Overview
An evaluation recipe captures the full configuration of an evaluation run (which model to test, which evaluators to use, which dataset to run against, and whether to extract failure modes) as a reusable JSON template. Saving a recipe lets you run the same evaluation repeatedly as your model evolves, ensuring results are directly comparable across iterations. For instructions on saving and accessing evaluation recipes from the UI, see Evaluation Recipes. For the full schema reference, see Evaluation Recipe Schema.Common recipe patterns
Basic single-model evaluation
Evaluate a custom trained model against a dataset using one judge evaluator. The minimal configuration needed to run a reproducible evaluation.Evaluation with failure mode analysis
Same as above, but withgenerateFailureModes: true. Oumi will automatically cluster low-scoring responses into categories (e.g., hallucination, formatting errors, missing information) that can feed directly into a data synthesis run.
Evaluating an external API model
Use a frontier model (e.g., GPT-4.1, Claude) as the subject of evaluation, rather than a custom trained model. Useful for benchmarking your fine-tuned model against a baseline.Tips
- Fix
inferenceTemperature: 0.0andinferenceSeed: deterministic outputs make results comparable across runs. - Use multiple evaluators: combining a quality evaluator and a safety evaluator gives a more complete picture.
- Run failure mode analysis periodically, not on every iteration. It adds compute cost but provides high-value signal for data synthesis.
- Pin
datasetVersion: if your evaluation dataset may change, pin the version so historical comparisons remain valid.