Skip to main content

Overview

An evaluation recipe captures the full configuration of an evaluation run (which model to test, which evaluators to use, which dataset to run against, and whether to extract failure modes) as a reusable JSON template. Saving a recipe lets you run the same evaluation repeatedly as your model evolves, ensuring results are directly comparable across iterations. For instructions on saving and accessing evaluation recipes from the UI, see Evaluation Recipes. For the full schema reference, see Evaluation Recipe Schema.

Common recipe patterns

Basic single-model evaluation

Evaluate a custom trained model against a dataset using one judge evaluator. The minimal configuration needed to run a reproducible evaluation.
{
  "recipe": {
    "recipeConfig": {
      "type": "evaluate",
      "evaluationConfig": {
        "evaluationType": "single_model",
        "modelIdentifier": {
          "modelType": "CUSTOM_CLOUD_STORAGE",
          "modelName": "my-fine-tuned-model",
          "modelId": 42,
          "modelVersionId": 3
        },
        "evaluators": [
          { "evaluatorId": 10 }
        ],
        "inferenceConfig": {
          "inferenceTemperature": 0.0,
          "inferenceMaxNewTokens": 512,
          "inferenceSeed": 42
        },
        "dataset": {
          "datasetId": 88
        },
        "generateFailureModes": false
      }
    }
  }
}
When to use: Routine quality checks after each training iteration, where you want fast results without full failure mode analysis.

Evaluation with failure mode analysis

Same as above, but with generateFailureModes: true. Oumi will automatically cluster low-scoring responses into categories (e.g., hallucination, formatting errors, missing information) that can feed directly into a data synthesis run.
{
  "recipe": {
    "recipeConfig": {
      "type": "evaluate",
      "evaluationConfig": {
        "evaluationType": "single_model",
        "modelIdentifier": {
          "modelType": "CUSTOM_CLOUD_STORAGE",
          "modelName": "my-fine-tuned-model",
          "modelId": 42,
          "modelVersionId": 3
        },
        "evaluators": [
          { "evaluatorId": 10 },
          { "evaluatorId": 11 }
        ],
        "inferenceConfig": {
          "inferenceTemperature": 0.0,
          "inferenceMaxNewTokens": 512,
          "inferenceSeed": 42,
          "requestsPerMinute": 60
        },
        "dataset": {
          "datasetId": 88
        },
        "generateFailureModes": true
      }
    }
  }
}
When to use: After a training run where you want to understand why the model underperforms, not just how much.

Evaluating an external API model

Use a frontier model (e.g., GPT-4.1, Claude) as the subject of evaluation, rather than a custom trained model. Useful for benchmarking your fine-tuned model against a baseline.
{
  "recipe": {
    "recipeConfig": {
      "type": "evaluate",
      "evaluationConfig": {
        "evaluationType": "single_model",
        "modelIdentifier": {
          "modelType": "OPENAI_API",
          "modelName": "gpt-4.1",
          "apiKeys": {
            "openai": "sk-xxxxxxxxxxxxxxxxxxxxxxxx"
          }
        },
        "evaluators": [
          { "evaluatorId": 10 }
        ],
        "inferenceConfig": {
          "inferenceTemperature": 0.0,
          "inferenceMaxNewTokens": 512,
          "requestsPerMinute": 100
        },
        "dataset": {
          "datasetId": 88
        },
        "generateFailureModes": false
      }
    }
  }
}
When to use: Establishing a performance ceiling before fine-tuning, or comparing your model against a known baseline on the same dataset.

Tips

  • Fix inferenceTemperature: 0.0 and inferenceSeed: deterministic outputs make results comparable across runs.
  • Use multiple evaluators: combining a quality evaluator and a safety evaluator gives a more complete picture.
  • Run failure mode analysis periodically, not on every iteration. It adds compute cost but provides high-value signal for data synthesis.
  • Pin datasetVersion: if your evaluation dataset may change, pin the version so historical comparisons remain valid.