Overview
An evaluator recipe captures the configuration of a judge (the prompt it uses, the model it runs on, and the scoring structure it applies) as a reusable template. Saving an evaluator as a recipe lets you apply the same judge consistently across multiple evaluation runs without reconfiguring it each time. For instructions on saving and accessing evaluator recipes from the UI, see Evaluator Recipes. For the full schema reference, see Evaluator Recipe Schema.Common recipe patterns
General response quality judge
Evaluates whether a model response is accurate, clear, and helpful. A good starting point for most fine-tuning workflows.Safety judge
Focused specifically on detecting unsafe, harmful, or policy-violating outputs. Use alongside a quality judge when safety is a concern.Domain-specific correctness judge
Evaluates factual correctness against a ground truth reference. Useful for tasks like question answering, classification, or structured extraction where a correct answer exists.Tips
- Set
inferenceTemperature: 0.0for judge models. You want deterministic scoring, not creative variation. - Enable
generateScoreExplanation: trueduring development. Explanations help you validate that the judge is reasoning correctly before running large evaluations. - Use
dataFieldscarefully: field names must match the column names in your evaluation dataset exactly. - Keep prompts focused: judges with 2-3 scoring criteria produce more reliable results than judges with many criteria in a single prompt.
- Use
THINKING_AND_RESPONSEmode with reasoning models to leverage chain-of-thought in the judge’s scoring.