Accessing your Evaluators
You can view and manage your evaluators on the Evaluators page.Using Oumi’s built-in Evaluators
Oumi includes a growing set of general-purpose evaluators that work well across many tasks:- Instruction following: Is the model response faithful to the prompt?
- Safety: Is the response free from harmful language?
- Topic adherence: Does the response stay on topic with respect to the prompt?
- Truthfulness: Is the response correct as far as is known from the model knowledge and context window?
- You are establishing an initial baseline
- You want fast feedback without custom configuration
- Your task aligns with common evaluation dimensions
Built-in evaluators can be reviewed, edited, or extended as needed.
Creating custom Evaluators
Oumi also enables you to build custom evaluators tailored to your specific requirements.- On the Evaluators page, click on the
Create Evaluator. - In the Builder, click on
Create an Evaluator.
Evaluator identity
Provide a name and description for your judge.Name- The name of the judge; should indicate what this judge evaluates (e.g., “Instruction Following”)What should the judge evaluate?- A clear sentence describing the evaluation goal (e.g., “Evaluate whether the response accurately follows all instructions provided in the user’s request”)
Judgements
Judgements define the possible outcomes of an evaluation and how each outcome is scored. They tell the evaluator how to label a response and what that label means. ClickAdd Judgement and provide the follow three fields to add a new judgement:
Label- The name of the outcome (for example, “Correct”, “Incorrect”, “Adherent”, “Non-Adherent”). This is what the evaluator will output.Condition- A clear description of when this label should be applied. This acts as guidance for the evaluator, helping it decide which label best fits a given response.Score- A numeric value associated with the label (such as 1 for correct, 0 for incorrect). Scores allow you to quantify performance and compare results across examples.
How judgements work
Judgements define the grading system for your task. For example:- In a classification task:
- Correct → score 1
- Incorrect → score 0
- In a quality evaluation task:
- High Quality → score 1
- Acceptable → score 0.5
- Poor → score 0
Data fields
Define the data fields that the judge should evaluate.Input Type- Specify whether the input type is:single turn (a user request and an assistant response)single turn (an assistant response)mult-turn (a conversation between a user and an assistant)
Additional Fields- Provide additional fields (asKey,Display Name, andDescription) for the judge to consider in its evaluation
Evaluation criteria (optional)
ClickAdd Criterion to add specific criteria the judge should consider. Each item should consist of a Name that describes the criteria (e.g., “Format Compliance”) and Description (e.g., “Response should be in JSON format and not exceed 1000 characters”).
Examples (optional)
ClickAdd Example to add the additional fields to provide examples to guide the judge’s behavior. Each example should consist of a Name that describes the criteria (e.g., “Format Compliance”) and Description (e.g., “Response should be in JSON format and not exceed 1000 characters”).
Model settings
Specify the model and settings used to calculate the score according to the instructions.This model is not to be confused with the model under evaluation that produced the (
prompt, response) pairsUsing the Agent
You can also use the Agent to define your judges using natural language. Oumi will suggest evaluator patterns based on your task (for example, comprehensiveness, groundedness, fluency, format adherence).See the built-in evaluators for example of correctly formatted prompts.
Models for custom providers like OpenAI will not appear unless you have registered a corresponding API key with the project. See here for details.
Custom Evaluator design best practices
Well-designed custom evaluators produce consistent, actionable results. The following best practices ensure that your custom evaluators are clear, focused, and aligned with your success criteria.- Be specific about what success looks like.
- Avoid overly broad instructions.
- Each evaluator should measure a single, non-overlapping property.
- Start simple, then refine based on failure modes.