Skip to main content
Evaluators define what strong performance looks like for your task and are essential for building effective custom AI. Oumi makes evaluation easier by providing powerful built-in evaluators along with the ability to create custom ones from natural language descriptions.

Accessing your Evaluators

You can view and manage your evaluators on the Evaluators page.

Using Oumi’s built-in Evaluators

Oumi includes a growing set of general-purpose evaluators that work well across many tasks:
  • Instruction following: Is the model response faithful to the prompt?
  • Safety: Is the response free from harmful language?
  • Topic adherence: Does the response stay on topic with respect to the prompt?
  • Truthfulness: Is the response correct as far as is known from the model knowledge and context window?
These built-in evaluators are ideal when:
  • You are establishing an initial baseline
  • You want fast feedback without custom configuration
  • Your task aligns with common evaluation dimensions
Built-in evaluators can be reviewed, edited, or extended as needed.

Creating custom Evaluators

Oumi also enables you to build custom evaluators tailored to your specific requirements.
  • On the Evaluators page, click on the Create Evaluator.
  • In the Builder, click on Create an Evaluator.
On the Inputs tab, provide the following fields to define your Evaluator:

Evaluator identity

Provide a name and description for your judge.
  • Name - The name of the judge; should indicate what this judge evaluates (e.g., “Instruction Following”)
  • What should the judge evaluate? - A clear sentence describing the evaluation goal (e.g., “Evaluate whether the response accurately follows all instructions provided in the user’s request”)

Judgements

Judgements define the possible outcomes of an evaluation and how each outcome is scored. They tell the evaluator how to label a response and what that label means. Click Add Judgement and provide the follow three fields to add a new judgement:
  • Label - The name of the outcome (for example, “Correct”, “Incorrect”, “Adherent”, “Non-Adherent”). This is what the evaluator will output.
  • Condition - A clear description of when this label should be applied. This acts as guidance for the evaluator, helping it decide which label best fits a given response.
  • Score - A numeric value associated with the label (such as 1 for correct, 0 for incorrect). Scores allow you to quantify performance and compare results across examples.

How judgements work

Judgements define the grading system for your task. For example:
  • In a classification task:
    • Correct → score 1
    • Incorrect → score 0
  • In a quality evaluation task:
    • High Quality → score 1
    • Acceptable → score 0.5
    • Poor → score 0
By defining clear labels and conditions, you ensure that evaluations are consistent, interpretable, and aligned with your goals.

Data fields

Define the data fields that the judge should evaluate.
  • Input Type - Specify whether the input type is:
    • single turn (a user request and an assistant response)
    • single turn (an assistant response)
    • mult-turn (a conversation between a user and an assistant)
  • Additional Fields - Provide additional fields (as Key, Display Name, and Description) for the judge to consider in its evaluation

Evaluation criteria (optional)

Click Add Criterion to add specific criteria the judge should consider. Each item should consist of a Name that describes the criteria (e.g., “Format Compliance”) and Description (e.g., “Response should be in JSON format and not exceed 1000 characters”).

Examples (optional)

Click Add Example to add the additional fields to provide examples to guide the judge’s behavior. Each example should consist of a Name that describes the criteria (e.g., “Format Compliance”) and Description (e.g., “Response should be in JSON format and not exceed 1000 characters”).

Model settings

Specify the model and settings used to calculate the score according to the instructions.
This model is not to be confused with the model under evaluation that produced the (prompt, response) pairs

Using the Agent

You can also use the Agent to define your judges using natural language. Oumi will suggest evaluator patterns based on your task (for example, comprehensiveness, groundedness, fluency, format adherence).
See the built-in evaluators for example of correctly formatted prompts.
Models for custom providers like OpenAI will not appear unless you have registered a corresponding API key with the project. See here for details.
Evaluators can use strong general-puroses models from leading vendors such as OpenAI and Anthropic by adding API keys to the platform. Custom evaluators are saved and reusable across evaluations.

Custom Evaluator design best practices

Well-designed custom evaluators produce consistent, actionable results. The following best practices ensure that your custom evaluators are clear, focused, and aligned with your success criteria.
  • Be specific about what success looks like.
  • Avoid overly broad instructions.
  • Each evaluator should measure a single, non-overlapping property.
  • Start simple, then refine based on failure modes.