EVALUATORS - Oumi

Traditional approaches to evaluating model quality don’t scale well. Human review is slow and costly, rule-based checks are often too rigid for open-ended outputs, and conventional metrics frequently fail to capture what users actually value. Evaluators solve these problems by using LLM-based judges to assess response quality at scale. They can measure nuanced attributes such as accuracy, completeness, and tone across thousands of examples, while providing detailed, per-example feedback that makes it easy to pinpoint where a model is falling short. The result is a more reliable and consistent way to understand model performance. The Oumi Agent simplifies evaluator creation by embedding ML expertise directly into your workflow, using natural language prompts to translate task goals into structured criteria, metrics, and edge cases without manual coding. By removing friction in defining scoring rubrics and handling complexity, it reduces development time from days to minutes, enabling consistent, rigorous evaluation throughout the model lifecycle while allowing evaluators to be reused and refined over time for compounding value.

HOW IT WORKS

Your model generates responses to a set of prompts
The evaluator reads each response and scores it based on criteria you define
You get scores and feedback across your dataset, showing where your model excels and where it struggles

WHAT TO DEFINE IN AN EVALUATOR

Evaluation criteria: A prompt describing what the judge should assess (e.g., “Is the response accurate and complete?”)
Judgment labels: The rating scale the judge uses (e.g., “poor”, “acceptable”, “good”, “excellent”)
Scoring: Numeric scores mapped to each label
Judge model: Which hosted LLM runs the evaluation (e.g., GLM-5, Qwen3-235B-A22B-Instruct-2507)

BUILT-IN & CUSTOM EVALUATORS

Oumi includes built-in evaluators (such as instruction following, safety, topic adherence, and truthfulness) to help you quickly establish baselines and gather early feedback. You can review, edit, and reuse these evaluators across evaluations, or create custom ones using the Builder to define the exact inputs your judge should consider. You can also describe your desired evaluator in natural language with the Oumi Agent, specifying scoring criteria, selecting the evaluator model, and including additional dataset fields for context as needed.

Custom evaluators are reusable and should focus on a single, clearly defined property to ensure consistent and reliable results.

WHY EVALUATORS MATTER

Before training: Benchmark a base model to see where it falls short
After training: Measure whether fine-tuning actually improved quality
Compare models: Run the same evaluators on different models to see which performs better
Identify failure modes: Find specific examples where the model struggles, then use those insights to improve your training data

EXAMPLE EVALUATOR AXES

For a customer support bot, you might create separate evaluators for:

Accuracy: Did the response contain correct information?
Tone: Was the response empathetic and professional?
Completeness: Did it fully address the customer’s question?
Policy compliance: Did it follow company guidelines?

WHAT’S NEXT

Defining evaluators

Establish criteria for measuring model performance

Evaluator recipes

Save and reuse evaluator configurations

Documentation Index

​HOW IT WORKS

​WHAT TO DEFINE IN AN EVALUATOR

​BUILT-IN & CUSTOM EVALUATORS

​WHY EVALUATORS MATTER

​EXAMPLE EVALUATOR AXES

​WHAT’S NEXT

Defining evaluators

Evaluator recipes

HOW IT WORKS

WHAT TO DEFINE IN AN EVALUATOR

BUILT-IN & CUSTOM EVALUATORS

WHY EVALUATORS MATTER

EXAMPLE EVALUATOR AXES

WHAT’S NEXT