Skip to main content
Evaluations can be run before or after training to automatically measure model performance, uncover failure modes, and pinpoint issues that matter the most. Each evaluation is fully recorded and reproducible, giving you a reliable audit trail of every run. With configurable evaluators that score outputs against your defined criteria, Oumi turns model iteration into a structured, repeatable loop, so you can continuously diagnose and improve your models.

How evaluations work

Evaluations are created using the LLM-as-a-Judge framework, where a language model is prompted to analyze outputs and assign scores. The model performing this judgment is called the evaluator model. This is typically different from the model being tested, and can be a strong general-purpose model or a smaller model fine-tuned specifically for evaluation tasks. The model under evaluation is the system generating the prompt, response pairs being scored. This may be a baseline model or the current iteration of your custom model. When you run an evaluation, Oumi executes an evaluation run, which scores the model against a defined dataset and aggregates results. In terms of scoring, evaluators score model outputs against a specific criterion (e.g., safety, quality, or correctness). Each evaluator operates independently, allowing you to assess different dimensions of model performance in a modular way.

Evaluation workflow

An evaluation run in Oumi follows these steps:
  • Generate responses by running a model on a set of prompts.
  • Score each (prompt, response) pair using one or more evaluators.
  • Aggregate results to assess overall performance.
  • Extract higher-level failure modes to explain underperformance.
Each new project includes predefined evaluators to help you get started quickly. These evaluators are editable, optional, and reusable, providing strong baseline configurations for your evaluations.