Skip to main content
Evaluations can be run before or after training to measure your model’s performance, identify weaknesses, and guide improvements as you iterate. Each evaluation uses configurable evaluators that score model outputs against defined criteria.

How Evaluations work

Evaluations are created using the LLM-as-a-Judge framework, where a language model is prompted to analyze outputs and assign scores. The model performing this judgment is called the evaluator model. This is typically different from the model being tested, and can be a strong general-purpose model or a smaller model fine-tuned specifically for evaluation tasks. The model under evaluation is the system generating the prompt, response pairs being scored. This may be a baseline model or the current iteration of your custom model. When you run an evaluation, Oumi executes an evaluation run, which scores the model against a defined dataset and aggregates results. In terms of scoring, evaluators score model outputs against a specific criterion (e.g., safety, quality, or correctness). Each evaluator operates independently, allowing you to assess different dimensions of model performance in a modular way.

Evaluation workflow

An evaluation run in Oumi follows these steps:
  • Generate responses by running a model on a set of prompts.
  • Score each (prompt, response) pair using one or more evaluators.
  • Aggregate results to assess overall performance.
  • Extract higher-level failure modes to explain underperformance.
Each new project includes predefined evaluators to help you get started quickly. These evaluators are editable, optional, and reusable, providing strong baseline configurations for your evaluations.