How Evaluations work
Evaluations are created using the LLM-as-a-Judge framework, where a language model is prompted to analyze outputs and assign scores. The model performing this judgment is called the evaluator model. This is typically different from the model being tested, and can be a strong general-purpose model or a smaller model fine-tuned specifically for evaluation tasks. The model under evaluation is the system generating theprompt, response pairs being scored. This may be a baseline model or the current iteration of your custom model. When you run an evaluation, Oumi executes an evaluation run, which scores the model against a defined dataset and aggregates results. In terms of scoring, evaluators score model outputs against a specific criterion (e.g., safety, quality, or correctness).
Each evaluator operates independently, allowing you to assess different dimensions of model performance in a modular way.
Evaluation workflow
An evaluation run in Oumi follows these steps:- Generate responses by running a model on a set of prompts.
- Score each (prompt, response) pair using one or more evaluators.
- Aggregate results to assess overall performance.
- Extract higher-level failure modes to explain underperformance.