Skip to main content

Overview

Oumi automatically extracts failure modes from evaluation results, identifying recurring patterns that explain why certain outputs did not meet the defined criteria. Instead of treating low scores as isolated incidents, failure modes surface systematic weaknesses (e.g., reasoning gaps, formatting errors, or unsafe responses), providing clear, actionable insight for the next iteration. By clustering related errors, Oumi helps you focus on high-impact improvements across data, prompts, models, or evaluators. You can also generate a new dataset directly from an evaluator’s identified failure modes, enabling targeted data synthesis and creating a tight feedback loop between evaluation and model refinement.

Accessing failure modes

To access the failure modes for a particular evaluation:
  1. Go the the Evaluation page and click on the name of the evaluation.
  2. Click on the Review Failure Modes button.
  3. On the Failure Modes page, review each individual item as needed.
To generate a training dataset that targets specific issues, select the failure mode(s) in question and click Generate Dataset.