Skip to main content
Oumi automatically scans your uploaded datasets for quality issues, highlights problematic entries, and lets you fix them directly in the platform.

How it works

When you upload a dataset to the platform, Oumi automatically runs a series of quality tests to identify potential issues. Any failed tests pinpoint conversations that may need to be removed to enhance overall quality. You can then directly remove problematic rows within the platform’s UI, or export individual quality tests for further analysis. Oumi currently provides the following quality tests:
  • Total tokens exceed 8000 - Flags conversations where the total token count exceeds 8,000, helping identify overly long entries that may affect processing or model performance.
  • Non-alternating user/assistant turns - Flags conversations where messages do not strictly alternate between user and assistant, ensuring the dataset follows a consistent turn-taking structure.
  • Empty turns detected - Flags conversations containing empty or whitespace-only messages, which can introduce noise or reduce dataset quality.

Accessing dataset quality tests

After successfully uploading a dataset, you can access its quality tests from the Datasets page:
  1. Click the Quality Tests link for your dataset.
  2. Under the Quality Tests tab, you can expand each test to view its details.
  3. Click Export Quality Tests to export the quality tests as a JSONL or Parquet file.

Deleting failed rows

To immediately remove rows flagged by Oumi’s quality tests during dataset upload:
  1. Click the Quality Tests link for the dataset with the failing tests.
  2. Under the Quality Tests tab, you can expand each test and drill down further to analyze row-level errors.
  3. Click Delete Failed Rows to delete the problematic rows. Oumi will then re-analyze your new dataset version.
Rather than modifying the original dataset, Oumi creates a new version and applies changes there, ensuring all dataset states are version-controlled and fully recoverable.

Restoring previous dataset versions

To restore a previous dataset:
  1. On the Datasets page, click the name of your dataset.
  2. Click the three-dot menu to the right of the dataset version you want to restore.
  3. Select Restore Version. Oumi will restore the selected dataset version as a new version and automatically rerun the quality checks.