How it works
When you upload a dataset to the platform, Oumi automatically runs a series of quality tests to identify potential issues. Any failed tests pinpoint conversations that may need to be removed to enhance overall quality. You can then directly remove problematic rows within the platform’s UI, or export individual quality tests for further analysis. Oumi currently provides the following quality tests:- Total tokens exceed 8000 - Flags conversations where the total token count exceeds 8,000, helping identify overly long entries that may affect processing or model performance.
- Non-alternating user/assistant turns - Flags conversations where messages do not strictly alternate between user and assistant, ensuring the dataset follows a consistent turn-taking structure.
- Empty turns detected - Flags conversations containing empty or whitespace-only messages, which can introduce noise or reduce dataset quality.
Accessing dataset quality tests
After successfully uploading a dataset, you can access its quality tests from the Datasets page:- Click the
Quality Testslink for your dataset. - Under the
Quality Teststab, you can expand each test to view its details. - Click
Export Quality Teststo export the quality tests as a JSONL or Parquet file.
Deleting failed rows
To immediately remove rows flagged by Oumi’s quality tests during dataset upload:- Click the
Quality Testslink for the dataset with the failing tests. - Under the
Quality Teststab, you can expand each test and drill down further to analyze row-level errors. - Click
Delete Failed Rowsto delete the problematic rows. Oumi will then re-analyze your new dataset version.
Rather than modifying the original dataset, Oumi creates a new version and applies changes there, ensuring all dataset states are version-controlled and fully recoverable.
Restoring previous dataset versions
To restore a previous dataset:- On the Datasets page, click the name of your dataset.
- Click the three-dot menu to the right of the dataset version you want to restore.
- Select
Restore Version. Oumi will restore the selected dataset version as a new version and automatically rerun the quality checks.