Structure & contents
An is a structured collection of prompts and responses used to either train a model or evaluate its performance. Depending on your workflow, a dataset may include:- Prompt–response pairs for supervised fine-tuning
- Prompts only, where model outputs are generated and evaluated separately
- Multi-turn conversations for dialogue-based training or benchmarking
Uploading datasets
You can upload datasets directly into Oumi in a variety of common formats, including JSON, JSONL, CSV, and Parquet. All Oumi datasets follow a standardized internal that defines how messages, roles, and metadata are structured. During upload, Oumi automatically validates and converts your data into this format, ensuring it works seamlessly with training, evaluation, data synthesis, and analysis tools across the platform as well as modern machine learning pipelines.Raw files
Oumi also supports uploading raw files to ground your models in proprietary or domain-specific data. This allows you to incorporate internal documents, knowledge bases, or other private content into your workflows. To learn more, please see Uploading Raw Files.Example usage
Here’s an example of a properly-formed dataset for Oumi in format:messages field, you can also specify a metadata field that is a dictionary of metadata for your data row.
What’s next
Add Datasets
Upload and import datasets into the Oumi platform.
Add Raw Files
Upload and import raw files to contextualize and ground your data.
Data Explorer
Explore, inspect, and validate your datasets.
Recipes
Adding new datasets using guided workflows.