Skip to main content
Oumi makes it easy to generate high-quality at any stage of the machine learning workflow, whether you’re creating training data from scratch or synthesizing datasets from failure modes to improve model performance.

Structure & contents

An is a structured collection of prompts and responses used to either train a model or evaluate its performance. Depending on your workflow, a dataset may include:
  • Prompt–response pairs for supervised fine-tuning
  • Prompts only, where model outputs are generated and evaluated separately
  • Multi-turn conversations for dialogue-based training or benchmarking

Uploading datasets

You can upload datasets directly into Oumi in a variety of common formats, including JSON, JSONL, CSV, and Parquet. All Oumi datasets follow a standardized internal that defines how messages, roles, and metadata are structured. During upload, Oumi automatically validates and converts your data into this format, ensuring it works seamlessly with training, evaluation, data synthesis, and analysis tools across the platform as well as modern machine learning pipelines.

Raw files

Oumi also supports uploading raw files to ground your models in proprietary or domain-specific data. This allows you to incorporate internal documents, knowledge bases, or other private content into your workflows. To learn more, please see Uploading Raw Files.

Example usage

Here’s an example of a properly-formed dataset for Oumi in format:
{"messages": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}], "metadata": {"source": "geography"}}
{"messages": [{"role": "user", "content": "How do I make pasta?"}, {"role": "assistant", "content": "Boil water and add pasta for 8-10 minutes."}], "metadata": {"source": "cooking"}}
{"messages": [{"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "2+2 equals 4."}], "metadata": {"source": "math"}}
In addition to the standard messages field, you can also specify a metadata field that is a dictionary of metadata for your data row.

What’s next

Add Datasets

Upload and import datasets into the Oumi platform.

Add Raw Files

Upload and import raw files to contextualize and ground your data.

Data Explorer

Explore, inspect, and validate your datasets.

Recipes

Adding new datasets using guided workflows.