> ## Documentation Index
> Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# SYNTHESIS

> Synthesis recipes

## OVERVIEW

A data synthesis recipe captures the configuration for generating a synthetic dataset, including which model to use, inference settings, and how to structure and split the output. It is stored as a reusable JSON template. Running a synthesis recipe produces a new dataset that appears in your Datasets page once the job completes.

For instructions on accessing and running synthesis recipes from the UI, see [Data synthesis recipes](/guides/synthesis/recipes).

For the full schema reference, see [Dataset recipe schema](/reference/schema/datasets).

***

## COMMON RECIPE PATTERNS

### GENERAL DATASET SYNTHESIS

Generates a new dataset by running an existing seed dataset through a synthesis model. The most common starting configuration.

```json theme={null}
{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Llama-3.1-8B-Instruct",
    "modelId": "model_llama_8b",
    "modelVersionId": "v1"
  },
  "inferenceConfig": {
    "inferenceTemperature": 0.8,
    "inferenceMaxNewTokens": 512,
    "inferenceSeed": 42,
    "requestsPerMinute": 60
  },
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_seed_001",
      "datasetDistribution": {
        "train": 0.8,
        "validation": 0.1,
        "test": 0.1
      }
    }
  }
}
```

**When to use:** Expanding a small seed dataset into a larger training set, or generating diverse variations of existing examples.

***

### HIGH-DIVERSITY SYNTHESIS

Uses higher temperature to maximize variation across generated samples. Useful when your seed dataset is small and you need broad coverage.

```json theme={null}
{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Qwen3-8B",
    "modelId": "model_qwen3_8b",
    "modelVersionId": "v1"
  },
  "inferenceConfig": {
    "inferenceTemperature": 1.0,
    "inferenceMaxNewTokens": 768,
    "requestsPerMinute": 50
  },
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_seed_002",
      "datasetDistribution": {
        "train": 0.85,
        "validation": 0.1,
        "test": 0.05
      }
    }
  }
}
```

**When to use:** When diversity of phrasing and scenario coverage matters more than strict fidelity to the seed examples.

***

### FAILURE MODE TARGETED SYNTHESIS

Generates targeted training data from failure modes identified during an evaluation run. By synthesizing examples that specifically address identified weaknesses, each training iteration becomes more intentional.

<Note>To use this pattern, first run an evaluation with `generateFailureModes: true`. The resulting failure mode dataset ID can then be used as the input here.</Note>

```json theme={null}
{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Llama-3.1-8B-Instruct",
    "modelId": "model_llama_8b",
    "modelVersionId": "v1"
  },
  "inferenceConfig": {
    "inferenceTemperature": 0.7,
    "inferenceMaxNewTokens": 512,
    "inferenceSeed": 99,
    "requestsPerMinute": 60
  },
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_failure_modes_eval_run_47",
      "datasetDistribution": {
        "train": 0.9,
        "validation": 0.1,
        "test": 0.0
      }
    }
  }
}
```

**When to use:** Targeted improvement iterations after evaluation reveals specific, recurring failure patterns. This closes the `evaluate → synthesize → train` loop.

***

## TIPS

* Use a stronger model for synthesis. A larger or more capable model produces higher-quality synthetic examples that are better training signal for a smaller student model.
* Keep `inferenceSeed` fixed for reproducible synthesis runs; remove it (or randomize) when you want maximum diversity.
* Adjust `train`/`validation`/`test` splits\*\* based on your needs. If you have a separate validation dataset already, you can set `validation: 0.0` and allocate everything to training.
* Rate-limit thoughtfully: Lower `requestsPerMinute` avoids hitting API throttle limits when synthesizing against an external model provider.
* Validate synthesized data in the [Data explorer](/guides/datasets/exploring) before using it for training. Inspect for formatting issues, degenerate outputs, or off-topic examples.
