Skip to main content

Overview

A data synthesis recipe captures the configuration for generating a synthetic dataset, including which model to use, inference settings, and how to structure and split the output. It is stored as a reusable JSON template. Running a synthesis recipe produces a new dataset that appears in your Datasets page once the job completes. For instructions on accessing and running synthesis recipes from the UI, see Data Synthesis Recipes. For the full schema reference, see Dataset Recipe Schema.

Common recipe patterns

General dataset synthesis

Generates a new dataset by running an existing seed dataset through a synthesis model. The most common starting configuration.
{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Llama-3.1-8B-Instruct",
    "modelId": "model_llama_8b",
    "modelVersionId": "v1"
  },
  "inferenceConfig": {
    "inferenceTemperature": 0.8,
    "inferenceMaxNewTokens": 512,
    "inferenceSeed": 42,
    "requestsPerMinute": 60
  },
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_seed_001",
      "datasetDistribution": {
        "train": 0.8,
        "validation": 0.1,
        "test": 0.1
      }
    }
  }
}
When to use: Expanding a small seed dataset into a larger training set, or generating diverse variations of existing examples.

High-diversity synthesis

Uses higher temperature to maximize variation across generated samples. Useful when your seed dataset is small and you need broad coverage.
{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Qwen3-8B",
    "modelId": "model_qwen3_8b",
    "modelVersionId": "v1"
  },
  "inferenceConfig": {
    "inferenceTemperature": 1.0,
    "inferenceMaxNewTokens": 768,
    "requestsPerMinute": 50
  },
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_seed_002",
      "datasetDistribution": {
        "train": 0.85,
        "validation": 0.1,
        "test": 0.05
      }
    }
  }
}
When to use: When diversity of phrasing and scenario coverage matters more than strict fidelity to the seed examples.

Failure mode targeted synthesis

Generates targeted training data from failure modes identified during an evaluation run. By synthesizing examples that specifically address identified weaknesses, each training iteration becomes more intentional.
To use this pattern, first run an evaluation with generateFailureModes: true. The resulting failure mode dataset ID can then be used as the input here.
{
  "modelIdentifier": {
    "modelType": "llm",
    "modelName": "Llama-3.1-8B-Instruct",
    "modelId": "model_llama_8b",
    "modelVersionId": "v1"
  },
  "inferenceConfig": {
    "inferenceTemperature": 0.7,
    "inferenceMaxNewTokens": 512,
    "inferenceSeed": 99,
    "requestsPerMinute": 60
  },
  "synthesisConfig": {
    "synthesisType": "general",
    "synthesisConfig": {
      "datasetId": "dataset_failure_modes_eval_run_47",
      "datasetDistribution": {
        "train": 0.9,
        "validation": 0.1,
        "test": 0.0
      }
    }
  }
}
When to use: Targeted improvement iterations after evaluation reveals specific, recurring failure patterns. This closes the evaluate → synthesize → train loop.

Tips

  • Use a stronger model for synthesis: a larger or more capable model produces higher-quality synthetic examples that are better training signal for a smaller student model.
  • Keep inferenceSeed fixed for reproducible synthesis runs; remove it (or randomize) when you want maximum diversity.
  • Adjust train/validation/test splits based on your needs. If you have a separate validation dataset already, you can set validation: 0.0 and allocate everything to training.
  • Rate-limit thoughtfully: lower requestsPerMinute avoids hitting API throttle limits when synthesizing against an external model provider.
  • Validate synthesized data in the Data Explorer before using it for training. Inspect for formatting issues, degenerate outputs, or off-topic examples.