Overview
A data synthesis recipe captures the configuration for generating a synthetic dataset, including which model to use, inference settings, and how to structure and split the output. It is stored as a reusable JSON template. Running a synthesis recipe produces a new dataset that appears in your Datasets page once the job completes. For instructions on accessing and running synthesis recipes from the UI, see Data Synthesis Recipes. For the full schema reference, see Dataset Recipe Schema.Common recipe patterns
General dataset synthesis
Generates a new dataset by running an existing seed dataset through a synthesis model. The most common starting configuration.High-diversity synthesis
Uses higher temperature to maximize variation across generated samples. Useful when your seed dataset is small and you need broad coverage.Failure mode targeted synthesis
Generates targeted training data from failure modes identified during an evaluation run. By synthesizing examples that specifically address identified weaknesses, each training iteration becomes more intentional.To use this pattern, first run an evaluation with
generateFailureModes: true. The resulting failure mode dataset ID can then be used as the input here.Tips
- Use a stronger model for synthesis: a larger or more capable model produces higher-quality synthetic examples that are better training signal for a smaller student model.
- Keep
inferenceSeedfixed for reproducible synthesis runs; remove it (or randomize) when you want maximum diversity. - Adjust
train/validation/testsplits based on your needs. If you have a separate validation dataset already, you can setvalidation: 0.0and allocate everything to training. - Rate-limit thoughtfully: lower
requestsPerMinuteavoids hitting API throttle limits when synthesizing against an external model provider. - Validate synthesized data in the Data Explorer before using it for training. Inspect for formatting issues, degenerate outputs, or off-topic examples.