NEXT STEPS

TROUBLESHOOTING & TIPS

As you work with data synthesis, you may occasionally want to fine-tune configurations, improve performance, or refine output quality. The following tips capture best practices learned from real-world usage, helping you diagnose issues quickly, optimize generation, and ensure your synthetic data is reliable and ready for training.

Start small: Begin with a small number of samples to test your configuration
Use examples: Provide good examples for better generation quality
Postprocess outputs: Use postprocessing to clean and format generated text. Oumi’s dataset analysis can help to automatically analyze datasets and remove low quality dataset samples
Validate results: Review generated samples before using for training
Version control: Keep your synthesis configs in version control

COMMON ISSUES

Even with well-defined configurations, you may occasionally encounter issues during data synthesis. The following are some of the most common situations, along with guidance on how to identify and address them so you can continue generating high-quality datasets efficiently.

Empty results: Check that your instruction messages are well-formed and you have proper API access.
Slow generation: Increase num_workers or lower politeness_policy to improve throughput.
Out of memory: Use a smaller model or reduce max_new_tokens in generation config.
Validation errors: Ensure all attribute IDs are unique and required fields are not empty.

You can also automatically detect and remove low-quality data to maintain dataset integrity using Oumi’s in-built analysis and curation tools.

Getting started

Oumi workflow

TROUBLESHOOTING & TIPS

COMMON ISSUES

​TROUBLESHOOTING & TIPS

​COMMON ISSUES

TROUBLESHOOTING & TIPS

COMMON ISSUES