Troubleshooting & tips
As you work with data synthesis, you may occasionally want to fine-tune configurations, improve performance, or refine output quality. The following tips capture best practices learned from real-world usage, helping you diagnose issues quickly, optimize generation, and ensure your synthetic data is reliable and ready for training.- Start small: Begin with a small number of samples to test your configuration
- Use examples: Provide good examples for better generation quality
- Postprocess outputs: Use postprocessing to clean and format generated text. Oumi’s dataset analysis can help to automatically analyze datasets and remove low quality dataset samples
- Validate results: Review generated samples before using for training
- Version control: Keep your synthesis configs in version control
Common issues
Even with well-defined configurations, you may occasionally encounter issues during data synthesis. The following are some of the most common situations, along with guidance on how to identify and address them so you can continue generating high-quality datasets efficiently.- Empty results: Check that your instruction messages are well-formed and you have proper API access.
- Slow generation: Increase
num_workersor lowerpoliteness_policyto improve throughput. - Out of memory: Use a smaller model or reduce
max_new_tokensin generation config. - Validation errors: Ensure all attribute IDs are unique and required fields are not empty.