This feature is currently in beta.
Benefits of smaller Models
Well-trained smaller models can often outperform larger general-purpose models within their specific domains. They also offer practical advantages: they’re easier to deploy locally (thereby enhancing privacy and security), more cost-efficient to run, and simpler to update or retrain as new data becomes available. To train these smaller student models, two main post-training strategies are commonly used:- Off-policy training – The student learns from fixed targets produced by an external source. In distillation, this typically means training on static datasets containing precomputed teacher logits or completed responses.
- On-policy training – The student generates responses during training, and the teacher provides supervision (e.g., logits) on those samples.
Why use on-policy distillation?
On-policy distillation often performs better than purely off-policy approaches for two key reasons. First, training on student-generated sequences improves context alignment. The student learns from the same types of contexts and mistakes it will encounter at inference time, rather than only from ideal teacher-generated outputs. This approach also helps address a common issue in supervised fine-tuning (SFT) known as test-time distribution mismatch. In traditional SFT:- Training uses teacher-generated responses
- The student learns from ideal output trajectories
- At inference time, the student generates its own tokens
How it works
On-policy distillation compresses large language models by transferring knowledge from a high-performing teacher model to a smaller student model. During training:- A prompt is provided to the student model.
- The student generates a response.
- The teacher evaluates that response and provides logits or targets.
- The student updates its parameters to better match the teacher’s probability distribution.
- Teacher-generated responses (similar to SFT)
- Student-generated responses sampled during training
- Learn from the teacher’s knowledge
- Adapt to sequences it generates itself
- Improve robustness during inference
Workflow
To train a model using on-policy distillation:- Navigate to Create a Train Model.
- Select a base model.
- Enable On-Policy Distillation.
- Choose a teacher model.
- Configure distillation settings (or use default values).
- Launch the training run.
Configuring via the Builder INPUTS tab
The following configuration options are available if you are configuring your training job from the INPUTS tab of the Builder.
Please see Model Recipe Configs for configuring via the
Glossary tab in the Builder.Model selection
Base model
The model being trained (i.e., the student model).Teacher model
The reference model used for knowledge distillation. Available teacher models depend on the selected base model.Data
Training dataset
The dataset that contains examples used to train the model.Validation dataset
The dataset used to evaluate model performance during training.Training settings (optional)
The following settings control the optimization process during training.Epochs
The number of complete passes over the training dataset. Keep in mind that more epochs:- Increase learning
- May increase risk of overfitting
Learning rate
The initial learning rate after warmup, controls how large each parameter update is during training. Smaller learning rates:- More stable training
- Slower convergence
- Faster training
- Higher risk of instability
Distillation settings (optional)
The following settings control how student responses are generated and how distillation loss is applied. If no values are specified, default backend values are used.Temperature
Controls the sampling variability of the student model during training.- Higher values produce more diverse outputs.
- Default behavior encourages variability so the student model encounters a wider range of sequences during training.
Lambda
This value determines the probability of training on student-generated responses, rather than teacher responses. The value consists of the interpolation coefficient used when combining training losses, and controls the balance between:- supervised learning loss
- distillation loss
| Lambda Value | Behavior |
|---|---|
| 0 | Uses only teacher responses (equivalent to SFT) |
| 0.5 | 50% student responses, 50% teacher responses |
| 1 | Fully on-policy training using only student responses |
Beta
The Jensen–Shannon Divergence (JSD) interpolation coefficient determines how strongly the model aligns with the teacher distribution. This value controls the interpolation between forward KL and reverse KL loss during distillation.| Beta Value | Loss Behavior |
|---|---|
| 0 | Forward KL |
| 1 | Reverse KL |
Max completion length
Specifies the maximum response length generated by the student model during training (maximum number of tokens generated for each completion). This value limits:- generation length
- training compute usage
- Your task expects short responses
- You want to reduce training compute
Parameter-efficient settings (optional)
Parameter-efficient fine-tuning (PEFT) enables efficient fine-tuning by training a small subset of parameters, instead of the full model. This significantly reduces:- GPU memory usage
- training time
- storage requirements
PEFT method
Select the parameter-efficient fine-tuning method.LoRA is currently provided as the default PEFT method, with more currently under development.
Default values
If distillation settings are not specified, the system uses default backend values. Recommended workflow:- Start with the default configuration
- Run an initial training experiment
- Adjust parameters such as
lambdaandbetabased on results:- Start with
lambdaof0.5. If performance is poor, increase the on-policy fraction. - If that doesn’t help, try lowering it. Experiment with
lambda = [0, 0.5, 1.0]. - Begin with a
betaof1; if training is unstable, reduce it to0.
- Start with
Parameter tuning guidelines
Because on-policy distillation introduces additional hyperparameters, experimentation may be required.Start with defaults
Default settings provide a strong baseline configuration.Adjust lambda first
Increasing lambda increases the proportion of student-generated training data. Higher values often improve robustness but may require additional experimentation.Experiment with beta
Beta can significantly affect training dynamics depending on the task and dataset. There is currently no universally optimal value, so tuning may be required.When to use on-policy distillation
On-policy distillation is useful when:- Fine-tuning models that generate long sequences
- Training models that must remain robust to their own outputs
- Improving performance beyond traditional SFT-only training