Skip to main content
On-policy distillation trains a smaller student model using a stronger teacher model. During training, the student learns both from teacher-generated examples and from its own generated outputs, allowing it to better match the teacher’s behavior while remaining efficient to deploy.
This feature is currently in beta.

Benefits of smaller Models

Well-trained smaller models can often outperform larger general-purpose models within their specific domains. They also offer practical advantages: they’re easier to deploy locally (thereby enhancing privacy and security), more cost-efficient to run, and simpler to update or retrain as new data becomes available. To train these smaller student models, two main post-training strategies are commonly used:
  • Off-policy training – The student learns from fixed targets produced by an external source. In distillation, this typically means training on static datasets containing precomputed teacher logits or completed responses.
  • On-policy training – The student generates responses during training, and the teacher provides supervision (e.g., logits) on those samples.

Why use on-policy distillation?

On-policy distillation often performs better than purely off-policy approaches for two key reasons. First, training on student-generated sequences improves context alignment. The student learns from the same types of contexts and mistakes it will encounter at inference time, rather than only from ideal teacher-generated outputs. This approach also helps address a common issue in supervised fine-tuning (SFT) known as test-time distribution mismatch. In traditional SFT:
  • Training uses teacher-generated responses
  • The student learns from ideal output trajectories
  • At inference time, the student generates its own tokens
Because the student generates tokens autoregressively at inference time, it may encounter contexts that never appeared during training. This mismatch can degrade performance. On-policy distillation reduces this gap by training the model on sequences generated by the student itself, while still using the teacher model to guide learning. Second, as the student model improves, its generated outputs become better training data. This creates a positive feedback loop where higher-quality student outputs lead to more useful training signals.

How it works

On-policy distillation compresses large language models by transferring knowledge from a high-performing teacher model to a smaller student model. During training:
  1. A prompt is provided to the student model.
  2. The student generates a response.
  3. The teacher evaluates that response and provides logits or targets.
  4. The student updates its parameters to better match the teacher’s probability distribution.
Training may use a mix of:
  • Teacher-generated responses (similar to SFT)
  • Student-generated responses sampled during training
When the student generates a response, the teacher provides logits used in the distillation loss, guiding the student toward the teacher’s behavior. This training setup allows the student to:
  • Learn from the teacher’s knowledge
  • Adapt to sequences it generates itself
  • Improve robustness during inference
Over time, the student model can approximate the teacher’s task performance while achieving significantly lower model size, latency, and inference cost.

Workflow

To train a model using on-policy distillation:
  1. Navigate to Create a Train Model.
  2. Select a base model.
  3. Enable On-Policy Distillation.
  4. Choose a teacher model.
  5. Configure distillation settings (or use default values).
  6. Launch the training run.

Configuring via the Builder INPUTS tab

The following configuration options are available if you are configuring your training job from the INPUTS tab of the Builder.
Please see Model Recipe Configs for configuring via the Glossary tab in the Builder.

Model selection

Base model

The model being trained (i.e., the student model).

Teacher model

The reference model used for knowledge distillation. Available teacher models depend on the selected base model.

Data

Training dataset

The dataset that contains examples used to train the model.

Validation dataset

The dataset used to evaluate model performance during training.

Training settings (optional)

The following settings control the optimization process during training.

Epochs

The number of complete passes over the training dataset. Keep in mind that more epochs:
  • Increase learning
  • May increase risk of overfitting

Learning rate

The initial learning rate after warmup, controls how large each parameter update is during training. Smaller learning rates:
  • More stable training
  • Slower convergence
Larger learning rate:
  • Faster training
  • Higher risk of instability

Distillation settings (optional)

The following settings control how student responses are generated and how distillation loss is applied. If no values are specified, default backend values are used.

Temperature

Controls the sampling variability of the student model during training.
  • Higher values produce more diverse outputs.
  • Default behavior encourages variability so the student model encounters a wider range of sequences during training.

Lambda

This value determines the probability of training on student-generated responses, rather than teacher responses. The value consists of the interpolation coefficient used when combining training losses, and controls the balance between:
  • supervised learning loss
  • distillation loss
Lambda ValueBehavior
0Uses only teacher responses (equivalent to SFT)
0.550% student responses, 50% teacher responses
1Fully on-policy training using only student responses
Higher lambda values increase exposure to student-generated sequences, which can improve robustness.

Beta

The Jensen–Shannon Divergence (JSD) interpolation coefficient determines how strongly the model aligns with the teacher distribution. This value controls the interpolation between forward KL and reverse KL loss during distillation.
Beta ValueLoss Behavior
0Forward KL
1Reverse KL
The optimal value depends heavily on the task and model configuration.

Max completion length

Specifies the maximum response length generated by the student model during training (maximum number of tokens generated for each completion). This value limits:
  • generation length
  • training compute usage
Reducing this value can be useful if:
  • Your task expects short responses
  • You want to reduce training compute

Parameter-efficient settings (optional)

Parameter-efficient fine-tuning (PEFT) enables efficient fine-tuning by training a small subset of parameters, instead of the full model. This significantly reduces:
  • GPU memory usage
  • training time
  • storage requirements

PEFT method

Select the parameter-efficient fine-tuning method.
LoRA is currently provided as the default PEFT method, with more currently under development.

Default values

If distillation settings are not specified, the system uses default backend values. Recommended workflow:
  1. Start with the default configuration
  2. Run an initial training experiment
  3. Adjust parameters such as lambda and beta based on results:
    • Start with lambda of 0.5. If performance is poor, increase the on-policy fraction.
    • If that doesn’t help, try lowering it. Experiment with lambda = [0, 0.5, 1.0].
    • Begin with a beta of 1; if training is unstable, reduce it to 0.

Parameter tuning guidelines

Because on-policy distillation introduces additional hyperparameters, experimentation may be required.

Start with defaults

Default settings provide a strong baseline configuration.

Adjust lambda first

Increasing lambda increases the proportion of student-generated training data. Higher values often improve robustness but may require additional experimentation.

Experiment with beta

Beta can significantly affect training dynamics depending on the task and dataset. There is currently no universally optimal value, so tuning may be required.

When to use on-policy distillation

On-policy distillation is useful when:
  • Fine-tuning models that generate long sequences
  • Training models that must remain robust to their own outputs
  • Improving performance beyond traditional SFT-only training
On-policy distillation extends supervised fine-tuning by allowing the student model to learn from both teacher responses and its own generated outputs. By reducing test-time distribution mismatch, this approach can produce more robust and higher-performing models.