ON-POLICY DISTILLATION

On-policy distillation trains a smaller student model using a stronger teacher model. During training, the student learns both from teacher-generated examples and from its own generated outputs, allowing it to better match the teacher’s behavior while remaining efficient to deploy.

This feature is currently in beta.

BENEFITS OF SMALLER MODELS

Well-trained smaller models can often outperform larger general-purpose models within their specific domains. They also offer practical advantages: they’re easier to deploy locally (thereby enhancing privacy and security), more cost-efficient to run, and simpler to update or retrain as new data becomes available. To train these smaller student models, two main post-training strategies are commonly used:

Off-policy training – The student learns from fixed targets produced by an external source. In distillation, this typically means training on static datasets containing precomputed teacher logits or completed responses.
On-policy training – The student generates responses during training, and the teacher provides supervision (e.g., logits) on those samples.

WHY USE ON-POLICY DISTILLATION?

On-policy distillation often performs better than purely off-policy approaches for two key reasons. First, training on student-generated sequences improves context alignment. The student learns from the same types of contexts and mistakes it will encounter at inference time, rather than only from ideal teacher-generated outputs. This approach also helps address a common issue in supervised fine-tuning (SFT) known as test-time distribution mismatch. In traditional SFT:

Training uses teacher-generated responses
The student learns from ideal output trajectories
At inference time, the student generates its own tokens

Because the student generates tokens autoregressively at inference time, it may encounter contexts that never appeared during training. This mismatch can degrade performance. On-policy distillation reduces this gap by training the model on sequences generated by the student itself, while still using the teacher model to guide learning. Second, as the student model improves, its generated outputs become better training data. This creates a positive feedback loop where higher-quality student outputs lead to more useful training signals.

HOW IT WORKS

On-policy distillation compresses large language models by transferring knowledge from a high-performing teacher model to a smaller student model. During training:

A prompt is provided to the student model.
The student generates a response.
The teacher evaluates that response and provides logits or targets.
The student updates its parameters to better match the teacher’s probability distribution.

Training may use a mix of:

Teacher-generated responses (similar to SFT)
Student-generated responses sampled during training

When the student generates a response, the teacher provides logits used in the distillation loss, guiding the student toward the teacher’s behavior. This training setup allows the student to:

Learn from the teacher’s knowledge
Adapt to sequences it generates itself
Improve robustness during inference

Over time, the student model can approximate the teacher’s task performance while achieving significantly lower model size, latency, and inference cost.

WORKFLOW

To train a model using on-policy distillation:

Navigate to Create a Train Model.
Select a base model.
Enable On-Policy Distillation.
Choose a teacher model.
Configure distillation settings (or use default values).
Launch the training run.

CONFIGURING VIA THE BUILDER `INPUTS` TAB

The following configuration options are available if you are configuring your training job from the INPUTS tab of the Builder.

Please see Model Recipe Configs for configuring via the Glossary tab in the Builder.

MODEL SELECTION

BASE MODEL

The model being trained (i.e., the student model).

TEACHER MODEL

The reference model used for knowledge distillation. Available teacher models depend on the selected base model.

DATA

TRAINING DATASET

The dataset that contains examples used to train the model.

VALIDATION DATASET

The dataset used to evaluate model performance during training.

TRAINING SETTINGS (OPTIONAL)

The following settings control the optimization process during training.

EPOCHS

The number of complete passes over the training dataset. Keep in mind that more epochs:

Increase learning
May increase risk of overfitting

LEARNING RATE

The initial learning rate after warmup, controls how large each parameter update is during training. Smaller learning rates:

More stable training
Slower convergence

Larger learning rate:

Faster training
Higher risk of instability

DISTILLATION SETTINGS (OPTIONAL)

The following settings control how student responses are generated and how distillation loss is applied. If no values are specified, default backend values are used.

TEMPERATURE

Controls the sampling variability of the student model during training.

Higher values produce more diverse outputs.
Default behavior encourages variability so the student model encounters a wider range of sequences during training.

LAMBDA

This value determines the probability of training on student-generated responses, rather than teacher responses. The value consists of the interpolation coefficient used when combining training losses, and controls the balance between:

supervised learning loss
distillation loss

Lambda Value	Behavior
0	Uses only teacher responses (equivalent to SFT)
0.5	50% student responses, 50% teacher responses
1	Fully on-policy training using only student responses

Higher lambda values increase exposure to student-generated sequences, which can improve robustness.

BETA

The Jensen–Shannon Divergence (JSD) interpolation coefficient determines how strongly the model aligns with the teacher distribution. This value controls the interpolation between forward KL and reverse KL loss during distillation.

Beta Value	Loss Behavior
0	Forward KL
1	Reverse KL

The optimal value depends heavily on the task and model configuration.

MAX COMPLETION LENGTH

Specifies the maximum response length generated by the student model during training (maximum number of tokens generated for each completion). This value limits:

generation length
training compute usage

Reducing this value can be useful if:

Your task expects short responses
You want to reduce training compute

PARAMETER-EFFICIENT SETTINGS (OPTIONAL)

Parameter-efficient fine-tuning (PEFT) enables efficient fine-tuning by training a small subset of parameters, instead of the full model. This significantly reduces:

GPU memory usage
training time
storage requirements

PEFT METHOD

Select the parameter-efficient fine-tuning method.

LoRA is currently provided as the default PEFT method, with more currently under development.

DEFAULT VALUES

If distillation settings are not specified, the system uses default backend values. Recommended workflow:

Start with the default configuration
Run an initial training experiment
Adjust parameters such as lambda and beta based on results:
- Start with lambda of 0.5. If performance is poor, increase the on-policy fraction.
- If that doesn’t help, try lowering it. Experiment with lambda = [0, 0.5, 1.0].
- Begin with a beta of 1; if training is unstable, reduce it to 0.

PARAMETER TUNING GUIDELINES

Because on-policy distillation introduces additional hyperparameters, experimentation may be required.

START WITH DEFAULTS

Default settings provide a strong baseline configuration.

ADJUST LAMBDA FIRST

Increasing lambda increases the proportion of student-generated training data. Higher values often improve robustness but may require additional experimentation.

EXPERIMENT WITH BETA

Beta can significantly affect training dynamics depending on the task and dataset. There is currently no universally optimal value, so tuning may be required.

WHEN TO USE ON-POLICY DISTILLATION

On-policy distillation is useful when:

Fine-tuning models that generate long sequences
Training models that must remain robust to their own outputs
Improving performance beyond traditional SFT-only training

On-policy distillation extends supervised fine-tuning by allowing the student model to learn from both teacher responses and its own generated outputs. By reducing test-time distribution mismatch, this approach can produce more robust and higher-performing models.

Getting started

Oumi workflow

ON-POLICY DISTILLATION

BENEFITS OF SMALLER MODELS

WHY USE ON-POLICY DISTILLATION?

HOW IT WORKS

WORKFLOW

CONFIGURING VIA THE BUILDER `INPUTS` TAB

MODEL SELECTION

BASE MODEL

TEACHER MODEL

DATA

TRAINING DATASET

VALIDATION DATASET

TRAINING SETTINGS (OPTIONAL)

EPOCHS

LEARNING RATE

DISTILLATION SETTINGS (OPTIONAL)

TEMPERATURE

LAMBDA

BETA

MAX COMPLETION LENGTH

PARAMETER-EFFICIENT SETTINGS (OPTIONAL)

PEFT METHOD

DEFAULT VALUES

PARAMETER TUNING GUIDELINES

START WITH DEFAULTS

ADJUST LAMBDA FIRST

EXPERIMENT WITH BETA

WHEN TO USE ON-POLICY DISTILLATION

​BENEFITS OF SMALLER MODELS

​WHY USE ON-POLICY DISTILLATION?

​HOW IT WORKS

​WORKFLOW

​CONFIGURING VIA THE BUILDER INPUTS TAB

​MODEL SELECTION

​BASE MODEL

​TEACHER MODEL

​DATA

​TRAINING DATASET

​VALIDATION DATASET

​TRAINING SETTINGS (OPTIONAL)

​EPOCHS

​LEARNING RATE

​DISTILLATION SETTINGS (OPTIONAL)

​TEMPERATURE

​LAMBDA

​BETA

​MAX COMPLETION LENGTH

​PARAMETER-EFFICIENT SETTINGS (OPTIONAL)

​PEFT METHOD

​DEFAULT VALUES

​PARAMETER TUNING GUIDELINES

​START WITH DEFAULTS

​ADJUST LAMBDA FIRST

​EXPERIMENT WITH BETA

​WHEN TO USE ON-POLICY DISTILLATION

BENEFITS OF SMALLER MODELS

WHY USE ON-POLICY DISTILLATION?

HOW IT WORKS

WORKFLOW

CONFIGURING VIA THE BUILDER `INPUTS` TAB

MODEL SELECTION

BASE MODEL

TEACHER MODEL

DATA

TRAINING DATASET

VALIDATION DATASET

TRAINING SETTINGS (OPTIONAL)

EPOCHS

LEARNING RATE

DISTILLATION SETTINGS (OPTIONAL)

TEMPERATURE

LAMBDA

BETA

MAX COMPLETION LENGTH

PARAMETER-EFFICIENT SETTINGS (OPTIONAL)

PEFT METHOD

DEFAULT VALUES

PARAMETER TUNING GUIDELINES

START WITH DEFAULTS

ADJUST LAMBDA FIRST

EXPERIMENT WITH BETA

WHEN TO USE ON-POLICY DISTILLATION