OUMI QUICKSTART

OVERVIEW

This quickstart demonstrates how to build a custom model that can outperform a leading large model in just a few steps. You’ll be using:

Large model (judge and large model benchmarking): ChatGPT 5.2
Small model (for fine-tuning): Qwen/Qwen2.5-3B-Instruct

By the end, you’ll have a fine-tuned small model that outperforms the larger model, validated through Oumi’s evaluations, which you’ll set up as part of the workflow.

TASK DEFINITION

You will develop an AI model for a bank that classifies customer support queries by intent, enabling accurate routing in a banking context.

WORKFLOW STEPS

Oumi provides a fully automated, end-to-end workflow. For this use case, the process includes:

Uploading your datasets and having Oumi automatically analyze them for potential issues
Defining evaluators (judges) to assess accuracy and ensure proper output formatting
Benchmarking both the large and small models on a test dataset to establish performance baselines
Fine-tuning the small model using your training data to create a custom version
Re-run the evaluations to see your custom model outperform the larger model’s baseline

You can complete this quickstart using the Oumi Agent or just the platform UI.

Oumi Agent
Platform UI

Project setup

Start by creating a new project in your workspace:

Click on New Project.
Give your project a Project Name.
Provide a description in Project Context. You can also invite your team members to your project by clicking the Invite team members button and selecting their usernames.
Click Create Project.

Set up your datasets

You can use the following datasets for this quickstart. Download them to your local machine:

Oumi requires that datasets follow a specific format. Please see Datasets to learn more.

From your project’s Overview page:

Click the Create button and select Dataset from the menu.
Select Upload a Dataset (to the right of Create Dataset)
Provide a Dataset Name for your dataset and select the JSONL files you downloaded in the previous step.
Click Create Dataset. Oumi will start uploading your datasets and automatically run a series of quality checks.

Create your evaluators & baselines

Next, define your evaluators for measuring baseline model performance. The Oumi Agent makes it easy to create custom evaluators for any metric using natural language prompts.You’ll need two evaluators for this example: one to measure accuracy against ground truth, the other to validate that outputs are correctly formatted (i.e., an integer within the valid class range).

Please see Evaluations to learn more about how to assess your model’s quality using Oumi.

From the Agent pane on the right-hand side of your screen:

Give the Oumi Agent the following prompt:

You are building a model to classify and route customer support queries for a bank.
The model should determine the customer’s intent based on a provided conversation. Start by creating baselines for performance benchmarking and fine-tuning. 
First, evaluate a strong model on the uploaded test dataset, then evaluate a small language model on the test dataset. Define two custom judges for the evaluations: one to determine whether the output is correct (measuring accuracy using the ground truth labels stored in the dataset’s metadata fields as `label` and `label_name`), the other to determine whether the output is valid (is an integer between 1 and the number of classes).
To make the evaluations run faster, do not create failure modes.

The Agent will analyze your existing project assets and guide you through the steps for creating your model baselines and evaluations. Select GPT-5.2 as your strong model, Qwen2.5-1.5B as your small model, and GPT-5.2 as your judge model.
Once the Agent finishes configuring your evaluation jobs, click Run It to kick off each evaluation. Oumi will run the two evaluations in parallel.

Once your evaluations jobs finish, review the results. Your strong model should outperform your base model. Oumi enables you to get that same performance in your custom small model. You’ll do this by fine-tuning your small model on the training dataset.

You can also review your in-depth evaluation results side-by-side. From the Evaluations page, click on Compare and select your two evaluations.

Fine-tune & evaluate your custom model

After reviewing the baselines, you can fine-tune your custom small model to improve accuracy and close the gap.From the Agent-provided options:

Select Fine-tune Qwen 2.5-3B.
Review the training guidance and plan as the Agent automatically sets up your training configuration, select SFT with LoRA as your training method and parameter strategy.
Click start training and approve to kick off the training job.

Once the training job model completes, run an evaluation against your new custom model to ensure that it beats your previous baselines. Click Run Evaluation to kick off the job.

Click Done for now to wrap up the workflow.

Your custom, high-quality AI model tailored to your classification task is now ready for use. The entire process should take only a matter of hours, not months.

Project setup

Start by creating a new project in your workspace:

Click on New Project.
Give your project a Project Name.
Provide a description in Project Context. You can also invite your team members to your project by clicking the Invite team members button and selecting their usernames.
Click Create Project.

Set up your datasets

You can use the following datasets for this quickstart. Download them to your local machine:

Oumi requires that datasets follow a specific format. Please see Datasets to learn more.

From your project’s Overview page:

Click the Create button and select Dataset from the menu.
Select Upload a Dataset (to the right of Create Dataset)
Provide a Dataset Name for your dataset and select the JSONL files you downloaded in the previous step.
Click Create Dataset. Oumi will start uploading your datasets and automatically run a series of quality checks.

Create your evaluators

Next, define your evaluators for measuring baseline model performance. You’ll need two evaluators for this example: one to measure accuracy against ground truth, the other to validate that outputs are correctly formatted (i.e., an integer within the valid class range).

Please see Evaluations to learn how to assess your model’s quality with Oumi.

Go to your Evaluators page and click Create Evaluator on the top right-hand side to load the Builder.On the CONFIGURE tab:

Leave the Judgments default values and enter in the following for the Prompt:

You are a classification accuracy judge. Your task is to determine whether the assistant's predicted intent label matches the ground truth label for a banking customer support intent classification task.

Inputs:
Consider the field 'conversation' (a multi-turn conversation between a user and an assistant) along with the provided data fields 'Ground Truth Label (Integer)' and 'Ground Truth Label Name' to make your assessment.

Decision rule:
If the assistant's response contains the correct ground truth label integer OR the correct ground truth label name (or a semantically equivalent version of the label name), respond with 'Yes'. Otherwise, respond with 'No'.

Evaluation Criteria:
1. Exact Integer Match: If the assistant's response contains the ground truth label integer (e.g., if the ground truth is 42, the response contains '42' as a label prediction), this counts as a match.
2. Exact Label Name Match: If the assistant's response contains the ground truth label name exactly as provided (e.g., 'card_arrival'), this counts as a match.
3. Semantic Label Name Equivalence: Minor formatting variations of the label name are acceptable matches. For example, 'card_arrival' matches 'card arrival', 'Card Arrival', 'card-arrival', or 'Card_Arrival'. Underscores, hyphens, spaces, and capitalization differences should be treated as equivalent. However, the semantic meaning must be preserved: 'card_payment' does NOT match 'card_arrival'.
4. Presence in Longer Responses: If the assistant's response contains additional text beyond just the label (e.g., explanations, reasoning, or formatting), the judge should still check whether the correct label integer or label name appears within the response. The label must be clearly identifiable as the assistant's prediction, not merely mentioned in passing or as part of a different context.
5. Strictness: If the response is ambiguous, unclear, contains multiple conflicting labels, or does not clearly indicate the correct label, default to 'No'.

Corner cases:
1. Multiple Labels in Response: If the assistant outputs multiple labels and one of them matches the ground truth, judge 'Yes' only if the correct label is clearly the final or primary prediction. If it is ambiguous which label is the prediction, judge 'No'.
2. No Label Present: If the assistant's response does not contain any recognizable label integer or label name, judge 'No'.
3. Partial Matches: A partial match of the label name is not sufficient. For example, if the ground truth is 'card_arrival', a response of 'card' alone does not match. The full label name or its semantic equivalent must be present.
4. Irrelevant or Off-Topic Responses: If the assistant's response does not attempt to classify the intent at all, judge 'No'.
5. Numerical Ambiguity: If the ground truth label integer appears in the response but in a context unrelated to the classification (e.g., as part of a date or other number), this does not count as a match. The number must be clearly presented as the predicted label.

For Model, select GPT 5.2. Leave everything else with their default values.
Click the Save icon on the top right-hand corner, and give your evaluator a name.

You will now create your second evaluator for output validity. Click Create Evaluator on the top right-hand side to load the Builder again.On the CONFIGURE tab:

Leave the Judgments default values and enter in the following for the Prompt:

You are an output validity judge. Your task is to determine whether the assistant's response contains a valid integer between 0 and 76 (inclusive), representing one of the 77 Banking77 intent classes.

Inputs:
Only consider the field 'conversation' (a multi-turn conversation between a user and an assistant) to make your assessment.

Decision rule:
Examine the assistant's response for the presence of at least one integer in the range [0, 76]. If such an integer is present, respond with 'Yes'. Otherwise, respond with 'No'.

Evaluation criteria:
1. Valid Integer Presence: The assistant's response must contain at least one integer value that falls within the inclusive range of 0 to 76. The integer must be a whole number (not a decimal or fraction).
2. Permissive Extraction: If the response contains extra text, explanations, or formatting but still includes a clear integer in the range 0-76, judge 'Yes'. The integer does not need to be the only content in the response.
3. Multiple Numbers: If the response contains multiple numbers, check whether any one of them is a valid integer in the range 0-76. If at least one qualifies, judge 'Yes'.
4. Label Names Without Integers: If the response contains only a textual label name (e.g., 'card_payment_fee_charged') but no integer, judge 'No'. The task requires an integer output.
5. Strict Range Enforcement: Integers outside the range 0-76 (e.g., -1, 77, 100) do not count as valid. If the only integers present are outside this range, judge 'No'.
6. Number Format: Accept integers written as digits (e.g., '42') or as numeric words (e.g., 'forty-two'). Both count as valid if they fall in range. However, decimal numbers like '3.5' do not count as valid integers; only their integer part should be considered if it appears separately.

Corner cases:
1. Empty Response: If the assistant's response is empty or contains only whitespace, judge 'No'.
2. Non-Numeric Response: If the response contains no numbers at all (only text, symbols, or punctuation), judge 'No'.
3. Out-of-Range Integers Only: If all integers in the response are outside 0-76, judge 'No'.
4. Ambiguous Numeric Content: If the response contains numbers embedded in non-numeric contexts (e.g., dates, URLs, or identifiers like 'GPT-4'), use judgment to determine whether any standalone integer in range 0-76 is clearly intended as a class prediction. If uncertain, be permissive and judge 'Yes' if a standalone integer in range is present.
5. Negative Numbers: Negative integers are outside the valid range and should be judged 'No' unless a valid in-range integer also appears.
6. Refusal or Error Messages: If the assistant refuses to answer or returns an error message without any valid integer, judge 'No'.

For Model, select GPT 5.2. Leave everything else with their default values.
Click the Save icon on the top right-hand corner, and give your evaluator a name.

Create evaluations & baselines

Next, define two evaluations: one for the strong model, the other for the smaller model using your uploaded datasets.Go to the Evaluations page and click Run Evaluation on the top right-hand side to load the Builder.In the Builder:

Select Create a Judge-based Evaluation. On the CONFIGURE tab:
Select GPT5.2 for your Model. Under Evaluators, select the accuracy and output validity evaluators you created in the previous step.
For Dataset, select the banking77_test_basic.jsonl dataset you previously uploaded.
Leave all others at their defaults and click Execute on the top right-hand side.
Give your evaluation a name. This will automatically create a new recipe and run the evaluation. You can view your new evaluation results on your Evaluations page when the job completes.

Now, create your second evaluation for the small model. Click Run Evaluation on the top right-hand side to load the Builder again.In the Builder:

Select Create a Judge-based Evaluation. On the CONFIGURE tab:
Select Qwen 2.5-3B for your Model. Under Evaluators, select the accuracy and output validity evaluators you created in the previous step.
For Dataset, select the banking77_test_basic.jsonl dataset you previously uploaded.
Leave all others at their defaults and click Execute on the top right-hand side.
Give your evaluation a name. This will automatically create a new recipe and run the evaluation. You can view your new evaluation results on your Evaluations page when the job completes.

Once your evaluations jobs finish, review the results. Your strong model should outperform your small model. Oumi enables you to get that same performance in your custom small model.

Fine-tune your custom small model

Next, fine-tune your small model on the training dataset to close the accuracy gap. Because you now know what your improvement metrics are after reviewing your evaluations, it’s easy to compare your custom model to make sure that it outperforms the others.

From the Models page, click Train New Model on the top right-hand side, followed by Supervised Fine-Tuning.
On the CONFIGURE tab, set the following values (you can leave the other values as is):
- Training Method should be Supervised Fine-Tuning
- Base Model should be Qwen/Qwen 2.5-3B-Instruct
- Training Dataset should be banking77_train_basic
Click Execute and provide a name for your model. Oumi will automatically save the recipe under this name as well.
Click Run Job to kick off the fine-tuning job.

Once the job completes, create a final evaluation on your custom small model:

Click on your newly-trained model from the Evaluations page.
Click on Evaluate in the top navigation. In the CONFIGURE tab:
- Under Model, make sure your newly-trained custom small model is selected.
- Under Select Evaluators, select the Accuracy and Output Validity judges.
- For the Dataset, select the banking77_test_basic dataset. You can leave all other optional items as is.
Click Execute, give your new evaluation a name, and click Run Evaluation to kick off the job.

Once the job finishes, review your evaluation to verify that your new model is the best-performing across the board.From the Evaluations page:

Click Compare in the top navigation.
Select the previous large model evaluation and the new custom small model evaluation.
Review the evaluators’ scores across the two evaluations. Your new custom small model should outperform the large model.

Your custom, high-quality AI model tailored to your classification task is now ready for use. The entire process should take only a matter of hours, not months.From this point, you can export your new model or continue improving its performance.

NEXT STEPS

You’ve now built a custom model with Oumi, leveraging the strength of a large model while gaining the efficiency and control of a smaller one, without complex or costly development. From here, you’re ready to iterate, scale, and apply this example to your own use cases. Explore the Oumi Workflow and dive deeper into the available options and configurations for building custom AI models in Oumi.

Getting started

Oumi workflow

OUMI QUICKSTART

OVERVIEW

TASK DEFINITION

WORKFLOW STEPS

NEXT STEPS

​OVERVIEW

​TASK DEFINITION

​WORKFLOW STEPS

​NEXT STEPS

OVERVIEW

TASK DEFINITION

WORKFLOW STEPS

NEXT STEPS