> ## Documentation Index > Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt > Use this file to discover all available pages before exploring further. # OUMI QUICKSTART > Learn how to quickly create your first custom AI model ## OVERVIEW This quickstart demonstrates how to build a custom model that can outperform a leading large model in just a few steps. You'll be using: * **Large model (judge and large model benchmarking):** ChatGPT 5.2 * **Small model (for fine-tuning):** Qwen/Qwen2.5-3B-Instruct By the end, you’ll have a fine-tuned small model that outperforms the larger model, validated through Oumi’s evaluations, which you’ll set up as part of the workflow. ## TASK DEFINITION You will develop an AI model for a bank that classifies customer support queries by intent, enabling accurate routing in a banking context. ## WORKFLOW STEPS Oumi provides a fully automated, end-to-end workflow. For this use case, the process includes: * Uploading your datasets and having Oumi automatically analyze them for potential issues * Defining evaluators (judges) to assess accuracy and ensure proper output formatting * Benchmarking both the large and small models on a test dataset to establish performance baselines * Fine-tuning the small model using your training data to create a custom version * Re-run the evaluations to see your custom model outperform the larger model’s baseline You can complete this quickstart using the Oumi Agent or just the platform UI. Start by creating a new project in your workspace: 1. Click on `New Project`. 2. Give your project a `Project Name`. 3. Provide a description in `Project Context`. You can also invite your team members to your project by clicking the `Invite team members` button and selecting their usernames. 4. Click `Create Project`. *** You can use the following datasets for this quickstart. Download them to your local machine: * [banking77\_test\_basic.jsonl](https://huggingface.co/datasets/oumi-ai/banking77-oumi-quickstart/resolve/main/banking77_test_basic.jsonl?download=true) * [banking77\_train\_basic.jsonl](https://huggingface.co/datasets/oumi-ai/banking77-oumi-quickstart/resolve/main/banking77_train_basic.jsonl?download=true) * [banking77\_val\_basic.jsonl](https://huggingface.co/datasets/oumi-ai/banking77-oumi-quickstart/resolve/main/banking77_val_basic.jsonl?download=true) Oumi requires that datasets follow a specific format. Please see [Datasets](/guides/datasets) to learn more. From your project's **Overview** page: 1. Click the `Create` button and select `Dataset` from the menu. 2. Select `Upload a Dataset` (to the right of **Create Dataset**) 3. Provide a `Dataset Name` for your dataset and select the JSONL files you downloaded in the previous step. 4. Click `Create Dataset`. Oumi will start uploading your datasets and automatically run a series of quality checks.

Next, define your evaluators for measuring baseline model performance. The Oumi Agent makes it easy to create custom evaluators for any metric using natural language prompts. You'll need two evaluators for this example: one to measure accuracy against ground truth, the other to validate that outputs are correctly formatted (i.e., an integer within the valid class range). Please see [Evaluations](/guides/evaluations/results) to learn more about how to assess your model’s quality using Oumi. From the **Agent** pane on the right-hand side of your screen: 1. Give the Oumi Agent the following prompt: ```prompt theme={null} You are building a model to classify and route customer support queries for a bank. The model should determine the customer’s intent based on a provided conversation. Start by creating baselines for performance benchmarking and fine-tuning. First, evaluate a strong model on the uploaded test dataset, then evaluate a small language model on the test dataset. Define two custom judges for the evaluations: one to determine whether the output is correct (measuring accuracy using the ground truth labels stored in the dataset’s metadata fields as `label` and `label_name`), the other to determine whether the output is valid (is an integer between 1 and the number of classes). To make the evaluations run faster, do not create failure modes. ``` 2. The Agent will analyze your existing project assets and guide you through the steps for creating your model baselines and evaluations. Select `GPT-5.2` as your `strong model`, `Qwen2.5-1.5B` as your `small model`, and `GPT-5.2` as your `judge model`. 3. Once the Agent finishes configuring your evaluation jobs, click `Run It` to kick off each evaluation. Oumi will run the two evaluations in parallel.

After reviewing the baselines, you can fine-tune your custom small model to improve accuracy and close the gap. From the Agent-provided options: 1. Select `Fine-tune Qwen 2.5-3B`. 2. Review the training guidance and plan as the Agent automatically sets up your training configuration, select `SFT with LoRA` as your training method and parameter strategy. 3. Click `start training` and `approve` to kick off the training job.

Start by creating a new project in your workspace: 1. Click on `New Project`. 2. Give your project a `Project Name`. 3. Provide a description in `Project Context`. You can also invite your team members to your project by clicking the `Invite team members` button and selecting their usernames. 4. Click `Create Project`. *** You can use the following datasets for this quickstart. Download them to your local machine: * [banking77\_test\_basic.jsonl](/downloads/banking77_test_basic.jsonl) * [banking77\_train\_basic.jsonl](/downloads/banking77_train_basic.jsonl) * [banking77\_val\_basic.jsonl](/downloads/banking77_val_basic.jsonl) Oumi requires that datasets follow a specific format. Please see [Datasets](/guides/datasets) to learn more. From your project's **Overview** page: 1. Click the `Create` button and select `Dataset` from the menu. 2. Select `Upload a Dataset` (to the right of **Create Dataset**) 3. Provide a `Dataset Name` for your dataset and select the JSONL files you downloaded in the previous step. 4. Click `Create Dataset`. Oumi will start uploading your datasets and automatically run a series of quality checks.

Next, define your evaluators for measuring baseline model performance. You'll need two evaluators for this example: one to measure accuracy against ground truth, the other to validate that outputs are correctly formatted (i.e., an integer within the valid class range). Please see [Evaluations](/guides/evaluations/results) to learn how to assess your model’s quality with Oumi. Go to your **Evaluators** page and click `Create Evaluator` on the top right-hand side to load the Builder. On the `CONFIGURE` tab: 1. Leave the **Judgments** default values and enter in the following for the **Prompt**: ```prompt theme={null} You are a classification accuracy judge. Your task is to determine whether the assistant's predicted intent label matches the ground truth label for a banking customer support intent classification task. Inputs: Consider the field 'conversation' (a multi-turn conversation between a user and an assistant) along with the provided data fields 'Ground Truth Label (Integer)' and 'Ground Truth Label Name' to make your assessment. Decision rule: If the assistant's response contains the correct ground truth label integer OR the correct ground truth label name (or a semantically equivalent version of the label name), respond with 'Yes'. Otherwise, respond with 'No'. Evaluation Criteria: 1. Exact Integer Match: If the assistant's response contains the ground truth label integer (e.g., if the ground truth is 42, the response contains '42' as a label prediction), this counts as a match. 2. Exact Label Name Match: If the assistant's response contains the ground truth label name exactly as provided (e.g., 'card_arrival'), this counts as a match. 3. Semantic Label Name Equivalence: Minor formatting variations of the label name are acceptable matches. For example, 'card_arrival' matches 'card arrival', 'Card Arrival', 'card-arrival', or 'Card_Arrival'. Underscores, hyphens, spaces, and capitalization differences should be treated as equivalent. However, the semantic meaning must be preserved: 'card_payment' does NOT match 'card_arrival'. 4. Presence in Longer Responses: If the assistant's response contains additional text beyond just the label (e.g., explanations, reasoning, or formatting), the judge should still check whether the correct label integer or label name appears within the response. The label must be clearly identifiable as the assistant's prediction, not merely mentioned in passing or as part of a different context. 5. Strictness: If the response is ambiguous, unclear, contains multiple conflicting labels, or does not clearly indicate the correct label, default to 'No'. Corner cases: 1. Multiple Labels in Response: If the assistant outputs multiple labels and one of them matches the ground truth, judge 'Yes' only if the correct label is clearly the final or primary prediction. If it is ambiguous which label is the prediction, judge 'No'. 2. No Label Present: If the assistant's response does not contain any recognizable label integer or label name, judge 'No'. 3. Partial Matches: A partial match of the label name is not sufficient. For example, if the ground truth is 'card_arrival', a response of 'card' alone does not match. The full label name or its semantic equivalent must be present. 4. Irrelevant or Off-Topic Responses: If the assistant's response does not attempt to classify the intent at all, judge 'No'. 5. Numerical Ambiguity: If the ground truth label integer appears in the response but in a context unrelated to the classification (e.g., as part of a date or other number), this does not count as a match. The number must be clearly presented as the predicted label. ``` 2. For `Model`, select `GPT 5.2`. Leave everything else with their default values. 3. Click the `Save` icon on the top right-hand corner, and give your evaluator a name.

Next, define two evaluations: one for the strong model, the other for the smaller model using your uploaded datasets. Go to the **Evaluations** page and click `Run Evaluation` on the top right-hand side to load the Builder. In the Builder: 1. Select `Create a Judge-based Evaluation`. On the `CONFIGURE` tab: 2. Select `GPT5.2` for your `Model`. Under **Evaluators**, select the accuracy and output validity evaluators you created in the previous step. 3. For `Dataset`, select the `banking77_test_basic.jsonl` dataset you previously uploaded. 4. Leave all others at their defaults and click `Execute` on the top right-hand side. 5. Give your evaluation a name. This will automatically create a new recipe and run the evaluation. You can view your new evaluation results on your `Evaluations` page when the job completes.

Next, fine-tune your small model on the training dataset to close the accuracy gap. Because you now know what your improvement metrics are after reviewing your evaluations, it's easy to compare your custom model to make sure that it outperforms the others. 1. From the **Models** page, click `Train New Model` on the top right-hand side, followed by `Supervised Fine-Tuning`. 2. On the **CONFIGURE** tab, set the following values (you can leave the other values as is): * `Training Method` should be `Supervised Fine-Tuning` * `Base Model` should be `Qwen/Qwen 2.5-3B-Instruct` * `Training Dataset` should be `banking77_train_basic` 3. Click `Execute` and provide a name for your model. Oumi will automatically save the recipe under this name as well. 4. Click `Run Job` to kick off the fine-tuning job.

## NEXT STEPS You’ve now built a custom model with Oumi, leveraging the strength of a large model while gaining the efficiency and control of a smaller one, without complex or costly development. From here, you’re ready to iterate, scale, and apply this example to your own use cases. Explore the [Oumi Workflow](/guides/intro) and dive deeper into the available options and configurations for building custom AI models in Oumi.