> ## Documentation Index
> Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# OUMI QUICKSTART

> Learn how to quickly create your first custom AI model

## OVERVIEW

This quickstart demonstrates how to build a custom model that can outperform a leading large model in just a few steps.

You'll be using:

* **Large model (judge and large model benchmarking):** ChatGPT 5.2
* **Small model (for fine-tuning):** Qwen/Qwen2.5-3B-Instruct

By the end, you’ll have a fine-tuned small model that outperforms the larger model, validated through Oumi’s evaluations, which you’ll set up as part of the workflow.

## TASK DEFINITION

You will develop an AI model for a bank that classifies customer support queries by intent, enabling accurate routing in a banking context.

## WORKFLOW STEPS

Oumi provides a fully automated, end-to-end workflow. For this use case, the process includes:

* Uploading your datasets and having Oumi automatically analyze them for potential issues
* Defining evaluators (judges) to assess accuracy and ensure proper output formatting
* Benchmarking both the large and small models on a test dataset to establish performance baselines
* Fine-tuning the small model using your training data to create a custom version
* Re-run the evaluations to see your custom model outperform the larger model’s baseline

You can complete this quickstart using the Oumi Agent or just the platform UI.

<Tabs>
  <Tab title="Oumi Agent">
    <Steps>
      <Step title="Project setup">
        Start by creating a new project in your workspace:

        1. Click on `New Project`.
        2. Give your project a `Project Name`.
        3. Provide a description in `Project Context`. You can also invite your team members to your project by clicking the `Invite team members` button and selecting their usernames.
        4. Click `Create Project`.

        ***
      </Step>

      <Step title="Set up your datasets">
        You can use the following datasets for this quickstart. Download them to your local machine:

        * [banking77\_test\_basic.jsonl](https://huggingface.co/datasets/oumi-ai/banking77-oumi-quickstart/resolve/main/banking77_test_basic.jsonl?download=true)
        * [banking77\_train\_basic.jsonl](https://huggingface.co/datasets/oumi-ai/banking77-oumi-quickstart/resolve/main/banking77_train_basic.jsonl?download=true)
        * [banking77\_val\_basic.jsonl](https://huggingface.co/datasets/oumi-ai/banking77-oumi-quickstart/resolve/main/banking77_val_basic.jsonl?download=true)

        <Note>Oumi requires that datasets follow a specific format. Please see [Datasets](/guides/datasets) to learn more.</Note>

        From your project's **Overview** page:

        1. Click the `Create` button and select `Dataset` from the menu.

        2. Select `Upload a Dataset` (to the right of **Create Dataset**)

        3. Provide a `Dataset Name` for your dataset and select the JSONL files you downloaded in the previous step.

        4. Click `Create Dataset`. Oumi will start uploading your datasets and automatically run a series of quality checks.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-upload-datasets.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=fa9623a4e893b73e89e5503da1b1694b" data-path="videos/quickstart-upload-datasets.mp4" />

        ***
      </Step>

      <Step title="Create your evaluators & baselines">
        Next, define your evaluators for measuring baseline model performance. The Oumi Agent makes it easy to create custom evaluators for any metric using natural language prompts.

        You'll need two evaluators for this example: one to measure accuracy against ground truth, the other to validate that outputs are correctly formatted (i.e., an integer within the valid class range).

        <Info>Please see [Evaluations](/guides/evaluations/results) to learn more about how to assess your model’s quality using Oumi.</Info>

        From the **Agent** pane on the right-hand side of your screen:

        1. Give the Oumi Agent the following prompt:

        ```prompt theme={null}
        You are building a model to classify and route customer support queries for a bank.
        The model should determine the customer’s intent based on a provided conversation. Start by creating baselines for performance benchmarking and fine-tuning. 
        First, evaluate a strong model on the uploaded test dataset, then evaluate a small language model on the test dataset. Define two custom judges for the evaluations: one to determine whether the output is correct (measuring accuracy using the ground truth labels stored in the dataset’s metadata fields as `label` and `label_name`), the other to determine whether the output is valid (is an integer between 1 and the number of classes).
        To make the evaluations run faster, do not create failure modes.
        ```

        2. The Agent will analyze your existing project assets and guide you through the steps for creating your model baselines and evaluations. Select `GPT-5.2` as your `strong model`, `Qwen2.5-1.5B` as your `small model`, and `GPT-5.2` as your `judge model`.

        3. Once the Agent finishes configuring your evaluation jobs, click `Run It` to kick off each evaluation. Oumi will run the two evaluations in parallel.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/eoU5aAI3oLu48TXK/videos/quickstart-create-evals-new.mp4?fit=max&auto=format&n=eoU5aAI3oLu48TXK&q=85&s=e81e8e25cb21915affa587cae271c484" data-path="videos/quickstart-create-evals-new.mp4" />

        Once your evaluations jobs finish, review the results. Your strong model should outperform your base model. Oumi enables you to get that same performance in your custom small model. You'll do this by fine-tuning your small model on the training dataset.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-view-evals.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=f170cf6596b2a69431096dc5604b6a2e" data-path="videos/quickstart-view-evals.mp4" />

        You can also review your in-depth evaluation results side-by-side. From the **Evaluations** page, click on `Compare` and select your two evaluations.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-view-evals2.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=c4267e9a84892ee1da9ec669d18238fd" data-path="videos/quickstart-view-evals2.mp4" />
      </Step>

      <Step title="Fine-tune & evaluate your custom model">
        After reviewing the baselines, you can fine-tune your custom small model to improve accuracy and close the gap.

        From the Agent-provided options:

        1. Select `Fine-tune Qwen 2.5-3B`.

        2. Review the training guidance and plan as the Agent automatically sets up your training configuration, select `SFT with LoRA` as your training method and parameter strategy.

        3. Click `start training` and `approve` to kick off the training job.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-fine-tuning.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=2f3bc2c97ed12fbb2b5cfac423316efc" data-path="videos/quickstart-fine-tuning.mp4" />

        4. Once the training job model completes, run an evaluation against your new custom model to ensure that it beats your previous baselines. Click `Run Evaluation` to kick off the job.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-run-eval-custom.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=62b6f858d491d56803f9244af6d61941" data-path="videos/quickstart-run-eval-custom.mp4" />

        5. Click `Done for now` to wrap up the workflow.

        Your custom, high-quality AI model tailored to your classification task is now ready for use. The entire process should take only a matter of hours, not months.
      </Step>
    </Steps>
  </Tab>

  <Tab title="Platform UI">
    <Steps>
      <Step title="Project setup">
        Start by creating a new project in your workspace:

        1. Click on `New Project`.
        2. Give your project a `Project Name`.
        3. Provide a description in `Project Context`. You can also invite your team members to your project by clicking the `Invite team members` button and selecting their usernames.
        4. Click `Create Project`.

        ***
      </Step>

      <Step title="Set up your datasets">
        You can use the following datasets for this quickstart. Download them to your local machine:

        * [banking77\_test\_basic.jsonl](/downloads/banking77_test_basic.jsonl)
        * [banking77\_train\_basic.jsonl](/downloads/banking77_train_basic.jsonl)
        * [banking77\_val\_basic.jsonl](/downloads/banking77_val_basic.jsonl)

        <Note>Oumi requires that datasets follow a specific format. Please see [Datasets](/guides/datasets) to learn more.</Note>

        From your project's **Overview** page:

        1. Click the `Create` button and select `Dataset` from the menu.

        2. Select `Upload a Dataset` (to the right of **Create Dataset**)

        3. Provide a `Dataset Name` for your dataset and select the JSONL files you downloaded in the previous step.

        4. Click `Create Dataset`. Oumi will start uploading your datasets and automatically run a series of quality checks.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-upload-datasets.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=fa9623a4e893b73e89e5503da1b1694b" data-path="videos/quickstart-upload-datasets.mp4" />

        ***
      </Step>

      <Step title="Create your evaluators">
        Next, define your evaluators for measuring baseline model performance. You'll need two evaluators for this example: one to measure accuracy against ground truth, the other to validate that outputs are correctly formatted (i.e., an integer within the valid class range).

        <Info>Please see [Evaluations](/guides/evaluations/results) to learn how to assess your model’s quality with Oumi.</Info>

        Go to your **Evaluators** page and click `Create Evaluator` on the top right-hand side to load the Builder.

        On the `CONFIGURE` tab:

        1. Leave the **Judgments** default values and enter in the following for the **Prompt**:

        ```prompt theme={null}

        You are a classification accuracy judge. Your task is to determine whether the assistant's predicted intent label matches the ground truth label for a banking customer support intent classification task.

        Inputs:
        Consider the field 'conversation' (a multi-turn conversation between a user and an assistant) along with the provided data fields 'Ground Truth Label (Integer)' and 'Ground Truth Label Name' to make your assessment.

        Decision rule:
        If the assistant's response contains the correct ground truth label integer OR the correct ground truth label name (or a semantically equivalent version of the label name), respond with 'Yes'. Otherwise, respond with 'No'.

        Evaluation Criteria:
        1. Exact Integer Match: If the assistant's response contains the ground truth label integer (e.g., if the ground truth is 42, the response contains '42' as a label prediction), this counts as a match.
        2. Exact Label Name Match: If the assistant's response contains the ground truth label name exactly as provided (e.g., 'card_arrival'), this counts as a match.
        3. Semantic Label Name Equivalence: Minor formatting variations of the label name are acceptable matches. For example, 'card_arrival' matches 'card arrival', 'Card Arrival', 'card-arrival', or 'Card_Arrival'. Underscores, hyphens, spaces, and capitalization differences should be treated as equivalent. However, the semantic meaning must be preserved: 'card_payment' does NOT match 'card_arrival'.
        4. Presence in Longer Responses: If the assistant's response contains additional text beyond just the label (e.g., explanations, reasoning, or formatting), the judge should still check whether the correct label integer or label name appears within the response. The label must be clearly identifiable as the assistant's prediction, not merely mentioned in passing or as part of a different context.
        5. Strictness: If the response is ambiguous, unclear, contains multiple conflicting labels, or does not clearly indicate the correct label, default to 'No'.

        Corner cases:
        1. Multiple Labels in Response: If the assistant outputs multiple labels and one of them matches the ground truth, judge 'Yes' only if the correct label is clearly the final or primary prediction. If it is ambiguous which label is the prediction, judge 'No'.
        2. No Label Present: If the assistant's response does not contain any recognizable label integer or label name, judge 'No'.
        3. Partial Matches: A partial match of the label name is not sufficient. For example, if the ground truth is 'card_arrival', a response of 'card' alone does not match. The full label name or its semantic equivalent must be present.
        4. Irrelevant or Off-Topic Responses: If the assistant's response does not attempt to classify the intent at all, judge 'No'.
        5. Numerical Ambiguity: If the ground truth label integer appears in the response but in a context unrelated to the classification (e.g., as part of a date or other number), this does not count as a match. The number must be clearly presented as the predicted label.
        ```

        2. For `Model`, select `GPT 5.2`. Leave everything else with their default values.

        3. Click the `Save` icon on the top right-hand corner, and give your evaluator a name.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/eoU5aAI3oLu48TXK/videos/quickstart-create-evaluators-ui.mp4?fit=max&auto=format&n=eoU5aAI3oLu48TXK&q=85&s=695c3eb3e8b255b4c58a550aeac585fe" data-path="videos/quickstart-create-evaluators-ui.mp4" />

        You will now create your second evaluator for output validity. Click `Create Evaluator` on the top right-hand side to load the Builder again.

        On the `CONFIGURE` tab:

        1. Leave the **Judgments** default values and enter in the following for the **Prompt**:

        ```prompt theme={null}
        You are an output validity judge. Your task is to determine whether the assistant's response contains a valid integer between 0 and 76 (inclusive), representing one of the 77 Banking77 intent classes.

        Inputs:
        Only consider the field 'conversation' (a multi-turn conversation between a user and an assistant) to make your assessment.

        Decision rule:
        Examine the assistant's response for the presence of at least one integer in the range [0, 76]. If such an integer is present, respond with 'Yes'. Otherwise, respond with 'No'.

        Evaluation criteria:
        1. Valid Integer Presence: The assistant's response must contain at least one integer value that falls within the inclusive range of 0 to 76. The integer must be a whole number (not a decimal or fraction).
        2. Permissive Extraction: If the response contains extra text, explanations, or formatting but still includes a clear integer in the range 0-76, judge 'Yes'. The integer does not need to be the only content in the response.
        3. Multiple Numbers: If the response contains multiple numbers, check whether any one of them is a valid integer in the range 0-76. If at least one qualifies, judge 'Yes'.
        4. Label Names Without Integers: If the response contains only a textual label name (e.g., 'card_payment_fee_charged') but no integer, judge 'No'. The task requires an integer output.
        5. Strict Range Enforcement: Integers outside the range 0-76 (e.g., -1, 77, 100) do not count as valid. If the only integers present are outside this range, judge 'No'.
        6. Number Format: Accept integers written as digits (e.g., '42') or as numeric words (e.g., 'forty-two'). Both count as valid if they fall in range. However, decimal numbers like '3.5' do not count as valid integers; only their integer part should be considered if it appears separately.

        Corner cases:
        1. Empty Response: If the assistant's response is empty or contains only whitespace, judge 'No'.
        2. Non-Numeric Response: If the response contains no numbers at all (only text, symbols, or punctuation), judge 'No'.
        3. Out-of-Range Integers Only: If all integers in the response are outside 0-76, judge 'No'.
        4. Ambiguous Numeric Content: If the response contains numbers embedded in non-numeric contexts (e.g., dates, URLs, or identifiers like 'GPT-4'), use judgment to determine whether any standalone integer in range 0-76 is clearly intended as a class prediction. If uncertain, be permissive and judge 'Yes' if a standalone integer in range is present.
        5. Negative Numbers: Negative integers are outside the valid range and should be judged 'No' unless a valid in-range integer also appears.
        6. Refusal or Error Messages: If the assistant refuses to answer or returns an error message without any valid integer, judge 'No'.
        ```

        2. For `Model`, select `GPT 5.2`. Leave everything else with their default values.

        3. Click the `Save` icon on the top right-hand corner, and give your evaluator a name.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/eoU5aAI3oLu48TXK/videos/quickstart-create-evaluators-ui.mp4?fit=max&auto=format&n=eoU5aAI3oLu48TXK&q=85&s=695c3eb3e8b255b4c58a550aeac585fe" data-path="videos/quickstart-create-evaluators-ui.mp4" />
      </Step>

      <Step title="Create evaluations & baselines">
        Next, define two evaluations: one for the strong model, the other for the smaller model using your uploaded datasets.

        Go to the **Evaluations** page and click `Run Evaluation` on the top right-hand side to load the Builder.

        In the Builder:

        1. Select `Create a Judge-based Evaluation`.
           On the `CONFIGURE` tab:
        2. Select `GPT5.2` for your `Model`. Under **Evaluators**, select the accuracy and output validity evaluators you created in the previous step.
        3. For `Dataset`, select the `banking77_test_basic.jsonl` dataset you previously uploaded.
        4. Leave all others at their defaults and click `Execute` on the top right-hand side.
        5. Give your evaluation a name. This will automatically create a new recipe and run the evaluation. You can view your new evaluation results on your `Evaluations` page when the job completes.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-create-evalsbaselines-ui.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=8f591bab9efa1fa9715ce3a3bfcc0af6" data-path="videos/quickstart-create-evalsbaselines-ui.mp4" />

        Now, create your second evaluation for the small model. Click `Run Evaluation` on the top right-hand side to load the Builder again.

        In the Builder:

        1. Select `Create a Judge-based Evaluation`.
           On the `CONFIGURE` tab:
        2. Select `Qwen 2.5-3B` for your `Model`. Under **Evaluators**, select the accuracy and output validity evaluators you created in the previous step.
        3. For `Dataset`, select the `banking77_test_basic.jsonl` dataset you previously uploaded.
        4. Leave all others at their defaults and click `Execute` on the top right-hand side.
        5. Give your evaluation a name. This will automatically create a new recipe and run the evaluation. You can view your new evaluation results on your `Evaluations` page when the job completes.

        Once your evaluations jobs finish, review the results. Your strong model should outperform your small model. Oumi enables you to get that same performance in your custom small model.
      </Step>

      <Step title="Fine-tune your custom small model">
        Next, fine-tune your small model on the training dataset to close the accuracy gap. Because you now know what your improvement metrics are after reviewing your evaluations, it's easy to compare your custom model to make sure that it outperforms the others.

        1. From the **Models** page, click `Train New Model` on the top right-hand side, followed by `Supervised Fine-Tuning`.

        2. On the **CONFIGURE** tab, set the following values (you can leave the other values as is):

           * `Training Method` should be `Supervised Fine-Tuning`
           * `Base Model` should be `Qwen/Qwen 2.5-3B-Instruct`
           * `Training Dataset` should be `banking77_train_basic`

        3. Click `Execute` and provide a name for your model. Oumi will automatically save the recipe under this name as well.

        4. Click `Run Job` to kick off the fine-tuning job.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-trainmodel-ui.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=cab16feb18ab870e8c7d379a261b3f7f" data-path="videos/quickstart-trainmodel-ui.mp4" />

        Once the job completes, create a final evaluation on your custom small model:

        1. Click on your newly-trained model from the **Evaluations** page.

        2. Click on `Evaluate` in the top navigation. In the **CONFIGURE** tab:
           * Under `Model`, make sure your newly-trained custom small model is selected.
           * Under `Select Evaluators`, select the `Accuracy` and `Output Validity` judges.
           * For the `Dataset`, select the `banking77_test_basic` dataset. You can leave all other optional items as is.

        3. Click `Execute`, give your new evaluation a name, and click `Run Evaluation` to kick off the job.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/WxNX4Z_9BfbirlQx/videos/quickstart-create-finaleval-ui.mp4?fit=max&auto=format&n=WxNX4Z_9BfbirlQx&q=85&s=597fbc378536e517995cbe1fabb6abce" data-path="videos/quickstart-create-finaleval-ui.mp4" />

        Once the job finishes, review your evaluation to verify that your new model is the best-performing across the board.

        From the **Evaluations** page:

        1. Click `Compare` in the top navigation.
        2. Select the previous large model evaluation and the new custom small model evaluation.
        3. Review the evaluators' scores across the two evaluations. Your new custom small model should outperform the large model.

        <video autoPlay controls muted loop playsInline allowFullScreen className="w-full aspect-video rounded-xl" src="https://mintcdn.com/oumi/-C82V_kXqoBIcXEj/videos/quickstart-review-finalevals-ui.mp4?fit=max&auto=format&n=-C82V_kXqoBIcXEj&q=85&s=e19d256b2a5cf43ab9c7d8d28af07b0b" data-path="videos/quickstart-review-finalevals-ui.mp4" />

        Your custom, high-quality AI model tailored to your classification task is now ready for use. The entire process should take only a matter of hours, not months.

        From this point, you can [export your new model](/guides/deployment/exporting-models) or continue improving its performance.
      </Step>
    </Steps>
  </Tab>
</Tabs>

## NEXT STEPS

You’ve now built a custom model with Oumi, leveraging the strength of a large model while gaining the efficiency and control of a smaller one, without complex or costly development. From here, you’re ready to iterate, scale, and apply this example to your own use cases.

Explore the [Oumi Workflow](/guides/intro) and dive deeper into the available options and configurations for building custom AI models in Oumi.
