MODEL SELECTION

Oumi provides a variety of base models for fine-tuning, and the list of supported models is always evolving as exciting new models are released. The base model you select is the most important factor determining your fine-tuned model’s capabilities, latency, and token-efficiency. Each base model offers unique strengths; the right base model for your needs will depend primarily on the nature of your use case and infrastructure requirements. A few of the most important factors are:

Task complexity
Latency
Cost
Language support
Tool use and agentic capabilities

You should weigh these factors carefully to select a base model that delivers the right balance of performance, efficiency, and cost for your specific application.

KEY CONSIDERATIONS

Selecting the right base model requires balancing performance, cost, and capability across several technical dimensions; each of the considerations below can meaningfully impact your fine-tuning results and production deployment.

MODEL SIZE AND LATENCY

Smaller models (≤4B parameters) offer the fastest inference speed and can be an order of magnitude cheaper to deploy at scale. Larger models (>8B) provide stronger reasoning and instruction-following capacity, but come with higher latency and cost. For latency-sensitive applications or on-device deployments, start with one of the most compact models, and scale up only if the resulting quality doesn’t meet your needs.

ARCHITECTURE: DENSE VS MOE

Most models released since 2023 use a “dense” architecture, meaning that all parameters are activated when processing each token of each input. An alternative to dense models is the “Mixture-of-Experts” (MoE) architecture, which routes each token to a subset of “expert” layers. This can improve the capacity of a model with lower inference cost. However, MoE models can be temperamental to fine-tune because expert routing can become unbalanced during training, with small or narrow datasets not adequately updating all experts. If you’re new to fine-tuning or working with limited data, dense models tend to offer more predictable results.

TASK SCOPE

Small models work well for narrow tasks with a restricted output space (e.g., classification, entity extraction, routing). They are also cheaper and faster than larger models. In internal testing, we have found that SmolLM2 models punch well above their weight across a number of classification tasks. Llama 3.* series are also excellent choices for classification. Reserve larger models for open-ended generation, problem-solving, and tasks with hard but varying constraints on response format.

TASK COMPLEXITY AND REASONING

Some models are trained to natively support “reasoning-style” generation, which consistently outperforms non-reasoning generation on complex technical tasks requiring multiple steps. But be aware that reasoning models produce many more tokens per prompt (incurring additional latency and cost). If your tasks are consistently complex, consider a reasoning model. If complexity varies, the Qwen3 series offers “hybrid” reasoning that can adapt its depth based on the prompt.

TOOL USE

If your use case involves RAG, web search, or generation of API calls to internal endpoints, choose a model trained with native tool-use capabilities. Qwen3, gpt-oss-20b, and the Llama 3.x/4 instruct variants are post-trained with extensive tool-use data.

CONTEXT LENGTH

If your tasks require the model to process large blocks of text (e.g., RAG with large documents, summarization), choose a model that supports a large training context window. Qwen3 models support training with inputs up to 32k tokens. Training with smaller models like SmolLM is limited to 2k or 4k tokens.

LANGUAGE SUPPORT

Different model families prioritize different languages. Gemma 3 supports approximately 140 languages. Qwen3 offers excellent multilingual capabilities with support for 119 languages, and is the clear choice if support for Asian languages (especially Chinese) is important for your use case. SmolLM2 excels in English but has limited support for other languages. Llama 4 has much more extensive multilingual support than Llama 3.

DATA PROVENANCE

All supported models are open-weight with publicly available model cards. SmolLM2 was trained entirely with publicly-available data. Current trends in model development rely heavily upon synthetic data, which makes up a substantial fraction of the training datamixes of all models we support. If training data transparency or specific licensing terms are important for your deployment, review the model card for your chosen base model before starting training.

AVAILABLE MODELS (FEB 2026)

The following models are currently supported for fine-tuning in Oumi, organized by size to help you quickly identify the right balance of capability, latency, and cost for your use case.

COMPACT MODELS (<2B PARAMETERS)

Best for edge deployment, low-latency applications, classification, and rapid prototyping.

Model	Size	Max Context	Notes
SmolLM2-135M-Instruct	135M	2K	Fastest inference, excellent for classification
SmolLM2-360M-Instruct	360M	2K	Slightly more capable while staying lightweight
Qwen3-0.6B	0.6B	32K	Strong for its size, good multilingual support
Llama-3.2-1B-Instruct	1B	8K	Solid general-purpose compact model
Qwen2.5-1.5B-Instruct	1.5B	8K	Balanced capability and efficiency
SmolLM2-1.7B-Instruct	1.7B	2K	Top of the compact range

MID-SIZE MODELS (3B–8B PARAMETERS)

Good balance of quality and speed for most production use cases.

Model	Size	Max Context	Notes
Llama-3.2-3B-Instruct	3B	8K	Reliable general-purpose, tool-use capable
Qwen2.5-3B-Instruct	3B	8K	Strong multilingual and coding ability
Phi-3.5-mini-instruct	3.8B	4K	Excels at reasoning and math
Gemma-3-4B-IT	4B	8K	140+ languages, well-rounded
Qwen3-4B-Instruct	4B	32K	Long context, hybrid reasoning
Qwen2.5-7B-Instruct	7B	8K	Versatile mid-size option
Qwen3-8B	8B	32K	Hybrid reasoning, strong tool use
Llama-3.1-8B-Instruct	8B	8K	Industry standard, excellent tool use

LARGER MODELS (>8B PARAMETERS)

Best for complex reasoning, technical tasks, and when quality is the priority.

Model	Size	Max Context	Notes
Phi-3.5-MoE-instruct	16x3.8B (MoE)	4K	MoE architecture; fast inference for its capacity
Llama-4-Scout-17B-16E	17B (MoE)	8K	Latest Llama with mixture-of-experts
gpt-oss-20b	20B	8K	Strong tool use, semantic tasks
Qwen3-32B	32B	32K	Largest available; best raw capability

RECOMMENDATIONS BY USE CASE

Use Case	Recommended Models
Classification / Narrow tasks	SmolLM2-135M, SmolLM2-360M, SmolLM2-1.7B
Latency-critical / Edge	SmolLM2-135M, Qwen3-0.6B, Llama-3.2-1B
Tool use / RAG / Agents	Qwen3-4B-Instruct-2507, Qwen3-8B, Llama-3.1-8B, GPT-OSS-20B
Code generation	Qwen2.5-7B, Qwen3-8B, Llama-3.1-8B
General assistant / Chatbot	Llama-3.1-8B, Qwen3-8B, Gemma-3-4B
Complex reasoning	Qwen3-32B, Qwen3-8B, GPT-OSS-20B
Multilingual	Gemma-3-4B (140 languages), Qwen3 family (119 languages), Llama 4
Asian languages	Qwen3 family
Long documents	Qwen3 family (32K context)

Getting started

Oumi workflow

MODEL SELECTION

KEY CONSIDERATIONS

MODEL SIZE AND LATENCY

ARCHITECTURE: DENSE VS MOE

TASK SCOPE

TASK COMPLEXITY AND REASONING

TOOL USE

CONTEXT LENGTH

LANGUAGE SUPPORT

DATA PROVENANCE

AVAILABLE MODELS (FEB 2026)

COMPACT MODELS (<2B PARAMETERS)

MID-SIZE MODELS (3B–8B PARAMETERS)

LARGER MODELS (>8B PARAMETERS)

RECOMMENDATIONS BY USE CASE

​KEY CONSIDERATIONS

​MODEL SIZE AND LATENCY

​ARCHITECTURE: DENSE VS MOE

​TASK SCOPE

​TASK COMPLEXITY AND REASONING

​TOOL USE

​CONTEXT LENGTH

​LANGUAGE SUPPORT

​DATA PROVENANCE

​AVAILABLE MODELS (FEB 2026)

​COMPACT MODELS (<2B PARAMETERS)

​MID-SIZE MODELS (3B–8B PARAMETERS)

​LARGER MODELS (>8B PARAMETERS)

​RECOMMENDATIONS BY USE CASE

KEY CONSIDERATIONS

MODEL SIZE AND LATENCY

ARCHITECTURE: DENSE VS MOE

TASK SCOPE

TASK COMPLEXITY AND REASONING

TOOL USE

CONTEXT LENGTH

LANGUAGE SUPPORT

DATA PROVENANCE

AVAILABLE MODELS (FEB 2026)

COMPACT MODELS (<2B PARAMETERS)

MID-SIZE MODELS (3B–8B PARAMETERS)

LARGER MODELS (>8B PARAMETERS)

RECOMMENDATIONS BY USE CASE