Skip to main content
Oumi provides a variety of base models for fine-tuning, and the list of supported models is always evolving as exciting new models are released. The base model you select is the most important factor determining your fine-tuned model’s capabilities, latency, and token-efficiency. Each base model offers unique strengths; the right base model for your needs will depend primarily on the nature of your use case and infrastructure requirements. A few of the most important factors are:
  • Task complexity
  • Latency
  • Cost
  • Language support
  • Tool use and agentic capabilities
You should weigh these factors carefully to select a base model that delivers the right balance of performance, efficiency, and cost for your specific application.

Key considerations

Selecting the right base model requires balancing performance, cost, and capability across several technical dimensions; each of the considerations below can meaningfully impact your fine-tuning results and production deployment.

Model size and latency

Smaller models (≤4B parameters) offer the fastest inference speed and can be an order of magnitude cheaper to deploy at scale. Larger models (>8B) provide stronger reasoning and instruction-following capacity, but come with higher latency and cost. For latency-sensitive applications or on-device deployments, start with one of the most compact models, and scale up only if the resulting quality doesn’t meet your needs.

Architecture: dense vs MoE

Most models released since 2023 use a “dense” architecture, meaning that all parameters are activated when processing each token of each input. An alternative to dense models is the “Mixture-of-Experts” (MoE) architecture, which routes each token to a subset of “expert” layers. This can improve the capacity of a model with lower inference cost. However, MoE models can be temperamental to fine-tune because expert routing can become unbalanced during training, with small or narrow datasets not adequately updating all experts. If you’re new to fine-tuning or working with limited data, dense models tend to offer more predictable results.

Task scope

Small models work well for narrow tasks with a restricted output space (e.g., classification, entity extraction, routing). They are also cheaper and faster than larger models. In internal testing, we have found that SmolLM2 models punch well above their weight across a number of classification tasks. Llama 3.* series are also excellent choices for classification. Reserve larger models for open-ended generation, problem-solving, and tasks with hard but varying constraints on response format.

Task complexity and reasoning

Some models are trained to natively support “reasoning-style” generation, which consistently outperforms non-reasoning generation on complex technical tasks requiring multiple steps. But be aware that reasoning models produce many more tokens per prompt (incurring additional latency and cost). If your tasks are consistently complex, consider a reasoning model. If complexity varies, the Qwen3 series offers “hybrid” reasoning that can adapt its depth based on the prompt.

Tool use

If your use case involves RAG, web search, or generation of API calls to internal endpoints, choose a model trained with native tool-use capabilities. Qwen3, gpt-oss-20b, and the Llama 3.x/4 instruct variants are post-trained with extensive tool-use data.

Context length

If your tasks require the model to process large blocks of text (e.g., RAG with large documents, summarization), choose a model that supports a large training context window. Qwen3 models support training with inputs up to 32k tokens. Training with smaller models like SmolLM is limited to 2k or 4k tokens.

Language support

Different model families prioritize different languages. Gemma 3 supports approximately 140 languages. Qwen3 offers excellent multilingual capabilities with support for 119 languages, and is the clear choice if support for Asian languages (especially Chinese) is important for your use case. SmolLM2 excels in English but has limited support for other languages. Llama 4 has much more extensive multilingual support than Llama 3.

Data provenance

All supported models are open-weight with publicly available model cards. SmolLM2 was trained entirely with publicly-available data. Current trends in model development rely heavily upon synthetic data, which makes up a substantial fraction of the training datamixes of all models we support. If training data transparency or specific licensing terms are important for your deployment, review the model card for your chosen base model before starting training.

Available Models (Feb 2026)

The following models are currently supported for fine-tuning in Oumi, organized by size to help you quickly identify the right balance of capability, latency, and cost for your use case.

Compact Models (<2B parameters)

Best for edge deployment, low-latency applications, classification, and rapid prototyping.
ModelSizeMax ContextNotes
SmolLM2-135M-Instruct135M2KFastest inference, excellent for classification
SmolLM2-360M-Instruct360M2KSlightly more capable while staying lightweight
Qwen3-0.6B0.6B32KStrong for its size, good multilingual support
Llama-3.2-1B-Instruct1B8KSolid general-purpose compact model
Qwen2.5-1.5B-Instruct1.5B8KBalanced capability and efficiency
SmolLM2-1.7B-Instruct1.7B2KTop of the compact range

Mid-size Models (3B–8B parameters)

Good balance of quality and speed for most production use cases.
ModelSizeMax ContextNotes
Llama-3.2-3B-Instruct3B8KReliable general-purpose, tool-use capable
Qwen2.5-3B-Instruct3B8KStrong multilingual and coding ability
Phi-3.5-mini-instruct3.8B4KExcels at reasoning and math
Gemma-3-4B-IT4B8K140+ languages, well-rounded
Qwen3-4B-Instruct4B32KLong context, hybrid reasoning
Qwen2.5-7B-Instruct7B8KVersatile mid-size option
Qwen3-8B8B32KHybrid reasoning, strong tool use
Llama-3.1-8B-Instruct8B8KIndustry standard, excellent tool use

Larger Models (>8B parameters)

Best for complex reasoning, technical tasks, and when quality is the priority.
ModelSizeMax ContextNotes
Phi-3.5-MoE-instruct16x3.8B (MoE)4KMoE architecture; fast inference for its capacity
Llama-4-Scout-17B-16E17B (MoE)8KLatest Llama with mixture-of-experts
gpt-oss-20b20B8KStrong tool use, semantic tasks
Qwen3-32B32B32KLargest available; best raw capability

Recommendations by use case

Use CaseRecommended Models
Classification / Narrow tasksSmolLM2-135M, SmolLM2-360M, SmolLM2-1.7B
Latency-critical / EdgeSmolLM2-135M, Qwen3-0.6B, Llama-3.2-1B
Tool use / RAG / AgentsQwen3-4B-Instruct-2507, Qwen3-8B, Llama-3.1-8B, GPT-OSS-20B
Code generationQwen2.5-7B, Qwen3-8B, Llama-3.1-8B
General assistant / ChatbotLlama-3.1-8B, Qwen3-8B, Gemma-3-4B
Complex reasoningQwen3-32B, Qwen3-8B, GPT-OSS-20B
MultilingualGemma-3-4B (140 languages), Qwen3 family (119 languages), Llama 4
Asian languagesQwen3 family
Long documentsQwen3 family (32K context)