- Task complexity
- Latency
- Cost
- Language support
- Tool use and agentic capabilities
Key considerations
Selecting the right base model requires balancing performance, cost, and capability across several technical dimensions; each of the considerations below can meaningfully impact your fine-tuning results and production deployment.Model size and latency
Smaller models (≤4B parameters) offer the fastest inference speed and can be an order of magnitude cheaper to deploy at scale. Larger models (>8B) provide stronger reasoning and instruction-following capacity, but come with higher latency and cost. For latency-sensitive applications or on-device deployments, start with one of the most compact models, and scale up only if the resulting quality doesn’t meet your needs.Architecture: dense vs MoE
Most models released since 2023 use a “dense” architecture, meaning that all parameters are activated when processing each token of each input. An alternative to dense models is the “Mixture-of-Experts” (MoE) architecture, which routes each token to a subset of “expert” layers. This can improve the capacity of a model with lower inference cost. However, MoE models can be temperamental to fine-tune because expert routing can become unbalanced during training, with small or narrow datasets not adequately updating all experts. If you’re new to fine-tuning or working with limited data, dense models tend to offer more predictable results.Task scope
Small models work well for narrow tasks with a restricted output space (e.g., classification, entity extraction, routing). They are also cheaper and faster than larger models. In internal testing, we have found that SmolLM2 models punch well above their weight across a number of classification tasks. Llama 3.* series are also excellent choices for classification. Reserve larger models for open-ended generation, problem-solving, and tasks with hard but varying constraints on response format.Task complexity and reasoning
Some models are trained to natively support “reasoning-style” generation, which consistently outperforms non-reasoning generation on complex technical tasks requiring multiple steps. But be aware that reasoning models produce many more tokens per prompt (incurring additional latency and cost). If your tasks are consistently complex, consider a reasoning model. If complexity varies, the Qwen3 series offers “hybrid” reasoning that can adapt its depth based on the prompt.Tool use
If your use case involves RAG, web search, or generation of API calls to internal endpoints, choose a model trained with native tool-use capabilities. Qwen3, gpt-oss-20b, and the Llama 3.x/4 instruct variants are post-trained with extensive tool-use data.Context length
If your tasks require the model to process large blocks of text (e.g., RAG with large documents, summarization), choose a model that supports a large training context window. Qwen3 models support training with inputs up to 32k tokens. Training with smaller models like SmolLM is limited to 2k or 4k tokens.Language support
Different model families prioritize different languages. Gemma 3 supports approximately 140 languages. Qwen3 offers excellent multilingual capabilities with support for 119 languages, and is the clear choice if support for Asian languages (especially Chinese) is important for your use case. SmolLM2 excels in English but has limited support for other languages. Llama 4 has much more extensive multilingual support than Llama 3.Data provenance
All supported models are open-weight with publicly available model cards. SmolLM2 was trained entirely with publicly-available data. Current trends in model development rely heavily upon synthetic data, which makes up a substantial fraction of the training datamixes of all models we support. If training data transparency or specific licensing terms are important for your deployment, review the model card for your chosen base model before starting training.Available Models (Feb 2026)
The following models are currently supported for fine-tuning in Oumi, organized by size to help you quickly identify the right balance of capability, latency, and cost for your use case.Compact Models (<2B parameters)
Best for edge deployment, low-latency applications, classification, and rapid prototyping.| Model | Size | Max Context | Notes |
|---|---|---|---|
| SmolLM2-135M-Instruct | 135M | 2K | Fastest inference, excellent for classification |
| SmolLM2-360M-Instruct | 360M | 2K | Slightly more capable while staying lightweight |
| Qwen3-0.6B | 0.6B | 32K | Strong for its size, good multilingual support |
| Llama-3.2-1B-Instruct | 1B | 8K | Solid general-purpose compact model |
| Qwen2.5-1.5B-Instruct | 1.5B | 8K | Balanced capability and efficiency |
| SmolLM2-1.7B-Instruct | 1.7B | 2K | Top of the compact range |
Mid-size Models (3B–8B parameters)
Good balance of quality and speed for most production use cases.| Model | Size | Max Context | Notes |
|---|---|---|---|
| Llama-3.2-3B-Instruct | 3B | 8K | Reliable general-purpose, tool-use capable |
| Qwen2.5-3B-Instruct | 3B | 8K | Strong multilingual and coding ability |
| Phi-3.5-mini-instruct | 3.8B | 4K | Excels at reasoning and math |
| Gemma-3-4B-IT | 4B | 8K | 140+ languages, well-rounded |
| Qwen3-4B-Instruct | 4B | 32K | Long context, hybrid reasoning |
| Qwen2.5-7B-Instruct | 7B | 8K | Versatile mid-size option |
| Qwen3-8B | 8B | 32K | Hybrid reasoning, strong tool use |
| Llama-3.1-8B-Instruct | 8B | 8K | Industry standard, excellent tool use |
Larger Models (>8B parameters)
Best for complex reasoning, technical tasks, and when quality is the priority.| Model | Size | Max Context | Notes |
|---|---|---|---|
| Phi-3.5-MoE-instruct | 16x3.8B (MoE) | 4K | MoE architecture; fast inference for its capacity |
| Llama-4-Scout-17B-16E | 17B (MoE) | 8K | Latest Llama with mixture-of-experts |
| gpt-oss-20b | 20B | 8K | Strong tool use, semantic tasks |
| Qwen3-32B | 32B | 32K | Largest available; best raw capability |
Recommendations by use case
| Use Case | Recommended Models |
|---|---|
| Classification / Narrow tasks | SmolLM2-135M, SmolLM2-360M, SmolLM2-1.7B |
| Latency-critical / Edge | SmolLM2-135M, Qwen3-0.6B, Llama-3.2-1B |
| Tool use / RAG / Agents | Qwen3-4B-Instruct-2507, Qwen3-8B, Llama-3.1-8B, GPT-OSS-20B |
| Code generation | Qwen2.5-7B, Qwen3-8B, Llama-3.1-8B |
| General assistant / Chatbot | Llama-3.1-8B, Qwen3-8B, Gemma-3-4B |
| Complex reasoning | Qwen3-32B, Qwen3-8B, GPT-OSS-20B |
| Multilingual | Gemma-3-4B (140 languages), Qwen3 family (119 languages), Llama 4 |
| Asian languages | Qwen3 family |
| Long documents | Qwen3 family (32K context) |