Deployment workflow
Deployment in Oumi follows a straightforward sequence:- Export your trained model from the Oumi platform
- Choose an inference target: run locally on your own hardware, or deploy to a cloud provider
- Serve the model using a compatible inference engine (e.g., vLLM, Hugging Face Transformers)
- Monitor and iterate: re-evaluate and retrain as production data evolves
Choosing a deployment target
The right deployment target depends on your latency requirements, data privacy needs, and infrastructure preferences.| Local Inference | Cloud Inference | |
|---|---|---|
| Best for | Development, testing, air-gapped environments | Production, high-throughput, scalable APIs |
| Hardware | Your own GPU or CPU | Cloud GPU instances (AWS, GCP, Lambda, etc.) |
| Data privacy | Full control; data never leaves your machine | Depends on provider and configuration |
| Setup effort | Low; single command with vLLM | Moderate; instance provisioning required |
| Scalability | Limited to local resources | Scales horizontally on demand |
| Cost | Infrastructure you already own | Pay-per-use or reserved instance pricing |
Local inference
Run your exported model directly on your own hardware using vLLM or Hugging Face Transformers. This is the fastest way to get a model running after export and is ideal for iterative testing, internal tools, and privacy-sensitive workloads. Learn more about Local Inference →Cloud inference
Deploy your exported model to a cloud provider for scalable, production-grade serving. Oumi-exported models are compatible with several managed inference platforms and GPU cloud providers, including AWS Bedrock and Lambda. Learn more about Cloud Inference →What’s next
Exporting Your Model
Download your trained model artifacts from Oumi.
Local Inference
Serve your model on your own hardware with vLLM or Hugging Face.
Cloud Inference
Deploy to AWS, Lambda, or another GPU cloud provider.