Skip to main content
Once you’ve trained and evaluated a model that meets your performance goals, the final step is to export it from Oumi and deploy it for inference. Oumi exports models in a standard format compatible with popular inference engines, giving you full flexibility over where and how you serve your model.

Deployment workflow

Deployment in Oumi follows a straightforward sequence:
  1. Export your trained model from the Oumi platform
  2. Choose an inference target: run locally on your own hardware, or deploy to a cloud provider
  3. Serve the model using a compatible inference engine (e.g., vLLM, Hugging Face Transformers)
  4. Monitor and iterate: re-evaluate and retrain as production data evolves

Choosing a deployment target

The right deployment target depends on your latency requirements, data privacy needs, and infrastructure preferences.
Local InferenceCloud Inference
Best forDevelopment, testing, air-gapped environmentsProduction, high-throughput, scalable APIs
HardwareYour own GPU or CPUCloud GPU instances (AWS, GCP, Lambda, etc.)
Data privacyFull control; data never leaves your machineDepends on provider and configuration
Setup effortLow; single command with vLLMModerate; instance provisioning required
ScalabilityLimited to local resourcesScales horizontally on demand
CostInfrastructure you already ownPay-per-use or reserved instance pricing

Local inference

Run your exported model directly on your own hardware using vLLM or Hugging Face Transformers. This is the fastest way to get a model running after export and is ideal for iterative testing, internal tools, and privacy-sensitive workloads. Learn more about Local Inference →

Cloud inference

Deploy your exported model to a cloud provider for scalable, production-grade serving. Oumi-exported models are compatible with several managed inference platforms and GPU cloud providers, including AWS Bedrock and Lambda. Learn more about Cloud Inference →

What’s next

Exporting Your Model

Download your trained model artifacts from Oumi.

Local Inference

Serve your model on your own hardware with vLLM or Hugging Face.

Cloud Inference

Deploy to AWS, Lambda, or another GPU cloud provider.