When to use cloud inference
Cloud deployment is the right choice when:- You need to serve many users or handle variable traffic
- Your use case requires more GPU memory than local hardware provides
- You want managed scaling, uptime guarantees, or geo-distributed serving
- Your team needs a stable API endpoint for an application or integration
Managed inference platforms
These platforms handle infrastructure provisioning and model serving for you, making them the lowest-friction option for getting a model into production.AWS: Amazon Bedrock
Amazon Bedrock supports custom model import, allowing you to deploy your Oumi-trained model as a managed inference endpoint. Read the AWS blog post for a full walkthrough of importing an Oumi-exported model into Amazon Bedrock. Best for: Teams already on AWS who want a fully managed endpoint with no infrastructure overhead.Lambda: on-demand GPU instances
Lambda provides on-demand GPU cloud instances well-suited for hosting an inference server with vLLM. Watch the Lambda deployment video for a step-by-step guide on spinning up an instance, loading your exported model, and making inference requests. Best for: Teams who want direct GPU access and control over the serving stack without a full cloud commitment.Self-hosted cloud deployment
If you prefer to manage your own inference server on a cloud GPU instance (from any provider), the setup mirrors the local inference workflow, with your instance replacing your local machine.General steps
- Export your model from the Oumi platform and transfer the artifacts to your cloud instance (e.g., via
scp, S3, or GCS). - Install vLLM on the instance:
- Start the inference server, pointing to your exported model directory:
- Make inference requests using the OpenAI-compatible API:
Replace
<your-instance-ip> with the public IP or hostname of your cloud instance. Ensure port 8000 is open in your security group or firewall rules.Choosing an instance type
The right instance type depends on your model size, latency requirements, and budget. As a general rule, larger models require more GPU memory. A 7B parameter model typically needs at least 16 GB of GPU VRAM, while a 30B+ model will need significantly more. Consult your cloud provider’s documentation for current instance availability and pricing:Considerations
Cost: Cloud GPU instances are billed by the hour. For variable traffic, consider auto-scaling groups or serverless inference platforms to avoid paying for idle capacity. Latency: Network round-trip adds latency compared to local inference. Choose a region close to your users and keep request payloads small. Security: Restrict access to your inference endpoint using API keys, VPC networking, or IAM policies. Do not expose the vLLM server publicly without authentication. Model versioning: Keep exported artifacts versioned in cloud storage (S3, GCS) so you can roll back to a previous model version if needed.What’s next
Local Inference
Run your model on your own hardware for development and testing.
Evaluating After Deployment
Re-evaluate your model as production data evolves.