Cloud Inference

Cloud inference lets you serve your exported model using GPU-backed infrastructure, making it suitable for production applications, high-throughput batch jobs, and public-facing APIs. Because Oumi exports models in a standard format compatible with popular inference engines, you are not locked into any single cloud provider.

When to use cloud inference

Cloud deployment is the right choice when:

You need to serve many users or handle variable traffic
Your use case requires more GPU memory than local hardware provides
You want managed scaling, uptime guarantees, or geo-distributed serving
Your team needs a stable API endpoint for an application or integration

For lightweight testing and development, local inference is usually faster to set up.

Managed inference platforms

These platforms handle infrastructure provisioning and model serving for you, making them the lowest-friction option for getting a model into production.

AWS: Amazon Bedrock

Amazon Bedrock supports custom model import, allowing you to deploy your Oumi-trained model as a managed inference endpoint. Read the AWS blog post for a full walkthrough of importing an Oumi-exported model into Amazon Bedrock. Best for: Teams already on AWS who want a fully managed endpoint with no infrastructure overhead.

Lambda: on-demand GPU instances

Lambda provides on-demand GPU cloud instances well-suited for hosting an inference server with vLLM. Watch the Lambda deployment video for a step-by-step guide on spinning up an instance, loading your exported model, and making inference requests. Best for: Teams who want direct GPU access and control over the serving stack without a full cloud commitment.

Self-hosted cloud deployment

If you prefer to manage your own inference server on a cloud GPU instance (from any provider), the setup mirrors the local inference workflow, with your instance replacing your local machine.

General steps

Export your model from the Oumi platform and transfer the artifacts to your cloud instance (e.g., via scp, S3, or GCS).
Install vLLM on the instance:
```
pip install vllm
```
Start the inference server, pointing to your exported model directory:
```
vllm serve ./exported_model/ --port 8000
```

Make inference requests using the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://<your-instance-ip>:8000", api_key="unused")

response = client.chat.completions.create(
    model="exported_model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Replace <your-instance-ip> with the public IP or hostname of your cloud instance. Ensure port 8000 is open in your security group or firewall rules.

Choosing an instance type

The right instance type depends on your model size, latency requirements, and budget. As a general rule, larger models require more GPU memory. A 7B parameter model typically needs at least 16 GB of GPU VRAM, while a 30B+ model will need significantly more. Consult your cloud provider’s documentation for current instance availability and pricing:

Considerations

Cost: Cloud GPU instances are billed by the hour. For variable traffic, consider auto-scaling groups or serverless inference platforms to avoid paying for idle capacity. Latency: Network round-trip adds latency compared to local inference. Choose a region close to your users and keep request payloads small. Security: Restrict access to your inference endpoint using API keys, VPC networking, or IAM policies. Do not expose the vLLM server publicly without authentication. Model versioning: Keep exported artifacts versioned in cloud storage (S3, GCS) so you can roll back to a previous model version if needed.

Getting Started

Oumi Workflow

Cloud Inference

When to use cloud inference

Managed inference platforms

AWS: Amazon Bedrock

Lambda: on-demand GPU instances

Self-hosted cloud deployment

General steps

Choosing an instance type

Considerations

What’s next

Local Inference

Evaluating After Deployment

Getting Started

Oumi Workflow

​When to use cloud inference

​Managed inference platforms

​AWS: Amazon Bedrock

​Lambda: on-demand GPU instances

​Self-hosted cloud deployment

​General steps

​Choosing an instance type

​Considerations

​What’s next

Local Inference

Evaluating After Deployment

When to use cloud inference

Managed inference platforms

AWS: Amazon Bedrock

Lambda: on-demand GPU instances

Self-hosted cloud deployment

General steps

Choosing an instance type

Considerations

What’s next