> ## Documentation Index
> Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# SELF-HOSTED INFERENCE

> Run scalable inference with Oumi in the cloud

Self-hosted inference lets you serve your exported model using GPU-backed infrastructure, making it suitable for production applications, high-throughput batch jobs, and public-facing APIs. Because Oumi exports models in a standard format compatible with popular inference engines, you are not locked into any single cloud provider.

***

## WHEN TO USE SELF-HOSTED INFERENCE

Self-hosted inference is the right choice when:

* You need to serve many users or handle variable traffic
* Your use case requires more GPU memory than local hardware provides
* You want managed scaling, uptime guarantees, or geo-distributed serving
* Your team needs a stable API endpoint for an application or integration

For lightweight testing and development, [local inference](/guides/deployment/local-inference) is usually faster to set up.

***

## MANAGED INFERENCE PLATFORMS

These platforms handle infrastructure provisioning and model serving for you, making them the lowest-friction option for getting a model into production.

### AWS: AMAZON BEDROCK

Amazon Bedrock supports custom model import, allowing you to deploy your Oumi-trained model as a managed inference endpoint.

Read the [AWS blog post](https://aws.amazon.com/blogs/machine-learning/accelerate-custom-llm-deployment-fine-tune-with-oumi-and-deploy-to-amazon-bedrock/) for a full walkthrough of importing an Oumi-exported model into Amazon Bedrock.

**Best for:** Teams already on AWS who want a fully managed endpoint with no infrastructure overhead.

***

### LAMBDA: ON-DEMAND GPU INSTANCES

Lambda provides on-demand GPU cloud instances well-suited for hosting an inference server with vLLM.

Watch the [Lambda deployment video](https://www.youtube.com/watch?v=0XpfYRpd_FA) for a step-by-step guide on spinning up an instance, loading your exported model, and making inference requests.

**Best for:** Teams who want direct GPU access and control over the serving stack without a full cloud commitment.

***

## SELF-HOSTED CLOUD DEPLOYMENT

If you prefer to manage your own inference server on a cloud GPU instance (from any provider), the setup mirrors the [local inference](/guides/deployment/local-inference) workflow, with your instance replacing your local machine.

### GENERAL STEPS

1. **Export your model** from the Oumi platform and transfer the artifacts to your cloud instance (e.g., via `scp`, S3, or GCS).
2. **Install vLLM** on the instance:
   ```bash theme={null}
   pip install vllm
   ```
3. **Start the inference server**, pointing to your exported model directory:
   ```bash theme={null}
   vllm serve ./exported_model/ --port 8000
   ```
4. **Make inference requests** using the OpenAI-compatible API:
   ```python theme={null}
   from openai import OpenAI

   client = OpenAI(base_url="http://<your-instance-ip>:8000", api_key="unused")

   response = client.chat.completions.create(
       model="exported_model",
       messages=[{"role": "user", "content": "Hello!"}]
   )
   print(response.choices[0].message.content)
   ```

<Note>Replace `<your-instance-ip>` with the public IP or hostname of your cloud instance. Ensure port 8000 is open in your security group or firewall rules.</Note>

### CHOOSING AN INSTANCE TYPE

The right instance type depends on your model size, latency requirements, and budget. As a general rule, larger models require more GPU memory. A 7B parameter model typically needs at least 16 GB of GPU VRAM, while a 30B+ model will need significantly more. Consult your cloud provider's documentation for current instance availability and pricing:

* [AWS EC2 GPU instances](https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing)
* [GCP GPU machine types](https://cloud.google.com/compute/docs/gpus)
* [Lambda GPU cloud](https://lambdalabs.com/service/gpu-cloud)

***

## CONSIDERATIONS

**Cost:** Cloud GPU instances are billed by the hour. For variable traffic, consider auto-scaling groups or serverless inference platforms to avoid paying for idle capacity.

**Latency:** Network round-trip adds latency compared to local inference. Choose a region close to your users and keep request payloads small.

**Security:** Restrict access to your inference endpoint using API keys, VPC networking, or IAM policies. Do not expose the vLLM server publicly without authentication.

**Model versioning:** Keep exported artifacts versioned in cloud storage (S3, GCS) so you can roll back to a previous model version if needed.

***

## WHAT'S NEXT

<CardGroup cols={2}>
  <Card title="Local Inference" icon="laptop" href="/guides/deployment/local-inference">
    Run your model on your own hardware for development and testing.
  </Card>

  <Card title="Evaluating After Deployment" icon="chart-column" href="/guides/evaluations">
    Re-evaluate your model as production data evolves.
  </Card>
</CardGroup>
