> ## Documentation Index
> Fetch the complete documentation index at: https://docs.oumi.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# LOCAL INFERENCE

> Running your trained model on your own hardware

## OVERVIEW

Local inference allows you to run trained models directly on your own infrastructure, giving you full control over hardware, data privacy, and performance. This deployment approach is ideal for teams that need low-latency responses, strict data governance, or the ability to operate in offline or restricted environments.

### VLLM AND VLLM-MLX

[vLLM](https://github.com/vllm-project/vllm) and it's Mac Silicon equivalent, [vLLM-MLX](https://github.com/waybarrios/vllm-mlx) are popular libraries for running an OpenAI-compatible inference server locally.

First, follow the installation instructions on those projects' homepages, for instance:

```bash theme={null}
pip install vllm
```

Then, after you have exported your model, navigate to model's parent directory and start vLLM with a command like this:

```bash theme={null}
vllm serve ./exported_model/ --port 1234
```

Ensure [OpenAI's Python library](https://github.com/openai/openai-python) is installed:

```bash theme={null}
pip install openai
```

Then you can access the vLLM server by creating an `OpenAI()` object, replacing the API endpoint with `http://127.0.0.1:1234` and omitting an API key.

### OTHER OPTIONS

Aside from using vLLM's CLI, you can use your exported model directly from Python via `vLLM` or Hugging Face's `Transformer` libraries.

Hugging Face's `Transformer` library includes a CLI that can also serve the model with an OpenAI-compatible API: see the [documentation](https://huggingface.co/docs/transformers/v5.2.0/serve-cli/serving) for more details.
