Skip to main content

Overview

Local inference allows you to run trained models directly on your own infrastructure, giving you full control over hardware, data privacy, and performance. This deployment approach is ideal for teams that need low-latency responses, strict data governance, or the ability to operate in offline or restricted environments.

vLLM and vLLM-MLX

vLLM and it’s Mac Silicon equivalent, vLLM-MLX are popular libraries for running an OpenAI-compatible inference server locally. First, follow the installation instructions on those projects’ homepages, for instance:
pip install vllm
Then, after you have exported your model, navigate to model’s parent directory and start vLLM with a command like this:
vllm serve ./exported_model/ --port 1234
Ensure OpenAI’s Python library is installed:
pip install openai
Then you can access the vLLM server by creating an OpenAI() object, replacing the API endpoint with http://127.0.0.1:1234 and omitting an API key.

Other options

Aside from using vLLM’s CLI, you can use your exported model directly from Python via vLLM or Hugging Face’s Transformer libraries. Hugging Face’s Transformer library includes a CLI that can also serve the model with an OpenAI-compatible API: see the documentation for more details.