Overview
Local inference allows you to run trained models directly on your own infrastructure, giving you full control over hardware, data privacy, and performance. This deployment approach is ideal for teams that need low-latency responses, strict data governance, or the ability to operate in offline or restricted environments.vLLM and vLLM-MLX
vLLM and it’s Mac Silicon equivalent, vLLM-MLX are popular libraries for running an OpenAI-compatible inference server locally. First, follow the installation instructions on those projects’ homepages, for instance:OpenAI() object, replacing the API endpoint with http://127.0.0.1:1234 and omitting an API key.
Other options
Aside from using vLLM’s CLI, you can use your exported model directly from Python viavLLM or Hugging Face’s Transformer libraries.
Hugging Face’s Transformer library includes a CLI that can also serve the model with an OpenAI-compatible API: see the documentation for more details.