Files
oam/knowledge base/ai/vllm.md
2026-02-11 01:14:19 +01:00

2.1 KiB

vLLM

Open source library for LLM inference and serving.

  1. TL;DR
  2. Further readings
    1. Sources

TL;DR

Engineered specifically for high-performance, production-grade LLM inference.

Offers production-ready, highly mature OpenAI-compatible API.
Has full support for streaming, embeddings, tool/function calling with parallel invocation capability, vision-language model support, rate limiting, and token-based authentication. Optimized for high-throughput and batch requests.

Supports PyTorch and Safetensors (primary), GPTQ and AWQ quantization, native Hugging Face model hub.
Does not natively support GGUF (requires conversion).

Offers production-grade, fully-featured, OpenAI-compatible tool calling functionality via API.
Support includes parallel function calls, the tool_choice parameter for controlling tool selection, and streaming support for tool calls.

Considered the gold standard for production deployments requiring enterprise-grade tool orchestration.
Best for production-grade performance and reliability, high concurrent request handling, multi-GPU deployment capabilities, and enterprise-scale LLM serving.

Setup
pip install 'vllm'
pipx install 'vllm'
Usage
vllm serve 'meta-llama/Llama-2-7b-hf' --port '8000' --gpu-memory-utilization '0.9'
vllm serve 'meta-llama/Llama-2-70b-hf' --tensor-parallel-size '2' --port '8000'

Further readings

Sources