# llama.cpp > TODO LLM inference engine written in in C/C++.
Vastly used as base for AI tools like [Ollama] and [Docker model runner]. 1. [TL;DR](#tldr) 1. [Further readings](#further-readings) 1. [Sources](#sources) ## TL;DR
Setup ```sh brew install 'llama.cpp' ```
Usage ```sh # List available devices and exit. llama-cli --list-devices # List models in cache. llama-cli -cl llama-cli --cache-list # Run models from files interactively. llama-cli -m 'path/to/model.gguf' llama-cli -m 'path/to/target/model.gguf' -md 'path/to/draft/model.gguf' # Download and run models. llama-cli -mu 'https://example.org/some/model' # URL llama-cli -hf 'ggml-org/gemma-3-1b-it-GGUF' -c '32.768' # Hugging Face llama-cli -dr 'ai/qwen2.5' --offline # Docker Hub # Launch the OpenAI-compatible API server. llama-server -m 'path/to/model.gguf' llama-server -hf 'ggml-org/gemma-3-1b-it-GGUF' --port '8080' --host '127.0.0.1' # Run benchmarks. llama-bench -m 'path/to/model.gguf' llama-bench -m 'models/7B/ggml-model-q4_0.gguf' -m 'models/13B/ggml-model-q4_0.gguf' -p '0' -n '128,256,512' --progress ``` The web UI can be accessed via browser at .
The chat completion endpoint it at .
Real world use cases ```sh # Use models pulled with Ollama. jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \ "$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b" \ | xargs -pI '%%' llama-bench -m "$HOME/.ollama/models/blobs/%%" --progress ```
## Further readings - [Website] - [Codebase] - [ik_llama.cpp] ### Sources [Docker model runner]: ../docker.md#running-llms-locally [Ollama]: ollama.md [Codebase]: https://github.com/ggml-org/llama.cpp [Website]: https://llama-cpp.com/ [ik_llama.cpp]: https://github.com/ikawrakow/ik_llama.cpp