mirror of
https://gitea.com/mcereda/oam.git
synced 2026-02-21 19:14:24 +00:00
2.1 KiB
2.1 KiB
llama.cpp
TODO
LLM inference engine written in in C/C++.
Vastly used as base for AI tools like Ollama and Docker model runner.
TL;DR
Setup
brew install 'llama.cpp'
Usage
# List available devices and exit.
llama-cli --list-devices
# List models in cache.
llama-cli -cl
llama-cli --cache-list
# Run models from files interactively.
llama-cli -m 'path/to/model.gguf'
llama-cli -m 'path/to/target/model.gguf' -md 'path/to/draft/model.gguf'
# Download and run models.
llama-cli -mu 'https://example.org/some/model' # URL
llama-cli -hf 'ggml-org/gemma-3-1b-it-GGUF' -c '32.768' # Hugging Face
llama-cli -dr 'ai/qwen2.5' --offline # Docker Hub
# Launch the OpenAI-compatible API server.
llama-server -m 'path/to/model.gguf'
llama-server -hf 'ggml-org/gemma-3-1b-it-GGUF' --port '8080' --host '127.0.0.1'
# Run benchmarks.
llama-bench -m 'path/to/model.gguf'
llama-bench -m 'models/7B/ggml-model-q4_0.gguf' -m 'models/13B/ggml-model-q4_0.gguf' -p '0' -n '128,256,512' --progress
The web UI can be accessed via browser at http://localhost:8080.
The chat completion endpoint it at http://localhost:8080/v1/chat/completions.
Real world use cases
# Use models pulled with Ollama.
jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b" \
| xargs -pI '%%' llama-bench -m "$HOME/.ollama/models/blobs/%%" --progress