Ollama

One of the easiest way to get up and running with large language models.
Emerged as one of the most popular tools for local LLM deployment.

TL;DR
Further readings
1. Sources

TL;DR

Leverages llama.cpp.

Supports primarily the GGUF file format with quantization levels Q2_K through Q8_0.
Offers automatic conversion of models from Hugging Face and allows customization through Modelfile.

Supports tool calling functionality via API.
Models can decide when to invoke tools and how to use returned data.
Works with models specifically trained for function calling (e.g., Mistral, Llama 3.1, Llama 3.2, and Qwen2.5). However, it does not currently allow forcing a specific tool to be called nor receiving tool call responses in streaming mode.

Considered ideal for developers who prefer CLI interfaces and automation, need reliable API integration, value open-source transparency, and want efficient resource utilization.

Excellent for building applications that require seamless migration from OpenAI.

Setup

brew install --cask 'ollama-app'  # or just brew install 'ollama'
curl -fsSL 'https://ollama.com/install.sh' | sh
docker pull 'ollama/ollama'

# Run in containers.
docker run -d -v 'ollama:/root/.ollama' -p '11434:11434' --name 'ollama' 'ollama/ollama'
docker run -d --gpus='all' … 'ollama/ollama'

# Expose (bind) the server to specific IP addresses and/or with custom ports.
# Default is 127.0.0.1 on port 11434.
# Only valid for the *'serve'* command.
OLLAMA_HOST='some.fqdn:11435' ollama serve

# Use a custom context length.
# Only valid for the *'serve'* command.
OLLAMA_CONTEXT_LENGTH=64000 ollama serve

# Use a remotely served model.
# Valid for all commands *but* 'serve'.
OLLAMA_HOST='some.fqdn:11435' ollama …

The maximum context for model execution can be set in the app.
If so, using OLLAMA_CONTEXT_LENGTH in the CLI seems to have no effect. The app's setting is used regardless.

Performance examples

Prompt: Hi! Are you there?.
The model was run once right before the tests started to remove loading times.
Requests have been sent in headless mode (ollama run 'model' 'prompt').

glm-4.7-flash:q4_K_M on an M3 Pro MacBook Pro 36 GB

Model: glm-4.7-flash:q4_K_M.
Host: M3 Pro MacBook Pro 36 GB.

Context	RAM Usage	Used swap	Average response time	System remained responsive
4096	19 GB	No	9.27s	Yes
8192	19 GB	No	8.28s	Yes
16384	20 GB	No	9.13s	Yes
32768	22 GB	No	9.05s	Yes
65536	25 GB	No? (unsure)	10.07s	Meh (minor stutters)
131072	33 GB	Yes	18.43s	No (noticeable stutters)

The API are available after installation at http://localhost:11434/api as default.

Cloud models are automatically offloaded to Ollama's cloud service.
This allows to keep using one's local tools while running larger models that wouldn't fit on a personal computer.
Those models are usually tagged with the cloud suffix.

Thinking is enabled by default in the CLI and API for models that support it.
Some of those models (e.g. gpt-oss) also (or only) allow to set thinking levels.

Vision models accept images alongside text.
The model can describe, classify, and answer questions about what it sees.

Usage

# Start the server.
ollama serve

# Verify the server is running.
curl 'http://localhost:11434/'

# Access the API via cURL.
curl 'http://localhost:11434/api/generate' -d '{
  "model": "gemma3",
  "prompt": "Why is the sky blue?"
}'

# Start the interactive menu.
ollama
ollama launch

# Download models.
ollama pull 'qwen2.5-coder:7b'
ollama pull 'glm-4.7:cloud'

# List pulled models.
ollama list
ollama ls

# Show models information.
ollama show 'codellama:13b'

# Run models interactively.
ollama run 'gemma3'
docker exec -it 'ollama' ollama run 'llama3.2'

# Run headless.
ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose
ollama run 'deepseek-r1' --think=false "Summarize this article"
ollama run 'gemma3' --hidethinking "Is 9.9 bigger or 9.11?"
ollama run 'gpt-oss' --think=low "Draft a headline"
ollama run 'gemma3' './image.png' "what's in this image?" --temperature '0.8' --top-p '0.9'

# Launch integrations.
ollama launch 'opencode'
ollama launch 'claude' --model 'glm-4.7-flash'
ollama launch 'openclaw'

# Only configure models used by integrations.
# Do *not* launch them.
ollama launch 'opencode' --config
ollama launch 'claude' --config

# Check usage.
ollama ps

# Stop running models.
ollama stop 'gemma3'

# Delete models.
ollama rm 'gemma3'
ollama rm 'nomic-embed-text:latest' 'llama3.1:8b'

# Create custom models.
# Requires a Modelfile.
ollama create -f 'Modelfile'

# Quantize models.
# Requires a Modelfile.
ollama create --quantize 'q4_K_M' 'llama3.2'

# Push models to Ollama.
ollama push 'myuser/mymodel'

# Clone models.
ollama cp 'mymodel' 'myuser/mymodel'

# Sign into Ollama cloud, or create a new account.
ollama signin

# Sign out from Ollama cloud.
ollama signout

6.5 KiB Raw Blame History

Ollama

TL;DR

Further readings

Sources

6.5 KiB

Raw Blame History