diff --git a/knowledge base/ai/claude/claude code.md b/knowledge base/ai/claude/claude code.md
index b82c577..9910b75 100644
--- a/knowledge base/ai/claude/claude code.md
+++ b/knowledge base/ai/claude/claude code.md
@@ -17,8 +17,7 @@ Works in a terminal, IDE, browser, and as a desktop app.
> [!warning]
> Normally requires an Anthropic account to be used.
-> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead, but its performances
-> do seem to take an extreme hit.
+> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead.
Uses a scope system to determine where configurations apply and who they're shared with.
When multiple scopes are active, the **more** specific ones take precedence.
@@ -39,34 +38,82 @@ brew install --cask 'claude-code'
-
Real world use cases
```sh
# Run Claude Code on a model served locally by Ollama.
-ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model 'lfm2.5-thinking:1.2b'
+ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \
+ claude --model 'lfm2.5-thinking:1.2b'
```
## Run on local models
-Performance examples:
+Claude _can_ use other models and engines by setting the `ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_BASE_URL` and
+`ANTHROPIC_API_KEY` environment variables.
-| Engine | Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
-| ------------------ | -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
-| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
-| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 52 s |
+E.g.:
+
+```sh
+# Run Claude Code on a model served locally by Ollama.
+ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \
+ claude --model 'lfm2.5-thinking:1.2b'
+```
+
+> [!warning]
+> Performances do tend to drop substantially depending on the context size and the executing host.
+
+
+ Examples
+
+Prompt: `Hi! Are you there?`.
+The model was run once right before the tests started to remove loading times.
+Requests have been sent in headless mode (`claude -p 'prompt'`).
+
+
+ glm-4.7-flash:q4_K_M on an M3 Pro MacBook Pro 36 GB
+
+Model: `glm-4.7-flash:q4_K_M`.
+Host: M3 Pro MacBook Pro 36 GB.
+Claude Code version: `v2.1.41`.
+
+| Engine | Context | RAM usage | Used swap | Average response time | System remained responsive |
+| ------------------ | ------: | --------: | ------------ | --------------------: | -------------------------- |
+| llama.cpp (ollama) | 4096 | 19 GB | No | 19s | No |
+| llama.cpp (ollama) | 8192 | 19 GB | No | 48s | No |
+| llama.cpp (ollama) | 16384 | 20 GB | No | 2m 16s | No |
+| llama.cpp (ollama) | 32768 | 22 GB | No | 7.12s | No |
+| llama.cpp (ollama) | 65536 | 25 GB | No? (unsure) | 10.25s | Meh (minor stutters) |
+| llama.cpp (ollama) | 131072 | 33 GB | No | 3m 42s | **No** (major stutters) |
+
+
+
+
## Further readings
@@ -81,6 +128,7 @@ Performance examples:
### Sources
- [Documentation]
+- [pffigueiredo/claude-code-sheet.md]
+[pffigueiredo/claude-code-sheet.md]: https://gist.github.com/pffigueiredo/252bac8c731f7e8a2fc268c8a965a963
diff --git a/knowledge base/ai/ollama.md b/knowledge base/ai/ollama.md
index ed818f6..7bee45a 100644
--- a/knowledge base/ai/ollama.md
+++ b/knowledge base/ai/ollama.md
@@ -44,13 +44,31 @@ docker run -d --gpus='all' … 'ollama/ollama'
The maximum context for model execution can be set in the app.
If so, using `OLLAMA_CONTEXT_LENGTH` in the CLI seems to have no effect. The app's setting is used regardless.
-Performance examples:
+
+ Performance examples
-| Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
-| -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
-| glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
-| glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 19.28 s |
-| glm-4.7-flash:q4_K_M | 16384 | 20 GB | M3 Pro MacBook Pro 36 GB | 9.13 s |
+Prompt: `Hi! Are you there?`.
+The model was run once right before the tests started to remove loading times.
+Requests have been sent in headless mode (`ollama run 'model' 'prompt'`).
+
+
+ glm-4.7-flash:q4_K_M on an M3 Pro MacBook Pro 36 GB
+
+Model: `glm-4.7-flash:q4_K_M`.
+Host: M3 Pro MacBook Pro 36 GB.
+
+| Context | RAM Usage | Used swap | Average response time | System remained responsive |
+| ------: | --------: | ------------ | --------------------: | ---------------------------- |
+| 4096 | 19 GB | No | 9.27s | Yes |
+| 8192 | 19 GB | No | 8.28s | Yes |
+| 16384 | 20 GB | No | 9.13s | Yes |
+| 32768 | 22 GB | No | 9.05s | Yes |
+| 65536 | 25 GB | No? (unsure) | 10.07s | Meh (minor stutters) |
+| 131072 | 33 GB | **Yes** | 18.43s | **No** (noticeable stutters) |
+
+
+
+
The API are available after installation at as default.
@@ -58,6 +76,12 @@ Cloud models are automatically offloaded to Ollama's cloud service.
This allows to keep using one's local tools while running larger models that wouldn't fit on a personal computer.
Those models are _usually_ tagged with the `cloud` suffix.
+Thinking is enabled by default in the CLI and API for models that support it.
+Some of those models (e.g. `gpt-oss`) also (or only) allow to set thinking levels.
+
+Vision models accept images alongside text.
+The model can describe, classify, and answer questions about what it sees.
+
Usage
@@ -68,6 +92,13 @@ curl 'http://localhost:11434/api/generate' -d '{
"prompt": "Why is the sky blue?"
}'
+# Expose (bind) the server to specific IP addresses and/or with custom ports.
+# Default is 127.0.0.1 on port 11434.
+OLLAMA_HOST='some.fqdn:11435'
+
+# Start the interactive menu.
+ollama
+
# Download models.
ollama pull 'qwen2.5-coder:7b'
ollama pull 'glm-4.7:cloud'
@@ -86,16 +117,22 @@ docker exec -it 'ollama' ollama run 'llama3.2'
# Run headless.
ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose
-OLLAMA_HOST='some.fqdn:11434' ollama run 'glm-4.7-flash:q4_K_M' …
+ollama run 'deepseek-r1' --think=false "Summarize this article"
+ollama run 'gemma3' --hidethinking "Is 9.9 bigger or 9.11?"
+ollama run 'gpt-oss' --think=low "Draft a headline"
+ollama run 'gemma3' './image.png' "what's in this image?"
# Quickly set up a coding tool with Ollama models.
ollama launch
# Launch integrations.
+ollama launch 'opencode'
ollama launch 'claude' --model 'glm-4.7-flash'
+ollama launch 'openclaw'
# Only configure models used by integrations.
# Do *not* launch them.
+ollama launch 'opencode' --config
ollama launch 'claude' --config
# Check usage.