fix(kb/ai): performance tests

2026-02-15 16:24:24 +00:00 · 2026-02-14 18:54:05 +01:00
parent 77e1c21b65
commit ab84702791
2 changed files with 103 additions and 17 deletions
--- a/base/ai/claude/claude
+++ b/base/ai/claude/claude
@@ -17,8 +17,7 @@ Works in a terminal, IDE, browser, and as a desktop app.

 > [!warning]
 > Normally requires an Anthropic account to be used.<br/>
-> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead, but its performances
-> do seem to take an extreme hit.
+> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead.

 Uses a scope system to determine where configurations apply and who they're shared with.<br/>
 When multiple scopes are active, the **more** specific ones take precedence.
@@ -39,34 +38,82 @@ brew install --cask 'claude-code'

 </details>

-<!-- Uncomment if used
 <details>
  <summary>Usage</summary>

 ```sh
+# Start in interactive mode.
+claude
+
+# Run a one-time task.
+claude "fix the build error"
+
+# Run a one-off task, then exit.
+claude -p 'Hi! Are you there?'
+claude -p "explain this function"
+
+# Resume the most recent conversation that happened in the current directory
+claude -c
+
+# Resume a previous conversation
+claude -r
 ```

 </details>
-->

 <details>
  <summary>Real world use cases</summary>

 ```sh
 # Run Claude Code on a model served locally by Ollama.
-ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model 'lfm2.5-thinking:1.2b'
+ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \
+  claude --model 'lfm2.5-thinking:1.2b'
 ```

 </details>

 ## Run on local models

-Performance examples:
+Claude _can_ use other models and engines by setting the `ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_BASE_URL` and
+`ANTHROPIC_API_KEY` environment variables.

-| Engine             | Model                | Context (tokens) | Size in RAM | Executing host           | Average time to respond to `Hi!` |
-| ------------------ | -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
-| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 4096             | 19 GB       | M3 Pro MacBook Pro 36 GB | 59 s                             |
-| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 8192             | 19 GB       | M3 Pro MacBook Pro 36 GB | 52 s                             |
+E.g.:
+
+```sh
+# Run Claude Code on a model served locally by Ollama.
+ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \
+  claude --model 'lfm2.5-thinking:1.2b'
+```
+
+> [!warning]
+> Performances do tend to drop substantially depending on the context size and the executing host.
+
+<details style='padding: 0 0 1rem 1rem'>
+  <summary>Examples</summary>
+
+Prompt: `Hi! Are you there?`.<br/>
+The model was run once right before the tests started to remove loading times.<br/>
+Requests have been sent in headless mode (`claude -p 'prompt'`).
+
+  <details style='padding: 0 0 0 1rem'>
+    <summary><code>glm-4.7-flash:q4_K_M</code> on an M3 Pro MacBook Pro 36 GB</summary>
+
+Model: `glm-4.7-flash:q4_K_M`.<br/>
+Host: M3 Pro MacBook Pro 36 GB.<br/>
+Claude Code version: `v2.1.41`.<br/>
+
+| Engine             | Context | RAM usage | Used swap    | Average response time | System remained responsive |
+| ------------------ | ------: | --------: | ------------ | --------------------: | -------------------------- |
+| llama.cpp (ollama) |    4096 |     19 GB | No           |                   19s | No                         |
+| llama.cpp (ollama) |    8192 |     19 GB | No           |                   48s | No                         |
+| llama.cpp (ollama) |   16384 |     20 GB | No           |                2m 16s | No                         |
+| llama.cpp (ollama) |   32768 |     22 GB | No           |                 7.12s | No                         |
+| llama.cpp (ollama) |   65536 |     25 GB | No? (unsure) |                10.25s | Meh (minor stutters)       |
+| llama.cpp (ollama) |  131072 |     33 GB | No           |                3m 42s | **No** (major stutters)    |
+
+  </details>
+
+</details>

 ## Further readings

@@ -81,6 +128,7 @@ Performance examples:
 ### Sources

 - [Documentation]
+- [pffigueiredo/claude-code-sheet.md]

 <!--
  Reference
@@ -103,3 +151,4 @@ Performance examples:
 [Website]: https://claude.com/product/overview

 <!-- Others -->
+[pffigueiredo/claude-code-sheet.md]: https://gist.github.com/pffigueiredo/252bac8c731f7e8a2fc268c8a965a963
--- a/base/ai/ollama.md
+++ b/base/ai/ollama.md
@@ -44,13 +44,31 @@ docker run -d --gpus='all' … 'ollama/ollama'
 The maximum context for model execution can be set in the app.<br/>
 If so, using `OLLAMA_CONTEXT_LENGTH` in the CLI seems to have no effect. The app's setting is used regardless.

-Performance examples:
+<details style='padding: 0 0 1rem 0'>
+  <summary>Performance examples</summary>

-| Model                | Context (tokens) | Size in RAM | Executing host           | Average time to respond to `Hi!` |
-| -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
-| glm-4.7-flash:q4_K_M | 4096             | 19 GB       | M3 Pro MacBook Pro 36 GB | 59 s                             |
-| glm-4.7-flash:q4_K_M | 8192             | 19 GB       | M3 Pro MacBook Pro 36 GB | 19.28 s                          |
-| glm-4.7-flash:q4_K_M | 16384            | 20 GB       | M3 Pro MacBook Pro 36 GB | 9.13 s                           |
+Prompt: `Hi! Are you there?`.<br/>
+The model was run once right before the tests started to remove loading times.<br/>
+Requests have been sent in headless mode (`ollama run 'model' 'prompt'`).
+
+  <details style='padding: 0 0 0 1rem'>
+    <summary><code>glm-4.7-flash:q4_K_M</code> on an M3 Pro MacBook Pro 36 GB</summary>
+
+Model: `glm-4.7-flash:q4_K_M`.<br/>
+Host: M3 Pro MacBook Pro 36 GB.
+
+| Context | RAM Usage | Used swap    | Average response time | System remained responsive   |
+| ------: | --------: | ------------ | --------------------: | ---------------------------- |
+|    4096 |     19 GB | No           |                 9.27s | Yes                          |
+|    8192 |     19 GB | No           |                 8.28s | Yes                          |
+|   16384 |     20 GB | No           |                 9.13s | Yes                          |
+|   32768 |     22 GB | No           |                 9.05s | Yes                          |
+|   65536 |     25 GB | No? (unsure) |                10.07s | Meh (minor stutters)         |
+|  131072 |     33 GB | **Yes**      |                18.43s | **No** (noticeable stutters) |
+
+  </details>
+
+</details>

 The API are available after installation at <http://localhost:11434/api> as default.

@@ -58,6 +76,12 @@ Cloud models are automatically offloaded to Ollama's cloud service.<br/>
 This allows to keep using one's local tools while running larger models that wouldn't fit on a personal computer.<br/>
 Those models are _usually_ tagged with the `cloud` suffix.

+Thinking is enabled by default in the CLI and API for models that support it.<br/>
+Some of those models (e.g. `gpt-oss`) also (or only) allow to set thinking levels.
+
+Vision models accept images alongside text.<br/>
+The model can describe, classify, and answer questions about what it sees.
+
 <details>
  <summary>Usage</summary>

@@ -68,6 +92,13 @@ curl 'http://localhost:11434/api/generate' -d '{
  "prompt": "Why is the sky blue?"
 }'

+# Expose (bind) the server to specific IP addresses and/or with custom ports.
+# Default is 127.0.0.1 on port 11434.
+OLLAMA_HOST='some.fqdn:11435'
+
+# Start the interactive menu.
+ollama
+
 # Download models.
 ollama pull 'qwen2.5-coder:7b'
 ollama pull 'glm-4.7:cloud'
@@ -86,16 +117,22 @@ docker exec -it 'ollama' ollama run 'llama3.2'

 # Run headless.
 ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose
-OLLAMA_HOST='some.fqdn:11434' ollama run 'glm-4.7-flash:q4_K_M' …
+ollama run 'deepseek-r1' --think=false "Summarize this article"
+ollama run 'gemma3' --hidethinking "Is 9.9 bigger or 9.11?"
+ollama run 'gpt-oss' --think=low "Draft a headline"
+ollama run 'gemma3' './image.png' "what's in this image?"

 # Quickly set up a coding tool with Ollama models.
 ollama launch

 # Launch integrations.
+ollama launch 'opencode'
 ollama launch 'claude' --model 'glm-4.7-flash'
+ollama launch 'openclaw'

 # Only configure models used by integrations.
 # Do *not* launch them.
+ollama launch 'opencode' --config
 ollama launch 'claude' --config

 # Check usage.