chore(kb/ai): performance tests

2026-02-15 16:24:24 +00:00 · 2026-02-14 11:31:25 +01:00
parent 1beca8efff
commit 77e1c21b65
2 changed files with 29 additions and 3 deletions
--- a/base/ai/claude/claude
+++ b/base/ai/claude/claude
@@ -9,6 +9,7 @@ Works in a terminal, IDE, browser, and as a desktop app.
 ## Table of contents <!-- omit in toc -->

 1. [TL;DR](#tldr)
+1. [Run on local models](#run-on-local-models)
 1. [Further readings](#further-readings)
   1. [Sources](#sources)

@@ -58,6 +59,15 @@ ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_

 </details>

+## Run on local models
+
+Performance examples:
+
+| Engine             | Model                | Context (tokens) | Size in RAM | Executing host           | Average time to respond to `Hi!` |
+| ------------------ | -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
+| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 4096             | 19 GB       | M3 Pro MacBook Pro 36 GB | 59 s                             |
+| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 8192             | 19 GB       | M3 Pro MacBook Pro 36 GB | 52 s                             |
+
 ## Further readings

 - [Website]
--- a/base/ai/ollama.md
+++ b/base/ai/ollama.md
@@ -27,7 +27,7 @@ open-source transparency, and want efficient resource utilization.

 Excellent for building applications that require seamless migration from OpenAI.

-<details>
+<details style='padding: 0 0 1rem 0'>
  <summary>Setup</summary>

 ```sh
@@ -41,6 +41,17 @@ docker run -d --gpus='all' … 'ollama/ollama'

 </details>

+The maximum context for model execution can be set in the app.<br/>
+If so, using `OLLAMA_CONTEXT_LENGTH` in the CLI seems to have no effect. The app's setting is used regardless.
+
+Performance examples:
+
+| Model                | Context (tokens) | Size in RAM | Executing host           | Average time to respond to `Hi!` |
+| -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
+| glm-4.7-flash:q4_K_M | 4096             | 19 GB       | M3 Pro MacBook Pro 36 GB | 59 s                             |
+| glm-4.7-flash:q4_K_M | 8192             | 19 GB       | M3 Pro MacBook Pro 36 GB | 19.28 s                          |
+| glm-4.7-flash:q4_K_M | 16384            | 20 GB       | M3 Pro MacBook Pro 36 GB | 9.13 s                           |
+
 The API are available after installation at <http://localhost:11434/api> as default.

 Cloud models are automatically offloaded to Ollama's cloud service.<br/>
@@ -69,10 +80,14 @@ ollama ls
 ollama serve
 OLLAMA_CONTEXT_LENGTH=64000 ollama serve

-# Run models.
+# Run models interactively.
 ollama run 'gemma3'
 docker exec -it 'ollama' ollama run 'llama3.2'

+# Run headless.
+ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose
+OLLAMA_HOST='some.fqdn:11434' ollama run 'glm-4.7-flash:q4_K_M' …
+
 # Quickly set up a coding tool with Ollama models.
 ollama launch

@@ -121,7 +136,8 @@ ollama signout

 ```sh
 # Run Claude Code on a model served locally by Ollama.
-ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model 'lfm2.5-thinking:1.2b'
+ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" \
+  claude --model 'lfm2.5-thinking:1.2b'
 ```

 </details>