From ab84702791d2fee1c89529ad114cdacd1fbfba50 Mon Sep 17 00:00:00 2001 From: Michele Cereda Date: Sat, 14 Feb 2026 18:54:05 +0100 Subject: [PATCH] fix(kb/ai): performance tests --- knowledge base/ai/claude/claude code.md | 69 +++++++++++++++++++++---- knowledge base/ai/ollama.md | 51 +++++++++++++++--- 2 files changed, 103 insertions(+), 17 deletions(-) diff --git a/knowledge base/ai/claude/claude code.md b/knowledge base/ai/claude/claude code.md index b82c577..9910b75 100644 --- a/knowledge base/ai/claude/claude code.md +++ b/knowledge base/ai/claude/claude code.md @@ -17,8 +17,7 @@ Works in a terminal, IDE, browser, and as a desktop app. > [!warning] > Normally requires an Anthropic account to be used.
-> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead, but its performances -> do seem to take an extreme hit. +> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead. Uses a scope system to determine where configurations apply and who they're shared with.
When multiple scopes are active, the **more** specific ones take precedence. @@ -39,34 +38,82 @@ brew install --cask 'claude-code' -
Real world use cases ```sh # Run Claude Code on a model served locally by Ollama. -ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model 'lfm2.5-thinking:1.2b' +ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \ + claude --model 'lfm2.5-thinking:1.2b' ```
## Run on local models -Performance examples: +Claude _can_ use other models and engines by setting the `ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_BASE_URL` and +`ANTHROPIC_API_KEY` environment variables. -| Engine | Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` | -| ------------------ | -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- | -| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s | -| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 52 s | +E.g.: + +```sh +# Run Claude Code on a model served locally by Ollama. +ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \ + claude --model 'lfm2.5-thinking:1.2b' +``` + +> [!warning] +> Performances do tend to drop substantially depending on the context size and the executing host. + +
+ Examples + +Prompt: `Hi! Are you there?`.
+The model was run once right before the tests started to remove loading times.
+Requests have been sent in headless mode (`claude -p 'prompt'`). + +
+ glm-4.7-flash:q4_K_M on an M3 Pro MacBook Pro 36 GB + +Model: `glm-4.7-flash:q4_K_M`.
+Host: M3 Pro MacBook Pro 36 GB.
+Claude Code version: `v2.1.41`.
+ +| Engine | Context | RAM usage | Used swap | Average response time | System remained responsive | +| ------------------ | ------: | --------: | ------------ | --------------------: | -------------------------- | +| llama.cpp (ollama) | 4096 | 19 GB | No | 19s | No | +| llama.cpp (ollama) | 8192 | 19 GB | No | 48s | No | +| llama.cpp (ollama) | 16384 | 20 GB | No | 2m 16s | No | +| llama.cpp (ollama) | 32768 | 22 GB | No | 7.12s | No | +| llama.cpp (ollama) | 65536 | 25 GB | No? (unsure) | 10.25s | Meh (minor stutters) | +| llama.cpp (ollama) | 131072 | 33 GB | No | 3m 42s | **No** (major stutters) | + +
+ +
## Further readings @@ -81,6 +128,7 @@ Performance examples: ### Sources - [Documentation] +- [pffigueiredo/claude-code-sheet.md] +[pffigueiredo/claude-code-sheet.md]: https://gist.github.com/pffigueiredo/252bac8c731f7e8a2fc268c8a965a963 diff --git a/knowledge base/ai/ollama.md b/knowledge base/ai/ollama.md index ed818f6..7bee45a 100644 --- a/knowledge base/ai/ollama.md +++ b/knowledge base/ai/ollama.md @@ -44,13 +44,31 @@ docker run -d --gpus='all' … 'ollama/ollama' The maximum context for model execution can be set in the app.
If so, using `OLLAMA_CONTEXT_LENGTH` in the CLI seems to have no effect. The app's setting is used regardless. -Performance examples: +
+ Performance examples -| Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` | -| -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- | -| glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s | -| glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 19.28 s | -| glm-4.7-flash:q4_K_M | 16384 | 20 GB | M3 Pro MacBook Pro 36 GB | 9.13 s | +Prompt: `Hi! Are you there?`.
+The model was run once right before the tests started to remove loading times.
+Requests have been sent in headless mode (`ollama run 'model' 'prompt'`). + +
+ glm-4.7-flash:q4_K_M on an M3 Pro MacBook Pro 36 GB + +Model: `glm-4.7-flash:q4_K_M`.
+Host: M3 Pro MacBook Pro 36 GB. + +| Context | RAM Usage | Used swap | Average response time | System remained responsive | +| ------: | --------: | ------------ | --------------------: | ---------------------------- | +| 4096 | 19 GB | No | 9.27s | Yes | +| 8192 | 19 GB | No | 8.28s | Yes | +| 16384 | 20 GB | No | 9.13s | Yes | +| 32768 | 22 GB | No | 9.05s | Yes | +| 65536 | 25 GB | No? (unsure) | 10.07s | Meh (minor stutters) | +| 131072 | 33 GB | **Yes** | 18.43s | **No** (noticeable stutters) | + +
+ +
The API are available after installation at as default. @@ -58,6 +76,12 @@ Cloud models are automatically offloaded to Ollama's cloud service.
This allows to keep using one's local tools while running larger models that wouldn't fit on a personal computer.
Those models are _usually_ tagged with the `cloud` suffix. +Thinking is enabled by default in the CLI and API for models that support it.
+Some of those models (e.g. `gpt-oss`) also (or only) allow to set thinking levels. + +Vision models accept images alongside text.
+The model can describe, classify, and answer questions about what it sees. +
Usage @@ -68,6 +92,13 @@ curl 'http://localhost:11434/api/generate' -d '{ "prompt": "Why is the sky blue?" }' +# Expose (bind) the server to specific IP addresses and/or with custom ports. +# Default is 127.0.0.1 on port 11434. +OLLAMA_HOST='some.fqdn:11435' + +# Start the interactive menu. +ollama + # Download models. ollama pull 'qwen2.5-coder:7b' ollama pull 'glm-4.7:cloud' @@ -86,16 +117,22 @@ docker exec -it 'ollama' ollama run 'llama3.2' # Run headless. ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose -OLLAMA_HOST='some.fqdn:11434' ollama run 'glm-4.7-flash:q4_K_M' … +ollama run 'deepseek-r1' --think=false "Summarize this article" +ollama run 'gemma3' --hidethinking "Is 9.9 bigger or 9.11?" +ollama run 'gpt-oss' --think=low "Draft a headline" +ollama run 'gemma3' './image.png' "what's in this image?" # Quickly set up a coding tool with Ollama models. ollama launch # Launch integrations. +ollama launch 'opencode' ollama launch 'claude' --model 'glm-4.7-flash' +ollama launch 'openclaw' # Only configure models used by integrations. # Do *not* launch them. +ollama launch 'opencode' --config ollama launch 'claude' --config # Check usage.