fix(kb/ai): performance tests

This commit is contained in:
Michele Cereda
2026-02-14 18:54:05 +01:00
parent 77e1c21b65
commit ab84702791
2 changed files with 103 additions and 17 deletions

View File

@@ -17,8 +17,7 @@ Works in a terminal, IDE, browser, and as a desktop app.
> [!warning]
> Normally requires an Anthropic account to be used.<br/>
> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead, but its performances
> do seem to take an extreme hit.
> One _can_ use [Claude Code router] or [Ollama] to run on a locally server or shared LLM instead.
Uses a scope system to determine where configurations apply and who they're shared with.<br/>
When multiple scopes are active, the **more** specific ones take precedence.
@@ -39,34 +38,82 @@ brew install --cask 'claude-code'
</details>
<!-- Uncomment if used
<details>
<summary>Usage</summary>
```sh
# Start in interactive mode.
claude
# Run a one-time task.
claude "fix the build error"
# Run a one-off task, then exit.
claude -p 'Hi! Are you there?'
claude -p "explain this function"
# Resume the most recent conversation that happened in the current directory
claude -c
# Resume a previous conversation
claude -r
```
</details>
-->
<details>
<summary>Real world use cases</summary>
```sh
# Run Claude Code on a model served locally by Ollama.
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model 'lfm2.5-thinking:1.2b'
ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \
claude --model 'lfm2.5-thinking:1.2b'
```
</details>
## Run on local models
Performance examples:
Claude _can_ use other models and engines by setting the `ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_BASE_URL` and
`ANTHROPIC_API_KEY` environment variables.
| Engine | Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
| ------------------ | -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 52 s |
E.g.:
```sh
# Run Claude Code on a model served locally by Ollama.
ANTHROPIC_AUTH_TOKEN='ollama' ANTHROPIC_BASE_URL='http://localhost:11434' ANTHROPIC_API_KEY='' \
claude --model 'lfm2.5-thinking:1.2b'
```
> [!warning]
> Performances do tend to drop substantially depending on the context size and the executing host.
<details style='padding: 0 0 1rem 1rem'>
<summary>Examples</summary>
Prompt: `Hi! Are you there?`.<br/>
The model was run once right before the tests started to remove loading times.<br/>
Requests have been sent in headless mode (`claude -p 'prompt'`).
<details style='padding: 0 0 0 1rem'>
<summary><code>glm-4.7-flash:q4_K_M</code> on an M3 Pro MacBook Pro 36 GB</summary>
Model: `glm-4.7-flash:q4_K_M`.<br/>
Host: M3 Pro MacBook Pro 36 GB.<br/>
Claude Code version: `v2.1.41`.<br/>
| Engine | Context | RAM usage | Used swap | Average response time | System remained responsive |
| ------------------ | ------: | --------: | ------------ | --------------------: | -------------------------- |
| llama.cpp (ollama) | 4096 | 19 GB | No | 19s | No |
| llama.cpp (ollama) | 8192 | 19 GB | No | 48s | No |
| llama.cpp (ollama) | 16384 | 20 GB | No | 2m 16s | No |
| llama.cpp (ollama) | 32768 | 22 GB | No | 7.12s | No |
| llama.cpp (ollama) | 65536 | 25 GB | No? (unsure) | 10.25s | Meh (minor stutters) |
| llama.cpp (ollama) | 131072 | 33 GB | No | 3m 42s | **No** (major stutters) |
</details>
</details>
## Further readings
@@ -81,6 +128,7 @@ Performance examples:
### Sources
- [Documentation]
- [pffigueiredo/claude-code-sheet.md]
<!--
Reference
@@ -103,3 +151,4 @@ Performance examples:
[Website]: https://claude.com/product/overview
<!-- Others -->
[pffigueiredo/claude-code-sheet.md]: https://gist.github.com/pffigueiredo/252bac8c731f7e8a2fc268c8a965a963

View File

@@ -44,13 +44,31 @@ docker run -d --gpus='all' … 'ollama/ollama'
The maximum context for model execution can be set in the app.<br/>
If so, using `OLLAMA_CONTEXT_LENGTH` in the CLI seems to have no effect. The app's setting is used regardless.
Performance examples:
<details style='padding: 0 0 1rem 0'>
<summary>Performance examples</summary>
| Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
| -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
| glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
| glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 19.28 s |
| glm-4.7-flash:q4_K_M | 16384 | 20 GB | M3 Pro MacBook Pro 36 GB | 9.13 s |
Prompt: `Hi! Are you there?`.<br/>
The model was run once right before the tests started to remove loading times.<br/>
Requests have been sent in headless mode (`ollama run 'model' 'prompt'`).
<details style='padding: 0 0 0 1rem'>
<summary><code>glm-4.7-flash:q4_K_M</code> on an M3 Pro MacBook Pro 36 GB</summary>
Model: `glm-4.7-flash:q4_K_M`.<br/>
Host: M3 Pro MacBook Pro 36 GB.
| Context | RAM Usage | Used swap | Average response time | System remained responsive |
| ------: | --------: | ------------ | --------------------: | ---------------------------- |
| 4096 | 19 GB | No | 9.27s | Yes |
| 8192 | 19 GB | No | 8.28s | Yes |
| 16384 | 20 GB | No | 9.13s | Yes |
| 32768 | 22 GB | No | 9.05s | Yes |
| 65536 | 25 GB | No? (unsure) | 10.07s | Meh (minor stutters) |
| 131072 | 33 GB | **Yes** | 18.43s | **No** (noticeable stutters) |
</details>
</details>
The API are available after installation at <http://localhost:11434/api> as default.
@@ -58,6 +76,12 @@ Cloud models are automatically offloaded to Ollama's cloud service.<br/>
This allows to keep using one's local tools while running larger models that wouldn't fit on a personal computer.<br/>
Those models are _usually_ tagged with the `cloud` suffix.
Thinking is enabled by default in the CLI and API for models that support it.<br/>
Some of those models (e.g. `gpt-oss`) also (or only) allow to set thinking levels.
Vision models accept images alongside text.<br/>
The model can describe, classify, and answer questions about what it sees.
<details>
<summary>Usage</summary>
@@ -68,6 +92,13 @@ curl 'http://localhost:11434/api/generate' -d '{
"prompt": "Why is the sky blue?"
}'
# Expose (bind) the server to specific IP addresses and/or with custom ports.
# Default is 127.0.0.1 on port 11434.
OLLAMA_HOST='some.fqdn:11435'
# Start the interactive menu.
ollama
# Download models.
ollama pull 'qwen2.5-coder:7b'
ollama pull 'glm-4.7:cloud'
@@ -86,16 +117,22 @@ docker exec -it 'ollama' ollama run 'llama3.2'
# Run headless.
ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose
OLLAMA_HOST='some.fqdn:11434' ollama run 'glm-4.7-flash:q4_K_M'
ollama run 'deepseek-r1' --think=false "Summarize this article"
ollama run 'gemma3' --hidethinking "Is 9.9 bigger or 9.11?"
ollama run 'gpt-oss' --think=low "Draft a headline"
ollama run 'gemma3' './image.png' "what's in this image?"
# Quickly set up a coding tool with Ollama models.
ollama launch
# Launch integrations.
ollama launch 'opencode'
ollama launch 'claude' --model 'glm-4.7-flash'
ollama launch 'openclaw'
# Only configure models used by integrations.
# Do *not* launch them.
ollama launch 'opencode' --config
ollama launch 'claude' --config
# Check usage.