chore(kb/ai): performance tests

This commit is contained in:
Michele Cereda
2026-02-14 11:31:25 +01:00
parent 1beca8efff
commit 77e1c21b65
2 changed files with 29 additions and 3 deletions

View File

@@ -9,6 +9,7 @@ Works in a terminal, IDE, browser, and as a desktop app.
## Table of contents <!-- omit in toc -->
1. [TL;DR](#tldr)
1. [Run on local models](#run-on-local-models)
1. [Further readings](#further-readings)
1. [Sources](#sources)
@@ -58,6 +59,15 @@ ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_
</details>
## Run on local models
Performance examples:
| Engine | Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
| ------------------ | -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 52 s |
## Further readings
- [Website]

View File

@@ -27,7 +27,7 @@ open-source transparency, and want efficient resource utilization.
Excellent for building applications that require seamless migration from OpenAI.
<details>
<details style='padding: 0 0 1rem 0'>
<summary>Setup</summary>
```sh
@@ -41,6 +41,17 @@ docker run -d --gpus='all' … 'ollama/ollama'
</details>
The maximum context for model execution can be set in the app.<br/>
If so, using `OLLAMA_CONTEXT_LENGTH` in the CLI seems to have no effect. The app's setting is used regardless.
Performance examples:
| Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
| -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
| glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
| glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 19.28 s |
| glm-4.7-flash:q4_K_M | 16384 | 20 GB | M3 Pro MacBook Pro 36 GB | 9.13 s |
The API are available after installation at <http://localhost:11434/api> as default.
Cloud models are automatically offloaded to Ollama's cloud service.<br/>
@@ -69,10 +80,14 @@ ollama ls
ollama serve
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
# Run models.
# Run models interactively.
ollama run 'gemma3'
docker exec -it 'ollama' ollama run 'llama3.2'
# Run headless.
ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose
OLLAMA_HOST='some.fqdn:11434' ollama run 'glm-4.7-flash:q4_K_M'
# Quickly set up a coding tool with Ollama models.
ollama launch
@@ -121,7 +136,8 @@ ollama signout
```sh
# Run Claude Code on a model served locally by Ollama.
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model 'lfm2.5-thinking:1.2b'
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" \
claude --model 'lfm2.5-thinking:1.2b'
```
</details>