mirror of
https://gitea.com/mcereda/oam.git
synced 2026-02-15 16:24:24 +00:00
chore(kb/ai): performance tests
This commit is contained in:
@@ -9,6 +9,7 @@ Works in a terminal, IDE, browser, and as a desktop app.
|
||||
## Table of contents <!-- omit in toc -->
|
||||
|
||||
1. [TL;DR](#tldr)
|
||||
1. [Run on local models](#run-on-local-models)
|
||||
1. [Further readings](#further-readings)
|
||||
1. [Sources](#sources)
|
||||
|
||||
@@ -58,6 +59,15 @@ ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_
|
||||
|
||||
</details>
|
||||
|
||||
## Run on local models
|
||||
|
||||
Performance examples:
|
||||
|
||||
| Engine | Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
|
||||
| ------------------ | -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
|
||||
| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
|
||||
| llama.cpp (ollama) | glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 52 s |
|
||||
|
||||
## Further readings
|
||||
|
||||
- [Website]
|
||||
|
||||
@@ -27,7 +27,7 @@ open-source transparency, and want efficient resource utilization.
|
||||
|
||||
Excellent for building applications that require seamless migration from OpenAI.
|
||||
|
||||
<details>
|
||||
<details style='padding: 0 0 1rem 0'>
|
||||
<summary>Setup</summary>
|
||||
|
||||
```sh
|
||||
@@ -41,6 +41,17 @@ docker run -d --gpus='all' … 'ollama/ollama'
|
||||
|
||||
</details>
|
||||
|
||||
The maximum context for model execution can be set in the app.<br/>
|
||||
If so, using `OLLAMA_CONTEXT_LENGTH` in the CLI seems to have no effect. The app's setting is used regardless.
|
||||
|
||||
Performance examples:
|
||||
|
||||
| Model | Context (tokens) | Size in RAM | Executing host | Average time to respond to `Hi!` |
|
||||
| -------------------- | ---------------- | ----------- | ------------------------ | -------------------------------- |
|
||||
| glm-4.7-flash:q4_K_M | 4096 | 19 GB | M3 Pro MacBook Pro 36 GB | 59 s |
|
||||
| glm-4.7-flash:q4_K_M | 8192 | 19 GB | M3 Pro MacBook Pro 36 GB | 19.28 s |
|
||||
| glm-4.7-flash:q4_K_M | 16384 | 20 GB | M3 Pro MacBook Pro 36 GB | 9.13 s |
|
||||
|
||||
The API are available after installation at <http://localhost:11434/api> as default.
|
||||
|
||||
Cloud models are automatically offloaded to Ollama's cloud service.<br/>
|
||||
@@ -69,10 +80,14 @@ ollama ls
|
||||
ollama serve
|
||||
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
|
||||
|
||||
# Run models.
|
||||
# Run models interactively.
|
||||
ollama run 'gemma3'
|
||||
docker exec -it 'ollama' ollama run 'llama3.2'
|
||||
|
||||
# Run headless.
|
||||
ollama run 'glm-4.7-flash:q4_K_M' 'Hi! Are you there?' --verbose
|
||||
OLLAMA_HOST='some.fqdn:11434' ollama run 'glm-4.7-flash:q4_K_M' …
|
||||
|
||||
# Quickly set up a coding tool with Ollama models.
|
||||
ollama launch
|
||||
|
||||
@@ -121,7 +136,8 @@ ollama signout
|
||||
|
||||
```sh
|
||||
# Run Claude Code on a model served locally by Ollama.
|
||||
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model 'lfm2.5-thinking:1.2b'
|
||||
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" \
|
||||
claude --model 'lfm2.5-thinking:1.2b'
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
Reference in New Issue
Block a user