chore(kb/ai): review and expand notes

This commit is contained in:
Michele Cereda
2026-02-19 06:58:49 +01:00
parent 6c6b8e0428
commit d94e63268d
5 changed files with 108 additions and 7 deletions

View File

@@ -14,38 +14,64 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner].
## TL;DR
<!-- Uncomment if used
<details>
<summary>Setup</summary>
```sh
brew install 'llama.cpp'
```
</details>
-->
<!-- Uncomment if used
<details>
<summary>Usage</summary>
```sh
# List available devices and exit.
llama-cli --list-devices
# List models in cache.
llama-cli -cl
llama-cli --cache-list
# Run models from files interactively.
llama-cli -m 'path/to/model.gguf'
llama-cli -m 'path/to/target/model.gguf' -md 'path/to/draft/model.gguf'
# Download and run models.
llama-cli -mu 'https://example.org/some/model' # URL
llama-cli -hf 'ggml-org/gemma-3-1b-it-GGUF' -c '32.768' # Hugging Face
llama-cli -dr 'ai/qwen2.5-coder' --offline # Docker Hub
# Launch the OpenAI-compatible API server.
llama-server -m 'path/to/model.gguf'
llama-server -hf 'ggml-org/gemma-3-1b-it-GGUF' --port '8080' --host '127.0.0.1'
# Run benchmarks.
llama-bench -m 'path/to/model.gguf'
llama-bench -m 'models/7B/ggml-model-q4_0.gguf' -m 'models/13B/ggml-model-q4_0.gguf' -p '0' -n '128,256,512' --progress
```
</details>
-->
The web UI can be accessed via browser at <http://localhost:8080>.<br/>
The chat completion endpoint it at <http://localhost:8080/v1/chat/completions>.
</details>
<!-- Uncomment if used
<details>
<summary>Real world use cases</summary>
```sh
# Use models pulled with Ollama.
jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b" \
| xargs -pI '%%' llama-bench -m "$HOME/.ollama/models/blobs/%%" --progress
```
</details>
-->
## Further readings
- [Website]
- [Codebase]
- [ik_llama.cpp]
@@ -64,6 +90,7 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner].
<!-- Files -->
<!-- Upstream -->
[Codebase]: https://github.com/ggml-org/llama.cpp
[Website]: https://llama-cpp.com/
<!-- Others -->
[ik_llama.cpp]: https://github.com/ikawrakow/ik_llama.cpp

View File

@@ -13,6 +13,8 @@ They have superseded recurrent neural network-based models.
1. [TL;DR](#tldr)
1. [Reasoning](#reasoning)
1. [Inference](#inference)
1. [Speculative decoding](#speculative-decoding)
1. [Concerns](#concerns)
1. [Run LLMs Locally](#run-llms-locally)
1. [Further readings](#further-readings)
@@ -79,6 +81,57 @@ is satisfied.
Next step is [agentic AI][agent].
## Inference
### Speculative decoding
Refer:
- [Fast Inference from Transformers via Speculative Decoding].
- [Accelerating Large Language Model Decoding with Speculative Sampling].
- [An Introduction to Speculative Decoding for Reducing Latency in AI Inference].
- [Looking back at speculative decoding].
Makes inference faster and more responsive, significantly reducing latency while preserving output quality by
predicting and verifying multiple tokens simultaneously.
Pairs a target LLM with a less resource-intensive _draft_ model.<br/>
The smaller model quickly proposes several next tokens to the target model, offloading it of part of the standard
autoregressive decoding it would normally do and hence reducing the number of sequential steps.<br/>
The target model verifies the proposed tokens in a single forward pass instead of one at a time, accepts the longest
prefix that matches its own predictions, and continues from there.
Generating multiple tokens at once cuts latency and boosts throughput without impacting accuracy.
Use cases:
- Speeding up input-grounded tasks like translation, summarization, and transcription.
- Performing greedy decoding by always selecting the most likely token.
- Low-temperature sampling when outputs need to be focused and predictable.
- The target model barely fits in the GPU's memory.
Cons:
- Increases memory overhead due to both models needing to be loaded at the same time.
- Less effective for high-temperature sampling (e.g. creative writing).
- Benefits drop if the draft model is poorly matched to the target model.
- Gains are minimal for very small target models that already fit easily in memory.
Effectiveness depends on selecting the right draft model.<br/>
A poor choice will grant minimal speedup, or even slow things down.
The draft model must have:
- At least 10× **_fewer_** parameters than the target model.<br/>
Large draft models will generate tokens more slowly, which defeats the purpose.
- The same tokenizer as the target model.<br/>
This is non-negotiable, since the two models must follow the same internal processes to be compatible.
- Similar training data, to maximize the target model's acceptance rate.
- Same architecture family when possible
Usually, a distilled or simplified version of the target model works best.<br/>
For domain-specific applications, consider fine-tuning a small model to mimic the target model's behavior.
## Concerns
- Lots of people currently thinks of LLMs as _real intelligence_, when it is not.
@@ -104,6 +157,8 @@ Refer:
## Further readings
- [SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency]
### Sources
- [Run LLMs Locally: 6 Simple Methods]
@@ -129,14 +184,19 @@ Refer:
<!-- Files -->
<!-- Upstream -->
<!-- Others -->
[Accelerating Large Language Model Decoding with Speculative Sampling]: https://arxiv.org/abs/2302.01318
[An Introduction to Speculative Decoding for Reducing Latency in AI Inference]: https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
[ChatGPT]: https://chatgpt.com/
[Copilot]: https://copilot.microsoft.com/
[Duck AI]: https://duck.ai/
[Fast Inference from Transformers via Speculative Decoding]: https://arxiv.org/abs/2211.17192
[Grok]: https://grok.com/
[Jan]: https://www.jan.ai/
[Llama]: https://www.llama.com/
[Llamafile]: https://github.com/mozilla-ai/llamafile
[Local LLM Hosting: Complete 2026 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More]: https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/
[Looking back at speculative decoding]: https://research.google/blog/looking-back-at-speculative-decoding/
[Mistral]: https://mistral.ai/
[OpenClaw: Who are you?]: https://www.youtube.com/watch?v=hoeEclqW8Gs
[Run LLMs Locally: 6 Simple Methods]: https://www.datacamp.com/tutorial/run-llms-locally-tutorial
[SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency]: https://infini-ai-lab.github.io/Sequoia-Page/

View File

@@ -126,6 +126,7 @@ ollama ls
# Show models information.
ollama show 'codellama:13b'
ollama show --verbose 'llama3.2'
# Run models interactively.
ollama run 'gemma3'

View File

@@ -97,6 +97,9 @@ jq '.rules=([inputs.rules]|flatten)' 'starting-rule-set.json' 'parts'/*'.json'
# Put specific keys on top.
jq '.objects = [(.objects[] as $in | {type,name,id} + $in)]' 'prod/dataPipeline_deviceLocationConversion_prod.json'
# Sort descending by property `age`, take the first 3 elements.
jq -r 'sort_by(.age)|reverse|[limit(3;.[])]' 'file.json'
```
</details>
@@ -139,6 +142,10 @@ helm template 'chartName' \
# Check that the 'backend.url key' in a 'Pulumi.yaml' file is not 'file://' and fail otherwise.
yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
# Get the digest of the biggest element, then replace ':' with '-'
jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b"
```
</details>
@@ -162,6 +169,7 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
- [Remove all null values]
- [jq: select where .attribute in list]
- [An Introduction to JQ]
- [How to sort a json file by keys and values of those keys in jq]
<!--
Reference
@@ -178,6 +186,7 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
[change multiple values at once]: https://stackoverflow.com/questions/47355901/jq-change-multiple-values#47357956
[deleting multiple keys at once with jq]: https://stackoverflow.com/questions/36227245/deleting-multiple-keys-at-once-with-jq
[filter objects list with regex]: https://til.hashrocket.com/posts/uv0bjiokwk-use-jq-to-filter-objects-list-with-regex
[How to sort a json file by keys and values of those keys in jq]: https://stackoverflow.com/questions/30331504/how-to-sort-a-json-file-by-keys-and-values-of-those-keys-in-jq
[jq select range]: https://stackoverflow.com/questions/45548604/jq-select-range
[jq: select where .attribute in list]: https://stackoverflow.com/questions/50750688/jq-select-where-attribute-in-list
[remove all null values]: https://stackoverflow.com/questions/39500608/remove-all-null-values

View File

@@ -115,6 +115,10 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
# Apply formatting to the same file you read from
yq -iY --explicit-start '.' 'external-snapshotter/crds.yml'
# Get the digest of the biggest element, then replace ':' with '-'
jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b"
# Sort
# Refer <https://stackoverflow.com/questions/30331504/how-to-sort-a-json-file-by-keys-and-values-of-those-keys-in-jq>
# by key