chore(kb/ai): review and expand notes

2026-02-26 13:14:24 +00:00 · 2026-02-19 06:58:49 +01:00
parent 6c6b8e0428
commit d94e63268d
5 changed files with 108 additions and 7 deletions
--- a/base/ai/llama.cpp.md
+++ b/base/ai/llama.cpp.md
@@ -14,38 +14,64 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner].

 ## TL;DR

-<!-- Uncomment if used
 <details>
  <summary>Setup</summary>

 ```sh
+brew install 'llama.cpp'
 ```

 </details>
-->

-<!-- Uncomment if used
 <details>
  <summary>Usage</summary>

 ```sh
+# List available devices and exit.
+llama-cli --list-devices
+
+# List models in cache.
+llama-cli -cl
+llama-cli --cache-list
+
+# Run models from files interactively.
+llama-cli -m 'path/to/model.gguf'
+llama-cli -m 'path/to/target/model.gguf' -md 'path/to/draft/model.gguf'
+
+# Download and run models.
+llama-cli -mu 'https://example.org/some/model'  # URL
+llama-cli -hf 'ggml-org/gemma-3-1b-it-GGUF' -c '32.768'  # Hugging Face
+llama-cli -dr 'ai/qwen2.5-coder' --offline  # Docker Hub
+
+# Launch the OpenAI-compatible API server.
+llama-server -m 'path/to/model.gguf'
+llama-server -hf 'ggml-org/gemma-3-1b-it-GGUF' --port '8080' --host '127.0.0.1'
+
+# Run benchmarks.
+llama-bench -m 'path/to/model.gguf'
+llama-bench -m 'models/7B/ggml-model-q4_0.gguf' -m 'models/13B/ggml-model-q4_0.gguf' -p '0' -n '128,256,512' --progress
 ```

-</details>
-->
+The web UI can be accessed via browser at <http://localhost:8080>.<br/>
+The chat completion endpoint it at <http://localhost:8080/v1/chat/completions>.
+
+</details>

-<!-- Uncomment if used
 <details>
  <summary>Real world use cases</summary>

 ```sh
+# Use models pulled with Ollama.
+jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
+  "$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b" \
+| xargs -pI '%%' llama-bench -m "$HOME/.ollama/models/blobs/%%" --progress
 ```

 </details>
-->

 ## Further readings

+- [Website]
 - [Codebase]
 - [ik_llama.cpp]

@@ -64,6 +90,7 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner].
 <!-- Files -->
 <!-- Upstream -->
 [Codebase]: https://github.com/ggml-org/llama.cpp
+[Website]: https://llama-cpp.com/

 <!-- Others -->
 [ik_llama.cpp]: https://github.com/ikawrakow/ik_llama.cpp
--- a/base/ai/llm.md
+++ b/base/ai/llm.md
@@ -13,6 +13,8 @@ They have superseded recurrent neural network-based models.

 1. [TL;DR](#tldr)
 1. [Reasoning](#reasoning)
+1. [Inference](#inference)
+   1. [Speculative decoding](#speculative-decoding)
 1. [Concerns](#concerns)
 1. [Run LLMs Locally](#run-llms-locally)
 1. [Further readings](#further-readings)
@@ -79,6 +81,57 @@ is satisfied.

 Next step is [agentic AI][agent].

+## Inference
+
+### Speculative decoding
+
+Refer:
+
+- [Fast Inference from Transformers via Speculative Decoding].
+- [Accelerating Large Language Model Decoding with Speculative Sampling].
+- [An Introduction to Speculative Decoding for Reducing Latency in AI Inference].
+- [Looking back at speculative decoding].
+
+Makes inference faster and more responsive, significantly reducing latency while preserving output quality by
+predicting and verifying multiple tokens simultaneously.
+
+Pairs a target LLM with a less resource-intensive _draft_ model.<br/>
+The smaller model quickly proposes several next tokens to the target model, offloading it of part of the standard
+autoregressive decoding it would normally do and hence reducing the number of sequential steps.<br/>
+The target model verifies the proposed tokens in a single forward pass instead of one at a time, accepts the longest
+prefix that matches its own predictions, and continues from there.
+
+Generating multiple tokens at once cuts latency and boosts throughput without impacting accuracy.
+
+Use cases:
+
+- Speeding up input-grounded tasks like translation, summarization, and transcription.
+- Performing greedy decoding by always selecting the most likely token.
+- Low-temperature sampling when outputs need to be focused and predictable.
+- The target model barely fits in the GPU's memory.
+
+Cons:
+
+- Increases memory overhead due to both models needing to be loaded at the same time.
+- Less effective for high-temperature sampling (e.g. creative writing).
+- Benefits drop if the draft model is poorly matched to the target model.
+- Gains are minimal for very small target models that already fit easily in memory.
+
+Effectiveness depends on selecting the right draft model.<br/>
+A poor choice will grant minimal speedup, or even slow things down.
+
+The draft model must have:
+
+- At least 10× **_fewer_** parameters than the target model.<br/>
+  Large draft models will generate tokens more slowly, which defeats the purpose.
+- The same tokenizer as the target model.<br/>
+  This is non-negotiable, since the two models must follow the same internal processes to be compatible.
+- Similar training data, to maximize the target model's acceptance rate.
+- Same architecture family when possible
+
+Usually, a distilled or simplified version of the target model works best.<br/>
+For domain-specific applications, consider fine-tuning a small model to mimic the target model's behavior.
+
 ## Concerns

 - Lots of people currently thinks of LLMs as _real intelligence_, when it is not.
@@ -104,6 +157,8 @@ Refer:

 ## Further readings

+- [SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency]
+
 ### Sources

 - [Run LLMs Locally: 6 Simple Methods]
@@ -129,14 +184,19 @@ Refer:
 <!-- Files -->
 <!-- Upstream -->
 <!-- Others -->
+[Accelerating Large Language Model Decoding with Speculative Sampling]: https://arxiv.org/abs/2302.01318
+[An Introduction to Speculative Decoding for Reducing Latency in AI Inference]: https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
 [ChatGPT]: https://chatgpt.com/
 [Copilot]: https://copilot.microsoft.com/
 [Duck AI]: https://duck.ai/
+[Fast Inference from Transformers via Speculative Decoding]: https://arxiv.org/abs/2211.17192
 [Grok]: https://grok.com/
 [Jan]: https://www.jan.ai/
 [Llama]: https://www.llama.com/
 [Llamafile]: https://github.com/mozilla-ai/llamafile
 [Local LLM Hosting: Complete 2026 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More]: https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/
+[Looking back at speculative decoding]: https://research.google/blog/looking-back-at-speculative-decoding/
 [Mistral]: https://mistral.ai/
 [OpenClaw: Who are you?]: https://www.youtube.com/watch?v=hoeEclqW8Gs
 [Run LLMs Locally: 6 Simple Methods]: https://www.datacamp.com/tutorial/run-llms-locally-tutorial
+[SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency]: https://infini-ai-lab.github.io/Sequoia-Page/
--- a/base/ai/ollama.md
+++ b/base/ai/ollama.md
@@ -126,6 +126,7 @@ ollama ls

 # Show models information.
 ollama show 'codellama:13b'
+ollama show --verbose 'llama3.2'

 # Run models interactively.
 ollama run 'gemma3'
--- a/base/jq.md
+++ b/base/jq.md
@@ -97,6 +97,9 @@ jq '.rules=([inputs.rules]|flatten)' 'starting-rule-set.json' 'parts'/*'.json'

 # Put specific keys on top.
 jq '.objects = [(.objects[] as $in | {type,name,id} + $in)]' 'prod/dataPipeline_deviceLocationConversion_prod.json'
+
+# Sort descending by property `age`, take the first 3 elements.
+jq -r 'sort_by(.age)|reverse|[limit(3;.[])]' 'file.json'
 ```

 </details>
@@ -139,6 +142,10 @@ helm template 'chartName' \

 # Check that the 'backend.url key' in a 'Pulumi.yaml' file is not 'file://' and fail otherwise.
 yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
+
+# Get the digest of the biggest element, then replace ':' with '-'
+jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
+  "$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b"
 ```

 </details>
@@ -162,6 +169,7 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
 - [Remove all null values]
 - [jq: select where .attribute in list]
 - [An Introduction to JQ]
+- [How to sort a json file by keys and values of those keys in jq]

 <!--
  Reference
@@ -178,6 +186,7 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
 [change multiple values at once]: https://stackoverflow.com/questions/47355901/jq-change-multiple-values#47357956
 [deleting multiple keys at once with jq]: https://stackoverflow.com/questions/36227245/deleting-multiple-keys-at-once-with-jq
 [filter objects list with regex]: https://til.hashrocket.com/posts/uv0bjiokwk-use-jq-to-filter-objects-list-with-regex
+[How to sort a json file by keys and values of those keys in jq]: https://stackoverflow.com/questions/30331504/how-to-sort-a-json-file-by-keys-and-values-of-those-keys-in-jq
 [jq select range]: https://stackoverflow.com/questions/45548604/jq-select-range
 [jq: select where .attribute in list]: https://stackoverflow.com/questions/50750688/jq-select-where-attribute-in-list
 [remove all null values]: https://stackoverflow.com/questions/39500608/remove-all-null-values
--- a/snippets/jq.fish
+++ b/snippets/jq.fish
@@ -115,6 +115,10 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
 # Apply formatting to the same file you read from
 yq -iY --explicit-start '.' 'external-snapshotter/crds.yml'

+# Get the digest of the biggest element, then replace ':' with '-'
+jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
+	"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b"
+
 # Sort
 # Refer <https://stackoverflow.com/questions/30331504/how-to-sort-a-json-file-by-keys-and-values-of-those-keys-in-jq>
 # by key