mirror of
https://gitea.com/mcereda/oam.git
synced 2026-02-26 13:14:24 +00:00
chore(kb/ai): review and expand notes
This commit is contained in:
@@ -14,38 +14,64 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner].
|
||||
|
||||
## TL;DR
|
||||
|
||||
<!-- Uncomment if used
|
||||
<details>
|
||||
<summary>Setup</summary>
|
||||
|
||||
```sh
|
||||
brew install 'llama.cpp'
|
||||
```
|
||||
|
||||
</details>
|
||||
-->
|
||||
|
||||
<!-- Uncomment if used
|
||||
<details>
|
||||
<summary>Usage</summary>
|
||||
|
||||
```sh
|
||||
# List available devices and exit.
|
||||
llama-cli --list-devices
|
||||
|
||||
# List models in cache.
|
||||
llama-cli -cl
|
||||
llama-cli --cache-list
|
||||
|
||||
# Run models from files interactively.
|
||||
llama-cli -m 'path/to/model.gguf'
|
||||
llama-cli -m 'path/to/target/model.gguf' -md 'path/to/draft/model.gguf'
|
||||
|
||||
# Download and run models.
|
||||
llama-cli -mu 'https://example.org/some/model' # URL
|
||||
llama-cli -hf 'ggml-org/gemma-3-1b-it-GGUF' -c '32.768' # Hugging Face
|
||||
llama-cli -dr 'ai/qwen2.5-coder' --offline # Docker Hub
|
||||
|
||||
# Launch the OpenAI-compatible API server.
|
||||
llama-server -m 'path/to/model.gguf'
|
||||
llama-server -hf 'ggml-org/gemma-3-1b-it-GGUF' --port '8080' --host '127.0.0.1'
|
||||
|
||||
# Run benchmarks.
|
||||
llama-bench -m 'path/to/model.gguf'
|
||||
llama-bench -m 'models/7B/ggml-model-q4_0.gguf' -m 'models/13B/ggml-model-q4_0.gguf' -p '0' -n '128,256,512' --progress
|
||||
```
|
||||
|
||||
</details>
|
||||
-->
|
||||
The web UI can be accessed via browser at <http://localhost:8080>.<br/>
|
||||
The chat completion endpoint it at <http://localhost:8080/v1/chat/completions>.
|
||||
|
||||
</details>
|
||||
|
||||
<!-- Uncomment if used
|
||||
<details>
|
||||
<summary>Real world use cases</summary>
|
||||
|
||||
```sh
|
||||
# Use models pulled with Ollama.
|
||||
jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
|
||||
"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b" \
|
||||
| xargs -pI '%%' llama-bench -m "$HOME/.ollama/models/blobs/%%" --progress
|
||||
```
|
||||
|
||||
</details>
|
||||
-->
|
||||
|
||||
## Further readings
|
||||
|
||||
- [Website]
|
||||
- [Codebase]
|
||||
- [ik_llama.cpp]
|
||||
|
||||
@@ -64,6 +90,7 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner].
|
||||
<!-- Files -->
|
||||
<!-- Upstream -->
|
||||
[Codebase]: https://github.com/ggml-org/llama.cpp
|
||||
[Website]: https://llama-cpp.com/
|
||||
|
||||
<!-- Others -->
|
||||
[ik_llama.cpp]: https://github.com/ikawrakow/ik_llama.cpp
|
||||
|
||||
@@ -13,6 +13,8 @@ They have superseded recurrent neural network-based models.
|
||||
|
||||
1. [TL;DR](#tldr)
|
||||
1. [Reasoning](#reasoning)
|
||||
1. [Inference](#inference)
|
||||
1. [Speculative decoding](#speculative-decoding)
|
||||
1. [Concerns](#concerns)
|
||||
1. [Run LLMs Locally](#run-llms-locally)
|
||||
1. [Further readings](#further-readings)
|
||||
@@ -79,6 +81,57 @@ is satisfied.
|
||||
|
||||
Next step is [agentic AI][agent].
|
||||
|
||||
## Inference
|
||||
|
||||
### Speculative decoding
|
||||
|
||||
Refer:
|
||||
|
||||
- [Fast Inference from Transformers via Speculative Decoding].
|
||||
- [Accelerating Large Language Model Decoding with Speculative Sampling].
|
||||
- [An Introduction to Speculative Decoding for Reducing Latency in AI Inference].
|
||||
- [Looking back at speculative decoding].
|
||||
|
||||
Makes inference faster and more responsive, significantly reducing latency while preserving output quality by
|
||||
predicting and verifying multiple tokens simultaneously.
|
||||
|
||||
Pairs a target LLM with a less resource-intensive _draft_ model.<br/>
|
||||
The smaller model quickly proposes several next tokens to the target model, offloading it of part of the standard
|
||||
autoregressive decoding it would normally do and hence reducing the number of sequential steps.<br/>
|
||||
The target model verifies the proposed tokens in a single forward pass instead of one at a time, accepts the longest
|
||||
prefix that matches its own predictions, and continues from there.
|
||||
|
||||
Generating multiple tokens at once cuts latency and boosts throughput without impacting accuracy.
|
||||
|
||||
Use cases:
|
||||
|
||||
- Speeding up input-grounded tasks like translation, summarization, and transcription.
|
||||
- Performing greedy decoding by always selecting the most likely token.
|
||||
- Low-temperature sampling when outputs need to be focused and predictable.
|
||||
- The target model barely fits in the GPU's memory.
|
||||
|
||||
Cons:
|
||||
|
||||
- Increases memory overhead due to both models needing to be loaded at the same time.
|
||||
- Less effective for high-temperature sampling (e.g. creative writing).
|
||||
- Benefits drop if the draft model is poorly matched to the target model.
|
||||
- Gains are minimal for very small target models that already fit easily in memory.
|
||||
|
||||
Effectiveness depends on selecting the right draft model.<br/>
|
||||
A poor choice will grant minimal speedup, or even slow things down.
|
||||
|
||||
The draft model must have:
|
||||
|
||||
- At least 10× **_fewer_** parameters than the target model.<br/>
|
||||
Large draft models will generate tokens more slowly, which defeats the purpose.
|
||||
- The same tokenizer as the target model.<br/>
|
||||
This is non-negotiable, since the two models must follow the same internal processes to be compatible.
|
||||
- Similar training data, to maximize the target model's acceptance rate.
|
||||
- Same architecture family when possible
|
||||
|
||||
Usually, a distilled or simplified version of the target model works best.<br/>
|
||||
For domain-specific applications, consider fine-tuning a small model to mimic the target model's behavior.
|
||||
|
||||
## Concerns
|
||||
|
||||
- Lots of people currently thinks of LLMs as _real intelligence_, when it is not.
|
||||
@@ -104,6 +157,8 @@ Refer:
|
||||
|
||||
## Further readings
|
||||
|
||||
- [SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency]
|
||||
|
||||
### Sources
|
||||
|
||||
- [Run LLMs Locally: 6 Simple Methods]
|
||||
@@ -129,14 +184,19 @@ Refer:
|
||||
<!-- Files -->
|
||||
<!-- Upstream -->
|
||||
<!-- Others -->
|
||||
[Accelerating Large Language Model Decoding with Speculative Sampling]: https://arxiv.org/abs/2302.01318
|
||||
[An Introduction to Speculative Decoding for Reducing Latency in AI Inference]: https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
|
||||
[ChatGPT]: https://chatgpt.com/
|
||||
[Copilot]: https://copilot.microsoft.com/
|
||||
[Duck AI]: https://duck.ai/
|
||||
[Fast Inference from Transformers via Speculative Decoding]: https://arxiv.org/abs/2211.17192
|
||||
[Grok]: https://grok.com/
|
||||
[Jan]: https://www.jan.ai/
|
||||
[Llama]: https://www.llama.com/
|
||||
[Llamafile]: https://github.com/mozilla-ai/llamafile
|
||||
[Local LLM Hosting: Complete 2026 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More]: https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/
|
||||
[Looking back at speculative decoding]: https://research.google/blog/looking-back-at-speculative-decoding/
|
||||
[Mistral]: https://mistral.ai/
|
||||
[OpenClaw: Who are you?]: https://www.youtube.com/watch?v=hoeEclqW8Gs
|
||||
[Run LLMs Locally: 6 Simple Methods]: https://www.datacamp.com/tutorial/run-llms-locally-tutorial
|
||||
[SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency]: https://infini-ai-lab.github.io/Sequoia-Page/
|
||||
|
||||
@@ -126,6 +126,7 @@ ollama ls
|
||||
|
||||
# Show models information.
|
||||
ollama show 'codellama:13b'
|
||||
ollama show --verbose 'llama3.2'
|
||||
|
||||
# Run models interactively.
|
||||
ollama run 'gemma3'
|
||||
|
||||
@@ -97,6 +97,9 @@ jq '.rules=([inputs.rules]|flatten)' 'starting-rule-set.json' 'parts'/*'.json'
|
||||
|
||||
# Put specific keys on top.
|
||||
jq '.objects = [(.objects[] as $in | {type,name,id} + $in)]' 'prod/dataPipeline_deviceLocationConversion_prod.json'
|
||||
|
||||
# Sort descending by property `age`, take the first 3 elements.
|
||||
jq -r 'sort_by(.age)|reverse|[limit(3;.[])]' 'file.json'
|
||||
```
|
||||
|
||||
</details>
|
||||
@@ -139,6 +142,10 @@ helm template 'chartName' \
|
||||
|
||||
# Check that the 'backend.url key' in a 'Pulumi.yaml' file is not 'file://' and fail otherwise.
|
||||
yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
|
||||
|
||||
# Get the digest of the biggest element, then replace ':' with '-'
|
||||
jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
|
||||
"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b"
|
||||
```
|
||||
|
||||
</details>
|
||||
@@ -162,6 +169,7 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
|
||||
- [Remove all null values]
|
||||
- [jq: select where .attribute in list]
|
||||
- [An Introduction to JQ]
|
||||
- [How to sort a json file by keys and values of those keys in jq]
|
||||
|
||||
<!--
|
||||
Reference
|
||||
@@ -178,6 +186,7 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
|
||||
[change multiple values at once]: https://stackoverflow.com/questions/47355901/jq-change-multiple-values#47357956
|
||||
[deleting multiple keys at once with jq]: https://stackoverflow.com/questions/36227245/deleting-multiple-keys-at-once-with-jq
|
||||
[filter objects list with regex]: https://til.hashrocket.com/posts/uv0bjiokwk-use-jq-to-filter-objects-list-with-regex
|
||||
[How to sort a json file by keys and values of those keys in jq]: https://stackoverflow.com/questions/30331504/how-to-sort-a-json-file-by-keys-and-values-of-those-keys-in-jq
|
||||
[jq select range]: https://stackoverflow.com/questions/45548604/jq-select-range
|
||||
[jq: select where .attribute in list]: https://stackoverflow.com/questions/50750688/jq-select-where-attribute-in-list
|
||||
[remove all null values]: https://stackoverflow.com/questions/39500608/remove-all-null-values
|
||||
|
||||
@@ -115,6 +115,10 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml'
|
||||
# Apply formatting to the same file you read from
|
||||
yq -iY --explicit-start '.' 'external-snapshotter/crds.yml'
|
||||
|
||||
# Get the digest of the biggest element, then replace ':' with '-'
|
||||
jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \
|
||||
"$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b"
|
||||
|
||||
# Sort
|
||||
# Refer <https://stackoverflow.com/questions/30331504/how-to-sort-a-json-file-by-keys-and-values-of-those-keys-in-jq>
|
||||
# by key
|
||||
|
||||
Reference in New Issue
Block a user