diff --git a/knowledge base/ai/llama.cpp.md b/knowledge base/ai/llama.cpp.md index 011684e..f5f3038 100644 --- a/knowledge base/ai/llama.cpp.md +++ b/knowledge base/ai/llama.cpp.md @@ -14,38 +14,64 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner]. ## TL;DR - - +The web UI can be accessed via browser at .
+The chat completion endpoint it at . + + - ## Further readings +- [Website] - [Codebase] - [ik_llama.cpp] @@ -64,6 +90,7 @@ Vastly used as base for AI tools like [Ollama] and [Docker model runner]. [Codebase]: https://github.com/ggml-org/llama.cpp +[Website]: https://llama-cpp.com/ [ik_llama.cpp]: https://github.com/ikawrakow/ik_llama.cpp diff --git a/knowledge base/ai/llm.md b/knowledge base/ai/llm.md index 0c96dd5..d09dcae 100644 --- a/knowledge base/ai/llm.md +++ b/knowledge base/ai/llm.md @@ -13,6 +13,8 @@ They have superseded recurrent neural network-based models. 1. [TL;DR](#tldr) 1. [Reasoning](#reasoning) +1. [Inference](#inference) + 1. [Speculative decoding](#speculative-decoding) 1. [Concerns](#concerns) 1. [Run LLMs Locally](#run-llms-locally) 1. [Further readings](#further-readings) @@ -79,6 +81,57 @@ is satisfied. Next step is [agentic AI][agent]. +## Inference + +### Speculative decoding + +Refer: + +- [Fast Inference from Transformers via Speculative Decoding]. +- [Accelerating Large Language Model Decoding with Speculative Sampling]. +- [An Introduction to Speculative Decoding for Reducing Latency in AI Inference]. +- [Looking back at speculative decoding]. + +Makes inference faster and more responsive, significantly reducing latency while preserving output quality by +predicting and verifying multiple tokens simultaneously. + +Pairs a target LLM with a less resource-intensive _draft_ model.
+The smaller model quickly proposes several next tokens to the target model, offloading it of part of the standard +autoregressive decoding it would normally do and hence reducing the number of sequential steps.
+The target model verifies the proposed tokens in a single forward pass instead of one at a time, accepts the longest +prefix that matches its own predictions, and continues from there. + +Generating multiple tokens at once cuts latency and boosts throughput without impacting accuracy. + +Use cases: + +- Speeding up input-grounded tasks like translation, summarization, and transcription. +- Performing greedy decoding by always selecting the most likely token. +- Low-temperature sampling when outputs need to be focused and predictable. +- The target model barely fits in the GPU's memory. + +Cons: + +- Increases memory overhead due to both models needing to be loaded at the same time. +- Less effective for high-temperature sampling (e.g. creative writing). +- Benefits drop if the draft model is poorly matched to the target model. +- Gains are minimal for very small target models that already fit easily in memory. + +Effectiveness depends on selecting the right draft model.
+A poor choice will grant minimal speedup, or even slow things down. + +The draft model must have: + +- At least 10× **_fewer_** parameters than the target model.
+ Large draft models will generate tokens more slowly, which defeats the purpose. +- The same tokenizer as the target model.
+ This is non-negotiable, since the two models must follow the same internal processes to be compatible. +- Similar training data, to maximize the target model's acceptance rate. +- Same architecture family when possible + +Usually, a distilled or simplified version of the target model works best.
+For domain-specific applications, consider fine-tuning a small model to mimic the target model's behavior. + ## Concerns - Lots of people currently thinks of LLMs as _real intelligence_, when it is not. @@ -104,6 +157,8 @@ Refer: ## Further readings +- [SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency] + ### Sources - [Run LLMs Locally: 6 Simple Methods] @@ -129,14 +184,19 @@ Refer: +[Accelerating Large Language Model Decoding with Speculative Sampling]: https://arxiv.org/abs/2302.01318 +[An Introduction to Speculative Decoding for Reducing Latency in AI Inference]: https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/ [ChatGPT]: https://chatgpt.com/ [Copilot]: https://copilot.microsoft.com/ [Duck AI]: https://duck.ai/ +[Fast Inference from Transformers via Speculative Decoding]: https://arxiv.org/abs/2211.17192 [Grok]: https://grok.com/ [Jan]: https://www.jan.ai/ [Llama]: https://www.llama.com/ [Llamafile]: https://github.com/mozilla-ai/llamafile [Local LLM Hosting: Complete 2026 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More]: https://www.glukhov.org/post/2025/11/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/ +[Looking back at speculative decoding]: https://research.google/blog/looking-back-at-speculative-decoding/ [Mistral]: https://mistral.ai/ [OpenClaw: Who are you?]: https://www.youtube.com/watch?v=hoeEclqW8Gs [Run LLMs Locally: 6 Simple Methods]: https://www.datacamp.com/tutorial/run-llms-locally-tutorial +[SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency]: https://infini-ai-lab.github.io/Sequoia-Page/ diff --git a/knowledge base/ai/ollama.md b/knowledge base/ai/ollama.md index 0338a01..cf8f2c1 100644 --- a/knowledge base/ai/ollama.md +++ b/knowledge base/ai/ollama.md @@ -126,6 +126,7 @@ ollama ls # Show models information. ollama show 'codellama:13b' +ollama show --verbose 'llama3.2' # Run models interactively. ollama run 'gemma3' diff --git a/knowledge base/jq.md b/knowledge base/jq.md index 8f35c15..a3ea4c8 100644 --- a/knowledge base/jq.md +++ b/knowledge base/jq.md @@ -97,6 +97,9 @@ jq '.rules=([inputs.rules]|flatten)' 'starting-rule-set.json' 'parts'/*'.json' # Put specific keys on top. jq '.objects = [(.objects[] as $in | {type,name,id} + $in)]' 'prod/dataPipeline_deviceLocationConversion_prod.json' + +# Sort descending by property `age`, take the first 3 elements. +jq -r 'sort_by(.age)|reverse|[limit(3;.[])]' 'file.json' ``` @@ -139,6 +142,10 @@ helm template 'chartName' \ # Check that the 'backend.url key' in a 'Pulumi.yaml' file is not 'file://' and fail otherwise. yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml' + +# Get the digest of the biggest element, then replace ':' with '-' +jq -r '.layers|sort_by(.size)[-1].digest|sub(":";"-")' \ + "$HOME/.ollama/models/manifests/registry.ollama.ai/library/codellama/13b" ``` @@ -162,6 +169,7 @@ yq -e '(.backend.url|test("^file://")?)|not' 'Pulumi.yaml' - [Remove all null values] - [jq: select where .attribute in list] - [An Introduction to JQ] +- [How to sort a json file by keys and values of those keys in jq]