Files
oam/knowledge base/ai/lms.md
2026-02-23 19:38:38 +01:00

17 KiB
Raw Blame History

Language models

Statistical or machine learning models designed to understand, generate, and predict the next token in a sequence given the previous ones.

  1. TL;DR
  2. Large Language Models
  3. Inference
    1. Speculative decoding
  4. Reasoning
  5. Prompting
  6. Function calling
  7. Concerns
  8. Run LLMs Locally
  9. Further readings
    1. Sources

TL;DR

Tokens can be words, subwords (one or more subsets of a word), or single characters.
The full sequence of tokens can be an entire sentence, paragraph, or an entire essay.

LMs are proficient at understanding human prompts in natural language.
They analyze the structure and use of natural language, enabling machines to process and generate text that is contextually appropriate and coherent.

Their primary purpose is to capture the statistical properties of natural language in mathematical notation.
They can predict the likelihood that a given token will follow a sequence of other tokens by learning the probability distribution of patterns.
This predictive capability is fundamental for tasks that require understanding the context and meaning of text, and it can be extended to more complex tasks.

Context is helpful information before or after a target token.
It can help a language model make better predictions, like determining whether "orange" refers to a citrus fruit or a color.

Large LMs are language models trained on massive datasets, and encoding their acquired knowledge into up to trillions of parameters.

Parameters are internal weights and values that an LLM learns during training.
They are used to capture patterns in language such as grammar, meaning, context and relationships between words.

The more parameters a model has, the better it typically is to understand and generate complex output.
An increased parameter count, on the other hand, demands more computational resources for training and inference, and make models more prone to overfitting, slower to respond, and harder to deploy efficiently.

Provider Creator
ChatGPT OpenAI
Claude Anthropic
Copilot Microsoft
Duck AI DuckDuckGo
Gemini Google
Grok X
Llama Meta
Mistral Mistral AI

Many models now come pre-trained, and one can use the same model for different language-related purposes like classification, summarisation, answering questions, data extraction, text generation, reasoning, planning, translation, coding, sentiment analysis, speech recognition, and more.
They can be also be further trained on additional information specific to an industry niche or a particular business.

The capabilities of transformer-based LLMs depend from the amount and the quality of their training data.
LLMs appear to be hitting a performance wall, and will probably need the rise of a different architecture.

LLMs find it difficult, if not impossible, to distinguishing data from instructions.
As such, every part of the data could be used for prompt injection.

Large Language Models

Large language models are language models trained on massive datasets, frequently including texts scraped from the Internet.

LLMs have the ability to perform a wide range of tasks with minimal fine-tuning, and are especially proficient in speech recognition, machine translation, natural language generation, optical character recognition, route optimization, handwriting recognition, grammar induction, information retrieval, and other tasks.

They are currently predominantly based on transformers, which have superseded recurrent neural networks as the most effective architecture.

Training LLMs involves feeding them vast amounts of data, and computing weights to optimize their parameters.
The training process typically includes multiple stages, and requires substantial computational resources.
Stages often use unsupervised pre-training followed by supervised fine-tuning on specific tasks. The models' size and complexity can make them difficult to interpret and control, leading to potential ethical and bias issues.

The capabilities of Transformer-based LLMs depend from the amount and the quality of their training data.
Adding parameters only has a limited impact: given the same training data, models with a higher number of parameters perform usually better, but models with less parameters and better training data beat those with more parameters and less training.

Transformer-based LLMs appear to be hitting a performance wall, and will probably need to switch to a different architecture.
Scaling up the amount of training data did wonders up to ChatGPT 5. Once OpenAI got there, they found that enlarging the training data resulted in diminishing returns.

Inference

Speculative decoding

Refer:

Makes inference faster and more responsive, significantly reducing latency while preserving output quality by predicting and verifying multiple tokens simultaneously.

Pairs a target LLM with a less resource-intensive draft model.
The smaller model quickly proposes several next tokens to the target model, offloading it of part of the standard autoregressive decoding it would normally do and hence reducing the number of sequential steps.
The target model verifies the proposed tokens in a single forward pass instead of one at a time, accepts the longest prefix that matches its own predictions, and continues from there.

Generating multiple tokens at once cuts latency and boosts throughput without impacting accuracy.

Use cases:

  • Speeding up input-grounded tasks like translation, summarization, and transcription.
  • Performing greedy decoding by always selecting the most likely token.
  • Low-temperature sampling when outputs need to be focused and predictable.
  • The target model barely fits in the GPU's memory.

Cons:

  • Increases memory overhead due to both models needing to be loaded at the same time.
  • Less effective for high-temperature sampling (e.g. creative writing).
  • Benefits drop if the draft model is poorly matched to the target model.
  • Gains are minimal for very small target models that already fit easily in memory.

Effectiveness depends on selecting the right draft model.
A poor choice will grant minimal speedup, or even slow things down.

The draft model must have:

  • At least 10× fewer parameters than the target model.
    Large draft models will generate tokens more slowly, which defeats the purpose.
  • The same tokenizer as the target model.
    This is non-negotiable, since the two models must follow the same internal processes to be compatible.
  • Similar training data, to maximize the target model's acceptance rate.
  • Same architecture family when possible

Usually, a distilled or simplified version of the target model works best.
For domain-specific applications, consider fine-tuning a small model to mimic the target model's behavior.

Reasoning

Standard models' behaviour is just autocompletion. Models just try to infer or recall what the most probable next word would be.

Chain of Thought techniques tell models to show their work by breaking prompts in smaller, more manageable steps, and solving on each of them singularly before giving back the final answer.
The result is more accurate, but it costs more tokens and requires a bigger context window.
It feels like a model is calculating or thinking, but what it is really just increasing the chances that the answer is logically sound.

The ReAct loop (Reason + Act) paradigm forces models to loop over chain-of-thoughts.
A model breaks the request in smaller steps, plans the next action, acts on it using functions should it decide it needs to, checks the results, updates the chain of thoughts, and repeats this Think-Act-Observe loop to iteratively improve upon responses.
See also ReAct: Synergizing Reasoning and Acting in Language Models.

The ReAct loop unlocked the agentic loop for general-purpose tasks.

The ReWOO (Reasoning WithOut Observation) method eliminates the dependence on tool outputs for action planning.
Models plan upfront, and avoid redundant usage of tools by anticipating which tools to use upon receiving the initial prompt from the user.
Users can confirm the plan before the model executes it.

Prompting

Good prompting is about designing predictable interactions with a model.
In the context of LLM agent development, it is no different from interface design.

Function calling

Refer Function calling in LLMs.

A.K.A tool-calling.
Allows models to reliably connect and interact with external tools or APIs.

One provides the LLM with a set of tools, and the model decides during interaction which tool it wants to invoke for a specific prompt and/or to complete a given task.
Models supporting function calling can use (or even create) tools to get or check an answer, instead of just infer or recall it.

Function calling grants models real-time data access and information retrieval.
This eliminates the fundamental problem of them giving responses based on stale training data, and reduces hallucination episodes that come from them not accepting they don't know something.

Using tools increases the overall token count and hence costs, also reducing available context and adding latency.
Deciding which tool to call, using that tool, and then using the results to generate a response is more intensive than just inferring the next token.

Caution

Allowing LLMs to call functions can have real-world consequences.
This includes financial loss, data corruption or exfiltration, and security breaches.

Concerns

  • Lots of people currently thinks of LLMs as real, rational, intelligence, when they are not.
    LLMs are really nothing more than glorified guessing machines that are designed to interact naturally. It's humans that are biased by evolution toward attributing sentience and agency to entities they interact with.
  • People is mindlessly using LLMs too much, mostly due to the convenience they offer but also because they don't understand what those are or how they work. This is causing lack of critical thinking, and overreliance.
  • People is giving too much credibility to LLM answers, and trust them more than they trust their teachers, accountants, lawyers or even doctors.
  • LLMs are incapable of distinguishing facts from beliefs, and are completely disembodied from the world.
    They do not understand concepts and are unaware of time, change, and causality. They just approximate reasoning by mimicking language based on how connected are the tokens in their own training data.
  • Models are very limited in their ability to revise beliefs. Once some pattern is learned, it is extremely difficult to unwire it due to the very nature of how models function.
  • AI companies could steer and bias their models to say specific things, subtly promote ideologies, influence elections, or even rewrite history in the mind of those who trust the LLM.
  • Models can be vulnerable to attacks (e.g. prompt injection) that can change the LLM's behaviour, bias it, or hide malware in the tools they manage and use.
  • Model training and execution requires massive amounts of data and computation, resources that are normally not available to the common person. Aside from the vast amount of energy and cooling they consume, this encourages people to depend from, and hence give power to, AI companies.
  • Models can learn and exhibit deceptive behavior.
    Standard revision techniques could fail to remove it, and instead empower it while creating a false impression of safety.
    See Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
  • Models are painfully inconsistent, often unaware of their limitations, irritatingly overconfident, and tend to not accept gracefully that they don't know something, ending up preferring to hallucinate as the result.
    More recent techniques are making models more efficient, but they just delay this problem.

Run LLMs Locally

Refer:

Ollama| Jan |LMStudio | Docker model runner | llama.cpp | vLLM | Llamafile

Further readings

Sources