🧬How LLMs Actually Work
Training, tokens, transformers, and neural networks — explained simply
How LLMs Actually Work
Large Language Models (LLMs) like Claude are among the most powerful technologies ever built — but how do they actually work under the hood? In this lesson, we will break it all down using simple analogies that anyone can understand.
What Is an LLM?
An LLM — Large Language Model — is a computer program that has been trained on enormous amounts of text so it can understand and generate human language. Think of it as a student who has read millions of books, articles, and websites, and can now write coherent responses on almost any topic.
But here is the key insight: an LLM does not "know" things the way you do. It has learned statistical patterns about how words and ideas relate to each other. It is incredibly good at predicting what text should come next given what came before.
What Is Training Data?
Before an LLM can do anything useful, it needs to learn from data. This process is called training.
Where does the data come from?
Training data typically includes:
- Books and academic papers
- Websites and articles
- Code repositories
- Conversations and forums
- Documentation and manuals
How much data?
Modern LLMs are trained on trillions of words. To put that in perspective:
| Comparison | Amount |
|---|---|
| Average novel | ~80,000 words |
| Wikipedia (English) | ~4 billion words |
| LLM training data | Trillions of words |
Analogy: Imagine a student who has read every book in every library on Earth, plus every website ever written. That is roughly the scale of data an LLM learns from.
Neural Networks — The Brain Behind the Model
At the heart of every LLM is a neural network. But what is that?
Simple Analogy: A Network of Decision-Makers
Imagine a giant company with millions of employees arranged in layers:
- Input Layer — The front desk. It receives the raw information (your text).
- Hidden Layers — Thousands of departments, each specializing in different patterns. Some recognize grammar, some understand topics, some detect tone.
- Output Layer — The final decision-maker that produces the response.
Each "employee" (called a neuron) receives information, does a small calculation, and passes the result to the next layer. When millions of neurons work together across hundreds of layers, the network can understand incredibly complex patterns.
How Does It Learn?
The network learns through a process called backpropagation:
- Show the network some text
- Ask it to predict the next word
- Check if it got it right
- If wrong, adjust the connections (called weights) slightly
- Repeat this billions of times
Analogy: It is like a musician practicing scales. At first, the notes sound wrong. But after thousands of hours of practice, the musician can play complex pieces effortlessly. The LLM "practices" on billions of text examples.
What Is a Transformer?
The Transformer is the specific type of neural network architecture that powers modern LLMs. It was introduced in the famous 2017 paper "Attention Is All You Need."
The Key Innovation: Attention
Before Transformers, language models processed words one at a time, left to right. The Transformer introduced attention — the ability to look at all the words in a sentence simultaneously and figure out which ones are most important for understanding each other word.
Example
Consider the sentence: "The cat sat on the mat because it was tired."
What does "it" refer to? The cat or the mat?
- A human instantly knows "it" = the cat
- The attention mechanism allows the model to make the same connection by calculating how strongly "it" relates to every other word
Why Transformers Changed Everything
| Before Transformers | After Transformers |
|---|---|
| Processed words sequentially | Processes all words in parallel |
| Struggled with long text | Handles long context effectively |
| Slow to train | Much faster to train |
| Limited understanding | Deep contextual understanding |
Analogy: Imagine reading a book one word at a time with no memory of previous pages versus being able to see the entire book at once and instantly connect ideas across chapters. That is the power of attention.
Tokens: How LLMs See Text
LLMs do not read text the way you do. They break text into small pieces called tokens.
What Is a Token?
A token is a chunk of text — usually a word or part of a word:
| Text | Tokens |
|---|---|
| "Hello" | ["Hello"] |
| "understanding" | ["under", "standing"] |
| "ChatGPT" | ["Chat", "G", "PT"] |
| "Hello, world!" | ["Hello", ",", " world", "!"] |
Why Tokens Matter
- Cost — API pricing is based on token count
- Context Window — The model can only process a limited number of tokens at once
- Speed — More tokens = slower processing
Rule of Thumb
In English, 1 token is roughly 4 characters or 0.75 words. So:
- 100 words is approximately 130 tokens
- A full page (~500 words) is approximately 650 tokens
Next Token Prediction: The Core Mechanism
Here is the fundamental secret of how LLMs generate text: they predict the next token, one at a time.
How It Works
When you type "The weather today is" the model:
- Looks at all the tokens:
["The", " weather", " today", " is"] - Calculates the probability of every possible next token
- Picks one (e.g., "sunny" with 35% probability, "nice" with 20%, "cold" with 15%...)
- Adds "sunny" to the sequence
- Now predicts the next token after "The weather today is sunny"
- Repeats until done
The Complete Pipeline
| Stage | What Happens | Example |
|---|---|---|
| Input | You type your prompt | "Explain gravity" |
| Tokenization | Text is broken into tokens | ["Explain", " gravity"] |
| Processing | Transformer processes all tokens with attention | Neural network activations |
| Prediction | Model calculates probability for every possible next token | "Gravity" (40%), "In" (25%), "The" (15%)... |
| Output | Selected token is generated, process repeats | "Gravity is a fundamental force..." |
Analogy: It is like autocomplete on your phone, but incredibly more sophisticated. Your phone predicts the next word based on simple patterns. An LLM predicts the next token based on deep understanding of context, meaning, and structure learned from trillions of words.
Why Do LLMs Hallucinate?
One of the most important things to understand about LLMs is that they sometimes hallucinate — they generate text that sounds completely confident and correct but is actually wrong.
Why Does This Happen?
-
Pattern matching, not knowledge retrieval — The model is not looking things up in a database. It is generating text that statistically fits the pattern. Sometimes the pattern produces incorrect facts.
-
No built-in fact-checking — The model has no way to verify what it is saying against a source of truth while generating text.
-
Trained to be fluent, not factual — The training process optimizes for producing coherent, natural-sounding text. Being factually correct is a side effect, not the primary goal.
-
Knowledge cutoff — The model only knows what was in its training data up to a certain date. It has no access to current events or real-time information.
Common Hallucination Examples
- Inventing academic papers that do not exist
- Generating plausible but incorrect code documentation
- Confidently stating wrong dates or statistics
- Creating fake URLs that look legitimate
Analogy: Imagine a very eloquent person who has read millions of books but sometimes mixes up details from different books and presents the mixed-up version with total confidence. That is hallucination.
Temperature: Controlling Creativity
When the model predicts the next token, it does not always pick the most likely one. The temperature setting controls how creative or random the output is.
How Temperature Works
| Temperature | Behavior | Best For |
|---|---|---|
| 0 | Always picks the most likely token | Code, factual Q&A, data extraction |
| 0.3 - 0.5 | Mostly predictable with slight variation | Business writing, summaries |
| 0.7 - 0.9 | More creative and varied | Creative writing, brainstorming |
| 1.0+ | Highly random and unpredictable | Experimental, poetry |
Example
Prompt: "The sunset was..."
- Temperature 0: "beautiful" (most probable)
- Temperature 0.7: "painted across the sky like watercolors"
- Temperature 1.0: "whispering ancient secrets through fractured amber light"
Analogy: Temperature is like a dial on a radio. At 0, you get the clearest, most predictable signal. As you turn it up, you get more interesting but potentially noisier output.
Context Window: The Model's Working Memory
The context window is the total number of tokens the model can "see" at once — including both your input and its output.
Context Window Sizes
| Model | Context Window |
|---|---|
| GPT-3 (2020) | 4,096 tokens |
| Claude 3 Haiku | 200,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Claude (with extended thinking) | 200,000+ tokens |
Why It Matters
- Too little context — The model forgets earlier parts of the conversation
- More context — The model can reference larger documents, longer conversations, and more complex instructions
- Cost — Larger context windows use more computation and cost more
Analogy: Context window is like a desk. A small desk can only hold a few papers — you constantly have to remove old ones to make room. A large desk lets you spread out everything and see connections between documents.
Training vs. Inference
These are two completely different phases in an LLM's lifecycle:
Training
- Happens once (or periodically) at the AI company
- Takes weeks to months on thousands of specialized GPUs
- Costs millions of dollars
- The model learns patterns from data
- Results in a set of learned weights (parameters)
Inference
- Happens every time you use the model
- Takes seconds
- The model uses its learned weights to generate responses
- No new learning happens during inference
| Aspect | Training | Inference |
|---|---|---|
| When | Before deployment | Every API call |
| Duration | Weeks/months | Seconds |
| Cost | Millions of dollars | Fractions of a cent per token |
| What happens | Model learns | Model applies what it learned |
| Hardware | Thousands of GPUs | Fewer GPUs per request |
Analogy: Training is like going to medical school for 8 years. Inference is like a doctor seeing a patient. The doctor does not re-learn medicine for each patient — they apply what they already learned.
Putting It All Together
Here is the complete picture of how an LLM works:
-
Training Phase: The model reads trillions of tokens and learns statistical patterns about language through a neural network with a Transformer architecture.
-
Your Request: You send a prompt to the model via an API or chat interface.
-
Tokenization: Your text is broken into tokens.
-
Processing: The Transformer processes all tokens using attention to understand context and relationships.
-
Generation: The model predicts the next token, one at a time, controlled by temperature.
-
Output: Tokens are converted back to human-readable text and returned to you.
The End-to-End Flow
| Step | Process | Details |
|---|---|---|
| 1. Input | You send a prompt | "Write a haiku about coding" |
| 2. Tokenization | Text becomes tokens | ["Write", " a", " ha", "iku", " about", " coding"] |
| 3. Processing | Transformer + Attention | All tokens processed in parallel |
| 4. Prediction | Next token probability | Model picks most appropriate token |
| 5. Output | Tokens become text | "Lines of logic flow / Through silicon pathways bright / Bugs hide in the night" |
Key Takeaways
- LLMs learn from data — They are trained on trillions of words to learn language patterns
- Neural networks are the engine — Layers of interconnected neurons process information
- Transformers enable understanding — The attention mechanism lets models understand context
- Tokens are the unit of work — Everything is processed as tokens, not words
- Next token prediction is the core — LLMs generate one token at a time based on probability
- Hallucinations are inevitable — Always verify important outputs
- Temperature controls creativity — Lower for accuracy, higher for creativity
- Context window limits memory — The model can only see a fixed number of tokens at once
- Training is expensive and rare — Inference is cheap and happens every time you use the model
Understanding these fundamentals makes you a more effective AI user. You will write better prompts, understand limitations, and know when to trust (or double-check) the output.