HomeFoundationsHow LLMs Actually Work

beginner12 min read· Module 1, Lesson 7

🧬How LLMs Actually Work

Training, tokens, transformers, and neural networks — explained simply

How LLMs Actually Work

Large Language Models (LLMs) like Claude are among the most powerful technologies ever built — but how do they actually work under the hood? In this lesson, we will break it all down using simple analogies that anyone can understand.

What Is an LLM?

An LLM — Large Language Model — is a computer program that has been trained on enormous amounts of text so it can understand and generate human language. Think of it as a student who has read millions of books, articles, and websites, and can now write coherent responses on almost any topic.

But here is the key insight: an LLM does not "know" things the way you do. It has learned statistical patterns about how words and ideas relate to each other. It is incredibly good at predicting what text should come next given what came before.

What Is Training Data?

Before an LLM can do anything useful, it needs to learn from data. This process is called training.

Where does the data come from?

Training data typically includes:

Books and academic papers
Websites and articles
Code repositories
Conversations and forums
Documentation and manuals

How much data?

Modern LLMs are trained on trillions of words. To put that in perspective:

Comparison	Amount
Average novel	~80,000 words
Wikipedia (English)	~4 billion words
LLM training data	Trillions of words

Analogy: Imagine a student who has read every book in every library on Earth, plus every website ever written. That is roughly the scale of data an LLM learns from.

Neural Networks — The Brain Behind the Model

At the heart of every LLM is a neural network. But what is that?

Simple Analogy: A Network of Decision-Makers

Imagine a giant company with millions of employees arranged in layers:

Input Layer — The front desk. It receives the raw information (your text).
Hidden Layers — Thousands of departments, each specializing in different patterns. Some recognize grammar, some understand topics, some detect tone.
Output Layer — The final decision-maker that produces the response.

Each "employee" (called a neuron) receives information, does a small calculation, and passes the result to the next layer. When millions of neurons work together across hundreds of layers, the network can understand incredibly complex patterns.

How Does It Learn?

The network learns through a process called backpropagation:

Show the network some text
Ask it to predict the next word
Check if it got it right
If wrong, adjust the connections (called weights) slightly
Repeat this billions of times

Analogy: It is like a musician practicing scales. At first, the notes sound wrong. But after thousands of hours of practice, the musician can play complex pieces effortlessly. The LLM "practices" on billions of text examples.

What Is a Transformer?

The Transformer is the specific type of neural network architecture that powers modern LLMs. It was introduced in the famous 2017 paper "Attention Is All You Need."

The Key Innovation: Attention

Before Transformers, language models processed words one at a time, left to right. The Transformer introduced attention — the ability to look at all the words in a sentence simultaneously and figure out which ones are most important for understanding each other word.

Example

Consider the sentence: "The cat sat on the mat because it was tired."

What does "it" refer to? The cat or the mat?

A human instantly knows "it" = the cat
The attention mechanism allows the model to make the same connection by calculating how strongly "it" relates to every other word

Why Transformers Changed Everything

Before Transformers	After Transformers
Processed words sequentially	Processes all words in parallel
Struggled with long text	Handles long context effectively
Slow to train	Much faster to train
Limited understanding	Deep contextual understanding

Analogy: Imagine reading a book one word at a time with no memory of previous pages versus being able to see the entire book at once and instantly connect ideas across chapters. That is the power of attention.

Tokens: How LLMs See Text

LLMs do not read text the way you do. They break text into small pieces called tokens.

What Is a Token?

A token is a chunk of text — usually a word or part of a word:

Text	Tokens
"Hello"	`["Hello"]`
"understanding"	`["under", "standing"]`
"ChatGPT"	`["Chat", "G", "PT"]`
"Hello, world!"	`["Hello", ",", " world", "!"]`

Why Tokens Matter

Cost — API pricing is based on token count
Context Window — The model can only process a limited number of tokens at once
Speed — More tokens = slower processing

Rule of Thumb

In English, 1 token is roughly 4 characters or 0.75 words. So:

100 words is approximately 130 tokens
A full page (~500 words) is approximately 650 tokens

Next Token Prediction: The Core Mechanism

Here is the fundamental secret of how LLMs generate text: they predict the next token, one at a time.

How It Works

When you type "The weather today is" the model:

Looks at all the tokens: ["The", " weather", " today", " is"]
Calculates the probability of every possible next token
Picks one (e.g., "sunny" with 35% probability, "nice" with 20%, "cold" with 15%...)
Adds "sunny" to the sequence
Now predicts the next token after "The weather today is sunny"
Repeats until done

The Complete Pipeline

Stage	What Happens	Example
Input	You type your prompt	"Explain gravity"
Tokenization	Text is broken into tokens	`["Explain", " gravity"]`
Processing	Transformer processes all tokens with attention	Neural network activations
Prediction	Model calculates probability for every possible next token	"Gravity" (40%), "In" (25%), "The" (15%)...
Output	Selected token is generated, process repeats	"Gravity is a fundamental force..."

Analogy: It is like autocomplete on your phone, but incredibly more sophisticated. Your phone predicts the next word based on simple patterns. An LLM predicts the next token based on deep understanding of context, meaning, and structure learned from trillions of words.

Why Do LLMs Hallucinate?

One of the most important things to understand about LLMs is that they sometimes hallucinate — they generate text that sounds completely confident and correct but is actually wrong.

Why Does This Happen?

Pattern matching, not knowledge retrieval — The model is not looking things up in a database. It is generating text that statistically fits the pattern. Sometimes the pattern produces incorrect facts.
No built-in fact-checking — The model has no way to verify what it is saying against a source of truth while generating text.
Trained to be fluent, not factual — The training process optimizes for producing coherent, natural-sounding text. Being factually correct is a side effect, not the primary goal.
Knowledge cutoff — The model only knows what was in its training data up to a certain date. It has no access to current events or real-time information.

Common Hallucination Examples

Inventing academic papers that do not exist
Generating plausible but incorrect code documentation
Confidently stating wrong dates or statistics
Creating fake URLs that look legitimate

Analogy: Imagine a very eloquent person who has read millions of books but sometimes mixes up details from different books and presents the mixed-up version with total confidence. That is hallucination.

Temperature: Controlling Creativity

When the model predicts the next token, it does not always pick the most likely one. The temperature setting controls how creative or random the output is.

How Temperature Works

Temperature	Behavior	Best For
0	Always picks the most likely token	Code, factual Q&A, data extraction
0.3 - 0.5	Mostly predictable with slight variation	Business writing, summaries
0.7 - 0.9	More creative and varied	Creative writing, brainstorming
1.0+	Highly random and unpredictable	Experimental, poetry

Example

Prompt: "The sunset was..."

Temperature 0: "beautiful" (most probable)
Temperature 0.7: "painted across the sky like watercolors"
Temperature 1.0: "whispering ancient secrets through fractured amber light"

Analogy: Temperature is like a dial on a radio. At 0, you get the clearest, most predictable signal. As you turn it up, you get more interesting but potentially noisier output.

Context Window: The Model's Working Memory

The context window is the total number of tokens the model can "see" at once — including both your input and its output.

Context Window Sizes

Model	Context Window
GPT-3 (2020)	4,096 tokens
Claude 3 Haiku	200,000 tokens
Claude 3.5 Sonnet	200,000 tokens
Claude (with extended thinking)	200,000+ tokens

Why It Matters

Too little context — The model forgets earlier parts of the conversation
More context — The model can reference larger documents, longer conversations, and more complex instructions
Cost — Larger context windows use more computation and cost more

Analogy: Context window is like a desk. A small desk can only hold a few papers — you constantly have to remove old ones to make room. A large desk lets you spread out everything and see connections between documents.

Training vs. Inference

These are two completely different phases in an LLM's lifecycle:

Training

Happens once (or periodically) at the AI company
Takes weeks to months on thousands of specialized GPUs
Costs millions of dollars
The model learns patterns from data
Results in a set of learned weights (parameters)

Inference

Happens every time you use the model
Takes seconds
The model uses its learned weights to generate responses
No new learning happens during inference

Aspect	Training	Inference
When	Before deployment	Every API call
Duration	Weeks/months	Seconds
Cost	Millions of dollars	Fractions of a cent per token
What happens	Model learns	Model applies what it learned
Hardware	Thousands of GPUs	Fewer GPUs per request

Analogy: Training is like going to medical school for 8 years. Inference is like a doctor seeing a patient. The doctor does not re-learn medicine for each patient — they apply what they already learned.

Putting It All Together

Here is the complete picture of how an LLM works:

Training Phase: The model reads trillions of tokens and learns statistical patterns about language through a neural network with a Transformer architecture.
Your Request: You send a prompt to the model via an API or chat interface.
Tokenization: Your text is broken into tokens.
Processing: The Transformer processes all tokens using attention to understand context and relationships.
Generation: The model predicts the next token, one at a time, controlled by temperature.
Output: Tokens are converted back to human-readable text and returned to you.

The End-to-End Flow

Step	Process	Details
1. Input	You send a prompt	"Write a haiku about coding"
2. Tokenization	Text becomes tokens	`["Write", " a", " ha", "iku", " about", " coding"]`
3. Processing	Transformer + Attention	All tokens processed in parallel
4. Prediction	Next token probability	Model picks most appropriate token
5. Output	Tokens become text	"Lines of logic flow / Through silicon pathways bright / Bugs hide in the night"

Key Takeaways

LLMs learn from data — They are trained on trillions of words to learn language patterns
Neural networks are the engine — Layers of interconnected neurons process information
Transformers enable understanding — The attention mechanism lets models understand context
Tokens are the unit of work — Everything is processed as tokens, not words
Next token prediction is the core — LLMs generate one token at a time based on probability
Hallucinations are inevitable — Always verify important outputs
Temperature controls creativity — Lower for accuracy, higher for creativity
Context window limits memory — The model can only see a fixed number of tokens at once
Training is expensive and rare — Inference is cheap and happens every time you use the model

Understanding these fundamentals makes you a more effective AI user. You will write better prompts, understand limitations, and know when to trust (or double-check) the output.

Module 1

7/9

🔌 What is an API?

How AI Pricing Works 💰

7/9