Large Language Models: How They Work
Understand GPT architecture, BPE tokenization, pretraining vs fine-tuning, RLHF, context windows, and scaling laws of LLMs
What Makes an LLM "Large"?
Large Language Models are transformer-based neural networks trained on massive text corpora with billions of parameters. They learn to predict the next token, and from this simple objective, emergent abilities like reasoning, coding, and instruction following arise at scale.
Scale of Modern LLMs:
GPT Architecture: Decoder-Only Transformer
GPT models use a decoder-only transformer with causal (left-to-right) attention. Each token can only attend to previous tokens, enabling autoregressive text generation.
import torch
import torch.nn as nn
class GPTBlock(nn.Module):
"""A single GPT transformer block."""
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = nn.MultiheadAttention(d_model, num_heads,
dropout=dropout, batch_first=True)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(), # GPT uses GELU instead of ReLU
nn.Linear(d_ff, d_model),
nn.Dropout(dropout),
)
def forward(self, x, mask=None):
# Pre-norm architecture (GPT-2 style)
normed = self.ln1(x)
attn_out, _ = self.attn(normed, normed, normed, attn_mask=mask)
x = x + attn_out # residual connection
normed = self.ln2(x)
x = x + self.ffn(normed) # residual connection
return x
class MiniGPT(nn.Module):
"""Minimal GPT model for understanding the architecture."""
def __init__(self, vocab_size, d_model, num_heads, num_layers, max_seq_len):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_seq_len, d_model) # learned positions
self.blocks = nn.ModuleList([
GPTBlock(d_model, num_heads, d_model * 4) for _ in range(num_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
def forward(self, idx):
B, T = idx.shape
# Create causal mask (lower triangular)
mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
mask = mask.to(idx.device)
# Token + positional embeddings
positions = torch.arange(T, device=idx.device)
x = self.token_emb(idx) + self.pos_emb(positions)
# Pass through transformer blocks
for block in self.blocks:
x = block(x, mask)
x = self.ln_f(x)
logits = self.head(x) # (B, T, vocab_size)
return logits
# GPT-2 Small: vocab=50257, d_model=768, heads=12, layers=12
model = MiniGPT(vocab_size=50257, d_model=768, num_heads=12,
num_layers=12, max_seq_len=1024)
params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {params/1e6:.0f}M") # ~124M (GPT-2 Small)
BPE Tokenization: How LLMs See Text
from transformers import AutoTokenizer
# GPT-2 uses BPE (Byte Pair Encoding) tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenization examples
examples = [
"Hello, world!",
"Machine learning is fascinating",
"supercalifragilisticexpialidocious", # rare word -> subwords
"def fibonacci(n):\n return n if n < 2 else fibonacci(n-1) + fibonacci(n-2)",
]
for text in examples:
tokens = tokenizer.encode(text)
decoded_tokens = [tokenizer.decode([t]) for t in tokens]
print(f"Text: {text[:50]}")
print(f" Tokens ({len(tokens)}): {decoded_tokens[:10]}")
print()
# Key insight: 1 token is approximately 0.75 English words
# "Hello" = 1 token, " world" = 1 token (note the space!)
# Code is less token-efficient than natural language
# Non-English text uses more tokens per word
# Token budget matters for cost and context length:
# GPT-4: 128K context window = ~96K words
# Claude: 200K context window = ~150K words
text = "The quick brown fox jumps over the lazy dog."
tokens = tokenizer.encode(text)
print(f"'{text}'")
print(f" Word count: {len(text.split())}")
print(f" Token count: {len(tokens)}")
print(f" Ratio: {len(tokens)/len(text.split()):.2f} tokens per word")
The LLM Training Pipeline
Stage 1: Pretraining
Train on trillions of tokens from the internet. Objective: predict the next token. This gives the model general knowledge and language understanding. Cost: $10M-$100M+ in compute.
Stage 2: Supervised Fine-Tuning (SFT)
Train on high-quality instruction-response pairs curated by humans. Teaches the model to follow instructions, be helpful, and format responses properly. Typically 10K-100K examples.
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
Train a reward model on human preference data (which response is better?). Then use PPO to optimize the LLM to generate responses the reward model scores highly. This aligns the model with human values.
Alternative: DPO (Direct Preference Optimization)
Newer approach that skips the reward model. Directly optimizes the LLM using preference pairs. Simpler, more stable, increasingly popular (used by Llama 3, Zephyr).
Text Generation: Temperature and Sampling
import torch
import torch.nn.functional as F
def generate_text(model, tokenizer, prompt, max_tokens=50,
temperature=1.0, top_p=0.9, top_k=50):
"""
Generate text with various decoding strategies.
temperature: Controls randomness (0 = greedy, 1 = standard, >1 = creative)
top_p: Nucleus sampling (only consider tokens summing to this probability)
top_k: Only consider the top-k most likely tokens
"""
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_tokens):
with torch.no_grad():
logits = model(input_ids).logits[:, -1, :] # last token logits
# Apply temperature scaling
logits = logits / temperature
# Top-K filtering
if top_k > 0:
indices_to_remove = logits < torch.topk(logits, top_k)[0][:, -1:]
logits[indices_to_remove] = float('-inf')
# Top-P (nucleus) filtering
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) >= top_p
sorted_logits[sorted_mask] = float('-inf')
logits = sorted_logits.scatter(1, sorted_indices, sorted_logits)
# Sample from the filtered distribution
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=-1)
# Stop at end-of-sequence token
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0])
# Temperature effects:
# 0.0 -> Always picks the most likely token (deterministic)
# 0.3 -> Conservative, factual responses
# 0.7 -> Balanced creativity and coherence
# 1.0 -> Standard sampling, moderate diversity
# 1.5 -> Very creative, may lose coherence
print("Use temperature=0 for factual tasks, 0.7-1.0 for creative tasks")
Scaling Laws and Emergent Abilities
Chinchilla Scaling Laws
DeepMind's Chinchilla paper showed that for compute-optimal training, model size and data should scale together. A model with N parameters should be trained on roughly 20N tokens.
Few-shot learning, basic reasoning
Chain-of-thought reasoning, code generation
Complex multi-step reasoning, theory of mind
Abilities appear suddenly at certain scale thresholds, not gradually
Key Takeaways
- LLMs are decoder-only transformers trained to predict the next token at massive scale
- BPE tokenization breaks text into subwords; 1 token is roughly 0.75 English words
- The training pipeline is: pretrain -> SFT -> RLHF/DPO alignment
- Temperature controls the randomness/creativity trade-off in generation
- Emergent abilities appear suddenly at certain scale thresholds, not gradually