TechLead
Intermediate
20 min
Full Guide

Large Language Models: How They Work

Understand GPT architecture, BPE tokenization, pretraining vs fine-tuning, RLHF, context windows, and scaling laws of LLMs

What Makes an LLM "Large"?

Large Language Models are transformer-based neural networks trained on massive text corpora with billions of parameters. They learn to predict the next token, and from this simple objective, emergent abilities like reasoning, coding, and instruction following arise at scale.

Scale of Modern LLMs:

GPT-3: 175B params, 300B tokens
GPT-4: ~1.8T params (rumored MoE)
Llama 3: 8B-405B params
Claude: Undisclosed, very large
Gemini: Multimodal, MoE arch
Training cost: $10M-$100M+

GPT Architecture: Decoder-Only Transformer

GPT models use a decoder-only transformer with causal (left-to-right) attention. Each token can only attend to previous tokens, enabling autoregressive text generation.

import torch
import torch.nn as nn

class GPTBlock(nn.Module):
    """A single GPT transformer block."""

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, num_heads,
                                           dropout=dropout, batch_first=True)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),  # GPT uses GELU instead of ReLU
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x, mask=None):
        # Pre-norm architecture (GPT-2 style)
        normed = self.ln1(x)
        attn_out, _ = self.attn(normed, normed, normed, attn_mask=mask)
        x = x + attn_out  # residual connection

        normed = self.ln2(x)
        x = x + self.ffn(normed)  # residual connection
        return x

class MiniGPT(nn.Module):
    """Minimal GPT model for understanding the architecture."""

    def __init__(self, vocab_size, d_model, num_heads, num_layers, max_seq_len):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)  # learned positions
        self.blocks = nn.ModuleList([
            GPTBlock(d_model, num_heads, d_model * 4) for _ in range(num_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, idx):
        B, T = idx.shape
        # Create causal mask (lower triangular)
        mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
        mask = mask.to(idx.device)

        # Token + positional embeddings
        positions = torch.arange(T, device=idx.device)
        x = self.token_emb(idx) + self.pos_emb(positions)

        # Pass through transformer blocks
        for block in self.blocks:
            x = block(x, mask)

        x = self.ln_f(x)
        logits = self.head(x)  # (B, T, vocab_size)
        return logits

# GPT-2 Small: vocab=50257, d_model=768, heads=12, layers=12
model = MiniGPT(vocab_size=50257, d_model=768, num_heads=12,
                num_layers=12, max_seq_len=1024)
params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {params/1e6:.0f}M")  # ~124M (GPT-2 Small)

BPE Tokenization: How LLMs See Text

from transformers import AutoTokenizer

# GPT-2 uses BPE (Byte Pair Encoding) tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenization examples
examples = [
    "Hello, world!",
    "Machine learning is fascinating",
    "supercalifragilisticexpialidocious",  # rare word -> subwords
    "def fibonacci(n):\n    return n if n < 2 else fibonacci(n-1) + fibonacci(n-2)",
]

for text in examples:
    tokens = tokenizer.encode(text)
    decoded_tokens = [tokenizer.decode([t]) for t in tokens]
    print(f"Text: {text[:50]}")
    print(f"  Tokens ({len(tokens)}): {decoded_tokens[:10]}")
    print()

# Key insight: 1 token is approximately 0.75 English words
# "Hello" = 1 token, " world" = 1 token (note the space!)
# Code is less token-efficient than natural language
# Non-English text uses more tokens per word

# Token budget matters for cost and context length:
# GPT-4: 128K context window = ~96K words
# Claude: 200K context window = ~150K words

text = "The quick brown fox jumps over the lazy dog."
tokens = tokenizer.encode(text)
print(f"'{text}'")
print(f"  Word count: {len(text.split())}")
print(f"  Token count: {len(tokens)}")
print(f"  Ratio: {len(tokens)/len(text.split()):.2f} tokens per word")

The LLM Training Pipeline

Stage 1: Pretraining

Train on trillions of tokens from the internet. Objective: predict the next token. This gives the model general knowledge and language understanding. Cost: $10M-$100M+ in compute.

Stage 2: Supervised Fine-Tuning (SFT)

Train on high-quality instruction-response pairs curated by humans. Teaches the model to follow instructions, be helpful, and format responses properly. Typically 10K-100K examples.

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Train a reward model on human preference data (which response is better?). Then use PPO to optimize the LLM to generate responses the reward model scores highly. This aligns the model with human values.

Alternative: DPO (Direct Preference Optimization)

Newer approach that skips the reward model. Directly optimizes the LLM using preference pairs. Simpler, more stable, increasingly popular (used by Llama 3, Zephyr).

Text Generation: Temperature and Sampling

import torch
import torch.nn.functional as F

def generate_text(model, tokenizer, prompt, max_tokens=50,
                  temperature=1.0, top_p=0.9, top_k=50):
    """
    Generate text with various decoding strategies.

    temperature: Controls randomness (0 = greedy, 1 = standard, >1 = creative)
    top_p: Nucleus sampling (only consider tokens summing to this probability)
    top_k: Only consider the top-k most likely tokens
    """
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    for _ in range(max_tokens):
        with torch.no_grad():
            logits = model(input_ids).logits[:, -1, :]  # last token logits

        # Apply temperature scaling
        logits = logits / temperature

        # Top-K filtering
        if top_k > 0:
            indices_to_remove = logits < torch.topk(logits, top_k)[0][:, -1:]
            logits[indices_to_remove] = float('-inf')

        # Top-P (nucleus) filtering
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) >= top_p
        sorted_logits[sorted_mask] = float('-inf')
        logits = sorted_logits.scatter(1, sorted_indices, sorted_logits)

        # Sample from the filtered distribution
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat([input_ids, next_token], dim=-1)

        # Stop at end-of-sequence token
        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0])

# Temperature effects:
# 0.0 -> Always picks the most likely token (deterministic)
# 0.3 -> Conservative, factual responses
# 0.7 -> Balanced creativity and coherence
# 1.0 -> Standard sampling, moderate diversity
# 1.5 -> Very creative, may lose coherence
print("Use temperature=0 for factual tasks, 0.7-1.0 for creative tasks")

Scaling Laws and Emergent Abilities

Chinchilla Scaling Laws

DeepMind's Chinchilla paper showed that for compute-optimal training, model size and data should scale together. A model with N parameters should be trained on roughly 20N tokens.

Emergent at ~10B params:

Few-shot learning, basic reasoning

Emergent at ~100B params:

Chain-of-thought reasoning, code generation

Emergent at ~500B+ params:

Complex multi-step reasoning, theory of mind

Key insight:

Abilities appear suddenly at certain scale thresholds, not gradually

Key Takeaways

  • LLMs are decoder-only transformers trained to predict the next token at massive scale
  • BPE tokenization breaks text into subwords; 1 token is roughly 0.75 English words
  • The training pipeline is: pretrain -> SFT -> RLHF/DPO alignment
  • Temperature controls the randomness/creativity trade-off in generation
  • Emergent abilities appear suddenly at certain scale thresholds, not gradually

Continue Learning