Diffusion Models & Image Generation
How diffusion models work for image generation, including Stable Diffusion architecture, U-Net, CLIP, ControlNet, and LoRA fine-tuning
How Diffusion Models Work
Diffusion models generate images by learning to reverse a noise-adding process. During training, Gaussian noise is gradually added to images (forward process). The model learns to remove noise step by step (reverse process), eventually generating images from pure noise.
The Two Processes:
DDPM: Denoising Diffusion Probabilistic Models
import torch
import torch.nn as nn
import numpy as np
class SimpleDDPM:
"""Simplified DDPM for understanding the diffusion process."""
def __init__(self, num_timesteps=1000):
self.T = num_timesteps
# Noise schedule: linearly increasing betas
self.betas = torch.linspace(1e-4, 0.02, num_timesteps)
self.alphas = 1.0 - self.betas
self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)
def forward_process(self, x0, t):
"""
Add noise to image x0 at timestep t.
q(x_t | x_0) = N(x_t; sqrt(alpha_bar_t) * x0, (1 - alpha_bar_t) * I)
"""
sqrt_alpha = torch.sqrt(self.alpha_cumprod[t])
sqrt_one_minus = torch.sqrt(1 - self.alpha_cumprod[t])
noise = torch.randn_like(x0)
x_t = sqrt_alpha * x0 + sqrt_one_minus * noise
return x_t, noise
def reverse_step(self, model, x_t, t):
"""
Denoise one step: predict noise, remove it.
The model predicts the noise that was added.
"""
predicted_noise = model(x_t, t)
alpha = self.alphas[t]
alpha_bar = self.alpha_cumprod[t]
beta = self.betas[t]
# Compute the mean of p(x_{t-1} | x_t)
mean = (1 / torch.sqrt(alpha)) * (
x_t - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
)
# Add noise (except at t=0)
if t > 0:
noise = torch.randn_like(x_t)
sigma = torch.sqrt(beta)
return mean + sigma * noise
return mean
def generate(self, model, shape):
"""Generate an image from pure noise."""
x = torch.randn(shape) # start from pure noise
for t in reversed(range(self.T)):
x = self.reverse_step(model, x, t)
if t % 200 == 0:
print(f" Step {self.T - t}/{self.T}: denoising...")
return x
# Training objective:
# 1. Sample a clean image x0
# 2. Sample a random timestep t
# 3. Add noise: x_t = forward_process(x0, t)
# 4. Predict noise: predicted = model(x_t, t)
# 5. Loss = MSE(predicted, actual_noise)
ddpm = SimpleDDPM(num_timesteps=1000)
print("DDPM: learn to predict noise, then remove it step by step")
Stable Diffusion Architecture
CLIP Text Encoder
Converts your text prompt into a vector embedding. CLIP was trained on 400M image-text pairs, so it understands the relationship between text descriptions and visual concepts.
VAE (Variational Autoencoder)
Compresses images to a smaller latent space (64x64 instead of 512x512). Diffusion happens in this latent space, making it 50x faster than pixel-space diffusion.
U-Net (Noise Predictor)
The core denoising model. A U-shaped network with skip connections that predicts noise at each timestep. Conditioned on both the timestep and the text embedding via cross-attention.
Scheduler (Sampler)
Controls the denoising schedule. Options: DDPM (1000 steps), DDIM (50 steps), Euler, DPM-Solver. Fewer steps = faster but lower quality.
Using Stable Diffusion in Python
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
# Load Stable Diffusion model
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16, # half precision for speed
)
pipe.to("cuda") # move to GPU
# Use a faster scheduler (20-30 steps instead of 50)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# Generate an image
prompt = "A serene mountain landscape at sunset, oil painting style, highly detailed"
negative_prompt = "blurry, low quality, distorted, deformed"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=25, # number of denoising steps
guidance_scale=7.5, # how closely to follow the prompt
width=512, height=512,
).images[0]
image.save("mountain_landscape.png")
print("Image generated and saved!")
# Img2Img: modify an existing image
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
img2img = StableDiffusionImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16,
).to("cuda")
# init_image = Image.open("sketch.png").resize((512, 512))
# result = img2img(
# prompt="a detailed watercolor painting",
# image=init_image,
# strength=0.75, # how much to change (0=nothing, 1=completely new)
# ).images[0]
ControlNet and LoRA Fine-Tuning
# ControlNet: Add spatial control to diffusion models
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
# Load ControlNet for edge-guided generation
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
# Use edge map to control the generated image structure
# canny_image = get_canny_edges(input_image) # detect edges
# result = pipe("a beautiful house", image=canny_image).images[0]
# LoRA: Lightweight fine-tuning for custom styles
# LoRA adds small trainable matrices to attention layers
# Instead of fine-tuning all 1B+ params, only train ~1-10M params
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
# Load a LoRA adapter (e.g., trained on anime style)
# pipe.load_lora_weights("path/to/lora/weights")
# image = pipe("a warrior in anime style").images[0]
# LoRA advantages:
# - File size: ~3-50 MB (vs 2-5 GB for full model)
# - Training: minutes to hours on a single GPU
# - Composable: stack multiple LoRAs together
# - Shareable: easy to distribute on Civitai/HuggingFace
print("LoRA: fine-tune style with ~1% of the parameters")
Key Takeaways
- Diffusion models learn to reverse a noise-adding process, generating images from pure noise
- Stable Diffusion works in latent space (64x64) for efficiency, not pixel space (512x512)
- The U-Net predicts noise conditioned on text embeddings via cross-attention
- ControlNet adds spatial control (edges, poses, depth) to guide image generation
- LoRA enables efficient fine-tuning with only 1% of parameters for custom styles