TechLead
Intermediate
20 min
Full Guide

Diffusion Models & Image Generation

How diffusion models work for image generation, including Stable Diffusion architecture, U-Net, CLIP, ControlNet, and LoRA fine-tuning

How Diffusion Models Work

Diffusion models generate images by learning to reverse a noise-adding process. During training, Gaussian noise is gradually added to images (forward process). The model learns to remove noise step by step (reverse process), eventually generating images from pure noise.

The Two Processes:

Forward Process (fixed): Gradually add Gaussian noise over T steps until the image becomes pure noise. No learning needed.
Reverse Process (learned): A neural network predicts and removes noise at each step, gradually reconstructing a clean image from noise.

DDPM: Denoising Diffusion Probabilistic Models

import torch
import torch.nn as nn
import numpy as np

class SimpleDDPM:
    """Simplified DDPM for understanding the diffusion process."""

    def __init__(self, num_timesteps=1000):
        self.T = num_timesteps

        # Noise schedule: linearly increasing betas
        self.betas = torch.linspace(1e-4, 0.02, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)

    def forward_process(self, x0, t):
        """
        Add noise to image x0 at timestep t.
        q(x_t | x_0) = N(x_t; sqrt(alpha_bar_t) * x0, (1 - alpha_bar_t) * I)
        """
        sqrt_alpha = torch.sqrt(self.alpha_cumprod[t])
        sqrt_one_minus = torch.sqrt(1 - self.alpha_cumprod[t])

        noise = torch.randn_like(x0)
        x_t = sqrt_alpha * x0 + sqrt_one_minus * noise
        return x_t, noise

    def reverse_step(self, model, x_t, t):
        """
        Denoise one step: predict noise, remove it.
        The model predicts the noise that was added.
        """
        predicted_noise = model(x_t, t)

        alpha = self.alphas[t]
        alpha_bar = self.alpha_cumprod[t]
        beta = self.betas[t]

        # Compute the mean of p(x_{t-1} | x_t)
        mean = (1 / torch.sqrt(alpha)) * (
            x_t - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
        )

        # Add noise (except at t=0)
        if t > 0:
            noise = torch.randn_like(x_t)
            sigma = torch.sqrt(beta)
            return mean + sigma * noise
        return mean

    def generate(self, model, shape):
        """Generate an image from pure noise."""
        x = torch.randn(shape)  # start from pure noise

        for t in reversed(range(self.T)):
            x = self.reverse_step(model, x, t)
            if t % 200 == 0:
                print(f"  Step {self.T - t}/{self.T}: denoising...")

        return x

# Training objective:
# 1. Sample a clean image x0
# 2. Sample a random timestep t
# 3. Add noise: x_t = forward_process(x0, t)
# 4. Predict noise: predicted = model(x_t, t)
# 5. Loss = MSE(predicted, actual_noise)
ddpm = SimpleDDPM(num_timesteps=1000)
print("DDPM: learn to predict noise, then remove it step by step")

Stable Diffusion Architecture

CLIP Text Encoder

Converts your text prompt into a vector embedding. CLIP was trained on 400M image-text pairs, so it understands the relationship between text descriptions and visual concepts.

VAE (Variational Autoencoder)

Compresses images to a smaller latent space (64x64 instead of 512x512). Diffusion happens in this latent space, making it 50x faster than pixel-space diffusion.

U-Net (Noise Predictor)

The core denoising model. A U-shaped network with skip connections that predicts noise at each timestep. Conditioned on both the timestep and the text embedding via cross-attention.

Scheduler (Sampler)

Controls the denoising schedule. Options: DDPM (1000 steps), DDIM (50 steps), Euler, DPM-Solver. Fewer steps = faster but lower quality.

Using Stable Diffusion in Python

from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# Load Stable Diffusion model
pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,  # half precision for speed
)
pipe.to("cuda")  # move to GPU

# Use a faster scheduler (20-30 steps instead of 50)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Generate an image
prompt = "A serene mountain landscape at sunset, oil painting style, highly detailed"
negative_prompt = "blurry, low quality, distorted, deformed"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=25,       # number of denoising steps
    guidance_scale=7.5,           # how closely to follow the prompt
    width=512, height=512,
).images[0]

image.save("mountain_landscape.png")
print("Image generated and saved!")

# Img2Img: modify an existing image
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

img2img = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
).to("cuda")

# init_image = Image.open("sketch.png").resize((512, 512))
# result = img2img(
#     prompt="a detailed watercolor painting",
#     image=init_image,
#     strength=0.75,  # how much to change (0=nothing, 1=completely new)
# ).images[0]

ControlNet and LoRA Fine-Tuning

# ControlNet: Add spatial control to diffusion models
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

# Load ControlNet for edge-guided generation
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# Use edge map to control the generated image structure
# canny_image = get_canny_edges(input_image)  # detect edges
# result = pipe("a beautiful house", image=canny_image).images[0]

# LoRA: Lightweight fine-tuning for custom styles
# LoRA adds small trainable matrices to attention layers
# Instead of fine-tuning all 1B+ params, only train ~1-10M params

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# Load a LoRA adapter (e.g., trained on anime style)
# pipe.load_lora_weights("path/to/lora/weights")
# image = pipe("a warrior in anime style").images[0]

# LoRA advantages:
# - File size: ~3-50 MB (vs 2-5 GB for full model)
# - Training: minutes to hours on a single GPU
# - Composable: stack multiple LoRAs together
# - Shareable: easy to distribute on Civitai/HuggingFace
print("LoRA: fine-tune style with ~1% of the parameters")

Key Takeaways

  • Diffusion models learn to reverse a noise-adding process, generating images from pure noise
  • Stable Diffusion works in latent space (64x64) for efficiency, not pixel space (512x512)
  • The U-Net predicts noise conditioned on text embeddings via cross-attention
  • ControlNet adds spatial control (edges, poses, depth) to guide image generation
  • LoRA enables efficient fine-tuning with only 1% of parameters for custom styles

Continue Learning