Diffusion Models

Denoising Diffusion Probabilistic Models

🌊 Apa itu Diffusion Models?

Diffusion Models adalah generative models yang belajar menghasilkan data dengan cara membalikkan proses penambahan noise secara bertahap.

Core Idea:

Forward: Tambahkan noise bertahap (x₀ → x_T)
Reverse: Pelajari cara menghilangkan noise (x_T → x₀)

🎨 Mengapa Diffusion Models?

➡️ Forward Process

Gradually add Gaussian noise

x₀ (clean) → x₁ → x₂ → ... → x_T (pure noise)

⬅️ Reverse Process

Learn to denoise step-by-step

x_T (noise) → ... → x₁ → x₀ (generated image)

📊 Comparison: GAN vs VAE vs Diffusion

Model Pros Cons
GAN High quality, fast sampling Training instability, mode collapse
VAE Stable training, good latent space Blurry outputs
Diffusion Excellent quality, stable training, diverse outputs Slow sampling (many steps)

🎯 Yang Akan Dipelajari

➡️

Forward Process

Noise schedule β_t

⬅️

Reverse Process

Denoising dengan U-Net

📐

DDPM

Training objective

DDIM

Fast sampling

Forward Diffusion Process

Gradually Adding Noise

➡️ Forward Process

Forward diffusion adalah proses Markov chain yang menambahkan Gaussian noise secara bertahap ke data bersih x₀.

q(x_t | x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I)
β_t: noise schedule (controls noise amount at step t)

Each step adds small Gaussian noise

🔍 Noise Schedule β_t

β_t mengontrol seberapa banyak noise ditambahkan di step t:

  • 📈 Linear schedule: β_t increases linearly from β₁ to β_T
  • 📉 Cosine schedule: Smoother transition, better performance
  • 📊 Custom schedules: Optimized untuk dataset tertentu

Common Values:

β₁ = 10⁻⁴ (very small noise)
β_T = 0.02 (substantial noise)
T = 1000 steps

⚡ Direct Sampling: q(x_t | x₀)

Kita tidak perlu iterasi T kali! Ada closed-form solution:

q(x_t | x₀) = N(x_t; √(ᾱ_t)x₀, (1-ᾱ_t)I)
ᾱ_t = ∏_{s=1}^t (1-β_s) = ∏_{s=1}^t α_s
α_t = 1 - β_t

Reparameterization: x_t = √(ᾱ_t)x₀ + √(1-ᾱ_t)ε, where ε ~ N(0,I)

📅 Diffusion Timeline

t=0
🖼️ Clean Image
x₀
t=250
🌫️ Little Noise
t=500
☁️ More Noise
t=750
🌊 Heavy Noise
t=1000
🔊 Pure Noise
x_T ~ N(0,I)

🎬 Forward Diffusion Animation

Watch image gradually become pure noise

Reverse Diffusion Process

Learning to Denoise

⬅️ Reverse Process

Reverse diffusion adalah proses yang dipelajari untuk menghilangkan noise secara bertahap, starting dari pure noise x_T → clean data x₀.

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))
θ: learned parameters (neural network)
μ_θ: predicted mean
Σ_θ: predicted variance (or fixed)

Goal: Learn to reverse the forward process!

🏗️ U-Net Noise Predictor

Diffusion models menggunakan U-Net untuk memprediksi noise ε_θ(x_t, t):

  • 📥 Input: Noisy image x_t + timestep embedding t
  • 🎯 Output: Predicted noise ε_θ that was added
  • 🏛️ Architecture: Encoder-decoder dengan skip connections
  • Time embedding: Sinusoidal positional encoding

Key Insight:

Instead of predicting x₀ directly, predict the noise that was added!
Then compute: x₀ = (x_t - √(1-ᾱ_t)ε_θ) / √(ᾱ_t)

🔄 Denoising Step

Given x_t, untuk mendapatkan x_{t-1}:

1. Predict noise: ε_θ = ε_θ(x_t, t)
2. Compute mean: μ_θ = (1/√α_t)(x_t - (β_t/√(1-ᾱ_t))ε_θ)
3. Sample: x_{t-1} = μ_θ + √β_t · z, where z ~ N(0,I)

For t=1, set z=0 (deterministic final step)

🎬 Reverse Diffusion Animation

Watch noise gradually become a clean image

DDPM

Denoising Diffusion Probabilistic Models

📐 DDPM Training Objective

DDPM (Ho et al., 2020) mempelajari reverse process dengan memaksimalkan variational lower bound:

L_VLB = E_q[D_KL(q(x_T|x_0)||p(x_T)) + Σ_t D_KL(q(x_{t-1}|x_t,x_0)||p_θ(x_{t-1}|x_t))]

This is complex! Ho et al. showed a simpler equivalent objective...

⚡ Simplified Loss (Actually Used)

In practice, kita gunakan simplified objective:

L_simple = E_{t,x_0,ε}[||ε - ε_θ(√(ᾱ_t)x_0 + √(1-ᾱ_t)ε, t)||²]
t ~ Uniform({1, ..., T})
ε ~ N(0, I)
x_t = √(ᾱ_t)x_0 + √(1-ᾱ_t)ε (reparameterization)

Simply: predict the noise that was added!

Training Algorithm:

  1. Sample x₀ from data
  2. Sample timestep t uniformly
  3. Sample noise ε ~ N(0,I)
  4. Compute noisy x_t using reparameterization
  5. Predict noise: ε_pred = ε_θ(x_t, t)
  6. Loss = ||ε - ε_pred||²
  7. Backprop and update θ

🎲 Sampling Algorithm (DDPM)

Initialize: x_T ~ N(0, I)
For t = T, T-1, ..., 1:
z ~ N(0,I) if t > 1, else z = 0
ε_pred = ε_θ(x_t, t)
x_{t-1} = (1/√α_t)(x_t - (β_t/√(1-ᾱ_t))ε_pred) + √β_t·z
Return: x₀

💡 Key Insights

  • 🎯 Noise prediction easier than direct x₀ prediction
  • 🔄 Reparameterization allows direct sampling of any x_t
  • 📊 Simple MSE loss works better than complex VLB
  • ⏱️ T=1000 steps typical for high-quality generation

DDIM & Improvements

Faster Sampling & Enhancements

⚡ DDIM (Denoising Diffusion Implicit Models)

Problem dengan DDPM: Sampling memerlukan T=1000 steps (slow!)
Solution: DDIM (Song et al., 2020) - deterministic sampling dengan fewer steps.

x_{t-1} = √(ᾱ_{t-1})·pred_x₀ + √(1-ᾱ_{t-1}-σ_t²)·ε_θ + σ_t·z
pred_x₀ = (x_t - √(1-ᾱ_t)ε_θ) / √(ᾱ_t)
σ_t = 0 → deterministic (DDIM)
σ_t = √((1-ᾱ_{t-1})/(1-ᾱ_t))·√(1-ᾱ_t/ᾱ_{t-1}) → stochastic (DDPM)

🚀 Faster Sampling

DDIM allows skipping timesteps without retraining:

  • 🏃 10-50 steps instead of 1000 (20-100x faster!)
  • 🎯 Deterministic: same noise → same image
  • 🔄 Interpolation: smooth latent space
  • ✏️ Inversion: encode real images to latent

DDIM Sampling (S=50 steps):

Use subset of timesteps: {1000, 980, 960, ..., 40, 20}
Much faster while maintaining quality!

📊 Noise Schedules

Berbagai noise schedules untuk performa lebih baik:

Schedule Types
  • 📈 Linear: Original DDPM, simple
  • 📉 Cosine: Better for high-res, less noise at extremes
  • 📊 Custom: Learned or hand-tuned for specific data

🎨 Classifier-Free Guidance

Untuk conditional generation (e.g., text-to-image):

ε_guided = ε_uncond + w·(ε_cond - ε_uncond)
w: guidance scale (typically 7.5)
Higher w → stronger adherence to condition

Train both conditional and unconditional models jointly!

Example (Text-to-Image):

Condition: "a cat wearing a hat"
w=1.0 → mostly ignore text
w=7.5 → strong text adherence
w=15.0 → very strong, may sacrifice quality

Score-Based Models

Score Matching Perspective

🎯 Score Matching

Alternative perspective: learn the score function ∇_x log p(x).

s_θ(x, t) ≈ ∇_x log p_t(x)
Score: gradient of log probability
Points toward higher probability regions

Connection: score = -ε / √(1-ᾱ_t)

🌊 Langevin Dynamics

Sampling dengan Langevin dynamics:

x_{t+1} = x_t + δ·∇_x log p(x_t) + √(2δ)·z
δ: step size
z ~ N(0, I)

Move toward high density + random walk

📐 SDE Formulation

Song et al. unified view: diffusion as Stochastic Differential Equation:

Forward SDE: dx = f(x,t)dt + g(t)dw
Reverse SDE: dx = [f(x,t) - g(t)²∇_x log p_t(x)]dt + g(t)d‾w

Learn score, solve reverse SDE → generate samples

🔗 Connection to Diffusion

  • 🎯 Score-based: ∇_x log p(x) perspective
  • 🌊 Diffusion: Forward/reverse process perspective
  • 🔄 Equivalent: Different views of same model!
  • Unified: SDE framework combines both

Key Insight:

Denoising score matching ≈ Diffusion model training
Both learn to predict noise/score at different noise levels

Implementation

PyTorch Code

💻 DDPM PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class DiffusionModel(nn.Module):
    def __init__(self, unet, noise_steps=1000, beta_start=1e-4, beta_end=0.02):
        super().__init__()
        self.unet = unet  # Noise prediction network
        self.noise_steps = noise_steps
        
        # Noise schedule (linear)
        self.beta = torch.linspace(beta_start, beta_end, noise_steps)
        self.alpha = 1 - self.beta
        self.alpha_hat = torch.cumprod(self.alpha, dim=0)
        
    def add_noise(self, x_0, t, noise):
        """
        Add noise to x_0 to get x_t.
        
        Args:
            x_0: clean images (batch, C, H, W)
            t: timesteps (batch,)
            noise: Gaussian noise (batch, C, H, W)
        
        Returns:
            x_t: noisy images
        """
        sqrt_alpha_hat = torch.sqrt(self.alpha_hat[t])[:, None, None, None]
        sqrt_one_minus_alpha_hat = torch.sqrt(1 - self.alpha_hat[t])[:, None, None, None]
        
        x_t = sqrt_alpha_hat * x_0 + sqrt_one_minus_alpha_hat * noise
        return x_t
    
    def forward(self, x, t):
        """Predict noise."""
        return self.unet(x, t)

🎓 Training Loop

def train_diffusion(model, dataloader, optimizer, device, epochs=100):
    """Train diffusion model."""
    model.train()
    
    for epoch in range(epochs):
        for x_0 in dataloader:
            x_0 = x_0.to(device)
            batch_size = x_0.shape[0]
            
            # Sample random timesteps
            t = torch.randint(0, model.noise_steps, (batch_size,), device=device)
            
            # Sample noise
            noise = torch.randn_like(x_0)
            
            # Add noise to get x_t
            x_t = model.add_noise(x_0, t, noise)
            
            # Predict noise
            noise_pred = model(x_t, t)
            
            # Simplified loss
            loss = F.mse_loss(noise_pred, noise)
            
            # Backward
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

🎲 Sampling (DDPM)

@torch.no_grad()
def sample_ddpm(model, n_samples, img_size, device):
    """
    Sample images using DDPM algorithm.
    
    Args:
        model: trained diffusion model
        n_samples: number of images to generate
        img_size: (C, H, W)
        device: cuda or cpu
    
    Returns:
        generated images
    """
    model.eval()
    C, H, W = img_size
    
    # Start from pure noise
    x = torch.randn(n_samples, C, H, W, device=device)
    
    # Iteratively denoise
    for t in reversed(range(model.noise_steps)):
        # Create timestep tensor
        t_tensor = torch.full((n_samples,), t, device=device, dtype=torch.long)
        
        # Predict noise
        noise_pred = model(x, t_tensor)
        
        # Get schedule values
        alpha = model.alpha[t]
        alpha_hat = model.alpha_hat[t]
        beta = model.beta[t]
        
        # Denoise step
        if t > 0:
            noise = torch.randn_like(x)
        else:
            noise = torch.zeros_like(x)
        
        x = (1 / torch.sqrt(alpha)) * (x - ((1 - alpha) / torch.sqrt(1 - alpha_hat)) * noise_pred) + torch.sqrt(beta) * noise
    
    return x

⚡ DDIM Sampling

@torch.no_grad()
def sample_ddim(model, n_samples, img_size, device, ddim_steps=50):
    """Fast sampling with DDIM."""
    model.eval()
    C, H, W = img_size
    
    # Subset of timesteps
    timesteps = torch.linspace(model.noise_steps - 1, 0, ddim_steps, dtype=torch.long)
    
    # Start from noise
    x = torch.randn(n_samples, C, H, W, device=device)
    
    for i, t in enumerate(timesteps):
        t_tensor = torch.full((n_samples,), t, device=device, dtype=torch.long)
        
        # Predict noise
        noise_pred = model(x, t_tensor)
        
        # Predict x_0
        alpha_hat = model.alpha_hat[t]
        pred_x0 = (x - torch.sqrt(1 - alpha_hat) * noise_pred) / torch.sqrt(alpha_hat)
        
        if i < len(timesteps) - 1:
            t_prev = timesteps[i + 1]
            alpha_hat_prev = model.alpha_hat[t_prev]
            
            # DDIM update (deterministic)
            x = torch.sqrt(alpha_hat_prev) * pred_x0 + torch.sqrt(1 - alpha_hat_prev) * noise_pred
        else:
            x = pred_x0
    
    return x

Applications

Diffusion Models in the Wild

🚀 Diffusion Model Applications

Diffusion models telah menjadi state-of-the-art untuk berbagai generative tasks:

🎨 Stable Diffusion

Latent Diffusion Models

Diffusion in compressed latent space (VAE encoder)

Text-to-image, img2img, inpainting

🌟 DALL-E 2

CLIP + Diffusion

Text → CLIP embedding → Diffusion decoder

High-quality text-to-image generation

🖼️ Image Generation

Unconditional/Class-conditional

Generate realistic images from noise

ImageNet, FFHQ, LSUN benchmarks

✏️ Image Editing

Inpainting & Outpainting

Fill missing regions or extend images

Remove objects, extend backgrounds

🔍 Super-Resolution

SR3, Imagen

Upscale low-res images to high-res

64×64 → 256×256 → 1024×1024

🎬 Video Generation

Temporal Diffusion

Generate coherent video sequences

Text-to-video, video prediction

🎵 Audio Synthesis

WaveGrad, DiffWave

Generate high-quality audio waveforms

Text-to-speech, music generation

🧬 Molecular Design

Protein/Drug Generation

Generate novel molecular structures

Drug discovery, protein folding

💡 Why Diffusion Models Excel

  • High Quality: State-of-the-art image/video generation
  • Stable Training: No mode collapse like GANs
  • Diverse Outputs: Stochastic sampling
  • Flexible Conditioning: Text, class, layout, etc.
  • Principled Framework: Strong theoretical foundation

🔮 Future Directions

  • Faster Sampling: 1-step diffusion models
  • 📹 Longer Videos: Temporal consistency
  • 🎮 3D Generation: NeRF + diffusion
  • 🧠 Efficiency: Smaller models, edge deployment
  • 🎨 Control: Better user control over generation

✅ Selamat!

🎉 Tutorial Selesai!

Anda telah mempelajari:

  • ✅ Forward diffusion process (noise addition)
  • ✅ Reverse diffusion process (denoising)
  • ✅ DDPM training & sampling
  • ✅ DDIM for faster generation
  • ✅ Score-based perspective
  • ✅ PyTorch implementation
  • ✅ Real-world applications

🚀 Next Steps

• Implement DDPM untuk your dataset

• Experiment dengan noise schedules

• Try Stable Diffusion, DALL-E 2

• Read papers: DDPM, DDIM, Score-based models