Skip to Content

Diffusion Models and Score-Based Generative Models

Start writing here...

Absolutely! Here's a complete and accessible breakdown of Diffusion Models and Score-Based Generative Models — two of the most powerful and rapidly evolving approaches in generative AI (behind tools like DALL·E 2, Stable Diffusion, Imagen, Midjourney, etc.).

🌫️ Diffusion Models & Score-Based Generative Models

Generating images, audio, and more by reversing noise.

🧠 What Are Diffusion Models?

Diffusion models are a class of generative models that learn to generate data (e.g., images) by reversing a diffusion process — a step-by-step transformation of random noise into structured data.

Inspired by non-equilibrium thermodynamics, these models simulate how structure can emerge from randomness.

⛓️ Core Intuition

Forward Process: Add Noise

Gradually add Gaussian noise to real data over many steps until it becomes pure noise.

q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I})

  • At each timestep tt, the data becomes less recognizable.
  • After TT steps, xT∼N(0,I)x_T \sim \mathcal{N}(0, I)

Reverse Process: Remove Noise

Learn to denoise from noise back to data using a neural network ϵθ(xt,t)\epsilon_\theta(x_t, t)

pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

  • The denoising network is trained to reverse the noise and recover clean data.

📉 Training Objective

Most diffusion models use variational inference or denoising score matching to train.

Simplified loss (DDPM):

L=Ex0,ϵ,t[∥ϵ−ϵθ(xt,t)∥2]\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]

The network learns to predict the noise added at timestep tt, given the noisy image xtx_t.

🔁 Sampling Process

  1. Start from Gaussian noise xT∼N(0,I)x_T \sim \mathcal{N}(0, I)
  2. Run the reverse denoising process step-by-step
  3. Output is a generated image (or audio, 3D shape, etc.)

This is slow — originally hundreds to thousands of steps.

🎨 Popular Diffusion Model Variants

🔹 DDPM (Denoising Diffusion Probabilistic Model)

  • Original method by Ho et al. (2020)
  • Simple and effective, but slow sampling

🔹 Improved DDPMs

  • Better sampling schedules, perceptual loss
  • Introduces classifier guidance

🔹 Latent Diffusion Models (LDMs)

  • Used in Stable Diffusion
  • Diffusion happens in latent space, not pixel space → much faster

🔹 Guided Diffusion

  • Uses class labels or text as guidance
  • Conditional sampling: generate "a cat on a skateboard"

🔹 Imagen (Google)

  • Cascaded diffusion + large-scale text embeddings → high-fidelity text-to-image

🔹 DALLE-2

  • Uses CLIP + diffusion decoder to generate realistic images from prompts

🧮 Score-Based Generative Models

Developed by Song & Ermon, these models are mathematically connected to diffusion models but emphasize score functions instead of likelihoods.

🔍 Key Idea:

Learn the score function ∇xlog⁡pt(x)\nabla_x \log p_t(x) — the gradient of log-probability — at different noise levels.

  • Use Score Matching to train
  • Sampling uses Stochastic Differential Equations (SDEs) or ODE solvers

dx=f(x,t)dt+g(t)dwdx = f(x, t) dt + g(t) dw

This makes sampling more flexible, potentially much faster and more accurate.

🚀 Applications

Domain Use Case
Image Generation Stable Diffusion, Midjourney, Imagen
Text-to-Image DALL·E 2, ControlNet, SDXL
Audio Generation WaveGrad, DiffWave, Bark
3D Generation DreamFusion, LION, Gaussian Splatting
Video Pika, Sora, Gen-2 (multi-frame diffusion)
Medical Imaging MRI reconstruction, inpainting
Inpainting / Editing Photoshop’s Generative Fill

🧠 Advantages of Diffusion Models

High-quality samples

Stable training (compared to GANs)

Flexibility (unconditional, conditional, guided)

Excellent controllability (e.g., using ControlNet)

⚠️ Challenges

  • Slow sampling (hundreds of steps)
  • High compute cost
  • Need for large datasets and tuning
  • Sensitive to noise schedules

Solutions:

  • Fast samplers (DDIM, PLMS, DPM-Solver)
  • Latent diffusion for speed + memory savings
  • Distillation (e.g., Consistency Models)
  • Score distillation sampling (SDS) for text-to-3D

📚 Foundational Papers

  • 🔗 DDPMDenoising Diffusion Probabilistic Models (Ho et al., 2020)
  • 🔗 Score SDEsScore-Based Generative Modeling through SDEs (Song et al., 2021)
  • 🔗 Latent Diffusion ModelsHigh-Resolution Image Synthesis with LDM (Rombach et al., 2022)
  • 🔗 ImagenPhotorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Saharia et al., 2022)
  • 🔗 ControlNetAdding Conditional Control to Text-to-Image Diffusion Models (Zhang et al., 2023)

🧰 Ecosystem & Tools

  • Diffusers (HuggingFace) – Easy-to-use diffusion models + pipelines
  • Stable Diffusion Web UIs – AUTOMATIC1111, InvokeAI
  • ComfyUI – Node-based workflow for controlling diffusion generation
  • Open Source Models – SDXL, DeepFloyd IF, Kandinsky 3, RealisticVision

💡 TL;DR

  • Diffusion models generate data by learning to reverse noise.
  • Score-based models learn gradients of log density — often more mathematically elegant.
  • They power today’s state-of-the-art in generative media (images, video, audio).
  • New techniques (faster sampling, latent spaces, control modules) are making them real-time and highly practical.

Would you like:

  • A Python notebook showing how to sample from a pretrained diffusion model?
  • A walkthrough of how Stable Diffusion works under the hood?
  • An explanation of ControlNet, LoRA, or DreamBooth integrations with diffusion?