Diffusion Models and Score-Based Generative Models

Start writing here...

Absolutely! Here's a complete and accessible breakdown of Diffusion Models and Score-Based Generative Models — two of the most powerful and rapidly evolving approaches in generative AI (behind tools like DALL·E 2, Stable Diffusion, Imagen, Midjourney, etc.).

🌫️ Diffusion Models & Score-Based Generative Models

Generating images, audio, and more by reversing noise.

🧠 What Are Diffusion Models?

Diffusion models are a class of generative models that learn to generate data (e.g., images) by reversing a diffusion process — a step-by-step transformation of random noise into structured data.

Inspired by non-equilibrium thermodynamics, these models simulate how structure can emerge from randomness.

⛓️ Core Intuition

Forward Process: Add Noise

Gradually add Gaussian noise to real data over many steps until it becomes pure noise.

q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I})

At each timestep tt, the data becomes less recognizable.
After TT steps, xT∼N(0,I)x_T \sim \mathcal{N}(0, I)

Reverse Process: Remove Noise

Learn to denoise from noise back to data using a neural network ϵθ(xt,t)\epsilon_\theta(x_t, t)

pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The denoising network is trained to reverse the noise and recover clean data.

📉 Training Objective

Most diffusion models use variational inference or denoising score matching to train.

Simplified loss (DDPM):

L=Ex0,ϵ,t[∥ϵ−ϵθ(xt,t)∥2]\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]

The network learns to predict the noise added at timestep tt, given the noisy image xtx_t.

🔁 Sampling Process

Start from Gaussian noise xT∼N(0,I)x_T \sim \mathcal{N}(0, I)
Run the reverse denoising process step-by-step
Output is a generated image (or audio, 3D shape, etc.)

This is slow — originally hundreds to thousands of steps.

🎨 Popular Diffusion Model Variants

🔹 DDPM (Denoising Diffusion Probabilistic Model)

Original method by Ho et al. (2020)
Simple and effective, but slow sampling

🔹 Improved DDPMs

Better sampling schedules, perceptual loss
Introduces classifier guidance

🔹 Latent Diffusion Models (LDMs)

Used in Stable Diffusion
Diffusion happens in latent space, not pixel space → much faster

🔹 Guided Diffusion

Uses class labels or text as guidance
Conditional sampling: generate "a cat on a skateboard"

🔹 Imagen (Google)

Cascaded diffusion + large-scale text embeddings → high-fidelity text-to-image

🔹 DALLE-2

Uses CLIP + diffusion decoder to generate realistic images from prompts

🧮 Score-Based Generative Models

Developed by Song & Ermon, these models are mathematically connected to diffusion models but emphasize score functions instead of likelihoods.

🔍 Key Idea:

Learn the score function ∇xlog⁡pt(x)\nabla_x \log p_t(x) — the gradient of log-probability — at different noise levels.

Use Score Matching to train
Sampling uses Stochastic Differential Equations (SDEs) or ODE solvers

dx=f(x,t)dt+g(t)dwdx = f(x, t) dt + g(t) dw

This makes sampling more flexible, potentially much faster and more accurate.

🚀 Applications

Domain	Use Case
Image Generation	Stable Diffusion, Midjourney, Imagen
Text-to-Image	DALL·E 2, ControlNet, SDXL
Audio Generation	WaveGrad, DiffWave, Bark
3D Generation	DreamFusion, LION, Gaussian Splatting
Video	Pika, Sora, Gen-2 (multi-frame diffusion)
Medical Imaging	MRI reconstruction, inpainting
Inpainting / Editing	Photoshop’s Generative Fill

🧠 Advantages of Diffusion Models

✅ High-quality samples

✅ Stable training (compared to GANs)

✅ Flexibility (unconditional, conditional, guided)

✅ Excellent controllability (e.g., using ControlNet)

⚠️ Challenges

Slow sampling (hundreds of steps)
High compute cost
Need for large datasets and tuning
Sensitive to noise schedules

Solutions:

Fast samplers (DDIM, PLMS, DPM-Solver)
Latent diffusion for speed + memory savings
Distillation (e.g., Consistency Models)
Score distillation sampling (SDS) for text-to-3D

📚 Foundational Papers

🔗 DDPM – Denoising Diffusion Probabilistic Models (Ho et al., 2020)
🔗 Score SDEs – Score-Based Generative Modeling through SDEs (Song et al., 2021)
🔗 Latent Diffusion Models – High-Resolution Image Synthesis with LDM (Rombach et al., 2022)
🔗 Imagen – Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Saharia et al., 2022)
🔗 ControlNet – Adding Conditional Control to Text-to-Image Diffusion Models (Zhang et al., 2023)

🧰 Ecosystem & Tools

Diffusers (HuggingFace) – Easy-to-use diffusion models + pipelines
Stable Diffusion Web UIs – AUTOMATIC1111, InvokeAI
ComfyUI – Node-based workflow for controlling diffusion generation
Open Source Models – SDXL, DeepFloyd IF, Kandinsky 3, RealisticVision

💡 TL;DR

Diffusion models generate data by learning to reverse noise.
Score-based models learn gradients of log density — often more mathematically elegant.
They power today’s state-of-the-art in generative media (images, video, audio).
New techniques (faster sampling, latent spaces, control modules) are making them real-time and highly practical.

Would you like:

A Python notebook showing how to sample from a pretrained diffusion model?
A walkthrough of how Stable Diffusion works under the hood?
An explanation of ControlNet, LoRA, or DreamBooth integrations with diffusion?

in Machine Learning