Diffusion Models

Generative models that learn to reverse a gradual noising process — synthesizing data by iteratively denoising from Gaussian noise, with state-of-the-art quality across images, video, audio, and structured outputs.


frontier tier

A diffusion model is a generative model that learns to invert a gradual corruption process. The forward process takes a clean datum x0x_0 and adds Gaussian noise over TT steps according to a fixed schedule, eventually producing pure noise xTN(0,I)x_T \sim \mathcal{N}(0, I). A neural network is trained to predict the noise added at each step (or, equivalently, the score xlogpt(x)\nabla_x \log p_t(x) of the noised distribution), and sampling proceeds by starting from Gaussian noise and iteratively denoising. The training loss is a simple regression objective, the model is a stable and scalable likelihood-trainable network, and the resulting samples now define the state of the art for images, video, audio, 3D shapes, molecular structures, and many other continuous modalities. The framework was crystallized by Sohl-Dickstein et al. (2015) and brought to dominance by the DDPM (Ho et al., 2020) and score-based (Song & Ermon, 2019; Song et al., 2021) lines of work, which converged on essentially the same algorithm from independent starting points.

Latent diffusion and accelerated editing

Operating directly in pixel space is expensive: each denoising step is a forward pass through a U-Net at full resolution, and high-quality sampling typically takes tens to hundreds of steps. Latent diffusion addresses this by first compressing images into a low-dimensional latent space with a pre-trained autoencoder, then running the diffusion process there — a design that has become the standard for text-to-image systems like Stable Diffusion. Avrahami et al. extend latent diffusion to localized text-driven editing: their Blended Latent Diffusion technique blends the noisy latent of a region under edit with the unperturbed latent of the surrounding image at every denoising step, eliminating the per-step CLIP-gradient calculations that plagued earlier pixel-space blended-diffusion methods. They show that a naive transposition of the blend operation to latent space introduces reconstruction errors, and they propose an optimization-based fix that recovers fidelity outside the edit mask while keeping inference markedly faster than pixel-space alternatives.

Aligning samples with human preference

Once a diffusion model can produce many plausible outputs for a given prompt, the question becomes which one a human would actually prefer. Standard automatic metrics (FID, CLIP score) correlate poorly with judgement on perceptual artefacts like awkward limbs and facial geometry, and so the field has begun importing the reward modelling machinery from reinforcement learning from human feedback. Wu et al. introduce the Human Preference Score (HPS): they collect a large dataset of human pairwise choices over images generated from the Stable Foundation Discord channel, train a CLIP-style classifier on those preferences, and then use the resulting score as a guidance signal to fine-tune Stable Diffusion. The fine-tuned model produces images that are preferred over the base model in held-out human studies, and the HPS itself transfers to images generated by other text-to-image systems — establishing diffusion-model alignment as a transplantable subproblem distinct from the underlying generative architecture.

Diffusion models as classifiers

A trained diffusion model is, in addition to a sampler, an estimator of conditional likelihoods p(xc)p(x \mid c). Li et al. exploit this with Diffusion Classifier, a method that performs zero-shot classification by evaluating the diffusion-model loss for each candidate label and picking the one with the highest implied likelihood. Without any additional training, the technique produces strong zero-shot classifiers on standard benchmarks, and it inherits two properties that purely discriminative classifiers lack: stronger compositional reasoning (because the underlying diffusion model was trained to align fine-grained linguistic structure with images) and improved robustness to distribution shift. The result reframes generative models as a viable substrate for downstream discrimination — a direction that runs counter to the long-standing wisdom that one should always train discriminative models for discriminative tasks.

Video diffusion in projected latent space

Naively scaling diffusion to video — a clean datum is now a 3D tensor of frames ×\times height ×\times width — blows up both compute and memory. Yu et al. propose Projected Latent Video Diffusion Models (PVDM), which factor the video tensor into 2D-shaped latents using a specialized autoencoder and then run a diffusion model over those 2D projections. The factorization decomposes the cubic structure of pixel volumes into three views that share computation, and the diffusion architecture is adapted so that a single trained model can synthesize videos of arbitrary length. PVDM cuts the FVD of long-video generation on UCF-101 from 1773 to 640 — an order-of-magnitude improvement that demonstrates how careful latent-space design, not architectural exotica, often unlocks the scalability barrier for diffusion at higher data dimensions.

Diffusion on manifolds and SE(3) for robotics

The standard diffusion process is defined in Rd\mathbb{R}^d, where Gaussian noise and Euclidean updates are the natural choices. Many problems live on non-Euclidean manifolds: rotation groups, Lie groups, and product spaces of poses and configurations. Urain et al. derive SE(3)-DiffusionFields, a diffusion model defined directly on the SE(3)\mathrm{SE}(3) group of rigid transformations, and use it to learn smooth multimodal cost functions for combined grasp planning and motion optimization. Because diffusion models naturally represent multimodal distributions and produce gradients of logp\log p over their entire support, the learned cost field can be plugged into a downstream optimizer that jointly resolves the grasp pose and the trajectory under task constraints. The work is a representative of a broader programme — generalizing diffusion to Riemannian manifolds, equivariant data, and structured spaces — that has expanded the modelling reach of the framework far beyond image pixels.

Self-consuming loops and model collapse

A finding that has unsettled the field concerns what happens when generative models train on data that earlier generative models produced. Casco-Rodriguez et al. analyse three families of autophagous (“self-consuming”) loops that differ in how much fresh real data they retain across generations and how strongly they bias toward high-quality samples. Their main result, holding across image-domain experiments and toy theoretical settings, is that without enough fresh real data each generation, future generative models are doomed — either their quality decays or their diversity collapses, in some cases both. The phenomenon, dubbed Model Autophagy Disorder (MAD), has direct implications for the modern web: as generative outputs flood public data, future training runs that scrape that data face a structural risk that simple loss curves do not surface. It is one of the first sharp empirical statements about a problem that pure scaling cannot solve.

Open methodological problems

Common threads connect these directions. Sampling efficiency — fewer-step solvers, consistency-style distillations, learned ODE shortcuts — remains the single largest practical lever. Conditioning beyond text prompts, including layout, depth, sketches, and demonstrations, demands principled mechanisms for plugging classifier-free guidance into structured inputs. Alignment carries over reinforcement-learning-from-human-feedback machinery in non-trivial ways once the output is a distribution rather than a single trajectory. Generative-as-discriminative uses, like Diffusion Classifier, suggest that the line between learning p(xc)p(x \mid c) and learning p(cx)p(c \mid x) is increasingly fluid. And data hygiene — keeping training distributions free of contamination from prior generative outputs — has become a first-class engineering concern alongside the algorithmic ones.

Prerequisites

Sources

In context

Where this topic sits in the prerequisite graph. Click any node to jump.

Open in full atlas →

Reviewed by


Review this topic

This page was drafted by an agent and is waiting on expert review. Spotted a wrong prerequisite, a missing concept, a misattributed source, or a factual slip? Tell us — your review opens a tracked issue maintainers act on.