All random variables live in Rd. We write X0,X1,…,XT for the random variables in the forward chain, and use lowercase x,y, etc. as dummy arguments of densities. We write N(m,Σ) for the Gaussian distribution with mean m and covariance Σ, and N(x;m,Σ) for its density evaluated at x. Densities carry explicit subscripts to avoid ambiguity:
qt(x): marginal density of Xt
qt∣s(x∣y): conditional density of Xt given Xs=y
qt∣s,r(x∣y,z): conditional density of Xt given Xs=y and Xr=z
The learned reverse process defines a separate chain YT,YT−1,…,Y0 with densities pθ, using analogous subscript conventions.
I.2. Forward process
Let q0=pdata be the data distribution. Fix a schedule 0<β1,…,βT<1. Define the forward chain by X0∼q0 and
qt∣t−1(x∣y)=N(x;1−βty,βtI),t=1,…,T.
Set αt=1−βt and αˉt=∏s=1tαs. By induction (composing affine Gaussian maps), the marginal conditional on X0 has the closed form
qt∣0(x∣x0)=N(x;αˉtx0,(1−αˉt)I).
Equivalently, Xt=αˉtX0+1−αˉtε where ε∼N(0,I) is independent of X0. The schedule is chosen so that αˉT≈0, ensuring qT≈N(0,I).
The (unconditional) marginal density of Xt is
qt(x)=∫Rdqt∣0(x∣x0)q0(x0)dx0.
I.3. The true reverse transitions
By Bayes' rule on the forward chain, for t≥2:
qt−1∣t(x∣y)=qt(y)qt∣t−1(y∣x)qt−1(x).
This involves the intractable marginals qt−1 and qt, so we cannot compute qt−1∣t directly. However, if we additionally condition on X0, everything becomes Gaussian. Applying Bayes' rule within the forward chain conditioned on X0=x0:
where q0∣t(x0∣y)∝qt∣0(y∣x0)q0(x0) is the posterior over clean images given a noisy observation — a mixture over the entire data manifold and completely intractable. The tractable Gaussian qt−1∣t,0, however, will be central to both training (Part II) and sampling (Part III): it is the target each learned reverse step tries to approximate, and the bridge posterior that determines the sampling geometry.
Part II: Training
II.1. The learned reverse process and the ELBO
We define a reverse chain YT,…,Y0 with YT∼N(0,I) and learnable transitions
pθ,t(x∣y)=N(x;μθ(y,t),σt2I),t=T,T−1,…,1
where σt2 is fixed (eg to βt or β~t). The marginal density of Y0 under this chain is
pθ(x)=∫RdTp(yT)t=1∏Tpθ,t(yt−1∣yt)dy1:T
where p denotes the density of N(0,I) and we have written y0=x. We want to maximise logpθ(x0) for data points x0∼q0.
Derivation of the ELBO
The evidence lower bound (ELBO) is derived as follows.
Step 1. Write the marginal as an expectation under the forward posterior:
where the outer expectation is over X0∼q0 and Xt=αˉtX0+1−αˉtε with ε∼N(0,I). Note: LT has no learnable parameters; L0 is a reconstruction term; the interesting terms are L1,…,LT−1.
II.2. Reducing Lt−1 to noise prediction
Each Lt−1 (for t≥2) is a KL divergence (Kullback–Leibler divergence) between two Gaussians with the same covariance when σt2=β~t, reducing to
Lt−1=2β~t1μ~t(Xt,X0)−μθ(Xt,t)2+const.
Substitute X0=αˉt1(Xt−1−αˉtε) into the expression for μ~t:
μ~t(Xt,ε)=αt1(Xt−1−αˉtβtε).
Parameterise the model mean as
μθ(y,t)=αt1(y−1−αˉtβtεθ(y,t))
so that Lt−1∝∥ε−εθ(Xt,t)∥2. The DDPM (denoising diffusion probabilistic model) simplified objective, which drops the t-dependent prefactor and sums uniformly over t, is
Given image-caption pairs (X0,C)∼pdata, the forward process acts only on the image: qt∣0(x∣x0) is unchanged, and the caption C is carried along as an unperturbed label. The noise predictor becomes εθ(y,t,c) and the training loss is
Lsimple(θ)=Et,(X0,C),ε[∥ε−εθ(Xt,t,C)∥2].
This is the only change: condition the network on c and train with the same squared-error objective.
To enable classifier-free guidance (CFG) at inference (Section III.2), one additionally trains the network to operate without a caption. During training, the caption C is replaced with a null token ∅ independently with probability puncond≈0.1. Writing C~ for the resulting input (C with probability 1−puncond, ∅ otherwise), the loss remains E[∥ε−εθ(Xt,t,C~)∥2]. After convergence, the single network approximates two score functions depending on its third argument:
The first is the score of the conditional marginal qt(⋅∣c); the second is the score of the unconditional marginal qt. How these are combined at inference time is described in Section III.2.
II.4. What training actually does, and what it learns
Training is just denoising
The operational content of training is: sample X0∼q0, sample t∼Unif{1,…,T} and ε∼N(0,I), form Xt=αˉtX0+1−αˉtε, and regress εθ(Xt,t) against ε. No Xt−1 is needed; the closed-form marginal qt∣0 lets you jump directly to any noise level without simulating the chain. Note that predicting ε and predicting X0 are equivalent given (Xt,t), since X^0=αˉt1(Xt−1−αˉtεθ).
Why denoising gives a generative model
A standalone denoiser applied once to XT∼N(0,I) would produce the minimum mean squared error (MMSE) estimate E[X0∣XT] — a blurry average, not a sharp sample. The ELBO (Section II.1) ensures that training a denoiser at every noise level does more: each Lt−1 fits pθ,t(⋅∣y) to the forward posterior qt−1∣t,0(⋅∣y,X0) averaged over X0, calibrating a chain whose T transitions from pure noise produce samples from ≈q0. No individual step needs to do anything dramatic — at large t, x^0 is poor but μ~t barely trusts it; at small t, the estimate is accurate and the step commits.
The model learns the marginal reverse, not the trajectory
For each t, the loss Lt−1 asks the single Gaussian pθ,t(⋅∣y) to match qt−1∣t,0(⋅∣y,X0), but the outer expectation averages over X0∼q0∣t(⋅∣y). The effective target for μθ(y,t) is therefore
where the last equality uses linearity of μ~t in x0. In the noise parameterisation, this is equivalent to εθ(y,t)→E[ε∣Xt=y].
The model learns the MMSE denoiser: the conditional expectation of the noise (or equivalently of X0) given Xt=y. The single Gaussian pθ,t(⋅∣y) approximates the intractable mixture qt−1∣t(⋅∣y) by matching its mean.
The ELBO ties every learned transition to the data distribution q0: the model can only generate images that resemble training data. Part IV replaces this objective with an arbitrary reward.
Part III: Sampling
III.1. The sampling algorithm and its structure
Given a trained noise predictor εθ, Algorithm 2 of Ho et al. (2020) generates images by initialising xT∼N(0,I) and iterating for t=T,T−1,…,1:
(with z=0 at t=1). This decomposes as xt−1=mt(xt)+σtz: a deterministic target mt(xt) plus isotropic Gaussian noise. Each step is therefore a draw from N(mt(xt),σt2I) — an isotropic Gaussian with exact, closed-form log-density. This structure is what makes the policy gradient computation in Part IV tractable.
Sections III.3 and III.4 derive mt and σt. First, Section III.2 describes how conditioning on a caption enters at inference time via classifier-free guidance.
III.2. Classifier-free guidance
Sampling with the conditional model εθ(xt,t,c) alone would approximate q0(⋅∣c). Classifier-free guidance (CFG), introduced by Ho and Salimans (2022), sharpens this by combining the conditional and unconditional score estimates from Section II.3.
At inference, one forms the guided noise estimate
ε^(y,t,c)=(1+w)εθ(y,t,c)−wεθ(y,t,∅)
for a guidance weight w>0, and uses ε^ in place of εθ in the sampling step. Expressing this in terms of scores (up to the factor −1−αˉt), the guided estimate follows
(1+w)∇ylogqt(y∣c)−w∇ylogqt(y).
Applying Bayes' rule — ∇ylogqt(y∣c)=∇ylogqt(c∣y)+∇ylogqt(y), since ∇ylogqt(c)=0 — this becomes
∇y[logqt(y)+(1+w)logqt(c∣y)].
The unconditional score ∇ylogqt(y) keeps the sample on the image manifold; the term (1+w)logqt(c∣y) acts as an amplified implicit classifier steering toward images strongly associated with caption c. At w=0 this reduces to standard conditional sampling; increasing w concentrates the effective distribution on high-qt(c∣y) modes, improving caption fidelity at the cost of diversity. The guided score field does not, in general, integrate to a normalised density — CFG is a heuristic that manipulates the score at inference time, with no formal objective being optimised. Part IV takes a different approach, defining an explicit reward and optimising it via RL.
III.3. The deterministic target: where the step lands
The target mt(xt) is a weighted average of the current noisy state xt and an estimate of the clean image x0. Define the MMSE clean estimate from the noise predictor:
x^0(xt)=αˉt1(xt−1−αˉtεθ(xt,t)).
Substituting x^0 into μ~t(xt,x^0) — the forward posterior mean from Section I.3, with x^0 in place of the true clean image:
Using αˉtαˉt−1=αt1, the coefficient of xt becomes (1−αˉt)αtβt+αt(1−αˉt−1). The numerator simplifies: βt+αt−αtαˉt−1=(1−αt)+αt−αˉt=1−αˉt, so the coefficient is αt1. The coefficient of εθ is αt1−αˉtβt. Therefore:
μ~t(xt,x^0)=αt1(xt−1−αˉtβtεθ(xt,t))
which is exactly mt(xt). The sampling step is therefore
xt−1=μ~t(xt,x^0(xt))+σtz
ie, it draws from the forward posterior qt−1∣t,0(⋅∣xt,x^0), treating the MMSE estimate as the true clean image.
Contrast with "estimate x0 then re-corrupt"
If sampling instead jumped to x^0 and re-injected noise to reach level t−1, the step would be xt−1=αˉt−1x^0(xt)+1−αˉt−1z. This forgets xt entirely, independently re-adding all noise for level t−1 from scratch.
The actual step is far more conservative. In μ~t, the weight on x^0 is 1−αˉtαˉt−1βt, which is small when αˉt−1≈0 (large t, most noise remaining) and approaches 1 as αˉt−1→1 (small t, little noise remaining). Early steps barely trust x^0; late steps commit to it. This conservatism is essential: x^0 is poor at large t, but the step does not rely on it.
III.4. The bridge interpretation
The weights in μ~t and the variance β~t arise from the bridge structure of the forward chain.
The forward chain (X0,X1,…,XT) conditioned on X0=x0 is a Gaussian Markov chain (OU-like, not a random walk, due to the 1−βt contraction). Pinning both endpoints — conditioning on X0=x0 and Xt=xt — gives a jointly Gaussian distribution over the intermediate variables (X1,…,Xt−1), which is itself Markov with fixed endpoints: a Gaussian Markov bridge. The density qt−1∣t,0(⋅∣xt,x0) is the one-step-back marginal of this bridge, with mean μ~t(xt,x0) and variance β~tI, both determined by the geometry of the forward chain.
With σt2=β~t, the sampling step is precisely
xt−1∼qt−1∣t,0(⋅∣xt,x^0(xt)),
ie, a draw from the bridge posterior, treating the MMSE estimate as the true clean image. In practice one never constructs the bridge as an object — just evaluate μ~t(xt,x^0) and add noise β~tz — but the bridge interpretation explains why these particular weights and this particular variance are correct.
Part IV: Finetuning with reinforcement learning
Parts II and III optimise a single objective: matching the data distribution q0 via the ELBO. But many applications care about a downstream property of the generated image — aesthetic quality, compressibility, prompt-image alignment — that is not captured by data likelihood. This section recasts the sampling chain from Part III as a Markov decision process (MDP), enabling policy gradient methods to optimise an arbitrary reward r(x0,c) directly. The framework is due to Black et al. (2024), who call the resulting algorithm denoising diffusion policy optimisation (DDPO).
IV.1. The reward maximisation objective
Given a distribution over captions C and a reward function r:Rd×C→R, we want to solve
θmaxJ(θ)=Ec∼CEx0∼pθ(⋅∣c)[r(x0,c)]
where pθ(x0∣c)=∫p(xT)∏t=1Tπθ,t(xt−1∣xt,c)dx1:T is the marginal over final images induced by the sampling chain. Since this integral is over RdT, neither pθ(x0∣c) nor its gradient with respect to θ can be evaluated.
A naive approach (reward-weighted regression, or RWR) applies REINFORCE directly to this marginal:
This is exact in principle, but logpθ(x0∣c) is the intractable log-marginal. In practice, RWR substitutes the ELBO (or another bound) for logpθ(x0∣c), which introduces bias: the gradient is no longer that of J but of a surrogate objective. DDPO eliminates this approximation.
IV.2. The multi-step MDP
Recall from Part III that each sampling step is a draw from an isotropic Gaussian:
where x^0(xt,c)=αˉt1(xt−1−αˉtεθ(xt,t,c)) and μ~t is the bridge target from Section III.3. (The notation πθ,t replaces pθ,t from Part II to match MDP conventions; the object is the same.) DDPO maps the sampling chain to the following MDP:
State: (xt,t,c). Initial state: xT∼N(0,I), c∼C.
Action: xt−1∈Rd.
Policy: πθ,t(xt−1∣xt,c) as above.
Transition: deterministic — the next state is (xt−1,t−1,c). All stochasticity is in the policy.
Reward: r(x0,c) at the terminal step; zero otherwise.
A trajectory is τ=(xT,xT−1,…,x0) — one run of the sampling chain.
IV.3. Factorised likelihoods and the policy gradient
The trajectory log-probability factorises as
logpθ(τ∣c)=logp(xT)+t=1∑Tlogπθ,t(xt−1∣xt,c)
where each term is an exact isotropic Gaussian log-density:
This is the central advantage: the intractable ∇θlogpθ(x0∣c) from Section IV.1 has been replaced by a sum of T exact, closed-form terms. Applying REINFORCE to the trajectory gives
In practice, one samples trajectories from pθ, evaluates r(x0,c), and performs a gradient step.
IV.4. On-policy and off-policy variants
The gradient in Section IV.3 requires trajectories sampled from the current policy pθ. DDPO-SF (score function) implements this directly: sample a batch of trajectories, compute the REINFORCE gradient, update θ, discard the batch. Each trajectory is used for exactly one gradient step.
DDPO-IS (importance sampling) reuses trajectories across multiple updates. Given trajectories sampled from a previous policy pθold, the gradient under the current pθ is reweighted by the trajectory importance ratio
which factorises into per-step Gaussian density ratios, each exact and closed-form. As θ drifts from θold, the importance weights become high-variance. Following PPO, DDPO-IS clips the per-step ratios to [1−ϵ,1+ϵ], preventing large updates from stale trajectories. This improves sample efficiency at the cost of bias from clipping.
IV.5. Credit assignment
The reward is sparse: r(x0,c) arrives only at the terminal step, but the policy makes T decisions. Two features make credit assignment tractable. First, per-prompt baselines: for each caption c, the mean reward across sampled trajectories is subtracted, so the gradient reinforces trajectories that outperform the per-prompt average. Second, near-determinism: since σt is small, the trajectory barely branches, so the terminal reward is a smooth function of early decisions. In the limit σt→0 (DDIM), the trajectory becomes fully deterministic — eliminating credit assignment but also the stochasticity REINFORCE requires.
IV.6. Reward hacking and KL regularisation
The objective J(θ) contains no data-matching term: if r is an imperfect proxy for the true goal, unconstrained optimisation will exploit the gap, typically degrading image quality to game the metric (eg, compressibility rewards produce featureless blobs that achieve optimal file size but contain no meaningful content).
The principled fix adds a KL penalty against the pretrained reference pref:
This is more than a constraint on how far pθ can drift. The KL-regularised objective has a closed-form optimum:
p∗(x0∣c)∝pref(x0∣c)exp(r(x0,c)/β).
This Gibbs reweighting upweights images that are both plausible under pref and high-reward, while suppressing high-reward but implausible images. The parameter β controls the trade-off: large β keeps p∗ close to pref; small β concentrates on reward-maximising modes that still have reference support. The KL term actively shapes the target distribution — defining what the optimal policy is, not merely limiting how far optimisation can go. If r is sufficiently misspecified, however, even p∗ will exploit the gap within the support of pref.