Diffusion Models: A Precise Account

Part I: Probabilistic set-up

I.1. Notation

All random variables live in Rd\mathbb{R}^d. We write X0,X1,,XTX_0, X_1, \ldots, X_T for the random variables in the forward chain, and use lowercase x,yx, y, etc. as dummy arguments of densities. We write N(m,Σ)\mathcal{N}(m, \Sigma) for the Gaussian distribution with mean mm and covariance Σ\Sigma, and N(x;m,Σ)\mathcal{N}(x; m, \Sigma) for its density evaluated at xx. Densities carry explicit subscripts to avoid ambiguity:

The learned reverse process defines a separate chain YT,YT1,,Y0Y_T, Y_{T-1}, \ldots, Y_0 with densities pθp_\theta, using analogous subscript conventions.

I.2. Forward process

Let q0=pdataq_0 = p_{\mathrm{data}} be the data distribution. Fix a schedule 0<β1,,βT<10 < \beta_1, \ldots, \beta_T < 1. Define the forward chain by X0q0X_0 \sim q_0 and

qtt1(xy)=N(x;  1βty,  βtI),t=1,,T.q_{t|t-1}(x \mid y) = \mathcal{N}(x;\; \sqrt{1-\beta_t}\,y,\; \beta_t\,I), \qquad t = 1, \ldots, T.

Set αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^t \alpha_s. By induction (composing affine Gaussian maps), the marginal conditional on X0X_0 has the closed form

qt0(xx0)=N(x;  αˉtx0,  (1αˉt)I).q_{t|0}(x \mid x_0) = \mathcal{N}(x;\; \sqrt{\bar{\alpha}_t}\,x_0,\; (1 - \bar{\alpha}_t)\,I).

Equivalently, Xt=αˉtX0+1αˉtεX_t = \sqrt{\bar{\alpha}_t}\,X_0 + \sqrt{1-\bar{\alpha}_t}\,\varepsilon where εN(0,I)\varepsilon \sim \mathcal{N}(0, I) is independent of X0X_0. The schedule is chosen so that αˉT0\bar{\alpha}_T \approx 0, ensuring qTN(0,I)q_T \approx \mathcal{N}(0, I).

The (unconditional) marginal density of XtX_t is

qt(x)=Rdqt0(xx0)q0(x0)dx0.q_t(x) = \int_{\mathbb{R}^d} q_{t|0}(x \mid x_0)\,q_0(x_0)\,\mathrm{d}x_0.

I.3. The true reverse transitions

By Bayes' rule on the forward chain, for t2t \geq 2:

qt1t(xy)=qtt1(yx)qt1(x)qt(y).q_{t-1|t}(x \mid y) = \frac{q_{t|t-1}(y \mid x)\,q_{t-1}(x)}{q_t(y)}.

This involves the intractable marginals qt1q_{t-1} and qtq_t, so we cannot compute qt1tq_{t-1|t} directly. However, if we additionally condition on X0X_0, everything becomes Gaussian. Applying Bayes' rule within the forward chain conditioned on X0=x0X_0 = x_0:

qt1t,0(xy,x0)=qtt1(yx)qt10(xx0)qt0(yx0).q_{t-1|t,0}(x \mid y, x_0) = \frac{q_{t|t-1}(y \mid x)\,q_{t-1|0}(x \mid x_0)}{q_{t|0}(y \mid x_0)}.

All three factors on the right are Gaussian, so the left-hand side is Gaussian. Completing the square gives

qt1t,0(xy,x0)=N(x;  μ~t(y,x0),  β~tI)q_{t-1|t,0}(x \mid y, x_0) = \mathcal{N}(x;\; \tilde{\mu}_t(y, x_0),\; \tilde{\beta}_t\,I)

where

μ~t(y,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉty,β~t=(1αˉt1)1αˉtβt.\tilde{\mu}_t(y, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1 - \bar{\alpha}_t}\,x_0 + \frac{\sqrt{\alpha_t}\,(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\,y, \qquad \tilde{\beta}_t = \frac{(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\,\beta_t.

The true (unconditioned-on-X0X_0) reverse kernel is then the mixture

qt1t(xy)=Rdqt1t,0(xy,x0)q0t(x0y)dx0q_{t-1|t}(x \mid y) = \int_{\mathbb{R}^d} q_{t-1|t,0}(x \mid y, x_0)\,q_{0|t}(x_0 \mid y)\,\mathrm{d}x_0

where q0t(x0y)qt0(yx0)q0(x0)q_{0|t}(x_0 \mid y) \propto q_{t|0}(y \mid x_0)\,q_0(x_0) is the posterior over clean images given a noisy observation — a mixture over the entire data manifold and completely intractable. The tractable Gaussian qt1t,0q_{t-1|t,0}, however, will be central to both training (Part II) and sampling (Part III): it is the target each learned reverse step tries to approximate, and the bridge posterior that determines the sampling geometry.

Part II: Training

II.1. The learned reverse process and the ELBO

We define a reverse chain YT,,Y0Y_T, \ldots, Y_0 with YTN(0,I)Y_T \sim \mathcal{N}(0, I) and learnable transitions

pθ,t(xy)=N(x;  μθ(y,t),  σt2I),t=T,T1,,1p_{\theta,t}(x \mid y) = \mathcal{N}(x;\; \mu_\theta(y, t),\; \sigma_t^2\,I), \qquad t = T, T-1, \ldots, 1

where σt2\sigma_t^2 is fixed (eg to βt\beta_t or β~t\tilde{\beta}_t). The marginal density of Y0Y_0 under this chain is

pθ(x)=RdTp(yT)t=1Tpθ,t(yt1yt)dy1:Tp_\theta(x) = \int_{\mathbb{R}^{dT}} p(y_T)\prod_{t=1}^{T} p_{\theta,t}(y_{t-1} \mid y_t)\,\mathrm{d}y_{1:T}

where pp denotes the density of N(0,I)\mathcal{N}(0,I) and we have written y0=xy_0 = x. We want to maximise logpθ(x0)\log p_\theta(x_0) for data points x0q0x_0 \sim q_0.

Derivation of the ELBO

The evidence lower bound (ELBO) is derived as follows.

Step 1. Write the marginal as an expectation under the forward posterior:

logpθ(x0)=logp(xT)t=1Tpθ,t(xt1xt)q1:T0(x1:Tx0)  q1:T0(x1:Tx0)dx1:T\log p_\theta(x_0) = \log \int \frac{p(x_T)\prod_{t=1}^T p_{\theta,t}(x_{t-1}\mid x_t)}{q_{1:T|0}(x_{1:T}\mid x_0)}\;q_{1:T|0}(x_{1:T}\mid x_0)\,\mathrm{d}x_{1:T}

where q1:T0(x1:Tx0)=t=1Tqtt1(xtxt1)q_{1:T|0}(x_{1:T}\mid x_0) = \prod_{t=1}^T q_{t|t-1}(x_t \mid x_{t-1}) is the joint forward density.

Step 2. Apply Jensen's inequality (log\log is concave, q1:T0(x0)q_{1:T|0}(\cdot \mid x_0) is a probability measure):

logpθ(x0)Eq1:T0(x0) ⁣[logp(XT)t=1Tpθ,t(Xt1Xt)t=1Tqtt1(XtXt1)]=:L(x0;θ).\log p_\theta(x_0) \geq \mathbb{E}_{q_{1:T|0}(\cdot \mid x_0)}\!\left[\log\frac{p(X_T)\prod_{t=1}^T p_{\theta,t}(X_{t-1}\mid X_t)}{\prod_{t=1}^T q_{t|t-1}(X_t \mid X_{t-1})}\right] =: -\mathcal{L}(x_0;\theta).

So L(x0;θ)\mathcal{L}(x_0;\theta) is an upper bound on logpθ(x0)-\log p_\theta(x_0), or equivalently L-\mathcal{L} is a lower bound on logpθ(x0)\log p_\theta(x_0).

Step 3. Decompose L\mathcal{L}. Inside the expectation, write

logt=1Tqtt1(XtXt1)p(XT)t=1Tpθ,t(Xt1Xt)=logp(XT)+t=1Tlogqtt1(XtXt1)pθ,t(Xt1Xt).\log \frac{\prod_{t=1}^T q_{t|t-1}(X_t \mid X_{t-1})}{p(X_T)\prod_{t=1}^T p_{\theta,t}(X_{t-1} \mid X_t)} = -\log p(X_T) + \sum_{t=1}^T \log\frac{q_{t|t-1}(X_t \mid X_{t-1})}{p_{\theta,t}(X_{t-1} \mid X_t)}.

For t2t \geq 2, use Bayes' rule within the forward chain conditioned on X0X_0:

qtt1(XtXt1)=qt1t,0(Xt1Xt,X0)  qt0(XtX0)qt10(Xt1X0)q_{t|t-1}(X_t \mid X_{t-1}) = \frac{q_{t-1|t,0}(X_{t-1} \mid X_t, X_0)\;q_{t|0}(X_t \mid X_0)}{q_{t-1|0}(X_{t-1} \mid X_0)}

to rewrite each term (t2t \geq 2) as

logqtt1(XtXt1)pθ,t(Xt1Xt)=logqt1t,0(Xt1Xt,X0)pθ,t(Xt1Xt)+logqt0(XtX0)logqt10(Xt1X0).\begin{aligned} \log\frac{q_{t|t-1}(X_t \mid X_{t-1})}{p_{\theta,t}(X_{t-1}\mid X_t)} &= \log\frac{q_{t-1|t,0}(X_{t-1}\mid X_t, X_0)}{p_{\theta,t}(X_{t-1}\mid X_t)} \\ &\quad + \log q_{t|0}(X_t\mid X_0) - \log q_{t-1|0}(X_{t-1}\mid X_0). \end{aligned}

Summing from t=2t = 2 to TT, the logqt0\log q_{t|0} terms telescope:

t=2T[logqt0(XtX0)logqt10(Xt1X0)]=logqT0(XTX0)logq10(X1X0).\sum_{t=2}^T \bigl[\log q_{t|0}(X_t \mid X_0) - \log q_{t-1|0}(X_{t-1}\mid X_0)\bigr] = \log q_{T|0}(X_T\mid X_0) - \log q_{1|0}(X_1\mid X_0).

The t=1t=1 term contributes logq10(X1X0)logpθ,1(X0X1)\log q_{1|0}(X_1\mid X_0) - \log p_{\theta,1}(X_0 \mid X_1). Collecting everything:

L(x0;θ)=E ⁣[DKL ⁣(qT0(X0)    p)LTlogpθ,1(X0X1)L0+t=2TDKL ⁣(qt1t,0(Xt,X0)    pθ,t(Xt))Lt1  ]\begin{aligned} \mathcal{L}(x_0;\theta) = \mathbb{E}\!\Bigl[ &\underbrace{D_{\mathrm{KL}}\!\bigl(q_{T|0}(\cdot\mid X_0)\;\|\;p\bigr)}_{L_T} - \underbrace{\log p_{\theta,1}(X_0\mid X_1)}_{-L_0} \\[4pt] &+ \sum_{t=2}^{T}\underbrace{D_{\mathrm{KL}}\!\bigl(q_{t-1|t,0}(\cdot\mid X_t, X_0)\;\|\;p_{\theta,t}(\cdot\mid X_t)\bigr)}_{L_{t-1}}\;\Bigr] \end{aligned}

where the outer expectation is over X0q0X_0 \sim q_0 and Xt=αˉtX0+1αˉtεX_t = \sqrt{\bar{\alpha}_t}\,X_0 + \sqrt{1-\bar{\alpha}_t}\,\varepsilon with εN(0,I)\varepsilon\sim\mathcal{N}(0, I). Note: LTL_T has no learnable parameters; L0L_0 is a reconstruction term; the interesting terms are L1,,LT1L_1, \ldots, L_{T-1}.

II.2. Reducing Lt1L_{t-1} to noise prediction

Each Lt1L_{t-1} (for t2t \geq 2) is a KL divergence (Kullback–Leibler divergence) between two Gaussians with the same covariance when σt2=β~t\sigma_t^2 = \tilde{\beta}_t, reducing to

Lt1=12β~tμ~t(Xt,X0)μθ(Xt,t)2+const.L_{t-1} = \frac{1}{2\tilde{\beta}_t}\bigl\|\tilde{\mu}_t(X_t, X_0) - \mu_\theta(X_t, t)\bigr\|^2 + \text{const}.

Substitute X0=1αˉt(Xt1αˉtε)X_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}\bigl(X_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon\bigr) into the expression for μ~t\tilde{\mu}_t:

μ~t(Xt,ε)=1αt ⁣(Xtβt1αˉtε).\tilde{\mu}_t(X_t, \varepsilon) = \frac{1}{\sqrt{\alpha_t}}\!\left(X_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon\right).

Parameterise the model mean as

μθ(y,t)=1αt ⁣(yβt1αˉtεθ(y,t))\mu_\theta(y, t) = \frac{1}{\sqrt{\alpha_t}}\!\left(y - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(y, t)\right)

so that Lt1εεθ(Xt,t)2L_{t-1} \propto \|\varepsilon - \varepsilon_\theta(X_t, t)\|^2. The DDPM (denoising diffusion probabilistic model) simplified objective, which drops the tt-dependent prefactor and sums uniformly over tt, is

Lsimple(θ)=EtUnif{1,,T},  X0q0,  εN(0,I) ⁣[εεθ(Xt,t)2]L_{\mathrm{simple}}(\theta) = \mathbb{E}_{t \sim \mathrm{Unif}\{1,\ldots,T\},\;X_0 \sim q_0,\;\varepsilon\sim\mathcal{N}(0, I)}\!\bigl[\|\varepsilon - \varepsilon_\theta(X_t, t)\|^2\bigr]

where Xt=αˉtX0+1αˉtεX_t = \sqrt{\bar{\alpha}_t}\,X_0 + \sqrt{1-\bar{\alpha}_t}\,\varepsilon.

II.3. Conditional training and caption dropout

Given image-caption pairs (X0,C)pdata(X_0, C) \sim p_{\mathrm{data}}, the forward process acts only on the image: qt0(xx0)q_{t|0}(x \mid x_0) is unchanged, and the caption CC is carried along as an unperturbed label. The noise predictor becomes εθ(y,t,c)\varepsilon_\theta(y, t, c) and the training loss is

Lsimple(θ)=Et,(X0,C),ε ⁣[εεθ(Xt,t,C)2].L_{\mathrm{simple}}(\theta) = \mathbb{E}_{t,\,(X_0, C),\,\varepsilon}\!\bigl[\|\varepsilon - \varepsilon_\theta(X_t, t, C)\|^2\bigr].

This is the only change: condition the network on cc and train with the same squared-error objective.

To enable classifier-free guidance (CFG) at inference (Section III.2), one additionally trains the network to operate without a caption. During training, the caption CC is replaced with a null token \varnothing independently with probability puncond0.1p_{\mathrm{uncond}} \approx 0.1. Writing C~\tilde{C} for the resulting input (CC with probability 1puncond1 - p_{\mathrm{uncond}}, \varnothing otherwise), the loss remains E[εεθ(Xt,t,C~)2]\mathbb{E}[\|\varepsilon - \varepsilon_\theta(X_t, t, \tilde{C})\|^2]. After convergence, the single network approximates two score functions depending on its third argument:

εθ(y,t,c)1αˉtylogqt(yc),εθ(y,t,)1αˉtylogqt(y).\varepsilon_\theta(y, t, c) \approx -\sqrt{1-\bar{\alpha}_t}\,\nabla_y \log q_t(y \mid c), \qquad \varepsilon_\theta(y, t, \varnothing) \approx -\sqrt{1-\bar{\alpha}_t}\,\nabla_y \log q_t(y).

The first is the score of the conditional marginal qt(c)q_t(\cdot \mid c); the second is the score of the unconditional marginal qtq_t. How these are combined at inference time is described in Section III.2.

II.4. What training actually does, and what it learns

Training is just denoising

The operational content of training is: sample X0q0X_0 \sim q_0, sample tUnif{1,,T}t \sim \mathrm{Unif}\{1,\ldots,T\} and εN(0,I)\varepsilon \sim \mathcal{N}(0, I), form Xt=αˉtX0+1αˉtεX_t = \sqrt{\bar{\alpha}_t}\,X_0 + \sqrt{1-\bar{\alpha}_t}\,\varepsilon, and regress εθ(Xt,t)\varepsilon_\theta(X_t, t) against ε\varepsilon. No Xt1X_{t-1} is needed; the closed-form marginal qt0q_{t|0} lets you jump directly to any noise level without simulating the chain. Note that predicting ε\varepsilon and predicting X0X_0 are equivalent given (Xt,t)(X_t, t), since X^0=1αˉt(Xt1αˉtεθ)\hat{X}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(X_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta).

Why denoising gives a generative model

A standalone denoiser applied once to XTN(0,I)X_T \sim \mathcal{N}(0, I) would produce the minimum mean squared error (MMSE) estimate E[X0XT]\mathbb{E}[X_0 \mid X_T] — a blurry average, not a sharp sample. The ELBO (Section II.1) ensures that training a denoiser at every noise level does more: each Lt1L_{t-1} fits pθ,t(y)p_{\theta,t}(\cdot \mid y) to the forward posterior qt1t,0(y,X0)q_{t-1|t,0}(\cdot \mid y, X_0) averaged over X0X_0, calibrating a chain whose TT transitions from pure noise produce samples from q0\approx q_0. No individual step needs to do anything dramatic — at large tt, x^0\hat{x}_0 is poor but μ~t\tilde{\mu}_t barely trusts it; at small tt, the estimate is accurate and the step commits.

The model learns the marginal reverse, not the trajectory

For each tt, the loss Lt1L_{t-1} asks the single Gaussian pθ,t(y)p_{\theta,t}(\cdot \mid y) to match qt1t,0(y,X0)q_{t-1|t,0}(\cdot \mid y, X_0), but the outer expectation averages over X0q0t(y)X_0 \sim q_{0|t}(\cdot \mid y). The effective target for μθ(y,t)\mu_\theta(y, t) is therefore

arg minμ  EX0q0t(y) ⁣[μ~t(y,X0)μ2]=EX0q0t(y) ⁣[μ~t(y,X0)]=μ~t ⁣(y,  E[X0Xt=y])\begin{aligned} \operatorname*{arg\,min}_{\mu} \;\mathbb{E}_{X_0 \sim q_{0|t}(\cdot|y)}\!\bigl[\|\tilde{\mu}_t(y, X_0) - \mu\|^2\bigr] &= \mathbb{E}_{X_0 \sim q_{0|t}(\cdot|y)}\!\bigl[\tilde{\mu}_t(y, X_0)\bigr] \\ &= \tilde{\mu}_t\!\bigl(y,\;\mathbb{E}[X_0 \mid X_t = y]\bigr) \end{aligned}

where the last equality uses linearity of μ~t\tilde{\mu}_t in x0x_0. In the noise parameterisation, this is equivalent to εθ(y,t)E[εXt=y]\varepsilon_\theta(y, t) \to \mathbb{E}[\varepsilon \mid X_t = y].

The model learns the MMSE denoiser: the conditional expectation of the noise (or equivalently of X0X_0) given Xt=yX_t = y. The single Gaussian pθ,t(y)p_{\theta,t}(\cdot \mid y) approximates the intractable mixture qt1t(y)q_{t-1|t}(\cdot \mid y) by matching its mean.

The ELBO ties every learned transition to the data distribution q0q_0: the model can only generate images that resemble training data. Part IV replaces this objective with an arbitrary reward.

Part III: Sampling

III.1. The sampling algorithm and its structure

Given a trained noise predictor εθ\varepsilon_\theta, Algorithm 2 of Ho et al. (2020) generates images by initialising xTN(0,I)x_T \sim \mathcal{N}(0, I) and iterating for t=T,T1,,1t = T, T-1, \ldots, 1:

xt1=1αt ⁣(xt1αt1αˉtεθ(xt,t))+σtz,zN(0,I)x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t, t)\right) + \sigma_t\,z, \qquad z \sim \mathcal{N}(0, I)

(with z=0z = 0 at t=1t=1). This decomposes as xt1=mt(xt)+σtzx_{t-1} = m_t(x_t) + \sigma_t\,z: a deterministic target mt(xt)m_t(x_t) plus isotropic Gaussian noise. Each step is therefore a draw from N(mt(xt),σt2I)\mathcal{N}(m_t(x_t),\,\sigma_t^2 I) — an isotropic Gaussian with exact, closed-form log-density. This structure is what makes the policy gradient computation in Part IV tractable.

Sections III.3 and III.4 derive mtm_t and σt\sigma_t. First, Section III.2 describes how conditioning on a caption enters at inference time via classifier-free guidance.

III.2. Classifier-free guidance

Sampling with the conditional model εθ(xt,t,c)\varepsilon_\theta(x_t, t, c) alone would approximate q0(c)q_0(\cdot \mid c). Classifier-free guidance (CFG), introduced by Ho and Salimans (2022), sharpens this by combining the conditional and unconditional score estimates from Section II.3.

At inference, one forms the guided noise estimate

ε^(y,t,c)=(1+w)εθ(y,t,c)wεθ(y,t,)\hat{\varepsilon}(y, t, c) = (1+w)\,\varepsilon_\theta(y, t, c) - w\,\varepsilon_\theta(y, t, \varnothing)

for a guidance weight w>0w > 0, and uses ε^\hat{\varepsilon} in place of εθ\varepsilon_\theta in the sampling step. Expressing this in terms of scores (up to the factor 1αˉt-\sqrt{1-\bar\alpha_t}), the guided estimate follows

(1+w)ylogqt(yc)wylogqt(y).(1+w)\,\nabla_y \log q_t(y \mid c) - w\,\nabla_y \log q_t(y).

Applying Bayes' rule — ylogqt(yc)=ylogqt(cy)+ylogqt(y)\nabla_y \log q_t(y \mid c) = \nabla_y \log q_t(c \mid y) + \nabla_y \log q_t(y), since ylogqt(c)=0\nabla_y \log q_t(c) = 0 — this becomes

y ⁣[logqt(y)+(1+w)logqt(cy)].\nabla_y\!\bigl[\log q_t(y) + (1+w)\log q_t(c \mid y)\bigr].

The unconditional score ylogqt(y)\nabla_y \log q_t(y) keeps the sample on the image manifold; the term (1+w)logqt(cy)(1+w)\log q_t(c \mid y) acts as an amplified implicit classifier steering toward images strongly associated with caption cc. At w=0w = 0 this reduces to standard conditional sampling; increasing ww concentrates the effective distribution on high-qt(cy)q_t(c \mid y) modes, improving caption fidelity at the cost of diversity. The guided score field does not, in general, integrate to a normalised density — CFG is a heuristic that manipulates the score at inference time, with no formal objective being optimised. Part IV takes a different approach, defining an explicit reward and optimising it via RL.

III.3. The deterministic target: where the step lands

The target mt(xt)m_t(x_t) is a weighted average of the current noisy state xtx_t and an estimate of the clean image x0x_0. Define the MMSE clean estimate from the noise predictor:

x^0(xt)=1αˉt ⁣(xt1αˉtεθ(xt,t)).\hat{x}_0(x_t) = \frac{1}{\sqrt{\bar{\alpha}_t}}\!\bigl(x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta(x_t, t)\bigr).

Substituting x^0\hat{x}_0 into μ~t(xt,x^0)\tilde{\mu}_t(x_t, \hat{x}_0) — the forward posterior mean from Section I.3, with x^0\hat{x}_0 in place of the true clean image:

μ~t(xt,x^0)=αˉt1βt1αˉtx^0+αt(1αˉt1)1αˉtxt.\tilde{\mu}_t(x_t, \hat{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\hat{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,x_t.

Using αˉt1αˉt=1αt\frac{\sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha}_t}} = \frac{1}{\sqrt{\alpha_t}}, the coefficient of xtx_t becomes βt+αt(1αˉt1)(1αˉt)αt\frac{\beta_t + \alpha_t(1-\bar{\alpha}_{t-1})}{(1-\bar{\alpha}_t)\sqrt{\alpha_t}}. The numerator simplifies: βt+αtαtαˉt1=(1αt)+αtαˉt=1αˉt\beta_t + \alpha_t - \alpha_t\bar{\alpha}_{t-1} = (1-\alpha_t) + \alpha_t - \bar{\alpha}_t = 1 - \bar{\alpha}_t, so the coefficient is 1αt\frac{1}{\sqrt{\alpha_t}}. The coefficient of εθ\varepsilon_\theta is βtαt1αˉt\frac{\beta_t}{\sqrt{\alpha_t}\sqrt{1-\bar{\alpha}_t}}. Therefore:

μ~t(xt,x^0)=1αt ⁣(xtβt1αˉtεθ(xt,t))\tilde{\mu}_t(x_t, \hat{x}_0) = \frac{1}{\sqrt{\alpha_t}}\!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\varepsilon_\theta(x_t, t)\right)

which is exactly mt(xt)m_t(x_t). The sampling step is therefore

xt1=μ~t(xt,x^0(xt))+σtzx_{t-1} = \tilde{\mu}_t(x_t, \hat{x}_0(x_t)) + \sigma_t\,z

ie, it draws from the forward posterior qt1t,0(xt,x^0)q_{t-1|t,0}(\cdot \mid x_t, \hat{x}_0), treating the MMSE estimate as the true clean image.

Contrast with "estimate x0x_0 then re-corrupt"

If sampling instead jumped to x^0\hat{x}_0 and re-injected noise to reach level t1t{-}1, the step would be xt1=αˉt1x^0(xt)+1αˉt1zx_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\,\hat{x}_0(x_t) + \sqrt{1-\bar{\alpha}_{t-1}}\,z. This forgets xtx_t entirely, independently re-adding all noise for level t1t{-}1 from scratch.

The actual step is far more conservative. In μ~t\tilde{\mu}_t, the weight on x^0\hat{x}_0 is αˉt1βt1αˉt\frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}, which is small when αˉt10\bar{\alpha}_{t-1} \approx 0 (large tt, most noise remaining) and approaches 11 as αˉt11\bar{\alpha}_{t-1} \to 1 (small tt, little noise remaining). Early steps barely trust x^0\hat{x}_0; late steps commit to it. This conservatism is essential: x^0\hat{x}_0 is poor at large tt, but the step does not rely on it.

III.4. The bridge interpretation

The weights in μ~t\tilde{\mu}_t and the variance β~t\tilde{\beta}_t arise from the bridge structure of the forward chain.

The forward chain (X0,X1,,XT)(X_0, X_1, \ldots, X_T) conditioned on X0=x0X_0 = x_0 is a Gaussian Markov chain (OU-like, not a random walk, due to the 1βt\sqrt{1-\beta_t} contraction). Pinning both endpoints — conditioning on X0=x0X_0 = x_0 and Xt=xtX_t = x_t — gives a jointly Gaussian distribution over the intermediate variables (X1,,Xt1)(X_1, \ldots, X_{t-1}), which is itself Markov with fixed endpoints: a Gaussian Markov bridge. The density qt1t,0(xt,x0)q_{t-1|t,0}(\cdot \mid x_t, x_0) is the one-step-back marginal of this bridge, with mean μ~t(xt,x0)\tilde{\mu}_t(x_t, x_0) and variance β~tI\tilde{\beta}_t\,I, both determined by the geometry of the forward chain.

With σt2=β~t\sigma_t^2 = \tilde{\beta}_t, the sampling step is precisely

xt1qt1t,0(      xt,  x^0(xt)),x_{t-1} \sim q_{t-1|t,0}(\;\cdot\;\mid\;x_t,\;\hat{x}_0(x_t)),

ie, a draw from the bridge posterior, treating the MMSE estimate as the true clean image. In practice one never constructs the bridge as an object — just evaluate μ~t(xt,x^0)\tilde{\mu}_t(x_t, \hat{x}_0) and add noise β~tz\sqrt{\tilde{\beta}_t}\,z — but the bridge interpretation explains why these particular weights and this particular variance are correct.

Part IV: Finetuning with reinforcement learning

Parts II and III optimise a single objective: matching the data distribution q0q_0 via the ELBO. But many applications care about a downstream property of the generated image — aesthetic quality, compressibility, prompt-image alignment — that is not captured by data likelihood. This section recasts the sampling chain from Part III as a Markov decision process (MDP), enabling policy gradient methods to optimise an arbitrary reward r(x0,c)r(x_0, c) directly. The framework is due to Black et al. (2024), who call the resulting algorithm denoising diffusion policy optimisation (DDPO).

IV.1. The reward maximisation objective

Given a distribution over captions C\mathcal{C} and a reward function r:Rd×CRr: \mathbb{R}^d \times \mathcal{C} \to \mathbb{R}, we want to solve

maxθ  J(θ)=EcC  Ex0pθ(c) ⁣[r(x0,c)]\max_\theta \; J(\theta) = \mathbb{E}_{c \sim \mathcal{C}}\;\mathbb{E}_{x_0 \sim p_\theta(\cdot \mid c)}\!\bigl[r(x_0, c)\bigr]

where pθ(x0c)=p(xT)t=1Tπθ,t(xt1xt,c)  dx1:Tp_\theta(x_0 \mid c) = \int p(x_T)\prod_{t=1}^{T} \pi_{\theta,t}(x_{t-1} \mid x_t, c)\;\mathrm{d}x_{1:T} is the marginal over final images induced by the sampling chain. Since this integral is over RdT\mathbb{R}^{dT}, neither pθ(x0c)p_\theta(x_0 \mid c) nor its gradient with respect to θ\theta can be evaluated.

A naive approach (reward-weighted regression, or RWR) applies REINFORCE directly to this marginal:

θJ=EcC,  x0pθ(c) ⁣[r(x0,c)  θlogpθ(x0c)].\nabla_\theta J = \mathbb{E}_{c \sim \mathcal{C},\; x_0 \sim p_\theta(\cdot \mid c)}\!\bigl[r(x_0, c)\;\nabla_\theta \log p_\theta(x_0 \mid c)\bigr].

This is exact in principle, but logpθ(x0c)\log p_\theta(x_0 \mid c) is the intractable log-marginal. In practice, RWR substitutes the ELBO (or another bound) for logpθ(x0c)\log p_\theta(x_0 \mid c), which introduces bias: the gradient is no longer that of JJ but of a surrogate objective. DDPO eliminates this approximation.

IV.2. The multi-step MDP

Recall from Part III that each sampling step is a draw from an isotropic Gaussian:

πθ,t(xt1xt,c)=N ⁣(xt1;  μ~t(xt,x^0(xt,c)),  σt2I)\pi_{\theta,t}(x_{t-1} \mid x_t, c) = \mathcal{N}\!\bigl(x_{t-1};\; \tilde{\mu}_t(x_t, \hat{x}_0(x_t, c)),\; \sigma_t^2\,I\bigr)

where x^0(xt,c)=1αˉt(xt1αˉtεθ(xt,t,c))\hat{x}_0(x_t, c) = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\,\varepsilon_\theta(x_t, t, c)) and μ~t\tilde{\mu}_t is the bridge target from Section III.3. (The notation πθ,t\pi_{\theta,t} replaces pθ,tp_{\theta,t} from Part II to match MDP conventions; the object is the same.) DDPO maps the sampling chain to the following MDP:

A trajectory is τ=(xT,xT1,,x0)\tau = (x_T, x_{T-1}, \ldots, x_0) — one run of the sampling chain.

IV.3. Factorised likelihoods and the policy gradient

The trajectory log-probability factorises as

logpθ(τc)=logp(xT)+t=1Tlogπθ,t(xt1xt,c)\log p_\theta(\tau \mid c) = \log p(x_T) + \sum_{t=1}^{T} \log \pi_{\theta,t}(x_{t-1} \mid x_t, c)

where each term is an exact isotropic Gaussian log-density:

logπθ,t(xt1xt,c)=d2log(2πσt2)12σt2xt1μ~t(xt,x^0(xt,c))2.\log \pi_{\theta,t}(x_{t-1} \mid x_t, c) = -\frac{d}{2}\log(2\pi\sigma_t^2) - \frac{1}{2\sigma_t^2}\bigl\|x_{t-1} - \tilde{\mu}_t(x_t, \hat{x}_0(x_t, c))\bigr\|^2.

This is the central advantage: the intractable θlogpθ(x0c)\nabla_\theta \log p_\theta(x_0 \mid c) from Section IV.1 has been replaced by a sum of TT exact, closed-form terms. Applying REINFORCE to the trajectory gives

θJ(θ)=Eτpθ(c) ⁣[r(x0,c)t=1Tθlogπθ,t(xt1xt,c)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\cdot \mid c)}\!\left[r(x_0, c)\sum_{t=1}^{T}\nabla_\theta \log \pi_{\theta,t}(x_{t-1} \mid x_t, c)\right]

where each θlogπθ,t\nabla_\theta \log \pi_{\theta,t} propagates through μ~t\tilde{\mu}_t and hence through εθ\varepsilon_\theta:

θlogπθ,t(xt1xt,c)=1σt2(xt1μ~t(xt,x^0(xt,c)))  θμ~t(xt,x^0(xt,c)).\nabla_\theta \log \pi_{\theta,t}(x_{t-1} \mid x_t, c) = \frac{1}{\sigma_t^2}\bigl(x_{t-1} - \tilde{\mu}_t(x_t, \hat{x}_0(x_t, c))\bigr)\;\nabla_\theta \tilde{\mu}_t(x_t, \hat{x}_0(x_t, c)).

In practice, one samples trajectories from pθp_\theta, evaluates r(x0,c)r(x_0, c), and performs a gradient step.

IV.4. On-policy and off-policy variants

The gradient in Section IV.3 requires trajectories sampled from the current policy pθp_\theta. DDPO-SF (score function) implements this directly: sample a batch of trajectories, compute the REINFORCE gradient, update θ\theta, discard the batch. Each trajectory is used for exactly one gradient step.

DDPO-IS (importance sampling) reuses trajectories across multiple updates. Given trajectories sampled from a previous policy pθoldp_{\theta_{\mathrm{old}}}, the gradient under the current pθp_\theta is reweighted by the trajectory importance ratio

pθ(τc)pθold(τc)=t=1Tπθ,t(xt1xt,c)πθold,t(xt1xt,c)\frac{p_\theta(\tau \mid c)}{p_{\theta_{\mathrm{old}}}(\tau \mid c)} = \prod_{t=1}^{T} \frac{\pi_{\theta,t}(x_{t-1} \mid x_t, c)}{\pi_{\theta_{\mathrm{old}},t}(x_{t-1} \mid x_t, c)}

which factorises into per-step Gaussian density ratios, each exact and closed-form. As θ\theta drifts from θold\theta_{\mathrm{old}}, the importance weights become high-variance. Following PPO, DDPO-IS clips the per-step ratios to [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon], preventing large updates from stale trajectories. This improves sample efficiency at the cost of bias from clipping.

IV.5. Credit assignment

The reward is sparse: r(x0,c)r(x_0, c) arrives only at the terminal step, but the policy makes TT decisions. Two features make credit assignment tractable. First, per-prompt baselines: for each caption cc, the mean reward across sampled trajectories is subtracted, so the gradient reinforces trajectories that outperform the per-prompt average. Second, near-determinism: since σt\sigma_t is small, the trajectory barely branches, so the terminal reward is a smooth function of early decisions. In the limit σt0\sigma_t \to 0 (DDIM), the trajectory becomes fully deterministic — eliminating credit assignment but also the stochasticity REINFORCE requires.

IV.6. Reward hacking and KL regularisation

The objective J(θ)J(\theta) contains no data-matching term: if rr is an imperfect proxy for the true goal, unconstrained optimisation will exploit the gap, typically degrading image quality to game the metric (eg, compressibility rewards produce featureless blobs that achieve optimal file size but contain no meaningful content).

The principled fix adds a KL penalty against the pretrained reference prefp_{\mathrm{ref}}:

maxθ  Ec,  x0pθ(c) ⁣[r(x0,c)]βDKL ⁣(pθ(c)pref(c)).\max_\theta \; \mathbb{E}_{c,\; x_0 \sim p_\theta(\cdot|c)}\!\bigl[r(x_0, c)\bigr] - \beta\, D_{\mathrm{KL}}\!\bigl(p_\theta(\cdot | c) \,\|\, p_{\mathrm{ref}}(\cdot | c)\bigr).

This is more than a constraint on how far pθp_\theta can drift. The KL-regularised objective has a closed-form optimum:

p(x0c)pref(x0c)exp ⁣(r(x0,c)/β).p^*(x_0 \mid c) \propto p_{\mathrm{ref}}(x_0 \mid c)\,\exp\!\bigl(r(x_0, c)/\beta\bigr).

This Gibbs reweighting upweights images that are both plausible under prefp_{\mathrm{ref}} and high-reward, while suppressing high-reward but implausible images. The parameter β\beta controls the trade-off: large β\beta keeps pp^* close to prefp_{\mathrm{ref}}; small β\beta concentrates on reward-maximising modes that still have reference support. The KL term actively shapes the target distribution — defining what the optimal policy is, not merely limiting how far optimisation can go. If rr is sufficiently misspecified, however, even pp^* will exploit the gap within the support of prefp_{\mathrm{ref}}.