Title: Rolling Diffusion Models Description: No description Keywords: Machine Learning, ICML Text content: Rolling Diffusion Models 1 Introduction 2 Background: Diffusion Models 2.1 Diffusion 2.2 Diffusion for temporal data 3 Rolling Diffusion Models 3.1 A global perspective Forward process True backward process and generative process 3.2 A local perspective A linear reparameterization 3.3 Boundary conditions 3.4 Local training 4 Related Work 4.1 Video diffusion 4.2 Other time-series diffusion models 5 Experiments 5.1 Kolmogorov Flow Evaluation 5.2 BAIR Robot Pushing Dataset 5.3 Kinetics-600 Baseline Rollout 6 Conclusion A Rolling Diffusion Objective Standard Diffusion Rolling Diffusion Objective B Algorithms C Hyperparameters D Runtime Complexity E Simulation Details F Additional Results F.1 Kolmogorov Flow F.2 Kinetics-600 G Rescaled Noise Schedule H Hyperparameter Search for β𝛽\betaitalic_β Rolling Diffusion Models David Ruhe    Jonathan Heek    Tim Salimans    Emiel Hoogeboom Abstract Diffusion models have recently been increasingly applied to temporal data such as video, fluid mechanics simulations, or climate data. These methods generally treat subsequent frames equally regarding the amount of noise in the diffusion process. This paper explores Rolling Diffusion: a new approach that uses a sliding window denoising process. It ensures that the diffusion process progressively corrupts through time by assigning more noise to frames that appear later in a sequence, reflecting greater uncertainty about the future as the generation process unfolds. Empirically, we show that when the temporal dynamics are complex, Rolling Diffusion is superior to standard diffusion. In particular, this result is demonstrated in a video prediction task using the Kinetics-600 video dataset and in a chaotic fluid dynamics forecasting experiment. Machine Learning, ICML 1 Introduction Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) have significantly boosted the field of generative modeling. They provided the fundaments for large-scale text-to-image systems like DALL-E 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022), Parti (Yu et al., 2022), and Stable Diffusion (Rombach et al., 2022). Other applications of diffusion models include density estimation, text-to-speech, and image editing (Kingma et al., 2021; Gao et al., 2023; Kawar et al., 2023). After these successes in these domains, interest in developing diffusion models for time sequences has grown. Prominent recent large-scale works include, e.g., Imagen Video (Ho et al., 2022a), Stable Diffusion Video (StabilityAI, 2023). Other impressive results for generating video data have been achieved by, e.g., (Blattmann et al., 2023; Ge et al., 2023; Harvey et al., 2022; Singer et al., 2022; Ho et al., 2022b). Applications of sequential generative modeling outside video include, e.g., fluid mechanics or weather and climate modeling (Price et al., 2023; Meng et al., 2022; Lippe et al., 2023). What is common across many of these works is that they treat the temporal axis as an ‘extra spatial dimension’. That is, they treat the video as a 3D tensor of shape K×H×W𝐾𝐻𝑊K\times H\times Witalic_K × italic_H × italic_W. This has several downsides. First, the memory and computational requirements can quickly grow infeasible if one wants to generate long sequences. Second, one is typically interested in being able to roll out the sampling process for a variable number of time steps. Therefore, an alternative angle is a fully autoregressive approach by conditioning on a sequence of input frames and simulating a single output frame, which is then concatenated to the input frames, upon which the recursion can continue. In this case, one has to traverse the entire denoising diffusion sampling chain for every single frame, which is computationally intensive. Additionally, iteratively sampling single frames leads to quick autoregressive error accumulation. A middle ground can be found by jointly generating blocks of frames. However, in this block-autoregressive case, a diffusion model would use the same number of denoising steps for every frame. This is suboptimal since, given a sequence of conditioning frames, the generative uncertainty about the next few is much lower than the frames further into the future. Finally, both methods sample frames only jointly with earlier frames, which is potentially a suboptimal parameterization. Figure 1: Overview of the Rolling Diffusion rollout sampling procedure. The input to the model contains some conditioning and a sequence of partially denoised frames. The model then denoises the frames by a small amount. After denoising, the sliding window shifts, and the fully denoised frames are concatenated with the conditioning. This process is repeats until the desired number of frames is generated. Example video taken from the Kinetics-600 dataset (Kay et al., 2017) (CC BY 4.0). In this paper, we propose a new framework called Rolling Diffusion, a method that explicitly corrupts data from past to future. This is achieved by reparameterizing the global diffusion time to a local time for each frame. It turns out that by doing this, one can (apart from boundary conditions) completely focus on a local sliding window sequential denoising process. This has several temporal inductive biases, alleviating some of the abovementioned issues. 1. In denoising diffusion models, the model output tends to contain low-frequency information in high noise regimes and includes high-frequency information only when corruptions are light. In our framework, the noise strength is higher for frames that are further from the conditioning. As such, the model only needs to predict low-frequency information (i.e., global structures) for frames further into the future; high-frequency information gets included as frames move closer to the present. 2. Each frame is generated together with both a number of preceding and succeeding frames. 3. Due to the local sliding window point of view, every frame enjoys the same inductive bias and undergoes a similar sampling procedure regardless of its absolute position in the video. These merits are empirically demonstrated in, among others, a video prediction experiment using the Kinetics-600 video dataset (Kay et al., 2017) and in an experiment involving chaotic fluid mechanics simulations. Figure 2: Left: an illustration of a global rolling diffusion process and its local time reparameterization. The global diffusion denoising time t𝑡titalic_t (vertical axis) is mapped to a local time tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for a frame k𝑘kitalic_k (horizontal axis). The local time is then used to compute the diffusion parameters αtksubscript𝛼subscript𝑡𝑘\alpha_{t_{k}}italic_α start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and σtksubscript𝜎subscript𝑡𝑘\sigma_{t_{k}}italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. On the right, we show how the same local schedule can be applied to each sequence of frames based on the frame index w𝑤witalic_w. The nontrivial part of sampling the generative process only occurs in the sliding window as it gets shifted over the sequence. 2 Background: Diffusion Models 2.1 Diffusion Diffusion models consist of a process that destroys data stochastically, named the ‘diffusion process’, and a generative process called the denoising process. Let 𝒛t∈ℝDsubscript𝒛𝑡superscriptℝ𝐷{\bm{z}}_{t}\in{\mathbb{R}}^{D}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT denote a latent variable over a diffusion dimension t∈[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. We refer to t𝑡titalic_t as the global (diffusion) time, which will determine the amount of noise added to the data. Given a datapoint 𝒙∈ℝD𝒙superscriptℝ𝐷{\bm{x}}\in{\mathbb{R}}^{D}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, 𝒙∼q⁢(𝒙)similar-to𝒙𝑞𝒙{\bm{x}}\sim q({\bm{x}})bold_italic_x ∼ italic_q ( bold_italic_x ), the diffusion process is designed so that 𝒛0≈𝒙subscript𝒛0𝒙{\bm{z}}_{0}\approx{\bm{x}}bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ bold_italic_x and 𝒛1∼𝒩⁢(0,1)similar-tosubscript𝒛1𝒩01{\bm{z}}_{1}\sim\mathcal{N}(0,1)bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) via the distribution: q⁢(𝒛t|𝒙):=𝒩⁢(𝒛t|αt⁢𝒙,σt2⁢𝐈),assign𝑞conditionalsubscript𝒛𝑡𝒙𝒩conditionalsubscript𝒛𝑡subscript𝛼𝑡𝒙superscriptsubscript𝜎𝑡2𝐈q({\bm{z}}_{t}|{\bm{x}}):=\mathcal{N}({\bm{z}}_{t}|\alpha_{t}{\bm{x}},\sigma_{% t}^{2}{\mathbf{I}})\,,italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x ) := caligraphic_N ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) , (1) where atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σt2superscriptsubscript𝜎𝑡2\sigma_{t}^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are strictly positive scalar functions of t𝑡titalic_t. We define their signal-to-noise ratio SNR⁡(t):=αt2σt2assignSNR𝑡superscriptsubscript𝛼𝑡2superscriptsubscript𝜎𝑡2\displaystyle\operatorname{SNR}(t):=\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}}roman_SNR ( italic_t ) := divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (2) to be monotonically decreasing in t𝑡titalic_t. Finally, we let αt2+σt2=1superscriptsubscript𝛼𝑡2superscriptsubscript𝜎𝑡21\alpha_{t}^{2}+\sigma_{t}^{2}=1italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, corresponding to a variance-preserving process which also implies at2∈(0,1]superscriptsubscript𝑎𝑡201a_{t}^{2}\in(0,1]italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ ( 0 , 1 ] and σt2∈(0,1]superscriptsubscript𝜎𝑡201\sigma_{t}^{2}\in(0,1]italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ ( 0 , 1 ]. Given the noising process, it can be shown (Sohl-Dickstein et al., 2015) that the true (i.e., optimal) denoising distribution for a single datapoint 𝒙𝒙{\bm{x}}bold_italic_x from time t𝑡titalic_t to time s𝑠sitalic_s (s≤t)s\leq t)italic_s ≤ italic_t ) is given by q⁢(𝒛s|𝒛t,𝒙)=𝒩⁢(𝒛s|μt→s⁢(𝒛t,𝒙),σt→s2⁢𝐈),𝑞conditionalsubscript𝒛𝑠subscript𝒛𝑡𝒙𝒩conditionalsubscript𝒛𝑠subscript𝜇→𝑡𝑠subscript𝒛𝑡𝒙subscriptsuperscript𝜎2→𝑡𝑠𝐈q({\bm{z}}_{s}|{\bm{z}}_{t},{\bm{x}})=\mathcal{N}({\bm{z}}_{s}|\mu_{t\to s}({% \bm{z}}_{t},{\bm{x}}),\sigma^{2}_{t\to s}{\mathbf{I}})\,,italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) = caligraphic_N ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT italic_t → italic_s end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → italic_s end_POSTSUBSCRIPT bold_I ) , (3) where μ𝜇\muitalic_μ and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are analytical mean and variance functions of t𝑡titalic_t, s𝑠sitalic_s, 𝒙𝒙{\bm{x}}bold_italic_x and 𝒛tsubscript𝒛𝑡{\bm{z}}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The parameterized generative process pθ⁢(𝒛s|𝒛t)subscript𝑝𝜃conditionalsubscript𝒛𝑠subscript𝒛𝑡p_{\theta}({\bm{z}}_{s}|{\bm{z}}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is then defined by approximating 𝒙𝒙{\bm{x}}bold_italic_x via a neural network fθ:ℝD×[0,1]→ℝD:subscript𝑓𝜃→superscriptℝ𝐷01superscriptℝ𝐷f_{\theta}:{\mathbb{R}}^{D}\times[0,1]\to{\mathbb{R}}^{D}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × [ 0 , 1 ] → blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. That is, we set pθ⁢(𝒛s|𝒛t):=q⁢(𝒛s|𝒛t,𝒙=fθ⁢(𝒛t,t)).assignsubscript𝑝𝜃conditionalsubscript𝒛𝑠subscript𝒛𝑡𝑞conditionalsubscript𝒛𝑠subscript𝒛𝑡𝒙subscript𝑓𝜃subscript𝒛𝑡𝑡p_{\theta}({\bm{z}}_{s}|{\bm{z}}_{t}):=q({\bm{z}}_{s}|{\bm{z}}_{t},{\bm{x}}=f_% {\theta}({\bm{z}}_{t},t))\,.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) . (4) The diffusion objective can be expressed as a KL-divergence between the diffusion process and the denoising process, i.e. DKL(q(𝒙,𝒛0,…,𝒛1)||p(𝒙,𝒛0,…,𝒛1))D_{\mathrm{KL}}(q({\bm{x}},{\bm{z}}_{0},\ldots,{\bm{z}}_{1})\,||\,p({\bm{x}},{% \bm{z}}_{0},\ldots,{\bm{z}}_{1}))italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q ( bold_italic_x , bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | | italic_p ( bold_italic_x , bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) which simplifies to (Kingma et al., 2021): ℒθ⁢(𝒙)subscriptℒ𝜃𝒙\displaystyle\mathcal{L}_{\theta}({\bm{x}})caligraphic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) :=𝔼t∼U⁢(0,1),ϵ∼𝒩⁢(0,1)⁢[a⁢(t)⁢‖𝒙−fθ⁢(𝒛t,ϵ,t)‖2]assignabsentsubscript𝔼formulae-sequencesimilar-to𝑡𝑈01similar-tobold-italic-ϵ𝒩01delimited-[]𝑎𝑡superscriptnorm𝒙subscript𝑓𝜃subscript𝒛𝑡bold-italic-ϵ𝑡2\displaystyle:=\mathbb{E}_{t\sim U(0,1),{\bm{{\epsilon}}}\sim\mathcal{N}(0,1)}% \left[a(t)||{\bm{x}}-f_{\theta}({\bm{z}}_{t,{\bm{{\epsilon}}}},t)||^{2}\right]:= blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U ( 0 , 1 ) , bold_italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ italic_a ( italic_t ) | | bold_italic_x - italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t , bold_italic_ϵ end_POSTSUBSCRIPT , italic_t ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] +ℒprior+ℒdata,subscriptℒpriorsubscriptℒdata\displaystyle\quad+\mathcal{L}_{\mathrm{prior}}+\mathcal{L}_{\mathrm{data}}\,,+ caligraphic_L start_POSTSUBSCRIPT roman_prior end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT , (5) where ℒpriorsubscriptℒprior\mathcal{L}_{\mathrm{prior}}caligraphic_L start_POSTSUBSCRIPT roman_prior end_POSTSUBSCRIPT and ℒdatasubscriptℒdata\mathcal{L}_{\mathrm{data}}caligraphic_L start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT are typically negligible. The weighting a⁢(t)𝑎𝑡a(t)italic_a ( italic_t ) can be freely specified. In practice, it was found that specific weightings of the loss result in better sample quality (Ho et al., 2020). This is the case for, e.g., ϵitalic-ϵ\epsilonitalic_ϵ-loss, which corresponds to a⁢(t)=SNR⁢(t)𝑎𝑡SNR𝑡a(t)=\mathrm{SNR}(t)italic_a ( italic_t ) = roman_SNR ( italic_t ). 2.2 Diffusion for temporal data If one is interested in generation of temporal data beyond typical hardware constraints, one must consider (autoregressive) conditional extension of previously generated data. I.e., given an initial sample 𝒙ksuperscript𝒙𝑘{\bm{x}}^{k}bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT at a temporal index k𝑘kitalic_k, we want to sample a (faithful) conditional distribution p⁢(𝒙k+1|𝒙k)𝑝conditionalsuperscript𝒙𝑘1superscript𝒙𝑘p({\bm{x}}^{k+1}|{\bm{x}}^{k})italic_p ( bold_italic_x start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). This process can then be extended to videos of arbitrary lengths. As discussed in Section 1, it is not yet clear what kinds of parameterization choices are optimal to estimate this conditional distribution. Further, no temporal inductive bias is typically baked into the denoising process. 3 Rolling Diffusion Models We introduce rolling diffusion models, merging the arrow of time with the (de)noising process. To formalize this, we first have to discuss the global diffusion model. We will see that the only nontrivial parts of the global process take place locally. Defining the noise schedule locally is advantageous since the resulting model does not depend on the number of frames K𝐾Kitalic_K and can be unrolled indefinitely. 3.1 A global perspective Let 𝒙∈ℝD×K𝒙superscriptℝ𝐷𝐾{\bm{x}}\in\mathbb{R}^{D\times K}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_K end_POSTSUPERSCRIPT be a time series datapoint where K𝐾Kitalic_K denotes the number of frames and D𝐷Ditalic_D the dimensionality of each frame. The core idea that allows rolling diffusion is a reparameterization of the diffusion (denoising) time t𝑡titalic_t to a frame-dependent local (frame-dependent) time: i.e., t↦tk.maps-to𝑡subscript𝑡𝑘\displaystyle t\mapsto t_{k}\,.italic_t ↦ italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (6) Note that we still require tk∈[0,1]subscript𝑡𝑘01t_{k}\in[0,1]italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ] for all k∈{0,…,K−1}𝑘0…𝐾1{k\in\{0,\dots,K-1\}}italic_k ∈ { 0 , … , italic_K - 1 }. Furthermore, we still have a monotonically decreasing signal-to-noise schedule, ensuring a well-defined diffusion process. However, we now have a different signal-to-noise schedule for each frame. In this work, we also always have tk≤tk+1subscript𝑡𝑘subscript𝑡𝑘1t_{k}\leq t_{k+1}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, i.e., the local denoising time of a given frame is smaller than the local time of the next frame. This means we add more noise to future frames: a natural temporal inductive bias. Note that this is not strictly required; one could also have a reverse-time inductive bias or a mixture. An example of such a reparameterization is shown in Figure 2 (left). We depict a map that takes a global diffusion time t𝑡titalic_t (vertical axis) and a frame index k𝑘kitalic_k (horizontal axis), and computes a local time tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, indicated with a color intensity. Forward process We now redefine the forward process using the local time: q⁢(𝒛t|𝒙):=∏k=0K−1𝒩⁢(𝒛tk|αtk⁢𝒙k,σtk2⁢𝐈),assign𝑞conditionalsubscript𝒛𝑡𝒙superscriptsubscriptproduct𝑘0𝐾1𝒩conditionalsubscriptsuperscript𝒛𝑘𝑡subscript𝛼subscript𝑡𝑘superscript𝒙𝑘superscriptsubscript𝜎subscript𝑡𝑘2𝐈\displaystyle q({\bm{z}}_{t}|{\bm{x}}):=\prod_{k=0}^{K-1}\mathcal{N}({\bm{z}}^% {k}_{t}|\alpha_{t_{k}}{\bm{x}}^{k},\sigma_{t_{k}}^{2}{\mathbf{I}})\,,italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x ) := ∏ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT caligraphic_N ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) , (7) where we can reuse the α𝛼\alphaitalic_α and σ𝜎\sigmaitalic_σ functions (now evaluated locally at tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) from before. Here, 𝒙ksuperscript𝒙𝑘{\bm{x}}^{k}bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes the k𝑘kitalic_k-th frame of 𝒙𝒙{\bm{x}}bold_italic_x. True backward process and generative process Given a tuple (s,t)𝑠𝑡(s,t)( italic_s , italic_t ), s∈[0,1]𝑠01s\in[0,1]italic_s ∈ [ 0 , 1 ], t∈[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], s≤t𝑠𝑡s\leq titalic_s ≤ italic_t, we can divide the frames k∈{0,…,K−1}𝑘0…𝐾1k\in\{0,\dots,K-1\}italic_k ∈ { 0 , … , italic_K - 1 } into three categories: clean⁢(s,t)clean𝑠𝑡\displaystyle\text{clean}(s,t)clean ( italic_s , italic_t ) :={k∣sk=tk=0},assignabsentconditional-set𝑘subscript𝑠𝑘subscript𝑡𝑘0\displaystyle:=\{k\mid s_{k}=t_{k}=0\}\,,:= { italic_k ∣ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 } , (8) noise⁢(s,t)noise𝑠𝑡\displaystyle\text{noise}(s,t)noise ( italic_s , italic_t ) :={k∣sk=tk=1},assignabsentconditional-set𝑘subscript𝑠𝑘subscript𝑡𝑘1\displaystyle:=\{k\mid s_{k}=t_{k}=1\}\,,:= { italic_k ∣ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 } , (9) win⁢(s,t)win𝑠𝑡\displaystyle\text{win}(s,t)win ( italic_s , italic_t ) :={k∣sk∈[0,1),tk∈(sk,1]}.assignabsentconditional-set𝑘formulae-sequencesubscript𝑠𝑘01subscript𝑡𝑘subscript𝑠𝑘1\displaystyle:=\{k\mid s_{k}\in[0,1),t_{k}\in(s_{k},1]\}\,.:= { italic_k ∣ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 0 , 1 ) , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 1 ] } . (10) Note that here, s𝑠sitalic_s and t𝑡titalic_t are both diffusion time-steps (corresponding to certain SNR levels), while k𝑘kitalic_k denotes a frame index. This categorization can be motivated using the schedule depicted in Figure 2. Given, for example, t=0.5𝑡0.5t=0.5italic_t = 0.5 and s=0.375𝑠0.375s=0.375italic_s = 0.375, we see that the first frame k=0𝑘0k=0italic_k = 0 falls in the first category. At this point in time, 𝒛t0=𝒛s0subscript𝒛subscript𝑡0subscript𝒛subscript𝑠0{\bm{z}}_{t_{0}}={\bm{z}}_{s_{0}}bold_italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are identical given that limt→0+log⁡SNR⁢(t)=∞subscript→𝑡superscript0SNR𝑡\lim_{t\to 0^{+}}\log\mathrm{SNR}(t)=\inftyroman_lim start_POSTSUBSCRIPT italic_t → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log roman_SNR ( italic_t ) = ∞. On the other hand, the last frame k=K−1𝑘𝐾1k=K-1italic_k = italic_K - 1 (31 in the figure) falls in the second category, i.e., both 𝒛tK−1subscript𝒛subscript𝑡𝐾1{\bm{z}}_{t_{K-1}}bold_italic_z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒛sK−1subscript𝒛subscript𝑠𝐾1{\bm{z}}_{s_{K-1}}bold_italic_z start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are distributed as independent standard Gaussians, given that limt→1−log⁡SNR⁢(t)=−∞subscript→𝑡superscript1SNR𝑡\lim_{t\to 1^{-}}\log\mathrm{SNR}(t)=-\inftyroman_lim start_POSTSUBSCRIPT italic_t → 1 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log roman_SNR ( italic_t ) = - ∞. Finally, the frame k=16𝑘16k=16italic_k = 16 falls in the third, most interesting category: the sliding window. As such, observe that the true denoising process can be factorized as: q⁢(𝒛s|𝒛t,𝒙)=q⁢(𝒛sclean|𝒛t,𝒙)⁢q⁢(𝒛snoise|𝒛t,𝒙)⁢q⁢(𝒛swin|𝒛t,𝒙).𝑞conditionalsubscript𝒛𝑠subscript𝒛𝑡𝒙𝑞conditionalsuperscriptsubscript𝒛𝑠cleansubscript𝒛𝑡𝒙𝑞conditionalsuperscriptsubscript𝒛𝑠noisesubscript𝒛𝑡𝒙𝑞conditionalsuperscriptsubscript𝒛𝑠winsubscript𝒛𝑡𝒙\displaystyle q({\bm{z}}_{s}|{\bm{z}}_{t},{\bm{x}})=q({\bm{z}}_{s}^{\text{% clean}}|{\bm{z}}_{t},{\bm{x}})q({\bm{z}}_{s}^{\text{noise}}|{\bm{z}}_{t},{\bm{% x}})q({\bm{z}}_{s}^{\text{win}}|{\bm{z}}_{t},{\bm{x}}).italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) = italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noise end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) . (11) This is helpful because we will see that the only frames that need to be modeled are in the window. Namely, the first factor has q⁢(𝒛sclean|𝒛t,𝒙)=∏k∈clean⁢(s,t)δ⁢(𝒛sk|𝒛tk).𝑞conditionalsuperscriptsubscript𝒛𝑠cleansubscript𝒛𝑡𝒙subscriptproduct𝑘clean𝑠𝑡𝛿conditionalsubscriptsuperscript𝒛𝑘𝑠subscriptsuperscript𝒛𝑘𝑡\displaystyle q({\bm{z}}_{s}^{\text{clean}}|{\bm{z}}_{t},{\bm{x}})=\prod_{k\in% \text{clean}(s,t)}\delta({\bm{z}}^{k}_{s}|{\bm{z}}^{k}_{t})\,.italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_k ∈ clean ( italic_s , italic_t ) end_POSTSUBSCRIPT italic_δ ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (12) In other words, if 𝒛tksubscriptsuperscript𝒛𝑘𝑡{\bm{z}}^{k}_{t}bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is already noiseless, then 𝒛sksubscriptsuperscript𝒛𝑘𝑠{\bm{z}}^{k}_{s}bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT will also be noiseless. Regarding the second factor, we see that they are all independently normally distributed: q⁢(𝒛snoise|𝒛t,𝒙)=∏k∈noise⁢(s,t)𝒩⁢(𝒛sk|0,𝐈).𝑞conditionalsuperscriptsubscript𝒛𝑠noisesubscript𝒛𝑡𝒙subscriptproduct𝑘noise𝑠𝑡𝒩conditionalsubscriptsuperscript𝒛𝑘𝑠0𝐈\displaystyle q({\bm{z}}_{s}^{\text{noise}}|{\bm{z}}_{t},{\bm{x}})=\prod_{k\in% \text{noise}(s,t)}\mathcal{N}({\bm{z}}^{k}_{s}|0,{\mathbf{I}}).italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noise end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_k ∈ noise ( italic_s , italic_t ) end_POSTSUBSCRIPT caligraphic_N ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | 0 , bold_I ) . (13) Simply put, in these cases 𝒛sksubscriptsuperscript𝒛𝑘𝑠{\bm{z}}^{k}_{s}bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is independent noise and does not depend on data at all. Finally, the third factor has a true non-trivial denoising process: q⁢(𝒛swin|𝒛t,𝒙)=∏k∈win⁢(s,t)𝒩⁢(𝒛sk|μtk→sk⁢(𝒛tk,𝒙k),σtk→sk2⁢𝐈)𝑞conditionalsuperscriptsubscript𝒛𝑠winsubscript𝒛𝑡𝒙subscriptproduct𝑘win𝑠𝑡𝒩conditionalsubscriptsuperscript𝒛𝑘𝑠subscript𝜇→subscript𝑡𝑘subscript𝑠𝑘subscriptsuperscript𝒛𝑘𝑡superscript𝒙𝑘superscriptsubscript𝜎→subscript𝑡𝑘subscript𝑠𝑘2𝐈\displaystyle q({\bm{z}}_{s}^{\text{win}}|{\bm{z}}_{t},{\bm{x}})=\prod_{k\in% \text{win}(s,t)}\mathcal{N}({\bm{z}}^{k}_{s}|\mu_{t_{k}\to s_{k}}({\bm{z}}^{k}% _{t},{\bm{x}}^{k}),\sigma_{t_{k}\to s_{k}}^{2}{\mathbf{I}})italic_q ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_k ∈ win ( italic_s , italic_t ) end_POSTSUBSCRIPT caligraphic_N ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) where μtk→sksubscript𝜇→subscript𝑡𝑘subscript𝑠𝑘\mu_{t_{k}\to s_{k}}italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and σtk→sk2superscriptsubscript𝜎→subscript𝑡𝑘subscript𝑠𝑘2\sigma_{t_{k}\to s_{k}}^{2}italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the analytical mean and variance functions. Note that we can then optimally (w.r.t. a KL-divergence) factorize the generative process similarly: pθ⁢(𝒛s|𝒛t):=p⁢(𝒛sclean|𝒛t)⁢p⁢(𝒛snoise|𝒛t)⁢pθ⁢(𝒛swin|𝒛t),assignsubscript𝑝𝜃conditionalsubscript𝒛𝑠subscript𝒛𝑡𝑝conditionalsuperscriptsubscript𝒛𝑠cleansubscript𝒛𝑡𝑝conditionalsuperscriptsubscript𝒛𝑠noisesubscript𝒛𝑡subscript𝑝𝜃conditionalsuperscriptsubscript𝒛𝑠winsubscript𝒛𝑡\displaystyle p_{\theta}({\bm{z}}_{s}|{\bm{z}}_{t}):=p({\bm{z}}_{s}^{\text{% clean}}|{\bm{z}}_{t})p({\bm{z}}_{s}^{\text{noise}}|{\bm{z}}_{t})p_{\theta}({% \bm{z}}_{s}^{\text{win}}|{\bm{z}}_{t})\,,italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := italic_p ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noise end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (14) with p⁢(𝒛sclean|𝒛t):=∏k∈clean⁢(s,t)δ⁢(𝒛sk|𝒛tk)assign𝑝conditionalsuperscriptsubscript𝒛𝑠cleansubscript𝒛𝑡subscriptproduct𝑘clean𝑠𝑡𝛿conditionalsubscriptsuperscript𝒛𝑘𝑠subscriptsuperscript𝒛𝑘𝑡p({\bm{z}}_{s}^{\text{clean}}|{\bm{z}}_{t}):=\prod_{k\in\text{clean}(s,t)}% \delta({\bm{z}}^{k}_{s}|{\bm{z}}^{k}_{t})italic_p ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := ∏ start_POSTSUBSCRIPT italic_k ∈ clean ( italic_s , italic_t ) end_POSTSUBSCRIPT italic_δ ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and p⁢(𝒛snoise|𝒛t):=∏k∈noise⁢(s,t)𝒩⁢(𝒛sk|0,𝐈)assign𝑝conditionalsuperscriptsubscript𝒛𝑠noisesubscript𝒛𝑡subscriptproduct𝑘noise𝑠𝑡𝒩conditionalsubscriptsuperscript𝒛𝑘𝑠0𝐈p({\bm{z}}_{s}^{\text{noise}}|{\bm{z}}_{t}):=\prod_{k\in\text{noise}(s,t)}% \mathcal{N}({\bm{z}}^{k}_{s}|0,{\mathbf{I}})italic_p ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noise end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := ∏ start_POSTSUBSCRIPT italic_k ∈ noise ( italic_s , italic_t ) end_POSTSUBSCRIPT caligraphic_N ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | 0 , bold_I ). The only ‘interesting’ parameterized part of the generative process then has pθ⁢(𝒛swin|𝒛t)subscript𝑝𝜃conditionalsuperscriptsubscript𝒛𝑠winsubscript𝒛𝑡\displaystyle p_{\theta}({\bm{z}}_{s}^{\text{win}}|{\bm{z}}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) :=∏k∈win⁢(s,t)q⁢(𝒛sk|𝒛t,𝒙k=fθ⁢(𝒛t,tk)).assignabsentsubscriptproduct𝑘win𝑠𝑡𝑞conditionalsubscriptsuperscript𝒛𝑘𝑠subscript𝒛𝑡superscript𝒙𝑘subscript𝑓𝜃subscript𝒛𝑡subscript𝑡𝑘\displaystyle:=\prod_{k\in\text{win}(s,t)}q({\bm{z}}^{k}_{s}|{\bm{z}}_{t},{\bm% {x}}^{k}=f_{\theta}({\bm{z}}_{t},t_{k})).:= ∏ start_POSTSUBSCRIPT italic_k ∈ win ( italic_s , italic_t ) end_POSTSUBSCRIPT italic_q ( bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) . (15) In other words, we can only focus the generative process on the frames that are in the sliding window. Finally, note that we can choose to not condition the model on all 𝒛tksubscriptsuperscript𝒛𝑘𝑡{\bm{z}}^{k}_{t}bold_italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that have tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0, since frames that are far in the past are likely to be independent of the current frame, and this excessive conditioning would exceed computational constraints. As such, we get pθ⁢(𝒛swin|𝒛t)subscript𝑝𝜃conditionalsuperscriptsubscript𝒛𝑠winsubscript𝒛𝑡\displaystyle p_{\theta}({\bm{z}}_{s}^{\text{win}}|{\bm{z}}_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =pθ⁢(𝒛swin|𝒛tclean,𝒛twin)absentsubscript𝑝𝜃conditionalsuperscriptsubscript𝒛𝑠winsubscriptsuperscript𝒛clean𝑡subscriptsuperscript𝒛win𝑡\displaystyle=p_{\theta}({\bm{z}}_{s}^{\text{win}}|{\bm{z}}^{\text{clean}}_{t}% ,{\bm{z}}^{\text{win}}_{t})= italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (16) :≈pθ(𝒛swin|𝒛tclean^,𝒛twin)\displaystyle:\approx p_{\theta}({\bm{z}}_{s}^{\text{win}}|\widehat{{\bm{z}}_{% t}^{\text{clean}}},{\bm{z}}^{\text{win}}_{t}): ≈ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT | over^ start_ARG bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT end_ARG , bold_italic_z start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (17) where 𝒛tclean^^superscriptsubscript𝒛𝑡clean\widehat{{\bm{z}}_{t}^{\text{clean}}}over^ start_ARG bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT end_ARG denotes a specific subset of 𝒛tcleansubscriptsuperscript𝒛clean𝑡{\bm{z}}^{\text{clean}}_{t}bold_italic_z start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, typically including a few frames slightly before the current sliding window. In Appendix A, we motivate, in addition to the arguments above, the following loss function: ℒwin,θ⁢(𝒙):=assignsubscriptℒwin𝜃𝒙absent\displaystyle\mathcal{L}_{\text{win},\theta}({\bm{x}}):=caligraphic_L start_POSTSUBSCRIPT win , italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) := 𝔼t∼U⁢(0,1),ϵ∼𝒩⁢(0,1)⁢[Lwin,θ⁢(𝒙;t,ϵ)]subscript𝔼formulae-sequencesimilar-to𝑡𝑈01similar-tobold-italic-ϵ𝒩01delimited-[]subscript𝐿win𝜃𝒙𝑡bold-italic-ϵ\displaystyle\mathbb{E}_{t\sim U(0,1),{\bm{{\epsilon}}}\sim\mathcal{N}(0,1)}% \left[L_{\text{win},\theta}({\bm{x}};t,{\bm{{\epsilon}}})\right]blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_U ( 0 , 1 ) , bold_italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT win , italic_θ end_POSTSUBSCRIPT ( bold_italic_x ; italic_t , bold_italic_ϵ ) ] (18) with Lwin,θ:=∑k∈win⁢(t)a⁢(tk)⁢‖𝒙k−fθk⁢(𝒛t,ϵwin,𝒛t,ϵclean,t)‖2,assignsubscript𝐿win𝜃subscript𝑘win𝑡𝑎subscript𝑡𝑘superscriptnormsuperscript𝒙𝑘subscriptsuperscript𝑓𝑘𝜃superscriptsubscript𝒛𝑡bold-italic-ϵwinsuperscriptsubscript𝒛𝑡bold-italic-ϵclean𝑡2\displaystyle L_{\text{win},\theta}:=\sum_{k\in\text{win}(t)}a(t_{k})||{\bm{x}% }^{k}-f^{k}_{\theta}({\bm{z}}_{t,{\bm{{\epsilon}}}}^{\text{win}},{\bm{z}}_{t,{% \bm{{\epsilon}}}}^{\text{clean}},t)||^{2}\,,italic_L start_POSTSUBSCRIPT win , italic_θ end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_k ∈ win ( italic_t ) end_POSTSUBSCRIPT italic_a ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | | bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_f start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t , bold_italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT win end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t , bold_italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT clean end_POSTSUPERSCRIPT , italic_t ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , where we suppress some arguments for notational convenience. Here, 𝒛t,ϵsubscript𝒛𝑡bold-italic-ϵ{\bm{z}}_{t,{\bm{{\epsilon}}}}bold_italic_z start_POSTSUBSCRIPT italic_t , bold_italic_ϵ end_POSTSUBSCRIPT denotes a noised version of 𝒙𝒙{\bm{x}}bold_italic_x as a function of t𝑡titalic_t and ϵbold-italic-ϵ{\bm{{\epsilon}}}bold_italic_ϵ and a⁢(tk)𝑎subscript𝑡𝑘a(t_{k})italic_a ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a weighting function leading to, e.g., the usual ‘simple’ ϵitalic-ϵ\epsilonitalic_ϵ-MSE loss, v𝑣vitalic_v-MSE loss, or x𝑥xitalic_x-MSE loss. Observe Figure 2 again. After training is completed, we can essentially sample from the generative model by traversing the image with the sliding window from the top left to the bottom right. 3.2 A local perspective In the previous section, we discussed how rolling diffusion enables us to concentrate entirely on frames within a sliding window. Instead of using t𝑡titalic_t to represent the global diffusion time, which determines the noise level for all frames, we now redefine t𝑡titalic_t to determine the noise level for each frame in a smaller subsequence. Specifically, running the denoising chain from t=1𝑡1t=1italic_t = 1 to t=0𝑡0t=0italic_t = 0 will sample a sequence such that the first frame is completely denoised, but the subsequent frames still retain some noise. In contrast, the global process described earlier denoises an entire video. Similar to before, we reparameterize t𝑡titalic_t to allow for different noise levels for each frame in the sliding window. This reparameterization should be: 1. local, meaning we allow for sharing and reusing the parameterization across various positions of the sliding window, independent of their absolute locations. 2. consistent under moving the window, meaning that the noise level for the current frame w𝑤witalic_w when t=0𝑡0t=0italic_t = 0 should match the noise level at w−1𝑤1w-1italic_w - 1 at t=1𝑡1t=1italic_t = 1. This consistency enables seamless denoising as the window slides, ensuring that each frame is progressively denoised while shifting positions. Let W