Title: Autoregressive Image Generation without Vector Quantization Description: No description Keywords: No keywords Text content: Autoregressive Image Generation without Vector Quantization 1 Introduction 2 Related Work Sequence Models for Image Generation Diffusion for Representation Learning Diffusion for Policy Learning 3 Method 3.1 Rethinking Discrete-Valued Tokens 3.2 Diffusion Loss Loss function Sampler 3.3 Diffusion Loss for Autoregressive Models 3.4 Unifying Autoregressive and Masked Generative Models Bidirectional attention can perform autoregression Autoregressive models in random orders Masked autoregressive models 4 Implementation 4.1 Diffusion Loss Diffusion Process Denoising MLP 4.2 Autoregressive and Masked Autoregressive Image Generation Tokenizer Transformer Autoregressive baseline Masked autoregressive models 5 Experiments 5.1 Properties of Diffusion Loss Diffusion Loss vs. Cross-entropy Loss Flexibility of Diffusion Loss Denoising MLP in Diffusion Loss 5.2 Properties of Generalized Autoregressive Models From AR to MAR Speed/accuracy Trade-off 5.3 Benchmarking with Previous Systems 6 Discussion and Conclusion Acknowledgements. A Limitations and Broader Impacts B Additional Implementation Details Classifier-free guidance (CFG) Training Implementation Details of Table 3 Pseudo-code of Diffusion Loss Compute Resources C Comparison between MAR and MAGE D Additional Comparisons D.1 Autoregressive Image Generation in Pixel Space D.2 ImageNet 512×\times×512 D.3 L2 Loss vs. Diff Loss Autoregressive Image Generation without Vector Quantization Tianhong Li1     Yonglong Tian2     He Li3     Mingyang Deng1     Kaiming He1 1MIT CSAIL     2Google DeepMind     3Tsinghua University Abstract Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications. Code is available at https://github.com/LTH14/mar. 1 Introduction Autoregressive models are currently the de facto solution to generative models in natural language processing [38, 39, 3]. These models predict the next word or token in a sequence based on the previous words as input. Given the discrete nature of languages, the inputs and outputs of these models are in a categorical, discrete-valued space. This prevailing approach has led to a widespread belief that autoregressive models are inherently linked to discrete representations. As a result, research on generalizing autoregressive models to continuous-valued domains—most notably, image generation—has intensely focused on discretizing the data [6, 13, 40]. A commonly adopted strategy is to train a discrete-valued tokenizer on images, which involves a finite vocabulary obtained by vector quantization (VQ) [51, 41]. Autoregressive models are then operated on the discrete-valued token space, analogous to their language counterparts. In this work, we aim to address the following question: Is it necessary for autoregressive models to be coupled with vector-quantized representations? We note that the autoregressive nature, i.e., “predicting next tokens based on previous ones”, is independent of whether the values are discrete or continuous. What is needed is to model the per-token probability distribution, which can be measured by a loss function and used to draw samples from. Discrete-valued representations can be conveniently modeled by a categorical distribution, but it is not conceptually necessary. If alternative models for per-token probability distributions are presented, autoregressive models can be approached without vector quantization. With this observation, we propose to model the per-token probability distribution by a diffusion procedure operating on continuous-valued domains. Our methodology leverages the principles of diffusion models [45, 24, 33, 10] for representing arbitrary probability distributions. Specifically, our method autoregressively predicts a vector z𝑧zitalic_z for each token, which serves as a conditioning for a denoising network (e.g., a small MLP). The denoising diffusion procedure enables us to represent an underlying distribution p⁢(x|z)𝑝conditional𝑥𝑧p(x|z)italic_p ( italic_x | italic_z ) for the output x𝑥xitalic_x (Figure 1). This small denoising network is trained jointly with the autoregressive model, with continuous-valued tokens as the input and target. Conceptually, this small prediction head, applied to each token, behaves like a loss function for measuring the quality of z𝑧zitalic_z. We refer to this loss function as Diffusion Loss. Figure 1: Diffusion Loss. Given a continuous-valued token x𝑥xitalic_x to be predicted, the autoregressive model produces a vector z𝑧zitalic_z, which serves as the condition of a denoising diffusion network (a small MLP). This offers a way to model the probability distribution p⁢(x|z)𝑝conditional𝑥𝑧p(x|z)italic_p ( italic_x | italic_z ) of this token. This network is trained jointly with the autoregressive model by backpropagation. At inference time, with a predicted z𝑧zitalic_z, running the reverse diffusion procedure can sample a token following the distribution: x∼p⁢(x|z)similar-to𝑥𝑝conditional𝑥𝑧x\sim p(x|z)italic_x ∼ italic_p ( italic_x | italic_z ). This method eliminates the need for discrete-valued tokenizers. Our approach eliminates the need for discrete-valued tokenizers. Vector-quantized tokenizers are difficult to train and are sensitive to gradient approximation strategies [51, 41, 40, 27]. Their reconstruction quality often falls short compared to continuous-valued counterparts [42]. Our approach allows autoregressive models to enjoy the benefits of higher-quality, non-quantized tokenizers. To broaden the scope, we further unify standard autoregressive (AR) models [13] and masked generative models [4, 29] into a generalized autoregressive framework (Figure 3). Conceptually, masked generative models predict multiple output tokens simultaneously in a randomized order, while still maintaining the autoregressive nature of “predicting next tokens based on known ones”. This leads to a masked autoregressive (MAR) model that can be seamlessly used with Diffusion Loss. We demonstrate by experiments the effectiveness of Diffusion Loss across a wide variety of cases, including AR and MAR models. It eliminates the need for vector-quantized tokenizers and consistently improves generation quality. Our loss function can be flexibly applied with different types of tokenizers. Further, our method enjoys the advantage of the fast speed of sequence models. Our MAR model with Diffusion Loss can generate at a rate of <<< 0.3 second per image while achieving a strong FID of <<< 2.0 on ImageNet 256×\times×256. Our best model can approach 1.55 FID. The effectiveness of our method reveals a largely uncharted realm of image generation: modeling the interdependence of tokens by autoregression, jointly with the per-token distribution by diffusion. This is in contrast with typical latent diffusion models [42, 37] in which the diffusion process models the joint distribution of all tokens. Given the effectiveness, speed, and flexibility of our method, we hope that the Diffusion Loss will advance autoregressive image generation and be generalized to other domains in future research. 2 Related Work Sequence Models for Image Generation . Pioneering efforts on autoregressive image models [17, 50, 49, 36, 7, 6] operate on sequences of pixels. Autoregression can be performed by RNNs [50], CNNs [49, 7], and, most lately and popularly, Transformers [36, 6]. Motivated by language models, another series of works [51, 41, 13, 40] model images as discrete-valued tokens. Autoregressive [13, 40] and masked generative models [4, 29] can operate on the discrete-valued token space. But discrete tokenizers are difficult to train, which has recently drawn special focus [27, 54, 32]. Related to our work, the recent work on GIVT [48] also focuses on continuous-valued tokens in sequence models. GIVT and our work both reveal the significance and potential of this direction. In GIVT, the token distribution is represented by Gaussian mixture models. It uses a pre-defined number of mixtures, which can limit the types of distributions it can represent. In contrast, our method leverages the effectiveness of the diffusion process for modeling arbitrary distributions. Diffusion for Representation Learning . The denoising diffusion process has been explored as a criterion for visual self-supervised learning. For example, DiffMAE [53] replaces the L2 loss in the original MAE [21] with a denoising diffusion decoder; DARL [30] trains autoregressive models with a denoising diffusion patch decoder. These efforts have been focused on representation learning, rather than image generation. In their scenarios, generating diverse images is not a goal; these methods have not presented the capability of generating new images from scratch. Diffusion for Policy Learning . Our work is conceptually related to Diffusion Policy [8] in robotics. In those scenarios, the distribution of taking an action is formulated as a denoising process on the robot observations, which can be pixels or latents [8, 34]. In image generation, we can think of generating a token as an “action” to take. Despite this conceptual connection, the diversity of the generated samples in robotics is less of a core consideration than it is for image generation. 3 Method In a nutshell, our image generation approach is a sequence model operated on a tokenized latent space [6, 13, 40]. But unlike previous methods that are based on vector-quantized tokenizers (e.g., variants of VQ-VAE [51, 13]), we aim to use continuous-valued tokenizers (e.g., [42]). We propose Diffusion Loss that makes sequence models compatible with continuous-valued tokens. 3.1 Rethinking Discrete-Valued Tokens To begin with, we revisit the roles of discrete-valued tokens in autoregressive generation models. Denote as x𝑥xitalic_x the ground-truth token to be predicted at the next position. With a discrete tokenizer, x𝑥xitalic_x can be represented as an integer: 0≤x>>1 CrossEnt 8.79 146.1 3.69 278.4 Diff Loss 3.50 201.4 1.98 290.3 Table 2: Flexibility of Diffusion Loss. Diffusion Loss can support different types of tokenizers. (i) VQ tokenizers: we treat the continuous-valued latent before VQ as the tokens. (ii) Tokenizers with a mismatched stride (here, 8): we group 2×\times×2 tokens into a new token for sequence modeling. (iii) Consistency Decoder [35], a non-VQ tokenizer of a different decoder architecture. Here, rFID denotes the reconstruction FID of the tokenizer on the ImageNet training set. Settings in this table for all entries: MAR-L, 400 epochs, ImageNet 256×\times×256. †: This tokenizer is trained by us on ImageNet using [42]’s code; the original ones from [42] were trained on OpenImages. tokenizer # tokens w/o CFG w/ CFG loss src arch raw seq rFID↓↓\downarrow↓ FID↓↓\downarrow↓ IS↑↑\uparrow↑ FID↓↓\downarrow↓ IS↑↑\uparrow↑ Diff Loss [42] VQ-16 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 5.87 7.82 151.7 3.64 258.5 [42] KL-16 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1.43 3.50 201.4 1.98 290.3 [42] KL-8 3222{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1.20 4.33 180.0 2.05 283.9 [35] Consistency 3222{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1.30 5.76 170.6 3.23 271.0   [42]† KL-16 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1622{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 1.22 2.85 214.0 1.97 291.2 We experiment on ImageNet [9] at a resolution of 256×\times×256. We evaluate FID [22] and IS [43], and provide Precision and Recall as references following common practice [10]. We follow the evaluation suite provided by [10]. 5.1 Properties of Diffusion Loss Diffusion Loss vs. Cross-entropy Loss . We first compare continuous-valued tokens with Diffusion Loss and standard discrete-valued tokens with cross-entropy loss (Table 1). For fair comparisons, the tokenizers (“VQ-16” and “KL-16”) are both downloaded from the LDM codebase [42]. These are popularly used tokenizers (e.g., [13, 42, 37]). The comparisons are in four variants of AR/MAR. As shown in Table 1, Diffusion Loss consistently outperforms the cross-entropy counterpart in all cases. Specifically, in MAR (e.g., the default), using Diffusion Loss can reduce FID by relatively ∼similar-to\scriptstyle\sim∼50%-60%. This is because the continuous-valued KL-16 has smaller compression loss than VQ-16 (discussed next in Table 2), and also because a diffusion process models distributions more effectively than categorical ones. In the following ablations, unless specified, we follow the “default” MAR setting in Table 1. MLP w/o CFG w/ CFG width params FID↓↓\downarrow↓ IS↑↑\uparrow↑ FID↓↓\downarrow↓ IS↑↑\uparrow↑ inference time 256   2M 3.47 195.3 2.45 274.0 0.286 s / im. 512   6M 3.24 199.1 2.11 281.0 0.288 s / im. 1024 21M 2.85 214.0 1.97 291.2 0.288 s / im. 1536 45M 2.93 207.6 1.91 289.3 0.291 s / im. Figure 4: Denoising MLP in Diffusion Loss. The denoising MLP is small and efficient. Here, the inference time involves the entire generation model, and the Transformer’s size is 407M. Settings: MAR-L, 400 epochs, ImageNet 256×\times×256, 3 MLP blocks. Figure 5: Sampling steps of Diffusion Loss. We show the FID (left) and IS (right) w.r.t. the number of diffusive sampling steps. Using 100 steps is sufficient to achieve a strong generation quality. Figure 6: Temperature of Diffusion Loss. Temperature τ𝜏\tauitalic_τ has clear influence on both FID (left) and IS (right). Just like the temperature in discrete-valued autoregression, the temperature here also plays a critical role in continuous-valued autoregression. Flexibility of Diffusion Loss . One significant advantage of Diffusion Loss is its flexibility with various tokenizers. We compare several publicly available tokenizers in Table 2. Diffusion Loss can be easily used even given a VQ tokenizer. We simply treat the continuous-valued latent before the VQ layer as the tokens. This variant gives us 7.82 FID (w/o CFG), compared favorably with 8.79 FID (Table 1) of cross-entropy loss using the same VQ tokenizer. This suggests the better capability of diffusion for modeling distributions. This variant also enables us to compare the VQ-16 and KL-16 tokenizers using the same loss. As shown in Table 2, VQ-16 has a much worse reconstruction FID (rFID) than KL-16, which consequently leads to a much worse generation FID (e.g., 7.82 vs. 3.50 in Table 2). Interestingly, Diffusion Loss also enables us to use tokenizers with mismatched strides. In Table 2, we study a KL-8 tokenizer whose stride is 8 and output sequence length is 32×\times×32. Without increasing the sequence length of the generator, we group 2×\times×2 tokens into a new token. Despite the mismatch, we are able to obtain decent results, e.g., KL-8 gives us 2.05 FID, vs. KL-16’s 1.98 FID. Further, this property allows us to investigate other tokenizers, e.g., Consistency Decoder [35], a non-VQ tokenizer of a different architecture/stride designed for different goals. For comprehensiveness, we also train a KL-16 tokenizer on ImageNet using the code of [42], noting that the original KL-16 in [42] was trained on OpenImages [28]. The comparison is in the last row of Table 2. We use this tokenizer in the following explorations. Denoising MLP in Diffusion Loss . We investigate the denoising MLP in Table 4. Even a very small MLP (e.g., 2M) can lead to competitive results. As expected, increasing the MLP width helps improve the generation quality; we have explored increasing the depth and had similar observations. Note that our default MLP size (1024 width, 21M) adds only ∼similar-to\scriptstyle\sim∼5% extra parameters to the MAR-L model. During inference, the diffusion sampler has a decent cost of ∼similar-to\scriptstyle\sim∼10% overall running time. Increasing the MLP width has negligible extra cost in our implementation (Table 4), partially because the main overhead is not about computation but memory communication. Sampling Steps of Diffusion Loss. Our diffusion process follows the common practice of DDPM [24, 10]: we train with a 1000-step noise schedule but inference with fewer steps. Figure 6 shows that using 100 diffusion steps at inference is sufficient to achieve a strong generation quality. Temperature of Diffusion Loss. In the case of cross-entropy loss, the temperature is of central importance. Diffusion Loss also offers a temperature counterpart for controlling the diversity and fidelity. Figure 6 shows the influence of the temperature τ𝜏\tauitalic_τ in the diffusion sampler (see Sec. 3.2) at inference time. The temperature τ𝜏\tauitalic_τ plays an important role in our models, similar to the observations on cross-entropy-based counterparts (note that the cross-entropy results in Table 1 are with their optimal temperatures). Figure 7: Speed/accuracy trade-off of the generation process. For MAR, a curve is obtained by different autoregressive steps (8 to 128). For DiT, a curve is obtained by different diffusion steps (50, 75, 150, 250) using its official code. We compare our implementation of AR and MAR. AR is with kv-cache for fast inference. AR/MAR model size is L and DiT model size is DiT-XL. The star marker denotes our default MAR setting used in other ablations. We benchmark FID and speed on ImageNet 256×\times×256 using one A100 GPU with a batch size of 256. Table 3: System-level comparison on ImageNet 256×\times×256 conditional generation. Diffusion Loss enables Masked Autoregression to achieve leading results in comparison with previous systems. †: LDM operates on continuous-valued tokens, though this result uses a quantized tokenizer. w/o CFG w/ CFG #params FID↓↓\downarrow↓ IS↑↑\uparrow↑ Pre.↑↑\uparrow↑ Rec.↑↑\uparrow↑ FID↓↓\downarrow↓ IS↑↑\uparrow↑ Pre.↑↑\uparrow↑ Rec.↑↑\uparrow↑ pixel-based ADM [10] 554M 10.94 101.0 0.69 0.63 4.59 186.7 0.82 0.52 VDM+⁣++++ + [26] 2B 2.40 225.3 - - 2.12 267.7 - - vector-quantized tokens Autoreg. w/ VQGAN [13] 1.4B 15.78   78.3 - - - - - - MaskGIT [4] 227M 6.18 182.1 0.80 0.51 - - - - MAGE [29] 230M 6.93 195.8 - - - - - - MAGVIT-v2 [55] 307M 3.65 200.5 - - 1.78 319.4 - - continuous-valued tokens LDM-4† [42] 400M 10.56 103.5 0.71 0.62 3.60 247.7 0.87 0.48 U-ViT-H/2-G [2] 501M - - - - 2.29 263.9 0.82 0.57 DiT-XL/2 [37] 675M 9.62 121.5 0.67 0.67 2.27 278.2 0.83 0.57 DiffiT [19] - - - - - 1.73 276.5 0.80 0.62 MDTv2-XL/2 [14] 676M 5.06 155.6 0.72 0.66 1.58 314.7 0.79 0.65 GIVT [48] 304M 5.67 - 0.75 0.59 3.35 - 0.84 0.53 MAR-B, Diff Loss 208M 3.48 192.4 0.78 0.58 2.31 281.7 0.82 0.57 MAR-L, Diff Loss 479M 2.60 221.4 0.79 0.60 1.78 296.0 0.81 0.60 MAR-H, Diff Loss 943M 2.35 227.8 0.79 0.62 1.55 303.7 0.81 0.62 5.2 Properties of Generalized Autoregressive Models From AR to MAR . Table 1 is also a comparison on the AR/MAR variants, which we discuss next. First, replacing the raster order in AR with random order has a significant gain, e.g., reducing FID from 19.23 to 13.07 (w/o CFG). Next, replacing the causal attention with the bidirectional counterpart leads to another massive gain, e.g., reducing FID from 13.07 to 3.43 (w/o CFG). The random-order, bidirectional AR is essentially a form of MAR that predicts one token at a time. Predicting multiple tokens (‘>>>1’) at each step can effectively reduce the number of autoregressive steps. In Table 1, we show that the MAR variant with 64 steps slightly trades off generation quality. A more comprehensive trade-off comparison is discussed next. Speed/accuracy Trade-off . Following MaskGIT [4], our MAR enjoys the flexibility of predicting multiple tokens at a time. This is controlled by the number of autoregressive steps at inference time. Figure 7 plots the speed/accuracy trade-off. MAR has a better trade-off than its AR counterpart, noting that AR is with the efficient kv-cache. With Diffusion Loss, MAR also shows a favorable trade-off in comparison with the recently popular Diffusion Transformer (DiT) [37]. As a latent diffusion model, DiT models the interdependence across all tokens by the diffusion process. The speed/accuracy trade-off of DiT is mainly controlled by its diffusion steps. Unlike our diffusion process on a small MLP, the diffusion process of DiT involves the entire Transformer architecture. Our method is more accurate and faster. Notably, our method can generate at a rate of <<< 0.3 second per image with a strong FID of <<< 2.0. 5.3 Benchmarking with Previous Systems We compare with the leading systems in Table 3. We explore various model sizes (see Appendix B) and train for 800 epochs. Similar to autoregressive language models [3], we observe encouraging scaling behavior. Further investigation into scaling could be promising. Regarding metrics, we report 2.35 FID without CFG, largely outperforming other token-based methods. Our best entry has 1.55 FID and compares favorably with leading systems. Figure 8 shows qualitative results. 6 Discussion and Conclusion The effectiveness of Diffusion Loss on various autoregressive models suggests new opportunities: modeling the interdependence of tokens by autoregression, jointly with the per-token distribution by diffusion. This is unlike the common usage of diffusion that models the joint distribution of all tokens. Our strong results on image generation suggest that autoregressive models or their extensions are powerful tools beyond language modeling. These models do not need to be constrained by vector-quantized representations. We hope our work will motivate the research community to explore sequence models with continuous-valued representations in other domains. Figure 8: Qualitative Results. We show selected examples of class-conditional generation on ImageNet 256×\times×256 using MAR-H with Diffusion Loss. Acknowledgements. Tianhong Li was supported by the Mathworks Fellowship during this project. We thank Congyue Deng and Xinlei Chen for helpful discussion. We thank Google TPU Research Cloud (TRC) for granting us access to TPUs, and Google Cloud Platform for supporting GPU resources. References Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016. Bao et al. [2022] Fan Bao, Chongxuan Li, Yue Cao, and Jun Zhu. All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022. Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020. Chang et al. [2022] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative image Transformer. In CVPR, 2022. Chang et al. [2023] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. Muse: Text-to-image generation via masked generative Transformers. In ICML, 2023. Chen et al. [2020] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, 2020. Chen et al. [2018] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An improved autoregressive generative model. In ICML, 2018. Chi et al. [2023] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In RSS, 2023. Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. Dhariwal & Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021. Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. Elfwing et al. [2018] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 2018. Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Transformers for high-resolution image synthesis. In CVPR, 2021. Gao et al. [2023] Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion Transformer is a strong image synthesizer. In ICCV, 2023. Goodfellow et al. [2014] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014. Goyal et al. [2017] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017. Gregor et al. [2014] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In ICML, 2014. Gumbel [1954] Emil Julius Gumbel. Statistical theory of extreme valuse and some practical applications. Nat. Bur. Standards Appl. Math. Ser. 33, 1954. Hatamizadeh et al. [2023] Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. DiffiT: Diffusion vision Transformers for image generation. arXiv:2312.02139, 2023. He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIP, 2017. Ho & Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv:2207.12598, 2022. Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. Karras et al. [2023] Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. arXiv:2312.02696, 2023. Kingma & Gao [2023] Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In NeurIPS, 2023. Kolesnikov et al. [2022] Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, and Neil Houlsby. UViM: A unified modeling approach for vision with learned guiding codes. NeurIPS, 2022. Krasin et al. [2017] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. 2017. Li et al. [2023] Tianhong Li, Huiwen Chang, Shlok Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan. MAGE: Masked generative encoder to unify representation learning and image synthesis. In CVPR, 2023. Li et al. [2024] Yazhe Li, Jorg Bornschein, and Ting Chen. Denoising autoregressive representation learning. arXiv preprint arXiv:2403.05196, 2024. Loshchilov & Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. Mentzer et al. [2024] Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-VAE made simple. In ICLR, 2024. Nichol & Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021. Octo Model Team et al. [2024] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In RSS, 2024. OpenAI [2024] OpenAI. Consistency Decoder, 2024. URL https://github.com/openai/consistencydecoder. Parmar et al. [2018] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image Transformer. In ICML, 2018. Peebles & Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with Transformers. In ICCV, 2023. Radford et al. [2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021. Razavi et al. [2019] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, 2019. Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NeurIPS, 2016. Shazeer [2019] Noam Shazeer. Fast Transformer decoding: One write-head is all you need. arXiv:1911.02150, 2019. Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015. Song & Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019. Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021. Tschannen et al. [2023] Michael Tschannen, Cian Eastwood, and Fabian Mentzer. GIVT: Generative infinite-vocabulary Transformers. arXiv:2312.02116, 2023. van den Oord et al. [2016a] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with PixelCNN decoders. In NeurIPS, 2016a. van den Oord et al. [2016b] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016b. van den Oord et al. [2017] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NeurIPS, 2017. Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. Wei et al. [2023] Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, and Christoph Feichtenhofer. Diffusion models as masked autoencoders. In ICCV, 2023. Yu et al. [2023] Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang. MAGVIT: Masked generative video Transformer. In CVPR, 2023. Yu et al. [2024] Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, David A. Ross Irfan Essa, and Lu Jiang. Language model beats diffusion–tokenizer is key to visual generation. In ICLR, 2024. Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. Ours DiT Ours DiT Ours DiT Figure 10: Failure cases. Similar to existing methods, our system can produce results with noticeable artifacts. For each pair, we show MAR-H and DiT-XL’s results of the same class. The leftmost example of DiT is taken from their paper [37]; the others are obtained from their official code. Appendix A Limitations and Broader Impacts Limitations. Beyond demonstrating the potential of our method for image generation, this paper acknowledges its limitations. First of all, our image generation system can produce images with noticeable artifacts (Figure 10). This limitation is commonly observed in existing methods, especially when trained on controlled, academic data (e.g., ImageNet). Research-driven models trained on ImageNet still have a noticeable gap in visual quality in comparison with commercial models trained on massive data. Second, our image generation system relies on existing pre-trained tokenizers. The quality of our system can be limited by the quality of these tokenizers. Pre-training better tokenizers is beyond the scope of this paper. Nevertheless, we hope our work will make it easier to use continuous-valued tokenizers to be developed in the future. Last, we note that given the limited computational resources, we have primarily tested our method on the ImageNet benchmark. Further validation is needed to assess the scalability and robustness of our approach in more diverse and real-world scenarios. Broader Impacts. Our primary aim is to advance the fundamental research on generative models, and we believe it will be beneficial to this field. An immediate application of our method is to extend it to large visual generation models, e.g., text-to-image or text-to-video generation. Our approach has the potential to significantly reduce the training and inference cost of these large models. At the same time, our method may suggest the opportunity to replace traditional loss functions with Diffusion Loss in many applications. On the negative side, our method learns statistics from the training dataset, and as such may reflect the bias in the data; the image generation system may be misused to generate disinformation, which warrants further consideration. Appendix B Additional Implementation Details Classifier-free guidance (CFG) . To support CFG [23], at training time, the class condition is replaced with a dummy class token for 10%percent\%% of the samples [23]. At inference time, the model is run with the given class token and the dummy token, providing two outputs zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and zusubscript𝑧𝑢z_{u}italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The predicted noise ε𝜀\varepsilonitalic_ε is then modified [23] as: ε=εθ⁢(xt|t,zu)+ω⋅(εθ⁢(xt|t,zc)−εθ⁢(xt|t,zu))𝜀subscript𝜀𝜃conditionalsubscript𝑥𝑡𝑡subscript𝑧𝑢⋅𝜔subscript𝜀𝜃conditionalsubscript𝑥𝑡𝑡subscript𝑧𝑐subscript𝜀𝜃conditionalsubscript𝑥𝑡𝑡subscript𝑧𝑢\varepsilon=\varepsilon_{\theta}(x_{t}|t,z_{u})+\omega\cdot(\varepsilon_{% \theta}(x_{t}|t,z_{c})-\varepsilon_{\theta}(x_{t}|t,z_{u}))italic_ε = italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t , italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + italic_ω ⋅ ( italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t , italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) - italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t , italic_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ), where ω𝜔\omegaitalic_ω is the guidance scale. At inference time, we use a CFG schedule following [5]. We sweep the optimal guidance scale and temperature combination for each model. Training . By default, the models are trained using the AdamW optimizer [31] for 400 epochs. The weight decay and momenta for AdamW are 0.02 and (0.9, 0.95). We use a batch size of 2048 and a learning rate (lr) of 8e-4. Our models with Diffusion Loss are trained with a 100-epoch linear lr warmup [16], followed by a constant [37] lr schedule. The cross-entropy counterparts are trained with a cosine lr schedule, which works better for them. Following [37, 25], we maintain the exponential moving average (EMA) of the model parameters with a momentum of 0.9999. Implementation Details of Table 3 . To explore our method’s scaling behavior, we study three model sizes described as follows. In addition to MAR-L, we explore a smaller model (MAR-B) and a larger model (MAR-H). MAR-B, -L, and -H respectively have 24, 32, 40 Transformer blocks and a width of 768, 1024, and 1280. In Table 3 specifically, the denoising MLP respectively has 6, 8, 12 blocks and a width of 1024, 1280, and 1536. The training length is increased to 800 epochs. At inference time, we run 256 autoregressive steps to achieve the best results. Pseudo-code of Diffusion Loss . See Algorithm 1. Compute Resources . Our training is mainly done on 16 servers with 8 V100 GPUs each. Training a 400 epochs MAR-L model takes ∼similar-to\sim∼2.6 days on these GPUs. As a comparison, training a DiT-XL/2 and LDM-4 model for the same number of epochs on this cluster takes 4.6 and 9.5 days, respectively. Algorithm 1 Diffusion Loss: PyTorch-like Pseudo-code ⬇ class DiffusionLoss(nn.Module) def __init__(depth, width): # SimpleMLP takes in x_t, timestep, and condition, and outputs predicted noise. self.net = SimpleMLP(depth, width) # GaussianDiffusion offers forward and backward functions q_sample and p_sample. self.diffusion = GaussianDiffusion() # Given condition z and ground truth token x, compute loss def loss(self, z, x): # sample random noise and timestep noise = torch.randn(x.shape) timestep = torch.randint(0, self.diffusion.num_timesteps, x.size(0)) # sample x_t from x x_t = self.diffusion.q_sample(x, timestep, noise) # predict noise from x_t noise_pred = self.net(x_t, timestep, z) # L2 loss loss = ((noise_pred - noise) ** 2).mean() # optional: loss += loss_vlb return loss # Given condition and noise, sample x using reverse diffusion process def sample(self, z, noise): x = noise for t in list(range(self.diffusion.num_timesteps))[::-1]: x = self.diffusion.p_sample(self.net, x, t, z) return x   Pseudo-code illustrating the concept of Diffusion Loss. Here the conditioning vector z𝑧zitalic_z is the output from the AR/MAR model. The gradient is backpropagated to z𝑧zitalic_z. For simplicity, here we omit the code for inference rescheduling, temperature and the loss term for variational lower bound [10], which can be easily incorporated. Appendix C Comparison between MAR and MAGE MAR (regardless of the loss used) is conceptually related to MAGE [29]. Besides implementation differences (e.g., architecture specifics, hyper-parameters), a major conceptual difference between MAR and MAGE is in the scanning order at inference time. In MAGE, following MaskGIT [4], the locations of the next tokens to be predicted are determined on-the-fly by the sample confidence at each location, i.e., the more confident locations are more likely to be selected at each step [4, 29]. In contrast, MAR adopts a fully randomized order, and its temperature sampling is applied to each token. Table 4 compares this difference in controlled settings. The first line is our MAR implementation but using MAGE’s on-the-fly ordering strategy, which has similar results as the simpler random order counterpart. Fully randomized ordering can make the training and inference process consistent regarding the distribution of orders; it also allows us to adopt token-wise temperature sampling in a way similar to autoregressive language models (e.g., GPT [38, 39, 3]). order loss FID↓↓\downarrow↓ IS↑↑\uparrow↑ MAR, our impl. on-the-fly CrossEnt 8.72 145.6 MAR, our impl. random CrossEnt 8.79 146.1 MAR, our impl. random Diff Loss 3.50 201.4 Table 4: To compare conceptually with MAGE, we run MAR’s inference using the MAGE strategy of determining the order on the fly by confidence sampling across the spatial domain. These entries are all based on the tokenizers provided by the LDM codebase [42]. Appendix D Additional Comparisons D.1 Autoregressive Image Generation in Pixel Space Our MAR+DiffLoss approach can also be directly applied to model the RGB pixel space without the need for an image tokenizer. To demonstrate this, we conducted an experiment on ImageNet 64×\times×64, grouping every 4×\times×4 pixels into a single token for the Diffusion Loss to model. A MAR-L+DiffLoss model trained for 400 epochs achieved an FID of 2.93, demonstrating the potential to eliminate the need for tokenizers in autoregressive image generation. However, as commonly observed in the diffusion model literature, directly modeling the pixel space is significantly more computationally expensive than using a tokenizer. For MAR+DiffLoss, directly modeling pixels at higher resolutions might require either a much longer sequence length for the autoregressive transformer or a substantially larger network for the Diffusion Loss to handle larger patches. We leave this exploration for future work. D.2 ImageNet 512×\times×512 Following previous works, we also report results on ImageNet at a resolution of 512×\times×512, compared with leading systems (Table 5). For simplicity, we use the KL-16 tokenizer, which gives a sequence length of 32×\times×32 on a 512×\times×512 image. Other settings follow the MAR-L configuration described in Table 3. Our method achieves an FID of 2.74 without CFG and 1.73 with CFG. Our results are competitive with those of previous systems. Due to limited resources, we have not trained the larger MAR-H on ImageNet 512×\times×512, which is expected to have better results. Table 5: System-level comparison on ImageNet 512×\times×512 conditional generation. MAR’s CFG scale is set to 4.0; other settings follow the MAR-L configuration described in Table 3. w/o CFG w/ CFG #params FID↓↓\downarrow↓ IS↑↑\uparrow↑ FID↓↓\downarrow↓ IS↑↑\uparrow↑ pixel-based ADM [10] 554M 23.24   58.1 7.72 172.7 VDM+⁣++++ + [26] 2B 2.99 232.2 2.65 278.1 vector-quantized tokens MaskGIT [4] 227M 7.32 156.0 - - MAGVIT-v2 [55] 307M 3.07 213.1 1.91 324.3 continuous-valued tokens U-ViT-H/2-G [2] 501M - - 4.05 263.8 DiT-XL/2 [37] 675M 12.03 105.3 3.04 240.8 DiffiT [19] - - - 2.67 252.1 GIVT [48] 304M   8.35 - - - EDM2-XXL [25] 1.5B   1.91 - 1.81 - MAR-L, Diff Loss 481M   2.74 205.2 1.73 279.9 D.3 L2 Loss vs. Diff Loss A naïve baseline for continuous-valued tokens is to compute the Mean Squared Error (MSE, i.e., L2) loss directly between the predictions and the target tokens. In the case of a raster-order AR model, using the L2 loss introduces no randomness and thus cannot generate diverse samples. In the case of the MAR models with the L2 loss, the only randomness is the sequence order; the prediction at a location is deterministic for any given order. In our experiment, we have trained an MAR model with the L2 loss, which as expected leads to a disastrous FID score (>>>100). Generated on Fri Nov 1 14:45:23 2024 by LaTeXML