Goku: Flow Based Video Generative Foundation Models Shoufa Chen1∗ Chongjian Ge1∗ Yuqi Zhang2 Yida Zhang2 Fengda Zhu2 Hao Yang2 Hongxiang Hao2 Hui Wu2 Zhichao Lai2 Yifei Hu2 Ting-Che Lin2 Shilong Zhang1 Fu Li2 Chuan Li2 Xing Wang2 Yanghua Peng2 Peize Sun1 Ping Luo1 Yi Jiang2 Zehuan Yuan2 Bingyue Peng2 Xiaobing Liu2 1 The University of Hong Kong 2 Bytedance Inc arXiv:2502.04896v2 [cs.CV] 10 Feb 2025 ∗ Equal Contribution https://saiyan-world.github.io/goku/ Abstract This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models. 1. Introduction Video generation has garnered significant attention owing to its transformative potential across a wide range of applications, such media content creation (Polyak et al., 2024), advertising (Zhang et al., 2024; Bacher et al., 2021), video games (Yang et al., 2024b; Valevski et al., 2024; Quevedo et al., 2024), and world model simulators (Ha and Schmidhuber, 2018; Brooks et al., 2024; Agarwal et al., 2025). Benefiting from advanced generative algorithms (Goodfellow et al., 2014; Ho et al., 2020; Liu et al., 2023; Lipman et al., 2023), scalable model architectures (Vaswani et al., 2017; Peebles and Xie, 2023), vast amounts of internet-sourced data (Chen et al., 2024b; Nan et al., 2024; Ju et al., 2024), and ongoing expansion of computing capabilities (Corporation, 2022, 2023, 2024), remarkable advancements have been achieved in the field of video generation (Ho et al., 2022b,a; Singer et al., 2023; Blattmann et al., 2023b; Brooks et al., 2024; Kuaishou, 2024; Yang et al., 2024c; Jin et al., 2024; Polyak et al., 2024; Kong et al., 2024; Ji et al., 2024). In this work, we present Goku, a family of rectified flow (Lipman et al., 2023; Liu et al., 2023) transformer models designed for joint image and video generation, establishing a pathway toward industry-grade performance. This report centers on four key components: data curation, model architecture design, flow formulation, and training infrastructure optimization—each rigorously refined to meet the demands of high-quality, large-scale video generation. A native Warrior shaman Bengal Cat with a black and white leopard pattern, blue eyes, short fur, and portrait pose, colorful feathers and colorful ornaments, a regal oil-style portrait of the queen of native Kitty shaman white Cat with wings and headdress. Nordic is kind and motherly, it has black eye makeup and her hair is in messy. A glass transparent emoji cartoon A white bearded man's face hand making the peace sign gesture, emerges from a cloud of white with fingers straight up and down butterflies, background is white An extremely happy American Cocker Spaniel. An ancient artifact rests on a pedestal, the word “GOKU” etched onto its surface, glowing as if holding a hidden power within. Goku Black, in Super Saiyan Rose form, stands in a destroyed cityscape. The word "SAIYAN" is etched into the ground with dark energy. An enchanted forest with a waterfall cascading over rocks, the word “GOKU” formed by glowing moss along the stone surface, lighting up the misty surroundings. (a) Text-to-Image Samples A giant panda sitting comfortably at a table, eating a hotpot meal. A flock of paper airplanes flutters through a dense jungle, weaving around trees as if they were migrating birds. An individual standing in a kitchen, wearing an apron, and holding a frying pan positioned above a burner. (b) Text-to-Video Samples Figure 1 | Generated samples from Goku. Key components are highlighted in RED. First, we present a comprehensive data processing pipeline designed to construct largescale, high-quality image and video-text datasets. The pipeline integrates multiple advanced techniques, including video and image filtering based on aesthetic scores, OCR-driven content analysis, and subjective evaluations, to ensure exceptional visual and contextual quality. Furthermore, we employ multimodal large language models (MLLMs) (Yuan et al., 2025) to generate dense and contextually aligned captions, which are subsequently refined using an additional large language model (LLM) (Yang et al., 2024a) to enhance their accuracy, fluency, and descriptive richness. As a result, we have curated a robust training dataset comprising approximately 36M video-text pairs and 160M image-text pairs, which are proven sufficient for training industry-level generative models. Secondly, we take a pioneering step by applying rectified flow formulation (Lipman et al., 2023) for joint image and video generation, implemented through the Goku model family, which comprises Transformer architectures with 2B and 8B parameters. At its core, the Goku 2 framework employs a 3D joint image-video variational autoencoder (VAE) to compress image and video inputs into a shared latent space, facilitating unified representation. This shared latent space is coupled with a full-attention (Vaswani et al., 2017) mechanism, enabling seamless joint training of image and video. This architecture delivers high-quality, coherent outputs across both images and videos, establishing a unified framework for visual generation tasks. Furthermore, to support the training of Goku at scale, we have developed a robust infrastructure tailored for large-scale model training. Our approach incorporates advanced parallelism strategies (Jacobs et al., 2023; Zhao et al., 2023) to manage memory efficiently during long-context training. Additionally, we employ ByteCheckpoint (Wan et al., 2024) for high-performance checkpointing and integrate fault-tolerant mechanisms from MegaScale (Jiang et al., 2024) to ensure stability and scalability across large GPU clusters. These optimizations enable Goku to handle the computational and data challenges of generative modeling with exceptional efficiency and reliability. We evaluate Goku on both text-to-image and text-to-video benchmarks to highlight its competitive advantages. For text-to-image generation, Goku-T2I demonstrates strong performance across multiple benchmarks, including T2I-CompBench (Huang et al., 2023), GenEval (Ghosh et al., 2024), and DPG-Bench (Hu et al., 2024), excelling in both visual quality and text-image alignment. In text-to-video benchmarks, Goku-T2V achieves state-of-the-art performance on the UCF-101 (Soomro et al., 2012) zero-shot generation task. Additionally, Goku-T2V attains an impressive score of 84.85 on VBench (Huang et al., 2024), securing the top position on the leaderboard (as of 2025-01-25) and surpassing several leading commercial text-to-video models. Qualitative results, illustrated in Figure 1, further demonstrate the superior quality of the generated media samples. These findings underscore Goku’s effectiveness in multi-modal generation and its potential as a high-performing solution for both research and commercial applications. 2. Goku: Generative Flow Models for Visual Creation In this section, we present three core components of Goku, the image-video joint VAE (Yang et al., 2024c), the Goku Transformer architecture, and the rectified flow formulation. These components are designed to work synergistically, forming a cohesive and scalable framework for joint image and video generation. During training, each raw video input 𝑥 ∈ R𝑇 × 𝐻 ×𝑊 ×3 (with images treated as a special case where 𝑇 = 1 ) is encoded from the pixel space to a latent space using a 3D image-video joint VAE (Section 2.1). The encoded latents are then organized into mini-batches containing both video and image representations, facilitating the learning of a unified cross-modal representation. Subsequently, the rectified flow formulation (Section 2.3) is applied to these latents, leveraging a series of Transformer blocks (Section 2.2) to model complex temporal and spatial dependencies effectively. 2.1. Image-Video Joint VAE Earlier research (He et al., 2022; Rombach et al., 2022; Esser et al., 2021) demonstrates that diffusion and flow-based models can significantly improve efficiency and performance by modeling in latent space through a Variational Auto-Encoder (VAE) (Esser et al., 2021; Kingma, 2013). Inspired by Sora (Brooks et al., 2024), the open-source community has introduced 3D-VAE to explore spatio-temporal compression within latent spaces for video generation tasks (Lab and etc., 2024; Zheng et al., 2024; Yang et al., 2024c). To extend the advantages of latent space modeling across multiple media formats, including images and videos, we adopt a jointly trained Image-Video VAE (Yang et al., 2024c) that handles both image and video data within a 3 Model Layer Model Dimension FFN Dimension Attention Heads Goku-1B Goku-2B Goku-8B 28 28 40 1152 1792 3072 4608 7168 12288 16 28 48 Table 1 | Architecture configurations for Goku Models. Goku-1B model is only used for pilot experiments in Section 2.3 unified framework. Specifically, for videos, we apply a compression stride of 8 × 8 × 4 across height, width, and temporal dimensions, respectively, while for images, the compression stride is set to 8 × 8 in spatial dimensions. 2.2. Transformer Architectures The design of the Goku Transformer block builds upon GenTron (Chen et al., 2024a), an extension of the class-conditioned diffusion transformer (Peebles and Xie, 2023) for text-toimage/video tasks. It includes a self-attention module for capturing inter-token correlations, a cross-attention layer to integrate textual conditional embeddings (extracted via the Flan-T5 language model (Chung et al., 2024)), a feed-forward network (FFN) for feature projection, and a layer-wise adaLN-Zero block that incorporates timestep information to guide feature transformations. Additionally, we introduce several recent design enhancements to improve model performance and training stability, as detailed below. Plain Full Attention. In Transformer-based video generative models, previous approaches (Chen et al., 2024a; Wu et al., 2023; Singer et al., 2023; Blattmann et al., 2023b) typically combine temporal attention with spatial attention to extend text-to-image generation to video. While this method reduces computational cost, it is sub-optimal for modeling complex temporal motions, as highlighted in prior work (Yang et al., 2024c; Polyak et al., 2024). In Goku, we adopt full attention to model multi-modal tokens (image and video) within a unified network. Given the large number of video tokens remaining after VAE processing—particularly for high-framerate, long-duration videos—we leverage FlashAttention (Shah et al., 2024; Dao, 2024) and sequence parallelism (Li et al., 2021) to optimize both GPU memory usage and computational efficiency. Patch n’ Pack. To enable joint training on images and videos of varying aspect ratios and lengths, we follow the approach from NaViT (Dehghani et al., 2024), packing both modalities into a single minibatch along the sequence dimension. This method allows flexible mixing of training instances with different sequence lengths into a single batch, eliminating the need for data buckets (Podell et al., 2023). 3D RoPE Position Embedding. Rotary Position Embedding (RoPE) (Su et al., 2024) has demonstrated effectiveness in LLMs by enabling greater sequence length flexibility and reducing inter-token dependencies as relative distances increase. During joint training, we apply 3D RoPE embeddings to image and video tokens. In our joint training framework, we extend 3D RoPE embeddings to image and video tokens, leveraging their extrapolation capability to accommodate varying resolutions. This adaptability makes RoPE particularly suited for 4 handling diverse resolutions and video lengths. Furthermore, our empirical analysis revealed that RoPE converges faster than sinusoidal positional embeddings during transitions across different training stages Q-K Normalization. Training large-scale Transformers can occasionally result in loss spikes, which may lead to model corruption, manifesting as severe artifacts or even pure noise in generated images or videos. To mitigate this issue, we incorporate query-key normalization (Dehghani et al., 2023) to stabilize the training process. Specifically, we apply RMSNorm (Zhang and Sennrich, 2019) to each query-key feature prior to attention computation, ensuring smoother and more reliable training dynamics. The overall Transformer model is constructed by stacking a sequence of blocks as described above. To address varying computational demands and performance requirements, we design three model variants, summarized in Table 1. The Goku-1B model serves as a lightweight option for pilot experiments. The Goku-2B variant consists of 28 layers, each with a model dimension of 1792 and 28 attention heads, providing a balance between computational efficiency and expressive capacity. In contrast, the larger Goku-8B variant features 40 layers, a model dimension of 3072, and 48 attention heads, delivering superior modeling capacity aimed at achieving high generation quality. 2.3. Flow-based Training Our flow-based formulation is rooted in the rectified flow (RF) algorithm (Albergo and VandenEijnden, 2023; Lipman et al., 2023; Liu et al., 2023), where a sample is progressively transformed from a prior distribution, such as a standard normal distribution, to the target data distribution. This transformation is achieved by defining the forward process as a series of linear interpolations between the prior and target distributions. Specifically, given a real data sample x1 from the target distribution and a noise sample x0 ∼ N (0, 1) from the prior distribution, a training example is constructed through linear interpolation: x𝑡 = 𝑡 · x1 + (1 − 𝑡 ) · x0 , (1) where 𝑡 ∈ [0, 1] represents the interpolation coefficient. The model is trained to predict the velocity, defined as the time derivative of x𝑡 , v𝑡 = 𝑑𝑑𝑡x𝑡 , which guides the transformation of intermediate samples x𝑡 towards the real data x1 during inference. By establishing a direct, linear interpolation between data and noise, RF simplifies the modeling process, providing improved theoretical properties, conceptual clarity, and faster convergence across data distributions. Goku takes a pioneering step by adopting a flow-based formulation for joint image-andvideo generation. We conduct a pilot experiment to validate the rapid convergence of flow-based training by performing class-conditional generation with Goku-1B a model specifically designed for these proof-of-concept experiments, on ImageNet-1K (256 × 256) (Deng et al., 2009). The model is configured with 28 layers, an attention dimension of 1152, and 16 attention heads. To evaluate performance, we compare key metrics, such as FID-50K and Inception Score (IS), for models trained using the denoising diffusion probabilistic model (DDPM) (Ho et al., 2020) and rectified flow. As shown in Table 2, RF demonstrates faster convergence than DDPM. For instance, Goku-1B (RF) achieves a lower FID-50K after 400k training steps compared to Goku-1B (DDPM), which requires 1000k steps to reach a similar level of performance. 5 Loss Steps FID ↓ sFID ↓ IS ↑ Precision ↑ Recall ↑ DDPM DDPM DDPM 200k 400k 1000k 3.0795 2.5231 2.2568 4.3498 4.3821 4.4887 226.4783 265.0612 286.5601 0.8387 0.8399 0.8319 0.5317 0.5591 0.5849 Rectified Flow Rectified Flow 200k 400k 2.7472 2.1572 4.6416 4.5022 232.3090 261.1203 0.8239 0.8210 0.5590 0.5871 Table 2 | Proof-of-concept experiments of class-conditional generation on ImageNet 256×256. Rectified flow achieves faster convergency compared to DDPM. 2.4. Training Details Multi-stage Training. Directly optimizing joint image-and-video training poses significant challenges, as the network must simultaneously learn spatial semantics critical for images and temporal motion dynamics essential for videos. To tackle this complexity, we introduce a decomposed, multi-stage training strategy that progressively enhances the model’s capabilities, ensuring effective and robust learning across both modalities. • Stage-1: Text-Semantic Pairing. In the initial stage, we focus on establishing a solid understanding of text-to-image relationships by pretraining Goku on text-to-image tasks. This step is critical for grounding the model in basic semantic comprehension, enabling it to learn to associate textual prompts with high-level visual semantics. Through this process, the model develops a reliable capacity for representing visual concepts essential for both image and video generation, such as object attributes, spatial configurations, and contextual coherence. • Stage-2: Image-and-video joint learning. Building on the foundational capabilities of text-tosemantic pairing, we extend Goku to joint learning across both image and video data. This stage leverages the unified framework of Goku, which employs a global attention mechanism adaptable to both images and videos. Besides, acquiring a substantial volume of high-quality video data is generally more resource-intensive compared to obtaining a similar amount of high-quality image data. To address this disparity, our framework integrates images and videos into unified token sequences during training, enabling the rich information inherent in high-quality images to enhance the generation of video frames (Chen et al., 2024a). By curating a carefully balanced dataset of images and videos, Goku not only gains the capability to generate both high-quality images and videos but also enhances the visual quality of videos by leveraging the rich information from high-quality image data. • Stage-3: Modality-specific finetuning. In the final stage, we fine-tune Goku for each specific modality to further enhance its output quality. For text-to-image generation, we implement image-centric adjustments aimed at producing more visually compelling images. For textto-video generation, we focus on adjustments that improve temporal smoothness, motion continuity, and stability across frames, resulting in realistic and fluid video outputs. Cascaded Resolution Training. In the second stage of joint training, we adopt a cascade resolution strategy to optimize the learning process. Initially, training is conducted on low-resolution image and video data (288 × 512), enabling the model to efficiently focus on fundamental textsemantic-motion relationships at reduced computational costs. Once these core interactions are well-established, the resolution of the training data is progressively increased, transitioning from 480 × 864 to 720 × 1280. This stepwise resolution enhancement allows Goku to refine its 6 understanding of intricate details and improve overall image fidelity, ultimately leading to superior generation quality for both images and videos. 2.5. Image-to-Video To extend Goku for adapting an image as an additional condition for video generation, we employ a widely used strategy by using the first frame of each clip as the reference image (Girdhar et al., 2023; Blattmann et al., 2023a; Yang et al., 2024c). The corresponding image tokens are broadcasted and concatenated with the paired noised video tokens along the channel dimension. To fully leverage the pretrained knowledge during fine-tuning, we introduce a single MLP layer for channel alignment, while preserving the rest of the model architecture identical to Goku-T2V. 3. Infrastructure Optimization To achieve scalable and efficient training of Goku, we first adopt advanced parallelism strategies (Section 3.1), to handle the challenges of long-context, large-scale models. To further optimize memory usage and balance computation with communication, we implement finegrained Activation Checkpointing (Section 3.2). Additionally, we integrate robust fault tolerance mechanisms from MegaScale, enabling automated fault detection and recovery with minimal disruption (Section 3.3). Finally, ByteCheckpoint is utilized to ensure efficient and scalable saving and loading of training states, supporting flexibility across diverse hardware configurations (Section 3.4). The details of these optimizations are introduced below. 3.1. Model Parallelism Strategies The substantial model size and the exceptionally long sequence length (exceeding 220K tokens for the longest sequence) necessitate the adoption of multiple parallelism strategies to ensure efficient training. Specifically, we employ 3D parallelism to achieve scalability across three axes: input sequences, data, and model parameters. Sequence-Parallelism (SP) (Korthikanti et al., 2023; Li et al., 2021; Jacobs et al., 2023) slices the input across the sequence dimension for independent layers (e.g., LayerNorm) to eliminate redundant computations, reduce memory usage, and support padding for non-conforming input. We adopt Ulysses (Jacobs et al., 2023) as our implementation, which shards samples across the sequence parallel group from the start of the training loop. During attention computation, it uses all-to-all communication to distribute query, key, and value shards, allowing each worker to process the full sequence but only a subset of attention heads. After parallel computation of attention heads, another all-to-all communication aggregates the results, recombining all heads and the sharded sequence dimension. Fully Sharded Data Parallelism (FSDP) (Zhao et al., 2023) partitions all parameters, gradients and optimizer states across the data parallel ranks. Instead of all-reduce in Distributed Data Parallelism, FSDP performs all-gather for parameters and reduce-scatter for gradients, enabling overlap with forward and backward computations to potentially reduce communication overhead. In our case, we adopt the HYBRID_SHARD strategy, which combines FULL_SHARD within a shard group and parameter replication across such groups, which effectively implements data parallelism (DP). This approach minimizes communication costs by limiting all-gather and reduce-scatter operations. 7 3.2. Activation Checkpointing While the parallelism methods discussed in Section 3.1 provide significant memory savings and enable large-scaling training with long sequences, they inevitably introduce communication overhead among ranks, which can lead to suboptimal overall performance. To address this issue and better balance the computation and communication by maximizing their overlap in the profiling trace, we designed a fine-grained Activation Checkpointing (AC) (Chen et al., 2016) strategy. Specifically, we implemented selective activation checkpointing to minimize the number of layers requiring activation storage while maximizing GPU utilization. 3.3. Cluster Fault Tolerance Scaling Goku training to large-scale GPU clusters inevitably introduces fault scenarios, which can reduce training efficiency. The likelihood of encountering failures increases with the number of nodes, as larger systems have a higher probability of at least one node failing. These disruptions can extend training time and increase costs. To enhance stability and efficiency at scale, we adopted fault tolerance techniques from MegaScale (Jiang et al., 2024), including self-check diagnostics, multi-level monitoring, and fast restart/recovery mechanisms. These strategies effectively mitigate the impact of interruptions, enabling Goku to maintain robust performance in large-scale generative modeling tasks. 3.4. Saving and Loading Training Stages Checkpointing training states—such as model parameters, exponential moving average (EMA) parameters, optimizer states, and random states—is crucial for training large-scale models, particularly given the increased likelihood of cluster faults. Reloading checkpointed states ensures reproducibility, which is essential for model reliability and debugging potential issues, including those caused by unintentional errors or malicious attacks. To support scalable large-scale training, we adopt ByteCheckpoint (Wan et al., 2024) as our checkpointing solution. It not only enables parallel saving and loading of partitioned checkpoints with high I/O efficiency but also supports resharding distributed checkpoints. This flexibility allows seamless switching between different training scales, accommodating varying numbers of ranks and diverse storage backends. In our setup, checkpointing an 8B model across over thousands of GPUs blocks training for less than 4 seconds, which is negligible compared to the overall forward and backward computation time per iteration. 4. Data Curation Pipeline We unblock the data volume that is utilized for industry-grade video/image generation models. Our data curation pipeline, illustrated in Figure 2, consists of five main stages: (1) image and video collection, (2) video extraction and clipping, (3) image and video filtering, (4) captioning, and (5) data distribution balancing. We describe the details of data curation procedure below. 4.1. Data Overview We collet raw image and video data from a variety of sources, including publicly available academic datasets, internet resources, and proprietary datasets obtained through partnerships with collaborating organizations. After rigorous filtering, the final training dataset for Goku consists of approximately 160M image-text pairs and 36M video-text pairs, encompassing both 8 FFmpeg Error Raw Video Aesthetic Score OCR Video Tag Background DINOv2 Similarity Motion Score Read Video Info long-tail Extract Clips Keyframe per Second NSFW Video Filtering Caption Raw Image Collection Extraction Filtering Captioning uniform Balancing Figure 2 | The data curation pipeline in Goku. Given a large volume of video/image data collected from Internet, we generate high-quality video/image-text pairs through a series of data filtering, captioning and balancing steps. publicly available datasets and internally curated proprietary datasets. The detailed composition of these resources is outlined as follows: • Text-to-Image Data. Our text-to-image training dataset includes 100M public samples from LAION (Schuhmann et al., 2022) and 60M high-quality, internal samples. We use public data for pre-training and internal data for fine-tuning. • Text-to-Video Data. Our T2V training dataset includes 11M public clips and 25M in-house clips. The former include Panda-70M (Chen et al., 2024b), InternVid (Wang et al., 2023b), OpenVid-1M (Nan et al., 2024), and Pexels (Lab and etc., 2024). Rather than directly using these datasets, we apply a data curation pipeline to keep high-quality samples. 4.2. Data Processing and Filtering To construct a high-quality video dataset, we implement a comprehensive processing pipeline comprising several key stages. Raw videos are first preprocessed and standardized to address inconsistencies in encoding formats, durations, and frame rates. Next, a two-stage video clipping method segments videos into meaningful and diverse clips of consistent length. Additional filtering processes are applied, including visual aesthetic filtering to retain photorealistic and visually rich clips, OCR filtering to exclude videos with excessive text, and motion filtering to ensure balanced motion dynamics. In addition, the multi-level training data is segmented based on resolution and corresponding filtering thresholds for DINO similarity, aesthetic score, OCR text coverage, and motion score, as summarized in Table 4. We provide the details of each processing step as follows. Table 3 presents the key parameters and their corresponding thresholds used for video quality assessment. Each parameter is essential in ensuring the generation and evaluation of high-quality videos. The Duration parameter specifies that raw video lengths should be at least 4 seconds to capture meaningful temporal dynamics. The Resolution criterion ensures that the minimum dimension (either height or width) of the video is no less than 480 pixels, maintaining adequate visual clarity. The Bitrate, which determines the amount of data processed per second during playback, requires a minimum of 500 kbps to ensure sufficient quality, clarity, and manageable file size. Videos with low bitrate typically correspond to content with low complexity, such as static videos or those featuring pure color backgrounds. Finally, the Frame Rate enforces a standard of at least 24 frames per second (film standard) or 23.976 frames 9 Parameter Description Threshold Duration Raw video length ≥ 4 seconds Resolution Width and height of the video 𝑚𝑖𝑛{ height, width} ≥ 480 Bitrate Amount of data processed per second during playback, which impacts the video’s quality, clarity, and file size ≥ 500 kbps Frame Rate Frames displayed per second ≥ 24 FPS (Film Standard) / 23.976 FPS (NTSC Standard) Table 3 | Summary of video quality parameters and their thresholds for preprocessing. The table outlines the criteria used to filter and standardize raw videos based on essential attributes, ensuring uniformity and compatibility in the dataset. per second (NTSC standard) to guarantee smooth motion and prevent visual artifacts. These thresholds collectively establish a baseline for evaluating and generating high-quality video content. • Preprocessing and Standardization of Raw Videos. Videos collected from the internet often require extensive preprocessing to address variations in encoding formats, durations, and frame rates. Initially, we perform a primary filtering step based on fundamental video attributes such as duration, resolution, bitrate. The specific filtering criteria and corresponding thresholds are detailed in Table 3. This initial filtering step is computationally efficient compared to more advanced, model-based filtering approaches, such as aesthetic (Schuhmann et al., 2022) evaluation models. Following this stage, the raw videos are standardized to a consistent coding format, H.264 (Wiegand et al., 2003), ensuring uniformity across the dataset and facilitating subsequent processing stages. • Video Clips Extraction. We employ a two-stage video clipping method for this stage. First, we use PySceneDetect (Castellano, 2024) for shot boundary detection, resulting coarse-grained video clips from raw videos. Next, we further refine the video clips by sampling one frame per second, generating DINOv2 (Oquab et al., 2023) features and calculating cosine similarity between adjacent frames. When similarity falls below a set threshold, we mark a shot change and further divide the clip. Specifically, as shown in Table 4, for video resolutions around 480 × 864, we segmented the video clips where the DINO similarity between adjacent frames exceeds 0.85. For resolutions greater than 720 × 1280, the threshold is set at 0.9. Besides, to standardize length, we limit clips to a maximum of 10 seconds. Furthermore, we consider the similarity between different clips derived from the same source video to ensure diversity and maintain quality. Specifically, we compute the perceptual hashing (Contributors, 2013) values of keyframes from each clip and compare them. If two clips have similar hash values, indicating significant overlap, we retain the clip with a higher aesthetic score. This ensures that the final dataset includes diverse and high-quality video clips. • Visual Aesthetic Filtering. To assess the visual quality of the videos, we utilize aesthetic models (Schuhmann et al., 2022) to evaluate the keyframes. The aesthetic scores of the keyframes are averaged to obtain an overall aesthetic score for each video. For videos with resolutions around 480 × 864, those with an aesthetic score below 4.3 are discarded, while for resolutions exceeding 720 × 1280, the threshold is raised to 4.5. This filtering process ensures that the selected clips are photorealistic, visually rich, and of high aesthetic quality. 10 Stage Amount Resolution 480p 720p 1080p 36M 24M 7M ≥ 480×864 ≥ 720×1280 ≥ 1080×1920 DINO-Sim. Aesthetic ≥0.85 ≥0.90 ≥0.90 ≥ 4.3 ≥ 4.5 ≥ 4.5 OCR Motion <= 0.02 0.3 ≤ score ≤ 20.0 <= 0.01 0.5 ≤ score ≤ 15.0 <= 0.01 0.5 ≤ score ≤ 8.0 Table 4 | Overview of multi-stage training data.This table summarizes the thresholds for each filtering criterion, including resolution, DINO similarity, aesthetic score, OCR text coverage, motion score, and the corresponding data quantities. • OCR Filtering. To exclude videos with excessive text, we employ an internal OCR model to detect text within the keyframes. The OCR model identifies text regions, and we calculate the text coverage ratio by dividing the area of the largest bounding box detected by the total area of the keyframe. Videos with a text coverage ratio exceeding predefined thresholds are discarded. Specifically, for videos with resolutions around 480 × 864, the threshold is set at 0.02, while for resolutions exceeding 720 × 1280, the threshold is reduced to 0.01. This process effectively filters out videos with excessive text content. • Motion Filtering. Unlike images, videos require additional filtering based on motion characteristics. To achieve this, we utilize RAFT (Teed and Deng, 2020) to compute the mean optical flow of video clips, which is then used to derive a motion score. For videos with resolutions around 480 × 864, clips with motion scores below 0.3 (indicating low motion) or above 20.0 (indicating excessive motion) are excluded. For resolutions exceeding 720 × 1280, the thresholds are adjusted to 0.5 and 15.0, respectively. Furthermore, to enhance motion control, the motion score is appended to each caption. 4.3. Captioning Detailed captions are essential for enabling the model to generate text-aligned images/videos precisely. For images, we use InternVL2.0 (Chen et al., 2024c) to generate dense captions for each sample. To caption video clips, we start with InternVL2.0 (Chen et al., 2024c) for keyframe captions, followed by Tarsier2 (Yuan et al., 2025) for video-wide captions. Note that the Tarsier2 model can inherently describe camera motion types (e.g., zoom in, pan right) in videos, eliminating the need for a separate prediction model and simplifying the overall pipeline compared to previous work such as (Polyak et al., 2024). Next, we utilize Qwen2 (Yang et al., 2024a) to merge the keyframe and video captions. Besides, we also empirically found that adding the motion score (calculated by RAFT (Teed and Deng, 2020)) to the captions improves motion control for video generation. This approach enables users to specify different motion scores in prompts to guide the model in generating videos with varied motion dynamics. 4.4. Training Data Balancing The model’s performance are significantly influenced by the data distribution, especially for video data. To balance the video training data, we first use an internal video classification model to generate semantic tags for the videos. We then adjust the data distribution based on these semantic tags to ensure a balanced representation across categories. • Data Semantic Distribution. The video classification model assigns a semantic tag to each video based on four evenly sampled keyframes. The model categorizes videos into 9 primary 11 (a) Semantic distribution of video clips. Sub-category from Human Sub-category from Scenery half-selfie forest natural rivers multi snow human grass full-selfie sky (b) The balanced semantic distribution of subcategories. Figure 3 | Training data distributions. The balanced semantic distribution of primary categories and subcategories are shown in (a) and (b), respectively. classes (e.g., human, scenery, animals, food) and 86 subcategories (e.g., half-selfie, kid, dinner, wedding). Figure 3a presents the semantic distribution across our filtered training clips, with humans, scenery, food, urban life, and animals as the predominant categories. • Data Balancing. The quality of the generated videos is closely tied to the semantic distribution of the training data. Videos involving humans pose greater modeling challenges due to the extensive diversity in appearances, whereas animals and landscapes exhibit more visual consistency and are relatively easier to model. To address this disparity, we implement a data-balancing strategy that emphasizes human-related content while ensuring equitable representation across subcategories within each primary category. Overrepresented subcategories are selectively down-sampled, whereas underrepresented ones are augmented through artificial data generation and oversampling techniques. Balanced data distribution is shown in Figure 3b. 12 Method GenEval T2I-CompBench DPG-Bench Overall Color Shape Texture Average Text Enc. SDv1.5 (Rombach et al., 2022) CLIP ViT-L/14 DALL-E 2 (Ramesh et al., 2022) CLIP ViT-H/16 SDv2.1 (Rombach et al., 2022) CLIP ViT-H/14 SDX (Podell et al., 2023) CLIP ViT-bigG Flan-T5-XXL PixArt-𝛼 (Chen et al., 2023) DALL-E 3 (Betker et al., 2023) Flan-T5-XXL GenTron (Chen et al., 2024a) CLIP T5XXL SD3 (Esser et al., 2024) Flan-T5-XXL Show-o (Xie et al., 2024) Phi-1.5 Transfusion (Zhou et al., 2024) Chameleon (Lu et al., 2024) LlamaGen (Sun et al., 2024) FLAN-T5 XL Emu 3 (Wang et al., 2024b) - 0.43 0.3730 0.3646 0.4219 0.52 0.5750 0.5464 0.6374 0.50 0.5694 0.4495 0.4982 0.55 0.6369 0.5408 0.5637 0.48 0.6886 0.5582 0.7044 0.67† 0.8110† 0.6750† 0.8070† 0.7674 0.5700 0.7150 0.74 0.53 0.63 0.39 0.32 † † † 0.66 0.7913 0.5846 0.7422† 63.18 74.65 71.11 83.50† 80.60 Goku-T2I (2B) Goku-T2I (2B)† 0.70 0.76† 83.65 FLAN-T5 XL 0.7521 0.4832 0.6691 0.7561† 0.5759† 0.7071† Table 5 | Comparison with state-of-the-art models on image generation benchmarks. We evaluate on GenEval (Ghosh et al., 2024); T2I-CompBench (Huang et al., 2023) and DPGBench (Hu et al., 2024). Following (Wang et al., 2024b), we use † to indicate the result with prompt rewriting. 5. Experiments 5.1. Text-to-Image Results we conduct a comprehensive quantitative evaluation of Goku-T2I on widely recognized image generation benchmarks, including GenEval (Ghosh et al., 2024), T2I-CompBench (Huang et al., 2023), and DPG-Bench (Hu et al., 2024). Details of these benchmarks could be found in Appendix Appendix A. The results are summarized in Table 5. Performance on GenEval. To assess text-image alignment comprehensively, we employ the GenEval benchmark, which evaluates the correspondence between textual descriptions and visual content. Since Goku-T2I is primarily trained on dense generative captions, it exhibits a natural advantage when handling detailed prompts. To further explore this, we expand the original short prompts in GenEval with ChatGPT-4o, preserving their semantics while enhancing descriptive detail. As shown in Table 5, Goku-T2I achieves strong performance with the original short prompts, surpassing most state-of-the-art models. With the rewritten prompts, Goku-T2I attains the highest score (0.76), demonstrating its exceptional capability in aligning detailed textual descriptions with generated images. Performance on T2I-CompBench. We further evaluate the alignment between generated images and textual conditions using the T2I-CompBench benchmark, which focuses on various object attributes such as color, shape, and texture. As illustrated in Table 5, Goku-T2I consistently 13 Method CogVideo (Chinese) (Hong et al., 2022) CogVideo (English) (Hong et al., 2022) Make-A-Video (Singer et al., 2023) VideoLDM (Blattmann et al., 2023b) LVDM (He et al., 2022) MagicVideo (Zhou et al., 2022) PixelDance (Zeng et al., 2024) PYOCO (Ge et al., 2023) Emu-Video (Girdhar et al., 2023) SVD (Blattmann et al., 2023a) Goku-2B (ours) Goku-2B (ours) Goku-2B (ours) Resolution 480×480 480×480 256×256 256×256 256×256 240×360 256×256 240×360 128×128 FVD (↓) 751.34 701.59 367.23 550.61 372.00 655.00 242.82 355.19 317.10 242.02 246.17 254.47 217.24 IS (↑ ) 23.55 25.27 33.00 33.45 42.10 47.76 42.7 45.77 ± 1.10 46.64 ± 1.08 42.30 ± 1.03 Table 6 | Zero-shot text-to-video performance on UCF-101. We generate videos of different resolutions, including 256×256, 240×360, 128×128, for comprehensive comparisons. outperforms several strong baselines, including PixArt-𝛼 (Chen et al., 2023), SDXL (Podell et al., 2023), and DALL-E 2 (Mishkin et al., 2022). Notably, the inclusion of prompt rewriting leads to improved performance across all attributes, further highlighting Goku-T2I’s robustness in text-image alignment. Performance on DPG-Bench. While the aforementioned benchmarks primarily evaluate textimage alignment with short prompts, DPG-Bench is designed to test model performance on dense prompt following. This challenging benchmark includes 1,000 detailed prompts, providing a rigorous test of a model’s ability to generate visually accurate outputs for complex textual inputs. As shown in the last column of Table 5, Goku-T2I achieves the highest performance with an average score of 83.65, surpassing PixArt-𝛼 (Chen et al., 2023) (71.11), DALL-E 3 (Betker et al., 2023) (83.50), and EMU3 (Wang et al., 2024b) (80.60). These results highlight Goku-T2I’s superior ability to handle dense prompts and maintain high fidelity in text-image alignment. 5.2. Text-to-Video Results Performance on UCF-101. We conduct experiments on UCF-101 (Soomro et al., 2012) using zero-shot text-to-video setting. As UCF-101 only has class labels, we utilize an video-language model, Tarsier-34B (Wang et al., 2024a), to generate detailed captions for all UCF-101 videos. These captions are then used to synthesize videos with Goku. Finally, we generated 13,320 videos at different resolutions with Goku-2B model for evaluation, including 256×256, 240×360 and 128×128. Following standard practice (Skorokhodov et al., 2022), we use the I3D model, pre-trained on Kinetics-400 (Carreira and Zisserman, 2017), as the feature extractor. Based on the extracted features, we calculated Fréchet Video Distance (FVD) (Unterthiner et al., 2018) to evaluate the fidelity of the generated videos. The results in Table 6 demonstrate that Goku consistently generates videos with lower FVD and higher IS. For instance, at a resolution of 128×128, the FVD of videos generated by Goku is 217.24, achieving state-of-the-art performance and highlighting significant advantages over other methods. 14 Models Human Dynamic Multiple Appear. Quality Semantic Scene Overall Action Degree Objects Style Score Score AnimateDiff-V2 VideoCrafter-2.0 OpenSora V1.2 Show-1 Gen-3 Pika-1.0 CogVideoX-5B Kling Mira CausVid Luma HunyuanVideo 92.60 95.00 85.80 95.60 96.40 86.20 99.40 93.40 63.80 99.80 96.40 94.40 50.19 55.29 42.47 47.03 54.57 49.83 53.20 50.86 16.34 56.58 58.98 53.88 40.83 42.50 47.22 44.44 60.14 47.50 70.97 46.94 60.33 92.69 44.26 70.83 36.88 40.66 58.41 45.47 53.64 43.08 62.11 68.05 12.52 72.15 82.63 68.55 22.42 25.13 23.89 23.06 24.31 22.26 24.91 19.62 21.89 24.27 24.66 19.80 82.90 82.20 80.71 80.42 84.11 82.92 82.75 83.39 78.78 85.65 83.47 85.09 69.75 73.42 73.30 72.98 75.17 71.77 77.04 75.68 44.21 78.75 84.17 75.82 80.27 80.44 79.23 78.93 82.32 80.69 81.61 81.85 71.87 84.27 83.61 83.24 Goku (ours) 97.60 57.08 76.11 79.48 23.08 85.60 81.87 84.85 Table 7 | Comparison with leading T2V models on VBench. Goku achieves state-of-the-art overall performance. Detailed results across all 16 evaluation dimensions are provided in Table 8 in the Appendix. Performance on VBench. As presented in Table 7, we evaluate Goku-T2V against state-ofthe-art models on VBench (Huang et al., 2024), a comprehensive benchmark designed to assess video generation quality across 16 dimensions. Goku-T2V achieves state-of-the-art overall performance on VBench, showcasing its ability to generate high-quality videos across diverse attributes and scenarios. Among the key metrics, Goku-T2V demonstrates notable strength in human action representation, dynamic degree, and multiple object generation, reflecting its capacity for handling complex and diverse video content. Additionally, it achieves competitive results in appearance style, quality score, and semantic alignment, highlighting its balanced performance across multiple aspects. For detailed results on all 16 evaluation dimensions, we refer readers to Table 8 in the Appendix. This comprehensive analysis underscores Goku-T2V’s superiority in video generation compared to prior approaches. 5.3. Image-to-Video We finetune Goku-I2V from the T2V initialization with approximate 4.5M text-image-video triplets, sourced from diverse domains to ensure robust generalization. Despite the relatively small number of fine-tuning steps (10k), our model demonstrates remarkable efficiency in animating reference image while maintaining strong alignment with the accompanying text. As illustrated in Figure 4, the generated videos exhibit high visual quality and temporal coherence, effectively capturing the semantic nuances described in the text. 15 A lion running towards the left side of the scene, with flames engulfing its body. As it runs, the lion gradually transforms into a mass of flames… A woman in workout gear is lifting weights at a gym, her biceps flexing with each lift, sweat visible on her forehead, with a closeup on her determined expression… A man surfing on a wave, with the camera following his movement and focusing on his face. He is smiling and giving a thumbs-up to the camera, … Figure 4 | Samples of Goku-I2V. Reference images are presented in the leftmost columns. We omitted redundant information from the long prompts, displaying only the key details in each one. Key words are highlighted in RED. 5.4. Image and Video Qualitative Visualizations For intuitive comparisons, we conduct qualitative assessments and present sampled results in Figure 6. The evaluation includes open-source models, such as CogVideoX (Yang et al., 2024c) and Open-Sora-Plan (Zheng et al., 2024), alongside closed-source commercial products, including DreamMachine (Luma, 2024), Pika (pika, 2024), Vidu (Bao et al., 2024), and Kling (Kuaishou, 2024). The results reveal that some commercial models struggle to generate critical video elements when handling complex prompts. For instance, models like Pika, DreamMachine, and Vidu (rows 3–5) fail to render the skimming drone over water. While certain models succeed in generating the target drone, they often produce distorted subjects (rows 1–2) or static frames lacking motion consistency (row 6). In contrast, Goku-T2V (8B) demonstrates superior performance by accurately incorporating all details from the prompt, creating a coherent visual output with smooth motion. Additional comparisons are provided in the appendix for a more comprehensive evaluation. Furthermore, more video examples are available at the goku homepage. 5.5. Ablation Studies Model Scaling. We compared Goku-T2V models with 2B and 8B parameters. Results in Figure 5a indicate that model scaling helps mitigate the generation of distorted object structures, such as the arm in Figure 5a (row 1) and the wheel in Figure 5a (row 2). This aligns with findings observed in large multi-modality models. Joint Training. We further examine the impact of joint image-and-video training. Starting from the same pretrained Goku-T2I (8B) weights, we fine-tuned Goku-T2V (8B) on 480p videos for an equal number of training steps, with and without joint image-and-video training. As shown in Figure 5b, Goku-T2V without joint training tends to generate low-quality video frames, while the model with joint training more consistently produces photorealistic frames. 16 GOKU-T2V(2B) GOKU-T2V(8B) (a) Model Scaling GOKU-T2V w/o Joint Training GOKU-T2V w/ Joint Training (b) Joint Training Figure 5 | Ablation Studies of Model Scaling and Joint Training. Fig. (a) shows the comparison between Goku-T2V(2B) and Goku-T2V(8B). Fig. (b) shows the comparison between whether joint training is adopted or not. 6. Conclusion In this work, we presented Goku, a novel model for joint image-and-video generation for industry-standard performance. Through an advanced data curation process and a robust model architecture, Goku delivers high-quality outputs by ensuring both fine-grained data selection and effective integration of image and video modalities. Key components, such as the image-video joint VAE and the application of rectified flow, facilitate seamless token interaction across modalities, establishing a shared latent space that enhances model adaptability and attention across tokens. Empirical results highlight Goku’s superiority in commercial-grade visual generation quality. Acknowledgements We sincerely appreciate the support of our collaborators at ByteDance who contributed to this work. Xibin Wu, Chongxi Wang, Yina Tang, Fangzhou Ai, Yi Ren, Wei Wang, Chen Chen, Colin Young, Bobo Zeng, Ge Bai, Yi Fu, Ruoyu Guo, Prasanna Raghav, Weiguo Feng, Xugang Ye, Adithya Sampath, Aaron Shen, Da Tang, Yuan Fang, Qijun Gan, Chen Zhang, Zhenhui Ye, Pan Xie, Houmin Wei, Gaohong Liu, Zherui Liu, Chenyuan Wang, Yun Zhang, Kaihua Jiang, Zhuo Jiang, Yang Bai, Weiqiang Lou, Hongkai Li, Xi Yang, Shuguang Wang, Junru Zheng, Zuquan Song, Zixian Du, Jingzhe Tang, Yongqiang Zhang, Mingji Han, Heng Zhang, Li Han, Sophie Xie, Shuo Li, Xinzhi Yao, Peng Li, Lianke Qin, Dongyang Wang, Yang Cheng, Chundian Liu, Wenhao Hao, Haibin Lin, Xin Liu 17 CogVideoX1.5(5B) Open-soraPlan(v1.3) Pika DreamMachine Vidu Kling(1.5) GOKU (8B) Prompt: Gliding through a crystal-clear coral reef, the drone skims just above the vibrant marine life below. Brightly colored corals, schools of fish, and rays of sunlight penetrating the water’s surface all contribute to the serene yet fast-paced journey. The scene showcases the beauty of the underwater world, as the drone swiftly maneuvers through coral arches and narrow underwater channels. Figure 6 | Qualitative comparisons with state-of-the-art (SoTA) video generation models. This figure showcases comparisons with leading models, including (Yang et al., 2024c), Open-Sora Plan (Lab and etc., 2024), Pika (pika, 2024), DreamMachine (Luma, 2024), Vidu (Bao et al., 2024), and Kling v1.5 (Kuaishou, 2024). 18 Appendix A. Benchmark Configurations T2I-Compbench (Huang et al., 2023) We evaluate the alignment between the generated images and text conditions using T2I-Compbench, a comprehensive benchmark for assessing compositional text-to-image generation capabilities. Specifically, we report scores for color binding, shape binding, and texture binding. To evaluate these results, we employ the Disentangled BLIP-VQA model. For each attribute, we generate 10 images per prompt, with a total of 300 prompts in each category. GenEval (Ghosh et al., 2024) GenEval is an object-focused framework designed to evaluate compositional image properties, such as object co-occurrence, position, count, and color. For evaluation, we generate a total of 2,212 images across 553 prompts. The final score is reported as the average across tasks. DPG-Bench (Hu et al., 2024) Compared to the aforementioned benchmarks, DPGBench offers longer prompts with more detailed information, making it effective for evaluating compositional generation in text-to-image models. For this evaluation, we generate a total of 4,260 images across 1,065 prompts, with the final score reported as the average across tasks. VBench (Huang et al., 2024) VBench is a benchmark suite for evaluating video generative models. It provides a structured Evaluation Dimension Suite that breaks down “video generation quality" into precise dimensions for detailed assessment. Each dimension and content category includes a carefully crafted Prompt Suite and samples Generated Videos from various models. Appendix B. More Visualization Examples Appendix B.1. Goku-T2I Samples Visualization We present more generated image samples with their text prompts in Figure 7. The prompts are randomly selected from the Internet 1 . Goku-T2I achieves strong performance in both visual quality and text-image alignment. It can interpret visual elements and their interactions from complex natural language descriptions. Notably, in Figure 8, Goku-T2I exhibits impressive abilities on generating images with rich details, for example, the clear textures of leaves and berries. Appendix B.2. Goku-T2V Samples Visualization In Figure 9 we show more examples generated by Goku-T2V, in both landscape (e.g., rows one through five) and portrait mode (e.g., the last row). Goku-T2V is capable of generating highmotion videos (e.g., skiing) and realistic scenes (e.g., forests). All videos are configured with a duration of 4 seconds, a frame rate of 24 FPS, and a resolution of 720p. For visualization, we uniformly sample five frames in temporal sequence. 19 Appendix B.3. Goku-T2V Comparisons with Prior Arts Additional comparisons with state-of-the-art text-to-video generation models are presented in Figure 10 and Figure 11. These results demonstrate the strong performance of Goku when evaluated against both open-source models (Yang et al., 2024c; Zheng et al., 2024) and commercial products (pika, 2024; Kuaishou, 2024; Bao et al., 2024; Luma, 2024). For instance, in Figure 11, Goku successfully generates smooth motion and accurately incorporates the specified low-angle shot. In contrast, other models, such as CogVideoX (Yang et al., 2024c), Vidu (Bao et al., 2024), and Kling (Kuaishou, 2024), often produce incorrect objects or improper camera views. Appendix B.4. Goku-I2V Samples Visualization We present additional visualization of generated samples from Goku-I2V in Figure 12, which further validate the effectiveness and versatility of our approach. As shown in the figure, GokuI2V demonstrates an impressive ability to synthesize coherent and visually compelling videos from diverse reference images, maintaining consistency in motion and scene semantics. For instance, in the first row, the model successfully captures the dynamic and high-energy nature of water boxing, generating fluid and natural movements of splashes synchronized with the subject’s motions. In the second row, the sequence of a child riding a bike through a park illustrates the model’s proficiency in creating smooth and realistic forward motion while preserving environmental consistency. Finally, the third row showcases the model’s ability to handle creative and imaginative scenarios, as seen in the detailed depiction of pirate ships battling atop a swirling coffee cup. The photorealistic rendering and accurate motion trajectories underscore the model’s robustness in both realism and creativity. These examples highlight Goku-I2V’s capacity to generalize across a wide range of inputs, reinforcing its potential for applications in video generation tasks requiring high fidelity and adaptability. 1 https://promptlibrary.org/ 20 21 Qual ity S core ore 95.30 96.85 94.45 95.53 97.10 96.94 96.23 98.33 96.23 97.53 97.33 97.37 95.55 Sema n 69.75 73.42 73.30 72.98 75.17 71.77 77.04 75.68 44.21 78.75 84.17 75.82 81.87 roun d c o n siste ncy back g 97.68 98.22 97.90 98.02 96.62 97.36 96.52 97.60 96.92 97.19 97.43 97.76 96.67 98.75 98.41 99.47 99.12 98.61 99.74 98.66 99.30 98.29 96.24 98.64 99.44 97.71 97.76 97.73 98.20 98.24 99.23 99.50 96.92 99.40 97.54 98.05 99.35 98.99 98.50 dyna m i c degr ee 40.83 42.50 47.22 44.44 60.14 47.50 70.97 46.94 60.33 92.69 44.26 70.83 76.11 a es t h e t ic qu ality 67.16 63.13 56.18 57.35 63.34 62.04 61.98 61.21 42.51 64.15 65.51 60.36 67.22 70.10 67.22 60.94 58.66 66.82 61.87 62.90 65.62 60.16 68.88 66.55 67.56 71.29 90.90 36.88 92.55 40.66 83.37 58.41 93.07 45.47 87.81 53.64 88.72 43.08 85.23 62.11 87.24 68.05 52.06 12.52 92.99 72.15 94.95 82.63 86.10 68.55 94.40 79.48 spati al rel atio nship color 92.60 87.47 34.60 95.00 92.92 35.86 85.80 87.49 67.51 95.60 86.35 53.50 96.40 80.90 65.09 86.20 90.57 61.03 99.40 82.81 66.35 93.40 89.90 73.03 63.80 42.24 27.83 99.80 80.17 64.65 96.40 92.33 83.67 94.40 91.60 68.68 97.60 83.81 85.72 le scene 50.19 22.42 55.29 25.13 42.47 23.89 47.03 23.06 54.57 24.31 49.83 22.26 53.20 24.91 50.86 19.62 16.34 21.89 56.58 24.27 58.98 24.66 53.88 19.80 57.08 23.08 ap p e a rance sty 26.03 25.84 24.55 25.28 24.71 24.22 25.38 24.17 18.77 25.33 26.29 23.89 25.64 Table 8 | Comparison with state-of-the-art models on video generation benchmarks. We evaluate on VBench (Huang et al., 2024) and compare with Gen-3 (Runway, 2023), Vchitect-2.0 (Team, 2024), VEnhancer (He et al., 2024), Kling (Kuaishou, 2024), LaVie-2 (Wang et al., 2023a), CogVideoX (Yang et al., 2024c), Emu3 (Wang et al., 2024b). su tic Sc ncy bject cons iste s thne s ion s moo mot flicke ring mpo ral te ag i n g qual ity im s t clas objec ts objec mult iple style temp oral ion n act hum a core Total S Method AnimateDiff-V2 80.27 82.90 VideoCrafter-2.0 80.44 82.20 OpenSora V1.2 79.23 80.71 Show-1 78.93 80.42 Gen-3 82.32 84.11 Pika-1.0 80.69 82.92 CogVideoX-5B 81.61 82.75 Kling 81.85 83.39 Mira 71.87 78.78 CausVid 84.27 85.65 Luma 83.61 83.47 HunyuanVideo 83.24 85.09 Goku 84.85 85.60 ov 27.04 28.23 27.07 27.46 26.69 25.94 27.59 26.42 18.72 27.51 28.13 26.44 27.35 er a l l c o n siste ncy An embroidered sweater with an anatomical illustration of the human torso and chest, the skin is open to reveal the internal anatomy. A portrait featuring a 26-year-old Chinese male model in a six-grid layout. He has a sleek, naturally layered Korean hairstyle with subtly drooping bangs. Each panel shows him wearing modern Prototype flying fox made from blown glass, Lino Tagliapietra style Muranese glassmaking, intricate details. 3d cube woman underwater, iridescent water, dreamlike Close up shot of hand of a woman touching oats in oat farm. Shot from behind. 3D illustration of the chip with text "AI" floating above it, with a blue color scheme. A simple design in black on a white background. The word "VINTAGE" is at the bottom. Great Dane Dog sitting on a toilet bowl in wide bathroom, reading a large double page spread newspaper, sit like human. The background is in a white room. Full body shot of balenciaga fashion model and parrot hybrid with a human body and the head of the parrot. He is walking through a podium like a model. Full body photo of a screaming cauliflower monster roaring towards the viewer, very detailed textures. The background is clean and blue. Create realistic playing cards on fire. The playing cards are presented with 4A. The fire is red and intense. The background is black. Figure 7 | Qualitative samples of Goku-T2I. Key words are highlighted in RED. 22 Prompt: Raspberry in the form of women walk along the path of a fairy tale forest. She carries a jug of water with her. Her head is made of one big raspberry on which she has big and beautiful eyes, as well as nose and mouth. The skin of the face has a raspberry color. She has very beautiful hair which consists of raspberry, leaves and thin stems. Her arms and legs are made entirely of intertwined stems. She also wears a skirt with raspberry leaves and small raspberries and she looks very delicate and feminine. Figure 8 | Qualitative samples of Goku-T2I. Key words are highlighted in RED. For clarity, we zoom in on specific regions to enhance visualization. 23 At an aquarium, a diver in a yellow wetsuit is feeding tropical fish in a large tank. Zooming through a dense, lush rainforest at incredible speed, weaving between colossal trees, with rays of sunlight breaking through the canopy and exotic birds scattering in the distance. A snowboarder carves down a steep slope, their board cutting swiftly through the snow. A boxer dances around the ring, fists raised and jabbing rapidly at their opponent. A kung fu master swiftly maneuvers through a series of rapid punches and palm strikes, their arms blurring with speed. In a cozy living room with a roaring fireplace and plush furniture, a dog with a shiny coat sits contentedly on a soft rug. Figure 9 | Qualitative samples of Goku-T2V. Key words are highlighted in RED. 24 CogVideoX1.5(5B) Open-soraPlan(v1.3) Pika DreamMachine Vidu Kling(1.5) GOKU (8B) Prompt: An astronaut runs across the surface of the moon, with a low-angle shot showcasing the vast lunar landscape. The movements are smooth and light. Figure 10 | Qualitative comparisons of Goku-T2V with SOTA video generation methods. Key words are highlighted in RED. 25 CogVideoX1.5(5B) Open-soraPlan(v1.3) Pika DreamMachine Vidu Kling(1.5) GOKU (8B) Prompt: A man surfing on a wave, with the camera following his movement and focusing on his face. He is smiling and giving a thumbs-up to the camera, conveying a sense of enjoyment and excitement. The ocean waves are vibrant and dynamic around him, with sunlight glistening on the water. The background features a clear blue sky, enhancing the lively atmosphere of the scene as he rides the waves with confidence and enthusiasm. Figure 11 | Qualitative comparisons of Goku-T2V with SOTA video generation methods. Key words are highlighted in RED. A person performing dynamic and fast-paced water boxing, demonstrating quick, fluid arm movements while splashing water… A kid rides a bike in the park, pedaling fast and moving towards the camera… A highly detailed, photorealistic close-up image of two pirate ships engaged in an intense battle, their sails billowing as they maneuver through the dark, swirling surface of a coffee cup. Figure 12 | Qualitative samples of Goku-I2V. Key words are highlighted in RED. 26 References Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al. (2025). Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575. Albergo, M. S. and Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations. Bacher, I., Javidnia, H., Dev, S., Agrahari, R., Hossari, M., Nicholson, M., Conran, C., Tang, J., Song, P., Corrigan, D., et al. (2021). An advert creation system for 3d product placements. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part IV, pages 224–239. Springer. Bao, F., Xiang, C., Yue, G., He, G., Zhu, H., Zheng, K., Zhao, M., Liu, S., Wang, Y., and Zhu, J. (2024). Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233. Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al. (2023). Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al. (2023a). Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. (2023b). Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575. Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. (2024). Video generation models as world simulators. Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308. Castellano, B. (2024). PySceneDetect. Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al. (2023). Pixart-alphaalpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426. Chen, S., Xu, M., Ren, J., Cong, Y., He, S., Xie, Y., Sinha, A., Luo, P., Xiang, T., and Perez-Rua, J.-M. (2024a). Gentron: Diffusion transformers for image and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6441–6451. Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Chen, T.-S., Siarohin, A., Menapace, W., Deyneka, E., Chao, H.-w., Jeon, B. E., Fang, Y., Lee, H.-Y., Ren, J., Yang, M.-H., et al. (2024b). Panda-70m: Captioning 70m videos with multiple 27 cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331. Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. (2024c). How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53. Contributors, I. H. (2013). Image hash. Corporation, N. (2022). Nvidia h100 tensor core gpu architecture. Corporation, N. (2023). Nvidia announces dgx gh200 ai supercomputer. Corporation, N. (2024). Nvidia h200 nvl pcie gpu accelerates ai and hpc applications. Dao, T. (2024). FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR). Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. (2023). Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR. Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I. M., et al. (2024). Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255. Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. (2024). Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning. Esser, P., Rombach, R., and Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883. Ge, S., Nah, S., Liu, G., Poon, T., Tao, A., Catanzaro, B., Jacobs, D., Huang, J.-B., Liu, M.-Y., and Balaji, Y. (2023). Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941. Ghosh, D., Hajishirzi, H., and Schmidt, L. (2024). Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36. Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S. S., Shah, A., Yin, X., Parikh, D., and Misra, I. (2023). Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. 28 Ha, D. and Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122. He, J., Xue, T., Liu, D., Lin, X., Gao, P., Lin, D., Qiao, Y., Ouyang, W., and Liu, Z. (2024). Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667. He, Y., Yang, T., Zhang, Y., Shan, Y., and Chen, Q. (2022). Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2(3):4. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. (2022a). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303. Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. (2022b). Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646. Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. (2022). Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868. Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., and Yu, G. (2024). Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135. Huang, K., Sun, K., Xie, E., Li, Z., and Liu, X. (2023). T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747. Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al. (2024). Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818. Jacobs, S. A., Tanaka, M., Zhang, C., Zhang, M., Song, S. L., Rajbhandari, S., and He, Y. (2023). Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509. Ji, Y., Zhang, J., Wu, J., Zhang, S., Chen, S., GE, C., Sun, P., Chen, W., Shao, W., Xiao, X., et al. (2024). Prompt-a-video: Prompt your video diffusion model via preference-aligned llm. arXiv preprint arXiv:2412.15156. Jiang, Z., Lin, H., Zhong, Y., Huang, Q., Chen, Y., Zhang, Z., Peng, Y., Li, X., Xie, C., Nong, S., et al. (2024). Megascale: Scaling large language model training to more than 10,000 gpus. arXiv preprint arXiv:2402.15627. Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., and Lin, Z. (2024). Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954. Ju, X., Gao, Y., Zhang, Z., Yuan, Z., Wang, X., Zeng, A., Xiong, Y., Xu, Q., and Shan, Y. (2024). Miradata: A large-scale video dataset with long durations and structured captions. arXiv preprint arXiv:2407.06358. Kingma, D. P. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. 29 Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. (2024). Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603. Korthikanti, V. A., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., and Catanzaro, B. (2023). Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353. Kuaishou (2024). Kling ai. https://klingai.com/. Lab, P.-Y. and etc., T. A. (2024). Open-sora-plan. Li, S., Xue, F., Baranwal, C., Li, Y., and You, Y. (2021). Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations. Liu, X., Gong, C., and qiang liu (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations. Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K.-W., Wu, Y. N., Zhu, S.-C., and Gao, J. (2024). Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36. Luma (2024). Luma ai. https://lumalabs.ai/dream-machine. Mishkin, P., Ahmad, L., Brundage, M., Krueger, G., and Sastry, G. (2022). Dall· e 2 preview-risks and limitations. Noudettu, 28(2022):3. Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., and Tai, Y. (2024). Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193. Peebles, W. and Xie, S. (2023). Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205. pika (2024). Pika ai. https://pika.art/try. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach, R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952. Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y., Chuang, C.-Y., et al. (2024). Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720. Quevedo, J., McIntyre, Q., Campbell, S., and Wachen, R. (2024). Oasis: A universe in a transformer. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3. 30 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695. Runway (2023). Gen-2: Generate novel videos with text, images or video clips. https: //runwayml.com/research/gen-2/. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294. Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. (2024). Flashattention3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608. Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., and Taigman, Y. (2023). Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations. Skorokhodov, I., Tulyakov, S., and Elhoseiny, M. (2022). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3626–3636. Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. Technical report, Center for Research in Computer Vision, Orlando, FL 32816, USA. CRCV-TR-12-01. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063. Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. (2024). Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525. Team, V. (2024). Vchitect-2.0: Parallel transformer for scaling up video diffusion models. https://github.com/Vchitect/Vchitect-2.0. Teed, Z. and Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer. Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717. Valevski, D., Leviathan, Y., Arar, M., and Fruchter, S. (2024). Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Wan, B., Han, M., Sheng, Y., Lai, Z., Zhang, M., Zhang, J., Peng, Y., Lin, H., Liu, X., and Wu, C. (2024). Bytecheckpoint: A unified checkpointing system for llm development. arXiv preprint arXiv:2407.20143. 31 Wang, J., Yuan, L., Zhang, Y., and Sun, H. (2024a). Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634. Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., Zhao, Y., Ao, Y., Min, X., Li, T., Wu, B., Zhao, B., Zhang, B., Wang, L., Liu, G., He, Z., Yang, X., Liu, J., Lin, Y., Huang, T., and Wang, Z. (2024b). Emu3: Next-token prediction is all you need. Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al. (2023a). Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103. Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al. (2023b). Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942. Wiegand, T., Sullivan, G. J., Bjontegaard, G., and Luthra, A. (2003). Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7):560– 576. Wu, J. Z., Ge, Y., Wang, X., Lei, S. W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., and Shou, M. Z. (2023). Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633. Xie, J., Mao, W., Bai, Z., Zhang, D. J., Wang, W., Lin, K. Q., Gu, Y., Chen, Z., Yang, Z., and Shou, M. Z. (2024). Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al. (2024a). Qwen2 technical report. arXiv preprint arXiv:2407.10671. Yang, M., Li, J., Fang, Z., Chen, S., Yu, Y., Fu, Q., Yang, W., and Ye, D. (2024b). Playable game generation. arXiv preprint arXiv:2412.00887. Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al. (2024c). Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072. Yuan, L., Wang, J., Sun, H., Zhang, Y., and Lin, Y. (2025). Tarsier2: Advancing large visionlanguage models from detailed video description to comprehensive video understanding. arXiv preprint arXiv:2501.07888. Zeng, Y., Wei, G., Zheng, J., Zou, J., Wei, Y., Zhang, Y., and Li, H. (2024). Make pixels dance: High-dynamic video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8850–8860. Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. Advances in Neural Information Processing Systems, 32. Zhang, J., Chen, J., Wang, C., Yu, Z., Qi, T., Liu, C., and Wu, D. (2024). Virbo: Multimodal multilingual avatar video generation in digital marketing. arXiv preprint arXiv:2403.11700. Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., Mathews, A., and Li, S. (2023). Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848–3860. 32 Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. (2024). Open-sora: Democratizing efficient video production for all. Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., and Levy, O. (2024). Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039. Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., and Feng, J. (2022). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018. 33