Goku: Flow Based Video Generative Foundation Models
Shoufa Chen1∗ Chongjian Ge1∗ Yuqi Zhang2 Yida Zhang2 Fengda Zhu2 Hao Yang2
Hongxiang Hao2 Hui Wu2 Zhichao Lai2 Yifei Hu2 Ting-Che Lin2 Shilong Zhang1 Fu Li2
Chuan Li2 Xing Wang2 Yanghua Peng2 Peize Sun1 Ping Luo1 Yi Jiang2 Zehuan Yuan2
Bingyue Peng2 Xiaobing Liu2
1 The University of Hong Kong

2 Bytedance Inc

arXiv:2502.04896v2 [cs.CV] 10 Feb 2025

∗ Equal Contribution

https://saiyan-world.github.io/goku/

Abstract
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail
the foundational elements enabling high-quality visual generation, including the data curation
pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient
and robust large-scale training. The Goku models demonstrate superior performance in both
qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and
84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights
and practical advancements for the research community in developing joint image-and-video
generation models.

1. Introduction
Video generation has garnered significant attention owing to its transformative potential across a
wide range of applications, such media content creation (Polyak et al., 2024), advertising (Zhang
et al., 2024; Bacher et al., 2021), video games (Yang et al., 2024b; Valevski et al., 2024; Quevedo
et al., 2024), and world model simulators (Ha and Schmidhuber, 2018; Brooks et al., 2024;
Agarwal et al., 2025). Benefiting from advanced generative algorithms (Goodfellow et al., 2014;
Ho et al., 2020; Liu et al., 2023; Lipman et al., 2023), scalable model architectures (Vaswani et al.,
2017; Peebles and Xie, 2023), vast amounts of internet-sourced data (Chen et al., 2024b; Nan
et al., 2024; Ju et al., 2024), and ongoing expansion of computing capabilities (Corporation, 2022,
2023, 2024), remarkable advancements have been achieved in the field of video generation (Ho
et al., 2022b,a; Singer et al., 2023; Blattmann et al., 2023b; Brooks et al., 2024; Kuaishou, 2024;
Yang et al., 2024c; Jin et al., 2024; Polyak et al., 2024; Kong et al., 2024; Ji et al., 2024).
In this work, we present Goku, a family of rectified flow (Lipman et al., 2023; Liu et al., 2023)
transformer models designed for joint image and video generation, establishing a pathway
toward industry-grade performance. This report centers on four key components: data curation,
model architecture design, flow formulation, and training infrastructure optimization—each
rigorously refined to meet the demands of high-quality, large-scale video generation.

A native Warrior shaman Bengal Cat with a black and white
leopard pattern, blue eyes, short fur, and portrait pose,
colorful feathers and colorful ornaments, a regal oil-style
portrait of the queen of native Kitty shaman white Cat with
wings and headdress. Nordic is kind and motherly, it has
black eye makeup and her hair is in messy.

A glass transparent emoji cartoon A white bearded man's face
hand making the peace sign gesture, emerges from a cloud of white
with fingers straight up and down
butterflies, background is white

An extremely happy American
Cocker Spaniel.

An ancient artifact rests on a
pedestal, the word “GOKU”
etched onto its surface, glowing
as if holding a hidden power
within.

Goku Black, in Super Saiyan Rose
form, stands in a destroyed
cityscape. The word "SAIYAN" is
etched into the ground with dark
energy.

An enchanted forest with a waterfall
cascading over rocks, the word
“GOKU” formed by glowing moss
along the stone surface, lighting up
the misty surroundings.

(a) Text-to-Image Samples

A giant panda sitting comfortably at a table, eating a hotpot meal.

A flock of paper airplanes flutters through a dense jungle, weaving around trees as if they were migrating birds.

An individual standing in a kitchen, wearing an apron, and holding a frying pan positioned above a burner.

(b) Text-to-Video Samples

Figure 1 | Generated samples from Goku. Key components are highlighted in RED.
First, we present a comprehensive data processing pipeline designed to construct largescale, high-quality image and video-text datasets. The pipeline integrates multiple advanced
techniques, including video and image filtering based on aesthetic scores, OCR-driven content analysis, and subjective evaluations, to ensure exceptional visual and contextual quality.
Furthermore, we employ multimodal large language models (MLLMs) (Yuan et al., 2025) to
generate dense and contextually aligned captions, which are subsequently refined using an
additional large language model (LLM) (Yang et al., 2024a) to enhance their accuracy, fluency,
and descriptive richness. As a result, we have curated a robust training dataset comprising
approximately 36M video-text pairs and 160M image-text pairs, which are proven sufficient for
training industry-level generative models.
Secondly, we take a pioneering step by applying rectified flow formulation (Lipman et al.,
2023) for joint image and video generation, implemented through the Goku model family,
which comprises Transformer architectures with 2B and 8B parameters. At its core, the Goku
2

framework employs a 3D joint image-video variational autoencoder (VAE) to compress image
and video inputs into a shared latent space, facilitating unified representation. This shared latent
space is coupled with a full-attention (Vaswani et al., 2017) mechanism, enabling seamless joint
training of image and video. This architecture delivers high-quality, coherent outputs across
both images and videos, establishing a unified framework for visual generation tasks.
Furthermore, to support the training of Goku at scale, we have developed a robust infrastructure tailored for large-scale model training. Our approach incorporates advanced parallelism
strategies (Jacobs et al., 2023; Zhao et al., 2023) to manage memory efficiently during long-context
training. Additionally, we employ ByteCheckpoint (Wan et al., 2024) for high-performance
checkpointing and integrate fault-tolerant mechanisms from MegaScale (Jiang et al., 2024) to
ensure stability and scalability across large GPU clusters. These optimizations enable Goku
to handle the computational and data challenges of generative modeling with exceptional
efficiency and reliability.
We evaluate Goku on both text-to-image and text-to-video benchmarks to highlight its competitive advantages. For text-to-image generation, Goku-T2I demonstrates strong performance
across multiple benchmarks, including T2I-CompBench (Huang et al., 2023), GenEval (Ghosh
et al., 2024), and DPG-Bench (Hu et al., 2024), excelling in both visual quality and text-image
alignment. In text-to-video benchmarks, Goku-T2V achieves state-of-the-art performance on
the UCF-101 (Soomro et al., 2012) zero-shot generation task. Additionally, Goku-T2V attains
an impressive score of 84.85 on VBench (Huang et al., 2024), securing the top position on the
leaderboard (as of 2025-01-25) and surpassing several leading commercial text-to-video models.
Qualitative results, illustrated in Figure 1, further demonstrate the superior quality of the generated media samples. These findings underscore Goku’s effectiveness in multi-modal generation
and its potential as a high-performing solution for both research and commercial applications.

2. Goku: Generative Flow Models for Visual Creation
In this section, we present three core components of Goku, the image-video joint VAE (Yang
et al., 2024c), the Goku Transformer architecture, and the rectified flow formulation. These
components are designed to work synergistically, forming a cohesive and scalable framework
for joint image and video generation. During training, each raw video input 𝑥 ∈ R𝑇 × 𝐻 ×𝑊 ×3 (with
images treated as a special case where 𝑇 = 1 ) is encoded from the pixel space to a latent space
using a 3D image-video joint VAE (Section 2.1). The encoded latents are then organized into
mini-batches containing both video and image representations, facilitating the learning of a
unified cross-modal representation. Subsequently, the rectified flow formulation (Section 2.3) is
applied to these latents, leveraging a series of Transformer blocks (Section 2.2) to model complex
temporal and spatial dependencies effectively.
2.1. Image-Video Joint VAE
Earlier research (He et al., 2022; Rombach et al., 2022; Esser et al., 2021) demonstrates that
diffusion and flow-based models can significantly improve efficiency and performance by
modeling in latent space through a Variational Auto-Encoder (VAE) (Esser et al., 2021; Kingma,
2013). Inspired by Sora (Brooks et al., 2024), the open-source community has introduced 3D-VAE
to explore spatio-temporal compression within latent spaces for video generation tasks (Lab
and etc., 2024; Zheng et al., 2024; Yang et al., 2024c). To extend the advantages of latent space
modeling across multiple media formats, including images and videos, we adopt a jointly
trained Image-Video VAE (Yang et al., 2024c) that handles both image and video data within a
3

Model

Layer

Model Dimension

FFN Dimension

Attention Heads

Goku-1B
Goku-2B
Goku-8B

28
28
40

1152
1792
3072

4608
7168
12288

16
28
48

Table 1 | Architecture configurations for Goku Models. Goku-1B model is only used for pilot
experiments in Section 2.3
unified framework. Specifically, for videos, we apply a compression stride of 8 × 8 × 4 across
height, width, and temporal dimensions, respectively, while for images, the compression stride
is set to 8 × 8 in spatial dimensions.
2.2. Transformer Architectures
The design of the Goku Transformer block builds upon GenTron (Chen et al., 2024a), an
extension of the class-conditioned diffusion transformer (Peebles and Xie, 2023) for text-toimage/video tasks. It includes a self-attention module for capturing inter-token correlations,
a cross-attention layer to integrate textual conditional embeddings (extracted via the Flan-T5
language model (Chung et al., 2024)), a feed-forward network (FFN) for feature projection,
and a layer-wise adaLN-Zero block that incorporates timestep information to guide feature
transformations. Additionally, we introduce several recent design enhancements to improve
model performance and training stability, as detailed below.
Plain Full Attention. In Transformer-based video generative models, previous approaches (Chen
et al., 2024a; Wu et al., 2023; Singer et al., 2023; Blattmann et al., 2023b) typically combine temporal attention with spatial attention to extend text-to-image generation to video. While this
method reduces computational cost, it is sub-optimal for modeling complex temporal motions,
as highlighted in prior work (Yang et al., 2024c; Polyak et al., 2024). In Goku, we adopt full
attention to model multi-modal tokens (image and video) within a unified network. Given the
large number of video tokens remaining after VAE processing—particularly for high-framerate, long-duration videos—we leverage FlashAttention (Shah et al., 2024; Dao, 2024) and
sequence parallelism (Li et al., 2021) to optimize both GPU memory usage and computational
efficiency.
Patch n’ Pack. To enable joint training on images and videos of varying aspect ratios and
lengths, we follow the approach from NaViT (Dehghani et al., 2024), packing both modalities
into a single minibatch along the sequence dimension. This method allows flexible mixing of
training instances with different sequence lengths into a single batch, eliminating the need for
data buckets (Podell et al., 2023).
3D RoPE Position Embedding. Rotary Position Embedding (RoPE) (Su et al., 2024) has demonstrated effectiveness in LLMs by enabling greater sequence length flexibility and reducing
inter-token dependencies as relative distances increase. During joint training, we apply 3D
RoPE embeddings to image and video tokens. In our joint training framework, we extend
3D RoPE embeddings to image and video tokens, leveraging their extrapolation capability
to accommodate varying resolutions. This adaptability makes RoPE particularly suited for
4

handling diverse resolutions and video lengths. Furthermore, our empirical analysis revealed
that RoPE converges faster than sinusoidal positional embeddings during transitions across
different training stages
Q-K Normalization. Training large-scale Transformers can occasionally result in loss spikes,
which may lead to model corruption, manifesting as severe artifacts or even pure noise in
generated images or videos. To mitigate this issue, we incorporate query-key normalization (Dehghani et al., 2023) to stabilize the training process. Specifically, we apply RMSNorm (Zhang
and Sennrich, 2019) to each query-key feature prior to attention computation, ensuring smoother
and more reliable training dynamics.
The overall Transformer model is constructed by stacking a sequence of blocks as described
above. To address varying computational demands and performance requirements, we design
three model variants, summarized in Table 1. The Goku-1B model serves as a lightweight
option for pilot experiments. The Goku-2B variant consists of 28 layers, each with a model
dimension of 1792 and 28 attention heads, providing a balance between computational efficiency
and expressive capacity. In contrast, the larger Goku-8B variant features 40 layers, a model
dimension of 3072, and 48 attention heads, delivering superior modeling capacity aimed at
achieving high generation quality.
2.3. Flow-based Training
Our flow-based formulation is rooted in the rectified flow (RF) algorithm (Albergo and VandenEijnden, 2023; Lipman et al., 2023; Liu et al., 2023), where a sample is progressively transformed
from a prior distribution, such as a standard normal distribution, to the target data distribution.
This transformation is achieved by defining the forward process as a series of linear interpolations between the prior and target distributions. Specifically, given a real data sample x1 from
the target distribution and a noise sample x0 ∼ N (0, 1) from the prior distribution, a training
example is constructed through linear interpolation:
x𝑡 = 𝑡 · x1 + (1 − 𝑡 ) · x0 ,

(1)

where 𝑡 ∈ [0, 1] represents the interpolation coefficient. The model is trained to predict the
velocity, defined as the time derivative of x𝑡 , v𝑡 = 𝑑𝑑𝑡x𝑡 , which guides the transformation of
intermediate samples x𝑡 towards the real data x1 during inference. By establishing a direct, linear
interpolation between data and noise, RF simplifies the modeling process, providing improved
theoretical properties, conceptual clarity, and faster convergence across data distributions.
Goku takes a pioneering step by adopting a flow-based formulation for joint image-andvideo generation. We conduct a pilot experiment to validate the rapid convergence of flow-based
training by performing class-conditional generation with Goku-1B a model specifically designed
for these proof-of-concept experiments, on ImageNet-1K (256 × 256) (Deng et al., 2009). The
model is configured with 28 layers, an attention dimension of 1152, and 16 attention heads.
To evaluate performance, we compare key metrics, such as FID-50K and Inception Score (IS),
for models trained using the denoising diffusion probabilistic model (DDPM) (Ho et al., 2020)
and rectified flow. As shown in Table 2, RF demonstrates faster convergence than DDPM.
For instance, Goku-1B (RF) achieves a lower FID-50K after 400k training steps compared to
Goku-1B (DDPM), which requires 1000k steps to reach a similar level of performance.

5

Loss

Steps

FID ↓

sFID ↓

IS ↑

Precision ↑

Recall ↑

DDPM
DDPM
DDPM

200k
400k
1000k

3.0795
2.5231
2.2568

4.3498
4.3821
4.4887

226.4783
265.0612
286.5601

0.8387
0.8399
0.8319

0.5317
0.5591
0.5849

Rectified Flow
Rectified Flow

200k
400k

2.7472
2.1572

4.6416
4.5022

232.3090
261.1203

0.8239
0.8210

0.5590
0.5871

Table 2 | Proof-of-concept experiments of class-conditional generation on ImageNet 256×256.
Rectified flow achieves faster convergency compared to DDPM.
2.4. Training Details
Multi-stage Training. Directly optimizing joint image-and-video training poses significant
challenges, as the network must simultaneously learn spatial semantics critical for images
and temporal motion dynamics essential for videos. To tackle this complexity, we introduce a
decomposed, multi-stage training strategy that progressively enhances the model’s capabilities,
ensuring effective and robust learning across both modalities.
• Stage-1: Text-Semantic Pairing. In the initial stage, we focus on establishing a solid understanding of text-to-image relationships by pretraining Goku on text-to-image tasks. This step
is critical for grounding the model in basic semantic comprehension, enabling it to learn to
associate textual prompts with high-level visual semantics. Through this process, the model
develops a reliable capacity for representing visual concepts essential for both image and
video generation, such as object attributes, spatial configurations, and contextual coherence.
• Stage-2: Image-and-video joint learning. Building on the foundational capabilities of text-tosemantic pairing, we extend Goku to joint learning across both image and video data. This
stage leverages the unified framework of Goku, which employs a global attention mechanism
adaptable to both images and videos. Besides, acquiring a substantial volume of high-quality
video data is generally more resource-intensive compared to obtaining a similar amount of
high-quality image data. To address this disparity, our framework integrates images and
videos into unified token sequences during training, enabling the rich information inherent
in high-quality images to enhance the generation of video frames (Chen et al., 2024a). By
curating a carefully balanced dataset of images and videos, Goku not only gains the capability
to generate both high-quality images and videos but also enhances the visual quality of videos
by leveraging the rich information from high-quality image data.
• Stage-3: Modality-specific finetuning. In the final stage, we fine-tune Goku for each specific
modality to further enhance its output quality. For text-to-image generation, we implement
image-centric adjustments aimed at producing more visually compelling images. For textto-video generation, we focus on adjustments that improve temporal smoothness, motion
continuity, and stability across frames, resulting in realistic and fluid video outputs.
Cascaded Resolution Training. In the second stage of joint training, we adopt a cascade resolution strategy to optimize the learning process. Initially, training is conducted on low-resolution
image and video data (288 × 512), enabling the model to efficiently focus on fundamental textsemantic-motion relationships at reduced computational costs. Once these core interactions
are well-established, the resolution of the training data is progressively increased, transitioning
from 480 × 864 to 720 × 1280. This stepwise resolution enhancement allows Goku to refine its
6

understanding of intricate details and improve overall image fidelity, ultimately leading to
superior generation quality for both images and videos.
2.5. Image-to-Video
To extend Goku for adapting an image as an additional condition for video generation, we employ
a widely used strategy by using the first frame of each clip as the reference image (Girdhar
et al., 2023; Blattmann et al., 2023a; Yang et al., 2024c). The corresponding image tokens are
broadcasted and concatenated with the paired noised video tokens along the channel dimension.
To fully leverage the pretrained knowledge during fine-tuning, we introduce a single MLP layer
for channel alignment, while preserving the rest of the model architecture identical to Goku-T2V.

3. Infrastructure Optimization
To achieve scalable and efficient training of Goku, we first adopt advanced parallelism strategies (Section 3.1), to handle the challenges of long-context, large-scale models. To further
optimize memory usage and balance computation with communication, we implement finegrained Activation Checkpointing (Section 3.2). Additionally, we integrate robust fault tolerance
mechanisms from MegaScale, enabling automated fault detection and recovery with minimal
disruption (Section 3.3). Finally, ByteCheckpoint is utilized to ensure efficient and scalable
saving and loading of training states, supporting flexibility across diverse hardware configurations (Section 3.4). The details of these optimizations are introduced below.
3.1. Model Parallelism Strategies
The substantial model size and the exceptionally long sequence length (exceeding 220K tokens
for the longest sequence) necessitate the adoption of multiple parallelism strategies to ensure
efficient training. Specifically, we employ 3D parallelism to achieve scalability across three axes:
input sequences, data, and model parameters.
Sequence-Parallelism (SP) (Korthikanti et al., 2023; Li et al., 2021; Jacobs et al., 2023) slices
the input across the sequence dimension for independent layers (e.g., LayerNorm) to eliminate
redundant computations, reduce memory usage, and support padding for non-conforming
input. We adopt Ulysses (Jacobs et al., 2023) as our implementation, which shards samples across
the sequence parallel group from the start of the training loop. During attention computation, it
uses all-to-all communication to distribute query, key, and value shards, allowing each worker
to process the full sequence but only a subset of attention heads. After parallel computation of
attention heads, another all-to-all communication aggregates the results, recombining all heads
and the sharded sequence dimension.
Fully Sharded Data Parallelism (FSDP) (Zhao et al., 2023) partitions all parameters, gradients
and optimizer states across the data parallel ranks. Instead of all-reduce in Distributed Data
Parallelism, FSDP performs all-gather for parameters and reduce-scatter for gradients, enabling
overlap with forward and backward computations to potentially reduce communication overhead. In our case, we adopt the HYBRID_SHARD strategy, which combines FULL_SHARD within
a shard group and parameter replication across such groups, which effectively implements data
parallelism (DP). This approach minimizes communication costs by limiting all-gather and
reduce-scatter operations.
7

3.2. Activation Checkpointing
While the parallelism methods discussed in Section 3.1 provide significant memory savings and
enable large-scaling training with long sequences, they inevitably introduce communication
overhead among ranks, which can lead to suboptimal overall performance. To address this
issue and better balance the computation and communication by maximizing their overlap
in the profiling trace, we designed a fine-grained Activation Checkpointing (AC) (Chen et al.,
2016) strategy. Specifically, we implemented selective activation checkpointing to minimize the
number of layers requiring activation storage while maximizing GPU utilization.
3.3. Cluster Fault Tolerance
Scaling Goku training to large-scale GPU clusters inevitably introduces fault scenarios, which can
reduce training efficiency. The likelihood of encountering failures increases with the number of
nodes, as larger systems have a higher probability of at least one node failing. These disruptions
can extend training time and increase costs. To enhance stability and efficiency at scale, we
adopted fault tolerance techniques from MegaScale (Jiang et al., 2024), including self-check
diagnostics, multi-level monitoring, and fast restart/recovery mechanisms. These strategies
effectively mitigate the impact of interruptions, enabling Goku to maintain robust performance
in large-scale generative modeling tasks.
3.4. Saving and Loading Training Stages
Checkpointing training states—such as model parameters, exponential moving average (EMA)
parameters, optimizer states, and random states—is crucial for training large-scale models,
particularly given the increased likelihood of cluster faults. Reloading checkpointed states
ensures reproducibility, which is essential for model reliability and debugging potential issues,
including those caused by unintentional errors or malicious attacks.
To support scalable large-scale training, we adopt ByteCheckpoint (Wan et al., 2024) as
our checkpointing solution. It not only enables parallel saving and loading of partitioned
checkpoints with high I/O efficiency but also supports resharding distributed checkpoints. This
flexibility allows seamless switching between different training scales, accommodating varying
numbers of ranks and diverse storage backends. In our setup, checkpointing an 8B model across
over thousands of GPUs blocks training for less than 4 seconds, which is negligible compared to
the overall forward and backward computation time per iteration.

4. Data Curation Pipeline
We unblock the data volume that is utilized for industry-grade video/image generation models.
Our data curation pipeline, illustrated in Figure 2, consists of five main stages: (1) image and
video collection, (2) video extraction and clipping, (3) image and video filtering, (4) captioning,
and (5) data distribution balancing. We describe the details of data curation procedure below.
4.1. Data Overview
We collet raw image and video data from a variety of sources, including publicly available
academic datasets, internet resources, and proprietary datasets obtained through partnerships
with collaborating organizations. After rigorous filtering, the final training dataset for Goku
consists of approximately 160M image-text pairs and 36M video-text pairs, encompassing both
8

FFmpeg Error
Raw Video

Aesthetic Score

OCR

Video Tag

Background

DINOv2
Similarity

Motion Score

Read Video Info

long-tail

Extract Clips
Keyframe per
Second

NSFW Video Filtering

Caption

Raw Image
Collection

Extraction

Filtering

Captioning

uniform
Balancing

Figure 2 | The data curation pipeline in Goku. Given a large volume of video/image data
collected from Internet, we generate high-quality video/image-text pairs through a series of
data filtering, captioning and balancing steps.
publicly available datasets and internally curated proprietary datasets. The detailed composition
of these resources is outlined as follows:
• Text-to-Image Data. Our text-to-image training dataset includes 100M public samples from
LAION (Schuhmann et al., 2022) and 60M high-quality, internal samples. We use public data
for pre-training and internal data for fine-tuning.
• Text-to-Video Data. Our T2V training dataset includes 11M public clips and 25M in-house
clips. The former include Panda-70M (Chen et al., 2024b), InternVid (Wang et al., 2023b),
OpenVid-1M (Nan et al., 2024), and Pexels (Lab and etc., 2024). Rather than directly using
these datasets, we apply a data curation pipeline to keep high-quality samples.

4.2. Data Processing and Filtering
To construct a high-quality video dataset, we implement a comprehensive processing pipeline
comprising several key stages. Raw videos are first preprocessed and standardized to address
inconsistencies in encoding formats, durations, and frame rates. Next, a two-stage video clipping
method segments videos into meaningful and diverse clips of consistent length. Additional
filtering processes are applied, including visual aesthetic filtering to retain photorealistic and
visually rich clips, OCR filtering to exclude videos with excessive text, and motion filtering
to ensure balanced motion dynamics. In addition, the multi-level training data is segmented
based on resolution and corresponding filtering thresholds for DINO similarity, aesthetic score,
OCR text coverage, and motion score, as summarized in Table 4. We provide the details of each
processing step as follows.
Table 3 presents the key parameters and their corresponding thresholds used for video
quality assessment. Each parameter is essential in ensuring the generation and evaluation of
high-quality videos. The Duration parameter specifies that raw video lengths should be at
least 4 seconds to capture meaningful temporal dynamics. The Resolution criterion ensures
that the minimum dimension (either height or width) of the video is no less than 480 pixels,
maintaining adequate visual clarity. The Bitrate, which determines the amount of data processed
per second during playback, requires a minimum of 500 kbps to ensure sufficient quality, clarity,
and manageable file size. Videos with low bitrate typically correspond to content with low
complexity, such as static videos or those featuring pure color backgrounds. Finally, the Frame
Rate enforces a standard of at least 24 frames per second (film standard) or 23.976 frames
9

Parameter

Description

Threshold

Duration

Raw video length

≥ 4 seconds

Resolution

Width and height of the video

𝑚𝑖𝑛{ height, width} ≥ 480

Bitrate

Amount of data processed per second
during playback, which impacts the
video’s quality, clarity, and file size

≥ 500 kbps

Frame Rate

Frames displayed per second

≥ 24 FPS (Film Standard) / 23.976
FPS (NTSC Standard)

Table 3 | Summary of video quality parameters and their thresholds for preprocessing. The
table outlines the criteria used to filter and standardize raw videos based on essential attributes,
ensuring uniformity and compatibility in the dataset.
per second (NTSC standard) to guarantee smooth motion and prevent visual artifacts. These
thresholds collectively establish a baseline for evaluating and generating high-quality video
content.
• Preprocessing and Standardization of Raw Videos. Videos collected from the internet
often require extensive preprocessing to address variations in encoding formats, durations,
and frame rates. Initially, we perform a primary filtering step based on fundamental video
attributes such as duration, resolution, bitrate. The specific filtering criteria and corresponding
thresholds are detailed in Table 3. This initial filtering step is computationally efficient
compared to more advanced, model-based filtering approaches, such as aesthetic (Schuhmann
et al., 2022) evaluation models. Following this stage, the raw videos are standardized to a
consistent coding format, H.264 (Wiegand et al., 2003), ensuring uniformity across the dataset
and facilitating subsequent processing stages.
• Video Clips Extraction. We employ a two-stage video clipping method for this stage. First,
we use PySceneDetect (Castellano, 2024) for shot boundary detection, resulting coarse-grained
video clips from raw videos. Next, we further refine the video clips by sampling one frame
per second, generating DINOv2 (Oquab et al., 2023) features and calculating cosine similarity
between adjacent frames. When similarity falls below a set threshold, we mark a shot change
and further divide the clip. Specifically, as shown in Table 4, for video resolutions around
480 × 864, we segmented the video clips where the DINO similarity between adjacent frames
exceeds 0.85. For resolutions greater than 720 × 1280, the threshold is set at 0.9. Besides, to
standardize length, we limit clips to a maximum of 10 seconds. Furthermore, we consider
the similarity between different clips derived from the same source video to ensure diversity
and maintain quality. Specifically, we compute the perceptual hashing (Contributors, 2013)
values of keyframes from each clip and compare them. If two clips have similar hash values,
indicating significant overlap, we retain the clip with a higher aesthetic score. This ensures
that the final dataset includes diverse and high-quality video clips.
• Visual Aesthetic Filtering. To assess the visual quality of the videos, we utilize aesthetic
models (Schuhmann et al., 2022) to evaluate the keyframes. The aesthetic scores of the
keyframes are averaged to obtain an overall aesthetic score for each video. For videos with
resolutions around 480 × 864, those with an aesthetic score below 4.3 are discarded, while for
resolutions exceeding 720 × 1280, the threshold is raised to 4.5. This filtering process ensures
that the selected clips are photorealistic, visually rich, and of high aesthetic quality.
10

Stage

Amount

Resolution

480p
720p
1080p

36M
24M
7M

≥ 480×864
≥ 720×1280
≥ 1080×1920

DINO-Sim. Aesthetic
≥0.85
≥0.90
≥0.90

≥ 4.3
≥ 4.5
≥ 4.5

OCR

Motion

<= 0.02 0.3 ≤ score ≤ 20.0
<= 0.01 0.5 ≤ score ≤ 15.0
<= 0.01 0.5 ≤ score ≤ 8.0

Table 4 | Overview of multi-stage training data.This table summarizes the thresholds for each
filtering criterion, including resolution, DINO similarity, aesthetic score, OCR text coverage,
motion score, and the corresponding data quantities.
• OCR Filtering. To exclude videos with excessive text, we employ an internal OCR model to
detect text within the keyframes. The OCR model identifies text regions, and we calculate
the text coverage ratio by dividing the area of the largest bounding box detected by the total
area of the keyframe. Videos with a text coverage ratio exceeding predefined thresholds are
discarded. Specifically, for videos with resolutions around 480 × 864, the threshold is set at
0.02, while for resolutions exceeding 720 × 1280, the threshold is reduced to 0.01. This process
effectively filters out videos with excessive text content.
• Motion Filtering. Unlike images, videos require additional filtering based on motion characteristics. To achieve this, we utilize RAFT (Teed and Deng, 2020) to compute the mean
optical flow of video clips, which is then used to derive a motion score. For videos with
resolutions around 480 × 864, clips with motion scores below 0.3 (indicating low motion) or
above 20.0 (indicating excessive motion) are excluded. For resolutions exceeding 720 × 1280,
the thresholds are adjusted to 0.5 and 15.0, respectively. Furthermore, to enhance motion
control, the motion score is appended to each caption.

4.3. Captioning
Detailed captions are essential for enabling the model to generate text-aligned images/videos
precisely. For images, we use InternVL2.0 (Chen et al., 2024c) to generate dense captions
for each sample. To caption video clips, we start with InternVL2.0 (Chen et al., 2024c) for
keyframe captions, followed by Tarsier2 (Yuan et al., 2025) for video-wide captions. Note that
the Tarsier2 model can inherently describe camera motion types (e.g., zoom in, pan right) in
videos, eliminating the need for a separate prediction model and simplifying the overall pipeline
compared to previous work such as (Polyak et al., 2024). Next, we utilize Qwen2 (Yang et al.,
2024a) to merge the keyframe and video captions. Besides, we also empirically found that
adding the motion score (calculated by RAFT (Teed and Deng, 2020)) to the captions improves
motion control for video generation. This approach enables users to specify different motion
scores in prompts to guide the model in generating videos with varied motion dynamics.
4.4. Training Data Balancing
The model’s performance are significantly influenced by the data distribution, especially for
video data. To balance the video training data, we first use an internal video classification model
to generate semantic tags for the videos. We then adjust the data distribution based on these
semantic tags to ensure a balanced representation across categories.
• Data Semantic Distribution. The video classification model assigns a semantic tag to each
video based on four evenly sampled keyframes. The model categorizes videos into 9 primary
11

(a) Semantic distribution of video clips.
Sub-category from Human

Sub-category from Scenery

half-selfie
forest

natural

rivers

multi
snow
human

grass

full-selfie

sky

(b) The balanced semantic distribution of subcategories.

Figure 3 | Training data distributions. The balanced semantic distribution of primary categories
and subcategories are shown in (a) and (b), respectively.
classes (e.g., human, scenery, animals, food) and 86 subcategories (e.g., half-selfie, kid, dinner,
wedding). Figure 3a presents the semantic distribution across our filtered training clips, with
humans, scenery, food, urban life, and animals as the predominant categories.
• Data Balancing. The quality of the generated videos is closely tied to the semantic distribution of the training data. Videos involving humans pose greater modeling challenges due to
the extensive diversity in appearances, whereas animals and landscapes exhibit more visual
consistency and are relatively easier to model. To address this disparity, we implement a
data-balancing strategy that emphasizes human-related content while ensuring equitable
representation across subcategories within each primary category. Overrepresented subcategories are selectively down-sampled, whereas underrepresented ones are augmented through
artificial data generation and oversampling techniques. Balanced data distribution is shown
in Figure 3b.

12

Method

GenEval
T2I-CompBench
DPG-Bench
Overall Color Shape Texture Average

Text Enc.

SDv1.5 (Rombach et al., 2022) CLIP ViT-L/14
DALL-E 2 (Ramesh et al., 2022) CLIP ViT-H/16
SDv2.1 (Rombach et al., 2022) CLIP ViT-H/14
SDX (Podell et al., 2023)
CLIP ViT-bigG
Flan-T5-XXL
PixArt-𝛼 (Chen et al., 2023)
DALL-E 3 (Betker et al., 2023)
Flan-T5-XXL
GenTron (Chen et al., 2024a)
CLIP T5XXL
SD3 (Esser et al., 2024)
Flan-T5-XXL
Show-o (Xie et al., 2024)
Phi-1.5
Transfusion (Zhou et al., 2024)
Chameleon (Lu et al., 2024)
LlamaGen (Sun et al., 2024)
FLAN-T5 XL
Emu 3 (Wang et al., 2024b)
-

0.43
0.3730 0.3646 0.4219
0.52
0.5750 0.5464 0.6374
0.50
0.5694 0.4495 0.4982
0.55
0.6369 0.5408 0.5637
0.48
0.6886 0.5582 0.7044
0.67† 0.8110† 0.6750† 0.8070†
0.7674 0.5700 0.7150
0.74
0.53
0.63
0.39
0.32
†
†
†
0.66 0.7913 0.5846 0.7422†

63.18
74.65
71.11
83.50†
80.60

Goku-T2I (2B)
Goku-T2I (2B)†

0.70
0.76†

83.65

FLAN-T5 XL

0.7521 0.4832 0.6691
0.7561† 0.5759† 0.7071†

Table 5 | Comparison with state-of-the-art models on image generation benchmarks. We
evaluate on GenEval (Ghosh et al., 2024); T2I-CompBench (Huang et al., 2023) and DPGBench (Hu et al., 2024). Following (Wang et al., 2024b), we use † to indicate the result with
prompt rewriting.

5. Experiments
5.1. Text-to-Image Results
we conduct a comprehensive quantitative evaluation of Goku-T2I on widely recognized image generation benchmarks, including GenEval (Ghosh et al., 2024), T2I-CompBench (Huang
et al., 2023), and DPG-Bench (Hu et al., 2024). Details of these benchmarks could be found in
Appendix Appendix A. The results are summarized in Table 5.
Performance on GenEval. To assess text-image alignment comprehensively, we employ the
GenEval benchmark, which evaluates the correspondence between textual descriptions and
visual content. Since Goku-T2I is primarily trained on dense generative captions, it exhibits a
natural advantage when handling detailed prompts. To further explore this, we expand the
original short prompts in GenEval with ChatGPT-4o, preserving their semantics while enhancing
descriptive detail. As shown in Table 5, Goku-T2I achieves strong performance with the original
short prompts, surpassing most state-of-the-art models. With the rewritten prompts, Goku-T2I
attains the highest score (0.76), demonstrating its exceptional capability in aligning detailed
textual descriptions with generated images.
Performance on T2I-CompBench. We further evaluate the alignment between generated
images and textual conditions using the T2I-CompBench benchmark, which focuses on various
object attributes such as color, shape, and texture. As illustrated in Table 5, Goku-T2I consistently
13

Method
CogVideo (Chinese) (Hong et al., 2022)
CogVideo (English) (Hong et al., 2022)
Make-A-Video (Singer et al., 2023)
VideoLDM (Blattmann et al., 2023b)
LVDM (He et al., 2022)
MagicVideo (Zhou et al., 2022)
PixelDance (Zeng et al., 2024)
PYOCO (Ge et al., 2023)
Emu-Video (Girdhar et al., 2023)
SVD (Blattmann et al., 2023a)
Goku-2B (ours)
Goku-2B (ours)
Goku-2B (ours)

Resolution
480×480
480×480
256×256
256×256
256×256
240×360
256×256
240×360
128×128

FVD (↓)
751.34
701.59
367.23
550.61
372.00
655.00
242.82
355.19
317.10
242.02
246.17
254.47
217.24

IS (↑ )
23.55
25.27
33.00
33.45
42.10
47.76
42.7
45.77 ± 1.10
46.64 ± 1.08
42.30 ± 1.03

Table 6 | Zero-shot text-to-video performance on UCF-101. We generate videos of different
resolutions, including 256×256, 240×360, 128×128, for comprehensive comparisons.
outperforms several strong baselines, including PixArt-𝛼 (Chen et al., 2023), SDXL (Podell et al.,
2023), and DALL-E 2 (Mishkin et al., 2022). Notably, the inclusion of prompt rewriting leads
to improved performance across all attributes, further highlighting Goku-T2I’s robustness in
text-image alignment.
Performance on DPG-Bench. While the aforementioned benchmarks primarily evaluate textimage alignment with short prompts, DPG-Bench is designed to test model performance on
dense prompt following. This challenging benchmark includes 1,000 detailed prompts, providing a rigorous test of a model’s ability to generate visually accurate outputs for complex textual
inputs. As shown in the last column of Table 5, Goku-T2I achieves the highest performance
with an average score of 83.65, surpassing PixArt-𝛼 (Chen et al., 2023) (71.11), DALL-E 3 (Betker
et al., 2023) (83.50), and EMU3 (Wang et al., 2024b) (80.60). These results highlight Goku-T2I’s
superior ability to handle dense prompts and maintain high fidelity in text-image alignment.
5.2. Text-to-Video Results
Performance on UCF-101. We conduct experiments on UCF-101 (Soomro et al., 2012) using
zero-shot text-to-video setting. As UCF-101 only has class labels, we utilize an video-language
model, Tarsier-34B (Wang et al., 2024a), to generate detailed captions for all UCF-101 videos.
These captions are then used to synthesize videos with Goku. Finally, we generated 13,320
videos at different resolutions with Goku-2B model for evaluation, including 256×256, 240×360
and 128×128. Following standard practice (Skorokhodov et al., 2022), we use the I3D model,
pre-trained on Kinetics-400 (Carreira and Zisserman, 2017), as the feature extractor. Based on
the extracted features, we calculated Fréchet Video Distance (FVD) (Unterthiner et al., 2018)
to evaluate the fidelity of the generated videos. The results in Table 6 demonstrate that Goku
consistently generates videos with lower FVD and higher IS. For instance, at a resolution of
128×128, the FVD of videos generated by Goku is 217.24, achieving state-of-the-art performance
and highlighting significant advantages over other methods.

14

Models

Human
Dynamic Multiple Appear. Quality Semantic
Scene
Overall
Action
Degree Objects
Style
Score
Score

AnimateDiff-V2
VideoCrafter-2.0
OpenSora V1.2
Show-1
Gen-3
Pika-1.0
CogVideoX-5B
Kling
Mira
CausVid
Luma
HunyuanVideo

92.60
95.00
85.80
95.60
96.40
86.20
99.40
93.40
63.80
99.80
96.40
94.40

50.19
55.29
42.47
47.03
54.57
49.83
53.20
50.86
16.34
56.58
58.98
53.88

40.83
42.50
47.22
44.44
60.14
47.50
70.97
46.94
60.33
92.69
44.26
70.83

36.88
40.66
58.41
45.47
53.64
43.08
62.11
68.05
12.52
72.15
82.63
68.55

22.42
25.13
23.89
23.06
24.31
22.26
24.91
19.62
21.89
24.27
24.66
19.80

82.90
82.20
80.71
80.42
84.11
82.92
82.75
83.39
78.78
85.65
83.47
85.09

69.75
73.42
73.30
72.98
75.17
71.77
77.04
75.68
44.21
78.75
84.17
75.82

80.27
80.44
79.23
78.93
82.32
80.69
81.61
81.85
71.87
84.27
83.61
83.24

Goku (ours)

97.60

57.08

76.11

79.48

23.08

85.60

81.87

84.85

Table 7 | Comparison with leading T2V models on VBench. Goku achieves state-of-the-art
overall performance. Detailed results across all 16 evaluation dimensions are provided in Table 8
in the Appendix.
Performance on VBench. As presented in Table 7, we evaluate Goku-T2V against state-ofthe-art models on VBench (Huang et al., 2024), a comprehensive benchmark designed to assess
video generation quality across 16 dimensions. Goku-T2V achieves state-of-the-art overall
performance on VBench, showcasing its ability to generate high-quality videos across diverse
attributes and scenarios.
Among the key metrics, Goku-T2V demonstrates notable strength in human action representation, dynamic degree, and multiple object generation, reflecting its capacity for handling
complex and diverse video content. Additionally, it achieves competitive results in appearance
style, quality score, and semantic alignment, highlighting its balanced performance across
multiple aspects.
For detailed results on all 16 evaluation dimensions, we refer readers to Table 8 in the Appendix. This comprehensive analysis underscores Goku-T2V’s superiority in video generation
compared to prior approaches.
5.3. Image-to-Video
We finetune Goku-I2V from the T2V initialization with approximate 4.5M text-image-video
triplets, sourced from diverse domains to ensure robust generalization. Despite the relatively
small number of fine-tuning steps (10k), our model demonstrates remarkable efficiency in
animating reference image while maintaining strong alignment with the accompanying text. As
illustrated in Figure 4, the generated videos exhibit high visual quality and temporal coherence,
effectively capturing the semantic nuances described in the text.

15

A lion running towards the left side of the scene, with flames engulfing its body. As it runs, the lion gradually transforms into a mass of flames…

A woman in workout gear is lifting weights at a gym, her biceps flexing with each lift, sweat visible on her forehead, with a closeup on her determined expression…

A man surfing on a wave, with the camera following his movement and focusing on his face. He is smiling and giving a thumbs-up to the camera, …

Figure 4 | Samples of Goku-I2V. Reference images are presented in the leftmost columns. We
omitted redundant information from the long prompts, displaying only the key details in each
one. Key words are highlighted in RED.
5.4. Image and Video Qualitative Visualizations
For intuitive comparisons, we conduct qualitative assessments and present sampled results in
Figure 6. The evaluation includes open-source models, such as CogVideoX (Yang et al., 2024c)
and Open-Sora-Plan (Zheng et al., 2024), alongside closed-source commercial products, including DreamMachine (Luma, 2024), Pika (pika, 2024), Vidu (Bao et al., 2024), and Kling (Kuaishou,
2024). The results reveal that some commercial models struggle to generate critical video
elements when handling complex prompts. For instance, models like Pika, DreamMachine,
and Vidu (rows 3–5) fail to render the skimming drone over water. While certain models succeed in generating the target drone, they often produce distorted subjects (rows 1–2) or static
frames lacking motion consistency (row 6). In contrast, Goku-T2V (8B) demonstrates superior
performance by accurately incorporating all details from the prompt, creating a coherent visual output with smooth motion. Additional comparisons are provided in the appendix for a
more comprehensive evaluation. Furthermore, more video examples are available at the goku
homepage.
5.5. Ablation Studies
Model Scaling. We compared Goku-T2V models with 2B and 8B parameters. Results in
Figure 5a indicate that model scaling helps mitigate the generation of distorted object structures,
such as the arm in Figure 5a (row 1) and the wheel in Figure 5a (row 2). This aligns with findings
observed in large multi-modality models.
Joint Training. We further examine the impact of joint image-and-video training. Starting from
the same pretrained Goku-T2I (8B) weights, we fine-tuned Goku-T2V (8B) on 480p videos for
an equal number of training steps, with and without joint image-and-video training. As shown
in Figure 5b, Goku-T2V without joint training tends to generate low-quality video frames, while
the model with joint training more consistently produces photorealistic frames.

16

GOKU-T2V(2B)

GOKU-T2V(8B)

(a) Model Scaling

GOKU-T2V w/o Joint Training

GOKU-T2V w/ Joint Training

(b) Joint Training

Figure 5 | Ablation Studies of Model Scaling and Joint Training. Fig. (a) shows the comparison
between Goku-T2V(2B) and Goku-T2V(8B). Fig. (b) shows the comparison between whether
joint training is adopted or not.

6. Conclusion
In this work, we presented Goku, a novel model for joint image-and-video generation for
industry-standard performance. Through an advanced data curation process and a robust
model architecture, Goku delivers high-quality outputs by ensuring both fine-grained data
selection and effective integration of image and video modalities. Key components, such as the
image-video joint VAE and the application of rectified flow, facilitate seamless token interaction
across modalities, establishing a shared latent space that enhances model adaptability and
attention across tokens. Empirical results highlight Goku’s superiority in commercial-grade
visual generation quality.
Acknowledgements
We sincerely appreciate the support of our collaborators at ByteDance who contributed to this
work. Xibin Wu, Chongxi Wang, Yina Tang, Fangzhou Ai, Yi Ren, Wei Wang, Chen Chen, Colin
Young, Bobo Zeng, Ge Bai, Yi Fu, Ruoyu Guo, Prasanna Raghav, Weiguo Feng, Xugang Ye,
Adithya Sampath, Aaron Shen, Da Tang, Yuan Fang, Qijun Gan, Chen Zhang, Zhenhui Ye, Pan
Xie, Houmin Wei, Gaohong Liu, Zherui Liu, Chenyuan Wang, Yun Zhang, Kaihua Jiang, Zhuo
Jiang, Yang Bai, Weiqiang Lou, Hongkai Li, Xi Yang, Shuguang Wang, Junru Zheng, Zuquan
Song, Zixian Du, Jingzhe Tang, Yongqiang Zhang, Mingji Han, Heng Zhang, Li Han, Sophie
Xie, Shuo Li, Xinzhi Yao, Peng Li, Lianke Qin, Dongyang Wang, Yang Cheng, Chundian Liu,
Wenhao Hao, Haibin Lin, Xin Liu
17

CogVideoX1.5(5B)
Open-soraPlan(v1.3)
Pika
DreamMachine
Vidu
Kling(1.5)
GOKU (8B)

Prompt： Gliding through a crystal-clear coral reef, the drone skims just above the vibrant marine life below. Brightly colored corals, schools
of fish, and rays of sunlight penetrating the water’s surface all contribute to the serene yet fast-paced journey. The scene showcases the
beauty of the underwater world, as the drone swiftly maneuvers through coral arches and narrow underwater channels.

Figure 6 | Qualitative comparisons with state-of-the-art (SoTA) video generation models. This
figure showcases comparisons with leading models, including (Yang et al., 2024c), Open-Sora
Plan (Lab and etc., 2024), Pika (pika, 2024), DreamMachine (Luma, 2024), Vidu (Bao et al., 2024),
and Kling v1.5 (Kuaishou, 2024).

18

Appendix A. Benchmark Configurations
T2I-Compbench (Huang et al., 2023) We evaluate the alignment between the generated images
and text conditions using T2I-Compbench, a comprehensive benchmark for assessing compositional text-to-image generation capabilities. Specifically, we report scores for color binding,
shape binding, and texture binding. To evaluate these results, we employ the Disentangled
BLIP-VQA model. For each attribute, we generate 10 images per prompt, with a total of 300
prompts in each category.
GenEval (Ghosh et al., 2024) GenEval is an object-focused framework designed to evaluate
compositional image properties, such as object co-occurrence, position, count, and color. For
evaluation, we generate a total of 2,212 images across 553 prompts. The final score is reported as
the average across tasks.
DPG-Bench (Hu et al., 2024) Compared to the aforementioned benchmarks, DPGBench offers
longer prompts with more detailed information, making it effective for evaluating compositional
generation in text-to-image models. For this evaluation, we generate a total of 4,260 images
across 1,065 prompts, with the final score reported as the average across tasks.
VBench (Huang et al., 2024) VBench is a benchmark suite for evaluating video generative
models. It provides a structured Evaluation Dimension Suite that breaks down “video generation
quality" into precise dimensions for detailed assessment. Each dimension and content category
includes a carefully crafted Prompt Suite and samples Generated Videos from various models.

Appendix B. More Visualization Examples
Appendix B.1. Goku-T2I Samples Visualization
We present more generated image samples with their text prompts in Figure 7. The prompts are
randomly selected from the Internet 1 . Goku-T2I achieves strong performance in both visual
quality and text-image alignment. It can interpret visual elements and their interactions from
complex natural language descriptions. Notably, in Figure 8, Goku-T2I exhibits impressive
abilities on generating images with rich details, for example, the clear textures of leaves and
berries.
Appendix B.2. Goku-T2V Samples Visualization
In Figure 9 we show more examples generated by Goku-T2V, in both landscape (e.g., rows one
through five) and portrait mode (e.g., the last row). Goku-T2V is capable of generating highmotion videos (e.g., skiing) and realistic scenes (e.g., forests). All videos are configured with a
duration of 4 seconds, a frame rate of 24 FPS, and a resolution of 720p. For visualization, we
uniformly sample five frames in temporal sequence.

19

Appendix B.3. Goku-T2V Comparisons with Prior Arts
Additional comparisons with state-of-the-art text-to-video generation models are presented
in Figure 10 and Figure 11. These results demonstrate the strong performance of Goku when
evaluated against both open-source models (Yang et al., 2024c; Zheng et al., 2024) and commercial
products (pika, 2024; Kuaishou, 2024; Bao et al., 2024; Luma, 2024). For instance, in Figure 11,
Goku successfully generates smooth motion and accurately incorporates the specified low-angle
shot. In contrast, other models, such as CogVideoX (Yang et al., 2024c), Vidu (Bao et al., 2024),
and Kling (Kuaishou, 2024), often produce incorrect objects or improper camera views.
Appendix B.4. Goku-I2V Samples Visualization
We present additional visualization of generated samples from Goku-I2V in Figure 12, which
further validate the effectiveness and versatility of our approach. As shown in the figure, GokuI2V demonstrates an impressive ability to synthesize coherent and visually compelling videos
from diverse reference images, maintaining consistency in motion and scene semantics.
For instance, in the first row, the model successfully captures the dynamic and high-energy
nature of water boxing, generating fluid and natural movements of splashes synchronized
with the subject’s motions. In the second row, the sequence of a child riding a bike through a
park illustrates the model’s proficiency in creating smooth and realistic forward motion while
preserving environmental consistency. Finally, the third row showcases the model’s ability
to handle creative and imaginative scenarios, as seen in the detailed depiction of pirate ships
battling atop a swirling coffee cup. The photorealistic rendering and accurate motion trajectories
underscore the model’s robustness in both realism and creativity.
These examples highlight Goku-I2V’s capacity to generalize across a wide range of inputs,
reinforcing its potential for applications in video generation tasks requiring high fidelity and
adaptability.

1 https://promptlibrary.org/

20

21

Qual
ity S
core

ore

95.30
96.85
94.45
95.53
97.10
96.94
96.23
98.33
96.23
97.53
97.33
97.37
95.55

Sema
n

69.75
73.42
73.30
72.98
75.17
71.77
77.04
75.68
44.21
78.75
84.17
75.82
81.87

roun
d
c
o
n
siste
ncy
back
g

97.68
98.22
97.90
98.02
96.62
97.36
96.52
97.60
96.92
97.19
97.43
97.76
96.67

98.75
98.41
99.47
99.12
98.61
99.74
98.66
99.30
98.29
96.24
98.64
99.44
97.71

97.76
97.73
98.20
98.24
99.23
99.50
96.92
99.40
97.54
98.05
99.35
98.99
98.50

dyna
m
i
c
degr
ee

40.83
42.50
47.22
44.44
60.14
47.50
70.97
46.94
60.33
92.69
44.26
70.83
76.11

a es
t
h
e
t
ic qu
ality

67.16
63.13
56.18
57.35
63.34
62.04
61.98
61.21
42.51
64.15
65.51
60.36
67.22

70.10
67.22
60.94
58.66
66.82
61.87
62.90
65.62
60.16
68.88
66.55
67.56
71.29

90.90 36.88
92.55 40.66
83.37 58.41
93.07 45.47
87.81 53.64
88.72 43.08
85.23 62.11
87.24 68.05
52.06 12.52
92.99 72.15
94.95 82.63
86.10 68.55
94.40 79.48

spati
al rel
atio

nship
color

92.60 87.47 34.60
95.00 92.92 35.86
85.80 87.49 67.51
95.60 86.35 53.50
96.40 80.90 65.09
86.20 90.57 61.03
99.40 82.81 66.35
93.40 89.90 73.03
63.80 42.24 27.83
99.80 80.17 64.65
96.40 92.33 83.67
94.40 91.60 68.68
97.60 83.81 85.72

le
scene

50.19 22.42
55.29 25.13
42.47 23.89
47.03 23.06
54.57 24.31
49.83 22.26
53.20 24.91
50.86 19.62
16.34 21.89
56.58 24.27
58.98 24.66
53.88 19.80
57.08 23.08

ap
p
e
a
rance
sty

26.03
25.84
24.55
25.28
24.71
24.22
25.38
24.17
18.77
25.33
26.29
23.89
25.64

Table 8 | Comparison with state-of-the-art models on video generation benchmarks. We evaluate on VBench (Huang et al., 2024) and
compare with Gen-3 (Runway, 2023), Vchitect-2.0 (Team, 2024), VEnhancer (He et al., 2024), Kling (Kuaishou, 2024), LaVie-2 (Wang et al.,
2023a), CogVideoX (Yang et al., 2024c), Emu3 (Wang et al., 2024b).

su

tic Sc

ncy
bject
cons
iste

s
thne
s
ion s
moo
mot

flicke
ring
mpo
ral
te

ag
i
n
g
qual
ity
im

s
t clas
objec

ts
objec
mult
iple

style
temp
oral

ion
n act
hum
a

core

Total
S

Method
AnimateDiff-V2 80.27 82.90
VideoCrafter-2.0 80.44 82.20
OpenSora V1.2 79.23 80.71
Show-1
78.93 80.42
Gen-3
82.32 84.11
Pika-1.0
80.69 82.92
CogVideoX-5B 81.61 82.75
Kling
81.85 83.39
Mira
71.87 78.78
CausVid
84.27 85.65
Luma
83.61 83.47
HunyuanVideo 83.24 85.09
Goku
84.85 85.60

ov

27.04
28.23
27.07
27.46
26.69
25.94
27.59
26.42
18.72
27.51
28.13
26.44
27.35

er a l
l
c
o
n
siste
ncy

An embroidered sweater with an
anatomical illustration of the human
torso and chest, the skin is open to
reveal the internal anatomy.

A portrait featuring a 26-year-old Chinese male model in a six-grid
layout. He has a sleek, naturally layered Korean hairstyle with
subtly drooping bangs. Each panel shows him wearing modern

Prototype flying fox made from
blown glass, Lino Tagliapietra style
Muranese glassmaking, intricate
details.

3d cube woman underwater,
iridescent water, dreamlike

Close up shot of hand of a
woman touching oats in oat
farm. Shot from behind.

3D illustration of the chip
with text "AI" floating above
it, with a blue color scheme.

A simple design in black on a
white background. The word
"VINTAGE" is at the bottom.

Great Dane Dog sitting on a
toilet bowl in wide bathroom,
reading a large double page
spread newspaper, sit like
human. The background is in
a white room.

Full body shot of balenciaga
fashion model and parrot
hybrid with a human body
and the head of the parrot. He
is walking through a podium
like a model.

Full body photo of a
screaming
cauliflower
monster roaring towards the
viewer, very detailed textures.
The background is clean and
blue.

Create realistic playing cards
on fire. The playing cards are
presented with 4A. The fire is
red
and
intense.
The
background is black.

Figure 7 | Qualitative samples of Goku-T2I. Key words are highlighted in RED.

22

Prompt： Raspberry in the form of women walk along the path of a fairy tale forest. She carries a jug of
water with her. Her head is made of one big raspberry on which she has big and beautiful eyes, as well as
nose and mouth. The skin of the face has a raspberry color. She has very beautiful hair which consists of
raspberry, leaves and thin stems. Her arms and legs are made entirely of intertwined stems. She also
wears a skirt with raspberry leaves and small raspberries and she looks very delicate and feminine.

Figure 8 | Qualitative samples of Goku-T2I. Key words are highlighted in RED. For clarity, we
zoom in on specific regions to enhance visualization.

23

At an aquarium, a diver in a yellow wetsuit is feeding tropical fish in a large tank.

Zooming through a dense, lush rainforest at incredible speed, weaving between colossal trees, with rays of sunlight breaking through the canopy
and exotic birds scattering in the distance.

A snowboarder carves down a steep slope, their board cutting swiftly through the snow.

A boxer dances around the ring, fists raised and jabbing rapidly at their opponent.

A kung fu master swiftly maneuvers through a series of rapid punches and palm strikes, their arms blurring with speed.

In a cozy living room with a roaring fireplace and plush furniture, a dog with a shiny coat sits contentedly on a soft rug.

Figure 9 | Qualitative samples of Goku-T2V. Key words are highlighted in RED.

24

CogVideoX1.5(5B)
Open-soraPlan(v1.3)
Pika
DreamMachine
Vidu
Kling(1.5)
GOKU (8B)

Prompt： An astronaut runs across the surface of the moon, with a low-angle shot showcasing the vast lunar landscape. The movements
are smooth and light.

Figure 10 | Qualitative comparisons of Goku-T2V with SOTA video generation methods. Key
words are highlighted in RED.

25

CogVideoX1.5(5B)
Open-soraPlan(v1.3)
Pika
DreamMachine
Vidu
Kling(1.5)
GOKU (8B)

Prompt： A man surfing on a wave, with the camera following his movement and focusing on his face. He is smiling and giving a thumbs-up
to the camera, conveying a sense of enjoyment and excitement. The ocean waves are vibrant and dynamic around him, with sunlight
glistening on the water. The background features a clear blue sky, enhancing the lively atmosphere of the scene as he rides the waves with
confidence and enthusiasm.

Figure 11 | Qualitative comparisons of Goku-T2V with SOTA video generation methods. Key
words are highlighted in RED.

A person performing dynamic and fast-paced water boxing, demonstrating quick, fluid arm movements while splashing water…

A kid rides a bike in the park, pedaling fast and moving towards the camera…

A highly detailed, photorealistic close-up image of two pirate ships engaged in an intense battle, their sails billowing as they maneuver through the dark, swirling surface of a
coffee cup.

Figure 12 | Qualitative samples of Goku-I2V. Key words are highlighted in RED.

26

References
Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y.,
Ding, Y., et al. (2025). Cosmos world foundation model platform for physical ai. arXiv preprint
arXiv:2501.03575.
Albergo, M. S. and Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic
interpolants. In The Eleventh International Conference on Learning Representations.
Bacher, I., Javidnia, H., Dev, S., Agrahari, R., Hossari, M., Nicholson, M., Conran, C., Tang, J.,
Song, P., Corrigan, D., et al. (2021). An advert creation system for 3d product placements. In
Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track: European
Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part IV,
pages 224–239. Springer.
Bao, F., Xiang, C., Yue, G., He, G., Zhu, H., Zheng, K., Zhao, M., Liu, S., Wang, Y., and Zhu, J.
(2024). Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion
models. arXiv preprint arXiv:2405.04233.
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y.,
et al. (2023). Improving image generation with better captions. Computer Science. https://cdn.
openai. com/papers/dall-e-3. pdf, 2(3):8.
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English,
Z., Voleti, V., Letts, A., et al. (2023a). Stable video diffusion: Scaling latent video diffusion
models to large datasets. arXiv preprint arXiv:2311.15127.
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. (2023b).
Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575.
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman,
T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. (2024). Video generation models as world
simulators.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the
kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 6299–6308.
Castellano, B. (2024). PySceneDetect.
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al. (2023).
Pixart-alphaalpha: Fast training of diffusion transformer for photorealistic text-to-image
synthesis. arXiv preprint arXiv:2310.00426.
Chen, S., Xu, M., Ren, J., Cong, Y., He, S., Xie, Y., Sinha, A., Luo, P., Xiang, T., and Perez-Rua,
J.-M. (2024a). Gentron: Diffusion transformers for image and video generation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6441–6451.
Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Training deep nets with sublinear memory
cost. arXiv preprint arXiv:1604.06174.
Chen, T.-S., Siarohin, A., Menapace, W., Deyneka, E., Chao, H.-w., Jeon, B. E., Fang, Y., Lee,
H.-Y., Ren, J., Yang, M.-H., et al. (2024b). Panda-70m: Captioning 70m videos with multiple
27

cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 13320–13331.
Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.
(2024c). How far are we to gpt-4v? closing the gap to commercial multimodal models with
open-source suites. arXiv preprint arXiv:2404.16821.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M.,
Brahma, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine
Learning Research, 25(70):1–53.
Contributors, I. H. (2013). Image hash.
Corporation, N. (2022). Nvidia h100 tensor core gpu architecture.
Corporation, N. (2023). Nvidia announces dgx gh200 ai supercomputer.
Corporation, N. (2024). Nvidia h200 nvl pcie gpu accelerates ai and hpc applications.
Dao, T. (2024). FlashAttention-2: Faster attention with better parallelism and work partitioning.
In International Conference on Learning Representations (ICLR).
Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron,
M., Geirhos, R., Alabdulmohsin, I., et al. (2023). Scaling vision transformers to 22 billion
parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR.
Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver,
J., Geirhos, R., Alabdulmohsin, I. M., et al. (2024). Patch n’pack: Navit, a vision transformer
for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale
hierarchical image database. In CVPR, pages 248–255.
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A.,
Boesel, F., et al. (2024). Scaling rectified flow transformers for high-resolution image synthesis.
In Forty-first International Conference on Machine Learning.
Esser, P., Rombach, R., and Ommer, B. (2021). Taming transformers for high-resolution image
synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 12873–12883.
Ge, S., Nah, S., Liu, G., Poon, T., Tao, A., Catanzaro, B., Jacobs, D., Huang, J.-B., Liu, M.-Y., and
Balaji, Y. (2023). Preserve your own correlation: A noise prior for video diffusion models. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941.
Ghosh, D., Hajishirzi, H., and Schmidt, L. (2024). Geneval: An object-focused framework for
evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36.
Girdhar, R., Singh, M., Brown, A., Duval, Q., Azadi, S., Rambhatla, S. S., Shah, A., Yin, X., Parikh,
D., and Misra, I. (2023). Emu video: Factorizing text-to-video generation by explicit image
conditioning. arXiv preprint arXiv:2311.10709.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,
and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing
systems, 27.
28

Ha, D. and Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122.
He, J., Xue, T., Liu, D., Lin, X., Gao, P., Lin, D., Qiao, Y., Ouyang, W., and Liu, Z. (2024). Venhancer:
Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667.
He, Y., Yang, T., Zhang, Y., Shan, Y., and Chen, Q. (2022). Latent video diffusion models for
high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2(3):4.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi,
M., Fleet, D. J., et al. (2022a). Imagen video: High definition video generation with diffusion
models. arXiv preprint arXiv:2210.02303.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in
neural information processing systems, 33:6840–6851.
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., and Fleet, D. J. (2022b). Video diffusion
models. Advances in Neural Information Processing Systems, 35:8633–8646.
Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. (2022). Cogvideo: Large-scale pretraining
for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868.
Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., and Yu, G. (2024). Ella: Equip diffusion models
with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135.
Huang, K., Sun, K., Xie, E., Li, Z., and Liu, X. (2023). T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information
Processing Systems, 36:78723–78747.
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.
(2024). Vbench: Comprehensive benchmark suite for video generative models. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818.
Jacobs, S. A., Tanaka, M., Zhang, C., Zhang, M., Song, S. L., Rajbhandari, S., and He, Y. (2023).
Deepspeed ulysses: System optimizations for enabling training of extreme long sequence
transformer models. arXiv preprint arXiv:2309.14509.
Ji, Y., Zhang, J., Wu, J., Zhang, S., Chen, S., GE, C., Sun, P., Chen, W., Shao, W., Xiao, X., et al.
(2024). Prompt-a-video: Prompt your video diffusion model via preference-aligned llm. arXiv
preprint arXiv:2412.15156.
Jiang, Z., Lin, H., Zhong, Y., Huang, Q., Chen, Y., Zhang, Z., Peng, Y., Li, X., Xie, C., Nong, S.,
et al. (2024). Megascale: Scaling large language model training to more than 10,000 gpus.
arXiv preprint arXiv:2402.15627.
Jin, Y., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y., Mu, Y., and Lin,
Z. (2024). Pyramidal flow matching for efficient video generative modeling. arXiv preprint
arXiv:2410.05954.
Ju, X., Gao, Y., Zhang, Z., Yuan, Z., Wang, X., Zeng, A., Xiong, Y., Xu, Q., and Shan, Y. (2024).
Miradata: A large-scale video dataset with long durations and structured captions. arXiv
preprint arXiv:2407.06358.
Kingma, D. P. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

29

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.
(2024). Hunyuanvideo: A systematic framework for large video generative models. arXiv
preprint arXiv:2412.03603.
Korthikanti, V. A., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., and Catanzaro, B.
(2023). Reducing activation recomputation in large transformer models. Proceedings of Machine
Learning and Systems, 5:341–353.
Kuaishou (2024). Kling ai. https://klingai.com/.
Lab, P.-Y. and etc., T. A. (2024). Open-sora-plan.
Li, S., Xue, F., Baranwal, C., Li, Y., and You, Y. (2021). Sequence parallelism: Long sequence
training from system perspective. arXiv preprint arXiv:2105.13120.
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for
generative modeling. In The Eleventh International Conference on Learning Representations.
Liu, X., Gong, C., and qiang liu (2023). Flow straight and fast: Learning to generate and transfer
data with rectified flow. In The Eleventh International Conference on Learning Representations.
Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K.-W., Wu, Y. N., Zhu, S.-C., and Gao, J. (2024).
Chameleon: Plug-and-play compositional reasoning with large language models. Advances in
Neural Information Processing Systems, 36.
Luma (2024). Luma ai. https://lumalabs.ai/dream-machine.
Mishkin, P., Ahmad, L., Brundage, M., Krueger, G., and Sastry, G. (2022). Dall· e 2 preview-risks
and limitations. Noudettu, 28(2022):3.
Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., and Tai, Y. (2024).
Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint
arXiv:2407.02371.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza,
D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without
supervision. arXiv preprint arXiv:2304.07193.
Peebles, W. and Xie, S. (2023). Scalable diffusion models with transformers. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 4195–4205.
pika (2024). Pika ai. https://pika.art/try.
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach,
R. (2023). Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv
preprint arXiv:2307.01952.
Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y.,
Chuang, C.-Y., et al. (2024). Movie gen: A cast of media foundation models. arXiv preprint
arXiv:2410.13720.
Quevedo, J., McIntyre, Q., Campbell, S., and Wachen, R. (2024). Oasis: A universe in a transformer.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical text-conditional
image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
30

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image
synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 10684–10695.
Runway (2023). Gen-2: Generate novel videos with text, images or video clips. https:
//runwayml.com/research/gen-2/.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T.,
Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for
training next generation image-text models. Advances in Neural Information Processing Systems,
35:25278–25294.
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. (2024). Flashattention3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint
arXiv:2407.08608.
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O.,
Parikh, D., Gupta, S., and Taigman, Y. (2023). Make-a-video: Text-to-video generation without
text-video data. In The Eleventh International Conference on Learning Representations.
Skorokhodov, I., Tulyakov, S., and Elhoseiny, M. (2022). Stylegan-v: A continuous video
generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 3626–3636.
Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101: A dataset of 101 human actions classes
from videos in the wild. Technical report, Center for Research in Computer Vision, Orlando,
FL 32816, USA. CRCV-TR-12-01.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer
with rotary position embedding. Neurocomputing, 568:127063.
Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. (2024). Autoregressive
model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525.
Team, V. (2024). Vchitect-2.0: Parallel transformer for scaling up video diffusion models.
https://github.com/Vchitect/Vchitect-2.0.
Teed, Z. and Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part II 16, pages 402–419. Springer.
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. (2018).
Towards accurate generative models of video: A new metric & challenges. arXiv preprint
arXiv:1812.01717.
Valevski, D., Leviathan, Y., Arar, M., and Fruchter, S. (2024). Diffusion models are real-time
game engines. arXiv preprint arXiv:2408.14837.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Wan, B., Han, M., Sheng, Y., Lai, Z., Zhang, M., Zhang, J., Peng, Y., Lin, H., Liu, X., and Wu, C.
(2024). Bytecheckpoint: A unified checkpointing system for llm development. arXiv preprint
arXiv:2407.20143.
31

Wang, J., Yuan, L., Zhang, Y., and Sun, H. (2024a). Tarsier: Recipes for training and evaluating
large video description models. arXiv preprint arXiv:2407.00634.
Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., Zhao,
Y., Ao, Y., Min, X., Li, T., Wu, B., Zhao, B., Zhang, B., Wang, L., Liu, G., He, Z., Yang, X., Liu, J.,
Lin, Y., Huang, T., and Wang, Z. (2024b). Emu3: Next-token prediction is all you need.
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.
(2023a). Lavie: High-quality video generation with cascaded latent diffusion models. arXiv
preprint arXiv:2309.15103.
Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al. (2023b).
Internvid: A large-scale video-text dataset for multimodal understanding and generation.
arXiv preprint arXiv:2307.06942.
Wiegand, T., Sullivan, G. J., Bjontegaard, G., and Luthra, A. (2003). Overview of the h. 264/avc
video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7):560–
576.
Wu, J. Z., Ge, Y., Wang, X., Lei, S. W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., and Shou, M. Z.
(2023). Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633.
Xie, J., Mao, W., Bai, Z., Zhang, D. J., Wang, W., Lin, K. Q., Gu, Y., Chen, Z., Yang, Z., and
Shou, M. Z. (2024). Show-o: One single transformer to unify multimodal understanding and
generation. arXiv preprint arXiv:2408.12528.
Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al.
(2024a). Qwen2 technical report. arXiv preprint arXiv:2407.10671.
Yang, M., Li, J., Fang, Z., Chen, S., Yu, Y., Fu, Q., Yang, W., and Ye, D. (2024b). Playable game
generation. arXiv preprint arXiv:2412.00887.
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G.,
et al. (2024c). Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv
preprint arXiv:2408.06072.
Yuan, L., Wang, J., Sun, H., Zhang, Y., and Lin, Y. (2025). Tarsier2: Advancing large visionlanguage models from detailed video description to comprehensive video understanding.
arXiv preprint arXiv:2501.07888.
Zeng, Y., Wei, G., Zheng, J., Zou, J., Wei, Y., Zhang, Y., and Li, H. (2024). Make pixels dance:
High-dynamic video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 8850–8860.
Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. Advances in Neural
Information Processing Systems, 32.
Zhang, J., Chen, J., Wang, C., Yu, Z., Qi, T., Liu, C., and Wu, D. (2024). Virbo: Multimodal
multilingual avatar video generation in digital marketing. arXiv preprint arXiv:2403.11700.
Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott,
M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao,
Y., Mathews, A., and Li, S. (2023). Pytorch fsdp: Experiences on scaling fully sharded data
parallel. Proc. VLDB Endow., 16(12):3848–3860.
32

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., and You, Y. (2024).
Open-sora: Democratizing efficient video production for all.
Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer,
L., and Levy, O. (2024). Transfusion: Predict the next token and diffuse images with one
multi-modal model. arXiv preprint arXiv:2408.11039.
Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., and Feng, J. (2022). Magicvideo: Efficient video
generation with latent diffusion models. arXiv preprint arXiv:2211.11018.

33