arXiv:2409.19256v2 [cs.LG] 2 Oct 2024
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng
Chi Zhang
Zilingfeng Ye
The University of Hong Kong
gmsheng@connect.hku.hk
ByteDance
zhangchi.usc1992@bytedance.com
ByteDance
yezilingfeng@bytedance.com
Xibin Wu
Wang Zhang
Ru Zhang
ByteDance
wuxibin@bytedance.com
ByteDance
zhangwang.nozomi@bytedance.com
ByteDance
zhangru.1994@bytedance.com
Yanghua Peng
Haibin Lin
Chuan Wu
ByteDance
pengyanghua.yanghua@bytedance.com
ByteDance
haibin.lin@bytedance.com
The University of Hong Kong
cwu@cs.hku.hk
Abstract
Reinforcement Learning from Human Feedback (RLHF) is
widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node
represents computation of a neural network (NN) and each
edge denotes data dependencies between the NNs. RLHF
complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge
into a many-to-many multicast. Traditional RL frameworks
execute the dataflow using a single controller to instruct
both intra-node computation and inter-node communication,
which can be inefficient in RLHF due to large control dispatch
overhead for distributed intra-node computation. Existing
RLHF systems adopt a multi-controller paradigm, which can
be inflexible due to nesting distributed computation and data
communication. We propose HybridFlow, which combines
single-controller and multi-controller paradigms in a hybrid
manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and
data dependencies in the complex RLHF dataflow, allowing
efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various
devices. We further design a 3D-HybridEngine for efficient
actor model resharding between training and generation
phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results
demonstrate 1.53×∼20.57× throughput improvement when
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than the author(s) must
be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from permissions@acm.org.
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
© 2025 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
ACM ISBN 979-8-4007-1196-1/25/03
https://doi.org/10.1145/3689031.3696075
running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source
code will be available at https://github.com/volcengine/verl
CCS Concepts: • Computing methodologies → Distributed
computing methodologies; Machine learning.
Keywords: Distributed systems, Reinforcement Learning
from Human Feedback
ACM Reference Format:
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang
Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025.
HybridFlow: A Flexible and Efficient RLHF Framework. In Twentieth
European Conference on Computer Systems (EuroSys ’25), March 30April 3, 2025, Rotterdam, Netherlands. ACM, New York, NY, USA,
19 pages. https://doi.org/10.1145/3689031.3696075
1
Introduction
Large language models (LLMs) such as GPT [11], Llama [73]
and Claude [7] have revolutionized various artificial intelligence (AI) applications, ranging from writing [2], searching [52] to coding [63]. LLMs are first pre-trained on trillions
of tokens from books, websites, etc,. via next-word prediction
to accumulate broad knowledge [11]. Next, LLMs are trained
on domain-specific datasets via supervised fine-tuning (SFT),
to be able to follow human instructions [11]. Despite the outstanding capabilities of LLMs on natural language tasks after
pre-training and SFT, the detrimental and biased contents in
the training datasets may still mislead an LLM to generate
toxic and undesirable content. Reinforcement Learning from
Human Feedback (RLHF) is introduced to further align an
LLM to human values, for building helpful and harmless AI
applications [7, 55].
RLHF is built upon traditional RL algorithms [4, 68, 78],
e.g., Proximal Policy Optimization (PPO) [68] and REINFORCE [78]. The widely adopted PPO-based RLHF system
typically consists of four LLMs [7, 55]: an actor, a critic, a reference policy network and a reward model. PPO-based RLHF
proceeds in iterations, each with three stages: (1) response
generation using the actor model with a batch of prompts;
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
(2) preparation of training data by scoring the generated responses through a single forward pass of the critic, reference
policy, and reward models; (3) learning from human preference by updating actor and critic through forward and
backward computation. Other RLHF variants [19, 43] follow
similar stages but involves different numbers of models and
data dependencies among the models.
Traditional RL can be modeled as a dataflow [46], which is
a directed acyclic graph (DAG): each node in the RL dataflow
represents computation of a neural network (e.g., actor or
critic network which can be CNN or MLP); each edge denotes data dependency between NN computations (e.g., output of the critic is used as input to actor training [68].) RLHF
dataflow is more complex, with more complicated models
involved (e.g., LLMs for the actor/critic/reference/reward
models), each running distinct computation, and more diverse data dependencies among them (i.e., multicast between
distributed model partitions). Training and generation of an
LLM in the RLHF dataflow requires distributed computation
(e.g., using tensor/pipeline/data parallelism) [40, 71]. Therefore, each node in the RLHF dataflow is a complex distributed
program, corresponding to distributed computation of the
respective LLM. Models in different nodes typically use different parallelism strategies as their workloads vary. The
edge represents data resharding, which is often a many-tomany multicast. Consequently, Flexible representation and
efficient execution of the complex and resource intensive
RLHF is imperative.
Traditional RL frameworks such as RLLib [45] and RLLib
Flow [46] utilize a hierarchical single-controller paradigm
to run RL dataflows. A centralized controller assigns nodes
in the dataflow to different processes and coordinates their
execution order. Each node process can further spawn more
workers to perform computation, again following the singlecontroller paradigm. However, they only provide primitives
for data-parallel training and are constrained to neural networks that are at most hundreds of MB in size [45, 46]. In the
RLHF dataflow, each node corresponds to an LLM with up to
billions of operators, computed using some complex parallelism. A single-controller paradigm is inefficient due to the
substantial overhead of dispatching operators to distributed
accelerators [1, 9].
Existing RLHF systems adopt a multi-controller paradigm
to manage intra-node computation and inter-node data resharding [17, 30, 80]. Each controller independently manages
the computation of one device and uses multiple point-topoint operations to coordinate data dependencies between
different nodes. This multi-controller paradigm introduces
negligible dispatch overhead when performing LLM computation (detailed in §2.2).
However, without central control, it is inflexible to implement various RLHF dataflow, as modifying a single node to
adapt to different data dependencies requires changing all
dependent nodes’ implementation, hindering code reuse.
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
To address these limitations, we propose HybridFlow, a
flexible and efficient RLHF framework to easily represent and
execute diverse RLHF dataflows, attaining high throughput.
Our key observation is that utilizing the single-controller
paradigm on the inter-node level enables flexible expression of various data dependencies and easy coordination of
inter-node data resharding with minimal overhead, while
integrating the multi-controller paradigm within intra-node
computation enhances computation efficiency substantially.
We advocate a hierarchical hybrid programming model to
generate RLHF dataflows. At the node level, multiple model
classes are provided that encapsulate distributed computation (training, inference and generation) of different LLMs in
the dataflow into primitive APIs. These APIs can seamlessly
support various parallelism strategies from the existing LLM
frameworks, including 3D parallelism [71], ZeRO [59], and
PyTorch FSDP [57]), and perform distributed computation
under the multi-controller paradigm. Among the nodes, a
set of transfer protocols are designed to hide the complexity
of data resharding from users, as coordinated by a single
controller. This programming model abstracts away the complexity of distributed computing, allowing users to implement an RLHF dataflow in a few lines of code and run RLHF
through a single process of the single controller. It also effectively decouples intra-node computation and inter-node data
transfer, allowing independent optimization of each model
without changing the code of other models in the dataflow.
Training and generation of the actor model represent major computation in the RLHF dataflow. We further design
a 3D-HybridEngine to enable efficient execution of training
and generation of the actor model, introducing zero memory redundancy and significantly reduced communication
overhead during model parameter resharding between the
training and generation stages. Our hybrid programming
model also facilitates flexible placement of models onto the
same or different sets of GPU devices. This allows us to provide an effective algorithm to optimize GPU allocation and
placement of the models, with various model sizes and distinct workloads, for any RLHF dataflow. Our contributions
in designing HybridFlow are summarized as follows:
• We propose a hierarchical hybrid programming model for
conveniently building the RLHF dataflow. This programming
model enables efficient distributed execution of intra-node
computation and flexible inter-node data resharding and
transfer, for various RLHF algorithms (§4).
• We design a 3D-HybridEngine that executes training and
generation of the actor model with high computation efficiency and zero-redundancy transition between the training
stage and the generation stage (§5).
• We devise an effective mapping algorithm to automatically
identify optimized GPU allocation and placement of each
node (model) in the RLHF dataflow (§6).
• We conduct extensive experiments comparing HybridFlow
with state-of-the-art RLHF systems [17, 30, 82] under various
HybridFlow: A Flexible and Efficient RLHF Framework
actor
critic
reference policy
Actor
Gen
Actor
Gen
1
Ref
Fwd
RM
Fwd
reward model
Critic
Fwd
Ref
Fwd
RM
Fwd
Cost
Fwd
Critic
Fwd
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
cost model
Actor
Gen
Actor
Gen
RM
Fwd
RM
Fwd
Critic
Training
Actor
Fwd
Actor
Training
Critic
Training
Actor Model
Ref
Fwd
Actor
Training
3
(a) PPO
(b) Safe-RLHF
A GPU worker
with a controller
Remote
Call
Data
Transfer
(c) ReMax
Figure 1. Dataflow graph of 3 RLHF algorithms [19, 43, 55].
1 ○,
2 ○
3 represent Generation, Preparation, and
Stage ○,
Training, respectively.
RLHF algorithms, model sizes and cluster scales. Our evaluation demonstrates 1.53×∼20.57× throughput improvements.
We have open-sourced HybridFlow and believe that HybridFlow can boost future RLHF research and development.
Actor
while True:
res = recv_actor()
broadcast(res)
all_gather_weights()
values = model(res)
for prompts in datasets:
for _ in range(len):
all_gather_weights()
logit = model(prompts)
Reward Model
...
all_gather_to_rank0(res) while True:
send_critic(res)
res = recv_actor()
send_rm(res)
broadcast(res)
all_gather_weights()
reward = model(res)
Inactive
Node
Node in
RLHF dataflow
Inactive
Transfer
Multi-Controller implementation
Critic Model
2
Actor
Training
worker
worker
worker
worker
worker
Reward
Critic
worker
worker
Inflexible:
Inefficient:
- Computation and - Large transition
data dependencies cost between
are nested
train & gen
- Coupled with some - Rigid support for
LLM systems.
model placement
(a) Existing RLHF frameworks
Single Controller
Inter-Node: Single-Controller implementation
responses = actor.gen(prompts)
value = critic.comp_value(responses)
reward = rm.comp_reward(responses)
Actor
worker
worker
worker
worker
Critic
Intra-Node: Multi-Controller implementation
def comp_values(res):
all_gather_weights()
def gen(prompts):
values = model(res)
for _ in range(len):
return values
all_gather_weights()
logit = model(prompts) def comp_reward(res):
...
all_gather_weights()
return responses
reward = model(res)
return reward
Reward
worker
worker
Flexible:
Efficient:
- Decouple the data - Zero redundancy
and computation
during transition
dependencies
- Support different
- Seamless integrate model placement
any LLM systems.
strategies
(b) HybridFlow
2
Background and Motivation
2.1
Reinforcement Learning from Human Feedback
RLHF Workflow. RLHF aligns the linguistic space of LLMs
with human values, using a set of human-ranked candidates
of given prompts [7, 19, 41, 43, 55, 70, 91]. An RLHF system typically consists of multiple models, e.g., an actor, a
critic, a reference policy, and one or multiple reward models.
The actor and the reference are each pre-trained/fined-tuned
LLM (i.e., the LLM that is undergoing RLHF). The critic and
reward models can be different LLMs fine-tuned on the human preference dataset, with the language modeling head
replaced by a scalar output head [7, 55]. The RLHF workflow
can be decomposed into 3 stages (Figure 1) and we take PPO
as an example:
•Stage 1 (Generation): The actor produces responses from a
batch of prompts using auto-regressive generation.
•Stage 2 (Preparation): Using prompts and generated responses,
the critic computes their values [66, 68], the reference policy
computes their reference log probabilities, and the reward
model computes their rewards [7, 55], all via a single pass of
forward computation of the respective model.
•Stage 3 (Learning/Training): The actor and the critic are
updated via Adam [38], using the batch of data produced by
previous stages and the loss function [55].
Other RLHF algorithms largely follow the 3-stage workflow as well (Figure 1(b)(c)). Safe-RLHF [19] introduces an
auxiliary pretrain loss following PPO-ptx [55] and includes
an additional cost model to fit human preferences and safety
labels simultaneously. ReMax [43] requires an additional generation pass for variance reduction and eliminates the critic
model in the dataflow. Researchers are actively exploring
novel RLHF algorithms [41, 70, 91] and integrating traditional RL methods into RLHF domains [37]. These variances
necessitate a flexible representation of the RLHF dataflow
graph to accommodate diverse algorithmic requirements.
Figure 2. Programming model used in RLHF systems. (a)
Existing RLHF systems adopt the multi-controller paradigm.
(b) HybridFlow utilizes a hybrid programming model: the
single-controller coordinates models; each model uses multicontroller paradigm in distributed computation. Inactive
node in grey represents operation not executed at this time.
Parallelism Strategies. LLMs are trained and served with
data, pipeline, and tensor parallelism [36, 40, 54]. With data
parallelism (DP), the input data is split into multiple subsets; each subset is processed by a separate device (e.g., a
GPU) [69]. ZeRO [59] is a memory-optimized solution for
DP training, progressively sharding optimizer states, gradients, and model parameters across GPUs. Pipeline parallelism (PP) [32, 53] and tensor parallelism (TP) [71] distribute model parameters, gradients and optimizer states across
multiple GPUs. Modern distributed training frameworks like
Megatron-LM [71] and MegaScale [36] utilize 3D parallelism
or PTD parallelism [54], where P, T, D stand for PP, TP, DP,
respectively. In 3D parallelism, PP size represents the number of pipeline stages in model training, TP size refers to the
number of shards that a tensor is partitioned into, and DP
size is the number of model replicas. LLM serving systems
employ 3D parallelism similar to training while only model
parameters and KVCache are sharded [16, 29, 40].
LLM models in the RLHF dataflow may perform distinct
computations, including training (one forward pass, one
backward pass and model update), inference (one forward
pass) and generation (auto-regressive generation with multiple forward passes). In particular, training and generation are
performed on the actor model, training and inference on the
critic, and inference on reference policy and reward models.
Distinct parallel strategies can be applied to different models
for varied computations to achieve optimal throughput.
Single-Controller. It employs a centralized controller to
manage the overall execution flow of the distributed program.
With centralized control logic, users can build core functionalities of the dataflow as a single process (Figure 2(b)), while
the controller automatically generates distributed workers
to carry out the computation. With a global view of the
hardware and dataflow graph, the single-controller paradigm allows flexible and optimized resource mapping and
execution order coordination among dataflow tasks. However, coordination messages are passed from the controller
to all workers, incurring significant dispatch overhead when
executing expansive dataflow graphs on large clusters [1, 9].
Multi-Controller. Each device (aka worker) has its own controller. State-of-the-art distributed LLM training and serving
systems adopt the multi-controller paradigm, due to its scalability and low dispatch overhead (control messaging largely
passed from CPU to GPU over fast PCIe links) [36, 40, 60, 71].
As shown in the example that employs multi-controller RLHF
implementation in Figure 2(a), a separate program is run
for each model, and all workers of one model execute the
same program. Each worker only possesses a local view of
the system state and requires point-to-point communication
between two models (blue code and arrows) to coordinate
model execution order. To implement an RLHF workflow in
the multi-controller architecture, a user must intricately integrate the code for collective communication, computation,
and point-to-point data transfer in the program run at each
device. This leads to deeply nested code of computation and
data transfer, challenging to develop, maintain, and optimize.
In Figure 2(a), each model performs local computation and
all_gather operations (black code), while the actor model
must explicitly manage send operations to the critic and reward models, and the latter must correspondingly implement
receive operations at precise points in their program.
2.3
RLHF Characteristics
Heterogeneous model workloads. The actor, critic, reference and reward models in RLHF may execute training,
inference or generation at different stages, with different
memory footprint and computation demand. For reference
policy and reward models, only their model parameters need
to be stored in GPU memory, as they perform only the forward pass computation. For the actor and the critic, their
model parameters, gradients, and optimizer states must be
stored as they undergo model training. Moreover, a small
actor model (e.g., a 7B pre-trained/fine-tuned LLM) can be
paired with larger critic and reward models (e.g., 70B LLMs)
in RLHF for better alignment [7]. Given such heterogeneity,
different parallelism strategies and tailored optimizations
are needed for running each model during RLHF.
Unbalanced computation between actor training and
generation. In the RLHF dataflow, training and generation
of the actor model are represented by two nodes (Figure 1),
Dataflow Graph
Gen
Placement
Actor -> A:0-1
Critic -> B:2-3
Ref -> C:4-5
Execution Pattern
Machine A
Programming Model for Distributed ML
0
Gen
Actor Training
1
Gen
Actor Training
RM -> C:4-5
Ref
Actor
Training
RM
Machine B
2.2
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
Value
Critic
Training
Placing models
on different
machines
Machine C
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
2
Value
Critic Training
3
Value
Critic Training
4
Ref
RM
5
Figure 3. Dataflow execution given a model placement plan.
Blocks with numbers represent GPUs. In dashed boxes, the
models are placed on different sets of devices and can be
concurrently computed. Reference model (blue) and reward
model (green) are colocated on the same set of GPUs and
executed sequentially.
which often render majority of the workload in each RLHF
iteration (e.g., 58.9% of total RLHF time with HybridFlow).
Actor training is computation bound [24], often requiring a
larger model-parallel (MP) size (i.e., the number of partitions
the model is partitioned into) and distributing the workload
to more GPUs, e.g., 8 partitions of a 7B model on 8 GPUs. Using the same parallelism strategy (e.g., the same MP size) for
generation can lead to underutilization of GPU computation
resources due to its memory-bound nature [40]. Previous
studies show that combining a larger DP size with a smaller
MP size (hybrid data and model parallelism), e.g., partition a
7B model into two and replicate it four times on 8 GPUs, can
improve the generation throughput [44, 92]. Although using
different parallelism strategies for actor training and generation may optimize throughput in both stages, resharding the
actor model weights at runtime between the two stages can
incur significant communication and memory overhead. For
example, aligning a 70B actor model requires transferring
140GB of model weights from training to generation per
RLHF iteration, taking up to 36.4% of an iteration time when
the two stages are on different devices [30].
Diverse model placement requirements. Strategic device
placement of models in the RLHF dataflow is necessary, according to computation workloads and data dependencies
of the models. Figure 3 gives an example model placement
plan and the corresponding RLHF execution flow. Models
placed on different sets of devices can be executed in parallel
if no data dependencies exist. Models placed on the same
set of GPUs, referred to as colocated models, share the GPU
memory and are executed sequentially in a time-sharing
manner, as out-of-memory (OOM) error may easily happen
if colocated LLMs execute concurrently.
We observe a compromise: placing models on different devices permits parallel processing but may inevitably lead to
some GPU idle time, given staged model execution in RLHF.
In Figure 3, actor and critic are placed separately, performing training in parallel, but incurring 1/3 of their GPU time
being idle, during other RLHF stages. Supporting various
HybridFlow: A Flexible and Efficient RLHF Framework
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Table 1. Comparison of RLHF frameworks. Figures illustrate execution of one PPO iteration. Numbers 1-6 represent response
generation, reward model inference, reference model inference, critic inference, actor training, and critic training, respectively.
RLHF system
DeepSpeed-Chat
OpenRLHF
NeMo-Aligner
HybridFlow
Parallelism
Training: ZeRO
Generation:TP
Training: ZeRO
Generation:TP
3D Parallelism for both
training and generation
Training: 3D, ZeRO, FSDP
Generation: 3D Parallelism
Actor weights
in training & generation
Model resharding
from ZeRO to TP
Using two copies of actor
weights for the two stages
Using identical model partition
in two stages (shared weights)
Zero-redundancy
model resharding
Model
Placement
Colocate all models
on the same set of devices
Each model placed
on separate devices
Actor/Ref colocated on some GPUs
Critic/RM colocated on other GPUs
Support various
model placement
Execution
Pattern
Actor
Critic
GPU Process
Reward model
Reference Policy
1
1
5
1
2
3
4
5
6
4
3
3
2
placement strategies and maximizing device utilization are
crucial for optimizing RLHF performance at any model size
and cluster scale.
2.4 Limitations of existing RLHF systems
Inflexible support for various RLHF dataflow graphs.
Existing RLHF systems adopt the multi-controller paradigm
for dataflow implementation [17, 30, 80, 82]. To implement
various RLHF algorithms, a user must navigate and manage
code that mixes collective communication, model computation (potentially using various distributed training/serving
frameworks), and point-to-point data transfer. This code
structure lacks modularity/function encapsulation, making
the RLHF systems tightly coupled with specific LLM training and serving frameworks. Consequently, a user needs to
implement and optimize different RLHF dataflows case-bycase [46], hindering code reuse and increasing the risk of
making mistakes. Existing RLHF frameworks only support
the PPO algorithm. In addition, limited parallel strategies are
supported due to implementation complexity. For example,
to incorporate 3D parallelism for LLM training and generation in DeepSpeed-Chat [82], one may have to re-implement
the whole system due to the mixed code structure.
Inefficient RLHF execution. Table 1 summarizes parallelism strategies, model placement, and execution patterns
adopted by the existing RLHF systems. DeepSpeed-Chat [82]
and OpenRLHF [30] adopt ZeRO-3 for actor training and TP
for actor generation. OpenRLHF uses different copies of the
actor model on different devices for training and generation,
incurring redundant memory usage and frequent weight synchronization among devices. DeepSpeed-Chat maintains the
same copy of actor model on the same set of devices for training and generation, and reshards model weights between
training and generation (due to different parallelisms used
in the two stages), which may still incur substantial memory
and communication overhead for large models (detailed in
§5.4). NeMo-Aligner [17] uses the same 3D parallelism configurations in actor training and generation, experiencing
low generation throughput (§8.4).
Existing RLHF frameworks are limited to one model placement plan and hence one RLHF execution pattern, as shown
2
5
6
4
6
Support various
execution patterns
in Table 1. Implementing a different placement is difficult,
requiring changing the inner logic of model initialization
and inter-node data transfer as highlighted in blue in Figure 2. OpenRLHF and NeMo-Aligner allow concurrent model
computation in the preparation and learning stages; in the
generation stage, models except the actor are idle, wasting
the GPUs they occupy. DeepSpeed-Chat colocates all models
on the same set of devices, and each device runs each model
sequentially according to the RLHF dataflow. With unbalanced workloads among the models, such a placement can
be inefficient in resource utilization (evaluated in §8.3).
2.5
Design Considerations
To tackle limitations of existing systems, the key question is
- How to design a flexible and efficient programming
model to implement RLHF dataflow? A single-controller
design is particularly advantageous at the inter-node level
due to its flexibility in coordinating data transfer, execution
order, and resource virtualization among distributed computation of different models [9, 50]. The RLHF dataflow graph
typically consists of only a few nodes. Dispatching control
messages to different nodes from the single-controller incurs negligible overhead as compared to distributed computation required for nodes (models) in the dataflow. The
multi-controller paradigm, known for its low latency in dispatching operators to accelerators [20], can be leveraged in
distributed computation of each model. With these insights,
we propose a hierarchical hybrid programming model for
RLHF dataflow implementation. Our key design principle is
to combine single-controller and multi-controller paradigms
in a hybrid manner. This design ensures flexible expression
and efficient execution of RLHF dataflow, maintaining low
control overhead at both inter-node and intra-node levels.
As shown in Figure 2(b), this paradigm decouples intra-node
distributed computation and inter-node data transfer, allowing each model to focus solely on local computation without
managing inter-node communication.
3
HybridFlow Overview
Figure 4 depicts the architecture of HybridFlow, which consists of three major components: Hybrid Programming Model,
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
User Input
RLHF dataflow graph
Model Config
Device Config
Transfer Protocol (§4)
LLM Training Engine
3D-HybridEngine (§5)
LLM Generation Engine
Model Placement
Device Allocation
ParallelWorker (§4)
Auto Mapping (§6)
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
class ActorWorker(3DParallelWorker):
# An example of distributed computation function
@register(transfer_mode=3D_PROTO)
def update_actor(self, prompts: DataProto):
...
# Allocate devices for a ResourcePool
resource_pool = ResourcePool([n_gpus_per_machine] * n_machines)
# Map the model to allocated devices and init model
actor_model = ActorWorker(actor_config, resource_pool)
Resource Pool (§4)
DP1
Physical Devices
TP1,PP0 DP0
TP0,PP0
TP1,PP0
TP0,PP0
Figure 4. Architecture of HybridFlow.
3D-HybridEngine and Auto-Mapping algorithm. The hybrid
programming model includes a set of hierarchical APIs to
enable flexible expression of the RLHF dataflow and efficient computation of models in the dataflow (§4). The 3DHybridEngine is particularly designed for efficient training
and generation of the actor model, allowing different 3D
parallel configurations in the two stages and enabling zero
memory redundancy and minimized communication overhead during the transition between two stages (§5). The automapping algorithm determines optimized device placement
of each model to maximize the throughput of RLHF (§6).
The workflow of our RLHF system goes as follows. A user
provides the following inputs to start the RLHF system: (i)
model specifications, including the architecture and size of
the actor/critic/reference policy/reward models in the RLHF
dataflow; (ii) device placement of the models in the dataflow,
as obtained by running the auto-mapping algorithm under
given GPU cluster configurations; (iii) parallelism strategy
for running each model in each stage, e.g., a tuple of (p, t,
d) for 3D parallelism, where p, t, d represent PP size, TP
size and DP size, respectively. The single controller program
takes these inputs to initialize models in the RLHF dataflow
and virtualized resource pool, dispatches operations/models
to devices according to the placement plan, and invokes
functions run by the multiple controllers on devices to carry
out distributed computation of each model.
The multi-controller program implements the ParallelWorker class: it constructs parallel groups of each model
among allocated devices according to its parallelism strategies, invokes the 3D-HybridEngine for actor training and
generation, and can be integrated seamlessly with existing
LLM engines [40, 57, 60, 71] for training, inference and generation of other models. The transfer protocols are coordinated
by the single controller program to support resharding of
data (including prompts, responses, and other model outputs
in RLHF) between models with distinct parallelism strategies. The data resharding of the actor between training and
generation is handled by 3D-HybridEngine.
4
Hybrid Programming Model
4.1
Hierarchical APIs
Intra-node: encapsulating distributed program. For distributed computation of each model in different RLHF stages,
we provide a base class, 3DParallelWorker. Given allocated
Model
Machine 1
Model (P, T, D)
…
+
ResourcePool()
Config
ParallelWoker()
(a) Actor model initialization
Call from controller
Return data futures
Collect data futures
Distribute data futures
Transfer data
TP,PP TP, PP rank on a GPU
in a DP group
Actor (p, t, d) = (1, 2, 3)
Critic (p, t, d) = (2, 1, 2)
Single Controller
AAAB+HicbVBNS8NAEN34WetHox69BIvgQUoipXosePFYwX5AG8pmM22XbjZhdyLW0F/ixYMiXv0p3vw3btsctPXBwOO9md2ZFySCa3Tdb2ttfWNza7uwU9zd2z8o2YdHLR2nikGTxSJWnYBqEFxCEzkK6CQKaBQIaAfjm5nffgCleSzvcZKAH9Gh5APOKBqpb5d6CI/IuGICwqw67dtlt+LO4awSLydlkqPRt796YczSCCQyQbXuem6CfkYVcvPktNhLNSSUjekQuoZKGoH2s/niU+fMKKEziJUpic5c/T2R0UjrSRSYzojiSC97M/E/r5vi4NrPuExSBMkWHw1S4WDszFJwQq6AoZgYQpniZleHjaiiDE1WRROCt3zyKmldVrxapXZXLdcv8jgK5IScknPikStSJ7ekQZqEkZQ8k1fyZj1ZL9a79bFoXbPymWPyB9bnDyOBk1k=
AAAB+HicbVBNS8NAEN3Ur1o/GvXoJVgED1ISkeqx4MVjBfsBbSibzbRdutmE3YlYQ3+JFw+KePWnePPfuG1z0NYHA4/3ZnZnXpAIrtF1v63C2vrG5lZxu7Szu7dftg8OWzpOFYMmi0WsOgHVILiEJnIU0EkU0CgQ0A7GNzO//QBK81je4yQBP6JDyQecUTRS3y73EB6RccUEhJk37dsVt+rO4awSLycVkqPRt796YczSCCQyQbXuem6CfkYVcvPktNRLNSSUjekQuoZKGoH2s/niU+fUKKEziJUpic5c/T2R0UjrSRSYzojiSC97M/E/r5vi4NrPuExSBMkWHw1S4WDszFJwQq6AoZgYQpniZleHjaiiDE1WJROCt3zyKmldVL1atXZ3Wamf53EUyTE5IWfEI1ekTm5JgzQJIyl5Jq/kzXqyXqx362PRWrDymSPyB9bnDx7yk1Y=
AAAB+HicbVBNS8NAEN34WetHox69BIvgQUpSpHosePFYwX5AG8pmM22XbjZhdyLW0F/ixYMiXv0p3vw3btsctPXBwOO9md2ZFySCa3Tdb2ttfWNza7uwU9zd2z8o2YdHLR2nikGTxSJWnYBqEFxCEzkK6CQKaBQIaAfjm5nffgCleSzvcZKAH9Gh5APOKBqpb5d6CI/IuGICwqw67dtlt+LO4awSLydlkqPRt796YczSCCQyQbXuem6CfkYVcvPktNhLNSSUjekQuoZKGoH2s/niU+fMKKEziJUpic5c/T2R0UjrSRSYzojiSC97M/E/r5vi4NrPuExSBMkWHw1S4WDszFJwQq6AoZgYQpniZleHjaiiDE1WRROCt3zyKmlVK16tUru7LNcv8jgK5IScknPikStSJ7ekQZqEkZQ8k1fyZj1ZL9a79bFoXbPymWPyB9bnDyB3k1c=
2
○
1
○
AAAB+HicbVDLSgNBEOz1GeMjqx69DAbBg4Rdlegx4MVjBPOAZAmzs5NkyOyDmV4xLvkSLx4U8eqnePNvnCR70MSChqKqe6a7/EQKjY7zba2srq1vbBa2its7u3sle/+gqeNUMd5gsYxV26eaSxHxBgqUvJ0oTkNf8pY/upn6rQeutIijexwn3AvpIBJ9wSgaqWeXusgfkQnFJA+yi0nPLjsVZwayTNyclCFHvWd/dYOYpSGPkEmqdcd1EvQyqlCYJyfFbqp5QtmIDnjH0IiGXHvZbPEJOTFKQPqxMhUhmam/JzIaaj0OfdMZUhzqRW8q/ud1Uuxfe5mIkhR5xOYf9VNJMCbTFEggFGcox4ZQpoTZlbAhVZShyapoQnAXT14mzfOKW61U7y7LtbM8jgIcwTGcggtXUINbqEMDGKTwDK/wZj1ZL9a79TFvXbHymUP4A+vzByH8k1g=
3
○
Actor
DP0
TP1,PP0
TP0,PP0
4
○
Critic
6
○
AAAB+HicbVBNS8NAEJ3Ur1o/GvXoJVgETyURqR6LXjxWsB/QhrLZbNulm03YnYg19Jd48aCIV3+KN/+N2zYHbX0w8HhvZnfmBYngGl332yqsrW9sbhW3Szu7e/tl++CwpeNUUdaksYhVJyCaCS5ZEzkK1kkUI1EgWDsY38z89gNTmsfyHicJ8yMylHzAKUEj9e1yD9kjUq6oYGFWm/btilt153BWiZeTCuRo9O2vXhjTNGISqSBadz03QT8jCrl5clrqpZolhI7JkHUNlSRi2s/mi0+dU6OEziBWpiQ6c/X3REYirSdRYDojgiO97M3E/7xuioMrP+MySZFJuvhokAoHY2eWghNyxSiKiSGEKm52deiIKELRZFUyIXjLJ6+S1nnVq1VrdxeV+nUeRxGO4QTOwINLqMMtNKAJFFJ4hld4s56sF+vd+li0Fqx85gj+wPr8AS0pk3E=
DP1
TP1,PP0
TP0,PP0
DP2
TP1,PP0
TP0,PP0
DP0
TP0,PP0
TP0,PP1
AAAB+HicbVDLSgNBEOz1GeMjqx69DAbBg4Rd0egx4MVjBPOAZAmzs5NkyOyDmV4xLvkSLx4U8eqnePNvnCR70MSChqKqe6a7/EQKjY7zba2srq1vbBa2its7u3sle/+gqeNUMd5gsYxV26eaSxHxBgqUvJ0oTkNf8pY/upn6rQeutIijexwn3AvpIBJ9wSgaqWeXusgfkQnFJA+yy0nPLjsVZwayTNyclCFHvWd/dYOYpSGPkEmqdcd1EvQyqlCYJyfFbqp5QtmIDnjH0IiGXHvZbPEJOTFKQPqxMhUhmam/JzIaaj0OfdMZUhzqRW8q/ud1Uuxfe5mIkhR5xOYf9VNJMCbTFEggFGcox4ZQpoTZlbAhVZShyapoQnAXT14mzfOKW61U7y7KtbM8jgIcwTGcggtXUINbqEMDGKTwDK/wZj1ZL9a79TFvXbHymUP4A+vzByUGk1o=
5
○
DP1
TP0,PP0
TP0,PP1
(b) Data resharding and asynchronous execution
Figure 5. An illustration of hierarchical APIs. (a) Model
with 3D parallel configuration, resource allocation, and
3DParallelWorker initialization. (b) Asynchronous data resharding between two models with collect and distribute
functions in 3D_PROTO.
devices, it facilitates distributed model weight initialization
and establishes 3D parallel groups for each model. A parallel
group includes a set of GPUs to host a specific parallel dimension of the model, e.g., different tensor shards in TP and
different model replicas in DP. Figure 5(a) illustrates initialization of the actor model with our APIs, while initialization
of other models is similar.
Inheriting from the 3DParallelWorker class, several model
classes, for actor, critic, reference, and reward model, respectively, are provided. Each of these model classes encapsulates APIs to implement the model’s distributed forward and
backward computation, auto-regressive generation, and optimizer updates, decoupling the distributed computation code
with data dependencies with other models. These APIs can
be easily implemented by reusing the computation scripts
from existing LLM systems. For example, the computation involved in update_actor function of ActorWorker (the class
for the actor model) is similar to the pre-training scripts in
Megatron-LM [71]. A model class encapsulates fundamental
operations for implementing various RLHF algorithms, e.g.,
generate_sequences in the actor model class for generating responses based on the prompts and compute_reward
in the reward model class for evaluating responses through
a forward pass. (More APIs are detailed in Appendix A).
Besides base class 3DParallelWorker that implements
3D parallelism, we further provide base classes for PyTorch
FSDP (FSDPWorker) and ZeRO (ZeROWorker), and the corresponding model classes inheriting each base class, to support
different parallelism strategies in model computation. ParallelWorker in Figure 4 denotes one of these base classes.
HybridFlow: A Flexible and Efficient RLHF Framework
Inter-node: unifying data resharding implementation
between models. Many-to-many multicast is involved for
data transfer between models employing different parallelism strategies on different devices. We unify this data
transfer implementation by associating each operation in
each model class with a transfer protocol, using @register.
Each transfer protocol consists of a collect function and a distribute function, to aggregate output data and distribute input data according to the parallelism strategy of each model.
In the example in Figure 5(a), update_actor operation is
registered to transfer protocol 3D_PROTO, as 3D parallelism
is used for actor training. In 3D_PROTO, the collect function
gathers all the output data of corresponding model function
(e.g., the loss scalar return from the update_actor) in each
DP group to the single controller, and the distribute function distributes the input data to the registered function (e.g.,
advantages for the update_actor) to each DP group. Data
resharding is enabled using the source model’s output collect
function and the destination model’s input distribute function. Figure 5(b) illustrates data resharding between the actor
(generation) and the critic (inference), where computation
of the models adopts different 3D parallelism strategies. The
single controller gathers data futures using the collect func1 ○)
3 and sends it to critic
tion in 3D_PROTO of actor (steps ○4 critic distributes the received data futures to each
(step ○);
DP group using the distribute function in its 3D_PROTO (step
5 Then remote data is retrieved from actor to critic, with
○).
each of critic’s GPUs only fetching the required local batch
6
of the actor’s output data according to its DP rank (step ○).
The actual data transfer only occurs between GPUs, avoiding
any central bottleneck.
We provide 8 transfer protocols, including 3D_PROTO, DP
_PROTO, ONE_TO_ALL, etc., that cover most data resharding
scenarios (detailed in Appendix B). A user can further extend
the transfer protocols through implementing customized
collect and distribute functions.
Facilitating flexible model placement. We provide a
ResourcePool class that virtualizes a set of GPU devices.
When applying a ResourcePool instance to a model class
(Figure 5(a)), distributed computation of the model will be
mapped to the devices. Models utilizing the same ResourcePool
instance are colocated on the same set of GPUs; models are
placed on different sets of GPUs when different Resource
Pool instances are applied in their model classes. We assume
no overlap between different ResourcePool instances.
Asynchronous dataflow execution. When models are
placed on separate sets of devices, their execution is triggered
automatically as soon as their inputs become available [50].
In Figure 5(b), the data future from actor is immediately
1 ○);
3 the conreturned after the controller’s call (steps ○troller then initiates a new call to critic and distributes the
4 ○).
5 When
futures following the transfer protocol (steps ○some models are placed on the same set of devices, they
are executed sequentially based on the calling order. With
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
# Initialize cost model by reusing the RewardWorker
cost = RewardWorker(cost_config, resource_pool)
... # omit other models initialization
algo_type = “Safe-RLHF” # specify different RLHF numerical computation.
# Examples of PPO and Safe-RLHF
for (prompts, pretrain_batch) in dataloader:
# Stage 1: Generate responses
batch = actor.generate_sequences(prompts)
batch = actor.generate_sequences(prompts, do_sample=False)
# Stage 2: Prepare experience
is added for ReMax
batch = critic.compute_values(batch)
Not necessary in ReMax
batch = reference.compute_log_prob(batch)
batch = reward.compute_reward(batch)
is added for Safe-RLHF
batch = cost.compute_cost(batch)
batch = compute_advantages(batch, algo_type)
# Stage 3: Actor and critic training
critic_metrics = critic.update_critic(batch, loss_func=algo_type)
pretrain_loss = actor.compute_loss(pretrain_batch)
batch[“pretrain_loss”] = pretrain_loss
actor_metrics = actor.update_actor(batch, loss_func=algo_type)
Figure 6. Implementation of PPO [55], ReMax [43], and SafeRLHF [19]. Users can adapt to different RLHF algorithms by
simply adding or deleting a few lines of code.
our programming model, HybridFlow is flexible in supporting diverse distributed execution patterns without any code
change of the RLHF algorithm (Figure 6).
4.2
Implementation of different RLHF algorithms
Our APIs enable streamlined development of various RLHF
algorithms (dataflows). Users can implement an RLHF algorithm in a few lines of code as a single process program
to run on the single controller, that involves a sequence
of primitive API calls to invoke distributed computation of
models. Examples of PPO, ReMax, and Safe-RLHF are given
in Figure 6. PPO can be implemented in just 8 lines by invoking model operations including compute_values and
generate_sequences, which are executed under the multicontroller paradigm on multiple GPUs. To adapt to SafeRLHF which integrates an additional cost model to evaluate
safety preferences and the pre-taining loss for actor, only 5
more lines of code are added on top of PPO implementation.
To adapt to ReMax, one additional call to actor generation is
needed, and the critic-related code can be removed.
Achieving flexible. This flexibility of extension is crucial for
researchers to explore different RLHF algorithms: they can
reuse distributed computation encapsulated in each model
class and simply adjust the code for numerical computations
according to specific algorithms, such as GAE [67] and KL divergence in compute_advantage and loss functions of actor
and critic. The streamlined development can be attributed
to the hybrid programming model. Our modular API design
simplifies development, facilitates extensive code reuse, and
enables directly incorporating the codebase of existing LLM
training/serving frameworks. It also decouples model computation and data transfer among models. Any change in the
distributed frameworks does not affect the code of the RLHF
algorithm (Figure 6), enabling individualized optimization
for each model’s execution (§5). Flexible placement of models with diverse workloads is supported, enabling optimized
mapping of RLHF dataflow onto various devices (§6).
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Micro-DP group 1 GPU & rank
Model partition
P Prompt
1
1
P1
P1
R1
P1
P2
R1
R2
1
1
2
2
P2
P2
R2
P1
P2
R1
R2
2
2
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
R Response
TP group
Train
AAAB+HicbVBNS8NAEN3Ur1o/GvXoJVgED1ISkeqx4MVjBfsBbSibzbRdutmE3YlYQ3+JFw+KePWnePPfuG1z0NYHA4/3ZnZnXpAIrtF1v63C2vrG5lZxu7Szu7dftg8OWzpOFYMmi0WsOgHVILiEJnIU0EkU0CgQ0A7GNzO//QBK81je4yQBP6JDyQecUTRS3y73EB6RccUEhJk37dsVt+rO4awSLycVkqPRt796YczSCCQyQbXuem6CfkYVcvPktNRLNSSUjekQuoZKGoH2s/niU+fUKKEziJUpic5c/T2R0UjrSRSYzojiSC97M/E/r5vi4NrPuExSBMkWHw1S4WDszFJwQq6AoZgYQpniZleHjaiiDE1WJROCt3zyKmldVL1atXZ3Wamf53EUyTE5IWfEI1ekTm5JgzQJIyl5Jq/kzXqyXqx362PRWrDymSPyB9bnDx7yk1Y=
3
3
4
4
Train
Gen
1 All Gather
○
Model weights
Data
loader
AAAB+HicbVBNS8NAEN34WetHox69BIvgQUpSpHosePFYwX5AG8pmM22XbjZhdyLW0F/ixYMiXv0p3vw3btsctPXBwOO9md2ZFySCa3Tdb2ttfWNza7uwU9zd2z8o2YdHLR2nikGTxSJWnYBqEFxCEzkK6CQKaBQIaAfjm5nffgCleSzvcZKAH9Gh5APOKBqpb5d6CI/IuGICwqw67dtlt+LO4awSLydlkqPRt796YczSCCQyQbXuem6CfkYVcvPktNhLNSSUjekQuoZKGoH2s/niU+fMKKEziJUpic5c/T2R0UjrSRSYzojiSC97M/E/r5vi4NrPuExSBMkWHw1S4WDszFJwQq6AoZgYQpniZleHjaiiDE1WRROCt3zyKmlVK16tUru7LNcv8jgK5IScknPikStSJ7ekQZqEkZQ8k1fyZj1ZL9a79bFoXbPymWPyB9bnDyB3k1c=
P3
P4
2 Load prompts
○
P3
P4
AAAB+HicbVDLSgNBEOz1GeMjqx69DAbBg4Rdlegx4MVjBPOAZAmzs5NkyOyDmV4xLvkSLx4U8eqnePNvnCR70MSChqKqe6a7/EQKjY7zba2srq1vbBa2its7u3sle/+gqeNUMd5gsYxV26eaSxHxBgqUvJ0oTkNf8pY/upn6rQeutIijexwn3AvpIBJ9wSgaqWeXusgfkQnFJA+yi0nPLjsVZwayTNyclCFHvWd/dYOYpSGPkEmqdcd1EvQyqlCYJyfFbqp5QtmIDnjH0IiGXHvZbPEJOTFKQPqxMhUhmam/JzIaaj0OfdMZUhzqRW8q/ud1Uuxfe5mIkhR5xOYf9VNJMCbTFEggFGcox4ZQpoTZlbAhVZShyapoQnAXT14mzfOKW61U7y7LtbM8jgIcwTGcggtXUINbqEMDGKTwDK/wZj1ZL9a79TFvXbHymUP4A+vzByH8k1g=
R3
P3
P4
R3
R4
3
3
R4
P3
P4
R3
R4
4
4
Gen
Train
3 Generate and
○
AllGather responses
AAAB+HicbVBNS8NAEN34WetHox69BIvgQUoipXosePFYwX5AG8pmM22XbjZhdyLW0F/ixYMiXv0p3vw3btsctPXBwOO9md2ZFySCa3Tdb2ttfWNza7uwU9zd2z8o2YdHLR2nikGTxSJWnYBqEFxCEzkK6CQKaBQIaAfjm5nffgCleSzvcZKAH9Gh5APOKBqpb5d6CI/IuGICwqw67dtlt+LO4awSLydlkqPRt796YczSCCQyQbXuem6CfkYVcvPktNhLNSSUjekQuoZKGoH2s/niU+fMKKEziJUpic5c/T2R0UjrSRSYzojiSC97M/E/r5vi4NrPuExSBMkWHw1S4WDszFJwQq6AoZgYQpniZleHjaiiDE1WRROCt3zyKmldVrxapXZXLdcv8jgK5IScknPikStSJ7ekQZqEkZQ8k1fyZj1ZL9a79bFoXbPymWPyB9bnDyOBk1k=
G1
G2
G3
G4
G5
G6
G7
G8
G1 GPU & rank
All-Gather
complete weights
G1
Model
Update
AAAB+HicbVDLSgNBEOz1GeMjqx69DAbBg4Rd0egx4MVjBPOAZAmzs5NkyOyDmV4xLvkSLx4U8eqnePNvnCR70MSChqKqe6a7/EQKjY7zba2srq1vbBa2its7u3sle/+gqeNUMd5gsYxV26eaSxHxBgqUvJ0oTkNf8pY/upn6rQeutIijexwn3AvpIBJ9wSgaqWeXusgfkQnFJA+yy0nPLjsVZwayTNyclCFHvWd/dYOYpSGPkEmqdcd1EvQyqlCYJyfFbqp5QtmIDnjH0IiGXHvZbPEJOTFKQPqxMhUhmam/JzIaaj0OfdMZUhzqRW8q/ud1Uuxfe5mIkhR5xOYf9VNJMCbTFEggFGcox4ZQpoTZlbAhVZShyapoQnAXT14mzfOKW61U7y7KtbM8jgIcwTGcggtXUINbqEMDGKTwDK/wZj1ZL9a79TFvXbHymUP4A+vzByUGk1o=
5 Training
○
G2
G3
G4
G5
Model weight
partition
G6
G7
G8
Gen
G1
G2
G3
G4
G5
G6
G7
G8
4 Partition weights
○
to train mode
Redundant
training weight
Discard unused weights from other ranks
Train DP Group
Train TP Group
(a) Same grouping methods between training and generation (HybridFlow-V)
Gen TP Group
Figure 7. 3D-HybridEngine workflow in one RLHF iteration.
4 GPUs are used for actor training and generation. 1-2-2
(𝑝-𝑡-𝑑) parallel groups are used in training and 1-1-2-2 (𝑝𝑔 𝑡𝑔 -𝑑𝑔 -𝑑) parallel groups are used in generation.
5
Parallel Groups
To eliminate redundant actor model copies, we advocate
deploying actor training and generation stages on the same
set of devices, 𝑁𝑎 GPUs allocated to the actor, and execute
them sequentially on the same copy of actor model weights.
Nonetheless, actor training and generation may well adopt
different 3D parallelism strategies, i.e., the generation stage
typically requires smaller TP and PP sizes but a larger DP
size, than the training stage (§2.3). 3D-HybridEngine enables
efficient model parameter resharding between actor training
and generation across the same set of devices in this context.
Let 𝑝-𝑡-𝑑 denote 3D parallel groups constructed for actor training, corresponding to the set of GPUs to host 𝑝
pipeline stages, 𝑡 tensor shards, and 𝑑 model replicas [54].
3D-HybridEngine builds different parallel groups for actor
training and generation, according to their different 3D parallelism strategies, respectively. We use 𝑝𝑔 , 𝑡𝑔 , and 𝑑𝑔 to denote
the size of generation pipeline parallel group, generation
tensor parallel group, and micro data parallel group, respectively, in the generation stage. 𝑑𝑔 indicates the ratio of model
replica number in generation over that in training, i.e., each
DP replica in training becomes 𝑑𝑔 micro DP replicas, to process 𝑑𝑔 microbatches of prompts and responses. We have
𝑝𝑡
𝑁𝑎 =𝑝×𝑡×𝑑=𝑝𝑔 ×𝑡𝑔 ×𝑑𝑔 ×𝑑 such that 𝑑𝑔 = 𝑝𝑔 𝑡𝑔 . The micro DP
groups are employed exclusively in actor generation stage
to render a larger DP size for full device utilization. The
generation parallel groups are denoted by 𝑝𝑔 -𝑡𝑔 -𝑑𝑔 -𝑑.
5.2
3D-HybridEngine Workflow
Between actor training in iteration 𝑖 of RLHF and actor generation in iteration 𝑖 + 1, the actor model parameters need
to be resharded and prompts data to be distributed, following the parallel group configurations in the two stages. In
iteration 𝑖 + 1 of RLHF, 3D-HybridEngine gathers the actor
G1
G2
G3
G4
G5
G6
G7
G8
All-Gather within
Micro-DP groups
Gen
G1
G2
G3
G4
G5
All Gather
Micro DP
Group
G6
G7
G8
(b) Optimized parallel grouping methods (HybridFlow)
3D-HybridEngine
We design the 3D-HybridEngine to support efficient training
and generation of the actor model, targeting significant RLHF
throughput improvement.
5.1
Train
Figure 8. Model weights resharding. 2 machines each with
4 GPUs are used for actor training and generation.
1 in Figure 7),
model parameters updated in iteration 𝑖 (step ○
for generation within each micro DP group. Then, the batch
2 which
of prompts are loaded to each model replica (step ○),
generates responses (Generation stage of RLHF). Following
this, 3D-HybridEngine performs an all-gather operation on
the generation results within each micro DP group (step
3 and re-partitions model parameters according to the 3D
○),
4 With model weights,
parallelism for actor training (step ○).
prompts and responses correctly re-distributed, the loss of
the actor model is computed and actor model weights are up5 - actor training
dated following the RLHF algorithm (step ○)
stage of iteration 𝑖 + 1.
5.3
Zero redundancy model resharding
Parallel grouping methods in 3D parallelism are typically as
follows: PP and TP groups are formed by assigning consecutive ranks to pipeline stages and tensor shards, respectively;
DP groups are constructed by selecting ranks at regular intervals, determined by the product of PP size and TP size.
In Figure 8(a), actor training uses 3D parallel groups, 1-4-2:
there is one PP group for all GPUs (for illustration clarify);
the TP groups are [G1, G2, G3, G4], [G5, G6, G7, G8], and
the DP groups are [G1, G5], [G2, G6], [G3, G7], [G4, G8].
Suppose the same parallel grouping methods are used but
with different parallel sizes, e.g., 1-2-2-2 for generation in
Figure 8(a). During the transition from training to generation, 3D-HybridEngine applies all-gather operations among
the model parallel groups to aggregate all parameters, and
then retain only a subset of model weights on each device
for its generation, according to the parallel groups the device
belongs to. On some GPUs (e.g., G2, G3, G6, G7), there is
no overlap between training and generation model weights,
and separate memory is needed to maintain weights for subsequent training as well (grey boxes in Figure 8(a)).We call
HybridFlow: A Flexible and Efficient RLHF Framework
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Table 2. Transition overhead between training & generation
DS-Chat
HybridFlow-V
HybridFlow
Comm. Vol
𝑡𝑝𝑑 −1
𝑡𝑝𝑑 𝑀
𝑡𝑝 −1
𝑡𝑝 𝑀
𝑡𝑝 −𝑡𝑔 𝑝𝑔
𝑡𝑔 𝑝𝑔 𝑡𝑝 𝑀
Peak Mem.
𝑀
𝑀
1
𝑡𝑔 𝑝𝑔 𝑀
Redundancy
1
𝑡𝑝𝑑 𝑀
1
𝑡𝑝 𝑀
0
each micro DP group. The communication overhead is re𝑑 −1
𝑡𝑝 −𝑡 𝑝
duced to 𝑔𝑡𝑝 𝑀 = 𝑡𝑔 𝑝𝑔𝑔𝑡𝑝𝑔 𝑀. Each GPU only needs to collect remote parameters within its micro DP group and can
reuse the training weights in generation. Therefore, the peak
memory usage of model parameters in HybridFlow precisely
matches the model partition size on each GPU in generation,
eliminating any redundancy in GPU memory usage.
6
the system HybridFlow-V, when 3D-HybridEngine uses the
above vanilla parallel grouping methods in the two stages.
We further design a new parallel grouping method for 3DHybridEngine to use in the generation stage, that eliminates
the redundancy in weights storage and leads to minimal
memory footprint and communication due to actor model
resharding between training and generation. Specifically,
we form generation TP and PP groups by selecting ranks at
𝑝
regular intervals, determined by 𝑡𝑡𝑔 and 𝑝𝑔 , and construct
micro DP groups by sequentially assigning ranks along the
generation TP or PP dimensions. In Figure 8(b), 1-2-2-2 parallel groups are used in generation: the generation TP groups
are [G1, G3], [G2, G4], [G5, G7], [G6, G8]; and the micro
DP groups are [G1, G2], [G3, G4], [G5, G6], [G7, G8]. This
strategic rearrangement of generation parallel groups leads
to overlap between training and generation model weights
on each device, enabling reuse of training weights during
generation and zero redundancy in device memory usage due
to model resharding. In addition, 3D-HybridEngine conducts
several all-gather operations concurrently, one within each
micro DP group, leading to significantly reduced communication overhead.
5.4
Transition overhead
In Table 2, we compare communication overhead and memory footprint during the transition between training and
generation stages, among different actor engine designs. We
assume model size of the actor is 𝑀 and 𝑁𝑎 GPUs are used for
its training and generation. The actor engine in DeepSpeedChat conducts an all-gather operation across all GPUs during transition; HybridFlow-V performs this all-gather within
training TP and PP groups. The communication volumes
𝑡𝑝𝑑 −1
for these operations are 𝑁𝑁𝑎 𝑎−1 𝑀 = 𝑡𝑝𝑑 𝑀 for DeepSpeed𝑡𝑝 −1
Chat and 𝑡𝑝 𝑀 for HybridFlow-V, calculated following [13].
Both engines aggregate all model parameters in each GPU’s
memory before subsequently partitioning model states according to the generation parallel groups, resulting in a peak
memory usage of model parameters 𝑀. As they cannot reuse
training weights during generation on some GPUs, training
1
weights need to be maintained on them, amounting to 𝑡𝑝𝑑
1
and 𝑡𝑝 redundant memory consumption, respectively.
With our parallel grouping method for the generation
stage, HybridFlow confines the all-gather operation within
Auto Device Mapping
Our hybrid programming model requires users to input the
following configurations, which are referred to as a mapping
of the RLHF dataflow to the given devices: (a) device placement of the models in the dataflow; (b) the corresponding
parallelism strategy for running each model in each stage.
We provide an efficient algorithm (Algorithm 1) for users
to identify the optimized mapping of executing the RLHF
dataflow on a given cluster of devices, that minimizes the
end-to-end latency of each RLHF iteration. Given a dataflow
𝐷, we first explore all possible placement plans P for the
models in the given cluster (Line 3). For example, the PPO algorithm involves four models, resulting in 15 possible placements (from the Bell partition problem [10, 62]), ranging
from a completely standalone placement where all models
are placed on different devices (e.g., OpenRLHF’s placement)
to colocating all models on the same set of devices (e.g.,
DeepSpeed-Chat’s placement). We refer to colocated models
on the same set of GPUs as a colocated set. Models in a colocated set can employ different parallelism strategies across
the same set of GPUs. We identify the smallest number of
GPUs to be allocated to each of the colocated model sets,
𝐴𝑚𝑖𝑛 , based on memory consumption of colocated models,
ensuring no out-of-memory errors (Line 9).
Next, starting from the minimal GPU allocation in 𝐴𝑚𝑖𝑛 ,
we enumerate all feasible device allocations to each colocated model set (Lines 10-12). Given device allocation 𝐴 to
the colocated set and computation workload 𝑊 of models
in the set, we explore optimized parallelism strategies for
each model in the auto_parallel module, that minimizes
model execution latency. The workload 𝑊 includes input
and output shapes and computation (training, inference or
generation) of each model. In auto_parallel, we utilize a
simulator module simu to estimate the latency of different
parallel strategies, following previous research [42, 84, 90, 92]
(outline in Appendix. C).
The d_cost module estimates the end-to-end latency of
the RLHF dataflow under given model placement and parallelism strategies, by iterating through all stages in the
dataflow graph and summing up latencies of all stages (Lines 17,
25). For models in the same colocated set and involving computation in the same stage (such as actor and critic both
performing model update in RLHF training stage), their execution latencies are summed up (Line 32). For models in different colocated sets, their execution within the same stage
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Algorithm 1 Device Mapping for an RLHF Dataflow
1: Input: RLHF dataflow graph 𝐷, LLMs in RLHF dataflow
𝐿=[𝑙 1, 𝑙 2, . . . , 𝑙𝑘 ], workload 𝑊 of LLMs in RLHF dataflow, total
# of GPUs 𝑁 , memory capacity per GPU 𝑄
2: Output: device mapping of models in RLHF dataflow
3: P ← get_placements(𝐷, 𝐿, 𝑁 )
4: 𝐶 ∗ ← ∞
5: 𝑏𝑒𝑠𝑡_𝑚𝑎𝑝𝑝𝑖𝑛𝑔 ← ∅
6: for all 𝑝𝑙𝑚 ∈ P do
7:
𝐶𝑝𝑙𝑚 ← ∞
8:
𝑏𝑒𝑠𝑡_𝑝𝑙𝑚_𝑎𝑙𝑙𝑜𝑐 ← ∅
9:
𝐴𝑚𝑖𝑛 ← get_min_alloc(𝑝𝑙𝑚, 𝑄, 𝑁 )
10:
for all 𝐴 ∈ enum_alloc(𝑁 , 𝐴𝑚𝑖𝑛 ) do
b
11:
𝐿 ← []
12:
for all set ∈ 𝑝𝑙𝑚 do
13:
for all 𝑙 ∈ set do
b
14:
𝑙 ← auto_parallel(𝐴, 𝐴𝑚𝑖𝑛 , 𝑙,𝑊 )
b
15:
𝐿.append(b
𝑙)
16:
𝑝𝑙𝑚.update(b
𝐿)
17:
𝐶𝑎𝑙𝑙𝑜𝑐 ← d_cost(𝐷, 𝑝𝑙𝑚,𝑊 )
18:
if 𝐶𝑎𝑙𝑙𝑜𝑐 < 𝐶𝑝𝑙𝑚 then
19:
𝐶𝑝𝑙𝑚 ← 𝐶𝑎𝑙𝑙𝑜𝑐
20:
𝑏𝑒𝑠𝑡_𝑝𝑙𝑚_𝑎𝑙𝑙𝑜𝑐 ← (𝑝𝑙𝑚, 𝐴)
21:
if 𝐶𝑝𝑙𝑚 < 𝐶 ∗ then
22:
𝐶 ∗ ← 𝐶𝑝𝑙𝑚
23:
𝑏𝑒𝑠𝑡_𝑚𝑎𝑝𝑝𝑖𝑛𝑔 ← 𝑏𝑒𝑠𝑡_𝑝𝑙𝑚_𝑎𝑙𝑙𝑜𝑐
24: return 𝑏𝑒𝑠𝑡_𝑚𝑎𝑝𝑝𝑖𝑛𝑔
25: Procedure d_cost(𝐷, 𝑝𝑙𝑚, 𝑊 ):
26:
𝑠 ← number of stages in 𝐷
27:
𝑐 ← [0] × 𝑠 // Initialize latency for each stage to 0
28:
for all set ∈ 𝑝𝑙𝑚 do
29:
𝑐𝑔 ← [0] × 𝑠
30:
for all 𝑖 ∈ {0, ..., 𝑠 − 1} do
31:
for all b
𝑙 ∈ set do
32:
𝑐𝑔 [𝑖] ← 𝑐𝑔 [𝑖] + simu(b
𝑙,𝑊 [𝑖])
33:
𝑐 [𝑖] ← 𝑚𝑎𝑥 {𝑐 [𝑖], 𝑐𝑔 [𝑖]}
34:
return sum(𝑐)
can be parallelized, and the latency of the stage is determined by the maximum execution time among different sets
(Line 33). We identify the best device placement of the models with their corresponding parallelism strategies, achieving
minimal execution time per RLHF iteration (Lines 18-23).
(𝑁 −1)!
The complexity of Algorithm 1 is 𝑂 ( (𝑘 −1)!(𝑁
−𝑘 )! ), where
𝑘 is the number of models in the dataflow and 𝑁 is the total
number of devices to run the dataflow. This is the worst-case
complexity for enumerating all possible device allocations
for a placement strategy (i.e., the standalone placement),
calculated by assigning 𝑁 devices to 𝑘 models (known as
the integer partition problem [6]). For better efficiency, we
cache parallelism strategies identified for each model on a
number of devices 𝐴, to eliminate redundant searches for
the same parallelism strategies when the model is placed on
different sets of 𝐴 GPUs in different placement strategies.
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
Though we assume 𝑁 homogeneous GPUs when running
the auto mapping algorithm, Algorithm 1 can be readily
extended for optimizing model mapping over heterogeneous
devices, by considering heterogeneous devices in simu and
auto_parallel modules [88].
7
Implementation
HybridFlow is implemented in around 12k lines of Python
code (LoC).
Hybrid programming model. The hierarchical APIs are
implemented with 1.8k LoC. The centralized single controller
is built on top of Ray [50] and uses Remote Process Calls
(RPC) to coordinate the execution order of different models
and transfer data between models following the dataflow.
These intermediate data are stored in TensorDict [57]. In our
multi-controller paradigm for distributed computation, each
model function runs on a separate process across various
devices, with control messages relayed from each controller’s
CPU process to the corresponding GPU. Our implementation
supports Megatron-LM, PyTorch FSDP, and DeepSpeed as
the LLM training and inference engines, and vLLM for autoregressive generation. In vLLM, we replace the centralized
KVCache manager with a distributed manager to align with
the multi-controller paradigm.
3D-HybridEngine. Its main logic is implemented with 2.4k
LoC on top of Megatron-LM and vLLM. We store actor model
weights for training and generation stages on separate memory buffers, offload generation weights to the CPU memory during training, reload generation weights back to GPU
memory during the transition, and use both buffers in generation. We use NCCL communication primitives [35] to
collect and concatenate model parameters in each micro DP
group during the transition between training and generation.
We offload KVCache to CPU memory after generation and
reload it back to GPU in the next iteration.
Auto-Mapping Algorithm is implemented with 1.9k LoC,
together with three simulators for training, inference, and
generation workloads. The algorithm is run before starting
the RLHF dataflow on CPU, to generate device mapping and
parallelism strategies for dataflow initialization.
8
Evaluation
8.1
Experimental Setup
Testbed. We deploy HybridFlow on a cluster of 16 machines
(128 GPUs). Each machine is equipped with 8 NVIDIA A10080GB GPUs inter-connected with 600GB/s NVLink. The
inter-machine bandwidth is 200Gbps. Our experiments use
the following software versions: CUDA12.1, PyTorch 2.1.2,
Megatron-core 0.6.0, NCCL 2.18.1, and vLLM 0.3.1.
Models and RLHF algorithms. We run the RLHF dataflow
(Figure 1) of PPO [68], ReMax [43] and Safe-RLHF [19] algorithms. PPO is one of the most popular algorithms for
RLHF [7, 55], consisting of actor, critic, reference policy, and
3
2
1
0
8
16
32
64
# of GPUs
128
1e4
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
2
1
0
16
(a) 7B (1.68×∼8.63×)
32
64
# of GPUs
128
1e4
1.5
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
1.0
Throughput (tokens/s)
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Throughput (tokens/s)
1e4
Throughput (tokens/s)
Throughput (tokens/s)
HybridFlow: A Flexible and Efficient RLHF Framework
0.5
0.0
(b) 13B (2.70×∼18.96×)
32
64
# of GPUs
128
1.0
1e4
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
0.8
0.6
0.4
0.2
0.0
64
128
# of GPUs
(c) 34B (2.41×∼20.57×)
(d) 70B (5.17×∼17.98×)
2
1
0
8
16
32
64
# of GPUs
128
1e4
DS-Chat
OpenRLHF
HybridFlow
2.0
1.5
1.0
0.5
0.0
(a) 7B (1.53×∼2.56×)
8
16
32
64
# of GPUs
128
1e4
1.00
Throughput (tokens/s)
DS-Chat
OpenRLHF
HybridFlow
Throughput (tokens/s)
3
Throughput (tokens/s)
Throughput (tokens/s)
Figure 9. PPO throughput. Numbers in parentheses are HybridFlow speedups compared with baselines.
1e4
DS-Chat
OpenRLHF
HybridFlow
0.75
0.50
0.25
0.00
16
(b) 13B (2.49×∼3.66×)
32
64
# of GPUs
128
(c) 34B (2.14×∼4.80×)
1.0
1e4
DS-Chat
OpenRLHF
HybridFlow
0.8
0.6
0.4
0.2
0.0
32B
64
128
# of GPUs
(d) 70B (6.46×∼9.78×)
3
2
1
0
8
16
32
64
# of GPUs
(a) 7B (1.71×∼12.87×)
128
1e4
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
2.0
1.5
1.0
0.5
0.0
16
32
64
# of GPUs
128
(b) 13B (2.49×∼18.47×)
1.5
1e4
Throughput (tokens/s)
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
Throughput (tokens/s)
1e4
Throughput (tokens/s)
Throughput (tokens/s)
Figure 10. ReMax throughput. Numbers in parentheses are HybridFlow speedups compared with baselines
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
1.0
0.5
0.0
32
64
# of GPUs
128
1.0
0.8
0.6
1e4
NeMo-Aligner
DS-Chat
OpenRLHF
HybridFlow
0.4
0.2
0.0
(c) 34B (2.20×∼19.76×)
64
128
# of GPUs
(d) 70B (4.89×∼16.86×)
Figure 11. Safe-RLHF throughput. Numbers in the parentheses are HybridFlow speedups compared with the baselines
reward models. Each model is a Llama [73] model with sizes
ranging from 7B to 70B. Safe-RLHF has an additional cost
model whose architecture and size are the same as the reward model and ReMax eliminates the critic model. We use
mixed precision for actor and critic training, i.e., BF16 for
model parameters and FP32 for gradient and optimizer states,
with Adam [38] optimizer in all experiments. BF16 is used
in model inference and auto-regressive generation. If not
specified, the experiment results are obtained from PPO.
Baselines. We compare HybridFlow with state-of-the-art
RLHF systems including DeepSpeed-Chat [82] v0.14.0, OpenRLHF [30] v0.2.5, and NeMo-Aligner [17] v0.2.0 (detailed in
Table 1). NeMo-Alginer doesn’t support ReMax algorithm.
We do not compare HybridFlow to other frameworks such
as Trlx [27], HuggingFaceDDP [79], and Collosal-Chat [15]
as they are less representative and slower than the above
baselines (as reported in [82]).
We use RLHF throughput (tokens/sec) as the performance
metric, computed by dividing the total number of tokens in
prompts and responses in a global batch by one RLHF iteration time. All reported performance numbers are averaged
over 5 training iterations after a warm-up of 10 iterations.
Datasets and hyperparameters. We perform RLHF on
"Dahoas/ful-hh-rlhf" dataset [7] of HuggingFace, which is
widely used for LLM alignment [64, 85]. As the baseline
systems may not incorporate continuous-batching optimization [83] during generation, for a fair comparison, we enforce the same length on all responses to be generated. In
each experiment, the input prompt length and the output
response length are both 1024 and the global batch size of
input prompts to the actor model is 1024. The number of
PPO epochs is 1 and the number of PPO update iterations per
epoch is 8, aligning with previous RLHF research [31, 55, 81].
8.2
End-to-End performance
Figures 9, 10, and 11 show RLHF throughput when running
PPO, ReMax, and Safe-RLHF respectively. The actor, critic,
reference, and reward models in this set of experiments are
of the same size, following previous practice [7, 55, 82]. The
number of GPUs used in experiments of different model
sizes ranges from the smallest number of GPUs to run RLHF
without OOM to 128 GPUs. We do not enable offloading
optimizer states [61] in the experiments for fair comparison.
Overall performance. We observe that HybridFlow consistently outperforms the baselines across all model scales. In
Colocate
Split
2
Standalone
HybridFlow
1
0
16
24
32
64
# of GPUs
(a) 13B
96
128
1.5
1.0
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
1e4
Colocate
Split
Standalone
HybridFlow
0.5
0.0
32
48
64
# of GPUs
96
128
(b) 34B
Figure 12. Throughput of HybridFlow under different placements
Figure 9 for PPO, HybridFlow outperforms DeepSpeed-Chat,
OpenRLHF and NeMo-Aligner by 3.67× (up to 7.84×), 3.25×
(up to 5.93×) and 12.52× (up to 20.57×), respectively. This is
mainly because HybridFlow effectively executes generation,
inference, and training in all RLHF stages by sharding the
models with different parallelism strategies to fit various
computation workloads. HybridFlow achieves the highest
average speedup of 9.64× when training 70B models, as HybridFlow reduces the transition overhead by up to 71.2% and
89.1% compared to DeepSpeed-Chat and OpenRLHF, which
also incurs large inter-machine communication when training with ZeRO-3. Due to the lack of KVCache in generation
engine, NeMo-Aligner’s main performance bottleneck lies
in the generation stage, which accounts for up to 81.2% of
its RLHF iteration time. Similar results can be observed in
Figures 10, 11 validating the efficiency of HybridFlow on
running various RLHF algorithms.
Scalability. HybridFlow achieves at least 2.09× speedup on 8
GPUs. With increasing GPUs, the strong scaling efficiency of
HybridFlow on various model scales is 66.8%, computed by dithroughput in largest scale
viding
by max. # of GPUs [5],
throughput in smallest scale
min. # of GPUs
averaging over three algorithms and all model scales. Scaling to a large number of GPUs with a fixed global batch
size results in smaller local batch sizes for each worker, potentially causing GPU underutilization. Running 7B models
on 128 GPUs, HybridFlow still outperforms the best baseline OpenRLHF for 1.68×, 1.53×, and 1.71× on PPO, ReMax,
and Safe-RLHF respectively. This can be attributed to HybridFlow’s ability to adapt the best placement strategies for
different models and cluster sizes to minimize RLHF time.
OpenRLHF performs better in a larger GPU cluster but less
efficiently on smaller ones.
8.3
Model Placement
In this experiment, we implement various model placements
of the PPO algorithm in HybridFlow, under the same model
and cluster settings as in Sec. 8.2: (i) colocate, the placement
strategy in DeepSpeed-Chat; (ii) standalone, that in OpenRLHF and; (iii) split, NeMo-Aligner’s colocation placement
(actor and reference policy on the same set of devices and
critic and reward model on another); (iv) hybridflow, the
optimized placement obtained by Algorithm 1.
Throughput (tokens/s)
1e4
Throughput (tokens/s)
Throughput (tokens/s)
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
1.5
1e4
1.0
Colocate
Split
Standalone
HybridFlow
0.5
0.0
32
64
96
# of GPUs
128
Figure 13. Placement comparison under
13B actor and reference policy & 70B critic
and reward model.
Comparison of different model placements. Figure 12
reveals that optimized placement of HybridFlow under different numbers of GPUs varies. From 16 to 64 GPUs, colocating
all models on the same set of devices yields the best performance. For 96 to 128 GPUs with 34B models and 96 GPUs
with 13B models, the split strategy becomes optimal. The
split strategy divides GPUs evenly between the two sets
of models, as their sizes are equal. For 13B models on 128
GPUs, the standalone strategy achieves the highest throughput. In this case, HybridFlow allocates 64 GPUs for the actor,
32 for the critic, and 16 each for the reference and reward
model. In smaller clusters, computation of all models can
fully utilize GPU resources; the colocate strategy ensures
maximum GPU usage in different RLHF stages. In larger clusters, RLHF throughput under colocate placement fails to scale
up linearly as the batch size is fixed and the computation-tocommunication ratio decreases with a larger DP size on more
GPUs. Standalone and split strategies place models on different devices with a smaller DP size for each model in larger
clusters, facilitating parallel execution of different models in
the same stages. In all cases, our Algorithm 1 produces the
best placement with the highest training throughput.
Larger critic and reward model. We further evaluate
model placements when running PPO with a 13B actor and
reference policy and 70B critic and reward models (larger
critic and reward models are expected to produce better alignment [7]). Figure 13 shows that the colocate strategy still
outperforms others by 44.8% on average with up to 64 GPUs.
The split strategy achieves higher throughput with 96 GPUs.
When scaling to 128 GPUs, the best placement obtained by
Algorithm 1 colocates actor, reference, and reward models
on 64 GPUs while allocating the remaining 64 GPUs to critic.
On the same number of GPUs, actor and reference policy’s
computation time is much smaller than critic and reward
model, and colocating the reward model with actor and reference policy reduces the GPU idle time in the experience
preparation stage. In general, distributing actor and critic on
different devices for parallel execution in the training stage
leads to higher throughput in large clusters.
8.4
3D-HybridEngine
Transition time comparison. Figure 14 shows the transition time between actor training and generation stages on
HybridFlow: A Flexible and Efficient RLHF Framework
10
5
0
8
16
32
64
# of GPUs
(a) 7B (𝑇𝑔 =2, 𝑃𝑔 =1, 𝑇 =8,𝑃 =1)
HybridFlow-V
HybridFlow
10
0
128
OpenRLHF
DS-Chat
20
16
32
64
# of GPUs
OpenRLHF
DS-Chat
60
Time (s)
HybridFlow-V
HybridFlow
40
20
0
128
(b) 13B (𝑇𝑔 =4, 𝑃𝑔 =1, 𝑇 =8,𝑃 =1)
HybridFlow-V
HybridFlow
32
64
# of GPUs
128
(c) 34B (𝑇𝑔 =8, 𝑃𝑔 =1, 𝑇 =8,𝑃 =4)
Time (s)
30
OpenRLHF
DS-Chat
15
Time (s)
Time (s)
20
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
100
OpenRLHF
DS-Chat
HybridFlow-V
HybridFlow
50
0
64
128
# of GPUs
(d) 70B (𝑇𝑔 =8, 𝑃𝑔 =1, 𝑇 =8,𝑃 =8)
generation time
transition time
75
Time (s)
Time (s)
100
50
25
0
Tg = 8
Dg = 1
Tg = 4
Dg = 2
(a) 7B
Tg = 2
Dg = 4
Tg = 1
Dg = 8
Time (s)
Figure 14. Transition time between actor training and generation.
generation time
transition time
200
102
101
100
0
103
Tg = 8
Dg = 1
Tg = 4
Dg = 2
Tg = 2
Dg = 4
Tg = 1
Dg = 8
(b) 13B
Figure 15. Time breakdown on different generation parallel
sizes of the actor model on 16 GPUs.
various model scales, which is the time to reshard model
weights from training to generation, under the same settings
in §8.2. OpenRLHF’s transition time includes weight synchronization time between two copies of the actor model on
different devices. HybridFlow reduces the transition time by
55.2% (11.7s) on average and the transition overhead by up to
89.1% (78.2s) with 70B models, while maintaining consistent
overhead across different cluster scales. This is attributed to
our new parallel grouping method for the generation stage
(§5.4). In baseline methods, all model parameters must be
collected during transition, necessitating layer-by-layer collections multiple times to prevent OOM. HybridFlow enables
zero memory redundancy during transition and requires only
one all-gather operation per micro DP group.
Transition and generation time We further validate the
need to use different parallel sizes in actor training and generation in HybridFlow. In this experiment, all models are
colocated on the same set of GPUs, and the KVCache for
generation is allocated using the remaining GPU memory
(i.e., best-effort allocation). Figure 15 gives the transition and
generation time when running RLHF on 16 GPUs with 7B
and 13B models, respectively, with training parallel groups
1-8-2 (following p-t-d convention) and varying generation
TP group size 𝑡𝑔 from 1 to 8. The generation PP group size
remains constant at 𝑝𝑔 =1 and the micro DP group size 𝑑𝑔 is
computed as 𝑡8𝑔 . We observe that applying a smaller generation TP group size, 𝑡𝑔 =2, for 7B models and 𝑡𝑔 =4 for 13B
models reduces the generation latency by 60.3% and 36.4%,
respectively. Conversely, using the same TP size as training
(𝑡𝑔 =8), following the NeMo-Aligner approach, results in the
largest generation latency due to GPU underutilization. Further reducing 𝑡𝑔 fails to achieve higher speedup, as a smaller
𝑡𝑔 necessitates maintaining a larger KVCache per GPU.
(7B,8)
(7B,16) (13B,24) (13B,32) (34B,48) (34B,64) (70B,96)(70B,128)
Model Size and # of GPUs
Figure 16. Runtime of device mapping algorithm. The model
size and # of GPUs are simultaneously scaled.
8.5 Algorithm Runtime
Figure 16 shows the running time of Algorithm 1, which is
significantly shorter than days of actual RLHF training. A
linear growth of running time is exhibited, revealing good
scalability of the device mapping algorithm with model size
and cluster size. Most of the running time is spent on estimating the execution latency of each model’s parallel strategies.
More parallelism strategies are available for a larger model,
requiring more simulations to identify the optimal one for
each placement plan. Our caching of optimal parallelism
strategies of the models to be reapplied across different placements reduces the search time for the best placement to at
most half an hour.
9
Discussions
Fault Tolerance. HybridFlow is orthogonal to existing faulttolerance approaches [22, 34, 49, 76, 93] and already incorporates checkpointing. Failures can be detected by NCCL
errors and silent-data-corruption by checksums. Our programming model enables the single controller to coordinate
checkpoint operations via RPC, allowing the saving of model
states within each ParallWorker Group. This includes saving parameters of actor/critic models, dataloader IDs, and
Random Number Generator (RNG) states to ensure systemwide consistency. Moreover, HybridFlow can also employ
redundancy-based fault-tolerance methods, such as broadcast parameters and CPU checkpoint, for fast recovery if
enough healthy model replicas are available [76, 93].
Placement Insights. We conclude three main insights for
model placement and GPU allocation in RLHF training. 1)
Allocating more GPUs to the actor model can reduce the timeconsuming generation latency, which cannot be parallelized
with other models. 2) When each model computation can
fully utilize GPU resources, colocating all the models is most
effective when training on relatively small-scale clusters. 3)
When scaling up to large-scale clusters (i.e., strong scaling),
distributing the actor and critic models on different devices
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
for parallel execution in the training and preparation stages
would help achieve higher throughput.
Resource multiplexing. HybridFlow enables colocation of
models on shared devices by utilizing time-sharing for GPU
computation. Recent research in DNN task scheduling has
developed fine-grained resource multiplexing techniques,
primarily aimed at achieving the service-level objectives
of individual tasks [8, 18, 26, 26, 47, 56, 77]. Although the
ResourcePool implementation supports parallel execution
of collocated models, HybridFlow generally adheres to sequential execution to prevent GPU resource contention or
OOM issues as discussed in Section 2.3. Applying GPU sharing and heterogeneous resources in RLHF training poses
distinct challenges, as it seeks to balance the computation
workload and manage complex data dependencies among
various tasks. Investigating fine-grained auto-mapping algorithms for GPU sharing in RLHF training, coupled with
model offload optimization and integration of heterogeneous
devices, would be a promising direction for future research.
From alignment to reasoning. In RLHF for LLM alignment,
the reward signal is generated by the reward model. Besides
alignment tasks, similar algorithms (e.g., PPO and GRPO [70])
can be applied to other domains, such as code generation and
mathematical reasoning. For these tasks, a ground truth may
exist for each prompt, which can be determined by assessing
the correctness of the output value for each code test case and
verifying the accuracy of mathematical results. Therefore,
the reward model can be replaced by non-neural-network
reward modules, such as a sandbox environment [87] for
evaluating generated code or a reward function [14, 65] to
validate mathematical results. HybridFlow can seamlessly
integrate these reward modules by wrapping them as remote
functions and orchestrating their execution within the singleprocess script, providing a flexible and efficient framework
for diverse reinforcement learning applications.
10
Related Work
RL frameworks. There have been plenty of frameworks
for RL, ranging from general-purpose RL systems design
for small-scale DNNs [12, 25, 28, 39, 45, 46] to RLHF systems specifically optimized for LLMs [15, 17, 30, 80, 82]. We
have thoroughly examined closely related work in §2 and we
discuss more RL frameworks in this section. These RL frameworks [12, 25, 28, 39, 74], similar to recent RLHF systems,
use a hodgepodge of multi-controller frameworks to implement their algorithms. They establish multiple long-running
distributed programs with each component coordinating
the execution order with hard-coded data synchronization.
Gear [74] further optimized the experience replay segment
of the RL pipeline. However, all these frameworks fail to
support LLM training, inference, and generation in RLHF.
LLM training and serving systems. TorchDDP [57] and
Horovod [69] support data parallel training. ByteScheduler [58]
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
and DeepSpeed [60] extend data parallelism with communication and memory optimizations. Numerous systems [23,
36, 48, 54, 71, 75, 89] optimized large model training through
model parallelisms such as tensor parallelism and pipeline
parallelism to partition models across devices. LLM serving
systems [3, 16, 40, 72, 83, 92] also adopts data and model
parallelism to accelerate auto-regressive generation with
specialized optimizations like continuous-batching [83] and
chunked-prefill [3]. Note that all the above frameworks adopt
multi-controller paradigm for efficient computation.
Dataflow systems. Dataflow systems like MapReduce [21],
Spark [86], Dryad [33], and Naiad [51] are popular for analytics and ML workloads but they lack support for dynamic task
graphs. Ray [50] unifies task-parallel and actor programming
models in a single dynamic task graph and implements a scalable distributed scheduler and a global control store, which
is adopted by many RL frameworks [45, 46]. Pathways [9],
a closed-source project for TPUs, are designed to easily express complex parallelism patterns and fine-grain control
flow within a single DNN model, such as pipeline parallelism
and Mixture-of-Experts with sparse computation. It employs
an asynchronous distributed dataflow design that enables
parallel control plane execution despite data dependencies,
reducing the dispatch overhead from single-controller paradigm. Its main focus lies on single-model training, requiring
complex compilations of each sub-network of a DNN model.
HybridFlow can integrate Pathways as a submodule to implement the computation of models in the RLHF dataflow.
11
Conclusion
HybridFlow is an RLHF framework that enables flexible representation and efficient execution of diverse RLHF algorithms. We propose a hybrid programming model that allows
users to easily build RLHF dataflow in a few lines of code by
encapsulating distributed computation of different LLMs into
primitive APIs and hiding the complexity of data resharding
among nodes. Our 3D-HybridEngine ensures efficient execution of training and generation of the actor model, with zero
memory redundancy and significantly reduced communication overhead for model parameter resharding. Furthermore,
our effective mapping algorithm optimizes GPU allocation
and placement of models in the RLHF dataflow. Extensive
experiments demonstrate that HybridFlow achieves 1.53× to
20.57× speedup compared to state-of-the-art RLHF systems
under various model sizes and cluster scales.
Acknowledgments
We would like to thank our shepherd Y. Charlie Hu and
the anonymous reviewers for their constructive feedback.
We thank Xin Liu, Yangrui Chen, and Ningxin Zheng for
their insightful feedback on this project. This work was
supported in part by a ByteDance Research Collaboration
Project, and grants from Hong Kong RGC under the contracts
HKU 17204423 and C7004-22G (CRF).
HybridFlow: A Flexible and Efficient RLHF Framework
References
[1] Martín Abadi. 2016. TensorFlow: learning functions at scale. In Proceedings of the 21st ACM SIGPLAN international conference on functional
programming. 1–1.
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge
Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt,
Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774 (2023).
[3] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient
llm inference by piggybacking decodes with chunked prefills. arXiv
preprint arXiv:2308.16369 (2023).
[4] Riad Akrour, Marc Schoenauer, and Michele Sebag. 2011. Preferencebased policy learning. In Machine Learning and Knowledge Discovery
in Databases: European Conference, ECML PKDD 2011, Athens, Greece,
September 5-9, 2011. Proceedings, Part I 11. Springer, 12–27.
[5] Gene M Amdahl. 1967. Validity of the single processor approach to
achieving large scale computing capabilities. In Proceedings of the April
18-20, 1967, spring joint computer conference. 483–485.
[6] George E Andrews and Kimmo Eriksson. 2004. Integer partitions.
Cambridge University Press.
[7] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen,
Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom
Henighan, et al. 2022. Training a helpful and harmless assistant
with reinforcement learning from human feedback. arXiv preprint
arXiv:2204.05862 (2022).
[8] Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. {PipeSwitch}:
Fast pipelined context switching for deep learning applications. In 14th
USENIX Symposium on Operating Systems Design and Implementation
(OSDI 20). 499–514.
[9] Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat,
Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming
Pang, Sudip Roy, et al. 2022. Pathways: Asynchronous distributed
dataflow for ml. Proceedings of Machine Learning and Systems 4 (2022),
430–449.
[10] Eric Temple Bell. 1934. Exponential polynomials. Annals of Mathematics (1934), 258–277.
[11] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot
Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165 https:
//arxiv.org/abs/2005.14165
[12] I. Caspi. 2017. Reinforcement learning coach by Intel. https://github.
com/NervanaSystems/coach
[13] Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van
De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13
(2007), 1749–1783.
[14] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton,
Reiichiro Nakano, et al. 2021. Training verifiers to solve math word
problems. arXiv preprint arXiv:2110.14168 (2021).
[15] Collosal-AI Corporation. 2023. Collosal-Chat. https://github.com/
binmakeswell/ColossalChat
[16] NVIDIA Corporation. 2023. TensorRT-LLM: A TensorRT Toolbox for Optimized Large Language Model Inference. https://github.com/NVIDIA/
TensorRT-LLM
[17] NVIDIA Corporation. 2024. NeMo-Aligner: Scalable toolkit for efficient
model alignment. https://github.com/NVIDIA/NeMo-Aligner
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
[18] Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng,
Chao Li, and Minyi Guo. 2022. {DVABatch}: Diversity-aware {MultiEntry} {Multi-Exit} batching for efficient processing of {DNN} services on {GPUs}. In 2022 USENIX Annual Technical Conference (USENIX
ATC 22). 183–198.
[19] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu,
Yizhou Wang, and Yaodong Yang. 2024. Safe RLHF: Safe Reinforcement
Learning from Human Feedback. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=
TyFrPOKYXw
[20] Frederica Darema. 2001. The spmd model: Past, present and future.
In Recent Advances in Parallel Virtual Machine and Message Passing
Interface: 8th European PVM/MPI Users’ Group Meeting Santorini/Thera,
Greece, September 23–26, 2001 Proceedings 8. Springer, 1–1.
[21] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
[22] Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa
Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha
Smelyanskiy, and Murali Annavaram. 2022. {Check-N-Run}: A checkpointing system for training deep learning recommendation models.
In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 929–943.
[23] Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen
Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. 2021.
DAPPLE: A pipelined data parallel approach for training large models.
In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming. 431–445.
[24] X Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko.
2021. Habitat: A {Runtime-Based} computational performance predictor for deep neural network training. In 2021 USENIX Annual Technical
Conference (USENIX ATC 21). 503–521.
[25] Danijar Hafner, James Davidson, and Vincent Vanhoucke. 2017. Tensorflow agents: Efficient batched reinforcement learning in tensorflow.
arXiv preprint arXiv:1709.02878 (2017).
[26] Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022.
Microsecond-scale preemption for concurrent GPU-accelerated DNN
inferences. In 16th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 22). 539–558.
[27] Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari,
Jonathan Tow, Stella Biderman, Quentin Anthony, and Louis Castricato.
2023. trlX: A framework for large scale reinforcement learning from
human feedback. In Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing. 8578–8595.
[28] C. Hesse, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu.
2017. OpenAI baselines. https://github.com/openai/baselines
[29] Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad
Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi,
Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. 2024. DeepSpeedFastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference. arXiv preprint arXiv:2401.08671 (2024).
[30] Jian Hu, Xibin Wu, Xianyu, Chen Su, Leon Qiu, Daoning Jiang, Qing
Wang, and Weixun Wang. 2023. OpenRLHF: A Ray-based Highperformance RLHF framework. https://github.com/OpenLLMAI/
OpenRLHF.
[31] Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul,
Weixun Wang, and Lewis Tunstall. 2024. The N+ Implementation
Details of RLHF with PPO: A Case Study on TL; DR Summarization.
arXiv preprint arXiv:2403.17031 (2024).
[32] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao
Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui
Wu, et al. 2019. Gpipe: Efficient training of giant neural networks
using pipeline parallelism. Advances in neural information processing
systems 32 (2019).
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
[33] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis
Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys
European conference on computer systems 2007. 59–72.
[34] Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury. 2023. Oobleck: Resilient distributed training of large models
using pipeline templates. In Proceedings of the 29th Symposium on
Operating Systems Principles. 382–395.
[35] Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC),
Vol. 2. 23.
[36] Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen,
Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al.
2024. MegaScale: Scaling Large Language Model Training to More
Than 10,000 GPUs. arXiv preprint arXiv:2402.15627 (2024).
[37] Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier.
2023. A survey of reinforcement learning from human feedback. arXiv
preprint arXiv:2312.14925 (2023).
[38] Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
[39] I. Kostrikov. 2017. PyTorch implementation of advantage actor critic
(A2C), proximal policy optimization (PPO) and scalable trust-region
method for deep reinforcement learning. https://github.com/ikostrikov/
pytorch-a2c-ppo-acktr
[40] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin
Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica.
2023. Efficient memory management for large language model serving
with pagedattention. In Proceedings of the 29th Symposium on Operating
Systems Principles. 611–626.
[41] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas
Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023.
Rlaif: Scaling reinforcement learning from human feedback with ai
feedback. arXiv preprint arXiv:2309.00267 (2023).
[42] Cheng Li. 2023. LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference. https://github.com/cli99/llmanalysis
[43] Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun,
and Zhi-Quan Luo. 2023. ReMax: A Simple, Effective, and Efficient
Reinforcement Learning Method for Aligning Large Language Models.
arXiv preprint arXiv: 2310.10505 (2023).
[44] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng,
Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model
parallelism for deep learning serving. In 17th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 23). 663–679.
[45] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox,
Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018.
RLlib: Abstractions for distributed reinforcement learning. In International conference on machine learning. PMLR, 3053–3062.
[46] Eric Liang, Zhanghao Wu, Michael Luo, Sven Mika, Joseph E Gonzalez,
and Ion Stoica. 2021. RLlib Flow: Distributed Reinforcement Learning
is a Dataflow Problem. Advances in Neural Information Processing
Systems 34 (2021), 5506–5517.
[47] Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh,
and Deming Chen. 2014. Efficient GPU spatial-temporal multitasking.
IEEE Transactions on Parallel and Distributed Systems 26, 3 (2014), 748–
760.
[48] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for
convolutional neural networks. In 2017 IEEE International Symposium
on High Performance Computer Architecture (HPCA). IEEE, 553–564.
[49] Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021.
{CheckFreq}: Frequent,{Fine-Grained} {DNN} Checkpointing. In
19th USENIX Conference on File and Storage Technologies (FAST 21).
203–216.
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
[50] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov,
Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul,
Michael I Jordan, et al. 2018. Ray: A distributed framework for emerging {AI} applications. In 13th USENIX symposium on operating systems
design and implementation (OSDI 18). 561–577.
[51] Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul
Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system.
In Proceedings of the Twenty-Fourth ACM Symposium on Operating
Systems Principles. 439–455.
[52] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang,
Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju,
William Saunders, et al. 2021. Webgpt: Browser-assisted questionanswering with human feedback. arXiv preprint arXiv:2112.09332
(2021).
[53] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri,
Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei
Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN
training. In Proceedings of the 27th ACM symposium on operating systems principles. 1–15.
[54] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi
Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient
large-scale language model training on gpu clusters using megatronlm. In Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis. 1–15.
[55] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina
Slama, Alex Ray, et al. 2022. Training language models to follow
instructions with human feedback. Advances in Neural Information
Processing Systems 35 (2022), 27730–27744.
[56] Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking
GPUs. In Proceedings of the twenty-second international conference on
architectural support for programming languages and operating systems.
527–540.
[57] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James
Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style,
high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[58] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang
Lan, Chuan Wu, and Chuanxiong Guo. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In
Proceedings of the 27th ACM Symposium on Operating Systems Principles. ACM, Huntsville Ontario Canada, 16–29. https://doi.org/10.
1145/3341301.3359642
[59] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.
2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance
Computing, Networking, Storage and Analysis. IEEE, 1–16.
[60] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He.
2020. Deepspeed: System optimizations enable training deep learning
models with over 100 billion parameters. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. 3505–3506.
[61] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji
Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He.
2021. {Zero-offload}: Democratizing {billion-scale} model training. In
2021 USENIX Annual Technical Conference (USENIX ATC 21). 551–564.
[62] Gian-Carlo Rota. 1964. The number of partitions of a set. The American
Mathematical Monthly 71, 5 (1964), 498–504.
[63] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai
Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal
HybridFlow: A Flexible and Efficient RLHF Framework
Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan
Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron,
Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve.
2023. Code Llama: Open Foundation Models for Code. arXiv preprint
arXiv: 2308.12950 (2023).
[64] Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen.
2023. Efficient RLHF: Reducing the Memory Usage of PPO. arXiv
preprint arXiv: 2309.00754 (2023).
[65] Hill Kohli Saxton, Grefenstette. 2019. Analysing Mathematical Reasoning Abilities of Neural Models. arXiv:1904.01557 (2019).
[66] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and
Philipp Moritz. 2015. Trust region policy optimization. In International
conference on machine learning. PMLR, 1889–1897.
[67] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and
Pieter Abbeel. 2018. High-Dimensional Continuous Control Using
Generalized Advantage Estimation. arXiv:1506.02438 [cs.LG]
[68] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and
Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv
preprint arXiv:1707.06347 (2017).
[69] Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast
and easy distributed deep learning in TensorFlow. arXiv preprint
arXiv:1802.05799 (2018).
[70] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song,
Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. 2024. Deepseekmath:
Pushing the limits of mathematical reasoning in open language models.
arXiv preprint arXiv:2402.03300 (2024).
[71] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,
Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv
preprint arXiv:1909.08053 (2019).
[72] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2023. PowerInfer:
Fast Large Language Model Serving with a Consumer-grade GPU.
arXiv:2312.12456 [cs.LG]
[73] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[74] Hanjing Wang, Man-Kit Sit, Congjie He, Ying Wen, Weinan Zhang,
Jun Wang, Yaodong Yang, and Luo Mai. 2023. GEAR: a GPU-centric
experience replay system for large reinforcement learning models. In
International Conference on Machine Learning. PMLR, 36380–36390.
[75] Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting
very large models using automatic dataflow graph partitioning. In
Proceedings of the Fourteenth EuroSys Conference 2019. 1–17.
[76] Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eugene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the
29th Symposium on Operating Systems Principles. 364–381.
[77] Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao
Zhang, and Minyi Guo. 2016. Simultaneous multikernel GPU: Multitasking throughput processors via fine-grained sharing. In 2016 IEEE
international symposium on high performance computer architecture
(HPCA). IEEE, 358–369.
[78] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8
(1992), 229–256.
[79] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond,
Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush.
2019. HuggingFace’s Transformers: State-of-the-art Natural Language
Processing. arXiv preprint arXiv: 1910.03771 (2019).
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
[80] Youshao Xiao, Weichang Wu, Zhenglei Zhou, Fagui Mao, Shangchun
Zhao, Lin Ju, Lei Liang, Xiaolu Zhang, and Jun Zhou. 2023. An Adaptive
Placement and Parallelism Framework for Accelerating RLHF Training.
arXiv preprint arXiv: 2312.11819 (2023).
[81] Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei,
Guangju Wang, Chao Yu, and Yi Wu. 2024. Is dpo superior to ppo for
llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719
(2024).
[82] Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang,
Conglong Li, Connor Holmes, et al. 2023. DeepSpeed-Chat: Easy, Fast
and Affordable RLHF Training of ChatGPT-like Models at All Scales.
arXiv preprint arXiv:2308.01320 (2023).
[83] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and
Byung-Gon Chun. 2022. Orca: A distributed serving system for
{Transformer-Based} generative models. In 16th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 22). 521–538.
[84] Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou,
Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan
Yan, Beidi Chen, Guangyu Sun, and Kurt Keutzer. 2024. LLM Inference
Unveiled: Survey and Roofline Model Insights. arXiv preprint arXiv:
2402.16363 (2024).
[85] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang,
and Fei Huang. 2023. Rrhf: Rank responses to align language models
with human feedback without tears. arXiv preprint arXiv:2304.05302
(2023).
[86] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das,
Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a
unified engine for big data processing. Commun. ACM 59, 11 (2016),
56–65.
[87] Chi Zhang, Guangming Sheng, Siyao Liu, Jiahao Li, Ziyuan Feng,
Zherui Liu, Xin Liu, Xiaoying Jia, Yanghua Peng, Haibin Lin, and
Chuan Wu. 2024. A Framework for Training Large Language Models
for Code Generation via Proximal Policy Optimization. NL2Code
Workshop of ACM KDD 2024 (2024).
[88] Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang,
and Wei Lin. 2024. HAP: SPMD DNN Training on Heterogeneous
GPU Clusters with Automated Program Synthesis. arXiv preprint
arXiv:2401.05965 (2024).
[89] Shiwei Zhang, Lansong Diao, Chuan Wu, Siyu Wang, and Wei Lin.
2022. Accelerating large-scale distributed neural network training
with SPMD parallelism. In Proceedings of the 13th Symposium on Cloud
Computing. 403–418.
[90] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng
Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo,
Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator}
parallelism for distributed deep learning. In 16th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 22). 559–578.
[91] Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou,
Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, et al. 2023. Improving
generalization of alignment with human preferences through group
invariant learning. arXiv preprint arXiv:2310.11971 (2023).
[92] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating
Prefill and Decoding for Goodput-optimized Large Language Model
Serving. arXiv:2401.09670 [cs]
[93] Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and
Chuan Wu. 2023. Swift: Expedited Failure Recovery for Large-Scale
DNN Training. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal,
QC, Canada) (PPoPP ’23). Association for Computing Machinery, New
York, NY, USA, 447–449. https://doi.org/10.1145/3572848.3577510
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, P. Yang, H. Lin, C. Wu
Table 3. The transfer protocols in HybridFlow.
Transfer Protocols
Distribute function
Collect function
Use case
ONE_TO_ALL
Broadcast the data to all ranks.
Gather the data from all ranks.
All the worker methods have the same input and run the ssme codes,
e.g. model initialization.
3D_PROTO
Split the data, scatter across all DP ranks
and broadcast within the group.
Gather and concatenate the data from
the p=-1, t=0 worker in all DP groups.
The model is sharded among multiple workers within each data-parallel
group. The output of the model only exists in the last pipeline stage and
is duplicated across the data-parallel groups. This is a typical scenario
in 3D parallel training in Megatron-LM, Deepspeed, etc.
3D_ALL_MICRO_DP
Split the data by micro DP size, scatter across
all micro DP groups and broadcast
among all ranks within the group.
Gather and concatenate the data from
the local_rank=0 worker in all micro DP groups.
Used with HybridEngine. It is used to handle the 3D-parallel scheme of
the policy model, when switching between training and inference.
3D_PP_ONLY
Broadcast the data to all ranks.
Gather and concatenate the data from
the t=0, d=0 worker in all PP groups.
Used to examine weight names as they are identical in TP and DP
groups.
DP_PROTO
Split the data into batches and
scatter across all DP ranks.
Gather and concatenate
the data from all DP ranks.
ALL_TO_ALL
No operation.
Gather the data from all ranks.
A
Primitive APIs in HybridFlow
In HybridFlow, we implemented the primitive of each model
in RLHF training by inheriting the 3DParallelWorker, FSDP
Worker and ZeROWorker. The functions of these model classes
are designed to decouple the distributed computation code
and provide fundamental operations in RLHF for the users.
This primitive design is compatible with the auto-regressive
generation, forward pass, backward pass, and model update
operations in the existing distributed inference and training
frameworks. Users can easily customize the RLHF training
dataflow (by adapting the numerical computation in the provided functions) according to the algorithm’s design and
benefit from reusing the underlying distributed computation
implementation. We illustrate the meaning and the actual
computations of these APIs in Table 4.
B
Transfer Protocols
We implemented transfer protocols that cover all common
use cases of data resharding between models in RLHF dataflow.
Users can utilize these pre-defined protocols to generate any
RLHF dataflow. Moreover, Users can easily define their own
transfer protocols by implementing a collect function and a
distribute function. Transfer protocols decoupled the complicated data resharding and distributed training. We denote p,
t, d as the rank of the worker in pipeline-, tensor- and dataparallel group respectively. We illustrate these predefined
protocols in Table 3.
C
Auto-Parallelism Algorithm
Algorithm 2 outlines the search process of the optimal parallelism strategy of each model. Starting from the minimal
model parallelism size of each model (to prevent OOM when
colocating with multiple workers), we enumerate all feasible parallel configurations based on the number of GPUs
and the number of GPUs per machine 𝑈 . The default number of 𝑈 is set to 8. We use simu module to estimate the
latency of each model based on their workload. This module
Training model in data-parallel mode.
Used when debugging. Users can manually define the inputs of each
worker and examine their outputs respectively.
Algorithm 2 Auto Parallelism Algorithm
1: Input: Device allocation 𝐴, minimal device allocation
and model parallel size for each model in a set 𝐴𝑚𝑖𝑛 ,
workload 𝑊 , the number of GPUs per machine 𝑈
2: Output: the parallelism strategy for the model in a set
3: Procedure auto_parallel(𝐴, 𝐴𝑚𝑖𝑛 , 𝑙, 𝑊 ):
4: 𝑁𝑙 = 𝐴[𝑙] // Get device allocation of the model
5: 𝑡𝑚𝑖𝑛 = 𝐴𝑚𝑖𝑛 [𝑙].𝑡 // Get minimal model parallel size
6: 𝑝𝑚𝑖𝑛 = 𝐴𝑚𝑖𝑛 [𝑙].𝑝
7: best_para ← ∅
8: best_para.cost ← ∞
9: for all t ∈ {𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑖𝑛 + 1..., 𝑈 } do
10:
for all p ∈ {𝑝𝑚𝑖𝑛 , 𝑝𝑚𝑖𝑛 + 1..., 𝑁𝑈𝑙 } do
𝑁𝑙
11:
d ← p×t
12:
para_plan ← (𝑝, 𝑡, 𝑑)
13:
cost ← simu(𝑝𝑎𝑟𝑎_𝑝𝑙𝑎𝑛, 𝑙,𝑊 [𝑙])
14:
if best_para.𝑐𝑜𝑠𝑡 > cost then
15:
best_para.𝑐𝑜𝑠𝑡 ← cost
16:
best_para ← para_plan
17: return best_para
includes three simulators for training, inference, and generation workload, all are analytical models following previous
research [42, 84, 92]. The training and inference workload is
compute-bound while the generation workload is memorybound. For the actor model, we first find the parallelism strategy for training and record the memory usage in the training
stage. During actor generation, KVCache requirements are
calculated using the batch size and max sequence length.
If the model-parallel size for the generation stage cannot
accommodate both parameters and KVCache, we increase
it. Then, we seek the optimal strategy with corresponding
KVCache allocation by comparing the latency estimation.
Developing a comprehensive autoregressive generation simulator that accounts for variable KVCache sizes could further
enhance the auto-mapping process in RLHF research.
HybridFlow: A Flexible and Efficient RLHF Framework
EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands
Table 4. Key functions provided in each model class. The users can use these provided functions to construct various RLHF
algorithms in a few lines of code.
Model
APIs
Computation
Interpretation
generate_sequence
auto-regressive
generation
Based on a batch of prompts, the actor model generates a batch of
responses and returns the log probability of each token in the responses.
compute_log_prob
a forward pass
The actor model computes the log probability of each token in the
prompts and responses. This log probability is the same as the return
log probability when performing generation using the same model
precision. (Optional in PPO)
compute_loss
a forward pass
The actor model computes the pretrain loss based on the pertaining
dataset [7, 19, 55].
update_actor
a forward, backward pass
and model update
Based on the advantages, returns (calculated from compute_advantage)
and pertaining loss, the actor model calculate the training loss and update its weights. We implement various loss for diverse RLHF algorithms
including PPO [55], Safe-RLHF [19], ReMax [43], GRPO [70] and others.
compute_values
a forward pass
The critic model computes the values for each prompt and response.
update_critic
a forward, backward pass
and model update
Based on the values and returns, the critic computes a squared-error loss
to update its weights. We also implement critic loss for diverse RLHF
algorithms including PPO [55], Safe-RLHF [19], ReMax [43], GRPO [70]
and others.
Actor
Critic
Reference
Policy
compute_ref_log_prob
a forward pass
The reference model computes the reference log probability of each
token in the prompts and responses. This log probability is utilized as a
benchmark to evaluate the divergence of the actor model and constrain
its learning process.
Reward
compute_reward
a forward pass
The reward model conducts forward computation to calculate scores for
a given set of prompts and responses. The rewards could be token-level
or sample-level.
compute_advantage
numerical
computation
Based on the values rewards from the value model and reward model respectively, the function estimates the advantages on the given prompts
and the current policy model’s responses. This computation involves
no model forward passes.
-