ZeRO-Offload: Democratizing Billion-Scale
Model Training
Jie Ren, UC Merced; Samyam Rajbhandari, Reza Yazdani Aminabadi, and
Olatunji Ruwase, Microsoft; Shuangyan Yang, UC Merced; Minjia Zhang,
Microsoft; Dong Li, UC Merced; Yuxiong He, Microsoft
https://www.usenix.org/conference/atc21/presentation/ren-jie

This paper is included in the Proceedings of the
2021 USENIX Annual Technical Conference.
July 14–16, 2021
978-1-939133-23-6

Open access to the Proceedings of the
2021 USENIX Annual Technical Conference
is sponsored by USENIX.

ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
UC Merced

Samyam Rajbhandari
Microsoft

Shuangyan Yang
UC Merced

Minjia Zhang
Microsoft

Abstract
Large-scale model training has been a playing ground for a
limited few users, because it often requires complex model
refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly
everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to
popular framework such as PyTorch, and it does so without
requiring any model change from data scientists or sacrificing
computational efficiency.
ZeRO-Offload enables large model training by offloading
data and compute to CPU. To preserve compute efficiency, it
is designed to minimize data movement to/from GPU, and reduce CPU compute time while maximizing memory savings
on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter
model, compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running
out of memory on GPU. ZeRO-Offload is also designed to
scale on multiple-GPUs when available, offering near-linear
speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70
billion parameters on a single DGX-2 box, a 4.5x increase in
model size compared to using model parallelism alone.
By combining compute and memory efficiency with easeof-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to
just a single GPU.

1

Introduction

Since the advent of the attention-based deep learning (DL)
models in 2017, we have seen an exponential growth in DL
model size, fueled by substantial quality gains that these attention based models can offer with the increase in the number
of parameters. For example, the largest language model in
literature had less than 100M parameters in 2017. It grew
to over 300M with BERT [6] in 2018, and increased to tens
of billions in 2019 with models such as GPT-2 [3], T5 [20],
Megatron-LM [28] and Turing-NLG [25]. Today, the largest

USENIX Association

Reza Yazdani Aminabadi
Microsoft
Dong Li
UC Merced

Olatunji Ruwase
Microsoft
Yuxiong He
Microsoft

language model GPT-3 [2] has a staggering number of 175B
parameters. With the three orders of magnitude growth in
model size since 2017, the model accuracy continues to improve with the model size [12]. Recent studies in fact show
that larger models are more resource-efficient to train than
smaller ones [12] for a given accuracy target. As a result, we
expect the model size to continue growing in the future.
However, accessibility to large model training is severely
limited by the nature of state-of-art system technologies.
Those technologies make entry into the large model training
space prohibitively expensive. To be more specific, distributed
parallel DL training technologies such as pipeline parallelism
[10], model parallelism [28], and ZeRO [21] (Zero Redundancy Optimizer) allow transcending the memory boundaries
of single GPU/accelerator device by splitting the model states
(parameters, gradients and optimizer states) across multiple
GPU devices, enabling massive models that would otherwise
simply not fit in a single GPU memory. All record-breaking
large models such as GPT-2, Megatron-LM, Turing-NLG,
and GPT-3, were trained using a combination of the aforementioned technologies. However, all of these DL parallel
technologies require having enough GPU devices such that
the aggregated GPU memory can hold the model states required for the training. For example, training a 10B parameter
model efficiently requires a DGX-2 equivalent node with 16
NVIDIA V100 cards, which costs over 100K, beyond the
reach of many data scientists, and even many academic and
industrial institutions.
Heterogeneous DL training is a promising approach
to reduce GPU memory requirement by exploiting CPU
memory. Many efforts have been made in this direction
[8, 9, 11, 17, 23, 23, 24, 32–34]. Nearly all of them target
CNN based models, where activation memory is the memory
bottleneck, and model size is fairly small (less than 500M).
However, the primary memory bottleneck for recent attention
based large model training are the model states, instead of
activation memory. There is an absence in literature studying
these workloads for heterogeneous DL training. Additionally,
existing efforts on heterogeneous training are further limited
in two major ways: i) nearly all of them exploit CPU memory,
but not CPU compute, which we show can be used to significantly reduce the CPU-GPU communication overhead, and

2021 USENIX Annual Technical Conference

551

ii) they are mostly designed for and evaluated on single GPU
[9, 11, 23, 34], without a clear path to scaling efficiently on
multiple GPUs, which is crucial for large model training.
Addressing the aforementioned limitation, we attempt to democratize large model training by developing ZeRO-Offload,
a novel heterogeneous DL training technology designed
specifically for large model training. ZeRO-Offload exploits
both CPU memory and compute for offloading, while offering
a clear path towards efficiently scaling on multiple GPUs by
working with ZeRO-powered data parallelism [21]. Additionally, our first principle analysis shows that ZeRO-Offload provides an optimal and the only optimal solution in maximizing
memory saving while minimizing communication overhead
and CPU compute overhead for large model training.
ZeRO-Offload is designed around three main pillars: i)
Efficiency, ii) Scalabilty, and iii) Usability.
Efficiency: The offload strategy in ZeRO-Offload is designed with the goal of achieving comparable compute efficiency to the state-of-art non-offload strategies but for significantly larger models. To achieve this goal, we rely on first
principle analysis to identify a unique optimal computation
and data partitioning strategy between CPU and GPU devices.
This strategy is optimal in three key aspects: i) it requires
orders-of-magnitude fewer computation on CPU compared to
GPU, preventing the CPU compute from becoming a performance bottleneck, ii) it minimizes the communication volume
between CPU and GPU preventing communication from becoming a bottleneck, and iii) it provably maximizes memory
savings on GPU while achieving minimum communication
volume.
Our analysis shows that to be optimal in the aforementioned regards, we must offload the gradients, optimizer states
and optimizer computation to CPU, while keeping the parameters and forward and backward computation on GPU. This
strategy enables a 10x increase in model size, with minimum
communication and limited CPU computation, which allows
us to train 13B parameters on a single NVIDIA V100 GPU
at 40 TFLOPS, compared to 30 TFLOPS on the same GPU
with 1.2B parameters, the largest model that can be trained
without any CPU offloading.
Offloading optimizer computation requires CPU to perform
O(M) computation compared to O(MB) on GPU where M
and B are the model size and batch sizes respectively. In most
cases, the batch size is large, and CPU computation is not a
bottleneck, but for small batch sizes, the CPU compute can be
a bottleneck. We address this using two optimizations: i) an
efficient CPU optimizer that is up to 6x faster than the-stateof-art , and ii) One-step delayed parameter update that allows
overlapping the CPU optimizer step with GPU compute, while
preserving accuracy. Together, they preserve efficiency for
ZeRO-Offload even with small batch sizes.
Scalability: Good scalability is crucial to take advantage
of multiple GPUs that may be available to some data scientists. In the DL community, data parallelism is generally

552

2021 USENIX Annual Technical Conference

import torch
...
...
model = BuildModel(config)
optimizer = Optimizer(model)
...
...
...
for batch in batches:
loss = model(batch)
loss.backward()
optimizer.step()

import torch
import deepspeed
...
model = BuildModel(config)
optimizer = Optimizer(model)
model = deepspeed.initialize(
model, optimizer)
...
for batch in batches:
loss = model(batch)
model.backward(loss)
model.step()

Figure 1: ZeRO-Offload can be enabled with a few lines of change.
The code on left shows a standard training pipeline, while the right
shows the same pipeline with ZeRO-Offload enabled.

used as the de facto standard to scale DL training to multiple
GPUs [5, 26, 35]. However, it is not designed to work with
heterogeneous training and presents scalability challenges
because of the replication of data and computation in data
parallel training. Data parallel training replicates all the model
states such as optimizer states, parameters, and gradients, and
it also replicates the optimizer computation on each GPU.
Therefore, offloading model states or optimizer computation
to CPU in combination with data parallelism will result in
significant replication of communication and CPU compute:
increase the CPU memory requirement proportionally to the
data parallelism degree while limiting throughput scalability
due to the increased communication.
To address these challenges, ZeRO-Offload combines
unique optimal offload strategy with ZeRO [21] powered
data parallelism instead of traditional data parallelism. The
symbiosis allows ZeRO-Offload to maintain a single copy
of the optimizer states on the CPU memory regardless of
the data parallel degree. Furthermore, it keeps the aggregate
communication volume between GPU and CPU, as well as
the aggregate CPU computation a constant regardless of data
parallelism, allowing ZeRO-Offload to effectively utilize the
linear increase in CPU compute with the increase in the data
parallelism degree. As a result, ZeRO-Offload achieves excellent scalability on up to 128 GPUs.
In addition to working with ZeRO powered data parallelism, ZeRO-Offload can be combined with model parallelism [27, 28] to achieve higher memory savings, when multiple GPUs are available.
Usability: ZeRO-Offload is available as part of an OpenSource PyTorch library, DeepSpeed (www.deepspeed.ai).
Unlike most strategies discussed in Section 2, ZeRO-Offload
does not require model refactoring to work. In fact, PyTorch
users can enable ZeRO-Offload with few lines of code change
to their existing training pipeline as shown in Figure 1, allowing to train 10x larger models easily.
Contributions. To the best of our knowledge, ZeROOffload is the first fully distributed all-reduced based training
framework using CPU memory and computation resources to
train large-scale models. We summarize contributions are as
follows:

USENIX Association

• A unique optimal offload strategy for heterogeneous large
model training on GPU + CPU system that enables 10x
larger model on a single GPU without sacrificing efficiency
(Sec. 3 and Sec. 4.1).
• Highly scalable multi-GPU design through i) a symbiotic
combination of offload strategy with ZeRO powered data
parallelism (Sec. 4.2), allowing ZeRO-Offload to achieve
near-linear scalability, and ii) seamless integration with
model-parallel training [28], enabling even larger models than using ZeRO-Offload or model parallelism alone
(Sec. 4.2).
• Open-source implementation of ZeRO-Offload in PyTorch.
• Extensive evaluation demonstrating i) Model Scale: 10x
increase in model size with up to 13B on a single GPU
and 4x increase in model size over model parallelism with
up to 70B parameters on a DGX-2 node. ii) Efficiency:
Over 40 TFlops for a 10B parameters on a single NVIDIA
V100, compared to 30 TFLOPS on the same GPU with
1.4B parameters, the largest model that can be trained without any CPU offloading; Outperform two state-of-the-art
heterogeneous DL training frameworks by 22% and 37%
respectively on a single GPU. iii) Scalability: Near-perfect
linear scalability for a 10B parameter model on up to 128
GPUs. iv) CPU overhead reduction with our ADAM implementation with 6x speedup over PyTorch optimizer and
up to 1.5X improvement in end-to-end throughput with
delayed parameter update optimizations (Sec. 6).

2

Background and Related Work

Memory consumption in large model training. The full
spectrum of memory consumption during DL model training
can be classified into two parts: i) model states and ii) residual states [21]. Model states include parameters, gradients,
and optimizer states (such as momentum and variances in
Adam [13]); Residual states include activations, temporary
buffers, and unusable fragmented memory.
Model states are the primary source of memory bottleneck
in large model training. We consider the memory consumption due to model states for large transformer models such
as Megatron-LM (8 billion) [28], T5 (11 billion) [20], and
Turing-NLG [25] (17.2 billion). They are trained with float-16
mixed precision training [16] and Adam optimizer [13].
Mixed precision training often keeps two copies of the
parameters, one in float-16 (fp16) and the other in float-32
(fp32). The gradients are stored in fp16. In addition to the
parameters and gradients, the Adam optimizer keeps track of
the momentum and variance of the gradients. These optimizer
states are stored in fp32. Therefore, training a model in mixed
precision with the Adam optimizer requires at least 2 bytes
of memory for each fp16 parameter and gradient, and 4 byte
of memory for each fp32 parameter, and the moementum and
variance of each gradient. In total, a model with M parameters
requires 16 × M bytes of memory. Therefore, the model states

USENIX Association

for Megatron-LM, T5 and Turing-NLG require 128 GB, 176
GB and 284 GB, respectively, which are clearly beyond the
memory capacity of even the current flagship NVIDIA A100
GPU with 80 GB of memory.
Significant amount of work has been done in the recent
years to enable large model training, which requires more
memory than what is available on a single GPU to fit these
model and residual states. These efforts can be classified
broadly into two categories: i) scale-out training and ii) scaleup training based approaches. We discuss them as follows.
Scale out large model training. Scale-out training uses
aggregate memory of multiple GPUs to satisfy the memory
requirement for large model training. Two prominent examples of scale out training is model parallelism [5, 28] and
pipeline parallelism [7, 10], both partitioning the model states
and the residual states across multiple GPUs. Model parallelism [5, 28] partitions the model vertically and distributes
the model partitions to multiple GPU devices in order to train
large models. Pipeline parallelism [7, 10] on the other hand
parallelizes model training by partitioning the model horizontally across layers. Both of these approaches must change the
user model to work, therefore can limit usability.
A recent work, ZeRO [21], provides an alternative to model
and pipeline parallelisms to train large models. ZeRO splits
the training batch across multiple GPUs similar to data parallel training [5, 26, 35], but unlike data parallel training which
replicates all the model states on each GPU, ZeRO partitions
them across all GPUs, and uses communication collectives
to gather individual parameters as needed during the training.
ZeRO does not require changes to the user model to work,
making it more generic than model or pipeline parallel training. It also offers better compute efficiency and scalability.
Despite the ability of model and pipeline parallelisms, and
ZeRO to train large models, they all require multiple GPUs
such that the aggregate GPU memory can hold the model and
residual states for training large models. In contrast, ZeROOffload is designed to fit a larger model by offloading model
states to CPU memory and can train a 10x larger model on
a single GPU without sacrificing efficiency. When multiple
GPUs are available, ZeRO-Offload is designed to work together with ZeRO to offer excellent scalability, or in conjunction with model parallelism to fit even larger model sizes that
is not possible with ZeRO-Offload or model parallelism alone.
Scale up large model training. Existing work scales up
model size in a single GPU through three major approaches.
The first approach trades computation for memory saving
from activations (residual memory) by recomputing from
checkpoints [4]. The second approach uses compression techniques such as using low or mixed precision [16] for model
training, saving on both model states and activations. The
third approach uses an external memory such as the CPU
memory as an extension of GPU memory to increase memory
capacity during training [8, 9, 11, 17, 23, 24, 33].
Our work, ZeRO-Offload falls under the third approach.

2021 USENIX Annual Technical Conference

553

Unlike ZeRO-Offload, the above efforts only offload data to
CPU but not compute, and they use smaller models training.
Furthermore, none of the above works is communication optimal, leading to extra communication between CPU and GPU
and hurting training throughput. In contrast, a recent work
called L2L [18] can enable multi-billion parameter training by
managing memory usage in GPU layer by layer. In particular,
L2L synchronously moves tensors needed in the upcoming
layer into GPU memory for computation and keeps the rest of
tensors into CPU memory for memory saving. In comparison
to ZeRO-Offload, it offers limited efficiency due to extra communication overhead, does not offer a way to scale out across
devices, and requires model refactoring, making it difficult to
use.
ZeRO powered data parallel training. ZeRO-Offload
works with ZeRO to scale DL training to multiple GPUs.
ZeRO has three stages, ZeRO-1, ZeRO-2 and ZeRO-3 corresponding to the partitioning of the three different model
states, optimizer states, gradients and parameters, respectively.
ZeRO-1 partitions the optimizer states only, while ZeRO-2
partitions gradients in addition to optimizer states, and ZeRO3 partitions all model states. ZeRO-Offload works symbiotically with ZeRO-2, and therefore we discuss it further.
In ZeRO-2, each GPU stores a replica of all the parameters,
but only updates a mutually exclusive portion of it during
the parameter update at the end of each training step. As
each GPU only updates a portion of the parameters, they
only store optimizer states and gradients required to make
that update. After the update, each GPU sends its portion
of the updated parameters to all the other GPUs using an
all-gather communication collective. ZeRO-2 computation
and communication schedule is described below:
During the forward pass, each GPU computes the loss with
respect to a different mini-batch. During the backward propagation, as each gradient is computed, it is averaged using a
reduce operator at the GPU/GPUs that owns the gradient or
part of the gradient. After the backward pass, each GPU updates its portion of the parameters and optimizer states using
the averaged gradients corresponding to that portion. After
this, an all-gather is performed to receive the rest of the
parameter update computed on other GPUs.

3

Unique Optimal Offload Strategy

ZeRO-Offload is designed to enable efficient large model
training on a single or multiple GPUs by offloading some of
the model states from GPU to CPU memory during training.
As discussed in Sec. 2, model states: parameters, gradients,
and the optimizer states, are the primary source of memory
bottleneck in large model training. By offloading some of
these model states to CPU, ZeRO-Offload can enable training
of significantly larger models 1 . However, identifying the
1 ZeRO-Offload only offloads model states. Offloading secondary sources
of memory bottleneck such as activation memory is beyond scope of our

554

2021 USENIX Annual Technical Conference

optimal offloading strategy is non-trivial. There are numerous
ways to offload model states to CPU memory, each with a
different trade-off in terms of CPU computation, and GPUCPU communication, both of which can limit the training
efficiency.
To identify the optimal offload strategy, ZeRO-Offload models the DL training as data-flow graph and uses first principle
analysis to efficiently partition this graph between CPU and
GPU devices. ZeRO-Offload partitions the graph in a way
that is optimal in three key aspects: i) it requires orders-ofmagnitude fewer computation on CPU compared to GPU,
which prevents CPU from becoming a performance bottleneck (Sec. 3.1), ii) it guarantees the minimization of communication volume between CPU and GPU memory (Sec. 3.3),
and iii) it provably maximizes the memory savings while
achieving minimum communication volume (Sec. 3.4). In
fact, ZeRO-Offload can achieve high efficiency during training that is comparable to non-offload training and it is unique
optimal, meaning no other solution can offer better memory
savings without increasing the communication volume or
increasing CPU computation.
In this section, we discuss the derivation of our unique
optimal offload strategy. Our strategy is specifically designed
for mixed precision training with Adam optimizer which is
the de facto training recipe for large model training.

3.1

DL Training as a Data-Flow Graph

The DL training workload can be represented as a weighted
directed graph of data and computation, as shown in Figure 2,
where the circular nodes represents model states (parameter16,
gradient16, parameter32, momentum32, variance32), and the
rectangular nodes represents computation (forward, backward,
param update). The edges in the graph represents the data
flow between the nodes, and the weight of an edge is the total
data volume in bytes that flows through it during any given
training iteration. For a model with M parameters, the weight
of the edges in this graph is either 2M where the source node
produces fp16 model states, or 4M where the source node
produces fp32 model states.
An offload strategy between GPU and CPU can be represented using a two-way partitioning of this graph, such that
computation nodes in a partition would be executed on the
device that owns the partition, and the data nodes in the partition will be stored on device that owns the partition. The
total data volume that must be communicated between GPU
and CPU is given by the weight of edges running across two
partitions.
There are numerous ways to partition this graph. In the
following sections, we use first principles to simplify the data
offload strategy. Given that they are significantly smaller than model states,
we ignore them for the purpose of our analysis. Furthermore, the first and
second approaches described in Sec. 2 can be used in conjunction with
ZeRO-Offload to reduce activation memory

USENIX Association

activation 16

FWD

GPU

2M

activation 16

FWD-BWD
Super Node

FWD

2M
float2half

2M

4M
parameter 32

12 M

2M

parameter 16

2M

gradient 16

momentum 32

4M

BWD

parameter 16

BWD

activation 16

parameter 32

12 M

Param update

variance 32

gradient 16

CPU

2M

2M

Param update

float2half

Update Super Node
parameter 32

momentum 32

momentum 32

variance 32

variance 32

Figure 2: The dataflow of fully connected neural networks with
M parameters. We use activation checkpoint to reduce activation
memory to avoid activation migration between CPU and GPU.

flow graph to reduce the number of possible choices based
on three different efficiency metric: i) CPU computation overhead, ii) communication overhead, and iii) memory savings.

3.2

Limiting CPU Computation

The CPU computation throughput is multiple orders of magnitude slower than the GPU computation throughput. Therefore,
offloading large computation graph to CPU will severely limit
training efficiency. As such, we must avoid offloading compute intensive components to the CPU.
The compute complexity of DL training per iteration is
generally given by O(MB), where M is the model size and B
is the effective batch size. To avoid CPU computation form
becoming a bottleneck, only those computations that have a
compute complexity lower than O(MB) should be offloaded to
CPU. This means that the forward propagation and backward
propagation both of which have a compute complexity of
O(MB) must be done on GPU, while remaining computations
such as norm calculations, weight updates etc that have a
complexity of O(M) may be offloaded to CPU.
Based on this simple observation we fuse the forward and
backward nodes in our data flow graph into a single supernode (FWD-BWD) and assign it to GPU.

3.3

Minimizing Communication Volume

The CPU memory bandwidth is at least an order of magnitude
faster than the PCI-E bandwidth between CPU and GPU,
while the GPU memory is another order of magnitude faster
than even the CPU memory. Therefore, we must minimize
the communication volume between CPU and GPU memory
to prevent the PCI-E bandwidth from becoming a training
performance bottleneck. To do so we must first identify the
theoretical minimum communication volume for a modelstate offload strategy.
The minimum communication volume for any model-state
offload strategy is given by 4M 2 . Note that after fusing the
2 Please note that it is possible to reduce the communication volume

further by only offloading partial model states. For simplification, we assume

USENIX Association

forward and backward into a single super-node as discussed
in Sec. 3.2, each node in our data flow graph is part of a cycle.
Therefore, any partitioning of this graph would require cutting
at least two edges, each of which has a edge weight of at least
2M, resulting in a total communication of at least 4M.
If we choose to limit the communication volume to this
bare minimum, we can greatly simplify our data-flow graph
and reduce the number of partitioning strategies to a handful:
Creating fp32 super-node. Notice that any partitioning
strategy that does not co-locate the fp32 model states with
their producer and consumer nodes cannot achieve the minimum communication volume of 4M. Such a partition must
cut at least one edge with a weight of 4M, and the other with at
least 2M, resulting in a communication volume of at least 6M.
Therefore, to achieve the minimum communication volume,
all offload strategies must co-locate fp32 model states with
their producer and consumer operators, i.e., the fp32 model
states (momentum32, variance32 and parameter32) must be
co-located with the Param Update and the float2half computation.
This constraint allows us to treat all the aforementioned
fp32 data and compute nodes in the data flow graph as a single
super-node that we refer to as Update Super. We show this
reduced data flow graph in Figure 2, consisting of only four
nodes: FWD-BWD Super node, p16 data node, g16 data node,
and Update Super node.
p16 assignment. To achieve the minimum communication volume, p16 must be co-located with FWD-BWD Super
because the edge weight between these two nodes is 4M. Separating these two nodes, would increase the communication
volume to 6M (i.e., 4M + 2M). Since, we have already assigned node FWD-BWD Super to GPU to limit computation
on CPU, p16 must also be assigned to GPU.

3.4

Maximizing Memory Savings

After simplifying the data flow graph to minimize communication volume, only g16 and Update Super remain to be
assigned. Notice that at this point, all partitions will result
in minimum communication volume, so we can prune the
choices further to maximize the memory savings on GPU.
Table 1 shows the memory savings of all valid partitioning
strategies that minimize the communication volume. The maximum memory savings of 8x can be achieved by offloading
both g16 and Update Super to CPU.
Table 1: Memory savings for offload strategies that minimize communication volume compared to the baseline.
FWD-BWD
gpu
gpu
gpu
gpu

p16
gpu
gpu
gpu
gpu

g16
gpu
cpu
gpu
cpu

Update
gpu
gpu
cpu
cpu

Memory
16M
14M
4M
4M

Reduction
1x (baseline)
1.14x
4x
8x

that an offload of a model state implies that we offload the entire model state.
Our analysis on the memory savings per communication volume, still holds
even if we offload partial model states

2021 USENIX Annual Technical Conference

555

Step i

Step i+1

FWD & BWD p update
Computation stream:
Swapping stream:

GPU

CPU

GPU->CPU

GPU->CPU

g offload

p swap

A Unique and Optimal Offload Strategy

ZeRO-Offload Schedule

In this section, we discuss the concrete computation and communication schedule for implementing ZeRO-Offload on a
single GPU system based on our offload strategy. We then
show how we extend this schedule to work effectively on
multi-GPU systems by combining our offload strategy with
ZeRO data parallelism and model parallelism.

4.1

Single GPU Schedule

As discussed in Sec. 3, ZeRO-Offload partitions the data such
that the fp16 parameters are stored in GPU while the fp16
gradients, and all the optimizer states such as fp32 momentum,
variance and parameters are stored in CPU.
During the training, we begin by computing the loss via the
forward propagation. Since the fp16 parameters are already
presented on GPU, no CPU communication is required for this
part of the computation. During the backward propagation on
the loss, the gradient for different parameters are computed
at different point in the backward schedule. ZeRO-Offload
can transfer these gradients for each parameter individually
or in small groups to the CPU memory immediately after
they are computed. Therefore, only a small amount of memory is required to temporarily hold the gradients on the GPU
memory before they are transferred to CPU memory. Furthermore, each gradient transfer can be overlapped with the
backpropagation on the remainder of the backward graph,
allowing ZeRO-Offload to hide a significant portion of the
communication cost.

556

𝐷𝑃&

𝐷𝑃#

…

𝐺𝑃𝑈!

𝐺𝑃𝑈#

𝐺𝑃𝑈"

…

𝐺𝑃𝑈#

…

GPU

ZeRO-Offload allocates all the fp32 model states along with
the fp16 gradients on the CPU memory, and it also computes
the parameter updates on CPU. The fp16 parameters are kept
on GPU and the forward and backward computations are also
done on GPU.
We arrive at this offload strategy by simplifying our data
flow graph and eliminating all other partitioning strategies
as they do not limit CPU computation, minimize communication volume, or maximize memory savings. Therefore,
ZeRO-Offload is not only optimal in terms of the aforementioned metrics, it is also unique; there can be no other strategy
that can offer more memory savings than ZeRO-Offload without increasing the compute complexity on the CPU or incur
additional GPU-CPU communication volume.

4

𝐺𝑃𝑈"

FWD & BWD

Figure 3: ZeRO-Offload training process on a single GPU.

3.5

𝐺𝑃𝑈!

2021 USENIX Annual Technical Conference

𝐷𝑃$

𝐷𝑃!"#

𝐷𝑃!"%

𝑪𝑷𝑼𝟎

Parameters on
GPU memory

𝐷𝑃!"$

𝑪𝑷𝑼𝑵

Gradients on
GPU memory

Gradients on
CPU memory

Optimizer States
on CPU memory

Figure 4: ZeRO-Offload data placement with multiple GPUs

After the backward propagation, ZeRO-Offload updates the
fp32 parameters and the remaining optimizer states (such as
momentum and variance) directly on CPU, and copies the
updated fp32 parameters from the CPU memory to the fp16
parameters on the GPU memory. Figure 3 shows the computation and communication in each step of ZeRO-Offload
diagrammatically, and Figure 5 shows the concrete schedule
as a pseudo-code.

4.2

Scaling to Multi-GPUs

ZeRO-Offload in its entirety is a symbiotic integration
of ZeRO-Offload strategy described in Sec. 3 and ZeROpowered data parallelism discussed in Sec. 2, which allows
ZeRO-Offload to scale to hundreds of GPUs efficiently. ZeROOffload preserves the model state partitioning strategy of
ZeRO Stage-2 (optimizer state and gradient partitioning),
while offloading the partitioned gradients, optimizer states
and the corresponding parameter updates to CPU.
The key benefit of doing this partitioning before offloading
is that for systems with more than 1 GPU, each data parallel process is only responsible for updating a subset of the
parameters. The aggregated communication volume from all
the data parallel GPUs to CPU remains constant, and CPU resources are used in parallel to jointly compute a single weight
update. As a result, the total CPU update time decreases with
increased data parallelism, since the CPU compute resources
increase linearly with the increase in the number of compute
nodes. This allows ZeRO-Offload to achieve very good scalability, as the overhead of communication across GPUs is
offset by the reduction in the CPU optimizer step.
ZeRO-Offload partitions gradients and optimizer states
among different GPUs, and each GPU offloads the partition
it owns to the CPU memory and keeps it there for the entire training. During the backward propagation, gradients are
computed and averaged using reduce-scatter on the GPU, and
each GPU only offloads the averaged gradients belonging
to its partition to the CPU memory. Once the gradients are
available on the CPU, optimizer state partitions are updated
in parallel by each data parallel process directly on the CPU.
After the update, parameter partitions are moved back to GPU
followed by an all-gather operation on the GPU similar to
ZeRO-2 to gather all the parameters. Figure 4 shows the data

USENIX Association

placement model parameters, gradients and optimizer states
for ZeRO-Offload and the details of the ZeRO-Offload data
parallel schedule is presented in Figure 5. The all gather operation described above is shown as a sequence of broadcast
operations in the Figure.
Model Parallel training ZeRO-Offload can also work
together with tensor-slicing based model parallelism (MP)
frameworks such as Megatron-LM [28]. It does so by offloading the gradients, optimizer states and the optimizer computation corresponding to each MP process allowing ZeROOffload to train significantly larger models than possible than
using model parallelism alone. Sec. 6 provides more details.

5

Optimized CPU Execution

We speedup the CPU execution time for the parameter updates
with two optimizations. First, we implement a fast CPU Adam
optimizer using high performance computing techniques offering significant speedup over state-of-art Pytorch implementation. Second, we develop a one-step delayed parameter
update schedule that overlaps the CPU parameter update computation with the forward and backward computation on the
GPU, hiding the CPU execution time when enabled.

5.1

Implementing the CPU Optimizer

We use three levels of parallelism for improving the performance of the CPU optimizer. 1) SIMD vector instruction [15]
for fully exploiting the hardware parallelism supported on
CPU architectures. 2) Loop unrolling [31], an effective technique for increasing instruction level parallelism that is crucial
for better memory bandwidth utilization. 3) OMP multithreading for effective utilization of multiple cores and threads on
the CPU in parallel. Using these technique, we present a significantly faster implementation of Adam optimizer compared
to state-of-art PyTorch implementation.
Mixed Precision Training with Adam ADAM is an optimization algorithm used for deep-learning training, which
takes the loss gradients together with their first and second
momentums to update the parameters. Therefore, in addition
to the model parameters, ADAM requires two more matrices of the same size (M) saved during the training. In the
mixed precision training mode, there are two versions of the
parameters stored in memory: one in fp16 (parameter16) used
for computing the activations in the forward pass (on GPU),
and one master copy in fp32 (parameter32) which is updated
by the optimizer (on CPU). The p16 is updated with the parameter32 through f loat2hal f casting, at each training step.
Moreover, the momentum and variance of the gradients are
saved in fp32 (on CPU), to prevent the precision loss for updating the parameters. Please refer to [13] for further detail
on ADAM’s algorithm.

USENIX Association

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

for_parallel rank in range(world_size):
initialize_layers()
for batch in dataset:
x = forward(batch)
compute_loss(x,batch).backward()
backward(x.grad)
step()
def _is_owner(i):
return True if rank owns i else False
def initialize_layers():
for i in range(num_layers):
l = layers[i]
allocate_on_gpu l.param_fp16
if _is_owner(i):
allocate_on_cpu l.param_fp32
allocate_on_cpu l.optim_states_fp32
allocate_on_cpu l.cpu_grad
def forward(x):
for i in range(num_layers):
x = layers[i].forward(x)
return x
def backward(dx):
for i in range(num_layers, 0, -1):
dx=layers[i].backward(dx)
reduce(layers[i].grad, dest_rank
= _owner_rank(i))
if _is_owner(i) l.cpu_grad.copy(l.grad)
else pass
del layers[i].grad
def step():
for i in range(num_layers):
l=layers[i]
if _is_owner(i):
update_in_cpu(l.optim_states_fp32,
l.cpu_grad,
l.param_fp32)
l.param_fp16.copy(l.param_fp32)
BROADCAST(l.param_fp16, src=_owner_rank(i))

Figure 5: Code representing ZeRO-Offload that combines unique
optimal CPU offload strategy with ZeRO-powered data parallelism.

Optimized Implementation Algorithm 1 elaborates the
ADAM’s implementation detail using SIMD operations. As
shown, the Adam function receives the optimizer parameters
such as β1 , β2 , and α, and the gradient, momentum, variance
and master copy of parameters (parameter32) as the input.
We also use some parameters specific to the implementation,
like the simd_width and unroll_width. The Adam optimizer
sends back the updated variance, momentum, and parameter
in both fp16 (to GPU) and fp32 (to CPU) .
We firstly read the data, including parameter, gradient, momentum and variance, into the vector registers (line 7). Then,
we use several fused multiply-add (FMA) vector operations
to preform the main execution pipeline which is repeated by
the unrolling width. Note that the rest of operations, such as
multiply, division, and sqrt, also run in vector mode. For the
best performance we use AVX512 simd instruction set and an
unroll_width of 8 based on auto-tuning results.
In addition to the CPU-Adam optimizer, we implement

2021 USENIX Annual Technical Conference

557

Table 2: Hardware overview of experimental system.

Algorithm 2 CPU-ADAM Optimizer
Input: p32, g32, m32, v32, β1 , β2 , α , step, eps
Output: p16, p32, m32, v32
Parameter: tile_width, simd_width, unroll_width

1: biascorrection1 ← −α/(1
− βstep
1 )
q
2: biascorrection2 ← 1/ 1 − βstep
2
3: simd_count ← sizeo f (32) / simd_width
4: unroll omp parallel
5: for i in 1 to (simd_count/unroll_width) do
6:
...
7:
gv , pv , mv , vv = g32[i], p32[i], m32[i], v32[i]
8:
mv = FMA(gv , (1 - β1 ), β1 *mv )
9:
vv = FMA(gv *gv , (1 - β2 ), β2 *vv )
√
10:
gv = FMA( vv , biascorrection2, eps)
11:
gv = mm / gv
12:
pv = FMA(gv , biascorrection1, pv )
13:
p32[i], m32[i], v32[i] = pv , mv , vv
14:
...
15:
IF (i == tile_width) copy_to_gpu(p16, p32)
16: end for

GPU
GPU Memory
CPU
CPU Memory
CPU cache
PCIe

question, we evaluated DPU on multiple training workloads
and found that DPU does not hurt convergence if we introduce DPU after a few dozen iterations instead of introducing
it at the beginning. Our evaluation result in Sec. 6 shows
that compared with training with ZeRO-Offload only, training
with delayed parameter update achieves same model training
accuracy with higher training throughput.

6
the CPU-to-GPU fp16 parameter-copy in a tiled manner (line
15). We overlap the CPU and GPU execution by parallelizing
the Adam computation and copying the parameters over to
GPU. As we process Adam computation of the current tile of
data on CPU, we write the parameters back to GPU for the
previously processed tile. This way, we reduce the idle time
of GPU to start the processing of the next training step.

5.2

One-Step Delayed Parameter Update

Despite using a highly optimized CPU optimizer, the CPU
computation overhead can become a bottleneck during training with very small batch sizes, when the GPU computation
time is not much larger than CPU compute. For such limited
cases, we develop one-step delayed parameter update (DPU)
that overlaps CPU and GPU compute to hide the CPU computation overhead by delaying the parameter update by a single
step. We verify that DPU does not impact the final accuracy
of training in the evaluation.
DPU training schedule Figure 6 shows the workflow of
ZeRO-Offload training process with delayed parameter update. Ê The first N −1 steps, are trained without DPU to avoid
destabilizing the training during the early stages where gradients change rapidly. Ë On step N, we obtain the gradients
from the GPU, but we skip the CPU optimizer step, and do
not update the fp16 parameters on the GPU either. Ì At step
N + 1, we compute the parameter updates on the CPU using
gradients from step N, while computing the forward and backward pass on the GPU in parallel using parameters updated at
step N − 1. From this step onwards, the model at (i + 1)th step
will be trained using the parameters updated with gradients
from (i − 1)th step instead of parameters updated at ith step,
overlapping CPU compute with GPU compute.
Accuracy trade-off. Since DPU changes the semantics of
the training, it is reasonable to ask if there is a trade-off between model accuracy and training efficiency. To answer this

558

2021 USENIX Annual Technical Conference

DGX-2 node
16 NVIDIA Tesla V100 Tensor Core GPUs
32GB HBM2 on each GPU
2 Intel Xeon Platinum 8168 Processors
1.5TB 2666MHz DDR4
L1, L2, and L3 are 32K, 1M, and 33M, respectively
bidirectional 32 GBps PCIe

Evaluation

This section seeks to answer the following questions, in comparison to the state-of-the-art:
(i) How does ZeRO-Offload scale the trainable model size
compared to existing multi-billion parameter training
solutions on a single GPU/DGX-2 node?
(ii) What is the training throughput of ZeRO-Offload on
single GPU/DGX-2 node?
(iii) How does the throughput of ZeRO-Offload scale on up
to 128 GPUs?
(iv) What is the impact of our CPU-Adam and delay parameter update (DPU) on improving throughput, and does
DPU change model convergence?

6.1

Evaluation Methodology

Testbed. For the evaluation of model scale and throughput,
we conduct our experiments on a single DGX-2 node, whose
details are shown in Table 2. For the evaluation of throughput
scalability, we conduct experiments on 8 Nvidia DGX-2 nodes
connected together with InfiniBand using a 648-port Mellanox
MLNX-OS CS7500 switch.
Workloads. For the performance evaluation, we focus on
evaluating GPT-2 [19] like Transformer based models [30].
We vary the hidden dimension and the number of Transformer
blocks to obtain models with a different number of parameters. Note that scaling the depth alone is often not sufficient
because it would make training more difficult [12]. Table 3
shows the configuration parameters used in our experiments.
For convergence analysis, such as the delayed parameter
update, we use GPT-2 [19] and BERT [6], both of which are
commonly used as pre-trained language models and have
demonstrated superior performance in many NLP tasks (e.g.,
natural language understanding and inference) than recurrent neural networks or convolutional neural networks. We
use BERT-large, same as the one from [6], which has 24layer, 1024-hidden, 16-heads, and 336M parameters. Similar

USENIX Association

FWD

GPU

fp16 param (step N-1)

BWD

param update

fp16 param (step N-1)

fp32 param
fp32 momentum
fp32 variance

GPU->CPU

CPU->GPU

GPU

Stream 1:

CPU

BWD

FWD

GPU

fp16 param (step N)

GPU

fp16 param (step N)

FWD

GPU

fp16 param (step N)

BWD

GPU

fp16 param (step N)

step(N-1)

fp16 gradient

Stream 2:

(step N-1)

gradient offload

…

GPU->CPU

fp32 param
->
fp16 param

fp16 gradient
(step N)

step(N)

gradient offload

param swap

Step N - 1

CPU

fp32 param
fp32 momentum
fp32 variance
step(N)

GPU->CPU
fp16 gradient
(step N+1)

gradient offload

param update

Step N

CPU->GPU
fp32 param
->
fp16 param
step(N)

param swap

Step N+1

…

time

Figure 6: Delayed parameter update during the training process.

Table 3: Model configuration in evaluation.

single GPU as well as 16 GPUs in a single DGX-2 node.

# params

batch size
per GPU

MP setting
in ZeRO-Offload

# layer

hidden size

1, 2 billion
4 billion
6, 8 billion
10,11 billion
12, 13 billion
15 billion
20,40,60 billion
70 billion

32
32
16
10,8
4
8
8
8

1
1
1
1
1
2
2
8

20, 40
64
53, 72
50,55
60, 65
78
25,50,75
69

2048
2304
3072
4096
4096
4096
8192
9216

to [21, 28], we fine-tune BERT on the Stanford Question
Answering Dataset (SQuAD) [1], which is one of the most
widely used reading comprehension benchmark [22]. Unless
otherwise stated, we follow the same training procedure and
hyperparameter settings as in [6, 19].
Baseline. We compare the effectiveness of ZeRO-Offload
with state-of-arts multi-billion parameter training solutions:
• PyTorch DDP: This is the existing PyTorch Transformer
implementation using DistributedDataParallel [14].
• Megatron [28]: One of the current state-of-the-art multibillion parameter model training solutions, which employs model parallelism to train up to 8.3B parameter
models using 512 GPUs.
• SwapAdvisor [9]: SwapAdvisor explores a genetic algorithm to guide model-agnostic tensor swapping between
GPU and CPU memory for GPU memory saving.
• L2L [18]: L2L enables training of deep Transformer
networks by keeping one Transformer block at a time in
GPU memory and only moves tensors in the upcoming
Transformer block into GPU memory when needed.
• ZeRO-2 [21]: ZeRO extends data parallelism by eliminating memory redundancies across multiple GPUs,
allowing to train models up to 170B parameters with
high training throughput using 25 DGX-2 nodes. ZeRO2 achieves the SOTA results for large model training and
is a strong baseline.

6.2

Experimental Results

6.2.1

Model Scale

As an important step toward democratizing large model training, in this part, we first test the largest trainable models on a

USENIX Association

Single GPU. The largest model can be trained using PyTorch DDP on a single GPU with 32GB memory is 1.4B,
before running out of memory, as shown in figure 7. Both
Megatron and ZeRO-2 do not increase the trainable model
size on a single GPU in comparison to PyTorch, because they
both utilize the aggregated GPU memory to fit larger models. In contrast, ZeRO-Offload enables 13B model training
on a single GPU, which is more than 9X larger than using
PyTorch, Megatron, and ZeRO-2. This is mainly because of
ZeRO-Offload’s strategy for maximizing the memory savings
on GPU by offloading expensive states such as optimizer
states and the majority of gradients to CPU memory. The
largest model can be trained with SwapAdvisor on a single
GPU is 8B, which is 38% smaller than the model can be
trained with ZeRO-Offload. SwapAdvisor relies on a blackbox approach and uses a simulator to predict which tensors
are more frequently used in order to keep them in GPU memory to maximize training throughput. The prediction can not
be fully accurate, and therefore SwapAdvisor keeps more tensors in GPU memory than ZeRO-Offload does. On the other
hand, L2L is able to train even larger models (e.g., 17B) on a
single GPU by frequently moving weights from unused layers
to CPU memory. However, the largest model size does not
increase when training L2L with multiple GPUs, which is
discussed next.
Multi-GPU in single DGX-2. We further perform model
scale tests with 4 and 16 GPUs in a single DGX-2 node,
respectively. As shown in Figure 7, the maximum trainable
model size stays the same for PyTorch, L2L and SwapAdvisor,
because all of them do not handle memory redundancies in
data parallelism. As a result, their scalability is bounded by
the model scale on a single GPU. Both Megatron and ZeRO2 support large model training with more GPUs, but they
cannot scale efficiently beyond 15B parameters, even with
16 GPUs. Megatron supports larger models than ZeRO-2,
because ZeRO-2 still incurs memory redundancies on model
weights. On the other hand, ZeRO-Offload easily enables
training of up to 70B parameter models by partitioning and
offloading optimizer states and gradients to CPU memory

2021 USENIX Annual Technical Conference

559

combined with model parallelism. Overall, ZeRO-Offload
increases the model scale on a single DGX-2 node by 50X,
4.5X, 7.8X, and 4.2X than using PyTorch, Megatron, ZeRO-2,
and L2L, respectively.
6.2.2

Training Throughput

Single GPU. Next, we compare the training throughput
of SwapAdvisor, L2L and ZeRO-Offload, for models with
billion-scale parameters, on a single GPU. We do not include
Megatron and ZeRO-2 in this comparison, because both of
them cannot train models bigger than 1.4B parameters due
to OOM. We evaluate SwapAdvisor, L2L and ZeRO-Offload
with the same training batch size (e.g., 512) and same microbatch sizes (shown in table 3), with gradient accumulation
enabled. We also disable delayed parameter update in this
experiment so that the comparison is only from the system
efficiency perspective. We evaluate the performance improvement and its impact on the convergence of delayed parameter
update in Section 6.2.4.
Figure 8 shows that ZeRO-Offload outperforms SwapAdvisor by 23% (up to 37%) in training throughput. SwapAdvisor
relies online genetic algorithm to make tensor swapping decision, which takes hours to find an optimal tensor swapping
solution in terms of maximizing the overlapping of computation and tensor swapping. Before getting the optimal tensor
swapping solution, SwapAdvisor tries random tensor swapping solutions and hurts training performance.
Figure 8 shows that ZeRO-Offload outperforms L2L by
14% on average (up to 22%) in throughput (TFLOPS). The
performance benefit of ZeRO-Offload comes from the following two aspects. First, ZeRO-Offload has a lower communication cost between CPU and GPU than L2L. For a model with
M parameters, L2L requires 28M data communication volume between GPU and CPU, which is a sum of the weights,
gradients, and optimizer states of each layer of the model. As
analyzed in Sec. 4.1, the communication volume between
CPU and GPU memory in ZeRO-Offload is 4M, which is
7x smaller than L2L. The reduced communication volume
significantly mitigates the bottleneck from CPU-GPU communication. Second, compared with L2L, the parameter update
of ZeRO-Offload happens on CPU instead of GPU, but our optimized CPU-Adam implementation achieves a quite comparable parameter update performance than the PyTorch Adam
implementation on GPU (evaluated in Sec. 6.2.4). Therefore,
although the optimizer update on GPU in L2L is slightly
faster than the optimizer update on CPU in ZeRO-Offload,
the communication overhead introduced by L2L leads to an
overall slower throughput than ZeRO-Offload.
Multi-GPU in single DGX-2. Next, we compare the training
throughput of PyTorch, ZeRO-2, Megatron, ZeRO-Offload
without model parallelism (w/o MP), and ZeRO-Offload with
model parallelism (w/ MP) in one DGX-2 node. When using
MP, we use a MP degree that gives the best performance

560

2021 USENIX Annual Technical Conference

for both baseline and ZeRO-Offload. We use a total batch
size of 512 for all the experiments using a combination of
micro-batch per GPU and gradient accumulation. To get the
best performance for each configuration, we use the largest
micro batch that it can support without OOM. We exclude
L2L [29] in this test because its implementation does not
support multi-GPU training.
Figure 10 shows the throughput per GPU results when training on multiple GPUs. We make the following observations:
• For 1B to 15B models, ZeRO-Offload achieves the highest throughput and has up to 1.33X, 1.11X, 1.64X higher
speeds than PyTorch, ZeRO-2, and Megatron, respectively. By offloading all the optimizer states to CPU
with low overhead, ZeRO-Offload can train with larger
micro-batch sizes giving higher throughput.
• ZeRO-2 runs out of memory once the model size is beyond 8B due to lack of enough aggregated GPU memory
to store the model states on 16 GPUs. Instead, ZeROOffload scales to 13B, without model parallelism because it offloads optimizer states and the majority of
gradients to CPU memory.
• When combined with model parallelism, ZeRO-Offload
enables training up to 70B parameter models with more
than 30 TFLOPS throughput per GPU. In contrast, Megatron supports only up to 15B parameter models before
running out of memory, using just model parallelism.
• Compared ZeRO-Offload with ZeRO-2 and Megatron,
ZeRO-Offload outperforms ZeRO-2 and Megatron in
throughput for 1–8B and 1–13B parameter models, respectively. ZeRO-Offload is faster than Megatron, because it eliminates frequent communication between
different GPUs and can train with larger micro batch
sizes. ZeRO-Offload outperforms ZeRO-2 also due to
larger micro batch sizes.
6.2.3

Throughput Scalability

We compare the throughput scalability of ZeRO-2 and ZeROOffload3 on up to 128 GPUs in Figure 11 and make the
following key observations: First, ZeRO-Offload achieves
near perfect linear speedup in terms of aggregated throughput (green line) running at over 30 TFlops per GPU (blue
bars). Second, from 1 to 16 GPUs, while ZeRO-2 runs out
of memory, ZeRO-Offload can effectively train the model,
turning the model training from infeasible to feasible. Third,
with 32 GPUs, ZeRO-Offload slightly outperforms ZeRO-2 in
throughput. The improvement comes from additional memory
savings on GPU from ZeRO-Offload, which allows training
the model with larger batch sizes that lead to increased GPU
computation efficiency. Fourth, with more GPUs (such as 64
3 We do not include comparison against Megatron because it consistently
performs worse than ZeRO-Offload, as shown in Figure 10. Given the communication overhead added by model parallelism, scaling out Megatron
training can not achieve higher throughput than ZeRO-Offload even with
linear scalability.

USENIX Association

Figure 7: The size of the biggest model that Figure 8: The training throughput with Py- Figure 9: The training throughput is comcan be trained on single GPU, 4 and 16 GPUs Torch, L2L, SwapAdvisor and ZeRO-Offload pared for w/o DPU and w/ DPU to GPT-2.
(one DGX-2 node).
on a single GPU with a batch size of 512.
Batch size is set to 8.

for the case with 1B parameters. The CPU-Adam optimizer
achieves high speedups by exploiting the instruction-level parallelism, thread-level parallelism, and the tile-based data copy
scheme (as shown in line 15 of Algorithm 1). Meanwhile,
although CPU-Adam has a slower speed than the PyTorch
Adam implementation on GPU (PT-GPU), the performance
gap is not very huge, and the CPU computation is not a bottleneck of the training throughout.
Figure 10: Training throughput with PyTorch, ZeRO-2, MegatronLM, ZeRO-Offload without model parallelism and ZeRO-Offload
with model parallelism.

Figure 11: Comparison of training throughput between ZeROOffload and ZeRO-2 using 1–128 GPUs for a 10B parameter GPT2.

and 128), ZeRO-2 starts to outperform ZeRO-Offload, because both can now run similar batch sizes, achieving similar
computation efficiency, whereas ZeRO-2 does not suffer from
the additional overhead of CPU-GPU communication. In summary, ZeRO-Offload complements ZeRO-2 and enables large
model training from a single device to thousands of devices
with good computation efficiency.
6.2.4

Optimized CPU Execution

A. CPU-Adam efficiency. In this part, we evaluate our
Adam implementation against the PyTorch Adam on CPU.
Table 4 shows the optimizer execution time of the two implementations for model parameters from 1 to 10 billion.
Compared to PyTorch (PT-CPU), CPU-Adam reduces the execution time by over 5X for all the configurations and 6.4X

USENIX Association

B. One-step Delayed parameter update (DPU). Figure 9
shows the comparison of the training throughput of GPT-2
with and without DPU. As shown, with DPU enabled, the
training achieves 1.12–1.59, updated times higher throughput
than without it, for a wide range of model sizes for a small
micro batch size of 8. This is expected because DPU allows
the optimizer updates to overlap with the next forward computation such that the GPU does not have to be slowed down
by the CPU computation and CPU-GPU communication. But,
what about accuracy?
Convergence impact We study the convergence impact
of DPU on both GPT-2 and BERT. Figure 12 shows the
pre-training loss curves over 100K training iterations using PyTorch (unmodified GPT-2), and Figure 13 shows the
loss curves of fine-tuning Bert-large model on SQuAD using
ZeRO-Offload without DPU, and ZeRO-Offload with DPU.
In both cases, DPU is enabled after 40 iterations allowing the
training to stabilize in its early stage before introducing DPU.
We observe that the training curves of the unmodified GPT2 and ZeRO-Offload w/o DPU are exactly overlapped, because ZeRO-Offload w/o DPU performs only system optimizations and does not alter training dynamics. On the other
hand, the training curve from ZeRO-Offload with DPU converges slightly slower at the very beginning of the training
(e.g., barely can be seen at 2K-5K iterations) and quickly
catches up after 5K iterations. For the remaining of the training, the training loss matches the original training until the
model converges.
For Bert-Large fine-uning, we can see that although the
training losses are not exactly the same, they converge in the
same trend and are largely overlapped. Without changing any
hyperparameters, ZeRO-Offload + DPU achieves the same

2021 USENIX Annual Technical Conference

561

Table 4: Adam latency (s) for PyTorch (PT) and CPU-Adam.
#Parameter

CPU-Adam

PT-CPU

PT-GPU (L2L)

1 billion

0.22

1.39

0.10

2 billion

0.51

2.75

0.26

4 billion

1.03

5.71

0.64

8 billion

2.41

11.93

0.87

10 billion

2.57

14.76

1.00

final F1 score (92.8) as the baseline. From these results on
both GPT-2 pretraining, and Bert-Large fine-tuning, we empirically verify that DPU is an effective technique to improve the
training throughput of ZeRO-Offload without hurting model
convergence and accuracy.The 1-step staleness introduced by
DPU is well tolerated by the iterative training process once
the model has passed the initial training phase.

Figure 12: The training loss Figure 13: The fine-tuning loss
curve of unmodified GPT-2, curve of BERT, ZeRO-Offload
ZeRO-Offload w/o DPU and w/o DPU and ZeRO-Offload
with DPU.
ZeRO-Offload with DPU.

6.2.5

to ZeRO-Offload in Figure 14 are mainly coming from tensor migration overhead between GPU and CPU memory. (3)
ZeRO-Offload further introduces one-step delayed parameter
update, which overlaps computation on CPU with computation on GPU and improves performance by 7% compared
with using ZeRO-Offload without DPU. In summary, leveraging optimized CPU execution, ZeRO-Offload has similar
performance as PyTorch when ZeRO-Offload and PyTorch
training with the same batch size on GPU.
As the batch size increases, out-of-memory on GPU memory happens in training with PyTorch. The training throughput
increases in ZeRO-Offload as the batch size increasing. With
unique optimal offload strategy, ZeRO-Offload outperforms
PyTorch by 39% for the maximum training throughput that
can be achieved on a single GPU with 1-billion model.

Performance Breakdown and Analysis

To better understand the performance benefit from offload
strategies and optimization techniques in ZeRO-Offload, we
evaluate the training throughput of PyTorch, ZeRO-Offload
with PT-CPU, ZeRO-Offload with CPU-Adam (refer as ZeROOffload), and ZeRO-Offload with DPU. We perform the evaluation with various batch sizes with 1-billion GPT-2 model
on a single GPU. Figure 14 shows the result.
From batch size 1 to 8, PyTorch outperforms ZeRO-Offload
with PT-CPU by 16% on average. This is because when the
model can fit on GPU memory, PyTorch does not incur any
communication overhead. Meanwhile, PyTorch adopts PyTorch GPU Adam (PT-GPU) for optimizer computation on
GPU. To reduce the performance loss because of communication and optimizer computation on CPU, ZeRO-Offload
optimizes execution on CPU. (1) By optimizing CPU optimizer, ZeRO-Offload implements CPU-Adam and improves
the performance by up to 10% compared with using offload
strategy only (i.e., ZeRO-Offload with PT-CPU). (2) PyTorch
outperforms ZeRO-Offload by 8% on average when the model
can fit on GPU memory. As shown in table 4, the performance gap between CPU-Adam and PT-GPU is not very
large. Therefore, the performance degradation from PyTorch

562

Figure 14: Comparison of training throughput with enabling offload strategies and optimization techniques step-by-step in ZeROOffload.

2021 USENIX Annual Technical Conference

7

Conclusions

We presented ZeRO-Offload, a powerful GPU-CPU hybrid
DL training technology with high compute efficiency and near
linear throughput scalability, that can allows data scientists
to train models with multi-billion parameter models even
on a single GPU, without requiring any model refactoring.
We open-sourced ZeRO-Offload as part of the DeepSpeed
library (www.deepspeed.ai) with the hope to democratize
large model training, allowing data scientist everywhere to
harness the potential of truly massive DL models.

Acknowledgments
We thank the anonymous reviewers for their constructive comments. We thank our shepherd, Mark Silberstein, for his valuable feedback. This work was partially supported by U.S.
National Science Foundation (CCF-1718194, CCF-1553645
and OAC-2104116) and the Chameleon Cloud.

USENIX Association

References
[1] The Stanford Question Answering Dataset (SQuAD)
leaderboard.
https://rajpurkar.github.io/
SQuAD-explorer/.
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher
Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. Language models are few-shot learners, 2020.
[3] Yu Cao, Wei Bi, Meng Fang, and Dacheng Tao. Pretrained language models for dialogue generation with
multiple input sources, 2020.
[4] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos
Guestrin. Training deep nets with sublinear memory
cost. arXiv: Learning, 2016.
[5] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen,
Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and
Andrew Y. Ng. Large scale distributed deep networks.
In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223–1231. Curran Associates,
Inc., 2012.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805, 2018.
[7] Aaron Harlap, Deepak Narayanan, Amar Phanishayee,
Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil
Gibbons. Pipedream: Fast and efficient pipeline parallel
dnn training, 2018.
[8] Mark Hildebrand, Jawad Khan, Sanjeev Trika, Jason
Lowe-Power, and Venkatesh Akella. Autotm: Automatic
tensor movement in heterogeneous memory systems
using integer linear programming. In Proceedings of the
Twenty-Fifth International Conference on Architectural
Support for Programming Languages and Operating
Systems, ASPLOS ’20, 2020.
[9] Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapadvisor: Pushing deep learning beyond the gpu memory
limit via smart swapping. In Proceedings of the TwentyFifth International Conference on Architectural Support

USENIX Association

for Programming Languages and Operating Systems,
ASPLOS ’20, page 1341–1355, New York, NY, USA,
2020. Association for Computing Machinery.
[10] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan
Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee,
Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng
Chen. Gpipe: Efficient training of giant neural networks
using pipeline parallelism, 2018.
[11] Hai Jin, Bo Liu, Wenbin Jiang, Yang Ma, Xuanhua Shi,
Bingsheng He, and Shaofeng Zhao. Layer-centric memory reuse and data migration for extreme-scale deep
learning on many-core architectures. ACM Trans. Archit. Code Optim., 15(3), September 2018.
[12] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Radford, Jeffrey Wu, and Dario Amodei. Scaling laws
for neural language models, 2020.
[13] Diederik P. Kingma and Jimmy Ba. Adam: A method
for stochastic optimization, 2014.
[14] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar,
Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith,
Brian Vaughan, Pritam Damania, and Soumith Chintala.
Pytorch distributed: Experiences on accelerating data
parallel training. Proc. VLDB Endow., 13(12):3005–
3018, 2020.
[15] Gaurav Mitra, Beau Johnston, Alistair Rendell, Eric
McCreath, and Jun Zhou. Use of simd vector operations to accelerate application code performance on lowpowered arm and intel platforms. pages 1107–1116, 05
2013.
[16] Nvidia.
Automatic Mixed Precision for Deep
Learning.
https://developer.nvidia.com/
automatic-mixed-precision, 2019.
[17] Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang
Ma, Qian Xiong, Fan Yang, and Xuehai Qian. Capuchin:
Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International
Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’20, page
891–905, New York, NY, USA, 2020. Association for
Computing Machinery.
[18] Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth Bharadwaj. Training large neural
networks with constant memory using a new execution
algorithm. June 2020.
[19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9,
2019.

2021 USENIX Annual Technical Conference

563

[20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu. Exploring the limits of transfer
learning with a unified text-to-text transformer, 2020.
[21] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and
Yuxiong He. ZeRO: Memory Optimizations Toward
Training Trillion Parameter Models. In International
Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020.
[22] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. Squad: 100, 000+ questions for machine
comprehension of text. In Jian Su, Xavier Carreras, and
Kevin Duh, editors, Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing,
EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,
pages 2383–2392. The Association for Computational
Linguistics, 2016.
[23] Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran
Jeon, and Dong Li. Sentinel: Efficient Tensor Migration
and Allocation on Heterogeneous Memory Systems for
Deep Learning. In International Symposium on High
Performance Computer Architecture (HPCA), 2020.
[24] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. vdnn: Virtualized
deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49,
2016.
[25] Corby Rosset. Turing-nlg: A 17-billion-parameter language model by microsoft, 2020.
[26] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini,
Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl.
Measuring the Effects of Data Parallelism on Neural
Network Training. Journal of Machine Learning Research, 20, 2019.
[27] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin
Tran, Ashish Vaswani, Penporn Koanantakool, Peter
Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff
Young, Ryan Sepassi, and Blake A. Hechtman. Meshtensorflow: Deep learning for supercomputers. In Samy
Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen
Grauman, Nicolò Cesa-Bianchi, and Roman Garnett,
editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8,
2018, Montréal, Canada, pages 10435–10444, 2018.

models using model parallelism. CoRR, abs/1909.08053,
2019.
[29] Roman Tezikov. PyTorch implementation of L2L execution algorithm. https://github.com/TezRomacH/
layer-to-layer-pytorch, 2020.
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In
Isabelle Guyon, Ulrike von Luxburg, Samy Bengio,
Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan,
and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008,
2017.
[31] G. Velkoski, M. Gusev, and S. Ristov. The performance
impact analysis of loop unrolling. In 2014 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO),
pages 307–312, 2014.
[32] Oreste Villa, Mark Stephenson, David Nellans, and
Stephen Keckler. Buddy Compression: Enabling Larger
Memory for Deep Learning and HPC Workloads on
GPUs. In International Symposium on Computer Architecture, 2020.
[33] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang
Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska.
Superneurons: Dynamic gpu memory management for
training deep neural networks. In Proceedings of the
23rd ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming, PPoPP ’18, page
41–53, New York, NY, USA, 2018. Association for Computing Machinery.
[34] Junzhe Zhang, Sai-Ho Yeung, Yao Shu, Bingsheng He,
and Wei Wang. Efficient memory management for gpubased deep learning systems. CoRR, abs/1903.06631,
2019.
[35] M. A. Zinkevich, M.Weimer, A. Smola, and L. Li. Parallelized Stochastic Gradient Descent. In International
Conference on Neural Information Processing Systems,
2010.

[28] Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
Megatron-lm: Training multi-billion parameter language

564

2021 USENIX Annual Technical Conference

USENIX Association