D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

Lingjiao Chen 1 Hongyi Wang 1 Zachary Charles 1 Dimitris Papailiopoulos 1

Abstract
Distributed model training is vulnerable to byzantine system failures and adversarial compute
nodes, i.e., nodes that use malicious updates to
corrupt the global model stored at a parameter
server (PS). To guarantee some form of robustness, recent work suggests using variants of the geometric median as an aggregation rule, in place of
gradient averaging. Unfortunately, median-based
rules can incur a prohibitive computational overhead in large-scale settings, and their convergence
guarantees often require strong assumptions. In
this work, we present D RACO, a scalable framework for robust distributed training that uses ideas
from coding theory. In D RACO, each compute
node evaluates redundant gradients that are used
by the parameter server to eliminate the effects of
adversarial updates. D RACO comes with problemindependent robustness guarantees, and the model
that it trains is identical to the one trained in the
adversary-free setup. We provide extensive experiments on real datasets and distributed setups
across a variety of large-scale models, where we
show that D RACO is several times, to orders of
magnitude faster than median-based approaches.

1. Introduction
Distributed and parallel implementations of stochastic optimization algorithms have become the de facto standard
in large-scale model training (Li et al., 2014; Recht et al.,
2011; Zhang et al., 2015; Agarwal et al., 2010; Abadi et al.,
2016; Chen et al., 2015; Paszke et al., 2017a; Chilimbi et al.,
2014). Due to increasingly common malicious attacks, hardware and software errors (Castro et al., 1999; Kotla et al.,
2007; Blanchard et al., 2017; Chen et al., 2017), protecting distributed machine learning against adversarial attacks
and failures has become increasingly important. Unfortu1
University of Wisconsin-Madison.
Lingjiao Chen <lchen@cs.wisc.edu>.

Correspondence to:

Proceedings of the 35 th International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).

nately, even a single adversarial node in a distributed setup
can introduce arbitrary bias and inaccuracies to the end
model(Blanchard et al., 2017).
A recent line of work (Blanchard et al., 2017; Chen et al.,
2017) studies this problem under a synchronous training
setup, where compute nodes evaluate gradient updates and
ship them to a parameter server (PS) which stores and updates the global model. Many of the aforementioned work
use median-based aggregation, including the geometric median (GM) instead of averaging in order to make their computations more robust. The advantage of median-based
approaches is that they can be robust to up to a constant
fraction of the compute nodes being adversarial (Chen et al.,
2017). However, in large data settings, the cost of computing the geometric median can dwarf the cost of computing
a batch of gradients (Chen et al., 2017), rendering it impractical. Furthermore, proofs of convergence for such systems
require restrictive assumptions such as convexity, and need
to be re-tailored to each different training algorithm. A scalable distributed training framework that is robust against
adversaries and can be applied to a large family of training
algorithms (e.g., mini-batch SGD, GD, coordinate descent,
SVRG, etc.) remains an open problem.
In this paper, we instead use ideas from coding theory to
ensure robustness during distributed training. We present
D RACO, a general distributed training framework that is
robust against adversarial nodes and worst-case compute
errors. We show that D RACO can resist any s adversarial
compute nodes during training and returns a model identical
to the one trained in the adversary-free setup. This allows
D RACO to come with “black-box” convergence guarantees,
i.e., proofs of convergence in the adversary-free setup carry
through to the adversarial setup with no modification, unlike prior median-based approaches (Blanchard et al., 2017;
Chen et al., 2017). Moreover, in median-based approaches
such as (Blanchard et al., 2017; Chen et al., 2017), the median computation may dominate the overall training time. In
D RACO, most of the computational effort is carried through
by the compute nodes. This key factor allows our framework to offer up to orders of magnitude faster convergence
in real distributed setups.
To design D RACO, we borrow ideas from coding theory
and algorithmic redundancy. In standard adversary-free dis-

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients
g1 + g2 + g3 + g4

are malicious, D RACO is significantly faster as we show in
experiments in Section 4.

2g
3 +

3

x2
x3
x4

e
2g 4

g2

3g

g3 +

x1
x2
x3

+

+

g1

g1

2
2g

g4

PS

x3
x4
x1

x4
x1
x2
Adversary

Figure 1. The high level idea behind D RACO’s algorithmic redundancy. Suppose we have 4 data points x1 , . . . , x4 , and let gi be
the gradient of the model with respect to data point xi . Instead of
having each compute node i evaluate a single gradient gi , D RACO
assigns each node redundant gradients. In this example, the replication ratio is 3, and the parameter server can recover the sum of
the gradients from any 2 of the encoded gradient updates. Thus,
the PS can still recover the sum of gradients in the presence of an
adversary. This can be done through a majority vote on all 6 pairs
of encoded gradient updates. This intuitive idea does not scale to
a large number of compute nodes. D RACO implements a more
systematic and efficient encoding and decoding mechanism that
scales to any number of machines.

tributed computation setups, during each distributed round,
each of the P compute nodes processes B/P gradients and
ships their sum to the parameter server. In D RACO, each
compute node processes rB/P gradients and sends a linear combination of those to the PS. Thus, D RACO incurs a
computational redundancy ratio of r. While this may seem
sub-optimal, we show that under a worst-case adversarial
setup, it is information–theoretically impossible to design a
system that obtains identical models to the adversary–free
setup with less redundancy. Upon receiving the P gradient sums, the PS uses a “decoding” function to remove the
effect of the adversarial nodes and reconstruct the original
desired sum of the B gradients. With redundancy ratio r, we
show that D RACO can tolerate up to (r − 1)/2 adversaries,
which is information–theoretically tight. See Fig. 1 for a toy
example of D RACO’s functionality.
We present two encoding and decoding techniques for
D RACO. The encoding schemes are based on the fractional
repetition code and cyclic repetition code presented in (Tandon et al., 2017; Raviv et al., 2017). In contrast to previous
work on stragglers and gradient codes (Tandon et al., 2017;
Raviv et al., 2017; Charles et al., 2017), our decoders are
tailored to the adversarial setting and use different methods. Our decoding schemes utilize an efficient majority vote
decoder and a novel Fourier decoding technique.
Compared to median-based techniques that can tolerate approximately a constant fraction of “average case” adversaries, D RACO’s (r − 1)/2 bound on the number of “worstcase” adversaries may be significantly smaller. However,
in realistic regimes where only a constant number of nodes

We implement D RACO in PyTorch and deploy it on distributed setups on Amazon EC2, where we compare against
median-based training algorithms on several real world
datasets and various ML models. We show that D RACO
is up to orders of magnitude faster compared to GM-based
approaches across a range of neural networks, e.g., LeNet,
VGG-19, AlexNet, ResNet-18, and ResNet-152, and always converges to the correct adversary-free model, while
in some cases median-based approaches do not converge.
Related Work The large-scale nature of modern machine
learning has spurred a great deal of novel research on distributed and parallel training algorithms and systems (Recht
et al., 2011; Dean et al., 2012; Alistarh et al., 2017; Jaggi
et al., 2014; Liu et al., 2014; Mania et al., 2015; Chen et al.,
2016). Much of this work focuses on developing and analyzing efficient distributed training algorithms. This work
shares ideas with federated learning, in which training is
distributed among a large number of compute nodes without centralized training data (Konečnỳ et al., 2015; 2016;
Bonawitz et al., 2016).
Synchronous training can suffer from straggler nodes (Zaharia et al., 2008), where a few compute nodes are significantly slower than average. While early work on straggler
mitigation used techniques such as job replication (Shah
et al., 2016), more recent work has employed coding theory
to speed up distributed machine learning systems (Lee et al.,
2017; Li et al., 2015; Dutta et al., 2016; 2017; Reisizadeh
et al., 2017; Yang et al., 2017). One notable technique is
gradient coding, a straggler mitigation method proposed in
(Tandon et al., 2017), which uses codes to speed up synchronous distributed first-order methods (Raviv et al., 2017;
Charles et al., 2017; Cotter et al., 2011). Our work builds on
and extends this work to the adversarial setup. Mitigating
adversaries can often be more difficult than mitigating stragglers since in the adversarial setup we have no knowledge
as to which nodes are the adversaries.
The topic of byzantine fault tolerance has been extensively
studied since the early 80s (Lamport et al., 1982). There
has been substantial amounts of work recently on byzantine fault tolerance in distributed training which shows that
while average-based gradient methods are susceptible to
adversarial nodes (Blanchard et al., 2017; Chen et al., 2017),
median-based update methods can achieve good convergence while being robust to adversarial nodes. Both (Blanchard et al., 2017) and (Chen et al., 2017) use variants
of the geometric median to improve the tolerance of firstorder methods against adversarial nodes. Unfortunately,
convergence analyses of median approaches often require
restrictive assumptions and algorithm-specific proofs of convergence. Furthermore, the geometric median aggregation

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

may dominate the training time in large-scale settings.
The idea of using redundancy to guard against failures in
computational systems has existed for decades. Von Neumann used redundancy and majority vote operations in
boolean circuits to achieve accurate computations in the
presence of noise with high probability (Von Neumann,
1956). These results were further extended in work such as
(Pippenger, 1988) to understand how susceptible a boolean
circuit is to randomly occurring failures. Our work can be
seen as an application of the aforementioned concepts to the
context of distributed training in the face of adversity.

2. Preliminaries
Notation In the following, we denote matrices and vectors
in bold, and scalars and functions in standard script. We let
1m denote the m × 1 all ones vector, while 1n×m denotes
the all ones n×m matrix. We define 0m , 0n×m analogously.
Given a matrix A ∈ Rn×m , we let Ai,j denote its entry
at location (i, j), Ai,· ∈ R1×m denote its ith row, and
A·,j ∈ Rn×1 denote its jth column. Given S ⊆ {1, . . . , n},
T ⊆ {1, . . . , m}, we let AS,T denote the submatrix of A
where we keep rows indexed by S and columns indexed by
T . Given matrices A, B ∈ Rn×m , their Hadamard product,
denoted A B, is defined as the n × m matrix where
(A B)i,j = Ai,j Bi,j .
Distributed Training The process of training a model
from data can be cast as an optimization problem known as
empirical risk minimization (ERM):
n

min
w

1X
`(w; xi )
n i=1

where xi ∈ Rm represents the ith data point, n is the total
number of data points, w ∈ Rd is a model, and `(·; ·) is a
loss function that measures the accuracy of the predictions
made by the model on each data point.
One way to approximately solve the above ERM is through
stochastic gradient descent (SGD), which operates as follows. We initialize the model at an initial point w0 and then
iteratively update it according to
wk = wk−1 − γ∇`(wk−1 ; xik ),
where ik is a random data-point index sampled from
{1, . . . , n}, and γ > 0 is the learning rate.
In order to take advantage of distributed systems and parallelism, we often use mini-batch SGD. At each iteration
of mini-batch SGD, we select a random subset Sk ⊆
{1, . . . , n} of the data and update our model according to
wk = wk−1 −

γ X
∇`(wk−1 ; xi ).
|Sk |
i∈Sk

Many distributed versions of mini-batch SGD partition the
gradient computations across the compute nodes. After
computing and summing up their assigned gradients, each
nodes sends their respective sum back to the PS. The PS
aggregates these sums to update the model wk−1 according
to the rule above.
In this work, we consider the question of how to perform
this update method in a distributed and robust manner. Fix
a batch (or set of points) Sk , which after relabeling we
assume equals {1, . . . , B}. We will denote ∇`(wk−1 ; xi )
by gi . The fundamental
PB question we consider in this work
is how to compute i=1 gi in a distributed and adversaryresistant manner. We present D RACO, a framework that can
compute this summation in a distributed manner, even under
the presence of adversaries.
Remark 1. In contrast to previous works, our analysis
and framework are applicable to any distributed algorithm
which requires the sum of multiple functions. Notably, our
framework can be applied to any first-order methods, including gradient descent, SVRG (Johnson & Zhang, 2013),
coordinate descent, and projected or accelerated versions of
these algorithms. For the sake of simplicity, our discussion
in the rest of the text will focus on mini-batch SGD.
Adversarial Compute Node Model We consider the setting where a subset of size s of the P compute nodes act
adversarially against the training process. The goal of an adversary can either be to completely mislead the end model,
or bias it towards specific areas of the parameter space. A
compute node is considered to be an adversarial node, if
it does not return the prescribed gradient update given its
allocated samples. Such a node can ship back to the PS
any arbitrary update of dimension equal to that of the true
gradient. Mini-batch SGD fails to converge even if there is
only a single adversarial node (Blanchard et al., 2017).
In this work, we consider the strongest possible adversaries.
We assume that each adversarial node has access to infinite
computational power, the entire data set, the training algorithm, and has knowledge of any defenses present in the
system. Furthermore, all adversarial nodes may collaborate
with each other.

3. D RACO: Robust Distributed Training via
Algorithmic Redundancy
In this section we present our main results for D RACO. Due
to space constraints, all proofs are left to the supplement.
We generalize the scheme in Figure 1 to P compute nodes
and B data samples. At each iteration of our training process, we assign the B gradients to the P compute nodes
using a P × B allocation matrix A. Here, Aj,k is 1 if node
j is assigned the kth gradient gk , and 0 otherwise. The support of Aj,· , denoted supp (Aj,· ), is the set of indices k of

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

gradients evaluated by the jth compute node. For simplicity,
we will assume B = P throughout the following.

Parameter Sever: Decoding and Model Update

Model Update

D RACO utilizes redundant computations, so it is worth formally defining the amount of redundancy incurred. This is
captured by the following definition.

D
At most s adversarial
updates

Definition 3.1. r , P1 kAk0 denotes the redundancy ratio.
In other words, the redundancy ratio is the average number
of gradients assigned to each compute node.
We define a d × P matrix G by G , [g1 , g2 , · · · , gP ].
Thus, G has all assigned gradients as its columns. The jth
compute node first computes a d × P gradient matrix Yj ,
(1d Aj,· ) G using its allocated gradients. In particular, if
the kth gradient gk is allocated to the jth compute node, i.e.,
Aj,k 6= 0, then the compute node computes gk as the kth
column of Yj . Otherwise, it sets the k-th column of Yj to
be 0d .
The jth compute node is equipped with an encoding function
Ej that maps the d×P matrix Yj of its assigned gradients to
a single d-dimensional vector. After computing its assigned
gradients, the jth compute node sends zj , Ej (Yj ) to
the PS. If the jth node is adversarial then it instead sends
zj + nj to the PS, where nj is an arbitrary d-dimensional
vector. We let E be the set of local encoding functions, i.e.,
E = {E1 , E2 , · · · , EP }.
Let us define a d × P matrix ZA,E,G by ZA,E,G ,
[z1 , z2 , · · · , zP ], and a d × P matrix N by N ,
[n1 , n2 , · · · , nP ]. Note that at most s columns of N are
non-zero. Under this notation, after all updates are finished
the PS receives a d × P matrix R , ZA,E,G + N. The
PS then computes a d-dimensional update gradient vector
u , D(R) using a decoder function D.

The system in D RACO is determined by the tuple (A, E, D).
We decide how to assign gradients by designing A, how
each compute node should locally amalgamate its gradients
by designing E, and how the PS should decode the output
by designing D. The process of D RACO is illustrated in
Figure 2.
This framework of (A, E, D) encompasses both distributed
SGD and the GM approach. In distributed mini-batch SGD,
we assign 1 gradient to each compute node. After relabeling, we can assume that we assign gi to compute node i.
Therefore, A is simply the identity matrix IP . The matrix
Yj therefore contains gj in column j and 0 in all other
columns. The local encoding function Ej simply returns
gj by computing Ej (Yj ) = Yj 1P = gj , which it then
sends to the PS. The decoding function now depends on the
algorithm. For vanilla mini-batch SGD, the PS takes the
average of the gradients, while in the GM approach, it takes
a geometric median of the gradients.
In order to guarantee convergence, we want D RACO to

E1

E2

x1
x2
x3

x2
x3
x4

EP

...

1

xP 1
xP

x1

EP
xP

x1
x2

Compute nodes: Gradient Evaluations and Encoding

Figure 2. In D RACO, each compute node is allocated a subset of
the data set. Each compute node computes redundant gradients, encodes them via Ei , and sends the resulting vector to the PS. These
received vectors then pass through a decoder that detects where
the adversaries are and removes their effects from the updates. The
output of the decoder is the true sum of the gradients. The PS
applies the updates to the parameter model and we then continue
to the next iteration.

exactly recover the true sum of gradients, regardless of the
behavior of the adversarial nodes. In other words, we want
D RACO to protect against worst-case adversaries. Formally,
we want the PS to always obtain the d-dimensional vector
G1P via D RACO with any s adversarial nodes. Below is
the formal definition.
Definition 3.2. D RACO with (A, E, D) can tolerate s adversarial nodes, if for any N = [n1 , n2 , · · · , nP ] such that
|{j : nj 6= 0}| ≤ s, we have D(ZA,E,G + N) = G1P .
Remark 2. If we can successfully defend against s adversaries, then the model update after each iteration is identical
to that in the adversary-free setup. This implies that any
guarantees of convergence in the adversary-free case transfer to the adversarial case.

Redundancy Bound We first study how much redundancy is required if we want to exactly recover the correct
sum of gradients per iteration in the presence of s adversaries.
Theorem 3.1. Suppose a selection of gradient allocation,
encoding, and decoding mechanisms (A, E, D) can tolerate
s adversarial nodes. Then its redundancy ratio r must
satisfy r ≥ 2s + 1.
The above result is information–theoretic, meaning that
regardless of how the compute node encodes and how the
PS decodes, each data sample has to be replicated at least
2s + 1 times to defend against s adversarial nodes.
Remark 3. Suppose that a tuple (A, E, D) can tolerate
any s adversarial nodes. By Theorem 3.1, this implies that

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

on average, each compute node encodes at least (2s + 1)
d-dimensional vectors. Therefore, if the encoding has linear
complexity, then each encoder requires (2s + 1)d operations in the worst-case. If the decoder D has linear time
complexity, then it requires at least P d operations in the
worst case, as it needs to use the d-dimensional input from
all P compute nodes. This gives a computational cost of
O(P d) in general, which is significantly less than that of the
median approach in (Blanchard et al., 2017), which requires
O(P 2 (d + log P )) operations.
Optimal Coding Schemes A natural question is, can we
achieve the optimal redundancy bound with linear-time encoding and decoding? More formally, can we design a tuple
(A, E, D) that has redundancy ratio r = 2s + 1 and computation complexity O((2s + 1)d) at the compute node and
O(P d) at the PS? We give a positive answer by presenting two coding approaches that match the above bounds.
The encoding methods are based on the fractional repetition
code and the cyclic repetition codes in (Tandon et al., 2017;
Raviv et al., 2017).
Fractional Repetition Code Suppose 2s + 1 divides P .
The fractional repetition code (derived from (Tandon et al.,
2017)) works as follows. We first partition the compute
nodes into r = 2s + 1 groups. We assign the nodes in a
group to compute the same sum of gradients. Let ĝ be the
desired sum of gradients per iteration. In order to decode
the outputs returned by the compute nodes in the same
group, the PS uses majority vote to select one value. This
guarantees that as long as fewer than half of the nodes in a
group are adversarial, the majority procedure will return the
correct ĝ.
Formally, the repetition code (ARep , E Rep , DRep ) is defined as follows. The assignment matrix ARep is given
by
1
r×r
0r×r
Rep
A
=
 ..
.
0r×r

0r×r
1r×r
..
.
0r×r

0r×r
0r×r
..
.
0r×r

···
···
..
.
···

0r×r
0r×r
..
.
0r×r

0r×r 
0r×r 
.. 
.
.
1r×r

The jth compute
 node first
 computes all its allocated gradients YjRep = 1d ARep
G. Its encoder function simply
j,·
takes the summation of all the allocated gradients. That is,
EjRep (YjRep ) = YjRep 1P . It then sends zj = EjRep (YjRep )
to the PS.
The decoder works by first finding the majority vote of the
output of each compute node that was assigned the same
gradients. For instance, since the first r compute nodes were
assigned the same gradients, it finds the majority vote of
[z1 , . . . , zr ]. It does the same with each of the blocks of size
r, and then takes the sum of the P/r majority votes. We note
that our decoder here is different compared to the one used in
the straggler mitigation setup of (Tandon et al., 2017). Our

decoder follows the concept of majority decoding similarly
to (Von Neumann, 1956; Pippenger, 1988).
Formally, DRep is given by DRep (R)
=

P Pr
where Maj (·) de`=1 Maj R·,(`·(r−1)+1):(`·r) ,
notes the majority vote function and R is the d × P
matrix received from all compute nodes. While a naive
implementation of majority vote scales quadratically with
the number of compute nodes P , we instead use a streaming
version of majority vote (Boyer & Moore, 1991), the
complexity of which is linear in P .
Theorem 3.2. Suppose 2s+1 divides P . Then the repetition
code (ARep , E Rep , DRep ) with r = 2s+1 can tolerate any
s adversaries, achieves the optimal redundancy ratio, and
has linear-time encoding and decoding.
Cyclic Code Next we describe a cyclic code whose encoding method comes from (Tandon et al., 2017) and is
similar to that of (Raviv et al., 2017). We denote the
cyclic code, with encoding and decoding functions, by
(ACyc , E Cyc , DCyc ). The cyclic code provides an alternative way to tolerate adversaries in distributed setups. We
will show that the cyclic code also achieves the optimal redundancy ratio and has linear-time encoding and decoding.
Another difference compared to the repetition code is that
in the cyclic code, the compute nodes will compute and
transmit complex vectors, and the decoding function will
take as input these complex vectors.
To better understand the cyclic code, imagine that all P
gradients we wish to compute are arranged in a circle. Since
there are P starting positions, there are P possible ways to
pick a sequence consisting of 2s + 1 clock-wise consecutive
gradients in the circle. Assigning each sequence of gradients
to each compute node leads to redundancy ratio r = 2s + 1.
The allocation matrix for the cyclic code is ACyc , where
the i row contains r = 2s + 1 consecutive ones, between
position (i − 1)r + 1 to i · r modulo B.
In the cyclic code, each compute node computes a linear combination of its assigned gradients. This can be
viewed as a generalization of the repetition code’s encoder. Formally, we construct some P × P matrix W such
that ∀j, `, ACyc
6= 0 implies Wj,` = 0. Let YjCyc =

 j,`
1d ACyc
G denote the gradients computed at compute
j,·
node j. The local encoding function EjCyc is defined by
EjCyc (YjCyc ) = GW·,j . After performing this local encoding, the jth compute node then sends zCyc
, EjCyc (YjCyc )
j
Cyc

Cyc

to the PS. Let ZA ,E ,G , [zCyc
, zCyc
, · · · , zCyc
1
2
P ].
Then one can verify from the definition of EjCyc that
Cyc

Cyc

ZA ,E ,G = GW. The received matrix at the PS
Cyc
Cyc
now becomes RCyc = ZA ,E ,G + N = GW + N.
In order to decode, the PS needs to detect which com-

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

pute nodes are adversarial and recover the correct gradient summation from the non-adversarial nodes. Methods
to do the latter alone in the presence of straggler nodes
was presented in (Tandon et al., 2017) and (Raviv et al.,
2017). Suppose there is a function φ(·) that can compute
the adversarial node index set V . We will later construct
φ explicitly. Let U be the index set of the non-adversarial
nodes. Suppose that the span of W·,U contains 1P . Thus,
we can obtain a vector b by solving W·,U b = 1P . Finally, since U is the index set of non-adversarial nodes,
for any j ∈ U , we must have nj = 0. Thus, we can use
RCyc
·,U b = (GW + N)·,U b = GW·,U b = G1P . The
decoder function is given formally in Algorithm 1.
Algorithm 1 Decoder Function DCyc .
Input :Received d × P matrix RCyc
Output :Desired gradient summation uCyc
V = φ(R) // Locate the adversarial node indexes.
U = {1, 2, · · · , P } − V . // Non-adversarial node indexes
Find b by solving W·,U b = 1P
Compute and return uCyc = R·,U b

To make this approach work, we need to design a matrix W
and the index location function φ(·) such that (i) For all j, k,
Aj,k = 0 =⇒ Wj,k = 0 and the span of W·,U contains
1P , and (ii) φ(·) can locate the adversarial nodes.

vector β = [β0 , β1 , · · · , βs−1 ]T by solving
h

P −s−1

hP −s−2


...
hP −2s

hP −s
hP −s−1
...
hP −s+1

...
...
..
.
...

hP −2   β0  hP −1 
hP −3   β1  hP −2 
  .  =  . .
..
 .   . 
.
.
.
hP −s
βs−1
hP −s+1

Ps−1
We then compute h` =
u=0 βu h`+u−s , where ` =
0, 1, · · · , P − 2s − 1 and h` = hP +` . Once the vector
h , [h0 , h1 , · · · , hP −1 ] is obtained, we can compute the
IDFT of h, denoted by t , [t0 , t1 , · · · , tP −1 ]. The returned
index set V = {j : tj+1 6= 0}. The following lemma shows
the correctness of φ(·).
Lemma 3.4. Suppose N = [n1 , n2 , · · · , nP ] satisfies
|{j : knj k0 6= 0}| ≤ s. Then φ(RCyc ) = φ(GW + N) =
{j : knj k0 6= 0} with probability 1.
Finally we can show that the cyclic code can tolerate any
s adversaries and also achieves redundancy ratio and has
linear-time encoding and decoding.
Theorem 3.5. The cyclic code (ACyc , E Cyc , DCyc ) can
tolerate any s adversaries with probability 1 and achieves
the redundancy ratio lower bound. For d  P , its encoding
and decoding achieve linear-time computational complexity.

Note that the cyclic code requires transmitting complex
vectors GW which potentially doubles the bandwidth requirement. To handle this problem, one can transform
Let us first construct W. Let C be a P × P inverse discrete
the original real gradient G ∈ Rd×P into a complex graFourier transformation (IDFT) matrix, i.e.,
dient Ĝ ∈ Cdd/2e×P by letting its ith component have


real part Gi and complex part Gdd/2e+i . Then the com2πi
1
(j − 1)(k − 1) , j, k = 1, 2, · · · , P.
Cjk = √ exp
pute nodes only need to send ĜW. Once the PS reP
P
covers ûCyc , Ĝ1P , it can simply sum the real and
Let CL be the first P − 2s rows of C and CR be the last
imaginary parts to form the true gradient summation, i.e.,
2s rows. Let αj be the set of row indices of the zero entries
uCyc = Re(ûCyc ) + Im(ûCyc ) = G1P .
Cyc
Cyc
in A·,j , i.e., αj = {k : Aj,k = 0}. Note that CL is a
(P − 2s) × P Vandermonde matrix and thus any P − 2s
4. Experiments
columns of it are linearly independent. Since |αj | = P −
In this section we present an empirical study of D RACO
2s − 1, we can obtain a P − 2s −1-dimensional vector qj
and compare it to the median-based approach in (Chen
uniquely by solving 0 = qj 1 · [CL ]·,αj . Construct a


et al., 2017) under different adversarial models and real
P × (P − 2s − 1) matrix Q , q1 q2 · · · qP and a


distributed environments. The main findings are as follows:
P × P matrix W , Q 1P · CL . One can verify that
1) For the same training accuracy, D RACO is up to orders
(i) each row of W has the same support as the allocation
of magnitude faster compared to the GM-based approach;
matrix ACyc and (ii) the span of any P − 2s + 1 columns
2) In some instances, the GM approach (Chen et al., 2017)
of W contains 1P , summarized as follows.
does not converge, while D RACO converges in all of our
Lemma 3.3. For all j, k, Aj,k = 0 ⇒ Wj,k = 0. For any
experiments, regardless of which dataset, machine learning
index set U such that |U | ≥ P − (2s + 1), the column span
model, and adversary attack model we use; 3) Although
of W·,U contains 1P .
D RACO is faster than GM-based approaches, its runtime
can sometimes scale linearly with the number of adversaries
The φ(·) function works as follows. Given the d × P matrix
due to the algorithmic redundancy needed to defend against
RCyc received from the compute nodes, we first generate a
adversaries.
1 × d random vector f ∼ N (11×d , Id ), and then compute
Implementation and Setup We compare vanilla mini[hP −2s , hP −2s−1 , · · · , hP −1 ] , fRC†R 1 . We then obtain a
batch SGD to both D RACO-based mini-batch SGD and GM1
† denotes transpose conjugate.
based mini-batch SGD (Chen et al., 2017). In mini-batch

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

SGD, there is no data replication and each compute node
only computes gradients sampled from its partition of the
data. The PS then averages all received gradients and updates the model. In GM-based mini-batch SGD, the PS
uses the geometric median instead of average to update
the model. We have implemented all of these in PyTorch
(Paszke et al., 2017b) with MPI4py (Dalcin et al., 2011)
deployed on the m4.2/4/10xlarge instances in Amazon EC2
2
. We conduct our experiments on various adversary attack
models, datasets, learning problems and neural network
models.

until they achieve the same designated testing accuracy (the
details for MNIST are given in supplement). As expected,
ordinary mini-batch may not converge even if there is only
one adversary. Second, under the reverse gradient adversary model, D RACO converges several times faster than the
GM approach, using both the repetition and cyclic codes,
achieving up to more than an order of magnitude speedup
compared to the GM approach. We suspect that this is because the computation of the GM is more expensive than
the encoding and decoding overhead of D RACO.

Adversarial Attack Models We consider two adversarial
models. First is the “reversed gradient” adversary, where
adversarial nodes that were supposed to send g to the PS
instead send −cg, for some c > 0. Next, we consider a
“constant adversary” attack, where adversarial nodes always
send a constant multiple κ of the all-ones vector to the PS
with dimension equal to that of the true gradient. In our experiments, we set c = 100 for the reverse gradient adversary,
and κ = −100 for the constant adversary. At each iteration,
s nodes are randomly selected to act as adversaries.

Table 2. Speedups of D RACO with repetition and cyclic codes over
GM when using ResNet-18 on CIFAR10. We run both methods
until they reach the same specified testing accuracy. Here ∞ means
that the GM approach failed to converge to the same accuracy as
D RACO.
Test Accuracy
80%
85%
88%
90%

End-to-end Convergence Performance We first evaluate the end-to-end convergence performance of D RACO,
using both the repetition and cyclic codes, and compare it
to ordinary mini-batch SGD as well as the GM approach.

Table 3. Speedups of D RACOwith repetition and cyclic codes over
GM when using CRM on MR. We run both methods until they
reach the same specified testing accuracy.
Test Accuracy
95%
96%
98%
98.5%

Table 1. The datasets used, their associated learning models and
corresponding parameters.
Dataset
MNIST
CIFAR10
MR
# data points

70,000

60,000

10,662

Model

FC/LeNet

ResNet-18

CRN

# Classes

10

10

2

# Parameters

1,033k / 431k

1,1173k

154k

Optimizer

SGD

SGD

Adam

Learning Rate

0.01 / 0.01

0.1

0.001

Batch Size

720 / 720

180

180

The datasets and their associated learning models are summarized in Table 1. We use fully connected (FC) neural networks and LeNet (LeCun et al., 1998) for MNIST, ResNet18 (He et al., 2016) for Cifar 10 (Krizhevsky & Hinton,
2009), and CNN-rand-non-static (CRN) model in (Kim,
2014) for Movie Review (MR) (Pang & Lee, 2005).
The experiments were run on a cluster of 45 compute nodes
instantiated on m4.2xlarge instances. At each iteration, we
randomly select s = 1, 3, 5 (2.2%, 6.7%, 11.1% of all compute nodes) nodes as adversaries. All three methods are
trained for 10,000 distributed iterations. Figure 3 shows
how the testing accuracy varies with training time. Tables
2 and 3 give a detailed account of the speedups of D RACO
compared to the GM approach, where we run both systems
2

https://github.com/hwang595/Draco

2.2% rev grad

2.6/2.0

3.3/2.6

4.2/3.3

∞/∞

6.7% rev grad

2.8/2.2

3.4/2.7

4.3/3.4

∞/∞

11.1% rev grad

4.1/3.3

4.2/3.2

5.5/4.4

∞/∞

2.2% rev grad

5.4/4.2

5.6/4.3

9.7/7.4

12/9.0

6.7% rev grad

6.4/4.5

6.3/4.5

11/8.1

19/13

11.1% rev grad

7.5/4.7

7.4/4.6

12/8

19/12

Under the constant adversary model, the GM approach
sometimes failed to converge while D RACO still converged
in all of our experiments. This reflects our theory, which
shows that D RACO always returns a model identical to the
model trained by the ordinary algorithm in an adversary-free
environment. One reason why the GM approach may fail to
converge is that by using the geometric median, it is actually
losing information about a subset of the gradients. Under
the constant adversary model, the PS effectively gains no
information about the gradients computed by the adversarial
nodes, and may not recover the desired optimal model.
Another reason might be that the GM often requires conditions such as convexity of the underlying loss function.
Since neural networks are generally non-convex, we have
no guarantees that GM converges in these settings. It is
worth noting that GM may also not converge if we use LBFGS or accelerated gradient descent to perform training,
as the choice of algorithm is separate from the underlying
geometry of the neural network. Nevertheless, D RACO still
converges for such algorithms.
Per iteration cost of D RACO We provide empirical per
iteration costs of applying D RACO to three large state-ofthe-art deep networks, ResNet-152, VGG-19, and AlexNet

20
0 1 2 3 4 5 6 7 8
Time Cost (min) 1e2

40
20
0.0 0.2 0.4 0.6 0.8
Time Cost (min) 1e3

60
40
20
0

0

1 2 3 4 5 6 7
Time Cost (min) 1e3

80
60
40
20
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Time Cost (min) 1e3

100

(e) MNIST,FC,Const

80
60
40
20
0.0 0.2 0.4 0.6 0.8 1.0
Time Cost (min) 1e3

(f) MNIST,LeNet,Const

Test Accuracy (%)

(b) MNIST,LeNet,Rev Grad (c) CIFAR10,ResNet18,Rev
Grad

Test Accuracy (%)

Test Accuracy (%)

(a) MNIST,FC,Rev Grad

60

80

80
60
40
20
0
0.0

0.2 0.4 0.6 0.8
Time Cost (min) 1e4

(g) CIFAR10,ResNet18,Const

Test Accuracy (%)

40

80

100
90
80
70
60
50

0 1 2 3 4 5 6 7
Time Cost (min) 1e2

(d) MR,CRN,Rev Grad
Test Accuracy (%)

60

100

Test Accuracy (%)

80

Test Accuracy (%)

Test Accuracy (%)

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

100
90
80
70
60
50

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Time Cost (min) 1e3

(h) MR,CRN,Const

102
101
100
10−1
Comp Comm EncodeDecode

Time Per Iteration (sec)

Time Per Iteration (sec)

Figure 3. Convergence rates of D RACO, GM, and vanilla mini-batch SGD, on (a) MNIST on FC, (b) MNIST on LeNet, (c) CIFAR10
on ResNet-18, and (d) MR on CRN, all with reverse gradient adversaries; (e) MNIST on FC, (f) MNIST on LeNet, (g) CIFAR10 on
ResNet-18, and (h) MR on CRN, all with constant adversaries.

102
101
100
10−1

102

10

1

100

Comp Comm Encode Decode

(c) AlexNet, Rev Grad

(b) VGG-19, Const
Time Per Iteration (sec)

Time Per Iteration (sec)

(a) VGG-19, Rev Grad

Comp Comm EncodeDecode

102
101

works. For networks not designed for small images (like
AlexNet), we resize the CIFAR10 images to fit the network.
As shown in Figure 4, with s = 5, the encoding and decoding time of D RACO can be several times larger than the
computation time of ordinary SGD, though SGD may not
converge in adversarial settings. Nevertheless, D RACO is
still several times faster than GM. While the communication cost is high in both D RACO and the GM method, the
decoding time of the GM approach, i.e., its geometric median update at the PS, is prohibitively higher. Meanwhile,
the overhead of D RACO is relatively negligible.

5. Conclusion and Open Problems

100
Comp Comm Encode Decode

(d) AlexNet, Const

Figure 4. Empirical Per Iteration Time Cost on Large Models with
11.1% adversarial nodes. We consider reverse gradient adversary
on (a) VGG-19 and (b) AlexNet, and constant adversary on (c)
VGG-19 and (d) AlexNet. Results on ResNet-152 are in the supplement.

(He et al., 2016; Simonyan & Zisserman, 2014; Krizhevsky
et al., 2012). The experiments provided here are run on
46 real instances (45 compute nodes with 1 PS) on AWS
EC2. For ResNet-152 and VGG-19, m4.4xlarge (equipped
with 16 cores with 64 GB memory) instances are used while
AlexNet experiments are run on m4.10xlarge (40 cores with
160 GB memory) instances. We use a batch size of B = 180
and split the data among compute nodes. Therefore, each
compute node is assigned B
n = 4 data points per iteration.
We use the CIFAR10 dataset for all the aforementioned net-

In this work we presented D RACO, a framework for robust
distributed training via algorithmic redundancy. D RACO
is robust to arbitrarily malicious compute nodes, while being orders of magnitude faster than state-of-the-art robust
distributed systems. We give information–theoretic lower
bounds on how much redundancy is required to resist adversaries while maintaining the correct update rule, and show
that D RACO achieves this lower bound. There are several
interesting future directions.
First, D RACO is designed to output the same model with or
without adversaries. However, slightly inexact model updates often do not decrease performance noticeably. Therefore, we might ask whether we can either (1) tolerate more
stragglers or (2) reduce the computational cost of D RACO by
only approximately recovering the desired gradient summation. Second, while we give two relatively efficient methods
for encoding and decoding, there may be others that are
more efficient for use in distributed setups.

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

Acknowledgement
This work was supported in part by a gift from Google and
AWS Cloud Credits for Research from Amazon. We thank
Jeffrey Naughton and Remzi Arpaci-Dusseau for invaluable
discussions and feedback on earlier drafts of this paper.

References
Abadi, Martı́n, Barham, Paul, Chen, Jianmin, Chen, Zhifeng,
Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat,
Sanjay, Irving, Geoffrey, Isard, Michael, et al. Tensorflow:
A system for large-scale machine learning. In OSDI,
volume 16, pp. 265–283, 2016.

Chilimbi, Trishul M, Suzue, Yutaka, Apacible, Johnson,
and Kalyanaraman, Karthik. Project adam: Building an
efficient and scalable deep learning training system. In
OSDI, volume 14, pp. 571–582, 2014.
Cotter, Andrew, Shamir, Ohad, Srebro, Nati, and Sridharan,
Karthik. Better mini-batch algorithms via accelerated
gradient methods. In NIPS, pp. 1647–1655, 2011.
Dalcin, Lisandro D, Paz, Rodrigo R, Kler, Pablo A, and
Cosimo, Alejandro. Parallel distributed computing using
python. Advances in Water Resources, 34(9):1124–1139,
2011.

Agarwal, Alekh, Wainwright, Martin J, and Duchi, John C.
Distributed dual averaging in networks. In NIPS, pp.
550–558, 2010.

Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, Kai,
Devin, Matthieu, Mao, Mark, Senior, Andrew, Tucker,
Paul, Yang, Ke, Le, Quoc V, et al. Large scale distributed
deep networks. In NIPS, pp. 1223–1231, 2012.

Alistarh, Dan, Grubic, Demjan, Li, Jerry, Tomioka, Ryota,
and Vojnovic, Milan. Qsgd: Communication-efficient
sgd via gradient quantization and encoding. In NIPS, pp.
1707–1718, 2017.

Dutta, Sanghamitra, Cadambe, Viveck, and Grover, Pulkit.
Short-dot: Computing large linear transforms distributedly using coded short dot products. In NIPS, pp. 2100–
2108, 2016.

Blanchard, Peva, Guerraoui, Rachid, Stainer, Julien, et al.
Machine learning with adversaries: Byzantine tolerant
gradient descent. In NIPS, pp. 118–128, 2017.

Dutta, Sanghamitra, Cadambe, Viveck, and Grover, Pulkit.
Coded convolution for parallel and distributed computing
within a deadline. In ISIT, pp. 2403–2407. IEEE, 2017.

Bonawitz, Keith, Ivanov, Vladimir, Kreuter, Ben, Marcedone, Antonio, McMahan, H Brendan, Patel, Sarvar, Ramage, Daniel, Segal, Aaron, and Seth, Karn. Practical
secure aggregation for federated learning on user-held
data. arXiv preprint arXiv:1611.04482, 2016.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Jian. Deep residual learning for image recognition. In
CVPR, pp. 770–778, 2016.

Boyer, Robert S and Moore, J Strother. Mjrtya fast majority
vote algorithm. In Automated Reasoning, pp. 105–117.
Springer, 1991.
Castro, Miguel, Liskov, Barbara, et al. Practical byzantine
fault tolerance. In OSDI, volume 99, pp. 173–186, 1999.
Charles, Zachary, Papailiopoulos, Dimitris, and Ellenberg,
Jordan. Approximate gradient coding via sparse random
graphs. arXiv preprint arXiv:1711.06771, 2017.
Chen, Jianmin, Pan, Xinghao, Monga, Rajat, Bengio,
Samy, and Jozefowicz, Rafal. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
Chen, Tianqi, Li, Mu, Li, Yutian, Lin, Min, Wang, Naiyan,
Wang, Minjie, Xiao, Tianjun, Xu, Bing, Zhang, Chiyuan,
and Zhang, Zheng. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
Chen, Yudong, Su, Lili, and Xu, Jiaming. Distributed statistical machine learning in adversarial settings: Byzantine
gradient descent. arXiv preprint arXiv:1705.05491, 2017.

Jaggi, Martin, Smith, Virginia, Takác, Martin, Terhorst,
Jonathan, Krishnan, Sanjay, Hofmann, Thomas, and Jordan, Michael I. Communication-efficient distributed dual
coordinate ascent. In NIPS, pp. 3068–3076, 2014.
Johnson, Rie and Zhang, Tong. Accelerating stochastic
gradient descent using predictive variance reduction. In
NIPS, pp. 315–323, 2013.
Kim, Yoon. Convolutional neural networks for sentence
classification. arXiv preprint arXiv:1408.5882, 2014.
Konečnỳ, Jakub, McMahan, Brendan, and Ramage, Daniel.
Federated optimization: Distributed optimization beyond
the datacenter. arXiv preprint arXiv:1511.03575, 2015.
Konečnỳ, Jakub, McMahan, H Brendan, Yu, Felix X,
Richtárik, Peter, Suresh, Ananda Theertha, and Bacon,
Dave. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492,
2016.
Kotla, Ramakrishna, Alvisi, Lorenzo, Dahlin, Mike,
Clement, Allen, and Wong, Edmund. Zyzzyva: speculative byzantine fault tolerance. In ACM SIGOPS Operating Systems Review, volume 41, pp. 45–58. ACM,
2007.

D RACO: Byzantine-resilient Distributed Training via Redundant Gradients

Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple
layers of features from tiny images. 2009.

Desmaison, Alban, Antiga, Luca, and Lerer, Adam. Automatic differentiation in pytorch. 2017b.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural
networks. In NIPS, pp. 1097–1105, 2012.

Pippenger, Nicholas. Reliable computation by formulas in
the presence of noise. IEEE Transactions on Information
Theory, 34(2):194–197, 1988.

Lamport, Leslie, Shostak, Robert, and Pease, Marshall. The
byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS), 4(3):382–
401, 1982.

Raviv, Netanel, Tamo, Itzhak, Tandon, Rashish, and Dimakis, Alexandros G. Gradient coding from cyclic
mds codes and expander graphs.
arXiv preprint
arXiv:1707.03858, 2017.

LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner,
Patrick. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324,
1998.

Recht, Benjamin, Re, Christopher, Wright, Stephen, and
Niu, Feng. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pp. 693–701,
2011.

Lee, Kangwook, Lam, Maximilian, Pedarsani, Ramtin,
Papailiopoulos, Dimitris, and Ramchandran, Kannan.
Speeding up distributed machine learning using codes.
IEEE Transactions on Information Theory, 2017.

Reisizadeh, Amirhossein, Prakash, Saurav, Pedarsani,
Ramtin, and Avestimehr, Salman. Coded computation
over heterogeneous clusters. In ISIT, pp. 2408–2412.
IEEE, 2017.

Li, Mu, Andersen, David G, Park, Jun Woo, Smola, Alexander J, Ahmed, Amr, Josifovski, Vanja, Long, James,
Shekita, Eugene J, and Su, Bor-Yiing. Scaling distributed
machine learning with the parameter server. In OSDI,
volume 14, pp. 583–598, 2014.

Shah, Nihar B, Lee, Kangwook, and Ramchandran, Kannan. When do redundant requests reduce latency? IEEE
Transactions on Communications, 64(2):715–722, 2016.

Li, Songze, Maddah-Ali, Mohammad Ali, and Avestimehr,
A Salman. Coded mapreduce. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pp. 964–971, 2015.
Liu, Ji, Wright, Steve, Re, Christopher, Bittorf, Victor, and
Sridhar, Srikrishna. An asynchronous parallel stochastic
coordinate descent algorithm. In ICML, pp. 469–477,
2014.

Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
Tandon, Rashish, Lei, Qi, Dimakis, Alexandros G, and
Karampatziakis, Nikos. Gradient coding: Avoiding stragglers in distributed learning. In ICML, pp. 3368–3376,
2017.
Von Neumann, John. Probabilistic logics and the synthesis of reliable organisms from unreliable components.
Automata studies, 34:43–98, 1956.

Mania, Horia, Pan, Xinghao, Papailiopoulos, Dimitris,
Recht, Benjamin, Ramchandran, Kannan, and Jordan,
Michael I. Perturbed iterate analysis for asynchronous
stochastic optimization. NIPS, OPT, 2015.

Yang, Yaoqing, Grover, Pulkit, and Kar, Soummya. Coded
distributed computing for inverse problems. In NIPS, pp.
709–719, 2017.

Pang, Bo and Lee, Lillian. Seeing stars: Exploiting class
relationships for sentiment categorization with respect to
rating scales. In ACL, pp. 115–124, 2005.

Zaharia, Matei, Konwinski, Andy, Joseph, Anthony D, Katz,
Randy H, and Stoica, Ion. Improving mapreduce performance in heterogeneous environments. In OSDI, volume 8, pp. 7, 2008.

Paszke, Adam, Gross, Sam, Chintala, Soumith, and Chanan,
Gregory. Pytorch, 2017a.
Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan,
Gregory, Yang, Edward, DeVito, Zachary, Lin, Zeming,

Zhang, Sixin, Choromanska, Anna E, and LeCun, Yann.
Deep learning with elastic averaging sgd. In NIPS, pp.
685–693, 2015.