Published as a conference paper at ICLR 2021
L ARGE BATCH S IMULATION
FOR D EEP R EINFORCEMENT L EARNING
arXiv:2103.07013v1 [cs.LG] 12 Mar 2021
Brennan Shacklett1∗ Erik Wijmans2 Aleksei Petrenko3,4
Manolis Savva5 Dhruv Batra2 Vladlen Koltun3 Kayvon Fatahalian1
1
Stanford University 2 Georgia Institute of Technology 3 Intel Labs
4
University of Southern California 5 Simon Fraser University
A BSTRACT
We accelerate deep reinforcement learning-based training in visually complex 3D
environments by two orders of magnitude over prior work, realizing end-to-end
training speeds of over 19,000 frames of experience per second on a single GPU
and up to 72,000 frames per second on a single eight-GPU machine. The key
idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of “batch simulation”: accepting and executing large
batches of requests simultaneously. Beyond exposing large amounts of work at
once, batch simulation allows implementations to amortize in-memory storage of
scene assets, rendering work, data loading, and synchronization costs across many
simulation requests, dramatically improving the number of simulated agents per
GPU and overall simulation throughput. To balance DNN inference and training
costs with faster simulation, we also build a computationally efficient policy DNN
that maintains high task performance, and modify training algorithms to maintain
sample efficiency when training with large mini-batches. By combining batch
simulation and DNN performance optimizations, we demonstrate that PointGoal
navigation agents can be trained in complex 3D environments on a single GPU in
1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. We provide open-source reference
implementations of our batch 3D renderer and simulator to facilitate incorporation
of these ideas into RL systems.
1
I NTRODUCTION
Speed matters. It is now common for modern reinforcement learning (RL) algorithms leveraging
deep neural networks (DNNs) to require billions of samples of experience from simulated environments (Wijmans et al., 2020; Petrenko et al., 2020; OpenAI et al., 2019; Silver et al., 2017; Vinyals
et al., 2019). For embodied AI tasks such as visual navigation, where the ultimate goal for learned
policies is deployment in the real world, learning from realistic simulations is important for successful transfer of learned policies to physical robots. In these cases simulators must render detailed 3D
scenes and simulate agent interaction with complex environments (Kolve et al., 2017; Dosovitskiy
et al., 2017; Savva et al., 2019; Xia et al., 2020; Gan et al., 2020).
Evaluating and training a DNN on billions of simulated samples is computationally expensive. For
instance, the DD-PPO system (Wijmans et al., 2020) used 64 GPUs over three days to learn from
2.5 billion frames of experience and achieve near-perfect PointGoal navigation in 3D scanned environments of indoor spaces. At an even larger distributed training scale, OpenAI Five used over
50,000 CPUs and 1000 GPUs to train Dota 2 agents (OpenAI et al., 2019). Unfortunately, experiments at this scale are out of reach for most researchers. This problem will only grow worse as the
field explores more complex tasks in more detailed environments.
Many efforts to accelerate deep RL focus on improving the efficiency of DNN evaluation and
training – e.g., by “centralizing” computations to facilitate efficient batch execution on GPUs or
TPUs (Espeholt et al., 2020; Petrenko et al., 2020) or by parallelizing across GPUs (Wijmans et al.,
2020). However, most RL platforms still accelerate environment simulation by running many copies
of off-the-shelf, unmodified simulators, such as simulators designed for video game engines (Bellemare et al., 2013; Kempka et al., 2016; Beattie et al., 2016; Weihs et al., 2020), on large numbers
∗
Correspondence to bps@cs.stanford.edu
1
Published as a conference paper at ICLR 2021
RGB
140
WIJMANS20
Depth
180
BPS
(Our System)
13300
19900
Frames of experience per second
Figure 1: We train agents to perform PointGoal navigation in visually complex Gibson (Xia et al.,
2018) and Matterport3D (Chang et al., 2017) environments such as the ones shown here. These
environments feature detailed scans of real-world scenes composed of up to 600K triangles and
high-resolution textures. Our system is able to train agents using 64×64 depth sensors (a highresolution example is shown on the left) in these environments at 19,900 frames per second, and
agents with 64×64 RGB cameras at 13,300 frames per second on a single GPU.
of CPUs or GPUs. This approach is a simple and productive way to improve simulation throughput, but it makes inefficient use of computation resources. For example, when rendering complex
environments (Kolve et al., 2017; Savva et al., 2019; Xia et al., 2018), a single simulator instance
might consume gigabytes of GPU memory, limiting the total number of instances to far below the
parallelism afforded by the machine. Further, running many simulator instances (in particular when
they are distributed across machines) can introduce overhead in synchronization and communication
with other components of the RL system. Inefficient environment simulation is a major reason RL
platforms typically require scale-out parallelism to achieve high end-to-end system throughput.
In this paper, we crack open the simulation black box and take a holistic approach to co-designing a
3D renderer, simulator, and RL training system. Our key contribution is batch simulation for RL: designing high-throughput simulators that accept large batches of requests as input (aggregated across
different environments, potentially with different assets) and efficiently execute the entire batch at
once. Exposing work en masse facilitates a number of optimizations: we reduce memory footprint by sharing scene assets (geometry and textures) across rendering requests (enabling orders of
magnitude more environments to be rendered simultaneously on a single GPU), amortize rendering
work using GPU commands that draw triangles from multiple scenes at once, hide latency of scene
I/O, and exploit batch transfer to reduce data communication and synchronization costs between
the simulator, DNN inference, and training. To further improve end-to-end RL speedups, the DNN
workload must be optimized to match high simulation throughput, so we design a computationally
efficient policy DNN that still achieves high task performance in our experiments. Large-batch simulation increases the number of samples collected per training iteration, so we also employ techniques
from large-batch supervised learning to maintain sample efficiency in this regime.
We evaluate batch simulation on the task of PointGoal navigation (Anderson et al., 2018) in 3D
scanned Gibson and Matterport3D environments, and show that end-to-end optimization of batched
rendering, simulation, inference, and training yields a 110× speedup over state-of-the-art prior systems, while achieving 97% of the task performance for depth-sensor-driven agents and 91% for
RGB-camera-driven agents. Concretely, we demonstrate sample generation and training at over
19,000 frames of experience per second on a single GPU.1 In real-world terms, a single GPU is
capable of training a virtual agent on 26 years of experience in a single day.2 This new performance regime significantly improves the accessibility and efficiency of RL research in realistic 3D
environments, and opens new possibilities for more complex embodied tasks in the future.
2
R ELATED W ORK
Systems for high-performance RL. Existing systems for high-performance RL have primarily focused on improving the efficiency of DNN components of the workload (policy inference and optimization) and use a simulator designed for efficient single agent simulation as a black box. For
example, Impala and Ape-X used multiple worker processes to asynchronously collect experience
for a centralized learner (Espeholt et al., 2018; Horgan et al., 2018). SEED RL and Sample Factory
built upon this idea and introduced inference workers that centralize network inference, thereby allowing it to be accelerated by GPUs or TPUs (Espeholt et al., 2020; Petrenko et al., 2020). DD-PPO
proposed a synchronous distributed system for similar purposes (Wijmans et al., 2020). A number
1
Samples of experience used for learning, not ‘frameskipped’ metrics typically used in Atari/DMLab.
Calculated on rate a physical robot (LoCoBot (Carnegie Mellon University, 2019)) collects observations
when operating constantly at maximum speed (0.5 m/s) and capturing 1 frame every 0.25m.
2
2
Published as a conference paper at ICLR 2021
of efficient implementations of these ideas have been proposed as part of RL frameworks or in other
deep learning libraries (Liang et al., 2018; Stooke & Abbeel, 2019; Küttler et al., 2019).
We extend the idea of centralizing inference and learning to simulation by cracking open the simulator black box and designing a new simulation architecture for RL workloads. Our large-batch
simulator is a drop-in replacement for large numbers of (non-batched) simulation workers, making
it synergistic with existing asynchronous and synchronous distributed training schemes. It reduces
the number of processes and communication overhead needed for asynchronous methods and eliminates separate simulation worker processes altogether for synchronous methods. We demonstrate
this by combining our system with DD-PPO (Wijmans et al., 2020).
Concurrently with our work, CuLE, a GPU-accelerated reimplementation of the Atari Learning Environment (ALE), demonstrates the benefits of centralized batch simulation (Dalton et al., 2020).
While both our work and CuLE enable wide-batch execution of their respective simulation workloads, our focus is on high-performance batch rendering of complex 3D environments. This involves
optimizations (GPU-driven pipelined geometry culling, 3D asset sharing, and asynchronous data
transfer) not addressed by CuLE due to the simplicity of rendering Atari-like environments. Additionally, like CuLE, we observe that the large training batches produced by batch simulation reduce
RL sample efficiency. Our work goes further and leverages large-batch optimization techniques
from the supervised learning literature to mitigate the loss of sample efficiency without shrinking
batch size.
Large mini-batch optimization. A consequence of large batch simulation is that more experience
is collected between gradient updates. This provides the opportunity to accelerate learning via large
mini-batch optimization. In supervised learning, using large mini-batches during optimization typically decreases the generalization performance of models (Keskar et al., 2017). Goyal et al. (2017)
demonstrated that model performance can be improved by scaling the learning rate proportionally
with the batch size and “warming-up” the learning rate at the start of training. You et al. (2017) proposed an optimizer modification, LARS, that adaptively scales the learning rate at each layer, and
applied it to SGD to improve generalization further. In reinforcement learning and natural language
processing, the Adam optimizer (Kingma & Ba, 2015) is often used instead of SGD. Lamb (You
et al., 2020) combines LARS (You et al., 2017) with Adam (Kingma & Ba, 2015). We do not find
that large mini-batch optimization harms generalization in reinforcement learning, but we do find it
decreases sample efficiency. We adapt the techniques proposed above – learning rate scaling (You
et al., 2017) and the Lamb optimizer (You et al., 2020) – to improve sample efficiency.
Simulators for machine learning. Platforms for simulating realistic environments for model training fall into two broad categories: those built on top of pre-existing game engines (Kolve et al., 2017;
Dosovitskiy et al., 2017; Lee et al., 2019; Gan et al., 2020; James et al., 2020), and those built from
scratch using open-source 3D graphics and physics libraries (Savva et al., 2017; 2019; Xia et al.,
2018; 2020; Xiang et al., 2020; Zeng et al., 2020). While improving simulator performance has
been a focus of this line of work, it has been evaluated in a narrow sense (i.e. frame rate benchmarks
for predetermined agent trajectories), not accounting for the overall performance of end-to-end RL
training. We instead take a holistic approach to co-design rendering and simulation modules and
their interfaces to the RL training system, obtaining significant gains in end-to-end throughput over
the state of the art.
3
S YSTEM D ESIGN & I MPLEMENTATION
Batch simulation accelerates rollout generation during RL training by processing many simulated
environments simultaneously in large batches. Fig. 2 illustrates how batch simulation interacts with
policy inference to generate rollouts. Simulation for sensorymotor agents, such as the PointGoal
navigation task targeted by our implementation, can be separated into two tasks: determining the
next environment state given an agent’s actions and rendering its sensory observations. Therefore,
our design utilizes two components: a batch simulator that performs geodesic distance and navigation mesh (Snook, 2000) computations on the CPU, and a batch renderer that renders complex 3D
environments on the GPU.
During rollout generation, batches of requests are passed between these components – given N
agents, the simulator produces a batch of N environment states. Next, the renderer processes the
batch of environment states by simultaneously rendering N frames and exposing the result directly
in GPU memory. Agent observations (from both the simulator and the renderer) are then provided
as a batch to policy inference to determine the next actions for the N agents.
3
Published as a conference paper at ICLR 2021
Batch Renderer
Batch Simulator
N actions
Learning +
Inference
Worker
Thread
Worker
Thread
Scene
Metadata 1
Worker
Thread
N states
N environment states
Scene
Metadata K
N states
N frames
Scene Asset 1
Scene Asset K
Loading Scene
Asset 1
Loading Scene
Asset K
Figure 2: The batch simulation and rendering architecture. Each component communicates at the
granularity of batches of N elements (e.g., N =1024), minimizing communication overheads and
allowing components to independently parallelize their execution over each batch. To fit the working
set for large batches on the GPU, the renderer maintains K ≪ N unique scene assets in GPU memory
and shares these assets across subsets of the N environments in a batch. To enable experience
collection across a diverse set of environments, the renderer continuously updates the set of K inmemory scene assets using asynchronous transfers that overlap rollout generation and learning.
The key idea is that the batch simulator and renderer implementations (in addition to the DNN workload) take responsibility for their own parallelization. Large batch sizes (values of N on the order
of hundreds to thousands of environments) provide opportunities for implementations to efficiently
utilize parallel execution resources (e.g., GPUs) as well as amortize processing, synchronization,
and data communication costs across many environments. The remainder of this section describes
the design and key implementation details of our system’s batch simulator and batch renderer, as
well as contributions that improve the efficiency of policy inference and optimization in this regime.
3.1
BATCH E NVIRONMENT S IMULATION
Our CPU-based batch simulator executes geodesic distance and navigation mesh computations in
parallel for a large batch of environments. Due to differences in navigation mesh complexity across
environments, the time to perform simulation may differ per environment. This variance is the
source of workload imbalance problems in parallel synchronous RL systems (Wijmans et al., 2020;
Savva et al., 2019) and one motivation for recent asynchronous designs (Petrenko et al., 2020; Espeholt et al., 2020; 2018). To ensure good workload balance, our batch simulator operates on large
batches that contain significantly more environments than the number of available CPU cores and
dynamically schedules work onto cores using a pool of worker threads (simulation for each environment is carried out sequentially). Worker threads report simulation results into a designated
per-environment slot in a results buffer that is communicated to the renderer via a single batched
request when all environment simulation for a batch is complete. To minimize CPU memory usage,
the simulator only loads navigation meshes and does not utilize the main rendering assets.
3.2
BATCH R ENDERING
A renderer for producing RL agent observations in scanned real-world environments must efficiently
synthesize many low-resolution renderings (e.g., 64×64 pixels) of scenes featuring high-resolution
textures and complex meshes.3 Low-resolution output presents challenges for GPU acceleration.
Rendering images one at a time produces too little rendering work to efficiently utilize a modern
GPU rendering pipeline’s parallel processing resources. Rendering many environments concurrently
but individually (e.g., from different worker threads or processes) exposes more rendering work to
the GPU, but incurs the overhead of sending the GPU many fine-grained rendering commands.
To address the problem of rendering many small images efficiently, our renderer combines the GPU
commands required to render observations for an entire simulation batch of N environments into a
single rendering request to the GPU – effectively drawing the entire batch as a single large frame
(individual environment observations are tiles in the image). This approach exposes large amounts
of rendering work to the GPU and amortizes GPU pipeline configuration and rendering overhead
over an entire batch. Our implementation makes use of modern GPU pipeline features (Khronos
Group, 2017) that allow rendering tasks that access different texture and mesh assets to proceed as
part of a single large operation (avoiding GPU pipeline flushes due to pipeline state reconfiguration).
Scene asset sharing. Efficiently utilizing a GPU requires batches to be large (we use N up to 1024).
However, geometry and texture assets for a single environment may be gigabytes in size, so naively
loading unique assets for each environment in a large batch would exceed available GPU memory.
Our implementation allows multiple environments in a batch to reference the same 3D scene assets in
3
The Matterport3D dataset contains up to 600K triangles per 3D scan.
4
Published as a conference paper at ICLR 2021
GPU memory. Specifically, our system materializes K unique assets in GPU memory (K ≪ N ) and
constructs batches of N environments that reference these assets. Asset reuse decreases the diversity
of training experiences in a batch, so to preserve diversity we limit the ratio of N to K in any one
batch to 32, and continuously rotate the set of K assets in GPU memory. The renderer refreshes
the set of K assets by asynchronously loading new scene assets into GPU memory during the main
rollout generation and learning loop. As episodes complete, new environments are constructed
to reference the newly loaded assets, and assets no longer referenced by active environments are
removed from GPU memory. This design allows policy optimization to learn from an entire dataset
of assets without exceeding GPU memory or incurring the latency costs of frequent asset loading.
Pipelined geometry culling. When rendering detailed geometry to low-resolution images, most
scene triangles cover less than one pixel. As a result, rendering performance is determined by the
rate the GPU’s rasterization hardware processes triangles, not the rate the GPU can shade covered
pixels. To reduce the number of triangles the GPU pipeline must process, the renderer uses idle
GPU cores to identify and discard geometry that lies outside the agent’s view—a process known
as frustum culling (Akenine-Möller et al., 2018). Our implementation pipelines frustum culling
operations (implemented using GPU compute shaders) with rendering for different environments in
a batch. This pipelined design increases GPU utilization by concurrently executing culling work on
the GPU’s programmable cores and rendering work on the GPU’s rasterization hardware.
3.3
P OLICY DNN A RCHITECTURE
High-throughput batch simulation creates a need for high-throughput policy DNN inference. Therefore, we develop a policy DNN architecture designed to achieve an efficient balance between high
task performance and low computational cost. Prior work in PointGoal navigation (Wijmans et al.,
2020) used a policy DNN design where a visual encoder CNN processes an agent’s visual sensory
information followed by an LSTM (Hochreiter & Schmidhuber, 1997) that determines the policy’s
actions. Our policy DNN uses this core design augmented with several performance optimizations.
First, we reduce DNN effective input resolution from 128×128 (Wijmans et al., 2020) to 64×64. Beyond this simple optimization, we choose a shallow visual encoder CNN – a nine-layer ResNet (He
et al., 2016) (ResNet18 with every other block removed), rather than the 50 layer (or larger) ResNets
used by prior work. To counteract reduced task performance from the ResNet’s relatively low capacity, all stages include Squeeze-Excite (SE) blocks (Hu et al., 2018) with r=16. Additionally,
we use a SpaceToDepth stem (Ridnik et al., 2020), which we find performs equally to the standard
Conv+MaxPool stem while using less GPU memory and compute.
Finally, we avoid the use of normalization layers in the ResNet as these require spatial reductions
over the feature maps, preventing layer-fusion optimizations. Instead, the CNN utilizes Fixup Initialization (Zhang et al., 2019) to improve training stability. Fixup Initialization replaces expensive
normalization layers after each convolution with cheap elementwise multiplication and addition.
3.4
L ARGE M INI -BATCH P OLICY O PTIMIZATION
In on-policy reinforcement learning, policy optimization utilizes trajectories of experience to reduce
bias and for backpropagation-through-time. When generating trajectories of length L with a simulation batch size of N , a rollout will have N ×L steps of experience. Therefore, a consequence of
simulation with large N is that more experience is collected per rollout.
Large N presents the opportunity to utilize large mini-batches to improve the throughput of policy
optimization; however, throughput must be balanced against generalization and sample efficiency
to ensure that reduced task performance does not offset the throughput gains. Although large minibatch training is known to hurt generalization in supervised learning (Keskar et al., 2017), we do not
see evidence of this for RL. Conversely, we do find that sample efficiency for PointGoal navigation
is harmed by naively increasing N . Fortunately, we are able to mitigate this loss of sample efficiency
using techniques for improving generalization from the large mini-batch optimization literature.
q
First, we scale the learning rate by BBbase , where Bbase =256 and B, the training batch size, is N ×L
divided by the number of mini-batches per training iteration. We find it beneficial to use the scaled
learning rate immediately instead of ‘warming-up’ the learning rate (Goyal et al., 2017). Second,
we use and adapt the Lamb optimizer (You et al., 2020). Lamb is a modification to Adam (Kingma
& Ba, 2015) that applies LARS (You et al., 2017) to the step direction estimated by Adam to better
handle high learning rates. Since the Adam optimizer is often used with PPO (Schulman et al.,
5
Published as a conference paper at ICLR 2021
(k)
2017), Lamb is a natural choice. Given the Adam step direction st
(k)
(k)
θt+1 = θt
(k)
(k)
− ηt rt (st
(k)
+ λθt )
(k)
rt
(k)
for weights θt ,
(k)
=
φ(||θt ||)
(k)
||st
where ηt is the learning rate and λ is the weight decay coefficient.
(k)
(k)
min{||θt ||, 10.0} and introduce an additional clip on the trust ratio rt :
) )
(
(
(k)
1
φ(||θt ||)
(k)
,ρ ,
rt = min max
(k)
(k)
ρ
||st + λθt ||
(1)
(k)
+ λθt ||
(k)
We set φ(||θt ||) as
(2)
We find the exact value of ρ to be flexible (we observed similar training with
ρ ∈ {10−2 , 10−3 , 10−4 }) and also observed that this clip is only influential at the start of training,
suggesting that there is an initialization scheme where it is unnecessary.
4
R ESULTS
We evaluate the impact of our contributions on end-to-end training speed and task performance
by training PointGoal navigation agents in the complex Gibson (Xia et al., 2018) and Matterport3D (Chang et al., 2017) environments. The fastest published end-to-end training performance in
these environments is achieved with the synchronous RL implementation presented with DD-PPO
(Wijmans et al., 2020). Therefore, both our implementation and the baselines we compare against
are synchronous PPO-based RL systems.
4.1
E XPERIMENTAL S ETUP
PointGoal navigation task. We train and evaluate agents via the same procedure as Wijmans et al.
(2020): agents are trained for PointGoalNav (Anderson et al., 2018) with either a Depth sensor
or an RGB camera. Depth agents are trained on Gibson-2plus (Xia et al., 2018) and, consistent with
Wijmans et al. (2020), RGB agents are also trained on Matterport3D (Chang et al., 2017). RGB camera
simulation requires textures for the renderer, increasing the GPU memory consumed by each scene
significantly. Both classes of agent are trained on 2.5 billion simulated samples of experience.
Agents are evaluated on the Gibson dataset (Xia et al., 2018). We use two metrics: Success, whether
or not the agent reached the goal, and SPL (Anderson et al., 2018), a measure of both Success and
efficiency of the agent’s path. We perform policy evaluation using Habitat-Sim (Savva et al., 2019),
unmodified for direct comparability to prior work.
Batch Processing Simulator (BPS). We provide an RL system for learning PointGoalNav built
around the batch simulation techniques and system-wide optimizations described in Section 3. The
remainder of the paper refers to this system as BPS (Batch Processing Simulator). To further accelerate the policy DNN workload, BPS uses half-precision inference and mixed-precision training.
Baseline. The primary baseline for this work is Wijmans et al. (2020)’s open-source PointGoalNav
implementation, which uses Habitat-Sim (Savva et al., 2019) – the prior state of the art in highperformance simulation of realistic environments such as Gibson. Unlike BPS, multiple environments are simulated simultaneously using parallel worker processes that render frames at 256×256
pixels before downsampling to 128×128 for the visual encoder. The fastest published configuration
uses a ResNet50 visual encoder. Subsequent sections refer to this implementation as W IJMANS 20.
Ablations. As an additional baseline, we provide W IJMANS ++, which uses the optimized SEResNet9-based policy DNN (including performance optimizations and resolution reduction relative to W IJMANS 20) developed for BPS, but otherwise uses the same system design and simulator as
W IJMANS 20 (with a minor modification to not load textures for Depth agents). W IJMANS ++ serves to
isolate the impact of two components of BPS: first, the low-level DNN efficiency improvements, and,
more importantly, the performance of batch simulation versus W IJMANS 20’s independent simulation
worker design. Additionally, to ablate the effect of our encoder CNN architecture optimizations, we
include a variant of BPS, BPS - R 50, that uses the same ResNet50 visual encoder and input resolution
as W IJMANS 20, while maintaining the other of optimizations BPS.
Multi-GPU training. To support multi-GPU training, all three systems replace standard PPO with
DD-PPO (Wijmans et al., 2020). DD-PPO scales rollout generation and policy optimization across
all available GPUs, scaling the number of environments simulated and the number of samples gathered between training iterations proportionally. We report results with eight GPUs.
6
Published as a conference paper at ICLR 2021
Agent
Res.
Sensor
System
CNN
RTX 3090 RTX 2080Ti Tesla V100 8×2080Ti 8×V100
Depth
BPS
BPS - R 50
W IJMANS ++
W IJMANS 20
SE-ResNet9
ResNet50
SE-ResNet9
ResNet50
64
128
64
128
19900
2300
2800
180
12900
1400
2800
230
12600
2500
2100
200
72000
10800
9300
1600
46900
18400
13100
1360
RGB
BPS
BPS - R 50
W IJMANS ++
W IJMANS 20
SE-ResNet9
ResNet50
SE-ResNet9
ResNet50
64
128
64
128
13300
2000
990
140
8400
1050
860
OOM
9000
2200
1500
190
43000
6800
4600
OOM
37800
14300
8400
1320
Table 1: System performance. Average frames per second (FPS, measured as samples of experience processed per second) achieved by each system. BPS achieves a speedup of 110× over
W IJMANS 20 on Depth experiments (19,900 vs. 180 FPS) and 95× on RGB experiments (13,300 vs.
140 FPS) on an RTX 3090 GPU. OOM (out of memory) indicates that the RTX 2080Ti could not
run W IJMANS 20 with the published DD-PPO system parameters due to insufficient GPU memory.
Validation
Sensor
System
1
Depth
2
3
4
5
RGB
Test
SPL
Success
SPL Success
BPS
W IJMANS 20
94.4±0.7
95.6±0.3
99.2±1.4
99.9±0.2
91.5
94.4
97.3
98.2
BPS
BPS @ 128×128
W IJMANS 20
88.4[±0.9 97.6±0.3
87.8±0.7 97.3±0.4
92.9
99.1
83.7
85.6
92.0
95.7
96.3
97.7
Table 2: Policy performance. SPL and Success of agents produced by BPS and W IJMANS 20. The
performance of the BPS agent is within the margin of error of the W IJMANS 20 agent for Depth experiments on the validation set, and within five percent on RGB. BPS agents are trained on eight GPUs
with aggregate batch size N =1024.
Determining batch size. The per-GPU batch size, N , controls a trade-off between memory usage,
sample efficiency, and speed. For BPS, N designates the batch size for simulation, inference, and
training. For W IJMANS 20 and W IJMANS ++, N designates the batch size for inference and training,
as well as the number of simulation processes. W IJMANS 20 sets N =4 for consistency with Wijmans
et al. (2020). To maximize performance of single-GPU runs, BPS uses the largest batch size that
fits in GPU memory, subject to the constraint that no one scene asset can be shared by more than
32 environments in the batch. In eight-GPU configurations, DD-PPO scales the number of parallel
rollouts with the number of GPUs, so to maintain reasonable sample efficiency BPS limits perGPU batch size to N =128, with K=4 active scenes per GPU. W IJMANS ++ Depth experiments use
N =64 (limited by system memory due to N separate processes running Habitat-Sim). Batch size in
W IJMANS ++ RGB experiments is limited by GPU memory (N ranges from 6 to 20 depending on the
GPU). Appendix B provides the batch sizes used in all experiments.
Benchmark evaluation. We report end-to-end performance benchmarks in terms of average frames
per second (FPS) achieved by each system. We measure FPS as the number of samples of experience
processed over 16,000 inference batches divided by the time to complete rollout generation and
training for those samples. In experiments that run at 128×128 pixel sensor resolution, rendering
occurs at 256×256 and is downsampled for the policy DNN to match the behavior of W IJMANS 20
regardless of system, while 64×64 resolution experiments render without downsampling. Results
are reported across three models of NVIDIA GPUs: Tesla V100, GeForce RTX 2080Ti, and GeForce
RTX 3090. (The different GPUs are also accompanied by different CPUs, see Appendix C.)
4.2
E ND - TO -E ND T RAINING S PEED
Single-GPU performance. On a single GPU, BPS trains agents 45× (9000 vs. 190 FPS, Tesla V100)
to 110× (19900 vs. 180 FPS, RTX 3090) faster than W IJMANS 20 (Table 1). The greatest speedup
was achieved using the RTX 3090, which trains Depth agents at 19,900 FPS and RGB agents at
13,300 FPS – a 110× and 95× increase over W IJMANS 20, respectively. This 6000 FPS performance
drop from Depth to RGB is not caused by the more complex rendering workload, because the addi7
Published as a conference paper at ICLR 2021
96.0%
94.0%
SPL (higher is better)
SPL (higher is better)
100%
80%
92.0%
60%
90.0%
88.0%
40%
86.0%
20%
0%
84.0%
0
10
20
30
40
Wall-Clock Training Time (Hours)
W
s20
W
s++
BPS
0
5
256
Figure 3: SPL vs. wall-clock time (RGB
agents) on a RTX 3090 over 48 hours
(time required to reach 2.5 billion samples
with BPS). BPS exceeds 80% SPL in 10
hours and achieves a significantly higher
SPL than the baselines.
10
15
20
Wall-Clock Training Time (Hours)
Aggregate Batch Size (N)
512
1024
25
4096
Figure 4: SPL vs. wall-clock time (BPS
training Depth agents over 2.5 billion
samples on 8 Tesla V100s) for various
batch sizes (N ). N =256 finishes after 2×
the wall-clock time as N =1024, but both
achieve statistically similar SPL.
tional cost of fetching RGB textures is masked by the dominant cost of geometry processing. Instead,
due to memory constraints, BPS must reduce the batch size (N ) for RGB tasks, reducing the performance of all components (further detail in Section 4.4).
To assess how much of the BPS speedup is due to the SE-ResNet9 visual encoder and lower input
resolution, we also compare BPS - R 50 and W IJMANS 20, which have matching encoder architecture
and resolution. For Depth agents training on the the RTX 3090, BPS - R 50 still achieves greater than
10× performance improvement over W IJMANS 20 (2,300 vs. 180 FPS), demonstrating the benefits
of batch simulation even in DNN heavy workloads. BPS - R 50 is only 6× faster than W IJMANS 20
on the RTX 2080Ti, since the ResNet50 encoder’s larger memory footprint requires batch size to
be reduced from N =128 on the RTX 3090 (24 GB RAM) to N =64 on the RTX 2080Ti (11 GB
RAM). Similarly, increasing DNN input resolution increases memory usage, forcing batch size to
be decreased and reducing performance (Table A1).
The BPS batch simulation architecture is significantly faster than the W IJMANS ++ design that uses
multiple worker processes. When training Depth agents, BPS outperforms W IJMANS ++ by 4.5× to
7.8×, with a greater speedup of 6× to 13× for RGB agents. Since BPS and W IJMANS ++ use the
same policy DNN and input resolution, this comparison isolates the performance advantage of batch
simulation and rendering against an optimized version of the multiple-worker-process-based design:
W IJMANS ++ is up to 15× faster than W IJMANS 20. The relative speedup of BPS for RGB agents is
larger because W IJMANS ++ does not share environment assets between simulator instances. Textures
needed for RGB rendering significantly increase the memory footprint of each simulator instance and
limit W IJMANS ++ to as few as N =6 workers (compared to N =64 for Depth agents). Conversely, BPS
shares 3D assets across environments and maintains a batch size at least N =128 for RGB agents.
Multi-GPU performance. BPS achieves high end-to-end throughput when running in eight-GPU
configurations: up to 72,000 FPS for Depth agents on eight RTX 2080Ti. Relative to W IJMANS 20,
BPS is 29× to 34× faster with eight Telsa V100s and 45× faster with eight RTX 2080Ti. These
speedups are lower than the single-GPU configurations, because BPS reduces the per-GPU batch
size in eight-GPU configurations to avoid large aggregate batches that harm sample efficiency. This
leads to imperfect multi-GPU scaling for BPS: for Depth agents, each RTX 2080Ti is approximately
4000 FPS slower in an eight-GPU configuration than in a single-GPU configuration. Eight-GPU
scaling for Depth is lower on the Tesla V100s (3.7×) compared to the 2080Ti (5.6×) because larger
batch sizes are needed to utilize the large number of parallel compute units on the Tesla V100.
4.3
P OLICY TASK P ERFORMANCE
To understand how the system design and visual encoder architecture of BPS impact learning, we
evaluate the task performance of agents trained with BPS in an eight-GPU configuration with aggregate batch size of N =1024. For Depth agents, the reduction in encoder CNN depth results in a 1%
and 3% decrease in SPL on Val and Test respectively with a negligible Success change on Val and a
0.9 Success decrease on Test (Table 2, row 1 vs. 2). For RGB agents, BPS suffers a performance loss
of 3.8/1.3 SPL/Success on Val and 8.3/2.0 SPL/Success on Test (Table 2, row 3 vs. 4). Despite this
performance reduction, the RGB agent trained by BPS would have won the 2019 Habitat challenge by
4 SPL and is only beaten by W IJMANS 20’s ResNet50-based policy on Test.
8
Published as a conference paper at ICLR 2021
SPL vs. training time. BPS significantly outperforms the baselines in terms of wall-clock training
time to reach a given SPL. After 10 hours of training on a single RTX 3090, BPS reaches over 80%
SPL (on Val) while W IJMANS 20 and W IJMANS ++ reach only 40% and 65% SPL respectively (Fig. 3).
Furthermore, BPS converges within 1% of peak SPL at approximately 20 hours; conversely, neither
baseline reaches convergence within 48 hours. BPS converges to a lower final SPL in Fig. 3 than
Table 2, likely due to the tested single-GPU configuration differing in batch size and scene asset
swapping frequency compared to the eight-GPU configuration used to produce Table 2.
Effect of batch size. The end-to-end training efficiency of BPS is dependent on batch size (N ):
larger N will increase throughput and reduce wall-clock time to reach a given number of samples,
but may harm sample efficiency and final task performance at convergence. We evaluate this relationship by training Depth agents with BPS across a range of N . As shown in Fig. 4, all experiments
converge within 1% of the peak SPL achieved; however, N =256 halves total throughput compared
to N =1024 (the setting used elsewhere in the paper for eight-GPU configurations). At the high end,
N =4096 yields slightly worse SPL than N =1024 and is only 20% faster. Larger batch sizes also
require more memory for rollout storage and training, which is prohibitive for RGB experiments that
require significant GPU memory for texture assets. In terms of sample efficiency alone, Fig. A1
shows that smaller batch sizes have a slight advantage (without considering training speed).
4.4
RUNTIME B REAKDOWN
RTX 3090 (Depth)
Fig. 5 provides a breakdown of time spent in
each of the main components of the BPS system (µs per frame). Nearly 60% of BPS runtime on the RTX 3090 GPU (for both Depth
and RGB) is spent in DNN inference and training, even when rendering complex 3D environments and using a small, low-cost policy DNN.
This demonstrates the high degree of simulation efficiency achieved by BPS. Furthermore,
the results in Table A2 for BPS - R 50 show that,
with the larger visual encoder, over 90% of
per-frame time (on Depth tasks) is spent in the
DNN workload (70% on learning).
RTX 3090 (RGB)
V100
(Depth)
8x V100
(Depth)
14.6 5.9 16.6
23.8
13.8
30.0
47.2
7.0 22.8
43.2
33.6
73.2
Cumulative time (us) per frame (per GPU)
Simulation
Rendering
Inference
Learning
Figure 5: BPS runtime breakdown. Inference represents policy evaluation cost during rollout generation. Learning represents the total cost of policy optimization.
Batch size (N ) heavily impacts DNN performance. DNN operations for Depth (N =1024) are 2×
faster than RGB (N =256) on the RTX 3090, because RGB must use a smaller batch size to fit texture
assets in GPU memory. The larger batch size improves GPU utilization for all system components.
A similar effect is visible when comparing the single-GPU and eight-GPU V100 breakdowns. BPS
reduces the per-GPU batch size from N =1024 to N =128 in eight-GPU experiments to maintain an
aggregate batch size of 1024 for sample efficiency. Further work in policy optimization to address
this learning limitation would improve multi-GPU scaling by allowing larger aggregate batch sizes.
5
D ISCUSSION
We demonstrated that architecting an RL training system around the idea of batch simulation can
accelerate learning in complex 3D environments by one to two orders of magnitude over prior work.
With these efficiency gains, agents can be trained with billions of simulated samples from complex environments in about a day using only a single GPU. We believe these fast turnaround times
stand to make RL in realistic simulated environments accessible to a broad range of researchers,
increase the scale and complexity of tasks and environments that can be explored, and facilitate
new studies of how much visual realism is needed to learn a given task (e.g., dynamic lighting,
shadows, custom augmentations). To facilitate such efforts, our system is available open-source at
https://github.com/shacklettbp/bps-nav.
More generally, this work demonstrates the value of building RL systems around components that
have been specifically designed for RL workloads, not repurposed from other application domains.
We believe this philosophy should be applied to other components of future RL systems, in particular
to new systems for performing physics simulation in complex environments.
9
Published as a conference paper at ICLR 2021
ACKNOWLEDGMENTS
This work was supported in part by NSF, DARPA, ONR YIP, ARO PECASE, Intel, and Facebook.
EW is supported in part by an ARCS fellowship. We thank NVIDIA for GPU equipment donations.
We also thank the Habitat team for helpful discussions and their support of this project.
R EFERENCES
Tomas Akenine-Möller, Eric Haines, and Naty Hoffman. Real-time rendering. CRC Press, 2018.
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta,
Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On
evaluation of embodied navigation agents. arXiv:1807.06757, 2018.
Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich
Küttler, Andrew Lefrancq, Simon Green, Vı́ctor Valdés, Amir Sadik, et al. Deepmind lab.
arXiv:1612.03801, 2016.
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47,
2013.
Carnegie Mellon University.
Locobot:
locobot-website.netlify.com/, 2019.
An open source low cost robot.
https://
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva,
Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor
environments. In International Conference on 3D Vision (3DV), 2017. MatterPort3D dataset
license available at: http://kaldir.vc.in.tum.de/matterport/MP TOS.pdf.
Steven Dalton, Iuri Frosio, and Michael Garland. Accelerating reinforcement learning through gpu
atari emulation. NeurIPS, 2020.
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA:
An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning,
pp. 1–16, 2017.
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam
Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with
importance weighted actor-learner architectures. In Proceedings of the International Conference
on Machine Learning (ICML), 2018.
Lasse Espeholt, Raphaël Marinier, Piotr Stanczyk, Ke Wang, and Marcin Michalski. Seed rl: Scalable and efficient deep-rl with accelerated central inference. In Proceedings of the International
Conference on Learning Representations (ICLR), 2020.
Chuang Gan, Jeremy Schwartz, Seth Alter, Martin Schrimpf, James Traer, Julian De Freitas, Jonas
Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, et al. Threedworld: A platform for
interactive multi-modal physical simulation. arXiv:2007.04954, 2020.
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,
Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8),
1997.
Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt,
and David Silver. Distributed prioritized experience replay. Proceedings of the International
Conference on Learning Representations (ICLR), 2018.
10
Published as a conference paper at ICLR 2021
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The robot
learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020.
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie,
Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with
mixed-precision: Training ImageNet in four minutes. arXiv:1807.11205, 2018.
Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In IEEE Conference
on Computational Intelligence and Games, 2016.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. Proceedings of the International Conference on Learning Representations (ICLR), 2017.
Khronos Group. The Vulkan specification. 2017.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of
the International Conference on Learning Representations (ICLR), 2015.
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel
Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An interactive 3D environment
for visual AI. arXiv:1712.05474, 2017.
Heinrich Küttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim
Rocktäschel, and Edward Grefenstette. Torchbeast: A pytorch platform for distributed rl.
arXiv:1910.03552, 2019.
Youngwoon Lee, Edward S Hu, Zhengyu Yang, Alex Yin, and Joseph J Lim. IKEA furniture
assembly environment for long-horizon complex manipulation tasks. arXiv:1911.07246, 2019.
Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E.
Gonzalez, Michael I. Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement
learning. In International Conference on Machine Learning (ICML), 2018.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. Proceedings of the
International Conference on Learning Representations (ICLR), 2018.
OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal
Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé
de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon
Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep
reinforcement learning. 2019. URL https://arxiv.org/abs/1912.06680.
Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav Sukhatme, and Vladlen Koltun. Sample factory: Egocentric 3D control from pixels at 100000 fps with asynchronous reinforcement
learning. Proceedings of the International Conference on Machine Learning (ICML), 2020.
Tal Ridnik, Hussam Lawen, Asaf Noy, and Itamar Friedman. Tresnet: High performance gpudedicated architecture. arXiv:2003.13630, 2020.
Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. MINOS: Multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931,
2017.
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain,
Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat:
A Platform for Embodied AI Research. In Proceedings of IEEE International Conference on
Computer Vision (ICCV), 2019.
11
Published as a conference paper at ICLR 2021
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. Proceedings of the International Conference on Learning Representations (ICLR), 2016.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv:1707.06347, 2017.
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go
without human knowledge. Nature, 550(7676), 2017.
Greg Snook. Simplified 3d movement and pathfinding using navigation meshes. In Mark DeLoura
(ed.), Game Programming Gems, pp. 288–304. Charles River Media, 2000.
Adam Stooke and Pieter Abbeel. rlpyt: A research code base for deep reinforcement learning in
pytorch. arXiv:1909.01500, 2019.
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster
level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 2019.
Luca Weihs, Jordi Salvador, Klemen Kotar, Unnat Jain, Kuo-Hao Zeng, Roozbeh Mottaghi, and
Aniruddha Kembhavi. Allenact: A framework for embodied ai research. arXiv, 2020.
Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva,
and Dhruv Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames.
In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson
env: Real-world perception for embodied agents. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. Gibson dataset license agreement available at
https://storage.googleapis.com/gibson material/Agreement%20GDS%2006-04-18.pdf.
Fei Xia, William B Shen, Chengshu Li, Priya Kasimbeg, Micael Edmond Tchapmi, Alexander Toshev, Roberto Martı́n-Martı́n, and Silvio Savarese. Interactive Gibson benchmark: A benchmark
for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2),
2020.
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao
Jiang, Yifu Yuan, He Wang, et al. SAPIEN: A simulated part-based interactive environment. In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32K for ImageNet training.
arXiv:1708.03888, 2017.
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan
Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep
learning: Training BERT in 76 minutes. Proceedings of the International Conference on Learning
Representations (ICLR), 2020.
Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian,
Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, and Johnny Lee. Transporter
networks: Rearranging the visual world for robotic manipulation. Conference on Robot Learning
(CoRL), 2020.
Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without
normalization. Proceedings of the International Conference on Learning Representations (ICLR),
2019.
12
Published as a conference paper at ICLR 2021
Sensor
System
Agent
Res.
Depth
BPS
BPS
BPS - R 50
BPS - R 50
64
128
64
128
19900
6900
4800
2300
12900
4880
2700
1400
12600
5800
4000
2500
72000
38000
19400
10800
46900
41100
26500
18400
RGB
BPS
BPS
BPS - R 50
BPS - R 50
64
128
64
128
13300
6100
4000
2000
8400
3600
2100
1050
9000
4800
3500
2200
43000
22300
14100
6800
37800
31100
19700
14300
RTX 3090 RTX 2080Ti Tesla V100 8×2080Ti 8×V100
Table A1: Impact of Visual Encoder Input Resolution on Performance. Resolution has the
largest impact on performance when increased memory usage forces BPS’ batch size to be decreased. For example, on a single Tesla V100, BPS’ Depth performance drops by 2.2× after increasing the resolution, because batch size decreases from N =1024 to N =512. Conversely, the
eight-GPU Tesla V100 results only show a 12% decrease in performance, since batch size is fixed at
N =128. Experiments with 128×128 pixel resolution are rendered at 256×256 and downsampled.
Sensor System
CNN
Simulation +
Rendering Inference Learning
Depth
BPS
BPS - R 50
W IJMANS ++
W IJMANS 20
SE-ResNet9
ResNet50
SE-ResNet9
ResNet50
16.1
26.9
270.9
1901.3
5.9
99.3
78.8
3968.6
16.6
311.3
42.8
1534.5
RGB
BPS
BPS - R 50
W IJMANS ++
W IJMANS 20
SE-ResNet9
ResNet50
SE-ResNet9
ResNet50
29.6
40.3
520.3
1911.1
13.8
110.2
389.5
4027.5
30.0
333.4
169.3
1587.5
Table A2: Runtime breakdown across systems. Microseconds per frame for each RL component
on a RTX 3090. SE-ResNet9 uses an input resolution of 64x64, while ResNet50 uses an input
resolution of 128x128. Note the large amount of time spent by W IJMANS 20 on policy inference,
caused by GPU memory constraints that force a small number of rollouts per iteration. BPS - R 50’s
performance is dominated by the DNN workload due to the large ResNet50 visual encoder.
A
A DDITIONAL R ESULTS
A.1
F LEE AND E XPLORE TASKS ON AI2-THOR DATASET
To demonstrate batch simulation and rendering on additional tasks besides PointGoal navigation,
BPS also supports the Flee (find the farthest valid location from a given point) and Explore (visit as
much of an area as possible) tasks. We evaluate BPS’s performance on these tasks on the AI2-THOR
(Kolve et al., 2017) dataset to additionally show how batch rendering performs on assets with less
geometric complexity than the scanned geometry in Gibson and Matterport3D.
Table A3 shows the learned task performance and end-to-end training speed of BPS on these two
tasks for Depth-sensor-driven agents. For both tasks, BPS outperforms its results on PointGoal
navigation by around 5000 frames per second, largely due to the significantly reduced geometric
complexity of the AI2-THOR dataset versus Gibson. Additionally, the Explore task slightly outperforms the Flee task by 600 FPS on average due to a simpler simulation workload, because no
geodesic distance computation is necessary.
A.2
S TANDALONE BATCH R ENDERER P ERFORMANCE
To evaluate the absolute performance of BPS’s batch renderer independently from other components
of the system, Fig. A2 shows the performance of the standalone renderer on the “Stokes” scene from
13
Published as a conference paper at ICLR 2021
Task
FPS
Training Score
Validation Score
Explore
Flee
25300
24700
6.42
4.27
5.61
3.65
Table A3: Task and FPS results for Flee and Explore tasks with Depth agents (on a RTX 3090),
where the Training / Validation Score is measured in meters for the Flee task and number of cells
visited on the navigation mesh for the Explore task. These tasks achieve higher throughput than
PointGoal navigation due to the lower complexity AI2-THOR meshes used. The relatively low
scores are a result of the small spatial size of the AI2-THOR assets.
50000
94.0%
45000
SPL (higher is better)
96.0%
40000
Frames per Second
92.0%
35000
90.0%
30000
88.0%
25000
20000
86.0%
84.0%
15000
0
500
1000
1500
2000
Steps of Experience (in millions)
Aggregate Batch Size (N)
256
512
1024
10000
2500
32
4096
64
1
Figure A1: BPS’s validation set SPL for
Depth vs. number of training samples across
a range of batch sizes. This graph shows
that sample efficiency slightly decreases with
larger batch sizes (with the exception of
N =512 vs. N =1024, where N =1024 exhibits better validation score). Ultimately,
the difference in converged performance is
less than 1% SPL between different batch
sizes. Although N =256 converges the fastest
in terms of training samples needed, Fig. 4
shows that N =256 performs poorly in terms
of SPL achieved per unit of training time.
2
128
256
Agent Sensor Resolution
8
Batch Size
32
128
512
512
1024
Figure A2: Frames per second achieved
by the standalone renderer on a RTX 3090
across a range of resolutions and batch sizes
for a RGB sensor on the Gibson dataset. Performance saturates at a batch size of 512. For
lower batch sizes, increasing resolution has
a minimal performance impact, because the
GPU still isn’t fully utilized. As resolution
increases with larger batches, the relative decrease in performance from higher resolution
increases.
the Gibson dataset using a set of camera positions taken from a training run. A batch size of 512
achieves a 3.7x performance increase over a batch size of 1, which emphasizes the fact that much of
the end to end speedup provided by batch rendering comes from the performance benefits of larger
inference and training batches made possible by the batch renderer’s 3D asset sharing.
Fig. A2 also demonstrates that the batch renderer can maintain extremely high performance (approximately 23,000 FPS) at much higher resolutions than used in the RL tasks presented in this work.
While this may be useful for tasks requiring higher resolution inputs, considerable advancements
would need to be made in DNN performance to handle these high resolution frames at a comparable
framerate to the renderer.
A.3
L AMB O PTIMIZER A BLATION S TUDY
To demonstrate the benefit provided by the Lamb optimizer with regard to sample efficiency, Fig. A3
shows a comparison between the Lamb optimizer used by BPS and the Adam optimizer used by
W IJMANS 20 and W IJMANS ++. The training setup for these two optimizers is identical, with the exception of the removal of learning rate scaling for Adam, as this causes training to diverge. The
benefits of Lamb are most pronounced early in training, allowing Lamb to reach within 0.7% SPL
of convergence after just 1 billion samples of experience (while Adam trails Lamb by 1.5% at the
same point). As training progresses, the difference shrinks as Adam slowly converges for a final
difference of 0.6% SPL after 2.5 billion frames.
14
Published as a conference paper at ICLR 2021
SPL (Higher is better)
100.0
97.5
95.0
92.5
90.0
87.5
85.0
Optimizer
Lamb
Adam
0
500 1000 1500 2000 2500
Steps of Experience (in millions)
Figure A3: The effect of the Lamb optimizer versus the baseline Adam optimizer on sample efficiency while training a Depth sensor driven agent. Lamb maintains a consistent lead in terms of
SPL throughout training, especially in the first half of training.
B
E XPERIMENT AND T RAINING A DDITIONAL D ETAILS
Complete PointGoal navigation description. We train and evaluate agents via the same procedure
as Wijmans et al. (2020). Specifically, agents are trained for PointGoalNav (Anderson et al., 2018)
where the agent is tasked with navigating to a point specified relative to its initial location. Agents
are equipped with a GPS+Compass sensor (providing the agent with its position and orientation
relative to the starting position) and either a Depth sensor or RGB camera. The agent has access to 4
low-level actions, forward (0.25m), turn left(10◦ ), turn right(10◦ ), and stop.
Agents are evaluated on the Gibson dataset (Xia et al., 2018). We use two metrics to evaluate the
agents: Success, whether or not the agent called stop within 0.2m of the goal, and SPL (Anderson
et al., 2018), a measure of both Success and efficiency of the agent’s path. During evaluation, the
agent does not have access to reward.
Half-precision inference and mixed-precision training. We perform inference in half precision
for all components except the action distribution. We train in mixed precision (Jia et al., 2018),
utilizing the Apex library in O2 mode. We use half precision for all computations except the action
distribution and losses. Additionally, The optimizer still utilizes single precision for all computations
and applies gradients to a single-precision copy of the weights.
Training hyper-parameters Our hyper-parameters for eight-GPU runs are given in Table A4. We
additionally employ a gradual learning rate decay where we decay the learning rate from its scaled
value back to the base value over the first half of training. We use a cosine schedule.
We find it necessary to set ρ=1.0 for the bias parameters, fixup parameters, and layer-norm parameters of the network, making the optimizer for these parameters equivalent to AdamW (Kingma &
Ba, 2015; Loshchilov & Hutter, 2018). We also use L2 weight-decay both to add back regularization
lost by removing normalization layers and to stabilize Lamb; we use λ=10−2 .
We find one epoch of PPO with two mini-batches to be sufficient (instead of two epochs with two
mini-batches), thus effectively doubling the learning speed. We also evaluated one mini-batch, but
found two to be beneficial while also having little penalty on overall training speed.
C
B ENCHMARKING A DDITIONAL D ETAILS
Pretrained benchmarking. A pretrained DNN is used when benchmarking to avoid frequent environment resets at the start of training.
Benchmarking hyper-parameters. Table A5 shows the setting for hyper-parameters that impact
system throughput.
GPU details We report FPS results on three models of NVIDIA GPUs: Tesla V100, GeForce RTX
2080 TI, and GeForce RTX 3090. We demonstrate scaling to multiple GPUs with eight GPU configurations for all but the RTX 3090. Single GPU and eight GPU results are benchmarked on the
same machines; however single GPU configurations are limited to 12 cores and 64 GB of RAM as
this is a reasonable configuration for a single GPU workstation.
15
Published as a conference paper at ICLR 2021
PPO Parameters
PPO Epochs
PPO Mini-Batches
PPO Clip
Clipped value loss
Per mini-batch advantage normalization
γ
GAE-λ (Schulman et al., 2016)
Learning rate
1
2
0.2
No
No
0.99
0.95
−4
5.0 × 10−4 Depth,
q 2.5 × 10 RGB
B
Bbase
Learning rate scaling
Bbase
Max gradient norm
Weight decay
Lamb ρ
256
1.0
0.01
0.01
Per GPU parameters
Number of unique scenes (K)
Simulation batch size/Number of Environments (N )
Rollout length (L)
4
128
32
Table A4: Hyper-parameters used for BPS training on 8 GPUs.
CPU details. Each GPU configuration also uses different CPU configurations based on hardware
access. Tesla V100 benchmarking was done with 2x Intel Xeon E5-2698 v4 (a DGX-1 station).
RTX 2080 TI benchmarking was done with 2x Intel Xeon Gold 6226. RTX 3090 benchmarking
was done with with 1x Intel i7-5820k. On all CPUs, we disable Hardware P-State (HWP) (where
applicable) and put software P-State in performance mode. Our CPU load on simulation worker
cores is inherently sporadic and we find that certain CPUs are unable to change clock frequencies
fast enough to not incur a considerable performance penalty when allowed to enter a power saving
state.
16
Published as a conference paper at ICLR 2021
Sensor
System
BPS
BPS
Depth
BPS - R 50
BPS - R 50
SE-ResNet9
SE-ResNet9
ResNet50
ResNet50
Resolution
64
128
64
128
W IJMANS ++
SE-ResNet9
64
W IJMANS 20
ResNet50
128
BPS
SE-ResNet9
64
BPS
RGB
CNN
BPS - R 50
BPS - R 50
W IJMANS ++
W IJMANS 20
SE-ResNet9
ResNet50
ResNet50
SE-ResNet9
ResNet50
128
64
128
64
128
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
PPO Epochs
Rollout length (L)
Number of Environments (N )
Tesla V100
RTX 2080Ti
RTX 3090
1 GPU 8 GPUs
1 GPU 8 GPUs
1 GPU
1024
128
512
128
512
128
256
128
512
128
256
128
256
128
128
128
20
20
1
32
512
1
32
128
1
32
256
1
32
64
1
32
64
2
128
4
1
32
128
1
32
64∗
1
32
64
1
32
32∗
1
32
6
2
128
4
128
1024
128
512
128
512
64
128
128
256
64∗
256
64
256
32∗
64
6
16
Table A5: System configuration parameters for Table 1. ∗ indicates 4 mini batches per epoch instead
of 2.
17