Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Athul Paul Jacob * 1 2 David J. Wu * 1 Gabriele Farina * 3 Adam Lerer 1 Hengyuan Hu 1 Anton Bakhtin 1
Jacob Andreas 2 Noam Brown 1

arXiv:2112.07544v2 [cs.MA] 17 Feb 2022

Abstract
We consider the task of building strong but humanlike policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human
actions but may not match the strength of expert
humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult
for humans to understand and coordinate with.
We show in chess and Go that regularizing search
based on the KL divergence from an imitationlearned policy results in higher human prediction
accuracy and stronger performance than imitation
learning alone. We then introduce a novel regret
minimization algorithm that is regularized based
on the KL divergence from an imitation-learned
policy, and show that using this algorithm for
search in no-press Diplomacy yields a policy that
matches the human prediction accuracy of imitation learning while being substantially stronger.

1. Introduction
Self-play AI algorithms have matched or exceeded expert
human performance in many games, such as chess (Campbell et al., 2002; Silver et al., 2018), Go (Silver et al., 2016;
2017), and poker (Moravčík et al., 2017; Brown & Sandholm, 2017; 2019). However, the resulting policies often
differ markedly from how humans play (McIlroy-Young
et al., 2020a). This is a serious problem for human-computer
interactions that involve cooperation rather than purely competition. In such settings, modeling the other participants
accurately is important for success. For example, it is important for a self-driving car at a four-way stop sign to
conform to existing human conventions rather than its own
self-play solution to the problem (Lerer & Peysakhovich,
2019). Moreover, even in purely adversarial games, the
*Equal Contribution. 1 Meta AI Research, New York, NY, USA
CSAIL, MIT, Cambridge, MA, USA 3 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Athul Paul Jacob <apjacob@mit.edu>, David J. Wu
<dwu@fb.com>, Gabriele Farina <gfarina@cs.cmu.edu>, Noam
Brown <noambrown@fb.com>.
2

alien nature of AI policies makes it difficult for humans to
understand and learn from superhuman bots.
The classic approach toward modeling human behavior is
imitation learning (IL) on human data. However, evidence
in multiple games indicates that IL on expert human data
produces policies that are much weaker than actual expert
human play in domains with complex strategic planning.
In this paper, we study the problem of producing policies
that are both strong and human-like in games with complex strategic planning like chess, Go, Hanabi, and Diplomacy. In all four, we find that conducting search with KLregularization towards an IL policy matches or exceeds the
prior state of the art for prediction accuracy of expert humans while also better matching expert human performance.
In Section 3, we show that Monte Carlo tree search (MCTS)
with a human imitation-learned policy prior and value function surpasses prior state-of-the-art results for human prediction accuracy in chess and Go. As explained by Grill
et al. (2020), standard MCTS with a policy prior can be
viewed as a form of KL-regularized search, optimizing a
value function subject to a KL-divergence term with that
prior. Although MCTS has been extensively studied for
developing strong agents, it has been explored much less in
the context of developing human-like agents.
Section 4 builds on these findings and shows how to generalize them to a class of imperfect-information games (in
which ordinary MCTS is unsound and cannot be applied)
via a new algorithm for KL-regularized regret minimization. We show that existing regret minimization algorithms
achieve low accuracy in predicting expert human actions
in no-press Diplomacy. We then introduce the first regret
minimization algorithm to incorporate a cost term proportional to the KL divergence between the search policy and
a human-imitation learned anchor policy. We call this algorithm policy-regularized hedge, or piKL-hedge. We
prove that piKL-hedge converges to an equilibrium in which
all players’ policies are optimal given the joint policies of
the players and the cost of deviating from the anchor policy.
We then present results in no-press Diplomacy showing that
piKL-hedge produces policies that predict human actions
as accurately as imitation learning while also improving
head-to-head performance in a population of prior agents.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Appendix L additionally shows that applying KLregularization toward a human IL policy in the search algorithm SPARTA (Lerer et al., 2020) produces similar or
better human prediction accuracy while greatly improving
performance in the benchmark domain of Hanabi (Bard
et al., 2020).
Our experiments demonstrate the benefits of KL-regularized
search in all four of chess, Go, no-press Diplomacy, and
Hanabi to producing agents that are simultaneously more
human-like and closer in strength to actual human experts
than purely imitation-learned models.

2. Preliminaries
We study the problem of learning policies for multiplayer
games. Here we briefly introduce the key ingredients of
both classes of games we study; Section 3 and Section 4
give a more formal presentation tailored to individual game
types and learning algorithms.
An (N -player) game is defined by a state space S, an action space A, a (deterministic) transition function T :
S × AN → S, and a collection of reward functions ui .
We model the behavior of each player in a game as a policy πi : S → ∆(A) (a distribution over actions given
states). In every round of a game, each player observes
a (possibly incomplete) view sti of the current state. One
or more players select actions ati ∼ πi (· | sti ), then each
player receives a reward uti (st , at = at1 , . . . , atn ), and the
game transitions into a new state st+1 = T (st , at ). Each
player i aims to maximizes its expected reward, and the
optimal policies for doing so may depend on the policies
π−i = {π1 , . . . , πi−1 , πi+1 , . . . , πN } of the other players.
The sequential decision-making problem described above is
extremely general, and in this paper we focus on two special
cases. In perfect-information games, players make moves
sequentially (e.g., u1 and s2 depend only on a1 , u2 and s3
depend only on a2 , etc.). Many important games, including
chess and Go, fall into this category. Next, we study a more
general class of imperfect-information, simultaneousaction games that make no assumptions about the dependence of different ui and T on a; here we focus on games
with only a single round, also called matrix games. Owing to the large differences between these two settings, the
tools for computing strong policies are quite different. The
remainder of this paper accordingly offers a deeper exploration of each class of games: perfect-information games in
Section 3 and imperfect-information games in Section 4.

3. Perfect-Information Games: Policy
Regularization in Monte Carlo Tree Search
In this section, we focus on developing strong human-like
agents for perfect-information games. Monte Carlo tree

search (MCTS) has been highly successful for developing
strong, but not necessarily human-like agents in this setting,
and is a key component of general learning algorithms such
as AlphaZero and MuZero capable of achieving superhuman
performance in chess, Go, and similar games (Silver et al.,
2018; Schrittwieser et al., 2020). By contrast, for developing
human-like agents, the best prior human move prediction
accuracies for chess and Go were all achieved with pure
imitation learning on human data (McIlroy-Young et al.,
2020a; Cazenave, 2017; Silver et al., 2017).
The state of the art for predicting human moves in chess is
the Maia engine created by McIlroy-Young et al. (2020a) via
pure imitation learning without any search. However, this
approach appears to be of limited effectiveness for modeling sufficiently strong humans. Although the weakest Maia
models at low temperatures appear to outperform the players they imitate (due to “averaging away” of the imitated
players’ individual idiosyncratic mistakes (Anderson et al.,
2021)) each successive model on data from stronger players improves by much less than the players improve.1 The
strongest model, trained to predict human 1900-1999 rated
players, even with low temperature appears to be clearly below a 1950-average level of performance in all but the minority of bullet-speed games (in which very little time is available for planning and players are forced to rely more heavily on intuition). Similarly, in Go, pure imitation-learning
agents have not exceeded mid-expert level on various online
servers despite being trained on top-expert and master-level
games (Cazenave, 2017).
In contrast, search-based reinforcement learning (RL)
agents such as AlphaZero that do not use a human policy
prior play at a superhuman level, but often in non-human
ways that humans find difficult to understand even when
given access to interactively query and inspect the agent’s
analysis (Silver et al., 2017; Egri-Nagy & Törmänen, 2020).
However, we show in both chess and Go that if the humanlearned model is used in MCTS with appropriate parameters,
MCTS outperforms those models’ human prediction accuracy while simultaneously reducing the shortcomings in
those models’ strength.
3.1. Background
We consider sequential games where each player i alternatively chooses action a from a policy πi where, a ∼ πi (· | s).
Each action deterministically transitions the game into a new
state s0 = T (s, a) and gives rewards ui (s, a). Notationally,
we may elide the player i in some places when it is clear that
i is the next player to move in the state being considered.
For this work, following Silver et al. (2016), we implement
1

See
ratings
data
at
https://lichess.org/@/maia1,
https://lichess.org/@/maia5, https://lichess.org/@/maia9

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

one of the most common modern forms of MCTS, which
uses a value functionPV predicting the expected total future
reward Vi (s) = E[ t ui (st , at ) | π1 , π2 , s0 = s] and a
policy prior τ (typically both the outputs of a trained deep
neural net) and attempts to produce an improved policy π.
Each turn, MCTS builds a game tree over multiple iterations
rooted at the current state. Each iteration t, MCTS explores
a single path down the tree by simulating at each successive
state s with player i to move the action:
pP
arg max Q(s, a) + cpuct τ (s, a)
a

b N (s, b)

N (s, a) + 1

(1)

where Q(s, a) is the estimated expected future reward for i
from playing action a in state s, the visit count N (s, a) is
the number of times a has been explored from s, τ (s, a) is
the prior policy probability, and cpuct is a tunable parameter
trading off exploration versus exploitation.
Upon reaching a state st not yet seen, MCTS queries the
value function Vi (st ) for each player i, and based on Vi (st )
and any intermediate rewards received, updates all Q(s, a)
estimates on the path
Ptraversed. The final agent policy is
π(s, a) = N (s, a)/ b N (s, b) where s is the root state, or
optionally we may also have π(s, a) ∼ N (s, a)1/T where
T is a temperature parameter. See also Appendix F for a
fuller description of MCTS.
Grill et al. (2020) show that the agent policy π computed
by this form of MCTS is an approximate solution to the
optimization problem:
arg max
π

X
a

Q(s, a)π(s, a) + λDKL (τ k π)

(2)

√
where λ ∼ cpuct N and N is the total number of iterations.
In other words, at every node of the tree recursively, MCTS
implicitly optimizes its expected future reward subject to
KL regularization of its policy towards the prior policy τ
with strength controlled by λ. For any fixed computational
budget N , we can therefore tune cpuct to vary the strength
of that prior, with cpuct = ∞ approximating the prior policy
before search, and cpuct → 0 approaching a greedy argmax
of the Q value estimates.

If our goal is a strong human-like agent rather than solely a
strong agent, and the KL-regularizing policy is learned from
human data, then that policy serves not just as a prior, but
also as an anchor policy that regularizing towards is desirable in and of itself. With good choice of cpuct , MCTS can
improve that policy while remaining close to human. Our
experiments confirm that MCTS improves on the strength
and human prediction accuracy of the best existing models
in both chess and Go.

3.2. Experiments in Chess and Go
In chess and Go, we ran two main experiments each. First,
in chess using the prior state-of-the-art Maia models from
McIlroy-Young et al. (2020a) and in Go using a model
trained on professional games from the GoGoD dataset, we
demonstrate that MCTS with that model outperforms the
raw model in human prediction accuracy. Second, we also
sanity-check that MCTS with the same parameters greatly
improves the strength of the same models in chess and Go.
3.2.1 Data and Model Architecture
In chess, for
the human-learned anchor policy we use the pre-trained
Maia1100, Maia1500, and Maia1900 models from McIlroyYoung et al. (2020a), achieving state-of-the-art performance
on rating-conditional human move prediction. These models
follow a standard AlphaZero-like residual block architecture, including both a policy and a value head, and were
trained to imitate players in ratings “buckets” 1100, 1500,
and 1900 respectively, based on roughly 10 million games
each from the popular Lichess server (each bucket contains
games between players from rating N to N+99).
For Go, we trained a deep neural net on the GoGoD professional game dataset2 . We match Cazenave (2017) in using
games from 1900 through 2014 for training and 2015-2016
as the test set, with roughly 73000 and 6500 games, respectively. Our architecture matches the 20-block residual net
structure used by some versions of AlphaZero (Silver et al.,
2017), except adds squeeze-and-excitation layers which
have been successful in image processing tasks (Hu et al.,
2018) and self-play learning in chess and Go (LC0, 2020;
Troisi, 2019). See Appendix E for additional details.
3.2.2 Improved Human Prediction and Strength In Table 1 we show the top-1 accuracy of MCTS at predicting
players in chess and Go. MCTS on top of each model tested
outperforms that model at predicting human moves.
In chess, the benefit provided increases greatly as the rating
of players predicted increases, while the optimal choice for
cpuct appears to decrease (allowing increasingly small value
differences to affect the search). This is consistent with the
intuitive hypothesis that stronger players plan more deeply,
increasing the value of explicitly modeling planning, and
that they are more sensitive to small future value differences.
In Go, despite our baseline model being equal or better than
all prior imitation-learning models on the GoGoD human
pro games dataset, MCTS improves it yet further.
In Figure 2, for chess we see that while KL-regularized
search improves each model’s accuracy on players of its
target rating, surprisingly, the improvement grows yet larger
when each model predicts players of higher rating than it
was trained on. This suggests that as human players improve,
2

https://gogodonline.co.uk/

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Model Predicting

Chess Maia1100 Rating 1100
Chess Maia1500 Rating 1500
Chess Maia1900 Rating 1900
Go Cazenave
Go Wu (2018)
Go Wang et al.
Go
Ours

GoGoD
GoGoD
GoGoD
GoGoD

Raw
Model
51.1
52.4
53.2
54.7
55.3
57.7
57.8

Model + MCTS

100%

cpuct = 10

5

2

1

0.5

51.2
52.7
53.6

51.3
52.9
54.0

50.8
52.8
54.3

49.5
51.9
53.8

47.4
50.1
52.4

Win% vs Raw Model

Game

90%
80%
70%
60%
50%

58.1

58.3

58.5

58.1

57.1

cpuct
10
5
2
1
0.5

Model

Chess (Maia1900)
Go (Our model)
Raw Model (Chess, Go)

−0.50%
0.00%
0.50%
1.00%
Top-1 Accuracy Difference vs Raw Model

Table 1. % top-1 test accuracy predicting human moves in chess and Go Figure 1. Improvement in top-1 accuracy of Maia1900 in
using MCTS with 50 playouts and various cpuct , or raw model without Chess or our GoGoD model in Go using MCTS, plotted
MCTS, equivalent to infinite cpuct . In chess, first 10 ply per game and moves versus winrate of MCTS against the raw model (temperature
with < 30s time left excluded, similar to (McIlroy-Young et al., 2020a). 1). Error bars indicate 1 standard error. Many cpuct values
Standard error is approx 0.1 or less on all values. MCTS on top of current increase both human prediction accuracy and winrate over
state-of-the-art models improves human prediction accuracy significantly. the raw model in both Chess and Go.

icy and the MCTS policy, sampling each at temperature 1.
Figure 1 shows the change in human prediction accuracy of
MCTS in both chess and Go plotted jointly versus winrate
of MCTS against the raw model. Rather than solely a tradeoff between strength and accuracy, most cpuct values in the
range we tested increase both, some achieving more than
90% winrate while still improving human prediction. See
Appendix D for results at lower temperature and evidence
that accuracy improves further at longer time controls.

56.0%

Top-1 Accuracy

54.0%
52.0%
50.0%
Maia1100 Raw
Maia1500 Raw
Maia1900 Raw
Maia1100+MCTS, cpuct=10
Maia1500+MCTS, cpuct=5
Maia1900+MCTS, cpuct=2

48.0%
46.0%
44.0%
1100

1300

1500
1700
1900
2100
Rating Bucket Being Predicted

2300

2500

Figure 2. Top-1 % test accuracy in chess for Maia models trained
to predict moves by players in rating buckets 1100, 1500, 1900,
applied to predict all rating buckets, with and without MCTS. The
black bolded outline indicates where the model is predicting the
same rating as it was trained on. Standard error is very small,
the slight thin shade around each line. MCTS most improves
prediction of a model on players of its target rating and higher.

the incremental average change in their behavior resembles
or is correlated with the way that highly-regularized search
improves the strength of a baseline policy.
Additionally, in Appendix C, we show that if we apply
post-processing to the MCTS policy based on Grill et al.
(2020), MCTS improves cross entropy with human moves
in both chess and Go, not just top-1 accuracy. In other
words, not only does policy-regularized search improve the
prediction of the top move, but it also better models the
overall distribution of moves that humans may likely play.
We measure the strength impact of regularized search with
1000 games3 per cpuct setting between the raw model pol3
In Go, we also use the open-source KataGo (Wu, 2020) to
determine when the game is over and to score the result. Unlike
RL agents, humans which our models imitate universally pass and
score the game well before it becomes mechanically scorable, so

Although we did not test against humans directly to calibrate, this gives clear evidence that a well-tuned humanregularized MCTS agent would be better able to match the
1900-1999-rated chess players that Maia1900 currently falls
hundreds of Elo short of imitating, while simultaneously
being more accurate to their move-by-move behavior, and
similarly for human-imitation agents in Go.

4. Imperfect-Information Games:
Policy-regularized Regret Minimization
While MCTS is a popular search algorithm for perfectinformation deterministic games, it is not able to compute
optimal policies in imperfect-information games. Instead,
iterative algorithms based on regret minimization are the
leading approach to search in imperfect-information games.
Hedge (Littlestone & Warmuth, 1994; Freund & Schapire,
1997) is an iterative regret minimization algorithm that
in general converges to a coarse correlated equilibrium (CCE) (Hannan, 1957). In the special case of twoplayer zero-sum games, it also converges to a Nash equilibrium (NE) (Nash, 1951).
Regret Matching (RM) (Blackwell et al., 1956; Hart &
Mas-Colell, 2000) is another equilibrium-finding algorithm
we use KataGo as a neutral judge.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

τi 4 . When λi is small, the dominating term is the rewards
ui (πi , at−i ) and so the agent will tend to maximize reward
without as closely matching the anchor policy τi . These
statements are made precise in Theorem 1 and Theorem 2.

similar to Hedge that has historically been more popular and
that we compare our algorithm to in this paper.
The effectiveness of the implicit KL-regularization in MCTS
that we study in Section 3 motivates us to develop an
equilibrium-finding algorithm called piKL-Hedge that similarly biases the search towards an anchor policy. In Section 4.3, we show that piKL-Hedge achieves better empirical
performance against baseline human-imitation models than
Hedge and RM in a large imperfect-information game, as
well as much higher human prediction accuracy.

4.2. No-Regret Learning for Policy-Regularized
Utilities
In this section, we present Algorithm 1, a no-regret algorithm based on Hedge for any player i to learn strong policies relative to the regularized utilities defined in (5). As
we show in Proposition 1 in Appendix A, it guarantees that
each player i accumulates sublinear regret (of order log T )
with respect to the regularized utility functions:

4.1. Background
We consider a game with N players where each player i
chooses an action a from a set of actions Ai . We denote
the actions of all players other than i as a−i . After all
players simultaneously choose an action, player i receives a
reward of ui (a, a−i ). Players may also choose a probability
distribution over actions, where the probability of action a
is denoted πi (a) and the vector of probabilities is denoted
πi . We also define the fixed policy that we are interested in
biasing player i towards as the anchor policy τi ∈ ∆(Ai ).

Each player i maintains a regret for each action. The regret
on iteration t is denoted Rit (a). Initially, all regrets are zero.
On each iteration t of Hedge, πit (a) is set according to

πit (a) ∝ exp Rit (a)
(3)
Next, each player samples an action a∗ from Ai according
to πit and all regrets are updated such that
X
Rit+1 (a) = Rit (a) + ui (a, a∗−i ) −
πit (a0 )ui (a0i , a∗−i )

Uit (πi ) := Ui (πi , at−i ) = ui (πi , at−i ) − λi DKL (πi k τi ),
no matter the opponents’ actions at−i at each time t.
Algorithm 1 PI KL-H EDGE (for Player i)
Data: • Ai set of actions for Player i;
• ui reward function for Player i;
• η > 0 learning rate hyperparameter.
function I NITIALIZE()
t←0
3
for each action a ∈ Ai do
4
CV0i (a) ← 0
1
2

function P LAY()
t←t+1
7
let πit be the policy such that
5
6

πit (a) ∝ exp

a0 ∈Ai

(4)
It is proven that the average policy of Hedge over all it- 8
erations converges to a NE in two-player zero-sum games 9
and, more broadly, the players’ joint policy distribution 10
11
converges to a CCE as t → ∞.

We wish to model agents that seek to maximize their expected reward, while at the same time playing “close” to the
anchor policy. The two goals can be reconciled by defining
a composite utility function that adds a penalty based on the
“distance” between the player policy and their anchor policy,
with coefficient λi ∈ [0, ∞) scaling the penalty.
For each player i, we define i’s utility as a function of the
the agent policy πi ∈ ∆(Ai ) given policies π−i of all other
agents:
(5)

When λ is large, the utility function is dominated by the
KL-divergence term λi DKL (πi k τi ), and so the agent will
naturally tend to play a policy πi close to the anchor policy


η CVt−1
(a) + tλi η log τi (a)
i
. (6)
1 + tλi η

sample an action at ∼ πit
play at and observe actions at−i played by the opponents
for each a ∈ Ai do
CVti (a) ← CVt−1
(a) + ui (a, at−i )
i
As with many other regret-minimization methods, we consider the average policy of each player i over T iterations:
T

π̄iT :=

1X t
π
T t=1 i

(7)

where πit is defined in (6). We take π̄iT to be the final agent
policy produced by piKL-HedgeBotin no-press Diplomacy
(as described in more detail in Section 4.3). As shown in
4

Ui (πi , π−i ) := ui (πi , π−i ) − λi DKL (πi k τi ).



The careful reader may observe that the direction of the KLdivergence term, DKL (π k τ ) is the opposite of the direction implicit in MCTS, DKL (τ k π). We choose this direction for greater
ease of theoretical analysis and implementation in the context of
regret minimization; for our use cases we have not found the exact form of the loss to be critical so much as simply doing any
reasonable regularized search.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Appendix A, the KL-divergence of π̄iT from the anchor
policy τi converges to be inversely proportional to λi :
Theorem 1. (piKL stays close to the anchor policy) Upon
running Algorithm 1 for T iterations in a multiplayer
general-sum game, the policy π̄iT is at a distance


1 RiT
T
DKL (π̄i k τi ) ≤
+ Di ,
λi T
where Di is any upper bound on possible rewards for Player
i. In particular, if η > 0 is set so that RiT = o(T ), then
DKL (π̄iT k τi ) → Di /λi as T → +∞.
We can also show (see Appendix A) that in the case of a twoplayer zero-sum game, π̄iT approximates a Nash equilibrium
of the original utility functions, with the approximation
guarantee controlled by λ:
Theorem 2. Let (π̄1 , π̄2 ) be any limit point of the average
policies (π̄1T , π̄2T ) of the players. Almost surely, (π̄1 , π̄2 ) is
a (maxi=1,2 {λi βi })-approximate Nash equilibrium policy
with respect to the original utility functions ui , where βi is
as defined in (21).
Lastly, we remark that in the special case that τi is the uniform policy for all players i, the above results imply that
Algorithm 1 converges towards a quantal response equilibrium (McKelvey & Palfrey, 1995a), in which an imperfect
agent is modeled as choosing actions with probability exponentially decaying in the amount that each action is worse
than the best action(s), given that all other agents behave the
same way. Our method can be seen as a generalization that
takes into account a human-learned prior for what actions
may be more likely.
4.3. Diplomacy Experiments
Diplomacy is a benchmark 7-player simultaneous-action
game featuring both cooperation and competition. In Appendix G, we summarize the rules of the game. Using piKLHedge, we develop an agent piKL-HedgeBot and show that
it improves upon prior approaches for the game. In Appendix B, we also illustrate the key features of piKL-Hedge
in Blotto, a famous 2-player simultaneous action game.
4.3.1 Algorithms and Models
In no-press Diplomacy,
we compare the different equilibrium search algorithms
(RM, Hedge and piKL-Hedge) using the procedure introduced by Gray et al. (2020). We perform 1-ply lookahead
where on each turn, we sample up to 30 of the most likely
actions for each player from a policy network trained via
imitation learning on human data (IL policy). We then consider the 1-ply subgame consisting of those possible actions
where the rewards for a given joint action are given by querying a value network trained on human game data as in Gray
et al. (2020). We play according to the approximate equilibrium computed for that subgame by that algorithm. For

piKL-Hedge, the anchor policy is simply the same humantrained policy network. Our baseline policy and value models also contain a few improvements over prior models for
no-press Diplomacy, described in Appendices I and J.
In our experiments, we label our RM, Hedge, and piKLHedge agents as RMBot, HedgeBot, and piKL-HedgeBot
respectively. We compare also against SearchBot (Gray
et al., 2020) (similar to RMBot but using the models from
Gray et al. (2020) rather than our models). See Appendix H
for more details about the hyperparameters used.
4.3.2 Strong, human-like play with piKL-Hedge Similar to Chess and Go, we compare the human prediction
accuracy of RMBot, HedgeBot, piKL-HedgeBot (with different λs) to the IL anchor policy, as well as testing their
head-to-head strength. In particular, we test their ability
to predict human moves in 226 no-press Diplomacy games
from a validation set, and measure their score against the IL
policy across 700 games each.
In Figure 3 (Left), we present the average top-1 accuracy of
unit orders in each action predicted by these methods as well
as their average scores against 6 IL anchor policies. The
raw IL model (λ = ∞) predicts human moves with high
accuracy but is weak and achieves low average score. Unregularized Hedge and RM (λ = 0) achieve high score but low
human prediction accuracy. By contrast, piKL-HedgeBot
with different λ achieves a variety of highly favorable combinations of the two. λ = 10−1 gives about the same top-1
accuracy as the IL policy but improves score by a factor
of 1.4x over the IL anchor policy. λ = 10−3 outperforms
unregularized search methods in both score (by ~5%) and
human prediction accuracy (by ~6%). Mild regularization
improves average score, rather than harming it.
We also tested pure RL agents and found they perform
poorly in predicting human moves. In particular, the
recently proposed DORA and HumanDNVI-NPU algorithms (Bakhtin et al., 2021) achieve top-1 accuracy of only
29.1% and 37.8% respectively.
Next, in Figure 3 (Right), we compare the top-1 accuracy of
these methods across players of different pseudo-Elo ratings.
The pseudo-Elo (ei ) for player i is constructed based on the
logit rating si introduced in Gray et al. (2020), where, ei =
si ·400
log(10) + 1000. The top-1 accuracy for all the search-based
policies increases with pseudo-Elo, indicating that they are
better at modeling stronger players than weaker players.
piKL-Hedge (λ = 10−1 ) performs just as well as the anchor
policy across pseudo-Elos while being significantly stronger
than the anchor, while λ = 10−3 is as strong or stronger
than Hedge and RM but matches human play far better.
4.3.3 piKL-HedgeBot performs well against a varied
pool of agents
We also develop a new head-to-head
evaluation setting, where rather than testing one agent vs

LO

40

20
40

42.5 45 47.5 50 52.5
Top1 Accuracy [%]

60

Agent

Average Score

50

DipNet†
DipNet RL†

3.7% ± 0.3%
4.7% ± 0.3%

40

Blueprint‡
BRBot‡
SearchBot‡

4.9% ± 0.3%
16.1% ± 0.6%
13.4% ± 0.5%

30

IL Policy
RMBot

7.9% ± 0.4%
31.3% ± 0.7%

piKL-HedgeBot
(λ = 10−3 )

32.9% ± 0.7%

800

1000

piKL-Hedge (λ = 10−3 )

Anchor Policy

1400

Unit Order Top-1 Accuracy [%]

Average Score [%]

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

1200
1400
Pseudo-ELO

1600

Hedge
Regret Matching

Table 2. Average score achieved by
agents in a uniformly sampled pool of
other agents. piKL-HedgeBot (λ =
Figure 3. (Left) Average top-1 accuracy of unit orders in each action predicted by the human 10−3 ) outperforms all other agents in
IL anchor policy, RM, Hedge and piKL-Hedge, versus head to head score against 6 IL anchor this setting. DipNet agents from (Papolicies. piKL-HedgeBot (λ = 10−1 ) predicts human moves as accurately as the anchor policy quette et al., 2019) use a temperature
1600
while achieving a much higher score. At the same time, piKL-HedgeBot (λ = 10−3 ) allows of 0.1, while IL Policy and blueprint
for a stronger and more human-like policy than unregularized search methods (hedge, RM). (Gray et al., 2020) use a temperature of
Note that equal performance would imply an average score of 1/7 ≈ 14.3%. Error bars indicate 0.5. The ± shows one standard error.
1 standard error. (Right) Average top-1 accuracy of unit orders in each action predicted by the † (Paquette et al., 2019); ‡ (Gray et al.,
human policy, RM, and piKL, as a function of pseudo-Elo player rating.
2020).
−1

piKL-Hedge (λ = 10 )

piKL-Hedge (λ = 10−4 )

piKL-Hedge (λ = 10−2 )

piKL-Hedge (λ = 10−5 )

6x of another agent, all 7 agents per game are sampled uniformly from a pool. The 1v6 head-to-head scores used in
prior work (Gray et al., 2020; Bakhtin et al., 2021; Anthony
et al., 2020; Paquette et al., 2019) indicate whether a population of 6x agents can be invaded by a 1x agent, and hence
whether the 6 agents constitute an Evolutionarily Stable
Strategy (ESS) (Taylor & Jonker, 1978; Smith, 1982). By
contrast, assigning the 7 agents per game randomly from a
pool studies the robustness of an agent to a variety of other
agents.
We experiment with a pool of 8 agents. Five are previously published agents: DipNet, DipNet RL (Paquette et al.,
2019), Blueprint, BRBot, SearchBot (Gray et al., 2020) and
three are our agents: IL policy, RMBot and piKL-HedgeBot.
Doing well in this population requires playing well with
both human-like policies (DipNet, Blueprint) and equilibrium policies (SearchBot, RMBot). Each experiment only
compares one lambda value of piKL-HedgeBot for fairness.
The results of these experiment with piKL-HedgeBot (λ =
10−3 ) is presented in Table 2. piKL-HedgeBot (λ = 10−3 )
outperforms all other agents.
Overall, unlike Chess and Go, we do not find that piKLHedge clearly improves human prediction accuracy over
the IL model. However, unlike Chess and Go, we observe
that piKL-Hedge does improve playing strength over prior
search methods against various past agents. In generalsum games like Diplomacy, it appears that piKL-hedge
with a human anchor policy allows for slightly better play
in a population containing human-like agents while still

doing well against equilibrium-searchers, or alternatively
can improve strength over IL to a lesser degree with no cost
at all to prediction accuracy.

5. Related Work
5.1. Regularized Learning and Planning
Several prior works have explored augmenting reinforcement learning with supervised data from expert demonstrations (Vecerik et al., 2017; Nair et al., 2018). For example,
Hester et al. (2018) augment Deep Q Learning with a margin loss on demonstration data that aims to make Q(a) for
each demonstration action higher than that of other actions.
KL-regularization has also been used successfully to incorporate expert demonstrations into RL training (Rudner
et al., 2021; Ng et al., 2000; Boularias et al., 2011; Wu
et al., 2019; Peng et al., 2019; Siegel et al., 2019). In these
settings, the standard RL objective is augmented by a KL
divergence penalty that expresses the dissimilarity between
the online policy and a reference policy derived from demonstrations. This helps guide exploration in RL or ameliorate
inaccurate modeling of the environment in domains such as
robotics. AlphaStar (Vinyals et al., 2019), which achieved
expert human performance in StarCraft 2, uses self-play RL
initialized from supervised policies with a KL penalty term
for deviating from the supervised policy during training to
"aid in exploration and to preserve strategic diversity". In
our work, we use the KL term to better approximate human
play during inference-time search.
Prior work has explored entropy-regularized utilities in

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

games. Ling et al. (2018) show that a particular type of
entropic regularization in extensive-form games leads to
quantal response equilibria. Cen et al. (2021) give fast online optimization algorithms for entropy-regularized utilities
in the context of quantal response. Farina et al. (2019) regularize utilities with a KL divergence term from a precomputed Nash equilibrium strategy to design agents that trade
off game-theoretic safety and exploitation. In our work, we
leverage the KL divergence term towards a human-imitation
learned policy instead, to regularize the utilities. And unlike
previous works, we empirically study our approach in a
much larger imperfect information game.
5.2. Strong Human-Compatible Policies
Prior work in multi-agent reinforcement learning has emphasized the importance of human-compatible policies in cooperative multi-agent environments. Lerer & Peysakhovich
(2019) demonstrate that self-play policies may perform
poorly with other agents if they do not conform to the population equilibrium (social conventions), and propose a combination of policy gradient and imitation loss directly on
samples of population data.
Human-compatible policies have also been studied on the
benchmark game of Hanabi, where ad hoc play with humans
is regarded as an open challenge problem (Bard et al., 2020).
Most work on this challenge has focused on zero-shot coordination with humans, in which an agent must adapt to
human play with no prior experience with human partners
(Hu et al., 2020; 2021b; Cui et al., 2021). Learning humancompatible policies from a combination of human data and
planning are less well-studied in this setting.
5.2.1 MCTS, Chess and Go
Prior work in tree search
methods, especially in chess and Go, has typically focused
on developing strong agents without concern for accurately
modeling human behavior. For example in Go, a significant
body of older work investigates imitation learning (IL) to
obtain a baseline policy prior to use with MCTS, but tunes
and evaluates the final agent via playing strength alone (Tian
& Zhu, 2016; Cazenave, 2017).
Regarding the use of search for human modeling, recent
work in chess by McIlroy-Young et al. (2020b) found that
pure IL outperformed all other approaches and that adding
MCTS with the parameters of a standard engine significantly harmed human prediction accuracy. However in our
work, using a different range of parameters, we show clear
results to the contrary. In Go, Wang et al. (2017) report
promising results on human prediction and playing strength
using search, albeit with a specialized architecture and rollout method. Baier et al. (2018) report in the card game
Spades both excellent human accuracy and playing strength
by adding a human policy bias to a variant of MCTS. To our
knowledge, our work is the first to demonstrate a clear gain

in human prediction accuracy over deep learning models
in the highly-studied domains of chess and Go via simple
well-established methods of policy-regularized planning.
Diplomacy is a benchmark 7-player
5.2.2 Diplomacy
game that involves communication, negotiation, cooperation, and competition in a strategic multi-agent setting.
While chess and Go are two-player zero-sum games for
which optimal play is well-defined and can be computed
through self-play (Nash, 1951), Diplomacy has no such
guarantees and strong play likely requires modeling other
agents, even in the no-press variant where natural-language
communication is not allowed (Bakhtin et al., 2021).
Paquette et al. (2019) showed that neural network IL on
human data in no-press Diplomacy can reasonably approximate human play, but that bootstrapping RL from this agent
leads to a breakdown in cooperation. Anthony et al. (2020)
developed new RL methods based on fictitious play that improve the performance of agents in no-press Diplomacy, and
Gray et al. (2020) showed that an equilibrium-finding regret
minimization search procedure on top of human IL models
achieves human-level performance in no-press Diplomacy.
However, although both methods rely on the human IL
model to generate a restricted action set for RL or search,
neither contains any explicit regularization when choosing
among those actions, and we show that the equilibrium
search in Gray et al. (2020) greatly decreases the accuracy
of modeling human players. Similarly, Bakhtin et al. (2021)
achieved strong results in no-press Diplomacy via self-play
RL both from scratch and initialized with a human-learned
policy, but we show that the resulting final agents do not
ultimately model humans well.

6. Conclusion
In this paper, we showed across several domains that regularizing search policies according to a KL-divergence loss with
an imitation learned (IL) policy produces policies that maintain high human prediction accuracy while being far stronger
than the original learned policy. In chess and Go, applying
standard MCTS regularized toward a human-learned policy
achieves state-of-the-art prediction accuracy, surpassing imitation learning, while also winning more than 85% of games
against an IL model. We then introduced a novel regret
minimization algorithm that is regularized based on the KL
divergence from an IL policy. In no-press Diplomacy, this
algorithm yields both a policy that predicts human play with
the same accuracy as imitation learning alone while increasing win rate against state-of-the-art baselines by a factor of
1.4, or alternately a policy that outperforms unregularized
search while achieving much higher human prediction accuracy. We presented in Appendix L similar successful results
for KL-regularized search in Hanabi.
There are several directions for future work, such as ex-

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

tending piKL-Hedge to handle extensive-form games. Additionally, there may be better ways to regularize search
than KL-divergence. Finally, it remains to be seen how
KL-regularized search performs when combined with RL.

Author Contributions
A. P. Jacob was the primary researcher for piKL-hedge and
contributed to the direction, experiments, and writing of
the entire paper. D. J. Wu was the primary researcher for
MCTS in chess and Go, and contributed to the direction,
experiments, and writing of the entire paper. G. Farina was
the primary formulator of the piKL-hedge algorithm and
handled all the theory in the paper. A. Lerer contributed to
the direction of the project, the formulation of piKL-hedge,
its experimental evaluation, and paper writing. H. Hu was
the primary researcher for the extension of piKL to Hanabi
covered in Appendix L. A. Bakhtin contributed to the experimental evaluation of piKL-hedge. J. Andreas contributed to
the direction of the project and to paper writing. N. Brown
initiated the project and contributed to the direction of the
project, the formulation of piKL-hedge, its experimental
evaluation, and paper writing.

References
Abernethy, J. and Rakhlin, A. Beating the adaptive
bandit with high probability.
Technical Report
UCB/EECS-2009-10,
EECS Department,
University of California, Berkeley, Jan 2009.
URL
http://www2.eecs.berkeley.edu/Pubs/
TechRpts/2009/EECS-2009-10.html.
Anderson, A., McIlroy-Young, R., Sen, S., and
Kleinberg, J. Introducing maia, a human-like neural network chess engine, 2021.
URL https:
//lichess.org/blog/X9PUixUAANCqFRSh/
introducing-maia-a-human-like-neuralnetwork-chess-engine.
Anthony, T., Eccles, T., Tacchetti, A., Kramár, J., Gemp, I.,
Hudson, T., Porcel, N., Lanctot, M., Perolat, J., Everett,
R., Singh, S., Graepel, T., and Bachrach, Y. Learning
to play no-press diplomacy with best response policy
iteration. In Larochelle, H., Ranzato, M., Hadsell,
R., Balcan, M. F., and Lin, H. (eds.), Advances in
Neural Information Processing Systems, volume 33,
pp. 17987–18003. Curran Associates, Inc., 2020.
URL
https://proceedings.neurips.cc/
paper/2020/file/
d1419302db9c022ab1d48681b13d5f8bPaper.pdf.
Baier, H., Sattaur, A., Powley, E., Devlin, S., Rollason,
J., and Cowling, P. Emulating human play in a leading

mobile card game. IEEE Transactions on Games, PP:1–1,
05 2018. doi: 10.1109/TG.2018.2835764.
Bakhtin, A., Wu, D., Lerer, A., and Brown, N. No-press
diplomacy from scratch. In Thirty-Fifth Conference on
Neural Information Processing Systems, 2021.
Bard, N., Foerster, J. N., Chandar, S., Burch, N., Lanctot,
M., Song, H. F., Parisotto, E., Dumoulin, V., Moitra, S.,
Hughes, E., et al. The hanabi challenge: A new frontier
for ai research. Artificial Intelligence, 280:103216, 2020.
Blackwell, D. et al. An analog of the minimax theorem for
vector payoffs. Pacific Journal of Mathematics, 6(1):1–8,
1956.
Boularias, A., Kober, J., and Peters, J. Relative entropy
inverse reinforcement learning. In Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 182–189. JMLR Workshop and
Conference Proceedings, 2011.
Brown, N. and Sandholm, T. Superhuman AI for heads-up
no-limit poker: Libratus beats top professionals. Science,
pp. eaao1733, 2017.
Brown, N. and Sandholm, T. Superhuman AI for multiplayer
poker. Science, 365(6456):885–890, 2019.
Campbell, M., Hoane Jr, A. J., and Hsu, F.-h. Deep Blue.
Artificial intelligence, 134(1-2):57–83, 2002.
Cazenave, T.
Residual networks for computer go.
IEEE Transactions on Computational Intelligence and
AI in Games, PP:1–1, 03 2017.
doi: 10.1109/
TCIAIG.2017.2681042.
Cen, S., Wei, Y., and Chi, Y. Fast policy extragradient methods for competitive games with entropy regularization. In
Neural Information Processing Systems (NeurIPS), 2021.
Cui, B., Hu, H., Pineda, L., and Foerster, J. K-level resoning for zero-shot coordination in hanabi. In Thirty-Fifth
Conference on Neural Information Processing Systems,
2021.
Egri-Nagy, A. and Törmänen, A. The game is not
over yet—go in the post-alphago era.
Philosophies, 5(4):37–0, 2020.
ISSN 2409-9287.
doi:
10.3390/philosophies5040037.
URL https://
www.mdpi.com/2409-9287/5/4/37.
Farina, G., Kroer, C., and Sandholm, T. Online convex optimization for sequential decision processes and extensiveform games. In AAAI Conference on Artificial Intelligence, 2019.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Fickinger, A., Hu, H., Amos, B., Russell, S., and Brown,
N. Scalable online planning via reinforcement learning
fine-tuning. CoRR, abs/2109.15316, 2021. URL https:
//arxiv.org/abs/2109.15316.
Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting.
Journal of computer and system sciences, 55(1):119–139,
1997.
Gray, J., Lerer, A., Bakhtin, A., and Brown, N. Humanlevel performance in no-press diplomacy via equilibrium
search. In International Conference on Learning Representations, 2020.
Grill, J.-B., Altché, F., Tang, Y., Hubert, T., Valko, M.,
Antonoglou, I., and Munos, R. Monte-carlo tree search
as regularized policy optimization. In International Conference on Machine Learning, pp. 3769–3778. PMLR,
2020.
Hannan, J. Approximation to bayes risk in repeated play.
Contributions to the Theory of Games, 3:97–139, 1957.
Hart, S. and Mas-Colell, A. A simple adaptive procedure
leading to correlated equilibrium. Econometrica, 68(5):
1127–1150, 2000.
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul,
T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband,
I., et al. Deep q-learning from demonstrations. In Thirtysecond AAAI conference on artificial intelligence, 2018.
Hu, H., Lerer, A., Peysakhovich, A., and Foerster, J. “otherplay” for zero-shot coordination. In International Conference on Machine Learning, pp. 4399–4410. PMLR,
2020.
Hu, H., Lerer, A., Brown, N., and Foerster, J. N. Learned
belief search: Efficiently improving policies in partially
observable settings. CoRR, abs/2106.09086, 2021a. URL
https://arxiv.org/abs/2106.09086.
Hu, H., Lerer, A., Cui, B., Pineda, L., Wu, D., Brown, N.,
and Foerster, J. Off-belief learning. In International
Conference on Machine Learning. PMLR, 2021b.
Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 7132–7141, 2018.
doi: 10.1109/CVPR.2018.00745.

LC0. Leela chess zero information page on "neural
network topology". https://lczero.org/dev/
backend/nn/, 2020.
Lerer, A. and Peysakhovich, A. Learning existing social
conventions via observationally augmented self-play. In
Proceedings of the 2019 AAAI/ACM Conference on AI,
Ethics, and Society, pp. 107–114. ACM, 2019.
Lerer, A., Hu, H., Foerster, J., and Brown, N. Improving
policies via search in cooperative partially observable
games. In AAAI Conference on Artificial Intelligence,
2020.
Ling, C. K., Fang, F., and Kolter, J. Z. What game are we
playing? end-to-end learning in normal and extensive
form games. International Joint Conferences on Artificial
Intelligence Organization, 2018.
Littlestone, N. and Warmuth, M. K. The weighted majority
algorithm. Information and computation, 108(2):212–
261, 1994.
McIlroy-Young, R., Sen, S., Kleinberg, J., and Anderson, A.
Aligning superhuman ai with human behavior: Chess as a
model system. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery &
Data Mining, pp. 1677–1687, 2020a.
McIlroy-Young, R., Wang, R., Sen, S., Kleinberg,
J., and Anderson, A. Learning personalized models of human behavior in chess. Technical report,
Microsoft, Inc., August 2020b.
URL https:
//www.microsoft.com/en-us/research/
publication/learning-personalizedmodels-of-human-behavior-in-chess/.
McKelvey, R. D. and Palfrey, T. R. Quantal response
equilibria for normal form games. Games and Economic Behavior, 10(1):6–38, 1995a. ISSN 0899-8256.
doi: https://doi.org/10.1006/game.1995.1023. URL
https://www.sciencedirect.com/science/
article/pii/S0899825685710238.
McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for normal form games. Games and economic
behavior, 10(1):6–38, 1995b.
Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D.,
Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artificial intelligence in
heads-up no-limit poker. Science, 356(6337):508–513,
2017.

Lai, M.
Forum post on alphazero news (post by
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W.,
user matthewlai).
http://talkchess.com/
and Abbeel, P. Overcoming exploration in reinforcement
forum3/viewtopic.php?f=2&t=69175&sid=
learning with demonstrations. In 2018 IEEE Interna06ca6a966c29743d765c11b13402be8d&start=
tional Conference on Robotics and Automation (ICRA),
70#p781765, 2018.
pp. 6292–6299. IEEE, 2018.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Nash, J. Non-cooperative games. Annals of mathematics,
pp. 286–295, 1951.

Smith, J. M. Evolution and the Theory of Games. Cambridge
university press, 1982.

Ng, A. Y., Russell, S. J., et al. Algorithms for inverse
reinforcement learning. In Icml, volume 1, pp. 2, 2000.

Taylor, P. D. and Jonker, L. B. Evolutionary stable strategies
and game dynamics. Mathematical biosciences, 40(1-2):
145–156, 1978.

Paquette, P., Lu, Y., Bocco, S. S., Smith, M., Satya, O.-G.,
Kummerfeld, J. K., Pineau, J., Singh, S., and Courville,
A. C. No-press diplomacy: Modeling multi-agent gameplay. In Advances in Neural Information Processing
Systems, pp. 4474–4485, 2019.
Peng, X. B., Kumar, A., Zhang, G., and Levine, S.
Advantage-weighted regression: Simple and scalable
off-policy reinforcement learning.
arXiv preprint
arXiv:1910.00177, 2019.
Rakhlin, A. Lecture notes on online learning, 2009.
Rudner, T. G., Lu, C., Osborne, M., Gal, Y., and Teh, Y. W.
On pathologies in kl-regularized reinforcement learning
from expert demonstrations. In Thirty-Fifth Conference
on Neural Information Processing Systems, 2021.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K.,
Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis,
D., Graepel, T., et al. Mastering atari, go, chess and shogi
by planning with a learned model. Nature, 588(7839):
604–609, 2020.
Siegel, N., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., Heess,
N., and Riedmiller, M. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In
International Conference on Learning Representations,
2019.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., et al. Mastering the
game of go with deep neural networks and tree search.
Nature, 529(7587):484, 2016.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,
I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Bolton, A., et al. Mastering the game of go without
human knowledge. Nature, 550(7676):354, 2017.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,
M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm
that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
Siu, H. C., Peña, J., Chang, K. C., Chen, E., Zhou, Y.,
Lopez, V. J., Palko, K., and Allen, R. E. Evaluation
of human-ai teams for learned and rule-based agents in
hanabi. CoRR, abs/2107.07630, 2021. URL https:
//arxiv.org/abs/2107.07630.

Tian, Y. Github thread for elf opengo, "[suggestion] clarify
fpu in paper". https://github.com/pytorch/
ELF/issues/140, 2019.
Tian, Y. and Zhu, Y. Better computer go player with neural
network and long-term prediction, 2016.
Troisi, S.
Github thread for minigo "[experiment]
squeeze and excitation". https://github.com/
tensorflow/minigo/issues/683, 2019.
Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O.,
Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller,
M. Leveraging demonstrations for deep reinforcement
learning on robotics problems with sparse rewards. arXiv
preprint arXiv:1707.08817, 2017.
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M.,
Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds,
T., Georgiev, P., et al. Grandmaster level in starcraft ii
using multi-agent reinforcement learning. Nature, 575
(7782):350–354, 2019.
Wang, J., Wang, W., Wang, R., and Gao, W. Beyond monte
carlo tree search: Playing go with deep alternative neural
network and long-term evaluation. Proceedings of the
AAAI Conference on Artificial Intelligence, 31(1), Feb.
2017. URL https://ojs.aaai.org/index.php/
AAAI/article/view/10749.
Wu, D.
Go neural net sandbox.
https:
//github.com/lightvector/GoNN#rawneural-net-results, 2018.
Wu, D. Accelerating self-play learning in go. In AAAI-20
Workshop on Reinforcement Learning in Games, 2020.
Wu, Y., Tucker, G., and Nachum, O. Behavior regularized
offline reinforcement learning. 2019.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

A. Proofs
In this Appendix, we present detailed proofs of Proposition 1 and Theorem 2.
A.1. Known results
We start by recalling a few standard results. First, we recall the follow-the-regularized-leader (FTRL) algorithm, one of
the most well-studied algorithms in online optimization. At every time t, the FTRL algorithm instantiated with domain X ,
1-strongly-convex regularizer φ : X → R, and learning rate η > 0, produces iterates according to
(

)
t
X
φ(x)
xt+1 = arg max −
+
`τ (x) ,
η
x∈X
τ =1

(FTRL)

where `1 , . . . , `t are the convex utility functions gave as feedback by the environment. The FTRL algorithm guarantees the
following regret bound.
Lemma 1 (Rakhlin (2009), Corollary 7). The iterates xt ∈ X produced by the FTRL algorithm set up with constant step
size η > 0 and 1-strongly convex regularizer φ satisfy the regret bound
T
X
t=1

`t (u) − `t (xt ) ≤

T
φ(u) X t t+1
+
` (x ) − `t (xt )
η
t=1

∀u ∈ X.

(8)

In the analysis of Algorithm 1 we will also make use of the following technical lemma, a proof of which can be obtained
starting using the same techniques as Abernethy & Rakhlin (2009, Lemma A.4)
Lemma 2. Let p ∈ ∆(A) be a distribution over a discrete set A, q ∈ R|A| be a vector, and D > 0 be any constant such
that maxa,a0 ∈A {q(a) − q(a0 )} ≤ M . Then,
2
a∈A p(a) · exp{−q(a)}

P

2 − 1 ≤
p(a)
·
exp{−q(a)}
a∈A

P

exp{2M } X
p(a)q(a)2 .
M2
a∈A

Proof. Let qmin := mina∈A q(a) and q̃(a) := q(a) − qmin be a shifted version of q(a) so that 0 ≤ q̃(a) ≤ M for all a ∈ A.
Let now X denote a random variable with value q̃(a) with probability p(a) for all a ∈ A. Then,
2
a∈A p(a) · exp{−q(a)}

P
exp{qmin }2 a∈A p(a) · exp{q(a)}2
2 − 1 =
2 − 1
P
P
exp{qmin }2
a∈A p(a) · exp{−q(a)}
a∈A p(a) · exp{−q(a)}
P
p(a) · exp{−q̃(a)}2
= Pa∈A
2 − 1
a∈A p(a) · exp{−q̃(a)}
P

=

E[exp(−X)]
− 1.
[E exp(−X)]2

Applying Lemma A.4 from Abernethy & Rakhlin (2009) we obtain
2
a∈A p(a) · exp{−q(a)}

P

2 − 1 ≤
a∈A p(a) · exp{−q(a)}

P

exp{2M } − 2M − 1
exp{2M }
(E[X 2 ] − E[X]2 ) ≤
E[X 2 ],
M2
M2

which is exactly the statement.
A.2. Bounding the distance between the iterates of Algorithm 1
t
Lemma 3. At all times t and for all players i, the
with constant step
Ppolicies πi produced by the FTRL algorithm set up
size η and negative entropy regularizer ϕ(x) := a∈Ai x(a) log x(a), when observing the utilities Uit , match the policies
πit produced by Algorithm 1.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Proof. Plugging the particular choices of utilities and regularizer into (FTRL), we obtain
( t
!
)
X 0
X
1
t+1
πi = arg max
Uit (π) −
π(a) log π(a)
η
π∈∆(Ai )
0
t =1
a∈Ai
(
)

! X
t X
X
π(a)
t0
π(a) ui (a, a−i ) − λi π(a) log
= arg max η
−
π(a) log π(a)
τi (a)
π∈∆(Ai )
t0 =1 a∈Ai
a∈Ai
(
!
)
t
X
X
X
t0
tλi log τi (a) +
= arg max η
ui (a, a−i ) π(a) − (1 + tλi η)
π(a) log π(a)
π∈∆(Ai )

t0 =1

a∈Ai

a∈Ai

(

X
η
= arg max
π∈∆(Ai ) 1 + tλi η

tλi log τi (a) +

t
X

!
0
ui (a, at−i )

t0 =1

a∈Ai

)

π(a) −

X

π(a) log π(a) .

(9)

a∈Ai

A well-known closed form solution to the above entropy-regularized problem is given by the softmax function. In particular,
let
!
t
X
η
t+1
t0
wi (a) :=
tλi log τi (a) +
ũi (a, a−i )
∀a ∈ Ai .
(10)
1 + tλi η
0
t =1

Then,

exp{wit+1 (a)}
t+1 0
(a )}
a0 ∈Ai exp{wi

πit+1 (a) = P

∀a ∈ Ai ,

which coincides with the iterate produced by Algorithm 1.
The next observation shows that the iterates πi do not change if the utility function ui is first shifted to be in the range
[0, Di ].
Remark 1. Consider the shifted utilities
0

ũi (a, at−i ) := ui (a, at−i ) −

min

a∈A1 ×···×An

ui (a) ∈ [0, Di ]

(11)

and let v be defined as (10) using ũi in place of ui , that is,
vit+1 (a) :=

t
X
0
η
ũi (a, at−i )
tλi log τi (a) +
1 + tλi η
0

!

t =1

∀a ∈ Ai .

(12)

Then, the iterates πi can be equivalently expressed as
exp{vit+1 (a)}
t+1 0
(a )}
a0 ∈Ai exp{vi

πit+1 (a) = P

∀a ∈ Ai .

Proof. Let γ :=
mina∈A1 ×···×An ui (a) denote the minimum utility that Player i can get against the actions of the opponents. Since the
argmax of a function does not change if a constant is added to the objective, from (9) we can write
(
!
)
t
X
X
X
η
η
t+1
t0
πi = arg max −
γ+
tλi log τi (a) +
ui (a, a−i ) π(a) −
π(a) log π(a)
1 + tλi η
1 + tλi η
π∈∆(Ai )
t0 =1
a∈Ai
a∈Ai
!
)
(
t
X
X
X
η
t0
= arg max
tλi log τi (a) +
(ui (a, a−i ) − γ) π(a) −
π(a) log π(a)
π∈∆(Ai ) 1 + tλi η a∈A
t0 =1
a∈Ai
i
)
(
X
X
t+1
= arg max
vi (a)π(a) −
π(a) log π(a) ,
π∈∆(Ai )

a∈Ai

a∈Ai

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

where the second equality follows from the fact that π ∈ ∆(Ai ).
A solution is again given by softmax function

exp{vit+1 (a)}
t+1 0
(a )}
a0 ∈Ai exp{vi

πit+1 (a) = P

(13)

∀a ∈ Ai .

In the rest of the proof we will use (13) to analyze the iterates πi produced by the algorithm.
Lemma 4. Let η ≤ 1/(λi βi + 2Di ). Then, at all times t,
kπit+1 − πit k∇2 ϕ(πit ) ≤

√

3e
.
tλi η

Proof. At all times t, introduce the vector ξit ∈ R|Ai | defined as
ξit (a) :=


η
−λi v t (a) + λi log τi (a) + ũi (a, at−i )
1 + tλi η

∀a ∈ Ai .

(14)

At all times t and for all a, it holds that


η
1 + (t − 1)λi η t
v (a) + λi log τi (a) + ũi (a, at−i )
1 + tλi η
η


η
1 + tλi η t
=
v (a) − λi v t (a) + λi log τi (a) + ũi (a, at−i )
1 + tλi η
η

vit+1 (a) =

(15)

= v t (a) + ξit (a).
Substituting (15) we can write
exp{vit (a)} · exp{ξit (a)}
πit (a) exp{ξit (a)}
P
=
.
t 0
t 0
t 0
t 0
a0 ∈Ai exp{vi (a )} · exp{ξi (a )}
a0 ∈Ai πi (a ) exp{ξi (a )}

πit+1 (a) = P

(16)

Expanding the definition of the local norm induced by ∇2 ϕ(πit ) we find
kπit+1 − πit k2∇2 ϕ(πt ) =
i

X

1

π t (a)
a∈Ai i

πit+1 (a) − πit (a)

2

!2
t
exp{ξ
(a)}
i
=
πit (a) P
−1
(17)
t 0
t 0
0 ∈A πi (a ) exp{ξi (a )}
a
i
a∈Ai


!2
!
t
t
X
exp{ξ
(a)}
exp{ξ
(a)}
i
i
=
πit (a) P
−2 P
+ 1
t (a0 ) exp{ξ t (a0 )}
t (a0 ) exp{ξ t (a0 )}
π
π
0
0
i
i
a ∈Ai i
a ∈Ai i
a∈Ai
P
P
t
t
2
t
t
X
a∈Ai πi (a) exp{ξi (a)}
a∈Ai πi (a) exp{ξi (a)}
P
= P
−
2
+
πit (a)

2
t
t
0
0
t
t
0 ) exp{ξ (a0 )}
0 ∈A πi (a ) exp{ξi (a )}
π
(a
a
i
a∈Ai
i
a0 ∈Ai i
P
t
t
2
π
(a)
exp{ξ
(a)}
i
i
= P a∈Ai
(18)
 − 1,
t
t (a0 )} 2
0
π
(a
)
exp{ξ
0
i
i
a ∈Ai
X

where (17) follows from substituting (16). We now apply Lemma 2, applied with q = ξit , p = πit , and A = Ai . First, we
study the range Di = maxa,a0 ∈Ai {ξit (a) − ξit (a0 )} used in the statement of the Lemma. In particular, using (15) we have
max {ξit (a) − ξit (a0 )} = max
{vit+1 (a) − vit (a) − vit+1 (a0 ) + vit (a0 )}
a,a0 ∈Ai



tλi η
(t − 1)λi η
0
= max
(log
τ
(a)
−
log
τ
(a
))
·
−
i
i
a,a0 ∈Ai
1 + tλi η 1 + (t − 1)λi η

a,a0 ∈Ai

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

η
+
1 + tλi η

!

t
X

0

ũi (a, at−i ) − ũi (a0 , at−i )

t0 =1

t−1
X

η
−
1 + (t − 1)λi η

= max
0

a,a ∈Ai

0

t0 =1

!)
0
0
ũi (a, at−i ) − ũi (a0 , at−i )

λi η
(log τi (a) − log τi (a0 ))
(1 + tλi η)(1 + (t − 1)λi η)

!
t−1
X
λi η 2
t0
0
t0
+
ũi (a, a−i ) − ũi (a , a−i )
(1 + tλi η)(1 + (t − 1)λi η) 0
t =1

η
t
0
t
(ũi (a, a−i ) − ũi (a , a−i ))
+
1 + tλi η


λi η
0
≤ max
(log τi (a) − log τi (a ))
a,a0 ∈Ai (1 + tλi η)(1 + (t − 1)λi η)
!)
(
t−1
X
0
0
λi η 2
+ max
ũi (a, at−i ) − ũi (a0 , at−i )
a,a0 ∈Ai (1 + tλi η)(1 + (t − 1)λi η)
0
t =1


η
t
0
t
+ max
(ũi (a, a−i ) − ũi (a , a−i ))
a,a0 ∈Ai 1 + tλi η
(19)

≤ η(λi βi + 2Di ),

where the first inequality follows from upper bounding the max of a sum with the sum of max of each term, and the second
inequality follows from noting that
λi η
η
≤
≤ η,
(1 + tλi η)(1 + (t − 1)λi η)
1 + tλi η
and
λi η 2
η
≤
(1 + tλi η)(1 + (t − 1)λi η)
t

t−1
X
t0 =1

!
0
0
ũi (a, at−i ) − ũi (a0 , at−i )

≤

λi η 2
· tDi = ηDi .
tλi η

Applying Lemma 2 to the right-hand side of (18) using the bound on the range of ξit shown in (19) yields
kπit+1 − πit k2∇2 ϕ(πt ) ≤
i

2
exp{2η(λi βi + 2Di )} X t
πi (a) ξit (a) .
2
2
η (λi βi + 2Di )

(20)

a∈Ai

Using the fact that any convex combination of values is upper bounded by the maximum value, we can further bound the
right-hand side of (20) as
exp{2η(λi βi + 2Di )}
max(ξ t (a))2
η 2 (λi βi + 2Di )2 a∈Ai i
2
exp{2η(λi βi + 2Di )}
=
max v t+1 (a) − vit (a) ,
η 2 (λi βi + 2Di )2 a∈Ai i

kπit+1 − πit k2∇2 ϕ(πt ) ≤
i

where the equality follows from (15). Hence, expanding the definition of vit and vit+1 ,


tλi η
(t − 1)λi η
exp{2η(λi βi + 2Di )}
t+1
t 2
kπi − πi k∇2 ϕ(πt ) ≤
max
−
log τi (a)
i
η 2 (λi βi + 2Di )2 a∈Ai
1 + tλi η 1 + (t − 1)λi η

)2
t
t−1
X
X
η
η
t0
t0
+
ũi (a, a−i ) −
ũi (a, a−i )
1 + tλi η 0
1 + (t − 1)λi η 0
t =1

t =1


exp{2η(λi βi + 2Di )}
λi η
η
=
max
log τi (a) +
ũi (a, at−i )
2
2
a∈Ai (1 + tλi η)(1 + (t − 1)λi η)
η (λi βi + 2Di )
1 + tλi η

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

)2
t−1
X
λi η 2
t0
−
ũi (a, a−i )
(1 + tλi η)(1 + (t − 1)λi η) 0
t =1
(
2 
2
exp{2η(λi βi + 2Di )}
λi η
η
t
≤3
max
log τi (a) +
ũi (a, a−i )
η 2 (λi βi + 2Di )2 a∈Ai
(1 + tλi η)(1 + (t − 1)λi η)
1 + tλi η
!2 
t−1

X
0
λi η 2
+
ũi (a, at−i )

(1 + tλi η)(1 + (t − 1)λi η) 0
t =1
(
2
2
exp{2η(λi βi + 2Di )}
3
λi η
max
+ ηũi (a, at−i )
=
log
τ
(a)
i
2
2
2
a∈Ai
(1 + tλi η)
η (λi βi + 2Di )
1 + (t − 1)λi η
!2 
t−1

2
X
0
λi η
+
ũi (a, at−i )

(1 + (t − 1)λi η) 0
t =1

exp{2η(λi βi + 2Di )} 2 2 2
3
(λ η βi + 2η 2 Di2 )
2
(1 + tλi η)
η 2 (λi βi + 2Di )2
exp{2η(λi βi + 2Di )}
≤3
(1 + tλi η)2


√ exp{η(λi βi + 2Di )} 2
exp{2η(λi βi + 2Di )}
3
·
≤3
=
.
(tλi η)2
tλi η
≤

Using the hypothesis that η ≤ 1/(λi βi + 2Di ) and taking square roots yields the statement.
A.3. Completing the analysis
Proposition 1. Fix a player i ∈ {1, . . . , n}. The regret
RiT :=

max
∗

T
X

π ∈∆(Ai )

t=1

Uit (π ∗ ) −

T
X
t=1

Uit (πit )

incurred up to any time T by policies πit defined in (6) where the learning rate is set to any value 0 < η ≤ 1/(λi βi + 2Di ),
satisfies
p
log |Ai | 3 e(1 + log T )
RiT ≤
+
(Di + λi βi + λi |Ai |),
η
λi η
where Di is any upper bound on the range of the possible rewards of Player i, and
(21)

βi := max log(1/τ (a)).
a∈Ai

Proof. Let


qit := ũi (a, at−i )

,
a∈Ai

and note that, by definition of the regularized utilities Ui ,

Uit (πit+1 ) − Uit (πit ) = qi> πit+1 − πit − λi DKL (πit+1 k τ ) + λi DKL (πit k τ )

= qi> πit+1 − πit − λi ϕ(πit+1 ) + λi ϕ(πit ) − λi ∇ϕ(τi )> (πit − πit+1 )

>
= (qi − ∇ϕ(τi )) πit+1 − πit − λi ϕ(πit+1 ) + λi ϕ(πit )

>
≤ (qi + ∇ϕ(τi )) πit+1 − πit − λi ϕ(πit ) − λi ∇ϕ(πit )> (πit+1 − πit ) + λi ϕ(πit )
> t+1

= qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit )
πi − πit
≤ qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit ) ∇−2 ϕ(πt ) · πit+1 − πit ∇2 ϕ(πt ) ,
i

i

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

where the first inequality follows by convexity and the second inequality by the generalized Cauchy-Schwarz inequality with
the primal-dual norm pair k · k∇2 ϕ(πit ) and k · k∇−2 ϕ(πit ) . A bound for the second term in the product if given by Lemma 4.
We now bound the first norm. First,

2

qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit ) ∇−2 ϕ(πt ) =
i

X
a∈Ai

≤3

2
π t (a) · ũi (a, at−i ) + λi log τi (a) − λi log πit (a)

X
a∈Ai

π t (a) · ũi (a, at−i )2 + λ2i (log τi (a))2 + λ2i (log πit (a))2


≤ 3 Di2 + λ2i βi2 + λ2i |Ai |
2

p
≤ 3 Di + λi βi + λi |Ai | ,
where the second inequality follows from the fact that x log2 x ≤ 1 for all x ∈ [0, 1]. Taking square roots, we find
qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit ) ∇−2 ϕ(πt ) ≤
i


p
√ 
3 Di + λi βi + λi |Ai | .

So, using Lemma 4, we can write

Uit (πit+1 ) − Uit (πit ) ≤

p
3e
(Di + λi βi + λi |Ai |).
tλi η

Plugging in the above expression into Lemma 1 yields

RiT ≤
≤

T
p
log |Ai | X 3 e
+
(Di + λi βi + λi |Ai |)
η
tλi η
t=1

p
log |Ai | 3 e(1 + log T )
+
(Di + λi βi + λi |Ai |),
η
λi η

which is the statement.
A.4. Proof of Theorem 1
Theorem 1. (piKL stays close to the anchor policy) Upon running Algorithm 1 for T iterations in a multiplayer general-sum
game, the policy π̄iT is at a distance

DKL (π̄iT k τi ) ≤

1
λi




RiT
+ Di ,
T

where Di is any upper bound on possible rewards for Player i. In particular, if η > 0 is set so that RiT = o(T ), then
DKL (π̄iT k τi ) → Di /λi as T → +∞.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Proof. By definition of regret,
)
( T
X
1 T
1
t
∗
t
t
U (π ) − Ui (πi )
R =
max
T i
T πi∗ ∈∆(Ai ) t=1 i i
(
!
!
)
T
T
T
T
X
X
X
1X
λ
1
λ
i
i
t
t
= ∗max
ui (πi∗ , π−i
) −
ui (πit , π−i
) −
DKL (πi∗ k τi ) +
DKL (πit k τi )
T t=1
T t=1
T t=1
T t=1
πi ∈∆(Ai )
(
!
!
)
T
T
T
1X
1X
λi X
∗
t
t
t
∗
t
ui (πi , π−i ) −
ui (πi , π−i ) − λi DKL (πi k τi ) +
DKL (πi k τi )
= ∗max
T t=1
T t=1
T t=1
πi ∈∆(Ai )
(
!
!
)
T
T
T
1X
1X
λi X
∗
t
t
t
t
ui (πi , π−i ) −
ui (πi , π−i ) +
DKL (πi k τi )
≥ ∗max
T t=1
T t=1
T t=1
πi ∈∆(Ai )
(
!
!
)
T
T
X
1X
1
t
t
≥ ∗max
ui (πi∗ , π−i
) −
ui (πit , π−i
) + λi DKL (π̄iT k τi )
T t=1
T t=1
πi ∈∆(Ai )
(
!
)
T
1X
∗
t
t
t
T
= ∗max
(ui (πi , π−i ) − ui (πi , π−i ) + λi DKL (π̄i k τi )
T t=1
πi ∈∆(Ai )
(
!
)
T
1X
T
−Di + λi DKL (π̄i k τi ) = −Di + λi DKL (π̄iT k τi ),
≥ ∗max
T t=1
πi ∈∆(Ai )
where the first inequality holds since the KL divergence is nonnegative, the second inequality by convexity of the KL
divergence, and the third inequality by definition of Di . Rearranging yields the inequality in the statement.
A.5. Relationship with Nash Equilibrium
In this subsection, we show that when all players play according to Algorithm 1 in a two-player zero-sum game, then the
average policies π̄iT converge to a Nash equilibrium of the regularized game whose utilities are Ui .
Proposition 2. For any T ∈ N, η > 0, and δ ∈ (0, 1), define the quantity
r
32
2 maxi |Ai |
R1T + R2T
+ (max Di )
log
.
ξ (δ) :=
i
T
T
δ
T

Upon running Algorithm 1 for any T iterations with learning rate η > 0, the average policies π̄iT of each player form a
ξ T (δ)-approximate Nash equilibrium with respect to the regularized utility functions Ui with probability at least 1 − δ, for
any δ ∈ (0, 1).
Proof. Fix a player i ∈ {1, 2}, and any policy π ∗ ∈ ∆(Ai ), and introduce the discrete-time stochastic process

 

t
t
wt := Ui (π ∗ , π−i
) − Ui (πit , π−i
) − Ui (π ∗ , at−i ) − Ui (πit , at−i ) .
Since the opponent player −i plays according to Algorithm 1, its action at−i at all times t is selected by sampling (unbiasedly)
t
an action from the policy π−i
. Therefore, wt is a martingale difference sequence. Furthermore, by expanding the definition
of Ui , the absolute value of wt satisfies
|wt | =



 

t
t
ui (π ∗ , π−i
) − ui (πit , π−i
) − ui (π ∗ , at−i ) − ui (πit , at−i )

t
t
≤ ui (π ∗ , π−i
) − ui (πit , π−i
) − ui (π ∗ , at−i ) − ui (πit , at−i ) ≤ 2Di .

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Hence, using Azuma-Hoeffding’s inequality, for any δ ∈ (0, 1),
#
" T
r
X
1
t
1−δ ≤P
w ≤ Di 8T log
δ
t=1
" T
#
!
! r
T
T
T
X
X
X
X
1
∗
t
t
t
∗ t
t t
=P
Ui (π , π−i ) −
Ui (πi , π−i ) −
ui (π , a−i ) −
Ui (πi , a−i ) ≤ 8T log
δ
t=1
t=1
t=1
t=1
" T
#
r
T
X
X
1
t
t
=P
Ui (π ∗ , π−i
)−
,
Ui (πit , π−i
) ≤ RiT + Di 8T log
δ
t=1
t=1
where RiT is as defined in Proposition 1. Since the above expression holds for any π ∗ ∈ ∆(Ai ), in particular, using the
union bound on each a ∈ Ai ,
"
#
r
T
T
X
X
|Ai |
∗
t
t
t
T
P ∗max
Ui (π , π−i ) −
Ui (πi , π−i ) ≤ Ri + Di 8T log
≥1−δ
(22)
δ
π ∈∆(Ai )
t=1
t=1
for any player i ∈ {1, 2} and any δ ∈ (0, 1).

Summing Inequality (22) for i ∈ {1, 2} and using the union bound, we can further write
"
( T
)
( T
)
!
T
X
X
X
P ∗ max
U1 (π1∗ , π2t ) + ∗ max
U2 (π1t , π2∗ ) −
U1 (π1t , π2t ) + U2 (π1t , π2t )
π1 ∈∆(A1 )

π2 ∈∆(A2 )

t=1

t=1

t=1

r
≤ R1T + R2T + (max Di )
i

#
maxi |Ai |
32T log
≥ 1 − 2δ.
δ

Dividing by T and noting that for any player i ∈ {1, 2}
T
T
1X
1X t
t
Ui (π ∗ , π−i
) = Ui π ∗ ,
π
T t=1
T t=1 −i

further yields
"

P ∗ max
U1 (π1∗ , π̄2T ) +
π1 ∈∆(A1 )

1
max
U2 (π̄1T , π2∗ ) −
T
π2∗ ∈∆(A2 )


T
X
t=1

!
T
= Ui π ∗ , π̄−i


!
U1 (π1t , π2t ) + U2 (π1t , π2t )
RT + R2T
≤ 1
+ Di
T

r

#
32
maxi |Ai |
log
≥ 1 − 2δ.
T
δ
(23)

We now analyze the term in parenthesis, that is,
1
(♣) := −
T

T
X
t=1

!
U1 (π1t , π2t ) + U2 (π1t , π2t )

Plugging in the definition of U1 and U2 , that is,

into (♣) yields

U1 (π1 , π2 ) := u1 (π1 , π2 ) − λ1 DKL (π1 k τ1 )
U2 (π1 , π2 ) := u2 (π1 , π2 ) − λ2 DKL (π2 k τ2 ) = −u1 (π1 , π2 ) + λ2 DKL (π2 k τ2 )

1
(♣) = −
T

T
X
t=1

!
U1 (π1t , π2t ) + U2 (π1t , π2t )

1
=
T

T
X
t=1

!
λ1 DKL (π1t k τ1 ) + λ2 DKL (π2t k τ2 )

≥ λ1 DKL (π̄1T k τ1 ) + λ2 DKL (π̄2T k τ2 )

= − U1 (π̄1T , π̄2T ) + U2 (π̄1T , π̄2T ) ,

(24)
(25)

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

where (24) follows from convexity of the KL divergence function, and (25) follows again from the definition of U1 and U2 .
Substituting (25) back into (23), we find

P

max

π1∗ ∈∆(A1 )


U1 (π1∗ , π̄2T ) − U1 (π̄1T , π̄2T ) +

max

π2∗ ∈∆(A2 )


U2 (π̄1T , π2∗ ) − U2 (π̄1T , π̄2T )

#
r
32
maxi |Ai |
R1T + R2T
+ (max Di )
log
≥ 1 − 2δ.
≤
i
T
T
δ

(26)

Since
max

π1∗ ∈∆(A1 )


U1 (π1∗ , π̄2T ) − U1 (π̄1T , π̄2T ) ≥ 0,

and

max

π2∗ ∈∆(A2 )


U2 (π̄1T , π2∗ ) − U2 (π̄1T , π̄2T ) ≥ 0,

the inequality above in particular implies that



P max ∗ max
U1 (π1∗ , π̄2T ) − U1 (π̄1T , π̄2T ) ,
π1 ∈∆(A1 )

max
∗


U2 (π̄1T , π2∗ ) − U2 (π̄1T , π̄2T )



π2 ∈∆(A2 )

#
r
R1T + R2T
32
maxi |Ai |
≤
+ (max Di )
log
≥ 1 − 2δ,
i
T
T
δ

(27)

which is equivalent to the statement after making the variable substitution δ := δ 0 /2.
In particular,
√ when η ≤ 1/(λi βi + 2Di ) for both players i ∈ {1, 2}, Proposition 2 implies that the average strategy profile
is a O(1/ T )-Nash equilibrium with respect to the regularized utility functions Ui .

A standard application of the Borel-Cantelli lemma enables to convert from the high-proability guarantees of Proposition 2
at finite time to almost-sure convergence in the limit. Specifically,
Corollary 1. Let (π̄1 , π̄2 ) be any limit point of the average policies (π̄1T , π̄2T ) of the players. Almost surely, (π̄1 , π̄2 ) is a
Nash equilibrium with respect to the regularized utility functions U1 , U2 , respectively.
From there, it is immediate to give guarantees with respect to the original (i.e., unregularized) game, and Theorem 2 follows.
Theorem 2. Let (π̄1 , π̄2 ) be any limit point of the average policies (π̄1T , π̄2T ) of the players. Almost surely, (π̄1 , π̄2 ) is
a (maxi=1,2 {λi βi })-approximate Nash equilibrium policy with respect to the original utility functions ui , where βi is as
defined in (21).
Proof. From Corollary 1, almost surely (π̄1 , π̄2 ) is a Nash equilibrium of the regularized game whose players’ utilities are
U1 and U2 , respectively. Expanding the definition of Nash equilibrium relative to Player 1, we have that
0=
=

max {U1 (π1∗ , π̄2 ) − U1 (π̄1 , π̄2 )}

π1∗ ∈∆(A1 )

max {u1 (π1∗ , π̄2 ) − λ1 DKL (π1∗ k τ1 ) − u1 (π̄1 , π̄2 ) + λ1 DKL (π̄1 k τ1 )}

π1∗ ∈∆(A1 )

max {u1 (π1∗ , π̄2 ) − u1 (π̄1 , π̄2 )} − λ1 DKL (π1∗ k τ1 )
(
)
X
X
∗
∗
∗
∗
= ∗ max
(u1 (π1 , π̄2 ) − u1 (π̄1 , π̄2 )) − λ1
πi (a) log(πi (a)) − λ1
π1 (a) log(1/τ1 (a))
≥

π1∗ ∈∆(A1 )

π1 ∈∆(A1 )

a∈A1

a∈A1

(
≥

max

π1∗ ∈∆(A1 )

)
(u1 (π1∗ , π̄2 ) − u1 (π̄1 , π̄2 )) − λ1

X

a∈A1
∗
≥ ∗ max {u1 (π1 , π̄2 ) − u1 (π̄1 , π̄2 )} − λ1 β1 ,
π1 ∈∆(A1 )

π1∗ (a) log(1/τ1 (a))

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

where the first inequality follows since the KL divergence is alwasy nonnegative, the second inequality since the negative
entropy function is nonpositive on the simplex, and the third inequality follows from the definition of β1 . Symmetrically, for
Player 2 we find that
0 ≥ ∗ max {u2 (π̄1 , π2∗ ) − u2 (π̄1 , π̄2 )} − λ2 β2 .
π2 ∈∆(A2 )

Hence, the exploitability of π̄1 is at most λ1 β1 , while the exploitability of π̄2 is at most λ2 β2 , which immediately implies
the statement.

B. Illustrations of piKL-Hedge in Blotto

Figure 4. Comparison of piKL and regret matching as a function of λ in Colonel Blotto(10, 3). As λ increases in piKL-hedge, piKL
moves closer to the anchor policy at the cost of increased exploitability. The scale of λ is related to the scale of the payoffs in the game,
which are [0, 1] in Blotto.

Colonel Blotto is a famous 2-player simultaneous action game that has a large action space but has rules that are short and
simple. In Blotto, each player has c coins to be distributed across f fields. The aim is to win the most fields by allocating the
player’s coins across the fields. A field is won by contributing the most coins to that field (and drawn if there is a tie). The
winner receives a reward of +1 and the loser receives -1. Both receive 0 in the case of a tie.
In Figure 4 we illustrate the key features of piKL-Hedge in Blotto, using a uniform anchor policy for convenience.
Incidentally, piKL-Hedge with a uniform anchor policy converges to a quantal response equilibrium (McKelvey & Palfrey,
1995b). piKL-Hedge finds policies that play close to the anchor policy while having low regret, with λ controlling the
relative optimality of these two desiderata.

C. Human Policy KL-Regularized Search also improves Cross-Entropy
Model

Dataset

Maia1500 (Chess) Lichess 1500 Rating Bucket
Maia1900 (Chess) Lichess 1900 Rating Bucket
Our Model (Go)
GoGoD

(raw model) cpuct = 10 cpuct = 5 cpuct = 2 cpuct = 1 cpuct = 0.5
1.476
1.440
1.388

1.469
1.429
1.362

1.465
1.422
1.359

1.470
1.418
1.362

1.504
1.443
1.391

1.598
1.529
1.478

Table 3. Cross-entropy predicting human moves in chess and Go using smooth KL optimization post-processing of MCTS with various
cpuct . Largest standard error of any value is around 0.0033.

Here we show that not only does policy-regularized search improve top-1 accuracy, it also improves cross entropy for a
reasonable range of parameter values in both chess and Go, as well as performing well in Diplomacy.
For Chess and Go, to obtain the policy distribution with which to compute its cross entropy with the human data, unlike
for top-1 accuracy or for play we cannot directly use the MCTS visit distribution because occasionally simply due to
discretization MCTS may give zero visits to the actual move that a human played, resulting in an undefined (i.e. infinite)
cross-entropy. Instead, we leverage the result of Grill et al. (2020) that PUCT-style MCTS with a policy prior can be seen as

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

7

Anchor Policy

piKL-Hedge (λ = 10−4 )

piKL-Hedge (λ = 10−1 )
piKL-Hedge (λ = 10−3 )

piKL-Hedge (λ = 10−5 )
Hedge
Regret Matching

1300

1500

piKL-Hedge (λ = 10−2 )

Unit Order Cross-Entropy

6
5
4
3
2
1
800

900

1000

1100

1200
Pseudo-ELO

1400

1600

Figure 5. Average cross-entropy of per-unit order prediction in Diplomacy, comparing the human-imitation-learned anchor policy, regret
matching, hedge, and piKL-Hedge, as a function of pseudo-Elo player rating.

a discrete approximation to solving a smooth optimization:
arg max
π

X
a

Q(s, a)π(s, a) + λDKL (τ k π)

where
pP
na
Pa
λ = cpuct
(k + a na )
where Q is the current value estimate from search for each action, τ is the anchor or prior policy, λ controls the strength of
the regularization towards that prior as a function of the number of visits, cpuct is the MCTS exploration coefficient, na is the
number of times a was explored and k is an arbitrary constant not affecting the asymptotic results.
We perform MCTS exactly the same as normal, the only difference is that at the very end, rather than using visit counts, we
compute π optimizing the above objective using the human imitation-learned anchor policy for τ and the MCTS-estimated
Q-values for Q (and using the same unweighted average child value for moves that lack a Q-value estimate due to having
zero visits as described in Appendix F). We use this resulting smooth π as the final policy prediction and compute its cross
entropy with the actual human moves.
Note that the authors choose k = |A|, the number of legal actions, for mathematical convenience, corresponding to adding
one extra visit perP
every possible action (Grill et al., 2020), but since in our experiments we use only use 50 visits including
the root visit (i.e. a na = 49), and the branching factor can be as large as 362 in Go, adding one extra visit per legal action
in our case greatly overestimates the total number of visits, which in practice gives a less accurate correspondence between
MCTS and this smoother regularized solution, so we instead choose k = 0.
In Table 3 we show the results. Across roughly the same parameter ranges, the regularized search policy using this smoothed
MCTS postprocessing achieves lower cross-entropy with human moves than the raw imitation-learned policy without search.
This suggests that not only does search improve on the raw imitation-learned policy at pinpointing the top action, it also
gives a more accurate model of the overall distribution of likely human actions.
Similarly, in Figure 5, we show that piKL-Hedge achieves better average cross entropy of unit orders in no-press Diplomacy
compared to unregularized search methods, and matches that of imitation learning at λ = 0.1. This corroborates the results
of Section 4.3 and shows that piKL-Hedge provides the same benefits in modeling the overall distribution of human actions
as it does on predicting the top move - outpredicting unregularized search (while playing as well or slightly better against
human-like opponents), or equaling the prediction quality of imitation learning (while playing much more strongly than IL).

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

D. More Experiments in Chess and Go
Game

cpuct

MCTS Win% vs raw model
temp = 1
temp = 0.3

Chess
Chess
Chess
Chess
Chess

10.0
5.0
2.0
1.0
0.5

72.2% ± 1.2%
79.5% ± 1.1%
88.0% ± 0.8%
92.2% ± 0.7%
94.4% ± 0.6%

62.4% ± 1.3%
72.3% ± 1.2%
86.1% ± 0.9%
92.9% ± 0.6%
94.7% ± 0.5%

Go
Go
Go
Go
Go

10.0
5.0
2.0
1.0
0.5

73.2% ± 1.4%
80.5% ± 1.3%
87.6% ± 1.0%
94.6% ± 0.7%
96.4% ± 0.6%

63.3% ± 1.5%
74.1% ± 1.4%
85.3% ± 1.1%
94.4% ± 0.7%
97.0% ± 0.5%

Table 4. Winrate of base model + MCTS vs base model at temperature 1 and 0.3. Base model is Maia1900 in chess, and our GoGoD
model in Go. 1000 games per figure, draws count as half a win, ± indicates one standard error. Go uses Japanese rules with 6.5 komi.
MCTS greatly improves strength in Chess and Go, the smallest cpuct values improve it most.

Model

Predicting Main Time Increment Raw Model Acc % MCTS Acc % Acc Gain from MCTS Approx Stderr

Maia1500
Maia1500
Maia1500
Maia1500
Maia1500
Maia1500

1500
1500
1500
1500
1500
1500

3m
5m
10m
3m
5m
15m

0s
0s
0s
2s
3s
15s

51.9
52.7
52.4
53.1
52.5
51.9

52.1
53.2
53.1
53.7
53.4
52.6

0.2
0.5
0.7
0.6
0.9
0.7

0.17
0.16
0.20
0.31
0.35
0.37

Maia1900
Maia1900
Maia1900
Maia1900
Maia1900
Maia1900

1900
1900
1900
1900
1900
1900

3m
5m
10m
3m
5m
15m

0s
0s
0s
2s
3s
15s

53.0
53.2
52.9
54.1
53.1
53.7

53.9
54.5
54.6
55.5
55.1
55.9

0.8
1.3
1.7
1.4
2.0
2.2

0.12
0.17
0.24
0.27
0.51
0.56

Table 5. Difference in top-1 accuracy between raw model and MCTS using the best cpuct in predicting human moves for chess players in
rating buckets 1500 and 1900 using Maia1500 and Maia1900, split by time control of the games, excluding all time controls with fewer
than 100 games. Approx Stderr indicates the rough standard error of raw accuracy values on that row given the number of games of that
time control. Despite the statistical uncertainty on some individual values, overall the improvement of MCTS vs the raw model does
clearly tend to be larger on the games with longer time controls.

This section summarizes the results of a small number of additional experiments in chess and Go. Whereas Figure 1 used a
temperature of 1.0 when sampling from the agent policy, we show in Table 4 that MCTS also achieves similar winrates
versus the raw model when both are sampled using a much lower temperature of 0.3.
Additionally, we confirm in Go the same result that McIlroy-Young et al. (2020a) reported in chess, that current top RL
agents are far worse at matching human moves than even just imitation learning. We tested the current top open-source Go
program KataGo (Wu, 2020) using its final 6, 10, and 15 block models5 which range from upper-amateur to superhuman
level, and found that they achieve accuracies of about 35%, 43%, and 43%, all more than 10% lower than the results in
Table 1.
Lastly, in Table 5 we show that in chess the amount of improvement given by MCTS over the raw model in predicting
human players on the Lichess test set tends to be larger for games with longer time controls than for shorter time controls,
going from about 0.2% to about 0.7% between the shortest and longer time controls for 1500-1599-rated players, and going
from about 0.8% to around 2.0% for 1900-1999-rated players. This is consistent with the intuition that humans rely more
heavily on planning when they have more time to think, increasing the gain from modeling that planning.
5

from https://katagotraining.org

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

E. Baseline Model Architecture and Training for Go
As summarized in Section 3.2, for training baseline imitation-learning models to play on the 19x19 board in Go, our
architecture follows the same 20-block 256-channel residual net described in Silver et al. (2017), except with the addition
of squeeze-and-excitation layers at the end of each residual block (Hu et al., 2018). In particular, the following additional
operations are inserted just prior to each skip connection that adds the output R of a residual block to the trunk X:
• Channelwise global average pooling of R from 19 × 19 × 256 channels to 256 channels.
• A fully connected layer including bias from 256 channels to 64 channels.
• A ReLU nonlinearity.
• A fully connected layer including bias from 64 channels to 512 channels, which are split two vectors S and B of 256
channels each.
• Output R × sigmoid(S) + B to be added back to the trunk X, instead of R as in a normal residual net.
In other words, the final result of the residual block as a whole is ReLU(X + R × sigmoid(S) + B) instead of ReLU(X +
R).
Additionally, some games are played with a komi (compensation given to White for playing second) that is not equal to the
value of 7.5 used by Silver et al. (2017). Therefore, for the final feature of the input encoding, rather than a binary-valued
feature equalling 1 if the player to move is White and 0 if the player to move is Black, we instead use the real-valued feature
of komi/10 if the player to move is White or -komi/10 if the player to move is Black. We additionally exclude a very tiny
number of games with extreme komi values, outside of the range [-60,60].
We train using a mini-batch size of 2048 distributed as 8 batches of 256 across 8 GPUs, and train for a total of 64 epochs roughly 475000 minibatches for the GoGoD dataset. We use SGD with momentum 0.9, weight decay coefficient of 1e-4,
and a learning rate schedule of of 1e-1, 1e-2, 1e-3, 1e-4 for the first 16, next 16, next 16, and last 16 epochs respectively.
We train both the policy and values heads jointly, minimizing the cross entropy of the policy head with respect to the one-hot
move made in the actual game, and the MSE of the value head with respect to the game result of -1 or 1, except similarly to
Silver et al. (2017) we weight the MSE value loss by 0.01 to avoid overfitting of the value head.

F. MCTS Algorithmic Details
In this appendix, we summarize the details of the version of MCTS used in our experiments, including one often-overlooked
detail. We follow a standard MCTS implementation very similar to that of (Silver et al., 2017).
Each turn, the algorithm builds and expands a game tree over multiple iterations rooted at the current state for that turn. On
each iteration t MCTS starts at the root and descends the tree by exploring at each state s an action a according to some
exploration method. Upon reaching a state st not yet explored, it adds st to the tree, queries the value function Vi (st ) for
each i to estimate the total expected future reward, and updates the statistics of all nodes traversed based on Vi (st ) and any
intermediate rewards received. Subsequent iterations begin again from the root. For our work in chess and Go, we follow
the convention where win, loss, and draw have reward 1,-1, and 0.
The statistics tracked at the node for each state s where player i is to move include the visit counts N (s, a) which are
the number of iterations that reached state s and tried
P action a, and Q(s, a) the average value of those iterations from the
perspective of player i, i.e. Q(s, a) = (1/N (s, a)) t Vi (st ) + Ui (s, st ) where the sum ranges only over those iterations
t that reached state s and tried action a and Ui (s, st ) is the total intermediate reward on the path from s to st . When
descending the tree, the exploration method is to always select the action:
pP
b N (s, b)
arg max Q(s, a) + cpuct τ (s, a)
(28)
N (s, a) + 1
a
where τ (s, a) is the prior policy probability for action a in state s, and cpuct is a tunable parameter controlling the tradeoff
between exploration and exploitation. The final agent policy π is simply proportional to the visit counts for the root, i.e.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

P
π(s, a) = N (s, a)/ b N (s, b) where s is the root state, or optionally we may also have π(s, a) ∼ N (s, a)1/T where T is
a temperature parameter.
One final often-overlooked detail concerns how to evaluate states for which there is no Q(s, a) estimate. Since the tree
policy depends on the Q(s, a) estimates of the possible actions, there is a nontrivial choice of what Q value to use for an
action a that has been tried zero times and therefore never estimated. Since the tree branches exponentially, deeper in the
MCTS tree there will always be many actions with zero visits, and so this choice can affect the behavior of MCTS even in
the limit of large amounts of search. Unfortunately, the details of this choice have sometimes been left undiscussed and
undocumented in major past work, and as a result major MCTS implementations have not standardized on it, variously
choosing game-loss (Lai, 2018), the current running average parent Q or a Q-value minus a heuristic offset parameter (Tian,
2019), or many other options.
In our
average value of all actions at the parent node that have been visited at least once,
P work, we use the equal-weighted
P
i.e. a Q(s, a)I(N (s, a) > 0)/ a I(N (s, a) > 0) . This can be viewed as corresponding to a naive prior that the values
of actions are i.i.d draws from an unknown distribution. While not perfect, our choice is simple, parameter-free, behaves in a
way that is invariant to any global translation or scaling of the game’s rewards, and works reasonably in practice for our
purposes.

G. Brief Description of Diplomacy
We briefly summarizing the rules of Diplomacy. See Paquette et al. (2019) for a more detailed description. The board is a
map of Europe partitioned into 75 regions, 34 of which are supply centers (SCs) that players compete to control. Players
command multiple units and each turn privately issue orders for each unit they own (to hold, move, support another unit, or
convoy). These orders are revealed at the same time, thereby making Diplomacy a simultaneous-action game. A player
wins the game by controlling a majority (18) of the SCs. A game may also end in a draw if all remaining players agree.
In this case, we use the Sum-of-Squares (SoS) scoring system as used in prior works P
(Paquette et al., 2019; Gray et al.,
2020; Bakhtin et al., 2021). If no player wins, SoS defines the score of player i as Ci2 / i0 Ci20 , where Ci is the SC count
for player i.
Diplomacy is specifically designed so that a player is unlikely to achieve victory without help from other players. The full
game allows unrestricted private natural-language communication between players each turn prior to choosing orders, but
we focus on the simpler no-press variant, in which no such communication is allowed, but modeling how opponents will
behave continues to be important.

H. Diplomacy hyper-parameters
We describe the describe the search parameters used in this work and compare it to those in previous works. As compared to
Gray et al. (2020), we use a much less expensive set of search parameters for all results in the main section of this paper (See,
Table 7). In Table 8, we show that these tuned-down set of parameters, slightly reduces piKL-HedgeBot’s performance but
allows us to make similar conclusions when comparing against SearchBot (Gray et al., 2020) and supervised learning-based
bots from prior works. Additionally, in Table 8, we also show that when our agent (piKL-HedgeBot (λ = 10−3 )) uses the
same search parameters as Gray et al. (2020), it outperforms SearchBot by a big margin.
In our experiments, to compare against DipNet (Paquette et al., 2019) we use the original model checkpoint6 and we sample
from the policy with temperature 0.1 (as used in prior works). Similarly, to compare against SearchBot (Gray et al., 2020)
agent we use the released checkpoint7 and agent configuration8 .
The only other search parameter unique
√ and new to our piKL-HedgeBot algorithm is η in Algorithm 1, which we heuristically
set on each Hedge iteration t to c/(σ t) where σ is the standard deviation across iterations of the average utility experienced
by the agent i being updated. We find c = 10/3 works well for no-press Diplomacy.
6

DipNet SL from https://github.com/diplomacy/research.
blueprint from https://github.com/facebookresearch/diplomacy_searchbot/releases/tag/1.0.
8
https://github.com/facebookresearch/diplomacy_searchbot/blob/master/conf/common/agents/
searchbot_02_fastbot.prototxt
7

Modeling Strong and Human-Like Gameplay with KL-Regularized Search
Model

Temperature

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)
Blueprint (Gray et al., 2020)
IL Policy (Ours)

0.1
0.1
0.1
0.5

Table 6. Sampling temperatures used in the models across prior works. For our imitation learning model (IL Policy), we use a temperature
of 0.5 in all experiments to encourage stochasticity.

Parameter

(Gray et al., 2020)

Ours

50
3.5
256
0.75
0.95
2

30
3.5
512
N/A
0.95
0

Number candidate actions (Nc )
Max candidate actions per unit
Number search iterations
Policy sampling temperature for rollouts
Policy sampling top-p
Rollout length, move phases

Table 7. Search parameters used in Gray et al. (2020) compared to the search parameters used in our work. All experiments in the main
body of this work uses search settings that are much cheaper to run. We use a rollout length of 0 and Nc = 30 while increasing the
number of search iterations to 512.

1x ↓ 6x →

DipNet

DipNet RL

Blueprint

BRBot

SearchBot

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)

18.9% ± 1.4%

6.7% ± 0.9%
-

11.6% ± 0.1%
10.5% ± 1.1%

0.1% ± 0.1%
0.1% ± 0.1%

0.7% ± 0.2%
0.6% ± 0.2%

Blueprint (Gray et al., 2020)
BRBot (Gray et al., 2020)
SearchBot (Gray et al., 2020)

20.2% ± 1.3%
67.3% ± 1.0%
51.1% ± 1.9%

7.5% ± 1.0%
43.7% ± 1.0%
35.2% ± 1.8%

69.3% ± 1.7%
52.7% ± 1.3%

0.3% ± 0.1%
17.2% ± 1.3%

0.9% ± 0.2%
11.1% ± 1.1%
-

piKL-HedgeBot (λ = 0.001)

54.8% ± 1.8%

31.4% ± 1.8%

50.3% ± 1.8%

19.2% ± 1.4%

16.6% ± 1.3%

piKL-HedgeBot (λ = 0.001) ((Gray et al., 2020) parameters)

60.1% ± 1.8%

33.3% ± 1.8%

58.1% ± 1.8%

23.6% ± 1.6%

20.3% ± 1.4%

Table 8. Average SoS scores achieved by the 1x agent against the 6x agents. This table compares the performance of SearchBot (Gray
et al., 2020) and other agents from prior work with piKL-HedgeBot (λ = 10−3 ) that uses a much cheaper search setting. Using the much
cheaper search setting, comes at relatively small cost in its performance as we use improved value and policy models (See, Appendix I
and Appendix H for more details). When using the same parameters as Gray et al. (2020), piKL-HedgeBot (λ = 10−3 ) significantly
outperforms SearchBot under most settings. Note that equal performance would be 1/7 ≈ 14.3%. The ± shows one standard error.

I. Diplomacy Model Architecture and Input Features
Our imitation learning policy model for Diplomacy uses the same transformer-encoder LSTM-decoder architecture as
Bakhtin et al. (2021) for reinforcement learning in Diplomacy, but applied to imitation learning over human games. This
architecture also resembles the architecture used by a significant amount of past work (Gray et al., 2020; Anthony et al.,
2020; Paquette et al., 2019) but replaces the graph-convolution encoder with a transformer, which we find to produce good
results.
Additionally, we slightly modify the input feature encoding relative to Gray et al. (2020), removing a small number of
redundant channels and adding channels to indicate the “home centers” of each of the 7 powers (Austria, England, France,...
etc), which are the locations where that power is allowed to build new armies or fleets. See Table 9 for the new list of input
features. By adding the home centers to the input encoding instead of leaving them implicit, the game becomes entirely
equivariant to permutations of those powers - e.g. if one swaps all the units of England and France, and all their centers, and
which centers are their home centers, the resulting game is isomorphic to the original except with the two powers renamed.
This allows us to then augment the training data via equivariant permutations of the seven possible powers in the encoding.
Every time we sample a position from the dataset for training, we also choose among all 7-factorial permutations of the
powers uniformly at random, and correspondingly permute both the input and output, to reduce overfit and improve the
model’s generalization given the limited human data available.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Feature

Type

Location has unit?
Owner of unit
Buildable, Removable?
Location has dislodged unit?
Owner of dislodged unit
Area type
Supply center owner
Home center

One-hot (army/fleet), or all zero
One-hot (7 powers), or all zero
Binary
One-hot (army/fleet), or all zero
One-hot (7 powers), or all zero
One-hot (land,coast,water)
One-hot (7 powers or neutral), or all zero
One-hot (7 powers), or all zero

Number of Channels
2
7
2
2
7
3
8
7

Table 9. Per-location input features used

J. Improved Value Model in Diplomacy
For no-press Diplomacy, we note that Gray et al. (2020) observed that their search agent benefits from short rollouts using
the trained human policy before applying the human-learned value model to evaluate the position. Doing so appears to result
in more accurate evaluations reflecting the likely outcomes from a given game state, which the raw value model may failed
to learn sufficiently accurately on the limited human dataset. Since expectation of the learned value model after a short
rollout appears to be better than the learned value model itself, this motivates training a model to directly approximate the
former.
In a fashion broadly similar to Silver et al. (2016) generating rollout games to train a more accurate value head for Go, we
therefore generated a large stream of data by uniformly sampling positions from the human game dataset for Diplomacy,
rolling them forward between 4-8 phases of game play via the same rollout settings as Gray et al. (2020), i.e. policy sampling
temperature 0.75, top-p 0.95, and training a new value model to predict the resulting post-rollout value estimate of the old
value model. Samples were continuously and asynchronously added to a replay buffer of 10000 batches, and the buffer
was continuously sampled to train the same transformer-based architecture as the human-trained model from Appendix I
initialized with the weights of that model. Training was constrained to never exceed the rate of data generation by more than
a factor of 2 (i.e. using each sample twice in expectation) and proceeded for 128000 mini-batches of 1024 samples each
using the ADAM optimizer with a fixed learning rate of 1e-5.

K. More Experiments in Diplomacy
This section compiles additional performance results from evaluation games.
K.1. Head-to-Head Performance
In Table 10, we compare the performance of piKL-HedgeBotin 1v6 head-to-head games against the underlying imitation
anchor policy, following prior work (Gray et al., 2020; Bakhtin et al., 2021; Paquette et al., 2019; Anthony et al., 2020). We
find that the λ = 10−1 policy is substantially stronger than the imitation policy while matching the accuracy in predicting
human moves, while the λ = 10−3 policy outperforms unregularized search methods while playing much closer to the
human policy.
K.2. piKL-HedgeBot’s performance in population-based experiments
In this section, we provide all the results from the population experiments across various piKL-HedgeBot’s lambda values.
Figure 6 and Table 11 show the results. piKL-HedgeBot with λ = 10−3 performs best across individual population
experiments with an SoS score of 32.9%. The performance drops as we continue to increase λ past 1e − 3. Error bars
indicate 1 standard error.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

1x

6x

IL Policy

piKL-HedgeBot (λ = 10−1 )
piKL-HedgeBot (λ = 10−2 )
piKL-HedgeBot (λ = 10−3 )
piKL-HedgeBot (λ = 10−4 )
piKL-HedgeBot (λ = 10−5 )
HedgeBot
RMBot

Average SoS Score
8.3±0.9%
2.5±0.4%
1.8±0.3%
2.1±0.3%
1.6±0.2%
1.5±0.2%
1.4±0.2%

1x

6x

piKL-HedgeBot (λ = 10−1 )
piKL-HedgeBot (λ = 10−2 )
piKL-HedgeBot (λ = 10−3 )
piKL-HedgeBot (λ = 10−4 )
piKL-HedgeBot (λ = 10−5 )
HedgeBot
RMBot

IL Policy

Average SoS Score
21.1±1.4%
44.2±1.7%
52.7±1.7%
49.7±1.7%
46.9±1.7%
46.5±1.7%
46.2±1.7%

Table 10. Average SoS score attained by the 1x agent against the 6x agent. piKL-HedgeBot(λ = 10−1 ) policy is substantially stronger
than IL Policy, while the (λ = 10−2 ) policy is almost as strong as RMBot. The ± shows one standard error. Note that equal performance
would be 1/7 ≈ 14.3%.

32.5%

Average Score

30.0%
27.5%
25.0%
22.5%

piKL-Hedge (λ = 10−1)

20.0%

piKL-Hedge (λ = 10−2)
piKL-Hedge (λ = 10−3)

17.5%

piKL-Hedge (λ = 10−4)
piKL-Hedge (λ = 10−5)

15.0%
10−5

10−4

10−3
λ

10−2

10−1

Figure 6. Average SoS score achieved by piKL-HedgeBots in uniformly sampled pools of other agents as a function of λ. piKL-HedgeBot
(λ = 10−3 ) performs best across individual sweeps with an SoS score of 32.9%.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Agent

Average SoS Score

Agent

Average SoS Score

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)

4.9% ± 0.3%
5.6% ± 0.4%

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)

3.8% ± 0.3%
4.2% ± 0.3%

Blueprint (Gray et al., 2020)
BRBot (Gray et al., 2020)
SearchBot (Gray et al., 2020)

7.1% ± 0.4%
18.2% ± 0.6%
36.1% ± 0.8%

Blueprint (Gray et al., 2020)
BRBot (Gray et al., 2020)
SearchBot (Gray et al., 2020)

5.8% ± 0.4%
16.3% ± 0.6%
14.1% ± 0.6%

IL Policy
RMBot

10.2% ± 0.6%
36.8% ± 1.1%

IL Policy
RMBot

8.5% ± 0.4%
31.7% ± 0.8%

piKL-HedgeBot (λ = 10−1 )

15.6% ± 0.6%

piKL-HedgeBot (λ = 10−2 )

29.9% ± 0.7%

Agent

Average SoS Score

Agent

Average SoS Score

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)

3.7% ± 0.3%
4.7% ± 0.3%

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)

3.5% ± 0.3%
4.6% ± 0.3%

Blueprint (Gray et al., 2020)
BRBot (Gray et al., 2020)
SearchBot (Gray et al., 2020)

4.9% ± 0.3%
16.1% ± 0.6%
13.4% ± 0.5%

Blueprint (Gray et al., 2020)
BRBot (Gray et al., 2020)
SearchBot (Gray et al., 2020)

5.7% ± 0.3%
14.3% ± 0.6%
13.6% ± 0.5%

IL Policy
RMBot

7.9% ± 0.4%
31.3% ± 0.7%

IL Policy
RMBot

8.8% ± 0.4%
31.8% ± 0.7%

piKL-HedgeBot (λ = 10−3 )

32.9% ± 0.7%

piKL-HedgeBot (λ = 10−4 )

31.8% ± 0.7%

Agent

Average SoS Score

Agent

Average SoS Score

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)

3.6% ± 0.3%
4.4% ± 0.3%

DipNet (Paquette et al., 2019)
DipNet RL (Paquette et al., 2019)

3.6% ± 0.3%
3.9% ± 0.3%

Blueprint (Gray et al., 2020)
BRBot (Gray et al., 2020)
SearchBot (Gray et al., 2020)

5.3% ± 0.3%
15.0% ± 0.6%
13.0% ± 0.5%

Blueprint (Gray et al., 2020)
BRBot (Gray et al., 2020)
SearchBot (Gray et al., 2020)

5.5% ± 0.3%
14.7% ± 0.6%
14.1% ± 0.6%

IL Policy
RMBot

8.9% ± 0.4%
32.2% ± 0.7%

IL Policy
RMBot

8.7% ± 0.4%
32.2% ± 0.7%

piKL-HedgeBot (λ = 10−5 )

31.9% ± 0.7%

HedgeBot

31.7% ± 0.7%

Table 11. Average SoS score achieved by agents in uniformly sampled pools of other agents. piKL-HedgeBot with λ = 10−3 performs
best across individual sweeps with an SoS score of 32.9%. The ± shows one standard error.

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

L. Dec-POMDP Games: Policy-regularized SPARTA on Hanabi
In this section, we extend KL-regularized search to decentralized partially observable Markov decision processes (DecPOMDP) and test it on the Hanabi benchmark (Bard et al., 2020). We first train an imitation learning policy (IL policy)
from human data. We then use the IL policy as the blueprint policy in SPARTA (Lerer et al., 2020), a search technique for
Dec-POMDPs, and apply KL-regularization toward the IL policy in SPARTA. We call this new algorithm piKL-SPARTA.
We show that piKL-SPARTA matches or even slightly improves the original IL policy in human move prediction accuracy
while greatly improving self-play performance.
L.1. Background
A Dec-POMDP is an N -player fully cooperative game with state space S that is partially observed by each player i through
their individual observation function oi = Ωi (s) for s ∈ S, with joint action space A = A1 × A2 × · · · × AN and
transition function T : S × A → P (S) that returns the distribution of next state given current state and joint action. The
reward function R : S × A → R assigns a scalar reward for the entire team at each time step. A trajectory is denoted as
τ t = (s0 , a0 , r0 , . . . , st ) while the action-observation history (AOH) of each player is defined as τit = (o0 , a0 , r0 , . . . , ot ).
A full trajectory or full AOH that reaches the terminal state may be denoted more simply as τ or τi respectively. The
policy for each individual player πi (ati |τit ) takes as input the AOH and returns a distribution over valid actions. The joint
policy π = (π1 , . . . , πN ) is a tuple containing all players’ policies. ThePgoal is to find a policy to maximize the expected
0
total return π ∗ = arg maxπ J(π) = Eτ ∼P (τ |π) R0 (τ ) where Rt (τ ) = t0 ≥t γ (t −t) rt0 is the forward looking return with
optional discount factor γ ≤ 1.
Hanabi is a well-established large-scale Dec-POMDP benchmark (Bard et al., 2020). It is a 2 to 5 player card game with a
deck of 50 cards equally divided into 5 color suits. Each color consists of five ranks with three 1s, two 2s, two 3s, two 4s,
and one 5. Each player draws five cards from a randomly shuffled deck to start the game. The goal of the team is to play
cards in order of increasing rank from 1 to 5 for every color suit. Players take turns to either play a card, discard a card, or
give a hint to another player about their cards. When giving a hint, the acting player picks a receiver of the hint and a rank or
color of any card in the receiver’s hand. The recipient will then learn exactly which cards in their hand match the given
rank or color. Hinting costs one information token. The team starts with eight information tokens and they can recoup one
information token after discarding a card or successfully playing a 5 of any color. If a player makes an invalid play, e.g.
playing a red 3 when a red 2 is not played yet, the team loses one life token. After each play or a discard, a player draws a
new card if possible. The game ends when three life tokens are lost, in which case the team receives 0 points, or one round
after the entire deck is exhausted, in which case the score equals the number of cards successfully played.

The majority of existing works in the Hanabi domain focus on learning human compatible policies without using human
data (Bard et al., 2020; Siu et al., 2021; Hu et al., 2021b). Fewer works have explored better modeling human policies
using human game data, partly due to the lack of publicly available datasets. To the best of our knowledge, the strongest
supervised learning agent trained from human data was done by (Hu et al., 2021b) where the authors use it as an unseen
test-time partner agent in evaluation to estimate how well their agents might collaborate with humans. No prior work has
been done to better predict human moves in Hanabi.
A few search techniques such as SPARTA (Lerer et al., 2020) and RL-search (Fickinger et al., 2021) have been proposed for
large scale Dec-POMDPs and have specifically been applied in Hanabi. In this work we choose SPARTA as our backbone for
simplicity, while noting that our KL-regularization methods may also be generalized to other search techniques. SPARTA is
a test-time policy improvement algorithm that can be applied on top of any policy. SPARTA assumes that a blueprint policy
(BP) π is common knowledge in the Dec-POMDP and all players play their part of the blueprint unless they should deviate
according to the SPARTA rule. Here we briefly discuss the single-agent variant where the search agent assumes that other
agents will always play the blueprint. Given blueprint π, SPARTA first defines a belief function that tracks the distribution of
the real trajectory given the search agent’s own AOH Bi (τ t ) = P (τ t |τit , π). Then the search agent i computes the expected
value for each action a using Monte Carlo rollouts:
(29)

Qπ (τit , a) = Eτ t ∼Bi (τ t ) Qπ (τ t , a),
where:
Qπ (τ t , a) = Eτ ∼P (τ |T ,τ t ,at =a,at
i

j6=i ∼π,a

t0 >t ∼π)

Rt (τ )

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

is the expected forward looking return on τ t assuming that the search agent will perform the action a for the current step
and follow the blueprint afterwards while other agents always follow the blueprint. SPARTA computes the belief Bi (τt )
analytically. It maintains a distribution of all possible trajectories and adjusts it at every step by removing the trajectories
that contradict public knowledge or would have led to different joint actions according to the known joint policy π.
L.2. Method
We start with an imitation learning policy (IL policy) trained from human gameplay data collected from an online Hanabi
game platform. The IL policy is used as both the baseline for predicting human moves as well as the blueprint policy for our
piKL-SPARTA. The analytical belief update procedure in SPARTA requires full knowledge of the partners’ policy. This is
challenging when predicting human moves and playing with humans because our IL policy models the average behavior of
the entire population of players in the training dataset and a player at test time may perform actions that the IL policy will
never do, leading to a null belief space that will terminate the search. Therefore, we follow the practice in (Hu et al., 2021a)
and train an approximate neural network belief model on self-play data generated by the IL policy.
Given the IL policy π as blueprint and the approximate belief model B̂i that replaces the Bi in Eq. 29, piKL-SPARTA selects
actions for the search player i following


Qπ (τit , a)
t
P (a) ∝ π(a|τi ) · exp
.
(30)
λ
L.3. Experimental Setup
We use a similar dataset acquired from en.boardgamearena.com as in (Hu et al., 2021b). The dataset consists of
240,954 2-player Hanabi games. We randomly sample 1,000 games to create a validation set and another 4,000 games for
the test set. The training set contains the remaining 235,954 games with an average score of 15.88. Each game records the
AOH τi , i ∈ {1, 2} for both players. The IL policy πθ is parameterized by a neural network θ and is trained to minimize the
cross-entropy loss
T
X
L(θ) = −Eτi ∼D
πθ (ati |τit )
t=0

with stochastic gradient descent. The training set D is dynamically augmented with color shuffling (Hu et al., 2020)
where a random color permutation that changes both observation and action space is applied to each trajectory τi sampled
from the dataset before feeding it to the network. Hu et al. (2021b) shows that this data augmentation method greatly
reduces overfitting and leads to better policies. The network θ uses the Public-LSTM structure in (Hu et al., 2021a), which
eliminates the need to re-unroll LSTM on sampled trajectories from the beginning of the game as they share the same public
observations as the real trajectory.
To construct the approximate belief model B̂, we train a neural network φ to predict each player’s own hand, which is the
only hidden information in Hanabi. The network predicts each card in hand from oldest to newest auto-regressively. We
denote the hidden cards as {hji } with h1i being the oldest and hm
i being the newest, m ≤ 5. Then the cross-entropy loss for
φ in an N -player setting becomes


N X
T X
m
X
1
L(φ) = −Eτ ∼πθ 
log Pφ (hji |τit , h1i , . . . , hj−1
) .
i
N i=1 t=1 j=1
In Hanabi, it is sufficient to reconstruct τ t given τit and {hij }. The model is trained on infinite stream of data generated by
πθ through self-play to avoid overfitting. We train two variants of the belief model, one with sampled action a ∼ πθ while
the other with greedy action a = arg max πθ . They are used by piKL-SPARTA and piKL-SPARTA-G(reedy) respectively.
We first run piKL-SPARTA on the test set to compare its ability to predict human moves against the IL policy. For each
K
τit in the test set, we sample |A
hands from the belief model Pφ where K is the total number of searches for this step
i|
and |Ai | is the number of legal actions of the search player. In all our experiments we set K = 10, 000. In practice,
2K
K
we sample |A
hands and take the top |A
samples not contradicting the public knowledge. If the belief model fails to
i|
i|
produce any samples that comply with the public knowledge, we revert back to the blueprint. We then compute the expected
value for each search action by unrolling the IL policy until the end of the game for each sampled trajectory and compare

Modeling Strong and Human-Like Gameplay with KL-Regularized Search
Q (τ t ,a)

apred = arg maxa πθ (a|τit ) · exp[ π λi
study its effect on prediction accuracy.

] against the human move ati from the data. We experiment with different λ to

We also evaluate piKL-SPARTA in self-play to compare its performance under different λ against the IL policy. We run
piKL-SPARTA and the IL policy on 4,000 games with different seeds for the deck. At each step, the IL policy acts following
at ∼ πθ (a|τit ) while piKL-SPARTA acts following Eq. (30) again using K = 10, 000 rollouts.
The λ in these experiments are significantly higher than those in Diplomacy or implicitly in MCTS in Chess and Go9 because
λ represents the scale of utility difference that offsets a particular KL penalty, and the range of the utilities Qπ (τit , a) in
Hanabi is [−25, 25] whereas the range in other games are either [0, 1] or [−1, 1].

We experiment with both 1p piKL-SPARTA, where one player uses piKL-SPARTA and the other follows the blueprint, and
2p piKL-SPARTA, where both players run the same single-agent version of piKL-SPARTA independently. The latter is
not theoretically sound and 2p SPARTA has in past papers produced worse performance than 1p SPARTA (Lerer et al.,
2020) because 1p SPARTA assumes that partners are playing according to the blueprint policy while in actuality the partners
are playing according to a SPARTA policy. Despite the lack of theoretical soundness, we are interested in whether 2p
piKL-SPARTA can obtain empirical improvement over the 1p version, since piKL-SPARTA regularizes towards the blueprint
and therefore the mismatch between assuming the partner follows the blueprint and the policy they actually play would
likely be less severe.
Lastly, in Dec-POMDPs, it is common to select actions greedily in self-play instead of sampling from a mixed policy, since
in fully cooperative settings there is no need to avoid being deterministic or predictable to an adversary. Therefore, we
also experimented with piKL-SPARTA-G(reedy) where every agent, including the baseline IL policy, play according to
the arg max of their action distributions at every step, and the belief model for piKL-SPARTA-G is trained on trajectories
produced by a greedy IL policy.
L.4. Results

Figure 7. Top-1 test accuracy and self-play score of IL policy and piKL-SPARTA in Hanabi. Blue dot is the IL policy and green dots are
piKL-SPARTA with different λ. The self-play score is evaluated with sampling based 2p piKL-SPARTA and sampling based SL policy.
Additional evaluations with more λ and more algorithm variants are presented in Table 12 and Table 13. Error bar is 1 standard error.

Table 12 summarizes the results for human move prediction. On the full test set, piKL-SPARTA with λ = 10 and λ = 20
outperform the IL policy slightly, although not statistically confidently, and vastly outperform unregularized SPARTA. We
also investigate the prediction accuracy on games with score ≥ 10 or ≥ 20 that more predominantly come from more
experienced human players. The improvement of piKL-SPARTA over the IL policy may be larger on these games, although
this result is noisy and at best should be considered only mildly suggestive since filtering on final score also introduces other
major confounding factors as well.
In Table 13, we show the self-play performance piKL-SPARTA and IL policy. The top two rows show the results of the
9

√
Via the relationship λ ≈ cpuct N where N is the number of MCTS iterations

Modeling Strong and Human-Like Gameplay with KL-Regularized Search
Subset of Test Set

λ =0

λ =0.5

λ =1

λ =2

IL Policy

All games
Games w/score ≥ 10
Games w/score ≥ 20

25.30% ± 0.16%
27.01% ± 0.18%
28.87% ± 0.19%

56.04% ± 0.24%
58.86% ± 0.25%
61.86% ± 0.25%

60.22% ± 0.21%
62.62% ± 0.22%
65.21% ± 0.23%

62.44% ± 0.19%
64.51% ± 0.20%
66.81% ± 0.21%

63.63% ± 0.18%
65.29% ± 0.19%
67.39% ± 0.19%

Subset of Test Set

λ =5

λ =10

λ =20

λ =50

IL Policy

All games
Games w/score ≥ 10
Games w/score ≥ 20

63.50% ± 0.18%
65.33% ± 0.19%
67.48% ± 0.19%

63.68% ± 0.18%
65.43% ± 0.19%
67.54% ± 0.19%

63.71% ± 0.18%
65.42% ± 0.19%
67.53% ± 0.19%

63.68% ± 0.18%
65.35% ± 0.19%
67.45% ± 0.19%

63.63% ± 0.18%
65.29% ± 0.19%
67.39% ± 0.19%

Table 12. Human prediction accuracy of unregularized SPARTA (λ = 0) and of piKL-SPARTA with different λ and the IL policy on test
set. Each row represents their accuracy on a subset filtered by the final score of the games. pikl-SPARTA achieves similar prediction
accuracy as IL for most λ and is far more accurate than unregularized SPARTA.
λ =0
1p piKL-SPARTA
2p piKL-SPARTA

λ =0.5

λ =1

λ =2

λ =5

λ =10

20.56 ± 0.05 21.16 ± 0.06 20.02 ± 0.09 17.71 ± 0.12 13.53 ± 0.16 11.04 ± 0.17
20.23 ± 0.04 23.11 ± 0.03 22.60 ± 0.03 21.63 ± 0.05 18.65 ± 0.11 15.23 ± 0.15

IL Policy
8.81 ± 0.15

1p piKL-SPARTA-G 22.78 ± 0.03 23.07 ± 0.03 22.76 ± 0.03 22.41 ± 0.04 21.91 ± 0.05 21.45 ± 0.06
19.72 ± 0.10
2p piKL-SPARTA-G 19.98 ± 0.04 23.72 ± 0.02 23.39 ± 0.03 22.93 ± 0.03 22.36 ± 0.03 21.98 ± 0.04
Table 13. Performance of unregularized SPARTA (λ = 0) and piKL-SPARTA under different λ and IL policy evaluated on 4,000 self-play
games, reported ± one standard error. In the top two rows, both piKL-SPARTA and IL policy sample actions according to their action
distribution respectively. In the bottom two rows, both algorithms take greedy actions; “-G" is short for “-Greedy". All numbers shown in
each table section use the same learned belief model for fair comparison. At the high lambdas that maintain or improve human accuracy,
piKL-SPARTA significantly improves playing strength over IL, while at lower lambda values piKL-SPARTA outscores unregularized
SPARTA. For reference, unregularized greedy 1p SPARTA with exact beliefs rather than learned beliefs gets 23.49 ± 0.02.

sampling version while the bottom two rows show those of the greedy version. The conclusions are consistent across both
cases. Both 1p and 2p variants of piKL-SPARTA outscore IL for all λ tested. Together with Table 12 we find with λ = 10,
piKL-SPARTA maintains IL prediction accuracy and outscores IL greatly in self-play, an overall improvement without any
tradeoff. For smaller λ = 2 or λ = 5, piKL-SPARTA further outscores IL while losing some prediction accuracy on human
moves, but remains vastly more accurate than unregularized SPARTA (λ = 0).
Through qualitative analysis of the games played by both methods, we find that IL makes many mistakes due to both
sampling low probability actions and the fact that the training set contains bad moves. piKL-SPARTA, on the other hand,
avoids many of these mistakes while still otherwise following the same strategies and humanlike signaling conventions. It is
particularly good at preventing catastrophic failures where the agents lose all life tokens and points since the Q values for
good and bad actions differ much more widely in those cases.
The 2p piKL-SPARTA variants, despite being theoretically unsound, result in a further score improvement. Meanwhile,
we notice that 2p versions of the plain SPARTA (λ = 0) underperform their 1p variants, which is consistent with the
observations from (Lerer et al., 2020) despite us using an approximate learned belief model instead of exact belief. This
suggests that regularization towards the blueprint IL policy is successful in keeping the trajectories similar enough to those
from the blueprint that the unsound assumption that the partner plays the blueprint does not cause major problems, while
still allowing both players to deviate enough to correct major blunders that they would otherwise make.
Lastly, we notice that 1p piKL-SPARTA with λ = 0.5 outperforms the unregularized version in self-play. The reason could
be that the quality of the samples from the learned belief model worsen as the trajectories it sees at test time becomes
more off-distribution for smaller λ. For reference, we also run the original SPARTA which does not have this problem
as it computes beliefs analytically. The original SPARTA gets 23.49 ± 0.02, which is better than 1p piKL-SPARTA with
λ = 0.5 but worse than 2p piKL-SPARTA with the same λ. The computational cost of 1p SPARTA with exact beliefs and
2p piKL-SPARTA with approximate learned beliefs is roughly the same because the exact beliefs computation accounts
for 50% of the entire computation. Therefore, 2p piKL-SPARTA could be a preferable option to improve performance in
Dec-POMDPs via self play without incurring the huge cost of joint SPARTA or RL-search (Fickinger et al., 2021).