Modeling Strong and Human-Like Gameplay with KL-Regularized Search Athul Paul Jacob * 1 2 David J. Wu * 1 Gabriele Farina * 3 Adam Lerer 1 Hengyuan Hu 1 Anton Bakhtin 1 Jacob Andreas 2 Noam Brown 1 arXiv:2112.07544v2 [cs.MA] 17 Feb 2022 Abstract We consider the task of building strong but humanlike policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search based on the KL divergence from an imitationlearned policy results in higher human prediction accuracy and stronger performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger. 1. Introduction Self-play AI algorithms have matched or exceeded expert human performance in many games, such as chess (Campbell et al., 2002; Silver et al., 2018), Go (Silver et al., 2016; 2017), and poker (Moravčík et al., 2017; Brown & Sandholm, 2017; 2019). However, the resulting policies often differ markedly from how humans play (McIlroy-Young et al., 2020a). This is a serious problem for human-computer interactions that involve cooperation rather than purely competition. In such settings, modeling the other participants accurately is important for success. For example, it is important for a self-driving car at a four-way stop sign to conform to existing human conventions rather than its own self-play solution to the problem (Lerer & Peysakhovich, 2019). Moreover, even in purely adversarial games, the *Equal Contribution. 1 Meta AI Research, New York, NY, USA CSAIL, MIT, Cambridge, MA, USA 3 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Athul Paul Jacob , David J. Wu , Gabriele Farina , Noam Brown . 2 alien nature of AI policies makes it difficult for humans to understand and learn from superhuman bots. The classic approach toward modeling human behavior is imitation learning (IL) on human data. However, evidence in multiple games indicates that IL on expert human data produces policies that are much weaker than actual expert human play in domains with complex strategic planning. In this paper, we study the problem of producing policies that are both strong and human-like in games with complex strategic planning like chess, Go, Hanabi, and Diplomacy. In all four, we find that conducting search with KLregularization towards an IL policy matches or exceeds the prior state of the art for prediction accuracy of expert humans while also better matching expert human performance. In Section 3, we show that Monte Carlo tree search (MCTS) with a human imitation-learned policy prior and value function surpasses prior state-of-the-art results for human prediction accuracy in chess and Go. As explained by Grill et al. (2020), standard MCTS with a policy prior can be viewed as a form of KL-regularized search, optimizing a value function subject to a KL-divergence term with that prior. Although MCTS has been extensively studied for developing strong agents, it has been explored much less in the context of developing human-like agents. Section 4 builds on these findings and shows how to generalize them to a class of imperfect-information games (in which ordinary MCTS is unsound and cannot be applied) via a new algorithm for KL-regularized regret minimization. We show that existing regret minimization algorithms achieve low accuracy in predicting expert human actions in no-press Diplomacy. We then introduce the first regret minimization algorithm to incorporate a cost term proportional to the KL divergence between the search policy and a human-imitation learned anchor policy. We call this algorithm policy-regularized hedge, or piKL-hedge. We prove that piKL-hedge converges to an equilibrium in which all players’ policies are optimal given the joint policies of the players and the cost of deviating from the anchor policy. We then present results in no-press Diplomacy showing that piKL-hedge produces policies that predict human actions as accurately as imitation learning while also improving head-to-head performance in a population of prior agents. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Appendix L additionally shows that applying KLregularization toward a human IL policy in the search algorithm SPARTA (Lerer et al., 2020) produces similar or better human prediction accuracy while greatly improving performance in the benchmark domain of Hanabi (Bard et al., 2020). Our experiments demonstrate the benefits of KL-regularized search in all four of chess, Go, no-press Diplomacy, and Hanabi to producing agents that are simultaneously more human-like and closer in strength to actual human experts than purely imitation-learned models. 2. Preliminaries We study the problem of learning policies for multiplayer games. Here we briefly introduce the key ingredients of both classes of games we study; Section 3 and Section 4 give a more formal presentation tailored to individual game types and learning algorithms. An (N -player) game is defined by a state space S, an action space A, a (deterministic) transition function T : S × AN → S, and a collection of reward functions ui . We model the behavior of each player in a game as a policy πi : S → ∆(A) (a distribution over actions given states). In every round of a game, each player observes a (possibly incomplete) view sti of the current state. One or more players select actions ati ∼ πi (· | sti ), then each player receives a reward uti (st , at = at1 , . . . , atn ), and the game transitions into a new state st+1 = T (st , at ). Each player i aims to maximizes its expected reward, and the optimal policies for doing so may depend on the policies π−i = {π1 , . . . , πi−1 , πi+1 , . . . , πN } of the other players. The sequential decision-making problem described above is extremely general, and in this paper we focus on two special cases. In perfect-information games, players make moves sequentially (e.g., u1 and s2 depend only on a1 , u2 and s3 depend only on a2 , etc.). Many important games, including chess and Go, fall into this category. Next, we study a more general class of imperfect-information, simultaneousaction games that make no assumptions about the dependence of different ui and T on a; here we focus on games with only a single round, also called matrix games. Owing to the large differences between these two settings, the tools for computing strong policies are quite different. The remainder of this paper accordingly offers a deeper exploration of each class of games: perfect-information games in Section 3 and imperfect-information games in Section 4. 3. Perfect-Information Games: Policy Regularization in Monte Carlo Tree Search In this section, we focus on developing strong human-like agents for perfect-information games. Monte Carlo tree search (MCTS) has been highly successful for developing strong, but not necessarily human-like agents in this setting, and is a key component of general learning algorithms such as AlphaZero and MuZero capable of achieving superhuman performance in chess, Go, and similar games (Silver et al., 2018; Schrittwieser et al., 2020). By contrast, for developing human-like agents, the best prior human move prediction accuracies for chess and Go were all achieved with pure imitation learning on human data (McIlroy-Young et al., 2020a; Cazenave, 2017; Silver et al., 2017). The state of the art for predicting human moves in chess is the Maia engine created by McIlroy-Young et al. (2020a) via pure imitation learning without any search. However, this approach appears to be of limited effectiveness for modeling sufficiently strong humans. Although the weakest Maia models at low temperatures appear to outperform the players they imitate (due to “averaging away” of the imitated players’ individual idiosyncratic mistakes (Anderson et al., 2021)) each successive model on data from stronger players improves by much less than the players improve.1 The strongest model, trained to predict human 1900-1999 rated players, even with low temperature appears to be clearly below a 1950-average level of performance in all but the minority of bullet-speed games (in which very little time is available for planning and players are forced to rely more heavily on intuition). Similarly, in Go, pure imitation-learning agents have not exceeded mid-expert level on various online servers despite being trained on top-expert and master-level games (Cazenave, 2017). In contrast, search-based reinforcement learning (RL) agents such as AlphaZero that do not use a human policy prior play at a superhuman level, but often in non-human ways that humans find difficult to understand even when given access to interactively query and inspect the agent’s analysis (Silver et al., 2017; Egri-Nagy & Törmänen, 2020). However, we show in both chess and Go that if the humanlearned model is used in MCTS with appropriate parameters, MCTS outperforms those models’ human prediction accuracy while simultaneously reducing the shortcomings in those models’ strength. 3.1. Background We consider sequential games where each player i alternatively chooses action a from a policy πi where, a ∼ πi (· | s). Each action deterministically transitions the game into a new state s0 = T (s, a) and gives rewards ui (s, a). Notationally, we may elide the player i in some places when it is clear that i is the next player to move in the state being considered. For this work, following Silver et al. (2016), we implement 1 See ratings data at https://lichess.org/@/maia1, https://lichess.org/@/maia5, https://lichess.org/@/maia9 Modeling Strong and Human-Like Gameplay with KL-Regularized Search one of the most common modern forms of MCTS, which uses a value functionPV predicting the expected total future reward Vi (s) = E[ t ui (st , at ) | π1 , π2 , s0 = s] and a policy prior τ (typically both the outputs of a trained deep neural net) and attempts to produce an improved policy π. Each turn, MCTS builds a game tree over multiple iterations rooted at the current state. Each iteration t, MCTS explores a single path down the tree by simulating at each successive state s with player i to move the action: pP arg max Q(s, a) + cpuct τ (s, a) a b N (s, b) N (s, a) + 1 (1) where Q(s, a) is the estimated expected future reward for i from playing action a in state s, the visit count N (s, a) is the number of times a has been explored from s, τ (s, a) is the prior policy probability, and cpuct is a tunable parameter trading off exploration versus exploitation. Upon reaching a state st not yet seen, MCTS queries the value function Vi (st ) for each player i, and based on Vi (st ) and any intermediate rewards received, updates all Q(s, a) estimates on the path Ptraversed. The final agent policy is π(s, a) = N (s, a)/ b N (s, b) where s is the root state, or optionally we may also have π(s, a) ∼ N (s, a)1/T where T is a temperature parameter. See also Appendix F for a fuller description of MCTS. Grill et al. (2020) show that the agent policy π computed by this form of MCTS is an approximate solution to the optimization problem: arg max π X a Q(s, a)π(s, a) + λDKL (τ k π) (2) √ where λ ∼ cpuct N and N is the total number of iterations. In other words, at every node of the tree recursively, MCTS implicitly optimizes its expected future reward subject to KL regularization of its policy towards the prior policy τ with strength controlled by λ. For any fixed computational budget N , we can therefore tune cpuct to vary the strength of that prior, with cpuct = ∞ approximating the prior policy before search, and cpuct → 0 approaching a greedy argmax of the Q value estimates. If our goal is a strong human-like agent rather than solely a strong agent, and the KL-regularizing policy is learned from human data, then that policy serves not just as a prior, but also as an anchor policy that regularizing towards is desirable in and of itself. With good choice of cpuct , MCTS can improve that policy while remaining close to human. Our experiments confirm that MCTS improves on the strength and human prediction accuracy of the best existing models in both chess and Go. 3.2. Experiments in Chess and Go In chess and Go, we ran two main experiments each. First, in chess using the prior state-of-the-art Maia models from McIlroy-Young et al. (2020a) and in Go using a model trained on professional games from the GoGoD dataset, we demonstrate that MCTS with that model outperforms the raw model in human prediction accuracy. Second, we also sanity-check that MCTS with the same parameters greatly improves the strength of the same models in chess and Go. 3.2.1 Data and Model Architecture In chess, for the human-learned anchor policy we use the pre-trained Maia1100, Maia1500, and Maia1900 models from McIlroyYoung et al. (2020a), achieving state-of-the-art performance on rating-conditional human move prediction. These models follow a standard AlphaZero-like residual block architecture, including both a policy and a value head, and were trained to imitate players in ratings “buckets” 1100, 1500, and 1900 respectively, based on roughly 10 million games each from the popular Lichess server (each bucket contains games between players from rating N to N+99). For Go, we trained a deep neural net on the GoGoD professional game dataset2 . We match Cazenave (2017) in using games from 1900 through 2014 for training and 2015-2016 as the test set, with roughly 73000 and 6500 games, respectively. Our architecture matches the 20-block residual net structure used by some versions of AlphaZero (Silver et al., 2017), except adds squeeze-and-excitation layers which have been successful in image processing tasks (Hu et al., 2018) and self-play learning in chess and Go (LC0, 2020; Troisi, 2019). See Appendix E for additional details. 3.2.2 Improved Human Prediction and Strength In Table 1 we show the top-1 accuracy of MCTS at predicting players in chess and Go. MCTS on top of each model tested outperforms that model at predicting human moves. In chess, the benefit provided increases greatly as the rating of players predicted increases, while the optimal choice for cpuct appears to decrease (allowing increasingly small value differences to affect the search). This is consistent with the intuitive hypothesis that stronger players plan more deeply, increasing the value of explicitly modeling planning, and that they are more sensitive to small future value differences. In Go, despite our baseline model being equal or better than all prior imitation-learning models on the GoGoD human pro games dataset, MCTS improves it yet further. In Figure 2, for chess we see that while KL-regularized search improves each model’s accuracy on players of its target rating, surprisingly, the improvement grows yet larger when each model predicts players of higher rating than it was trained on. This suggests that as human players improve, 2 https://gogodonline.co.uk/ Modeling Strong and Human-Like Gameplay with KL-Regularized Search Model Predicting Chess Maia1100 Rating 1100 Chess Maia1500 Rating 1500 Chess Maia1900 Rating 1900 Go Cazenave Go Wu (2018) Go Wang et al. Go Ours GoGoD GoGoD GoGoD GoGoD Raw Model 51.1 52.4 53.2 54.7 55.3 57.7 57.8 Model + MCTS 100% cpuct = 10 5 2 1 0.5 51.2 52.7 53.6 51.3 52.9 54.0 50.8 52.8 54.3 49.5 51.9 53.8 47.4 50.1 52.4 Win% vs Raw Model Game 90% 80% 70% 60% 50% 58.1 58.3 58.5 58.1 57.1 cpuct 10 5 2 1 0.5 Model Chess (Maia1900) Go (Our model) Raw Model (Chess, Go) −0.50% 0.00% 0.50% 1.00% Top-1 Accuracy Difference vs Raw Model Table 1. % top-1 test accuracy predicting human moves in chess and Go Figure 1. Improvement in top-1 accuracy of Maia1900 in using MCTS with 50 playouts and various cpuct , or raw model without Chess or our GoGoD model in Go using MCTS, plotted MCTS, equivalent to infinite cpuct . In chess, first 10 ply per game and moves versus winrate of MCTS against the raw model (temperature with < 30s time left excluded, similar to (McIlroy-Young et al., 2020a). 1). Error bars indicate 1 standard error. Many cpuct values Standard error is approx 0.1 or less on all values. MCTS on top of current increase both human prediction accuracy and winrate over state-of-the-art models improves human prediction accuracy significantly. the raw model in both Chess and Go. icy and the MCTS policy, sampling each at temperature 1. Figure 1 shows the change in human prediction accuracy of MCTS in both chess and Go plotted jointly versus winrate of MCTS against the raw model. Rather than solely a tradeoff between strength and accuracy, most cpuct values in the range we tested increase both, some achieving more than 90% winrate while still improving human prediction. See Appendix D for results at lower temperature and evidence that accuracy improves further at longer time controls. 56.0% Top-1 Accuracy 54.0% 52.0% 50.0% Maia1100 Raw Maia1500 Raw Maia1900 Raw Maia1100+MCTS, cpuct=10 Maia1500+MCTS, cpuct=5 Maia1900+MCTS, cpuct=2 48.0% 46.0% 44.0% 1100 1300 1500 1700 1900 2100 Rating Bucket Being Predicted 2300 2500 Figure 2. Top-1 % test accuracy in chess for Maia models trained to predict moves by players in rating buckets 1100, 1500, 1900, applied to predict all rating buckets, with and without MCTS. The black bolded outline indicates where the model is predicting the same rating as it was trained on. Standard error is very small, the slight thin shade around each line. MCTS most improves prediction of a model on players of its target rating and higher. the incremental average change in their behavior resembles or is correlated with the way that highly-regularized search improves the strength of a baseline policy. Additionally, in Appendix C, we show that if we apply post-processing to the MCTS policy based on Grill et al. (2020), MCTS improves cross entropy with human moves in both chess and Go, not just top-1 accuracy. In other words, not only does policy-regularized search improve the prediction of the top move, but it also better models the overall distribution of moves that humans may likely play. We measure the strength impact of regularized search with 1000 games3 per cpuct setting between the raw model pol3 In Go, we also use the open-source KataGo (Wu, 2020) to determine when the game is over and to score the result. Unlike RL agents, humans which our models imitate universally pass and score the game well before it becomes mechanically scorable, so Although we did not test against humans directly to calibrate, this gives clear evidence that a well-tuned humanregularized MCTS agent would be better able to match the 1900-1999-rated chess players that Maia1900 currently falls hundreds of Elo short of imitating, while simultaneously being more accurate to their move-by-move behavior, and similarly for human-imitation agents in Go. 4. Imperfect-Information Games: Policy-regularized Regret Minimization While MCTS is a popular search algorithm for perfectinformation deterministic games, it is not able to compute optimal policies in imperfect-information games. Instead, iterative algorithms based on regret minimization are the leading approach to search in imperfect-information games. Hedge (Littlestone & Warmuth, 1994; Freund & Schapire, 1997) is an iterative regret minimization algorithm that in general converges to a coarse correlated equilibrium (CCE) (Hannan, 1957). In the special case of twoplayer zero-sum games, it also converges to a Nash equilibrium (NE) (Nash, 1951). Regret Matching (RM) (Blackwell et al., 1956; Hart & Mas-Colell, 2000) is another equilibrium-finding algorithm we use KataGo as a neutral judge. Modeling Strong and Human-Like Gameplay with KL-Regularized Search τi 4 . When λi is small, the dominating term is the rewards ui (πi , at−i ) and so the agent will tend to maximize reward without as closely matching the anchor policy τi . These statements are made precise in Theorem 1 and Theorem 2. similar to Hedge that has historically been more popular and that we compare our algorithm to in this paper. The effectiveness of the implicit KL-regularization in MCTS that we study in Section 3 motivates us to develop an equilibrium-finding algorithm called piKL-Hedge that similarly biases the search towards an anchor policy. In Section 4.3, we show that piKL-Hedge achieves better empirical performance against baseline human-imitation models than Hedge and RM in a large imperfect-information game, as well as much higher human prediction accuracy. 4.2. No-Regret Learning for Policy-Regularized Utilities In this section, we present Algorithm 1, a no-regret algorithm based on Hedge for any player i to learn strong policies relative to the regularized utilities defined in (5). As we show in Proposition 1 in Appendix A, it guarantees that each player i accumulates sublinear regret (of order log T ) with respect to the regularized utility functions: 4.1. Background We consider a game with N players where each player i chooses an action a from a set of actions Ai . We denote the actions of all players other than i as a−i . After all players simultaneously choose an action, player i receives a reward of ui (a, a−i ). Players may also choose a probability distribution over actions, where the probability of action a is denoted πi (a) and the vector of probabilities is denoted πi . We also define the fixed policy that we are interested in biasing player i towards as the anchor policy τi ∈ ∆(Ai ). Each player i maintains a regret for each action. The regret on iteration t is denoted Rit (a). Initially, all regrets are zero. On each iteration t of Hedge, πit (a) is set according to  πit (a) ∝ exp Rit (a) (3) Next, each player samples an action a∗ from Ai according to πit and all regrets are updated such that X Rit+1 (a) = Rit (a) + ui (a, a∗−i ) − πit (a0 )ui (a0i , a∗−i ) Uit (πi ) := Ui (πi , at−i ) = ui (πi , at−i ) − λi DKL (πi k τi ), no matter the opponents’ actions at−i at each time t. Algorithm 1 PI KL-H EDGE (for Player i) Data: • Ai set of actions for Player i; • ui reward function for Player i; • η > 0 learning rate hyperparameter. function I NITIALIZE() t←0 3 for each action a ∈ Ai do 4 CV0i (a) ← 0 1 2 function P LAY() t←t+1 7 let πit be the policy such that 5 6 πit (a) ∝ exp a0 ∈Ai (4) It is proven that the average policy of Hedge over all it- 8 erations converges to a NE in two-player zero-sum games 9 and, more broadly, the players’ joint policy distribution 10 11 converges to a CCE as t → ∞. We wish to model agents that seek to maximize their expected reward, while at the same time playing “close” to the anchor policy. The two goals can be reconciled by defining a composite utility function that adds a penalty based on the “distance” between the player policy and their anchor policy, with coefficient λi ∈ [0, ∞) scaling the penalty. For each player i, we define i’s utility as a function of the the agent policy πi ∈ ∆(Ai ) given policies π−i of all other agents: (5) When λ is large, the utility function is dominated by the KL-divergence term λi DKL (πi k τi ), and so the agent will naturally tend to play a policy πi close to the anchor policy  η CVt−1 (a) + tλi η log τi (a) i . (6) 1 + tλi η sample an action at ∼ πit play at and observe actions at−i played by the opponents for each a ∈ Ai do CVti (a) ← CVt−1 (a) + ui (a, at−i ) i As with many other regret-minimization methods, we consider the average policy of each player i over T iterations: T π̄iT := 1X t π T t=1 i (7) where πit is defined in (6). We take π̄iT to be the final agent policy produced by piKL-HedgeBotin no-press Diplomacy (as described in more detail in Section 4.3). As shown in 4 Ui (πi , π−i ) := ui (πi , π−i ) − λi DKL (πi k τi ).  The careful reader may observe that the direction of the KLdivergence term, DKL (π k τ ) is the opposite of the direction implicit in MCTS, DKL (τ k π). We choose this direction for greater ease of theoretical analysis and implementation in the context of regret minimization; for our use cases we have not found the exact form of the loss to be critical so much as simply doing any reasonable regularized search. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Appendix A, the KL-divergence of π̄iT from the anchor policy τi converges to be inversely proportional to λi : Theorem 1. (piKL stays close to the anchor policy) Upon running Algorithm 1 for T iterations in a multiplayer general-sum game, the policy π̄iT is at a distance   1 RiT T DKL (π̄i k τi ) ≤ + Di , λi T where Di is any upper bound on possible rewards for Player i. In particular, if η > 0 is set so that RiT = o(T ), then DKL (π̄iT k τi ) → Di /λi as T → +∞. We can also show (see Appendix A) that in the case of a twoplayer zero-sum game, π̄iT approximates a Nash equilibrium of the original utility functions, with the approximation guarantee controlled by λ: Theorem 2. Let (π̄1 , π̄2 ) be any limit point of the average policies (π̄1T , π̄2T ) of the players. Almost surely, (π̄1 , π̄2 ) is a (maxi=1,2 {λi βi })-approximate Nash equilibrium policy with respect to the original utility functions ui , where βi is as defined in (21). Lastly, we remark that in the special case that τi is the uniform policy for all players i, the above results imply that Algorithm 1 converges towards a quantal response equilibrium (McKelvey & Palfrey, 1995a), in which an imperfect agent is modeled as choosing actions with probability exponentially decaying in the amount that each action is worse than the best action(s), given that all other agents behave the same way. Our method can be seen as a generalization that takes into account a human-learned prior for what actions may be more likely. 4.3. Diplomacy Experiments Diplomacy is a benchmark 7-player simultaneous-action game featuring both cooperation and competition. In Appendix G, we summarize the rules of the game. Using piKLHedge, we develop an agent piKL-HedgeBot and show that it improves upon prior approaches for the game. In Appendix B, we also illustrate the key features of piKL-Hedge in Blotto, a famous 2-player simultaneous action game. 4.3.1 Algorithms and Models In no-press Diplomacy, we compare the different equilibrium search algorithms (RM, Hedge and piKL-Hedge) using the procedure introduced by Gray et al. (2020). We perform 1-ply lookahead where on each turn, we sample up to 30 of the most likely actions for each player from a policy network trained via imitation learning on human data (IL policy). We then consider the 1-ply subgame consisting of those possible actions where the rewards for a given joint action are given by querying a value network trained on human game data as in Gray et al. (2020). We play according to the approximate equilibrium computed for that subgame by that algorithm. For piKL-Hedge, the anchor policy is simply the same humantrained policy network. Our baseline policy and value models also contain a few improvements over prior models for no-press Diplomacy, described in Appendices I and J. In our experiments, we label our RM, Hedge, and piKLHedge agents as RMBot, HedgeBot, and piKL-HedgeBot respectively. We compare also against SearchBot (Gray et al., 2020) (similar to RMBot but using the models from Gray et al. (2020) rather than our models). See Appendix H for more details about the hyperparameters used. 4.3.2 Strong, human-like play with piKL-Hedge Similar to Chess and Go, we compare the human prediction accuracy of RMBot, HedgeBot, piKL-HedgeBot (with different λs) to the IL anchor policy, as well as testing their head-to-head strength. In particular, we test their ability to predict human moves in 226 no-press Diplomacy games from a validation set, and measure their score against the IL policy across 700 games each. In Figure 3 (Left), we present the average top-1 accuracy of unit orders in each action predicted by these methods as well as their average scores against 6 IL anchor policies. The raw IL model (λ = ∞) predicts human moves with high accuracy but is weak and achieves low average score. Unregularized Hedge and RM (λ = 0) achieve high score but low human prediction accuracy. By contrast, piKL-HedgeBot with different λ achieves a variety of highly favorable combinations of the two. λ = 10−1 gives about the same top-1 accuracy as the IL policy but improves score by a factor of 1.4x over the IL anchor policy. λ = 10−3 outperforms unregularized search methods in both score (by ~5%) and human prediction accuracy (by ~6%). Mild regularization improves average score, rather than harming it. We also tested pure RL agents and found they perform poorly in predicting human moves. In particular, the recently proposed DORA and HumanDNVI-NPU algorithms (Bakhtin et al., 2021) achieve top-1 accuracy of only 29.1% and 37.8% respectively. Next, in Figure 3 (Right), we compare the top-1 accuracy of these methods across players of different pseudo-Elo ratings. The pseudo-Elo (ei ) for player i is constructed based on the logit rating si introduced in Gray et al. (2020), where, ei = si ·400 log(10) + 1000. The top-1 accuracy for all the search-based policies increases with pseudo-Elo, indicating that they are better at modeling stronger players than weaker players. piKL-Hedge (λ = 10−1 ) performs just as well as the anchor policy across pseudo-Elos while being significantly stronger than the anchor, while λ = 10−3 is as strong or stronger than Hedge and RM but matches human play far better. 4.3.3 piKL-HedgeBot performs well against a varied pool of agents We also develop a new head-to-head evaluation setting, where rather than testing one agent vs LO 40 20 40 42.5 45 47.5 50 52.5 Top1 Accuracy [%] 60 Agent Average Score 50 DipNet† DipNet RL† 3.7% ± 0.3% 4.7% ± 0.3% 40 Blueprint‡ BRBot‡ SearchBot‡ 4.9% ± 0.3% 16.1% ± 0.6% 13.4% ± 0.5% 30 IL Policy RMBot 7.9% ± 0.4% 31.3% ± 0.7% piKL-HedgeBot (λ = 10−3 ) 32.9% ± 0.7% 800 1000 piKL-Hedge (λ = 10−3 ) Anchor Policy 1400 Unit Order Top-1 Accuracy [%] Average Score [%] Modeling Strong and Human-Like Gameplay with KL-Regularized Search 1200 1400 Pseudo-ELO 1600 Hedge Regret Matching Table 2. Average score achieved by agents in a uniformly sampled pool of other agents. piKL-HedgeBot (λ = Figure 3. (Left) Average top-1 accuracy of unit orders in each action predicted by the human 10−3 ) outperforms all other agents in IL anchor policy, RM, Hedge and piKL-Hedge, versus head to head score against 6 IL anchor this setting. DipNet agents from (Papolicies. piKL-HedgeBot (λ = 10−1 ) predicts human moves as accurately as the anchor policy quette et al., 2019) use a temperature 1600 while achieving a much higher score. At the same time, piKL-HedgeBot (λ = 10−3 ) allows of 0.1, while IL Policy and blueprint for a stronger and more human-like policy than unregularized search methods (hedge, RM). (Gray et al., 2020) use a temperature of Note that equal performance would imply an average score of 1/7 ≈ 14.3%. Error bars indicate 0.5. The ± shows one standard error. 1 standard error. (Right) Average top-1 accuracy of unit orders in each action predicted by the † (Paquette et al., 2019); ‡ (Gray et al., human policy, RM, and piKL, as a function of pseudo-Elo player rating. 2020). −1 piKL-Hedge (λ = 10 ) piKL-Hedge (λ = 10−4 ) piKL-Hedge (λ = 10−2 ) piKL-Hedge (λ = 10−5 ) 6x of another agent, all 7 agents per game are sampled uniformly from a pool. The 1v6 head-to-head scores used in prior work (Gray et al., 2020; Bakhtin et al., 2021; Anthony et al., 2020; Paquette et al., 2019) indicate whether a population of 6x agents can be invaded by a 1x agent, and hence whether the 6 agents constitute an Evolutionarily Stable Strategy (ESS) (Taylor & Jonker, 1978; Smith, 1982). By contrast, assigning the 7 agents per game randomly from a pool studies the robustness of an agent to a variety of other agents. We experiment with a pool of 8 agents. Five are previously published agents: DipNet, DipNet RL (Paquette et al., 2019), Blueprint, BRBot, SearchBot (Gray et al., 2020) and three are our agents: IL policy, RMBot and piKL-HedgeBot. Doing well in this population requires playing well with both human-like policies (DipNet, Blueprint) and equilibrium policies (SearchBot, RMBot). Each experiment only compares one lambda value of piKL-HedgeBot for fairness. The results of these experiment with piKL-HedgeBot (λ = 10−3 ) is presented in Table 2. piKL-HedgeBot (λ = 10−3 ) outperforms all other agents. Overall, unlike Chess and Go, we do not find that piKLHedge clearly improves human prediction accuracy over the IL model. However, unlike Chess and Go, we observe that piKL-Hedge does improve playing strength over prior search methods against various past agents. In generalsum games like Diplomacy, it appears that piKL-hedge with a human anchor policy allows for slightly better play in a population containing human-like agents while still doing well against equilibrium-searchers, or alternatively can improve strength over IL to a lesser degree with no cost at all to prediction accuracy. 5. Related Work 5.1. Regularized Learning and Planning Several prior works have explored augmenting reinforcement learning with supervised data from expert demonstrations (Vecerik et al., 2017; Nair et al., 2018). For example, Hester et al. (2018) augment Deep Q Learning with a margin loss on demonstration data that aims to make Q(a) for each demonstration action higher than that of other actions. KL-regularization has also been used successfully to incorporate expert demonstrations into RL training (Rudner et al., 2021; Ng et al., 2000; Boularias et al., 2011; Wu et al., 2019; Peng et al., 2019; Siegel et al., 2019). In these settings, the standard RL objective is augmented by a KL divergence penalty that expresses the dissimilarity between the online policy and a reference policy derived from demonstrations. This helps guide exploration in RL or ameliorate inaccurate modeling of the environment in domains such as robotics. AlphaStar (Vinyals et al., 2019), which achieved expert human performance in StarCraft 2, uses self-play RL initialized from supervised policies with a KL penalty term for deviating from the supervised policy during training to "aid in exploration and to preserve strategic diversity". In our work, we use the KL term to better approximate human play during inference-time search. Prior work has explored entropy-regularized utilities in Modeling Strong and Human-Like Gameplay with KL-Regularized Search games. Ling et al. (2018) show that a particular type of entropic regularization in extensive-form games leads to quantal response equilibria. Cen et al. (2021) give fast online optimization algorithms for entropy-regularized utilities in the context of quantal response. Farina et al. (2019) regularize utilities with a KL divergence term from a precomputed Nash equilibrium strategy to design agents that trade off game-theoretic safety and exploitation. In our work, we leverage the KL divergence term towards a human-imitation learned policy instead, to regularize the utilities. And unlike previous works, we empirically study our approach in a much larger imperfect information game. 5.2. Strong Human-Compatible Policies Prior work in multi-agent reinforcement learning has emphasized the importance of human-compatible policies in cooperative multi-agent environments. Lerer & Peysakhovich (2019) demonstrate that self-play policies may perform poorly with other agents if they do not conform to the population equilibrium (social conventions), and propose a combination of policy gradient and imitation loss directly on samples of population data. Human-compatible policies have also been studied on the benchmark game of Hanabi, where ad hoc play with humans is regarded as an open challenge problem (Bard et al., 2020). Most work on this challenge has focused on zero-shot coordination with humans, in which an agent must adapt to human play with no prior experience with human partners (Hu et al., 2020; 2021b; Cui et al., 2021). Learning humancompatible policies from a combination of human data and planning are less well-studied in this setting. 5.2.1 MCTS, Chess and Go Prior work in tree search methods, especially in chess and Go, has typically focused on developing strong agents without concern for accurately modeling human behavior. For example in Go, a significant body of older work investigates imitation learning (IL) to obtain a baseline policy prior to use with MCTS, but tunes and evaluates the final agent via playing strength alone (Tian & Zhu, 2016; Cazenave, 2017). Regarding the use of search for human modeling, recent work in chess by McIlroy-Young et al. (2020b) found that pure IL outperformed all other approaches and that adding MCTS with the parameters of a standard engine significantly harmed human prediction accuracy. However in our work, using a different range of parameters, we show clear results to the contrary. In Go, Wang et al. (2017) report promising results on human prediction and playing strength using search, albeit with a specialized architecture and rollout method. Baier et al. (2018) report in the card game Spades both excellent human accuracy and playing strength by adding a human policy bias to a variant of MCTS. To our knowledge, our work is the first to demonstrate a clear gain in human prediction accuracy over deep learning models in the highly-studied domains of chess and Go via simple well-established methods of policy-regularized planning. Diplomacy is a benchmark 7-player 5.2.2 Diplomacy game that involves communication, negotiation, cooperation, and competition in a strategic multi-agent setting. While chess and Go are two-player zero-sum games for which optimal play is well-defined and can be computed through self-play (Nash, 1951), Diplomacy has no such guarantees and strong play likely requires modeling other agents, even in the no-press variant where natural-language communication is not allowed (Bakhtin et al., 2021). Paquette et al. (2019) showed that neural network IL on human data in no-press Diplomacy can reasonably approximate human play, but that bootstrapping RL from this agent leads to a breakdown in cooperation. Anthony et al. (2020) developed new RL methods based on fictitious play that improve the performance of agents in no-press Diplomacy, and Gray et al. (2020) showed that an equilibrium-finding regret minimization search procedure on top of human IL models achieves human-level performance in no-press Diplomacy. However, although both methods rely on the human IL model to generate a restricted action set for RL or search, neither contains any explicit regularization when choosing among those actions, and we show that the equilibrium search in Gray et al. (2020) greatly decreases the accuracy of modeling human players. Similarly, Bakhtin et al. (2021) achieved strong results in no-press Diplomacy via self-play RL both from scratch and initialized with a human-learned policy, but we show that the resulting final agents do not ultimately model humans well. 6. Conclusion In this paper, we showed across several domains that regularizing search policies according to a KL-divergence loss with an imitation learned (IL) policy produces policies that maintain high human prediction accuracy while being far stronger than the original learned policy. In chess and Go, applying standard MCTS regularized toward a human-learned policy achieves state-of-the-art prediction accuracy, surpassing imitation learning, while also winning more than 85% of games against an IL model. We then introduced a novel regret minimization algorithm that is regularized based on the KL divergence from an IL policy. In no-press Diplomacy, this algorithm yields both a policy that predicts human play with the same accuracy as imitation learning alone while increasing win rate against state-of-the-art baselines by a factor of 1.4, or alternately a policy that outperforms unregularized search while achieving much higher human prediction accuracy. We presented in Appendix L similar successful results for KL-regularized search in Hanabi. There are several directions for future work, such as ex- Modeling Strong and Human-Like Gameplay with KL-Regularized Search tending piKL-Hedge to handle extensive-form games. Additionally, there may be better ways to regularize search than KL-divergence. Finally, it remains to be seen how KL-regularized search performs when combined with RL. Author Contributions A. P. Jacob was the primary researcher for piKL-hedge and contributed to the direction, experiments, and writing of the entire paper. D. J. Wu was the primary researcher for MCTS in chess and Go, and contributed to the direction, experiments, and writing of the entire paper. G. Farina was the primary formulator of the piKL-hedge algorithm and handled all the theory in the paper. A. Lerer contributed to the direction of the project, the formulation of piKL-hedge, its experimental evaluation, and paper writing. H. Hu was the primary researcher for the extension of piKL to Hanabi covered in Appendix L. A. Bakhtin contributed to the experimental evaluation of piKL-hedge. J. Andreas contributed to the direction of the project and to paper writing. N. Brown initiated the project and contributed to the direction of the project, the formulation of piKL-hedge, its experimental evaluation, and paper writing. References Abernethy, J. and Rakhlin, A. Beating the adaptive bandit with high probability. Technical Report UCB/EECS-2009-10, EECS Department, University of California, Berkeley, Jan 2009. URL http://www2.eecs.berkeley.edu/Pubs/ TechRpts/2009/EECS-2009-10.html. Anderson, A., McIlroy-Young, R., Sen, S., and Kleinberg, J. Introducing maia, a human-like neural network chess engine, 2021. URL https: //lichess.org/blog/X9PUixUAANCqFRSh/ introducing-maia-a-human-like-neuralnetwork-chess-engine. Anthony, T., Eccles, T., Tacchetti, A., Kramár, J., Gemp, I., Hudson, T., Porcel, N., Lanctot, M., Perolat, J., Everett, R., Singh, S., Graepel, T., and Bachrach, Y. Learning to play no-press diplomacy with best response policy iteration. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 17987–18003. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/ paper/2020/file/ d1419302db9c022ab1d48681b13d5f8bPaper.pdf. Baier, H., Sattaur, A., Powley, E., Devlin, S., Rollason, J., and Cowling, P. Emulating human play in a leading mobile card game. IEEE Transactions on Games, PP:1–1, 05 2018. doi: 10.1109/TG.2018.2835764. Bakhtin, A., Wu, D., Lerer, A., and Brown, N. No-press diplomacy from scratch. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. Bard, N., Foerster, J. N., Chandar, S., Burch, N., Lanctot, M., Song, H. F., Parisotto, E., Dumoulin, V., Moitra, S., Hughes, E., et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. Blackwell, D. et al. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956. Boularias, A., Kober, J., and Peters, J. Relative entropy inverse reinforcement learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 182–189. JMLR Workshop and Conference Proceedings, 2011. Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, pp. eaao1733, 2017. Brown, N. and Sandholm, T. Superhuman AI for multiplayer poker. Science, 365(6456):885–890, 2019. Campbell, M., Hoane Jr, A. J., and Hsu, F.-h. Deep Blue. Artificial intelligence, 134(1-2):57–83, 2002. Cazenave, T. Residual networks for computer go. IEEE Transactions on Computational Intelligence and AI in Games, PP:1–1, 03 2017. doi: 10.1109/ TCIAIG.2017.2681042. Cen, S., Wei, Y., and Chi, Y. Fast policy extragradient methods for competitive games with entropy regularization. In Neural Information Processing Systems (NeurIPS), 2021. Cui, B., Hu, H., Pineda, L., and Foerster, J. K-level resoning for zero-shot coordination in hanabi. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. Egri-Nagy, A. and Törmänen, A. The game is not over yet—go in the post-alphago era. Philosophies, 5(4):37–0, 2020. ISSN 2409-9287. doi: 10.3390/philosophies5040037. URL https:// www.mdpi.com/2409-9287/5/4/37. Farina, G., Kroer, C., and Sandholm, T. Online convex optimization for sequential decision processes and extensiveform games. In AAAI Conference on Artificial Intelligence, 2019. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Fickinger, A., Hu, H., Amos, B., Russell, S., and Brown, N. Scalable online planning via reinforcement learning fine-tuning. CoRR, abs/2109.15316, 2021. URL https: //arxiv.org/abs/2109.15316. Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997. Gray, J., Lerer, A., Bakhtin, A., and Brown, N. Humanlevel performance in no-press diplomacy via equilibrium search. In International Conference on Learning Representations, 2020. Grill, J.-B., Altché, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. Monte-carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pp. 3769–3778. PMLR, 2020. Hannan, J. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957. Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5): 1127–1150, 2000. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al. Deep q-learning from demonstrations. In Thirtysecond AAAI conference on artificial intelligence, 2018. Hu, H., Lerer, A., Peysakhovich, A., and Foerster, J. “otherplay” for zero-shot coordination. In International Conference on Machine Learning, pp. 4399–4410. PMLR, 2020. Hu, H., Lerer, A., Brown, N., and Foerster, J. N. Learned belief search: Efficiently improving policies in partially observable settings. CoRR, abs/2106.09086, 2021a. URL https://arxiv.org/abs/2106.09086. Hu, H., Lerer, A., Cui, B., Pineda, L., Wu, D., Brown, N., and Foerster, J. Off-belief learning. In International Conference on Machine Learning. PMLR, 2021b. Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141, 2018. doi: 10.1109/CVPR.2018.00745. LC0. Leela chess zero information page on "neural network topology". https://lczero.org/dev/ backend/nn/, 2020. Lerer, A. and Peysakhovich, A. Learning existing social conventions via observationally augmented self-play. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 107–114. ACM, 2019. Lerer, A., Hu, H., Foerster, J., and Brown, N. Improving policies via search in cooperative partially observable games. In AAAI Conference on Artificial Intelligence, 2020. Ling, C. K., Fang, F., and Kolter, J. Z. What game are we playing? end-to-end learning in normal and extensive form games. International Joint Conferences on Artificial Intelligence Organization, 2018. Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and computation, 108(2):212– 261, 1994. McIlroy-Young, R., Sen, S., Kleinberg, J., and Anderson, A. Aligning superhuman ai with human behavior: Chess as a model system. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1677–1687, 2020a. McIlroy-Young, R., Wang, R., Sen, S., Kleinberg, J., and Anderson, A. Learning personalized models of human behavior in chess. Technical report, Microsoft, Inc., August 2020b. URL https: //www.microsoft.com/en-us/research/ publication/learning-personalizedmodels-of-human-behavior-in-chess/. McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6–38, 1995a. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1995.1023. URL https://www.sciencedirect.com/science/ article/pii/S0899825685710238. McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995b. Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017. Lai, M. Forum post on alphazero news (post by Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., user matthewlai). http://talkchess.com/ and Abbeel, P. Overcoming exploration in reinforcement forum3/viewtopic.php?f=2&t=69175&sid= learning with demonstrations. In 2018 IEEE Interna06ca6a966c29743d765c11b13402be8d&start= tional Conference on Robotics and Automation (ICRA), 70#p781765, 2018. pp. 6292–6299. IEEE, 2018. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Nash, J. Non-cooperative games. Annals of mathematics, pp. 286–295, 1951. Smith, J. M. Evolution and the Theory of Games. Cambridge university press, 1982. Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 2, 2000. Taylor, P. D. and Jonker, L. B. Evolutionary stable strategies and game dynamics. Mathematical biosciences, 40(1-2): 145–156, 1978. Paquette, P., Lu, Y., Bocco, S. S., Smith, M., Satya, O.-G., Kummerfeld, J. K., Pineau, J., Singh, S., and Courville, A. C. No-press diplomacy: Modeling multi-agent gameplay. In Advances in Neural Information Processing Systems, pp. 4474–4485, 2019. Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019. Rakhlin, A. Lecture notes on online learning, 2009. Rudner, T. G., Lu, C., Osborne, M., Gal, Y., and Teh, Y. W. On pathologies in kl-regularized reinforcement learning from expert demonstrations. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839): 604–609, 2020. Siegel, N., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., Heess, N., and Riedmiller, M. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2019. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018. Siu, H. C., Peña, J., Chang, K. C., Chen, E., Zhou, Y., Lopez, V. J., Palko, K., and Allen, R. E. Evaluation of human-ai teams for learned and rule-based agents in hanabi. CoRR, abs/2107.07630, 2021. URL https: //arxiv.org/abs/2107.07630. Tian, Y. Github thread for elf opengo, "[suggestion] clarify fpu in paper". https://github.com/pytorch/ ELF/issues/140, 2019. Tian, Y. and Zhu, Y. Better computer go player with neural network and long-term prediction, 2016. Troisi, S. Github thread for minigo "[experiment] squeeze and excitation". https://github.com/ tensorflow/minigo/issues/683, 2019. Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350–354, 2019. Wang, J., Wang, W., Wang, R., and Gao, W. Beyond monte carlo tree search: Playing go with deep alternative neural network and long-term evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017. URL https://ojs.aaai.org/index.php/ AAAI/article/view/10749. Wu, D. Go neural net sandbox. https: //github.com/lightvector/GoNN#rawneural-net-results, 2018. Wu, D. Accelerating self-play learning in go. In AAAI-20 Workshop on Reinforcement Learning in Games, 2020. Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. 2019. Modeling Strong and Human-Like Gameplay with KL-Regularized Search A. Proofs In this Appendix, we present detailed proofs of Proposition 1 and Theorem 2. A.1. Known results We start by recalling a few standard results. First, we recall the follow-the-regularized-leader (FTRL) algorithm, one of the most well-studied algorithms in online optimization. At every time t, the FTRL algorithm instantiated with domain X , 1-strongly-convex regularizer φ : X → R, and learning rate η > 0, produces iterates according to ( ) t X φ(x) xt+1 = arg max − + `τ (x) , η x∈X τ =1 (FTRL) where `1 , . . . , `t are the convex utility functions gave as feedback by the environment. The FTRL algorithm guarantees the following regret bound. Lemma 1 (Rakhlin (2009), Corollary 7). The iterates xt ∈ X produced by the FTRL algorithm set up with constant step size η > 0 and 1-strongly convex regularizer φ satisfy the regret bound T X t=1 `t (u) − `t (xt ) ≤ T φ(u) X t t+1 + ` (x ) − `t (xt ) η t=1 ∀u ∈ X. (8) In the analysis of Algorithm 1 we will also make use of the following technical lemma, a proof of which can be obtained starting using the same techniques as Abernethy & Rakhlin (2009, Lemma A.4) Lemma 2. Let p ∈ ∆(A) be a distribution over a discrete set A, q ∈ R|A| be a vector, and D > 0 be any constant such that maxa,a0 ∈A {q(a) − q(a0 )} ≤ M . Then, 2 a∈A p(a) · exp{−q(a)} P 2 − 1 ≤ p(a) · exp{−q(a)} a∈A P exp{2M } X p(a)q(a)2 . M2 a∈A Proof. Let qmin := mina∈A q(a) and q̃(a) := q(a) − qmin be a shifted version of q(a) so that 0 ≤ q̃(a) ≤ M for all a ∈ A. Let now X denote a random variable with value q̃(a) with probability p(a) for all a ∈ A. Then, 2 a∈A p(a) · exp{−q(a)} P exp{qmin }2 a∈A p(a) · exp{q(a)}2 2 − 1 = 2 − 1 P P exp{qmin }2 a∈A p(a) · exp{−q(a)} a∈A p(a) · exp{−q(a)} P p(a) · exp{−q̃(a)}2 = Pa∈A 2 − 1 a∈A p(a) · exp{−q̃(a)} P = E[exp(−X)] − 1. [E exp(−X)]2 Applying Lemma A.4 from Abernethy & Rakhlin (2009) we obtain 2 a∈A p(a) · exp{−q(a)} P 2 − 1 ≤ a∈A p(a) · exp{−q(a)} P exp{2M } − 2M − 1 exp{2M } (E[X 2 ] − E[X]2 ) ≤ E[X 2 ], M2 M2 which is exactly the statement. A.2. Bounding the distance between the iterates of Algorithm 1 t Lemma 3. At all times t and for all players i, the with constant step Ppolicies πi produced by the FTRL algorithm set up size η and negative entropy regularizer ϕ(x) := a∈Ai x(a) log x(a), when observing the utilities Uit , match the policies πit produced by Algorithm 1. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Proof. Plugging the particular choices of utilities and regularizer into (FTRL), we obtain ( t ! ) X 0 X 1 t+1 πi = arg max Uit (π) − π(a) log π(a) η π∈∆(Ai ) 0 t =1 a∈Ai ( )  ! X t X X π(a) t0 π(a) ui (a, a−i ) − λi π(a) log = arg max η − π(a) log π(a) τi (a) π∈∆(Ai ) t0 =1 a∈Ai a∈Ai ( ! ) t X X X t0 tλi log τi (a) + = arg max η ui (a, a−i ) π(a) − (1 + tλi η) π(a) log π(a) π∈∆(Ai ) t0 =1 a∈Ai a∈Ai ( X η = arg max π∈∆(Ai ) 1 + tλi η tλi log τi (a) + t X ! 0 ui (a, at−i ) t0 =1 a∈Ai ) π(a) − X π(a) log π(a) . (9) a∈Ai A well-known closed form solution to the above entropy-regularized problem is given by the softmax function. In particular, let ! t X η t+1 t0 wi (a) := tλi log τi (a) + ũi (a, a−i ) ∀a ∈ Ai . (10) 1 + tλi η 0 t =1 Then, exp{wit+1 (a)} t+1 0 (a )} a0 ∈Ai exp{wi πit+1 (a) = P ∀a ∈ Ai , which coincides with the iterate produced by Algorithm 1. The next observation shows that the iterates πi do not change if the utility function ui is first shifted to be in the range [0, Di ]. Remark 1. Consider the shifted utilities 0 ũi (a, at−i ) := ui (a, at−i ) − min a∈A1 ×···×An ui (a) ∈ [0, Di ] (11) and let v be defined as (10) using ũi in place of ui , that is, vit+1 (a) := t X 0 η ũi (a, at−i ) tλi log τi (a) + 1 + tλi η 0 ! t =1 ∀a ∈ Ai . (12) Then, the iterates πi can be equivalently expressed as exp{vit+1 (a)} t+1 0 (a )} a0 ∈Ai exp{vi πit+1 (a) = P ∀a ∈ Ai . Proof. Let γ := mina∈A1 ×···×An ui (a) denote the minimum utility that Player i can get against the actions of the opponents. Since the argmax of a function does not change if a constant is added to the objective, from (9) we can write ( ! ) t X X X η η t+1 t0 πi = arg max − γ+ tλi log τi (a) + ui (a, a−i ) π(a) − π(a) log π(a) 1 + tλi η 1 + tλi η π∈∆(Ai ) t0 =1 a∈Ai a∈Ai ! ) ( t X X X η t0 = arg max tλi log τi (a) + (ui (a, a−i ) − γ) π(a) − π(a) log π(a) π∈∆(Ai ) 1 + tλi η a∈A t0 =1 a∈Ai i ) ( X X t+1 = arg max vi (a)π(a) − π(a) log π(a) , π∈∆(Ai ) a∈Ai a∈Ai Modeling Strong and Human-Like Gameplay with KL-Regularized Search where the second equality follows from the fact that π ∈ ∆(Ai ). A solution is again given by softmax function exp{vit+1 (a)} t+1 0 (a )} a0 ∈Ai exp{vi πit+1 (a) = P (13) ∀a ∈ Ai . In the rest of the proof we will use (13) to analyze the iterates πi produced by the algorithm. Lemma 4. Let η ≤ 1/(λi βi + 2Di ). Then, at all times t, kπit+1 − πit k∇2 ϕ(πit ) ≤ √ 3e . tλi η Proof. At all times t, introduce the vector ξit ∈ R|Ai | defined as ξit (a) :=  η −λi v t (a) + λi log τi (a) + ũi (a, at−i ) 1 + tλi η ∀a ∈ Ai . (14) At all times t and for all a, it holds that   η 1 + (t − 1)λi η t v (a) + λi log τi (a) + ũi (a, at−i ) 1 + tλi η η   η 1 + tλi η t = v (a) − λi v t (a) + λi log τi (a) + ũi (a, at−i ) 1 + tλi η η vit+1 (a) = (15) = v t (a) + ξit (a). Substituting (15) we can write exp{vit (a)} · exp{ξit (a)} πit (a) exp{ξit (a)} P = . t 0 t 0 t 0 t 0 a0 ∈Ai exp{vi (a )} · exp{ξi (a )} a0 ∈Ai πi (a ) exp{ξi (a )} πit+1 (a) = P (16) Expanding the definition of the local norm induced by ∇2 ϕ(πit ) we find kπit+1 − πit k2∇2 ϕ(πt ) = i X 1 π t (a) a∈Ai i πit+1 (a) − πit (a) 2 !2 t exp{ξ (a)} i = πit (a) P −1 (17) t 0 t 0 0 ∈A πi (a ) exp{ξi (a )} a i a∈Ai   !2 ! t t X exp{ξ (a)} exp{ξ (a)} i i = πit (a) P −2 P + 1 t (a0 ) exp{ξ t (a0 )} t (a0 ) exp{ξ t (a0 )} π π 0 0 i i a ∈Ai i a ∈Ai i a∈Ai P P t t 2 t t X a∈Ai πi (a) exp{ξi (a)} a∈Ai πi (a) exp{ξi (a)} P = P − 2 + πit (a)  2 t t 0 0 t t 0 ) exp{ξ (a0 )} 0 ∈A πi (a ) exp{ξi (a )} π (a a i a∈Ai i a0 ∈Ai i P t t 2 π (a) exp{ξ (a)} i i = P a∈Ai (18)  − 1, t t (a0 )} 2 0 π (a ) exp{ξ 0 i i a ∈Ai X where (17) follows from substituting (16). We now apply Lemma 2, applied with q = ξit , p = πit , and A = Ai . First, we study the range Di = maxa,a0 ∈Ai {ξit (a) − ξit (a0 )} used in the statement of the Lemma. In particular, using (15) we have max {ξit (a) − ξit (a0 )} = max {vit+1 (a) − vit (a) − vit+1 (a0 ) + vit (a0 )} a,a0 ∈Ai    tλi η (t − 1)λi η 0 = max (log τ (a) − log τ (a )) · − i i a,a0 ∈Ai 1 + tλi η 1 + (t − 1)λi η a,a0 ∈Ai Modeling Strong and Human-Like Gameplay with KL-Regularized Search η + 1 + tλi η ! t X 0 ũi (a, at−i ) − ũi (a0 , at−i ) t0 =1 t−1 X η − 1 + (t − 1)λi η  = max 0 a,a ∈Ai 0 t0 =1 !) 0 0 ũi (a, at−i ) − ũi (a0 , at−i ) λi η (log τi (a) − log τi (a0 )) (1 + tλi η)(1 + (t − 1)λi η) ! t−1 X λi η 2 t0 0 t0 + ũi (a, a−i ) − ũi (a , a−i ) (1 + tλi η)(1 + (t − 1)λi η) 0 t =1  η t 0 t (ũi (a, a−i ) − ũi (a , a−i )) + 1 + tλi η   λi η 0 ≤ max (log τi (a) − log τi (a )) a,a0 ∈Ai (1 + tλi η)(1 + (t − 1)λi η) !) ( t−1 X 0 0 λi η 2 + max ũi (a, at−i ) − ũi (a0 , at−i ) a,a0 ∈Ai (1 + tλi η)(1 + (t − 1)λi η) 0 t =1   η t 0 t + max (ũi (a, a−i ) − ũi (a , a−i )) a,a0 ∈Ai 1 + tλi η (19) ≤ η(λi βi + 2Di ), where the first inequality follows from upper bounding the max of a sum with the sum of max of each term, and the second inequality follows from noting that λi η η ≤ ≤ η, (1 + tλi η)(1 + (t − 1)λi η) 1 + tλi η and λi η 2 η ≤ (1 + tλi η)(1 + (t − 1)λi η) t t−1 X t0 =1 ! 0 0 ũi (a, at−i ) − ũi (a0 , at−i ) ≤ λi η 2 · tDi = ηDi . tλi η Applying Lemma 2 to the right-hand side of (18) using the bound on the range of ξit shown in (19) yields kπit+1 − πit k2∇2 ϕ(πt ) ≤ i 2 exp{2η(λi βi + 2Di )} X t πi (a) ξit (a) . 2 2 η (λi βi + 2Di ) (20) a∈Ai Using the fact that any convex combination of values is upper bounded by the maximum value, we can further bound the right-hand side of (20) as exp{2η(λi βi + 2Di )} max(ξ t (a))2 η 2 (λi βi + 2Di )2 a∈Ai i 2 exp{2η(λi βi + 2Di )} = max v t+1 (a) − vit (a) , η 2 (λi βi + 2Di )2 a∈Ai i kπit+1 − πit k2∇2 ϕ(πt ) ≤ i where the equality follows from (15). Hence, expanding the definition of vit and vit+1 ,   tλi η (t − 1)λi η exp{2η(λi βi + 2Di )} t+1 t 2 kπi − πi k∇2 ϕ(πt ) ≤ max − log τi (a) i η 2 (λi βi + 2Di )2 a∈Ai 1 + tλi η 1 + (t − 1)λi η )2 t t−1 X X η η t0 t0 + ũi (a, a−i ) − ũi (a, a−i ) 1 + tλi η 0 1 + (t − 1)λi η 0 t =1 t =1  exp{2η(λi βi + 2Di )} λi η η = max log τi (a) + ũi (a, at−i ) 2 2 a∈Ai (1 + tλi η)(1 + (t − 1)λi η) η (λi βi + 2Di ) 1 + tλi η Modeling Strong and Human-Like Gameplay with KL-Regularized Search )2 t−1 X λi η 2 t0 − ũi (a, a−i ) (1 + tλi η)(1 + (t − 1)λi η) 0 t =1 ( 2  2 exp{2η(λi βi + 2Di )} λi η η t ≤3 max log τi (a) + ũi (a, a−i ) η 2 (λi βi + 2Di )2 a∈Ai (1 + tλi η)(1 + (t − 1)λi η) 1 + tλi η !2  t−1  X 0 λi η 2 + ũi (a, at−i )  (1 + tλi η)(1 + (t − 1)λi η) 0 t =1 ( 2 2 exp{2η(λi βi + 2Di )} 3 λi η max + ηũi (a, at−i ) = log τ (a) i 2 2 2 a∈Ai (1 + tλi η) η (λi βi + 2Di ) 1 + (t − 1)λi η !2  t−1  2 X 0 λi η + ũi (a, at−i )  (1 + (t − 1)λi η) 0 t =1 exp{2η(λi βi + 2Di )} 2 2 2 3 (λ η βi + 2η 2 Di2 ) 2 (1 + tλi η) η 2 (λi βi + 2Di )2 exp{2η(λi βi + 2Di )} ≤3 (1 + tλi η)2   √ exp{η(λi βi + 2Di )} 2 exp{2η(λi βi + 2Di )} 3 · ≤3 = . (tλi η)2 tλi η ≤ Using the hypothesis that η ≤ 1/(λi βi + 2Di ) and taking square roots yields the statement. A.3. Completing the analysis Proposition 1. Fix a player i ∈ {1, . . . , n}. The regret RiT := max ∗ T X π ∈∆(Ai ) t=1 Uit (π ∗ ) − T X t=1 Uit (πit ) incurred up to any time T by policies πit defined in (6) where the learning rate is set to any value 0 < η ≤ 1/(λi βi + 2Di ), satisfies p log |Ai | 3 e(1 + log T ) RiT ≤ + (Di + λi βi + λi |Ai |), η λi η where Di is any upper bound on the range of the possible rewards of Player i, and (21) βi := max log(1/τ (a)). a∈Ai Proof. Let   qit := ũi (a, at−i ) , a∈Ai and note that, by definition of the regularized utilities Ui ,  Uit (πit+1 ) − Uit (πit ) = qi> πit+1 − πit − λi DKL (πit+1 k τ ) + λi DKL (πit k τ )  = qi> πit+1 − πit − λi ϕ(πit+1 ) + λi ϕ(πit ) − λi ∇ϕ(τi )> (πit − πit+1 )  > = (qi − ∇ϕ(τi )) πit+1 − πit − λi ϕ(πit+1 ) + λi ϕ(πit )  > ≤ (qi + ∇ϕ(τi )) πit+1 − πit − λi ϕ(πit ) − λi ∇ϕ(πit )> (πit+1 − πit ) + λi ϕ(πit ) > t+1  = qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit ) πi − πit ≤ qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit ) ∇−2 ϕ(πt ) · πit+1 − πit ∇2 ϕ(πt ) , i i Modeling Strong and Human-Like Gameplay with KL-Regularized Search where the first inequality follows by convexity and the second inequality by the generalized Cauchy-Schwarz inequality with the primal-dual norm pair k · k∇2 ϕ(πit ) and k · k∇−2 ϕ(πit ) . A bound for the second term in the product if given by Lemma 4. We now bound the first norm. First, 2 qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit ) ∇−2 ϕ(πt ) = i X a∈Ai ≤3 2 π t (a) · ũi (a, at−i ) + λi log τi (a) − λi log πit (a) X a∈Ai π t (a) · ũi (a, at−i )2 + λ2i (log τi (a))2 + λ2i (log πit (a))2   ≤ 3 Di2 + λ2i βi2 + λ2i |Ai | 2  p ≤ 3 Di + λi βi + λi |Ai | , where the second inequality follows from the fact that x log2 x ≤ 1 for all x ∈ [0, 1]. Taking square roots, we find qi + λi ∇ϕ(τi ) − λi ∇ϕ(πit ) ∇−2 ϕ(πt ) ≤ i  p √  3 Di + λi βi + λi |Ai | . So, using Lemma 4, we can write Uit (πit+1 ) − Uit (πit ) ≤ p 3e (Di + λi βi + λi |Ai |). tλi η Plugging in the above expression into Lemma 1 yields RiT ≤ ≤ T p log |Ai | X 3 e + (Di + λi βi + λi |Ai |) η tλi η t=1 p log |Ai | 3 e(1 + log T ) + (Di + λi βi + λi |Ai |), η λi η which is the statement. A.4. Proof of Theorem 1 Theorem 1. (piKL stays close to the anchor policy) Upon running Algorithm 1 for T iterations in a multiplayer general-sum game, the policy π̄iT is at a distance DKL (π̄iT k τi ) ≤ 1 λi   RiT + Di , T where Di is any upper bound on possible rewards for Player i. In particular, if η > 0 is set so that RiT = o(T ), then DKL (π̄iT k τi ) → Di /λi as T → +∞. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Proof. By definition of regret, ) ( T X 1 T 1 t ∗ t t U (π ) − Ui (πi ) R = max T i T πi∗ ∈∆(Ai ) t=1 i i ( ! ! ) T T T T X X X 1X λ 1 λ i i t t = ∗max ui (πi∗ , π−i ) − ui (πit , π−i ) − DKL (πi∗ k τi ) + DKL (πit k τi ) T t=1 T t=1 T t=1 T t=1 πi ∈∆(Ai ) ( ! ! ) T T T 1X 1X λi X ∗ t t t ∗ t ui (πi , π−i ) − ui (πi , π−i ) − λi DKL (πi k τi ) + DKL (πi k τi ) = ∗max T t=1 T t=1 T t=1 πi ∈∆(Ai ) ( ! ! ) T T T 1X 1X λi X ∗ t t t t ui (πi , π−i ) − ui (πi , π−i ) + DKL (πi k τi ) ≥ ∗max T t=1 T t=1 T t=1 πi ∈∆(Ai ) ( ! ! ) T T X 1X 1 t t ≥ ∗max ui (πi∗ , π−i ) − ui (πit , π−i ) + λi DKL (π̄iT k τi ) T t=1 T t=1 πi ∈∆(Ai ) ( ! ) T 1X ∗ t t t T = ∗max (ui (πi , π−i ) − ui (πi , π−i ) + λi DKL (π̄i k τi ) T t=1 πi ∈∆(Ai ) ( ! ) T 1X T −Di + λi DKL (π̄i k τi ) = −Di + λi DKL (π̄iT k τi ), ≥ ∗max T t=1 πi ∈∆(Ai ) where the first inequality holds since the KL divergence is nonnegative, the second inequality by convexity of the KL divergence, and the third inequality by definition of Di . Rearranging yields the inequality in the statement. A.5. Relationship with Nash Equilibrium In this subsection, we show that when all players play according to Algorithm 1 in a two-player zero-sum game, then the average policies π̄iT converge to a Nash equilibrium of the regularized game whose utilities are Ui . Proposition 2. For any T ∈ N, η > 0, and δ ∈ (0, 1), define the quantity r 32 2 maxi |Ai | R1T + R2T + (max Di ) log . ξ (δ) := i T T δ T Upon running Algorithm 1 for any T iterations with learning rate η > 0, the average policies π̄iT of each player form a ξ T (δ)-approximate Nash equilibrium with respect to the regularized utility functions Ui with probability at least 1 − δ, for any δ ∈ (0, 1). Proof. Fix a player i ∈ {1, 2}, and any policy π ∗ ∈ ∆(Ai ), and introduce the discrete-time stochastic process     t t wt := Ui (π ∗ , π−i ) − Ui (πit , π−i ) − Ui (π ∗ , at−i ) − Ui (πit , at−i ) . Since the opponent player −i plays according to Algorithm 1, its action at−i at all times t is selected by sampling (unbiasedly) t an action from the policy π−i . Therefore, wt is a martingale difference sequence. Furthermore, by expanding the definition of Ui , the absolute value of wt satisfies |wt | =     t t ui (π ∗ , π−i ) − ui (πit , π−i ) − ui (π ∗ , at−i ) − ui (πit , at−i ) t t ≤ ui (π ∗ , π−i ) − ui (πit , π−i ) − ui (π ∗ , at−i ) − ui (πit , at−i ) ≤ 2Di . Modeling Strong and Human-Like Gameplay with KL-Regularized Search Hence, using Azuma-Hoeffding’s inequality, for any δ ∈ (0, 1), # " T r X 1 t 1−δ ≤P w ≤ Di 8T log δ t=1 " T # ! ! r T T T X X X X 1 ∗ t t t ∗ t t t =P Ui (π , π−i ) − Ui (πi , π−i ) − ui (π , a−i ) − Ui (πi , a−i ) ≤ 8T log δ t=1 t=1 t=1 t=1 " T # r T X X 1 t t =P Ui (π ∗ , π−i )− , Ui (πit , π−i ) ≤ RiT + Di 8T log δ t=1 t=1 where RiT is as defined in Proposition 1. Since the above expression holds for any π ∗ ∈ ∆(Ai ), in particular, using the union bound on each a ∈ Ai , " # r T T X X |Ai | ∗ t t t T P ∗max Ui (π , π−i ) − Ui (πi , π−i ) ≤ Ri + Di 8T log ≥1−δ (22) δ π ∈∆(Ai ) t=1 t=1 for any player i ∈ {1, 2} and any δ ∈ (0, 1). Summing Inequality (22) for i ∈ {1, 2} and using the union bound, we can further write " ( T ) ( T ) ! T X X X P ∗ max U1 (π1∗ , π2t ) + ∗ max U2 (π1t , π2∗ ) − U1 (π1t , π2t ) + U2 (π1t , π2t ) π1 ∈∆(A1 ) π2 ∈∆(A2 ) t=1 t=1 t=1 r ≤ R1T + R2T + (max Di ) i # maxi |Ai | 32T log ≥ 1 − 2δ. δ Dividing by T and noting that for any player i ∈ {1, 2} T T 1X 1X t t Ui (π ∗ , π−i ) = Ui π ∗ , π T t=1 T t=1 −i further yields "  P ∗ max U1 (π1∗ , π̄2T ) + π1 ∈∆(A1 ) 1 max U2 (π̄1T , π2∗ ) − T π2∗ ∈∆(A2 )  T X t=1 ! T = Ui π ∗ , π̄−i  ! U1 (π1t , π2t ) + U2 (π1t , π2t ) RT + R2T ≤ 1 + Di T r # 32 maxi |Ai | log ≥ 1 − 2δ. T δ (23) We now analyze the term in parenthesis, that is, 1 (♣) := − T T X t=1 ! U1 (π1t , π2t ) + U2 (π1t , π2t ) Plugging in the definition of U1 and U2 , that is, into (♣) yields U1 (π1 , π2 ) := u1 (π1 , π2 ) − λ1 DKL (π1 k τ1 ) U2 (π1 , π2 ) := u2 (π1 , π2 ) − λ2 DKL (π2 k τ2 ) = −u1 (π1 , π2 ) + λ2 DKL (π2 k τ2 ) 1 (♣) = − T T X t=1 ! U1 (π1t , π2t ) + U2 (π1t , π2t ) 1 = T T X t=1 ! λ1 DKL (π1t k τ1 ) + λ2 DKL (π2t k τ2 ) ≥ λ1 DKL (π̄1T k τ1 ) + λ2 DKL (π̄2T k τ2 )  = − U1 (π̄1T , π̄2T ) + U2 (π̄1T , π̄2T ) , (24) (25) Modeling Strong and Human-Like Gameplay with KL-Regularized Search where (24) follows from convexity of the KL divergence function, and (25) follows again from the definition of U1 and U2 . Substituting (25) back into (23), we find  P max π1∗ ∈∆(A1 )  U1 (π1∗ , π̄2T ) − U1 (π̄1T , π̄2T ) + max π2∗ ∈∆(A2 )  U2 (π̄1T , π2∗ ) − U2 (π̄1T , π̄2T ) # r 32 maxi |Ai | R1T + R2T + (max Di ) log ≥ 1 − 2δ. ≤ i T T δ (26) Since max π1∗ ∈∆(A1 )  U1 (π1∗ , π̄2T ) − U1 (π̄1T , π̄2T ) ≥ 0, and max π2∗ ∈∆(A2 )  U2 (π̄1T , π2∗ ) − U2 (π̄1T , π̄2T ) ≥ 0, the inequality above in particular implies that    P max ∗ max U1 (π1∗ , π̄2T ) − U1 (π̄1T , π̄2T ) , π1 ∈∆(A1 ) max ∗  U2 (π̄1T , π2∗ ) − U2 (π̄1T , π̄2T )  π2 ∈∆(A2 ) # r R1T + R2T 32 maxi |Ai | ≤ + (max Di ) log ≥ 1 − 2δ, i T T δ (27) which is equivalent to the statement after making the variable substitution δ := δ 0 /2. In particular, √ when η ≤ 1/(λi βi + 2Di ) for both players i ∈ {1, 2}, Proposition 2 implies that the average strategy profile is a O(1/ T )-Nash equilibrium with respect to the regularized utility functions Ui . A standard application of the Borel-Cantelli lemma enables to convert from the high-proability guarantees of Proposition 2 at finite time to almost-sure convergence in the limit. Specifically, Corollary 1. Let (π̄1 , π̄2 ) be any limit point of the average policies (π̄1T , π̄2T ) of the players. Almost surely, (π̄1 , π̄2 ) is a Nash equilibrium with respect to the regularized utility functions U1 , U2 , respectively. From there, it is immediate to give guarantees with respect to the original (i.e., unregularized) game, and Theorem 2 follows. Theorem 2. Let (π̄1 , π̄2 ) be any limit point of the average policies (π̄1T , π̄2T ) of the players. Almost surely, (π̄1 , π̄2 ) is a (maxi=1,2 {λi βi })-approximate Nash equilibrium policy with respect to the original utility functions ui , where βi is as defined in (21). Proof. From Corollary 1, almost surely (π̄1 , π̄2 ) is a Nash equilibrium of the regularized game whose players’ utilities are U1 and U2 , respectively. Expanding the definition of Nash equilibrium relative to Player 1, we have that 0= = max {U1 (π1∗ , π̄2 ) − U1 (π̄1 , π̄2 )} π1∗ ∈∆(A1 ) max {u1 (π1∗ , π̄2 ) − λ1 DKL (π1∗ k τ1 ) − u1 (π̄1 , π̄2 ) + λ1 DKL (π̄1 k τ1 )} π1∗ ∈∆(A1 ) max {u1 (π1∗ , π̄2 ) − u1 (π̄1 , π̄2 )} − λ1 DKL (π1∗ k τ1 ) ( ) X X ∗ ∗ ∗ ∗ = ∗ max (u1 (π1 , π̄2 ) − u1 (π̄1 , π̄2 )) − λ1 πi (a) log(πi (a)) − λ1 π1 (a) log(1/τ1 (a)) ≥ π1∗ ∈∆(A1 ) π1 ∈∆(A1 ) a∈A1 a∈A1 ( ≥ max π1∗ ∈∆(A1 ) ) (u1 (π1∗ , π̄2 ) − u1 (π̄1 , π̄2 )) − λ1 X a∈A1 ∗ ≥ ∗ max {u1 (π1 , π̄2 ) − u1 (π̄1 , π̄2 )} − λ1 β1 , π1 ∈∆(A1 ) π1∗ (a) log(1/τ1 (a)) Modeling Strong and Human-Like Gameplay with KL-Regularized Search where the first inequality follows since the KL divergence is alwasy nonnegative, the second inequality since the negative entropy function is nonpositive on the simplex, and the third inequality follows from the definition of β1 . Symmetrically, for Player 2 we find that 0 ≥ ∗ max {u2 (π̄1 , π2∗ ) − u2 (π̄1 , π̄2 )} − λ2 β2 . π2 ∈∆(A2 ) Hence, the exploitability of π̄1 is at most λ1 β1 , while the exploitability of π̄2 is at most λ2 β2 , which immediately implies the statement. B. Illustrations of piKL-Hedge in Blotto Figure 4. Comparison of piKL and regret matching as a function of λ in Colonel Blotto(10, 3). As λ increases in piKL-hedge, piKL moves closer to the anchor policy at the cost of increased exploitability. The scale of λ is related to the scale of the payoffs in the game, which are [0, 1] in Blotto. Colonel Blotto is a famous 2-player simultaneous action game that has a large action space but has rules that are short and simple. In Blotto, each player has c coins to be distributed across f fields. The aim is to win the most fields by allocating the player’s coins across the fields. A field is won by contributing the most coins to that field (and drawn if there is a tie). The winner receives a reward of +1 and the loser receives -1. Both receive 0 in the case of a tie. In Figure 4 we illustrate the key features of piKL-Hedge in Blotto, using a uniform anchor policy for convenience. Incidentally, piKL-Hedge with a uniform anchor policy converges to a quantal response equilibrium (McKelvey & Palfrey, 1995b). piKL-Hedge finds policies that play close to the anchor policy while having low regret, with λ controlling the relative optimality of these two desiderata. C. Human Policy KL-Regularized Search also improves Cross-Entropy Model Dataset Maia1500 (Chess) Lichess 1500 Rating Bucket Maia1900 (Chess) Lichess 1900 Rating Bucket Our Model (Go) GoGoD (raw model) cpuct = 10 cpuct = 5 cpuct = 2 cpuct = 1 cpuct = 0.5 1.476 1.440 1.388 1.469 1.429 1.362 1.465 1.422 1.359 1.470 1.418 1.362 1.504 1.443 1.391 1.598 1.529 1.478 Table 3. Cross-entropy predicting human moves in chess and Go using smooth KL optimization post-processing of MCTS with various cpuct . Largest standard error of any value is around 0.0033. Here we show that not only does policy-regularized search improve top-1 accuracy, it also improves cross entropy for a reasonable range of parameter values in both chess and Go, as well as performing well in Diplomacy. For Chess and Go, to obtain the policy distribution with which to compute its cross entropy with the human data, unlike for top-1 accuracy or for play we cannot directly use the MCTS visit distribution because occasionally simply due to discretization MCTS may give zero visits to the actual move that a human played, resulting in an undefined (i.e. infinite) cross-entropy. Instead, we leverage the result of Grill et al. (2020) that PUCT-style MCTS with a policy prior can be seen as Modeling Strong and Human-Like Gameplay with KL-Regularized Search 7 Anchor Policy piKL-Hedge (λ = 10−4 ) piKL-Hedge (λ = 10−1 ) piKL-Hedge (λ = 10−3 ) piKL-Hedge (λ = 10−5 ) Hedge Regret Matching 1300 1500 piKL-Hedge (λ = 10−2 ) Unit Order Cross-Entropy 6 5 4 3 2 1 800 900 1000 1100 1200 Pseudo-ELO 1400 1600 Figure 5. Average cross-entropy of per-unit order prediction in Diplomacy, comparing the human-imitation-learned anchor policy, regret matching, hedge, and piKL-Hedge, as a function of pseudo-Elo player rating. a discrete approximation to solving a smooth optimization: arg max π X a Q(s, a)π(s, a) + λDKL (τ k π) where pP na Pa λ = cpuct (k + a na ) where Q is the current value estimate from search for each action, τ is the anchor or prior policy, λ controls the strength of the regularization towards that prior as a function of the number of visits, cpuct is the MCTS exploration coefficient, na is the number of times a was explored and k is an arbitrary constant not affecting the asymptotic results. We perform MCTS exactly the same as normal, the only difference is that at the very end, rather than using visit counts, we compute π optimizing the above objective using the human imitation-learned anchor policy for τ and the MCTS-estimated Q-values for Q (and using the same unweighted average child value for moves that lack a Q-value estimate due to having zero visits as described in Appendix F). We use this resulting smooth π as the final policy prediction and compute its cross entropy with the actual human moves. Note that the authors choose k = |A|, the number of legal actions, for mathematical convenience, corresponding to adding one extra visit perP every possible action (Grill et al., 2020), but since in our experiments we use only use 50 visits including the root visit (i.e. a na = 49), and the branching factor can be as large as 362 in Go, adding one extra visit per legal action in our case greatly overestimates the total number of visits, which in practice gives a less accurate correspondence between MCTS and this smoother regularized solution, so we instead choose k = 0. In Table 3 we show the results. Across roughly the same parameter ranges, the regularized search policy using this smoothed MCTS postprocessing achieves lower cross-entropy with human moves than the raw imitation-learned policy without search. This suggests that not only does search improve on the raw imitation-learned policy at pinpointing the top action, it also gives a more accurate model of the overall distribution of likely human actions. Similarly, in Figure 5, we show that piKL-Hedge achieves better average cross entropy of unit orders in no-press Diplomacy compared to unregularized search methods, and matches that of imitation learning at λ = 0.1. This corroborates the results of Section 4.3 and shows that piKL-Hedge provides the same benefits in modeling the overall distribution of human actions as it does on predicting the top move - outpredicting unregularized search (while playing as well or slightly better against human-like opponents), or equaling the prediction quality of imitation learning (while playing much more strongly than IL). Modeling Strong and Human-Like Gameplay with KL-Regularized Search D. More Experiments in Chess and Go Game cpuct MCTS Win% vs raw model temp = 1 temp = 0.3 Chess Chess Chess Chess Chess 10.0 5.0 2.0 1.0 0.5 72.2% ± 1.2% 79.5% ± 1.1% 88.0% ± 0.8% 92.2% ± 0.7% 94.4% ± 0.6% 62.4% ± 1.3% 72.3% ± 1.2% 86.1% ± 0.9% 92.9% ± 0.6% 94.7% ± 0.5% Go Go Go Go Go 10.0 5.0 2.0 1.0 0.5 73.2% ± 1.4% 80.5% ± 1.3% 87.6% ± 1.0% 94.6% ± 0.7% 96.4% ± 0.6% 63.3% ± 1.5% 74.1% ± 1.4% 85.3% ± 1.1% 94.4% ± 0.7% 97.0% ± 0.5% Table 4. Winrate of base model + MCTS vs base model at temperature 1 and 0.3. Base model is Maia1900 in chess, and our GoGoD model in Go. 1000 games per figure, draws count as half a win, ± indicates one standard error. Go uses Japanese rules with 6.5 komi. MCTS greatly improves strength in Chess and Go, the smallest cpuct values improve it most. Model Predicting Main Time Increment Raw Model Acc % MCTS Acc % Acc Gain from MCTS Approx Stderr Maia1500 Maia1500 Maia1500 Maia1500 Maia1500 Maia1500 1500 1500 1500 1500 1500 1500 3m 5m 10m 3m 5m 15m 0s 0s 0s 2s 3s 15s 51.9 52.7 52.4 53.1 52.5 51.9 52.1 53.2 53.1 53.7 53.4 52.6 0.2 0.5 0.7 0.6 0.9 0.7 0.17 0.16 0.20 0.31 0.35 0.37 Maia1900 Maia1900 Maia1900 Maia1900 Maia1900 Maia1900 1900 1900 1900 1900 1900 1900 3m 5m 10m 3m 5m 15m 0s 0s 0s 2s 3s 15s 53.0 53.2 52.9 54.1 53.1 53.7 53.9 54.5 54.6 55.5 55.1 55.9 0.8 1.3 1.7 1.4 2.0 2.2 0.12 0.17 0.24 0.27 0.51 0.56 Table 5. Difference in top-1 accuracy between raw model and MCTS using the best cpuct in predicting human moves for chess players in rating buckets 1500 and 1900 using Maia1500 and Maia1900, split by time control of the games, excluding all time controls with fewer than 100 games. Approx Stderr indicates the rough standard error of raw accuracy values on that row given the number of games of that time control. Despite the statistical uncertainty on some individual values, overall the improvement of MCTS vs the raw model does clearly tend to be larger on the games with longer time controls. This section summarizes the results of a small number of additional experiments in chess and Go. Whereas Figure 1 used a temperature of 1.0 when sampling from the agent policy, we show in Table 4 that MCTS also achieves similar winrates versus the raw model when both are sampled using a much lower temperature of 0.3. Additionally, we confirm in Go the same result that McIlroy-Young et al. (2020a) reported in chess, that current top RL agents are far worse at matching human moves than even just imitation learning. We tested the current top open-source Go program KataGo (Wu, 2020) using its final 6, 10, and 15 block models5 which range from upper-amateur to superhuman level, and found that they achieve accuracies of about 35%, 43%, and 43%, all more than 10% lower than the results in Table 1. Lastly, in Table 5 we show that in chess the amount of improvement given by MCTS over the raw model in predicting human players on the Lichess test set tends to be larger for games with longer time controls than for shorter time controls, going from about 0.2% to about 0.7% between the shortest and longer time controls for 1500-1599-rated players, and going from about 0.8% to around 2.0% for 1900-1999-rated players. This is consistent with the intuition that humans rely more heavily on planning when they have more time to think, increasing the gain from modeling that planning. 5 from https://katagotraining.org Modeling Strong and Human-Like Gameplay with KL-Regularized Search E. Baseline Model Architecture and Training for Go As summarized in Section 3.2, for training baseline imitation-learning models to play on the 19x19 board in Go, our architecture follows the same 20-block 256-channel residual net described in Silver et al. (2017), except with the addition of squeeze-and-excitation layers at the end of each residual block (Hu et al., 2018). In particular, the following additional operations are inserted just prior to each skip connection that adds the output R of a residual block to the trunk X: • Channelwise global average pooling of R from 19 × 19 × 256 channels to 256 channels. • A fully connected layer including bias from 256 channels to 64 channels. • A ReLU nonlinearity. • A fully connected layer including bias from 64 channels to 512 channels, which are split two vectors S and B of 256 channels each. • Output R × sigmoid(S) + B to be added back to the trunk X, instead of R as in a normal residual net. In other words, the final result of the residual block as a whole is ReLU(X + R × sigmoid(S) + B) instead of ReLU(X + R). Additionally, some games are played with a komi (compensation given to White for playing second) that is not equal to the value of 7.5 used by Silver et al. (2017). Therefore, for the final feature of the input encoding, rather than a binary-valued feature equalling 1 if the player to move is White and 0 if the player to move is Black, we instead use the real-valued feature of komi/10 if the player to move is White or -komi/10 if the player to move is Black. We additionally exclude a very tiny number of games with extreme komi values, outside of the range [-60,60]. We train using a mini-batch size of 2048 distributed as 8 batches of 256 across 8 GPUs, and train for a total of 64 epochs roughly 475000 minibatches for the GoGoD dataset. We use SGD with momentum 0.9, weight decay coefficient of 1e-4, and a learning rate schedule of of 1e-1, 1e-2, 1e-3, 1e-4 for the first 16, next 16, next 16, and last 16 epochs respectively. We train both the policy and values heads jointly, minimizing the cross entropy of the policy head with respect to the one-hot move made in the actual game, and the MSE of the value head with respect to the game result of -1 or 1, except similarly to Silver et al. (2017) we weight the MSE value loss by 0.01 to avoid overfitting of the value head. F. MCTS Algorithmic Details In this appendix, we summarize the details of the version of MCTS used in our experiments, including one often-overlooked detail. We follow a standard MCTS implementation very similar to that of (Silver et al., 2017). Each turn, the algorithm builds and expands a game tree over multiple iterations rooted at the current state for that turn. On each iteration t MCTS starts at the root and descends the tree by exploring at each state s an action a according to some exploration method. Upon reaching a state st not yet explored, it adds st to the tree, queries the value function Vi (st ) for each i to estimate the total expected future reward, and updates the statistics of all nodes traversed based on Vi (st ) and any intermediate rewards received. Subsequent iterations begin again from the root. For our work in chess and Go, we follow the convention where win, loss, and draw have reward 1,-1, and 0. The statistics tracked at the node for each state s where player i is to move include the visit counts N (s, a) which are the number of iterations that reached state s and tried P action a, and Q(s, a) the average value of those iterations from the perspective of player i, i.e. Q(s, a) = (1/N (s, a)) t Vi (st ) + Ui (s, st ) where the sum ranges only over those iterations t that reached state s and tried action a and Ui (s, st ) is the total intermediate reward on the path from s to st . When descending the tree, the exploration method is to always select the action: pP b N (s, b) arg max Q(s, a) + cpuct τ (s, a) (28) N (s, a) + 1 a where τ (s, a) is the prior policy probability for action a in state s, and cpuct is a tunable parameter controlling the tradeoff between exploration and exploitation. The final agent policy π is simply proportional to the visit counts for the root, i.e. Modeling Strong and Human-Like Gameplay with KL-Regularized Search P π(s, a) = N (s, a)/ b N (s, b) where s is the root state, or optionally we may also have π(s, a) ∼ N (s, a)1/T where T is a temperature parameter. One final often-overlooked detail concerns how to evaluate states for which there is no Q(s, a) estimate. Since the tree policy depends on the Q(s, a) estimates of the possible actions, there is a nontrivial choice of what Q value to use for an action a that has been tried zero times and therefore never estimated. Since the tree branches exponentially, deeper in the MCTS tree there will always be many actions with zero visits, and so this choice can affect the behavior of MCTS even in the limit of large amounts of search. Unfortunately, the details of this choice have sometimes been left undiscussed and undocumented in major past work, and as a result major MCTS implementations have not standardized on it, variously choosing game-loss (Lai, 2018), the current running average parent Q or a Q-value minus a heuristic offset parameter (Tian, 2019), or many other options. In our average value of all actions at the parent node that have been visited at least once, P work, we use the equal-weighted P i.e. a Q(s, a)I(N (s, a) > 0)/ a I(N (s, a) > 0) . This can be viewed as corresponding to a naive prior that the values of actions are i.i.d draws from an unknown distribution. While not perfect, our choice is simple, parameter-free, behaves in a way that is invariant to any global translation or scaling of the game’s rewards, and works reasonably in practice for our purposes. G. Brief Description of Diplomacy We briefly summarizing the rules of Diplomacy. See Paquette et al. (2019) for a more detailed description. The board is a map of Europe partitioned into 75 regions, 34 of which are supply centers (SCs) that players compete to control. Players command multiple units and each turn privately issue orders for each unit they own (to hold, move, support another unit, or convoy). These orders are revealed at the same time, thereby making Diplomacy a simultaneous-action game. A player wins the game by controlling a majority (18) of the SCs. A game may also end in a draw if all remaining players agree. In this case, we use the Sum-of-Squares (SoS) scoring system as used in prior works P (Paquette et al., 2019; Gray et al., 2020; Bakhtin et al., 2021). If no player wins, SoS defines the score of player i as Ci2 / i0 Ci20 , where Ci is the SC count for player i. Diplomacy is specifically designed so that a player is unlikely to achieve victory without help from other players. The full game allows unrestricted private natural-language communication between players each turn prior to choosing orders, but we focus on the simpler no-press variant, in which no such communication is allowed, but modeling how opponents will behave continues to be important. H. Diplomacy hyper-parameters We describe the describe the search parameters used in this work and compare it to those in previous works. As compared to Gray et al. (2020), we use a much less expensive set of search parameters for all results in the main section of this paper (See, Table 7). In Table 8, we show that these tuned-down set of parameters, slightly reduces piKL-HedgeBot’s performance but allows us to make similar conclusions when comparing against SearchBot (Gray et al., 2020) and supervised learning-based bots from prior works. Additionally, in Table 8, we also show that when our agent (piKL-HedgeBot (λ = 10−3 )) uses the same search parameters as Gray et al. (2020), it outperforms SearchBot by a big margin. In our experiments, to compare against DipNet (Paquette et al., 2019) we use the original model checkpoint6 and we sample from the policy with temperature 0.1 (as used in prior works). Similarly, to compare against SearchBot (Gray et al., 2020) agent we use the released checkpoint7 and agent configuration8 . The only other search parameter unique √ and new to our piKL-HedgeBot algorithm is η in Algorithm 1, which we heuristically set on each Hedge iteration t to c/(σ t) where σ is the standard deviation across iterations of the average utility experienced by the agent i being updated. We find c = 10/3 works well for no-press Diplomacy. 6 DipNet SL from https://github.com/diplomacy/research. blueprint from https://github.com/facebookresearch/diplomacy_searchbot/releases/tag/1.0. 8 https://github.com/facebookresearch/diplomacy_searchbot/blob/master/conf/common/agents/ searchbot_02_fastbot.prototxt 7 Modeling Strong and Human-Like Gameplay with KL-Regularized Search Model Temperature DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) Blueprint (Gray et al., 2020) IL Policy (Ours) 0.1 0.1 0.1 0.5 Table 6. Sampling temperatures used in the models across prior works. For our imitation learning model (IL Policy), we use a temperature of 0.5 in all experiments to encourage stochasticity. Parameter (Gray et al., 2020) Ours 50 3.5 256 0.75 0.95 2 30 3.5 512 N/A 0.95 0 Number candidate actions (Nc ) Max candidate actions per unit Number search iterations Policy sampling temperature for rollouts Policy sampling top-p Rollout length, move phases Table 7. Search parameters used in Gray et al. (2020) compared to the search parameters used in our work. All experiments in the main body of this work uses search settings that are much cheaper to run. We use a rollout length of 0 and Nc = 30 while increasing the number of search iterations to 512. 1x ↓ 6x → DipNet DipNet RL Blueprint BRBot SearchBot DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) 18.9% ± 1.4% 6.7% ± 0.9% - 11.6% ± 0.1% 10.5% ± 1.1% 0.1% ± 0.1% 0.1% ± 0.1% 0.7% ± 0.2% 0.6% ± 0.2% Blueprint (Gray et al., 2020) BRBot (Gray et al., 2020) SearchBot (Gray et al., 2020) 20.2% ± 1.3% 67.3% ± 1.0% 51.1% ± 1.9% 7.5% ± 1.0% 43.7% ± 1.0% 35.2% ± 1.8% 69.3% ± 1.7% 52.7% ± 1.3% 0.3% ± 0.1% 17.2% ± 1.3% 0.9% ± 0.2% 11.1% ± 1.1% - piKL-HedgeBot (λ = 0.001) 54.8% ± 1.8% 31.4% ± 1.8% 50.3% ± 1.8% 19.2% ± 1.4% 16.6% ± 1.3% piKL-HedgeBot (λ = 0.001) ((Gray et al., 2020) parameters) 60.1% ± 1.8% 33.3% ± 1.8% 58.1% ± 1.8% 23.6% ± 1.6% 20.3% ± 1.4% Table 8. Average SoS scores achieved by the 1x agent against the 6x agents. This table compares the performance of SearchBot (Gray et al., 2020) and other agents from prior work with piKL-HedgeBot (λ = 10−3 ) that uses a much cheaper search setting. Using the much cheaper search setting, comes at relatively small cost in its performance as we use improved value and policy models (See, Appendix I and Appendix H for more details). When using the same parameters as Gray et al. (2020), piKL-HedgeBot (λ = 10−3 ) significantly outperforms SearchBot under most settings. Note that equal performance would be 1/7 ≈ 14.3%. The ± shows one standard error. I. Diplomacy Model Architecture and Input Features Our imitation learning policy model for Diplomacy uses the same transformer-encoder LSTM-decoder architecture as Bakhtin et al. (2021) for reinforcement learning in Diplomacy, but applied to imitation learning over human games. This architecture also resembles the architecture used by a significant amount of past work (Gray et al., 2020; Anthony et al., 2020; Paquette et al., 2019) but replaces the graph-convolution encoder with a transformer, which we find to produce good results. Additionally, we slightly modify the input feature encoding relative to Gray et al. (2020), removing a small number of redundant channels and adding channels to indicate the “home centers” of each of the 7 powers (Austria, England, France,... etc), which are the locations where that power is allowed to build new armies or fleets. See Table 9 for the new list of input features. By adding the home centers to the input encoding instead of leaving them implicit, the game becomes entirely equivariant to permutations of those powers - e.g. if one swaps all the units of England and France, and all their centers, and which centers are their home centers, the resulting game is isomorphic to the original except with the two powers renamed. This allows us to then augment the training data via equivariant permutations of the seven possible powers in the encoding. Every time we sample a position from the dataset for training, we also choose among all 7-factorial permutations of the powers uniformly at random, and correspondingly permute both the input and output, to reduce overfit and improve the model’s generalization given the limited human data available. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Feature Type Location has unit? Owner of unit Buildable, Removable? Location has dislodged unit? Owner of dislodged unit Area type Supply center owner Home center One-hot (army/fleet), or all zero One-hot (7 powers), or all zero Binary One-hot (army/fleet), or all zero One-hot (7 powers), or all zero One-hot (land,coast,water) One-hot (7 powers or neutral), or all zero One-hot (7 powers), or all zero Number of Channels 2 7 2 2 7 3 8 7 Table 9. Per-location input features used J. Improved Value Model in Diplomacy For no-press Diplomacy, we note that Gray et al. (2020) observed that their search agent benefits from short rollouts using the trained human policy before applying the human-learned value model to evaluate the position. Doing so appears to result in more accurate evaluations reflecting the likely outcomes from a given game state, which the raw value model may failed to learn sufficiently accurately on the limited human dataset. Since expectation of the learned value model after a short rollout appears to be better than the learned value model itself, this motivates training a model to directly approximate the former. In a fashion broadly similar to Silver et al. (2016) generating rollout games to train a more accurate value head for Go, we therefore generated a large stream of data by uniformly sampling positions from the human game dataset for Diplomacy, rolling them forward between 4-8 phases of game play via the same rollout settings as Gray et al. (2020), i.e. policy sampling temperature 0.75, top-p 0.95, and training a new value model to predict the resulting post-rollout value estimate of the old value model. Samples were continuously and asynchronously added to a replay buffer of 10000 batches, and the buffer was continuously sampled to train the same transformer-based architecture as the human-trained model from Appendix I initialized with the weights of that model. Training was constrained to never exceed the rate of data generation by more than a factor of 2 (i.e. using each sample twice in expectation) and proceeded for 128000 mini-batches of 1024 samples each using the ADAM optimizer with a fixed learning rate of 1e-5. K. More Experiments in Diplomacy This section compiles additional performance results from evaluation games. K.1. Head-to-Head Performance In Table 10, we compare the performance of piKL-HedgeBotin 1v6 head-to-head games against the underlying imitation anchor policy, following prior work (Gray et al., 2020; Bakhtin et al., 2021; Paquette et al., 2019; Anthony et al., 2020). We find that the λ = 10−1 policy is substantially stronger than the imitation policy while matching the accuracy in predicting human moves, while the λ = 10−3 policy outperforms unregularized search methods while playing much closer to the human policy. K.2. piKL-HedgeBot’s performance in population-based experiments In this section, we provide all the results from the population experiments across various piKL-HedgeBot’s lambda values. Figure 6 and Table 11 show the results. piKL-HedgeBot with λ = 10−3 performs best across individual population experiments with an SoS score of 32.9%. The performance drops as we continue to increase λ past 1e − 3. Error bars indicate 1 standard error. Modeling Strong and Human-Like Gameplay with KL-Regularized Search 1x 6x IL Policy piKL-HedgeBot (λ = 10−1 ) piKL-HedgeBot (λ = 10−2 ) piKL-HedgeBot (λ = 10−3 ) piKL-HedgeBot (λ = 10−4 ) piKL-HedgeBot (λ = 10−5 ) HedgeBot RMBot Average SoS Score 8.3±0.9% 2.5±0.4% 1.8±0.3% 2.1±0.3% 1.6±0.2% 1.5±0.2% 1.4±0.2% 1x 6x piKL-HedgeBot (λ = 10−1 ) piKL-HedgeBot (λ = 10−2 ) piKL-HedgeBot (λ = 10−3 ) piKL-HedgeBot (λ = 10−4 ) piKL-HedgeBot (λ = 10−5 ) HedgeBot RMBot IL Policy Average SoS Score 21.1±1.4% 44.2±1.7% 52.7±1.7% 49.7±1.7% 46.9±1.7% 46.5±1.7% 46.2±1.7% Table 10. Average SoS score attained by the 1x agent against the 6x agent. piKL-HedgeBot(λ = 10−1 ) policy is substantially stronger than IL Policy, while the (λ = 10−2 ) policy is almost as strong as RMBot. The ± shows one standard error. Note that equal performance would be 1/7 ≈ 14.3%. 32.5% Average Score 30.0% 27.5% 25.0% 22.5% piKL-Hedge (λ = 10−1) 20.0% piKL-Hedge (λ = 10−2) piKL-Hedge (λ = 10−3) 17.5% piKL-Hedge (λ = 10−4) piKL-Hedge (λ = 10−5) 15.0% 10−5 10−4 10−3 λ 10−2 10−1 Figure 6. Average SoS score achieved by piKL-HedgeBots in uniformly sampled pools of other agents as a function of λ. piKL-HedgeBot (λ = 10−3 ) performs best across individual sweeps with an SoS score of 32.9%. Modeling Strong and Human-Like Gameplay with KL-Regularized Search Agent Average SoS Score Agent Average SoS Score DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) 4.9% ± 0.3% 5.6% ± 0.4% DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) 3.8% ± 0.3% 4.2% ± 0.3% Blueprint (Gray et al., 2020) BRBot (Gray et al., 2020) SearchBot (Gray et al., 2020) 7.1% ± 0.4% 18.2% ± 0.6% 36.1% ± 0.8% Blueprint (Gray et al., 2020) BRBot (Gray et al., 2020) SearchBot (Gray et al., 2020) 5.8% ± 0.4% 16.3% ± 0.6% 14.1% ± 0.6% IL Policy RMBot 10.2% ± 0.6% 36.8% ± 1.1% IL Policy RMBot 8.5% ± 0.4% 31.7% ± 0.8% piKL-HedgeBot (λ = 10−1 ) 15.6% ± 0.6% piKL-HedgeBot (λ = 10−2 ) 29.9% ± 0.7% Agent Average SoS Score Agent Average SoS Score DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) 3.7% ± 0.3% 4.7% ± 0.3% DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) 3.5% ± 0.3% 4.6% ± 0.3% Blueprint (Gray et al., 2020) BRBot (Gray et al., 2020) SearchBot (Gray et al., 2020) 4.9% ± 0.3% 16.1% ± 0.6% 13.4% ± 0.5% Blueprint (Gray et al., 2020) BRBot (Gray et al., 2020) SearchBot (Gray et al., 2020) 5.7% ± 0.3% 14.3% ± 0.6% 13.6% ± 0.5% IL Policy RMBot 7.9% ± 0.4% 31.3% ± 0.7% IL Policy RMBot 8.8% ± 0.4% 31.8% ± 0.7% piKL-HedgeBot (λ = 10−3 ) 32.9% ± 0.7% piKL-HedgeBot (λ = 10−4 ) 31.8% ± 0.7% Agent Average SoS Score Agent Average SoS Score DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) 3.6% ± 0.3% 4.4% ± 0.3% DipNet (Paquette et al., 2019) DipNet RL (Paquette et al., 2019) 3.6% ± 0.3% 3.9% ± 0.3% Blueprint (Gray et al., 2020) BRBot (Gray et al., 2020) SearchBot (Gray et al., 2020) 5.3% ± 0.3% 15.0% ± 0.6% 13.0% ± 0.5% Blueprint (Gray et al., 2020) BRBot (Gray et al., 2020) SearchBot (Gray et al., 2020) 5.5% ± 0.3% 14.7% ± 0.6% 14.1% ± 0.6% IL Policy RMBot 8.9% ± 0.4% 32.2% ± 0.7% IL Policy RMBot 8.7% ± 0.4% 32.2% ± 0.7% piKL-HedgeBot (λ = 10−5 ) 31.9% ± 0.7% HedgeBot 31.7% ± 0.7% Table 11. Average SoS score achieved by agents in uniformly sampled pools of other agents. piKL-HedgeBot with λ = 10−3 performs best across individual sweeps with an SoS score of 32.9%. The ± shows one standard error. Modeling Strong and Human-Like Gameplay with KL-Regularized Search L. Dec-POMDP Games: Policy-regularized SPARTA on Hanabi In this section, we extend KL-regularized search to decentralized partially observable Markov decision processes (DecPOMDP) and test it on the Hanabi benchmark (Bard et al., 2020). We first train an imitation learning policy (IL policy) from human data. We then use the IL policy as the blueprint policy in SPARTA (Lerer et al., 2020), a search technique for Dec-POMDPs, and apply KL-regularization toward the IL policy in SPARTA. We call this new algorithm piKL-SPARTA. We show that piKL-SPARTA matches or even slightly improves the original IL policy in human move prediction accuracy while greatly improving self-play performance. L.1. Background A Dec-POMDP is an N -player fully cooperative game with state space S that is partially observed by each player i through their individual observation function oi = Ωi (s) for s ∈ S, with joint action space A = A1 × A2 × · · · × AN and transition function T : S × A → P (S) that returns the distribution of next state given current state and joint action. The reward function R : S × A → R assigns a scalar reward for the entire team at each time step. A trajectory is denoted as τ t = (s0 , a0 , r0 , . . . , st ) while the action-observation history (AOH) of each player is defined as τit = (o0 , a0 , r0 , . . . , ot ). A full trajectory or full AOH that reaches the terminal state may be denoted more simply as τ or τi respectively. The policy for each individual player πi (ati |τit ) takes as input the AOH and returns a distribution over valid actions. The joint policy π = (π1 , . . . , πN ) is a tuple containing all players’ policies. ThePgoal is to find a policy to maximize the expected 0 total return π ∗ = arg maxπ J(π) = Eτ ∼P (τ |π) R0 (τ ) where Rt (τ ) = t0 ≥t γ (t −t) rt0 is the forward looking return with optional discount factor γ ≤ 1. Hanabi is a well-established large-scale Dec-POMDP benchmark (Bard et al., 2020). It is a 2 to 5 player card game with a deck of 50 cards equally divided into 5 color suits. Each color consists of five ranks with three 1s, two 2s, two 3s, two 4s, and one 5. Each player draws five cards from a randomly shuffled deck to start the game. The goal of the team is to play cards in order of increasing rank from 1 to 5 for every color suit. Players take turns to either play a card, discard a card, or give a hint to another player about their cards. When giving a hint, the acting player picks a receiver of the hint and a rank or color of any card in the receiver’s hand. The recipient will then learn exactly which cards in their hand match the given rank or color. Hinting costs one information token. The team starts with eight information tokens and they can recoup one information token after discarding a card or successfully playing a 5 of any color. If a player makes an invalid play, e.g. playing a red 3 when a red 2 is not played yet, the team loses one life token. After each play or a discard, a player draws a new card if possible. The game ends when three life tokens are lost, in which case the team receives 0 points, or one round after the entire deck is exhausted, in which case the score equals the number of cards successfully played. The majority of existing works in the Hanabi domain focus on learning human compatible policies without using human data (Bard et al., 2020; Siu et al., 2021; Hu et al., 2021b). Fewer works have explored better modeling human policies using human game data, partly due to the lack of publicly available datasets. To the best of our knowledge, the strongest supervised learning agent trained from human data was done by (Hu et al., 2021b) where the authors use it as an unseen test-time partner agent in evaluation to estimate how well their agents might collaborate with humans. No prior work has been done to better predict human moves in Hanabi. A few search techniques such as SPARTA (Lerer et al., 2020) and RL-search (Fickinger et al., 2021) have been proposed for large scale Dec-POMDPs and have specifically been applied in Hanabi. In this work we choose SPARTA as our backbone for simplicity, while noting that our KL-regularization methods may also be generalized to other search techniques. SPARTA is a test-time policy improvement algorithm that can be applied on top of any policy. SPARTA assumes that a blueprint policy (BP) π is common knowledge in the Dec-POMDP and all players play their part of the blueprint unless they should deviate according to the SPARTA rule. Here we briefly discuss the single-agent variant where the search agent assumes that other agents will always play the blueprint. Given blueprint π, SPARTA first defines a belief function that tracks the distribution of the real trajectory given the search agent’s own AOH Bi (τ t ) = P (τ t |τit , π). Then the search agent i computes the expected value for each action a using Monte Carlo rollouts: (29) Qπ (τit , a) = Eτ t ∼Bi (τ t ) Qπ (τ t , a), where: Qπ (τ t , a) = Eτ ∼P (τ |T ,τ t ,at =a,at i j6=i ∼π,a t0 >t ∼π) Rt (τ ) Modeling Strong and Human-Like Gameplay with KL-Regularized Search is the expected forward looking return on τ t assuming that the search agent will perform the action a for the current step and follow the blueprint afterwards while other agents always follow the blueprint. SPARTA computes the belief Bi (τt ) analytically. It maintains a distribution of all possible trajectories and adjusts it at every step by removing the trajectories that contradict public knowledge or would have led to different joint actions according to the known joint policy π. L.2. Method We start with an imitation learning policy (IL policy) trained from human gameplay data collected from an online Hanabi game platform. The IL policy is used as both the baseline for predicting human moves as well as the blueprint policy for our piKL-SPARTA. The analytical belief update procedure in SPARTA requires full knowledge of the partners’ policy. This is challenging when predicting human moves and playing with humans because our IL policy models the average behavior of the entire population of players in the training dataset and a player at test time may perform actions that the IL policy will never do, leading to a null belief space that will terminate the search. Therefore, we follow the practice in (Hu et al., 2021a) and train an approximate neural network belief model on self-play data generated by the IL policy. Given the IL policy π as blueprint and the approximate belief model B̂i that replaces the Bi in Eq. 29, piKL-SPARTA selects actions for the search player i following   Qπ (τit , a) t P (a) ∝ π(a|τi ) · exp . (30) λ L.3. Experimental Setup We use a similar dataset acquired from en.boardgamearena.com as in (Hu et al., 2021b). The dataset consists of 240,954 2-player Hanabi games. We randomly sample 1,000 games to create a validation set and another 4,000 games for the test set. The training set contains the remaining 235,954 games with an average score of 15.88. Each game records the AOH τi , i ∈ {1, 2} for both players. The IL policy πθ is parameterized by a neural network θ and is trained to minimize the cross-entropy loss T X L(θ) = −Eτi ∼D πθ (ati |τit ) t=0 with stochastic gradient descent. The training set D is dynamically augmented with color shuffling (Hu et al., 2020) where a random color permutation that changes both observation and action space is applied to each trajectory τi sampled from the dataset before feeding it to the network. Hu et al. (2021b) shows that this data augmentation method greatly reduces overfitting and leads to better policies. The network θ uses the Public-LSTM structure in (Hu et al., 2021a), which eliminates the need to re-unroll LSTM on sampled trajectories from the beginning of the game as they share the same public observations as the real trajectory. To construct the approximate belief model B̂, we train a neural network φ to predict each player’s own hand, which is the only hidden information in Hanabi. The network predicts each card in hand from oldest to newest auto-regressively. We denote the hidden cards as {hji } with h1i being the oldest and hm i being the newest, m ≤ 5. Then the cross-entropy loss for φ in an N -player setting becomes   N X T X m X 1 L(φ) = −Eτ ∼πθ  log Pφ (hji |τit , h1i , . . . , hj−1 ) . i N i=1 t=1 j=1 In Hanabi, it is sufficient to reconstruct τ t given τit and {hij }. The model is trained on infinite stream of data generated by πθ through self-play to avoid overfitting. We train two variants of the belief model, one with sampled action a ∼ πθ while the other with greedy action a = arg max πθ . They are used by piKL-SPARTA and piKL-SPARTA-G(reedy) respectively. We first run piKL-SPARTA on the test set to compare its ability to predict human moves against the IL policy. For each K τit in the test set, we sample |A hands from the belief model Pφ where K is the total number of searches for this step i| and |Ai | is the number of legal actions of the search player. In all our experiments we set K = 10, 000. In practice, 2K K we sample |A hands and take the top |A samples not contradicting the public knowledge. If the belief model fails to i| i| produce any samples that comply with the public knowledge, we revert back to the blueprint. We then compute the expected value for each search action by unrolling the IL policy until the end of the game for each sampled trajectory and compare Modeling Strong and Human-Like Gameplay with KL-Regularized Search Q (τ t ,a) apred = arg maxa πθ (a|τit ) · exp[ π λi study its effect on prediction accuracy. ] against the human move ati from the data. We experiment with different λ to We also evaluate piKL-SPARTA in self-play to compare its performance under different λ against the IL policy. We run piKL-SPARTA and the IL policy on 4,000 games with different seeds for the deck. At each step, the IL policy acts following at ∼ πθ (a|τit ) while piKL-SPARTA acts following Eq. (30) again using K = 10, 000 rollouts. The λ in these experiments are significantly higher than those in Diplomacy or implicitly in MCTS in Chess and Go9 because λ represents the scale of utility difference that offsets a particular KL penalty, and the range of the utilities Qπ (τit , a) in Hanabi is [−25, 25] whereas the range in other games are either [0, 1] or [−1, 1]. We experiment with both 1p piKL-SPARTA, where one player uses piKL-SPARTA and the other follows the blueprint, and 2p piKL-SPARTA, where both players run the same single-agent version of piKL-SPARTA independently. The latter is not theoretically sound and 2p SPARTA has in past papers produced worse performance than 1p SPARTA (Lerer et al., 2020) because 1p SPARTA assumes that partners are playing according to the blueprint policy while in actuality the partners are playing according to a SPARTA policy. Despite the lack of theoretical soundness, we are interested in whether 2p piKL-SPARTA can obtain empirical improvement over the 1p version, since piKL-SPARTA regularizes towards the blueprint and therefore the mismatch between assuming the partner follows the blueprint and the policy they actually play would likely be less severe. Lastly, in Dec-POMDPs, it is common to select actions greedily in self-play instead of sampling from a mixed policy, since in fully cooperative settings there is no need to avoid being deterministic or predictable to an adversary. Therefore, we also experimented with piKL-SPARTA-G(reedy) where every agent, including the baseline IL policy, play according to the arg max of their action distributions at every step, and the belief model for piKL-SPARTA-G is trained on trajectories produced by a greedy IL policy. L.4. Results Figure 7. Top-1 test accuracy and self-play score of IL policy and piKL-SPARTA in Hanabi. Blue dot is the IL policy and green dots are piKL-SPARTA with different λ. The self-play score is evaluated with sampling based 2p piKL-SPARTA and sampling based SL policy. Additional evaluations with more λ and more algorithm variants are presented in Table 12 and Table 13. Error bar is 1 standard error. Table 12 summarizes the results for human move prediction. On the full test set, piKL-SPARTA with λ = 10 and λ = 20 outperform the IL policy slightly, although not statistically confidently, and vastly outperform unregularized SPARTA. We also investigate the prediction accuracy on games with score ≥ 10 or ≥ 20 that more predominantly come from more experienced human players. The improvement of piKL-SPARTA over the IL policy may be larger on these games, although this result is noisy and at best should be considered only mildly suggestive since filtering on final score also introduces other major confounding factors as well. In Table 13, we show the self-play performance piKL-SPARTA and IL policy. The top two rows show the results of the 9 √ Via the relationship λ ≈ cpuct N where N is the number of MCTS iterations Modeling Strong and Human-Like Gameplay with KL-Regularized Search Subset of Test Set λ =0 λ =0.5 λ =1 λ =2 IL Policy All games Games w/score ≥ 10 Games w/score ≥ 20 25.30% ± 0.16% 27.01% ± 0.18% 28.87% ± 0.19% 56.04% ± 0.24% 58.86% ± 0.25% 61.86% ± 0.25% 60.22% ± 0.21% 62.62% ± 0.22% 65.21% ± 0.23% 62.44% ± 0.19% 64.51% ± 0.20% 66.81% ± 0.21% 63.63% ± 0.18% 65.29% ± 0.19% 67.39% ± 0.19% Subset of Test Set λ =5 λ =10 λ =20 λ =50 IL Policy All games Games w/score ≥ 10 Games w/score ≥ 20 63.50% ± 0.18% 65.33% ± 0.19% 67.48% ± 0.19% 63.68% ± 0.18% 65.43% ± 0.19% 67.54% ± 0.19% 63.71% ± 0.18% 65.42% ± 0.19% 67.53% ± 0.19% 63.68% ± 0.18% 65.35% ± 0.19% 67.45% ± 0.19% 63.63% ± 0.18% 65.29% ± 0.19% 67.39% ± 0.19% Table 12. Human prediction accuracy of unregularized SPARTA (λ = 0) and of piKL-SPARTA with different λ and the IL policy on test set. Each row represents their accuracy on a subset filtered by the final score of the games. pikl-SPARTA achieves similar prediction accuracy as IL for most λ and is far more accurate than unregularized SPARTA. λ =0 1p piKL-SPARTA 2p piKL-SPARTA λ =0.5 λ =1 λ =2 λ =5 λ =10 20.56 ± 0.05 21.16 ± 0.06 20.02 ± 0.09 17.71 ± 0.12 13.53 ± 0.16 11.04 ± 0.17 20.23 ± 0.04 23.11 ± 0.03 22.60 ± 0.03 21.63 ± 0.05 18.65 ± 0.11 15.23 ± 0.15 IL Policy 8.81 ± 0.15 1p piKL-SPARTA-G 22.78 ± 0.03 23.07 ± 0.03 22.76 ± 0.03 22.41 ± 0.04 21.91 ± 0.05 21.45 ± 0.06 19.72 ± 0.10 2p piKL-SPARTA-G 19.98 ± 0.04 23.72 ± 0.02 23.39 ± 0.03 22.93 ± 0.03 22.36 ± 0.03 21.98 ± 0.04 Table 13. Performance of unregularized SPARTA (λ = 0) and piKL-SPARTA under different λ and IL policy evaluated on 4,000 self-play games, reported ± one standard error. In the top two rows, both piKL-SPARTA and IL policy sample actions according to their action distribution respectively. In the bottom two rows, both algorithms take greedy actions; “-G" is short for “-Greedy". All numbers shown in each table section use the same learned belief model for fair comparison. At the high lambdas that maintain or improve human accuracy, piKL-SPARTA significantly improves playing strength over IL, while at lower lambda values piKL-SPARTA outscores unregularized SPARTA. For reference, unregularized greedy 1p SPARTA with exact beliefs rather than learned beliefs gets 23.49 ± 0.02. sampling version while the bottom two rows show those of the greedy version. The conclusions are consistent across both cases. Both 1p and 2p variants of piKL-SPARTA outscore IL for all λ tested. Together with Table 12 we find with λ = 10, piKL-SPARTA maintains IL prediction accuracy and outscores IL greatly in self-play, an overall improvement without any tradeoff. For smaller λ = 2 or λ = 5, piKL-SPARTA further outscores IL while losing some prediction accuracy on human moves, but remains vastly more accurate than unregularized SPARTA (λ = 0). Through qualitative analysis of the games played by both methods, we find that IL makes many mistakes due to both sampling low probability actions and the fact that the training set contains bad moves. piKL-SPARTA, on the other hand, avoids many of these mistakes while still otherwise following the same strategies and humanlike signaling conventions. It is particularly good at preventing catastrophic failures where the agents lose all life tokens and points since the Q values for good and bad actions differ much more widely in those cases. The 2p piKL-SPARTA variants, despite being theoretically unsound, result in a further score improvement. Meanwhile, we notice that 2p versions of the plain SPARTA (λ = 0) underperform their 1p variants, which is consistent with the observations from (Lerer et al., 2020) despite us using an approximate learned belief model instead of exact belief. This suggests that regularization towards the blueprint IL policy is successful in keeping the trajectories similar enough to those from the blueprint that the unsound assumption that the partner plays the blueprint does not cause major problems, while still allowing both players to deviate enough to correct major blunders that they would otherwise make. Lastly, we notice that 1p piKL-SPARTA with λ = 0.5 outperforms the unregularized version in self-play. The reason could be that the quality of the samples from the learned belief model worsen as the trajectories it sees at test time becomes more off-distribution for smaller λ. For reference, we also run the original SPARTA which does not have this problem as it computes beliefs analytically. The original SPARTA gets 23.49 ± 0.02, which is better than 1p piKL-SPARTA with λ = 0.5 but worse than 2p piKL-SPARTA with the same λ. The computational cost of 1p SPARTA with exact beliefs and 2p piKL-SPARTA with approximate learned beliefs is roughly the same because the exact beliefs computation accounts for 50% of the entire computation. Therefore, 2p piKL-SPARTA could be a preferable option to improve performance in Dec-POMDPs via self play without incurring the huge cost of joint SPARTA or RL-search (Fickinger et al., 2021).