ESCHER: E SCHEWING I MPORTANCE S AMPLING IN G AMES BY C OMPUTING A H ISTORY VALUE F UNCTION TO E STIMATE R EGRET arXiv:2206.04122v2 [cs.GT] 11 Oct 2022 Stephen McAleer Carnegie Mellon University smcaleer@cs.cmu.edu Gabriele Farina Carnegie Mellon University gfarina@cs.cmu.edu Marc Lanctot DeepMind lanctot@deepmind.com Tuomas Sandholm Carnegie Mellon University Strategy Robot, Inc. Optimized Markets, Inc. Strategic Machine, Inc. sandholm@cs.cmu.edu A BSTRACT Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability. We show that the variance of the estimated regret of ESCHER is orders of magnitude lower than DREAM and other baselines. We then show that ESCHER outperforms the prior state of the art—DREAM and neural fictitious self play (NFSP)—on a number of games and the difference becomes dramatic as game size increases. In the very large game of dark chess, ESCHER is able to beat DREAM and NFSP in a head-to-head competition over 90% of the time. 1 I NTRODUCTION A core challenge in computational game theory is the problem of learning strategies that approximate Nash equilibrium in very large imperfect-information games such as Starcraft (Vinyals et al., 2019), dark chess (Zhang & Sandholm, 2021), and Stratego (McAleer et al., 2020; Perolat et al., 2022). Due to the size of these games, tabular game-solving algorithms such as counterfactual regret minimization (CFR) are unable to produce such equilibrium strategies. To sidestep the issue, in the past stochastic methods such as Monte-Carlo CFR (MCCFR) have been proposed. These methods use computationally inexpensive unbiased estimators of the regret (i.e., utility gradient) of each player, trading off speed for convergence guarantees that hold with high probability rather than in the worst case. Several unbiased estimation techniques of utility gradients are known. Some, such as external sampling, produce low-variance gradient estimates that are dense, and therefore are prohibitive in the settings mentioned above. Others, such as outcome sampling, produce high-variance estimates that are sparse and can be computed given only the realization of play, and are therefore more appropriate for massive games. However, even outcome-sampling MCCFR is inapplicable in practice. First, since it is a tabular method, it can only update regret on information sets that it has seen during training. In very large games, only a small fraction of all information sets will be seen during training. Therefore, 1 generalization (via neural networks) is necessary. Second, to achieve unbiasedness of the utility gradient estimates, outcome-sampling MCCFR uses importance sampling (specifically, it divides the utility of each terminal state by a reach probability, which is often tiny), leading to estimates with extremely large magnitudes and high variance. This drawback is especially problematic when MCCFR is implemented using function approximation, as the high variance of the updates can cause instability of the neural network training. Deep CFR (Brown et al., 2019) addresses the first shortcoming above by training a neural network to estimate the regrets cumulated by outcome-sampling MCCFR, but is vulnerable to the second shortcoming, causing the neural network training procedure to be unstable. DREAM (Steinberger et al., 2020) improves on Deep CFR by partially addressing the second shortcoming by using a history-based value function as a baseline (Schmid et al., 2019). This baseline greatly reduces the variance in the updates and is shown to have better performance than simply regressing on the MCCFR updates. However, DREAM still uses importance sampling to remain unbiased. So, while DREAM was shown to work in small artificial poker variants, it is still vulnerable to the high variance of the estimated counterfactual regret and indeed we demonstrate that in games with long horizons and/or large action spaces, this importance sampling term causes DREAM to fail. In this paper, we introduce Eschewing importance Sampling by Computing a History value function to Estimate Regret (ESCHER), a method that is unbiased, low variance, and does not use importance sampling. ESCHER is different from DREAM in two important ways, both of which we show are critical to achieving good performance. First, instead of using a history-dependent value function as a baseline, ESCHER uses one directly as an estimator of the counterfactual value. Second, ESCHER does not multiply estimated counterfactual values by an importance-weighted reach term. To remove the need to weight by the reach to the current information state, ESCHER samples actions from a fixed sampling policy that does not change from one iteration to the next. Since this distribution is static, our fixed sampling policy simply weights certain information sets more than others. When the fixed sampling policy is close to the balanced policy (i.e., one where each leaf is reached with equal probability), these weighting terms minimally affect overall convergence of ESCHER with high probability. We find that ESCHER has orders of magnitude lower variance of its estimated regret. In experiments with a deep learning version of ESCHER on the large games of phantom tic tac toe, dark hex, and dark chess, we find that ESCHER outperforms NFSP and DREAM, and that the performance difference increases to be dramatic as the size of the game increases. Finally, we show through ablations that both differences between ESCHER and DREAM (removing the bootstrapped baseline and removing importance sampling) are necessary in order to get low variance and good performance on large games. 2 BACKGROUND We consider extensive-form games with perfect recall (Osborne & Rubinstein, 1994; Hansen et al., 2004; Kovařı́k et al., 2022). An extensive-form game progresses through a sequence of player actions, and has a world state w ∈ W at each step. In an N -player game, A = A1 × · · · × AN is the space of joint actions for the players. Ai (w) ⊆ Ai denotes the set of legal actions for player i ∈ N = {1, . . . , N } at world state w and a = (a1 , . . . , aN ) ∈ A denotes a joint action. At each world state, after the players choose a joint action, a transition function T (w, a) ∈ ∆W determines the probability distribution of the next world state w0 . Upon transition from world state w to w0 via joint action a, player i makes an observation oi = Oi (w, a, w0 ). In each world state w, player i receives a utility Ui (w). The game ends when the players reach a terminal world state. In this paper, we consider games that are guaranteed to end in a finite number of actions. A history is a sequence of actions and world states, denoted h = (w0 , a0 , w1 , a1 , . . . , wt ), where w0 is the known initial world state of the game. Ui (h) and Ai (h) are, respectively, the utility and set of legal actions for player i in the last world state of a history h. An information set for player i, denoted by si , is a sequence of that player’s observations and actions up until that time si (h) = (a0i , o1i , a1i , . . . , oti ). Define the set of all information sets for player i to be Ii . The set of histories that correspond to an information set si is denoted H(si ) = {h : si (h) = si }, and it is assumed that they all share the same set of legal actions Ai (si (h)) = Ai (h). For simplicity we often drop the subscript i for an information set s when the player is implied. 2 A player’s strategy πi is a function mapping from an information set to a probability distribution over actions. A strategy profile π is a tuple (π1 , . . . , πN ). All players other than i are denoted −i, and their strategies are jointly denoted π−i . A strategy for a history h is denoted πi (h) = πi (si (h)) and π(h) is the corresponding strategy profile. When a strategy πi is learned through RL, we refer to the learned strategy as a policy. The expected value (EV) viπ (h) for player i is the expected sum of future utilities for player i in history h, when all players play strategy profile π. The EV for an information set si is denoted viπ (si ) and the EV for the entire game is denoted vi (π). A two-player zero-sum game has v1 (π) + v2 (π) = 0 for all strategy profiles π. The EV for an action in an information set is denoted viπ (si , ai ). A Nash equilibrium (NE) is a strategy profile such that, if all players played their NE strategy, no player could ∗ achieve higher EV by deviating from it. Formally, π ∗ is a NE if vi (π ∗ ) = maxπi vi (πi , π−i ) for each player i. P The exploitability e(π) of a strategy profile π is defined as e(π) = i∈N maxπi0 vi (πi0 , π−i ). A best response (BR) strategy BRi (π−i ) for player i to a strategy π−i is a strategy that maximally exploits π−i : BRi (π−i ) = arg maxπi vi (πi , π−i ). An -best response (-BR) strategy BRi (π−i ) for player i to a strategy π−i is a strategy that is at most  worse for player i than the best response: vi (BRi (π−i ), π−i ) ≥ vi (BRi (π−i ), π−i ) − . An -Nash equilibrium (-NE) is a strategy profile π in which, for each player i, πi is an -BR to π−i . 2.1 C OUNTERFACTUAL R EGRET M INIMIZATION (CFR) In this section we review the counterfactual regret minimization (CFR) framework. All superhuman poker AIs have used advanced variants of the framework as part of their architectures (Bowling et al., 2015; Brown & Sandholm, 2018; 2019). CFR is also the basis of several reinforcement learning algorithms described in section 8. We will leverage and extend the CFR framework in the rest of the paper. We will start by reviewing the framework. Define η π (h) to be the reach weight of joint policy π to reach history h, and z is a terminal history. Define Z to be the set of all terminal histories. Define Z(s) ⊆ Z to be the set of terminal histories z that can be reached from information state s and define z[s] to be the unique history h ∈ s that is a subset of z. Define X vi (π, h) = η π (h, z)ui (z) (1) zAh to be the expected value under π for player i having reached h. Note that this value function takes as input the full-information history h and not an information set. Define X X π π vic (π, s) = η−i (z[s])η π (z[s], z)ui (z) = η−i (h)vi (π, h) (2) h∈s z∈Z(s) to be the counterfactual value for player i at state s under the joint strategy π. Define the strategy πs→a to be a modified version of π where a is played at information set s, and the counterfactual state-action value qic (π, s, a) = vic (πs→a , s). For any state s, strategy π, and action a ∈ A(s), one can define a local counterfactual regret for not switching to playing a at s as rc (π, s, a) = qic (π, s, a) − vic (π, s). Counterfactual regret minimization (CFR) (Zinkevich et al., 2008a) is a strategy iteration algorithm that produces a sequence of policies: {π1 , π2 , · · · , πT }. Each policy πt+1 (s) is derived directly PT from a collection of cumulative regrets RT (s, a) = t=1 rc (πt , s, a), for all a ∈ A(s) using regretmatching (Hart & Mas-Colell, 2000). In two-player zero-sum games, √ the average policy π̄T converges to an approximate Nash equilibrium at a rate of e(π̄T ) ≤ O(1/ T ). 2.2 M ONTE C ARLO C OUNTERFACTUAL R EGRET M INIMIZATION (MCCFR) In the standard CFR algorithm, the quantities required to produce new policies in Equations 1 and 2 require full traversals of the game to compute exactly. Monte Carlo CFR (Lanctot et al., 2009) is a stochastic version of CFR which instead estimates these quantities. In particular, MCCFR uses a sampling approach which specifies a distribution over blocks Zj of terminal histories such that ∪j Zj = Z, the set of terminal histories. Upon sampling a block j, a certain sampled counterfactual value v̂ c (π, s | j) (defined in detail later in this section) is computed for all prefix histories that occur in Zj . Then, estimated regrets are accumulated and new policies derived as in CFR. The main result 3 is that E[v̂ c (π, s | j)] = v c (π, s), so MCCFR is an unbiased approximation of CFR, and inherits its convergence properties albeit under a probabilistic guarantee. Blocks are sampled via sampling policy π̃ which is commonly a function of the players’ joint policy π. Two sampling variants were defined in the original MCCFR paper: outcome sampling (OS-MCCFR) and external sampling (ES-MCCFR). External sampling samples only the opponent (and chance’s) choices; hence, it requires a forward model of the game to recursively traverse over all of the subtrees under the player’s actions. Outcome sampling is the most extreme sampling variant where blocks consist of a single terminal history: it is the only model-free variant of MCCFR compliant with the standard reinforcement learning loop where the agent learns entirely from experience with the environment. The OS-MCCFR counterfactual value estimator when the opponent samples from their current policy as is commonly done is given as follows: v̂i (π, s|z) = η π−i (z[s])η π (z[s], z)ui (z) η πi (z[s], z) 1 ui (z) = π̃i π̃ η (z) η (z[s]) η π̃i (z[s], z) (3) The importance sampling term that is used to satisfy the unbiasedness of the values can have a significant detrimental effect on the convergence rate Gibson et al. (2012). Variance reduction techniques provide some empirical benefit Schmid et al. (2019); Davis et al. (2019), but have not been evaluated on games with long trajectories where the importance corrections have their largest impact. 2.3 D EEP C OUNTERFACTUAL R EGRET M INIMIZATION Deep CFR (Brown et al., 2019; Steinberger, 2019; Li et al., 2019) is a method that uses neural networks to scale MCCFR to large games. Deep CFR performs external sampling MCCFR and trains a regret network Ri (s, a|ψ) on a replay buffer of information sets and estimated cumulative regrets. The regret network is trained to approximate the cumulative regrets seen so far at that information state. The estimated counterfactual regrets are computed the same as in MCCFR, namely using importance sampling or external sampling. 2.4 DREAM DREAM (Steinberger et al., 2020) builds on Deep CFR and approximates OS-MCCFR with deep neural networks. Like Deep CFR, it trains a regret network Ri (s, a|ψ) on a replay buffer of information sets and estimated cumulative regrets. Additionally, in order to limit the high variance of OS-MCCFR, DREAM uses a learned history value function qi (π, h, a|θ) and uses it as a baseline (Schmid et al., 2019). While the baseline helps remove variance in the estimation of future utility, in order to remain unbiased DREAM must use importance sampling as in OS-MCCFR. We show empirically that variance of the DREAM estimator of the counterfactual value, although lower than OS-MCCFR, will often be quite high, even in small games and with an oracle history value function. This high variance estimator might make neural network training very difficult. In contrast, ESCHER has no importance sampling term and instead directly uses the learned history value function qi (π, h, a|θ) to estimate regrets. 3 TABULAR ESCHER WITH O RACLE VALUE F UNCTION In this section we define a tabular version of our proposed algorithm, ESCHER, where we assume oracle access to a value function. While in practice this tabular algorithm will not compare well to existing approaches due to the expense of generating an oracle value function, in this section we show that if we assume access to an oracle value function at no cost, then our tabular method is sound and converges to a Nash equilibrium with high probability. In the next section we introduce our main method which is a deep version of this tabular method and learns a neural network value function from data. As shown in Equation 3, the OS-MCCFR estimator can be seen as containing two separate terms. The first 1/η π̃i (z[s]) term ensures that each information set is updated equally often in expectation. The second η πi (z[s], z)ui (z)/η π̃i (z[s], z) term is an unbiased estimator of the history value vi (π, z[s]). In DREAM, the second term gets updated by a bootstrapped baseline to reduce variance. But since the baseline is not perfect in practice, as we show in our ablations, this term still induces high variance, 4 which prevents the regret network from learning effectively. The main idea behind ESCHER is to remove the first reach weighting term by ensuring that the sampling distribution for the update player remains fixed across iterations, and to replace the second term with a history value function v̂i (π, z[s]). Similar to OS-MCCFR, the tabular version of ESCHER iteratively updates each player’s policy by sampling a single trajectory. When updating a player’s strategy, we sample from a fixed distribution (for example the uniform distribution) for that player and sample from the opponent’s current policy for the other player. As in MCCFR, we update the estimated regret for all actions in each information state reached in the trajectory. However, unlike in MCCFR, we do not use the terminal utility to estimate regret but instead use the oracle history value function. This reduces variance in the update because the oracle value function at a given history action pair will always be the same if the policies are the same. ESCHER samples from the opponent’s current strategy when it is their turn but samples from a fixed strategy that roughly visits every information set equally likely when it is the update player’s turn. As a result, the expected value of the history value is equal to the counterfactual value scaled by a term that weights certain information sets up or down based on the fixed sampling policy. Formally, define the fixed sampling policy bi (s, a) to be any policy that remains constant across iterations and puts positive probability on every action. This fixed sampling policy can be one of many distributions such as one that samples uniformly over available actions at every information set. An interesting open research direction is finding good fixed sampling policy. In this paper, our fixed sampling policy uniformly samples over actions, which is somewhat similar to the robust sampling technique introduced in Li et al. (2019). When updating player i, we construct a joint fixed sampling policy π̃ i (s, a) to be  b (s, a) if it’s the update player i’s turn i π̃ (s, a) = i (4) π−i (s, a) otherwise We use a fixed sampling policy because it allows us to remove any importance sampling in our estimator. Unlike OS-MCCFR which must divide by the current player’s reach probability to remain unbiased, our method simply weights the regrets of certain information states more than others, but total average regret is still guaranteed to converge to zero. To remove the importance sampling term that arises from estimating the future expected value from the terminal utility, we substitute this estimate with the history value function for the ground-truth history in that trajectory. Since we only update regret on information states visited during the trajectory, our estimator is zero on all other information sets. Formally, we define our estimator for the counterfactual regret as follows:  r̂i (π, s, a|z) = qi (π, z[s], a) − vi (π, z[s]) if z ∈ Z(s) 0 otherwise (5) If we sample from π̃ i when updating player i, then the expected value of our counterfactual regret estimator is: X i η π̃ (z)[r̂i (π, s, a|z)] Ez∼π̃i [r̂i (π, s, a|z)] = z∈Z = i X η π̃ (z)[qi (π, z[s], a) − vi (π, z[s])] z∈Z(s) = XX = X h∈s zAh h∈s i i η π̃ (z)[qi (π, z[s], a) − vi (π, z[s])] i η π̃ (h)[qi (π, h, a) − vi (π, h)] = ηiπ̃ (s) X h∈s π η−i (h)[qi (π, h, a) − vi (π, h)] = w(s)[vic (π, s, a) − vic (π, s)] = w(s)rc (π, s, a) 5 (6) Algorithm 1: Tabular ESCHER with Oracle Value Function 1 for t = 1, ..., T do 2 for update player i ∈ {0, 1} do 3 Sample trajectory τ using sampling distribution π̃ i (Equation 4) 4 for each state s ∈ τ do 5 for each action a do 6 Estimate immediate regret vector r̂(π, s, a|z) = qi (π, z[s], a) − vi (π, z[s]) 7 Update total estimated regret of action a at infostate s: R̂(s, a) = R̂(s, a) + r̂(π, s, a|z) 8 Update πi (s, a) via regret matching on total estimated regret 9 return average policy π̄ Algorithm 2: ESCHER 1 Initialize history value function q 2 Initialize policy πi for both players 3 for t = 1, ..., T do 4 Retrain history value function q on data from π 5 Reinitialize regret networks R0 , R1 6 for update player i ∈ {0, 1} do 7 for P trajectories do 8 Get trajectory τ using sampling distribution (Equation 4) 9 for each state s ∈ τ do 10 for each action a do 11 Estimate immediate cf-regret P r̂(π, s, a|z) = qi (π, z[s], a|θ) − a πi (s, a)qi (π, z[s], a|θ) 12 Add (s, r̂(π, s)) to cumulative regret buffer 13 Add (s, a0 ) to average policy buffer where a0 is action taken at state s in trajectory τ 14 Train regret network Ri on cumulative regret buffer 15 Train average policy network π̄φ on average policy buffer 16 return average policy network π̄φ 0 0 0 Where for h, h0 ∈ s, ηiπ (h) = ηiπ (h0 ) = ηiπ (s) =: w(s) is the reach probability for reaching that infostate for player i via the fixed sampling distribution. Unlike the MCCFR estimator, our estimator has no importance sampling terms, and as a result has much lower variance. When all information sets are visited by the sampling distribution with equal probability, then ESCHER is perfectly unbiased. The correctness of our method is established by the next theorem, whose proof can be found in Appendix 7. As shown in the proof, the regret of our method is bounded with high probability by a term that is inversely proportional to the minimum over information sets s of w(s). Therefore, our theory suggests that the balanced sampling distribution is the optimal sampling distribution, but in practice other sampling distributions might perform better. In our experiments we approximate the balanced distribution with uniform sampling over actions. Theorem 1. Assume a fixed sampling policy that puts positive probability on every action. For any p ∈ (0, 1), with probability at least 1 − p, the regret accumulated by √each agent learning using the tabular algorithm ESCHER (Algorithm 1) is upper bounded by O( T · poly log(1/p)), where the O(·) notation hides game-dependent and sampling-policy-dependent constants. In the appendix we extend the analysis to the case of approximate history value function, and give a bound with an explicit dependence on the magnitude of the approximation error. 4 ESCHER As shown in the previous section, when using an oracle value function, our method minimizes regret in the single-agent case and converges to a Nash equilibrium in the two-player case. In this section 6 Game ESCHER (Ours) Ablation 1 Ablation 2 DREAM −1 Phantom Tic-Tac-Toe (2.6 ± 0.1)×10 1 (4.1 ± 0.7)×10 7 (1.4 ± 0.4)×10 (4.6 ± 1.0)×107 Dark Hex 4 (1.8 ± 0.1)×10−1 (1.3 ± 0.9)×102 (3.1 ± 1.7)×108 (2.8 ± 2.0)×108 Dark Hex 5 (1.3 ± 0.1)×10−1 (3.3 ± 1.6)×102 (2.0 ± 0.6)×105 (5.3 ± 3.9)×108 Game ESCHER (Ours) Leduc (5.3 ± 0.0)×100 (3.3 ± 0.7)×102 (2.8 ± 0.0)×102 (2.2 ± 0.0)×103 Battleship (1.4 ± 0.0)×100 (7.1 ± 0.3)×102 (1.2 ± 0.0)×103 (2.4 ± 0.0)×103 Liar’s Dice (9.0 ± 0.1)×10−1 (7.8 ± 0.9)×101 (4.0 ± 0.8)×102 (1.2 ± 0.1)×103 Ablation 2 DREAM OS-MCCFR Table 1: These results track the average variance of the regret estimator of each algorithm over all iterations. The top table shows results of deep algorithms on large games. Ablation 1 is ESCHER but with a bootstrapped baseline, and ablation 2 is ESCHER but with reach weighting. The bottom table shows tabular versions of the algorithms with oracle value functions on small games. Because ESCHER does not use importance sampling, the variance of its estimator is orders of magnitude smaller than baselines. Colors scale with the ratio of the minimum value in each row, according to the logarithmic color scale . 1 10 100 ≥ 1000 1 we describe our main method where we learn a history-dependent value function via a neural network. Similar to Deep CFR and DREAM, we train a regret network Ri (s, a|ψ) over a buffer of information states and targets where the targets come directly from the learned history value function qi (π, h, a|θ). Our method is built on Deep CFR. In particular, like Deep CFR, we traverse the game tree and add this experience into replay buffers. The first replay buffer stores information states and instantaneous regret estimates is used to train a regret network Rψ (s, a) that is trained to estimate the cumulative regret at a given information set. Unlike Deep CFR and DREAM, which use the terminal utility and sampling probabilities from the current trajectory to estimate the value, in ESCHER the instantaneous regret estimates are estimated using the current history value function alone X r̂i (π, s, a|z) = qi (π, z[s], a|θ) − πi (s, a0 )qi (π, z[s], a0 |θ) (7) a0 Similar to Deep CFR, each player’s current policy πi is given by performing regret matching on the output of the current regret network Ri (s, a|ψ). The second replay buffer stores histories and terminal utilities and is used to train the value network Qθ to estimate the expected utility for both players when both players are at that history and play from their current policies. Lastly, the third replay buffer stores information states and actions taken by the policy π and uses that data to train an average policy network π̄φ that approximates the average policy across all iterations. It is this average policy that has no regret and converges to an approximate Nash equilibrium in self play. As described in the previous section, the only difference between our tabular method and MCCFR and our deep method and Deep OS-MCCFR is the estimator for the immediate regret. While Deep OS-MCCFR uses an importance-weighted estimate of the counterfactual value estimated from the utility of the rollout, we instead simply use the value function to estimate the immediate regret. We describe our algorithm in Algorithm 2. 5 R ESULTS We compare the variance of the counterfactual value estimates from ESCHER, DREAM, and ablations in Table 1. Variance is computed over the set of all counterfactual regret values estimated in a single iteration. The results in the table are the average of each iteration’s variance over the first five iterations of training. The top table shows results of deep algorithms on large games. A summary of 7 Experiment: Playing against an opponent that plays uniformly at random 100% Phantom Tic-Tac-Toe Win % 75% 0% 75% 50% ESCHER (Ours) DREAM NFSP 25% 2 4 6 Information sets touched×106 0% Dark Chess 100% 75% 50% 25% Dark Hex 5 100% 50% ESCHER (Ours) DREAM NFSP ESCHER (Ours) DREAM NFSP 25% 1 2 3 Information sets touched×106 0% 0.5 1.0 Information sets touched×107 Experiment: Playing against another baseline head-to-head Win % 100% Phantom Tic-Tac-Toe Dark Hex 5 100% 100% 75% 75% 75% 50% 50% 50% 25% 0% ESCHER vs NFSP ESCHER vs DREAM 2 4 6 Information sets touched×106 25% 0% 25% ESCHER vs NFSP ESCHER vs DREAM 1 2 3 Information sets touched×106 0% Dark Chess ESCHER vs NFSP ESCHER vs DREAM 0.5 1.0 Information sets touched×107 1 Figure 1: ESCHER is competitive with NFSP and DREAM in Phantom Tic-Tac-Toe. But as the size of the game increases, ESCHER performs increasingly better than both DREAM and NFSP. In Dark Chess, ESCHER beats DREAM and NFSP in over 90% of the matches. History value function Boostrapped baseline Importance sampling ESCHER (Ours) Ablation 1 Ablation 2 3 3 3 7 3 7 7 7 3 DREAM / VR-MCCFR Deep CFR / OS-MCCFR 3 7 3 7 3 3 Algorithm Table 2: ESCHER is different from DREAM in two important ways. First, ESCHER does not use importance sampling. Second, ESCHER does not estimate counterfactual values for sampled actions via a bootstrapped baseline. Our ablations show that both improvements are necessary. the different ablations is given in Table 2. From this experiment we can see that because ESCHER does not use importance sampling, the variance of its estimator is orders of magnitude smaller than baselines. Also, we see that even ablation 1, which does not use importance sampling, has high variance. This is because when the history value function is not exact, the bootstrapping method recursively divides by the sampling probability on sampled actions. We see that this higher variance indeed leads to worse performance for DREAM and these ablations in Figure 2. Therefore, both improvements of ESCHER over DREAM are necessary. We compare our method to DREAM and NFSP, the most popular baselines that are also open-source, on the games of Phantom Tic Tac Toe (TTT), Dark Hex, and Dark Chess. Dark chess is a popular game among humans under the name Fog of War Chess on the website chess.com, and has emerged as a benchmark task (Zhang & Sandholm, 2021). All of these games are similar in that they are both imperfect information versions of perfect-information games played on square boards. Phantom TTT is played on a 3 × 3 board while dark hex 4 is played on a size 4 × 4 board and dark hex 5 is played on a size 5 × 5 board. Because these games are large, we are not able to compare exact exploitability 8 Ablation experiment: Playing against an opponent that plays uniformly at random 100% Phantom Tic-Tac-Toe Win % 75% 0% Dark Hex 4 75% 50% 25% 100% 75% 50% ESCHER (Ours) Ablation 1 Ablation 2 0.5 1.0 1.5 Information sets touched ×106 25% 0% Dark Hex 5 100% 50% ESCHER (Ours) Ablation 1 Ablation 2 1 2 Information sets touched ×106 25% 0% ESCHER (Ours) Ablation 1 Ablation 2 2 4 6 Information sets touched ×106 Ablation experiment: Playing against another ablation head-to-head Win % 100% Phantom Tic-Tac-Toe 100% Dark Hex 4 100% 75% 75% 75% 50% 50% 50% 25% 0% ESCHER vs Ablation 1 ESCHER vs Ablation 2 0.5 1.0 1.5 Information sets touched×106 25% 0% ESCHER vs Ablation 1 ESCHER vs Ablation 2 1 2 3 Information sets touched×106 25% 0% Dark Hex 5 ESCHER vs Ablation 1 ESCHER vs Ablation 2 2.5 5.0 7.5 Information sets touched×106 1 Figure 2: Ablation study on ESCHER. As summarized in Table 2, “Ablation 1” is ESCHER but with a bootstrapped history-value baseline, while “Ablation 2” is ESCHER but with reach weighting. Since ESCHER with a bootstrapped history-value baseline and reach weighting is equivalent to DREAM, these results show that both changes to DREAM are necessary for ESCHER to work in large games. so instead we compare performance through head-to-head evaluation. Results are shown in Figure 1, where the x axis tracks the number of information sets visited during training. We see that our method is competitive with DREAM and NFSP on Phantom TTT. On the larger game of Dark Hex 5, ESCHER beats DREAM and NFSP head to head and also scores higher against a random opponent. Moving to the largest game of Dark Chess, we see that ESCHER beats DREAM and NFSP head to head over 90% of the time and also is able to beat a random opponent while DREAM and NFSP are no better than random. 6 D ISCUSSION : L IMITATIONS AND F UTURE R ESEARCH Our method has a number of ways it can be improved. First, it requires two separate updates in one iteration. Perhaps the method could be more efficient if only on-policy data were used. Second, our method, like Deep CFR, DREAM, and NFSP, trains neural networks on large replay buffers of past experience. Unlike RL algorithms like DQN (Mnih et al., 2015), these replay buffers must record all data ever seen in order to learn an average. This can be a problem when the amount of data required is much larger than the replay buffer memory. Third, we do not use various implementation details that help performance in Deep CFR such as weighting by the sum of the reach probabilities over all iterations. Finally, our method uses separate data to train the value function. Our method could be made much more efficient by also using the data generated for training the policy to also train the value function. One direction of future research is finding optimal sampling distributions. In our method we use the uniform distribution over actions as our fixed sampling distribution, but this can be far from optimal. In principle any distribution that remains fixed will guarantee the method to converge with high probability. One possible direction would be to try to estimate the theoretically optimal balanced distribution. Other, less principled, methods such as using the average policy might work 9 well in practice as well (Burch et al., 2012). Another direction is in connecting this work with the reinforcement learning literature. Similar to reinforcement learning, we learn a Q value and a policy, and there are many techniques from reinforcement learning that are promising to try in this setting. For example, although we learned the value function simply through Monte-Carlo rollouts, one could use bootstrapping-based methods such as TD-λ (Sutton, 1988) and expected SARSA (Rummery & Niranjan, 1994). The policy might be able to be learned via some sort of policy gradient, similar to QPG (Srinivasan et al., 2018), NeuRD (Hennes et al., 2020), and F-FoReL (Perolat et al., 2021). R EFERENCES Bowling, M., Burch, N., Johanson, M., and Tammelin, O. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, 2015. Brafman, R. I. and Tennenholtz, M. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002. Brown, N. and Sandholm, T. Libratus: The superhuman AI for no-limit poker. In IJCAI, pp. 5226–5228, 2017a. Brown, N. and Sandholm, T. Safe and nested subgame solving for imperfect-information games. Advances in neural information processing systems, 30, 2017b. Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018. Brown, N. and Sandholm, T. Superhuman AI for multiplayer poker. Science, 365(6456):885–890, 2019. Brown, N., Sandholm, T., and Amos, B. Depth-limited solving for imperfect-information games. Advances in neural information processing systems, 31, 2018. Brown, N., Lerer, A., Gross, S., and Sandholm, T. Deep counterfactual regret minimization. In International Conference on Machine Learning, pp. 793–802, 2019. Brown, N., Bakhtin, A., Lerer, A., and Gong, Q. Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33: 17057–17069, 2020. Burch, N., Lanctot, M., Szafron, D., and Gibson, R. Efficient monte carlo counterfactual regret minimization in games with many player actions. Advances in neural information processing systems, 25, 2012. Burch, N., Johanson, M., and Bowling, M. Solving imperfect information games using decomposition. In Twenty-eighth AAAI conference on artificial intelligence, 2014. Daskalakis, C., Foster, D. J., and Golowich, N. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527–5540, 2020. Davis, T., Schmid, M., and Bowling, M. Low-variance and zero-variance baselines for extensive-form games. CoRR, abs/1907.09633, 2019. URL http://arxiv.org/abs/1907.09633. Ding, D., Wei, C.-Y., Zhang, K., and Jovanović, M. R. Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence. arXiv preprint arXiv:2202.04129, 2022. Farina, G., Kroer, C., and Sandholm, T. Online convex optimization for sequential decision processes and extensive-form games. In AAAI Conference on Artificial Intelligence, 2019a. Farina, G., Kroer, C., and Sandholm, T. Regret circuits: Composability of regret minimizers. In International Conference on Machine Learning, 2019b. Farina, G., Kroer, C., and Sandholm, T. Stochastic regret minimization in extensive-form games. In International Conference on Machine Learning, 2020. 10 Feng, X., Slumbers, O., Yang, Y., Wan, Z., Liu, B., McAleer, S., Wen, Y., and Wang, J. Discovering multi-agent auto-curricula in two-player zero-sum games. Advances in Neural Information Processing Systems (NeurIPS), 2021. Fox, R., Mcaleer, S. M., Overman, W., and Panageas, I. Independent natural policy gradient always converges in markov potential games. In International Conference on Artificial Intelligence and Statistics, pp. 4414–4425. PMLR, 2022. Fu, H., Liu, W., Wu, S., Wang, Y., Yang, T., Li, K., Xing, J., Li, B., Ma, B., Fu, Q., and Wei, Y. Actor-critic policy optimization in a large-scale imperfect-information game. In Proceedings of the Tenth International Conference on Learning Representations (ICLR), 2022. Gibson, R., Lanctot, M., Burch, N., Szafron, D., and Bowling, M. Generalized sampling and variance in counterfactual regret minimization. In Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (AAAI-12)., pp. 1355–1361, 2012. Gray, J., Lerer, A., Bakhtin, A., and Brown, N. Human-level performance in no-press diplomacy via equilibrium search. In International Conference on Learning Representations, 2020. Gruslys, A., Lanctot, M., Munos, R., Timbers, F., Schmid, M., Perolat, J., Morrill, D., Zambaldi, V., Lespiau, J.-B., Schultz, J., et al. The advantage regret-matching actor-critic. arXiv preprint arXiv:2008.12234, 2020. Hansen, E. A., Bernstein, D. S., and Zilberstein, S. Dynamic programming for partially observable stochastic games. Conference on Artificial Intelligence (AAAI), 2004. Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000. Heinrich, J. and Silver, D. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121, 2016. Hennes, D., Morrill, D., Omidshafiei, S., Munos, R., Perolat, J., Lanctot, M., Gruslys, A., Lespiau, J.-B., Parmas, P., Duéñez-Guzmán, E., et al. Neural replicator dynamics: Multiagent learning via hedging policy gradients. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 492–501, 2020. Jin, C., Liu, Q., Wang, Y., and Yu, T. V-learning–a simple, efficient, decentralized algorithm for multiagent rl. arXiv preprint arXiv:2110.14555, 2021. Kovařı́k, V., Schmid, M., Burch, N., Bowling, M., and Lisỳ, V. Rethinking formal models of partially observable multiagent decision making. Artificial Intelligence, 303:103645, 2022. Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. Monte carlo sampling for regret minimization in extensive games. In Advances in neural information processing systems, pp. 1078–1086, 2009. Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Pérolat, J., Silver, D., and Graepel, T. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017. Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V., Upadhyay, S., Pérolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., et al. Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019. Leonardos, S., Overman, W., Panageas, I., and Piliouras, G. Global convergence of multi-agent policy gradient in markov potential games. arXiv preprint arXiv:2106.01969, 2021. Li, H., Hu, K., Zhang, S., Qi, Y., and Song, L. Double neural counterfactual regret minimization. In International Conference on Learning Representations, 2019. Li, J., Koyamada, S., Ye, Q., Liu, G., Wang, C., Yang, R., Zhao, L., Qin, T., Liu, T.-Y., and Hon, H.-W. Suphx: Mastering mahjong with deep reinforcement learning. arXiv preprint arXiv:2003.13590, 2020. 11 Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Goldberg, K., Gonzalez, J., Jordan, M., and Stoica, I. Rllib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning, pp. 3053–3062, 2018. Liu, W., Li, B., and Togelius, J. Model-free neural counterfactual regret minimization with bootstrap learning. IEEE Transactions on Games, 2022. McAleer, S., Lanier, J., Fox, R., and Baldi, P. Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games. In Advances in Neural Information Processing Systems, 2020. McAleer, S., Lanier, J., Baldi, P., and Fox, R. XDO: A double oracle algorithm for extensive-form games. Advances in Neural Information Processing Systems (NeurIPS), 2021. McAleer, S., Wang, K., Lanctot, M., Lanier, J., Baldi, P., and Fox, R. Anytime optimal psro for two-player zero-sum games. arXiv preprint arXiv:2201.07700, 2022. Mguni, D. H., Wu, Y., Du, Y., Yang, Y., Wang, Z., Li, M., Wen, Y., Jennings, J., and Wang, J. Learning in nonzero-sum stochastic games with potentials. In International Conference on Machine Learning, pp. 7688–7699. PMLR, 2021. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. Moravcik, M., Schmid, M., Ha, K., Hladik, M., and Gaukrodger, S. Refining subgames in large imperfect information games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. Moravčı́k, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017. Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., et al. Ray: A distributed framework for emerging {AI} applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pp. 561–577, 2018. Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., et al. A generalized training approach for multiagent learning. In International Conference on Learning Representations, 2019. Osborne, M. J. and Rubinstein, A. A Course in Game Theory. MIT Press, 1994. Perolat, J., Piot, B., and Pietquin, O. Actor-critic fictitious play in simultaneous move multistage games. In International Conference on Artificial Intelligence and Statistics, pp. 919–928. PMLR, 2018. Perolat, J., Munos, R., Lespiau, J.-B., Omidshafiei, S., Rowland, M., Ortega, P., Burch, N., Anthony, T., Balduzzi, D., De Vylder, B., et al. From Poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization. In International Conference on Machine Learning, pp. 8525–8535. PMLR, 2021. Perolat, J., de Vylder, B., Hennes, D., Tarassov, E., Strub, F., de Boer, V., Muller, P., Connor, J. T., Burch, N., Anthony, T., et al. Mastering the game of stratego with model-free multiagent reinforcement learning. arXiv preprint arXiv:2206.15378, 2022. Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pp. 2817–2826. PMLR, 2017. Rummery, G. A. and Niranjan, M. On-line Q-learning using connectionist systems, volume 37. Citeseer, 1994. 12 Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., and Bowling, M. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, 2019. Schmid, M., Moravcik, M., Burch, N., Kadlec, R., Davidson, J., Waugh, K., Bard, N., Timbers, F., Lanctot, M., Holland, Z., et al. Player of games. arXiv preprint arXiv:2112.03178, 2021. Serrino, J., Kleiman-Weiner, M., Parkes, D. C., and Tenenbaum, J. Finding friend and foe in multi-agent games. Advances in Neural Information Processing Systems, 32, 2019. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550 (7676):354–359, 2017. Srinivasan, S., Lanctot, M., Zambaldi, V., Pérolat, J., Tuyls, K., Munos, R., and Bowling, M. Actorcritic policy optimization in partially observable multiagent environments. Advances in neural information processing systems, 31, 2018. Steinberger, E. Single deep counterfactual regret minimization. arXiv preprint arXiv:1901.07621, 2019. Steinberger, E., Lerer, A., and Brown, N. DREAM: Deep regret minimization with advantage baselines and model-free learning. arXiv preprint arXiv:2006.10410, 2020. Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1): 9–44, 1988. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. Online reinforcement learning in stochastic games. Advances in Neural Information Processing Systems, 30, 2017. Wurman, P. R., Barrett, S., Kawamoto, K., MacGlashan, J., Subramanian, K., Walsh, T. J., Capobianco, R., Devlic, A., Eckert, F., Fuchs, F., et al. Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022. Xie, Q., Chen, Y., Wang, Z., and Yang, Z. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on learning theory, pp. 3674–3682. PMLR, 2020. Zha, D., Xie, J., Ma, W., Zhang, S., Lian, X., Hu, X., and Liu, J. Douzero: Mastering doudizhu with self-play deep reinforcement learning. In International Conference on Machine Learning, pp. 12333–12344. PMLR, 2021. Zhang, B. and Sandholm, T. Subgame solving without common knowledge. Advances in Neural Information Processing Systems, 34, 2021. Zhang, R., Ren, Z., and Li, N. Gradient play in stochastic games: stationary points, convergence, and sample complexity. arXiv preprint arXiv:2106.00198, 2021. Zinkevich, M., Johanson, M., Bowling, M., and Piccione, C. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems (NeurIPS), 2008a. Zinkevich, M., Johanson, M., Bowling, M., and Piccione, C. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems (NeurIPS), 2008b. 13 7 P ROOFS We start by recalling a central theorem connecting regret to counterfactual regret (see, e.g., Zinkevich et al. (2008b); Farina et al. (2019b;a)). Proposition 1. Fix any player i, and let RsT := max â∈As T X ric (π t , s, â) = max â∈As t=1 T X t=1 qic (π t , s, â) − vic (π t , s) be the counterfactual regret accumulated up to time T by the regret minimizer local at each information set s. Then, the regret RiT := max ˆi pi T X t=1 t vi (π̂i , π−i ) − vi (π t ) accumulated by the policies π t on the overall game tree satisfies X RiT ≤ max{RsT , 0}. s We can now use a modification of the argument by Farina et al. (2020) to bound the degradation of regret due to the use of an estimator of the counterfactual regrets. However, our analysis requires some modifications compared to that of Farina et al. (2020), in that ESCHER introduces estimation at the level of counterfactuals, while the latter paper introduces estimation at the level of the game utilities. Theorem 1. Assume a fixed sampling policy that puts positive probability on every action. For any p ∈ (0, 1), with probability at least 1 − p, the regret accumulated by √each agent learning using the tabular algorithm ESCHER (Algorithm 1) is upper bounded by O( T · poly log(1/p)), where the O(·) notation hides game-dependent and sampling-policy-dependent constants. Proof. As shown in Section 3, for any information set s the counterfactual regret estimators r̂i (π, s, a|h) are unbiased up to a time-independent multiplicative factor; specifically, Eh∼π̃i [r̂i (π, s, a|h)] = w(s)ri (π, s, a) for all actions a available to player i at world states in s. Hence, for each a ∈ As we can construct the martingale difference sequences Xat := w(s)ri (π t , s, a) − r̂i (π t , s, a|h). Clearly, Xta is bounded, with |Xta | upper bounded by (twice) the range D of payoffs of player i. Hence, from the Azuma-Hoeffding inequality, we obtain that the regret RsT accumulated by the local policies produced by ESCHER with respect to the correct counterfactuals satisfies, for all p ∈ (0, 1) " T # r X 1 t ≥ 1 − p, P Xa ≤ 2D 2T log p t=1 Using a union bound on the actions, we can therefore write s " # T X |As | t P max Xa ≤ 2D 2T log ≥ 1 − p, a p t=1 The left-hand side in the probability can be expanded as follows: ( ) T T T X X X t t t max Xa = max w(s) ri (π , s, a) − r̂i (π , s, a|h) a t=1 a t=1 ( ≥ max w(s) a T X t=1 ) ri (π t , s, a) t=1 √ ≥ w(s)RsT − |As |D T , 14 − max a ( T X t=1 ) r̂i (π t , s, a|h) where the last inequality follows from the fact that the regret √ cumulated by regret matching (which is run on the regret estimates r̂i ) is upper bounded by |As |D T . Hence, we can write s " # √ |As | T P w(s)Rs − |As |D T ≤ 2D 2T log ≥1−p p s # " √ 2D |As | D|A | s T T+ 2T log ≥ 1 − p, ⇐⇒ P Rs ≤ w(s) w(s) p where in the second step we used the hypothesis that w(s) > 0 for all s. Since the right-hand size inside of the probability is non-negative, we can further write s " # √ 2D |As | D|A | s T T+ 2T log P max{Rs , 0} ≤ ≥ 1 − p, w(s) w(s) p valid for every information set s. Now, using the known analysis of CFR (Proposition 1), we obtain that the regret accumulated by the ESCHER iterates satisfies X RiT ≤ max{0, RsT }. s Hence, using a union bound over all information sets s ∈ Si of player i, we find that s " ! # X D|As | 2D |As ||Si | √ T P Ri ≤ + 2 log T ≥1−p w(s) w(s) p s s ! " # √ 2D|A ||S | |A ||S | s i s i ⇐⇒ P RiT ≤ 2 log T ≥ 1 − p, mins w(s) p for all p ∈ (0, 1). Absorbing game-dependent and sampling-policy-dependent constants yields the statements. 7.1 I NCORPORATING F UNCTION A PPROXIMATION E RRORS We now adapt the correctness analysis above to keep into account errors in the approximation of the history value function by the deep neural network. We can model that error explictly by modifying (5) to incorporate a history-action-dependent error δ(h, a) as follows:  q (π, z[s], a) − vi (π, z[s]) + δ(z[s], a) if z ∈ Z(s) r̂i (π, s, a|z) = i 0 otherwise. Repeating the analysis of Section 3, we have Ez∼π̃i [r̂i (π, s, a|z)] = w(s)rc (π, s, a) + w(s) X π η−i (h)δ(h, a). h∈s Propagating the error term throughout the analysis, assuming that each error term is at most  > 0 in magnitude, we obtain that for each infostate s and p ∈ (0, 1), s " # √ (D + )|A | 2(D + ) |A | s s P RsT ≤ T ++ 2T log ≥ 1 − p, w(s) w(s) p Again using the union bound across all infostates of player i, we obtain s " ! # 2(D + )|As ||Si | |As ||Si | √ T P Ri ≤ 2 log T + |Si | ≥ 1 − p, mins w(s) p showing that errors in the function approximation translate linearly into additive regret overhead. 15 8 R ELATED W ORK Superhuman performance in two-player games usually involves two components: the first focuses on finding a model-free blueprint strategy, which is the setting we focus on in this paper. The second component improves this blueprint online via model-based subgame solving and search (Burch et al., 2014; Moravcik et al., 2016; Brown et al., 2018; 2020; Brown & Sandholm, 2017b; Schmid et al., 2021). This combination of blueprint strategies with subgame solving has led to state-of the art performance in Go (Silver et al., 2017), Poker (Brown & Sandholm, 2017a; 2018; Moravčı́k et al., 2017), Diplomacy (Gray et al., 2020), and The Resistance: Avalon (Serrino et al., 2019). Methods that only use a blueprint have achieved state-of-the-art performance on Starcraft (Vinyals et al., 2019), Gran Turismo (Wurman et al., 2022), DouDizhu (Zha et al., 2021), Mahjohng (Li et al., 2020), and Stratego (McAleer et al., 2020; Perolat et al., 2022). Because ESCHER is a method for finding a blueprint, it can be combined with subgame solving and is complementary to these approaches. In the rest of this section we focus on other model-free methods for finding blueprints. Deep CFR (Brown et al., 2019; Steinberger, 2019) is a general method that trains a neural network on a buffer of counterfactual values. However, Deep CFR uses external sampling, which may be impractical for games with a large branching factor, such as Stratego and Barrage Stratego. DREAM (Steinberger et al., 2020) and ARMAC (Gruslys et al., 2020) are model-free regret-based deep learning approaches. ReCFR (Liu et al., 2022) propose a bootstrap method for estimating cumulative regrets with neural networks that could potentially be combined with our method. Neural Fictitious Self-Play (NFSP) (Heinrich & Silver, 2016) approximates fictitious play by progressively training a best response against an average of all past opponent policies using reinforcement learning. The average policy converges to an approximate Nash equilibrium in two-player zero-sum games. Policy Space Response Oracles (PSRO) (Lanctot et al., 2017; Muller et al., 2019; Feng et al., 2021; McAleer et al., 2022) are another promising method for approximately solving very large games. PSRO maintains a population of reinforcement learning policies and iteratively trains a best response to a mixture of the opponent’s population. PSRO is a fundamentally different method than the previously described methods in that in certain games it can be much faster but in other games it can take exponentially long in the worst case. Neural Extensive Form Double Oracle (NXDO) (McAleer et al., 2021) combines PSRO with extensive-form game solvers, and could potentially be combined with our method. There is an emerging literature connecting reinforcement learning to game theory. QPG (Srinivasan et al., 2018) shows that state-conditioned Q-values are related to counterfactual values by a reach weighted term summed over all histories in an infostate and proposes an actor-critic algorithm that empirically converges to a NE when the learning rate is annealed. NeuRD (Hennes et al., 2020), and F-FoReL (Perolat et al., 2021) approximate replicator dynamics and follow the regularized leader, respectively, with policy gradients. Actor Critic Hedge (ACH) (Fu et al., 2022) is similar to NeuRD but uses an information set based value function. All of these policy-gradient methods do not have theory proving that they converge with high probability in extensive form games when sampling trajectories from the policy. In practice, they often perform worse than NFSP and DREAM on small games but remain promising approaches for scaling to large games (Perolat et al., 2022). Robust reinforcement learning (Pinto et al., 2017), seeks to train an RL policy to be robust against an adversarial environment. In future work we will look to apply ESCHER to this setting. Markov games (or stochastic games) are extensive-form games where the world state information is shared among all players at each timestep, but players take simultaneous actions. Recent literature has shown that reinforcement learning algorithms converge to Nash equilibrium in two-player zero-sum Markov games (Brafman & Tennenholtz, 2002; Wei et al., 2017; Perolat et al., 2018; Xie et al., 2020; Daskalakis et al., 2020; Jin et al., 2021) and in multi-player general-sum Markov potential games (Leonardos et al., 2021; Mguni et al., 2021; Fox et al., 2022; Zhang et al., 2021; Ding et al., 2022). 16 9 A DDITIONAL E XPERIMENTAL R ESULTS 9.1 D ESCRIPTION OF G AME I NSTANCES We use Openspiel (Lanctot et al., 2019) for all our games. Below we list the parameters used to define each game in Openspiel. Leduc Game Name: leduc_poker Parameters: {"players": 2} Battleship Game Name: battleship Parameters: {"board_width": 2, "board_height": 2, "ship_sizes": "[2]", "ship_values": "[2]", "num_shots": 3, "allow_repeated_shots": False} Liar’s Dice Game Name: liars_dice Parameters: None Phantom Tic Tac Toe Game Name: phantom_ttt Parameters: None Dark Hex 4 Game Name: dark_hex Parameters: {"board_size": 4} Dark Hex 5 Game Name: dark_hex Parameters: {"board_size": 5} Dark Chess Game Name: dark_chess Parameters: None 9.2 TABULAR E XPERIMENTS We compare a tabular version of ESCHER with oracle value functions to a tabular version of DREAM with oracle value functions and with OS-MCCFR. We run experiments on Leduc poker, Battleship, and Liar’s dice, and use the implementations from OpenSpiel (Lanctot et al., 2019). We see in Figure 3 on the top row that ESCHER remains competitive with DREAM and OS-MCCFR on these games. On the bottom row we plot the average variance of the regret estimators over all information sets visited over an iteration window for each of these algorithms. While DREAM does improve upon OS-MCCFR, it still has orders of magnitude higher variance than ESCHER. Although this does not matter much in tabular experiments, we conjecture that high regret estimator variance makes neural network training unstable without prohibitively large buffer sizes. 10 H YPERPARAMETERS FOR D EEP E XPERIMENTS For all deep experiments we first did a hyperparameter search that starts with good hyperparameters for flop hold ’em poker. We report the final hyperparameters used for the deep experiments. As described in the Dark Chess section, DREAM experiments on dark chess used 500 batches for advantage and average training to not run out of memory. 10.1 ESCHER Parameter Value n regret network traversals n history value network traversals batch size regret network batch size history value network train steps regret network train steps history value network train steps average policy network 1,000 1,000 2048 2048 5,000 5,000 10,000 Table 3: ESCHER and Ablations Hyperparameters for Phantom TTT, Dark Hex 4, Dark Hex 5 17 (a) Leduc Exploitability (b) Battleship Exploitability (c) Liar’s Dice Exploitability (d) Leduc Variance (e) Battleship Variance (f) Liar’s Dice Variance Figure 3: The tabular version of ESCHER with an oracle value function is competitive with the tabular version of DREAM with an oracle value function and with OS-MCCFR in terms of exploitability (top row). The regret estimator in ESCHER has orders of magnitude lower variance than those of DREAM and OS-MCCFR (bottom row). Parameter Value n regret network traversals n history value network traversals batch size regret network batch size history value network train steps regret network train steps history value network train steps average policy network 1,000 1,000 2048 2048 500 500 10,000 Table 4: ESCHER and Ablations Hyperparameters for Variance Experiments When computing the value function, we random noise to the current policy to induce coverage over all information sets. To do this we added 0.01 times a uniform distribution to the current policy and renormalized. 10.2 DREAM We use the codebase from the original DREAM paper (Steinberger et al., 2020) with a wrapper to integrate with Openspiel (Lanctot et al., 2019) and rllib (Liang et al., 2018). When otherwise specified, we use default parameters from the DREAM codebase. 18 Parameter Value 4,000 1,000 1,000 10 12 10,000 2,000,000 10,000 2,000,000 2048 4000 n batches adv training n traversals per iter n batches per iter baseline periodic restart max n las sync simultaneously mini batch size adv max buffer size adv mini batch size avrg max buffer size avrg batch size baseline n batches avrg training Table 5: DREAM Hyperparameters for Phantom TTT, Dark Hex 4, Dark Hex 5 10.3 NFSP We use our own implementation of NFSP that uses RLLib’s (Liang et al., 2018) DQN implementation and outperforms the original paper’s results on Leduc poker. Parameter Value circular buffer size total rollout experience gathered each iter learning rate batch size TD-error loss type target network update frequency RL learner params anticipatory param avg policy reservoir buffer size avg policy learning starts after avg policy learning rate avg policy batch size avg policy optimizer 2e5 1024 steps 0.01 4096 MSE every 10,000 steps DDQN 0.1 2e6 16,000 steps 0.1 4096 SGD Table 6: NFSP Hyperparameters for Phantom TTT, Dark Hex 4, Dark Hex 5 10.4 DARK C HESS Hyperparameters are the same as in other deep experiments (described above), except DREAM experiments on dark chess used 500 batches for advantage and average training to not run out of memory. For these experiments only the current observation was passed in to the network for each method. As a result, we cannot expect these algorithms to learn a strong strategy on dark chess, but it is still a fair comparison. In future work we plan on doing more engineering to include more information to the networks. 11 C ODE We will include a GitHub link to our open source code under the MIT license in the deanonymized version of this work. Our code is built on top of the OpenSpiel (Lanctot et al., 2019) and Ray (Moritz et al., 2018) frameworks, both of which are open source and available under the Apache-2.0 license. 19