Published as a conference paper at ICLR 2024

T HE U PDATE -E QUIVALENCE
D ECISION -T IME P LANNING

F RAMEWORK

FOR

Samuel Sokota†1 Gabriele Farina†2
David J. Wu† Hengyuan Hu3 Kevin A. Wang†4 J. Zico Kolter1,5 Noam Brown†6
†
Work done at Meta AI 1 Carnegie Mellon University 2 Massachusetts Institute of Technology
3
Stanford University 4 Brown University 5 Bosch AI 6 OpenAI
ssokota@andrew.cmu.edu gfarina@mit.edu

A BSTRACT
The process of revising (or constructing) a policy at execution time—known as
decision-time planning—has been key to achieving superhuman performance in
perfect-information games like chess and Go. A recent line of work has extended
decision-time planning to imperfect-information games, leading to superhuman
performance in poker. However, these methods involve solving subgames whose
sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue,
we introduce an alternative framework for decision-time planning that is not based
on solving subgames, but rather on update equivalence. In this update-equivalence
framework, decision-time planning algorithms replicate the updates of last-iterate
algorithms, which need not rely on public information. This facilitates scalability
to games with large amounts of non-public information. Using this framework,
we derive a provably sound search algorithm for fully cooperative games based
on mirror descent and a search algorithm for adversarial games based on magnetic
mirror descent. We validate the performance of these algorithms in cooperative
and adversarial domains, notably in Hanabi, the standard benchmark for search
in fully cooperative imperfect-information games. Here, our mirror descent
approach exceeds or matches the performance of public information-based search
while using two orders of magnitude less search time. This is the first instance of a
non-public-information-based algorithm outperforming public-information-based
approaches in a domain they have historically dominated.

1

I NTRODUCTION

Decision-time planning (DTP) is the process of revising (or even constructing from scratch) a policy
immediately before using that policy to make a decision. In settings involving strategic decision
making, the benefits of DTP can be quite large. For example, while decision-time planning approaches have achieved superhuman performance in chess (Campbell et al., 2002), Go (Silver et al.,
2016), and poker (Moravčı́k et al., 2017; Brown and Sandholm, 2018; 2019), approaches without
decision-time planning remain non-competitive.
Currently, the dominant paradigm for DTP is based on solving (or improving the policy as much as
possible in) subgames. In perfect-information games, the subgame at a state is naturally defined as
a game beginning from that state that proceeds according to the same rules as the original game. In
contrast, the definition of subgame is nuanced in imperfect-information games, due to the presence
of private information. While multiple definitions have been proposed, all those that facilitate sound
guarantees (Nayyar et al., 2013; Dibangoye et al., 2013; Oliehoek, 2013; Burch et al., 2014; Moravcik et al., 2016; Brown and Sandholm, 2017; Moravčı́k et al., 2017; Brown et al., 2020; Sokota
et al., 2023b) rely on variants of a distribution known as a public belief state (PBS)1 —i.e., the joint
posterior over each player’s decision point, given public information and the joint policy that has
been played so far.
1

Note that different literatures use different names to refer to this distribution and its variants.

1

Published as a conference paper at ICLR 2024

Unfortunately, PBS-based planning has a fundamental limitation: it is ineffective in settings with
large amounts of non-public information. This shortcoming arises because PBS-based DTP distributes its computational budget across all decision points supported by the PBS and because the
number of such decision points increases with the amount of non-public information in the game.
When the
 amount of non-public information is small, such as in Texas Hold’em, where there are
only 52
2 = 1326 possible private hands, this is a non-issue because it is feasible to construct strong
policies for decision points supported by the PBS. However, as the amount of non-public information grows, meaningfully improving for all such decision points becomes increasingly inviable. For
example, in games where the only public knowledge is the amount of time that has passed, PBSbased subgame solving requires reasoning about every decision point consistent with the time step.
In the worst case—games with no public information at all—PBS-based subgame solving requires
solving the entire game, defeating the entire purpose of DTP.
In this work, motivated by the insufficiency of PBS-based conceptualizations, we investigate an alternative perspective on DTP that we call the update-equivalence framework for DTP. Rather than
viewing DTP algorithms as solving subgames, the update-equivalence framework views them as implementing the updates of last-iterate algorithms—a term that we use to refer to algorithms whose
last iterates possess desirable properties, such as monotonic improvements in expected return, decreases in exploitability, or convergence to an equilibrium point.2 This framework offers a simple
mechanism for generating and analyzing principled DTP algorithms, contrasting PBS-based algorithms and analyses, which are notorious, especially in adversarial games, for their reconditeness.
Furthermore, importantly, because these last-iterate algorithms need not involve PBSs, the updateequivalence framework for DTP is not inherently limited by the amount of non-public information.
Indeed, the most natural instances of the framework focus their entire computation budget on the decision point that is actually occupied by the planning agent, entirely ignoring counterfactual decision
points supported by the PBS.
We invoke the update-equivalence framework to derive a DTP algorithm based on mirror descent
(Nemirovsky and Yudin, 1983; Beck and Teboulle, 2003). We prove that this algorithm, which
we call mirror descent search (MDS), guarantees policy improvement in fully cooperative games.
We test MDS in Hanabi (Bard et al., 2020), the standard benchmark for search in fully cooperative
imperfect-information games. Remarkably, MDS exceeds or matches the performance of state-ofthe-art PBS methods while using two orders of magnitude less search time. This is the first instance
of a non-public-information-based algorithm outperforming public-information-based approaches
in a setting they’ve historically dominated. Furthermore, we show that the performance gap between
MDS and PBS methods widens as the amount of non-public information increases, reflecting MDS’s
superior scalability properties. In addition to MDS, we also introduce a search algorithm for adversarial settings based on magnetic mirror descent (MMD) (Sokota et al., 2023a) that we call MMD
search (MMDS). We show empirically that MMDS significantly improves performance in games
where PBS methods are essentially inapplicable due to the large amount of non-public information.

2

BACKGROUND AND N OTATION

We use the formalism of finite-horizon partially observable stochastic games, or POSGs, for short
(Hansen et al., 2004). POSGs are a class of games equivalent to perfect-recall timeable extensiveform games (Jakobsen et al., 2016; Kovarı́k et al., 2022). To describe finite-horizon POSGs, we
use s ∈ S to notate Markov states,
ai ∈ Ai to notate player i’s actions, oi ∈ Oi to notate player
S
i’s observations, hi ∈ Hi = t (Oi × Ai )t × Oi to notate i’s decision points, a ∈ A = ×i Ai to
notate joint actions, h ∈ H = ×i Hi to notate histories, Ri : S × A → R to notate player i’s reward
function, T : S × A → ∆(S) to notate the transition function, Oi : S × A → Oi to notate player i’s
observation function.
Each player i interacts with the game via a policy πi , which maps decision points to distributions
over actions πi : Hi → ∆(Ai ). Given a joint policy π = (πi )i , we use the notation Pπ to notate
2
For example, this definition of last-iterate algorithm would include policy iteration in Markov decision
processes, since its last iterate converges to the optimal policy, but exclude algorithms for imperfect-information
games possessing only average-iterate guarantees, such as counterfactual regret minimization (Zinkevich et al.,
2007) and fictitious play (Brown, 1951).

2

Published as a conference paper at ICLR 2024

probability functions associated with the joint policy π. For example, Pπ (h | hi ) denotes the
probability of history h under joint policy π given that player i occupies decision point hi .
The objective of player i is to maximize
its expected return Ji (π) := Eπ [G] where the return is deP
fined as the sum of rewards G := t Ri (S t , At ) and where we use capital letters to denote random
variables. (We maintain the convention of using capital letters for random variables throughout the
paper.) If the return for each player is the same, we say the game is common payoff. If there are
two players and the returns of these players are the opposite of one another, we say the game is two
player zero sum (2p0s).
We notate the expected value for an agent at a history ht under joint policy π as
#
"
X
π t
≥t
t
t′
t′
t
vi (h ) = Eπ [G | h ] = Eπ
Ri (S , A ) | h .
t′ ≥t

The expected action value for an agent at a history ht taking action ati under joint policy π is denoted


qiπ (ht , ati ) = Eπ [G≥t | ht , ati ] = Eπ Ri (S t , At ) + viπ (H t+1 ) | ht , ati .
The expected value of a decision point hti is the weighted sum of history values, where each history
is weighted by its probability, conditioned on the decision point and historical joint policy viπ (hti ) =
Eπ [viπ (H t ) | hti ]. Similarly, the expected action value for action ati at a decision point hti is defined
as qiπ (hti , ati ) = Eπ [qiπ (H t , ati ) | hti ].

3

T HE F RAMEWORK OF U PDATE E QUIVALENCE

The framework of update equivalence rests on the idea that last-iterate algorithms and DTP algorithms may be viewed, implicitly or explicitly, as inducing updates. For last-iterate algorithms, this
induced update refers to the mapping from one iterate (i.e., policy) to the next. For DTP algorithms,
this induced update refers to the mapping from the policy to be revised (called the blueprint policy)
to the policy produced by the search. We define the idea that these updates may be equivalent below.
Definition 3.1 (Update equivalence). For some fixed game, a decision-time planning algorithm and
a last-iterate algorithm are update equivalent if, for any policy π t and any information state hi , the
distribution of actions played by the decision-time planning algorithm at hi with blueprint π t is
equal to the distribution of actions the last-iterate algorithm would play at hi on iteration t + 1 given
that it had played π t on iteration t.
There are at least two benefits to considering update-equivalence relationships for decision-time
planning. First, it begets an approach to analyzing DTP algorithms via their last-iterate algorithm
counterparts. This avenue of analysis simply requires determining whether the next iterate is improved from the previous, circumventing a host of complications that make the analysis of PBS
approaches nuanced. Second, and perhaps more importantly, it enables the generation of new principled DTP algorithms via last-iterate algorithms with desirable guarantees.
We illustrate these benefits in the two coming subsections. In Section 3.1, we focus on last-iterate
algorithms that operate using action-value feedback, providing a general procedure to generate DTP
analogues of these algorithms and discussing three DTP algorithms constructed via this procedure,
as well as their soundness. In Section 3.2 we also discuss last-iterate algorithms that do not operate
using action-value feedback, demonstrating how the framework of update equivalence can be used
to generate or justify DTP algorithms with more complex structure.
3.1

ACTION -VALUE -BASED P LANNERS

In Algorithm 1, we give a procedure for constructing DTP algorithms that are update-equivalent to
last-iterate algorithms operating on action-value feedback. Specifically, let the last-iterate algorithm
t
be represented by a continuous function U : (π t (hi ), q π (hi )) 7→ π t+1 (hi ) mapping the current polt
icy π t to the next policy π t+1 by incorporating the action-value feedback q π for the current policy.
Algorithm 1 repeatedly samples histories starting from a given decision point and the joint policy
and acquires estimates of the sample returns for these histories via rollouts. When the computational
3

Published as a conference paper at ICLR 2024

budget has been exhausted, Algorithm 1 applies the update function U to the current joint policy and
estimated action values. This leads to an update-equivalent DTP algorithm, as we establish next.
Proposition 3.2. Consider a last-iterate algorithm operating with action-value feedback whose update function U is continuous. Then, as the number of rollouts goes to infinity, the output of Algot
rithm 1, conditioned on U and given inputs hti , π, converges in probability to U(π t (hi ), q π (hi ))(hi ).
Proof. This proposition holds because q̂ converges in probability to qiπ (hi ) by the law of large
numbers and U is continuous.
Monte Carlo Search Algorithm 1 enables
us to relate last-iterate algorithms with DTP
counterparts. A notable existing instance of
such a relationship had already been identified
by Tesauro (1995) and Tesauro and Galperin
(1996) between policy iteration (as the lastiterate algorithm) and Monte Carlo search (as
the corresponding DTP algorithm). Indeed, by
plugging policy iteration’s local update function into Algorithm 1 we recover exactly Monte
Carlo search as a special case. The local update
induced by policy iteration can be written as

Algorithm 1 Update Equivalent Search for Last-Iterate
Algorithm with Action-Value Feedback and Update U
Input: decision point hti , joint policy π
Initialize q̄[a] as running mean tracker for a ∈ Ai
repeat
Sample history H t ∼ Pπ (H t | hti )
for a ∈ Ai do
Sample return G≥t ∼ Pπ (G≥t | H t , a)
q̄[a].update running mean(G≥t )
end for
until search budget is exhausted
return U(π(hti ), q̄)

U : (π̄, q̄) 7→ arg max⟨π̄ ′ , q̄⟩,
π̄ ′ ∈∆(Ai )

where π̄ ∈ ∆(Ai ) and q̄ ∈ R|Ai | . In other words, policy iteration’s update simply plays a greedy
policy with respect to its action values. By Proposition 3.2, when the computational budget given
to the Monte Carlo search DTP algorithm grows, the policies produced converge to those of policy
iteration. So, with enough budget, Monte Carlo search inherits all desirable properties enjoyed by
policy iteration, including convergence to optimal policies in single-agent settings and convergence
to Nash equilibrium in perfect-information 2p0s games (Littman, 1996).
Mirror Descent Search Algorithm 1 also enables us to give novel justification to DTP approaches
in settings in which giving guarantees has been historically difficult. One example of such as a setting is common-payoff games, where a naı̈ve application of Monte Carlo search can lead to bad
performance as a result of the posterior over histories induced by the search policy diverging too far
from the blueprint policy’s posterior (Sokota et al., 2022). We show that this issue can be resolved
using mirror descent (Beck and Teboulle, 2003; Nemirovsky and Yudin, 1983), which is a general
approach to optimizing objectives that penalizes the distance between iterates. In the context of sequential decision making, a natural means of leveraging mirror descent is to instantiate simultaneous
updaters U of the form
1
U : (π̄, q̄) 7→ arg max⟨π̄ ′ , q̄⟩ − KL(π̄ ′ , π̄)
η
′
π̄ ∈∆(Ai )

at each decision point, where π̄ is a local policy, η is a stepsize, q̄ is a local action-value vector.
In the case of discrete action spaces, this update reduces to the closed-form learning algorithm
U(π̄, q̄) ∝ π̄eηq̄ , known in literature as hedge or multiplicative weights update.

In the appendix, we show that mirror descent satisfies the following improvement property.
Theorem 3.3. Consider a common-payoff game. Let π t be a joint policy having positive probability
on every action at every decision point. Then, if we run mirror descent at every decision point with
action-value feedback, for any sufficiently small stepsize η > 0, J (π t+1 ) ≥ J (π t ). Furthermore,
this inequality is strict if π t+1 ̸= π t (that is, if π t is not a local optimum).
We call mirror descent search (MDS) the DTP algorithm obtained by applying Algorithm 1 with
mirror descent as the last-iterate algorithm. By virtue of Proposition 3.2, as the computational
budget increases, MDS inherits the improvement property of Theorem 3.3.
Magnetic Mirror Descent Search Another example of a setting that has been historically difficult for DTP is 2p0s games. In such settings, neither policy iteration- nor mirror descent-based
approaches yield useful policies—instead, these approaches tend to cycle, diverge, or even exhibit
4

Published as a conference paper at ICLR 2024

formally chaotic behavior already in simple games such as rock-paper-scissors (Cheung and Piliouras, 2019). It was recently shown that this issue can be empirically and reliably resolved using
magnetic mirror descent (Sokota et al., 2023a), which is an extension of mirror descent with additional proximal regularization to a magnet that dampens these cycles. As with mirror descent, in the
context of sequential decision making, a natural means of leveraging magnetic mirror descent is to
instantiate simultaneous updaters U at each decision point:
1
U : (π̄, q̄) 7→ arg max⟨π̄ ′ , q̄⟩ − KL(π̄ ′ , π̄) − αKL(π̄ ′ , ρ̄),
η
π̄ ′ ∈∆(Ai )
where α is a regularization temperature and ρ̄ ∈ ∆(Ai ) is a local reference policy; in the case of
1
discrete action spaces, this update reduces to the following closed form: U(π̄, q̄) ∝ [π̄eηq̄ ρ̄ηα ] 1+αη .
We call magnetic mirror descent update equivalent search (MMDS) the DTP algorithm obtained
by applying Algorithm 1 with the particular choices of magnetic mirror descent as the last-iterate
algorithm. Once again from Proposition 3.2, we establish that if the observation of MMD’s reliable
last-iterate convergence to equilibrium made by Sokota et al. (2023a) holds, then MMDS leads to
expected improvement in 2p0s games as the computational budget grows.
Standard

With Subgame Updates

With BFT

2x2 Abrupt Dark Hex

4-Sided Liar’s Dice

With Opponent Update
Leduc Poker

KL Divergence

Kuhn Poker

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

Iterations (thousands)

Figure 1: Divergence to AQRE as a function of iterations for last-iterate algorithm analogues of
variants of MMD-based search algorithms. Each variant exhibits empirical convergence.

3.2

B EYOND ACTION -VALUE -BASED P LANNERS

Many DTP algorithms cannot be derived from Algorithm 1 that nevertheless may be useful to view
from the perspective of update equivalence. One class of such approaches makes updates at future decision points during search. Approaches of this form, such as Monte Carlo tree search
(MCTS) (Coulom, 2006; Kocsis and Szepesvári, 2006; Browne et al., 2012), played an important
role in successes in perfect information games Silver et al. (2016; 2018); Schrittwieser et al. (2019);
Antonoglou et al. (2022). Another class of approaches fine-tune the belief model (Sokota et al.,
2022) to track the search policy.
We provide empirical evidence for the soundness of three such MMD-based approaches in the context of imperfect-information 2p0s games: 1) With subgame updates, A variant in which the planning agent also performs MMD updates at its own future decision points; 2) With belief fine-tuning
(BFT), a variant that implements an MMD update on top of a posterior fine-tuned to the search
policy; 3) With opponent updates, a variant that implements an MMD update assuming a subgame
joint policy in which the opponent has performed an MMD update at its next decision point. See
Appendix B.1 for further algorithmic details.
In Figure 1, we measure the divergence of these last-iterate algorithm counterparts to agent quantal
response equilibrium (AQRE) (McKelvey and Palfrey, 1998) as a function of the number of update
iterations. We find that all three last-iterate algorithm analogues exhibit empirical convergence,
suggesting that these DTP algorithms may be safe in 2p0s settings. More generally, these findings
may hint that, so long as MMD is used as a local updater, one may have wide leeway in designing
safe DTP algorithms for 2p0s games.
5

Published as a conference paper at ICLR 2024

4

E XPERIMENTS

In this section, we add further evidence for the update-equivalence framework’s utility by showing
that the novel DTP algorithms derived from it also perform well in practice. We focus on two settings
with imperfect information: i) two variants of Hanabi (Bard et al., 2020), a fully cooperative card
game in which PBS-based DTP approaches are considered state-of-the-art; and ii) 3x3 Abrupt Dark
Hex and Phantom Tic-Tac-Toe, 2p0s games with virtually no public information.
4.1

H ANABI

We consider two versions of two-player Hanabi, a standard benchmark in the literature (Bard et al.,
2020). Both versions are scored out of 25 points—scoring a 25 is considered winning. First, we
trained instances of PPO (Schulman et al., 2017) in self play for each setting. We intentionally
selected instances whose final performance roughly matched those of R2D2 (Kapturowski et al.,
2019) instances that have been used to benchmark DTP algorithms. Next, we trained a belief model
for each PPO policy using Seq2Seq with the same setup and hyperparameters as Hu et al. (2021);
in this setup, the belief model takes in the decision point of one player and predicts the private
information of the other player, conditioned on the PPO policy having been played thus far.
We adapted our implementation of MDS from that of single-agent SPARTA with a learned belief
model (Hu et al., 2021). Our adaptation involves three important changes. First, MDS performs
search for all agents, rather than only one agent, as was done in (Hu et al., 2021). Second, MDS
plays the argmax3 of a mirror descent update, rather than the argmax of the empirical Q-values
from search, as SPARTA does. Third, MDS always plays its search policy. This contrasts both
SPARTA (Lerer et al., 2020; Hu et al., 2021) and RLSearch (Fickinger et al., 2021), which, after
doing search, perform a validation check to determine whether the search policy outperforms the
blueprint policy; if the validation check fails, the blueprint policy is played instead of the search
policy. This validation check can be expensive and usually fails (SPARTA and RLSearch only play
their search policies on a handful of turns). Thus, MDS’s ability to sidestep the validation check is
a significant advantage.
5-Card Hanabi We first test MDS in the 5-card 8-hint variant of Hanabi, which is commonly
played by humans. Specifically, we compare MDS using a PPO blueprint policy against single- and
multi-agent SPARTA and RLSearch, which are considered state-of-the-art, with exact belief models
and an R2D2 blueprint policy; we also compare against single-agent SPARTA using the same PPO
blueprint policy and Seq2Seq belief model. For MDS, we use 10, 000 samples from the belief model
for search; this takes about 2 seconds per decision, which is about two orders of magnitude faster
than multi-agent SPARTA and multi-agent RL-Search.
We report our results in Table 1. To give context to the results, the strongest reported performance
of model-free agents in literature exceeds 24.40, while that of the strongest reported search agents
exceeds 24.60 (Hu et al., 2021). It becomes increasingly hard to improve the performance as the
score increases; for high-scoring policies, increases as small as a few hundredths of points are
considered substantial and can translate to multiple percentage points in winning percentage. To
give specific numbers as an illustration, Lerer et al. (2020) report that improving the expected return
from 24.53 to 24.61 can increase the winning percentage from 71.1% to 75.5%.
Despite being handicapped by 1) an approximate belief model, 2) a lack of validation, and 3) two orders of magnitude less search time, we find that MDS achieves superior or comparable performance
to PBS methods. Given the historical dominance of PBS approaches in settings where they are applicable, the fact that a non-PBS-based method could achieve competitive performance is already
notable; that MDS does so with such a severe handicap is particularly impressive.
7-Card Hanabi We next investigate MDS in the 7-card 4-hint variant of Hanabi. This variant was
introduced by Hu et al. (2021) to illustrate the challenges of performing search as the amount of
private information scales. Specifically, the number of possible histories is too large to enumerate.
As a result, multi-agent SPARTA is inapplicable altogether, single-agent SPARTA and RLSearch
require an approximate belief model, and multi-agent RLSearch requires both an approximate belief
3
Note that this differs slightly from our description in the previous section in that we are playing the argmax,
rather than sampling, from the updated policy.

6

Published as a conference paper at ICLR 2024

Table 1: MDS (bold) compared to SPARTA and RLSearch in 5-card 8-hint Hanabi. Despite using approximate beliefs and roughly two orders of magnitude less search time, MDS matches the
performance of prior methods using exact beliefs. Columns to the left of the divider used an R2D2
blueprint that scored 24.23±0.04; columns to the right used a PPO blueprint that scored 24.24±0.02.
Search
Agents planned
Belief model
Validation used
Time per move

SPARTA
Single
Exact
✓
3s

RLSearch
Single
Exact
✓
35s

SPARTA
Multi
Exact
✓
450s

RLSearch
Multi
Exact
✓
180s

SPARTA
Single
Seq2Seq
✓
1s

MDS
Multi
Seq2Seq
✗
2s

Expected return
Standard error

24.57
± 0.03

24.59
± 0.02

24.61
± 0.02

24.62
± 0.03

24.52
± 0.02

24.62
± 0.02

Table 2: MDS (bold) compared to SPARTA and RLSearch in 7-card 4-hint Hanabi; MDS compares
favorably to these approaches. Columns to the left of the divider used an R2D2 blueprint that scored
23.67 ± 0.02; columns to the right used a PPO blueprint that scored 23.66 ± 0.03.
Search
Agents planned
Belief model
Validation used
Time per move

RLSearch
Single
Seq2Seq
✓
35s

RLSearch
Single
BFT
✓
100s

RLSearch
Multi
BFT
✓
310s

SPARTA
Single
Seq2Seq
✓
1s

MDS
Multi
Seq2Seq
✗
2s

Expected Return
Standard error

24.14
± 0.04

24.18
± 0.03

24.18
± 0.03

24.17
± 0.03

24.28
± 0.02

model and belief fine-tuning (Sokota et al., 2022). Because of the significantly increased burden of
running PBS methods in this setting, one would hope that a method that eschews PBSs might be
able to outperform PBS methods. And, indeed, in Table 2, where we report expected return and
standard error over 2000 games, we find that MDS does outperform multi-agent RLSearch with
BFT, as well as a variety of single-agent search techniques. These results support our claim that
the update-equivalence framework offers superior scaling properties in the amount of non-public
information, compared to PBS methods.

4.2 3 X 3 A BRUPT
DARK H EX AND P HANTOM T IC -TAC -T OE

24.25
Expected Return

To offer further intuition for the behavior of
MDS, we show the performance in 7-card 4hint Hanabi as a function of mirror descent’s
stepsize in Figure 2. As might be expected, we
find that improvement is a unimodal function
of stepsize: If the stepsize is too small, opportunities to improve the blueprint policy are neglected; on the other hand, if the stepsize is too
large, the search policy’s posterior diverges too
far from that of the blueprint for the search to
provide useful feedback.

24.00
23.75
Blueprint (7-card 4-hint)
MDS (7-card 4-hint)

23.50
23.25
10−1

100

101
Stepsize

102

103

Figure 2: Performance1 of MDS as a function
of stepsize in 7-card 4-hint Hanabi. Small step
sizes provide less improvement over the blueprint;
overly large step sizes cause the search policy to
diverge too far from the blueprint, resulting in less
improvement or even detriment.

Unlike common-payoff games, measuring performance in 2p0s games is difficult because the
usual metric of interest—exploitability (i.e., the
expected return of a best responder)—cannot be
cheaply estimated. Indeed, providing a reasonable lower bound requires training an approximate
best response (Timbers et al., 2022), which may be costly. To facilitate this approximate best response training under a modest computation budget, we focus our experiments on a blueprint policy
that selects actions uniformly at random at every decision point so that rollouts may be performed
quickly; furthermore, rather than using a learned belief model, we track an approximate posterior
using particle filtering (Doucet and Johansen, 2009) with 10 particles. If particles remain, our implementation of MMDS performs a rollout for each particle for each legal action until the end of the
7

Published as a conference paper at ICLR 2024

Approach

Random

3x3 Abrupt Dark Hex

Approach

Random+MMDS

MMD(1M)

3x3 Abrupt Dark Hex

Phantom Tic-Tac-Toe

MMD(1M)+MMDS
Phantom Tic-Tac-Toe

0.6
Expected Return

0.2
0
-0.2

0.2
0.0

-0.4

-0.2

1s

tL
eg
al
A

tL
eg
al
A

ct
Ra ion
nd
PP om
O
PP (1M
O
( )
N 10M
FS
)
P
N (1
FS M
P
)
M (10M
M
M D(1 )
1s MD M)
t L (1
eg 0M
al
A )
ct
i
Ra on
nd
P P om
O
PP (1M
O
( )
N 10M
FS
)
N P(1
FS M
P( )
M 10M
M
M D(1 )
M M
D
(1 )
0M
)

-0.6

1s

0.4

c
Ra tion
n
PP dom
O
PP (1
O M
N (10 )
F M
N SP( )
FS 1M
P
M (10 )
M M
M D( )
1s MD 1M
t L (1 )
eg 0M
al
A )
c
Ra tion
n
PP dom
O
P P (1
O M
N (10 )
FS M
N P( )
FS 1M
P
M (10 )
M M
M D( )
M 1M
D
(1 )
0M
)

Expected Return

0.4

Opponent

Opponent

(a) Random with MMDS

(b) MMD(1M) with MMDS

Figure 3: Expected return of uniform random and uniform random blueprint + MMDS (left) and
MMD(1M) and MMD(1M) blueprint + MMDS (right) versus various opponents in 3x3 Abrupt
Dark Hex and Phantom Tic-Tac-Toe. MMDS tends to improve head-to-head expected return.
game and performs an MMD update on top of the empirical means of the returns. Otherwise, the
agent executes the blueprint policy.
We empirically investigate MMDS in 3x3 Abrupt Dark Hex and Phantom Tic-Tac-Toe, two standard
benchmarks available in OpenSpiel (Lanctot et al., 2019). We show approximate exploitability
results in Table 3. We computed these results using OpenSpiel’s (Lanctot et al., 2019) DQN (Mnih
et al., 2015) best response code, trained for 10 million time steps. We show standard error over
5 DQN best response training seeds and 2000 final evaluation games for the fully trained model.
We compare against a bot that plays the first legal action, the uniform random blueprint, Liang
et al. (2018)’s implementation of independent PPO (Schulman et al., 2017), Lanctot et al. (2019)’s
implementation of NFSP (Heinrich and Silver, 2016), and MMD (Sokota et al., 2023a). For the
learning agents, we ran 5 seeds and include checkpoints trained for both 1 and 10 million time steps.
We find that MMDS reduces the approximate
exploitability of the uniform random blueprint
by more than a third, despite using a meager
10-particle approximate posterior. Furthermore,
MMDS achieves lower approximate exploitability
than all non-MMD-based approaches.

Table 3: Approximate exploitability in 3x3
Abrupt Dark Hex and Phantom Tic-Tac-Toe,
on a 0 to 100 scale. MMDS (bold) substantially reduces Random’s exploitability.
Agent
1st Legal Action
Random
PPO(1M steps)
PPO(10M steps)
NFSP(1M steps)
NFSP(10M steps)
Random+MMDS
MMD(1M steps)
MMD(10M steps)

We also investigate the performance of MMDS
in head-to-head matchups. We show results using the uniform random blueprint with 10 particles and also a MMD(1M) blueprint with 100 particles in Figure 3. The values shown are averages over 10,000 games with bootstrap estimates
of 95% confidence intervals. We find that MMDS
tends to improve the performance of the blueprint
policies in both cases, despite the short length of the games.

5

3x3 Ab. DH
100 ± 0
74 ± 1
85 ± 6
100 ± 0
91 ± 4
59 ± 1
50 ± 1
34 ± 2
20 ± 1

Phantom TTT
100 ± 0
78 ± 0
89 ± 6
90 ± 4
95 ± 1
78 ± 5
50 ± 1
37 ± 1
15 ± 1

R ELATED W ORK

Other Approaches to Imperfect Information Motivated by the deficiencies of PBS-based DTP
described above, a small group of existing works has advocated for alternative approaches to DTP
for common-payoff games (Tian et al., 2020) and 2p0s games (Zhang and Sandholm, 2021). Tian
et al. (2020)’s approach relies on a result that shows that it is possible to decompose the change
in expected return for two different joint policies across decision points. By leveraging this result,
Tian et al. (2020) introduce a search procedure called joint policy search (JPS) that is guaranteed to
not decrease the expected return. Zhang and Sandholm (2021)’s approach is based on the insight
8

Published as a conference paper at ICLR 2024

that, in practice, it is effective to consider a subgame that excludes most of the decision points
supported by the PBS. Specifically, Zhang and Sandholm (2021) advocate in favor of solving a
maxmargin subgame (Moravcik et al., 2016) that includes the planning agent’s true decision point,
as well as any opponent decision points that are possible from the perspective of the planning agent.
Despite proving the existence of games in which this approach, which they call 1-KLSS, increases
the exploitability of the policy, Zhang and Sandholm (2021) find experimentally that, on small and
medium-sized games, 1-KLSS reliably decreases exploitability. In concurrent work, Liu et al. (2023)
derive an alternative variant of 1-KLSS that guarantees safety, making it a promising approach for
scaling search to adversarial settings with large amounts of non-public information.
Structurally Similar Approaches While the motivation for the update-equivalence framework
most closely resembles of the works of Tian et al. (2020) and Zhang and Sandholm (2021), natural instances of the update-equivalence framework are structurally more similar to search algorithms unrelated to resolving issues with PBS-based planning. As discussed previously, arguably the
most fundamental instance of the framework of update equivalence is Monte Carlo search (MCS)
(Tesauro, 1995; Tesauro and Galperin, 1996), which is update equivalent to policy iteration, as was
articulated by Tesauro and Galperin (1996) themselves: “[MCS] basically implements a single step
of policy iteration.” More recently, Anthony et al. (2019); Anthony (2021); Hamrick et al. (2021);
Lerer et al. (2020); Hu et al. (2021) investigate MCS in a variety of settings: Anthony et al. (2019)
and Anthony (2021) find that it yields perhaps surprisingly good performance relative to MCTS in
Hex; Hamrick et al. (2021) report comparable performance with MCTS across a variety of settings,
with the exception of Go; Lerer et al. (2020) and Hu et al. (2021) show strong results for Hanabi
(under the names single-agent SPARTA and learned belief search).
In his work, Anthony (2021) also investigates a search algorithm called policy gradient search that
performs policy gradient updates at future decision points. He finds that a variant of this approach
that involves regularization toward the blueprint policy tends to outperform not regularizing. This
approach is similar to some of the variants of MMDS investigated in Section 3.2. The combination
of policy gradient search and the framework of update equivalence may be a fruitful direction in the
pursuit of algorithms that perform well in both perfect and imperfect information games.
Separately from Anthony (2021), Jacob et al. (2022) also investigate approaches to DTP involving
regularization toward a blueprint. Jacob et al. (2022) show empirically that, by performing regularized search, it is possible to increase the expected return of an imitation learned policy without a loss
of prediction accuracy (for the policy being imitated). The most immediately related experiments
to this work are those concerning Hanabi, in which Jacob et al. (2022) investigated MDS (under the
name piKL SPARTA) applied on top of imitation learned blueprint policies. Jacob et al. (2022) note
that this approach reliably increases the performance of weak policies, but neither recognize that it
possesses an improvement guarantee nor that it is simply performing a hedge update, as we do in
this work. Jacob et al. (2022)’s experiments on Diplomacy (Paquette et al., 2019) are also related in
that their approach is similar to a follow-the-regularized-leader analogue of MMDS. This approach
played an important role in recent empirical successes for Diplomacy (Bakhtin et al., 2023; Meta
Fundamental AI Research Diplomacy Team (FAIR) et al., 2022), suggesting that the framework of
update equivalence is a natural approach to general-sum settings.

6

C ONCLUSION AND F UTURE W ORK

In this work, motivated by the deficiencies of search algorithms based on public belief state representation of imperfect-information games, we advocate for a new paradigm for decision-time
planning, which we call the framework of update equivalence. We show how the framework of
update equivalence can be used to generate and ground new decision-time planning algorithms for
imperfect-information games. Furthermore, we show that these algorithms can achieve competitive
(or superior) performance compared with state-of-the-art PBS-based methods in Hanabi and can
reduce approximate exploitability in 3x3 Abrupt Dark Hex and Phantom Tic-Tac-Toe.
We believe the framework of update equivalence opens the door to many exciting possibilities for
DTP and expert iteration (Anthony et al., 2017; Anthony, 2021) in settings with large amounts of imperfect information. Among these are the prospects of extending algorithms resembling AlphaZero
(Silver et al., 2018), Stochastic MuZero (Antonoglou et al., 2022), and Diplodocus (Bakhtin et al.,
2023) to imperfect-information games.
9

Published as a conference paper at ICLR 2024

7

ACKNOWLEDGEMENTS

We thank Brian Hu Zhang for helpful discussions regarding knowledge-limited subgame solving (Zhang and Sandholm, 2021) and Eugene Vinitsky for helpful feedback regarding the presentation of the work.

R EFERENCES
T. Anthony, Z. Tian, and D. Barber. Thinking fast and slow with deep learning and tree search. In
Neural Information Processing Systems (NeurIPS), 2017.
T. Anthony, R. Nishihara, P. Moritz, T. Salimans, and J. Schulman. Policy gradient search: Online
planning and expert iteration without search trees. Computing Research Repository (CoRR), 2019.
T. W. Anthony. Expert iteration. PhD thesis, UCL (University College London), 2021.
I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver. Planning in stochastic environments with a learned model. In International Conference on Learning Representations (ICLR),
2022.
A. Bakhtin, D. J. Wu, A. Lerer, J. Gray, A. P. Jacob, G. Farina, A. H. Miller, and N. Brown. Mastering the game of No-Press Diplomacy via human-regularized reinforcement learning and planning.
In International Conference on Learning Representations (ICLR), 2023.
N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin,
S. Moitra, E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling.
The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 2020.
A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex
optimization. Operations Research Letters, 2003.
G. W. Brown. Iterative solution of games by fictitious play. In Activity Analysis of Production and
Allocation. 1951.
N. Brown and T. Sandholm. Safe and nested subgame solving for imperfect-information games. In
Neural Information Processing Systems (NeurIPS), 2017.
N. Brown and T. Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 2018.
N. Brown and T. Sandholm. Superhuman AI for multiplayer poker. Science, 2019.
N. Brown, A. Bakhtin, A. Lerer, and Q. Gong. Combining deep reinforcement learning and search
for imperfect-information games. Neural Information Processing Systems (NeurIPS), 2020.
C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener,
D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE
Transactions on Computational Intelligence and AI in Games, 2012.
N. Burch, M. Johanson, and M. Bowling. Solving imperfect information games using decomposition. In AAAI Conference on Artificial Intelligence (AAAI), 2014.
M. Campbell, A. Hoane, and F. hsiung Hsu. Deep Blue. Artificial Intelligence, 2002.
Y. K. Cheung and G. Piliouras. Vortices instead of equilibria in minmax optimization: Chaos and
butterfly effects of online learning in zero-sum games. In Conference on Learning Theory (COLT),
2019.
R. Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In Computers and
Games, 2006.
J. S. Dibangoye, C. Amato, O. Buffet, and F. Charpillet. Optimally solving Dec-POMDPs as
continuous-state MDPs. In International Joint Conference on Artificial Intelligence (IJCAI),
2013.
10

Published as a conference paper at ICLR 2024

A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later.
Handbook of Nonlinear Filtering, 2009.
A. Fickinger, H. Hu, B. Amos, S. Russell, and N. Brown. Scalable online planning via reinforcement
learning fine-tuning. In Neural Information Processing Systems (NeurIPS), 2021.
J. B. Hamrick, A. L. Friesen, F. Behbahani, A. Guez, F. Viola, S. Witherspoon, T. Anthony, L. H.
Buesing, P. Veličković, and T. Weber. On the role of planning in model-based deep reinforcement
learning. In International Conference on Learning Representations (ICLR), 2021.
E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable
stochastic games. In AAAI Conference on Artificial Intelligence (AAAI), 2004.
J. Heinrich and D. Silver. Deep reinforcement learning from self-play in imperfect-information
games. Computing Research Repository (CoRR), 2016.
H. Hu, A. Lerer, N. Brown, and J. N. Foerster. Learned belief search: Efficiently improving policies
in partially observable settings. Computing Research Repository (CoRR), 2021.
A. P. Jacob, D. J. Wu, G. Farina, A. Lerer, H. Hu, A. Bakhtin, J. Andreas, and N. Brown. Modeling
strong and human-like gameplay with KL-regularized search. In International Conference on
Machine Learning (ICML), 2022.
S. K. Jakobsen, T. B. Sørensen, and V. Conitzer. Timeability of extensive-form games. In Innovations in Theoretical Computer Science (ITCS), 2016.
S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos. Recurrent experience replay
in distributed reinforcement learning. In International Conference on Learning Representations
(ICLR), 2019.
L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In European Conference on
Machine Learning (ECML), 2006.
V. Kovarı́k, M. Schmid, N. Burch, M. Bowling, and V. Lisý. Rethinking formal models of partially
observable multiagent decision making. Artificial Intelligence, 2022.
M. Lanctot, E. Lockhart, J.-B. Lespiau, V. Zambaldi, S. Upadhyay, J. Pérolat, S. Srinivasan,
F. Timbers, K. Tuyls, S. Omidshafiei, D. Hennes, D. Morrill, P. Muller, T. Ewalds, R. Faulkner,
J. Kramár, B. D. Vylder, B. Saeta, J. Bradbury, D. Ding, S. Borgeaud, M. Lai, J. Schrittwieser,
T. Anthony, E. Hughes, I. Danihelka, and J. Ryan-Davis. OpenSpiel: A framework for reinforcement learning in games. Computing Research Repository (CoRR), 2019.
A. Lerer, H. Hu, J. Foerster, and N. Brown. Improving policies via search in cooperative partially
observable games. In AAAI Conference on Artificial Intelligence (AAAI), 2020.
E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. Jordan, and
I. Stoica. RLlib: Abstractions for distributed reinforcement learning. In International Conference
on Machine Learning (ICML), 2018.
M. L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996.
W. Liu, H. Fu, Q. Fu, and Y. Wei. Opponent-limited online search for imperfect information games.
In International Conference on Machine Learning (ICML), 2023.
R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for extensive form games. Experimental Economics, 1998.
Meta Fundamental AI Research Diplomacy Team (FAIR), A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu, A. P. Jacob, M. Komeili, K. Konath, M. Kwon,
A. Lerer, M. Lewis, A. H. Miller, S. Mitts, A. Renduchintala, S. Roller, D. Rowe, W. Shi,
J. Spisak, A. Wei, D. Wu, H. Zhang, and M. Zijlstra. Human-level play in the game of Diplomacy
by combining language models with strategic reasoning. Science, 2022.
11

Published as a conference paper at ICLR 2024

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King,
D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
M. Moravcik, M. Schmid, K. Ha, M. Hladik, and S. J. Gaukrodger. Refining subgames in large
imperfect information games. In AAAI Conference on Artificial Intelligence (AAAI), 2016.
M. Moravčı́k, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, and M. Bowling. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker.
Science, 2017.
A. Nayyar, A. Mahajan, and D. Teneketzis. Decentralized stochastic control with partial history
sharing: A common information approach. IEEE Transactions on Automatic Control, 2013.
A. Nemirovsky and D. Yudin. Problem Complexity and Method Efficiency in Optimization. 1983.
F. A. Oliehoek. Sufficient plan-time statistics for decentralized pomdps. In International Joint
Conference on Artificial Intelligence (IJCAI), 2013.
P. Paquette, Y. Lu, S. S. Bocco, M. Smith, S. O-G, J. K. Kummerfeld, J. Pineau, S. Singh, and A. C.
Courville. No-press diplomacy: Modeling multi-agent gameplay. Neural Information Processing
Systems (NeurIPS), 2019.
J. Perolat, R. Munos, J.-B. Lespiau, S. Omidshafiei, M. Rowland, P. Ortega, N. Burch, T. Anthony,
D. Balduzzi, B. De Vylder, G. Piliouras, M. Lanctot, and K. Tuyls. From Poincaré recurrence to
convergence in imperfect information games: Finding equilibrium via regularization. In International Conference on Machine Learning (ICML), 2021.
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart,
D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering atari, Go, chess and shogi by
planning with a learned model. Nature, 2019.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms. Computing Research Repository (CoRR), 2017.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering
the game of Go with deep neural networks and tree search. Nature, 2016.
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning
algorithm that masters chess, shogi, and go through self-play. Science, 2018.
S. Sokota, H. Hu, D. J. Wu, J. Z. Kolter, J. N. Foerster, and N. Brown. A fine-tuning approach to
belief state modeling. In International Conference on Learning Representations (ICLR), 2022.
S. Sokota, R. D’Orazio, J. Z. Kolter, N. Loizou, M. Lanctot, I. Mitliagkas, N. Brown, and C. Kroer.
A unified approach to reinforcement learning, quantal response equilibria, and two-player zerosum games. In International Conference on Learning Representations (ICLR), 2023a.
S. Sokota, R. D’Orazio, C. K. Ling, D. J. Wu, J. Z. Kolter, and N. Brown. Abstracting imperfect
information away from two-player zero-sum games. In International Conference on Machine
Learning (ICML), 2023b.
G. Tesauro. Temporal difference learning and TD-gammon. Journal of the International Computer
Games Association (ICGA), 1995.
G. Tesauro and G. Galperin. On-line policy improvement using monte-carlo search. In Neural
Information Processing Systems (NIPS), 1996.
Y. Tian, Q. Gong, and T. Jiang. Joint policy search for multi-agent collaboration with imperfect
information. In Neural Information Processing Systems (NeurIPS), 2020.
12

Published as a conference paper at ICLR 2024

F. Timbers, N. Bard, E. Lockhart, M. Lanctot, M. Schmid, N. Burch, J. Schrittwieser, T. Hubert,
and M. Bowling. Approximate exploitability: Learning a best response. In International Joint
Conference on Artificial Intelligence (IJCAI), 2022.
B. Zhang and T. Sandholm. Subgame solving without common knowledge. In Neural Information
Processing Systems (NeurIPS), 2021.
M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with
incomplete information. In Neural Information Processing Systems (NIPS), 2007.

13

Published as a conference paper at ICLR 2024

A

T HEORY

In this section, we prove Theorem 3.3. To start, consider the following more general lemma.
Lemma A.1 (Folklore). Let f ∈ C 1 (X ), where X ⊆ Rd is convex and compact, x0 ∈ X , and x+
be the solution to the mirror descent step
x+ := arg min {η⟨∇f (x0 ), x⟩ + Dφ (x, x0 )} .
x∈X

where φ is differentiable and 1-strongly convex with respect to a norm ∥ · ∥. Then, for η small
enough, if x+ ̸= x0 ,
f (x+ ) < f (x0 ),
Proof. From the first-order necessary optimality conditions for the mirror descent step, we have
⟨η∇f (x0 ) + ∇φ(x+ ) − ∇φ(x0 ), x̂ − x+ ⟩ ≥ 0

∀x ∈ X.

Hence, letting x̂ := x0 , and using the strong convexity of φ and the assumption that x+ ̸= x0 , we
obtain
1
⟨∇f (x0 ), x+ − x0 ⟩ ≤ − ∥x+ − x0 ∥2 < 0,
η
and with a further application of the Cauchy-Schwarz inequality,
∥x+ − x0 ∥ ≤ η∥∇f (x0 )∥.
By continuity of the gradient of f , there must exist ϵ > 0 such that
⟨∇f (x), x+ − x0 ⟩ < 0

∀ x ∈ B(x0 , ϵ).

Furthermore, the mean value theorem guarantees that
f (x+ ) − f (x0 ) = ⟨∇f (ξ), x+ − x0 ⟩
for some ξ on the line connecting x0 to x+ . So, as long as η∥∇f (x0 )∥ < ϵ, we have
f (x+ ) < f (x0 ),
as we wanted to show.
Theorem 3.3 Consider a common-payoff game. Let π t be a joint policy having positive probability
on every action at each decision point. Then, if we run mirror descent at every decision point with
action-value feedback, for any sufficiently small stepsize η,
J (π t1 ) ≥ J (π t ).
Furthermore, this inequality is strict if π t1 ̸= π t (that is, if π t is not a local optimum).
Proof. The space of joint policies Π is a Cartesian product of probability simplices, immediately
implying that it is convex and compact. Furthermore, the expected return function is a polynomial
function of any joint policy π ∈ Π. Hence, J ∈ C 1 (Π). The result then follows directly from
Lemma A.1, as, by hypothesis, the players are performing mirror descent steps on J .

B

E XPERIMENTS

In this section, we provide further details about some of our empirical results and also show some
additional results.
B.1

B EYOND ACTION -VALUE -BASED P LANNERS

First, we describe in greater detail the DTP algorithms that we investigated in Section 3.2.
14

Published as a conference paper at ICLR 2024

With Subgame Updates With subgame updates differs from MMDS in the feedback it uses. In
particular, rather than using the action values for the current policy, with subgame updates uses the
action-values for the joint policy induced by performing MMD updates at its own future decision
points (but leaving the opponent’s policy fixed). Because the opponent is fixed, in tabular settings,
we can compute the feedback for the last-iterate algorithm analogue of this approach using one
backward induction pass for each player.
With Belief Fine-Tuning With BFT differs from MMDS in the distribution it samples histories
from. In particular, rather than sampling from the distribution induced by the current policy, it
samples from the distribution induced by the search policy for each player.
With Opponent Update With opponent update differs from MMDS in the feedback it uses. In
particular, rather than using the action values for the current policy, with opponent update uses the
action values for the joint policy induced by performing an MMD update for the opponent at the next
time step. Note that, as a DTP algorithm, this approach would involve two belief model sampling
steps: one to sample an opponent decision point for updating and one to sample an unbiased history
for that opponent information state.
Agent Quantal Response Equilibria Solving In Figure 4, we show the convergence results from
the main body, along with the exploitabilities of the corresponding iterates. For these experiments,
we used α = 0.1 and η = α/10, except for with opponent updates on Leduc, where we used
η = α/20, and with subgame updates on Leduc, where we used η = α/50.
Standard

With BFT

2x2 Abrupt Dark Hex

4-Sided Liar’s Dice

With Opponent Update
Leduc Poker

KL Divergence

Kuhn Poker

With Subgame Updates

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

Iterations (thousands)

Standard

With Subgame Updates
2x2 Abrupt Dark Hex

With Opponent Update

4-Sided Liar’s Dice

Leduc Poker

Exploitability

Kuhn Poker

With BFT

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

Iterations (thousands)

Figure 4: Solving for agent quantal response equilibria using last-iterate algorithm analogues of
variants of MMDS.
MiniMaxEnt Equilibrium Solving In Figure 5, we show convergence results for solving for
MiniMaxEnt equilibria, which are the solutions of MiniMaxEnt objectives (Perolat et al., 2021). A
MiniMaxEnt objective is an objective of the form
"
#
X
t
t
t
t
Ji : π 7→ E
Ri (S , A ) + αH(πi (Hi )) − αH(π−i (H−i )) | π .
t

15

Published as a conference paper at ICLR 2024

We used α = 0.1 and η = α/10, except for with opponent update on Leduc, where we used
η = α/20, and with subgame updates on Leduc, which used η = α/50. We again observe empirical
convergence.

Standard

With Subgame Updates
2x2 Abrupt Dark Hex

With Opponent Update

4-Sided Liar’s Dice

Leduc Poker

KL Divergence

Kuhn Poker

With BFT

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

8

10

Iterations (thousands)

Standard

With Subgame Updates
2x2 Abrupt Dark Hex

With Opponent Update

4-Sided Liar’s Dice

Leduc Poker

Exploitability

Kuhn Poker

With BFT

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

Iterations (thousands)

Figure 5: Solving for MiniMaxEnt equilibria using last-iterate algorithm analogues of variants of
MMDS.

Solving for Nash Equilibria Next we show that these last-iterate algorithm analogues can be
made to converge to Nash equilibria by annealing the amount of regularization used. We show that
these results in Figure 6 compared against CFR (Zinkevich et al., 2007). The hyperparameters for
these results are shown in Table 4.

Standard

With BFT

2x2 Abrupt Dark Hex

With Opponent Update

4-Sided Liar’s Dice

Leduc Poker

Exploitability

Kuhn Poker

With Subgame Updates

Iterations

Figure 6: Exploitability of different MMDS analogues with annealed regularization.
16

CFR

Published as a conference paper at ICLR 2024

Method\Game

Kuhn Poker

Standard
With Subgame Updates
With BFT
With Opponent Update

αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t

2x2 Abrupt
Dark Hex
αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t

4-Sided Liar’s Dice

Leduc Poker

αt = √1t , ηt = √2t
1
αt = √1t , ηt = 2√
t
1
2
αt = √t , ηt = √t
αt = ηt = √1t

αt = √5t , ηt = √1t
1
αt = √5t , ηt = 5√
t
5
1
αt = √t , ηt = 2√
t
1
αt = √5t , ηt = 2√
t

Table 4: Schedules for Figure 6.
Solving for Nash Equilibria with MiniMaxEntRL objectives We also show analogous results
for MiniMaxEnt objectives in Figure 7. The hyperparameters for these experiments are shown in
Table 5.
With Subgame Updates

Standard

2x2 Abrupt Dark Hex

With Opponent Update

4-Sided Liar’s Dice

CFR

Leduc Poker

Exploitability

Kuhn Poker

With BFT

Iterations

Figure 7: Exploitability of different MMDS analogues under MiniMaxEnt objectives with annealed
regularization.
Method\Game

Kuhn Poker

One-Step Search
Multi-Step Search
BFT Search
Opponent Search

αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t

2x2 Abrupt
Dark Hex
αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t
αt = ηt = √1t

4-Sided Liar’s Dice

Leduc Poker

αt = √1t , ηt = √2t
1
αt = √1t , ηt = 2√
t
1
αt = √t , ηt = √2t
αt = ηt = √1t

αt = √5t , ηt = √1t
1
αt = √5t , ηt = 5√
t
5
αt = √t , ηt = √1t
1
αt = √5t , ηt = 2√
t

Table 5: Schedules for Figure 7.
B.2

H ANABI

For our Hanabi experiments, we used η = 20 for the MDS results in Tables 1 and 2. We performed
search with 10,000 samples.
B.3

3 X 3 A BRUPT DARK H EX AND P HANTOM -T IC -TAC -T OE

For our 3x3 Abrupt Dark Hex and Phantom-Tic-Tac-Toe experiments with a uniform blueprint, we
used η = 50, α = 0.01, and set ρ to be uniform. For the MMD(1M) blueprint, used η = 10,
α = 0.05, and set ρ to be uniform. For particle filtering, we sampled 10 particles for the uniform
blueprint and 100 particles for the MMD(1M) blueprint from the start of the game independently
at every decision point to reduce bias. For the baselines, we used the same setup as Sokota et al.
(2023a).

17