Published as a conference paper at ICLR 2024 Designing Skill-Compatible AI: Methodologies and Frameworks in Chess Karim Hamade kar@cs.toronto.edu University of Toronto Reid McIlroy-Young reidmcy@cs.toronto.edu University of Toronto Jon Kleinberg kleinberg@cornell.edu Cornell University Siddhartha Sen sidsen@microsoft.com Microsoft Research Ashton Anderson ashton@cs.toronto.edu University of Toronto Abstract Powerful artificial intelligence systems are often used in settings where they must interact with agents that are computationally much weaker, for example when they work alongside humans or operate in complex environments where some tasks are handled by algorithms, heuristics, or other entities of varying computational power. For AI agents to successfully interact in these settings, however, achieving superhuman performance alone is not sufficient; they also need to account for suboptimal actions or idiosyncratic style from their less-skilled counterparts. We propose a formal evaluation framework for assessing the compatibility of near-optimal AI with interaction partners who may have much lower levels of skill; we use popular collaborative chess variants as model systems to study and develop AI agents that can successfully interact with lower-skill entities. Traditional chess engines designed to output near-optimal moves prove to be inadequate partners when paired with engines of various lower skill levels in this domain, as they are not designed to consider the presence of other agents. We contribute three methodologies to explicitly create skill-compatible AI agents in complex decision-making settings, and two chess game frameworks designed to foster collaboration between powerful AI agents and less-skilled partners. On these frameworks, our agents outperform state-of-the-art chess AI (based on AlphaZero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility. 1 Introduction As AI achieves superhuman performance in an increasing number of areas, a recurring theme is that its behavior can be incomprehensible to agents of lower skill levels, or incompatible with their behavior. Game-playing is a familiar instance of this; it is well-understood, for example, that modern chess engines play in a style that is often alien to even the best human players, to the extent that calling out “engine moves” (actions only computers would take) is a staple of professional commentary. It is also standard practice for human players to use moves outputted by chess engines as recommendations for their own play, only to find that they cannot successfully follow them up. In a way, chess AI (which is "near-optimal") and human players (as examples of less-skilled agents) "speak" very different dialects that are often not mutually intelligible. This lack of compatibility leads to failures when less skilled agents interact with optimal ones, in settings where the less-skilled parties may be humans interacting with powerful AI systems, or simple heuristics interacting with much stronger ones. Given this state of affairs, an important open question in any given domain is how to achieve AI performance that is both high-level and compatible with agents of lower skill. How might we accomplish this, and how would we know when we’ve succeeded? 1 Published as a conference paper at ICLR 2024 In this work, we propose a training paradigm to create AI agents that combine strong performance with skill-compatibility, we use our paradigm to train several skill-compatible agents, and we illustrate their effectiveness on two novel chess tasks. Our paradigm is based on the following idea: skill-compatible agents should still achieve a very high level of performance, but in such a way that if they are interrupted at any point in time and replaced with a much weaker agent, the weaker agent should be able to take over from the current state and still perform well. We enforce a high level of performance by testing in an adversarial environment where the opponent might have superhuman abilities, and we encourage robustness against interruption and replacement by a weaker agent in chess (skill-compatibility) in two different ways: independently at random after each move; and by decomposing individual moves into a selection of a piece type and a subsequent selection of a move using that piece type, following the popular chess variant known as “hand and brain.” This interruption framework thus provides computationally powerful agents with an objective function that balances two distinct goals: (i) it dissuades them from playing incompatible “engine moves,” since such moves may strand a weaker agent in a state where it can’t find a good next action even when one exists, but (ii) it still promotes high performance since the goal is to win games despite interruptions from weaker counterparts. Our paradigm thus suggests a strong and measurable notion of the interpretability of a powerful agent’s actions: the actions are interpretable (and skill-compatible) in our sense if and only if a weaker agent can find an effective way to follow up on them. This grounding is particularly useful in complex settings such as our motivating domain of chess which contains actions where there might be no succinct "explanation" (we know this both in practice, and also theoretically from the fact that general game-tree evaluation is believed to be outside the complexity class NP.) 2 Background and Related Work Chess and AI. Chess has a long history in AI research Shannon (1950), with milestones including Deep Blue Campbell (1999), followed by superhuman performance on commodity hardware, and more recently AlphaZero and its follow-ups Silver et al. (2016); Schrittwieser et al. (2020). More recent work on the relation of algorithmically-generated chess moves to human behavior play an important role in our work McGrath et al. (2022); McIlroy-Young et al. (2020); Anderson et al. (2017); Maharaj et al. (2022). Human-AI collaboration. Recent work has studied human-AI collaboration in a multi-agent scenario, where an AI agent and a weaker human agent work alongside each other to complete a task Carroll et al. (2019); Strouse et al. (2021); Yang et al. (2022). One distinction from our work is their notion of compatibility, where the focus has been on agents working simultaneously on related tasks; in contrast, a central feature of our framework is that the compatibility is inter-temporal, with the design goal that a less-skilled agent should be able to take over at any point from the partially completed state of the AI agent’s progress. Opponent modeling. Agents that interact with humans have made great strides in performance by modeling other human actors Bard et al. (2019), whether by modeling opponents as an optimal player (Perolat et al., 2022; Gray et al., 2020; Brown & Sandholm, 2018), or by building agents that communicate and collaborate with other human players in multiplayer games (Vinyals et al., 2019; Yu et al., 2021; (FAIR)† et al., 2022). Several prior works explore from a strict adversarial, non-collaborative perspective the use of MCTS to exploit suboptimal play safely, in strong agents (Wang et al., 2022; Ganzfried & Sandholm, 2015). 2.1 Chess engines We select chess as our model system due to the ready availability of both superhuman AI agents and AI agents designed to emulate lower-skilled human players, the complexity of the decision-making task, and the need for more understandable game AI agents that others can interact with. We list the existing engines that we make use of in our work. leela. leela Lyashuk & et al (2023) is an open source version of AlphaZero Silver et al. (2018), a deep RL agent that consists of a neural network that evaluates boards (value) and suggests moves (policy) Silver et al. (2016), both of which are used to guide a Monte Carlo Tree Search 2 Published as a conference paper at ICLR 2024 Team 1 D S1 e5 S1 S2 J1 J2 Team 2 E Team 1 S2 rmblkans opopopop 0Z0Z0Z0Z 5 Z0Z0Z0Z0 4 0Z0O0Z0Z 3 Z0Z0Z0Z0 2 POPZPOPO 1 SNAQJBMR S1 e4 8 e5 d5 J1 b c d e f g h Team 2 o S2 rmblkans opopopop 0Z0Z0Z0Z 5 Z0Z0Z0Z0 Nc6 4 0Z0ZPZ0Z 3 Z0Z0Z0Z0 2 POPO0OPO 1 SNAQJBMR 7 6 a S1 S2 J1 J2 8 7 X n d4 X N d4 J2 J1 Figure 1: Stochastic Tag Team Framework 6 a b c d e f g h p pe4 J2 Figure 2: Hand and Brain Framework (MCTS) algorithmto select the next action Jacob et al. (2022); Grill et al. (2020). The network is trained using repeated iterations of self-play followed by back-propagation. We use a small version of leela as the superhuman engine in our tests and as our baseline for comparison. maia. The maia engines McIlroy-Young et al. (2020; 2022) are a set of human-like chess engines that capture human style at targeted skill levels. They are trained as a classification task on tens of millions of human games at a specified skill level, and as such maia does not use MCTS during play. Most of our work uses the weakest version, maia 1100, which was trained on games from 1100-rated players—those roughly in the 20th percentile of skill—on the open-source online chess platform lichess.org. maia serves as the instantiation of sub-optimal, lower-skilled agents. 3 Methodology 3.1 Chess frameworks Our goal is to develop chess agents that can behave in inherently skill-compatible ways, much like skilled coaches tailor their actions to suit their students. In this work, we approach this task via two proxy frameworks in which our engine “coaches”, or seniors, interact collaboratively with weaker “students”, or juniors. There are two natural ways in which this collaboration could take place within a sequential game setup: the seniors and juniors could alternate between who takes the current action, or the seniors and juniors could collaboratively construct each action. Our two frameworks follow these two directions to enable cooperation within a chess game, suiting our experimental purpose of designing and evaluating skill-compatible agents. The former can be likened to self-driving cars, where AI needs to be ready for a hand-off to humans at a moment’s notice, whereas the latter resembles automatic stock picking strategies where potential candidates are filtered by AI for humans to make final picks. Note that both frameworks incorporate a strong and weak engine jointly in control of a single color, and incorporate an element of stochasticity to preclude perfect prediction strategies by agents that skirt the objective of achieving compatibility. We will generally refer to our agents as playing the Focal roles against a team of opponent agents playing the Alter roles. Stochastic Tag Team (STT ). The STT setup consists of two teams, each in control of a color on the chessboard. A team consists of two agents, the junior teammate, which is generally maia, and the senior teammate, which is generally a stronger engine (e.g. leela, or the engines we design). Prior to each move, Nature flips a fair coin to determine whether the junior or senior makes the move, with no consultation with any other party allowed. This setting introduces a cooperative aspect to chess, since a senior will need to be prepared for the possibility that a weaker junior might be making the next move. At the same time, the senior will need to play at a high level of chess, since the opponent senior will also be playing at a high level, and it will simultaneously need to attempt to exploit the weaknesses in the opponent junior. The ST T framework thus allows us to explore scenarios where teammates and opponents can be both high-skilled and low-skilled, and the high-skilled AI is required to both perform at a high level and account for low-skill involvement inD both teams. See E S1 S2 Figure 1. We use the following tuple to denote a game played under ST T : J1 J2 , where S1 is the senior engine on the white team, J1 the junior engine on the white team, S2 the senior engine on the black team, and J2 the junior engine on the black team. Note that while the first 3 Published as a conference paper at ICLR 2024 team is technically white and the second team black in this notation, we abuse the notation to indicate multiple games played between these two teams, alternating between black and white. Hand and Brain (HB). This cooperative setup, which has witnessed a massive increase in interest by grandmasters (GMs) and amateurs alike in recent years, also consists of two teams, each in control of a color on the chessboard. A team consists of two agents, the brain agent (always the stronger agent in our case), which selects the piece type to be moved (e.g., knight N), and the hand (maia agent in our case), which then selects the specific piece and move to make given the brain’s selection (e.g., move the knight on g1 to f3). See Figure 2. GM Hikaru Nakamura stated that as the brain playing with a weaker hand, he often picks sub-optimal moves he finds more suitable for his hand partner Nakamura (2021), which is in the spirit of our work. In contrast to the previous framework, this framework exemplifies scenarios where the stronger agent “nudges” the weaker agent, narrowing their action space, which n thenomakes the action. We will use the fol1 H2 lowing tuple to denote a game played under HB: H B1 B2 , where H1 and B1 are the hand and brain agents on the white team, and H2 and B2 are the hand and brain agents on the black team. In our work, the brain will always be the stronger agent, and the hand the weaker engine that chooses a move (stochastically from its distribution, to induce some randomness and ensure the brains aren’t able to perfectly predict their decision) conditioned on the brain’s piece-type choice. 3.2 Methodologies to create skill-compatible seniors We contribute three methodologies to create agents that outperform superhuman chess engines (leela) in these two skill-compatibility frameworks. We emphasize that we are not aiming to improve upon state-of-the-art chess engines. Instead, we are interested in designing skill-compatible chess AI that can productively interact with weaker agents. Tree agent. Our first agent is the maia engine augmented with MCTS. By exploring future game states based solely on maia’s policies and values, the Tree agent inherently takes its junior’s propensities into account when deciding what move to make. Notably, this agent only requires a maia model and does not rely on a superhuman agent. It is also framework-agnostic and thus implemented identically for both frameworks. In HB, the tree agent’s output is filtered to only convey the piece type, as mentioned above. Expector agent. We introduce a type of gold standard agent that conforms to the exact setting of the frameworks. The expector agent has access to models of juniors/hands, and maximizes its expected win probability w over a short time horizon given the identities of the seniors and juniors. For ST T , it simulates all possible bitstrings 2 plies into the future, selecting the move m = argmax Es∈{00,01,10,11} [w|(m,s)] ; and for HB it selects the piece m that maximizes its expected win probability over maia’s distribution of moves conditioned on that piece (Dp ): p = argmax Em∈Dp [w|m] . In the former case it requires a model of the p other three agents in the game to perform the simulation, and in the latter case it also needs a model of its own hand to obtain the distribution Dp . In both cases, it requires access to a strong agent to compute the win probabilities that the expectation is maximizing. Although it requires no training, playing moves is expensive due to calls to multiple chess engines and evaluations of the current board state. The version of the expector designed for ST T will be denoted as expt , and the one designed for HB will be denoted as exph . Attuned agent. The attuned agent is a self-play RL agent that directly learns from playing in the two frameworks. In contrast to learning from self-play in conventional chess, as leela does, the attuned agents are created by generating games from self-play of leela and maia teams in both frameworks. With a small training set, this method is essentially a fine-tuning procedure for leela that takes into account maia’s interventions, rather than training a chess engine from scratch. The Supplement includes full training details. This methodology is the most practical of the three in its ability to be modified and generalize, and lies somewhere between the rigidity of tree and the specificity of exp. However, it is the only agent of the three we introduced that requires training. The version of the attuned agent trained on ST T will be denoted as attt , and the one trained on HB will be denoted as atth . In HB, similar to the tree agent, we only take the piece of the outputted move. 4 Published as a conference paper at ICLR 2024 Table 1: Game Results for all agents ST T and HB ST T Game Setup tree expt HB attt maia tree exph atth maia 56.5 66.5 focal | leela 0.5** 14.0* 55.0 7.5 60.0 12.5* 0.0** 0.5** 56.5** 55.0** 27.5* N/A 4.5 * 0.0** focal leela maia | maia ** ** ** * ** ** ≤ ±0.5, * ≤ ±1.5, see appendix for full error ranges. Note that exph doesn’t output moves, so can’t play leela directly. 4 Experiments 4.1 Agent strength in each framework Our foremost goal is to quantify the objective performance of each agent on each framework. Are they better at playing with weaker partners than state-of-the-art chess AI—that is, are they skill-compatible? Here, we use maia1100 as the junior and hand agent for all analyses, and the three methodologies will accordingly make use of maia1100 as a base model to guide the search for tree, as a junior to guide the look-ahead for exp, and in the creation of the training dataset for att. Our evaluation metric is the win-share over games, which is defined as (W +D/2) , where n W is the number of wins, D the number of draws, and 1000 ≤ n ≤ 10000 the number of games, from the perspective of the focal team. Equally-matched agents will each score a win-share of 50%, and scoring above 50% indicates a victory. We compute the standard error by treating the experiments as random samples from a trinomial distribution, as detailed in the Supplement. Tables 1 shows the results for all three agents on both frameworks ST T and HB respectively. As shown in the first row, which documents the performance of our focal agents in matchups against the state-of-the-art chess AI leela in our frameworks, all of our methodologies achieve a winning score (>50%) in both frameworks. In ST T , the expt agents that perform a short look-ahead (66%) tend to dominate, and in HB it is the tree agent that scores the highest (60%). Our main result is that all three methodologies produce agents that play well and more intelligibly to weaker partners than state-of-the-art chess AI. In order to validate that our focal agents are achieving their gains by explicitly accounting for the presence of the weaker partners, we eliminate the two most pressing potential confounders. The first hypothesis we rule out is that our focal agents are simply stronger than leela, which the second row in Tables 1 shows is evidently not the case. In fact, they are significantly weaker, losing most of their games to leela in head-to-head regular-chess matchups. tree performs particularly poorly (<1%), likely because it is unrelated to leela’s weights, unlike exp and att. It is striking that tree outperforms att on the ST T and HB despite being significantly weaker, indicating that it must compensate with larger adaptation to and synergy with its lower-skilled partner. The second hypothesis we rule out is whether leela is a particularly bad senior/brain due to its strength, and our focals are better at adapting to their hands/brains simply because they are weaker and more similar to them. Replacing our focals with maia as a senior/hand (the most similar, weakest possible senior/brain) refutes this idea (see last column). We note that, while att and exp are weaker than leela, they still achieve non-trivial scores (>10%) in regular chess versus leela, indicating they are still strong chess agents. We have established our central result: Our focal agents are objectively weaker than standard state-of-the-art AI, but their compatibility with maia is more than sufficient to defeat leela in both collaborative frameworks. We now investigate the mechanisms of skill-compatibility. 4.2 Mechanisms of achieving skill-compatibility in STT How do our agents achieve skill-compatibility? In this section, we answer this question by analyzing agent behavior at the individual move level. We define the win probability loss of a move, which measures the degree to which any given move is sub-optimal. Notice that any chess move is either optimal, meaning it preserves the win probability of the previous position (as evaluated by a strong engine such as leela), or is sub-optimal, meaning it degrades the agent’s win probability by a certain amount. We will define the win probability loss of a move, or simply 5 Published as a conference paper at ICLR 2024 Table 2: Average losses for agents in ST T . Agent Gt Ge Ga leela focal ∆Gf (leela, focal,∗) 1.15** 1.91** ** 1.16** 1.37** ** 1.17** 1.43** ** -0.76 ∆Gf (teamL , teamF ,∗) 0.13 * -0.21 * 0.63 Agent -0.26 Gt Ge Ga 4.46* 3.57* * 4.21* 3.37** * 4.19** 3.77** 0.42* maial maiaf ∆Gf (maial , maiaf ,∗) 0.89 0.84 ** 0.16 ** ≤ ±0.02, * ≤ ±0.04, see appendix for full error ranges Table 3: Different effects of seniors on juniors in ST T . Effects that are stronger (p<0.05) than that of the opposing senior are in bold. Agent Tricking : Helping (Interceding Junior): Helping (Interceding Senior): Indirect: Gt leela tree -0.03** 0.54** 0.26* 0.36* * 0.15 0.46* 0.56* Ge leela exp -0.27** 1.38** 0.30** 0.61** * 0.12 1.88** -0.18** Ga leela att -0.01** 0.25** 0.34** 0.21** ** 0.20 0.25** ** 0.37 ** ≤ ±0.07, * ≤ ±0.11, see appendix for full error ranges the loss, as the difference in the win probability of the board following the move to the win probability of the board preceding it. It ranges from 0 to 100. Given an agent A and a condition C on moves played by A from a set of games Gf , we define a mean value LGf (A,C) as follows: P 1 LGf (A,C) = |S| m∈S Loss(m) where S = {m ∈ Gf |m satisfies C and m is played by A}. To compare losses between agents A1 and A2 , we define ∆Gf (A1 ,A2 ,C) = LGf (A1 ,C)−LGf (A2 ,C), where setting C = ∗ means taking all moves. D E focal leela For this section, we will be referring to agents as they appear in the tuple maia , where f maial focal refers to one of tree, expt , or attt ; leela denotes leela playing as the alter; maiaf refers to the Focal team’s junior maia agent, and maial refers to the Alter team’s junior maia agent. Note that maiaf and maial are both maia1100 agents. We will refer to this tuple as Gf for simplicity, with f being the starting letter of the corresponding focal. All agents and games in this section are in ST T , so we omit the specifying subscript. 4.2.1 General effects on junior performance Table 2 shows the average loss by the agents involved at every move from the games played in STT. For all focal agents, we have LGf (focal,∗) > LGf (leela,∗), yet LGf (maiaf ,∗) < LGf (maiaL ,∗). This is a more granular, move-level statement of our central result: our focal engines sacrifice some optimality for the ability to influence and be skill-compatible with the sub-optimal agents they are interacting with. Note also that LGt (tree,∗) > LGa (att,∗), yet ∆Gt (maial ,maiaf ,∗) > ∆Ga (maial ,maiaf ,∗), implying tree’s results are more dependent on influencing the juniors present in the game. exp combines low absolute ∆Ge (focal,leela,∗) with higher ∆Ge (maial ,maiaf ,∗) to get the best overall team loss difference among the three agents, which explains its higher score in the main evaluations. To compare how these findings vary with position strength, we plot ∆Gf (maial ,maiaf ,i), where i stipulates the moves originate from boards where the probability of winning is equal to i (Figure 4). For all focal engines, the gap is increasing in i: the closer the board is to winning, the more maiaf outperforms maial . Winning boards are thus more critical, where the junior’s moves have a chance to throw the game, compared to losing situations where the junior cannot bring about large positive changes to the evaluation. 6 Published as a conference paper at ICLR 2024 att tree ∆Gt (maial ,maiaf ,i) Ratio focal vs leela 4 1 exp 6 4 2 [0, 1] [1, 5] [5, 10] [10, 50] exp att 2 0 [0, 20] [50, 100] tree [20, 40] [40, 60] [60, 80] [80, 100] Board Win Probability (i) Range of loss magnitude Figure 3: Ratio of probability of excess Figure 4: ∆Gf (maial , maiaf ) as loss induction over different loss magnitudes a function of different board win probabilities 4.2.2 Tricking, Helping, and Indirect Effects in ST T Having established that our focal agents induce a gap between maiaf and maial performance, we investigate three possible mechanisms by which they can achieve this: an immediate “tricking” effect, which we define as an induction of loss in maial over a 1-ply horizon (the alter junior’s next move), an immediate “helping” effect, which we define as reduction of loss in maiaf over a 2-ply horizon (the focal junior’s next move), and finally an indirect effect that cannot be measured over these short horizons (for example, early choices that impact how the game unfolds many moves into the future). The chief difficulty of comparing the focals’ effect on the juniors to that of leela on the juniors in the games played is that direct comparison would be potentially confounded by the different board distributions the opposing teams faced. We deal with the different distributions using two distinct methods detailed in the appendix. We present in table 3 the main results derived from the first method. Bold values indicates that the senior exhibits the effect in question. All agents display some tricking effect, with the magnitude being largest for exp (1.38) and smallest for att (0.25). We do not observe a helping effect for att, but do so for the other focals. In particular, the helping effect for exp is dramatically higher the when opponent leela intercedes (1.88 vs 0.61), and we believe this is due to the agent’s preference for tricking the opponent junior when it intercedes instead of helping its own junior down the horizon. Finally, exp has no beneficial indirect effect on the juniors (meaning longer than its optimization horizon of 2 plies), whereas the tree and att have a measurable indirect effect even when they do not precede the juniors. Additionally, extra analysis in the supplement suggests that this indirect effect is more important than the tricking effect for these engines. These results demonstrate that there are multiple ways to influence the juniors, and the strongest agent, exp, distinguishes itself by a complete lack of a beneficial indirect effect in favour of strong immediate effects, both helping and tricking. 4.3 Mechanisms of achieving skill-compatibility in HB n o focal leela We now turn to the HB framework. Note that Gf now refers to the tuple maia , f maial and agents without subscript will be referring to those created for the HB framework. We examine the effect of our focal agents (playing as brain) on the maia agents (playing as hand), and focus on two mechanisms: intra-team effects, where the brain picks a piece that causes maia to pick a better/worse move than it would have without interference, and inter-team effects, where the teams affect each other. We first examine intra-team effects by computing the savings, the difference between the loss of the team’s actual move played to the loss of the move maia would have selected without interference from the brain agent. Interestingly, Table 4 shows that exp is the only brain exhibiting such an effect (0.3%), which is natural as it has been explicitly instructed to minimize the expected loss of its hand. tree, which is the best performing agent in the framework, actually has negative savings (-0.2%), meaning its influence is causing its own hand to play worse moves. To analyze results more closely, we inspect the interaction between hands and brains. There are four key hand-brain interactions: agreement (same move chosen), blindsiding (different moves but same piece-type, allowing the hand’s move), correction (hand resamples to 7 Published as a conference paper at ICLR 2024 Table 4: Comparison of team loss to hypothetical maia loss without brain in HB. Better in bold. Gt leela maial 3.76** 3.75** ** True loss maia loss Savings -0.01 Ge tree maiaf 3.55** 3.29** ** -0.22 -0.04 Ga leela maial 3.44** 3.40** ** exp maiaf 3.33** 3.62** ** 0.3 leela maial 3.61** 3.59** ** -0.02 maia as focal att maiaf 3.50** 3.43** ** -0.07 leela maial 3.73* 3.39* * -0.32 maia maiaf 4.25* 3.60* * -0.65 ** ≤ ±0.03, * ≤ ±0.07, see appendix for full error ranges match the brain’s different piece-type move), and disagreement (hand selects a different move after forced resampling). Detailed results on the proportions of each are in the supplement. For all brains, the correction case yields some savings (4%–6%), and the disagreement yields a drop in the performance of the hands (1%–2%). Furthermore, the savings in the correction case are lower for tree and att than they are for leela. tree’s effect does not appear to come from savings, but rather, we see that when tree corrects its hand, it induces a high loss in the opponents’ next move compared to leela’s correction (4.8% vs 3.8%), showing a tricking action exerted by the tree on the opponent team. As tree and att both increase their agreement with maia, we test the strategy of maximizing agreement and eliminating disagreement (as well as correction and blindsiding) entirely by letting maia play on its own against leela as the alter senior and maia as the alter junior, and it loses to leela with a 40% ±1.5 score. This confirms that disagreement-induced loss is more than compensated for by the benefit of correction of a strong brain. 4.4 Exploring imperfect partner modelling All our experiments thus far have explored the results of senior agents designed to play with maia1100 (meaning a generic maia trained on many 1100 rated player games), tested on maia1100. Note that the term "designed for junior X" means, for att: that it trains with junior X; for tree: that it runs MCTS using X to guide the search tree; and for exp: that it uses junior X to simulate the next couple of moves. Now we investigate cross-compatibility, meaning whether seniors designed for these generic maia juniors are compatible with juniors that they are not explicitly designed for. We investigate three possible instantiations of cross-compatibility: cross-skill compatibility, specific player compatibility, and cross-style compatibility, and observe some success with the first two and an inability to generalize to radically different juniors. 4.4.1 Cross-skill compatibility Here, we use maia1900, the strongest available version of maia trained on data of 1900 rated players, as an alternative junior/hand. We analogously create tree, exp, and att agents designed to be compatible with maia1900. In this section, focalx denotes a focal designed for compatibility with maia1x00. Testing focal9 agents with maia1900 as the junior in 8 shows that they are able to beat leela in the frameworks, with the exception of tree9 producing no gains in ST T . We test cross-compatibility of the focal1 and focal9 agents by partnering them with each other’s juniors. We emphasize that focal9 agents are not exposed to maia1100 prior to testing, and vice versa. As shown in Table 8 (appendix), focal9 agents are always compatible with maia1100 as a junior, irrespective of focal and framework, while the same is not always true of focal1 agents with maia1900. We hypothesize the assymetry is due to focal1 agents being more aggressive in playing suboptimally in order to exploit the junior, which backfires when maia1900 does not fall for exploits, whereas focal9 agents do not get explicitly punished if they over-conservatively fail to setup a trap for maia1100. 4.4.2 Specific Player compatibility in ST T We turn to validating whether the focal agents we created here using the generic maia models are compatible with individualized engines that are fine-tuned to mimic particular human players McIlroy-Young et al. (2020). We do so with the objective of exploring the applicability of 8 Published as a conference paper at ICLR 2024 our method in the case where a complete model of the opponent/partner junior is not available (as is the case in most situations). Our initial experiment consists of individualized models of 2 players (rated 1650 and 1950), for which we designed specialized seniors, and compared the performance of these specialized seniors to that of seniors designed for generic maia1900. In these games, generally, generic focal agents win against leela, with a proportionally smaller margin than when seniors designed for generic/specific juniors are matched with the junior it was designed for. We now turn to more extensive player generalization testing. Accordingly, we selected a random subset of 23 models, each trained on a particular human Lichess players rated between 1400 (43rd percentile skill) and 1900 (83rd percentile skill) as the junior for this experiment. We then use an exp that internally uses the generic Maia agent with the nearest rating to the player in question, and have it play in ST T with that player’s model. The median score of the different exp agents over in this scenario with the 23 different juniors is 53.3% (±1%). While this is below exp’s performance of 66.5% in table 1, it nonetheless demonstrates that a generic approximation is sufficient to encapsulate skill-compatibility with individuals. In fact, out of 23 different players tested, exp was shown to be winning (p<0.05) in 18 of the cases after 3000 games played, with the experiments on the remaining 5 players not showing statistical significance to that level. 4.4.3 Cross-style compatibility in ST T We now investigate using a completely different junior based on a non-neural architecture. To do that, we calibrate a low-depth version of stockfish (we call it sfw), to play at the skill-level of maia1100, and comparison in the appendix demonstrates that it is very different from any agent used so far. Note that there is no tree agent. It is seen that exp is able to obtain results against sfw, but att is unable to. Generalization is nonexistent, with engines trained with maia1100 losing when testing with a sfw junior and vice versa. We attribute this to the lack of similarity between sfw and maia1100. This does indicate, that our agents’ compatibility is not a function of merely skill, but also style alignment. 5 Limitations and Discussion Our work proposes a methodology for creating powerful agents that are skill-compatible with weaker partners. Our key finding is that, in a complex decision making setup as chess, skill-compatibility is a qualitatively and quantitatively measurable attribute of agents distinct from raw ability on the underlying task. Our designed frameworks show that in situations where strong engines are required to collaborate with weak engines, playing strength alone is insufficient to achieve the best results; it is necessary to achieve compatibility, even at the cost of pure strength. Finally, our three methodologies, each distinct in design and method of operation, demonstrate that there are multiple viable techniques to create agents that achieve this form of compatibility, with different agents using different strategies in-game. Indeed, some strategies center on helping the weak engine make better moves should it assume control, while others explicitly disrupt the compatibility of the adversary forcing weak opponent agents into errors, which even a strong partner like leela is unable to mitigate. Our work therefore is an empirical proof-of-concept for skill-compatibility in chess, and provides a roadmap for the creation of human-compatible agents in this domain and beyond. While our paper does not speak directly to the prospect of skill-compatibility in other domains, we believe that a number of the techniques here are relatively general in nature, with clear analogues to other settings. For example, while the tree agent is very chess specific, and exp is difficult to run in continuous environments, a number of ideas underlying these methodologies — exp’s short-term prediction, att’s tandem training with the targeted skill — can be easily modified to fit different tasks, and offer potential for skill-compatibility in these tasks. The training frameworks we propose use human-like maia agents as weak partners, and a natural next direction for future work is the design of experiments to test these methods with human chess players. Our scope likewise did not include instantiating the stronger partner beyond using leela, which offers opportunity to test robustness to modifications of the environment. 9 Published as a conference paper at ICLR 2024 References Anderson, Ashton, Kleinberg, Jon, & Mullainathan, Sendhil. 2017. Assessing human error against a benchmark of perfection. ACM Transactions on Knowledge Discovery from Data (TKDD), 11(4), 1–25. Bard, Nolan, Foerster, Jakob N., Chandar, A. P. Sarath, Burch, Neil, Lanctot, Marc, Song, H. Francis, Parisotto, Emilio, Dumoulin, Vincent, Moitra, Subhodeep, Hughes, Edward, Dunning, Iain, Mourad, Shibl, Larochelle, H., Bellemare, Marc G., & Bowling, Michael H. 2019. The Hanabi Challenge: A New Frontier for AI Research. Artif. Intell., 280, 103216. Brown, Noam, & Sandholm, Tuomas. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374), 418–424. Campbell, Murray. 1999. Knowledge discovery in deep blue. Communications of the ACM, 42(11), 65–67. Carroll, Micah, Shah, Rohin, Ho, Mark K, Griffiths, Tom, Seshia, Sanjit, Abbeel, Pieter, & Dragan, Anca. 2019. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, 32. (FAIR)†, Meta Fundamental AI Research Diplomacy Team, Bakhtin, Anton, Brown, Noam, Dinan, Emily, Farina, Gabriele, Flaherty, Colin, Fried, Daniel, Goff, Andrew, Gray, Jonathan, Hu, Hengyuan, et al. 2022. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 378(6624), 1067–1074. Ganzfried, Sam, & Sandholm, Tuomas. 2015. Safe opponent exploitation. ACM Transactions on Economics and Computation (TEAC), 3(2), 1–28. Gray, Jonathan, Lerer, Adam, Bakhtin, Anton, & Brown, Noam. 2020. Human-level performance in no-press diplomacy via equilibrium search. arXiv preprint arXiv:2010.02923. Grill, Jean-Bastien, Altché, Florent, Tang, Yunhao, Hubert, Thomas, Valko, Michal, Antonoglou, Ioannis, & Munos, Rémi. 2020. Monte-Carlo tree search as regularized policy optimization. Pages 3769–3778 of: International Conference on Machine Learning. PMLR. Jacob, Athul Paul, Wu, David J, Farina, Gabriele, Lerer, Adam, Hu, Hengyuan, Bakhtin, Anton, Andreas, Jacob, & Brown, Noam. 2022. Modeling strong and human-like gameplay with KL-regularized search. Pages 9695–9728 of: International Conference on Machine Learning. PMLR. Lyashuk, Alexander, & et al. 2023. leela. https://lczero.org/. Accessed: 2023-02-11. Maharaj, Shiva, Polson, Nick, & Turk, Alex. 2022. Chess AI: competing paradigms for machine intelligence. Entropy, 24(4), 550. McGrath, Thomas, Kapishnikov, Andrei, Tomašev, Nenad, Pearce, Adam, Wattenberg, Martin, Hassabis, Demis, Kim, Been, Paquet, Ulrich, & Kramnik, Vladimir. 2022. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47), e2206625119. McIlroy-Young, Reid, Sen, Siddhartha, Kleinberg, Jon, & Anderson, Ashton. 2020. Aligning superhuman ai with human behavior: Chess as a model system. Pages 1677–1687 of: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. McIlroy-Young, Reid, Wang, Russell, Sen, Siddhartha, Kleinberg, Jon, & Anderson, Ashton. 2022. Learning Personalized Models of Human Behavior in Chess. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Nakamura, Hikaru. 2021. In Hand and Brain chess, is the stronger player generally preferred to be the hand or the brain? https://chess.stackexchange.com/a/34973. Accessed: 2023-04-07, Original video source no longer available. 10 Published as a conference paper at ICLR 2024 Perolat, Julien, De Vylder, Bart, Hennes, Daniel, Tarassov, Eugene, Strub, Florian, de Boer, Vincent, Muller, Paul, Connor, Jerome T, Burch, Neil, Anthony, Thomas, et al. 2022. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science, 378(6623), 990–996. Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen, Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis, Demis, Graepel, Thore, et al. 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839), 604–609. Shannon, Claude E. 1950. XXII. Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314), 256–275. Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), 484–489. Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan, Graepel, Thore, et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140–1144. Strouse, DJ, McKee, Kevin, Botvinick, Matt, Hughes, Edward, & Everett, Richard. 2021. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34, 14502–14515. Vinyals, Oriol, Babuschkin, Igor, Czarnecki, Wojciech M., Mathieu, Michaël, Dudzik, Andrew, Chung, Junyoung, Choi, David H., Powell, Richard, Ewalds, Timo, Georgiev, Petko, Oh, Junhyuk, Horgan, Dan, Kroiss, Manuel, Danihelka, Ivo, Huang, Aja, Sifre, L., Cai, Trevor, Agapiou, John P., Jaderberg, Max, Vezhnevets, Alexander Sasha, Leblond, Rémi, Pohlen, Tobias, Dalibard, Valentin, Budden, David, Sulsky, Yury, Molloy, James, Paine, Tom Le, Gulcehre, Caglar, Wang, Ziyun, Pfaff, Tobias, Wu, Yuhuai, Ring, Roman, Yogatama, Dani, Wünsch, Dario, McKinney, Katrina, Smith, Oliver, Schaul, Tom, Lillicrap, Timothy P., Kavukcuoglu, Koray, Hassabis, Demis, Apps, Chris, & Silver, David. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 1–5. Wang, Tony Tong, Gleave, Adam, Belrose, Nora, Tseng, Tom, Dennis, Michael D, Duan, Yawen, Pogrebniak, Viktor, Miller, Joseph, Levine, Sergey, & Russell, Stuart. 2022. Adversarial policies beat professional-level go ais. In: Deep Reinforcement Learning Workshop NeurIPS 2022. Yang, Mesut, Carroll, Micah, & Dragan, Anca. 2022. Optimal Behavior Prior: Data-Efficient Human Models for Improved Human-AI Collaboration. arXiv preprint arXiv:2211.01602. Yu, Chao, Velu, Akash, Vinitsky, Eugene, Wang, Yu, Bayen, Alexandre M., & Wu, Yi. 2021. The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games. ArXiv, abs/2103.01955. 11 Published as a conference paper at ICLR 2024 6 Supplement 6.1 Code Release Our code is released at github.com/CSSLab/skill-compatibility-chess. We also include several of our trained models. 6.2 6.2.1 Methodology Details Testing and error range computation D E ocal leela To test a particular focal in ST T , we run games of the form fmaia maia . For any bitstring s, we play 2 games, with the focal and leela teams switching between black and white. This is done because some bitstrings are biased in favour of a particular color, and we therefore eliminate this bias by having each team playing both sides of the bitstring. Additionally, unfair bitstrings are undesirable, as it is likely that the team is less relevant than the color to achieve victory. Hence, they represent a source of unbiased noise to the result, by adding 1 win for each team. It is difficult to quantify fairness of a bitstring, (it is not sufficient to ensure equal number of senior and junior moves-the order matters). Therefore, for consistency, all experiments to test individual focal agents are sampling from the same set of bitstrings during testing, eliminating the possibility that some agents sample more unfair (and hence, noise-contributing) bitstrings during their testing. No analogous measures are applicable for HB, as stochasticity is internal to the teams rather than being a characteristic of a game. For both frameworks, testing consists of between 1000 and 10000 games, depending on our targeted significance. Here we detail computation of the standard error se for the win-share displayed in the main section of the paper. If we play n games, with W wins and L losses, we can write the empirical win-share as 0.5+ ŵ− ˆl 2 where ŵ and ˆl are the empirical unbiased estimators of the true probabilities w, l, computed as W/n and L/n respectively. Since w, l, and d (probability of a draw) form a trinomial distribution, we have w(1−w) l(1−l) 2wl V (ŵ− ˆl) = + + n n n which simplifies to w+l−(w−l)2 n Plugging in the variance to get the se of the win-share, we have r w+l−(w−l)2 se = 0.5 n This is maximized with w = l = 0.5, and we have 0.5 se ≤ √ n Our data is presented in percentages, so, this implies setting n = 10000 we guarantee se <= 0.5%, and n = 1000 guarantees se <= 1.5%. In practice, we are often able to get lower errors with lower n because w ̸= l. All other quantities in our paper are sample means, with large sample sizes allowing the central limit theorem to be used to obtain their standard errors. 12 Published as a conference paper at ICLR 2024 6.2.2 leela We use the a 128x10-t60-2-5300 leela network, obtained from Vieri1 , with a 1500 node search. Against stockfish 13 (60k nodes), a strong classical engine that uses alpha-beta search, this version of leela obtains a score of 59 ± 3. Meloni2 benchmarks stockfish 13 to human elo, so we deduce that our version of leela plays at around 3050 elo. While leela can be made significantly stronger if more nodes are used in search, limiting it to 1500 allows us to generate more games and run more tests, while still retaining superhuman capability. We use 1500 nodes for all seniors, for consistency. To compute the win probabilities of boards, needed to conduct most of our loss analysis, we use a separate instantiation of leela with the same parameters. In practice though, the leela used for evaluation is stronger than the leela playing as senior, because evaluations occur multiple times per move for different statistics, which, due to caching, is equivalent to working with more nodes. 6.2.3 att To create att, a dataset of 10000 games (80% train,10% validate, and 10% test) is generated leela leela of the following game leela for ST T or leela for HB. Then, starting maia maia maia maia −5 with leela’s weights, and using a learning rate of 10 , and 10000 iterations, we run back-propagation to update leela’s policy and value neural network. We use the version of maia for which we are attempting to achieve compatibility in this training scheme. As this is a tuning task, and we are starting with leela weights, we perform some parameter tuning, with the objective not necessarily to find optimal parameters, but rather to find a set thereof that compromises between cost and robustness. The main time cost comes from generating the datasets of games, and training the agents. To do so, we train models with learning rates from 10−1 to 10−6 on dataset sizes of 1000, 10000, and 100000 games, and a short training time of 1000 iterations. We test these models with a small number of matches that is sufficient to determine whether the training procedure produced viable engines (viable simply means plausible, not necessarily successful). See table 5. Some learning rates catastrophically fail and exhibit nonsensical learning curves, or produce agents that lose a large majority of their games. From the above, we decide to use 10000 games as our dataset size, as it appears that a range of learning rates are viable on it, and it is not as costly to generate as 100000. We also settle on learning rate of 10−5 , as this rate is more robust in small datasets than 10−4 and requires less time to convergence than 10−6 . We note that there appears to be a connection between the quality of the policy and value accuracy curves and actual performance, meaning, models that overfit or fail to converge also perform poorly when testing on the frameworks. Accordingly, we observe that 100000 iterations produces more complete convergence curves for this particular learning rate and dataset size choice. After having selected hyper-parameters which we deem plausible, we run our first full training run 8 times to ensure that performance on the framework following training is repeatable, rather than a product of chance. The worst model of the 8 has a win-rate of 53.5 ±1, and the best model a win-rate of 56.0 ±1. Note that our goal is to find valid, stable hyper-parameters, which does not preclude the existence of better sets of hyper-parameters. For all other models, (meaning different frameworks, or different maia juniors), we use these parameters, and if convergence issues arrive, we modify parameters heuristically. This type of modification was not actually needed for any models in the main paper, but was required for a few additional models which we detail here. The rest of the hyper-parameters can be found in the configuration files in the linked code. 1 https://lczero.org/dev/wiki/best-nets-for-lc0 https://www.melonimarco.it/en/2021/03/08/stockfish-and-lc0-test-at-different-number-ofnodes/ 2 13 Published as a conference paper at ICLR 2024 6.2.4 exp exp is more expensive to run than the standard neural netework engines as it makes multiple calls to engines as subroutines to compute the move that maximizes expectation. Accordingly, it is unfeasible to do a full search over all legal moves, as that may consume up to hundreds of times as much compute. Likewise, these engines use shallow leela engines for board evaluation to determine their move selection. For ST T , we compute the expectation over the top 5 moves, with evaluation conducted at 300 nodes, whereas in HB, for each piece, we compute the expectation over the top 3 maia moves (meaning 18 moves checked total), with evaluation conducted at 50 nodes. 6.2.5 Detailing method 1 used in 4.2.3 While involved, the technique used here comes with the added benefit that no engines play additional moves outside what has already played within evaluation games, and therefore the comparison is more pertinent to the games themselves. To study these three mechanisms, we analyze special sequences on the board. Since in ST T , agent selection is done via coin flipping, games can be represented as a bitstring, s where 1s indicate senior moves and 0s indicate junior moves. We now detail how we compute the values present in table 3. By computing the loss on moves that are preceded by specific substrings, we can isolate different effects. Accordingly, we define LGf (A,s) where s is a condition that stipulates that the moves must be preceded by a (partial) bitstring s in Gf . For example, LGf (maiaf ,1) means that we are computing the loss of maiaf only when its move comes immediately after leela’s, whereas LGf (maia, 0) means the moves come after maial ’s. As mentioned earlier, in order to compare the tricking effect of a senior S1 to a senior S2 , we do not directly compare LGf (J2 ,1) (measuring how S1 tricks J2 ) to LGf (J1 ,1) (measuring how S2 tricks J1 ) , as J1 and J2 are playing on different board distributions (the distributions of the board of the two teams are not the same, the simplest example being that the focal team will have more winning boards). We perform an indirect comparison of each senior’s effect to its own junior’s effect, as they share a board distribution, and all necessary moves are already present. We do the same for both seniors and then compare these quantities. Formally, we define IGf (A,s) = LGf (A,1⊕ s)−LGf (A,0⊕s) where ⊕ is concatenation, the 1 and 0 denote the comparison of the loss induced by the senior being present in that position to that by the junior, and s is a string to standardize the agents that play in between should we be measuring an effect across more than 1 ply. The immediate tricking effect of a senior S under this definition can thus be computed as IGf (Jopp ,ϕ) where ϕ is the empty string and Jopp is the opposing junior. To measure the immediate helping effect of a senior S on its partner junior J par , note that the opposing team must play a move prior to J par , and there are two possible situations depending on which opponent plays. There are thus two helping effects to be measured, IGf (Jpar ,1) and IGf (Jpar ,0), denoting the opponent senior and junior being the interceding agents, respectively. An in order to examine the indirect effect, to see whether the focal agents affect the performance of their juniors over longer time horizons. We compute this as ∆Gf (maial ,maiaf ,00), where two zeros indicate the move must not be preceded by any senior for 2 plies. 6.2.6 Detailing method 2 used in 4.2.3 We perform this analysis with leela and each focal agent on two separate board distributions: the set of boards that leela’s team and that the focal’s team encountered in-game. Table 6 shows these results for different focal agents. exp induces similar loss in maia regardless of the board distribution (4.47 ≈ 4.39). This is in line with exp’s myopic optimization objective. The gulf between exp and leela remains large regardless if we compare the distributions as seen in game (4.39 > 3.25) or if we standarde the board distribution (4.39 > 3.65 and 4.47 > 3.25). In contrast, there is a notable degradation in the loss induction abilities of tree and att when eliminating distributional effects (3.89 < 4.85 and 3.95 < 4.46). Consequently, the standardized comparisons for these focals to leela shows a much closer result than the unstandardized ingame observations. Interestingly, comparing along the minor diagonal shows that leela induces more loss than these two agents (4.50 > 3.89 and 4.45 > 3.87), suggesting that the distributional effects induced by tree and att are actually more important than their direct effects. 14 Published as a conference paper at ICLR 2024 6.3 Hardware We made use of four Tesla K80 GPU’s for the purpose of experimentation, each with a VRAM of 12 GB. We show in table 11 the times taken for the most important tasks of the paper. As an example, if we wish to generate the games for an attt agent, train it, test it to se = 0.5, and obtain analysis metrics, it would take 10 hours for generation, 3 hours for training, 10 hours for testing, and 12 hours for analysis, a total of 35 hours. Alongside experiments that are not included in the paper, we approximate the total compute time used by the project to be approximately 1 year’s worth of our GPUs. 6.4 6.4.1 Extra Experiments Agreement rate of various engines We compute the agreement rate of various engines used in our experiments in figure 5. Notice how the att agents from both frameworks and calibrated to both juniors all play similar moves to each other and to leela, on which they are based. maia1100 and maia1900 also exhibit similarity to each other and dissimilarity to att agents. tree exhibits moderate similarity to both the maia agents and the leela-derived agents. sfw, a weakened version (25 nodes) of stockfish 8 used in a later experiment, is entirely dissimilar to any other agent, likely due to its architectural uniqueness. 6.4.2 Modification of training target for att in ST T The training procedure for att includes back-propagating on moves that both leela and maia play in ST T . Semantically, this updates the value-head to take into account maia’s interventions, however, it also updates the policy-head to learn maia’s moves, which we believed would weaken the engine. Therefore, we created a version of the engine that only learns leela’s moves, as to not affect the policy with maia’s moves, and only to have the value-head change to adapt to maia’s interventions. We also use an increased training games of 40000 because convergence with smaller datasets was worse (exclusion of maia moves halves quantity of data, and reduces its diversity, making it prone to overfitting). Interestingly, this turns out to be less effective than the default version in ST T . (52%±0.5 < 55.0%±0.5) than including the maia policy moves, although the engine turns out to be very strong in raw strength, achieving a 36.5%±1.5 > 14.0%±1.5.It is expected that the engine is stronger, as it is not ingesting maia moves, and we suspect that in ST T , training the policy on maia moves is actually beneficial, as it allows the agent to conduct search at least partially based on maia’s moves, mimicking tree to an extent. 6.4.3 Comparison of tricking vs helping strategies in ST T In order to compare the effect of attempting purely to sabotage maial , to that of purely aiding the maiaf , we modify the algorithm of exp to m = argmax Es∈{0,1} [w|(m,s)] , which m effectively optimizes only one ply in the future, eliminating any helping effect which requires at least 2 ply foresight and making this a pure tricking engine. In order to isolate the helping effect, we use m = argmax Es∈{10,11} [w|(m,s)] , which (falsely) assumes an un-exploitable m leela is always playing the opponent move, thereby forcing the optimization to be solely to help maiaf . Both versions are able to beat leela in the framework, however the tricking version achieves a higher score of 62.5%±1 as opposed to the helping version which achives a score of 57.0%±1, both lower than that achieved by the actual exp. We suspect that pure tricking is easier to conduct than pure helping, as there is no interceding agent to account for, and a shorter time horizon, hence less branching. 15 Published as a conference paper at ICLR 2024 6.5 Extra Figures and Tables Figure 5: Agreement rate of different engines with each other Table 5: Viability of attt engines created according to learning rate and dataset size Learning Rate 10−1 10−2 10−3 10−4 10−5 10−6 1000 Games X X X X ✓ ✓ 10000 Games X X X ✓ ✓ ✓ 100000 Games X X ✓ ✓ ✓ ✓ 6.6 Main Paper Tables with Standard Deviation Tables 12-15, show the standard deviation of each value from the main text. 16 Published as a conference paper at ICLR 2024 Table 6: maia loss induced by different seniors in distributions occurring to different teams in ST T Gt Ge Ga maia loss induced tree leela exp leela att by leela Distribution of leela 3.60±0.02 3.89±0.02 3.25±0.02 4.47 ±0.02 3.87±0.02 3.95±0.02 maial Distribution of focal 4.50±0.03 4.85±0.03 3.65±0.02 4.39±0.02 4.33±0.01 4.46±0.01 maiaf Table 7: Metrics by interaction type for tree and att in HB Interaction Agreement Correction Team Gt Distribution 44±1 60±1 Gt Savings 0 0 Gt Opponent loss 3.2±0.1 3.4±0.1 Ga Distribution 44±1 52±1 Ga Savings 0 0 Ga Opponent loss 3.2±0.1 3.2±0.1 leela maial focal maiaf leela maial focal maiaf Disagreement leela maial focal maiaf Blindsiding leela maial focal maiaf 20±1 16±1 21±1 14±1 15±1 10±1 5.4±0.1 4.3±0.1 -2.0±0.1 -2.1±0.1 0 0 3.8±0.1 4.8±0.1 4.0±0.1 4.4±0.1 3.3 ±0.1 3.8±0.1 20±1 20±1 20±1 16±1 16±1 11±1 5.2±0.1 4.6±0.1 -1.8±0.1 -1.9±0.1 0 0 3.8±0.1 4.0±0.1 3.9±0.1 3.9±0.1 3.3 ±0.1 3.7±0.1 Table 8: Generalization results with maia1100 and maia1900 as junior partners. Tested on junior Designed for maia1100 Designed for maia1900 tree maia1100 56.5±0.5 51.0±0.5 Tested on junior Designed for maia1100 Designed for maia1900 tree maia1100 60.0±0.5 57.0±0.5 ST T framework expt maia1900 maia1100 41.5±1.5 66.5±0.5 50.0±0.5 55.0±0.5 HB framework exph maia1900 maia1100 52.5±0.5 55.0±0.5 58.0±0.5 51.5±0.5 maia1900 53.0 ±0.5 65.5±1.0 attt maia1100 55.0±0.5 52.0±0.5 maia1900 51.5±0.5 53.0±0.5 maia1900 45.5 ±1.5 55.0±0.5 atth maia1100 56.0±0.5 54.0±0.5 maia1900 37.5±1.5 54.0±0.5 Table 9: Generalization results with maia1100 and sfw on ST T Tested on Designed for maia1100 Designed for sfw expt maia1100 66.5±0.5 43.5±1.0 sfw 44.0±1.0 55.0±0.5 attt maia1100 55.0±0.5 48.5±0.5 sfw 44.0±1.0 51.0±0.5 Table 10: Generalization results with Specific Players in ST T Player A (1650 rating) Senior trained for Att Tree Maia1900 51 ±0.5 52.5 ±0.5 Player A Not trained 51.5 ±0.5 Player B (1950 rating) Type of Senior Att Tree Maia1900 51 ±0.5 52 ±0.5 Player B Not trained 46.5 ±0.5 17 Exp 57 ±2 68 ±2 Exp 53 ±2 66.5 ±2 Published as a conference paper at ICLR 2024 Table 11: Approximate times of main tasks involved in experimentation Task 1000 ST T games, no exp 1000 ST T games, with exp 1000 HB games, no exp 1000 HB games, with exp 1000 games with metric collection 10000 training iterations Approximate Time (h) 1 2-3 2 2-3 4 3 Table 12: Table 1 with standard deviations ST T Game Setup tree expt attt maia HB tree exph atth maia 56.5±0.5 66.5±0.5 55.0±0.5 7.5 ±1.5 60.0±0.5 56.5±0.5 55.0±0.5 27.5 ±1.5 focal | leela 0.5±0.5 14.0 ±1.5 12.5 ±1.5 0.0±0 0.5±0.5 N/A 4.5 ±1.5 0.0±0 focal leela maia | maia Table 13: Table 2 with standard deviations Agent leela focal ∆Gf (leela, focal,∗) Gt 1.15 ±0.01 1.91 ±0.01 -0.76 ±0.02 ∆Gf (teaml , teamf ,∗) 0.13 ±0.03 Ge Ga Agent Gt Ge Ga 1.16 ±0.01 1.17 ±0.01 maial 4.46 ±0.04 4.21 ±0.04 4.19 ±0.02 1.37 ±0.01 1.43 ±0.01 maiaf 3.57 ±0.03 3.37 ±0.02 3.77 ±0.02 -0.21 ±0.02 -0.26 ±0.02 ∆Gf (maial , maiaf ,∗) 0.89 ±0.04 0.84 ±0.04 0.42 ±0.03 0.63 ±0.04 0.16 ±0.02 Table 14: Table 3 with standard deviations Gt leela tree -0.03±0.06 0.54±0.07 0.26±0.11 0.36±0.10 0.15±0.09 0.46±0.09 0.56±0.10 Agent Tricking: I(Jopp ,ϕ) Helping: I(Jpar ,0) Helping: I(Jpar ,1) Indirect: ∆(maial , maiaf ,00) Ge leela exp -0.27±0.04 1.38±0.05 0.30±0.07 0.61±0.07 0.12±0.09 1.88±0.07 -0.18±0.07 Ga leela att -0.01±0.06 0.25±0.04 0.34±0.06 0.21±0.05 0.20±0.06 0.25±0.05 0.37±0.06 Table 15: Table 4 with standard deviations Gt leela maial tree maiaf Ge leela maial exp maiaf Ga leela maial att maiaf maia as focal leela maial maia maiaf True loss 3.76±0.02 3.55±0.02 3.44±0.02 3.33±0.02 3.61±0.02 3.50±0.02 3.73±0.05 4.25±0.05 maia loss 3.75±0.02 3.29±0.02 3.40±0.02 3.62±0.02 3.59±0.02 3.43±0.02 3.39±0.05 3.60±0.05 Savings -0.01±0.03 -0.22±0.03 -0.04±0.03 0.3±0.03 -0.02±0.03 -0.07±0.03 -0.32±0.07 -0.65±0.07 18