MLGym: A New Framework and Benchmark for Advancing AI Research Agents Deepak Nathani1† , Lovish Madaan2,7 , Nicholas Roberts3† , Nikolay Bashlykov7 , Ajay Menon7 , Vincent Moens5 , Amar Budhiraja7 , Despoina Magka6 , Vladislav Vorotilov7 , Gaurav Chaurasia7 , Dieuwke Hupkes7 , Ricardo Silveira Cabral7 , Tatiana Shavrina7 , Jakob Foerster6 , Yoram Bachrach6 , William Yang Wang1 , Roberta Raileanu2,7 University of California, Santa Barbara, 2 University College London, 3 University of Wisconsin–Madison, 4 University of Oxford, 5 PyTorch Core Libraries at Meta, 6 FAIR at Meta, 7 GenAI at Meta arXiv:2502.14499v1 [cs.CL] 20 Feb 2025 1 † Work done during internship at Meta We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents. Date: February 21, 2025 Correspondence: Deepak Nathani at dnathani@ucsb.edu, Roberta Raileanu at raileanu@meta.com Code: https://github.com/facebookresearch/MLGym 1 Introduction Accelerating scientific discovery has been a long-standing ambition in artificial intelligence (AI) research, with early initiatives like the Oak Ridge Applied Artificial Intelligence Project in 1979 exploring (Team, 1985; Emrich et al., 1988; Johnson and Schaffer, 1994). More recent explorations enabled by advances in foundation models (Achiam et al., 2023; Anthropic, 2024; Team et al., 2024; Dubey et al., 2024) provide a proof-of-concept of a fully automated pipeline for end-to-end paper generation (Lu et al., 2024). In the future, we envision AI Research Agents capable of independently conducting literature search, generating scientific hypotheses, designing experiments, implementing new methods, analyzing results, disseminating findings by writing scientific papers, and applying this research in products, thus assisting with all parts of the research process. Such agents should be capable of both working fully autonomously, or be guided by human supervision, taking into account feedback from users. This vision stems from the recognition that AI, with its capacity to process vast datasets and discern complex patterns, could accelerate scientific breakthroughs in areas such as drug discovery and materials science by identifying promising drug candidates or predicting the properties of novel materials (Hessler and Baringhaus, 2018; Schneider et al., 2020; Guo et al., 2021). Unlike traditional methods, AI agents can reveal hidden interdisciplinary relationships by analyzing vast knowledge graphs, leading to novel insights and solutions 1 Environment Agent Tool Docs Task Description Action Prompts Gymnasium Environment Feedback Models Computer Tool Docs Task Description Tools Data Code Requirements Shell Action Tools Output File System Data Code Requirements Figure 1 Diagram of MLGym, a unified framework designed to integrate diverse and open-ended AI research tasks into a single platform for developing and evaluating LLM agents on these tasks. for complex challenges like climate modeling. By automating laborious tasks and exploring unconventional avenues, AI agents can liberate scientists to focus on higher-level cognitive activities, ultimately driving innovation and expanding the frontiers of knowledge. Machine learning (ML) research, with its emphasis on empirical validation and systematic experimentation in simulation, presents an ideal testbed for exploring and improving the utlity of LLMs for advancing scientific research. However, the scientific method inherently relies on empirical validation, rigorous evaluation, and standardized benchmarks to ensure the reliability and reproducibility of findings. While significant progress has been made in developing AI agents for various domains (Yang et al., 2024; Wu et al., 2024; Ma et al., 2024; Deng et al., 2023; Wang et al., 2023), we currently lack comprehensive frameworks and benchmarks specifically designed to assess their capabilities in conducting open-ended AI research tasks in diverse domains. This absence of standardized evaluation tools hinders our ability to objectively measure progress and identify areas for improvement in this emerging field. Recently, a number of papers have started to evaluate LLM agents on various SWE and ML tasks; notable examples include SWE-Bench (Jimenez et al., 2023), SWE-agent (Yang et al., 2024), ScienceAgentBench (Chen et al., 2024), SUPER (Bogin et al., 2024), MLE-Bench (Chan et al., 2024), MLAgentBench (Huang et al., 2024), and RE-Bench (METR, 2024). However, existing benchmarks for AI Research Agents either do not include open-ended research tasks, or only cover a narrow range of research domains. In addition, existing frameworks are not designed to enable research on different training algorithms for AI Research Agents such as reinforcement learning, curriculum learning, or open-ended learning. Finally, current frameworks do not allow flexible artifacts to be evaluated (e.g. different outputs of the agent’s research such as a model, algorithm, or set of predictions). In this paper, we introduce MLGym—the first Gym (Brockman et al., 2016) environment for AI Research Agents and a unified framework designed to integrate diverse and open-ended AI research tasks into a single platform for developing and evaluating LLM agents on such tasks (see Figure 1 for a diagram of MLGym). Being a Gym environment, our framework enables research on different training algorithms for AI Research Agents such as reinforcement learning (RL), curriculum learning, and open-ended learning. We also release MLGym-Bench, a curated set of 13 open-ended research tasks, covering a wide range of domains such as computer vision, natural language processing, reinforcement learning, and game theory, carefully crafted to evaluate the performance of agents in realistic, multifaceted workflows. MLGym and MLGym-Bench expand the range of problems considered by current LLM agent frameworks and benchmarks, by offering the ability to flexibly evaluate performance on open-ended research tasks. For example, performance can be measured based on various artefacts such as model weights, RL training algorithms, or code representing game theory strategies. We compare five frontier LLMs across the tasks in MLGym-Bench under consistent experimental settings, highlighting their strengths and limitations. Finally, we propose a new evaluation metric for agents, adapted from the optimization (Dolan and Moré, 2002) and automated machine learning (AutoML; Roberts 2 et al., 2022a) literature, to more fairly assess the relative performance of LLM agents across tasks with their own distinct performance metrics. To summarize our contributions, we (i) introduce MLGym, the first Gym environment for evaluating and developing AI Research Agents, (ii) release MLGym-Bench, a suite of diverse open-ended AI research tasks for evaluating LLM agents, (iii) propose a new evaluation metric for comparing multiple agents on a variety of tasks, and (iv) extensively evaluate frontier LLMs on MLGym-Bench. Finally, MLGym makes it easy for researchers and developers to integrate and evaluate new tasks, agents, or models. In the rest of the paper, we discuss related LLM agent frameworks and benchmarks, provide an overview of the MLGym framework, introduce the mechanics behind MLGym-Bench and its evaluation, present our experimental setup and results, and conclude with a discussion of limitations and future extensions. 1.1 Capability Levels for AI Research Agents We propose a hierarchical framework to categorize the capabilities of LLM agents for accelerating AI research. This framework consists of six levels, each representing a distinct degree of autonomy and scientific contribution. Level 0: Reproduction At this level, LLM agents can reproduce existing research papers either with or without access to the original code. This level demonstrates a basic understanding of the research domain and the ability to replicate established results. Level 1: Baseline Improvement At Level 1, LLM agents can improve performance on a benchmark given a baseline code that is not state-of-the-art (SOTA). This level indicates the ability to analyze and optimize existing solutions, even if they are not the most advanced. Level 2: SOTA Achievement At Level 2, LLM agents can achieve SOTA performance on a benchmark given only a task description and access to the published literature before the invention of the SOTA approach, but no access to the SOTA paper or code. This level demonstrates the ability to come up with a solution to an open research problem which is as good as the one found by humans. Level 3: Novel Scientific Contribution At Level 3, LLM agents can make a novel scientific contribution, such as coming up with a new method that establishes a new SOTA on multiple benchmarks, and is worthy of publication at a top ML conference such as NeurIPS. Level 4: Groundbreaking Scientific Contribution At Level 4, LLM agents can identify key research questions, directions, solutions, and make a notable scientific contribution worthy of being published as an oral or best paper award at a prestigious ML conference such as NeurIPS. Level 5: Long-Term Research Agenda At Level 5, LLM agents can pursue a long-term research agenda, coming up with the research questions, directions, and solutions, continuously producing scientific discoveries over the span of weeks, months, or years. LLMs at this level should be capable of paradigm-shifting research breakthroughs worthy of prizes such as Nobel or Turing. By defining these capability levels, we provide a framework for evaluating frontier AI Research Agents. MLGym-Bench focuses on Level 1: Baseline Improvement of the categorisation defined above. 2 Related Work 2.1 AI Research Frameworks and Benchmarks Table 1 shows a comparison between MLGym and MLGym-Bench with other related LLM agent frameworks and benchmarks. Below, we expand on the differences between MLGym and these works. First, MLGym is the first framework for AI Research Agents that provides a Gym interface, making it easy to integrate and train these agents using RL algoritms. MLGym-Bench is also the first benchmark to include tasks that require research on algorithms in multiple domains such as RL, game theory, or SAT. 3 Benchmark Gym Interface Algorithmic Tasks Open-Ended Research Flexible Artifacts Agentic Harness ! # # # # # ! # # # # # ! # # ! ! # ! # # ! ! # ! # ! ! # # MLGym (ours) MLE-Bench SWE-Bench/Agent MLAgentBench RE-Bench ScienceAgentBench Table 1 Comparison of MLGym and MLGym-Bench with other related LLM agent frameworks and benchmarks. Algorithmic Tasks refers to the inclusion of tasks that require coming up with new algorithms such as reinforcement learning, game theory or SAT problems. Open-ended Research refers to the inclusion of tasks that are not fully solved by the research community and where multiple new solutions could be discovered such as language modeling, game theory or SAT problems. Flexible Artifacts refers to the allowance of different research artifacts such as model weights, reinforcement learning algorithms, or code capturing an agent’s strategy. Second, MLGym-Bench encompasses a wide range of open-ended AI research tasks, covering supervised learning, language modeling, reinforcement learning, game theory and SAT. In contrast, SWE-Bench/SWEAgent (Yang et al., 2024) focuses on solving Github issues so the code changes either fix the code or not (as opposed to optmization tasks with finer-grained metrics, such as a loss metric in a supervised learning problem). Similarly, MLE-Bench (Chan et al., 2024) includes narrowly scoped machine learning tasks from Kaggle competitions. While these tasks have a spectrum of quality levels, they tend to be already solved by current state-of-the-art methods. On the other hand, MLAgentBench (Huang et al., 2024) contains both ML-specialized tasks (regression, classification, code speed improvements) and tasks focused on recent research challenges (e.g. CLRS reasoning corpus (Veličković et al., 2022), BabyLM challenge (Oba et al., 2023)). RE-bench (METR, 2024) also consists of broadly scoped ML engineering tasks which are hard to saturate and reward increasingly sophisticated approaches. ScienceAgentBench (Chen et al., 2024) incorporates data-driven scientific discovery tasks extracted from peer-reviewed publications, but which are so specific that they resemble Kaggle competition rather than open research questions. Third, MLGym allows for flexible evaluation artifacts: it is sufficient to provide python code that the agent can call to examine the quality of its current solution, such as a model checkpoint or an RL algorithm. In contrast, MLE-Bench requires a CSV file to be submitted for grading each question and SWE-Bench/Agent require evaluating a piece of code through a collection of unit tests. MLAgentBench, RE-Bench and ScienceAgentBench provide Python scripts to compute the evaluation scores. Finally, MLGym enables easy evaluation of both models and agents. To facilitate model evaluation, MLGym provides a default agentic harness that can be used out-of-the-box to evaluate any base model. 2.2 LLM Agents Research on tool-augmented LLMs (Schick et al., 2023) has inspired a new research agenda of “agentic” LLMs (Kaddour et al., 2023; Wang et al., 2024a), where LLMs interact with an external environment. Existing work explores teaching LLMs to use tools or APIs (Schick et al., 2023; Qin et al., 2023), navigate the web (Nakano et al., 2022; Deng et al., 2023; Zhou et al., 2023), interface with operating systems (Wu et al., 2024), play games (Paglieri et al., 2024; Wang et al., 2023), or interact with other simulated (Wang et al., 2024b; Lin et al., 2023) or physical worlds (Zhang et al., 2024a). Evaluating agentic LLMs typically involves designing controlled environments, providing suitable tools, defining tasks and goals, and establishing quantitative metrics to measure the system’s performance. Building on these directions, Yoran et al. (2024) introduce AssistantBench, emphasizing the complexity of open-web navigation and showcasing how current systems struggle with realistic, time-consuming tasks such as monitoring real-estate markets or identifying nearby businesses. Meanwhile, Kapoor et al. (2024) highlight the importance of standardized evaluation protocols that consider both accuracy and cost, warning against overfitting and advocating for more reproducible benchmarks. Extending these concerns to multi-dimensional environments, Liu et al. (2023) propose AgentBench—a suite of eight interactive settings that test agents’ capacity for reasoning, decision-making, and long-term instruction following. Similarly, Mialon et al. (2023) 4 focus on holistic planning skills through GAIA, a benchmark designed to assess performance on real-world questions requiring robust tool-use and multimodal reasoning, revealing substantial gaps between human-level proficiency and current LLMs. Finally, Trivedi et al. (2024) emphasize the necessity of sophisticated tool integration with AppWorld, an interactive environment where agents must operate diverse applications via APIs and generate complex code in an iterative fashion. Collectively, these works underscore not only the breadth of agentic LLM capabilities but also the pressing need for systematic, multifaceted benchmarks that capture complex tasks with verifiable results and foster reproducible progress in the field. However, none of these works focuses on evaluating or developing LLM agents for open-ended AI research tasks. 2.3 Agents for Software Engineering and Data Science In line with the principle of reproducibility and verifiability, software engineering tasks provide a testbed for LLM agents, where tasks can be tightly scoped and outcomes rigorously measured. Recent work has explored how agents can tackle code-level challenges in controlled settings that permit systematic evaluation. As discussed above, Yang et al. (2024) introduce SWE-agent, which operates within a constrained agentcomputer interface to facilitate file creation, repository navigation, and code testing—thereby enhancing both traceability and reproducibility on benchmarks such as SWE-bench and HumanEvalFix. Similarly, Wang et al. (2024c) describe OpenHands, a platform that restricts agent interactions to sandboxed environments for safer command execution and verifiable web browsing, and in doing so provides a standardized foundation for benchmarking. Magentic-One (Fourney et al., 2024) is another agentic system competent in software engineering but also augmented with web navigation capabilities, as demonstrated by its strong performance on the GAIA, AssistantBench and WebArena (Zhou et al., 2023) agentic benchmarks. On the other hand, Zhang et al. (2024b) achieve competitive perforemance on SWE-bench with AutoCodeRover, which, unlike the agentic approaches, solves Github issues by combining LLM-based programming with program representation as an abstract syntax tree. Towards the goal of automating data science work, Li et al. (2024) introduce AutoKaggle, a multi-agent human-assisting system, and Grosnit et al. (2024) present AgentK v1.0, an end-to-end autonomous data science agent; both of these systems perform well on Kaggle competition data. Still within the realm of data science work, Lei et al. (2024) build Spider 2.0, a challenging benchmark and code agent framework for automating text-to-SQL workflows. Going one step further, Cao et al. (2024) introduce Spider 2-V, an autonomous multimodal agent coupled with a benchmark focusing on the automation of enterprise data science and engineering workflows. More search-oriented approaches include SWE-Search (Antoniades et al., 2024), a multi-agent framework that marries Monte Carlo Tree Search (MCTS) with iterative refinement, enabling agents to continuously evaluate and improve their approaches to repository-level tasks. In a similar vein, Koh et al. (2024b) explore tree search for LLM agents and show that equipping LLM agents with best-first search boosts performane for the WebArena and VisualWebArena (Koh et al., 2024a) agentic benchmarks. Also on augmenting LLM agents with search, Yu et al. (2025) propose MCTS-based test-time search and self-learning techniques that yield better performance on VisualWebArena. Finally, Xia et al. (2024) demonstrate that even relatively simple approaches can excel when thoroughly monitored: an ’agentless’ system follows a three-step process and outperforms more complex agent-based methods on SWE-bench Lite, underscoring the value of constrained, verifiable environments in driving reproducible gains for autonomous SWE agents. 2.4 Agents for Scientific Research Controlled SWE contexts build the foundation for more complex automation while maintaining a reproducible and verifiable approach. However, just the software foundations alone are not sufficient to address the remaining gaps towards the goal of science acceleration. Going from the limited environments and well-defined tasks with metrics towards a less-defined area of open-ended questions, there are substantial efforts needed to boost the capabilities of research agents. For instance, coming up with automatable criteria to gauge scientific novelty or constructing theories inheriting the automated findings from heterogeneous disciplines are examples of areas that could use more refinement and experimentation. Nevertheless, the first steps on this path can be started now - in the field of ML research and data science - since these areas represent for us a scientific playground with tasks that are both well-defined and have 5 formal criteria of verifiability (benchmarks and tests), falsifiability (ablation studies and tests for data leakage, memorization, out of domain generalization, etc) and reproducibility. Data Science Many recent works approach both classic data science tasks and real-life repository-based tasks as a testbed for agents with a known test set and metrics. While based on similar grounds, the works differ in the resulting levels of autonomy of the agents. For instance, ML-Bench (Tang et al., 2024) focuses on explicit tasks within existing GitHub repositories — evaluating agents in code-centric setups without delving into open-ended objectives. By contrast, Data Interpreter (Hong et al., 2024) extends agent testing to broader data science problems, spanning coding tasks, mathematical reasoning, and a limited suite of open-ended applications (e.g., OCR, web search, and mini-game generation), thus reflecting a more flexible approach to autonomy. The agentic benchmark SUPER (Bogin et al., 2024) raises the bar by requiring the agent to formulate the task itself and iterate on NLP-related data and tasks within research repositories, thereby emphasizing self-directed problem-solving. AI Research The presence of models and simulations in machine learning itself inevitably leads to the fact that this area also becomes the object of automation. Having an agent formulating a task itself and approaching openended tasks naturally leads to automatic agentic enhancement of the machine learning methods themselves. AutoML (Eggensperger et al., 2019; Lindauer and Hutter, 2020; Tornede et al., 2023) and NAS (Elsken et al., 2019; Nasir et al., 2024) approaches have been previously paving the foundations of ML automation within environments with built-in restrictions (an explicit set of methods, definition of the search space and strategy), while the agentic approach can propose open-ended solutions without said specifications. For example, MLAgentBench (Huang et al., 2024) consists of an environment for agents to solve 13 complex tasks ranging from improving image classification to language modeling, with the current state-of-the-art LLMs achieving 0% success rate for the most difficult of these tasks. The proposed pipelines for agents in the environment include designing and running experiments, analyzing the results, and iterating towards improving the defined metrics. Similarly, RE-Bench (Research Engineering Benchmark) (METR, 2024) is a set of 7 diverse and challenging ML tasks with the methodological addition of real human experts involvement and progress comparison: timed sessions for ML experts vs LLM agents. Authors state that the best agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increased time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top agent when both are given 32 total hours. MLE-bench (Chan et al., 2024) focuses on Kaggle tasks as a source for agentic evaluations. Agents are evaluated across well-defined metrics, datasets, and real competition result distribution. The attempts are limited to 24 hours. However, in contrast with MLGym, all these works contain a more narrow set of domains that do not assess algorithmic reasoning capabilities. Moreover, some of them do not provide a standardized agentic harness to allow for model evaluation, but they vary both the harnesses (also known as scaffolds) and the LLMs when comparing performances. While our work focuses on creating an evaluation framework with objective and standardized evaluation metrics, other recent works focus on developing an agentic harness for the more subjective task of generating papers based on end-to-end experimental cycles (Lu et al., 2024). Scientific Discovery Several recent works have approached scientific automation with LLM agents targeting the process of scientific discovery. DiscoveryWorld (Jansen et al., 2024) is a benchmark for scientific agents being evaluated in a game-like virtual discovery environment. 120 tasks require an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions – for areas like proteomics, chemistry, archeology, physics, agriculture, rocket science, linguistics, or epidemiology. The custom simulation engine only supports a limited list of objects and 14 possible actions. A distinctive feature of the work is also that it focuses on general discovery skills rather than task-specific solution, and the assessment, space of objects and actions is common to all scientific domains. ScienceAgentBench (Chen et al., 2024), however, approaches differently the similar task of creating a discovery-based agentic benchmark: the tasks are based on 44 cherry-picked peer-reviewed publications that include data-driven discovery tasks with well-defined metrics. The scientific areas covered include 6 bioinformatics, computational chemistry, geographical information science, and neuroscience yielding 102 tasks of various types, such as data processing, modeling or visualization. Each task is defined by Python-based evaluation environment, end result metrics and intermediate evaluation criteria. Special metrics control data contamination and agent shortcut issues. Comparing different baselines, including pure LLMs with prompting, authors state that execution feedback is necessary for agents to generate useful solutions. The idea of execution feedback and iterative improvement for research tasks has been proposed in ResearchAgent (Baek et al., 2024). Agentic concept-based approach with literature-based discovery shows great improvement for end-to-end iterative solution generation, also supported by knowledge-based vs random facts ablations. The agent is evaluated solely with subjective human preference annotation and automatic human preference evals. While covering structured aspects of end-to-end experimental pipeline (problem clarity, feasibility, significance, relevance, originality, method generalizability, innovativeness, experiment reproducibility, validity, etc), relying solely on human judgment without supporting it with objective metrics is insufficient, as Si et al. (2024) shows. 3 MLGym An LLM agent can perform ML research/development by interacting with a shell environment through a sequence of commands. Given a task description, some starter code and access to its action and observation history, the LLM generates appropriate shell commands to accomplish research objectives like generating ideas, processing data, implementing new methods, training and evaluating models, analyzing the results, and reasoning about what experiments to run next. The agent is iteratively prompted to take actions based on the task description and execution feedback from previous commands, allowing it to develop and self-refine the solutions in-context. The MLGym framework provides a unified framework for evaluating and developing agents and models for AI research tasks. We take inspiration from long existing field of RL and build a Gym (Brockman et al., 2016) environment that can execute shell commands in a local docker machine shell. MLGym provides access to four core components: Agents, Environment, Datasets, and Tasks. MLGym’s modular design allows one to easily utilize and extend the library. For example, researchers can easily implement other agentic harnesses to improve performance, they can expand the environment by adding more tools for an agent, add more datasets within a given task (e.g., if the task is image classification they could add ImageNet in addition to Cifar-10), and they can even add more tasks to the MLGym benchmark. Below, we discuss each component in detail. 3.1 Agents The Agent class provided by MLGym acts as a wrapper around a base LLM and provides functionality for integrating various base models, history processors, and cost management. Moreover, unlike other frameworks (Huang et al., 2024; Yang et al., 2024), MLGym separates the agent from the environment, allowing for easy integration of external agents. This also enables one to fairly compare different base models given the same agentic harness without the need of implementing their own agentic orchestration. The agent is expected to take the history of all prior observations and actions as input and return the next action to take. The provided action is then passed to the environment, which executes the command and returns the next observation based on the command output. The agent can execute any bash command in the environment. In addition, it has access to a set of tools (i.e., bash scripts such as editing a file) that it can use similarly to any other bash command. MLGym provides an agent adapted from SWE-Agent (Yang et al., 2024) as a default agentic harness. We describe the design and configuration of the tools in Section 3.5. The full system prompt used can be found in Listing 1. 3.2 Environment MLGym environments are designed as Gymnasium (gym) environments (Towers et al., 2024). The environment component is responsible for initializing a shell environment in a local docker machine, with all the required tools, installing task-specific python dependencies, copying all the necessary data and code in a separate 7 agent workspace and managing interactions between the LLM agent and the system. Moreover, to support open-ended research tasks and make the environment safe and flexible, MLGym environment also manages permissions for various files and directories. Specifically, when running in a docker container, due to various security concerns associated with using a root user, we create a non-root user named "agent" and set the appropriate permissions for the working directory. In this work, we make a conscious decision to decouple tools and ACI as defined in SWE-Agent (Yang et al., 2024)1 . Note that this ensures that the agent and environment are not tightly coupled, allowing for easier implementation of other agentic architectures. Practically, this means that when the environment is initialized, it also initializes the tools in the working environment and a tool documentation is prepared which can be added to the LLM agent’s prompt. More details about the tools are provided in Section 3.5. 3.3 Datasets MLGym provides a simple abstraction for defining datasets through configuration files. It supports both locally stored and Hugging Face datasets. We decouple the dataset definition from the task definition, so that a single dataset can be used in multiple tasks. Similarly, a single task can have more than one dataset so that the agent’s code can be evaluated across all of them to demonstrate the generality of the implemented method. Moreover, if the dataset files are stored locally, the environment automatically copies the relevant files to the agent workspace with read-only permissions. This ensures that the agent cannot change the dataset files, which is important for reproducibility and cheating prevention. If the dataset is stored in Hugging Face, the agent is given the dataset URL through the starter code or in the prompt and asked to utilize it. Note that if the LLM agent fails to follow instructions or uses a different dataset, the evaluation code will not work or result in performance issues. 3.4 Tasks We provide an easy abstraction to define any ML research task using configuration files. Each task can incorporate one or more datasets, custom evaluation scripts (with read-only access), task-specific conda environment, optional starter code, training timeouts, and memory management settings. This provides a flexible framework for defining diverse open-ended ML research tasks covering a wide range of difficulty. For example, one can define an easier version of a task by providing a baseline code and a harder version by providing no starter code or one with bugs, thus creating a natural curriculum. Evaluation is a critical component for any ML task. Every task requires a different evaluation protocol; thus, Kaggle-style evaluation as done in MLE-Bench (Chan et al., 2024) where the agent is expected to submit a CSV file is not feasible for every problem. For example, in reinforcement learning settings, the evaluation artifact is a set of models trained on a set of pre-defined random seeds, which is then used to get a mean reward across a set of environment seeds. Similarly for Game Theoretic tasks, it can be a Python file with a strategy function which will be evaluated against a fixed set of strategy functions. Since we aim to evaluate the agent on open-ended and diverse tasks, it is not possible to convert all submissions to a CSV format. To ensure extensibility to such open-ended tasks, the task definition is expected to provide an evaluation script and submission artifact instructions. The LLM agent can then be prompted to follow the submission instructions and write the appropriate code. Moreover, the evaluation script is read-only for the LM agent, so while it can inspect the evaluation format, it cannot modify the script to change the evaluation logic. Existing works such as Huang et al. (2024); METR (2024); Chen et al. (2024) also use a script based evaluation approach, whereas MLE-Bench (Chan et al., 2024) uses a Kaggle style evaluation. All our design decisions for the Agent, Environment, Dataset, and Tasks are meant to reduce overhead on the developers’ and researchers’ side and enhance reproducibility in this newly emerging area. 1 As of the latest release, SWE-Agent also decouples tools/ACI from the agent. 8 3.5 Tools and ACI Augmenting LLM agents with the ability of using external tools is a critical component for making progress on knowledge-intensive tasks. In this work, we extend the ACI (agent-computer interface) first introduced in SWE-Agent (Yang et al., 2024) with some additional features required for an ML research agent. Specifically, we extend the commands for search, navigation, file viewer, file editor and context management with our permission management system and introduce new commands for literature search and a memory module. For example, if the agent tries to open a file without read permission, the file viewer tool will generate textual feedback for the agent. Similarly, if agent tries to edit the evaluation script (which is marked as read-only), the edit tools will output a feedback string instead of failing silently. Literature search and the ability to maintain a experimental log in it’s memory are crucial for the agent to surpass SOTA solutions on open-ended research tasks. Similar to SWE-Agent, tools are defined as bash or python scripts and are made available as bash commands in the environment. All tool documentation is provided to the agent in the system prompt. See Table 2 for a description of the available tools. Category Tool Arguments Documentation SWE-Agent Tools Search search_dir search_file find_file < search_term > [< dir >] < search_term > [< file >] < f ile_name > [< dir >] searches for the search term in all files in dir searches for the search term in the given file finds all the files with the given name in dir File Viewer open goto scroll_down scroll_up < path > [< line_number >] < line_number > opens the given file and goes to the line number moves the window to show the line number moves the window down 1000 lines moves the window up 1000 lines File editing create insert edit < filename > < line_number < text_to_add > < start_line >:< end_line < replacement_text > creates a new file inserts the given text at line number in the open file replaces the given lines with the given text in the open file Evaluation validate submit Literature Search literature_search parse_pdf_url < query > [< num_results >] < url > query Semantic Scholar API for papers with attached PDFs downloads and extracts the contents of a PDF given a URL Memory Module memory_write memory_read < content_str > < query_str > save important results, configs or findings to memory retrieve top-2 elements from memory most similar to a query validates the current submission file and returns the metrics on the test set submits the current code and terminates the session Extended Tools Table 2 List of tools available to agents. Required arguments are enclosed in <> and optional arguments are in []. Validation and Submit We provide two commands to the agent to validate the submission and submit the results. Both the validate and submit commands are used to run the evaluation script and give the agent feedback on its current score on the test set. However, while the submit command is a terminal action, i.e., the agent’s trajectory is terminated, and the evaluation script is executed to log the final scores, the validate command can be used as many times as needed during the run to get the current performance on the test set. Addition of a validation command helps the agent to continuously improve its performance on the test set. Literature Search and PDF Parser We provide the agent with two tools to find and extract knowledge from external sources. The Literature Search tool allows the agent to query the Semantic Scholar API to find research papers about a given query that have open-access PDFs available, and the PDF Parsing tool allows the agent to download PDFs and convert them into a text-based representation. The paper contents can be stored in the context window as well as the Memory Module for longer-term tasks. Combined, these two tools allow the agent to find and analyze research papers as part of its workflow. See Table 2 for more information about these tools and how they are called. Memory Module - Research Logs We introduce the Memory Module for MLGym, an important tool to improve the performance of agents on long-horizon AI research tasks. The Memory Module enables the agent to persistently store critical findings and successful training configurations using a structured memory system, overcoming the challenge of limited context retention in long tasks. During our experiments, we observed that when the agent has access to the memory module, it can retrieve the best training configuration from memory and continue to iterate on it (see Figure 11 and Figure 12). Without the memory module, the 9 agent’s trajectory can become longer than the model’s context length, thus not being able to retrieve the best configuration, effectively forgetting older experiments and only being able to locally iterate on recent configurations. The module is equipped with two core functions: memory_write and memory_read. The memory_write function allows the agent to store key insights and effective configurations by saving text data along with its corresponding embeddings and tags in JSON format. In contrast, the memory_read method retrieves the top-k most relevant stored entries based on cosine similarity with a given query, allowing the agent to review past knowledge and iterate from previously successful configurations. Empirical results demonstrate the positive impact of the Memory Module on long-horizon tasks. Agents equipped with the Memory Module were able to sustain progress over extended sequences of trials, reusing optimal configurations and findings to achieve superior results compared to agents limited by fixed context windows. To further enhance its capabilities, we added the state of the memory to the system prompt (memory tags and number of records) so that the agent is aware of the type of data stored. Tags from a memory record are extracted by identifying the 3-gram most closely matching to the memory record. This module significantly reduces the limitations of constrained context length, allowing agents to operate effectively in long experimental settings. However, it is an early version and there are many ways to improve the module. For example, one possible direction would be to introduce a more structured memory format, such as hierarchical or relational models, allowing for precise storage and retrieval of information and enabling more complex reasoning over stored knowledge. Another is to incorporate memory operations directly into the model’s training or fine-tuning process to allow the agent to natively utilize stored knowledge for improved performance. Or using a sub-agent that will automatically manage the memory by selecting important insights, removing unnecessary entries, and updating the memory. Each of these directions would require extensive experimentation and rigorous testing to ensure robustness and scalability. For all the experiments presented in this paper, the agent only uses the SWE-Agent tools and validation command. 4 MLGym-Bench The primary motivation behind our benchmark is to challenge models across different aspects of machine learning, including data handling, model architecture, and strategic decision-making. By incorporating tasks from data science, game theory, computer vision, natural language processing, and reinforcement learning, the benchmark aims to provide a varied and comprehensive agent evaluation testbed. The tasks included in the benchmark are carefully selected to represent real-world challenges, ensuring that models are tested on their ability to generalize and perform effectively across various scenarios. Each task is accompanied by standardized evaluation scripts and baseline implementations, providing a clear reference point for performance assessment and comparison. The benchmark suite is structured into four main categories, each focusing on a specific domain of machine learning: Data Science, Game Theory, Computer Vision, Natural Language Processing, and Reinforcement Learning. Below we describe each of the tasks in the benchmark. 4.1 Data Science House Price Prediction (Kaggle, 2016) In the House Price Prediction task, the goal is to predict housing prices using the Kaggle House Price dataset. This task evaluates models based on their ability to accurately predict prices from various features, using RMSE and R2 as performance metrics. The baseline for this task is a simple Ridge Regression model with minimal feature engineering. 4.2 3-SAT 3-SAT (Cook, 1971) In the 3-SAT task, the LLM agent is given a DPLL code and is prompted to optimize the variable selection heuristic. The associated DPLL code is stored in a read-only file, and the agent can 10 inspect it to structure its heuristic function code, however, it cannot modify it. A simple random selection heuristic is used as a baseline and starter code for the LLM agent. The performance is measured by the total wall-clock time taken to solve a set of 100 generated 3-SAT instances. The instances are genereted using the algorithm described in Selsam et al. (2018). 4.3 Game Theory We consider several tasks related to making strategic choices in iterated games, considering multiple well-known games. Specifically, we consider the task of producing code for a strategy for playing in a repeated two-player game. In each such task we provide an opponent strategy, in the form of an opponent bot for playing the game, and ask the agent to produce code for a strategy for best-responding to this opponent, i.e. provide code for a strategy that maximizes the score against that opponent. We very briefly review game theory terminology, with various textbooks covering this topic in more detail (Fudenberg and Tirole, 1991). In a two-player normal form game G, players select actions simultaneously, with the outcome determined by the choices of both players. Let A1 = {a11 , . . . , a1k } be the (pure) strategies available to player 1 and let A2 = {a21 , . . . , a2m } be the strategies available to player 2. Denote the set of strategy profiles, consisting of a strategy choice for both players as A = A1 × A2 . The utility of the players depends on the actions selected by both for them, i.e. the payoffs are u : A → Rn , where u(a) = (u1 (a), u2 (a)) for a ∈ A, and where each player i tries to maximize their individual utility ui . A mixed strategy is a probability distribution ∆ over pure P strategies. Given a mixed strategy profile σ = (σ1 , σ2 ) the expected utility of ui of player i is ui (σ1 , σ2 ) = (a1 ,a2 )∈A σ1 (a1 )σ2 (a2 )ui (a1 , a2 ). A repeated game consists of k rounds in which the players play the same underlying normal form game. The history at the j + 1’th round consists of the actions (pure strategies) chosen by both players in each of the rounds 1 to j. We denote by H the set of all possible such histories, so a strategy in a repeated game is a function ai : H → ∆(A), i.e. a function that takes the history of actions chosen in the previous round and provides a distribution over the actions the agents would take in the next round. In our tasks, a strategy in the repeated game is expressed as a piece of code that takes in the history (actions of both players in the previous rounds), and outputs an action for the next round (where the code may make some random choices, hence yielding a distribution over the selected next round actions). Given an opponent strategy a2 , the goal of our agent is to produce a strategy that best responds to the opponent and produces a the maximal payoff, i.e arg maxa1 u1 (a1 , a2 ). Note that in this equation a2 is a given opponent strategy expressed as a piece of code that takes the history over the previous rounds and selects an action for the next round (possibly making some random choices), and that the goal of an agent is to produce a1 as a piece of code capturing the strategy of the first player. The agent optimization goal is selecting the code a1 so as to maximize player 1’s expected payoff u1 against the fixed opponent a2 . We consider the repeated version of prominent games, which we briefly discuss here: iterated Prisoner’s Dilemma (Flood, 1958; Fudenberg and Tirole, 1991; Axelrod, 1980), Battle of the Sexes (Cooper et al., 1989; Luce and Raiffa, 2012) and Colonel Blotto (Roberson, 2006). As our goals was to highlight how our agent framework could be used to solve game theoretic tasks, rather than providing a rigorous evaluation and analysis of many game theoretic environments, we only included few games. However, additional games could easily be added in. Prisonner’s Dilemma (Axelrod, 1980). In this game, two players each have two options: cooperate or defect. When both cooperate, they receive a moderate reward. If one defects while the other cooperates, the defector gets a high reward while the cooperator gets a low payoff. If both defect, they both receive a low payoff. Due to the structure of payoffs, although mutual cooperation yields the best collective outcome, individual incentives often push towards defection. We included a repeated game, consisting of k = 20 rounds of the game. In the repeated version, players remember previous interactions and can adjust their strategies based on the history consisting of the past outcomes. Repeating the stage game multiple times allows for the development of trust and cooperation, as players recognize that consistent cooperation can lead to better long-term benefits than short-term defection (Axelrod, 1980). As our opponent strategy we provided a simple model which randomizes between cooperation, defection, or actions chosen based only on the last round of the interaction. 11 Battle of Sexes (Cooper et al., 1989). This is a simple game illustrating coordination challenges between two participants with different preferences. In the game, two participants have to agree on a venue (for instance where to go to spend an evening). There are two possible venues, and both players would rather make the same choice rather than making different choices. The strategic dilemma arises because as each player wants to coordinate their choice with the other, but they have a different ranking over the venues (one prefers the first venue and the other prefers the second venue). Similarly to the iterated Prisoner’s Dilemma, we have used a repeated game with k = 20 rounds and used a simple opponent that makes random choices using the information from the last round. Colonel Blotto Game (Roberson, 2006). This game is a model of strategic allocation of limited resources under competition. Two players (“Colonels”) must simultaneously distribute their resources (such as troops) over several alternative locations (“battlefields”). The player who allocates more resources to a battlefield wins that battlefield. The overall winner is the player who wins the most battlefields. The key challenge arises from the fact that players must make their allocations without knowing how their opponent will distribute their resources. This yields an environment where players try and anticipate their opponent’s moves to decide how to best allocate their own resources in order to maximize their chances of winning. A key insight from the game is the importance of diversification and unpredictability: it is harder to exploit an opponent who spreads resources across multiple battlefields and varies their strategy. Our target opponent used a very simple random allocation rule (re-normalizing to the overall budget of resources). It is important to note that in all the game theoretic tasks, the agent is allowed to look at the opponent’s strategy, and thus these tasks measure code understanding and the LLM’s capabilities to exploit the opponent’s strategy. In the future, we plan to add tasks where the opponent’s strategy is not provided to the agent, and agent is pitted against multiple opponents in a round robin fashion, similar to the setup used in Axelrod’s original Prisoner’s Dilemma tournament. Problem Setting Domain Task Dataset/Environment Supervised Learning Data Science Regression House Price Prediction2 Supervised Learning Supervised Learning Supervised Learning Computer Vision Computer Vision Computer Vision Image Classification Image Classification Image Captioning CIFAR-10 (Krizhevsky et al., 2009) Fashion MNIST (Xiao et al., 2017) MS-COCO (Lin et al., 2014) Supervised Learning Self-Supervised Learning Natural Language Processing Natural Language Processing Natural Language Inference Language Modeling MNLI (Williams et al., 2018) FineWeb (Penedo et al., 2024) Reinforcement Learning Reinforcement Learning Reinforcement Learning Reinforcement Learning Reinforcement Learning Reinforcement Learning MetaMaze Navigation MountainCar Continuous Breakout MinAtar Gymnax (Lange, 2022) Gymnax (Lange, 2022) Gymnax (Lange, 2022) Algorithmic Reasoning Computer Science 3-SAT Randomly Generated (Selsam et al., 2018) Algorithmic Reasoning Algorithmic Reasoning Algorithmic Reasoning Game Theory Game Theory Game Theory Prisonner’s Dilemma Battle of Sexes Colonel Blotto N/A N/A N/A Table 3 List of tasks included in MLGym-Bench along with their respective problem setting, domain, and datasets. 4.4 Computer Vision Image Classification (CIFAR-10) (Krizhevsky et al., 2009) The Image Classification CIFAR-10 task involves classifying images into one of ten classes using the CIFAR-10 dataset. This task tests the ability of models to learn visual patterns and features, with a baseline accuracy of 49.71% encouraging improvements Image Classification (Fashion MNIST) (Xiao et al., 2017) The Image Classification Fashion MNIST task involves classifying fashion items into predefined categories using the Fashion MNIST dataset. The agent is provided with a simple two layer CNN as a baseline and it has to optimize for the accuracy on the test set. The agent can optimize the model architecture and the hyper-parameters for the training. Image captioning (MS-COCO) (Lin et al., 2014) For the image captioning task, the agent has to write the modeling code and come up with a good architecture and training setup for the image-text pairs in the 2 https://www.kaggle.com/datasets/yasserh/housing-prices-dataset 12 MS-COCO dataset. We provide a baseline code for training to the agent which uses an image encoder and text decoder. We use the MS-COCO training and validation sets after removing all images containing humans. The agent has to optimize for the BLEU scores (Papineni et al., 2002) computed over the model-generated captions and ground truth captions for a given image. 4.5 Natural Language Processing For language, we test the agent’s ability to understand and modify training setup for both Natural Language Understanding (NLU) and Natural Language Generation (NLG) as detailed below. Natural Language Inference (Williams et al., 2018) In this task, the agent starts from a pre-trained BERT model (Devlin, 2018) and we provide the baseline code to fine-tune on the training set of the MNLI benchmark to the agent. The agent is expected to come up with good hyper-parameters and fine-tuning strategy to optimize the test set accuracy on MNLI. Language Modeling (Jordan et al., 2024) In the Language Modeling task, the agent is expected to train a language model for next token prediction using a smaller version of the FineWeb (Penedo et al., 2024) dataset. The LLM Agent is provided with the dataset and the NanoGPT (Jordan et al., 2024) codebase as a baseline and starting point. We use version #8 from modded-nanogpt3 as the starting point. The training and validation sets contain 1.773B and 100M tokens, respectively. The perfomance metric is the perplexity of the trained model on the validation set. 4.6 Reinforcement Learning MetaMaze Navigation (Miconi et al., 2020) The MetaMaze Navigation task simulates a grid-world environment where agents must navigate using local observations and reach the goal location. Mountain Car Continuous (Brockman et al., 2016) We use the continuous version of the Mountain Car environment introduced in Brockman et al. (2016), where the task is to learn a policy that drives a car up a steep hill in a continuous control environment. Breakout MinAtar (Young and Tian, 2019) The Breakout MinAtar task involves playing the arcade game Breakout in a simulated environment. This environment was introduced in Young and Tian (2019) and is a popular benchmark for evaluating reinforcement learning agents. For all the RL tasks, we use the environments from the Gymnax library (Lange, 2022) and the PPO algorithm from Gymnax-blines4 as a baseline and starting code for the LLM agent. 5 Experimental Setup 5.1 Agent and Models For our experiments, we utilize a SWE-Agent based model adapted specifically for the MLGYM environment. SWE-Agent follows a simple ReAct-style thought and action loop (Yao et al., 2023), where the agent is prompted with the ACI documentation, the task and dataset description, as well as lightweight generic instructions to act as a ML researcher. The agent is configured to use a single command per step, and is not allowed to use any interactive session commands (e.g., python REPL, vim). We use a set of 5 state-of-the-art models for our experiments, OpenAI O1-preview, Gemini 1.5 Pro, Claude3.5-sonnet-20241022 (refered to as Claude-3.5-sonnet in the paper), Llama-3-405b-instruct, and GPT-4o. All the models are used with temperature=0.0 and top-p=0.95, with the exception for OpenAI O1-preview, which doesn’t support changing the decoding parameters and has a default temperature=1.0. 3 https://github.com/KellerJordan/modded-nanogpt 4 https://github.com/RobertTLange/gymnax-blines 13 5.2 Environment Configuration The MLGYM environment is configured with several key parameters to facilitate effective interaction between the agent and the tasks: • Window Configuration: The environment uses a window size of 1000 lines with an overlap of 2 lines, allowing the agent to effectively navigate and edit large files while maintaining context. • Context Management: A processor maintains a rolling window with the five most recent interactions (action and observation), helping the agent maintain context about the most recent interactions while keeping the input size manageable. • Command Interface: The environment provides a set of specialized commands beyond standard bash operations, including file navigation commands (goto, scroll_up, scroll_down), file editing commands (edit, insert) with linting support, file and directory search commands (search_file, search_dir, find_file), and evaluation commands (validate, submit). A single agent run is limited to 50 steps (i.e. interactions with the environment), after which the agent is terminated and the last codebase state is autosubmitted. Moreover, to control the runtime of the agent and prevent it from simply increasing the number of parameters in the model, we set a task specific timeout for the training commands. In the next section, we discuss the evaluation metrics used in our experiments. 6 Evaluation In order to compare agents on MLGym, we aggregate the scores of each method—an agent architecture paired with a backbone model—across our tasks. There are many ways one can aggregate scores. Common options include computing the average score across tasks for each method or by computing the average ranking of each method across tasks. While simple, these approaches can weight metrics in undesirable ways and disproportionately penalize certain methods. Averaging across different metrics may unfairly weight the metrics differently based on their relative scales, and averaging ranks can disproportionately penalize methods that effectively solve a task but are tied with other methods that also solve the task. Rather than naive averaging of scores or rankings, we employ performance profile curves (Dolan and Moré, 2002), which allow us to compare relative performance gains across both methods and tasks. Performance profiles were originally developed to compare optimization techniques across a set of optimization problems. Since then, they have been used by the AutoML community to compare AutoML methods across diverse domains, each with their own domain-specific metrics (Tu et al., 2022; Roberts et al., 2022b). One challenge when using performance profiles is that they produce a curve for each method (where a higher curve is better), rather than a direct ranking of methods. To address this, the AutoML Decathlon (Roberts et al., 2022a) competition introduced the AUP score, which computes the area under the performance profile curve for each method, where a higher value constitutes better performance. Variants of the AUP score have since been used to score the AutoML Cup5 and MLCommons AlgoPerf (Dahl et al., 2023) competitions. Next, we define performance profiles, the AUP score, and the details of their usage within MLGym. 6.1 Performance Profiles and the AUP Score For a given method m, its performance profile curve is defined as ρm (τ ) = 1 |{t ∈ T : log10 rt,m ≤ τ }| |T | rt,m = ℓt,m min{ℓt,m : m ∈ M } (1) where M is the set of all methods, P is the set of tasks, ℓt,m is the performance metric for a method m on task t, and rt,m is a quantity called the performance ratio. Importantly, this definition assumes that the performance metric for each task, ℓp,· , must be defined such that lower scores are better—we discuss our modification to this definition to support other scores in Section 6.2. 5 https://2023.automl.cc/competitions/automl-cup/ 14 Performance profiles are parameterized by a threshold, τ , on the distance between the method m and the best scoring methods on each of the tasks. At a given threshold τ , performance profiles compute the proportion of tasks for which the method m is within τ of the best method for each task. In order to derive a final score for each method m ∈ M , we compute the AUP score as Z τmax AUPm = ρm (τ )dτ, (2) 1 where τmax is the minimum τ for which ρm (τ ) = 1 for all m ∈ M . 6.2 Usage in MLGym In the context of MLGym, a method is defined as a combination of an agent scaffolding and a backbone model. Since, in this work we use a single agent scaffolding (SWE-Agent), we are comparing the performance of different backbone models. Moreover, we adapt performance profiles and AUP scores to handle various edge cases introduced by our MLGym tasks. • Metric Direction Handling. For metrics where higher values are better (e.g., accuracy, R2), we invert the performance ratio calculation and use the maximum score instead of the minimum: rt,m = max{ℓt,m : m ∈ M } . ℓt,m (3) • Infeasible Method In order to be counted as a feasible method, an agent should produce at least one valid solution and beat the baseline, methods must outperform the baseline. Methods that don’t produce any valid solution or underperform are marked as Infeasible. The score of an infeasible method is set to (1 + ε) × rt,mbaseline , where rt,mbaseline is the score obtained by the baseline method on task t. We set the value of ε = 0.05. We report the metrics across 4 independent runs for each model on each task. Finally, since the LM agent can use the validate command to check the performance without ending the run, we maintain two separate sets of performance profiles and AUP scores for each model. 1. Best Submission Profiles, ρbs m (τ )@4, are computed using the best final submission across 4 runs. A submission is classified as a final submission in two cases: if the agent uses the submit command, or if the agent terminates without submitting and the last codebase state is used to evaluate the performance. 2. Best Attempt Profiles, ρba m (τ )@4, which are computed using the best attempt observed across 4 runs. Any valid call to the validate command is considered an attempt. The resulting AUP scores provide complementary information: • AUPbs m @4 indicates the model’s ability to consistently submit its best attempt as the final solution. Note that to do this, the LM agent has to be able to keep an internal state of the best attempt and recover from any mistakes made after the best attempt was made. • AUPba m @4 captures the model’s exploration capability and is an indicator of the ceiling of the model’s performance. Apart from the AUP scores and performance profiles, we also report the raw performance scores for each model on each task. Similar to performance profiles, we categorize the raw scores in two sets: Best Submission@4 and Best Attempt@4. 7 Results 7.1 AUP Scores and Performance Profiles As detailed in the Section 6, we evaluate the performance of each model in the SWE-Agent based agent scaffolding using Performance Profiles and Area Under the Performance Profile (AUP) score. 15 Best Attempt Profile@4 Best Submission Profile@4 1.0 P(ratio ) 0.8 0.6 0.4 Llama GPT-4o Claude Gemini O1-preview 0.2 0.0 0.0 0.3 0.6 0.9 1.2 0.0 0.3 0.6 0.9 1.2 Figure 2 Performance profiles comparing Best Attempt@4 and Best Submission@4 across all models and tasks. The x-axis shows the performance ratio threshold τ and the y-axis shows the fraction of tasks where a model achieves performance within τ of the best model. Moreover, since our agent can log the performance of intermediate steps, we categorize the performance of each model using two categories: Best Submission and Best Attempt. Best Submission indicates the LLM agent’s capability to produce a valid final solution for a task as well as the ability to remember to fall back to the best intermediate solution in case some experiments don’t pan out. Whereas, Best Attempt indicates the potential ceiling of the LLM agent’s capability to solve the given task. Figure 2 shows the performance profiles for Best Attempt (on the left) and Best Submission (on the right). Similarly, Table 4 shows the AUP scores for the Best Attempt and Best Submission for all models. In our experiments, we found that OpenAI O1-preview is the best-performing model on aggregate across our set of tasks for both Best Attempt and Best Submission, with Gemini 1.5 Pro and Claude-3.5-Sonnet being close behind. Model Llama3.1-405b-instruct Claude-3.5-Sonnet Gemini-1.5-Pro GPT-4o OpenAI O1 Best Attempt AUP@4 Best Submission AUP@4 1.015 1.142 1.140 1.000 1.150 1.039 1.135 1.125 1.029 1.176 Table 4 AUP@4 scores for the best attempt and best submission across all models. Best scores are highlighted in blue . 16 7.2 Raw Performance Scores To compare the performance of each model on each task, we also report aggregate metrics over 4 runs with different seeds, namely the Best Attempt@4 and Best Submission@4 in Table 5 and Table 6 respectively. While OpenAI O1-Preview is not dominant in all tasks, with Gemini-1.5-Pro, Claude-3.5-Sonnet, and Llama3.1-405b-Instruct occasionally taking the lead, it is consistently in the top performing models for most tasks and thus takes the top spot in the AUP scores and performance profiles. This shows that the performance profile is a good metric to compare the performance of different models on a set of tasks with a diverse set of metrics. We also find that Llama-3.1-405b-Instruct and GPT-4o are the only models that fail to produce any valid solution for the Language Modeling and Breakout tasks, respectively. Task Metric Baseline Llama3.1-405b-instruct GPT-4o Claude-3.5-Sonnet Gemini-1.5-Pro OpenAI o1 CIFAR-10 Battle of Sexes Prisoners Dilemma Blotto House Price Prediction Fashion MNIST MS-COCO MNLI Language Modeling Breakout Mountain Car Continuous Meta Maze 3-SAT Heuristic Accuracy Average Reward Average Reward Average Reward R2 Score Accuracy BLEU Score Validation Accuracy Validation Loss Average Score Average Reward Average Return Wall-Clock Time (s) 0.497 1.023 2.372 -0.248 0.88 0.783 0.279 0.525 4.673 48.817 33.794 15.734 16.158 0.548 1.261 2.632 0.043 0.908 0.876 0.294 0.777 ∞ 58.87 18.692 26.744 13.793 0.733 1.149 2.6 0.047 0.895 0.927 0.176 0.819 4.361 ∞ -215.776 7.823 13.676 0.895 1.442 2.567 0.576 0.921 0.945 0.298 0.830 4.476 35.017 36.313 48.562 15.728 0.84 1.443 2.63 0.249 0.914 0.916 0.131 0.838 4.166 71.389 92.513 27.859 14.36 0.857 1.444 2.629 0.248 0.931 0.92 0.135 0.836 3.966 63.518 96.335 34.986 13.652 Table 5 Best Attempt@4 scores for all models. Best scores are highlighted in blue . Note: ∞ indicates that the model was not able to produce even a single valid solution for submission or validation. Task Metric Baseline Llama3.1-405b-instruct GPT-4o Claude-3.5-Sonnet Gemini-1.5-Pro OpenAI o1 CIFAR-10 Battle of Sexes Prisoners Dilemma Blotto House Price Prediction Fashion MNIST MS-COCO MNLI Language Modeling Breakout Mountain Car Continuous Meta Maze 3-SAT Heuristic Accuracy Average Reward Average Reward Average Reward R2 Score Accuracy BLEU Score Validation Accuracy Validation Loss Average Score Average Reward Average Return Wall-Clock Time (s) 0.497 1.023 2.372 -0.248 0.88 0.783 0.279 0.525 4.673 48.817 33.794 15.734 16.158 0.528 1.256 2.562 0.041 0.908 0.876 0.294 0.777 ∞ 58.87 18.692 26.744 13.936 0.733 1.144 2.582 0.047 0.895 0.927 0.111 0.819 4.361 ∞ -216.621 7.823 13.676 0.894 1.439 2.563 0.228 0.912 0.945 0.125 0.830 4.476 17.735 36.313 48.562 15.728 0.758 1.443 2.63 0.088 0.908 0.916 0.131 0.838 4.166 71.389 92.513 22.889 14.36 0.854 1.439 2.571 0.247 0.931 0.906 0.135 0.836 3.966 63.518 96.335 34.986 13.83 Table 6 Best Submission@4 scores for all models. Best scores are highlighted in blue . Note: ∞ indicates that the model was not able to produce even a single valid solution for submission or validation. 7.3 Computational Cost As discussed in Kapoor et al. (2024), it is important to also consider the pareto curve of performance vs cost for a more comprehensive evaluation of the agents’ capabilities and their computational cost. In this work, we do not compare different agent scaffoldings; however, the pareto curve can still be useful to choose the most balanced model for a set of tasks. Figure 3 shows the Best Attempt AUP@4 vs Average Cost for all models. We use Best Attempt AUP scores to for this plot to highlight the maximum performance achievable by each model for a given cost. According to results discussed in Section 7.1, OpenAI O1-Preview is the best-performing model, however, it is also the most computationally expensive by a wide margin. In contrast, Gemini-1.5-Pro and Claude-3.5Sonnet are much more cost-effective while still reaching high performance not too far from OpenAI O1’s, with 17 1.2 1.15 O1-preview Gemini Best Attempt AUP@4 Claude 1.1 1.05 Llama 1.0 GPT-4o 0.95 0.9 0 2 4 6 8 10 Average API Cost ($) Figure 3 Best Attempt AUP@4 vs cost for all models. The x-axis shows the API cost in USD and the y-axis shows the AUP@4 score. Gemini-1.5-Pro being the most cost-effective. Gemini-1.5-Pro is cheaper than both GPT-4o and Llama-3.1-405b-Instruct and provides massive performance gains relative to them. GPT-4o is one of the cheapest models to run but performs significantly worse than the top models, Claude-3.5-Sonnet, Gemini-1.5-Pro, or OpenAI O1-Preview. Overall, Gemini-1.5-Pro strikes the best balance between performance and cost on MLGym-Bench, being the cheapest model to run (approximately 9× cheaper than OpenAI’s O1) while achieving 99% of OpenAI O1’s AUP (which is the top performing model). The API pricing for OpenAI O1-preview, GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro was taken from their respective price pages and for Llama-3.1-405b-instruct was taken from together.ai. For details on API pricing, tokens spent, and context length please consult Table 8 7.4 Agent Behavior Analysis 7.4.1 Failure Mode Analysis In this section we analyze the failure modes of our agents on MLGym-Bench tasks, using three key perspectives: termination error distribution, failed or incomplete run rates, and task-specific failure patterns. We collect trajectories across 11 tasks and 5 models with 4 different seeds. This results in a total of 220 trajectories with 20 and 44 trajectories for each task and model, respectively. Termination Errors Figure 4 shows the distribution of different causes for termination encountered by each model during task execution, as indicated by the first word of the error message. We categorize the errors into the following types: context length exceeded, evaluation error, file permission error, cost limit exceeded, format error, and runtime error. First, we observe that almost all models encounter Evaluation Error and is generally the most frequent final error, accounting for 75% of all termination errors. Evaluation Error is generally triggered by missing 18 Failed Runs Incomplete Runs 10 8 Count 30 Count 12 Llama GPT-4o Claude Gemini O1-preview 40 20 6 4 10 2 0 0 Evaluation Runtime Cost Format Context Llama Permission Figure 4 Termination Error Distribution by model. The size of the bars corresponds to the number of times each model triggered an exit status. Claude GPT-4o Gemini O1-preview Figure 5 Number of Failed and Incomplete runs per model. The criteria for marking a run as incomplete or failed is described in Section 7.4.1 submission artefacts or incorrect submission format at the last step or when the submit command is issued. Gemini-1.5-Pro is the only model that does not submit any invalid solutions, with OpenAI O1-Preview and Claude-3.5-Sonnet being the runner ups. OpenAI O1-Preview and Claude-3.5-Sonnet demonstrate superior error handling capabilities with the lowest overall error rates. Cost Limit is the second most frequent error encountered by Claude-3.5-Sonnet, Gemini1.5-Pro and OpenAI O1-Preview, indicating that they could further improve performance if provided with more budget. However, it is interesting to note that Gemini-1.5-Pro is the most cost-effective model across all tasks but still encounters Cost Limit error most frequently among all models. Failed and Incomplete Runs The failed and incomplete run analysis in Figure 5 reveals significant variations in model reliability. If an agent run fails with a termination error without producing any valid intermediate submission, we mark it as failed. Whereas, if the run fails with a termination error but produces a valid intermediate submission i.e. at least one score on the test set is obtained, we mark it as incomplete. Note that the model’s submission does not have to beat the baseline to be considered a valid intermediate submission. We are not interested in the performance of the model’s submission here, but rather the ability of the agent to produce a valid submission by following the given instructions. GPT-4o exhibits the highest failure rate, while Gemini-1.5-Pro and OpenAI O1-Preview achieve the best completion rates. While Claude-3.5-Sonnet is one of the top performing models across all tasks (Section 7.1), it has a high failure rate. Another interesting observation is that OpenAI O1-Preview has a high incompletion rate, but it always produces at least one valid solution for all tasks. We report additional results and failure mode analysis in Section A.2. 7.4.2 Action Analysis In this section, we analyze the overall action distribution, as well as across models and trajectory steps. To analyze the action distribution effectively, we group the actions according to categories defined in Table 2: Edit , View , Search , Validate and Submit . We treat validate and submit as two separate categories. Additionally, we have two open-ended categories: Python and Bash . All the actions that match the regex patterns python.*, deepspeed.*, torchrun.* are considered as Python actions. These actions usually correspond to the agent attempting to run a model evaluation or training script. All other actions are grouped under Bash category, i.e. are considered as open-ended bash commands. Overall Action Distribution Figure 6 shows the action distribution across all runs. File commands such as Edit and View are one of the most frequently used commands with Edit accounting for 50% of the total actions. Whereas, Search commands are rarely used, accounting for only 1% of the total actions. This distribution suggests that models spend a significant portion of their time in an iterative development cycle of editing and viewing files. Additionally, we observe a trend of regular experimental evaluation and periodic validation of solution by the frequent use of Python and Validate commands. 19 4000 2000 3500 1750 3000 1500 2500 1250 Count Count 3,708 2000 1,458 1500 Submit Search Python Bash 1000 750 1,066 1000 Edit View Validate 500 835 250 449 500 226 11 0 Edit Python Validate View Bash Submit 0 Claude Search Figure 6 Action distribution across all runs. We group the actions into categories following the grouping defined in Table 2 and Section 7.4.2. O1-preview Llama Gemini GPT-4o Figure 7 Action distribution for each model. We group the actions into categories following the grouping defined in Table 2 and Section 7.4.2. Per-Model Action Distribution Figure 7 shows the action distribution for each model. GPT-4o takes the least number of actions overall, indicating that the model either errors out or submits too early without reaching an optimal solution. This is consistent with the failure analysis shown in Figure 5. Among the best-performing models, Claude-3.5-Sonnet and OpenAI O1-Preview perform the most number of actions within a run, while Gemini-1.5-Pro performs the least number of actions. Consistent with the cost analysis discussed in Section 7.3, Gemini-1.5-Pro’s lower trajectory length contributes to it being the most cost-effective model. Per-Step Action Distribution Figure 8 illustrates the distribution of actions taken by agents across trajectory steps. Initially, Bash commands are predominant, indicating that agents start by checking and setting up their environment with basic commands such as ls, pwd, cd etc. As the steps progress, Edit actions become the most frequent, reflecting the agents’ focus on modifying and refining code. This is complemented by a consistent use of View commands, suggesting a pattern of iterative development where agents frequently review their changes. Python and Validate commands are used steadily throughout, which indicates an iterative cycle of experiments and evaluation. Submit actions are sparse, typically appearing towards the end of the process, aligning with the finalization of tasks. However, we can observe the Submit action being used as soon as Step 5, which indicates that some models submit their solution too early and likely fail to reach an optimal solution to beat other models. Interestingly, Search commands are rarely used, suggesting that agents might benefit from improved search strategies to enhance efficiency while editing code. Overall, our analysis highlights a structured approach where agents begin with getting familiar with the environment and the task, conduct multiple iterations of experiments and validation, and conclude with and submission. We report additional action analysis in Section A.3. 8 Discussion and Limitations Our findings highlight both the opportunities and ongoing challenges in leveraging large language models (LLMs) as agents for scientific workflows. The proposed MLGym framework and accompanying MLGymBench tasks demonstrate that modern LLM agents can successfully tackle a diverse array of quantitative experiments, reflecting advanced skills and domain adaptability. At the same time, our results reveal notable capability gaps, which point to several avenues for improvement: • Scaling beyond ML tasks To further evaluate the agent’s AI Research capabilities, it is essential to scale up the evaluation framework to accommodate large-scale domain-specific datasets, more complex tasks, as well as domains outside AI. This will enable the community to assess the robustness and 20 Edit View Validate 250 Submit Search Python Bash Count 200 150 100 50 0 1 5 10 15 20 25 30 Step Number 35 40 45 50 Figure 8 Action distribution for each step. We group the actions into categories following the grouping defined in Table 2 and Section 7.4.2. generalizability of different methods, as well as identify potential limitations and areas for improvement. • Interdisciplinary Ablations and Generalization Within the stage of method evaluation, one approach is to test the solutions for generalization: – automatically evaluating the applicability of a new method on different domains . For example, new LLM architectures like Mamba (Gu and Dao, 2024) could be automatically applied to data on DNA, chemical molecules, music generation, etc. – automatically running interdisciplinary and multidisciplinary ablations, where we systematically remove or modify specific components of the proposed ML system to assess their impact on performance. This will enable the community to more quickly identify the most critical factors contributing to generalization across different domains. • Addressing Scientific Novelty While the agentic benchmarks have demonstrated their effectiveness in evaluating complex tasks in different areas, it is essential to acknowledge that proposed interdisciplinary extrapolation of methods is just one aspect of the broader scientific understanding of "novelty" and "discovery" (Popper, 2005; Langley, 1987). It is not yet clear if the notion of scientific novelty can be successfully automated or even formally defined in a form suitable for agents. For many scientific disciplines, development may be uneven and depend on the availability of open data, the development of the methods, metrics and definitions used. • Data Openness Imperative Finally, we emphasize the importance of data openness in driving scientific progress. By making our representative ’corpus of the world’ widely accessible, including scientific artifacts, reproducible code, and domain-specific data for modeling, we can facilitate collaboration and accelerate discovery. This imperative is crucial for advancing our understanding of complex systems and developing more effective solutions to real-world problems. Removing once accessible resources that have entered LLM training from public access can have an irreparable impact on the acceleration of scientific progress, as it becomes impossible to identify sources of facts, and it is impossible to attribute 21 the out-of-distribution result from a scientific work from a hallucination or a completely new result. 9 Ethical Considerations AI agents proficient in tackling open research challenges like those in our benchmark could catalyze a remarkable acceleration in scientific advancement. This prospect is exhilarating yet demands a meticulous comprehension of model progress to ensure responsible and controlled deployment of such breakthroughs. MLGym-Bench, for instance, can serve as a metric for model autonomy within OpenAI’s Preparedness Framework, autonomous capabilities in Anthropic’s Responsible Scaling Policy, and ML R&D in Google DeepMind’s Frontier Safety Framework. Should AI agents become adept at autonomously conducting AI research, the positive impacts could be multifaceted, encompassing accelerated scientific progress in healthcare, climate science, and other domains, expedited safety and alignment research for models, and economic growth spurred by the development of novel products. The ability of agents to deliver high-quality research could signify a transformative stride in the economy. Nonetheless, agents capable of executing open-ended AI research tasks, such as enhancing their own training code, could augment the capabilities of cutting-edge models at a pace outstripping human researchers. If innovations outpace our ability to comprehend their ramifications, we risk developing models with catastrophic harm or misuse potential without parallel advancements in securing, aligning, and controlling such models. We believe a model proficient in solving a substantial portion of MLGym-Bench likely possesses the capacity to execute numerous open-ended AI tasks. We are open-sourcing MLGym and MLGym-Bench to foster understanding and research into the agentic capabilities of AI Research Agents and promote transparency regarding acceleration risks in frontier AI labs. In doing so, we acknowledge the limitations of MLGym-Bench and strongly encourage the development of additional evaluations of automated AI research capabilities, particularly those tailored to the workflow of researchers training frontier models. 10 Conclusions This paper presents MLGym and MLGym-Bench as initial steps toward building robust, flexible, and transparent LLM agents for AI research. As the field continues to evolve, improvements in long-context reasoning, better agent architectures, training and inference algorithms, as well as richer evaluation methodologies will be essential to fully harness LLMs’ potential for scientific discovery, in general and for AI research in particular. By fostering collaboration among researchers in machine learning, scientific computing, and diverse application domains, we can move closer to a future where AI-driven agents meaningfully accelerate scientific research, all while maintaining verifiability, reproducibility, and integrity in scientific discovery. 11 Acknowledgments We thank Sten Sootla, Mikayel Samvelyan, Sharath Chandra Raparthy, Mike Plekhanov, and Rishi Hazra for many insightful discussions about evaluating and developing AI Research Agents. 22 References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024. Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement, 2024. URL https://arxiv.org/abs/2410. 20285. Robert Axelrod. Effective choice in the prisoner’s dilemma. Journal of conflict resolution, 24(1):3–25, 1980. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, April 2024. URL https://arxiv.org/abs/2404. 07738. Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories, September 2024. URL https://arxiv.org/abs/2409.07440v1. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016. URL https://arxiv.org/abs/1606.01540. Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, and Tao Yu. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024. URL https://arxiv.org/abs/2407.10956. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering, October 2024. URL https://arxiv.org/abs/2410.07095v1. Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery, October 2024. URL https://arxiv.org/abs/2410.05080v1. Stephen A Cook. The complexity of theorem-proving procedures. Proceedings of the third annual ACM symposium on Theory of computing, pages 151–158, 1971. Russell Cooper, Douglas V DeJong, Robert Forsythe, and Thomas W Ross. Communication in the battle of the sexes game: some experimental results. The RAND Journal of Economics, pages 568–587, 1989. George Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, and Peter Mattson. Benchmarking neural network training algorithms, 06 2023. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Elizabeth D. Dolan and Jorge J. Moré. Benchmarking optimization software with performance profiles. Mathematical Programming, 91(2):201–213, January 2002. ISSN 1436-4646. doi: 10.1007/s101070100263. URL https://arxiv. org/abs/cs/0102001. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Katharina Eggensperger, Marius Lindauer, and Frank Hutter. Pitfalls and best practices in algorithm configuration. Journal of Artificial Intelligence Research, 64:861–893, 2019. 23 Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019. M Emrich, A Agarwal, B Jairam, N Murthy, and OAK RIDGE NATIONAL LAB TN. Potential applications of artificial intelligence to the field of software engineering. Technical report, 1988. Merrill M Flood. Some experimental games. Management Science, 5(1):5–26, 1958. Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024. URL https://arxiv.org/abs/2411.04468. Drew Fudenberg and Jean Tirole. Game theory. MIT press, 1991. Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, and Jun Wang. Large language models orchestrating structured reasoning achieve kaggle grandmaster level, 2024. URL https://arxiv.org/abs/2411.03562. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https: //arxiv.org/abs/2312.00752. Kai Guo, Zhenze Yang, Chi-Hua Yu, and Markus J Buehler. Artificial intelligence and machine learning in design of mechanical materials. Materials Horizons, 8(4):1153–1172, 2021. Gerhard Hessler and Karl-Heinz Baringhaus. Artificial intelligence in drug design. Molecules, 23(10):2520, 2018. Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Wenyi Wang, Xiangru Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zongze Xu, and Chenglin Wu. Data Interpreter: An LLM Agent For Data Science, March 2024. URL https://arxiv.org/abs/2402.18679. Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation, April 2024. URL https://arxiv.org/abs/2310.03302. Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents, June 2024. URL https://arxiv.org/abs/2406.06769. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023. Leland Johnson and Daniel Schaffer. Oak Ridge National Laboratory: the first fifty years. Univ. of Tennessee Press, 1994. Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URL https://github.com/KellerJordan/modded-nanogpt. Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023. Kaggle. House prices - advanced regression techniques. Online; accessed January 24, 2025, 2016. URL https: //www.kaggle.com/c/house-prices-advanced-regression-techniques. Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter, 2024. URL https://arxiv.org/abs/2407.01502. Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024a. Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents, 2024b. URL https://arxiv.org/abs/2407.01476. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 24 Robert Tjarko Lange. gymnax: A JAX-based reinforcement learning environment library, 2022. URL http://github. com/RobertTLange/gymnax. P Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987. Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows, 2024. URL https: //arxiv.org/abs/2411.07763. Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, and Ge Zhang. Autokaggle: A multi-agent framework for autonomous data science competitions, 2024. URL https://arxiv.org/abs/2410.20424. Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. Agentsims: An open-source sandbox for large language model evaluation, 2023. URL https://arxiv.org/abs/2308.04026. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. Marius Lindauer and Frank Hutter. Best practices for scientific research on neural architecture search. Journal of Machine Learning Research, 21(243):1–18, 2020. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents. https://arxiv.org/abs/2308.03688v2, August 2023. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, August 2024. URL https://arxiv.org/abs/2408.06292. R Duncan Luce and Howard Raiffa. Games and decisions: Introduction and critical survey. Courier Corporation, 2012. Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin Sun, Hany Awadalla, and Weizhu Chen. Sciagent: Tool-augmented language models for scientific reasoning, 2024. URL https://arxiv.org/abs/2402.11451. METR. Evaluating frontier ai r&d capabilities of language model agents against human experts, 11 2024. URL https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for General AI Assistants, November 2023. URL https://arxiv.org/abs/2311.12983. Thomas Miconi, Aditya Rawal, Jeff Clune, and Kenneth O. Stanley. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity, 2020. URL https://arxiv.org/abs/2002.10585. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https://arxiv.org/abs/2112.09332. Muhammad Umair Nasir, Sam Earle, Julian Togelius, Steven James, and Christopher Cleghorn. Llmatic: neural architecture search via large language models and quality diversity optimization. In proceedings of the Genetic and Evolutionary Computation Conference, pages 1110–1118, 2024. Miyu Oba, Akari Haga, Akiyo Fukatsu, and Yohei Oseki. Babylm challenge: Curriculum learning based on sentence complexity approximating language acquisition. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 290–297, 2023. Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, et al. Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543, 2024. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 25 Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557. Karl Popper. The logic of scientific discovery. Routledge, 2005. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. URL https://arxiv.org/abs/2307.16789. Brian Roberson. The colonel blotto game. Economic Theory, 29(1):1–24, 2006. Nicholas Roberts, Samuel Guo, Cong Xu, Ameet Talwalkar, David Lander, Lvfang Tao, Linhang Cai, Shuaicheng Niu, Jianyu Heng, Hongyang Qin, Minwen Deng, Johannes Hog, Alexander Pfefferle, Sushil Ammanaghatta Shivakumar, Arjun Krishnakumar, Yubo Wang, Rhea Sukthanker, Frank Hutter, Euxhen Hasanaj, Tien-Dung Le, Mikhail Khodak, Yuriy Nevmyvaka, Kashif Rasul, Frederic Sala, Anderson Schneider, Junhong Shen, and Evan Sparks. Automl decathlon: Diverse tasks, modern methods, and efficiency at scale. In Marco Ciccone, Gustavo Stolovitzky, and Jacob Albrecht, editors, Proceedings of the NeurIPS 2022 Competitions Track, volume 220 of Proceedings of Machine Learning Research, pages 151–170. PMLR, 28 Nov–09 Dec 2022a. URL https: //proceedings.mlr.press/v220/roberts23a.html. Nicholas Roberts, Xintong Li, Tzu-Heng Huang, Dyah Adila, Spencer Schoenberg, Cheng-Yu Liu, Lauren Pick, Haotian Ma, Aws Albarghouthi, and Frederic Sala. AutoWS-bench-101: Benchmarking automated weak supervision with 100 labels. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/forum?id=nQZHEunntbJ. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URL https: //arxiv.org/abs/2302.04761. Petra Schneider, W Patrick Walters, Alleyn T Plowright, Norman Sieroka, Jennifer Listgarten, Robert A Goodnow Jr, Jasmin Fisher, Johanna M Jansen, José S Duca, Thomas S Rush, et al. Rethinking drug design in the artificial intelligence era. Nature reviews drug discovery, 19(5):353–364, 2020. Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a SAT solver from single-bit supervision. CoRR, abs/1802.03685, 2018. URL http://arxiv.org/abs/1802.03685. Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, 2024. URL https://arxiv.org/abs/2409.04109. Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code, June 2024. URL https://arxiv.org/abs/2311.09835. Artificial Intelligence Task Team. Artifical intelligence and nuclear power. 1985. Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. Alexander Tornede, Difan Deng, Theresa Eimer, Joseph Giovanelli, Aditya Mohan, Tim Ruhkopf, Sarah Segel, Daphne Theodorakopoulos, Tanja Tornede, Henning Wachsmuth, et al. Automl in the age of large language models: Current challenges, future opportunities and risks. arXiv preprint arXiv:2306.08107, 2023. Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments, 2024. URL https://arxiv.org/abs/2407.17032. Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents, July 2024. URL https://arxiv.org/abs/2407.18901. 26 Renbo Tu, Nicholas Roberts, Mikhail Khodak, Junhong Shen, Frederic Sala, and Ameet Talwalkar. NAS-bench-360: Benchmarking neural architecture search on diverse tasks. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=xUXTbq6gWsB. Petar Veličković, Adrià Puigdomènech Badia, David Budden, Razvan Pascanu, Andrea Banino, Misha Dashevskiy, Raia Hadsell, and Charles Blundell. The clrs algorithmic reasoning benchmark. In International Conference on Machine Learning, pages 22084–22102. PMLR, 2022. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023. Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, December 2024a. ISSN 2095-2228, 2095-2236. doi: 10.1007/ s11704-024-40231-1. Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. User behavior simulation with large language model based agents, 2024b. URL https://arxiv.org/abs/2306.02552. Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenDevin: An Open Platform for AI Software Developers as Generalist Agents, July 2024c. URL https://arxiv.org/abs/2407.16741. Adina Williams, Nikita Nangia, and Samuel R Bowman. The multi-genre nli corpus. 2018. Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024. Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based Software Engineering Agents, July 2024. URL https://arxiv.org/abs/2407.01489. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, May 2024. URL https://arxiv. org/abs/2405.15793. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X. Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?, July 2024. URL https://arxiv.org/abs/2407.15711. Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments, 2019. URL https://arxiv.org/abs/1903.03176. Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, and Zhou Yu. Exact: Teaching ai agents to explore with reflective-mcts and exploratory learning, 2025. URL https://arxiv.org/abs/2410.02052. Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models, 2024a. URL https: //arxiv.org/abs/2307.02485. Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024b. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 27 Appendix A Additional Results and Analysis A.1 Computational Cost Table 7 lists the resources needed to run the agent on each task in MLGym-Bench. Each task has a set Training Timeout, which is used as the time limit for any python commands. Specifically, it is used to prevent the agent from continuously scaling the model parameters. Average agent runtime and Baseline runtime show the wall clock time for each agent run and the provided baseline code, respectively. Task Training Timeout GPUs/Agents Average Agent Runtime Baseline Runtime (mins) 30m 30m 30m 30m 30m 30m 40m 40m 40m 30m 30m 30m 30m 1 0 0 0 1 1 1 1 2 2 2 2 0 4h 30m 30m 30m 1.5h 2h 15 5 5 5 10 10 7 22 20 15 15 15 5 CIFAR-10 Battle of Sexes Prisoners Dilemma Blotto House Price Prediction Fashion MNIST MS-COCO MNLI Language Modeling Breakout Mountain Car Continuous Meta Maze 3-SAT Heuristic 4h 2h 2h 2h 30m Table 7 Computational resources required for each task in MLGym-bench. Table 8 lists the average input and output tokens and associated pricing for each model across all tasks in MLGym-Bench. We report the model pricing as listed by their respective providers. Llama3.1-405b-Instruct pricing is taken from Together AI. Note that for this work, we used the open-weights model checkpoint with FP-8 precision, hosted on Meta Internal servers. Gemini-1.5-Pro charges 2X for using the long-context capabilities, i.e for input and output exceeding 128K tokens. However, in our experiments, we do not observe Gemini using the long-context capabilities, so the final price is reported based on the normal pricing. Model Avg. Usage Input Output Pricing Input Output Llama3.1-405b-instruct∗ Claude-3.5-Sonnet Gemini-1.5-Pro† GPT-4o OpenAI O1-Preview 304348 707704 282613 266886 368898 3.50 3.00 1.25 2.50 15.0 2512 12415 1633 2429 60704 3.50 15.0 5.00 10.0 60.0 Context Length 128k 200k 2M 128k 128k Table 8 Model pricing, token usage and context length details. Model Pricing is in USD per 1M tokens. ∗ Llama3.1: FP8 endpoint by Together6 6 https://www.together.ai/pricing 28 A.2 Failure Mode Analysis Failed Runs Incomplete Runs 10 Count 8 6 4 2 0 FineWeb MS-COCO BreakoutMountainCar Maze 3-SAT Regression CIFAR-10 MNLI BoS Blotto Figure 9 Number of Failed and Incomplete runs per task. The criteria for marking a run as incomplete or failed is described in Section 7.4.1 Continuing the discussion from Section 7.4.1, we show the failed and incomplete runs on each task to understand the difficulty distribution of tasks. Language Modeling and all Reinforcement Learning tasks (Meta Maze, Mountain Car Continuous and Breakout) prove the most challenging, with the highest failure rates. Whereas, Fashion MNIST and Prisoner’s Dilemma show the lowest failure rates, with all models producing a valid intermediate solution and a valid submission for all seeds. These failure patterns align with the raw performance scores in Table 5 and Table 6, where we observe that tasks requiring complex architectural decisions (Language Modeling) or complex algorithms (Breakout, Meta Maze and Mountain Car Continuous). Traditional supervised learning tasks are handled more reliably across models, while the more advanced models demonstrate better error handling and completion rates overall. 29 A.3 Action Analysis Extending the results presented in Section 7.4.2, Figure 10 shows the action distribution on each task. The bars represent the sum of all the actions taken by all models on a particular task. We notice that RL tasks have the higest action count, while Game Theoretic tasks have the lowest action count. Algorithmic Tasks such as 3-SAT and Game Theory (Blotto, Prisonner’s Dilemma and Battle of Sexes) also have the highest amount of validation actions, signifying a quick experimental cycle. Similarly, all RL tasks have the most complex codebases among all MLGym-Bench tasks and thus agent extensively use the View commands. Edit View Validate 800 Submit Search Python Bash Count 600 400 200 0 Breakout MountainCar Maze Regression CIFAR-10 FineWeb MS-COCO 3-SAT F-MNIST Blotto MNLI PD BoS Figure 10 Action Distribution for each task. We group the actions into categories following the grouping defined in Table 2 and Section 7.4.2. A.4 Model Rankings Table 9 and Table 10 show each model’s ranking based on Best Attempt@4 and Best Submission@4 scores respectively. The aggregate ranks are computed using the BORDA7 count method. The aggregated rankings computed using BORDA count method align with the AUP score results as shown in Table 4. However, similar to any ranking-only metric, it does not convey the relative difference between each model’s performance. Rank 1 2 3 4 5 6 CIFAR-10 Battle of Sexes Prisoners Dilemma Blotto House Price Prediction Fashion MNIST Language Modeling Breakout Mountain Car Continuous Meta Maze 3-SAT Heuristic BORDA Claude-3.5-Sonnet OpenAI O1 Llama3-405b-instruct Claude-3.5-Sonnet OpenAI O1 Claude-3.5-Sonnet OpenAI O1 Gemini-1.5-Pro OpenAI O1 Claude-3.5-Sonnet OpenAI O1 OpenAI O1 OpenAI O1 Gemini-1.5-Pro Gemini-1.5-Pro Gemini-1.5-Pro Claude-3.5-Sonnet GPT-4o Gemini-1.5-Pro OpenAI O1 Gemini-1.5-Pro OpenAI O1 GPT-4o Gemini-1.5-Pro Gemini-1.5-Pro Claude-3.5-Sonnet OpenAI O1 OpenAI O1 Gemini-1.5-Pro OpenAI O1 GPT-4o Llama3-405b-instruct Claude-3.5-Sonnet Gemini-1.5-Pro Llama3-405b-instruct Claude-3.5-Sonnet GPT-4o Llama3-405b-instruct GPT-4o GPT-4o Llama3-405b-instruct Gemini-1.5-Pro Claude-3.5-Sonnet Baseline Baseline Llama3-405b-instruct Gemini-1.5-Pro Llama3-405b-instruct Llama3-405b-instruct GPT-4o Claude-3.5-Sonnet Llama3-405b-instruct GPT-4o Llama3-405b-instruct Baseline Claude-3.5-Sonnet Llama3-405b-instruct Baseline Claude-3.5-Sonnet GPT-4o Baseline Baseline Baseline Baseline Baseline Baseline Llama3-405b-instruct GPT-4o GPT-4o GPT-4o Baseline Baseline Table 9 Individual and Aggregate Ranking of models based on Best Attempt@4. We use the BORDA method to compute the aggregate ranks. 7 https://en.wikipedia.org/wiki/Borda_count 30 Rank 1 2 3 4 5 6 CIFAR-10 Battle of Sexes Prisoners Dilemma Blotto House Price Prediction Fashion MNIST Language Modeling Breakout Mountain Car Continuous Meta Maze 3-SAT Heuristic BORDA Claude-3.5-Sonnet Gemini-1.5-Pro Gemini-15-Pro OpenAI O1 OpenAI O1 Claude-3.5-Sonnet OpenAI O1 Gemini-1.5-Pro OpenAI O1 Claude-3.5-Sonnet GPT-4o OpenAI O1 OpenAI O1 OpenAI O1 GPT-4o Claude-3.5-Sonnet Claude-3.5-Sonnet GPT-4o Gemini-1.5-Pro OpenAI O1 Gemini-1.5-Pro OpenAI O1 OpenAI O1 Gemini-1.5-Pro Gemini-1.5-Pro Claude-3.5-Sonnet OpenAI O1 Gemini-1.5-Pro Llama3-405b-instruct Gemini-1.5-Pro GPT-4o Llama3-405b-instruct Claude-3.5-Sonnet Llama3-405b-instruct Llama3-405b-instruct Claude-3.5-Sonnet GPT-4o Llama3-405b-instruct Claude-3.5-Sonnet GPT-4o Gemini-1.5-Pro OpenAI O1 Claude-3.5-Sonnet Baseline Baseline Gemini-1.5-Pro Gemini-1.5-Pro GPT-4o Llama3-405b-instruct GPT-4o Llama3-405b-instruct Llama3-405b-instruct GPT-4o Llama3-405b-instruct Baseline Claude-3.5-Sonnet Llama3-405b-instruct Baseline Claude-3.5-Sonnet Llama3-405b-instruct Baseline Baseline Baseline Baseline Baseline Baseline Llama3-405b-instruct GPT-4o GPT-4o GPT-4o Baseline Baseline Table 10 Individual and Aggregate Ranking of models based on Best Subimission@4. We use the BORDA method to compute the aggregate ranks. A.5 Memory Utilization Figure 11 and Figure 12 show the agent using the memory module to store and retrieve specific experimental results and use them to submit the best possible model. Figure 11 Example of retrieving the best training configuration from memory and restarting exploration from it. 31 Figure 12 Example of retrieving the best training configuration from memory and restarting exploration from it. B Prompts Listing 1 System Propmt SETTING : You are an autonomous machine learning researcher , and you ’ re working directly in the command line with a special interface . The special interface consists of a file editor that shows you 1000 lines of a file at a time . In addition to typical bash commands , you can also use the following commands to help you navigate and edit files . COMMANDS : open : docstring : opens the file at the given path in the editor . If line_number is provided , the window will be move to include that line signature : open " < path >" [ < line_number >] arguments : - path ( string ) [ required ]: the path to the file to open - line_number ( integer ) [ optional ]: the line number to move the window to ( if not provided , the window will start at the top of the file ) goto : docstring : moves the window to show < line_number > signature : goto < line_number > arguments : - line_number ( integer ) [ required ]: the line number to move the window to scroll_down : docstring : moves the window down 1000 lines signature : scroll_down scroll_up : docstring : moves the window down 1000 lines signature : scroll_up create : 32 docstring : creates and opens a new file with the given name signature : create < filename > arguments : - filename ( string ) [ required ]: the name of the file to create search_dir : docstring : searches for search_term in all files in dir . If dir is not provided , searches in the current directory signature : search_dir < search_term > [ < dir >] arguments : - search_term ( string ) [ required ]: the term to search for - dir ( string ) [ optional ]: the directory to search in ( if not provided , searches in the current directory ) search_file : docstring : searches for search_term in file . If file is not provided , searches in the current open file signature : search_file < search_term > [ < file >] arguments : - search_term ( string ) [ required ]: the term to search for - file ( string ) [ optional ]: the file to search in ( if not provided , searches in the current open file ) find_file : docstring : finds all files with the given name in dir . If dir is not provided , searches in the current directory signature : find_file < file_name > [ < dir >] arguments : - file_name ( string ) [ required ]: the name of the file to search for - dir ( string ) [ optional ]: the directory to search in ( if not provided , searches in the current directory ) edit : docstring : replaces lines < start_line > through < end_line > ( inclusive ) with the given text in the open file . The replacement text is terminated by a line with only end_of_edit on it . All of the < replacement text > will be entered , so make sure your indentation is formatted properly . Python files will be checked for syntax errors after the edit . If the system detects a syntax error , the edit will not be executed . Simply try to edit the file again , but make sure to read the error message and modify the edit command you issue accordingly . Issuing the same command a second time will just lead to the same error message again . signature : edit < start_line >: < end_line > < replacement_text > end_of_edit arguments : - start_line ( integer ) [ required ]: the line number to start the edit at - end_line ( integer ) [ required ]: the line number to end the edit at ( inclusive ) - r e p l a c e m en t _ t e x t ( string ) [ required ]: the text to replace the current selection with insert : docstring : inserts the given text after the specified line number in the open file . The text to insert is terminated by a line with only end_of_insert on it . All of the < text_to_add > will be entered , so make sure your indentation is formatted properly . Python files will be checked for syntax errors after the insertion . If the system detects a syntax error , the insertion will not be executed . Simply try to insert again , but make sure to read the error message and modify the insert command you issue accordingly . signature : insert < line_number > < text_to_add > end_of_insert arguments : - line_number ( integer ) [ required ]: the line number after which to insert the text - text_to_add ( string ) [ required ]: the text to insert after the specified line submit : docstring : submits your current code and terminates the session signature : submit 33 validate : docstring : validates your current submission file and returns the metrics on test set signature : validate Please note that THE EDIT and INSERT COMMANDS REQUIRES PROPER INDENTATION . If you ’ d like to add the line ’ print ( x ) ’ you must fully write that out , with all those spaces before the code ! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run . RESPONSE FORMAT : Your shell prompt is formatted as follows : ( Open file : < path >) You need to format your output using two fields ; discussion and command . Your output should always include _one_ discussion and _one_ command field EXACTLY as in the following example : DISCUSSION First I ’ ll start by using ls to see what files are in the current directory . Then maybe we can look at some relevant files to see what they look like . ‘‘‘ ls -a ‘‘‘ You should only include a * SINGLE * command in the command section and then wait for a response from the shell before continuing with more discussion and commands . Everything you include in the DISCUSSION section will be saved for future reference . Please do not include any DISCUSSION after your action . If you ’ d like to issue two commands at once , PLEASE DO NOT DO THAT ! Please instead first submit just the first command , and then after receiving a response you ’ ll be able to issue the second command . You ’ re free to use any other bash commands you want ( e . g . find , grep , cat , ls , cd ) in addition to the special commands listed above . However , the environment does NOT support interactive session commands ( e . g . python , vim ) , so please do not invoke them . Your goal is to achieve the best possible score , not just to submit your first working solution . Consider strategies like validating your answer using the ‘ validate ‘ command , manually spot - checking predictions , building custom validation sets and grading functions , and comparing different algorithms . Once you have exhausted all possible solutions and cannot make progress , you can submit your final solution by using ‘ submit ‘ command . INSTRUCTIONS : Now , you ’ re going to train a model to improve performance on this task . Your terminal session has started and you ’ re in the workspace root directory . You can use any bash commands or the special interface to help you . Edit all the file you need or create a new training script . Remember , YOU CAN ONLY ENTER ONE COMMAND AT A TIME . You should always wait for feedback after every command . When you ’ re satisfied with all of the changes you ’ ve made , you can run your training file . Your training file should include the logic for saving the prediction for the ‘ test ‘ set of the task . The submission file should be named ‘ submission . csv ‘ with the instance id and prediction column . A sample submission file is given in the workspace and you can read it to get a better understanding of the submission format . Note however that you cannot use any interactive session commands ( e . g . python , vim ) in this environment , but you can write scripts and run them . E . g . you can write a python script and then run it with ‘ python < script_name >. py ‘. NOTE ABOUT THE EDIT AND INSERT COMMANDs : Indentation really matters ! When editing a file , make sure to insert appropriate indentation before each line ! IMPORTANT TIPS : 1. Always start by trying to understand the baseline script if available . This will give you an idea of one possible solution for the task and the baseline scores that you have to beat . 2. If you run a command and it doesn ’ t work , try running a different command . A command that did not work once will not work the second time unless you modify it ! 34 3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines , say line 583 , don ’ t just use the scroll_down command multiple times . Instead , use the goto 583 command . It ’ s much quicker . 4. Always make sure to look at the currently open file and the current working directory ( which appears right after the currently open file ) . The currently open file might be in a different directory than the working directory ! Note that some commands , such as ’ create ’ , open files , so they might change the current open file . 5. When editing files , it is easy to accidentally specify a wrong line number or to write code with incorrect indentation . Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish . If it didn ’t , issue another command to fix it . 6. You have a limited number of actions / steps you can take in the environment . The current step and remaining number of steps will given after every action . Use the remaining steps wisely . If you only have few remaining steps , it is better to submit a working solution then to keep trying . 7. Your each action should take less than 1800 seconds to complete . If your action doesn ’ t finish within the time limit , it will be interrupted . ( Current Step : 0 , Remaining Steps : 50) ( Open file : n / a ) ( Current directory : / home / agent / i m a g e C l a s s i f i c a t i o n C i f a r 1 0 ) bash - 35