MLGym: A New Framework and Benchmark
for Advancing AI Research Agents
Deepak Nathani1† , Lovish Madaan2,7 , Nicholas Roberts3† , Nikolay Bashlykov7 , Ajay Menon7 ,
Vincent Moens5 , Amar Budhiraja7 , Despoina Magka6 , Vladislav Vorotilov7 , Gaurav Chaurasia7 ,
Dieuwke Hupkes7 , Ricardo Silveira Cabral7 , Tatiana Shavrina7 , Jakob Foerster6 , Yoram
Bachrach6 , William Yang Wang1 , Roberta Raileanu2,7
University of California, Santa Barbara, 2 University College London, 3 University of
Wisconsin–Madison, 4 University of Oxford, 5 PyTorch Core Libraries at Meta, 6 FAIR at Meta,
7
GenAI at Meta
arXiv:2502.14499v1 [cs.CL] 20 Feb 2025
1
†
Work done during internship at Meta
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and
developing LLM agents on AI research tasks. This is the first Gym environment for machine learning
(ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents.
MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as
computer vision, natural language processing, reinforcement learning, and game theory. Solving these
tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating
and processing data, implementing ML methods, training models, running experiments, analyzing
the results, and iterating through this process to improve on a given task. We evaluate a number
of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1
405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new
tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new
learning algorithms for training agents on AI research tasks. We find that current frontier models can
improve on the given baselines, usually by finding better hyperparameters, but do not generate novel
hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework
and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.
Date: February 21, 2025
Correspondence: Deepak Nathani at dnathani@ucsb.edu, Roberta Raileanu at raileanu@meta.com
Code: https://github.com/facebookresearch/MLGym
1
Introduction
Accelerating scientific discovery has been a long-standing ambition in artificial intelligence (AI) research,
with early initiatives like the Oak Ridge Applied Artificial Intelligence Project in 1979 exploring (Team,
1985; Emrich et al., 1988; Johnson and Schaffer, 1994). More recent explorations enabled by advances in
foundation models (Achiam et al., 2023; Anthropic, 2024; Team et al., 2024; Dubey et al., 2024) provide
a proof-of-concept of a fully automated pipeline for end-to-end paper generation (Lu et al., 2024). In the
future, we envision AI Research Agents capable of independently conducting literature search, generating
scientific hypotheses, designing experiments, implementing new methods, analyzing results, disseminating
findings by writing scientific papers, and applying this research in products, thus assisting with all parts of
the research process. Such agents should be capable of both working fully autonomously, or be guided by
human supervision, taking into account feedback from users.
This vision stems from the recognition that AI, with its capacity to process vast datasets and discern complex
patterns, could accelerate scientific breakthroughs in areas such as drug discovery and materials science by
identifying promising drug candidates or predicting the properties of novel materials (Hessler and Baringhaus,
2018; Schneider et al., 2020; Guo et al., 2021). Unlike traditional methods, AI agents can reveal hidden
interdisciplinary relationships by analyzing vast knowledge graphs, leading to novel insights and solutions
1
Environment
Agent
Tool Docs
Task Description
Action
Prompts
Gymnasium
Environment
Feedback
Models
Computer
Tool Docs
Task Description
Tools
Data
Code
Requirements
Shell
Action
Tools
Output
File System
Data
Code
Requirements
Figure 1 Diagram of MLGym, a unified framework designed to integrate diverse and open-ended AI research tasks
into a single platform for developing and evaluating LLM agents on these tasks.
for complex challenges like climate modeling. By automating laborious tasks and exploring unconventional
avenues, AI agents can liberate scientists to focus on higher-level cognitive activities, ultimately driving
innovation and expanding the frontiers of knowledge. Machine learning (ML) research, with its emphasis on
empirical validation and systematic experimentation in simulation, presents an ideal testbed for exploring and
improving the utlity of LLMs for advancing scientific research.
However, the scientific method inherently relies on empirical validation, rigorous evaluation, and standardized
benchmarks to ensure the reliability and reproducibility of findings. While significant progress has been made
in developing AI agents for various domains (Yang et al., 2024; Wu et al., 2024; Ma et al., 2024; Deng et al.,
2023; Wang et al., 2023), we currently lack comprehensive frameworks and benchmarks specifically designed
to assess their capabilities in conducting open-ended AI research tasks in diverse domains. This absence
of standardized evaluation tools hinders our ability to objectively measure progress and identify areas for
improvement in this emerging field.
Recently, a number of papers have started to evaluate LLM agents on various SWE and ML tasks; notable
examples include SWE-Bench (Jimenez et al., 2023), SWE-agent (Yang et al., 2024), ScienceAgentBench (Chen
et al., 2024), SUPER (Bogin et al., 2024), MLE-Bench (Chan et al., 2024), MLAgentBench (Huang et al.,
2024), and RE-Bench (METR, 2024). However, existing benchmarks for AI Research Agents either do not
include open-ended research tasks, or only cover a narrow range of research domains. In addition, existing
frameworks are not designed to enable research on different training algorithms for AI Research Agents such as
reinforcement learning, curriculum learning, or open-ended learning. Finally, current frameworks do not allow
flexible artifacts to be evaluated (e.g. different outputs of the agent’s research such as a model, algorithm, or
set of predictions).
In this paper, we introduce MLGym—the first Gym (Brockman et al., 2016) environment for AI Research
Agents and a unified framework designed to integrate diverse and open-ended AI research tasks into a single
platform for developing and evaluating LLM agents on such tasks (see Figure 1 for a diagram of MLGym).
Being a Gym environment, our framework enables research on different training algorithms for AI Research
Agents such as reinforcement learning (RL), curriculum learning, and open-ended learning. We also release
MLGym-Bench, a curated set of 13 open-ended research tasks, covering a wide range of domains such as
computer vision, natural language processing, reinforcement learning, and game theory, carefully crafted to
evaluate the performance of agents in realistic, multifaceted workflows. MLGym and MLGym-Bench expand
the range of problems considered by current LLM agent frameworks and benchmarks, by offering the ability
to flexibly evaluate performance on open-ended research tasks. For example, performance can be measured
based on various artefacts such as model weights, RL training algorithms, or code representing game theory
strategies. We compare five frontier LLMs across the tasks in MLGym-Bench under consistent experimental
settings, highlighting their strengths and limitations. Finally, we propose a new evaluation metric for agents,
adapted from the optimization (Dolan and Moré, 2002) and automated machine learning (AutoML; Roberts
2
et al., 2022a) literature, to more fairly assess the relative performance of LLM agents across tasks with their
own distinct performance metrics.
To summarize our contributions, we (i) introduce MLGym, the first Gym environment for evaluating and
developing AI Research Agents, (ii) release MLGym-Bench, a suite of diverse open-ended AI research tasks
for evaluating LLM agents, (iii) propose a new evaluation metric for comparing multiple agents on a variety
of tasks, and (iv) extensively evaluate frontier LLMs on MLGym-Bench. Finally, MLGym makes it easy for
researchers and developers to integrate and evaluate new tasks, agents, or models.
In the rest of the paper, we discuss related LLM agent frameworks and benchmarks, provide an overview
of the MLGym framework, introduce the mechanics behind MLGym-Bench and its evaluation, present our
experimental setup and results, and conclude with a discussion of limitations and future extensions.
1.1
Capability Levels for AI Research Agents
We propose a hierarchical framework to categorize the capabilities of LLM agents for accelerating AI research.
This framework consists of six levels, each representing a distinct degree of autonomy and scientific contribution.
Level 0: Reproduction At this level, LLM agents can reproduce existing research papers either with or
without access to the original code. This level demonstrates a basic understanding of the research domain
and the ability to replicate established results.
Level 1: Baseline Improvement At Level 1, LLM agents can improve performance on a benchmark given a
baseline code that is not state-of-the-art (SOTA). This level indicates the ability to analyze and optimize
existing solutions, even if they are not the most advanced.
Level 2: SOTA Achievement At Level 2, LLM agents can achieve SOTA performance on a benchmark given
only a task description and access to the published literature before the invention of the SOTA approach, but
no access to the SOTA paper or code. This level demonstrates the ability to come up with a solution to an
open research problem which is as good as the one found by humans.
Level 3: Novel Scientific Contribution At Level 3, LLM agents can make a novel scientific contribution, such
as coming up with a new method that establishes a new SOTA on multiple benchmarks, and is worthy of
publication at a top ML conference such as NeurIPS.
Level 4: Groundbreaking Scientific Contribution At Level 4, LLM agents can identify key research questions,
directions, solutions, and make a notable scientific contribution worthy of being published as an oral or best
paper award at a prestigious ML conference such as NeurIPS.
Level 5: Long-Term Research Agenda At Level 5, LLM agents can pursue a long-term research agenda,
coming up with the research questions, directions, and solutions, continuously producing scientific discoveries
over the span of weeks, months, or years. LLMs at this level should be capable of paradigm-shifting research
breakthroughs worthy of prizes such as Nobel or Turing.
By defining these capability levels, we provide a framework for evaluating frontier AI Research Agents.
MLGym-Bench focuses on Level 1: Baseline Improvement of the categorisation defined above.
2
Related Work
2.1
AI Research Frameworks and Benchmarks
Table 1 shows a comparison between MLGym and MLGym-Bench with other related LLM agent frameworks
and benchmarks. Below, we expand on the differences between MLGym and these works.
First, MLGym is the first framework for AI Research Agents that provides a Gym interface, making it easy
to integrate and train these agents using RL algoritms. MLGym-Bench is also the first benchmark to include
tasks that require research on algorithms in multiple domains such as RL, game theory, or SAT.
3
Benchmark
Gym Interface
Algorithmic Tasks
Open-Ended Research
Flexible Artifacts
Agentic Harness
!
#
#
#
#
#
!
#
#
#
#
#
!
#
#
!
!
#
!
#
#
!
!
#
!
#
!
!
#
#
MLGym (ours)
MLE-Bench
SWE-Bench/Agent
MLAgentBench
RE-Bench
ScienceAgentBench
Table 1 Comparison of MLGym and MLGym-Bench with other related LLM agent frameworks and benchmarks.
Algorithmic Tasks refers to the inclusion of tasks that require coming up with new algorithms such as reinforcement
learning, game theory or SAT problems. Open-ended Research refers to the inclusion of tasks that are not fully solved
by the research community and where multiple new solutions could be discovered such as language modeling, game
theory or SAT problems. Flexible Artifacts refers to the allowance of different research artifacts such as model weights,
reinforcement learning algorithms, or code capturing an agent’s strategy.
Second, MLGym-Bench encompasses a wide range of open-ended AI research tasks, covering supervised
learning, language modeling, reinforcement learning, game theory and SAT. In contrast, SWE-Bench/SWEAgent (Yang et al., 2024) focuses on solving Github issues so the code changes either fix the code or not
(as opposed to optmization tasks with finer-grained metrics, such as a loss metric in a supervised learning
problem). Similarly, MLE-Bench (Chan et al., 2024) includes narrowly scoped machine learning tasks from
Kaggle competitions. While these tasks have a spectrum of quality levels, they tend to be already solved
by current state-of-the-art methods. On the other hand, MLAgentBench (Huang et al., 2024) contains both
ML-specialized tasks (regression, classification, code speed improvements) and tasks focused on recent research
challenges (e.g. CLRS reasoning corpus (Veličković et al., 2022), BabyLM challenge (Oba et al., 2023)).
RE-bench (METR, 2024) also consists of broadly scoped ML engineering tasks which are hard to saturate and
reward increasingly sophisticated approaches. ScienceAgentBench (Chen et al., 2024) incorporates data-driven
scientific discovery tasks extracted from peer-reviewed publications, but which are so specific that they
resemble Kaggle competition rather than open research questions.
Third, MLGym allows for flexible evaluation artifacts: it is sufficient to provide python code that the agent can
call to examine the quality of its current solution, such as a model checkpoint or an RL algorithm. In contrast,
MLE-Bench requires a CSV file to be submitted for grading each question and SWE-Bench/Agent require
evaluating a piece of code through a collection of unit tests. MLAgentBench, RE-Bench and ScienceAgentBench
provide Python scripts to compute the evaluation scores.
Finally, MLGym enables easy evaluation of both models and agents. To facilitate model evaluation, MLGym
provides a default agentic harness that can be used out-of-the-box to evaluate any base model.
2.2
LLM Agents
Research on tool-augmented LLMs (Schick et al., 2023) has inspired a new research agenda of “agentic”
LLMs (Kaddour et al., 2023; Wang et al., 2024a), where LLMs interact with an external environment.
Existing work explores teaching LLMs to use tools or APIs (Schick et al., 2023; Qin et al., 2023), navigate
the web (Nakano et al., 2022; Deng et al., 2023; Zhou et al., 2023), interface with operating systems (Wu
et al., 2024), play games (Paglieri et al., 2024; Wang et al., 2023), or interact with other simulated (Wang
et al., 2024b; Lin et al., 2023) or physical worlds (Zhang et al., 2024a). Evaluating agentic LLMs typically
involves designing controlled environments, providing suitable tools, defining tasks and goals, and establishing
quantitative metrics to measure the system’s performance.
Building on these directions, Yoran et al. (2024) introduce AssistantBench, emphasizing the complexity of
open-web navigation and showcasing how current systems struggle with realistic, time-consuming tasks such
as monitoring real-estate markets or identifying nearby businesses. Meanwhile, Kapoor et al. (2024) highlight
the importance of standardized evaluation protocols that consider both accuracy and cost, warning against
overfitting and advocating for more reproducible benchmarks. Extending these concerns to multi-dimensional
environments, Liu et al. (2023) propose AgentBench—a suite of eight interactive settings that test agents’
capacity for reasoning, decision-making, and long-term instruction following. Similarly, Mialon et al. (2023)
4
focus on holistic planning skills through GAIA, a benchmark designed to assess performance on real-world
questions requiring robust tool-use and multimodal reasoning, revealing substantial gaps between human-level
proficiency and current LLMs. Finally, Trivedi et al. (2024) emphasize the necessity of sophisticated tool
integration with AppWorld, an interactive environment where agents must operate diverse applications via
APIs and generate complex code in an iterative fashion. Collectively, these works underscore not only the
breadth of agentic LLM capabilities but also the pressing need for systematic, multifaceted benchmarks that
capture complex tasks with verifiable results and foster reproducible progress in the field. However, none of
these works focuses on evaluating or developing LLM agents for open-ended AI research tasks.
2.3
Agents for Software Engineering and Data Science
In line with the principle of reproducibility and verifiability, software engineering tasks provide a testbed
for LLM agents, where tasks can be tightly scoped and outcomes rigorously measured. Recent work has
explored how agents can tackle code-level challenges in controlled settings that permit systematic evaluation.
As discussed above, Yang et al. (2024) introduce SWE-agent, which operates within a constrained agentcomputer interface to facilitate file creation, repository navigation, and code testing—thereby enhancing both
traceability and reproducibility on benchmarks such as SWE-bench and HumanEvalFix. Similarly, Wang
et al. (2024c) describe OpenHands, a platform that restricts agent interactions to sandboxed environments for
safer command execution and verifiable web browsing, and in doing so provides a standardized foundation
for benchmarking. Magentic-One (Fourney et al., 2024) is another agentic system competent in software
engineering but also augmented with web navigation capabilities, as demonstrated by its strong performance
on the GAIA, AssistantBench and WebArena (Zhou et al., 2023) agentic benchmarks. On the other hand,
Zhang et al. (2024b) achieve competitive perforemance on SWE-bench with AutoCodeRover, which, unlike the
agentic approaches, solves Github issues by combining LLM-based programming with program representation
as an abstract syntax tree.
Towards the goal of automating data science work, Li et al. (2024) introduce AutoKaggle, a multi-agent
human-assisting system, and Grosnit et al. (2024) present AgentK v1.0, an end-to-end autonomous data
science agent; both of these systems perform well on Kaggle competition data. Still within the realm of
data science work, Lei et al. (2024) build Spider 2.0, a challenging benchmark and code agent framework
for automating text-to-SQL workflows. Going one step further, Cao et al. (2024) introduce Spider 2-V, an
autonomous multimodal agent coupled with a benchmark focusing on the automation of enterprise data
science and engineering workflows.
More search-oriented approaches include SWE-Search (Antoniades et al., 2024), a multi-agent framework
that marries Monte Carlo Tree Search (MCTS) with iterative refinement, enabling agents to continuously
evaluate and improve their approaches to repository-level tasks. In a similar vein, Koh et al. (2024b) explore
tree search for LLM agents and show that equipping LLM agents with best-first search boosts performane for
the WebArena and VisualWebArena (Koh et al., 2024a) agentic benchmarks. Also on augmenting LLM agents
with search, Yu et al. (2025) propose MCTS-based test-time search and self-learning techniques that yield
better performance on VisualWebArena. Finally, Xia et al. (2024) demonstrate that even relatively simple
approaches can excel when thoroughly monitored: an ’agentless’ system follows a three-step process and
outperforms more complex agent-based methods on SWE-bench Lite, underscoring the value of constrained,
verifiable environments in driving reproducible gains for autonomous SWE agents.
2.4
Agents for Scientific Research
Controlled SWE contexts build the foundation for more complex automation while maintaining a reproducible
and verifiable approach. However, just the software foundations alone are not sufficient to address the
remaining gaps towards the goal of science acceleration. Going from the limited environments and well-defined
tasks with metrics towards a less-defined area of open-ended questions, there are substantial efforts needed to
boost the capabilities of research agents. For instance, coming up with automatable criteria to gauge scientific
novelty or constructing theories inheriting the automated findings from heterogeneous disciplines are examples
of areas that could use more refinement and experimentation.
Nevertheless, the first steps on this path can be started now - in the field of ML research and data science
- since these areas represent for us a scientific playground with tasks that are both well-defined and have
5
formal criteria of verifiability (benchmarks and tests), falsifiability (ablation studies and tests for data leakage,
memorization, out of domain generalization, etc) and reproducibility.
Data Science
Many recent works approach both classic data science tasks and real-life repository-based tasks as a testbed
for agents with a known test set and metrics. While based on similar grounds, the works differ in the resulting
levels of autonomy of the agents. For instance, ML-Bench (Tang et al., 2024) focuses on explicit tasks within
existing GitHub repositories — evaluating agents in code-centric setups without delving into open-ended
objectives. By contrast, Data Interpreter (Hong et al., 2024) extends agent testing to broader data science
problems, spanning coding tasks, mathematical reasoning, and a limited suite of open-ended applications (e.g.,
OCR, web search, and mini-game generation), thus reflecting a more flexible approach to autonomy. The
agentic benchmark SUPER (Bogin et al., 2024) raises the bar by requiring the agent to formulate the task
itself and iterate on NLP-related data and tasks within research repositories, thereby emphasizing self-directed
problem-solving.
AI Research
The presence of models and simulations in machine learning itself inevitably leads to the fact that this area
also becomes the object of automation. Having an agent formulating a task itself and approaching openended tasks naturally leads to automatic agentic enhancement of the machine learning methods themselves.
AutoML (Eggensperger et al., 2019; Lindauer and Hutter, 2020; Tornede et al., 2023) and NAS (Elsken et al.,
2019; Nasir et al., 2024) approaches have been previously paving the foundations of ML automation within
environments with built-in restrictions (an explicit set of methods, definition of the search space and strategy),
while the agentic approach can propose open-ended solutions without said specifications.
For example, MLAgentBench (Huang et al., 2024) consists of an environment for agents to solve 13 complex
tasks ranging from improving image classification to language modeling, with the current state-of-the-art
LLMs achieving 0% success rate for the most difficult of these tasks. The proposed pipelines for agents in
the environment include designing and running experiments, analyzing the results, and iterating towards
improving the defined metrics. Similarly, RE-Bench (Research Engineering Benchmark) (METR, 2024) is a set
of 7 diverse and challenging ML tasks with the methodological addition of real human experts involvement and
progress comparison: timed sessions for ML experts vs LLM agents. Authors state that the best agents achieve
a score 4x higher than human experts when both are given a total time budget of 2 hours per environment.
However, humans currently display better returns to increased time budgets, narrowly exceeding the top AI
agent scores given an 8-hour budget, and achieving 2x the score of the top agent when both are given 32 total
hours. MLE-bench (Chan et al., 2024) focuses on Kaggle tasks as a source for agentic evaluations. Agents are
evaluated across well-defined metrics, datasets, and real competition result distribution. The attempts are
limited to 24 hours. However, in contrast with MLGym, all these works contain a more narrow set of domains
that do not assess algorithmic reasoning capabilities. Moreover, some of them do not provide a standardized
agentic harness to allow for model evaluation, but they vary both the harnesses (also known as scaffolds) and
the LLMs when comparing performances. While our work focuses on creating an evaluation framework with
objective and standardized evaluation metrics, other recent works focus on developing an agentic harness for
the more subjective task of generating papers based on end-to-end experimental cycles (Lu et al., 2024).
Scientific Discovery
Several recent works have approached scientific automation with LLM agents targeting the process of scientific
discovery. DiscoveryWorld (Jansen et al., 2024) is a benchmark for scientific agents being evaluated in a
game-like virtual discovery environment. 120 tasks require an agent to form hypotheses, design and run
experiments, analyze results, and act on conclusions – for areas like proteomics, chemistry, archeology, physics,
agriculture, rocket science, linguistics, or epidemiology. The custom simulation engine only supports a limited
list of objects and 14 possible actions. A distinctive feature of the work is also that it focuses on general
discovery skills rather than task-specific solution, and the assessment, space of objects and actions is common
to all scientific domains.
ScienceAgentBench (Chen et al., 2024), however, approaches differently the similar task of creating a
discovery-based agentic benchmark: the tasks are based on 44 cherry-picked peer-reviewed publications
that include data-driven discovery tasks with well-defined metrics. The scientific areas covered include
6
bioinformatics, computational chemistry, geographical information science, and neuroscience yielding 102 tasks
of various types, such as data processing, modeling or visualization. Each task is defined by Python-based
evaluation environment, end result metrics and intermediate evaluation criteria. Special metrics control data
contamination and agent shortcut issues. Comparing different baselines, including pure LLMs with prompting,
authors state that execution feedback is necessary for agents to generate useful solutions.
The idea of execution feedback and iterative improvement for research tasks has been proposed in ResearchAgent (Baek et al., 2024). Agentic concept-based approach with literature-based discovery shows great
improvement for end-to-end iterative solution generation, also supported by knowledge-based vs random
facts ablations. The agent is evaluated solely with subjective human preference annotation and automatic
human preference evals. While covering structured aspects of end-to-end experimental pipeline (problem
clarity, feasibility, significance, relevance, originality, method generalizability, innovativeness, experiment
reproducibility, validity, etc), relying solely on human judgment without supporting it with objective metrics
is insufficient, as Si et al. (2024) shows.
3
MLGym
An LLM agent can perform ML research/development by interacting with a shell environment through a
sequence of commands. Given a task description, some starter code and access to its action and observation
history, the LLM generates appropriate shell commands to accomplish research objectives like generating
ideas, processing data, implementing new methods, training and evaluating models, analyzing the results, and
reasoning about what experiments to run next. The agent is iteratively prompted to take actions based on
the task description and execution feedback from previous commands, allowing it to develop and self-refine
the solutions in-context.
The MLGym framework provides a unified framework for evaluating and developing agents and models for AI
research tasks. We take inspiration from long existing field of RL and build a Gym (Brockman et al., 2016)
environment that can execute shell commands in a local docker machine shell. MLGym provides access to
four core components: Agents, Environment, Datasets, and Tasks. MLGym’s modular design allows one
to easily utilize and extend the library. For example, researchers can easily implement other agentic harnesses
to improve performance, they can expand the environment by adding more tools for an agent, add more
datasets within a given task (e.g., if the task is image classification they could add ImageNet in addition to
Cifar-10), and they can even add more tasks to the MLGym benchmark. Below, we discuss each component
in detail.
3.1
Agents
The Agent class provided by MLGym acts as a wrapper around a base LLM and provides functionality
for integrating various base models, history processors, and cost management. Moreover, unlike other
frameworks (Huang et al., 2024; Yang et al., 2024), MLGym separates the agent from the environment,
allowing for easy integration of external agents. This also enables one to fairly compare different base models
given the same agentic harness without the need of implementing their own agentic orchestration.
The agent is expected to take the history of all prior observations and actions as input and return the next
action to take. The provided action is then passed to the environment, which executes the command and
returns the next observation based on the command output. The agent can execute any bash command in
the environment. In addition, it has access to a set of tools (i.e., bash scripts such as editing a file) that it can
use similarly to any other bash command. MLGym provides an agent adapted from SWE-Agent (Yang et al.,
2024) as a default agentic harness. We describe the design and configuration of the tools in Section 3.5. The
full system prompt used can be found in Listing 1.
3.2
Environment
MLGym environments are designed as Gymnasium (gym) environments (Towers et al., 2024). The environment
component is responsible for initializing a shell environment in a local docker machine, with all the required
tools, installing task-specific python dependencies, copying all the necessary data and code in a separate
7
agent workspace and managing interactions between the LLM agent and the system. Moreover, to support
open-ended research tasks and make the environment safe and flexible, MLGym environment also manages
permissions for various files and directories. Specifically, when running in a docker container, due to various
security concerns associated with using a root user, we create a non-root user named "agent" and set the
appropriate permissions for the working directory.
In this work, we make a conscious decision to decouple tools and ACI as defined in SWE-Agent (Yang et al.,
2024)1 . Note that this ensures that the agent and environment are not tightly coupled, allowing for easier
implementation of other agentic architectures. Practically, this means that when the environment is initialized,
it also initializes the tools in the working environment and a tool documentation is prepared which can be
added to the LLM agent’s prompt. More details about the tools are provided in Section 3.5.
3.3
Datasets
MLGym provides a simple abstraction for defining datasets through configuration files. It supports both
locally stored and Hugging Face datasets. We decouple the dataset definition from the task definition, so that
a single dataset can be used in multiple tasks. Similarly, a single task can have more than one dataset so
that the agent’s code can be evaluated across all of them to demonstrate the generality of the implemented
method.
Moreover, if the dataset files are stored locally, the environment automatically copies the relevant files to the
agent workspace with read-only permissions. This ensures that the agent cannot change the dataset files,
which is important for reproducibility and cheating prevention.
If the dataset is stored in Hugging Face, the agent is given the dataset URL through the starter code or in
the prompt and asked to utilize it. Note that if the LLM agent fails to follow instructions or uses a different
dataset, the evaluation code will not work or result in performance issues.
3.4
Tasks
We provide an easy abstraction to define any ML research task using configuration files. Each task can
incorporate one or more datasets, custom evaluation scripts (with read-only access), task-specific conda
environment, optional starter code, training timeouts, and memory management settings. This provides a
flexible framework for defining diverse open-ended ML research tasks covering a wide range of difficulty. For
example, one can define an easier version of a task by providing a baseline code and a harder version by
providing no starter code or one with bugs, thus creating a natural curriculum.
Evaluation is a critical component for any ML task. Every task requires a different evaluation protocol; thus,
Kaggle-style evaluation as done in MLE-Bench (Chan et al., 2024) where the agent is expected to submit a
CSV file is not feasible for every problem. For example, in reinforcement learning settings, the evaluation
artifact is a set of models trained on a set of pre-defined random seeds, which is then used to get a mean
reward across a set of environment seeds. Similarly for Game Theoretic tasks, it can be a Python file with a
strategy function which will be evaluated against a fixed set of strategy functions. Since we aim to evaluate
the agent on open-ended and diverse tasks, it is not possible to convert all submissions to a CSV format.
To ensure extensibility to such open-ended tasks, the task definition is expected to provide an evaluation
script and submission artifact instructions. The LLM agent can then be prompted to follow the submission
instructions and write the appropriate code. Moreover, the evaluation script is read-only for the LM agent, so
while it can inspect the evaluation format, it cannot modify the script to change the evaluation logic.
Existing works such as Huang et al. (2024); METR (2024); Chen et al. (2024) also use a script based evaluation
approach, whereas MLE-Bench (Chan et al., 2024) uses a Kaggle style evaluation.
All our design decisions for the Agent, Environment, Dataset, and Tasks are meant to reduce overhead on the
developers’ and researchers’ side and enhance reproducibility in this newly emerging area.
1 As of the latest release, SWE-Agent also decouples tools/ACI from the agent.
8
3.5
Tools and ACI
Augmenting LLM agents with the ability of using external tools is a critical component for making progress
on knowledge-intensive tasks. In this work, we extend the ACI (agent-computer interface) first introduced in
SWE-Agent (Yang et al., 2024) with some additional features required for an ML research agent. Specifically,
we extend the commands for search, navigation, file viewer, file editor and context management with our
permission management system and introduce new commands for literature search and a memory module.
For example, if the agent tries to open a file without read permission, the file viewer tool will generate textual
feedback for the agent. Similarly, if agent tries to edit the evaluation script (which is marked as read-only),
the edit tools will output a feedback string instead of failing silently. Literature search and the ability to
maintain a experimental log in it’s memory are crucial for the agent to surpass SOTA solutions on open-ended
research tasks.
Similar to SWE-Agent, tools are defined as bash or python scripts and are made available as bash commands
in the environment.
All tool documentation is provided to the agent in the system prompt. See Table 2 for a description of the
available tools.
Category
Tool
Arguments
Documentation
SWE-Agent Tools
Search
search_dir
search_file
find_file
< search_term > [< dir >]
< search_term > [< file >]
< f ile_name > [< dir >]
searches for the search term in all files in dir
searches for the search term in the given file
finds all the files with the given name in dir
File Viewer
open
goto
scroll_down
scroll_up
< path > [< line_number >]
< line_number >
opens the given file and goes to the line number
moves the window to show the line number
moves the window down 1000 lines
moves the window up 1000 lines
File editing
create
insert
edit
< filename >
< line_number < text_to_add >
< start_line >:< end_line < replacement_text >
creates a new file
inserts the given text at line number in the open file
replaces the given lines with the given text in the open file
Evaluation
validate
submit
Literature Search
literature_search
parse_pdf_url
< query > [< num_results >]
< url >
query Semantic Scholar API for papers with attached PDFs
downloads and extracts the contents of a PDF given a URL
Memory Module
memory_write
memory_read
< content_str >
< query_str >
save important results, configs or findings to memory
retrieve top-2 elements from memory most similar to a query
validates the current submission file and returns the metrics on the test set
submits the current code and terminates the session
Extended Tools
Table 2 List of tools available to agents. Required arguments are enclosed in <> and optional arguments are in [].
Validation and Submit We provide two commands to the agent to validate the submission and submit the
results. Both the validate and submit commands are used to run the evaluation script and give the agent
feedback on its current score on the test set. However, while the submit command is a terminal action, i.e.,
the agent’s trajectory is terminated, and the evaluation script is executed to log the final scores, the validate
command can be used as many times as needed during the run to get the current performance on the test set.
Addition of a validation command helps the agent to continuously improve its performance on the test set.
Literature Search and PDF Parser We provide the agent with two tools to find and extract knowledge from
external sources. The Literature Search tool allows the agent to query the Semantic Scholar API to find
research papers about a given query that have open-access PDFs available, and the PDF Parsing tool allows
the agent to download PDFs and convert them into a text-based representation. The paper contents can be
stored in the context window as well as the Memory Module for longer-term tasks. Combined, these two tools
allow the agent to find and analyze research papers as part of its workflow. See Table 2 for more information
about these tools and how they are called.
Memory Module - Research Logs We introduce the Memory Module for MLGym, an important tool to
improve the performance of agents on long-horizon AI research tasks. The Memory Module enables the agent
to persistently store critical findings and successful training configurations using a structured memory system,
overcoming the challenge of limited context retention in long tasks. During our experiments, we observed
that when the agent has access to the memory module, it can retrieve the best training configuration from
memory and continue to iterate on it (see Figure 11 and Figure 12). Without the memory module, the
9
agent’s trajectory can become longer than the model’s context length, thus not being able to retrieve the
best configuration, effectively forgetting older experiments and only being able to locally iterate on recent
configurations.
The module is equipped with two core functions: memory_write and memory_read. The memory_write
function allows the agent to store key insights and effective configurations by saving text data along with
its corresponding embeddings and tags in JSON format. In contrast, the memory_read method retrieves the
top-k most relevant stored entries based on cosine similarity with a given query, allowing the agent to review
past knowledge and iterate from previously successful configurations.
Empirical results demonstrate the positive impact of the Memory Module on long-horizon tasks. Agents
equipped with the Memory Module were able to sustain progress over extended sequences of trials, reusing
optimal configurations and findings to achieve superior results compared to agents limited by fixed context
windows. To further enhance its capabilities, we added the state of the memory to the system prompt (memory
tags and number of records) so that the agent is aware of the type of data stored. Tags from a memory record
are extracted by identifying the 3-gram most closely matching to the memory record.
This module significantly reduces the limitations of constrained context length, allowing agents to operate
effectively in long experimental settings. However, it is an early version and there are many ways to improve
the module. For example, one possible direction would be to introduce a more structured memory format,
such as hierarchical or relational models, allowing for precise storage and retrieval of information and enabling
more complex reasoning over stored knowledge. Another is to incorporate memory operations directly into the
model’s training or fine-tuning process to allow the agent to natively utilize stored knowledge for improved
performance. Or using a sub-agent that will automatically manage the memory by selecting important insights,
removing unnecessary entries, and updating the memory. Each of these directions would require extensive
experimentation and rigorous testing to ensure robustness and scalability.
For all the experiments presented in this paper, the agent only uses the SWE-Agent tools and validation
command.
4
MLGym-Bench
The primary motivation behind our benchmark is to challenge models across different aspects of machine
learning, including data handling, model architecture, and strategic decision-making. By incorporating tasks
from data science, game theory, computer vision, natural language processing, and reinforcement learning, the
benchmark aims to provide a varied and comprehensive agent evaluation testbed.
The tasks included in the benchmark are carefully selected to represent real-world challenges, ensuring that
models are tested on their ability to generalize and perform effectively across various scenarios. Each task is
accompanied by standardized evaluation scripts and baseline implementations, providing a clear reference
point for performance assessment and comparison.
The benchmark suite is structured into four main categories, each focusing on a specific domain of machine
learning: Data Science, Game Theory, Computer Vision, Natural Language Processing, and Reinforcement
Learning. Below we describe each of the tasks in the benchmark.
4.1
Data Science
House Price Prediction (Kaggle, 2016) In the House Price Prediction task, the goal is to predict housing
prices using the Kaggle House Price dataset. This task evaluates models based on their ability to accurately
predict prices from various features, using RMSE and R2 as performance metrics. The baseline for this task
is a simple Ridge Regression model with minimal feature engineering.
4.2
3-SAT
3-SAT (Cook, 1971) In the 3-SAT task, the LLM agent is given a DPLL code and is prompted to optimize
the variable selection heuristic. The associated DPLL code is stored in a read-only file, and the agent can
10
inspect it to structure its heuristic function code, however, it cannot modify it. A simple random selection
heuristic is used as a baseline and starter code for the LLM agent. The performance is measured by the total
wall-clock time taken to solve a set of 100 generated 3-SAT instances. The instances are genereted using the
algorithm described in Selsam et al. (2018).
4.3
Game Theory
We consider several tasks related to making strategic choices in iterated games, considering multiple well-known
games. Specifically, we consider the task of producing code for a strategy for playing in a repeated two-player
game. In each such task we provide an opponent strategy, in the form of an opponent bot for playing the
game, and ask the agent to produce code for a strategy for best-responding to this opponent, i.e. provide
code for a strategy that maximizes the score against that opponent. We very briefly review game theory
terminology, with various textbooks covering this topic in more detail (Fudenberg and Tirole, 1991).
In a two-player normal form game G, players select actions simultaneously, with the outcome determined
by the choices of both players. Let A1 = {a11 , . . . , a1k } be the (pure) strategies available to player 1 and let
A2 = {a21 , . . . , a2m } be the strategies available to player 2. Denote the set of strategy profiles, consisting
of a strategy choice for both players as A = A1 × A2 . The utility of the players depends on the actions
selected by both for them, i.e. the payoffs are u : A → Rn , where u(a) = (u1 (a), u2 (a)) for a ∈ A, and where
each player i tries to maximize their individual utility ui . A mixed strategy is a probability distribution
∆ over pure P
strategies. Given a mixed strategy profile σ = (σ1 , σ2 ) the expected utility of ui of player i is
ui (σ1 , σ2 ) = (a1 ,a2 )∈A σ1 (a1 )σ2 (a2 )ui (a1 , a2 ).
A repeated game consists of k rounds in which the players play the same underlying normal form game. The
history at the j + 1’th round consists of the actions (pure strategies) chosen by both players in each of the
rounds 1 to j. We denote by H the set of all possible such histories, so a strategy in a repeated game is a
function ai : H → ∆(A), i.e. a function that takes the history of actions chosen in the previous round and
provides a distribution over the actions the agents would take in the next round. In our tasks, a strategy in
the repeated game is expressed as a piece of code that takes in the history (actions of both players in the
previous rounds), and outputs an action for the next round (where the code may make some random choices,
hence yielding a distribution over the selected next round actions). Given an opponent strategy a2 , the goal
of our agent is to produce a strategy that best responds to the opponent and produces a the maximal payoff,
i.e arg maxa1 u1 (a1 , a2 ). Note that in this equation a2 is a given opponent strategy expressed as a piece of
code that takes the history over the previous rounds and selects an action for the next round (possibly making
some random choices), and that the goal of an agent is to produce a1 as a piece of code capturing the strategy
of the first player. The agent optimization goal is selecting the code a1 so as to maximize player 1’s expected
payoff u1 against the fixed opponent a2 .
We consider the repeated version of prominent games, which we briefly discuss here: iterated Prisoner’s
Dilemma (Flood, 1958; Fudenberg and Tirole, 1991; Axelrod, 1980), Battle of the Sexes (Cooper et al., 1989;
Luce and Raiffa, 2012) and Colonel Blotto (Roberson, 2006). As our goals was to highlight how our agent
framework could be used to solve game theoretic tasks, rather than providing a rigorous evaluation and
analysis of many game theoretic environments, we only included few games. However, additional games could
easily be added in.
Prisonner’s Dilemma (Axelrod, 1980). In this game, two players each have two options: cooperate or defect.
When both cooperate, they receive a moderate reward. If one defects while the other cooperates, the defector
gets a high reward while the cooperator gets a low payoff. If both defect, they both receive a low payoff.
Due to the structure of payoffs, although mutual cooperation yields the best collective outcome, individual
incentives often push towards defection. We included a repeated game, consisting of k = 20 rounds of the
game. In the repeated version, players remember previous interactions and can adjust their strategies based
on the history consisting of the past outcomes. Repeating the stage game multiple times allows for the
development of trust and cooperation, as players recognize that consistent cooperation can lead to better
long-term benefits than short-term defection (Axelrod, 1980). As our opponent strategy we provided a simple
model which randomizes between cooperation, defection, or actions chosen based only on the last round of
the interaction.
11
Battle of Sexes (Cooper et al., 1989). This is a simple game illustrating coordination challenges between two
participants with different preferences. In the game, two participants have to agree on a venue (for instance
where to go to spend an evening). There are two possible venues, and both players would rather make the
same choice rather than making different choices. The strategic dilemma arises because as each player wants
to coordinate their choice with the other, but they have a different ranking over the venues (one prefers the
first venue and the other prefers the second venue). Similarly to the iterated Prisoner’s Dilemma, we have
used a repeated game with k = 20 rounds and used a simple opponent that makes random choices using the
information from the last round.
Colonel Blotto Game (Roberson, 2006). This game is a model of strategic allocation of limited resources
under competition. Two players (“Colonels”) must simultaneously distribute their resources (such as troops)
over several alternative locations (“battlefields”). The player who allocates more resources to a battlefield
wins that battlefield. The overall winner is the player who wins the most battlefields. The key challenge arises
from the fact that players must make their allocations without knowing how their opponent will distribute
their resources. This yields an environment where players try and anticipate their opponent’s moves to decide
how to best allocate their own resources in order to maximize their chances of winning. A key insight from
the game is the importance of diversification and unpredictability: it is harder to exploit an opponent who
spreads resources across multiple battlefields and varies their strategy. Our target opponent used a very
simple random allocation rule (re-normalizing to the overall budget of resources).
It is important to note that in all the game theoretic tasks, the agent is allowed to look at the opponent’s
strategy, and thus these tasks measure code understanding and the LLM’s capabilities to exploit the opponent’s
strategy. In the future, we plan to add tasks where the opponent’s strategy is not provided to the agent, and
agent is pitted against multiple opponents in a round robin fashion, similar to the setup used in Axelrod’s
original Prisoner’s Dilemma tournament.
Problem Setting
Domain
Task
Dataset/Environment
Supervised Learning
Data Science
Regression
House Price Prediction2
Supervised Learning
Supervised Learning
Supervised Learning
Computer Vision
Computer Vision
Computer Vision
Image Classification
Image Classification
Image Captioning
CIFAR-10 (Krizhevsky et al., 2009)
Fashion MNIST (Xiao et al., 2017)
MS-COCO (Lin et al., 2014)
Supervised Learning
Self-Supervised Learning
Natural Language Processing
Natural Language Processing
Natural Language Inference
Language Modeling
MNLI (Williams et al., 2018)
FineWeb (Penedo et al., 2024)
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
MetaMaze Navigation
MountainCar Continuous
Breakout MinAtar
Gymnax (Lange, 2022)
Gymnax (Lange, 2022)
Gymnax (Lange, 2022)
Algorithmic Reasoning
Computer Science
3-SAT
Randomly Generated (Selsam et al., 2018)
Algorithmic Reasoning
Algorithmic Reasoning
Algorithmic Reasoning
Game Theory
Game Theory
Game Theory
Prisonner’s Dilemma
Battle of Sexes
Colonel Blotto
N/A
N/A
N/A
Table 3 List of tasks included in MLGym-Bench along with their respective problem setting, domain, and datasets.
4.4
Computer Vision
Image Classification (CIFAR-10) (Krizhevsky et al., 2009) The Image Classification CIFAR-10 task involves
classifying images into one of ten classes using the CIFAR-10 dataset. This task tests the ability of models to
learn visual patterns and features, with a baseline accuracy of 49.71% encouraging improvements
Image Classification (Fashion MNIST) (Xiao et al., 2017) The Image Classification Fashion MNIST task
involves classifying fashion items into predefined categories using the Fashion MNIST dataset. The agent is
provided with a simple two layer CNN as a baseline and it has to optimize for the accuracy on the test set.
The agent can optimize the model architecture and the hyper-parameters for the training.
Image captioning (MS-COCO) (Lin et al., 2014) For the image captioning task, the agent has to write
the modeling code and come up with a good architecture and training setup for the image-text pairs in the
2 https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
12
MS-COCO dataset. We provide a baseline code for training to the agent which uses an image encoder and
text decoder. We use the MS-COCO training and validation sets after removing all images containing humans.
The agent has to optimize for the BLEU scores (Papineni et al., 2002) computed over the model-generated
captions and ground truth captions for a given image.
4.5
Natural Language Processing
For language, we test the agent’s ability to understand and modify training setup for both Natural Language
Understanding (NLU) and Natural Language Generation (NLG) as detailed below.
Natural Language Inference (Williams et al., 2018) In this task, the agent starts from a pre-trained BERT
model (Devlin, 2018) and we provide the baseline code to fine-tune on the training set of the MNLI benchmark
to the agent. The agent is expected to come up with good hyper-parameters and fine-tuning strategy to
optimize the test set accuracy on MNLI.
Language Modeling (Jordan et al., 2024) In the Language Modeling task, the agent is expected to train
a language model for next token prediction using a smaller version of the FineWeb (Penedo et al., 2024)
dataset. The LLM Agent is provided with the dataset and the NanoGPT (Jordan et al., 2024) codebase as a
baseline and starting point. We use version #8 from modded-nanogpt3 as the starting point. The training
and validation sets contain 1.773B and 100M tokens, respectively. The perfomance metric is the perplexity of
the trained model on the validation set.
4.6
Reinforcement Learning
MetaMaze Navigation (Miconi et al., 2020) The MetaMaze Navigation task simulates a grid-world environment
where agents must navigate using local observations and reach the goal location.
Mountain Car Continuous (Brockman et al., 2016) We use the continuous version of the Mountain Car
environment introduced in Brockman et al. (2016), where the task is to learn a policy that drives a car up a
steep hill in a continuous control environment.
Breakout MinAtar (Young and Tian, 2019) The Breakout MinAtar task involves playing the arcade game
Breakout in a simulated environment. This environment was introduced in Young and Tian (2019) and is a
popular benchmark for evaluating reinforcement learning agents.
For all the RL tasks, we use the environments from the Gymnax library (Lange, 2022) and the PPO algorithm
from Gymnax-blines4 as a baseline and starting code for the LLM agent.
5
Experimental Setup
5.1
Agent and Models
For our experiments, we utilize a SWE-Agent based model adapted specifically for the MLGYM environment.
SWE-Agent follows a simple ReAct-style thought and action loop (Yao et al., 2023), where the agent is
prompted with the ACI documentation, the task and dataset description, as well as lightweight generic
instructions to act as a ML researcher. The agent is configured to use a single command per step, and is not
allowed to use any interactive session commands (e.g., python REPL, vim).
We use a set of 5 state-of-the-art models for our experiments, OpenAI O1-preview, Gemini 1.5 Pro, Claude3.5-sonnet-20241022 (refered to as Claude-3.5-sonnet in the paper), Llama-3-405b-instruct, and GPT-4o. All
the models are used with temperature=0.0 and top-p=0.95, with the exception for OpenAI O1-preview,
which doesn’t support changing the decoding parameters and has a default temperature=1.0.
3 https://github.com/KellerJordan/modded-nanogpt
4 https://github.com/RobertTLange/gymnax-blines
13
5.2
Environment Configuration
The MLGYM environment is configured with several key parameters to facilitate effective interaction between
the agent and the tasks:
• Window Configuration: The environment uses a window size of 1000 lines with an overlap of 2 lines,
allowing the agent to effectively navigate and edit large files while maintaining context.
• Context Management: A processor maintains a rolling window with the five most recent interactions
(action and observation), helping the agent maintain context about the most recent interactions while
keeping the input size manageable.
• Command Interface: The environment provides a set of specialized commands beyond standard bash
operations, including file navigation commands (goto, scroll_up, scroll_down), file editing commands
(edit, insert) with linting support, file and directory search commands (search_file, search_dir,
find_file), and evaluation commands (validate, submit).
A single agent run is limited to 50 steps (i.e. interactions with the environment), after which the agent is
terminated and the last codebase state is autosubmitted. Moreover, to control the runtime of the agent and
prevent it from simply increasing the number of parameters in the model, we set a task specific timeout for
the training commands.
In the next section, we discuss the evaluation metrics used in our experiments.
6
Evaluation
In order to compare agents on MLGym, we aggregate the scores of each method—an agent architecture
paired with a backbone model—across our tasks. There are many ways one can aggregate scores. Common
options include computing the average score across tasks for each method or by computing the average ranking
of each method across tasks. While simple, these approaches can weight metrics in undesirable ways and
disproportionately penalize certain methods. Averaging across different metrics may unfairly weight the
metrics differently based on their relative scales, and averaging ranks can disproportionately penalize methods
that effectively solve a task but are tied with other methods that also solve the task. Rather than naive
averaging of scores or rankings, we employ performance profile curves (Dolan and Moré, 2002), which allow us
to compare relative performance gains across both methods and tasks. Performance profiles were originally
developed to compare optimization techniques across a set of optimization problems. Since then, they have
been used by the AutoML community to compare AutoML methods across diverse domains, each with their
own domain-specific metrics (Tu et al., 2022; Roberts et al., 2022b).
One challenge when using performance profiles is that they produce a curve for each method (where a higher
curve is better), rather than a direct ranking of methods. To address this, the AutoML Decathlon (Roberts
et al., 2022a) competition introduced the AUP score, which computes the area under the performance profile
curve for each method, where a higher value constitutes better performance. Variants of the AUP score have
since been used to score the AutoML Cup5 and MLCommons AlgoPerf (Dahl et al., 2023) competitions. Next,
we define performance profiles, the AUP score, and the details of their usage within MLGym.
6.1
Performance Profiles and the AUP Score
For a given method m, its performance profile curve is defined as
ρm (τ ) =
1
|{t ∈ T : log10 rt,m ≤ τ }|
|T |
rt,m =
ℓt,m
min{ℓt,m : m ∈ M }
(1)
where M is the set of all methods, P is the set of tasks, ℓt,m is the performance metric for a method m on
task t, and rt,m is a quantity called the performance ratio.
Importantly, this definition assumes that the performance metric for each task, ℓp,· , must be defined such that
lower scores are better—we discuss our modification to this definition to support other scores in Section 6.2.
5 https://2023.automl.cc/competitions/automl-cup/
14
Performance profiles are parameterized by a threshold, τ , on the distance between the method m and the best
scoring methods on each of the tasks. At a given threshold τ , performance profiles compute the proportion of
tasks for which the method m is within τ of the best method for each task.
In order to derive a final score for each method m ∈ M , we compute the AUP score as
Z τmax
AUPm =
ρm (τ )dτ,
(2)
1
where τmax is the minimum τ for which ρm (τ ) = 1 for all m ∈ M .
6.2
Usage in MLGym
In the context of MLGym, a method is defined as a combination of an agent scaffolding and a backbone
model. Since, in this work we use a single agent scaffolding (SWE-Agent), we are comparing the performance
of different backbone models. Moreover, we adapt performance profiles and AUP scores to handle various
edge cases introduced by our MLGym tasks.
• Metric Direction Handling. For metrics where higher values are better (e.g., accuracy, R2), we invert
the performance ratio calculation and use the maximum score instead of the minimum:
rt,m =
max{ℓt,m : m ∈ M }
.
ℓt,m
(3)
• Infeasible Method In order to be counted as a feasible method, an agent should produce at least one
valid solution and beat the baseline, methods must outperform the baseline. Methods that don’t produce
any valid solution or underperform are marked as Infeasible. The score of an infeasible method is set to
(1 + ε) × rt,mbaseline , where rt,mbaseline is the score obtained by the baseline method on task t. We set the
value of ε = 0.05.
We report the metrics across 4 independent runs for each model on each task. Finally, since the LM agent can
use the validate command to check the performance without ending the run, we maintain two separate sets
of performance profiles and AUP scores for each model.
1. Best Submission Profiles, ρbs
m (τ )@4, are computed using the best final submission across 4 runs. A
submission is classified as a final submission in two cases: if the agent uses the submit command, or if
the agent terminates without submitting and the last codebase state is used to evaluate the performance.
2. Best Attempt Profiles, ρba
m (τ )@4, which are computed using the best attempt observed across 4 runs.
Any valid call to the validate command is considered an attempt.
The resulting AUP scores provide complementary information:
• AUPbs
m @4 indicates the model’s ability to consistently submit its best attempt as the final solution.
Note that to do this, the LM agent has to be able to keep an internal state of the best attempt and
recover from any mistakes made after the best attempt was made.
• AUPba
m @4 captures the model’s exploration capability and is an indicator of the ceiling of the model’s
performance.
Apart from the AUP scores and performance profiles, we also report the raw performance scores for each model
on each task. Similar to performance profiles, we categorize the raw scores in two sets: Best Submission@4
and Best Attempt@4.
7
Results
7.1
AUP Scores and Performance Profiles
As detailed in the Section 6, we evaluate the performance of each model in the SWE-Agent based agent
scaffolding using Performance Profiles and Area Under the Performance Profile (AUP) score.
15
Best Attempt Profile@4
Best Submission Profile@4
1.0
P(ratio
)
0.8
0.6
0.4
Llama
GPT-4o
Claude
Gemini
O1-preview
0.2
0.0
0.0
0.3
0.6
0.9
1.2
0.0
0.3
0.6
0.9
1.2
Figure 2 Performance profiles comparing Best Attempt@4 and Best Submission@4 across all models and tasks. The
x-axis shows the performance ratio threshold τ and the y-axis shows the fraction of tasks where a model achieves
performance within τ of the best model.
Moreover, since our agent can log the performance of intermediate steps, we categorize the performance of
each model using two categories: Best Submission and Best Attempt. Best Submission indicates the LLM
agent’s capability to produce a valid final solution for a task as well as the ability to remember to fall back to
the best intermediate solution in case some experiments don’t pan out. Whereas, Best Attempt indicates the
potential ceiling of the LLM agent’s capability to solve the given task.
Figure 2 shows the performance profiles for Best Attempt (on the left) and Best Submission (on the right).
Similarly, Table 4 shows the AUP scores for the Best Attempt and Best Submission for all models.
In our experiments, we found that OpenAI O1-preview is the best-performing model on aggregate across our
set of tasks for both Best Attempt and Best Submission, with Gemini 1.5 Pro and Claude-3.5-Sonnet being
close behind.
Model
Llama3.1-405b-instruct
Claude-3.5-Sonnet
Gemini-1.5-Pro
GPT-4o
OpenAI O1
Best Attempt AUP@4
Best Submission AUP@4
1.015
1.142
1.140
1.000
1.150
1.039
1.135
1.125
1.029
1.176
Table 4 AUP@4 scores for the best attempt and best submission across all models. Best scores are highlighted in
blue .
16
7.2
Raw Performance Scores
To compare the performance of each model on each task, we also report aggregate metrics over 4 runs with
different seeds, namely the Best Attempt@4 and Best Submission@4 in Table 5 and Table 6 respectively.
While OpenAI O1-Preview is not dominant in all tasks, with Gemini-1.5-Pro, Claude-3.5-Sonnet, and Llama3.1-405b-Instruct occasionally taking the lead, it is consistently in the top performing models for most tasks
and thus takes the top spot in the AUP scores and performance profiles. This shows that the performance
profile is a good metric to compare the performance of different models on a set of tasks with a diverse set of
metrics.
We also find that Llama-3.1-405b-Instruct and GPT-4o are the only models that fail to produce any valid
solution for the Language Modeling and Breakout tasks, respectively.
Task
Metric
Baseline
Llama3.1-405b-instruct
GPT-4o
Claude-3.5-Sonnet
Gemini-1.5-Pro
OpenAI o1
CIFAR-10
Battle of Sexes
Prisoners Dilemma
Blotto
House Price Prediction
Fashion MNIST
MS-COCO
MNLI
Language Modeling
Breakout
Mountain Car Continuous
Meta Maze
3-SAT Heuristic
Accuracy
Average Reward
Average Reward
Average Reward
R2 Score
Accuracy
BLEU Score
Validation Accuracy
Validation Loss
Average Score
Average Reward
Average Return
Wall-Clock Time (s)
0.497
1.023
2.372
-0.248
0.88
0.783
0.279
0.525
4.673
48.817
33.794
15.734
16.158
0.548
1.261
2.632
0.043
0.908
0.876
0.294
0.777
∞
58.87
18.692
26.744
13.793
0.733
1.149
2.6
0.047
0.895
0.927
0.176
0.819
4.361
∞
-215.776
7.823
13.676
0.895
1.442
2.567
0.576
0.921
0.945
0.298
0.830
4.476
35.017
36.313
48.562
15.728
0.84
1.443
2.63
0.249
0.914
0.916
0.131
0.838
4.166
71.389
92.513
27.859
14.36
0.857
1.444
2.629
0.248
0.931
0.92
0.135
0.836
3.966
63.518
96.335
34.986
13.652
Table 5 Best Attempt@4 scores for all models. Best scores are highlighted in blue . Note: ∞ indicates that the model
was not able to produce even a single valid solution for submission or validation.
Task
Metric
Baseline
Llama3.1-405b-instruct
GPT-4o
Claude-3.5-Sonnet
Gemini-1.5-Pro
OpenAI o1
CIFAR-10
Battle of Sexes
Prisoners Dilemma
Blotto
House Price Prediction
Fashion MNIST
MS-COCO
MNLI
Language Modeling
Breakout
Mountain Car Continuous
Meta Maze
3-SAT Heuristic
Accuracy
Average Reward
Average Reward
Average Reward
R2 Score
Accuracy
BLEU Score
Validation Accuracy
Validation Loss
Average Score
Average Reward
Average Return
Wall-Clock Time (s)
0.497
1.023
2.372
-0.248
0.88
0.783
0.279
0.525
4.673
48.817
33.794
15.734
16.158
0.528
1.256
2.562
0.041
0.908
0.876
0.294
0.777
∞
58.87
18.692
26.744
13.936
0.733
1.144
2.582
0.047
0.895
0.927
0.111
0.819
4.361
∞
-216.621
7.823
13.676
0.894
1.439
2.563
0.228
0.912
0.945
0.125
0.830
4.476
17.735
36.313
48.562
15.728
0.758
1.443
2.63
0.088
0.908
0.916
0.131
0.838
4.166
71.389
92.513
22.889
14.36
0.854
1.439
2.571
0.247
0.931
0.906
0.135
0.836
3.966
63.518
96.335
34.986
13.83
Table 6 Best Submission@4 scores for all models. Best scores are highlighted in blue . Note: ∞ indicates that the
model was not able to produce even a single valid solution for submission or validation.
7.3
Computational Cost
As discussed in Kapoor et al. (2024), it is important to also consider the pareto curve of performance vs cost
for a more comprehensive evaluation of the agents’ capabilities and their computational cost. In this work, we
do not compare different agent scaffoldings; however, the pareto curve can still be useful to choose the most
balanced model for a set of tasks. Figure 3 shows the Best Attempt AUP@4 vs Average Cost for all models.
We use Best Attempt AUP scores to for this plot to highlight the maximum performance achievable by each
model for a given cost.
According to results discussed in Section 7.1, OpenAI O1-Preview is the best-performing model, however, it
is also the most computationally expensive by a wide margin. In contrast, Gemini-1.5-Pro and Claude-3.5Sonnet are much more cost-effective while still reaching high performance not too far from OpenAI O1’s, with
17
1.2
1.15
O1-preview
Gemini
Best Attempt AUP@4
Claude
1.1
1.05
Llama
1.0
GPT-4o
0.95
0.9
0
2
4
6
8
10
Average API Cost ($)
Figure 3 Best Attempt AUP@4 vs cost for all models. The x-axis shows the API cost in USD and the y-axis shows
the AUP@4 score.
Gemini-1.5-Pro being the most cost-effective.
Gemini-1.5-Pro is cheaper than both GPT-4o and Llama-3.1-405b-Instruct and provides massive performance
gains relative to them. GPT-4o is one of the cheapest models to run but performs significantly worse
than the top models, Claude-3.5-Sonnet, Gemini-1.5-Pro, or OpenAI O1-Preview. Overall, Gemini-1.5-Pro
strikes the best balance between performance and cost on MLGym-Bench, being the cheapest model to run
(approximately 9× cheaper than OpenAI’s O1) while achieving 99% of OpenAI O1’s AUP (which is the top
performing model).
The API pricing for OpenAI O1-preview, GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro was taken from
their respective price pages and for Llama-3.1-405b-instruct was taken from together.ai. For details on API
pricing, tokens spent, and context length please consult Table 8
7.4
Agent Behavior Analysis
7.4.1
Failure Mode Analysis
In this section we analyze the failure modes of our agents on MLGym-Bench tasks, using three key perspectives:
termination error distribution, failed or incomplete run rates, and task-specific failure patterns. We collect
trajectories across 11 tasks and 5 models with 4 different seeds. This results in a total of 220 trajectories with
20 and 44 trajectories for each task and model, respectively.
Termination Errors Figure 4 shows the distribution of different causes for termination encountered by each
model during task execution, as indicated by the first word of the error message. We categorize the errors into
the following types: context length exceeded, evaluation error, file permission error, cost limit
exceeded, format error, and runtime error.
First, we observe that almost all models encounter Evaluation Error and is generally the most frequent
final error, accounting for 75% of all termination errors. Evaluation Error is generally triggered by missing
18
Failed Runs
Incomplete Runs
10
8
Count
30
Count
12
Llama
GPT-4o
Claude
Gemini
O1-preview
40
20
6
4
10
2
0
0
Evaluation
Runtime
Cost
Format
Context
Llama
Permission
Figure 4 Termination Error Distribution by model. The
size of the bars corresponds to the number of times each
model triggered an exit status.
Claude
GPT-4o
Gemini
O1-preview
Figure 5 Number of Failed and Incomplete runs per
model. The criteria for marking a run as incomplete or
failed is described in Section 7.4.1
submission artefacts or incorrect submission format at the last step or when the submit command is issued.
Gemini-1.5-Pro is the only model that does not submit any invalid solutions, with OpenAI O1-Preview and
Claude-3.5-Sonnet being the runner ups.
OpenAI O1-Preview and Claude-3.5-Sonnet demonstrate superior error handling capabilities with the lowest
overall error rates. Cost Limit is the second most frequent error encountered by Claude-3.5-Sonnet, Gemini1.5-Pro and OpenAI O1-Preview, indicating that they could further improve performance if provided with
more budget. However, it is interesting to note that Gemini-1.5-Pro is the most cost-effective model across all
tasks but still encounters Cost Limit error most frequently among all models.
Failed and Incomplete Runs The failed and incomplete run analysis in Figure 5 reveals significant variations
in model reliability. If an agent run fails with a termination error without producing any valid intermediate
submission, we mark it as failed. Whereas, if the run fails with a termination error but produces a valid
intermediate submission i.e. at least one score on the test set is obtained, we mark it as incomplete. Note that
the model’s submission does not have to beat the baseline to be considered a valid intermediate submission.
We are not interested in the performance of the model’s submission here, but rather the ability of the agent
to produce a valid submission by following the given instructions.
GPT-4o exhibits the highest failure rate, while Gemini-1.5-Pro and OpenAI O1-Preview achieve the best
completion rates. While Claude-3.5-Sonnet is one of the top performing models across all tasks (Section 7.1),
it has a high failure rate. Another interesting observation is that OpenAI O1-Preview has a high incompletion
rate, but it always produces at least one valid solution for all tasks.
We report additional results and failure mode analysis in Section A.2.
7.4.2
Action Analysis
In this section, we analyze the overall action distribution, as well as across models and trajectory steps. To
analyze the action distribution effectively, we group the actions according to categories defined in Table 2:
Edit , View , Search , Validate and Submit . We treat validate and submit as two separate categories.
Additionally, we have two open-ended categories: Python and Bash . All the actions that match the regex
patterns python.*, deepspeed.*, torchrun.* are considered as Python actions. These actions usually
correspond to the agent attempting to run a model evaluation or training script. All other actions are grouped
under Bash category, i.e. are considered as open-ended bash commands.
Overall Action Distribution Figure 6 shows the action distribution across all runs. File commands such as
Edit and View are one of the most frequently used commands with Edit accounting for 50% of the total
actions. Whereas, Search commands are rarely used, accounting for only 1% of the total actions.
This distribution suggests that models spend a significant portion of their time in an iterative development
cycle of editing and viewing files. Additionally, we observe a trend of regular experimental evaluation and
periodic validation of solution by the frequent use of Python and Validate commands.
19
4000
2000
3500
1750
3000
1500
2500
1250
Count
Count
3,708
2000
1,458
1500
Submit
Search
Python
Bash
1000
750
1,066
1000
Edit
View
Validate
500
835
250
449
500
226
11
0
Edit
Python
Validate
View
Bash
Submit
0
Claude
Search
Figure 6 Action distribution across all runs. We group
the actions into categories following the grouping defined
in Table 2 and Section 7.4.2.
O1-preview
Llama
Gemini
GPT-4o
Figure 7 Action distribution for each model. We group
the actions into categories following the grouping defined
in Table 2 and Section 7.4.2.
Per-Model Action Distribution Figure 7 shows the action distribution for each model. GPT-4o takes the least
number of actions overall, indicating that the model either errors out or submits too early without reaching
an optimal solution. This is consistent with the failure analysis shown in Figure 5.
Among the best-performing models, Claude-3.5-Sonnet and OpenAI O1-Preview perform the most number of
actions within a run, while Gemini-1.5-Pro performs the least number of actions. Consistent with the cost
analysis discussed in Section 7.3, Gemini-1.5-Pro’s lower trajectory length contributes to it being the most
cost-effective model.
Per-Step Action Distribution Figure 8 illustrates the distribution of actions taken by agents across trajectory
steps. Initially, Bash commands are predominant, indicating that agents start by checking and setting up
their environment with basic commands such as ls, pwd, cd etc. As the steps progress, Edit actions become
the most frequent, reflecting the agents’ focus on modifying and refining code. This is complemented by a
consistent use of View commands, suggesting a pattern of iterative development where agents frequently
review their changes.
Python and Validate commands are used steadily throughout, which indicates an iterative cycle of experiments
and evaluation. Submit actions are sparse, typically appearing towards the end of the process, aligning with
the finalization of tasks. However, we can observe the Submit action being used as soon as Step 5, which
indicates that some models submit their solution too early and likely fail to reach an optimal solution to beat
other models.
Interestingly, Search commands are rarely used, suggesting that agents might benefit from improved search
strategies to enhance efficiency while editing code.
Overall, our analysis highlights a structured approach where agents begin with getting familiar with the
environment and the task, conduct multiple iterations of experiments and validation, and conclude with and
submission. We report additional action analysis in Section A.3.
8
Discussion and Limitations
Our findings highlight both the opportunities and ongoing challenges in leveraging large language models
(LLMs) as agents for scientific workflows. The proposed MLGym framework and accompanying MLGymBench tasks demonstrate that modern LLM agents can successfully tackle a diverse array of quantitative
experiments, reflecting advanced skills and domain adaptability. At the same time, our results reveal notable
capability gaps, which point to several avenues for improvement:
• Scaling beyond ML tasks To further evaluate the agent’s AI Research capabilities, it is essential to
scale up the evaluation framework to accommodate large-scale domain-specific datasets, more complex
tasks, as well as domains outside AI. This will enable the community to assess the robustness and
20
Edit
View
Validate
250
Submit
Search
Python
Bash
Count
200
150
100
50
0
1
5
10
15
20
25
30
Step Number
35
40
45
50
Figure 8 Action distribution for each step. We group the actions into categories following the grouping defined in
Table 2 and Section 7.4.2.
generalizability of different methods, as well as identify potential limitations and areas for improvement.
• Interdisciplinary Ablations and Generalization Within the stage of method evaluation, one approach is
to test the solutions for generalization:
– automatically evaluating the applicability of a new method on different domains . For example,
new LLM architectures like Mamba (Gu and Dao, 2024) could be automatically applied to data on
DNA, chemical molecules, music generation, etc.
– automatically running interdisciplinary and multidisciplinary ablations, where we systematically
remove or modify specific components of the proposed ML system to assess their impact on
performance. This will enable the community to more quickly identify the most critical factors
contributing to generalization across different domains.
• Addressing Scientific Novelty While the agentic benchmarks have demonstrated their effectiveness in
evaluating complex tasks in different areas, it is essential to acknowledge that proposed interdisciplinary
extrapolation of methods is just one aspect of the broader scientific understanding of "novelty" and
"discovery" (Popper, 2005; Langley, 1987). It is not yet clear if the notion of scientific novelty can
be successfully automated or even formally defined in a form suitable for agents. For many scientific
disciplines, development may be uneven and depend on the availability of open data, the development
of the methods, metrics and definitions used.
• Data Openness Imperative Finally, we emphasize the importance of data openness in driving scientific
progress. By making our representative ’corpus of the world’ widely accessible, including scientific
artifacts, reproducible code, and domain-specific data for modeling, we can facilitate collaboration and
accelerate discovery. This imperative is crucial for advancing our understanding of complex systems and
developing more effective solutions to real-world problems. Removing once accessible resources that
have entered LLM training from public access can have an irreparable impact on the acceleration of
scientific progress, as it becomes impossible to identify sources of facts, and it is impossible to attribute
21
the out-of-distribution result from a scientific work from a hallucination or a completely new result.
9
Ethical Considerations
AI agents proficient in tackling open research challenges like those in our benchmark could catalyze a
remarkable acceleration in scientific advancement. This prospect is exhilarating yet demands a meticulous
comprehension of model progress to ensure responsible and controlled deployment of such breakthroughs.
MLGym-Bench, for instance, can serve as a metric for model autonomy within OpenAI’s Preparedness
Framework, autonomous capabilities in Anthropic’s Responsible Scaling Policy, and ML R&D in Google
DeepMind’s Frontier Safety Framework.
Should AI agents become adept at autonomously conducting AI research, the positive impacts could be
multifaceted, encompassing accelerated scientific progress in healthcare, climate science, and other domains,
expedited safety and alignment research for models, and economic growth spurred by the development of
novel products. The ability of agents to deliver high-quality research could signify a transformative stride in
the economy.
Nonetheless, agents capable of executing open-ended AI research tasks, such as enhancing their own training
code, could augment the capabilities of cutting-edge models at a pace outstripping human researchers. If
innovations outpace our ability to comprehend their ramifications, we risk developing models with catastrophic
harm or misuse potential without parallel advancements in securing, aligning, and controlling such models.
We believe a model proficient in solving a substantial portion of MLGym-Bench likely possesses the capacity
to execute numerous open-ended AI tasks. We are open-sourcing MLGym and MLGym-Bench to foster
understanding and research into the agentic capabilities of AI Research Agents and promote transparency
regarding acceleration risks in frontier AI labs. In doing so, we acknowledge the limitations of MLGym-Bench
and strongly encourage the development of additional evaluations of automated AI research capabilities,
particularly those tailored to the workflow of researchers training frontier models.
10
Conclusions
This paper presents MLGym and MLGym-Bench as initial steps toward building robust, flexible, and transparent LLM agents for AI research. As the field continues to evolve, improvements in long-context reasoning,
better agent architectures, training and inference algorithms, as well as richer evaluation methodologies will be
essential to fully harness LLMs’ potential for scientific discovery, in general and for AI research in particular.
By fostering collaboration among researchers in machine learning, scientific computing, and diverse application
domains, we can move closer to a future where AI-driven agents meaningfully accelerate scientific research, all
while maintaining verifiability, reproducibility, and integrity in scientific discovery.
11
Acknowledgments
We thank Sten Sootla, Mikayel Samvelyan, Sharath Chandra Raparthy, Mike Plekhanov, and Rishi Hazra for
many insightful discussions about evaluating and developing AI Research Agents.
22
References
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
2023.
AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. Swe-search: Enhancing
software agents with monte carlo tree search and iterative refinement, 2024. URL https://arxiv.org/abs/2410.
20285.
Robert Axelrod. Effective choice in the prisoner’s dilemma. Journal of conflict resolution, 24(1):3–25, 1980.
Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative Research Idea
Generation over Scientific Literature with Large Language Models, April 2024. URL https://arxiv.org/abs/2404.
07738.
Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and
Tushar Khot. SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories, September
2024. URL https://arxiv.org/abs/2409.07440v1.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.
Openai gym, 2016. URL https://arxiv.org/abs/1606.01540.
Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang,
Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin,
Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, and Tao Yu. Spider2-v: How far are multimodal
agents from automating data science and engineering workflows?, 2024. URL https://arxiv.org/abs/2407.10956.
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu,
Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. MLE-bench: Evaluating Machine Learning
Agents on Machine Learning Engineering, October 2024. URL https://arxiv.org/abs/2410.07095v1.
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong
Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia
Ning, Song Gao, Yu Su, and Huan Sun. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for
Data-Driven Scientific Discovery, October 2024. URL https://arxiv.org/abs/2410.05080v1.
Stephen A Cook. The complexity of theorem-proving procedures. Proceedings of the third annual ACM symposium on
Theory of computing, pages 151–158, 1971.
Russell Cooper, Douglas V DeJong, Robert Forsythe, and Thomas W Ross. Communication in the battle of the sexes
game: some experimental results. The RAND Journal of Economics, pages 568–587, 1989.
George Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh
Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel Peirson, Bilal Khan,
Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, and Peter Mattson. Benchmarking
neural network training algorithms, 06 2023.
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web:
Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070.
Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
Elizabeth D. Dolan and Jorge J. Moré. Benchmarking optimization software with performance profiles. Mathematical
Programming, 91(2):201–213, January 2002. ISSN 1436-4646. doi: 10.1007/s101070100263. URL https://arxiv.
org/abs/cs/0102001.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
2024.
Katharina Eggensperger, Marius Lindauer, and Frank Hutter. Pitfalls and best practices in algorithm configuration.
Journal of Artificial Intelligence Research, 64:861–893, 2019.
23
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine
Learning Research, 20(55):1–21, 2019.
M Emrich, A Agarwal, B Jairam, N Murthy, and OAK RIDGE NATIONAL LAB TN. Potential applications of
artificial intelligence to the field of software engineering. Technical report, 1988.
Merrill M Flood. Some experimental games. Management Science, 5(1):5–26, 1958.
Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner,
Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor
Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent
system for solving complex tasks, 2024. URL https://arxiv.org/abs/2411.04468.
Drew Fudenberg and Jean Tirole. Game theory. MIT press, 1991.
Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul
Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab,
Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, and
Jun Wang. Large language models orchestrating structured reasoning achieve kaggle grandmaster level, 2024. URL
https://arxiv.org/abs/2411.03562.
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https:
//arxiv.org/abs/2312.00752.
Kai Guo, Zhenze Yang, Chi-Hua Yu, and Markus J Buehler. Artificial intelligence and machine learning in design of
mechanical materials. Materials Horizons, 8(4):1153–1172, 2021.
Gerhard Hessler and Karl-Heinz Baringhaus. Artificial intelligence in drug design. Molecules, 23(10):2520, 2018.
Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang,
Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Wenyi Wang, Xiangru
Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zongze Xu, and Chenglin Wu. Data
Interpreter: An LLM Agent For Data Science, March 2024. URL https://arxiv.org/abs/2402.18679.
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating Language Agents on Machine
Learning Experimentation, April 2024. URL https://arxiv.org/abs/2310.03302.
Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad
Majumder, Oyvind Tafjord, and Peter Clark. DISCOVERYWORLD: A Virtual Environment for Developing and
Evaluating Automated Scientific Discovery Agents, June 2024. URL https://arxiv.org/abs/2406.06769.
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan.
Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
Leland Johnson and Daniel Schaffer. Oak Ridge National Laboratory: the first fifty years. Univ. of Tennessee Press,
1994.
Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista,
Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URL
https://github.com/KellerJordan/modded-nanogpt.
Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges
and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
Kaggle. House prices - advanced regression techniques. Online; accessed January 24, 2025, 2016. URL https:
//www.kaggle.com/c/house-prices-advanced-regression-techniques.
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter,
2024. URL https://arxiv.org/abs/2407.01502.
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan
Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual
web tasks. arXiv preprint arXiv:2401.13649, 2024a.
Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents, 2024b.
URL https://arxiv.org/abs/2407.01476.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
24
Robert Tjarko Lange. gymnax: A JAX-based reinforcement learning environment library, 2022. URL http://github.
com/RobertTLange/gymnax.
P Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987.
Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng
Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao
Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows, 2024. URL https:
//arxiv.org/abs/2411.07763.
Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang,
Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, and Ge Zhang. Autokaggle: A multi-agent
framework for autonomous data science competitions, 2024. URL https://arxiv.org/abs/2410.20424.
Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. Agentsims: An open-source sandbox
for large language model evaluation, 2023. URL https://arxiv.org/abs/2308.04026.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Marius Lindauer and Frank Hutter. Best practices for scientific research on neural architecture search. Journal of
Machine Learning Research, 21(243):1–18, 2020.
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men,
Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun
Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents.
https://arxiv.org/abs/2308.03688v2, August 2023.
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards Fully
Automated Open-Ended Scientific Discovery, August 2024. URL https://arxiv.org/abs/2408.06292.
R Duncan Luce and Howard Raiffa. Games and decisions: Introduction and critical survey. Courier Corporation, 2012.
Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, Aixin
Sun, Hany Awadalla, and Weizhu Chen. Sciagent: Tool-augmented language models for scientific reasoning, 2024.
URL https://arxiv.org/abs/2402.11451.
METR. Evaluating frontier ai r&d capabilities of language model agents against human experts, 11 2024. URL
https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/.
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A
benchmark for General AI Assistants, November 2023. URL https://arxiv.org/abs/2311.12983.
Thomas Miconi, Aditya Rawal, Jeff Clune, and Kenneth O. Stanley. Backpropamine: training self-modifying neural
networks with differentiable neuromodulated plasticity, 2020. URL https://arxiv.org/abs/2002.10585.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu
Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button,
Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human
feedback, 2022. URL https://arxiv.org/abs/2112.09332.
Muhammad Umair Nasir, Sam Earle, Julian Togelius, Steven James, and Christopher Cleghorn. Llmatic: neural
architecture search via large language models and quality diversity optimization. In proceedings of the Genetic and
Evolutionary Computation Conference, pages 1110–1118, 2024.
Miyu Oba, Akari Haga, Akiyo Fukatsu, and Yohei Oseki. Babylm challenge: Curriculum learning based on sentence
complexity approximating language acquisition. In Proceedings of the BabyLM Challenge at the 27th Conference on
Computational Natural Language Learning, pages 290–297, 2023.
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo
Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, et al. Balrog: Benchmarking agentic llm and vlm reasoning
on games. arXiv preprint arXiv:2411.13543, 2024.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages
311–318, 2002.
25
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von
Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL
https://arxiv.org/abs/2406.17557.
Karl Popper. The logic of scientific discovery. Routledge, 2005.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill
Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu,
and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. URL
https://arxiv.org/abs/2307.16789.
Brian Roberson. The colonel blotto game. Economic Theory, 29(1):1–24, 2006.
Nicholas Roberts, Samuel Guo, Cong Xu, Ameet Talwalkar, David Lander, Lvfang Tao, Linhang Cai, Shuaicheng
Niu, Jianyu Heng, Hongyang Qin, Minwen Deng, Johannes Hog, Alexander Pfefferle, Sushil Ammanaghatta
Shivakumar, Arjun Krishnakumar, Yubo Wang, Rhea Sukthanker, Frank Hutter, Euxhen Hasanaj, Tien-Dung
Le, Mikhail Khodak, Yuriy Nevmyvaka, Kashif Rasul, Frederic Sala, Anderson Schneider, Junhong Shen, and
Evan Sparks. Automl decathlon: Diverse tasks, modern methods, and efficiency at scale. In Marco Ciccone,
Gustavo Stolovitzky, and Jacob Albrecht, editors, Proceedings of the NeurIPS 2022 Competitions Track, volume
220 of Proceedings of Machine Learning Research, pages 151–170. PMLR, 28 Nov–09 Dec 2022a. URL https:
//proceedings.mlr.press/v220/roberts23a.html.
Nicholas Roberts, Xintong Li, Tzu-Heng Huang, Dyah Adila, Spencer Schoenberg, Cheng-Yu Liu, Lauren Pick, Haotian
Ma, Aws Albarghouthi, and Frederic Sala. AutoWS-bench-101: Benchmarking automated weak supervision with
100 labels. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
2022b. URL https://openreview.net/forum?id=nQZHEunntbJ.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda,
and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URL https:
//arxiv.org/abs/2302.04761.
Petra Schneider, W Patrick Walters, Alleyn T Plowright, Norman Sieroka, Jennifer Listgarten, Robert A Goodnow Jr,
Jasmin Fisher, Johanna M Jansen, José S Duca, Thomas S Rush, et al. Rethinking drug design in the artificial
intelligence era. Nature reviews drug discovery, 19(5):353–364, 2020.
Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a SAT
solver from single-bit supervision. CoRR, abs/1802.03685, 2018. URL http://arxiv.org/abs/1802.03685.
Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can llms generate novel research ideas? a large-scale human study
with 100+ nlp researchers, 2024. URL https://arxiv.org/abs/2409.04109.
Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An,
Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao
Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. ML-Bench:
Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code, June 2024.
URL https://arxiv.org/abs/2311.09835.
Artificial Intelligence Task Team. Artifical intelligence and nuclear power. 1985.
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent,
Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of
context. arXiv preprint arXiv:2403.05530, 2024.
Alexander Tornede, Difan Deng, Theresa Eimer, Joseph Giovanelli, Aditya Mohan, Tim Ruhkopf, Sarah Segel, Daphne
Theodorakopoulos, Tanja Tornede, Henning Wachsmuth, et al. Automl in the age of large language models: Current
challenges, future opportunities and risks. arXiv preprint arXiv:2306.08107, 2023.
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão,
Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet
Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments,
2024. URL https://arxiv.org/abs/2407.17032.
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish
Sabharwal, and Niranjan Balasubramanian. AppWorld: A Controllable World of Apps and People for Benchmarking
Interactive Coding Agents, July 2024. URL https://arxiv.org/abs/2407.18901.
26
Renbo Tu, Nicholas Roberts, Mikhail Khodak, Junhong Shen, Frederic Sala, and Ameet Talwalkar. NAS-bench-360:
Benchmarking neural architecture search on diverse tasks. In Thirty-sixth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=xUXTbq6gWsB.
Petar Veličković, Adrià Puigdomènech Badia, David Budden, Razvan Pascanu, Andrea Banino, Misha Dashevskiy,
Raia Hadsell, and Charles Blundell. The clrs algorithmic reasoning benchmark. In International Conference on
Machine Learning, pages 22084–22102. PMLR, 2022.
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar.
Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen,
Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous
agents. Frontiers of Computer Science, 18(6):186345, December 2024a. ISSN 2095-2228, 2095-2236. doi: 10.1007/
s11704-024-40231-1.
Lei Wang, Jingsen Zhang, Hao Yang, Zhiyuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song,
Wayne Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. User behavior simulation with large language
model based agents, 2024b. URL https://arxiv.org/abs/2306.02552.
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song,
Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao,
Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham
Neubig. OpenDevin: An Open Platform for AI Software Developers as Generalist Agents, July 2024c. URL
https://arxiv.org/abs/2407.16741.
Adina Williams, Nikita Nangia, and Samuel R Bowman. The multi-genre nli corpus. 2018.
Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng
Kong. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456,
2024.
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based Software
Engineering Agents, July 2024. URL https://arxiv.org/abs/2407.01489.
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning
algorithms, 2017.
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, May 2024. URL https://arxiv.
org/abs/2405.15793.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing
reasoning and acting in language models. In The Eleventh International Conference on Learning Representations,
2023. URL https://openreview.net/forum?id=WE_vluYUL-X.
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. AssistantBench:
Can Web Agents Solve Realistic and Time-Consuming Tasks?, July 2024. URL https://arxiv.org/abs/2407.15711.
Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning
experiments, 2019. URL https://arxiv.org/abs/1903.03176.
Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, and Zhou Yu. Exact: Teaching ai
agents to explore with reflective-mcts and exploratory learning, 2025. URL https://arxiv.org/abs/2410.02052.
Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and
Chuang Gan. Building cooperative embodied agents modularly with large language models, 2024a. URL https:
//arxiv.org/abs/2307.02485.
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages
1592–1604, 2024b.
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan
Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint
arXiv:2307.13854, 2023.
27
Appendix
A
Additional Results and Analysis
A.1
Computational Cost
Table 7 lists the resources needed to run the agent on each task in MLGym-Bench. Each task has a set
Training Timeout, which is used as the time limit for any python commands. Specifically, it is used to prevent
the agent from continuously scaling the model parameters. Average agent runtime and Baseline runtime show
the wall clock time for each agent run and the provided baseline code, respectively.
Task
Training Timeout
GPUs/Agents
Average Agent Runtime
Baseline Runtime (mins)
30m
30m
30m
30m
30m
30m
40m
40m
40m
30m
30m
30m
30m
1
0
0
0
1
1
1
1
2
2
2
2
0
4h
30m
30m
30m
1.5h
2h
15
5
5
5
10
10
7
22
20
15
15
15
5
CIFAR-10
Battle of Sexes
Prisoners Dilemma
Blotto
House Price Prediction
Fashion MNIST
MS-COCO
MNLI
Language Modeling
Breakout
Mountain Car Continuous
Meta Maze
3-SAT Heuristic
4h
2h
2h
2h
30m
Table 7 Computational resources required for each task in MLGym-bench.
Table 8 lists the average input and output tokens and associated pricing for each model across all tasks in
MLGym-Bench. We report the model pricing as listed by their respective providers. Llama3.1-405b-Instruct
pricing is taken from Together AI. Note that for this work, we used the open-weights model checkpoint
with FP-8 precision, hosted on Meta Internal servers. Gemini-1.5-Pro charges 2X for using the long-context
capabilities, i.e for input and output exceeding 128K tokens. However, in our experiments, we do not observe
Gemini using the long-context capabilities, so the final price is reported based on the normal pricing.
Model
Avg. Usage
Input Output
Pricing
Input Output
Llama3.1-405b-instruct∗
Claude-3.5-Sonnet
Gemini-1.5-Pro†
GPT-4o
OpenAI O1-Preview
304348
707704
282613
266886
368898
3.50
3.00
1.25
2.50
15.0
2512
12415
1633
2429
60704
3.50
15.0
5.00
10.0
60.0
Context Length
128k
200k
2M
128k
128k
Table 8 Model pricing, token usage and context length details. Model Pricing is in USD per 1M tokens. ∗ Llama3.1:
FP8 endpoint by Together6
6 https://www.together.ai/pricing
28
A.2
Failure Mode Analysis
Failed Runs
Incomplete Runs
10
Count
8
6
4
2
0
FineWeb MS-COCO BreakoutMountainCar Maze
3-SAT Regression CIFAR-10
MNLI
BoS
Blotto
Figure 9 Number of Failed and Incomplete runs per task. The criteria for marking a run as incomplete or failed is
described in Section 7.4.1
Continuing the discussion from Section 7.4.1, we show the failed and incomplete runs on each task to
understand the difficulty distribution of tasks. Language Modeling and all Reinforcement Learning tasks
(Meta Maze, Mountain Car Continuous and Breakout) prove the most challenging, with the highest failure
rates. Whereas, Fashion MNIST and Prisoner’s Dilemma show the lowest failure rates, with all models
producing a valid intermediate solution and a valid submission for all seeds.
These failure patterns align with the raw performance scores in Table 5 and Table 6, where we observe that
tasks requiring complex architectural decisions (Language Modeling) or complex algorithms (Breakout, Meta
Maze and Mountain Car Continuous). Traditional supervised learning tasks are handled more reliably across
models, while the more advanced models demonstrate better error handling and completion rates overall.
29
A.3
Action Analysis
Extending the results presented in Section 7.4.2, Figure 10 shows the action distribution on each task. The
bars represent the sum of all the actions taken by all models on a particular task. We notice that RL tasks
have the higest action count, while Game Theoretic tasks have the lowest action count. Algorithmic Tasks
such as 3-SAT and Game Theory (Blotto, Prisonner’s Dilemma and Battle of Sexes) also have the highest
amount of validation actions, signifying a quick experimental cycle. Similarly, all RL tasks have the most
complex codebases among all MLGym-Bench tasks and thus agent extensively use the View commands.
Edit
View
Validate
800
Submit
Search
Python
Bash
Count
600
400
200
0
Breakout MountainCar Maze
Regression CIFAR-10 FineWeb MS-COCO
3-SAT
F-MNIST
Blotto
MNLI
PD
BoS
Figure 10 Action Distribution for each task. We group the actions into categories following the grouping defined in
Table 2 and Section 7.4.2.
A.4
Model Rankings
Table 9 and Table 10 show each model’s ranking based on Best Attempt@4 and Best Submission@4 scores
respectively. The aggregate ranks are computed using the BORDA7 count method. The aggregated rankings
computed using BORDA count method align with the AUP score results as shown in Table 4. However, similar
to any ranking-only metric, it does not convey the relative difference between each model’s performance.
Rank
1
2
3
4
5
6
CIFAR-10
Battle of Sexes
Prisoners Dilemma
Blotto
House Price Prediction
Fashion MNIST
Language Modeling
Breakout
Mountain Car Continuous
Meta Maze
3-SAT Heuristic
BORDA
Claude-3.5-Sonnet
OpenAI O1
Llama3-405b-instruct
Claude-3.5-Sonnet
OpenAI O1
Claude-3.5-Sonnet
OpenAI O1
Gemini-1.5-Pro
OpenAI O1
Claude-3.5-Sonnet
OpenAI O1
OpenAI O1
OpenAI O1
Gemini-1.5-Pro
Gemini-1.5-Pro
Gemini-1.5-Pro
Claude-3.5-Sonnet
GPT-4o
Gemini-1.5-Pro
OpenAI O1
Gemini-1.5-Pro
OpenAI O1
GPT-4o
Gemini-1.5-Pro
Gemini-1.5-Pro
Claude-3.5-Sonnet
OpenAI O1
OpenAI O1
Gemini-1.5-Pro
OpenAI O1
GPT-4o
Llama3-405b-instruct
Claude-3.5-Sonnet
Gemini-1.5-Pro
Llama3-405b-instruct
Claude-3.5-Sonnet
GPT-4o
Llama3-405b-instruct
GPT-4o
GPT-4o
Llama3-405b-instruct
Gemini-1.5-Pro
Claude-3.5-Sonnet
Baseline
Baseline
Llama3-405b-instruct
Gemini-1.5-Pro
Llama3-405b-instruct
Llama3-405b-instruct
GPT-4o
Claude-3.5-Sonnet
Llama3-405b-instruct
GPT-4o
Llama3-405b-instruct
Baseline
Claude-3.5-Sonnet
Llama3-405b-instruct
Baseline
Claude-3.5-Sonnet
GPT-4o
Baseline
Baseline
Baseline
Baseline
Baseline
Baseline
Llama3-405b-instruct
GPT-4o
GPT-4o
GPT-4o
Baseline
Baseline
Table 9 Individual and Aggregate Ranking of models based on Best Attempt@4. We use the BORDA method to
compute the aggregate ranks.
7 https://en.wikipedia.org/wiki/Borda_count
30
Rank
1
2
3
4
5
6
CIFAR-10
Battle of Sexes
Prisoners Dilemma
Blotto
House Price Prediction
Fashion MNIST
Language Modeling
Breakout
Mountain Car Continuous
Meta Maze
3-SAT Heuristic
BORDA
Claude-3.5-Sonnet
Gemini-1.5-Pro
Gemini-15-Pro
OpenAI O1
OpenAI O1
Claude-3.5-Sonnet
OpenAI O1
Gemini-1.5-Pro
OpenAI O1
Claude-3.5-Sonnet
GPT-4o
OpenAI O1
OpenAI O1
OpenAI O1
GPT-4o
Claude-3.5-Sonnet
Claude-3.5-Sonnet
GPT-4o
Gemini-1.5-Pro
OpenAI O1
Gemini-1.5-Pro
OpenAI O1
OpenAI O1
Gemini-1.5-Pro
Gemini-1.5-Pro
Claude-3.5-Sonnet
OpenAI O1
Gemini-1.5-Pro
Llama3-405b-instruct
Gemini-1.5-Pro
GPT-4o
Llama3-405b-instruct
Claude-3.5-Sonnet
Llama3-405b-instruct
Llama3-405b-instruct
Claude-3.5-Sonnet
GPT-4o
Llama3-405b-instruct
Claude-3.5-Sonnet
GPT-4o
Gemini-1.5-Pro
OpenAI O1
Claude-3.5-Sonnet
Baseline
Baseline
Gemini-1.5-Pro
Gemini-1.5-Pro
GPT-4o
Llama3-405b-instruct
GPT-4o
Llama3-405b-instruct
Llama3-405b-instruct
GPT-4o
Llama3-405b-instruct
Baseline
Claude-3.5-Sonnet
Llama3-405b-instruct
Baseline
Claude-3.5-Sonnet
Llama3-405b-instruct
Baseline
Baseline
Baseline
Baseline
Baseline
Baseline
Llama3-405b-instruct
GPT-4o
GPT-4o
GPT-4o
Baseline
Baseline
Table 10 Individual and Aggregate Ranking of models based on Best Subimission@4. We use the BORDA method to
compute the aggregate ranks.
A.5
Memory Utilization
Figure 11 and Figure 12 show the agent using the memory module to store and retrieve specific experimental
results and use them to submit the best possible model.
Figure 11 Example of retrieving the best training configuration from memory and restarting exploration from it.
31
Figure 12 Example of retrieving the best training configuration from memory and restarting exploration from it.
B
Prompts
Listing 1 System Propmt
SETTING : You are an autonomous machine learning researcher ,
and you ’ re working directly in the command line with a special interface .
The special interface consists of a file editor that shows you 1000 lines of a file at a
time .
In addition to typical bash commands , you can also use the following commands
to help you navigate and edit files .
COMMANDS :
open :
docstring : opens the file at the given path in the editor . If line_number is provided ,
the window will be move to include that line
signature : open " < path >" [ < line_number >]
arguments :
- path ( string ) [ required ]: the path to the file to open
- line_number ( integer ) [ optional ]: the line number to move the window to ( if not
provided , the window will start at the top of the file )
goto :
docstring : moves the window to show < line_number >
signature : goto < line_number >
arguments :
- line_number ( integer ) [ required ]: the line number to move the window to
scroll_down :
docstring : moves the window down 1000 lines
signature : scroll_down
scroll_up :
docstring : moves the window down 1000 lines
signature : scroll_up
create :
32
docstring : creates and opens a new file with the given name
signature : create < filename >
arguments :
- filename ( string ) [ required ]: the name of the file to create
search_dir :
docstring : searches for search_term in all files in dir . If dir is not provided ,
searches in the current directory
signature : search_dir < search_term > [ < dir >]
arguments :
- search_term ( string ) [ required ]: the term to search for
- dir ( string ) [ optional ]: the directory to search in ( if not provided , searches in
the current directory )
search_file :
docstring : searches for search_term in file . If file is not provided , searches in the
current open file
signature : search_file < search_term > [ < file >]
arguments :
- search_term ( string ) [ required ]: the term to search for
- file ( string ) [ optional ]: the file to search in ( if not provided , searches in the
current open file )
find_file :
docstring : finds all files with the given name in dir . If dir is not provided , searches
in the current directory
signature : find_file < file_name > [ < dir >]
arguments :
- file_name ( string ) [ required ]: the name of the file to search for
- dir ( string ) [ optional ]: the directory to search in ( if not provided , searches in
the current directory )
edit :
docstring : replaces lines < start_line > through < end_line > ( inclusive ) with the given
text in the open file . The replacement text is terminated by a line with only
end_of_edit on it . All of the < replacement text > will be entered , so make sure your
indentation is formatted properly . Python files will be checked for syntax errors
after the edit . If the system detects a syntax error , the edit will not be executed .
Simply try to edit the file again , but make sure to read the error message and
modify the edit command you issue accordingly . Issuing the same command a second
time will just lead to the same error message again .
signature : edit < start_line >: < end_line >
< replacement_text >
end_of_edit
arguments :
- start_line ( integer ) [ required ]: the line number to start the edit at
- end_line ( integer ) [ required ]: the line number to end the edit at ( inclusive )
- r e p l a c e m en t _ t e x t ( string ) [ required ]: the text to replace the current selection
with
insert :
docstring : inserts the given text after the specified line number in the open file . The
text to insert is terminated by a line with only end_of_insert on it . All of the <
text_to_add > will be entered , so make sure your indentation is formatted properly .
Python files will be checked for syntax errors after the insertion . If the system
detects a syntax error , the insertion will not be executed . Simply try to insert
again , but make sure to read the error message and modify the insert command you
issue accordingly .
signature : insert < line_number >
< text_to_add >
end_of_insert
arguments :
- line_number ( integer ) [ required ]: the line number after which to insert the text
- text_to_add ( string ) [ required ]: the text to insert after the specified line
submit :
docstring : submits your current code and terminates the session
signature : submit
33
validate :
docstring : validates your current submission file and returns the metrics on test set
signature : validate
Please note that THE EDIT and INSERT COMMANDS REQUIRES PROPER INDENTATION .
If you ’ d like to add the line ’
print ( x ) ’ you must fully write that out , with all
those spaces before the code ! Indentation is important and code that is not
indented correctly will fail and require fixing before it can be run .
RESPONSE FORMAT :
Your shell prompt is formatted as follows :
( Open file : < path >)
You need to format your output using two fields ; discussion and command .
Your output should always include _one_ discussion and _one_ command field EXACTLY as in
the following example :
DISCUSSION
First I ’ ll start by using ls to see what files are in the current directory . Then maybe
we can look at some relevant files to see what they look like .
‘‘‘
ls -a
‘‘‘
You should only include a * SINGLE * command in the command section and then wait for a
response from the shell before continuing with more discussion and commands .
Everything you include in the DISCUSSION section will be saved for future reference .
Please do not include any DISCUSSION after your action .
If you ’ d like to issue two commands at once , PLEASE DO NOT DO THAT ! Please instead first
submit just the first command , and then after receiving a response you ’ ll be able
to issue the second command .
You ’ re free to use any other bash commands you want ( e . g . find , grep , cat , ls , cd ) in
addition to the special commands listed above .
However , the environment does NOT support interactive session commands ( e . g . python , vim
) , so please do not invoke them .
Your goal is to achieve the best possible score , not just to submit your first working
solution . Consider strategies like validating your answer using the ‘ validate ‘
command , manually spot - checking predictions , building custom validation sets and
grading functions , and comparing different algorithms .
Once you have exhausted all possible solutions and cannot make progress , you can submit
your final solution by using ‘ submit ‘ command .
INSTRUCTIONS :
Now , you ’ re going to train a model to improve performance on this task . Your terminal
session has started and you ’ re in the workspace root directory . You can use any bash
commands or the special interface to help you . Edit all the file you need or create
a new training script .
Remember , YOU CAN ONLY ENTER ONE COMMAND AT A TIME . You should always wait for feedback
after every command .
When you ’ re satisfied with all of the changes you ’ ve made , you can run your training
file . Your training file should include the logic for saving the prediction for the
‘ test ‘ set of the task . The submission file should be named ‘ submission . csv ‘ with
the instance id and prediction column .
A sample submission file is given in the workspace and you can read it to get a better
understanding of the submission format .
Note however that you cannot use any interactive session commands ( e . g . python , vim ) in
this environment , but you can write scripts and run them . E . g . you can write a
python script and then run it with ‘ python < script_name >. py ‘.
NOTE ABOUT THE EDIT AND INSERT COMMANDs : Indentation really matters ! When editing a file
, make sure to insert appropriate indentation before each line !
IMPORTANT TIPS :
1. Always start by trying to understand the baseline script if available . This will give
you an idea of one possible solution for the task and the baseline scores that you
have to beat .
2. If you run a command and it doesn ’ t work , try running a different command . A command
that did not work once will not work the second time unless you modify it !
34
3. If you open a file and need to get to an area around a specific line that is not in
the first 100 lines , say line 583 , don ’ t just use the scroll_down command multiple
times . Instead , use the goto 583 command . It ’ s much quicker .
4. Always make sure to look at the currently open file and the current working directory
( which appears right after the currently open file ) . The currently open file might
be in a different directory than the working directory ! Note that some commands ,
such as ’ create ’ , open files , so they might change the current open file .
5. When editing files , it is easy to accidentally specify a wrong line number or to
write code with incorrect indentation . Always check the code after you issue an edit
to make sure that it reflects what you wanted to accomplish . If it didn ’t , issue
another command to fix it .
6. You have a limited number of actions / steps you can take in the environment . The
current step and remaining number of steps will given after every action . Use the
remaining steps wisely . If you only have few remaining steps , it is better to submit
a working solution then to keep trying .
7. Your each action should take less than 1800 seconds to complete . If your action doesn
’ t finish within the time limit , it will be interrupted .
( Current Step : 0 , Remaining Steps : 50)
( Open file : n / a )
( Current directory : / home / agent / i m a g e C l a s s i f i c a t i o n C i f a r 1 0 )
bash -
35