MARS: Modular Agent with Reflective Search for Automated AI Research
Paper

MARS: Modular Agent with Reflective Search for Automated AI Research

Jaehyun Nam
2026.02.08
·Arxiv·by 이호민
#AI Research#Automated Research#LLM Agents#MCTS#Modular Design

Key Points

  • 1MARS (Modular Agent with Reflective Search) is a novel framework designed to automate AI research, addressing challenges like expensive evaluations and complex code generation that limit traditional LLM-based agents.
  • 2It employs three core pillars: Budget-Aware Planning via cost-constrained Monte Carlo Tree Search, a Modular Construction approach with a "Design-Decompose-Implement" pipeline, and Comparative Reflective Memory for distilling high-signal insights.
  • 3Evaluated on MLE-Bench, MARS achieves state-of-the-art performance among open-source frameworks, demonstrating superior medal rates and the ability to effectively generalize learned insights across different search paths.

MARS (Modular Agent with Reflective Search) is a framework designed for automating AI research, specifically addressing challenges posed by computationally expensive evaluations and opaque performance attribution in this domain. Unlike traditional LLM-based agents that generate monolithic scripts and neglect execution costs, MARS optimizes for autonomous scientific discovery through three core pillars: Budget-Aware Planning, Modular Construction, and Comparative Reflective Memory.

The paper formalizes the problem as finding a solution sβˆ—s^* that maximizes an objective O(s,E)O(s, E) within an environment EE, subject to a cost constraint BB:
sβˆ—=arg⁑max⁑sO(s,E)s.t.Cost(s)≀Bs^* = \arg \max_s O(s, E) \quad \text{s.t.} \quad \text{Cost}(s) \le B
This problem is instantiated as Machine Learning Engineering (MLE) tasks, where EE includes datasets and OO is a performance metric like accuracy on a held-out test set.

Core Methodology:

  1. Modular Construction Strategy (Modular Decomposition):
MARS fundamentally shifts from generating monolithic scripts to a repository-level, modular software architecture. This addresses token limits, enhances precision by focusing on smaller logical units, enables efficient code reuse, and improves testability. A node solution sns_n is defined as a tuple comprising a set of ll independent modules and one orchestration script:
sn=⟨{Mj}j=1l,Ο€main⟩s_n = \langle \{M_j\}_{j=1}^l, \pi_{\text{main}} \rangle
Each module MjM_j encapsulates a specific sub-task (e.g., data preprocessing, configuration), and Ο€main\pi_{\text{main}} orchestrates the pipeline.
The process involves a three-stage "Design-Decompose-Implement" workflow:
  • Idea Generation: An Idea Generation Agent articulates a comprehensive natural language plan.
  • Module Decomposition: A Modular Agent parses the plan and decomposes the solution into logical, independent functional modules.
  • Component Implementation and Debugging: A Coding Agent sequentially implements each module MjM_j and orchestrates them via Ο€main\pi_{\text{main}}.
To avoid full-repository regeneration, Diff-Based Editing is employed, allowing atomic, multi-file updates by specifying target files, blocks to replace, and new code in a standardized diff format.

  1. Reflective Memory (Lesson Learning):
To overcome context window limitations and enable iterative improvement, MARS introduces Lesson Learning to distill high-value insights into a compact lesson pool.
  • Solution Improvement Lessons: An Empirical Analysis Agent extracts objective findings from execution logs. A Lesson Distillation Agent compares new solutions against the best known, distilling structured lessons containing algorithmic changes, impact analysis, and generalized rules.
  • Debugging Lessons: For failed executions, a dedicated agent analyzes buggy code, error logs, and fixes, producing lessons that explain failure logic and provide guidelines to prevent similar errors.
  • Lesson Management: A Review Agent filters redundant insights via LLM-based reasoning to maintain a high-signal, diverse lesson pool.
  • Lesson Utilization: Relevant lessons (up to KmK_m most recent) from corresponding categories are utilized by the agent, which is instructed to explicitly cite applied lessons.
  1. Budget-Aware Monte Carlo Tree Search (MCTS):
MARS explores the solution space using an MCTS framework with domain-specific modifications:
  • Actions and Expansion: Three distinct operators transform a parent state sparents_{\text{parent}} into a child solution snews_{\text{new}}:
    • Drafting (Root Expansion): Generates a completely new solution from scratch.
    • Improvement: Applied to valid nodes, modifying modules and the main script to maximize OO.
    • Debugging: Applied to failed nodes, inheriting structure but modifying parts to resolve errors. Buggy children enter an automatic debugging loop with up to NdN_d debugging actions.
  • Node Selection: The Upper Confidence Bound for Trees (UCT) algorithm is used to balance exploitation and exploration. Traversal selects the child maximizing the UCT value until a non-"fully expanded" node is found.
    • The root node is fully expanded if it has no children or if the best solution hasn't improved after implementing nsn_s valid nodes.
    • Buggy nodes are always fully expanded.
    • Valid nodes are fully expanded if they have β‰₯Ni\ge N_i children (improvement attempts).
  • Efficiency-Guided Reward Function: A reward function R(v)R(v) balances performance gains with execution cost.
First, a global normalized score G(v)G(v) is computed based on the performance metric M(v)M(v) relative to the history of explored nodes VV:
G(v):={0.5ifΒ Mmax=MminM(v)βˆ’MminMmaxβˆ’MminotherwiseG(v) := \begin{cases} 0.5 & \text{if } M_{\text{max}} = M_{\text{min}} \\ \frac{M(v) - M_{\text{min}}}{M_{\text{max}} - M_{\text{min}}} & \text{otherwise} \end{cases}
where Mmax=max⁑vβ€²βˆˆVM(vβ€²)M_{\text{max}} = \max_{v' \in V} M(v') and Mmin=min⁑vβ€²βˆˆVM(vβ€²)M_{\text{min}} = \min_{v' \in V} M(v').
The efficiency-guided reward R(v)R(v) then incorporates execution time t(v)t(v) and time limit L(v)L(v):
R(v):=G(v)β‹…[t(v)/L(v)]wR(v) := G(v) \cdot [t(v)/L(v)]^w
where ww is a penalty weight hyperparameter.

Task-Specific Components (for MLE):
For MLE tasks, MARS includes Task Preparation (extracting metadata, formalizing objectives, preparing data), Data Analysis (Exploratory Data Analysis guidance), and Curriculum-Based Exploration (progressively exploring from simple baselines to complex methods).

Experiments and Results:
MARS is evaluated on MLE-Bench, a benchmark of 75 Kaggle competitions covering NLP, CV, and tabular data. Experiments are conducted under a strict 24-hour wall-clock time budget per competition on a standard node (NVIDIA A100 GPU, 12 vCPUs, 220 GB RAM). Metrics include Above Median Rate, Any Medal Rate, and Gold Medal Rate.
MARS establishes state-of-the-art performance among open-source frameworks, significantly outperforming AIDE and AIRA-dojo under identical constraints. Using Gemini-3-Pro-Preview, MARS achieves 98.7% valid submissions, 65.8% Above Median Rate, and 56.0% Any Medal Rate (31.1% Gold). A scaled variant, MARS+ (2x A100 GPUs), achieves even higher rates (73.3% Above Median, 59.6% Any Medal), surpassing resource-intensive competitors. MARS consistently outperforms baselines across Lite, Medium, and High task complexities.
Ablation studies demonstrate the significant contributions of both Modular Decomposition and Lesson Learning. Comparisons of tree search strategies show that Budget-Aware MCTS (with w=βˆ’0.07w=-0.07) consistently yields superior performance over greedy search or vanilla MCTS (w=0w=0), effectively balancing exploration with resource constraints. Qualitative analysis shows modular decomposition leads to more extensive and structured codebases (higher lines of code, more files) in the best solutions, with diverse modules tailored to specific sub-tasks.