
MARS: Modular Agent with Reflective Search for Automated AI Research
Key Points
- 1MARS (Modular Agent with Reflective Search) is a novel framework designed to automate AI research, addressing challenges like expensive evaluations and complex code generation that limit traditional LLM-based agents.
- 2It employs three core pillars: Budget-Aware Planning via cost-constrained Monte Carlo Tree Search, a Modular Construction approach with a "Design-Decompose-Implement" pipeline, and Comparative Reflective Memory for distilling high-signal insights.
- 3Evaluated on MLE-Bench, MARS achieves state-of-the-art performance among open-source frameworks, demonstrating superior medal rates and the ability to effectively generalize learned insights across different search paths.
MARS (Modular Agent with Reflective Search) is a framework designed for automating AI research, specifically addressing challenges posed by computationally expensive evaluations and opaque performance attribution in this domain. Unlike traditional LLM-based agents that generate monolithic scripts and neglect execution costs, MARS optimizes for autonomous scientific discovery through three core pillars: Budget-Aware Planning, Modular Construction, and Comparative Reflective Memory.
The paper formalizes the problem as finding a solution that maximizes an objective within an environment , subject to a cost constraint :
This problem is instantiated as Machine Learning Engineering (MLE) tasks, where includes datasets and is a performance metric like accuracy on a held-out test set.
Core Methodology:
- Modular Construction Strategy (Modular Decomposition):
Each module encapsulates a specific sub-task (e.g., data preprocessing, configuration), and orchestrates the pipeline.
The process involves a three-stage "Design-Decompose-Implement" workflow:
- Idea Generation: An
Idea Generation Agentarticulates a comprehensive natural language plan. - Module Decomposition: A
Modular Agentparses the plan and decomposes the solution into logical, independent functional modules. - Component Implementation and Debugging: A
Coding Agentsequentially implements each module and orchestrates them via .
Diff-Based Editing is employed, allowing atomic, multi-file updates by specifying target files, blocks to replace, and new code in a standardized diff format.- Reflective Memory (Lesson Learning):
Lesson Learning to distill high-value insights into a compact lesson pool.
- Solution Improvement Lessons: An
Empirical Analysis Agentextracts objective findings from execution logs. ALesson Distillation Agentcompares new solutions against the best known, distilling structured lessons containing algorithmic changes, impact analysis, and generalized rules. - Debugging Lessons: For failed executions, a dedicated agent analyzes buggy code, error logs, and fixes, producing lessons that explain failure logic and provide guidelines to prevent similar errors.
- Lesson Management: A
Review Agentfilters redundant insights via LLM-based reasoning to maintain a high-signal, diverse lesson pool. - Lesson Utilization: Relevant lessons (up to most recent) from corresponding categories are utilized by the agent, which is instructed to explicitly cite applied lessons.
- Budget-Aware Monte Carlo Tree Search (MCTS):
- Actions and Expansion: Three distinct operators transform a parent state into a child solution :
Drafting (Root Expansion): Generates a completely new solution from scratch.Improvement: Applied to valid nodes, modifying modules and the main script to maximize .Debugging: Applied to failed nodes, inheriting structure but modifying parts to resolve errors. Buggy children enter an automatic debugging loop with up to debugging actions.
- Node Selection: The Upper Confidence Bound for Trees (UCT) algorithm is used to balance exploitation and exploration. Traversal selects the child maximizing the UCT value until a non-"fully expanded" node is found.
- The root node is fully expanded if it has no children or if the best solution hasn't improved after implementing valid nodes.
- Buggy nodes are always fully expanded.
- Valid nodes are fully expanded if they have children (improvement attempts).
- Efficiency-Guided Reward Function: A reward function balances performance gains with execution cost.
where and .
The efficiency-guided reward then incorporates execution time and time limit :
where is a penalty weight hyperparameter.
Task-Specific Components (for MLE):
For MLE tasks, MARS includes Task Preparation (extracting metadata, formalizing objectives, preparing data), Data Analysis (Exploratory Data Analysis guidance), and Curriculum-Based Exploration (progressively exploring from simple baselines to complex methods).
Experiments and Results:
MARS is evaluated on MLE-Bench, a benchmark of 75 Kaggle competitions covering NLP, CV, and tabular data. Experiments are conducted under a strict 24-hour wall-clock time budget per competition on a standard node (NVIDIA A100 GPU, 12 vCPUs, 220 GB RAM). Metrics include Above Median Rate, Any Medal Rate, and Gold Medal Rate.
MARS establishes state-of-the-art performance among open-source frameworks, significantly outperforming AIDE and AIRA-dojo under identical constraints. Using Gemini-3-Pro-Preview, MARS achieves 98.7% valid submissions, 65.8% Above Median Rate, and 56.0% Any Medal Rate (31.1% Gold). A scaled variant, MARS+ (2x A100 GPUs), achieves even higher rates (73.3% Above Median, 59.6% Any Medal), surpassing resource-intensive competitors. MARS consistently outperforms baselines across Lite, Medium, and High task complexities.
Ablation studies demonstrate the significant contributions of both Modular Decomposition and Lesson Learning. Comparisons of tree search strategies show that Budget-Aware MCTS (with ) consistently yields superior performance over greedy search or vanilla MCTS (), effectively balancing exploration with resource constraints. Qualitative analysis shows modular decomposition leads to more extensive and structured codebases (higher lines of code, more files) in the best solutions, with diverse modules tailored to specific sub-tasks.