
Agent Laboratory: Using LLMs as Research Assistants
Key Points
- 1Agent Laboratory is an LLM-agent-driven system designed to assist human researchers by automating the entire research workflow, including literature review, experimentation, and report writing.
- 2The system features an `mle-solver` for iterative ML code improvement and a `paper-solver` for generating academic reports, with evaluations showing `mle-solver`'s strong performance on ML benchmarks.
- 3While `gpt-4o` proved most efficient and cost-effective, and co-pilot mode improved output quality, the system's autonomously generated research reports still fell below the standards of top-tier academic publications like NeurIPS.
Agent Laboratory is a system designed to assist human researchers in implementing their research ideas by outputting a research report and code repository from a human-produced research idea. It functions as a structured framework adaptable to varying computational resources, from MacBooks to GPU clusters. The system comprises specialized large language model (LLM) driven agents that facilitate the entire research workflow, encompassing literature reviews, plan formulation, experiment execution, and report writing, aiming to complement human creativity by automating repetitive and time-intensive tasks.
The system operates in three primary phases: (1) Literature Review, (2) Experimentation, and (3) Report Writing. During each phase, specialized LLM agents collaborate, integrating external tools such as arXiv, Hugging Face, Python, and LaTeX. This workflow begins with independent collection and analysis of research papers, progresses through collaborative planning and data preparation, and culminates in automated experimentation and comprehensive report generation.
The core methodology for solving ML problems within Agent Laboratory is addressed by the mle-solver module. This general-purpose ML code solver takes research directions, provided as text from preceding phases, and iteratively refines research code. Its operation involves:
- Iterative Improvement:
mle-solvermaintains a collection of "top programs" representing current best solutions. - Conditioning: These top programs are iteratively conditioned on inputs such as task instructions, command descriptions, and distilled knowledge.
- Scoring Function: Improvements are guided by a scoring function that evaluates experimental results.
- Code Modification: Changes to the code are generated via two distinct commands:
REPLACE: This command rewrites the entirety of the existing code.EDIT: This command modifies specific lines within the code.
- Feedback Loop: Successfully compiled code updates the collection of top programs based on their improved scores. In cases of compilation errors, the system attempts up to three repair attempts before generating new code. The agent employs a reflective mechanism at each step to continuously refine outcomes.
The mle-solver's effectiveness was evaluated in isolation on 10 ML challenges from MLE-bench, a benchmark for real-world ML tasks on Kaggle. It demonstrated superior consistency and performance compared to other solvers: mle-solver obtained four medals (two gold, one silver, one bronze), outperforming OpenHands (gpt-4o) with two gold medals, AIDE (o1-preview) with one gold and one bronze, and MLAB with zero medals. Furthermore, mle-solver achieved above median human performance on 6 out of 10 benchmarks, surpassing AIDE (5/10), OpenHands (2/10), and MLAB (0/10).
For generating research reports, Agent Laboratory introduces the paper-solver. This module functions as a results and code-to-report generator, synthesizing outputs and findings from experimental phases into a human-readable academic paper. It takes as input the research plan, experimental results, derived insights, and the literature review, producing outputs in a standard academic paper format suitable for conference submissions.
The human-perceived quality of research outputs was assessed for three LLM backends: gpt-4o, o1-mini, and o1-preview. Fifteen papers, generated autonomously by Agent Laboratory based on five research questions, were reviewed by 10 volunteer PhD students who rated them on a scale of 1 to 5 across experimental quality, report quality, and perceived usefulness. O1-preview achieved the highest perceived usefulness (4.4/5) and report quality (3.4/5), with slightly lower experimental quality (2.9/5). O1-mini showed the highest experimental quality (3.2/5). Gpt-4o scored lowest overall, especially in experimental quality (2.6/5), though maintaining a strong usefulness rating (4.0/5). Quality varied by research topic, with "word order" receiving the highest report quality (3.8/5) and usefulness (4.5/5), but lowest experimental quality (2.7/5). "Cognitive bias" achieved the highest experimental quality (3.2/5).
Human reviewers also evaluated papers using NeurIPS-style criteria, scoring quality, significance, clarity, soundness, presentation, and contribution out of 10. O1-preview attained the highest average overall score (4.0/10), followed by o1-mini (3.8/10) and gpt-4o (3.5/10). O1-mini excelled in quality (2.3/4) and o1-preview in soundness (2.2/4). All models showed modest performance in significance (2.2โ2.5/4) and contribution (average 2.1/4), indicating limitations in originality and impact. Clarity scores varied, with gpt-4o rated highest (2.6/4). All models' scores were significantly below the 5.9 average for accepted NeurIPS papers, highlighting gaps in rigor.
In co-pilot (human-guided) mode, Agent Laboratory showed improved performance. Researchers rated the tool highly on utility (3.5/5), continuation likelihood (3.75/5), satisfaction (3.63/5), and usability (4.0/5). Custom topics generally received higher scores across these metrics compared to preselected topics. Paper quality in co-pilot mode improved over autonomous mode, with an average overall score increase from 3.8/10 to 4.38/10 (+0.58). Gains were observed in quality (+0.75), clarity (+0.23), soundness (+0.48), and presentation (+0.33), with minimal change in significance (-0.05) and contribution (+0.03).
Runtime and cost analysis revealed gpt-4o as the most computationally efficient and cost-effective backend, completing the entire workflow in 1165.4 seconds at 7.51) and 6201.3 seconds (9.58 for this task alone).
Agent Laboratory contributes to the growing field of autonomous research systems, building upon prior works such as The Virtual Lab, ChemCrow, Coscientist, ResearchAgent, and The AI Scientist, by providing a structured framework for LLM agents to assist in the full research workflow.