
Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
Key Points
- 1This paper rigorously investigates the effectiveness of repository-level context files, such as AGENTS.md, for coding agents on both established benchmarks and a novel dataset with developer-provided files.
- 2Surprisingly, the study reveals that LLM-generated context files tend to decrease agent task success rates and increase inference costs by over 20%, while human-written files yield only marginal performance gains.
- 3Trace analysis indicates that context files encourage broader exploration and testing, leading to the conclusion that unnecessary instructions make tasks harder, and context files should contain only minimal requirements.
This paper rigorously investigates the effectiveness of repository-level context files, such as AGENTS.md, for coding agents, a practice widely encouraged by agent developers. The authors evaluate coding agents' task completion performance in two settings: established SWE-BENCH LITE tasks from popular repositories using LLM-generated context files, and a novel benchmark, AGENTBENCH, derived from real-world issues in less popular repositories that already contain developer-committed context files.
The core methodology revolves around a comparative evaluation across three context file settings:
- NONE: No context files are provided to the agent.
- LLM: A context file is automatically generated by an LLM (using the agent's recommended initialization command and model) from the pre-patch repository state . This setting is applied to both SWE-BENCH LITE and AGENTBENCH.
- HUMAN: A developer-provided context file, present in the repository from the pre-patch state , is used. This setting is only applicable to AGENTBENCH instances, as SWE-BENCH LITE repositories do not contain such files.
To facilitate this evaluation, the authors construct AGENTBENCH, a new benchmark comprising 138 unique Python software engineering tasks, created from real GitHub issues across 12 recent and niche repositories, each featuring developer-written context files. The generation of AGENTBENCH instances follows a five-stage process:
- Finding Repositories: GitHub search is used to identify repositories containing
AGENTS.mdorCLAUDE.mdat their root, filtering for Python projects with test suites and at least 400 pull requests (PRs) to ensure sufficient data for instance extraction. This process yielded 12 candidate repositories. - Filtering Pull Requests: PRs are filtered using a combination of rule-based checks and an LLM agent (specifically, GPT-5.2 with CODEX). Only PRs that reference at least one issue, modify at least one Python file, and are assessed by the agent to introduce deterministic, testable behaviors are kept. Unlike SWE-BENCH LITE, AGENTBENCH does not require PRs to contain unit tests, accommodating the less strict practices of niche repositories.
- Environment Set-Up: For each selected PR and its corresponding repository state , an LLM agent (GPT-5.2 with CODEX) is tasked with producing a script that sets up the execution environment, runs the repository's test suite, and stores the results as a machine-readable dictionary. Instances are retained only if the resulting dictionary indicates at least one passing test.
- Task Descriptions: A third LLM agent generates a standardized and detailed task description for each instance. This description is based on the PR description, associated issues, and the original golden patch . The descriptions are structured into six sections: description, steps to reproduce, expected behavior, observed behavior, specification, and additional information, ensuring precise specifications without leaking the solution.
- Generating Unit Tests: Since most collected PRs do not modify or add unit tests, an LLM agent generates specific unit tests that pass for any implementation resolving the described task. These generated tests are verified to fail on the base repository and pass on the patched repository . The final test set for each instance combines these generated tests with a maximal set of existing repository tests that pass on . The success rate for an instance is defined as the percentage of predicted patches where .
The evaluation utilizes four coding agents paired with suitable LLMs: CLAUDE CODE with SONNET-4.5, CODEX with GPT-5.2 and GPT-5.1 MINI, and QWEN CODE with QWEN3-30B-CODER. Performance is measured by success rate, the number of steps (agent-environment interactions), and the monetary cost of LLM inference.
The study's surprising findings indicate that:
- LLM-generated context files generally reduce task success rates (by 0.5% on SWE-BENCH LITE and 2% on AGENTBENCH) while increasing inference cost by over 20% on average across models and prompts. They also increase the average number of steps required to complete a task.
- Developer-provided context files (HUMAN) marginally improve performance compared to providing no context files (an increase of 4% on average) but also increase average steps and costs by up to 19%.
- Context file overviews are not effective: The presence of context files, whether LLM-generated or human-provided, does not meaningfully reduce the number of steps an agent takes to interact with relevant files.
- Context files as redundant documentation: When documentation-related files are removed from the codebase, LLM-generated context files tend to outperform developer-provided ones, suggesting LLM-generated files are largely redundant with existing documentation, while developer-written ones provide additional, potentially disruptive, information. This implies that LLM-generated context files can be beneficial where existing documentation is scarce.
- Behavioral changes: Trace analysis reveals that context files encourage broader exploration (e.g., more thorough testing, file traversal) and increase the use of repository-specific tooling. Coding agents generally respect instructions within context files (e.g., specific tools mentioned are used significantly more).
- Increased reasoning tokens: The presence of context files leads to an increase in reasoning tokens used by agents like GPT-5.2 and GPT-5.1 MINI, suggesting that the additional instructions make tasks conceptually harder for the agents.
The authors conclude that unnecessary requirements from context files make tasks harder, contradicting current agent-developer recommendations. They suggest that human-written context files should only describe minimal, essential requirements. The evaluation framework presented aims to aid in improving the helpfulness of LLM-generated context files.