Measuring AI Ability to Complete Long Tasks
Paper

Measuring AI Ability to Complete Long Tasks

Thomas Kwa
2025.04.20
ยทArxivยทby Anonymous
#AI#Task Completion#Benchmark#AI Capabilities#AI Safety

Key Points

  • 1This paper proposes the "50%-task-completion time horizon" as a new metric, quantifying AI capability by the duration of tasks humans typically complete that AI models can finish with 50% success.
  • 2Analyzing 170 diverse tasks, the study reveals that this AI time horizon has been doubling approximately every seven months since 2019, primarily due to enhanced reliability, reasoning, and tool-use capabilities.
  • 3Extrapolating this exponential trend, the research suggests that AI systems could automate many software tasks currently taking human professionals a month to complete within the next five years.

This paper, "Measuring AI Ability to Complete Long Tasks," by Kwa et al. from Model Evaluation & Threat Research (METR), addresses the critical challenge of quantifying AI system capabilities beyond traditional benchmarks, which often suffer from artificiality, adversarial selection, and rapid saturation. The authors propose a novel, intuitive metric: the task completion time horizon, specifically focusing on the 50%-task-completion time horizon. This metric represents the typical time humans take to complete tasks that AI models can successfully accomplish with a 50% probability.

The core methodology involves a three-step process:

  1. Task Suite Creation and Curation: A diverse suite of 170 tasks was compiled, designed to capture skills relevant to research and software engineering.
    • HCAST [8]: 97 diverse software tasks (1 minute to ~30 hours), covering cybersecurity, machine learning, and software engineering. These tasks are realistic, solvable by professionals, and primarily text-based, allowing for automatic scoring (0 to 1, with defined success thresholds for continuous scores).
    • RE-Bench [2]: 7 challenging ML research engineering tasks, each estimated to take approximately 8 hours for a human expert.
    • Software Atomic Actions (SWAA): 66 novel, single-step tasks (1 second to 30 seconds), representing atomic actions in software development (e.g., file selection, code completion, math). These tasks were developed blind to AI performance and are grouped into 5 task families. SWAA was introduced to provide finer resolution for shorter task measurements and to assess pre-2023 models.
Tasks are grouped into "task families" (e.g., "crossword") to ensure diversity by down-weighting families with many tasks.

  1. Human Baselining: To establish task difficulty and length, skilled human professionals (average 5 years relevant experience, many from top universities) performed most tasks.
    • HCAST: 286 successful baselines from ~460 attempts, totaling 2,529 hours across all baselines. Humans worked in the Vivaria environment, with screens/audio recorded to prevent cheating. Task durations were derived from the geometric mean of successful baselines; manual estimates were used for tasks lacking successful baselines.
    • RE-Bench: Task duration fixed at 8 hours, based on the paper's intent for human experts.
    • SWAA: Baselines collected by METR employees using a custom webapp, with precise timing for single-step actions. Each decision-based task was baselined 4 times, and fill-in-the-blank tasks 3 times.
This process yielded human time-to-complete estimates for 148 of the 169 tasks, serving as a proxy for task "length" or "difficulty."

  1. AI Agent Evaluation and Time Horizon Calculation:
    • AI Agent Evaluation: 13 frontier AI models from 2019 (e.g., GPT-2, GPT-3) to 2025 (e.g., Claude 3.7 Sonnet, o1) were evaluated. Most models used the "modular-public" agent scaffold, providing Python and Bash commands with context management. Some models (o1-preview, o1) used a slightly adapted scaffold due to tool use and agentic struggles. Each agent/task pair was typically run 8 times to obtain an average success rate. A strong negative correlation was observed between human time-to-complete and AI success rate (R2โ‰ˆ0.83R^2 \approx 0.83 for exponential fit of success rate vs. logarithm of human time). Earlier models failed tasks over 1 minute, while recent models completed some tasks exceeding 4 hours of human time.
    • Time Horizon Calculation: Drawing inspiration from psychometric studies and Item Response Theory (IRT), the paper fits a logistic model to relate the human-estimated task duration to the AI agent's success rate. The 50% time horizon is then derived as the task duration at which the model achieves a 50% success probability according to this fitted logistic curve. The paper notes that success rates are weighted by the inverse square root of tasks in a family to reduce bias from large task families.

Key Findings:
The 50% time horizon for frontier AI models has grown exponentially from 2019 to 2025, demonstrating a doubling time of approximately seven months. The trend may have accelerated in 2024. This progress is attributed to improved logical reasoning, better tool use capabilities, and enhanced reliability and self-awareness in task execution. The 80% time horizon shows a similar trend but is roughly 5x shorter. Current systems show limitations, performing worse on less structured, "messier" tasks.

External Validity Experiments:
To address concerns about the generalizability of these findings to real-world tasks, three supplementary experiments were conducted:

  1. Messiness Factors: HCAST and RE-Bench tasks were scored against 16 "messiness" factors (e.g., resource-limited, novel, dynamic environment). While models performed worse on tasks with higher messiness scores, the trend in AI agent performance over time was consistent across both lower and higher messiness subsets, with no evidence of plateaus.
  2. SWE-bench Verified [7]: Replication of the methodology on SWE-bench Verified, which includes human difficulty annotations, confirmed the exponential trend, with an even shorter doubling time. This might be due to potential underestimation of contractor time for easier SWE-bench tasks by maintainer-centric difficulty annotations.
  3. Internal Pull Requests (PRs): Evaluation of AI agent performance on internal PRs revealed significant differences in human completion times (contractors taking 5-18x longer than repo maintainers). When contractor time was used as the task length measure, the AI time horizons derived from the combined SWAA, HCAST, and RE-Bench data were compatible with AI agent performance on internal PRs.

The supplementary experiments provided little evidence that performance trends are meaningfully slower on more realistic tasks tested, though they highlighted that AI agent time horizons can vary significantly based on task domain and the reference human population.

Implications:
Naively extrapolating the observed trend suggests that AI systems could achieve a time horizon of over one month (167 work hours) between late 2028 and early 2031. However, this extrapolation is subject to external validity concerns and potential changes in future growth rates. The paper discusses various factors that could either accelerate or decelerate this trend.