Measuring AI agent autonomy in practice
Key Points
- 1Analyzing millions of human-agent interactions, the study reveals AI agent autonomy is increasing, with Claude Code working autonomously for longer and experienced users shifting from direct approval to monitoring and interrupting when needed.
- 2It finds that Claude Code frequently pauses for clarification more often than humans interrupt it, and while agents are emerging in risky domains like healthcare and finance, most deployed actions are currently low-risk, reversible, and involve human oversight.
- 3The paper concludes that effective agent oversight requires robust post-deployment monitoring and human-AI interaction paradigms that jointly manage autonomy and risk, rather than mandating specific interaction patterns like step-by-step approval.
This paper, titled "Measuring AI agent autonomy in practice," empirically investigates the real-world usage of AI agents, focusing on the autonomy users grant them, how this changes with experience, the domains agents operate in, and the associated risks. Conducted by Anthropic researchers using their own infrastructure and data from late 2025 through early 2026, the study addresses the critical need to understand practical agent deployment for safe and responsible AI.
The core methodology employs a dual-pronged approach, leveraging two distinct data sources to provide both breadth and depth. The researchers define an agent as an AI system equipped with tools for taking actions (e.g., running code, calling APIs).
- Public API Traffic: This source provides broad visibility into agentic deployments across thousands of diverse customers. The analysis is performed at the level of individual tool calls, defined as specific actions taken by an agent through its tools. This approach allows for grounded, consistent observations of agent actions regardless of the underlying agent architecture, which is often opaque to the model provider. While offering breadth, this method limits the ability to reconstruct full, sequential agent sessions or understand how individual actions compose into longer behaviors. To assess risk and autonomy for these tool calls, the researchers utilized Claude itself to estimate a risk score (1-10, where 10 is high risk of substantial harm) and an autonomy score (1-10, where 10 is high independence) for each individual tool call based on its context. These Claude-generated classifications were validated against internal data where possible, with a designated opt-out category for non-inferable cases. The study also categorized tool calls by domain.
- Claude Code Internal Product Data: As Claude Code is Anthropic's own product, the researchers have full visibility into complete agent workflows and user sessions. This allows for in-depth study of autonomy metrics, such as turn duration (time elapsed between Claude starting work and stopping), human interruption rates, and agent-initiated clarification rates. It also enables tracking user behavior changes with experience (e.g., auto-approval rates, interruption rates over account tenure). However, this source provides insight into only a single product, predominantly used for software engineering, thus limiting generalizability across diverse domains.
By combining these sources, the study addresses questions unanswerable by either alone. For example, the duration of Claude Code's autonomous work is measured directly by tracking time between turns. User experience is quantified by the number of sessions completed.
The key findings are:
- Increased Autonomy: For the longest-running Claude Code sessions, the 99.9th percentile turn duration nearly doubled from under 25 minutes to over 45 minutes within three months (October 2025 β January 2026). This smooth increase, independent of sharp model release jumps, suggests that existing models possess a deployment overhang, meaning they are capable of more autonomy than typically exercised in practice.
- Evolving User Oversight: Experienced Claude Code users (with >750 sessions) auto-approve actions more frequently (>40% of sessions) compared to new users (<50 sessions) (~20%). Counterintuitively, experienced users also interrupt Claude more often (from 5% to 9% of turns). This indicates a strategic shift from constant manual approval to active monitoring and intervention when necessary. Similar patterns were observed on the public API, where human involvement decreased for more complex tasks, suggesting step-by-step approval becomes less practical.
- Agent-Initiated Oversight: Claude Code proactively pauses for clarification more frequently than humans interrupt it, especially on complex tasks (more than twice as often). This highlights agent-initiated stops as a crucial form of internal oversight, where models self-limit autonomy by recognizing and surfacing their uncertainty.
- Risky Domain Emergence: While software engineering accounts for nearly 50% of public API tool calls, the study observed emerging usage in higher-stakes domains like healthcare, finance, and cybersecurity. However, most agent actions remain low-risk and reversible, with 73% appearing to have human involvement and only 0.8% being irreversible actions. The upper-right quadrant of the risk-autonomy distribution (high risk, high autonomy) is sparsely populated but not empty, indicating novel frontier uses, though some high-risk activities might be simulations or evaluations.
The paper concludes that effective oversight requires novel post-deployment monitoring infrastructure and human-AI interaction paradigms that allow humans and AI to co-manage autonomy and risk. It recommends that model and product developers invest in post-deployment monitoring (including methods to link API requests into coherent sessions), train models to recognize their own uncertainty, and design user interfaces for effective monitoring and intervention rather than mandating specific interaction patterns like approving every action. The research acknowledges limitations, including reliance on Anthropic's data, the scope of public API analysis (individual tool calls vs. full sessions), Claude-generated classifications, and the dynamic nature of the agent landscape.