GitHub - microsoft/Webwright: A simple SWE style browser agent framework that achieves SOTA results on long horizon web tasks.
Key Points
- 1Webwright provides LLMs with a terminal to launch and inspect browser sessions, enabling them to generate re-runnable Python scripts by treating the browser as a disposable environment.
- 2This "code-as-action" paradigm prioritizes the local workspace as state, allowing for robust, reusable, and efficient composition of complex web workflows without relying on hidden orchestration.
- 3Webwright achieves state-of-the-art performance on benchmarks like Online-Mind2Web and Odysseys, significantly outperforming prior methods by treating the browser as an environment for code execution.
Webwright is a novel framework that redefines how Large Language Models (LLMs) interact with web environments, treating the browser as a disposable environment rather than a stateful workspace. Its core innovation lies in separating the agent from the browser, allowing LLMs to operate within a terminal where they launch multiple browser sessions, inspect pages, and complete web tasks by generating and debugging Python code.
Core Methodology and Paradigm:
Unlike traditional web agents that treat the browser session as the primary workspace and predict single, discrete actions within a predefined interaction loop, Webwright adopts a "workspace-as-state" paradigm. The persistent artifact is not the transient browser session, but the code and logs stored in the local workspace. The agent's browsing history is encapsulated within a re-runnable Python script.
The core methodology revolves around "code-as-action":
- Code Generation: The LLM agent generates complete, end-to-end Playwright Python scripts to accomplish a web task. This contrasts with systems where LLMs predict low-level actions (e.g., clicks, types, DOM selectors) or call predefined tools.
- Execution: The generated Python script is executed. This script can include complex logic, loops, functions, and abstractions to handle dynamic web behaviors (e.g., lazy loading, re-rendering, conditional waiting).
- Inspection and Repair: After execution, the agent inspects the outcomes, primarily through captured screenshots and execution logs. If the task is not completed or an error occurs, the agent iterates by identifying the issue and modifying/repairing the Python script. This mirrors a human engineer iterating on an RPA script, allowing for exploratory scripting.
- Flat Loop: The interaction loop is simplified to . This design aims for readability, ease of debugging, and flexibility.
Architectural Principles:
Webwright is designed to be lightweight and minimal. Its core agent loop, Playwright environment, and CLI are implemented with a low line count (~450, ~570, ~150 LoC respectively), avoiding complex hidden frameworks. It supports pluggable model backends (OpenAI, Anthropic, OpenRouter) and focuses on producing run artifacts (trajectories, screenshots) to disk for thorough inspection and debugging.
Key Differentiators from Other Browser Agents:
Webwright fundamentally differs from other browser-agent frameworks in several key aspects:
- Paradigm: While others might use hybrid code/NL primitives (Stagehand), CLI tools for an external agent (agent-browser), or autonomous LLM loops over DOM/AX snapshots (browser-use), Webwright positions itself as a "coding agent with a terminal," treating the browser as a disposable environment it spawns.
- Action Space: Its action space is free-form Python, allowing the LLM to write Playwright scripts directly. This is a significant departure from discrete subcommands, indexed click/type actions, or LLM-translated Playwright primitives.
- State Definition: The "state" for Webwright is the local workspace (code, screenshots, logs), whereas for others, it's typically the persistent browser session. This makes browser sessions disposable in Webwright.
- Loop Shape: Webwright's loop is , allowing for multi-step interactions to be composed into a single program, leading to fewer interaction rounds and faster execution compared to step-by-step observation and prediction loops.
Performance:
Webwright demonstrates state-of-the-art performance on two real-website benchmarks with a 100-step budget:
- Online-Mind2Web (300 tasks): Achieved 86.7% with GPT-5.4, outperforming other open-sourced harnesses. Claude Opus 4.7 also performed strongly at 84.7%.
- Odysseys (200 long-horizon tasks): Achieved 60.1% with GPT-5.4 (average 76.1 steps), a significant improvement (15.6 points) over prior SOTA (Opus 4.6, 44.5%) that used vision-based approaches and persistent browsers.
Usage and Integration:
Webwright can be used as a standalone CLI tool, taking configuration files, task instructions, and initial URLs as input. Crucially, it ships with plugin manifests for Claude Code, OpenAI Codex, OpenClaw, and Hermes Agent. This allows host agents to drive the Webwright loop natively without requiring additional LLM API keys or costs. It offers two primary modes for script generation: /webwright:run for a one-shot final_script.py and /webwright:craft for a reusable, parameterized CLI tool that can be rerun with different arguments.