Harness engineering: leveraging Codex in an agent-first world
Key Points
- 1This paper details an experiment where a software product with over a million lines of code was developed in five months, entirely generated by OpenAI's Codex agents, significantly accelerating the development timeline.
- 2This "harness engineering" approach redefined the engineer's role, shifting focus from writing code to designing legible environments, specifying intent, and building feedback loops that enabled agents to autonomously execute tasks and maintain the codebase.
- 3Success relied on structured in-repository knowledge, mechanically enforced architectural invariants, and autonomous agents managing reviews, validating fixes, and continuously addressing technical debt to ensure coherence and maintainability.
This paper introduces "Harness Engineering," a novel approach to software development where engineering teams build and ship a software product with zero lines of manually-written code, relying entirely on AI agents, specifically Codex. The authors, from OpenAI, detail their five-month experiment which resulted in a product with internal daily users and external alpha testers, culminating in approximately one million lines of agent-generated code. This methodology reportedly achieved development velocity up to 10 times faster than traditional human-driven coding.
The core methodology shifts the engineer's role from writing code to "harnessing" agents. This involves designing environments, precisely specifying intent, and constructing robust feedback loops that enable agents to perform reliable work. The guiding principle is "humans steer, agents execute." When an agent struggles, the human intervention is not to code, but to identify the missing capabilities—tools, guardrails, or documentation—and then prompt the agent itself to implement the necessary fixes or enhancements.
Key technical aspects of this "agent-first" paradigm include:
- Agent-centric Interaction and Workflow: Engineers interact with the system almost entirely through high-level prompts. Codex agents are then responsible for translating these prompts into actionable development steps, which include opening Pull Requests (PRs), self-reviewing their changes locally, requesting additional agent or human reviews, responding to feedback, and iterating until all reviewers are satisfied. This iterative loop is termed the "Ralph Wiggum Loop." Agents directly utilize standard development tools (e.g.,
ghCLI, local scripts, repository-embedded skills) to gather context and execute tasks, eliminating the need for human copy-pasting.
- Enhanced Application Legibility for Agents: To overcome the human QA bottleneck as code throughput increased, the system was designed to make the application's internal state directly legible to agents. This includes:
- Isolated Worktrees: The application is made bootable per Git worktree, allowing Codex to launch and drive isolated instances for each change, preventing state contamination and enabling parallel development.
- UI/UX Interaction: Integration with the Chrome DevTools Protocol allows agents to access DOM snapshots, screenshots, and perform navigation. This empowers Codex to reproduce bugs, validate UI fixes, and directly reason about user interface behavior.
- Observability Integration: A local, ephemeral observability stack exposes logs, metrics, and traces directly to Codex. Agents can query logs using LogQL and metrics using PromQL, enabling them to validate performance SLOs (e.g., "service startup completes in under 800ms") and detect regressions without human oversight. Agents can work on single tasks for extended periods (up to six hours).
- Structured Repository Knowledge Base: Recognizing that context is a scarce resource for agents, the initial monolithic
AGENTS.mdfile was replaced with a structureddocs/directory acting as the primary "system of record." TheAGENTS.mdnow functions as a concise table of contents (approx. 100 lines), guiding agents to deeper, context-specific documentation. This approach promotes "progressive disclosure," where agents are taught where to look for information rather than being overwhelmed upfront. The knowledge base, including design documents, architectural maps, product specifications, and execution plans, is versioned and co-located in the repository. Mechanical enforcement, via linters and CI jobs, validates the knowledge base's freshness and structure. A "doc-gardening" agent is deployed to scan for stale documentation and open corrective PRs. The philosophy emphasizes pushing all relevant context—even "chat discussions" about architectural patterns—into the repository to ensure agent legibility, favoring "boring" technologies due to their composability, API stability, and strong representation in model training data.
- Mechanical Enforcement of Architecture and "Taste": To maintain coherence in a fully agent-generated codebase, strict architectural invariants are enforced mechanically, rather than micromanaging implementations. The application architecture is rigidly structured into layers (e.g., Types → Config → Repo → Service → Runtime → UI) with strictly validated dependency directions and permissible edges. Cross-cutting concerns are channeled through explicit "Providers." These constraints are enforced by custom, Codex-generated linters and structural tests, which provide remediation instructions in their error messages for the agents. "Taste invariants," such as structured logging, naming conventions, and file size limits, are also encoded into these mechanical checks. This allows human taste to be captured once and continuously enforced across the entire codebase, effectively acting as "garbage collection" to prevent technical debt from compounding.
- Autonomous Feature Development and Garbage Collection: The system has evolved to a point where Codex can end-to-end drive new features, from bug reproduction and fix implementation to validation, PR management (including responding to feedback, detecting build failures), and escalation only when human judgment is indispensable. Furthermore, to combat entropy and "AI slop" from agent-generated code, the team encoded "golden principles" directly into the repository. Background Codex tasks regularly scan for deviations from these principles, update quality grades, and open targeted refactoring PRs for automated or quick human review. This continuous, automated refactoring prevents the accumulation of technical debt, similar to garbage collection in a programming language runtime.
The paper concludes by highlighting that building software still demands discipline, but this discipline is now primarily applied to the scaffolding, tooling, abstractions, and feedback loops that govern agent behavior, rather than directly writing application code. The primary challenges have shifted to designing robust environments and control systems for agents to build and maintain complex software at scale.