
Code as Agent Harness
Key Points
- 1This paper introduces "code as agent harness," proposing that code serves as an executable, inspectable, and stateful operational substrate for AI agents, moving beyond its traditional role as a mere output.
- 2This framework positions code as crucial for grounding agent reasoning, enabling programmatic actions, and modeling environments, thereby enhancing reliability and closed-loop behavior in agentic systems.
- 3The survey systematically organizes this perspective into three layers: the harness interface, mechanisms for sustained operation (planning, memory, tool use), and scaling the harness for multi-agent coordination.
This paper introduces the concept of "Code as Agent Harness," positing that code serves as the fundamental operational substrate for modern agentic AI systems, moving beyond its traditional role as merely a target output. It argues that code provides an executable, inspectable, and stateful medium through which agents reason, act, observe feedback, and verify progress, thereby enabling reliable closed-loop behavior in long-running tasks. The survey systematically studies this perspective across three interconnected layers: the harness interface, harness mechanisms, and scaling the harness to multi-agent settings.
1. Harness Interface: Code for Reasoning, Acting, and Environment Modeling (§2)
This layer establishes how code forms the basic connection between an LLM agent and its task environment, allowing for executable, inspectable, and stateful interactions.
- Code for Reasoning (§2.1): Code externalizes internal model logic into verifiable computation.
- Program-Delegated Reasoning: Instead of solely relying on natural language for complex computation, the LLM generates executable programs (e.g., Python scripts, domain-specific language (DSL) commands). An external runtime or interpreter executes this code, producing formally grounded outputs. This approach, exemplified by Program-of-Thoughts (PoT) prompting, separates high-level reasoning (LLM's task decomposition) from low-level computation (external executor's role), significantly improving reliability and verifiability. Intermediate execution traces, variable states, and function outputs can be fed back for refinement.
- Formal Verification and Symbolic Reasoning Interfaces: Code functions as a persistent intermediate representation in hybrid neural-symbolic systems. LLMs can generate formal specifications, proof scripts, or use symbolic solvers (e.g., SMT solvers) through code interfaces. This allows for rigorous verification of computational steps and exploration of structured reasoning trajectories (e.g., Graph-of-Thoughts).
- Iterative Code-Grounded Reasoning: Execution feedback from generated code (e.g., runtime errors, test failures, or success signals) is used to iteratively refine the LLM's reasoning. This creates a closed-loop system where the agent diagnoses issues and proposes revised code, converging towards a correct solution.
- Code for Acting (§2.2): Code translates high-level intent into executable operations within various environments.
- Grounded Skill Selection: LLMs generate or select pre-defined code snippets (skills) to perform specific actions. These skills are often represented as Python functions or API calls with defined inputs and outputs, grounding the agent's actions in the capabilities of the environment.
- Programmatic Policy Generation: For complex interaction, LLMs synthesize programmatic policies (e.g., control scripts, behavior trees, state machines) that dictate sequences of actions and conditional logic. These policies are directly executable by robotic, GUI, or software agents, enabling intricate and robust interaction.
- Lifelong Code-Based Agents: Agents continuously learn and refine their action policies by generating and updating code modules based on new experiences and feedback, leading to adaptive and evolving behaviors over time.
- Code for Environment Modeling (§2.3): Code represents the world state, transition dynamics, and feedback signals.
- Structured World Representations: Environments are modeled using code-based structures, such as Document Object Model (DOM) trees for web interfaces, code repositories for software environments, or formal specifications. Agents parse, query, and manipulate these structures programmatically.
- Execution-Trace World Modeling: The history of agent interactions and environment changes is captured through execution traces, logs, and system states, all represented or derived from code execution. These provide a structured, inspectable record of dynamics.
- Code-Grounded Evaluation Environments: Tests, unit tests, and simulations, defined in code, serve as verifiable ground truth and feedback mechanisms. Agents can generate tests, run them, and use the results to evaluate their performance and revise their strategies.
- Verifiable Environment Construction: For scientific discovery or system design, agents can programmatically construct and manipulate simulations, experimental setups, or software architectures, ensuring reproducibility and verifiability of the environment itself.
2. Harness Mechanisms: Planning, Memory, Tool Use, Control, and Optimization (§3)
This layer details how the "code as agent harness" sustains reliable behavior over long horizons.
- Planning for Agent Harness (§3.1): LLMs generate and refine plans as sequences of code-based operations.
- Linear Decomposition: Breaking down complex tasks into a sequential list of code-executable sub-tasks or function calls.
- Structure-grounded Planning: Generating plans that conform to existing code structures, API schemas, or software architectures.
- Search-based Planning: Exploring a search space of possible code modifications or execution paths, often guided by execution feedback or symbolic reasoning.
- Orchestration-based Planning: Defining complex workflows as executable scripts or state machines, coordinating multiple code modules or tools.
- Memory and Context Engineering (§3.2): Code enables persistent storage and retrieval of information.
- Working Memory: Short-term memory storing current code states, variables, and recent execution traces.
- Semantic Memory: Knowledge bases encoded as structured code (e.g., functions, classes, data structures) or retrieved via embedding search on code artifacts.
- Experiential Memory: Storing past problem-solving episodes, including successful code solutions, execution traces, and debugging paths, for reuse.
- Long-Term Memory: Persistent storage of reusable skills, domain knowledge, and learned policies, often serialized as code or data structures.
- Multi-Agent Memory: Shared code repositories or state databases that facilitate coordination and knowledge transfer between multiple agents.
- Context Compaction and State Offloading: Techniques to summarize code-based execution histories and states for efficient LLM prompting, preventing context window overflow.
- Tool Use for Agent Harness (§3.3): Code provides the interface for agents to interact with external tools.
- Function-Oriented Tool Use: LLMs generate function calls defined by code (e.g., Python functions, REST API calls) to perform specific operations.
- Environment-Interaction Tool Use: Code allows agents to interact with diverse environments, such as command-line interfaces, web browsers (via DOM manipulation), or robotic platforms.
- Verification-Driven Tool Use: Tools are leveraged for validation, such as running unit tests, linters, or static analyzers on generated code.
- Workflow-Orchestration Tool Use: Code defines and executes complex workflows by chaining together multiple tool calls.
- Harness Control through the Plan, Execute, and Verify Loop (§3.4): This describes the core feedback loop for self-correction.
- From Debugging to Harness-Level Control: When code execution fails, the harness captures errors and uses them to guide the LLM in debugging and revising the code.
- Planning as Contract Formation: Plans generated by the LLM (as code) serve as explicit contracts, specifying expected outcomes and allowing for systematic verification at each step.
- Sandboxed Execution and Permissioned State Transition: Code is executed in controlled, isolated environments (sandboxes) to ensure safety and track state changes, often with explicit permission models.
- Verification through Deterministic Sensors: The harness uses code-based sensors (e.g., API calls, test results) to obtain deterministic feedback from the environment, enabling objective evaluation of agent actions.
- Agentic Harness Engineering for Adaptive Optimization (§3.5): The harness itself can evolve.
- Deep Telemetry: Capturing comprehensive execution data (code generated, errors, performance metrics) serves as feedback for improving the harness components.
- Evolution Agent: An agent (or meta-agent) capable of analyzing telemetry and modifying the harness's internal logic, tool definitions, or prompting strategies.
- Governed Harness Mutation: Controlled and systematic changes to the harness structure, often guided by performance objectives and safety constraints, preventing regressions.
3. Scaling the Harness: Multi-Agent Orchestration over Code (§4)
This layer extends the "code as agent harness" concept to collaborative multi-agent systems, where code facilitates coordination, shared state, and collective verification.
- Improved Coding Support through Multi-agent Collaboration (§4.1):
- Functional Role Specialization: Agents assume distinct roles (e.g., manager, coder, reviewer, tester) with responsibilities defined and executed through code.
- Diverse Interaction Modes Grounded in Shared Program State: Collaboration occurs through shared code artifacts (e.g., a common codebase, pull requests) that reflect the evolving shared state. Modes include programming, debugging, red-teaming, and debate.
- Optimized Workflow Topology: Agents interact in centralized, distributed, or streaming workflows, all mediated by shared code and execution states.
- Execution Feedback and Shared-Harness Synchronization (§4.2):
- Execution Feedback Integration: Feedback from one agent's code execution (e.g., test results) is shared with other agents to inform their actions.
- Shared-Harness Synchronization: Mechanisms ensure all agents have a consistent view of the shared code artifacts and environment state.
- Position: The Shared Code-Centric Harness Substrate (§4.3):
- Shared Harness Representation: Code repositories, test suites, and execution traces form a common, inspectable, and editable workspace for multi-agent teams.
- Harness-State Convergence: Agents work towards a common goal by iteratively modifying and converging on a shared, correct code state.
Emerging Fields and Open Problems (§5)
The paper identifies key applications, including coding assistants, GUI/OS automation, embodied agents, scientific discovery, and personalization. It also outlines critical open challenges:
- Harness-Level Evaluation: Beyond final task success, evaluating the quality of intermediate code, plans, and recovery strategies.
- Semantic Verification: Verifying code properties beyond simple execution success, such as correctness with respect to complex specifications or security.
- Self-Evolving Harnesses without Regression: Ensuring that adaptive improvements to the harness do not introduce new failures.
- Transactional Shared Program State: Managing concurrent modifications and resolving conflicts in shared code environments across multiple agents.
- Human-in-the-Loop Safety: Integrating human oversight for safety-critical actions and defining accountability mechanisms within the code-harnessed system.
- Multimodal Code-Harness Systems: Extending the code-centric harness to incorporate and reason about multimodal inputs and outputs.
In essence, "Code as Agent Harness" redefines the role of code from a passive output to an active, central component of agent infrastructure, enabling a systematic approach to building executable, verifiable, and stateful AI agent systems.