Unrolling the Codex agent loop
Key Points
- 1The Codex CLI's core logic is its "agent loop," which orchestrates interactions between the user, the AI model, and various tools to execute software tasks.
- 2This loop iteratively generates prompts, performs model inference, and executes tool calls, with conversation history and tool outputs progressively appended to the input until an assistant message is produced.
- 3To manage performance and context window limitations, Codex employs strategies like prompt caching for efficient inference and automatic conversation compaction to reduce prompt length.
The paper "Unrolling the Codex agent loop" by OpenAI elucidates the core operational mechanics of the Codex CLI, a cross-platform local software agent designed for generating high-quality software changes. This article, the first in a series, focuses specifically on the "agent loop," the central logic that orchestrates interactions among the user, the language model (LLM), and various tools to achieve software development tasks. The term "Codex" in this context refers to the "Codex harness," which encapsulates this core agent loop and execution logic.
The fundamental concept of the agent loop involves an iterative process:
- User Input: The loop begins by taking input from the user.
- Prompt Preparation: This input is then integrated into a comprehensive textual prompt, which also incorporates conversation history and contextual information.
- Model Inference: The prepared prompt is sent to the LLM via the Responses API. The model processes this prompt, translating it into input tokens, sampling, and then generating output tokens, which are translated back into text. This process is incremental, supporting streaming outputs.
- Response Handling: The model's response can either be:
- A final assistant message directly addressing the user's initial request or posing a follow-up question, signaling a termination state for the current "turn."
- A tool call, requesting the agent to execute a specific operation (e.g., running a shell command).
- Tool Execution and Iteration: If a tool call is requested, the agent executes the tool, appends the tool's output to the original prompt, and re-queries the model. This iterative cycle of inference and tool calling continues until the model delivers an assistant message.
A "turn" of conversation encompasses the entire journey from user input to the final assistant message, potentially involving numerous internal inference-tool call iterations. Crucially, the full conversation history, including prior messages and tool calls, is included in the prompt for subsequent turns, leading to an increasing prompt length. This necessitates careful context window management by the agent, as LLMs have a finite context window () for both input and output tokens.
Core Methodology: Codex's Agent Loop Implementation
Codex CLI leverages the Responses API for model inference, sending HTTP requests to configurable endpoints (e.g., https://chatgpt.com/backend-api/codex/responses for ChatGPT login, https://api.openai.com/v1/responses for OpenAI hosted models, or http://localhost:11434/v1/responses for local models via Ollama/LM Studio).
1. Building the Initial Prompt:
The Responses API structures the prompt from various input types provided by the client, rather than verbatim text. The core JSON payload sent to the API contains:
instructions: A system or developer message loaded from configuration files (e.g.,~/.codex/config.tomlor bundled model-specific Markdown files).tools: A JSON array of tool definitions the model can invoke. These include built-in Codex tools (e.g.,shell,update_plan), Responses API-provided tools (e.g.,web_search), and user-configured MCP (Model-Controller-Pair) server tools (e.g.,mcp__weather__get-forecast). Each tool specifies itstype,name,description,parameters, andrequiredarguments.input: A JSON array of message objects, each with atype(e.g.,message),role(system,developer,user,assistant), andcontent. Codex populates this field with:- A message describing the sandboxed environment (file permissions, network access, writable folders).
- (Optional) A message from user-defined
developer_instructions. - (Optional) A message aggregating "user instructions" from project-specific Markdown files (
AGENTS.override.md,AGENTS.md) and skill metadata. - A message detailing the local environment (current working directory
cwd, shell).
input array.The OpenAI Responses API server then constructs the prompt for the LLM by combining these elements in a specific order: instructions, tools, and then the input messages.
2. The First Turn and Subsequent Iterations:
The initial HTTP POST request to the Responses API initiates the first turn. The server responds with a Server-Sent Events (SSE) stream, containing event types like response.reasoning_summary_text.delta (for streaming reasoning), response.output_item.added (for tool calls or other model outputs), and response.output_text.delta (for streaming assistant messages). Codex consumes these events, republishing them internally.
When a tool call (e.g., function_call with name: "shell") is identified, Codex executes the command. The tool's output is then formatted as a function_call_output item and appended to the input array. This extended input array, which now includes the model's reasoning and the tool call/output, forms the prefix for the prompt of the *next* inference request within the same turn. This ensures the model has full context of previous steps. This process continues until an assistant message is generated, concluding the turn.
3. Performance Considerations:
The ever-growing prompt due to conversation history and tool outputs can lead to performance issues, as inference cost scales with prompt length. This is potentially quadratic if the entire history is sent repeatedly.
- Prompt Caching: OpenAI's models employ prompt caching, where computation from previous inference calls is reused if the new prompt is an exact prefix of a previously processed one. This makes sampling linear rather than quadratic.
- Cache Misses: Certain operations can cause cache misses:
- Changing available tools mid-conversation.
- Changing the target model.
- Modifying sandbox configuration, approval mode, or current working directory.
input array for configuration changes (e.g., new for sandbox changes, new for cwd changes) rather than modifying earlier messages.
- Context Window Management (Compaction): When the number of tokens exceeds a defined
auto_compact_limit, Codex automatically compacts the conversation. This involves calling a special/responses/compactendpoint of the Responses API. This endpoint returns a smaller list of items, including a item withencrypted_content, which effectively summarizes or preserves the model's latent understanding of the original, longer conversation. This allows the agent to continue without exhausting the context window while maintaining conversational coherence.