Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases
Paper

Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Jun Ge
2026.01.26
·Arxiv·by 이호민
#Agent#LLM#Software Engineering#Scalability#Codebase

Key Points

  • 1The Confucius Code Agent (CCA) is introduced as a scalable software engineering agent designed to operate over massive repositories, addressing challenges in long-context reasoning and long-term memory.
  • 2CCA is built on the Confucius SDK, which features hierarchical working memory for context management, a persistent note-taking system for continual learning, and a modular extension system for tool use, all supported by a meta-agent that automates configuration synthesis and refinement.
  • 3On SWE-Bench-Pro, CCA achieves a Resolve@1 score of 54.3%, outperforming prior research baselines and comparing favorably to commercial results under identical conditions, demonstrating the impact of principled scaffolding.

The Confucius Code Agent (CCA) paper introduces a novel software engineering agent designed for operating over large-scale codebases, sustaining long-horizon sessions, and coordinating complex toolchains. The core contribution is the Confucius Code Agent (CCA) itself, built on the Confucius SDK, an agent development platform structured around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX).

The paper addresses two core challenges in scalable agentic software engineering: C1 (Long-context reasoning) and C2 (Long-term memory). Existing agents often struggle with these due to flat interaction histories, heuristic prompt engineering, or tightly coupled tool pipelines. The Confucius SDK provides a principled approach to managing external information and agent behavior by treating AX, UX, and DX as distinct first-class design principles, unlike frameworks that conflate them. AX focuses on the agent's internal cognitive workspace, providing distilled, structured, and stable reasoning context to the LLM. UX emphasizes transparency, controllability, and interpretability for human users via readable logs and execution traces. DX concerns observability, evaluation, and modularity for developers building and improving agent systems. This separation avoids issues like context overflow for AX, information trimming for UX, and entanglement for DX. For instance, while UX receives rich, streaming updates, AX receives only a compressed summary of tool execution outcomes, avoiding verbose diffs.

The Confucius SDK underpins CCA through four key mechanisms:

  1. F1 (Context Management): This mechanism, primarily serving AX for C1, employs a hierarchical working memory with configurable visibility scopes (e.g., session, entry, runnable). On top of this, an adaptive context compression system, driven by a planner agent called the "Architect," is integrated. When the effective prompt length approaches configurable thresholds, the Architect is invoked in a separate LLM call. It analyzes the conversation history to construct a structured summary, explicitly preserving key information categories such as task goals, decisions, open TODOs, and critical error traces. This compressed summary then replaces marked historical messages, while a rolling window of recent messages is maintained in their original form. The summary is inserted as a new AI message, ensuring future turns see both the compact summary and recent raw history. This design preserves semantically important information and access to long reasoning chains, overcoming the brittleness of fixed-window truncation.
  1. F2 (Note-Taking Agent): Addressing C2, this mechanism serves both AX and UX by enabling persistent knowledge accumulation. A dedicated note-taking agent, built on the Confucius orchestrator, distills structured session trajectories (including user messages, tool invocations, LLM outputs, and system events) into compact Markdown notes. These notes are stored in a file-system-like tree, allowing for programmatic search, read, write, edit, delete, and import operations via structured tools. A distinctive aspect is the emphasis on "hindsight notes" for failures, recording compilation errors, runtime exceptions, and unproductive strategies. This corpus of failure cases, indexed by error messages and stack traces, allows agents to retrieve known fixes for similar failures in future sessions, reducing repeated "thrashing."
  1. F3 (Extensions): Serving AX for C1 and DX, extensions are modular components that attach to the orchestrator and participate in each iteration of the loop, cleanly separating the core orchestration logic from agent capabilities. An extension is a typed configuration object that registers callbacks (e.g., on_input_messages, on_plain_text, on_tag, on_llm_output). These callbacks are invoked in a fixed order and have access to a shared run context exposing I/O interfaces, session-wide storage, hierarchical memory, and artifact store. Extensions cover perception (parsing model outputs into structured actions), reasoning (shaping prompts, adding format instructions), and action (executing tools like shell commands, file edits, code search, and persisting results). This modularity enables composition and reuse across agents, improved observability, and easier ablation, directly instantiating CCA by bundling coding-specific extensions such as file search, file editing, and CLI tools.
  1. F4 (Meta-agent): Primarily for DX, the Meta-agent automates a build-test-improve loop for synthesizing, evaluating, and refining agent configurations. Implemented as an agent built on the Confucius Orchestrator, it interactively constructs new agents from high-level natural language specifications. It generates structured configuration forms, determines repository scope, latency/safety constraints, and selects existing extensions. After user confirmation, it automatically synthesizes the agent's configuration and prompts and wires in selected extensions and memory policies. The Meta-agent then spins up the candidate agent locally, drives it on regression tasks, and observes outputs and tool traces. Upon detecting failures (e.g., brittle tool selection, incorrect file-edit patterns), it proposes concrete modifications to prompts or extension configurations. These patches are applied, and the test loop reruns, iteratively refining the agent until target metrics are met. This mechanism was used to develop CCA itself, leading to more reliable tool selection and recovery behaviors.

The core of CCA is the Confucius Orchestrator, a minimal yet extensible execution loop (Algorithm 1) that repeatedly invokes the LLM, interprets its outputs, and coordinates tool use. It supports two modes of interaction with the LLM: native tool-use APIs (e.g., structured JSON) for advanced models and XML-style tag parsing (e.g., <bash>...</bash><bash>...</bash>) for others, ensuring broad model compatibility. Iteration control is bounded by a maximum limit but primarily agent-driven: the orchestrator interprets the agent's non-emission of actions as a completion signal, or extensions can explicitly request continuation (e.g., Bash extension raising an interrupt after command execution).

The Confucius SDK also provides a full suite of developer tools to support the agent development cycle, including a Trace UI for visualizing call stacks, tool interactions, and memory flows; a Playground for interactive prompt refinement; an Eval UI for regression tests and benchmark evaluations; and centralized agent management for large-scale development, integration, deployment, and monitoring.

Experimental evaluation on SWE-Bench-Pro demonstrates CCA's strong performance, achieving a Resolve@1 of 54.3% with Claude Opus 4.5, exceeding prior research baselines and comparing favorably to commercial results under identical conditions. This performance underscores the paper's central argument that agent scaffolding, beyond just model capability, is a primary determinant of agent performance.