A new open-source artificial intelligence system, dubbed the Confucius Code Agent (CCA), has been unveiled by researchers from Meta and Harvard. Designed to function as an AI software engineer, the CCA targets real-world GitHub projects and complex industrial-scale software environments, promising reproducible results on demanding benchmarks like SWE Bench Pro.
The Confucius SDK: Elevating Agent Design Through Scaffolding
At the core of the Confucius Code Agent lies the Confucius SDK, an agent development platform that prioritizes 'scaffolding' as a fundamental design element, moving beyond simple language model wrappers. The SDK is structured around three critical axes: Agent Experience, User Experience, and Developer Experience.
- Agent Experience: Governs the information presented to the model, including context arrangement, active memory, and tool outcomes.
- User Experience: Emphasizes clear execution traces, code comparisons, and built-in protections for human engineers.
- Developer Experience: Focuses on the observability, configuration, and debugging processes for the agent itself.
Pioneering Mechanisms for Advanced AI Software Engineering
The SDK introduces several innovative mechanisms crucial for its capabilities:
- Unified Orchestrator with Hierarchical Working Memory: Real-world software tasks often demand reasoning across numerous files and interaction steps. The orchestrator within the Confucius SDK manages a hierarchical working memory, segmenting task trajectories into distinct scopes, summarizing prior actions, and retaining compressed context for subsequent turns. This architecture efficiently keeps prompts within model context limits while preserving vital artifacts like patches, error logs, and design decisions, demonstrating the necessity of an explicit memory structure for effective coding agents.
- Persistent Note-Taking for Cross-Session Learning: A second key mechanism involves a dedicated agent that generates structured Markdown notes from execution traces. These notes encapsulate task-specific strategies, repository conventions, and common failure patterns, serving as a long-term memory resource reusable across sessions. In experiments, the Confucius Code Agent, when provided with these notes in a second run, showed reduced average turns (from 64 to 61), lower token usage (from approximately 104k to 93k), and an improvement in Resolve@1 scores from 53.0 to 54.4, highlighting the efficacy of notes as persistent learning aids.
- Modular Extensions and Sophisticated Tool Integration: The Confucius SDK exposes various tools as modular extensions, encompassing functionalities such as file editing, command execution, test runners, and code search. Each extension maintains its own state and prompt integration. Research indicates that the sophistication of tool interaction significantly impacts success rates. An ablation study revealed that moving from simple to richer tool handling increased Resolve@1 scores from 44.0 to 51.6 with Claude 4.5 Sonnet, suggesting that intelligent tool selection and sequencing are paramount to performance.
Automating Agent Design with a Meta Agent
Further enhancing the SDK's capabilities is a meta agent that streamlines the agent engineering process. This meta agent interprets natural language specifications for an agent, then iteratively proposes configurations, prompts, and tool sets. It subsequently evaluates candidate agents on tasks, analyzes performance metrics, and refines the configuration through an automated build, test, and improve cycle. Notably, the Confucius Code Agent itself was produced with the assistance of this meta agent, transforming parts of the agent design process into an LLM-guided optimization problem.
Superior Performance on Challenging Benchmarks
Extensive evaluations on SWE Bench Pro, which features 731 GitHub issues requiring real repository modifications, demonstrated the Confucius Code Agent's robust performance. Key findings highlight that a strong underlying scaffold can be more impactful than a larger language model alone. For instance, the Confucius Code Agent paired with Claude 4.5 Sonnet achieved a Resolve@1 score of 52.7, surpassing Claude 4.5 Opus (a more powerful model) utilizing a less sophisticated scaffold, which scored 52.0. On SWE Bench Verified, the Confucius Code Agent with Claude 4 Sonnet reached a Resolve@1 of 74.6, outperforming competitors like SWE Agent (66.6) and OpenHands (72.8). The agent also exhibited stable performance across tasks involving varying numbers of edited files, even on those requiring changes across more than ten files, underscoring its capability in large codebases.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost