In a significant development for artificial intelligence, researchers from CAMEL AI, Eigent AI, and other contributing organizations have unveiled SETA. This new open-source project provides a robust toolkit and environment stack specifically engineered to empower reinforcement learning (RL) in AI agents operating within Unix-style terminal environments. SETA aims to furnish a complete, end-to-end solution for developing, training, and evaluating agents tasked with executing verifiable operations under benchmark harnesses such as Terminal Bench.
Key Innovations Driving Progress
The SETA initiative distinguishes itself through several core advancements:
- Pioneering Terminal Agent Performance: An agent developed using SETA, powered by Claude Sonnet 4.5, has established a new benchmark for state-of-the-art performance on Terminal Bench 2.0. Similarly, a GPT-4.1-based agent achieved top scores on Terminal Bench 1.0. These comparisons are restricted to agents utilizing the same foundational language models, highlighting SETA's optimization capabilities.
- Scalable RL Training with Synthetic Environments: The research team has released an initial collection of 400 synthetic terminal tasks, varying in complexity. Of these, 260 tasks are specifically utilized for Reinforcement Learning with Human Feedback (RLHF)-like finetuning of a Qwen3-8B model, demonstrating a scalable approach to agent training.
- Elegant Agent Architecture for Broad Application: A unified agent implementation underpins both local task execution and official Terminal Bench evaluations. This streamlined design promotes consistency and enhances the agent's ability to generalize across different training and assessment frameworks.
Deep Dive into the SETA Framework
The SETA code repository introduces a specialized Terminal Toolkit designed to convert a language model into a functional terminal agent. During each task execution, the framework meticulously generates a structured log directory, providing essential insights into agent activity. Key files within this structure include:
chatagent.log: This file maintains a comprehensive history of the agent's messages, tool invocations, and test outcomes.- A
sessionsdirectory: This containssession_logsthat capture detailed terminal interactions initiated by the toolkit. - Specific log files within
session_logs, such asblocking_commands.logandsession_run_zork_1_correct_path.log, record command outputs across various sessions and operational modes. tests.logandtests.log.strip: These document the results of test runs, with the latter removing control characters for cleaner analysis.
This organized logging structure offers a practical pathway for debugging, enabling developers to trace agent decisions from high-level chat interactions down to individual shell commands and verify task success or failure through test logs.
For formal Terminal Bench evaluation, the GitHub repository includes dedicated entry points. Developers can navigate to a specific directory and execute scripts like run_eval.sh for Terminal Bench 1.0 or run_tb2.sh for Terminal Bench 2.0. Evaluation results are stored in a designated JSON file, with task-specific session logs residing in their respective directories.
Enhancing Agent Memory: The Note Taking Toolkit
A notable addition to the SETA framework is the Note Taking Toolkit, conceptualized as a form of persistent memory for complex, long-horizon tasks. Example tool calls illustrate the agent's capacity to systematically record and retrieve notes while addressing terminal challenges. While the current public documentation highlights the existence and application examples of this toolkit, a complete training objective for optimal note utilization is still under development. Nevertheless, this feature provides agents with a distinct channel to externalize intermediate findings and crucial hints, separate from the immediate terminal buffer.
Unpacking SETA's Performance Milestones
The agent harness within SETA has achieved remarkable results on Terminal Bench. An agent powered by Claude Sonnet-4.5 attained an impressive 46.5% accuracy on Terminal Bench 2.0 across 89 real-world tasks. This performance secures the top rank, surpassing the second-best system by 3 percentage points, with particular strengths observed in git workflows, DevOps automation, and code security tasks. On Terminal Bench 1.0, a GPT-4.1-based agent reached 35% accuracy, exceeding the next competitor within the same model family by 4.7 percentage points. For context, a supervised Qwen3 8B baseline registered only 3.4% on Terminal Bench 2.0, an outcome significantly improved upon by the Qwen3 8B terminal agent trained using the SETA RL pipeline within the custom synthetic environments.
Key Takeaways for Developers and Researchers
- SETA stands as a collaborative open-source project, providing both essential agent toolkits and synthetic RL environments tailored for terminal agents, fully compatible with the Terminal Bench evaluation format.
- The framework demonstrates leading performance for CAMEL terminal agents on both Terminal Bench 1.0 and 2.0, utilizing Claude Sonnet 4.5 and GPT 4.1 as foundational models, specifically benchmarked against agents employing the same model families.
- The open-source SETA RL dataset, available on Hugging Face, encompasses 400 synthetic terminal tasks, each structured with a
task.yaml,Dockerfile, andrun-tests.sh, with 260 tasks dedicated to RLHF-like finetuning for a Qwen3-8B based agent. - The SETA codebase offers a Terminal Toolkit with structured logging capabilities and a Note Taking Toolkit for long-term memory, integrating seamlessly with Terminal Bench evaluation scripts and logging paths in its GitHub repository.
- This comprehensive design offers a clear, repeatable path from synthetic RL environments to agents verified by benchmarks, empowering developers with a consistent stack for training, debugging, and evaluating terminal agents, moving beyond ad-hoc tool-calling demonstrations.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost