StepFun AI Launches Step-DeepResearch: A New Era for AI-Powered Scholarly Inquiry
Back to News
Monday, January 26, 20266 min read

StepFun AI Launches Step-DeepResearch: A New Era for AI-Powered Scholarly Inquiry

StepFun AI has unveiled Step-DeepResearch, an innovative 32-billion-parameter artificial intelligence agent designed to elevate common web searches into rigorous research endeavors. Built upon the Qwen2.5 32B-Base architecture, this end-to-end deep research agent is engineered to function as a singular entity capable of planning, exploring sources, validating evidence, and generating cited reports, all while maintaining low operational costs for inference.

From Basic Search to In-Depth Exploration

Many existing web agents are primarily optimized for multi-hop question-answering, focusing on matching predefined answers to concise queries. This approach often resembles targeted information retrieval rather than genuine, expansive research. Deep research tasks, by contrast, demand nuanced capabilities such as recognizing latent intentions, making decisions over extended timeframes, utilizing tools across multiple interactions, structured reasoning, and cross-source verification amidst uncertainty.

Step-DeepResearch reimagines this complex process as a sequence of decisions executed through a compact set of core "atomic capabilities." The development team has identified four such foundational skills: strategic planning and task breakdown, in-depth information gathering, critical reflection and validation, and professional report drafting. Rather than coordinating numerous external sub-agents, the system integrates this entire operational cycle into a single model that autonomously determines its subsequent action at each stage.

Specialized Data Synthesis for Core Skills

To instill these distinct atomic capabilities, StepFun’s researchers developed bespoke data pipelines for each skill. For comprehensive planning, the team curated high-quality technical reports, survey papers, and financial analyses. They then reverse-engineered realistic research plans and task hierarchies from document titles, abstracts, and structures, subsequently generating trajectories that adhere to these complex plans. This methodology exposes the model to extensive project structures, moving beyond simple question templates.

For advanced information seeking, graph-based queries were constructed over knowledge graphs like Wikidata5m and CN-DBpedia. Subgraphs were sampled and expanded via search, generating questions that necessitate multi-hop reasoning across various entities and documents. Another pipeline employed a Wiki-style hyperlink index to enforce cross-document retrieval and evidence synthesis. Training data was carefully filtered to exclude easy questions solvable by simpler strategies, thereby concentrating on challenging search problems.

Reflection and verification data were produced using self-correction loops and sophisticated multi-agent teacher traces. These teacher agents extracted claims, formulated checks, verified facts, refined plans if inconsistencies arose, and then compiled reports. The resulting robust trajectories were then refined and utilized for supervising a single student agent. Report generation training occurred in two phases: an initial phase for domain-specific style and depth using query-report pairs, followed by supervised fine-tuning with stringent formatting and plan consistency requirements.

Progressive Training with Extended Context

The training regimen for Step-DeepResearch involved three distinct stages: agentic mid-training, supervised fine-tuning, and reinforcement learning. During the first mid-training stage, atomic capabilities were introduced without explicit tools, leveraging context lengths up to 32,000 tokens. This data encompassed active reading, synthetic reasoning traces, summarization, and reflection. Researchers observed consistent performance improvements on benchmarks such as SimpleQA, TriviaQA, and FRAMES as training progressed to approximately 150 billion tokens, with the most significant gains noted on FRAMES, which specifically tests structured reasoning.

Stage two extended the context window to 128,000 tokens and integrated explicit tool calls. The model learned tasks including URL-based question answering, deep web search, extensive document summarization, and long dialogue reasoning. This phase aligned the model with practical research scenarios where searching, browsing, and analytical tasks must be seamlessly combined within a single workflow.

During supervised fine-tuning, all four atomic capabilities were integrated into complete deep search and research trajectories. Data cleaning ensured that only accurate and concise trajectories (in terms of steps and tool calls) were retained. The pipeline deliberately introduced controlled tool errors followed by corrections to enhance robustness, and enforced strict citation formats to guarantee reports were accurately grounded in retrieved sources.

Finally, reinforcement learning optimized the agent within a live tool environment. Tasks and checklists were developed through reverse synthesis, and a checklist-style Rubrics Judge was trained to assess reports along detailed dimensions. The reward mechanism translated ternary rubric labels into asymmetric binary rewards, capturing both positive targets and identified violations. The policy was refined using Proximal Policy Optimization (PPO) with a learned critic, employing generalized advantage estimation with minimal discounting to avoid truncating lengthy trajectories.

Integrated Architecture and Robust Search Capabilities

For inference, Step-DeepResearch operates as a single ReAct-style agent, iteratively alternating between internal thought processes, tool invocations, and observations until a report is ready for output. Its toolkit includes batch web search, a task manager, shell commands, and file operations. Execution occurs within a sandboxed environment, maintaining terminal persistence via tmux. A perception-oriented browser minimizes redundant page captures through perceptual hash distance. Support for multimodal inputs is provided by tools for document parsing, audio transcription, and image analysis.

Information acquisition leverages two critical resources. The StepFun team indicates that its proprietary Search API draws upon a vast repository of over 20 million high-quality academic papers and more than 600 premium indices. Furthermore, a meticulously curated authority indexing strategy identifies over 600 trusted domains, encompassing governmental, academic, and institutional websites. Retrieval operates at the paragraph level, employing authority-aware ranking to prioritize information from highly trusted sources when relevance is comparable.

The file management tools facilitate patch-based editing, allowing the agent to update only modified sections of a report. A summary-aware storage mechanism saves complete tool outputs to local files while injecting only concise summaries into the agent’s context. This functions as an external memory system, preventing context overflow during extensive projects.

Evaluation Highlights and Cost Efficiency

To thoroughly assess its deep research capabilities, the StepFun team introduced ADR-Bench, a Chinese benchmark comprising 110 open-ended tasks across nine diverse domains. Seventy of these tasks cover general areas like education, science, engineering, and social life, with evaluations conducted via expert side-by-side comparisons. The remaining 40 tasks in finance and law are scored using explicit rubrics that adhere to principles of atomicity and verifiability.

On the rigorous Scale AI Research Rubrics, Step-DeepResearch achieved 61.42 percent rubric compliance, a performance level comparable to leading models such as OpenAI-DeepResearch and Gemini-DeepResearch, and distinctly superior to numerous open-source and proprietary baseline systems. For ADR-Bench, expert-based Elo ratings revealed that the 32B model surpassed larger open models like MiniMax-M2, GLM-4.6, and DeepSeek-V3.2, proving competitive with established systems such as Kimi-Researcher and MiniMax-Agent-Pro.

These evaluations underscore Step-DeepResearch’s competitive quality at a notably lower operational cost, marking a significant advancement in accessible, high-fidelity AI-powered research.

This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.

Source: MarkTechPost
Share this article