Researchers from Stanford University, Together AI, Duke University, and Harvard University have unveiled DSGym, an advanced framework engineered to assess and develop data science agents. Unlike prior methods that often rely on simple code completion, DSGym pushes agents to scrutinize datasets, formulate workflows, execute code, and deliver verifiable solutions across more than 1,000 diverse data science challenges, all supported by expert-curated ground truth and a robust post-training pipeline.
Addressing Limitations in Existing Benchmarks
The development of DSGym was prompted by an investigation into existing benchmarks claiming to test data-aware agents. This analysis revealed significant shortcomings: many questions could be solved through pattern recognition or prior knowledge embedded in the text alone, rather than genuine data analysis. For instance, when data files were concealed, models exhibited notable drops in accuracy—40.5% on QRData, 86.8% on DAEval, and 44.4% on DiscoveryBench. Furthermore, the team identified annotation errors and inconsistent numerical tolerances, underscoring the need for a more robust evaluation system.
DSGym's Standardized Evaluation Framework
DSGym standardizes its evaluation process around three core components: Task, Agent, and Environment. Tasks are categorized into either Data Analysis or Data Prediction. Data Analysis tasks present one or more files along with a natural language query that agents must answer through code. Data Prediction tasks, conversely, provide training and test data splits, an explicit performance metric, and require the agent to construct a modeling pipeline to output predictions.
Each task is encapsulated within a Task Object, containing data files, the prompt, a scoring function, and relevant metadata. Agents interact with this system via a CodeAct-style loop, where each turn involves producing a reasoning block detailing their strategy, a code block for execution within the environment, and an answer block when a solution is ready. The Environment is implemented as a Docker container cluster, where worker containers mount data as read-only volumes, provide a writable workspace, and come equipped with domain-specific Python libraries.
Expanding the Benchmark Suite: DSGym Tasks, DSBio, and DSPredict
Building upon its runtime infrastructure, DSGym Tasks aggregates, refines, and introduces new datasets. Existing benchmarks such as QRData, DAEval, DABStep, and MLEBench Lite were cleaned by eliminating unscorable items and applying a 'shortcut filter' to remove questions easily solved without actual data access.
- DSBio: To address scientific discovery, DSGym introduces DSBio, a collection of 90 bioinformatics tasks derived from peer-reviewed publications and open-source datasets. These tasks span single-cell analysis, spatial and multi-omics, and human genetics, featuring deterministic numerical or categorical answers validated by expert reference notebooks.
- DSPredict: This component targets modeling challenges from real Kaggle competitions. A crawler gathers recent competitions accepting CSV submissions and meeting specific size and clarity criteria. After preprocessing, DSPredict is divided into 'Easy' (38 introductory competitions) and 'Hard' (54 highly complex challenges). In total, DSGym Tasks encompasses 972 data analysis tasks and 114 prediction tasks.
Insights into Current Agent Capabilities
Evaluations using DSGym included leading closed-source models like GPT-5.1, GPT-5, and GPT-4o, along with open-weight models such as Qwen3-Coder-480B and GPT-OSS-120B, and smaller models like Qwen2.5-7B-Instruct. All models were tested under consistent conditions, utilizing the same CodeAct agent, temperature 0, and disabled tools.
On verified general analysis benchmarks, including QRData Verified and DAEval Verified, top models achieved 60% to 90% exact match accuracy. However, performance significantly declined on DABStep Hard, indicating persistent challenges in multi-step quantitative reasoning over complex financial tables. DSBio revealed a more critical weakness, with Kimi-K2-Instruct achieving the highest overall accuracy at 43.33%. A substantial majority (85-96%) of failures on DSBio stemmed from domain grounding errors, such as misusing specialized libraries or incorrect biological interpretations, rather than basic coding errors.
While most frontier models achieved high valid submission rates (above 80%) on MLEBench Lite and DSPredict Easy, valid submissions rarely surpassed 70% on DSPredict Hard, and Kaggle leaderboard medal rates were negligible. This trend suggests a 'simplicity bias,' where agents often settle for baseline solutions instead of exploring more advanced models and hyperparameter tuning.
DSGym as a Training Ground for Data Science Agents
Beyond benchmarking, the DSGym environment also facilitates the synthesis of training data. Researchers prompted agents to explore datasets, propose questions, solve them with code, and record their trajectories using subsets of QRData and DABStep, generating 3,700 synthetic queries. A judge model then filtered these to create DSGym-SFT, a dataset of 2,000 high-quality query-trajectory pairs. Fine-tuning a 4B Qwen3-based model on DSGym-SFT produced an agent capable of competitive performance with GPT-4o on standardized analysis benchmarks, despite its significantly smaller parameter count. This demonstrates that execution-grounded supervision on structured tasks offers an effective pathway to enhancing data science agent capabilities.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost