DSGym: A New Standard for Benchmarking and Training Data Science AI Agents

Researchers from Stanford University, Together AI, Duke University, and Harvard University have unveiled DSGym, an advanced framework engineered to assess and develop data science agents. Unlike prior methods that often rely on simple code completion, DSGym pushes agents to scrutinize datasets, formulate workflows, execute code, and deliver verifiable solutions across more than 1,000 diverse data science challenges, all supported by expert-curated ground truth and a robust post-training pipeline.

Addressing Limitations in Existing Benchmarks

The development of DSGym was prompted by an investigation into existing benchmarks claiming to test data-aware agents. This analysis revealed significant shortcomings: many questions could be solved through pattern recognition or prior knowledge embedded in the text alone, rather than genuine data analysis. For instance, when data files were concealed, models exhibited notable drops in accuracy—40.5% on QRData, 86.8% on DAEval, and 44.4% on DiscoveryBench. Furthermore, the team identified annotation errors and inconsistent numerical tolerances, underscoring the need for a more robust evaluation system.

DSGym's Standardized Evaluation Framework

DSGym standardizes its evaluation process around three core components: Task, Agent, and Environment. Tasks are categorized into either Data Analysis or Data Prediction. Data Analysis tasks present one or more files along with a natural language query that agents must answer through code. Data Prediction tasks, conversely, provide training and test data splits, an explicit performance metric, and require the agent to construct a modeling pipeline to output predictions.

Each task is encapsulated within a Task Object, containing data files, the prompt, a scoring function, and relevant metadata. Agents interact with this system via a CodeAct-style loop, where each turn involves producing a reasoning block detailing their strategy, a code block for execution within the environment, and an answer block when a solution is ready. The Environment is implemented as a Docker container cluster, where worker containers mount data as read-only volumes, provide a writable workspace, and come equipped with domain-specific Python libraries.

Expanding the Benchmark Suite: DSGym Tasks, DSBio, and DSPredict

Building upon its runtime infrastructure, DSGym Tasks aggregates, refines, and introduces new datasets. Existing benchmarks such as QRData, DAEval, DABStep, and MLEBench Lite were cleaned by eliminating unscorable items and applying a 'shortcut filter' to remove questions easily solved without actual data access.

DSBio: To address scientific discovery, DSGym introduces DSBio, a collection of 90 bioinformatics tasks derived from peer-reviewed publications and open-source datasets. These tasks span single-cell analysis, spatial and multi-omics, and human genetics, featuring deterministic numerical or categorical answers validated by expert reference notebooks.
DSPredict: This component targets modeling challenges from real Kaggle competitions. A crawler gathers recent competitions accepting CSV submissions and meeting specific size and clarity criteria. After preprocessing, DSPredict is divided into 'Easy' (38 introductory competitions) and 'Hard' (54 highly complex challenges). In total, DSGym Tasks encompasses 972 data analysis tasks and 114 prediction tasks.

Insights into Current Agent Capabilities

Evaluations using DSGym included leading closed-source models like GPT-5.1, GPT-5, and GPT-4o, along with open-weight models such as Qwen3-Coder-480B and GPT-OSS-120B, and smaller models like Qwen2.5-7B-Instruct. All models were tested under consistent conditions, utilizing the same CodeAct agent, temperature 0, and disabled tools.

On verified general analysis benchmarks, including QRData Verified and DAEval Verified, top models achieved 60% to 90% exact match accuracy. However, performance significantly declined on DABStep Hard, indicating persistent challenges in multi-step quantitative reasoning over complex financial tables. DSBio revealed a more critical weakness, with Kimi-K2-Instruct achieving the highest overall accuracy at 43.33%. A substantial majority (85-96%) of failures on DSBio stemmed from domain grounding errors, such as misusing specialized libraries or incorrect biological interpretations, rather than basic coding errors.

While most frontier models achieved high valid submission rates (above 80%) on MLEBench Lite and DSPredict Easy, valid submissions rarely surpassed 70% on DSPredict Hard, and Kaggle leaderboard medal rates were negligible. This trend suggests a 'simplicity bias,' where agents often settle for baseline solutions instead of exploring more advanced models and hyperparameter tuning.

DSGym as a Training Ground for Data Science Agents

Beyond benchmarking, the DSGym environment also facilitates the synthesis of training data. Researchers prompted agents to explore datasets, propose questions, solve them with code, and record their trajectories using subsets of QRData and DABStep, generating 3,700 synthetic queries. A judge model then filtered these to create DSGym-SFT, a dataset of 2,000 high-quality query-trajectory pairs. Fine-tuning a 4B Qwen3-based model on DSGym-SFT produced an agent capable of competitive performance with GPT-4o on standardized analysis benchmarks, despite its significantly smaller parameter count. This demonstrates that execution-grounded supervision on structured tasks offers an effective pathway to enhancing data science agent capabilities.

Addressing Limitations in Existing Benchmarks

DSGym's Standardized Evaluation Framework

Expanding the Benchmark Suite: DSGym Tasks, DSBio, and DSPredict

DSBio: To address scientific discovery, DSGym introduces DSBio, a collection of 90 bioinformatics tasks derived from peer-reviewed publications and open-source datasets. These tasks span single-cell analysis, spatial and multi-omics, and human genetics, featuring deterministic numerical or categorical answers validated by expert reference notebooks.

DSPredict: This component targets modeling challenges from real Kaggle competitions. A crawler gathers recent competitions accepting CSV submissions and meeting specific size and clarity criteria. After preprocessing, DSPredict is divided into 'Easy' (38 introductory competitions) and 'Hard' (54 highly complex challenges). In total, DSGym Tasks encompasses 972 data analysis tasks and 114 prediction tasks.

Insights into Current Agent Capabilities

DSGym as a Training Ground for Data Science Agents

DSGym: A New Standard for Benchmarking and Training Data Science AI Agents

Addressing Limitations in Existing Benchmarks

DSGym's Standardized Evaluation Framework

Expanding the Benchmark Suite: DSGym Tasks, DSBio, and DSPredict

Insights into Current Agent Capabilities

DSGym as a Training Ground for Data Science Agents

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

Europe's Tech Ecosystem Surges: Five New Unicorns Emerge in January 2026

Sharpening Your Skills: Navigating Decision Tree Challenges in Data Science Interviews

DSGym: A New Standard for Benchmarking and Training Data Science AI Agents

Addressing Limitations in Existing Benchmarks

DSGym's Standardized Evaluation Framework

Expanding the Benchmark Suite: DSGym Tasks, DSBio, and DSPredict

Insights into Current Agent Capabilities

DSGym as a Training Ground for Data Science Agents

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

Europe's Tech Ecosystem Surges: Five New Unicorns Emerge in January 2026

Sharpening Your Skills: Navigating Decision Tree Challenges in Data Science Interviews

DSGym: A New Standard for Benchmarking and Training Data Science AI Agents

Addressing Limitations in Existing Benchmarks

DSGym's Standardized Evaluation Framework

Expanding the Benchmark Suite: DSGym Tasks, DSBio, and DSPredict

Insights into Current Agent Capabilities

DSGym as a Training Ground for Data Science Agents

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

Europe's Tech Ecosystem Surges: Five New Unicorns Emerge in January 2026

Sharpening Your Skills: Navigating Decision Tree Challenges in Data Science Interviews

DSGym: A New Standard for Benchmarking and Training Data Science AI Agents

Addressing Limitations in Existing Benchmarks

DSGym's Standardized Evaluation Framework

Expanding the Benchmark Suite: DSGym Tasks, DSBio, and DSPredict

Insights into Current Agent Capabilities

DSGym as a Training Ground for Data Science Agents

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

Europe's Tech Ecosystem Surges: Five New Unicorns Emerge in January 2026

Sharpening Your Skills: Navigating Decision Tree Challenges in Data Science Interviews

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance