Introducing NousCoder-14B: A New Era for AI in Competitive Coding
Nous Research has unveiled NousCoder-14B, an advanced artificial intelligence model engineered to excel in competitive programming challenges. This innovative system, developed by post-training the Qwen3-14B foundation model using reinforcement learning (RL) with verifiable rewards, showcases impressive performance. On the LiveCodeBench v6 benchmark, which features problems from August 2024 to May 2025, NousCoder-14B achieved a 67.87% Pass@1 accuracy. This result signifies a notable improvement, outperforming the Qwen3-14B baseline's 60.79% on the same evaluation by 7.08 percentage points. The model's trained weights are freely accessible on Hugging Face, distributed under the Apache 2.0 license, fostering open-source collaboration.
Benchmarking Competitive Code Performance
The LiveCodeBench v6 platform serves as the primary evaluation standard for NousCoder-14B, specifically designed for competitive programming assessments. Its test division comprises 454 distinct problems. Solutions must strictly adhere to time and memory limits, passing extensive hidden input-output tests. The "Pass@1" metric quantifies the fraction of problems where the initial generated program fully satisfies all these demanding criteria.
Crafting the RL Training Dataset
For its reinforcement learning regimen, NousCoder-14B was trained on 24,000 verifiable code generation problems, each including a reference implementation and numerous test cases. These were sourced from TACO Verified, PrimeIntellect SYNTHETIC 1, and LiveCodeBench challenges compiled before July 31, 2024. Evaluation occurred on LiveCodeBench v6, a distinct set of 454 problems from August 2024 to May 2025. This comprehensive setup, mirroring real competitive programming tasks, is vital for RL, providing a computationally efficient, binary reward signal post-code execution.
The Reinforcement Learning Execution Environment
The RL environment for NousCoder-14B was built using the Atropos framework. The model generates Python code, with each "rollout" receiving a scalar reward based on test case performance:
- +1 reward for passing all test cases.
- -1 reward for incorrect output, exceeding a 15-second time limit, or breaching a 4 GB memory limit.
Modal functioned as an autoscaled sandbox for secure, scalable execution, launching one container per rollout to isolate verification from training. A pipelined design further ensured the training loop remained inference-bound, not bottlenecked by verification, by sending completions to a verifier while new generations commenced.
Advanced Optimization Techniques
NousCoder-14B employs Group Relative Policy Optimization (GRPO), removing the need for a separate value model. Researchers explored three objectives: Dynamic sAmpling Policy Optimization (DAPO), Group Sequence Policy Optimization (GSPO), and an enhanced GSPO+. All share an advantage definition: rollout reward normalized by its group's mean and standard deviation. DAPO notably modifies GRPO with a "clip higher" rule for exploration, a token-level policy gradient for equal weighting, and dynamic sampling to discard groups offering zero advantage.
GSPO shifts importance weighting to the sequence level. GSPO+ maintains this correction but rescales gradients for equal token weighting regardless of sequence length. Performance differences on LiveCodeBench v6 were modest. At 81,920 tokens, DAPO achieved 67.87% Pass@1, slightly ahead of GSPO (66.26%) and GSPO+ (66.52%). At 40,960 tokens, all objectives were approximately 63% Pass@1.
Context Management Innovations
Training incorporated an iterative context extension schedule, starting at 32,000 tokens and expanding to 40,000. For evaluation, YaRN context extension boosted this to 81,920 tokens. "Overlong filtering" was a key technique, resetting the advantage of programs exceeding the maximum context window to zero. This prevented the model from favoring shorter solutions purely for optimization, thereby preserving code quality when scaling context length during testing.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost