The Qwen development team has officially unveiled Qwen3-Coder-Next, an open-weight language model meticulously crafted to power sophisticated coding agents and facilitate local development workflows. Building upon the robust Qwen3-Next-80B-A3B foundation, this model incorporates a sparse Mixture-of-Experts (MoE) architecture alongside hybrid attention mechanisms. Despite boasting 80 billion total parameters, it intelligently activates only 3 billion parameters per token. This design aims to deliver the performance of significantly larger active models while drastically reducing inference costs during extended coding sessions and automated agent operations.
Qwen3-Coder-Next is strategically positioned beyond mere code completion, targeting use cases such as agentic coding, browser-integrated development tools, and advanced IDE copilots. Its extensive training regimen, involving a vast collection of executable tasks and reinforcement learning, enables it to independently plan actions, utilize external tools, execute code, and effectively recover from runtime errors across extensive operational sequences.
Innovative Hybrid Architecture
Researchers characterize the model's structure as a hybrid design, skillfully combining Gated DeltaNet, Gated Attention, and the Mixture-of-Experts framework. Key architectural specifications include 80 billion total parameters (3 billion active per token), 48 layers, and a 2048 hidden dimension. The layer layout features 12 repetitions of three (Gated DeltaNet → MoE) sequences, followed by one (Gated Attention → MoE) sequence.
The MoE layer comprises 512 distinct experts, with 10 specialized experts and one shared expert active for processing each token. This intricate design provides substantial capacity for specialization while maintaining a computational footprint comparable to a 3 billion parameter dense model.
Advanced Agentic Training
The development team highlights Qwen3-Coder-Next's "agentically trained at scale" approach. The comprehensive training pipeline incorporated large-scale synthesis of executable tasks (approximately 800,000 verifiable tasks), dynamic interaction with diverse environments, and sophisticated reinforcement learning techniques. This methodology aligns closely with workflows akin to SWE-Bench, enabling long-horizon reasoning, tool orchestration, test execution, and recovery from unsuccessful runs, moving beyond mere static code modeling.
Performance Benchmarks
Evaluations reveal Qwen3-Coder-Next's highly competitive performance. On SWE-Bench Verified, the model achieved a score of 70.6, comparable to DeepSeek-V3.2 (671B parameters) at 70.2 and GLM-4.7 (358B parameters) at 74.2. For SWE-Bench Multilingual, it scored 62.8, nearly matching DeepSeek-V3.2 at 62.3 and GLM-4.7 at 63.7. On the more demanding SWE-Bench Pro, Qwen3-Coder-Next secured 44.3, surpassing DeepSeek-V3.2 (40.9) and GLM-4.7 (40.6).
Further tests on Terminal-Bench 2.0 saw the model score 36.2, again demonstrating competitiveness. On the Aider benchmark, it reached 66.2, positioning it among the top models in its category. These outcomes strongly substantiate the team's assertion that Qwen3-Coder-Next delivers performance on par with models possessing ten to twenty times more active parameters, particularly in coding and agent-oriented scenarios.
Seamless Agent Integration and Tool Use
Qwen3-Coder-Next has been meticulously optimized for effective tool calling and seamless integration within diverse coding agent frameworks. Its expansive 256,000-token context window allows these systems to manage extensive codebases, detailed logs, and ongoing conversations within a single, continuous session. Notably, Qwen3-Coder-Next exclusively supports a 'non-thinking' mode, meaning it does not generate explicit <think></think> blocks. This design choice streamlines integration for agents configured to expect direct tool calls and responses.
Flexible Deployment Options
For large-scale server deployments, the Qwen team recommends utilizing SGLang and vLLM, both offering OpenAI-compatible /v1 API endpoints. For local development, Unsloth provides GGUF quantizations. A 4-bit quantized version requires approximately 46 GB of RAM, while an 8-bit variant needs about 85 GB. The Unsloth guide suggests supporting context sizes up to 262,144 tokens, with 32,768 tokens as a practical default for less powerful local machines, and details integration into local agents mimicking OpenAI Codex.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost