Moonshot AI has announced the public release of Kimi K2.5, an open-source visual agentic intelligence model. This advanced system integrates a sophisticated Mixture of Experts language architecture, a dedicated vision encoder, and a novel parallel multi-agent framework termed Agent Swarm. Kimi K2.5 aims to significantly enhance performance in areas like code generation, multimodal reasoning, and in-depth web research, demonstrated by its strong results on key industry benchmarks.
Pioneering Model Architecture
Kimi K2.5 is built upon a Mixture of Experts (MoE) architecture, boasting an impressive 1 trillion total parameters while activating approximately 32 billion parameters per token. The network comprises 61 layers and incorporates 384 experts, with 8 experts selected for each token, plus one shared expert. It features an attention hidden size of 7168 and utilizes 64 attention heads.
The model incorporates MLA attention and the SwiGLU activation function. Its tokenizer supports a vocabulary size of 160,000, and a substantial maximum context length of 256,000 tokens is maintained during both training and inference. This extensive context window is crucial for managing prolonged tool interactions, lengthy documents, and intricate multi-step research processes.
Visual input is processed by a MoonViT encoder, which includes around 400 million parameters. This vision component is natively integrated, with visual and text tokens trained together within a unified multimodal backbone. Kimi K2.5 underwent continuous pre-training on approximately 15 trillion tokens of diverse vision and text data, building upon the Kimi K2 Base. This integrated training method allows the model to inherently learn joint structures across images, documents, and natural language from its initial development stages.
The released model checkpoints are compatible with widely used inference frameworks such as vLLM, SGLang, and KTransformers, requiring transformers version 4.57.1 or newer. Quantized INT4 variations are also available, leveraging techniques from Kimi K2 Thinking, facilitating deployment on standard GPUs with reduced memory requirements.
Advanced Coding and Multimodal Functionality
Kimi K2.5 distinguishes itself as a formidable open-source coding model, particularly adept at scenarios where visual context is paramount for code generation. The model can interpret UI mockups, design screenshots, and even video inputs, subsequently producing structured frontend code complete with layout, styling, and interactive logic.
Examples from the Moonshot team illustrate the model's ability to analyze an image depicting a puzzle, determine the shortest path, and then generate code that visualizes the solution. This exemplifies its cross-modal reasoning capabilities, where it seamlessly integrates image comprehension, algorithmic planning, and code synthesis within a single workflow.
The expansive 256,000-token context window of K2.5 allows it to maintain extensive specification histories during development. This feature enables developers to combine design assets, product documentation, and existing codebase within a single prompt, empowering the model to refactor or extend projects while ensuring visual constraints align with original designs.
The Power of Agent Swarm and Parallel Agent Reinforcement Learning
A pivotal innovation within Kimi K2.5 is Agent Swarm, a sophisticated multi-agent system developed using Parallel Agent Reinforcement Learning (PARL). In this paradigm, an orchestrator agent systematically deconstructs complex objectives into numerous subtasks. It then deploys specialized sub-agents to execute these subtasks concurrently.
The Kimi team reports that K2.5 can effectively manage up to 100 sub-agents within a single task, supporting up to 1,500 coordinated steps or tool calls per execution. This parallel processing capability leads to approximately 4.5 times faster task completion compared to a single-agent pipeline, particularly for broad search tasks.
PARL introduces a metric known as Critical Steps, which incentivizes policies that minimize the number of sequential steps required to solve a task. This encourages agents to divide work into parallel branches while upholding overall consistency, moving beyond simplistic sequential planning.
One practical demonstration from the Kimi team involves a research scenario focused on identifying numerous niche content creators. The orchestrator employs Agent Swarm to deploy a large number of research agents, each exploring different segments of the web. The system then consolidates their findings into a structured, comprehensive table.
Impressive Benchmark Performance
Kimi K2.5 has demonstrated strong performance across various agentic benchmarks, scoring 50.2 on HLE Full with tools and 74.9 on BrowseComp with context management. In Agent Swarm mode, the BrowseComp score further increases to 78.4, alongside improvements in WideSearch metrics. The Kimi team positions these results favorably against leading closed-source models such as GPT 5.2, Claude 4.5, Gemini 3 Pro, and DeepSeek V3 on these specific agentic suites.
For vision and video tasks, K2.5 also records high scores, achieving 78.5 on MMMU Pro and 86.6 on VideoMMMU. The model excels in document and scene understanding challenges like OmniDocBench, OCRBench, and WorldVQA, underscoring the effectiveness of its MoonViT encoder and long-context training for real-world multimodal applications.
In coding benchmarks, K2.5 reports scores of 76.8 on SWE Bench Verified, 50.7 on SWE Bench Pro, 73.0 on SWE Bench Multilingual, 50.8 on Terminal Bench 2.0, and 85.0 on LiveCodeBench v6. These figures firmly establish K2.5 among the top open-source coding models in reported performance.
Regarding long-context language benchmarks, K2.5 achieved 61.0 on LongBench V2 and 70.0 on AA LCR under standard evaluation. For reasoning tasks, it secured high scores on AIME 2025, HMMT 2025 February, GPQA Diamond, and MMLU Pro when operating in thinking mode.
Key Highlights
- Trillion-Scale Mixture of Experts: Kimi K2.5 leverages a Mixture of Experts architecture with 1T total parameters and approximately 32B active parameters per token, featuring 61 layers, 384 experts, and a 256K context length, optimized for complex multimodal and tool-intensive workflows.
- Native Multimodal Training with MoonViT: The model integrates a MoonViT vision encoder (around 400M parameters) and is trained on approximately 15T mixed vision and text tokens, ensuring seamless, unified processing of images, documents, and language.
- Parallel Agent Swarm with PARL: The Agent Swarm system, powered by Parallel Agent Reinforcement Learning, can coordinate up to 100 sub-agents and execute about 1,500 tool calls per task, resulting in roughly 4.5 times faster execution for wide research tasks compared to single-agent approaches.
- Strong Benchmark Performance: K2.5 reports competitive scores, including 76.8 on SWE Bench Verified, 78.5 on MMMU Pro, 86.6 on VideoMMMU, 50.2 on HLE Full with tools, and 74.9 on BrowseComp, demonstrating its prowess against listed closed models in various agentic and multimodal suites.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost