Kimi K2.5: Visual Agentic Intelligence | Technical Report
Paper

Kimi K2.5: Visual Agentic Intelligence | Technical Report

2026.01.29
ยทWebยทby web-ghost
#LLM#Agent#Multimodal#AI#Open Source

Key Points

  • 1Kimi K2.5 is introduced as a powerful open-source multimodal model with state-of-the-art coding and vision capabilities, built on continued pretraining with 15T mixed visual and text tokens.
  • 2A key innovation is the self-directed agent swarm, which enables K2.5 to orchestrate up to 100 sub-agents and 1,500 parallel tool calls, significantly reducing complex task execution time by up to 4.5x.
  • 3K2.5 demonstrates strong performance across coding, visual debugging, and office productivity tasks, as well as on agentic benchmarks like HLE, BrowseComp, and SWE-Verified, signaling a step toward advanced agentic intelligence.

Kimi K2.5 is an advanced open-source multimodal large language model, developed as a successor to Kimi K2. It has been continuously pretrained on approximately 15 trillion mixed visual and text tokens, enabling state-of-the-art capabilities in coding, vision, and particularly, self-directed agentic intelligence.

A core innovation of Kimi K2.5 is its Agent Swarm paradigm, which represents a significant shift from single-agent scaling to a self-directed, coordinated swarm-like execution. This capability allows K2.5 to autonomously orchestrate an agent swarm comprising up to 100 sub-agents, executing parallel workflows across as many as 1,500 tool calls. This parallel execution paradigm demonstrably reduces execution time by up to 4.5x compared to single-agent setups, significantly shortening the critical path for complex tasks. The agent swarm's creation and orchestration are fully automated by Kimi K2.5, requiring no predefined subagents or workflows.

The Agent Swarm functionality is powered by Parallel-Agent Reinforcement Learning (PARL). In PARL, a trainable orchestrator agent learns to decompose complex tasks into parallelizable subtasks. These subtasks are then executed concurrently by dynamically instantiated, frozen subagents. A key challenge in training such parallel orchestrators is the problem of "serial collapse," where the orchestrator defaults to sequential execution despite parallel capacity. To mitigate this, PARL employs a staged reward shaping mechanism. The reward function is defined as:
Rt=ฮปaux(e)โ‹…rparallel+(1โˆ’ฮปaux(e))โ‹…(I[success]โ‹…Q(ฯ„))R_t = \lambda_{aux}(e) \cdot r_{parallel} + (1 - \lambda_{aux}(e)) \cdot (I[success] \cdot Q(\tau))
Here, ฮปaux(e)\lambda_{aux}(e) is an annealing coefficient that decreases from 0.1 to 0.0 over the course of training. Early in training, the auxiliary reward rparallelr_{parallel} incentivizes the instantiation and concurrent execution of subagents, promoting exploration of the parallel scheduling space. As training progresses, the focus gradually shifts towards maximizing the end-to-end task quality Q(ฯ„)Q(\tau), ensuring task success. To further enforce the emergence of parallel strategies, a computational bottleneck is introduced by evaluating performance using Critical Steps, a latency-oriented metric inspired by the critical path in parallel computation:
CriticalSteps=โˆ‘t=1T(Smain(t)+maxโกiSsub,i(t))CriticalSteps = \sum_{t=1}^{T} (S_{main}(t) + \max_{i} S_{sub,i}(t))
where Smain(t)S_{main}(t) captures orchestration overhead and maxโกiSsub,i(t)\max_{i} S_{sub,i}(t) reflects the slowest subagent's progress at each stage. This metric ensures that spawning more subtasks is only beneficial if it shortens the overall critical path. This methodology enables an 80% reduction in end-to-end runtime and a 3x-4.5x reduction in minimum critical steps for complex tasks.

Beyond agent swarms, Kimi K2.5 excels in Coding with Vision. Leveraging its massive-scale vision-text joint pre-training, it can transform simple conversations into complete front-end interfaces, including interactive layouts and rich animations. It significantly improves image/video-to-code generation and visual debugging by reasoning directly over visual inputs. This allows users to express intent visually, and K2.5 can autonomously inspect its own visual output and iterate on it, demonstrating breakthrough autonomous visual debugging capabilities. Performance is evaluated on Kimi Code Bench, an internal benchmark covering diverse software engineering tasks from building to debugging across multiple languages, showing consistent improvements over K2. Users can access agentic coding capabilities via K2.5 Agent with preconfigured tools or through Kimi Code, an open-sourced terminal/IDE-integratable product that supports visual inputs.

Kimi K2.5 also brings its agentic intelligence to Office Productivity. K2.5 Agent can handle high-density, large-scale office work end-to-end, reasoning over complex inputs, coordinating multi-step tool use, and generating expert-level outputs such as documents, spreadsheets, PDFs, and slide decks. It supports advanced tasks like adding annotations in Word, constructing financial models with Pivot Tables, and writing LaTeX equations in PDFs, scaling to long-form outputs of up to 10,000 words or 100 pages. Evaluations on internal benchmarks like AI Office Benchmark and General Agent Benchmark show significant improvements over K2 Thinking, with 59.3% and 24.3% gains respectively in end-to-end performance on real-world professional tasks.

Kimi K2.5 is available through Kimi.com, the Kimi App, API, and Kimi Code, with Kimi.com and the Kimi App offering four modes: K2.5 Instant, K2.5 Thinking, K2.5 Agent, and K2.5 Agent Swarm (Beta). The model aims to redefine the boundaries of AI in knowledge work, representing a meaningful step towards AGI for the open-source community.