
Latent Collaboration in Multi-Agent Systems
Key Points
- 1LatentMAS introduces an end-to-end training-free framework that enables multi-agent systems to collaborate directly within the continuous latent space, moving beyond traditional text-based mediation.
- 2This framework achieves "pure latent collaboration" by having agents auto-regressively generate last-layer hidden embeddings as internal thoughts and share information losslessly via a shared latent working memory through KV caches.
- 3Empirical evaluations across nine benchmarks demonstrate that LatentMAS consistently outperforms single-model and text-based MAS baselines by significantly enhancing reasoning accuracy, reducing token usage by 70.8-83.7%, and achieving 4-4.3x faster inference.
This paper introduces LatentMAS, an end-to-end training-free framework enabling pure latent collaboration among Large Language Model (LLM) agents within multi-agent systems (MAS). While existing LLM agents typically rely on text-based communication, LatentMAS shifts reasoning and communication directly into the continuous latent space, offering substantial gains in accuracy, efficiency, and speed.
The core motivation stems from the limitations of natural language as a mediation medium in MAS, which can lead to communication bottlenecks, information loss, and computational overhead. LatentMAS proposes to unify internal latent chain-of-thought (CoT) reasoning with cross-agent latent communication, addressing this gap.
LatentMAS is built upon three foundational principles:
- Reasoning Expressiveness: Hidden representations inherently encode richer, continuous thoughts compared to discrete tokens.
- Communication Fidelity: A shared latent working memory preserves and transfers internal representations across agents without loss.
- Collaboration Complexity: The framework achieves higher collaborative expressiveness than text-based MAS with significantly lower inference complexity.
The methodology of LatentMAS consists of two main technical components:
1. Auto-regressive Latent Thoughts Generation:
Within each LLM agent , reasoning unfolds through the auto-regressive generation of last-layer hidden representations. Instead of decoding to tokens and then re-embedding for subsequent steps, the model directly appends the last-layer hidden state as the input embedding for the next step . This process is repeated for latent steps, producing a sequence of continuous output representations , defined as the latent thoughts.
To ensure distributional consistency and prevent out-of-distribution activations when re-inserting last-layer hidden states as input embeddings, an Input-Output Distribution Alignment mechanism is employed. This involves a linear alignment operator, a projection matrix , that maps each output vector back to a valid input embedding space . Specifically, , where is approximated by . In practice, is computed once via ridge regression: , ensuring computational negligibility.
The theoretical advantage of this approach is quantified by Theorem 3.1 (Expressiveness of Latent Thoughts), which states that latent thoughts generation is times more efficient than text-based reasoning for equivalent expressive capacity, where is the hidden dimension and is the vocabulary size. This implies that larger models inherently benefit more from latent reasoning.
2. Working Memory Preservation and Thoughts Transfer:
For inter-agent communication, LatentMAS introduces a novel latent working memory transfer mechanism. Instead of passing text, the complete Key-Value (KV) caches from all transformer layers of a source agent are extracted and constitute its latent working memory .
Specifically, , where and include both initial input context and newly generated latent thoughts.
A successive agent integrates by performing layer-wise concatenation: the and are prepended to the existing and of . This ensures that 's subsequent latent thoughts generation is conditioned on both 's working memory and its own internal representations.
This mechanism guarantees Lossless Information Transfer, as formalized by Theorem 3.3, stating that outputs generated by an agent receiving latent working memory are equivalent to those obtained by explicitly inputting preceding agents' outputs.
The entire process involves agents performing latent generation and passing their updated KV caches. Only the final agent decodes the ultimate text answer. Theorem 3.4 (LatentMAS Complexity) highlights its efficiency: the time complexity for each agent is , where is input length and is latent thought length, which is substantially lower than text-based MAS.
Empirical evaluations across 9 benchmarks (math/science reasoning, commonsense understanding, code generation) using Qwen3 models (4B, 8B, 14B) in both sequential and hierarchical MAS settings demonstrate that LatentMAS consistently outperforms single-model and text-based MAS baselines. It achieves up to 14.6% higher accuracy, reduces output token usage by 70.8%-83.7%, and provides 4x-4.3x faster end-to-end inference without any additional training. These results validate that latent collaboration significantly enhances system-level reasoning quality while delivering substantial efficiency gains.