
KVzap: Fast, Adaptive, and Faithful KV Cache Pruning
Key Points
- 1KVzap is introduced as a fast, adaptive, and faithful KV cache pruning method designed to alleviate the memory bottleneck in long-context Large Language Models.
- 2It approximates an improved KVzip variant (KVzip+) by training a lightweight surrogate model (linear or MLP) that predicts importance scores from hidden states, dynamically pruning KV pairs below a fixed threshold.
- 3KVzap achieves 2β4x KV cache compression with negligible accuracy loss across Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B on diverse long-context and reasoning tasks, demonstrating state-of-the-art performance.
KVzap is a novel method for pruning the Key-Value (KV) cache in transformer-based language models, designed to alleviate the memory bottleneck associated with long context lengths during inference. While various methods have been proposed to compress the KV cache along the sequence length (T-axis), their adoption in major inference engines has been limited due to trade-offs between speed and accuracy. KVzap addresses this by offering a fast, adaptive, and faithful approximation of the state-of-the-art KVzip method.
The paper first highlights the significant memory consumption of the KV cache, which grows linearly with sequence length (, , , dimensions), making it a dominant bottleneck for LLMs handling long contexts. Existing architectural modifications have targeted the , , and axes, but T-axis compression often relies on ad-hoc pruning.
KVzip and KVzip+:
KVzap builds upon KVzip, a query-agnostic KV cache pruning method that achieves high compression with minimal accuracy loss. KVzip operates by defining an "importance score" for each KV pair based on a copy-and-paste pretext task. Given an input prompt, an extended prompt is constructed by repeating the original prompt. For each head, the score for a KV pair at position in the original prompt is determined by the maximum attention weight it receives when attending to the repeated prompt:
where is the attention weight from query (in the repeated prompt) to key (in the original prompt). KV pairs with the lowest scores are then pruned. A key limitation of KVzip is its computational expense, requiring a costly double prefilling step, making it slow and unsuitable for real-time decoding.
KVzap introduces KVzip+, an enhanced version of KVzip's scoring mechanism. KVzip+ incorporates a normalization term inspired by recent work on expected attention, which considers the contribution of each token to the residual stream. The KVzip+ score is defined as:
Here, is the input hidden state at decoding step , is the output projection matrix, and is the value vector at position . This normalization weights the attention scores by the magnitude of the value vector's contribution to the output and inversely by the magnitude of the current hidden state, aiming to better reflect the token's true importance. Experiments confirm that KVzip+ consistently matches or exceeds the performance of original KVzip.
KVzap Methodology:
To overcome KVzip's practical limitations, KVzap proposes approximating the KVzip+ scores using a lightweight surrogate model. This model is designed to be computationally efficient and applicable during both prefilling and decoding.
- Surrogate Model Training: A per-layer surrogate model is trained to predict directly from the input hidden states . The model takes the hidden state (where is the hidden dimension) as input and outputs scores in (where is the number of KV heads). Two types of surrogate models are explored: a simple linear layer (KVzap-Linear) and a two-layer Multi-Layer Perceptron (MLP) (KVzap-MLP). The MLP uses one hidden layer with width followed by a GELU activation. These models are trained on a large dataset of (hidden state, ) pairs, derived from diverse text sources. The training process effectively learns to map intrinsic properties of the hidden states to KV pair importance as defined by KVzip+.
- Pruning Policy: Unlike KVzip, which enforces a fixed compression budget (e.g., keeping exactly 50% of KV pairs), KVzap employs a dynamic thresholding policy. KV pairs are discarded if their predicted score falls below a fixed threshold . This makes KVzap input-adaptive, automatically adjusting the compression ratio based on the perceived information density of the input. More complex or information-rich inputs will retain more KV pairs (lower compression), while redundant inputs will yield higher compression.
- Sliding Window: To preserve essential local context, KVzap implements a sliding window mechanism. The most recent tokens are always retained, regardless of their predicted scores. This ensures that the model always has access to the immediate preceding tokens, which are often critical for coherence and local dependencies.
The KVzap pruning process during prefilling (simplified PyTorch pseudocode):
def compress(hidden_states, keys, values, kvzap_model, threshold, window=128):
scores = kvzap_model(hidden_states) # Predict scores from hidden states
scores[..., -window:] = float("inf") # Preserve recent tokens
indices = torch.where(scores >= threshold) # Identify tokens to keep
return keys[indices], values[indices] # Return pruned KV pairsFor decoding, a score buffer is maintained to manage the sliding window dynamically.
Experimental Evaluation:
KVzap was rigorously evaluated on Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B across various benchmarks:
- RULER (4k & 16k): Long-context benchmark for retrieval, multi-hop tracing, aggregation, and QA.
- LongBench: A comprehensive long-context benchmark covering diverse tasks.
- AIME25: A reasoning benchmark for complex mathematical problems.
Key Findings:
- Score Prediction Accuracy: KVzap-MLP consistently achieved higher Pearson (0.668-0.772) between predicted and actual KVzip+ scores compared to KVzap-Linear (0.629-0.743), indicating better approximation of the oracle scores.
- Performance on RULER: KVzap achieved state-of-the-art results, matching or exceeding KVzip+ performance on RULER 4k, while significantly outperforming 15 other pruning methods. It maintained near-perfect accuracy up to 3-4x compression.
- Performance on LongBench: KVzap models maintained near-perfect accuracy up to 2-3x compression. The adaptive thresholding naturally resulted in lower compression ratios compared to synthetic datasets like RULER, reflecting the higher information density of real-world data.
- Performance on AIME25 (Reasoning): KVzap-MLP preserved reasoning accuracy even when discarding over 50% of the KV cache, demonstrating its effectiveness for generative tasks.
- Adaptive Compression: The thresholding approach led to varying compression ratios across different tasks and even within prompts, adapting to the input's complexity.
- Overhead: KVzap's computational overhead is negligible (0.02% for Linear, 1.1% for MLP) as it primarily involves a few matrix multiplications. Memory overhead is similarly low. During memory-bandwidth bound decoding, these additional FLOPs efficiently utilize idle GPU cycles.
- Sliding Window Importance: Ablations showed that the sliding window is crucial for maintaining accuracy, especially when input hidden states do not explicitly encode position information.
In conclusion, KVzap successfully addresses the limitations of prior KV cache pruning methods by providing a fast, adaptive, and accurate solution that can be seamlessly integrated into LLM inference pipelines. It achieves significant KV cache compression (2-4x) with negligible accuracy degradation across diverse models and tasks, making it a promising candidate for production deployment.