Paper

KVzap: Fast, Adaptive, and Faithful KV Cache Pruning

Maximilian Jeblick

2026.01.20

·Arxiv·by 이호민

#KV Cache Pruning#LLM Inference#Transformer#KVzip#KVzap

Key Points

1KVzap is introduced as a fast, adaptive, and faithful KV cache pruning method designed to alleviate the memory bottleneck in long-context Large Language Models.
2It approximates an improved KVzip variant (KVzip+) by training a lightweight surrogate model (linear or MLP) that predicts importance scores from hidden states, dynamically pruning KV pairs below a fixed threshold.
3KVzap achieves 2–4x KV cache compression with negligible accuracy loss across Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B on diverse long-context and reasoning tasks, demonstrating state-of-the-art performance.

L

Paper

Maximilian Jeblick

2026.01.20

·Arxiv·by 이호민

#KV Cache Pruning#LLM Inference#Transformer#KVzip#KVzap

1KVzap is introduced as a fast, adaptive, and faithful KV cache pruning method designed to alleviate the memory bottleneck in long-context Large Language Models.
2It approximates an improved KVzip variant (KVzip+) by training a lightweight surrogate model (linear or MLP) that predicts importance scores from hidden states, dynamically pruning KV pairs below a fixed threshold.
3KVzap achieves 2–4x KV cache compression with negligible accuracy loss across Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B on diverse long-context and reasoning tasks, demonstrating state-of-the-art performance.

L