NVIDIA Open-Sources KVzap: A Breakthrough in LLM KV Cache Compression for Long Contexts
Back to News
Friday, January 16, 20264 min read

NVIDIA Open-Sources KVzap: A Breakthrough in LLM KV Cache Compression for Long Contexts

The demand for large language models (LLMs) to process increasingly lengthy contexts, often spanning tens or hundreds of thousands of tokens, has brought a critical challenge to the forefront: the key-value (KV) cache. This cache, essential for transformer decoders, stores keys and values for every layer and attention head. Its substantial memory footprint, reaching hundreds of gigabytes for larger models at extended contexts, directly restricts batch sizes and prolongs the initial token generation time.

Addressing the KV Cache Bottleneck

While various architectural innovations like Grouped Query Attention (GQA) and Multi-head Latent Attention (MHA) have successfully compressed the KV cache along the head or dimension axes, the sequence axis largely remained untouched. Solutions like sparse or retrieval attention only selectively access cached tokens, without reducing the overall memory allocated for them. This highlighted a pressing need for techniques capable of intelligently pruning cache entries deemed less critical for future token generation.

NVIDIA’s KVpress project has served as a central hub for evaluating over twenty such pruning methods, offering a public leaderboard on Hugging Face. Among these, KVzip and its refined version, KVzip+, established themselves as leading baselines. These methods assign an importance score to each cache entry by observing how well a model can reproduce its input, then prune lower-scoring entries. While highly effective, KVzip’s reliance on an extended prefill prompt makes it computationally intensive and impractical for real-time production inference.

Introducing KVzap: Efficient, Adaptive Pruning

KVzap emerges as NVIDIA's answer to the limitations of oracle-based pruning. It replaces the expensive KVzip+ scoring mechanism with a compact surrogate model that operates directly on hidden states. For each transformer layer and sequence position, this module predicts importance scores for every key-value head. Researchers explored two architectures for this surrogate: a single linear layer (KVzap Linear) and a two-layer MLP, with the MLP consistently demonstrating superior correlation with the oracle scores (between 0.63 and 0.77 squared Pearson correlation).

The surrogate models are trained on a subset of the Nemotron Pretraining Dataset, where KVzip+ provides the ground truth importance scores. This process generates approximately 1.2 million training pairs per key-value head, enabling the surrogate to effectively learn the underlying importance ranking.

Seamless Integration and Minimal Overhead

During inference, KVzap processes hidden states to generate scores for cache entries. Entries falling below a predefined threshold are pruned, with a crucial safeguard: a sliding window of the 128 most recent tokens is always preserved to maintain local context integrity. Unlike fixed-budget methods, KVzap employs score thresholding, allowing the compression ratio to adapt dynamically to the information density of the input prompt, with variations of up to 20% observed.

Crucially, KVzap introduces negligible computational and memory overhead. The MLP variant adds at most about 1.1% to the linear projection FLOPs, while the linear version adds a mere 0.02%. In long-context scenarios where attention's quadratic cost dominates, these additional FLOPs are virtually insignificant, making KVzap highly practical for integration into existing LLM serving infrastructure.

Benchmarking Breakthrough Performance

KVzap's effectiveness was rigorously evaluated across a range of LLMs (Qwen3-8B, Llama-3.1-8B Instruct, Qwen3-32B) and benchmarks, including RULER for synthetic long-context tasks, LongBench for real-world documents, and AIME25 for math reasoning. The results demonstrate remarkable efficiency:

  • On RULER, KVzap configurations achieved over 70% cache removal while maintaining accuracy within a few tenths of a point of the full cache baseline.
  • For LongBench, despite lower inherent repetitiveness in documents leading to modest compression ratios, KVzap remained competitive with the full cache at up to 2-3x compression.
  • AIME25 showed KVzap MLP either preserving or slightly improving pass-at-4 accuracy at approximately 2x compression, remaining robust even when discarding more than half of the cache.

Overall, the best KVzap configurations consistently delivered average cache compression between 2.7x and 3.5x across all evaluated models and benchmarks, without significantly impacting task performance. This open-source method, implemented within the KVpress framework and offering ready-to-use checkpoints, represents a significant step forward in making long-context LLM deployments more efficient and scalable.

This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.

Source: MarkTechPost
Share this article