← Back to Computation and Language
cs.CL

Protecting prompt boundaries solves most KV cache eviction problems

Gabriel Garcia

May 18, 2026

When large language models evict key-value cache entries during long-context decoding, most eviction policies fail catastrophically without protecting special tokens at prompt boundaries. This work evaluates seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) on LongBench and finds that simply reserving 10% of cache at each boundary recovers the bulk of lost quality. Analysis of attention patterns shows the first token (position 0) captures ~75% of prefix attention, while other boundary tokens are underweighted by most scorers. With boundary protection, simpler scoring variants become equivalent to more complex attention-based methods. Results hold across 10 models and extend to 64K-token contexts, though improvements diminish at extreme compression ratios. This is primarily an empirical study with implications for practitioners deploying long-context LLMs under memory constraints.
Published as Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction arXiv:2605.18053
Read the original paper →