← Back to Computation and Language cs.CL
Protecting prompt boundaries solves most KV cache eviction problems
Gabriel Garcia
May 18, 2026
When large language models evict key-value cache entries during long-context decoding, most eviction policies fail catastrophically without protecting special tokens at prompt boundaries. This work evaluates seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) on LongBench and finds that simply reserving 10% of cache at each boundary recovers the bulk of lost quality. Analysis of attention patterns shows the first token (position 0) captures ~75% of prefix attention, while other boundary tokens are underweighted by most scorers. With boundary protection, simpler scoring variants become equivalent to more complex attention-based methods. Results hold across 10 models and extend to 64K-token contexts, though improvements diminish at extreme compression ratios. This is primarily an empirical study with implications for practitioners deploying long-context LLMs under memory constraints.
Read the original paper →