← Back to Computation and Language
cs.CL

Managing memory for long-context language models across GPUs and SSDs

Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

May 18, 2026

Long-context language models require massive key-value caches that exceed GPU memory, forcing systems to offload to host DRAM and disk. Current approaches maximize sparsity—selectively keeping critical cache entries—but hit accuracy limits, causing transfer bottlenecks during decoding. KVDrive rethinks the problem as a systems optimization: it jointly manages cache placement across tiers, restructures the decoding pipeline to overlap computation and I/O, and coordinates data movement to minimize stalls. Implemented and tested on popular LLMs, the system achieves 1.74× throughput improvement over existing offloading systems while maintaining accuracy. Intended for practitioners deploying long-context inference under memory constraints.
Published as KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference arXiv:2605.18071
Read the original paper →