Managing memory for long-context language models across GPUs and SSDs

Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

Long-context language models require massive key-value caches that exceed GPU memory, forcing systems to offload to host DRAM and disk. Current approaches maximize sparsity—selectively keeping critical cache entries—but hit accuracy limits, causing transfer bottlenecks during decoding. KVDrive rethinks the problem as a systems optimization: it jointly manages cache placement across tiers, restructures the decoding pipeline to overlap computation and I/O, and coordinates data movement to minimize stalls. Implemented and tested on popular LLMs, the system achieves 1.74× throughput improvement over existing offloading systems while maintaining accuracy. Intended for practitioners deploying long-context inference under memory constraints.