Speeding up long-context LLMs by caching attention states

Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki

Long conditioning prefixes help control LLM behavior but create two problems: their influence weakens during generation, and computing attention over them scales linearly with length. This work introduces attention-state memory, which precomputes and stores attention states between prefix and query tokens in a lightweight lookup table—avoiding redundant computation at inference time. Tested on ManyICLBench with LLaMA-3.1-8B, the approach improves accuracy over standard in-context learning at 1K–8K token budgets, cuts attention latency by 1.36× at 8K tokens, and matches retrieval-augmented generation (RAG) performance on NBA benchmark using only 20% of its memory. The method requires no training or gradient updates, making it practical for frequent prefix changes.