← Back to Computation and Language cs.CL
Speeding up long-context LLMs by caching attention states
Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki
May 18, 2026
Long conditioning prefixes help control LLM behavior but create two problems: their influence weakens during generation, and computing attention over them scales linearly with length. This work introduces attention-state memory, which precomputes and stores attention states between prefix and query tokens in a lightweight lookup table—avoiding redundant computation at inference time. Tested on ManyICLBench with LLaMA-3.1-8B, the approach improves accuracy over standard in-context learning at 1K–8K token budgets, cuts attention latency by 1.36× at 8K tokens, and matches retrieval-augmented generation (RAG) performance on NBA benchmark using only 20% of its memory. The method requires no training or gradient updates, making it practical for frequent prefix changes.
Read the original paper →