← Back to Computation and Language cs.CL
Predicting what to retrieve before language models ask
Wuyang Zhang, Shichao Pei
May 18, 2026
RAG systems ground language models in external knowledge but introduce latency from waiting for retrieval results. This work replaces synchronous retrieval with predictive prefetching: a framework using semantic signals in generation dynamics to forecast when information is needed and what to retrieve. Three components—a retrieval predictor, context monitor, and query generator—work together to trigger asynchronous retrieval before the model actually needs it. Tested on multiple benchmarks, the approach achieves 43.5% end-to-end latency reduction and 62.4% faster time-to-first-token while maintaining answer quality matching standard synchronous RAG.
Read the original paper →