← Back to Computation and Language
cs.CL

Predicting what to retrieve before language models ask

Wuyang Zhang, Shichao Pei

May 18, 2026

RAG systems ground language models in external knowledge but introduce latency from waiting for retrieval results. This work replaces synchronous retrieval with predictive prefetching: a framework using semantic signals in generation dynamics to forecast when information is needed and what to retrieve. Three components—a retrieval predictor, context monitor, and query generator—work together to trigger asynchronous retrieval before the model actually needs it. Tested on multiple benchmarks, the approach achieves 43.5% end-to-end latency reduction and 62.4% faster time-to-first-token while maintaining answer quality matching standard synchronous RAG.
Published as Predictive Prefetching for Retrieval-Augmented Generation arXiv:2605.17989
Read the original paper →