← Back to Computation and Language
cs.CL

Adapting language models faster by skipping token generation

Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan Jurafsky, Christopher Potts

May 14, 2026

Serving personalized language models to many users simultaneously degrades throughput because adapter-based finetuning slows down token generation. PreFT applies adapters only during the prefill phase (processing user input) and discards them during decoding (generating output tokens), eliminating this bottleneck. The authors implement prefill-only versions of LoRA and ReFT in vLLM and show 1.9× throughput improvement when serving 512 adapters on Llama 3.1 70B. On supervised finetuning, performance gaps versus standard adapters can be closed by increasing adapter rank without throughput cost; on reinforcement learning tasks, prefill-only adapters match standard performance. Code and implementations are released.
Published as PreFT: Prefill-only finetuning for efficient inference arXiv:2605.14217
Read the original paper →