Does prompt compression work for diffusion language models?

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo, Kaung Myat Kyaw, Jonathan Chan

Prompt compression reduces inference cost by shortening inputs, but its effectiveness has been studied mainly on autoregressive models like GPT. This work evaluates LLMLingua-2 on LLaDA, an 8B-parameter diffusion language model, using three benchmarks: GSM8K (math), DUC2004 (summarization), and ShareGPT (reconstruction). At 2× compression, compression methods preserved semantic similarity but failed differently across tasks—summarization remained stable while mathematical reasoning degraded sharply. The gap between BERTScore precision and recall suggests information loss drives failures rather than semantic drift. The finding that semantic similarity doesn't guarantee task stability in diffusion models indicates compression methods must be redesigned specifically for non-autoregressive architectures.