← Back to Computation and Language
cs.CL

Making adversarial attacks on language models look natural

Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, Xiaoyu Cheng

May 18, 2026

Universal adversarial triggers are input-agnostic attack sequences that fool NLP models across tasks, but existing methods produce unnatural text that's easy to spot. This work combines parts-of-speech filtering and perplexity-based loss to generate sensible triggers that read like natural phrases. On SST sentiment analysis, the triggers achieve attack success rates as high as 96–98% while maintaining grammaticality. The authors also demonstrate adversarial training with these triggers improves model robustness from 12% to 48% accuracy under attack. The contribution matters for both adversarial security research and practical defense—showing that believable attacks are possible and require more sophisticated defenses than existing approaches.
Published as Universal Adversarial Triggers arXiv:2605.17936
Read the original paper →