← Back to Computation and Language cs.CL
Making adversarial attacks on language models look natural
Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, Xiaoyu Cheng
May 18, 2026
Universal adversarial triggers are input-agnostic attack sequences that fool NLP models across tasks, but existing methods produce unnatural text that's easy to spot. This work combines parts-of-speech filtering and perplexity-based loss to generate sensible triggers that read like natural phrases. On SST sentiment analysis, the triggers achieve attack success rates as high as 96–98% while maintaining grammaticality. The authors also demonstrate adversarial training with these triggers improves model robustness from 12% to 48% accuracy under attack. The contribution matters for both adversarial security research and practical defense—showing that believable attacks are possible and require more sophisticated defenses than existing approaches.
Read the original paper →