Making adversarial attacks on language models look natural

Universal adversarial triggers are input-agnostic attack sequences that fool NLP models across tasks, but existing methods produce unnatural text that's easy to spot. This work combines parts-of-speech filtering and perplexity-based loss to generate sensible triggers that read like natural phrases. On SST sentiment analysis, the triggers achieve attack success rates as high as 96–98% while maintaining grammaticality. The authors also demonstrate adversarial training with these triggers improves model robustness from 12% to 48% accuracy under attack. The contribution matters for both adversarial security research and practical defense—showing that believable attacks are possible and require more sophisticated defenses than existing approaches.