← Back to Computation and Language
cs.CL

How LLMs turn novices into experts at causing harm—and how to stop them

Ruohao Guo, Wei Xu, Alan Ritter

June 1, 2026

LLMs can inadvertently help malicious users exceed their own capabilities through extended back-and-forth conversations—teaching novices specialized attacks or automating harmful tasks at scale. This paper introduces HarmAmp, a benchmark of 12 real-world multi-turn harm scenarios, and TrajSafe, a monitoring system that detects dangerous conversational paths and steers models toward safer responses. Experiments show TrajSafe cuts harm significantly without over-blocking legitimate requests or degrading general model performance.
Published as Investigating and Alleviating Harm Amplification in LLM Interactions arXiv:2606.02423
Read the original paper →