← Back to Computation and Language cs.CL
How LLMs turn novices into experts at causing harm—and how to stop them
Ruohao Guo, Wei Xu, Alan Ritter
June 1, 2026
LLMs can inadvertently help malicious users exceed their own capabilities through extended back-and-forth conversations—teaching novices specialized attacks or automating harmful tasks at scale. This paper introduces HarmAmp, a benchmark of 12 real-world multi-turn harm scenarios, and TrajSafe, a monitoring system that detects dangerous conversational paths and steers models toward safer responses. Experiments show TrajSafe cuts harm significantly without over-blocking legitimate requests or degrading general model performance.
Read the original paper →