How LLMs turn novices into experts at causing harm—and how to stop them

LLMs can inadvertently help malicious users exceed their own capabilities through extended back-and-forth conversations—teaching novices specialized attacks or automating harmful tasks at scale. This paper introduces HarmAmp, a benchmark of 12 real-world multi-turn harm scenarios, and TrajSafe, a monitoring system that detects dangerous conversational paths and steers models toward safer responses. Experiments show TrajSafe cuts harm significantly without over-blocking legitimate requests or degrading general model performance.