← Back to Machine Learning cs.LG
Do language models trust tool outputs more than user instructions?
Mohammed Sameer Syed, Rozhin Yasaei
May 30, 2026
Researchers tested whether language models show consistent vulnerability to adversarial attacks regardless of delivery channel. Using identical malicious payloads delivered through user messages, tool metadata, or tool outputs, they found agent-native models are far more susceptible via tool descriptions than user input—a vulnerability that reverses for general-purpose models. The catch: models treat tool metadata as trusted but tool results as ordinary data, a non-linear encoding hidden deep in the network that standard safety checks miss.
Read the original paper →