← Back to Machine Learning
cs.LG

Do language models trust tool outputs more than user instructions?

Mohammed Sameer Syed, Rozhin Yasaei

May 30, 2026

Researchers tested whether language models show consistent vulnerability to adversarial attacks regardless of delivery channel. Using identical malicious payloads delivered through user messages, tool metadata, or tool outputs, they found agent-native models are far more susceptible via tool descriptions than user input—a vulnerability that reverses for general-purpose models. The catch: models treat tool metadata as trusted but tool results as ordinary data, a non-linear encoding hidden deep in the network that standard safety checks miss.
Published as Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models arXiv:2606.00566
Read the original paper →