Most hallucination detection methods exploit benchmark shortcuts

Recent work claims rapid progress in detecting when large language models hallucinate, but this paper shows much of that progress is illusory. The authors found that standard benchmarks contain construction artifacts—ground-truth answers embedded in input prompts—that a simple text-similarity baseline exploits to achieve near-perfect detection without any model knowledge. After controlling for these artifacts across 22 detection methods, 12 models, and 6 corpora, most established baselines collapse to chance performance. Only supervised probes on upper-layer hidden states (SAPLMA and DRIFT, a new method introduced here) show consistent genuine detection ability. This work suggests the field has overestimated progress and calls for more rigorous benchmark design.