Why your sparse autoencoder benchmark might be lying to you

Sparse autoencoders are a key tool for interpreting large language models, but their development relies on benchmarks that may not reliably distinguish good architectures from bad ones. This paper audits SAEBench, the standard evaluation suite, using three approaches: measuring noise when retraining the same SAE, testing correlation against synthetic ground truth, and checking whether metrics discriminate between training stages. Two metrics—Targeted Probe Perturbation and Spurious Correlation Removal—fail multiple tests and should not be used. The most reliable metric tested, sae-probes, still struggles to separate different variants of the same architecture. The work is primarily a diagnostic audit, revealing that the field needs fundamentally better evaluation methods.