What inference hardware leaks about your LLM

Small numerical deviations from different GPUs, inference engines, and attention implementations accumulate and alter LLM outputs in measurable ways. Researchers built a fingerprinting method that queries a model and infers its hardware platform, inference engine, and attention backend—even at non-zero temperature where outputs should vary. This matters for privacy and security: anyone querying a black-box LLM can now unmask its infrastructure. Eliminating these traces requires harmonizing hardware and software across entire stacks, which the authors show is fundamentally impractical.