← Back to Computer Vision cs.CV
Can language models understand thermal images in the dark?
Tayeba Qazi, Ayush Maheshwari, Prerana Mukherjee, Brejesh Lall
May 30, 2026
Thermal cameras see through darkness and fog where regular vision fails, but CLIP and other vision-language models can't align thermal images with text descriptions. The gap stems from three problems: no captioned thermal datasets, LLMs that don't understand heat physics, and conflicting scene-level and object-level thermal signals in one embedding space. T-CLIP solves this with IR-Cap, a physics-aware captioning pipeline, plus a dual-adapter architecture that learns thermal understanding separately for scenes and objects—consistent wins over baselines on three benchmarks.
Read the original paper →