← Back to Computer Vision
cs.CV

Can language models understand thermal images in the dark?

Tayeba Qazi, Ayush Maheshwari, Prerana Mukherjee, Brejesh Lall

May 30, 2026

Thermal cameras see through darkness and fog where regular vision fails, but CLIP and other vision-language models can't align thermal images with text descriptions. The gap stems from three problems: no captioned thermal datasets, LLMs that don't understand heat physics, and conflicting scene-level and object-level thermal signals in one embedding space. T-CLIP solves this with IR-Cap, a physics-aware captioning pipeline, plus a dual-adapter architecture that learns thermal understanding separately for scenes and objects—consistent wins over baselines on three benchmarks.
Published as T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining arXiv:2606.00673
Read the original paper →