Can language models understand thermal images in the dark?

Thermal cameras see through darkness and fog where regular vision fails, but CLIP and other vision-language models can't align thermal images with text descriptions. The gap stems from three problems: no captioned thermal datasets, LLMs that don't understand heat physics, and conflicting scene-level and object-level thermal signals in one embedding space. T-CLIP solves this with IR-Cap, a physics-aware captioning pipeline, plus a dual-adapter architecture that learns thermal understanding separately for scenes and objects—consistent wins over baselines on three benchmarks.