Catching deepfakes by finding contradictions across vision, motion, and depth

Advanced deepfake generators maintain consistency within individual modalities (appearance, motion, geometry) while inadvertently creating contradictions across them. CAM-VFD exploits this by using cross-attention fusion to compare CLIP-based appearance features against VideoMAE motion and MiDaS depth features, identifying forensic inconsistencies invisible to single-modal approaches. Testing on GenVidBench and GenVideo benchmarks shows 95.31% and 93.43% accuracy respectively, with strong robustness against compression, noise, blur, and adversarial perturbations. Code and models are publicly available.