← Back to Artificial Intelligence cs.AI
Catching deepfakes by finding contradictions across vision, motion, and depth
Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy
May 16, 2026
Advanced deepfake generators maintain consistency within individual modalities (appearance, motion, geometry) while inadvertently creating contradictions across them. CAM-VFD exploits this by using cross-attention fusion to compare CLIP-based appearance features against VideoMAE motion and MiDaS depth features, identifying forensic inconsistencies invisible to single-modal approaches. Testing on GenVidBench and GenVideo benchmarks shows 95.31% and 93.43% accuracy respectively, with strong robustness against compression, noise, blur, and adversarial perturbations. Code and models are publicly available.
Read the original paper →