← Back to Artificial Intelligence
cs.AI

Catching deepfakes by finding contradictions across vision, motion, and depth

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

May 16, 2026

Advanced deepfake generators maintain consistency within individual modalities (appearance, motion, geometry) while inadvertently creating contradictions across them. CAM-VFD exploits this by using cross-attention fusion to compare CLIP-based appearance features against VideoMAE motion and MiDaS depth features, identifying forensic inconsistencies invisible to single-modal approaches. Testing on GenVidBench and GenVideo benchmarks shows 95.31% and 93.43% accuracy respectively, with strong robustness against compression, noise, blur, and adversarial perturbations. Code and models are publicly available.
Published as CAM-VFD: Cross-Attention Multimodal Video Forgery Detection arXiv:2605.17133
Read the original paper →