Abstract
With the rise of the metaverse, the rapid advancement of Deepfakes technology has become closely intertwined. Within the metaverse, individuals exist in digital form and engage in interactions, transactions, and communications through virtual avatars. However, the development of Deepfakes technology has led to the proliferation of forged information disseminated under the guise of users’ virtual identities, posing significant security risks to the metaverse. Hence, there is an urgent need to research and develop more robust methods for detecting deep forgeries to address these challenges. This paper explores deepfake video detection by leveraging the spatiotemporal inconsistencies generated by deepfake generation techniques, and thereby proposing the interactive spatioTemporal inconsistency learning and interactive fusion (ST-ILIF) detection method, which consists of phase-aware and sequence streams. The spatial inconsistencies exhibited in frames of deepfake videos are primarily attributed to variations in the structural information contained within the phase component of the Fourier domain. To mitigate the issue of overfitting the content information, a phase-aware stream is introduced to learn the spatial inconsistencies from the phase-based reconstructed frames. Additionally, considering that deepfake videos are generated frame-by-frame and lack temporal consistency between frames, a sequence stream is proposed to extract temporal inconsistency features from the spatiotemporal difference information between consecutive frames. Finally, through feature interaction and fusion of the two streams, the representation ability of intermediate and classification features is further enhanced. The proposed method, which was evaluated on four mainstream datasets, outperformed most existing methods, and extensive experimental results demonstrated its effectiveness in identifying deepfake videos. Our source code is available at https://github.com/qff98/Deepfake-Video-Detection
- 2019. Deepfakes github. https://github.com/deepfakes/faceswapGoogle Scholar
- Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. MesoNet: a Compact Facial Video Forgery Detection Network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). 1–7. https://doi.org/10.1109/WIFS.2018.8630761Google ScholarCross Ref
- Luca Bondi, Edoardo Daniele Cannas, Paolo Bestagini, and Stefano Tubaro. 2020. Training Strategies and Data Augmentations in CNN-based DeepFake Video Detection. In 2020 IEEE International Workshop on Information Forensics and Security (WIFS). 1–6. https://doi.org/10.1109/WIFS49906.2020.9360901Google ScholarCross Ref
- Nicolò Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. 2021. Video Face Manipulation Detection Through Ensemble of CNNs. In 2020 25th International Conference on Pattern Recognition (ICPR). 5012–5019. https://doi.org/10.1109/ICPR48806.2021.9412711Google ScholarCross Ref
- Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299–6308.Google ScholarCross Ref
- M. del Castillo. 2022, September 1. Facebook’s Metaverse Could Be Overrun By Deep Fakes And Other Misinformation If These Non-Profits Don’t Succeed. https://www.forbes.com/sites/michaeldelcastillo/2022/08/29/facebooks-metaverse-could-be-overrun-by-deep-fakes-and-other-misinformation-if-these-non-profits-dont-succeed/?sh=21acb3842737.Google Scholar
- Xiangling Ding, Wenjie Zhu, and Dengyong Zhang. 2022. DeepFake Videos Detection via Spatiotemporal Inconsistency Learning and Interactive Fusion. In 2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). IEEE, 425–433.Google ScholarDigital Library
- Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. 2019. The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854(2019). https://doi.org/10.48550/arXiv.1910.08854Google ScholarCross Ref
- Ricard Durall, Margret Keuper, Franz-Josef Pfreundt, and Janis Keuper. 2019. Unmasking deepfakes with simple features. arXiv preprint arXiv:1911.00686(2019). https://doi.org/10.48550/arXiv.1911.00686Google ScholarCross Ref
- Shiming Ge, Fanzhao Lin, Chenyu Li, Daichi Zhang, Weiping Wang, and Dan Zeng. 2022. Deepfake video detection via predictive representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2s (2022), 1–21.Google ScholarDigital Library
- Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. 2021. Spatiotemporal Inconsistency Learning for DeepFake Video Detection. In Proceedings of the 29th ACM International Conference on Multimedia. 3473–3481.Google ScholarDigital Library
- Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, and Lizhuang Ma. 2022. Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 36. 744–752.Google ScholarCross Ref
- Zhiqing Guo, Gaobo Yang, Jiyou Chen, and Xingming Sun. 2021. Fake face detection via adaptive manipulation traces extraction network. Computer Vision and Image Understanding 204 (2021). https://doi.org/10.1016/j.cviu.2021.103170Google ScholarCross Ref
- Bruce C Hansen and Robert F Hess. 2007. Structural sparseness and spatial phase alignment in natural scenes. JOSA A 24, 7 (2007), 1873–1885.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.Google ScholarCross Ref
- Juan Hu, Xin Liao, Wei Wang, and Zheng Qin. 2022. Detecting Compressed Deepfake Videos in Social Networks Using Frame-Temporality Two-Stream Convolutional Network. IEEE Transactions on Circuits and Systems for Video Technology 32, 3(2022), 1089–1102. https://doi.org/10.1109/TCSVT.2021.3074259Google ScholarCross Ref
- Ziheng Hu, Hongtao Xie, Yuxin Wang, Jiahong Li, Zhongyuan Wang, and Yongdong Zhang. 2021. Dynamic inconsistency-aware deepfake video detection. In IJCAI. 736–742.Google Scholar
- Gengyun Jia, Meisong Zheng, Chuanrui Hu, Xin Ma, Yuting Xu, Luoqi Liu, Yafeng Deng, and Ran He. 2021. Inconsistency-Aware Wavelet Dual-Branch Network for Face Forgery Detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3, 3(2021), 308–319. https://doi.org/10.1109/TBIOM.2021.3086109Google ScholarCross Ref
- Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. 2020. DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2889–2898.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014). https://doi.org/10.48550/arXiv.1412.6980Google ScholarCross Ref
- Dingquan Li, Tingting Jiang, and Ming Jiang. 2019. Quality Assessment of In-the-Wild Videos. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 2351–2359. https://doi.org/10.1145/3343031.3351028Google ScholarDigital Library
- Yuezun Li and Siwei Lyu. 2018. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656(2018). https://doi.org/10.48550/arXiv.1811.00656Google ScholarCross Ref
- Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. 2020. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3207–3216.Google ScholarCross Ref
- Xin Liao, Yumei Wang, Tianyi Wang, Juan Hu, and Xiaoshuai Wu. 2023. FAMM: Facial Muscle Motions for Detecting Compressed Deepfake Videos over Social Networks. IEEE Transactions on Circuits and Systems for Video Technology (2023).Google Scholar
- Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. 2022. Robust high-resolution video matting with temporal guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 238–247.Google ScholarCross Ref
- Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. 2021. Spatial-Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 772–781.Google ScholarCross Ref
- Jiarui Liu, Kaiman Zhu, Wei Lu, Xiangyang Luo, and Xianfeng Zhao. 2021. A lightweight 3D convolutional neural network for deepfake detection. International Journal of Intelligent Systems 36, 9 (2021), 4990–5004. https://doi.org/10.1002/int.22499 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/int.22499Google ScholarDigital Library
- Kunlin Liu, Wenbo Zhou, Zhenyu Zhang, Yanhao Ge, Hao Tang, Weiming Zhang, and Nenghai Yu. 2023. Measuring the Consistency and Diversity of 3D Face Generation. IEEE Journal of Selected Topics in Signal Processing 17, 6(2023), 1208–1220. https://doi.org/10.1109/JSTSP.2023.3273781Google ScholarCross Ref
- Xiaolong Liu, Yang Yu, Xiaolong Li, Yao Zhao, and Guodong Guo. 2023. TCSD: Triple complementary streams detector for comprehensive deepfake detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 19, 6 (2023), 1–22.Google ScholarDigital Library
- Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11669–11676.Google ScholarCross Ref
- Wei Lu, Lingyi Liu, Bolin Zhang, Junwei Luo, Xianfeng Zhao, Yicong Zhou, and Jiwu Huang. 2023. Detection of Deepfake Videos Using Long-Distance Attention. IEEE Transactions on Neural Networks and Learning Systems (2023).Google Scholar
- Fuyan Ma, Bin Sun, and Shutao Li. 2021. Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion. IEEE Transactions on Affective Computing(2021). https://doi.org/10.1109/TAFFC.2021.3122146Google ScholarDigital Library
- Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, and Wael AbdAlmageed. 2020. Two-Branch Recurrent Network for Isolating Deepfakes in Videos. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 667–684.Google ScholarDigital Library
- Falko Matern, Christian Riess, and Marc Stamminger. 2019. Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). 83–92. https://doi.org/10.1109/WACVW.2019.00020Google ScholarCross Ref
- Changtao Miao, Qi Chu, Weihai Li, Suichan Li, Zhentao Tan, Wanyi Zhuang, and Nenghai Yu. 2022. Learning Forgery Region-Aware and ID-Independent Features for Face Manipulation Detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 4, 1(2022), 71–84. https://doi.org/10.1109/TBIOM.2021.3119403Google ScholarCross Ref
- Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen. 2019. Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2307–2311. https://doi.org/10.1109/ICASSP.2019.8682602Google ScholarCross Ref
- Yuval Nirkin, Yosi Keller, and Tal Hassner. 2019. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF international conference on computer vision. 7184–7193.Google ScholarCross Ref
- A Oppenheim, Jae Lim, Gary Kopec, and SC Pohlig. 1979. Phase in speech and pictures. In ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4. IEEE, 632–637.Google Scholar
- Alan V Oppenheim and Jae S Lim. 1981. The importance of phase in signals. Proc. IEEE 69, 5 (1981), 529–541.Google ScholarCross Ref
- Guilin Pang, Baopeng Zhang, Zhu Teng, Zige Qi, and Jianping Fan. 2023. MRE-Net: Multi-Rate Excitation Network for Deepfake Video Detection. IEEE Transactions on Circuits and Systems for Video Technology (2023). https://doi.org/10.1109/TCSVT.2023.3239607Google ScholarDigital Library
- Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Umé, Mr Dpfks, Carl Shift Facenheim, Luis RP, Jian Jiang, et al. 2020. DeepFaceLab: Integrated, flexible and extensible face-swapping framework. arXiv preprint arXiv:2005.05535(2020). https://doi.org/10.48550/arXiv.2005.05535Google ScholarCross Ref
- Leon N Piotrowski and Fergus W Campbell. 1982. A demonstration of the visual importance and flexibility of spatial-frequency amplitude and phase. Perception 11, 3 (1982), 337–346.Google ScholarCross Ref
- Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 86–103.Google ScholarDigital Library
- Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Niessner. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 1–11.Google ScholarCross Ref
- Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 618–626.Google ScholarCross Ref
- Saniat Javid Sohrawardi, Akash Chintha, Bao Thai, Sovantharith Seng, Andrea Hickerson, Raymond Ptucha, and Matthew Wright. 2019. Poster: Towards Robust Open-World Detection of Deepfakes. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (London, United Kingdom) (CCS ’19). Association for Computing Machinery, New York, NY, USA, 2613–2615. https://doi.org/10.1145/3319535.3363269Google ScholarDigital Library
- Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105–6114.Google Scholar
- Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2387–2395.Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features With 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4489–4497.Google ScholarDigital Library
- Gengxing Wang, Jiahuan Zhou, and Ying Wu. 2020. Exposing Deep-faked Videos by Anomalous Co-motion Pattern Detection. arXiv preprint arXiv:2008.04848(2020). https://doi.org/10.48550/arXiv.2008.04848Google ScholarCross Ref
- Hanyi Wang, Zihan Liu, and Shilin Wang. 2023. Exploiting Complementary Dynamic Incoherence for DeepFake Video Detection. IEEE Transactions on Circuits and Systems for Video Technology (2023). https://doi.org/10.1109/TCSVT.2023.3238517Google ScholarDigital Library
- Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. TDN: Temporal Difference Networks for Efficient Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1895–1904.Google ScholarCross Ref
- Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. 2020. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11531–11539. https://doi.org/10.1109/CVPR42600.2020.01155Google ScholarCross Ref
- Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. 2020. CNN-Generated Images Are Surprisingly Easy to Spot... for Now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8695–8704.Google ScholarCross Ref
- Tianyi Wang, Harry Cheng, Kam Pui Chow, and Liqiang Nie. 2023. Deep convolutional pooling transformer for deepfake detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 19, 6 (2023), 1–20.Google ScholarDigital Library
- Wenhao Wu, Yuxiang Zhao, Yanwu Xu, Xiao Tan, Dongliang He, Zhikang Zou, Jin Ye, Yingying Li, Mingde Yao, Zichao Dong, et al. 2021. Dsanet: Dynamic segment aggregation network for video-level representation learning. In Proceedings of the 29th ACM International Conference on Multimedia. 1903–1911.Google ScholarDigital Library
- Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. A Fourier-Based Framework for Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14383–14392.Google ScholarCross Ref
- Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. A fourier-based framework for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14383–14392.Google ScholarCross Ref
- Jiachen Yang, Aiyun Li, Shuai Xiao, Wen Lu, and Xinbo Gao. 2021. MTD-Net: Learning to Detect Deepfakes Images by Multi-Scale Texture Difference. IEEE Transactions on Information Forensics and Security 16 (2021), 4234–4245. https://doi.org/10.1109/TIFS.2021.3102487Google ScholarDigital Library
- Yang Yu, Rongrong Ni, Yao Zhao, Siyuan Yang, Fen Xia, Ning Jiang, and Guoqing Zhao. 2023. MSVT: Multiple Spatiotemporal Views Transformer for DeepFake Video Detection. IEEE Transactions on Circuits and Systems for Video Technology (2023).Google Scholar
- Dengyong Zhang, Jiahao Chen, Xin Liao, Feng Li, Jiaxin Chen, and Gaobo Yang. 2024. Face Forgery Detection via Multi-Feature Fusion and Local Enhancement. IEEE Transactions on Circuits and Systems for Video Technology (2024), 1–1. https://doi.org/10.1109/TCSVT.2024.3390945Google ScholarCross Ref
- Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R Scott, and Limin Wang. 2020. V4d: 4d convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442(2020). https://doi.org/10.48550/arXiv.2002.07442Google ScholarCross Ref
- Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu. 2021. Multi-attentional deepfake detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2185–2194.Google ScholarCross Ref
- Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, and Nenghai Yu. 2022. Self-supervised transformer for deepfake detection. arXiv preprint arXiv:2203.01265(2022).Google Scholar
- Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. 2020. WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection. Association for Computing Machinery, New York, NY, USA, 2382–2390.Google Scholar
Index Terms
- Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video Detection
Recommendations
Spatiotemporal Inconsistency Learning for DeepFake Video Detection
MM '21: Proceedings of the 29th ACM International Conference on MultimediaThe rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-...
DeepFake Videos Detection via Spatiotemporal Inconsistency Learning and Interactive Fusion
2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON)While the rapid expansion of DeepFake generation techniques has arisen a serious impact on human society, the detection of DeepFake videos is challenging because of their highly plausible contents on each frame, which are not visually apparent. To address ...
Augmented Multi-Scale Spatiotemporal Inconsistency Magnifier for Generalized DeepFake Detection
Recently, realistic DeepFake videos have raised severe security concerns in society. Existing video-based detection methods observe local spatial regions with the coarse temporal view, thus it is difficult to obtain subtle spatiotemporal information, ...
Comments