ABSTRACT
Utilizing the sensor characteristics of the audio, visible camera, and thermal camera, the robustness of person recognition can be enhanced. Existing multimodal person recognition frameworks are primarily formulated assuming that multimodal data is always available. In this paper, we propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera, which addresses the missing modality problem. In the framework, a novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings. Also, a novel loss function, termed missing modality loss, accounts for possible missing modalities based on the triplet loss calculation while learning the individual latent embeddings. Additionally, a joint latent embedding utilizing the trimodal data is learnt using the multi-head attention transformer, which assigns attention weights to the different modalities. The different latent embeddings are subsequently used to train a deep neural network. The proposed framework is validated on the Speaking Faces dataset. A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy while accounting for missing modalities.
- Madina Abdrakhmanova, Askat Kuzdeuov, Sheikh Jarju, Yerbolat Khassanov, Michael Lewis, and Huseyin Atakan Varol. 2020. SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams. arXiv:2012.02961 [cs]Google Scholar
- George Bebis, Aglika Gyaourova, Saurabh Singh, and Ioannis Pavlidis. 2006. Face Recognition by Fusing Thermal Infrared and Visible Imagery. Image and Vision Computing 24, 7 (July 2006), 727--742.Google ScholarCross Ref
- Girija Chetty and Michael Wagner. 2006. Audio-Visual Multimodal Fusion for Biometric Person Authentication and Liveness Verification. In Proceedings of the 2005 NICTA-HCSNet Multimodal User Interaction Workshop --- Volume 57. 17--24.Google Scholar
- Tanzeem Choudhury, Brian Clarkson, Tony Jebara, and Alex Pentland. 1998. Multimodal Person Recognition using Unconstrained Audio and Video. In Proceedings of the International Conference on Audio- and Video-Based Biometric Person Authentication. 176--181.Google Scholar
- R. K. Das, R. Tao, J. Yang, W. Rao, C. Yu, and H. Li. 2020. HLT-NUS submission for 2019 NIST Multimedia Speaker Recognition Evaluation. In Proceedings of APSIPA, Annual Summit and Conference. 605--609.Google Scholar
- Jing Han, Zixing Zhang, Zhao Ren, and Björn Schuller. 2019. Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Monomodality. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 5861--5865.Google Scholar
- Madheswari Kanmani and Venkateswaran Narasimhan. 2020. Optimal Fusion Aided Face Recognition from Visible and Thermal Face Images. Multimedia Tools Application 79, 25--26 (July 2020), 17859--17883.Google ScholarDigital Library
- Seong G. Kong, Jingu Heo, Faysal Boughorbel, Yue Zheng, Besma R. Abidi, Andreas Koschan, Mingzhong Yi, and Mongi A. Abidi. 2007. Multiscale Fusion of Visible and Thermal IR Images for Illumination-Invariant Face Recognition. International Journal of Computer Vision 71, 2 (February 2007), 215--233.Google ScholarDigital Library
- Qinbo Li, Qing Wan, Sang-Heon Lee, and Yoonsuck Choe. 2021. Video Face Recognition with Audio-Visual Aggregation Network. In Proceedings of the International Conference on Neural Information Processing). 150--161.Google ScholarDigital Library
- Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. 2021. SMIL: Multimodal Learning with Severely Missing Modality. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2302--2310.Google ScholarCross Ref
- A. Nagrani, S. Albanie, and A. Zisserman. 2018. Learnable PINs: Cross-modal Embeddings for Person Identity. In in Proceedings of the European Conference on Computer Vision. 71--88.Google Scholar
- S. Nawaz, M. K. Janjua, I. Gallo, A. Mahmood, and A. Calefati. 2019. Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals. In Proceedings of the Digital Image Computing: Techniques and Applications. 1--7.Google Scholar
- Srinivas Parthasarathy and Shiva Sundaram. 2020. Training Strategies to Handle Missing Modalities for Audio-visual Expression Recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction. 400--404.Google ScholarDigital Library
- Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence. 6892--6899.Google ScholarDigital Library
- Seyed Sadjadi, Craig Greenberg, Elliot Singer, Douglas Olson, Lisa Mason, and Jaime Hernandez-Cordero. 2020. The 2019 NIST Audio-Visual Speaker Recognition Evaluation. In Proceedings of the Speaker and Language Recognition Workshop: Odyssey 2020. 266--272.Google ScholarCross Ref
- Ayan Seal, Debotosh Bhattacharjee, Mita Nasipuri, Consuelo Gonzalo-Martin, and Ernestina Menasalvas. 2017. Fusion of Visible and Thermal Images Using a Directed Search Method for Face Recognition. International Journal of Pattern Recognition and Artificial Intelligence 31, 04 (April 2017), 1756--1761.Google ScholarCross Ref
- Gregory Sell, Kevin Duh, David Snyder, Dave Etter, and Daniel Garcia-Romero. 2018. Audio-Visual Person Recognition in Multimedia Data From the Iarpa Janus Program. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3031--3035.Google ScholarDigital Library
- Saurabh Singh, Aglika Gyaourova, George Bebis, and Ioannis Pavlidis. 2004. Infrared and Visible Image Fusion for Face Recognition. In Proceedings of SPIE 5405, Biometric Technology for Human Identification, Vol. 5404. 585--596.Google Scholar
- Ruijie Tao, Rohan Kumar Das, and Haizhou Li. 2020. Audio-visual Speaker Recognition with a Cross-modal Discriminative Network. In Proceedings of Annual Conference of the International Speech Communication Association. 2242--2246.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems 30 (Dec. 2017), 6000--6010.Google Scholar
- Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. Transmodality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In Proceedings of the Web Conference. 2514--2520.Google ScholarDigital Library
- Y. Wen, M. A. Ismail, W. Liu, B. Raj, and R. Singh. 2019. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In Proceedings of the International Conference on Learning Representations. 1--17.Google Scholar
- Jinming Zhao, Ruichen Li, and Qin Jin. 2021. Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2608--2618.Google ScholarCross Ref
Index Terms
- A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition
Recommendations
Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalMultimodal sentiment analysis has been studied under the assumption that all modalities are available. However, such a strong assumption does not always hold in practice, and most of multimodal fusion models may fail when partial modalities are missing. ...
Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification
MMSys '23: Proceedings of the 14th ACM Multimedia Systems ConferenceThis paper addresses the missing modality problem in multimodal person classification, where an incomplete multimodal input with one modality missing is classified into predefined person classes. A multimodal cascaded framework with three deep learning ...
Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition
MultiMedia ModelingAbstractAudio-visual person recognition is the problem of recognizing an individual person class defined by the training data from the multimodal audio-visual data. Audio-visual person recognition has many applications in security, surveillance, ...
Comments