skip to main content
10.1145/3551626.3564965acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

Published:13 December 2022Publication History

ABSTRACT

Utilizing the sensor characteristics of the audio, visible camera, and thermal camera, the robustness of person recognition can be enhanced. Existing multimodal person recognition frameworks are primarily formulated assuming that multimodal data is always available. In this paper, we propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera, which addresses the missing modality problem. In the framework, a novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings. Also, a novel loss function, termed missing modality loss, accounts for possible missing modalities based on the triplet loss calculation while learning the individual latent embeddings. Additionally, a joint latent embedding utilizing the trimodal data is learnt using the multi-head attention transformer, which assigns attention weights to the different modalities. The different latent embeddings are subsequently used to train a deep neural network. The proposed framework is validated on the Speaking Faces dataset. A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy while accounting for missing modalities.

References

  1. Madina Abdrakhmanova, Askat Kuzdeuov, Sheikh Jarju, Yerbolat Khassanov, Michael Lewis, and Huseyin Atakan Varol. 2020. SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams. arXiv:2012.02961 [cs]Google ScholarGoogle Scholar
  2. George Bebis, Aglika Gyaourova, Saurabh Singh, and Ioannis Pavlidis. 2006. Face Recognition by Fusing Thermal Infrared and Visible Imagery. Image and Vision Computing 24, 7 (July 2006), 727--742.Google ScholarGoogle ScholarCross RefCross Ref
  3. Girija Chetty and Michael Wagner. 2006. Audio-Visual Multimodal Fusion for Biometric Person Authentication and Liveness Verification. In Proceedings of the 2005 NICTA-HCSNet Multimodal User Interaction Workshop --- Volume 57. 17--24.Google ScholarGoogle Scholar
  4. Tanzeem Choudhury, Brian Clarkson, Tony Jebara, and Alex Pentland. 1998. Multimodal Person Recognition using Unconstrained Audio and Video. In Proceedings of the International Conference on Audio- and Video-Based Biometric Person Authentication. 176--181.Google ScholarGoogle Scholar
  5. R. K. Das, R. Tao, J. Yang, W. Rao, C. Yu, and H. Li. 2020. HLT-NUS submission for 2019 NIST Multimedia Speaker Recognition Evaluation. In Proceedings of APSIPA, Annual Summit and Conference. 605--609.Google ScholarGoogle Scholar
  6. Jing Han, Zixing Zhang, Zhao Ren, and Björn Schuller. 2019. Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Monomodality. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 5861--5865.Google ScholarGoogle Scholar
  7. Madheswari Kanmani and Venkateswaran Narasimhan. 2020. Optimal Fusion Aided Face Recognition from Visible and Thermal Face Images. Multimedia Tools Application 79, 25--26 (July 2020), 17859--17883.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Seong G. Kong, Jingu Heo, Faysal Boughorbel, Yue Zheng, Besma R. Abidi, Andreas Koschan, Mingzhong Yi, and Mongi A. Abidi. 2007. Multiscale Fusion of Visible and Thermal IR Images for Illumination-Invariant Face Recognition. International Journal of Computer Vision 71, 2 (February 2007), 215--233.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Qinbo Li, Qing Wan, Sang-Heon Lee, and Yoonsuck Choe. 2021. Video Face Recognition with Audio-Visual Aggregation Network. In Proceedings of the International Conference on Neural Information Processing). 150--161.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. 2021. SMIL: Multimodal Learning with Severely Missing Modality. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2302--2310.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Nagrani, S. Albanie, and A. Zisserman. 2018. Learnable PINs: Cross-modal Embeddings for Person Identity. In in Proceedings of the European Conference on Computer Vision. 71--88.Google ScholarGoogle Scholar
  12. S. Nawaz, M. K. Janjua, I. Gallo, A. Mahmood, and A. Calefati. 2019. Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals. In Proceedings of the Digital Image Computing: Techniques and Applications. 1--7.Google ScholarGoogle Scholar
  13. Srinivas Parthasarathy and Shiva Sundaram. 2020. Training Strategies to Handle Missing Modalities for Audio-visual Expression Recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction. 400--404.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence. 6892--6899.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Seyed Sadjadi, Craig Greenberg, Elliot Singer, Douglas Olson, Lisa Mason, and Jaime Hernandez-Cordero. 2020. The 2019 NIST Audio-Visual Speaker Recognition Evaluation. In Proceedings of the Speaker and Language Recognition Workshop: Odyssey 2020. 266--272.Google ScholarGoogle ScholarCross RefCross Ref
  16. Ayan Seal, Debotosh Bhattacharjee, Mita Nasipuri, Consuelo Gonzalo-Martin, and Ernestina Menasalvas. 2017. Fusion of Visible and Thermal Images Using a Directed Search Method for Face Recognition. International Journal of Pattern Recognition and Artificial Intelligence 31, 04 (April 2017), 1756--1761.Google ScholarGoogle ScholarCross RefCross Ref
  17. Gregory Sell, Kevin Duh, David Snyder, Dave Etter, and Daniel Garcia-Romero. 2018. Audio-Visual Person Recognition in Multimedia Data From the Iarpa Janus Program. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3031--3035.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Saurabh Singh, Aglika Gyaourova, George Bebis, and Ioannis Pavlidis. 2004. Infrared and Visible Image Fusion for Face Recognition. In Proceedings of SPIE 5405, Biometric Technology for Human Identification, Vol. 5404. 585--596.Google ScholarGoogle Scholar
  19. Ruijie Tao, Rohan Kumar Das, and Haizhou Li. 2020. Audio-visual Speaker Recognition with a Cross-modal Discriminative Network. In Proceedings of Annual Conference of the International Speech Communication Association. 2242--2246.Google ScholarGoogle ScholarCross RefCross Ref
  20. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems 30 (Dec. 2017), 6000--6010.Google ScholarGoogle Scholar
  21. Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. Transmodality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In Proceedings of the Web Conference. 2514--2520.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Wen, M. A. Ismail, W. Liu, B. Raj, and R. Singh. 2019. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In Proceedings of the International Conference on Learning Representations. 1--17.Google ScholarGoogle Scholar
  23. Jinming Zhao, Ruichen Li, and Qin Jin. 2021. Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2608--2618.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia
        December 2022
        296 pages
        ISBN:9781450394789
        DOI:10.1145/3551626

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 December 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate59of204submissions,29%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader