short-paper

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

Authors:
Vijay John

Guardian Robot Project, RIKEN, Japan

Guardian Robot Project, RIKEN, Japan
View Profile

,
Yasutomo Kawanishi

Guardian Robot Project, RIKEN, Japan

Guardian Robot Project, RIKEN, Japan
View Profile

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in AsiaDecember 2022Article No.: 28Pages 1–5https://doi.org/10.1145/3551626.3564965

Published:13 December 2022Publication History

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

Pages 1–5

ABSTRACT

Utilizing the sensor characteristics of the audio, visible camera, and thermal camera, the robustness of person recognition can be enhanced. Existing multimodal person recognition frameworks are primarily formulated assuming that multimodal data is always available. In this paper, we propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera, which addresses the missing modality problem. In the framework, a novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings. Also, a novel loss function, termed missing modality loss, accounts for possible missing modalities based on the triplet loss calculation while learning the individual latent embeddings. Additionally, a joint latent embedding utilizing the trimodal data is learnt using the multi-head attention transformer, which assigns attention weights to the different modalities. The different latent embeddings are subsequently used to train a deep neural network. The proposed framework is validated on the Speaking Faces dataset. A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy while accounting for missing modalities.

References

Madina Abdrakhmanova, Askat Kuzdeuov, Sheikh Jarju, Yerbolat Khassanov, Michael Lewis, and Huseyin Atakan Varol. 2020. SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams. arXiv:2012.02961 [cs]Google Scholar
George Bebis, Aglika Gyaourova, Saurabh Singh, and Ioannis Pavlidis. 2006. Face Recognition by Fusing Thermal Infrared and Visible Imagery. Image and Vision Computing 24, 7 (July 2006), 727--742.Google ScholarCross Ref
Girija Chetty and Michael Wagner. 2006. Audio-Visual Multimodal Fusion for Biometric Person Authentication and Liveness Verification. In Proceedings of the 2005 NICTA-HCSNet Multimodal User Interaction Workshop --- Volume 57. 17--24.Google Scholar
Tanzeem Choudhury, Brian Clarkson, Tony Jebara, and Alex Pentland. 1998. Multimodal Person Recognition using Unconstrained Audio and Video. In Proceedings of the International Conference on Audio- and Video-Based Biometric Person Authentication. 176--181.Google Scholar
R. K. Das, R. Tao, J. Yang, W. Rao, C. Yu, and H. Li. 2020. HLT-NUS submission for 2019 NIST Multimedia Speaker Recognition Evaluation. In Proceedings of APSIPA, Annual Summit and Conference. 605--609.Google Scholar
Jing Han, Zixing Zhang, Zhao Ren, and Björn Schuller. 2019. Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Monomodality. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 5861--5865.Google Scholar
Madheswari Kanmani and Venkateswaran Narasimhan. 2020. Optimal Fusion Aided Face Recognition from Visible and Thermal Face Images. Multimedia Tools Application 79, 25--26 (July 2020), 17859--17883.Google ScholarDigital Library
Seong G. Kong, Jingu Heo, Faysal Boughorbel, Yue Zheng, Besma R. Abidi, Andreas Koschan, Mingzhong Yi, and Mongi A. Abidi. 2007. Multiscale Fusion of Visible and Thermal IR Images for Illumination-Invariant Face Recognition. International Journal of Computer Vision 71, 2 (February 2007), 215--233.Google ScholarDigital Library
Qinbo Li, Qing Wan, Sang-Heon Lee, and Yoonsuck Choe. 2021. Video Face Recognition with Audio-Visual Aggregation Network. In Proceedings of the International Conference on Neural Information Processing). 150--161.Google ScholarDigital Library
Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. 2021. SMIL: Multimodal Learning with Severely Missing Modality. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2302--2310.Google ScholarCross Ref
A. Nagrani, S. Albanie, and A. Zisserman. 2018. Learnable PINs: Cross-modal Embeddings for Person Identity. In in Proceedings of the European Conference on Computer Vision. 71--88.Google Scholar
S. Nawaz, M. K. Janjua, I. Gallo, A. Mahmood, and A. Calefati. 2019. Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals. In Proceedings of the Digital Image Computing: Techniques and Applications. 1--7.Google Scholar
Srinivas Parthasarathy and Shiva Sundaram. 2020. Training Strategies to Handle Missing Modalities for Audio-visual Expression Recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction. 400--404.Google ScholarDigital Library
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence. 6892--6899.Google ScholarDigital Library
Seyed Sadjadi, Craig Greenberg, Elliot Singer, Douglas Olson, Lisa Mason, and Jaime Hernandez-Cordero. 2020. The 2019 NIST Audio-Visual Speaker Recognition Evaluation. In Proceedings of the Speaker and Language Recognition Workshop: Odyssey 2020. 266--272.Google ScholarCross Ref
Ayan Seal, Debotosh Bhattacharjee, Mita Nasipuri, Consuelo Gonzalo-Martin, and Ernestina Menasalvas. 2017. Fusion of Visible and Thermal Images Using a Directed Search Method for Face Recognition. International Journal of Pattern Recognition and Artificial Intelligence 31, 04 (April 2017), 1756--1761.Google ScholarCross Ref
Gregory Sell, Kevin Duh, David Snyder, Dave Etter, and Daniel Garcia-Romero. 2018. Audio-Visual Person Recognition in Multimedia Data From the Iarpa Janus Program. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3031--3035.Google ScholarDigital Library
Saurabh Singh, Aglika Gyaourova, George Bebis, and Ioannis Pavlidis. 2004. Infrared and Visible Image Fusion for Face Recognition. In Proceedings of SPIE 5405, Biometric Technology for Human Identification, Vol. 5404. 585--596.Google Scholar
Ruijie Tao, Rohan Kumar Das, and Haizhou Li. 2020. Audio-visual Speaker Recognition with a Cross-modal Discriminative Network. In Proceedings of Annual Conference of the International Speech Communication Association. 2242--2246.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems 30 (Dec. 2017), 6000--6010.Google Scholar
Zilong Wang, Zhaohong Wan, and Xiaojun Wan. 2020. Transmodality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis. In Proceedings of the Web Conference. 2514--2520.Google ScholarDigital Library
Y. Wen, M. A. Ismail, W. Liu, B. Raj, and R. Singh. 2019. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In Proceedings of the International Conference on Learning Representations. 1--17.Google Scholar
Jinming Zhao, Ruichen Li, and Qin Jin. 2021. Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2608--2618.Google ScholarCross Ref

Index Terms

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Multimodal sentiment analysis has been studied under the assumption that all modalities are available. However, such a strong assumption does not always hold in practice, and most of multimodal fusion models may fail when partial modalities are missing. ...
Read More
Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification
MMSys '23: Proceedings of the 14th ACM Multimedia Systems Conference

This paper addresses the missing modality problem in multimodal person classification, where an incomplete multimodal input with one modality missing is classified into predefined person classes. A multimodal cascaded framework with three deep learning ...
Read More
Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition
MultiMedia Modeling
Abstract
Audio-visual person recognition is the problem of recognizing an individual person class defined by the training data from the multimodal audio-visual data. Audio-visual person recognition has many applications in security, surveillance, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia
December 2022
296 pages
ISBN:9781450394789
DOI:10.1145/3551626
Conference Chair:
Shuqiang Jiang
CASROLE@GENERAL CHAIR
,
General Chairs:
Kiyoharu Aizawa
The University of Tokyo
,
Phoebe Chen
La Trobe
,
Keiji Yanai
The University of Electro-Communications
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 December 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
missing modality loss
multimodal transformer
person recognition
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate59of204submissions,29%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 133
  Total Downloads
- Downloads (Last 12 months)99
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification

Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Multimodal Cascaded Framework with Metric Learning Robust to Missing Modalities for Person Classification

Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media