research-article

Temporal-aware Multimodal Feature Fusion for Sentiment Analysis

Authors:
Qi Li

Anhui University, Hefei, China

Anhui University, Hefei, China

0000-0002-8655-5781
View Profile

,
Shulei Tang

Anhui University, Hefei, China

Anhui University, Hefei, China

0009-0007-7221-7443
View Profile

,
Feixiang Zhang

Anhui University, Hefei, China

Anhui University, Hefei, China

0009-0005-7718-7269
View Profile

,
Ruotong Wang

Anhui University, Hefei, China

Anhui University, Hefei, China

0009-0003-7661-8974
View Profile

,
Yangyang Xu

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China

0009-0002-5406-5961
View Profile

,
Zhuoer Zhao

Anhui University, Hefei, China

Anhui University, Hefei, China

0009-0002-6535-2843
View Profile

,
Xiao Sun

Hefei University of Technology, Hefei Comprehensive National Science Center, & ZhongJuYuan Intelligent Technology Co., Ltd, Hefei, China

Hefei University of Technology, Hefei Comprehensive National Science Center, & ZhongJuYuan Intelligent Technology Co., Ltd, Hefei, China

0000-0001-9750-7032
View Profile

,
Meng Wang

Hefei University of Technology & Hefei Comprehensive National Science Center, Hefei, China

Hefei University of Technology & Hefei Comprehensive National Science Center, Hefei, China

0000-0002-3094-7735
View Profile

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and PersonalisationNovember 2023Pages 99–105https://doi.org/10.1145/3606039.3613111

Published:29 October 2023Publication History

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

Pages 99–105

ABSTRACT

In this paper, we present a solution to the MuSe-Personalisation sub-challenge in the Multimodal Sentiment Analysis Challenge 2023. The task of MuSe-Personalisation aims to predict a time-continuous emotional value (i.e., arousal and valence) by using multimodal data. The MuSe-Personalisation sub-challenge faces the individual variations problem, resulting in poor generalization on unknown test sets. To solve the above problem, we first extract several informative visual features, and then propose a framework containing feature selection, feature learning and fusion strategy to discover the best combination of features for sentiment analysis. Finally, our method achieved the Top 1 performance in the MuSe-Personalisation sub-challenge, and the result in the combined CCC of physiological arousal and valence was 0.8681, outperforming the baseline system by a large margin (i.e., 10.42%) on the test set.

References

Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face description with local binary patterns: Application to face recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 28, 12 (2006), 2037--2041.Google ScholarDigital Library
Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. 2017. Snore sound classification using image-based deep spectrum features. (2017), 3512--3516.Google Scholar
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , Vol. 33 (2020), 12449--12460.Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 2 (2018), 423--443.Google Scholar
Prerna Chikersal, Soujanya Poria, Erik Cambria, Alexander Gelbukh, and Chng Eng Siong. 2015. Modelling public sentiment in Twitter: using linguistic patterns to enhance supervised learning. In Computational Linguistics and Intelligent Text Processing: 16th International Conference, CICLing 2015, Cairo, Egypt, April 14--20, 2015, Proceedings, Part II 16. 49--65.Google ScholarCross Ref
Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. arXiv preprint arXiv:2305.03369 (2023).Google Scholar
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition. 886--893.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Sidney K D'mello and Jacqueline Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM computing surveys, Vol. 47, 3 (2015), 1--36.Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google Scholar
Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).Google Scholar
Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE transactions on affective computing , Vol. 7, 2 (2015), 190--202.Google Scholar
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In European Conference on Computer Vision. 709--727.Google ScholarDigital Library
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 2 (2017), 352--364.Google Scholar
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).Google Scholar
Martin Lades, Jan C Vorbruggen, Joachim Buhmann, Jörg Lange, Christoph Von Der Malsburg, Rolf P Wurtz, and Wolfgang Konen. 1993. Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on computers , Vol. 42, 3 (1993), 300--311.Google ScholarDigital Library
Jialin Li, Alia Waleed, and Hanan Salam. 2023. A survey on personalized affective computing in human-machine interaction. arXiv preprint arXiv:2304.00377 (2023).Google Scholar
Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 973--982.Google Scholar
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information fusion , Vol. 37 (2017), 98--125.Google Scholar
Seyed Mahdi Rezaeinia, Rouhollah Rahmani, Ali Ghodsi, and Hadi Veisi. 2019. Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications , Vol. 117 (2019), 139--147.Google ScholarCross Ref
Mousmita Sarma, Pegah Ghahremani, Daniel Povey, Nagendra Kumar Goel, Kandarpa Kumar Sarma, and Najim Dehak. 2018. Emotion Identification from Raw Speech Signals Using DNNs.. In Interspeech. 3097--3101.Google Scholar
Aharon Satt, Shai Rozenberg, Ron Hoory, et al. 2017. Efficient emotion recognition from speech using deep learning on spectrograms.. In Interspeech. 1089--1093.Google Scholar
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarCross Ref
Lukas Stappen, Alice Baird, Lukas Christ, Lea Schumann, Benjamin Sertolli, Eva-Maria Messner, Erik Cambria, Guoying Zhao, and Björn W Schuller. 2021. The MuSe 2021 multimodal sentiment analysis challenge: sentiment, emotion, physiological-emotion, and stress. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 5--14.Google ScholarDigital Library
Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W Schuller, Iulia Lefter, et al. 2020. Muse 2020 challenge and workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: Emotional car reviews in-the-wild. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop. 35--44.Google ScholarDigital Library
Ruijie Tao, Rohan Kumar Das, and Haizhou Li. 2020. Audio-visual speaker recognition with a cross-modal discriminative network. arXiv preprint arXiv:2008.03894 (2020).Google Scholar
Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. 2021. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence , Vol. 3, 1 (2021), 42--50.Google ScholarCross Ref
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting. 6558.Google Scholar
Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001. I--I.Google ScholarCross Ref
Xiaobo Wang, Shifeng Zhang, Shuo Wang, Tianyu Fu, Hailin Shi, and Tao Mei. 2020. Mis-classified vector guided softmax loss for face recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 12241--12248.Google ScholarCross Ref
Zhengyao Wen, Wenzhong Lin, Tao Wang, and Ge Xu. 2023. Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics, Vol. 8, 2 (2023), 199.Google ScholarCross Ref
Fanglei Xue, Qiangchang Wang, and Guodong Guo. 2021. Transfer: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3601--3610.Google ScholarCross Ref
Fanglei Xue, Qiangchang Wang, Zichang Tan, Zhongsong Ma, and Guodong Guo. 2022. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Transactions on Affective Computing (2022), 1--13.Google ScholarDigital Library
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI conference on artificial intelligence. 5634--5641.Google ScholarCross Ref
Yuhang Zhang, Chengrui Wang, Xu Ling, and Weihong Deng. 2022. Learn from all: Erasing attention consistency for noisy label facial expression recognition. In European Conference on Computer Vision. 418--434. ioGoogle ScholarDigital Library

Index Terms

Temporal-aware Multimodal Feature Fusion for Sentiment Analysis
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy
MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

In this paper, we present the solution to the MuSe-Mimic subchallenge of the 4th Multimodal Sentiment Analysis Challenge. This sub-challenge aims to predict the level of approval, disappointment and uncertainty in user-generated video clips. In our ...
Read More
The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation
MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

The Multimodal Sentiment Analysis Challenge (MuSe) 2023 is a set of shared tasks addressing three different contemporary multimodal affect and sentiment analysis problems: In the Mimicked Emotions Sub-Challenge (MuSe-Mimic), participants predict three ...
Read More
The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress
MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation
November 2023
113 pages
ISBN:9798400702709
DOI:10.1145/3606039
General Chairs:
Shahin Amiriparian
University of Augsburg, Germany
,
Lukas Christ
University of Augsburg, Germany
,
Andreas Konig
University of Passau, Germany
,
Alan Cowen
Hume AI, USA
,
Eva-Maria Meßner
University of Ulm, Germany
,
Erik Cambria
Nanyang Technological University, Singapore
,
Bjorn W. Schuller
Imperial College London, UK
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
arousal and valence
multimodal fusion
multimodal sentiment analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate14of17submissions,82%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 133
  Total Downloads
- Downloads (Last 12 months)133
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Temporal-aware Multimodal Feature Fusion for Sentiment Analysis

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation

The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Temporal-aware Multimodal Feature Fusion for Sentiment Analysis

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy

The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation

The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media