ABSTRACT
Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.
- Lauren B Adamson, Roger Bakeman, and Katharine Suma. 2016. The joint engagement rating inventory (JERI). Technical Report. Technical report 25: Developmental Laboratory, Department of Psychology ….Google Scholar
- Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In European Conference on Computer Vision.Google ScholarDigital Library
- Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59–66.Google ScholarDigital Library
- Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).Google ScholarDigital Library
- João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR abs/1705.07750 (2017). arXiv:1705.07750http://arxiv.org/abs/1705.07750Google Scholar
- Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision. 357–366.Google ScholarCross Ref
- Feiyu Chen, Zhengxiao Sun, Deqiang Ouyang, Xueliang Liu, and Jie Shao. 2021. Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation. In Proceedings of the 29th ACM International Conference on Multimedia. 1064–1073.Google ScholarDigital Library
- Huili Chen, Sharifa Mohammed Alghowinem, Soo Jung Jang, Cynthia Breazeal, and Hae Won Park. 2022. Dyadic Affect in Parent-child Multi-modal Interaction: Introducing the DAMI-P2C Dataset and its Preliminary Analysis. IEEE Transactions on Affective Computing (2022), 1–1. https://doi.org/10.1109/TAFFC.2022.3178689Google ScholarDigital Library
- Huili Chen, Yue Zhang, Felix Weninger, Rosalind Picard, Cynthia Breazeal, and Hae Won Park. 2020. Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset. In Proceedings of the 2020 International Conference on Multimodal Interaction. ACM. https://doi.org/10.1145/3382507.3418842Google ScholarDigital Library
- David Curto, Albert Clapés, Javier Selva, Sorina Smeureanu, Júlio C. S. Jacques Júnior, David Gallardo-Pujol, Georgina Guilera, David Leiva, Thomas B. Moeslund, Sergio Escalera, and Cristina Palmero. 2021. Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions. CoRR abs/2109.09487 (2021). arXiv:2109.09487https://arxiv.org/abs/2109.09487Google Scholar
- Francesco Del Duchetto, Paul Baxter, and Marc Hanheide. 2020. Are you still with me? Continuous engagement assessment from a robot’s point of view. Frontiers in Robotics and AI 7 (2020), 116.Google ScholarCross Ref
- Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. 2020. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. In Proceedings of the 2020 International Conference on Multimodal Interaction. 784–789.Google ScholarDigital Library
- Nia Dowell, Oleksandra Poquet, and Christopher Brooks. 2018. Applying group communication analysis to educational discourse interactions at scale. International Society of the Learning Sciences, Inc.[ISLS].Google Scholar
- Jiafei Duan, Samson Yu, Nicholas Tan, Yi Ru Wang, and Cheston Tan. 2023. Read My Mind: A Multi-Modal Dataset for Human Belief Prediction. arxiv:2304.14501 [cs.CV]Google Scholar
- Jiafei Duan, Samson Yu, Nicholas Tan, Li Yi, and Cheston Tan. 2022. BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios. arxiv:2206.10665 [cs.CV]Google Scholar
- Lifeng Fan, Shuwen Qiu, Zilong Zheng, Tao Gao, Song-Chun Zhu, and Yixin Zhu. 2021. Learning Triadic Belief Dynamics in Nonverbal Communication from Videos. CoRR abs/2104.02841 (2021). arXiv:2104.02841https://arxiv.org/abs/2104.02841Google Scholar
- Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540 (2019).Google Scholar
- Joseph Grafsgaard, Nicholas Duran, Ashley Randall, Chun Tao, and Sidney D’Mello. 2018. Generative multimodal models of nonverbal synchrony in close relationships. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 195–202.Google ScholarDigital Library
- Judith A. Hall, Terrence G. Horgan, and Nora A. Murphy. 2019. Nonverbal Communication. Annual Review of Psychology 70, 1 (2019), 271–294. https://doi.org/10.1146/annurev-psych-010418-103145 arXiv:https://doi.org/10.1146/annurev-psych-010418-103145PMID: 30256720.Google ScholarCross Ref
- Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing. 2594–2604.Google ScholarCross Ref
- Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. 2016. CNN Architectures for Large-Scale Audio Classification. CoRR abs/1609.09430 (2016). arXiv:1609.09430http://arxiv.org/abs/1609.09430Google Scholar
- Abhinav Joshi, Ashwani Bhat, Ayush Jain, Atin Vikram Singh, and Ashutosh Modi. 2022. COGMEN: COntextualized GNN based multimodal emotion recognitioN. arXiv preprint arXiv:2205.02455 (2022).Google Scholar
- Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal communication in human interaction. Cengage Learning.Google Scholar
- Peter Koval, Patrick T Burnett, and Yixia Zheng. 2021. Emotional inertia: On the conservation of emotional momentum. In Affect dynamics. Springer, 63–94.Google Scholar
- Shivani Kumar, Anubhav Shrimal, Md Shad Akhtar, and Tanmoy Chakraborty. 2022. Discovering emotion and reasoning its flip in multi-party conversations using masked memory network and transformer. Knowledge-Based Systems 240 (2022), 108112.Google ScholarDigital Library
- Peter Kuppens, Nicholas B Allen, and Lisa B Sheeber. 2010. Emotional inertia and psychological maladjustment. Psychological science 21, 7 (2010), 984–991.Google Scholar
- Dong Won Lee, Yubin Kim, Rosalind Picard, Cynthia Breazeal, and Hae Won Park. 2023. Multipar-T: Multiparty-Transformer for Capturing Contingent Behaviors in Group Conversations. arxiv:2304.12204 [cs.CV]Google Scholar
- Joosung Lee and Wooin Lee. 2021. CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation. arXiv preprint arXiv:2108.11626 (2021).Google Scholar
- Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. 2022. Towards An End-to-End Framework for Flow-Guided Video Inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. CoRR abs/1708.02002 (2017). arXiv:1708.02002http://arxiv.org/abs/1708.02002Google Scholar
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).Google Scholar
- Jiachen Luo, Huy Phan, and Joshua Reiss. 2023. deep learning of segment-level feature representation for speech emotion recognition in conversations. arXiv preprint arXiv:2302.02419 (2023).Google Scholar
- Michael W Morris and Dacher Keltner. 2000. How emotions work: The social functions of emotional expression in negotiations. Research in organizational behavior 22 (2000), 1–50.Google Scholar
- Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018. Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior. In Proc. ACM International Conference on Intelligent User Interfaces (IUI). 153–164. https://doi.org/10.1145/3172944.3172969Google ScholarDigital Library
- Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018. Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour. In Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA). 1–10. https://doi.org/10.1145/3204493.3204549Google ScholarDigital Library
- Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation. In Proc. ACM Multimedia (MM). 4878–4882. https://doi.org/10.1145/3474085.3479219Google ScholarDigital Library
- Costanza Navarretta. 2016. Mirroring Facial Expressions and Emotions in Dyadic Conversations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, 469–474. https://aclanthology.org/L16-1075Google Scholar
- Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. arxiv:2204.08451 [cs.CV]Google Scholar
- Rosalind W Picard. 2000. Affective computing. MIT press.Google ScholarDigital Library
- Philipp V. Rouast, Marc T. P. Adam, and Raymond Chiong. 2019. Deep Learning for Human Affect Recognition: Insights and New Developments. CoRR abs/1901.02884 (2019). arXiv:1901.02884http://arxiv.org/abs/1901.02884Google Scholar
- Guangyao Shen, Xin Wang, Xuguang Duan, Hongzhi Li, and Wenwu Zhu. 2020. Memor: A dataset for multimodal emotion reasoning in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 493–502.Google ScholarDigital Library
- Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2021. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13789–13797.Google ScholarCross Ref
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recognition. CoRR abs/1711.11248 (2017). arXiv:1711.11248http://arxiv.org/abs/1711.11248Google Scholar
- Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
- Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. 2022. Memformer: A Memory-Augmented Transformer for Sequence Modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. 308–318.Google Scholar
- Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 371–378.Google ScholarDigital Library
- Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2019. Modeling both Context-and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations.. In IJCAI. 5415–5421.Google Scholar
Index Terms
- HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer
Recommendations
Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations
MM '20: Proceedings of the 28th ACM International Conference on MultimediaThrough much exploration in the past decade, emotion analysis in conversations was mainly conducted in textual scenario. Nowadays, with the popularization of speech and video communication, academia and industry have become gradually aware of the need ...
Natural Language, Mixed-initiative Personal Assistant Agents
IMCOM '18: Proceedings of the 12th International Conference on Ubiquitous Information Management and CommunicationThe increasing popularity and use of personal voice assistant technologies, such as Siri and Google Now, is driving and expanding progress toward the long-term and lofty goal of using artificial intelligence to build human-computer dialog systems ...
Comments