HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer

Authors:
Yubin Kim

Media Lab, Massachusetts Institute of Technology, United States

Media Lab, Massachusetts Institute of Technology, United States

0000-0002-1902-3822
View Profile

,
Dong Won Lee

Media Lab, Massachusetts Institute of Technology, United States

Media Lab, Massachusetts Institute of Technology, United States

0000-0002-6336-5512
View Profile

,
Paul Pu Liang

Machine Learning Department, Carnegie Mellon University, United States

Machine Learning Department, Carnegie Mellon University, United States

0000-0001-7768-3610
View Profile

,
Sharifa Alghowinem

Media Lab, MIT, United States

Media Lab, MIT, United States

0000-0002-9391-0163
View Profile

,
Cynthia Breazeal

MIT Media Lab, Massachusetts Institute of Technology, United States

MIT Media Lab, Massachusetts Institute of Technology, United States

0000-0002-0587-2065
View Profile

,
Hae Won Park

MIT Media Lab, Massachusetts Institute of Technology, United States

MIT Media Lab, Massachusetts Institute of Technology, United States

0000-0001-9638-1722
View Profile

ICMI '23: Proceedings of the 25th International Conference on Multimodal InteractionOctober 2023Pages 314–325https://doi.org/10.1145/3577190.3614122

Published:09 October 2023Publication History

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

Pages 314–325

ABSTRACT

Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.

References

Lauren B Adamson, Roger Bakeman, and Katharine Suma. 2016. The joint engagement rating inventory (JERI). Technical Report. Technical report 25: Developmental Laboratory, Department of Psychology ….Google Scholar
Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In European Conference on Computer Vision.Google ScholarDigital Library
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59–66.Google ScholarDigital Library
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).Google ScholarDigital Library
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR abs/1705.07750 (2017). arXiv:1705.07750http://arxiv.org/abs/1705.07750Google Scholar
Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision. 357–366.Google ScholarCross Ref
Feiyu Chen, Zhengxiao Sun, Deqiang Ouyang, Xueliang Liu, and Jie Shao. 2021. Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation. In Proceedings of the 29th ACM International Conference on Multimedia. 1064–1073.Google ScholarDigital Library
Huili Chen, Sharifa Mohammed Alghowinem, Soo Jung Jang, Cynthia Breazeal, and Hae Won Park. 2022. Dyadic Affect in Parent-child Multi-modal Interaction: Introducing the DAMI-P2C Dataset and its Preliminary Analysis. IEEE Transactions on Affective Computing (2022), 1–1. https://doi.org/10.1109/TAFFC.2022.3178689Google ScholarDigital Library
Huili Chen, Yue Zhang, Felix Weninger, Rosalind Picard, Cynthia Breazeal, and Hae Won Park. 2020. Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset. In Proceedings of the 2020 International Conference on Multimodal Interaction. ACM. https://doi.org/10.1145/3382507.3418842Google ScholarDigital Library
David Curto, Albert Clapés, Javier Selva, Sorina Smeureanu, Júlio C. S. Jacques Júnior, David Gallardo-Pujol, Georgina Guilera, David Leiva, Thomas B. Moeslund, Sergio Escalera, and Cristina Palmero. 2021. Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions. CoRR abs/2109.09487 (2021). arXiv:2109.09487https://arxiv.org/abs/2109.09487Google Scholar
Francesco Del Duchetto, Paul Baxter, and Marc Hanheide. 2020. Are you still with me? Continuous engagement assessment from a robot’s point of view. Frontiers in Robotics and AI 7 (2020), 116.Google ScholarCross Ref
Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. 2020. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. In Proceedings of the 2020 International Conference on Multimodal Interaction. 784–789.Google ScholarDigital Library
Nia Dowell, Oleksandra Poquet, and Christopher Brooks. 2018. Applying group communication analysis to educational discourse interactions at scale. International Society of the Learning Sciences, Inc.[ISLS].Google Scholar
Jiafei Duan, Samson Yu, Nicholas Tan, Yi Ru Wang, and Cheston Tan. 2023. Read My Mind: A Multi-Modal Dataset for Human Belief Prediction. arxiv:2304.14501 [cs.CV]Google Scholar
Jiafei Duan, Samson Yu, Nicholas Tan, Li Yi, and Cheston Tan. 2022. BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios. arxiv:2206.10665 [cs.CV]Google Scholar
Lifeng Fan, Shuwen Qiu, Zilong Zheng, Tao Gao, Song-Chun Zhu, and Yixin Zhu. 2021. Learning Triadic Belief Dynamics in Nonverbal Communication from Videos. CoRR abs/2104.02841 (2021). arXiv:2104.02841https://arxiv.org/abs/2104.02841Google Scholar
Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540 (2019).Google Scholar
Joseph Grafsgaard, Nicholas Duran, Ashley Randall, Chun Tao, and Sidney D’Mello. 2018. Generative multimodal models of nonverbal synchrony in close relationships. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 195–202.Google ScholarDigital Library
Judith A. Hall, Terrence G. Horgan, and Nora A. Murphy. 2019. Nonverbal Communication. Annual Review of Psychology 70, 1 (2019), 271–294. https://doi.org/10.1146/annurev-psych-010418-103145 arXiv:https://doi.org/10.1146/annurev-psych-010418-103145PMID: 30256720.Google ScholarCross Ref
Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing. 2594–2604.Google ScholarCross Ref
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. 2016. CNN Architectures for Large-Scale Audio Classification. CoRR abs/1609.09430 (2016). arXiv:1609.09430http://arxiv.org/abs/1609.09430Google Scholar
Abhinav Joshi, Ashwani Bhat, Ayush Jain, Atin Vikram Singh, and Ashutosh Modi. 2022. COGMEN: COntextualized GNN based multimodal emotion recognitioN. arXiv preprint arXiv:2205.02455 (2022).Google Scholar
Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal communication in human interaction. Cengage Learning.Google Scholar
Peter Koval, Patrick T Burnett, and Yixia Zheng. 2021. Emotional inertia: On the conservation of emotional momentum. In Affect dynamics. Springer, 63–94.Google Scholar
Shivani Kumar, Anubhav Shrimal, Md Shad Akhtar, and Tanmoy Chakraborty. 2022. Discovering emotion and reasoning its flip in multi-party conversations using masked memory network and transformer. Knowledge-Based Systems 240 (2022), 108112.Google ScholarDigital Library
Peter Kuppens, Nicholas B Allen, and Lisa B Sheeber. 2010. Emotional inertia and psychological maladjustment. Psychological science 21, 7 (2010), 984–991.Google Scholar
Dong Won Lee, Yubin Kim, Rosalind Picard, Cynthia Breazeal, and Hae Won Park. 2023. Multipar-T: Multiparty-Transformer for Capturing Contingent Behaviors in Group Conversations. arxiv:2304.12204 [cs.CV]Google Scholar
Joosung Lee and Wooin Lee. 2021. CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation. arXiv preprint arXiv:2108.11626 (2021).Google Scholar
Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. 2022. Towards An End-to-End Framework for Flow-Guided Video Inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. CoRR abs/1708.02002 (2017). arXiv:1708.02002http://arxiv.org/abs/1708.02002Google Scholar
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).Google Scholar
Jiachen Luo, Huy Phan, and Joshua Reiss. 2023. deep learning of segment-level feature representation for speech emotion recognition in conversations. arXiv preprint arXiv:2302.02419 (2023).Google Scholar
Michael W Morris and Dacher Keltner. 2000. How emotions work: The social functions of emotional expression in negotiations. Research in organizational behavior 22 (2000), 1–50.Google Scholar
Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018. Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior. In Proc. ACM International Conference on Intelligent User Interfaces (IUI). 153–164. https://doi.org/10.1145/3172944.3172969Google ScholarDigital Library
Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018. Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour. In Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA). 1–10. https://doi.org/10.1145/3204493.3204549Google ScholarDigital Library
Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation. In Proc. ACM Multimedia (MM). 4878–4882. https://doi.org/10.1145/3474085.3479219Google ScholarDigital Library
Costanza Navarretta. 2016. Mirroring Facial Expressions and Emotions in Dyadic Conversations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, 469–474. https://aclanthology.org/L16-1075Google Scholar
Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. arxiv:2204.08451 [cs.CV]Google Scholar
Rosalind W Picard. 2000. Affective computing. MIT press.Google ScholarDigital Library
Philipp V. Rouast, Marc T. P. Adam, and Raymond Chiong. 2019. Deep Learning for Human Affect Recognition: Insights and New Developments. CoRR abs/1901.02884 (2019). arXiv:1901.02884http://arxiv.org/abs/1901.02884Google Scholar
Guangyao Shen, Xin Wang, Xuguang Duan, Hongzhi Li, and Wenwu Zhu. 2020. Memor: A dataset for multimodal emotion reasoning in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 493–502.Google ScholarDigital Library
Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2021. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13789–13797.Google ScholarCross Ref
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recognition. CoRR abs/1711.11248 (2017). arXiv:1711.11248http://arxiv.org/abs/1711.11248Google Scholar
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. 2022. Memformer: A Memory-Augmented Transformer for Sequence Modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. 308–318.Google Scholar
Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 371–378.Google ScholarDigital Library
Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2019. Modeling both Context-and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations.. In IJCAI. 5415–5421.Google Scholar

Index Terms

HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
2. Human-centered computing

Index terms have been assigned to the content through auto-classification.

Recommendations

Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Through much exploration in the past decade, emotion analysis in conversations was mainly conducted in textual scenario. Nowadays, with the popularization of speech and video communication, academia and industry have become gradually aware of the need ...
Read More
Inter-personal social conversation in multimodal human-virtual human interaction
Read More
Natural Language, Mixed-initiative Personal Assistant Agents
IMCOM '18: Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication

The increasing popularity and use of personal voice assistant technologies, such as Siri and Google Now, is driving and expanding progress toward the long-term and lofty goal of using artificial intelligence to build human-computer dialog systems ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction
October 2023
858 pages
ISBN:9798400700552
DOI:10.1145/3577190
Editors:
Elisabeth André
University of Augsburg
,
Mohamed Chetouani
Sorbonne University
,
Dominique Vaufreydaz
Univ. Grenoble Alpes
,
Gale Lucas
USC Institute for Creative Technologies
,
Tanja Schultz
University of Bremen
,
Louis-Philippe Morency
Carnegie Mellon University
,
Alessandro Vinciarelli
University of Glasgow
Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 October 2023
Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 241
  Total Downloads
- Downloads (Last 12 months)241
- Downloads (Last 6 weeks)34
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer

ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations

Inter-personal social conversation in multimodal human-virtual human interaction

Natural Language, Mixed-initiative Personal Assistant Agents