skip to main content
10.1145/3577190.3614122acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Open Access

HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer

Published:09 October 2023Publication History

ABSTRACT

Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.

References

  1. Lauren B Adamson, Roger Bakeman, and Katharine Suma. 2016. The joint engagement rating inventory (JERI). Technical Report. Technical report 25: Developmental Laboratory, Department of Psychology ….Google ScholarGoogle Scholar
  2. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59–66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR abs/1705.07750 (2017). arXiv:1705.07750http://arxiv.org/abs/1705.07750Google ScholarGoogle Scholar
  6. Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision. 357–366.Google ScholarGoogle ScholarCross RefCross Ref
  7. Feiyu Chen, Zhengxiao Sun, Deqiang Ouyang, Xueliang Liu, and Jie Shao. 2021. Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation. In Proceedings of the 29th ACM International Conference on Multimedia. 1064–1073.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Huili Chen, Sharifa Mohammed Alghowinem, Soo Jung Jang, Cynthia Breazeal, and Hae Won Park. 2022. Dyadic Affect in Parent-child Multi-modal Interaction: Introducing the DAMI-P2C Dataset and its Preliminary Analysis. IEEE Transactions on Affective Computing (2022), 1–1. https://doi.org/10.1109/TAFFC.2022.3178689Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Huili Chen, Yue Zhang, Felix Weninger, Rosalind Picard, Cynthia Breazeal, and Hae Won Park. 2020. Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset. In Proceedings of the 2020 International Conference on Multimodal Interaction. ACM. https://doi.org/10.1145/3382507.3418842Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Curto, Albert Clapés, Javier Selva, Sorina Smeureanu, Júlio C. S. Jacques Júnior, David Gallardo-Pujol, Georgina Guilera, David Leiva, Thomas B. Moeslund, Sergio Escalera, and Cristina Palmero. 2021. Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions. CoRR abs/2109.09487 (2021). arXiv:2109.09487https://arxiv.org/abs/2109.09487Google ScholarGoogle Scholar
  11. Francesco Del Duchetto, Paul Baxter, and Marc Hanheide. 2020. Are you still with me? Continuous engagement assessment from a robot’s point of view. Frontiers in Robotics and AI 7 (2020), 116.Google ScholarGoogle ScholarCross RefCross Ref
  12. Abhinav Dhall, Garima Sharma, Roland Goecke, and Tom Gedeon. 2020. Emotiw 2020: Driver gaze, group emotion, student engagement and physiological signal based challenges. In Proceedings of the 2020 International Conference on Multimodal Interaction. 784–789.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nia Dowell, Oleksandra Poquet, and Christopher Brooks. 2018. Applying group communication analysis to educational discourse interactions at scale. International Society of the Learning Sciences, Inc.[ISLS].Google ScholarGoogle Scholar
  14. Jiafei Duan, Samson Yu, Nicholas Tan, Yi Ru Wang, and Cheston Tan. 2023. Read My Mind: A Multi-Modal Dataset for Human Belief Prediction. arxiv:2304.14501 [cs.CV]Google ScholarGoogle Scholar
  15. Jiafei Duan, Samson Yu, Nicholas Tan, Li Yi, and Cheston Tan. 2022. BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios. arxiv:2206.10665 [cs.CV]Google ScholarGoogle Scholar
  16. Lifeng Fan, Shuwen Qiu, Zilong Zheng, Tao Gao, Song-Chun Zhu, and Yixin Zhu. 2021. Learning Triadic Belief Dynamics in Nonverbal Communication from Videos. CoRR abs/2104.02841 (2021). arXiv:2104.02841https://arxiv.org/abs/2104.02841Google ScholarGoogle Scholar
  17. Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540 (2019).Google ScholarGoogle Scholar
  18. Joseph Grafsgaard, Nicholas Duran, Ashley Randall, Chun Tao, and Sidney D’Mello. 2018. Generative multimodal models of nonverbal synchrony in close relationships. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 195–202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Judith A. Hall, Terrence G. Horgan, and Nora A. Murphy. 2019. Nonverbal Communication. Annual Review of Psychology 70, 1 (2019), 271–294. https://doi.org/10.1146/annurev-psych-010418-103145 arXiv:https://doi.org/10.1146/annurev-psych-010418-103145PMID: 30256720.Google ScholarGoogle ScholarCross RefCross Ref
  20. Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing. 2594–2604.Google ScholarGoogle ScholarCross RefCross Ref
  21. Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. 2016. CNN Architectures for Large-Scale Audio Classification. CoRR abs/1609.09430 (2016). arXiv:1609.09430http://arxiv.org/abs/1609.09430Google ScholarGoogle Scholar
  22. Abhinav Joshi, Ashwani Bhat, Ayush Jain, Atin Vikram Singh, and Ashutosh Modi. 2022. COGMEN: COntextualized GNN based multimodal emotion recognitioN. arXiv preprint arXiv:2205.02455 (2022).Google ScholarGoogle Scholar
  23. Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal communication in human interaction. Cengage Learning.Google ScholarGoogle Scholar
  24. Peter Koval, Patrick T Burnett, and Yixia Zheng. 2021. Emotional inertia: On the conservation of emotional momentum. In Affect dynamics. Springer, 63–94.Google ScholarGoogle Scholar
  25. Shivani Kumar, Anubhav Shrimal, Md Shad Akhtar, and Tanmoy Chakraborty. 2022. Discovering emotion and reasoning its flip in multi-party conversations using masked memory network and transformer. Knowledge-Based Systems 240 (2022), 108112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Peter Kuppens, Nicholas B Allen, and Lisa B Sheeber. 2010. Emotional inertia and psychological maladjustment. Psychological science 21, 7 (2010), 984–991.Google ScholarGoogle Scholar
  27. Dong Won Lee, Yubin Kim, Rosalind Picard, Cynthia Breazeal, and Hae Won Park. 2023. Multipar-T: Multiparty-Transformer for Capturing Contingent Behaviors in Group Conversations. arxiv:2304.12204 [cs.CV]Google ScholarGoogle Scholar
  28. Joosung Lee and Wooin Lee. 2021. CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation. arXiv preprint arXiv:2108.11626 (2021).Google ScholarGoogle Scholar
  29. Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. 2022. Towards An End-to-End Framework for Flow-Guided Video Inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  30. Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. CoRR abs/1708.02002 (2017). arXiv:1708.02002http://arxiv.org/abs/1708.02002Google ScholarGoogle Scholar
  31. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).Google ScholarGoogle Scholar
  32. Jiachen Luo, Huy Phan, and Joshua Reiss. 2023. deep learning of segment-level feature representation for speech emotion recognition in conversations. arXiv preprint arXiv:2302.02419 (2023).Google ScholarGoogle Scholar
  33. Michael W Morris and Dacher Keltner. 2000. How emotions work: The social functions of emotional expression in negotiations. Research in organizational behavior 22 (2000), 1–50.Google ScholarGoogle Scholar
  34. Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018. Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior. In Proc. ACM International Conference on Intelligent User Interfaces (IUI). 153–164. https://doi.org/10.1145/3172944.3172969Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018. Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour. In Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA). 1–10. https://doi.org/10.1145/3204493.3204549Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation. In Proc. ACM Multimedia (MM). 4878–4882. https://doi.org/10.1145/3474085.3479219Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Costanza Navarretta. 2016. Mirroring Facial Expressions and Emotions in Dyadic Conversations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, 469–474. https://aclanthology.org/L16-1075Google ScholarGoogle Scholar
  38. Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. arxiv:2204.08451 [cs.CV]Google ScholarGoogle Scholar
  39. Rosalind W Picard. 2000. Affective computing. MIT press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Philipp V. Rouast, Marc T. P. Adam, and Raymond Chiong. 2019. Deep Learning for Human Affect Recognition: Insights and New Developments. CoRR abs/1901.02884 (2019). arXiv:1901.02884http://arxiv.org/abs/1901.02884Google ScholarGoogle Scholar
  41. Guangyao Shen, Xin Wang, Xuguang Duan, Hongzhi Li, and Wenwu Zhu. 2020. Memor: A dataset for multimodal emotion reasoning in videos. In Proceedings of the 28th ACM International Conference on Multimedia. 493–502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2021. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13789–13797.Google ScholarGoogle ScholarCross RefCross Ref
  43. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recognition. CoRR abs/1711.11248 (2017). arXiv:1711.11248http://arxiv.org/abs/1711.11248Google ScholarGoogle Scholar
  44. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.Google ScholarGoogle ScholarCross RefCross Ref
  45. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  46. Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. 2022. Memformer: A Memory-Augmented Transformer for Sequence Modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. 308–318.Google ScholarGoogle Scholar
  47. Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 371–378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Li, Qiaoming Zhu, and Guodong Zhou. 2019. Modeling both Context-and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations.. In IJCAI. 5415–5421.Google ScholarGoogle Scholar

Index Terms

  1. HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction
          October 2023
          858 pages
          ISBN:9798400700552
          DOI:10.1145/3577190

          Copyright © 2023 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 October 2023

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate453of1,080submissions,42%
        • Article Metrics

          • Downloads (Last 12 months)241
          • Downloads (Last 6 weeks)34

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format