Skip to main content

Advertisement

Log in

Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the advancement of multimedia and human-computer interaction, it has become increasingly crucial to perceive people’s emotional states in dynamic data (e.g., video, audio, text stream) in order to effectively serve them. Emotion recognition has emerged as a prominent research area over the past decades. Traditional methods for emotion recognition heavily rely on manually crafted features and primarily focus on uni-modality. However, these approaches encounter challenges in extracting sufficient discriminative information for complex emotion recognition tasks. To tackle this issue, deep neural model-based methods have gained significant popularity in emotion recognition tasks. These methods leverage deep neural models to automatically learn more discriminative emotional features, thereby addressing the problem of poor discriminability associated with manually designed features. Moreover, deep neural models are also employed to integrate information across multiple modalities, thereby enhancing the extraction of discriminative information. In this paper, we provide a comprehensive review of the relevant studies on deep neural model-based emotion recognition in dynamic data using facial, speech, and textual cues published within the past five years. Specifically, we first explain discretized and continuous representations of emotions by introducing widely accepted emotion models. Subsequently, we elucidate how advanced methods integrate different neural models by scoping these methods using variant popular deep neural models (e.g. Transformer), along with corresponding preprocessing mechanisms. In addition, we present the development trend by surveying diverse datasets, metrics, and competitive performances. Finally, we have a discussion and explore significant research challenges and opportunities. Our survey bridges the gaps in the literature since existing surveys are narrow in focus, either exclusively covering single-modal methods, solely concentrating on multi-modal methods, overlooking certain aspects of face, speech, and text, or emphasizing outdated methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Code Availability

Not applicable

References

  1. Dolan RJ (2002) Emotion, cognition, and behavior. Science

  2. Zepf S, Hernandez J, Schmitt A, Minker W, Picard RW (2020) Driver emotion recognition for intelligent vehicles: a survey. ACM Computing Surveys (CSUR) 53(3):1–30

    Article  Google Scholar 

  3. Nawaf Hazim Barnouti WEM, Al-dabbagh SSM (2016) Face recognition: a literature review. Int J Appl Inf Syst 11(4):21–31. https://doi.org/10.5120/ijais2016451597

    Article  Google Scholar 

  4. D’mello SK, Kory J (2015) A review and meta-analysis of multimodal affect detection systems. ACM Comput Surv 47(3). https://doi.org/10.1145/2682899

  5. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  6. Graves A (2013) Generating sequences with recurrent neural networks. arXiv:1308.0850

  7. Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111

  8. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJ, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25

  9. Simonyan K, Zisserman A (2014) Very deep convolutional networks for largescale image recognition. arXiv:1409.1556

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  11. Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems, vol 29

  12. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907

  13. Thost V, Chen J (2021) Directed acyclic graph neural networks. arXiv:2101.07965

  14. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30

  15. Zhang Z, Fu T, Yan Z, Jin L, Xiao L, Sun Y, Yu Z, Li Y (2018) A varying-parameter convergent-differential neural network for solving joint-angular-drift problems of redundant robot manipulators. IEEE/ASME Trans Mechatron 23(2):679–689. https://doi.org/10.1109/TMECH.2018.2799724

    Article  Google Scholar 

  16. Zhang Z, Lu Y, Zheng L, Li S, Yu Z, Li Y (2018) A new varying-parameter convergent-differential neural-network for solving time-varying convex qp problem constrained by linear-equality. IEEE Trans Autom Control 63(12):4110–4125. https://doi.org/10.1109/TAC.2018.2810039

    Article  MathSciNet  Google Scholar 

  17. Zhang Z, Zheng L, Weng J, Mao Y, Lu W, Xiao L (2018) A new varyingparameter recurrent neural-network for online solution of time-varying sylvester equation. IEEE Trans Cybern 48(11):3135–3148. https://doi.org/10.1109/TCYB.2017.2760883

    Article  Google Scholar 

  18. Li S, Deng W (2020) Deep facial expression recognition: a survey. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2020.2981446

    Article  Google Scholar 

  19. Patel K, Mehta D, Mistry C, Gupta R, Tanwar S, Kumar N, Alazab M (2020) Facial sentiment analysis using ai techniques: state-of-theart, taxonomies, and challenges. IEEE Access 8:90495–90519. https://doi.org/10.1109/ACCESS.2020.2993803

    Article  Google Scholar 

  20. Latif S, Rana R, Khalifa S, Jurdak R, Qadir J, Schuller BW (2021) Survey of deep representation learning for speech emotion recognition. IEEE Trans Affect Comput 1–1. https://doi.org/10.1109/TAFFC.2021.3114365

  21. Jahangir R, Teh YW, Hanif F, Mujtaba G (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimed Tools Appl 80(16):23745–23812

    Article  Google Scholar 

  22. Mba A, Ko B (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers - sciencedirect. Speech Commun 116:56–76

    Article  Google Scholar 

  23. Alswaidan N, Menai M (2020) A survey of state-of-the-art approaches for emotion recognition in text. Knowl Inf Syst(16)

  24. Deng J, Ren F (2021) A survey of textual emotion recognition and its challenges. IEEE Trans Affect Comput PP(99):1–1

  25. Jiang Y, Li W, Hossain MS, Chen M, Al-Hammadi M (2019) A snapshot research and implementation of multimodal information fusion for datadriven emotion recognition. Inf Fusion 53

  26. Rouast PV, Adam M, Chiong R (2018) Deep learning for human affect recognition: insights and new developments. IEEE Trans Affect Comput 1–1

  27. Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inf Fusion 37:98–125

    Article  Google Scholar 

  28. He Z, Li Z, Yang F, Wang L, Li J, Zhou C, Pan J (2020) Advances in multimodal emotion recognition based on brain–computer interfaces. Brain Sci 10(10). https://doi.org/10.3390/brainsci10100687

  29. Koromilas P, Giannakopoulos T (2021) Deep multimodal emotion recognition on human speech: a review. Appl Sci 11(17). https://doi.org/10.3390/app11177962

  30. Ekman P (1992) An argument for basic emotions. Cogn Emo 6(3–4):169–200

    Article  Google Scholar 

  31. Plutchik R (2001) The nature of emotions: Human emotions have deep evolutionary roots

  32. Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161–1178

    Article  Google Scholar 

  33. Mehrabian A (1996) Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Curr Psychol 14(4):261–292

    Article  MathSciNet  Google Scholar 

  34. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol. 1, p. https://doi.org/10.1109/CVPR.2001.990517

  35. (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell

  36. Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference On

  37. Asthana A, Zafeiriou S, Cheng S, Pantic M (2013) Robust discriminative response map fitting with constrained local models. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 3444–3451. https://doi.org/10.1109/CVPR.2013.442

  38. Xiong X, De la Torre F (2013) Supervised descent method and its applications to face alignment. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 532–539. https://doi.org/10.1109/CVPR.2013.75

  39. Ren S, Cao X, Wei Y, Sun J (2014) Face alignment at 3000 fps via regressing local binary features. In: 2014 IEEE Conference on computer vision and pattern recognition, pp 1685–1692. https://doi.org/10.1109/CVPR.2014.218

  40. Asthana A, Zafeiriou S, Cheng S, Pantic M (2014) Incremental face alignment in the wild. In: 2014 IEEE conference on computer vision and pattern recognition, pp 1859–1866. https://doi.org/10.1109/CVPR.2014.240

  41. Sun Y, Wang X, Tang X (2013) Deep convolutional network cascade for facial point detection. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 3476–3483. https://doi.org/10.1109/CVPR.2013.446

  42. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503. https://doi.org/10.1109/LSP.2016.2603342

    Article  Google Scholar 

  43. Pizer SM, Amburn EP, Austin JD, Cromartie R, Geselowitz A, Greer T, ter Haar Romeny B, Zimmerman JB, Zuiderveld K (1987) Adaptive histogramequalization and its variations. Computer vision, graphics, and image processing

  44. Watson AB (1994) Image compression using the discrete cosine transform. Math J 4(7):81–88

    Google Scholar 

  45. Dabbaghchian S, Aghagolzadeh A, Moin MS (2007) Feature extraction using discrete cosine transform for face recognition. In: International symposium on signal processing & its applications

  46. Zhang Y, Xiong F, Zhang GL (2008) A preprocessing algorithm for illumination invariant face recognition. J Image Graph

  47. Birch P, Mitra B, Bangalore NM, Rehman S, Young R, Chatwin C (2010) Approximate bandpass and frequency response models of the difference of gaussian filter. Opt Commun 283(24):4942–4948

    Article  Google Scholar 

  48. Short J, Kittler J, Messer K (2004) A comparison of photometric normalisation algorithms for face verification. Proceedings of Automatic Face & Gesture Recognition

  49. Hassner T, Harel S, Paz E, Enbar R (2014) Effective face frontalization in unconstrained images. IEEE

  50. Yao A, Cai D, Ping H, Wang S, Chen Y (2016) Holonet: towards robust emotion recognition in the wild. In: Acm international conference on multimodal interaction

  51. Hu P, Cai D, Wang S, Yao A, Chen Y (2017) Learning supervised scoring ensemble for emotion recognition in the wild. ICMI ’17, Association for Computing Machinery, New York, USA, pp 553–560. https://doi.org/10.1145/3136755.3143009

  52. Kollias D, Zafeiriou SP (2020) Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Trans Affect Comput 1–1. https://doi.org/10.1109/TAFFC.2020.3014171

  53. Han J, Zhang Z, Ren Z, Schuller BW (2019) Emobed: Strengthening monomodal emotion recognition via training with crossmodal emotion embeddings. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2019.2928297

    Article  Google Scholar 

  54. Nie W, Ren M, Nie J, Zhao S (2020) C-gcn: Correlation based graph convolutional network for audio-video emotion recognition. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2020.3032037

    Article  Google Scholar 

  55. Dahmane M, Alam J, St-Charles P-L, Lalonde M, Heffner K, Foucher S (2020) A multimodal non-intrusive stress monitoring from the pleasure-arousal emotional dimensions. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2020.2988455

    Article  Google Scholar 

  56. Peng S, Zhang L, Ban Y, Fang M, Winkler S (2018) A deep network for arousal-valence emotion prediction with acoustic-visual cues

  57. Kollias D, Zafeiriou S (2018) A multi-component cnn-rnn approach for dimensional emotion recognition in-the-wild

  58. Deng D, Zhou Y, Pi J, Shi BE (2018) Multimodal utterance-level affect analysis using visual, audio and text features

  59. Zheng Z, Cao C, Chen X, Xu G (2018) Multimodal emotion recognition for one-minute-gradual emotion challenge

  60. Triantafyllopoulos A, Sagha H, Eyben F, Schuller B (2018) audeering’s approach to the one-minute-gradual emotion challenge

  61. Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: British Machine Vision Conference

  62. Chung J, Gulcehre C, Cho KH, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. Eprint Arxiv

  63. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  64. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231. https://doi.org/10.1109/TPAMI.2012.59

    Article  Google Scholar 

  65. Li Q, Gkoumas D, Sordoni A, Nie J, Melucci M (2021) Quantuminspired neural network for conversational emotion recognition. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp 13270–13278. https://ojs.aaai.org/index.php/AAAI/article/view/17567

  66. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada, pp. 873-883. https://doi.org/10.18653/v1/P17-1081. https://aclanthology.org/P17-1081

  67. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510

  68. Xing S, Mai S, Hu H (2020) Adapted dynamic memory network for emotion recognition in conversation. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2020.3005660

    Article  Google Scholar 

  69. Majumder N, Poria S, Hazarika D, Mihalcea R, Cambria E (2019) Dialoguernn: an attentive rnn for emotion detection in conversations. Proc AAAI Conf Artif Intell 3:6818–6825

    Google Scholar 

  70. Hossain MS, Muhammad G (2019) Emotion recognition using deep learning approach from audio-visual emotional big data. Inf Fusion 49:69–78. https://doi.org/10.1016/j.inffus.2018.09.008

    Article  Google Scholar 

  71. Ma Y, Hao Y, Min C, Chen J, Ping L, Andrej K (2018) Audio-visual emotion fusion(avef):a deep efficient weighted approach. Inf Fusion 46:184–192

    Article  Google Scholar 

  72. Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (LongPapers), Association for Computational Linguistics, New Orleans, Louisiana, pp 2122–2132. https://doi.org/10.18653/v1/N18-1193. https://aclanthology.org/N18-1193

  73. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) ICON: Interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, pp 2594–2604. https://doi.org/10.18653/v1/D18-1280. https://aclanthology.org/D18-1280

  74. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 6546–6555. https://doi.org/10.1109/CVPR.2018.00685

  75. Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) The kinetics human action video dataset

  76. Zhao S, Ma Y, Gu Y, Yang J, Keutzer K (2020) An end-to-end visualaudio attention network for emotion recognition in user-generated videos. Proc AAAI Conf Artif Intell 34(1):303–311

    Google Scholar 

  77. Deng D, Chen Z, Zhou Y, Shi BE (2020) MIMAMO net: Integrating micro- and macro-motion for video emotion recognition. In: The Thirty– Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp 2621–2628

  78. Portilla J, Simoncelli EP (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. Int J Comput Vis 40(1):49–70

    Article  Google Scholar 

  79. Albanie S, Nagrani A, Vedaldi A, Zisserman A (2018) Emotion recognition in speech using cross-modal transfer in the wild, 292–301

  80. Cao Q, Shen L, Xie W, Parkhi OM, Zisserman A (2017) Vggface2: A dataset for recognising faces across pose and age. In: IEEE International conference on automatic face & gesture recognition

  81. Pan X, Ying G, Chen G, Li H, Li W (2019) A deep spatial and temporal aggregation framework for video-based facial expression recognition. IEEE Access 7:48807–48815. https://doi.org/10.1109/ACCESS.2019.2907271

    Article  Google Scholar 

  82. Feng D, Ren F (2018) Dynamic facial expression recognition based on twostream–cnn with lbp-top. In: 2018 5th IEEE International conference on cloud computing and intelligence systems (CCIS)

  83. Zhao Z, Liu Q (2021) Former-DFER: dynamic facial expression recognition transformer. Association for Computing Machinery, New York, NY, USA, pp 1553–1561

    Google Scholar 

  84. Bachu RG, Kopparthi S, Adapa B, Barkana BD (2010) Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy. In: Elleithy K (ed) Advanced techniques in computing sciences and software engineering. Springer, Dordrecht, pp 279–282

    Chapter  Google Scholar 

  85. Lin J, Wu C, Wei W (2012) Error weighted semi-coupled hidden markov model for audio-visual emotion recognition. IEEE Trans Multim 14(1):142–156. https://doi.org/10.1109/TMM.2011.2171334

    Article  Google Scholar 

  86. Shirian A, Guha T (2021) Compact graph architecture for speech emotion recognition. In: ICASSP 2021 – 2021 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6284–6288

  87. Wang X, Wang M, Qi W, Su W, Wang X, Zhou H (2021) A novel end-to-end speech emotion recognition network with stacked transformer layers. In: ICASSP 2021 – 2021 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6289–6293. https://doi.org/10.1109/ICASSP39728.2021.9414314

  88. Lin W-C, Busso C (2021) Chunk-level speech emotion recognition: a general framework of sequence-to-one dynamic temporal modeling. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2021.3083821

  89. Sl A, Xx B, Wf B, Bc C, Pf B (2021) Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing

  90. Chatterjee R, Mazumdar S, Sherratt RS, Halder R, Maitra T, Giri D (2021) Real-time speech emotion analysis for smart home assistants. IEEE Trans Consum Electron 67(1):68–76. https://doi.org/10.1109/TCE.2021.3056421

    Article  Google Scholar 

  91. Mustaqeem Kwon S (2021) Att-net: enhanced emotion recognition system using lightweight self-attention module. Appl Soft Comput 102(4)

  92. Nediyanchath A, Paramasivam P, Yenigalla P (2020) Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7179–7183. https://doi.org/10.1109/ICASSP40776.2020.9054073

  93. Zhang S, Zhao X, Tian Q (2019) Spontaneous speech emotion recognition using multiscale deep convolutional lstm. IEEE Trans Affect Comput 1–1. https://doi.org/10.1109/TAFFC.2019.2947464

  94. Lotfian R, Busso C (2019) Curriculum learning for speech emotion recognition from crowdsourced labels. IEEE/ACM Trans Audio Speech Lang Process PP(99):1–1

  95. Kim J, Englebienne G, Truong KP, Evers V (2017) Deep temporal models using identity skip-connections for speech emotion recognition. In: Liu Q, Lienhart R, Wang H, Chen SK, Boll S, Chen YP, Friedland G, Li J, Yan S (eds) Proceedings of the 2017 ACM on multimedia conference, MM 2017, Mountain View, CA, USA, October 23–27, 2017, pp 1006–1013. https://doi.org/10.1145/3123266.3123353

  96. Han J, Zhang Z, Ringeval F, Schuller B (2017) Reconstruction-error-based learning for continuous emotion recognition in speech. In: ICASSP 2017 – 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  97. Latif S, Rana R, Qadir J, Epps J (2017) Variational autoencoders for learning latent representations of speech emotion: a preliminary study. In: Interspeech 2018

  98. Xi M, Wu Z, Jia J, Xu M, Cai L (2018) Emotion recognition from variable-length speech segments using deep learning on spectrograms. In: Interspeech 2018

  99. Shi B, Fu Z, Bing L, Lam W (2018) Learning domain-sensitive and sentiment-aware word embeddings. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers)

  100. Etienne C, Fidanza G, Petrovskii A, Devillers L, Schmauch B (2018) Cnn+lstm architecture for speech emotion recognition with data augmentation

  101. Neumann M, Vu NT (2017) Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech

  102. Lim W, Jang D, Lee T (2017) Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)

  103. Chernykh V, Sterling G, Prihodko P (2017) Emotion recognition from speech with recurrent neural networks

  104. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International conference on acoustics

  105. Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using cnn. In: Acm International Conference

  106. Zhu Z, Sato Y (2023) Deep investigation of intermediate representations in self-supervised learning models for speech emotion recognition. In: 2023 IEEE International conference on acoustics, speech, and signal processing workshops (ICASSPW), pp 1–5. https://doi.org/10.1109/ICASSPW59220.2023.10193018

  107. Sadok S, Leglaive S, Séguier R (2023) A vector quantized masked autoencoder for speech emotion recognition. In: 2023 IEEE International conference on acoustics, speech, and signal processing workshops (ICASSPW), pp 1–5. https://doi.org/10.1109/ICASSPW59220.2023.10193151

  108. Mikolov T, Sutskever I, Kai C, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems

  109. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. Comput Sci

  110. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Conference on empirical methods in natural language processing

  111. Peters M, Neumann M, Iyyer M, Gardner M, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, volume 1 (Long Papers)

  112. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding

  113. Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res

  114. Song K, Tan X, Qin T, Lu J, Liu TY (2019) Mass: Masked sequence to sequence pre-training for language generation

  115. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding

  116. Xu P, Madotto A, Wu CS, Park JH, Fung P (2018) Emo2vec: Learning generalized emotion representation by multi-task training

  117. Felbo B, Mislove A, Sgaard A, Rahwan I, Lehmann S (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In: Proceedings of the 2017 conference on empirical methods in natural language processing

  118. Winata GI, Madotto A, Lin Z, Shin J, Xu Y, Xu P, Fung P (2019) Caire hkust at semeval-2019 task 3: Hierarchical attention for dialogue emotion classification

  119. Deng J, Ren F (2020) Multi-label emotion detection via emotion-specified feature extraction and emotion correlation learning. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2020.3034215

    Article  Google Scholar 

  120. Jiao W, Lyu M, King I (2020) Real-time emotion recognition via attention gated hierarchical memory network. Proc AAAI Conf Artif Intell 34(5):8002–8009

    Google Scholar 

  121. Hazarika D, Poria S, Zimmermann R, Mihalcea R (2021) Conversational transfer learning for emotion recognition. Inf Fusion 65:1–12. https://doi.org/10.1016/j.inffus.2020.06.005

    Article  Google Scholar 

  122. Hu D, Bao Y, Wei L, Zhou W, Hu S (2023) Supervised adversarial contrastive learning for emotion recognition in conversations. In: Rogers A, Boyd-Graber JL, Okazaki N (eds) Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp 10835–10852. https://doi.org/10.18653/v1/2023.acl-long.606

  123. Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. arXiv:1908.11540

  124. Shen W, Wu S, Yang Y, Quan X (2021) Directed acyclic graph network for conversational emotion recognition. arXiv:2105.12907

  125. Li W, Zhu L, Mao R, Cambria E (2023) Skier: a symbolic knowledge integrated model for conversational emotion recognition. Proc AAAI Conf Artif Intell 37(11):13121–13129. https://doi.org/10.1609/aaai.v37i11.26541

    Article  Google Scholar 

  126. Shen W, Chen J, Quan X, Xie Z (2021) Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. Proceedings of the AAAI conference on artificial intelligence 35:13789–13797

    Article  Google Scholar 

  127. Zhu L, Pergola G, Gui L, Zhou D, He Y (2021) Topic-driven and knowledge-aware transformer for dialogue emotion detection. arXiv:2106.01071

  128. Zhang T, Chen Z, Zhong M, Qian T (2023) Mimicking the thinking process for emotion recognition in conversation with prompts and paraphrasing. In: Proceedings of the thirty-second international joint conference on artificial intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pp 6299–6307. https://doi.org/10.24963/ijcai.2023/699

  129. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: Denoising sequence-tosequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Online, pp 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703. https://aclanthology.org/2020.acl-main.703

  130. Lee J, Lee W (2022) CoMPM: Context modeling with speaker’s pre–trained memory tracking for emotion recognition in conversation. In: Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Seattle, United States, pp 5669–5679. https://doi.org/10.18653/v1/2022.naaclmain.416. https://aclanthology.org/2022.naacl-main.416

  131. Zhang D, Chen F, Chen X (2023) DualGATs: Dual graph attention networks for emotion recognition in conversations. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, pp 7395–7408. https://doi.org/10.18653/v1/2023.acl-long.408. https://aclanthology.org/2023.acl-long.408

  132. Qin X, Wu Z, Zhang T, Li Y, Luan J, Wang B, Wang L, Cui J (2023) Bert-erc: Fine-tuning bert is enough for emotion recognition in conversation. Proc AAAI Conf Artif Intell 37(11):13492–13500. https://doi.org/10.1609/aaai.v37i11.26582

    Article  Google Scholar 

  133. Zhang D, Ju X, Zhang W, Li J, Li S, Zhu Q, Zhou G (2021) Multimodal multi-label emotion recognition with heterogeneous hierarchical message passing. In: Thirty-Fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, the eleventh symposium on educational advances in artificial intelligence, EAAI 2021, Virtual Event, February 2–9, 2021, pp 14338–14346

  134. Xu G, Li W, Liu J (2019) A social emotion classification approach using multi-model fusion. Futur Gener Comput Syst 102

  135. Liang J, Li R, Jin Q (2020) Semi-supervised multi-modal emotion recognition with cross-modal distribution matching. In: Proceedings of the 28th ACM international conference on multimedia. MM ’20, Association for Computing Machinery, New York, USA , pp 2852–2861. https://doi.org/10.1145/3394171.3413579

  136. Jaiswal M, Provost EM (2020) Privacy enhanced multimodal neural representations for emotion recognition. Proc AAAI Conf Artif Intell 34(5):7985–7993

    Google Scholar 

  137. Li R, Wu Z, Jia J, Bu Y, Meng H (2019) Towards discriminative representation learning for speech emotion recognition. In: Twenty-eighth international joint conference on artificial intelligence IJCAI-19

  138. Zhang D, Wu L, Sun C, Li S, Zhu Q, Zhou G (2019) Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp 5415–5421

  139. Zhou S, Jia J, Wang Q, Dong Y, Yin Y, Lei K (2018) Inferring emotion from conversational voice data: a semi-supervised multi-path generative neural network approach. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the thirty-second AAAI conference on artificial intelligence(AAAI–18), pp 579–587

  140. Zhang Z, Han J, Coutinho E, Schuller BW (2018) Dynamic difficulty awareness training for continuous emotion prediction. IEEE Trans Multimed 1–1

  141. Tzirakis P, Trigeorgis G, Nicolaou MA, Schuller BW, Zafeiriou S (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE J Sel Top Signal Process 11(8):1301–1309. https://doi.org/10.1109/JSTSP.2017.2764438

    Article  Google Scholar 

  142. Han J, Zhang Z, Cummins N, Ringeval F, Schuller B (2017) Strength modelling for real-world automatic continuous affect recognition from audiovisual signals. Image Vis Comput 65(sep.):76–86

  143. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long Papers), pp 873–883

  144. Zhang S, Zhang S, Huang T, Gao W, Tian Q (2018) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circuits Syst Video Technol 28(10):3030–3043. https://doi.org/10.1109/TCSVT.2017.2719043

    Article  Google Scholar 

  145. Wöllmer M, Metallinou A, Eyben F, Schuller B, Narayanan SS (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In: INTERSPEECH

  146. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv

  147. Tzirakis P, Chen J, Zafeiriou S, Schuller B (2021) End-to-end multimodal affect recognition in real-world environments. Inf Fusion 68:46–53

    Article  Google Scholar 

  148. Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. Proceedings of the AAAI Conference on Artificial Intelligence

  149. Lian Z, Liu B, Tao J (2021) Ctnet: conversational transformer network for emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 29:985–1000. https://doi.org/10.1109/TASLP.2021.3049898

    Article  Google Scholar 

  150. Zadeh A, Poria S, Liang PP, Cambria E, Mazumder N, Morency L-P (2018) Memory fusion network for multi-view sequential learning, New Orleans, LA, United states, pp 5634–5641. Attention mechanisms; Benchmark datasets;Multi-views;Neural architectures;Sequential learning;Specific interaction; State of the art;

  151. Mansouri-Benssassi E, Ye J (2020) Synch-graph: Multisensory emotion recognition through neural synchrony via graph convolutional networks. Proc AAAI Conf Artif Intell 34(2):1351–1358

    Google Scholar 

  152. Mao Y, Sun Q, Liu G, Wang X, Gao W, Li X, Shen J (2020) Dialoguetrm: exploring the intra-and inter-modal emotional behaviors in the conversation. arXiv:2010.07637

  153. Xie B, Sidulova M, Park CH (2021) Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors 21(14). https://doi.org/10.3390/s21144913

  154. Siriwardhana S, Kaluarachchi T, Billinghurst M, Nanayakkara S (2020) Multimodal emotion recognition with transformer-based self supervised feature fusion. IEEE Access 8:176274–176285. https://doi.org/10.1109/ACCESS.2020.3026823

    Article  Google Scholar 

  155. Huang J, Tao J, Liu B, Lian Z, Niu M (2020) Multimodal transformer fusion for continuous emotion recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  156. Li Y, Wang Y, Cui Z (2023) Decoupled multimodal distilling for emotion recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6631–6640

  157. Yang D, Huang S, Kuang H, Du Y, Zhang L (2022) Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, Association for Computing Machinery, New York, USA, pp 1642–1651. https://doi.org/10.1145/3503161.3547754

  158. Sun J, Han S, Ruan Y-P, Zhang X, Zheng S-K, Liu Y, Huang Y, Li T (2023) Layer-wise fusion with modality independence modeling for multi-modal emotion recognition. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, pp 658–670. https://doi.org/10.18653/v1/2023.acl-long.39. https://aclanthology.org/2023.acl-long.39

  159. Hu G, Lin T-E, Zhao Y, Lu G, Wu Y, Li Y (2022) UniMSE: Towards unified multimodal sentiment analysis and emotion recognition. In: Proceedings of the 2022 conference on empirical methods in natural language processing, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, pp 7837–7851. https://doi.org/10.18653/v1/2022.emnlp-main.534. https://aclanthology.org/2022.emnlp-main.534

  160. Zhang T, Tan Z, Wu X (2023) Haan-erc: hierarchical adaptive attention network for multimodal emotion recognition in conversation. Neural Comput & Applic 1–14

  161. Joshi A, Bhat A, Jain A, Singh A, Modi A (2022) COGMEN: COntextualized GNN based multimodal emotion recognitioN. In: Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Seattle, United States, pp 4148–4164. https://doi.org/10.18653/v1/2022.naaclmain.306. https://aclanthology.org/2022.naacl-main.306

  162. Ge S, Jiang Z, Cheng Z, Wang C, Yin Y, Gu Q (2023) Learning robust multi-modal representation for multi-label emotion recognition via adversarial masking and perturbation. In: Proceedings of the ACM Web Conference 2023. WWW ’23, Association for Computing Machinery, New York, USA, pp 1510–1518. https://doi.org/10.1145/3543507.3583258

  163. Wu M, Su W, Chen L, Pedrycz W, Hirota K (2020) Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Trans Affect Comput 1–1. https://doi.org/10.1109/TAFFC.2020.2966440

  164. Zhang Y, Wang ZR, Du J (2019) Deep fusion: an attention guided factorized bilinear pooling for audio-video emotion recognition

  165. Shuang F, Chen C (2018) Fuzzy broad learning system: a novel neurofuzzy model for regression and classification. IEEE Trans Cybern PP(99):1–11

  166. Hao M, Cao W-H, Liu Z-T, Wu M, Xiao P (2020) Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features. Neurocomputing 391:42–51. https://doi.org/10.1016/j.neucom.2020.01.048

    Article  Google Scholar 

  167. Han J, Zhang Z, Schmitt M, Pantic M, Schuller B (2017) From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In: the 2017 ACM

  168. Hu P, Cai D, Wang S, Yao A, Chen Y (2017) Learning supervised scoring ensemble for emotion recognition in the wild. In: the 19th ACM International Conference

  169. Sahoo S, Routray A (2017) Emotion recognition from audio-visual data using rule based decision level fusion. IEEE

  170. (2008) Iemocap: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359

  171. Schuller B, Valstar M, Cowie R, Pantic M (2012) Avec 2012: the continuous audio/visual emotion challenge. In: Acm international conference on multimodal interaction

  172. Bagher Zadeh A, Liang PP, Poria S, Cambria E, Morency L-P (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 2236–2246. https://doi.org/10.18653/v1/P18-1208. https://www.aclweb.org/anthology/P18-1208

  173. Barros P, Churamani N, Lakomkin E, Siqueira H, Sutherland A, Wermter S (2018) The omg-emotion behavior dataset. In: 2018 International Joint Conference on Neural Networks (IJCNN)

  174. Poria S, Hazarika D, Majumder N, Naik G, Mihalcea R (2019) Meld: A multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th annual meeting of the association for computational linguistics

  175. Chou H-C, Lin W-C, Chang L-C, Li C-C, Ma H-P, Lee C-C (2017) Nnime: The nthu-ntua chinese interactive multimodal emotion corpus. In: 2017 Seventh international conference on affective computing and intelligent interaction (ACII), pp 292–298. https://doi.org/10.1109/ACII.2017.8273615

  176. Kossaifi J, Walecki R, Panagakis Y, Shen J, Schmitt M, Ringeval F, Han J, Pandit V, Toisoul A, Schuller B, Star K, Hajiyev E, Pantic M (2021) Sewa db: A rich database for audio-visual emotion and sentiment research in the wild. IEEE Trans Pattern Anal Mach Intell 43(3):1022–1040. https://doi.org/10.1109/TPAMI.2019.2944808

    Article  Google Scholar 

  177. Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: International conference on data engineering workshops

  178. Haq S, Jackson PJB, Edge J (2008) Audio-visual feature selection and reduction for emotion classification. In: Proc. Int. Conf. on Auditory– Visual Speech Processing (AVSP’08), Tangalooma, Australia

  179. Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed 19(3):0034

    Article  Google Scholar 

  180. Ringeval F, Sonderegger A, Sauer J, Lalanne D (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International conference and workshops on automatic face and gesture recognition (FG), pp 1–8. https://doi.org/10.1109/FG.2013.6553805

  181. Jiang YG, Xu B, Xue X (2014) Predicting emotions in user-generated videos. AAAI Press

  182. Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390. https://doi.org/10.1109/TAFFC.2014.2336244

    Article  Google Scholar 

  183. Baveye Y, Dellandréa E, Chamaret C, Chen L (2015) Liris-accede: a video database for affective content analysis. IEEE Trans Affect Comput 6(1):43–55. https://doi.org/10.1109/TAFFC.2015.2396531

    Article  Google Scholar 

  184. Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2017) Baum-1: A spontaneous audio-visual face database of affective and mental states. IEEE Trans Affect Comput 8(3):300–313. https://doi.org/10.1109/TAFFC.2016.2553038

    Article  Google Scholar 

  185. Busso C, Parthasarathy S, Burmania A, AbdelWahab M, Sadoughi N, Provost EM (2017) Msp-improv: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans Affect Comput 8(1):67–80. https://doi.org/10.1109/TAFFC.2016.2515617

    Article  Google Scholar 

  186. Xu B, Fu Y, Jiang YG, Li B, Sigal L (2018) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans Affect Comput 9(99):255–270

    Article  Google Scholar 

  187. Livingstone SR, Russo FA, Joseph N (2018) The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PLoS ONE 13(5):0196391

    Article  Google Scholar 

  188. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of german emotional speech. In: INTERSPEECH

  189. Lotfian R, Busso C (2019) Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans Affect Comput 10(4):471–483. https://doi.org/10.1109/TAFFC.2017.2736999

    Article  Google Scholar 

  190. Pichora-Fuller MK, Dupuis K (2011). Toronto Emotional Speech Set (TESS). https://doi.org/10.5683/SP2/E8H2MF

    Article  Google Scholar 

  191. (2010) Sentence emotion analysis and recognition based on emotion words using ren-cecps. Int J Adv Intell Paradig 2(1):105–117

  192. Buechel S, Hahn U (2017) Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. In: EACL 2017

  193. Li Y, Hui S, Shen X, Li W, Niu S (2017) Dailydialog: a manually labelled multi-turn dialogue dataset

  194. McKeown G, Valstar M, Cowie R, Pantic M, Schroder M (2012) The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17. https://doi.org/10.1109/TAFFC.2011.20

  195. Hsu C-C, Chen S-Y, Kuo C-C, Huang T-H, Ku L-W (2018) Emotion–Lines: an emotion corpus of multi-party conversations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1252

  196. Heaton CT, Schwartz DM (2020) Language models as emotional classifiers for textual conversation. In: Proceedings of the 28th ACM international conference on multimedia. MM ’20, Association for Computing Machinery, New York, USA, pp 2918–2926. https://doi.org/10.1145/3394171.3413755

  197. Latif S, Rana R, Khalifa S, Jurdak R, Epps J, Schuller BW (2020) Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2020.2983669

    Article  Google Scholar 

  198. Shukla A, Petridis S, Pantic M (2021) Does visual self-supervision improve learning of speech representations for emotion recognition. IEEE Trans Affect Comput 1–1. https://doi.org/10.1109/TAFFC.2021.3062406

  199. Zheng L, Bl A, Jtab C (2021) Decn: Dialogical emotion correction network for conversational emotion recognition. Neurocomputing

  200. Wang C, Ren Y, Zhang N, Cui F, Luo S (2022) Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimed Tools Appl 81(4):4897–4907

    Article  Google Scholar 

  201. Zhang D, Zhang W, Li S, Zhu Q, Zhou G (2020) Modeling both intraand inter-modal influence for real-time emotion detection in conversations. In: Proceedings of the 28th ACM international conference on multimedia. MM ’20, Association for Computing Machinery, New York, USA, pp 503–511. https://doi.org/10.1145/3394171

  202. Hu J, Liu Y, Zhao J, Jin Q (2021) Mmgcn: multimodal fusion via deep graph convolution network for emotion recognition in conversation. arXiv:2107.06779

  203. Li Z, Tang F, Zhao M, Zhu Y (2022) EmoCaps: Emotion capsule based model for conversational emotion recognition. In: Findings of the association for computational linguistics: ACL 2022, Association for Computational Linguistics, Dublin, Ireland, pp 1610–1618. https://aclanthology.org/2022.findings-acl.126

  204. Barros P, Barakova E, Wermter S (2020) Adapting the interplay between personalized and generalized affect recognition based on an unsupervised neural framework. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2020.3002657

    Article  Google Scholar 

  205. Ju X, Zhang D, Li J, Zhou G (2020) Transformer-based label set generation for multi-modal multi-label emotion detection. In: Proceedings of the 28th ACM international conference on multimedia. MM ’20, Association for Computing Machinery, New York, USA, pp 512–520. https://doi.org/10.1145/3394171.3413577

  206. Chen D, Lin Y, Li W, Li P, Zhou J, Sun X (2020) Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. Proceedings of the AAAI conference on artificial intelligence 34:3438–3445

    Article  Google Scholar 

  207. Li Q, Han Z, Wu X-M (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In: Thirty-second AAAI conference on artificial intelligence

  208. Gideon J, McInnis M, Mower Provost E (2019) Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog). IEEE Trans Affect Comput 1–1. https://doi.org/10.1109/TAFFC.2019.2916092

  209. Luo H, Han J (2020) Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 28:2047–2060. https://doi.org/10.1109/TASLP.2020.3006331

    Article  Google Scholar 

  210. Parisi GI, Tani J, Weber C, Wermter S (2017) Lifelong learning of human actions with deep neural network self-organization. Neural Netw 96:137–149

    Article  Google Scholar 

Download references

Funding

This work is supported by the National Natural Science Foundation of China under Grants (61772125) and the Fundamental Research Funds for the Central Universities (N2217001).

Author information

Authors and Affiliations

Authors

Contributions

Not applicable

Corresponding author

Correspondence to Zhenhua Tan.

Ethics declarations

Ethics approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, T., Tan, Z. Survey of deep emotion recognition in dynamic data using facial, speech and textual cues. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-023-17944-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-023-17944-9

Keywords

Navigation