ABSTRACT
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.
- S. Albanie and A. Vedaldi. Learning grimaces by watching tv. In Proc. BMVC., 2016.Google ScholarCross Ref
- Z. Aldeneh and E. M. Provost. Using regional saliency for speech emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2741--2745. IEEE, 2017.Google ScholarCross Ref
- R. Arandjelovic and A. Zisserman. Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 609--617. IEEE, 2017.Google ScholarCross Ref
- H. Aviezer, S. Bentin, R. R. Hassin,W. S. Meschino, J. Kennedy, S. Grewal, S. Esmail, S. Cohen, and M. Moscovitch. Not on the face alone: perception of contextualized face expressions in huntington's disease. Brain, 132(6):1633--1644, 2009.Google ScholarCross Ref
- Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892--900, 2016. Google ScholarDigital Library
- Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2017.Google Scholar
- J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654--2662, 2014. Google ScholarDigital Library
- S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 433--436. ACM, 2016. Google ScholarDigital Library
- H. B. Barlow. Unsupervised learning. Neural computation, 1(3):295--311, 1989. Google ScholarDigital Library
- E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), 2016. Google ScholarDigital Library
- A. Batliner, C. Hacker, S. Steidl, E. Nøth, S. D'Arcy, M. J. Russell, and M. Wong. You stupid tin box-children interacting with the aibo robot: A cross-linguistic emotional speech corpus. In LREC, 2004.Google Scholar
- C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541. ACM, 2006. Google ScholarDigital Library
- F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology, 2005.Google ScholarCross Ref
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008.Google Scholar
- C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces, pages 205--211. ACM, 2004. Google ScholarDigital Library
- Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. Int. Conf. Autom. Face and Gesture Recog., 2018.Google ScholarCross Ref
- K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.Google ScholarCross Ref
- J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.Google Scholar
- E. J. Crowley, G. Gray, and A. Storkey. Moonshine: Distilling with cheap convolutions. arXiv preprint arXiv:1711.02613, 2017.Google Scholar
- N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller. An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 2017 ACM on Multimedia Conference, pages 478--484. ACM, 2017. Google ScholarDigital Library
- E. Cvejic, J. Kim, and C. Davis. Prosodyoffthe top of the head: Prosodic contrasts can be discriminated by head motion. Speech Communication, 52(6):555--564, 2010. Google ScholarDigital Library
- J. Deng, Z. Zhang, F. Eyben, and B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, 21(9):1068--1072, 2014.Google ScholarCross Ref
- J. Deng, Z. Zhang, and B. Schuller. Linked source and target domain subspace feature transfer learning--exemplified by speech emotion recognition. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 761--766. IEEE, 2014. Google ScholarDigital Library
- A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 427--432. ACM, 2016. Google ScholarDigital Library
- A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34--41, 2012. Google ScholarDigital Library
- C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422--1430, 2015. Google ScholarDigital Library
- E. Douglas-Cowie, R. Cowie, and M. Schröder. A new emotion database: considerations, sources and scope. In ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.Google Scholar
- I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, pages 117--124. Springer, 2013.Google ScholarDigital Library
- C. C. Goren, M. Sarty, and P. Y. Wu. Visual following and pattern discrimination of face-like stimuli by newborn infants. Pediatrics, 56(4):544--549, 1975.Google Scholar
- T. Grossmann. The development of emotion perception in face and voice during infancy. Restorative neurology and neuroscience, 28(2):219--236, 2010.Google Scholar
- S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2827--2836. IEEE, 2016.Google ScholarCross Ref
- J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller. From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference, pages 890--897. ACM, 2017. Google ScholarDigital Library
- R. R. Hassin, H. Aviezer, and S. Bentin. Inherently ambiguous: Facial expressions of emotions, in context. Emotion Review, 5(1):60--65, 2013.Google ScholarCross Ref
- G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.Google Scholar
- J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. CVPR, 2018.Google ScholarCross Ref
- P. Hu, D. Cai, S.Wang, A. Yao, and Y. Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 553--560. ACM, 2017. Google ScholarDigital Library
- C. Huang. Combining convolutional neural networks for emotion recognition. In Undergraduate Research Technology Conference (URTC), 2017 IEEE MIT, pages 1--4. IEEE, 2017.Google ScholarCross Ref
- M. H. Johnson, S. Dziurawiec, H. Ellis, and J. Morton. Newborns' preferential tracking of face-like stimuli and its subsequent decline. Cognition, 40(1--2):1--19, 1991.Google Scholar
- C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, andW. Matusik. On learning associations of faces and voices. arXiv preprint arXiv:1805.05553, 2018.Google Scholar
- J. Kim, G. Englebienne, K. P. Truong, and V. Evers. Deep temporal models using identity skip-connections for speech emotion recognition. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1006--1013. ACM, 2017. Google ScholarDigital Library
- Y. Kim and E. M. Provost. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 92--99. ACM, 2016. Google ScholarDigital Library
- S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps. Transfer learning for improving speech emotion classification accuracy. arXiv preprint arXiv:1801.06353, 2018.Google Scholar
- J. Li, R. Zhao, J.-T. Huang, and Y. Gong. Learning small-size dnn with outputdistribution- based criteria. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.Google Scholar
- M. Liberman, K. Davis, M. Grossman, N. Martey, and J. Bell. Ldc emotional prosody speech transcripts database. University of Pennsylvania, Linguistic data consortium, 2002.Google Scholar
- D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.Google Scholar
- S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 85--90. IEEE, 2013. Google ScholarDigital Library
- O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterfaceâAZ05 audio-visual emotion database. In Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on, pages 8--8. IEEE, 2006. Google ScholarDigital Library
- A. Nagrani, S. Albanie, and A. Zisserman. Learnable PINs: Cross-modal embeddings for person identity. Proc. ECCV, 2018.Google ScholarCross Ref
- A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In Proc. CVPR, 2018.Google ScholarCross Ref
- A. Nagrani, J. S. Chung, and A. Zisserman. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.Google ScholarCross Ref
- F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017.Google ScholarDigital Library
- D. K. Oller. The effect of position in utterance on speech segment duration in english. The journal of the Acoustical Society of America, 54(5):1235--1247, 1973.Google Scholar
- A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, pages 801--816. Springer, 2016.Google ScholarCross Ref
- S. Parthasarathy, C. Zhang, J. H. Hansen, and C. Busso. A study of speaker verification performance with expressive speech. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5540--5544. IEEE, 2017.Google ScholarCross Ref
- M. D. Pell. Influence of emotion and focus location on prosody in matched statements and questions. The Journal of the Acoustical Society of America, 109(4):1668--1680, 2001.Google ScholarCross Ref
- M. D. Pell. Prosody--face interactions in emotional processing as revealed by the facial affect decision task. Journal of Nonverbal Behavior, 29(4):193--215, 2005.Google ScholarCross Ref
- S. Poria, E. Cambria, A. Hussain, and G.-B. Huang. Towards an intelligent framework for multimodal affective data analysis. Neural Networks, 63:104--116, 2015. Google ScholarDigital Library
- S. Rigoulot and M. D. Pell. Emotion in the voice influences the way we scan emotional faces. Speech Communication, 65:36--49, 2014.Google ScholarCross Ref
- D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.Google Scholar
- B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2):119--131, 2010. Google ScholarDigital Library
- M. Swerts and E. Krahmer. Facial expression and prosodic prominence: Effects of modality and facial area. Journal of Phonetics, 36(2):219--238, 2008.Google ScholarCross Ref
- O. Wiles, A. S. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference (BMVC), 2018.Google Scholar
- Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 435--442. ACM, 2015. Google ScholarDigital Library
- S. Zhalehpour, Z. Akhtar, and C. E. Erdem. Multimodal emotion recognition based on peak frame selection from video. Signal, Image and Video Processing, 10(5):827--834, 2016.Google ScholarCross Ref
- Z. Zhang, F. Weninger, M. Wöllmer, and B. Schuller. Unsupervised learning in cross-corpus acoustic emotion recognition. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 523--528. IEEE, 2011.Google ScholarCross Ref
- S. Zhao, G. Ding, Y. Gao, and J. Han. Learning visual emotion distributions via multi-modal features fusion. In Proceedings of the 2017 ACM on Multimedia Conference, pages 369--377. ACM, 2017. Google ScholarDigital Library
Index Terms
- Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Recommendations
Synthesized speech for model training in cross-corpus recognition of human emotion
Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion ...
Speech emotion recognition using a fuzzy approach
The 6th International Multi-Conference on Engineering and Technology Innovation 2017 (IMETI2017)This paper introduces a fuzzy approach for classifying speech emotions in which a fuzzy inference system based on fuzzy associative memory (FAM-FIS) is used for recognizing speech emotions. Experiments on two databases of emotion speech Emo-DB in German ...
Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
AbstractSpeech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation ...
Comments