skip to main content
10.1145/3240508.3240578acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open Access

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Published:15 October 2018Publication History

ABSTRACT

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

References

  1. S. Albanie and A. Vedaldi. Learning grimaces by watching tv. In Proc. BMVC., 2016.Google ScholarGoogle ScholarCross RefCross Ref
  2. Z. Aldeneh and E. M. Provost. Using regional saliency for speech emotion recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2741--2745. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  3. R. Arandjelovic and A. Zisserman. Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 609--617. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  4. H. Aviezer, S. Bentin, R. R. Hassin,W. S. Meschino, J. Kennedy, S. Grewal, S. Esmail, S. Cohen, and M. Moscovitch. Not on the face alone: perception of contextualized face expressions in huntington's disease. Brain, 132(6):1633--1644, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  5. Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892--900, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Aytar, C. Vondrick, and A. Torralba. See, hear, and read: Deep aligned representations. arXiv preprint arXiv:1706.00932, 2017.Google ScholarGoogle Scholar
  7. J. Ba and R. Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654--2662, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 433--436. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. B. Barlow. Unsupervised learning. Neural computation, 1(3):295--311, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang. Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Batliner, C. Hacker, S. Steidl, E. Nøth, S. D'Arcy, M. J. Russell, and M. Wong. You stupid tin box-children interacting with the aibo robot: A cross-linguistic emotional speech corpus. In LREC, 2004.Google ScholarGoogle Scholar
  12. C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535--541. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of german emotional speech. In Ninth European Conference on Speech Communication and Technology, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008.Google ScholarGoogle Scholar
  15. C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on Multimodal interfaces, pages 205--211. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In Proc. Int. Conf. Autom. Face and Gesture Recog., 2018.Google ScholarGoogle ScholarCross RefCross Ref
  17. K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.Google ScholarGoogle Scholar
  19. E. J. Crowley, G. Gray, and A. Storkey. Moonshine: Distilling with cheap convolutions. arXiv preprint arXiv:1711.02613, 2017.Google ScholarGoogle Scholar
  20. N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller. An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 2017 ACM on Multimedia Conference, pages 478--484. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Cvejic, J. Kim, and C. Davis. Prosodyoffthe top of the head: Prosodic contrasts can be discriminated by head motion. Speech Communication, 52(6):555--564, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Deng, Z. Zhang, F. Eyben, and B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, 21(9):1068--1072, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Deng, Z. Zhang, and B. Schuller. Linked source and target domain subspace feature transfer learning--exemplified by speech emotion recognition. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 761--766. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 427--432. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia, 19(3):34--41, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422--1430, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. Douglas-Cowie, R. Cowie, and M. Schröder. A new emotion database: considerations, sources and scope. In ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.Google ScholarGoogle Scholar
  28. I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, pages 117--124. Springer, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. C. Goren, M. Sarty, and P. Y. Wu. Visual following and pattern discrimination of face-like stimuli by newborn infants. Pediatrics, 56(4):544--549, 1975.Google ScholarGoogle Scholar
  30. T. Grossmann. The development of emotion perception in face and voice during infancy. Restorative neurology and neuroscience, 28(2):219--236, 2010.Google ScholarGoogle Scholar
  31. S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2827--2836. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  32. J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller. From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 2017 ACM on Multimedia Conference, pages 890--897. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. R. Hassin, H. Aviezer, and S. Bentin. Inherently ambiguous: Facial expressions of emotions, in context. Emotion Review, 5(1):60--65, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  34. G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.Google ScholarGoogle Scholar
  35. J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. CVPR, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  36. P. Hu, D. Cai, S.Wang, A. Yao, and Y. Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 553--560. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Huang. Combining convolutional neural networks for emotion recognition. In Undergraduate Research Technology Conference (URTC), 2017 IEEE MIT, pages 1--4. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  38. M. H. Johnson, S. Dziurawiec, H. Ellis, and J. Morton. Newborns' preferential tracking of face-like stimuli and its subsequent decline. Cognition, 40(1--2):1--19, 1991.Google ScholarGoogle Scholar
  39. C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, andW. Matusik. On learning associations of faces and voices. arXiv preprint arXiv:1805.05553, 2018.Google ScholarGoogle Scholar
  40. J. Kim, G. Englebienne, K. P. Truong, and V. Evers. Deep temporal models using identity skip-connections for speech emotion recognition. In Proceedings of the 2017 ACM on Multimedia Conference, pages 1006--1013. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Kim and E. M. Provost. Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 92--99. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps. Transfer learning for improving speech emotion classification accuracy. arXiv preprint arXiv:1801.06353, 2018.Google ScholarGoogle Scholar
  43. J. Li, R. Zhao, J.-T. Huang, and Y. Gong. Learning small-size dnn with outputdistribution- based criteria. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.Google ScholarGoogle Scholar
  44. M. Liberman, K. Davis, M. Grossman, N. Martey, and J. Bell. Ldc emotional prosody speech transcripts database. University of Pennsylvania, Linguistic data consortium, 2002.Google ScholarGoogle Scholar
  45. D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.Google ScholarGoogle Scholar
  46. S. Mariooryad and C. Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 85--90. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. O. Martin, I. Kotsia, B. Macq, and I. Pitas. The enterfaceâAZ05 audio-visual emotion database. In Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on, pages 8--8. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. Nagrani, S. Albanie, and A. Zisserman. Learnable PINs: Cross-modal embeddings for person identity. Proc. ECCV, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  49. A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In Proc. CVPR, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  50. A. Nagrani, J. S. Chung, and A. Zisserman. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  51. F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. K. Oller. The effect of position in utterance on speech segment duration in english. The journal of the Acoustical Society of America, 54(5):1235--1247, 1973.Google ScholarGoogle Scholar
  53. A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In ECCV, pages 801--816. Springer, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  54. S. Parthasarathy, C. Zhang, J. H. Hansen, and C. Busso. A study of speaker verification performance with expressive speech. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5540--5544. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  55. M. D. Pell. Influence of emotion and focus location on prosody in matched statements and questions. The Journal of the Acoustical Society of America, 109(4):1668--1680, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  56. M. D. Pell. Prosody--face interactions in emotional processing as revealed by the facial affect decision task. Journal of Nonverbal Behavior, 29(4):193--215, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  57. S. Poria, E. Cambria, A. Hussain, and G.-B. Huang. Towards an intelligent framework for multimodal affective data analysis. Neural Networks, 63:104--116, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Rigoulot and M. D. Pell. Emotion in the voice influences the way we scan emotional faces. Speech Communication, 65:36--49, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  59. D. Rolnick, A. Veit, S. Belongie, and N. Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.Google ScholarGoogle Scholar
  60. B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1(2):119--131, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. M. Swerts and E. Krahmer. Facial expression and prosodic prominence: Effects of modality and facial area. Journal of Phonetics, 36(2):219--238, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  62. O. Wiles, A. S. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference (BMVC), 2018.Google ScholarGoogle Scholar
  63. Z. Yu and C. Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 435--442. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. S. Zhalehpour, Z. Akhtar, and C. E. Erdem. Multimodal emotion recognition based on peak frame selection from video. Signal, Image and Video Processing, 10(5):827--834, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  65. Z. Zhang, F. Weninger, M. Wöllmer, and B. Schuller. Unsupervised learning in cross-corpus acoustic emotion recognition. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages 523--528. IEEE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  66. S. Zhao, G. Ding, Y. Gao, and J. Han. Learning visual emotion distributions via multi-modal features fusion. In Proceedings of the 2017 ACM on Multimedia Conference, pages 369--377. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '18: Proceedings of the 26th ACM international conference on Multimedia
        October 2018
        2167 pages
        ISBN:9781450356657
        DOI:10.1145/3240508

        Copyright © 2018 Owner/Author

        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader