Skip to main content

Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

  • Chapter
  • First Online:
Advanced Social Interaction with Agents

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 510))

Abstract

Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our code for extraction, training, postprocessing and evaluation is available at https://github.com/phiresky/backchannel-prediction. The repository also contains a script to reproduce all of the results of this paper.

References

  1. Dieleman S, Schlter J, Raffel C, Olson E, Sønderby SK et al (2015) Lasagne: first release. https://doi.org/10.5281/zenodo.27878

  2. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Aistats 9:249–256

    Google Scholar 

  3. Godfrey J, Holliman E (1993) Switchboard-1 release 2. https://catalog.ldc.upenn.edu/ldc97s62

  4. Harkins D et al (2003) ISIP switchboard word alignments. https://www.isip.piconepress.com/projects/switchboard/

  5. Huang L, Morency LP, Gratch J (2010) Learning backchannel prediction model from parasocial consensus sampling: a subjective evaluation. In: International conference on intelligent virtual agents. Springer, pp 159–172

    Google Scholar 

  6. Jurafsky D, Van Ess-Dykema C et al (1997) Switchboard discourse language modeling project

    Google Scholar 

  7. Kawahara T, Uesato M, Yoshino K, Takanashi K (2015) Toward adaptive generation of backchannels for attentive listening agents. In: International workshop serien on spoken dialogue systems technology, pp 1–10

    Google Scholar 

  8. Kawahara T, Yamaguchi T, Inoue K, Takanashi K, Ward N (2016) Prediction and generation of backchannel form for attentive listening systems. In: Proceedings of the INTERSPEECH, vol 2016

    Google Scholar 

  9. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  10. de Kok I, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog

    Google Scholar 

  11. Kok ID, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Feedback behaviors in dialog

    Google Scholar 

  12. Laskowski K, Heldner M, Edlund J (2008) The fundamental frequency variation spectrum. Proc Fon 2008:29–32

    Google Scholar 

  13. Levin L, Lavie A, Woszczyna M, Gates D, Gavaldá M, Koll D, Waibel A (2000) The janus-iii translation system: speech-to-speech translation in multiple domains. Mach Trans 15(1):3–25. https://doi.org/10.1023/A:1011186420821

  14. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  15. Müller M, Leuschner D, Briem L, Schmidt M, Kilgour K, Stüker S, Waibel A (2015) Using neural networks for data-driven backchannel prediction: a survey on input features and training techniques. In: International conference on human-computer interaction. Springer, pp 329–340

    Google Scholar 

  16. Mockus J (1974) On bayesian methods for seeking the extremum. In: Proceedings of the IFIP technical conference. Springer, London, pp 400–404. http://dl.acm.org/citation.cfm?id=646296.687872

  17. Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agent Multi-Agent Syst 20(1):70–84. https://doi.org/10.1007/s10458-009-9092-y

  18. Niehues J, Nguyen TS, Cho E, Ha TL, Kilgour K, Müller M, Sperber M, Stüker S, Waibel A (2016) Dynamic transcription for low-latency speech translation. Interspeech 2016:2513–2517

    Article  Google Scholar 

  19. Ries K (1999) HMM and neural network based speech act detection. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, 1999, vol 1. IEEE Computer Society, pp 497–500

    Google Scholar 

  20. Schroder M, Bevacqua E, Cowie R, Eyben F, Gunes H, Heylen D, Ter Maat M, McKeown G, Pammi S, Pantic M et al (2012) Building autonomous sensitive artificial listeners. IEEE Trans Affect Comput 3(2):165–183

    Article  Google Scholar 

  21. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  22. Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R, Van Ess-Dykema C, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput Linguist 26(3):339–373

    Article  Google Scholar 

  23. Stolcke A, et al (1998) Dialog act modeling for conversational speech. In: AAAI spring symposium on applying machine learning to discourse processing, pp 98–105

    Google Scholar 

  24. Theano Development Team: Theano: a python framework for fast computation of mathematical expressions (2016). arXiv e-prints http://arxiv.org/abs/1605.02688

  25. Truong KP, Poppe RW, Heylen DKJ (2010) A rule-based backchannel prediction model using pitch and pause information. In: Proceedings of the interspeech 2010, Makuhari, Chiba, Japan. International Speech Communication Association (ISCA), pp 3058–3061

    Google Scholar 

  26. Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339

    Article  Google Scholar 

  27. Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207

    Article  Google Scholar 

Download references

Acknowledgements

This work has been conducted in the SecondHands project which has received funding from the European Unions Horizon 2020 Research and Innovation programme (call:H2020- ICT-2014-1, RIA) under grant agreement No 643950.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Markus Müller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ruede, R., Müller, M., Stüker, S., Waibel, A. (2019). Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor. In: Eskenazi, M., Devillers, L., Mariani, J. (eds) Advanced Social Interaction with Agents . Lecture Notes in Electrical Engineering, vol 510. Springer, Cham. https://doi.org/10.1007/978-3-319-92108-2_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92108-2_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92107-5

  • Online ISBN: 978-3-319-92108-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics