Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

Ruede, Robin; Müller, Markus; Stüker, Sebastian; Waibel, Alex

doi:10.1007/978-3-319-92108-2_25

Robin Ruede³⁵,
Markus Müller³⁵,
Sebastian Stüker³⁵ &
…
Alex Waibel^35,36

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 510))

1087 Accesses
11 Citations

Abstract

Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Our code for extraction, training, postprocessing and evaluation is available at https://github.com/phiresky/backchannel-prediction. The repository also contains a script to reproduce all of the results of this paper.

References

Dieleman S, Schlter J, Raffel C, Olson E, Sønderby SK et al (2015) Lasagne: first release. https://doi.org/10.5281/zenodo.27878
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Aistats 9:249–256
Google Scholar
Godfrey J, Holliman E (1993) Switchboard-1 release 2. https://catalog.ldc.upenn.edu/ldc97s62
Harkins D et al (2003) ISIP switchboard word alignments. https://www.isip.piconepress.com/projects/switchboard/
Huang L, Morency LP, Gratch J (2010) Learning backchannel prediction model from parasocial consensus sampling: a subjective evaluation. In: International conference on intelligent virtual agents. Springer, pp 159–172
Google Scholar
Jurafsky D, Van Ess-Dykema C et al (1997) Switchboard discourse language modeling project
Google Scholar
Kawahara T, Uesato M, Yoshino K, Takanashi K (2015) Toward adaptive generation of backchannels for attentive listening agents. In: International workshop serien on spoken dialogue systems technology, pp 1–10
Google Scholar
Kawahara T, Yamaguchi T, Inoue K, Takanashi K, Ward N (2016) Prediction and generation of backchannel form for attentive listening systems. In: Proceedings of the INTERSPEECH, vol 2016
Google Scholar
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
de Kok I, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog
Google Scholar
Kok ID, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Feedback behaviors in dialog
Google Scholar
Laskowski K, Heldner M, Edlund J (2008) The fundamental frequency variation spectrum. Proc Fon 2008:29–32
Google Scholar
Levin L, Lavie A, Woszczyna M, Gates D, Gavaldá M, Koll D, Waibel A (2000) The janus-iii translation system: speech-to-speech translation in multiple domains. Mach Trans 15(1):3–25. https://doi.org/10.1023/A:1011186420821
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Müller M, Leuschner D, Briem L, Schmidt M, Kilgour K, Stüker S, Waibel A (2015) Using neural networks for data-driven backchannel prediction: a survey on input features and training techniques. In: International conference on human-computer interaction. Springer, pp 329–340
Google Scholar
Mockus J (1974) On bayesian methods for seeking the extremum. In: Proceedings of the IFIP technical conference. Springer, London, pp 400–404. http://dl.acm.org/citation.cfm?id=646296.687872
Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agent Multi-Agent Syst 20(1):70–84. https://doi.org/10.1007/s10458-009-9092-y
Niehues J, Nguyen TS, Cho E, Ha TL, Kilgour K, Müller M, Sperber M, Stüker S, Waibel A (2016) Dynamic transcription for low-latency speech translation. Interspeech 2016:2513–2517
Article Google Scholar
Ries K (1999) HMM and neural network based speech act detection. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, 1999, vol 1. IEEE Computer Society, pp 497–500
Google Scholar
Schroder M, Bevacqua E, Cowie R, Eyben F, Gunes H, Heylen D, Ter Maat M, McKeown G, Pammi S, Pantic M et al (2012) Building autonomous sensitive artificial listeners. IEEE Trans Affect Comput 3(2):165–183
Article Google Scholar
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R, Van Ess-Dykema C, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput Linguist 26(3):339–373
Article Google Scholar
Stolcke A, et al (1998) Dialog act modeling for conversational speech. In: AAAI spring symposium on applying machine learning to discourse processing, pp 98–105
Google Scholar
Theano Development Team: Theano: a python framework for fast computation of mathematical expressions (2016). arXiv e-prints http://arxiv.org/abs/1605.02688
Truong KP, Poppe RW, Heylen DKJ (2010) A rule-based backchannel prediction model using pitch and pause information. In: Proceedings of the interspeech 2010, Makuhari, Chiba, Japan. International Speech Communication Association (ISCA), pp 3058–3061
Google Scholar
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339
Article Google Scholar
Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207
Article Google Scholar

Download references

Acknowledgements

This work has been conducted in the SecondHands project which has received funding from the European Unions Horizon 2020 Research and Innovation programme (call:H2020- ICT-2014-1, RIA) under grant agreement No 643950.

Author information

Authors and Affiliations

Karlsruhe Institute of Technology, Karlsruhe, Germany
Robin Ruede, Markus Müller, Sebastian Stüker & Alex Waibel
Carnegie Mellon University, Pittsburgh, PA, USA
Alex Waibel

Authors

Robin Ruede
View author publications
You can also search for this author in PubMed Google Scholar
Markus Müller
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Stüker
View author publications
You can also search for this author in PubMed Google Scholar
Alex Waibel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Müller .

Editor information

Editors and Affiliations

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Maxine Eskenazi
LIMSI-CNRS, Sorbonne University, Paris, France
Laurence Devillers
LIMSI-CNRS, Paris-Saclay University, Orsay, France
Joseph Mariani

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ruede, R., Müller, M., Stüker, S., Waibel, A. (2019). Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor. In: Eskenazi, M., Devillers, L., Mariani, J. (eds) Advanced Social Interaction with Agents . Lecture Notes in Electrical Engineering, vol 510. Springer, Cham. https://doi.org/10.1007/978-3-319-92108-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-92108-2_25
Published: 02 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92107-5
Online ISBN: 978-3-319-92108-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics