Abstract
Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Our code for extraction, training, postprocessing and evaluation is available at https://github.com/phiresky/backchannel-prediction. The repository also contains a script to reproduce all of the results of this paper.
References
Dieleman S, Schlter J, Raffel C, Olson E, Sønderby SK et al (2015) Lasagne: first release. https://doi.org/10.5281/zenodo.27878
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Aistats 9:249–256
Godfrey J, Holliman E (1993) Switchboard-1 release 2. https://catalog.ldc.upenn.edu/ldc97s62
Harkins D et al (2003) ISIP switchboard word alignments. https://www.isip.piconepress.com/projects/switchboard/
Huang L, Morency LP, Gratch J (2010) Learning backchannel prediction model from parasocial consensus sampling: a subjective evaluation. In: International conference on intelligent virtual agents. Springer, pp 159–172
Jurafsky D, Van Ess-Dykema C et al (1997) Switchboard discourse language modeling project
Kawahara T, Uesato M, Yoshino K, Takanashi K (2015) Toward adaptive generation of backchannels for attentive listening agents. In: International workshop serien on spoken dialogue systems technology, pp 1–10
Kawahara T, Yamaguchi T, Inoue K, Takanashi K, Ward N (2016) Prediction and generation of backchannel form for attentive listening systems. In: Proceedings of the INTERSPEECH, vol 2016
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
de Kok I, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog
Kok ID, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Feedback behaviors in dialog
Laskowski K, Heldner M, Edlund J (2008) The fundamental frequency variation spectrum. Proc Fon 2008:29–32
Levin L, Lavie A, Woszczyna M, Gates D, Gavaldá M, Koll D, Waibel A (2000) The janus-iii translation system: speech-to-speech translation in multiple domains. Mach Trans 15(1):3–25. https://doi.org/10.1023/A:1011186420821
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Müller M, Leuschner D, Briem L, Schmidt M, Kilgour K, Stüker S, Waibel A (2015) Using neural networks for data-driven backchannel prediction: a survey on input features and training techniques. In: International conference on human-computer interaction. Springer, pp 329–340
Mockus J (1974) On bayesian methods for seeking the extremum. In: Proceedings of the IFIP technical conference. Springer, London, pp 400–404. http://dl.acm.org/citation.cfm?id=646296.687872
Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agent Multi-Agent Syst 20(1):70–84. https://doi.org/10.1007/s10458-009-9092-y
Niehues J, Nguyen TS, Cho E, Ha TL, Kilgour K, Müller M, Sperber M, Stüker S, Waibel A (2016) Dynamic transcription for low-latency speech translation. Interspeech 2016:2513–2517
Ries K (1999) HMM and neural network based speech act detection. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, 1999, vol 1. IEEE Computer Society, pp 497–500
Schroder M, Bevacqua E, Cowie R, Eyben F, Gunes H, Heylen D, Ter Maat M, McKeown G, Pammi S, Pantic M et al (2012) Building autonomous sensitive artificial listeners. IEEE Trans Affect Comput 3(2):165–183
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R, Van Ess-Dykema C, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput Linguist 26(3):339–373
Stolcke A, et al (1998) Dialog act modeling for conversational speech. In: AAAI spring symposium on applying machine learning to discourse processing, pp 98–105
Theano Development Team: Theano: a python framework for fast computation of mathematical expressions (2016). arXiv e-prints http://arxiv.org/abs/1605.02688
Truong KP, Poppe RW, Heylen DKJ (2010) A rule-based backchannel prediction model using pitch and pause information. In: Proceedings of the interspeech 2010, Makuhari, Chiba, Japan. International Speech Communication Association (ISCA), pp 3058–3061
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339
Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207
Acknowledgements
This work has been conducted in the SecondHands project which has received funding from the European Unions Horizon 2020 Research and Innovation programme (call:H2020- ICT-2014-1, RIA) under grant agreement No 643950.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Ruede, R., Müller, M., Stüker, S., Waibel, A. (2019). Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor. In: Eskenazi, M., Devillers, L., Mariani, J. (eds) Advanced Social Interaction with Agents . Lecture Notes in Electrical Engineering, vol 510. Springer, Cham. https://doi.org/10.1007/978-3-319-92108-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-92108-2_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92107-5
Online ISBN: 978-3-319-92108-2
eBook Packages: EngineeringEngineering (R0)