ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Learning when to listen: detecting system-addressed speech in human-human-computer dialog

Elizabeth Shriberg, Andreas Stolcke, Dilek Hakkani-Tür, Larry Heck

New challenges arise for addressee detection when multiple people interact jointly with a spoken dialog system using unconstrained natural language. We study the problem of discriminating computer-directed from human-directed speech in a new corpus of human-human-computer (H-H-C) dialog, using lexical and prosodic features. The prosodic features use no word, context, or speaker information. Results with 19% WER speech recognition show improvements from lexical features (EER=23.1%) to prosodic features (EER=12.6%) to a combined model (EER=11.1%). Prosodic features also provide a 35% error reduction over a lexical model using true words (EER from 10.2% to 6.7%). Modeling energy contours with GMMs provides a particularly good prosodic model. While lexical models perform well for commands, they confuse free-form system-directed speech with human-human speech. Prosodic models dramatically reduce these confusions, implying that users change speaking style as they shift addressees (computer versus human) within a session. Overall results provide strong support for combining simple acoustic-prosodic models with lexical models to detect speaking style differences for this task.

Index Terms: addressee detection, spoken dialog system, prosody, language model, GMM, boosting, logistic regression.


doi: 10.21437/Interspeech.2012-83

Cite as: Shriberg, E., Stolcke, A., Hakkani-Tür, D., Heck, L. (2012) Learning when to listen: detecting system-addressed speech in human-human-computer dialog. Proc. Interspeech 2012, 334-337, doi: 10.21437/Interspeech.2012-83

@inproceedings{shriberg12_interspeech,
  author={Elizabeth Shriberg and Andreas Stolcke and Dilek Hakkani-Tür and Larry Heck},
  title={{Learning when to listen: detecting system-addressed speech in human-human-computer dialog}},
  year=2012,
  booktitle={Proc. Interspeech 2012},
  pages={334--337},
  doi={10.21437/Interspeech.2012-83}
}