Visual Contribution to Word Prominence Detection in a Playful Interaction Setting

Heckmann, Martin

doi:10.1007/978-1-4614-8280-2_21

Visual Contribution to Word Prominence Detection in a Playful Interaction Setting

Martin Heckmann⁵

Conference paper
First Online: 01 January 2013

1486 Accesses

Abstract

This paper investigates how prominent words can be distinguished from non-prominent ones in a setting where a user was interacting in a small game, designed as a Wizard-of-Oz experiment, with a computer. Misunderstandings of the system were triggered and the user was asked to correct them naturally, i. e. using prosodic cues. Consequently, the corrected word is expected to be highly prominent. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features relative energy, duration and fundamental frequency were calculated. From the visual channel rigid head movements and image transformation-based features from the mouth region were extracted. Different feature combinations are evaluated regarding their power to discriminate the prominent from the non-prominent words using a SVM. Depending on the features accuracies of approximately 70%–80% are achieved. Thereby the visual features are in particular beneficial when the acoustic features are weaker.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Al Moubayed, S., Beskow, J.: Effects of visual prominence cues on speech intelligibility. In: Proceedings of the International Conference on Auditory Visual Speech Process. (AVSP), vol. 9, p. 16. ISCA, Austin (2009)
Google Scholar
Beskow, J., Granström, B., House, D.: Visual correlates to prominence in several expressive modes. In: Proceedings of INTERSPEECH, pp. 1272–1275. ISCA (2006)
Google Scholar
Black, A., Taylor, P., Caley, R.: The festival speech synthesis system. Tech. rep. (1998)
Google Scholar
Bradski, G., Kaehler, A.: Learning OpenCV: Computer vision with the OpenCV library O’reilly (2008)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120, 2421 (2006)
Article Google Scholar
Cvejic, E., Kim, J., Davis, C., Gibert, G.: Prosody for the eyes: Quantifying visual prosody using guided principal component analysis. In: Proceedings of INTERSPEECH. ISCA (2010)
Google Scholar
Dohen, M., Lœvenbruck, H., Harold, H., et al.: Visual correlates of prosodic contrastive focus in french: Description and inter-speaker variability. In: Speech Prosody. Dresden, Germany (2006)
Google Scholar
Graf, H., Cosatto, E., Strom, V., Huang, F.: Visual prosody: Facial movements accompanying speech. In: International Conference on Automatic Face and Gesture Recognition, pp. 396–401. IEEE (2002)
Google Scholar
Heckmann, M.: Audio-visual evaluation and detection of word prominence in a human-machine interaction scenario. In: Proceedings of INTERSPEECH. ISCA, Portland, OR (2012)
Google Scholar
Heckmann, M., Berthommier, F., Kroschel, K.: Noise adaptive stream weighting in audio-visual speech recognition. EURASIP J. Applied Signal Process. 11, 1260–1273 (2002)
Article Google Scholar
Heckmann, M., Domont, X., Joublin, F., Goerick, C.: A hierarchical framework for spectro-temporal feature extraction. Speech Comm. 53(5), 736–752 (2011). DOI: 10.1016 /j.specom.2010.08.006. Perceptual and Statistical Audition
Google Scholar
Heckmann, M., Gläser, C., Vaz, M., Rodemann, T., Joublin, F., Goerick, C.: Listen to the parrot: Demonstrating the quality of online pitch and formant extraction via feature-based resynthesis. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Nice (2008)
Google Scholar
Heckmann, M., Joublin, F., Goerick, C.: Combining rate and place information for robust pitch extraction. In: Proceedings of INTERSPEECH, pp. 2765–2768. Antwerp (2007)
Google Scholar
Hirschberg, J., Litman, D., Swerts, M.: Prosodic and other cues to speech recognition failures. Speech Communication 43(1-2), 155–175 (2004)
Google Scholar
Kolossa, D., Zeiler, S., Vorwerk, A., Orglmeister, R.: Audiovisual speech recognition with missing or unreliable data. In: Proceedings of International Conference on Auditory Visual Speech Processing (AVSP) (2009)
Google Scholar
Munhall, K., Jones, J., Callan, D., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility. Psychol. Sci. 15(2), 133 (2004)
Article Google Scholar
Nöth, E., Batliner, A., Kießling, A., Kompe, R., Niemann, H.: Verbmobil: The use of prosody in the linguistic components of a speech understanding system. IEEE Trans. Speech and Audio Process. 8(5), 519–532 (2000)
Article Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Shriberg, E.: Spontaneous speech: How people really talk and why engineers should care. In: Proceedings of EUROSPEECH, ISCA (2005)
Google Scholar
Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Ess-Dykema, C., Meteer, M.: Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. ling. 26(3), 339–373 (2000)
Article Google Scholar
Swerts, M., Krahmer, E.: Facial expression and prosodic prominence: Effects of modality and facial area. J. Phonetics 36(2), 219–238 (2008)
Article Google Scholar
Yoshida, T., Nakadai, K., Okuno, H.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: Proceedings of 9th IEEE-RAS International Conference on Humanoid Robots, pp. 604–609. IEEE (2009)
Google Scholar
Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University, Cambridge, United Kingdom (1995)
Google Scholar

Download references

Acknowledgements

I want to thank Petra Wagner, Britta Wrede and Heiko Wersing for fruitful discussions. Furthermore, I am very grateful to Rujiao Yan and Samuel Kevin Ngouoko for helping in setting up the visual processing and the forced alignment, respectively. Many thanks to Mark Dunn for support with the cameras and the recording system as well to Mathias Franzius for support with tuning the SVMs. Special thanks go to my subjects for their patience and effort.

Author information

Authors and Affiliations

Honda Research Institute Europe GmbH, D-63073, Offenbach/Main, Germany
Martin Heckmann

Authors

Martin Heckmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Heckmann .

Editor information

Editors and Affiliations

IMMI-CNRS, Orsay, France
Joseph Mariani
LIMSI-CNRS, Orsay, France
Sophie Rosset
IMMI-CNRS, Orsay, France
Martine Garnier-Rizet
LIMSI-CNRS, Orsay, France
Laurence Devillers

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heckmann, M. (2014). Visual Contribution to Word Prominence Detection in a Playful Interaction Setting. In: Mariani, J., Rosset, S., Garnier-Rizet, M., Devillers, L. (eds) Natural Interaction with Robots, Knowbots and Smartphones. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8280-2_21

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8280-2_21
Published: 28 August 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8279-6
Online ISBN: 978-1-4614-8280-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics