Complementary Gaussian Mixture Models for Multimodal Speech Recognition

Sad, Gonzalo D.; Terissi, Lucas D.; Gómez, Juan C.

doi:10.1007/978-3-319-14899-1_6

Gonzalo D. Sad⁷,
Lucas D. Terissi⁷ &
Juan C. Gómez⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8869))

Included in the following conference series:

IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human-Computer Interaction

764 Accesses

Abstract

In speech recognition systems, typically, each word/phoneme in the vocabulary is represented by a model trained with samples of each particular class. The recognition is then performed by computing which model best represents the input word/phoneme to be classified. In this paper, a novel classification strategy based on complementary class models is presented. A complementary model to a particular class \(j\) refers to a model that is trained with instances of all the considered classes, excepting the ones associated to that class \(j\). This work describes new multi-classifier schemes for isolated word speech recognition based on the combination of standard Hidden Markov Models (HMMs) and Complementary Gaussian Mixture Models (CGMMs). In particular, two different conditions are considered. If the data is represented by single feature vectors a cascade classification scheme using HMMs and CGMMs is proposed. On the other hand, when data is represented by multiple feature vectors, a classification scheme based on a voting strategy which combines scores from individual HMMs and CGMMs is proposed. The classification schemes proposed in this paper are evaluated over two audio-visual speech databases, considering acoustic noisy conditions. Experimental results show that improvements in the recognition rates through a wide range of signal to noise ratios are achieved with the proposed classification methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahlberg, J.: Candide-3 - an updated parameterised face. Technical report, Department of Electrical Engineering, Linköping University, Sweden (2001)
Google Scholar
AMP Lab.: Advanced Multimedia Processing Laboratory. Cornell University, Ithaca, NY, http://chenlab.ece.cornell.edu/projects/AudioVisualSpeechProcessing. Accessed October 2014
Arora, S.J., Singh, R.P.: Automatic speech recognition: a review. Int. J. Comput. Appl. 60(9), 34–44 (2012)
Google Scholar
Borgström, B., Alwan, A.: A low-complexity parabolic lip contour model with speaker normalization for high-level feature extraction in noise-robust audiovisual speech recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 38(6), 1273–1280 (2008)
Article Google Scholar
Chibelushi, C., Deravi, F., Mason, J.: A review of speech-based bimodal recognition. IEEE Trans. Multimedia 4(1), 23–37 (2002)
Article Google Scholar
Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)
Article Google Scholar
Estellers, V., Gurban, M., Thiran, J.: On dynamic stream weighting for audio-visual speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(4), 1145–1157 (2012)
Article Google Scholar
Jadhav, A., Pawar, R.: Review of various approaches towards speech recognition. In: Proceedings of the 2012 International Conference on Biomedical Engineering (ICoBE), pp. 99–103, February 2012
Google Scholar
Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a survey. Comput. Vis. Image Underst. 108(1–2), 116–134 (2007)
Article Google Scholar
Lee, J.S., Park, C.H.: Robust audio-visual speech recognition based on late integration. IEEE Trans. Multimedia 10(5), 767–779 (2008)
Article Google Scholar
Nefian, A., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., Murphy, K.: A coupled HMM for audio-visual speech recognition. In: International Conference on Acoustics, Speech and Signal Processing, pp. 2013–2016 (2002)
Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)
Article Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A.: Recent advances in the automatic recognition of audio-visual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Sad, G., Terissi, L., Gómez, J.: Isolated word speech recognition improvements based on the fusion of audio, video and audio-video classifiers. In: Proceedings of the XV Reunión de Trabajo en Procesamiento de la Información y Control (RPIC 2013), pp. 391–396, September 2013
Google Scholar
Shivappa, S., Trivedi, M., Rao, B.: Audiovisual information fusion in human computer interfaces and intelligent environments: a survey. Proc. IEEE 98(10), 1692–1715 (2010)
Article Google Scholar
Terissi, L., Gómez, J.: 3D head pose and facial expression tracking using a single camera. J. Univ. Comput. Sci. 16(6), 903–920 (2010)
MATH Google Scholar
Terissi, L., Parodi, M., Gómez, J.: Lip reading using wavelet-based features and random forests classification. In: Proceedings of the 22nd International Conference on Pattern Recognition (ICPR 2014), Stockholm, Sweden, pp. 791–796, August 2014
Google Scholar
Zhao, G., Barnard, M., Pietikäinen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11(7), 1254–1265 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for System Dynamics and Signal Processing, Universidad Nacional de Rosario, CIFASIS-CONICET, Rosario, Argentina
Gonzalo D. Sad, Lucas D. Terissi & Juan C. Gómez

Authors

Gonzalo D. Sad
View author publications
You can also search for this author in PubMed Google Scholar
Lucas D. Terissi
View author publications
You can also search for this author in PubMed Google Scholar
Juan C. Gómez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas D. Terissi .

Editor information

Editors and Affiliations

University of Ulm, Universität Ulm, Ulm, Germany
Friedhelm Schwenker
University of Southern California, Playa Vista, California, USA
Stefan Scherer
University of Southern California, Playa Vista, California, USA
Louis-Philippe Morency

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sad, G.D., Terissi, L.D., Gómez, J.C. (2015). Complementary Gaussian Mixture Models for Multimodal Speech Recognition. In: Schwenker, F., Scherer, S., Morency, LP. (eds) Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction. MPRSS 2014. Lecture Notes in Computer Science(), vol 8869. Springer, Cham. https://doi.org/10.1007/978-3-319-14899-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-14899-1_6
Published: 04 January 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14898-4
Online ISBN: 978-3-319-14899-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics