Preliminaries

Watanabe, Shinji; Delcroix, Marc; Metze, Florian; Hershey, John R.

doi:10.1007/978-3-319-64680-0_1

Preliminaries

Shinji Watanabe⁵,
Marc Delcroix⁶,
Florian Metze⁷ &
…
John R. Hershey⁵

Chapter
First Online: 26 July 2017

2219 Accesses

Abstract

Robust automatic speech recognition (ASR) technologies have greatly evolved due to the emergence of deep learning. This chapter introduces the general background of robustness issues of deep neural-network-based ASR. It provides an overview of robust ASR research including a brief history of several studies before the deep learning era, basic formulations of ASR, signal processing, and neural networks. This chapter also introduces common notations for variables and equations, which are extended in the later chapters to deal with more advanced topics. Finally, the chapter provides an overview of the book structure by summarizing the contributions of the individual chapters and associates them with the different components of a robust ASR system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The WERs refer to the Kaldi AMI recipe, November 15, 2016. https://github.com/kaldi-asr/kaldi/blob/master/egs/ami/s5b.
2.
http://www.clsp.jhu.edu/workshops/15-workshop/.
3.
However, these concepts have inspired related techniques for DNN-based acoustic models, such as DNN parameter regularization based on the L2 norm and Kullback–Leibler (KL) divergence, that can be regarded as a variant of MAP adaptation in the context of DNNs.
4.
This problem is discussed in Chap. 13

References

Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “CHiME” speech separation and recognition challenge: dataset, task and baselines. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 504–511 (2015)
Google Scholar
Berouti, M., Schwartz, R., Makhoul, J.: Enhancement of speech corrupted by acoustic noise. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP’79, vol. 4, pp. 208–211. IEEE, New York (1979)
Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)
MATH Google Scholar
Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., et al.: The AMI meeting corpus: a pre-announcement. In: International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39. Springer, Berlin (2005)
Google Scholar
Deng, L., Droppo, J., Acero, A.: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. Speech Audio Process. 13(3), 412–421 (2005)
Article Google Scholar
Digalakis, V.V., Rtischev, D., Neumeyer, L.G.: Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3(5), 357–366 (1995)
Article Google Scholar
Eide, E., Gish, H.: A parametric approach to vocal tract length normalization. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 96, vol. 1, pp. 346–348. IEEE, New York (1996)
Google Scholar
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 202, 050 (2002)
Google Scholar
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Gales, M.J., Young, S.J.: Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)
Article Google Scholar
Gauvain, J.L., Lee, C.H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)
Article Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Englewood Cliffs, NJ (2001)
Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4. IEEE, New York (2013)
Google Scholar
Kolossa, D., Haeb-Umbach, R.: Robust Speech Recognition of Uncertain or Missing Data: Theory and Applications. Springer Science & Business Media, Berlin (2011)
Book MATH Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Lee, K.F., Hon, H.W.: Large-vocabulary speaker-independent continuous speech recognition using HMM. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP 88, pp. 123–126. IEEE, New York (1988)
Google Scholar
Lee, C.H., Lin, C.H., Juang, B.H.: A study on speaker adaptation of the parameters of continuous density hidden Markov models. IEEE Trans. Signal Process. 39(4), 806–814 (1991)
Article Google Scholar
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9(2), 171–185 (1995)
Article Google Scholar
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)
Article Google Scholar
Moreno, P.J., Raj, B., Stern, R.M.: A vector Taylor series approach for environment-independent speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP 96, vol. 2, pp. 733–736. IEEE, New York (1996)
Google Scholar
Virtanen, T., Singh, R., Raj, B.: Techniques for Noise Robustness in Automatic Speech Recognition. Wiley, New York (2012)
Book Google Scholar
Watanabe, S., Chien, J.T.: Bayesian Speech and Language Processing. Cambridge University Press, Cambridge (2015)
Book MATH Google Scholar
Yu, D., Deng, L.: Automatic Speech Recognition. Springer, Berlin (2012)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA
Shinji Watanabe & John R. Hershey
NTT Communication Science Laboratories, NTT Corporation, 2-4, Hikaridai, Seika-cho, Kyoto, Japan
Marc Delcroix
Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
Florian Metze

Authors

Shinji Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Marc Delcroix
View author publications
You can also search for this author in PubMed Google Scholar
Florian Metze
View author publications
You can also search for this author in PubMed Google Scholar
John R. Hershey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shinji Watanabe .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. (2017). Preliminaries. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_1
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics