research-article

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Authors:
Zixing Zhang

Imperial College London, London, UK

Imperial College London, London, UK

0000-0001-8487-0561
View Profile

,
Jürgen Geiger

Huawei Technologies Duesseldorf GmbH, Munich, Germany

Huawei Technologies Duesseldorf GmbH, Munich, Germany
View Profile

,
Jouni Pohjalainen

University of Passau, Passau, Germany

University of Passau, Passau, Germany
View Profile

,
Amr El-Desoky Mousa

University of Passau, Passau, Germany

University of Passau, Passau, Germany
View Profile

,
Wenyu Jin

Huawei Technologies Duesseldorf GmbH, Munich, Germany

Huawei Technologies Duesseldorf GmbH, Munich, Germany
View Profile

,
Björn Schuller

Imperial College London, London, UK

Imperial College London, London, UK
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 9 Issue 5Article No.: 49pp 1–28https://doi.org/10.1145/3178115

Published:24 April 2018Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition but still remains an important challenge. Data-driven supervised approaches, especially the ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks. In the meanwhile, we discuss the pros and cons of these approaches and provide their experimental results on benchmark databases. We expect that this overview can facilitate the development of the robustness of speech recognition systems in acoustic noisy environments.

References

Alex Acero. 2012. Acoustical and Environmental Robustness in Automatic Speech Recognition. Vol. 201. Springer Science 8 Business Media, Berlin.Google Scholar
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, and others. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning (ICML’16). New York, NY. 173--182. Google ScholarDigital Library
Yekutiel Avargel and Israel Cohen. 2007. System identification in the short-time fourier transform domain with crossband filtering. IEEE Trans. Audio Speech Lang. Process. 15, 4 (Mar. 2007), 1305--1319. Google ScholarDigital Library
Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. 2015. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 504--511.Google ScholarCross Ref
Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, and Phil Green. 2013. The PASCAL CHiME speech separation and recognition challenge. Comput. Speech Lang. 27, 3 (May 2013), 621--633. Google ScholarDigital Library
Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Sign. Process. 27, 2 (Apr. 1979), 113--120.Google ScholarCross Ref
Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, and others. 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. 28--39. Google ScholarDigital Library
Zhuo Chen, Shinji Watanabe, Hakan Erdoğan, and John R. Hershey. 2015. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). Dresden, Germany, 1--5.Google Scholar
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST’14). 103--111.Google ScholarCross Ref
Henry Cox, Robert M. Zeskind, and Mark M. Owen. 1987. Robust adaptive beamforming. IEEE Trans. Acoust. Speech Sign. Process. 35, 10 (Oct. 1987), 1365--1376.Google ScholarCross Ref
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. 2017. Generative adversarial networks: An overview (submitted for publication).Google Scholar
George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 1 (Jan. 2012), 30--42. Google ScholarDigital Library
Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 4 (May 2011), 788--798. Google ScholarDigital Library
Li Deng. 2011. Front-end, back-end, and hybrid techniques for noise-robust speech recognition. In Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, 67--99.Google Scholar
Yariv Ephraim and David Malah. 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 32, 6 (Dec. 1984), 1109--1121.Google ScholarCross Ref
Yariv Ephraim and David Malah. 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 23, 2 (Apr. 1985), 443--445.Google Scholar
Hakan Erdogan, Tomoki Hayashi, John R. Hershey, Takaaki Hori, Chiori Hori, Wei-Ning Hsu, Suyoun Kim, Jonathan Le Roux, Zhong Meng, and Shinji Watanabe. 2016. Multi-channel speech recognition: LSTMs all the way through. In Proceedings of the CHiME-4 Workshop.Google Scholar
Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 708--712.Google ScholarCross Ref
Hakan Erdogan, John R. Hershey, Shinji Watanabe, Michael I. Mandel, and Jonathan Le Roux. 2016. Improved MVDR beamforming using single-channel mask prediction networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1981--1985.Google ScholarCross Ref
Xue Feng, Yaodong Zhang, and James Glass. 2014. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1759--1763.Google ScholarCross Ref
Tian Gao, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. Joint training of front-end and back-end deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4375--4379.Google ScholarCross Ref
J.-L. Gauvain and Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Aud. Process. 2, 2 (Apr. 1994), 291--298.Google Scholar
Jürgen Geiger, Jort F. Gemmeke, Björn Schuller, and Gerhard Rigoll. 2014a. Investigating NMF speech enhancement for neural network based acoustic models. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2405--2409.Google ScholarCross Ref
Jürgen Geiger, Erik Marchi, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014b. The TUM system for the REVERB challenge: Recognition of reverberated speech using multi-channel correlation shaping dereverberation and BLSTM recurrent neural networks. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.Google Scholar
Jürgen Geiger, Felix Weninger, Jort F. Gemmeke, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014c. Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 6 (June 2014), 1037--1046. Google ScholarDigital Library
Jürgen Geiger, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014d. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 631--635.Google ScholarCross Ref
Ritwik Giri, Michael L. Seltzer, Jasha Droppo, and Dong Yu. 2015. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 5014--5018.Google ScholarCross Ref
Yifan Gong. 1995. Speech recognition in noisy environments: A survey. Speech Commun. 16, 3 (Apr. 1995), 261--291. Google ScholarDigital Library
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press, Cambridge, MA. Google ScholarDigital Library
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’14). 2672--2680. Google ScholarDigital Library
E. M. G. Grais, Gerard Roma, Andrew J. R. Simpson, and Mark D. Plumbley. 2016. Combining mask estimates for single channel audio source separation using deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 3339--3343.Google Scholar
Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv:1308.0850 (Aug. 2013).Google Scholar
Kun Han, Yuxuan Wang, DeLiang Wang, William S. Woods, Ivo Merks, and Tao Zhang. 2015. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 6 (Apr. 2015), 982--992.Google Scholar
John H. L. Hansen and Mark A. Clements. 1991. Constrained iterative speech enhancement with application to speech recognition. IEEE Trans. Sign. Process. 39, 4 (Apr. 1991), 795--805. Google ScholarDigital Library
John H. L. Hansen and Bryan L. Pellom. 1998. An effective quality evaluation protocol for speech enhancement algorithms. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’98). 2819--2822.Google Scholar
Jahn Heymann, Lukas Drude, Aleksej Chinaev, and Reinhold Haeb-Umbach. 2015. BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 444--451.Google ScholarCross Ref
Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016a. Neural network based spectral mask estimation for acoustic beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 196--200.Google ScholarDigital Library
Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016b. Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 12--17.Google Scholar
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 82--97.Google ScholarCross Ref
Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 2006), 504--507.Google ScholarCross Ref
Hans Günter Hirsch and Harald Finster. 2005. The simulation of realistic acoustic input scenarios for speech recognition systems. In Proceedings of the Conference of the International Speech Communications Association (INTERSPEECH). Lisbon, Portugal, 2697–2700.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
Yedid Hoshen, Ron J. Weiss, and Kevin W. Wilson. 2015. Speech acoustic modeling from raw multichannel waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4624--4628.Google Scholar
Yi Hu and Philipos C. Loizou. 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16, 1 (Jan. 2008), 229--238. Google ScholarDigital Library
Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2014. Deep learning for monaural speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1562--1566.Google ScholarCross Ref
Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2015. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 12 (Dec. 2015), 2136--2147. Google ScholarDigital Library
Takaaki Ishii, Hiroki Komiyama, Takahiro Shinozaki, Yasuo Horiuchi, and Shingo Kuroiwa. 2013. Reverberant speech recognition based on denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3512--3516.Google ScholarCross Ref
Penny Karanasou, Yongqiang Wang, Mark J. F. Gales, and Philip C. Woodland. 2014. Adaptation of deep neural network acoustic models using factorised i-vectors. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2180--2184.Google Scholar
Arash Khabbazibasmenj, Sergiy A. Vorobyov, and Aboulnasr Hassanien. 2012. Robust adaptive beamforming based on steering vector estimation with as little as possible prior information. IEEE Trans. Sign. Process. 60, 6 (June 2012), 2974--2987. Google ScholarDigital Library
Keisuke Kinoshita, Marc Delcroix, Sharon Gannot, Emanuël A. P. Habets, Reinhold Haeb-Umbach, Walter Kellermann, Volker Leutnant, Roland Maas, Tomohiro Nakatani, Bhiksha Raj, and others. 2016. A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Sign. Process. 2016, 1 (Dec. 2016), 1--19.Google ScholarCross Ref
Souvik Kundu, Gautam Mantena, Yanmin Qian, Tian Tan, Marc Delcroix, and Khe Chai Sim. 2016. Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5025--5029.Google ScholarDigital Library
Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neur. Comput. 1, 4 (1989), 541--551. Google ScholarDigital Library
Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.Google ScholarCross Ref
Kang Hyun Lee, Shin Jae Kang, Woo Hyun Kang, and Nam Soo Kim. 2016. Two-stage noise aware training using asymmetric deep denoising autoencoder. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5765--5769.Google ScholarDigital Library
Kang Hyun Lee, Woo Hyun Kang, Tae Gyoon Kang, and Nam Soo Kim. 2017. Integrated DNN-based model adaptation technique for noise-robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 5245--5249.Google ScholarCross Ref
Christopher J. Leggetter and Philip C. Woodland. 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9, 2 (Apr. 1995), 171--185.Google ScholarCross Ref
Bo Li, Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, and Michiel Bacchiani. 2016. Neural network adaptive beamforming for robust multichannel speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1976--1980.Google ScholarCross Ref
Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Proces. 22, 4 (Apr. 2014), 745--777. Google ScholarDigital Library
Yan Liu, Yang Liu, Shenghua Zhong, and Songtao Wu. 2017. Implicit visual learning: Image recognition via dissipative learning model. ACM Trans. Intell. Syst. Technol. 8, 2 (Jan. 2017), 31:1--31:24. Google ScholarDigital Library
Yulan Liu, Pengyuan Zhang, and Thomas Hain. 2014. Using neural network front-ends on far field multiple microphones based speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5542--5546.Google ScholarCross Ref
Philipos C. Loizou. 2013. Speech Enhancement: Theory and Practice. Taylor Francis, Abingdon, UK. Google ScholarCross Ref
Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 436--440.Google ScholarCross Ref
Andrew L. Maas, Quoc V. Le, Tyler M. OŃeil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng. 2012. Recurrent neural networks for noise reduction in robust ASR. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’12). 22--25.Google Scholar
Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 2802--2810. Google ScholarDigital Library
Claude Marro, Yannick Mahieux, and Klaus Uwe Simmer. 1998. Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Trans. Speech Audio Process. 6, 3 (May 1998), 240--259.Google ScholarCross Ref
Iain McCowan and Herv’e Bourlard. 2003. Microphone array post-filter based on noise field coherence. IEEE Trans. Speech Audio Process. 11, 6 (Nov. 2003), 709--716.Google ScholarCross Ref
Zhong Meng, Shinji Watanabe, John R. Hershey, and Hakan Erdogan. 2017. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 271--275.Google ScholarCross Ref
Tobias Menne, Jahn Heymann, Anastasios Alexandridis, Kazuki Irie, Albert Zeyer, Markus Kitza, Pavel Golik, Kulikov Ilia, Lukas Durde, Ralf Schlater, Hermann Ney, Reinhold Haeb-Umbach, and Athanasios Mouchtaris. 2016. The RWTH /UPB/FORTH system combination for the 4th CHiME challenge evaluation. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 49--51.Google Scholar
Xavier Mestre and Miguel Angel Lagunas. 2003. On diagonal loading for minimum variance beamformers. In Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology. 459--462.Google Scholar
Daniel Michelsanti and Zheng-Hua Tan. 2017. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2008--2012.Google ScholarCross Ref
Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara. 2016. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition. In Proceedings of theConference of the International Speech Communication Association (INTERSPEECH’16). 3803--3807.Google ScholarCross Ref
Seyedmahdad Mirsamadi and John H. L. Hansen. 2015. A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2430--2434.Google Scholar
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and others. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533.Google ScholarCross Ref
Asunción Moreno, Børge Lindberg, Christoph Draxler, Gaël Richard, Khalid Choukri, Stephan Euler, and Jeffrey Allen. 2000. SPEECHDAT-CAR. A large speech database for automotive environments. In Proceedings of the the 2nd International Conference on Language Resources and Evaluation (LREC’00).Google Scholar
Arun Narayanan and DeLiang Wang. 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7092--7096.Google ScholarCross Ref
Arun Narayanan and DeLiang Wang. 2014. Joint noise adaptive training for robust automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 2504--2508.Google ScholarCross Ref
Arun Narayanan and DeLiang Wang. 2015. Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 92--101. Google ScholarDigital Library
Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, and John R. Hershey. 2017. Multichannel end-to-end speech recognition. In Proceedings of the the 34th International Conference on Machine Learning (ICML’17). 2632--2641.Google Scholar
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv:1609.03499 (Sep. 2016).Google Scholar
ITU-T Recommendation P.862. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.Google Scholar
Kuldip Paliwal, Kamil Wójcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement. Speech Commun. 53, 4 (Apr. 2011), 465--494. Google ScholarDigital Library
Se Rim Park and Jinwon Lee. 2016. A fully convolutional neural network for speech enhancement. arXiv:1609.07132 (Sep. 2016).Google Scholar
Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv:1703.09452 (Mar. 2017).Google Scholar
David Pearce and Hans-Günter Hirsch. 2000. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’00). 29--32.Google Scholar
David Pearce and J. Picone. 2002. Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02. Institute for Signal & Information Processing, Mississippi State University, Tech. Rep (2002).Google Scholar
Pasi Pertilä and Joonas Nikunen. 2014. Microphone array post-filtering using supervised machine learning for speech enhancement. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2675--2679.Google ScholarCross Ref
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florêncio, and Mark Hasegawa-Johnson. 2017. Speech enhancement using Bayesian WaveNet. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2013--2017.Google ScholarCross Ref
Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu. 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Process. 24, 12 (Dec. 2016), 2263--2276. Google ScholarDigital Library
Yanmin Qian and Tian Tan. 2016. The SJTU CHiME-4 system: Acoustic noise robustness for real single or multiple microphone scenarios. In Proceedings of the CHiME-4 Workshop.Google Scholar
Schuyler R. Quackenbush, Thomas Pinkney Barnwell, and Mark A. Clements. 1988. Objective Measures of Speech Quality. Prentice-Hall, Upper Saddle River, NJ.Google Scholar
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2017. A network of deep neural networks for distant speech recognition. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 4880--4884.Google ScholarCross Ref
Dario Rethage, Jordi Pons, and Xavier Serra. 2017. A Wavenet for speech denoising. arXiv:1706.07162 (June 2017).Google Scholar
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (Dec. 2015), 211--252. Google ScholarDigital Library
Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Haşim Sak. 2015. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4580--4584.Google ScholarCross Ref
Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew W. Senior, Kean K. Chin, Ananya Misra, and Chanwoo Kim. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 5 (May 2017), 965--979. Google ScholarDigital Library
George Saon, Tom Sercu, Steven Rennie, and Hong-Kwang J. Kuo. 2016. The IBM 2016 english conversational telephone speech recognition system. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 7--11.Google Scholar
Johan Schalkwyk, Doug Beeferman, Françoise Beaufays, Bill Byrne, Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope. 2010. “Your word is my command”: Google search by voice: A case study. In Advances in Speech Recognition. Springer, 61--90.Google Scholar
Markus Schedl, Yi-Hsuan Yang, and Perfecto Herrera-Boyer. 2016. Introduction to intelligent music systems and applications. ACM Trans. Intell. Syst. Technol. 8, 2 (Oct. 2016), 17:1--17:8. Google ScholarDigital Library
Björn Schuller, Felix Weninger, Martin Wöllmer, Yang Sun, and Gerhard Rigoll. 2010. Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 4562--4565.Google ScholarCross Ref
Michael L. Seltzer, Dong Yu, and Yongqiang Wang. 2013. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7398--7402.Google ScholarCross Ref
Sangita Sharma, Dan Ellis, Sachin S. Kajarekar, Pratibha Jain, and Hynek Hermansky. 2000. Feature extraction using non-linear transformation for robust speech recognition on the aurora database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’00). 1117--1120. Google ScholarDigital Library
Soundararajan Srinivasan, Nicoleta Roman, and DeLiang Wang. 2006. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48, 11 (Nov. 2006), 1486--1501.Google ScholarCross Ref
Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2013. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13). 285--290.Google ScholarCross Ref
Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2014. Convolutional neural networks for distant speech recognition. IEEE Sign. Process. Lett. 21, 9 (Sep. 2014), 1120--1124.Google ScholarCross Ref
George Trigeorgis, Fabien Ringeval, Raymond Bruckner, Erik Marchi, Mihalis Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). Shanghai, China, 5200--5204.Google ScholarDigital Library
Barry D. Van Veen and Kevin M. Buckley. 1988. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag. 5, 2 (Apr. 1988), 4--24.Google ScholarCross Ref
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 126--130.Google ScholarCross Ref
Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 4 (July 2006), 1462--1469. Google ScholarDigital Library
Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. 2016. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. (submitted for publication).Google Scholar
Tuomas Virtanen, Rita Singh, and Bhiksha Raj. 2012. Techniques for Noise Robustness in Automatic Speech Recognition. John Wiley & Sons, Hoboken, NJ. Google Scholar
DeLiang Wang. 2005. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. Springer US, Boston, MA, 181--197.Google Scholar
Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 12 (Dec. 2014), 1849--1858. Google ScholarDigital Library
Yuxuan Wang and DeLiang Wang. 2013. Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21, 7 (July 2013), 1381--1390. Google ScholarDigital Library
Yuxuan Wang and DeLiang Wang. 2015. A deep neural network for time-domain signal reconstruction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4390--4394.Google ScholarCross Ref
Zhong-Qiu Wang and DeLiang Wang. 2016. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 4 (Apr. 2016), 796--806. Google ScholarDigital Library
Ernst Warsitz and Reinhold Haeb-Umbach. 2007. Blind acoustic beamforming based on generalized eigenvalue decomposition. IEEE Trans. Audio Speech Lang. Process. 15, 5 (July 2007), 1529--1539. Google ScholarDigital Library
Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn Schuller. 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 91--99. Google ScholarDigital Library
Felix Weninger, Florian Eyben, and Björn Schuller. 2014. Single-channel speech separation with memory-enhanced recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 3709--3713.Google ScholarCross Ref
Felix Weninger, Jordi Feliu, and Björn Schuller. 2012. Supervised and semi-supervised suppression of background music in monaural speech recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 61--64.Google ScholarCross Ref
Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2013. The munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks. In Proceedings of the 2nd CHiME Workshop on Machine Listening in Multisource Environments. 86--90.Google Scholar
Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014a. Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Comput. Speech Lang. 28, 4 (July 2014), 888--902.Google ScholarCross Ref
Felix Weninger, John R. Hershey, Jonathan Le Roux, and Björn W. Schuller. 2014b. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’14). 577--581.Google Scholar
Felix Weninger, Shinji Watanabe, Jonathan Le Roux, J. Hershey, Yuuki Tachioka, Jürgen Geiger, Björn Schuller, and Gerhard Rigoll. 2014c. The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.Google Scholar
Donald S. Williamson and DeLiang Wang. 2017a. Speech dereverberation and denoising using complex ratio masks. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 5590--5594.Google Scholar
Donald S. Williamson and DeLiang Wang. 2017b. Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 7 (July 2017), 1492--1501. Google ScholarDigital Library
Martin Wöllmer, Florian Eyben, Alex Graves, Björn Schuller, and Gerhard Rigoll. 2010a. Improving keyword spotting with a tandem BLSTM-DBN architecture. In Proceedings of the Advances in Non-Linear Speech Processing: International Conference on Nonlinear Speech Processing (NOLISP’10). 68--75. Google ScholarDigital Library
Martin Wöllmer, Florian Eyben, Björn W. Schuller, Yang Sun, Tobias Moosmayr, and Nhu Nguyen-Thien. 2009. Robust in-car spelling recognition—A tandem BLSTM-HMM approach. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’09). 2507--2510.Google ScholarCross Ref
Martin Wöllmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll. 2010b. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Select. Top. Sign. Process. 4, 5 (Oct. 2010), 867--881.Google Scholar
Martin Wöllmer, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2013. Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 6822--6826.Google ScholarCross Ref
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 (Oct. 2016).Google Scholar
Bingyin Xia and Changchun Bao. 2013. Speech enhancement with weighted denoising auto-encoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3444--3448.Google ScholarCross Ref
Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, John Hershey, Michael L. Seltzer, Guoguo Chen, Yu Zhang, Michael Mandel, and Dong Yu. 2016a. Deep beamforming networks for multi-channel speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5745--5749.Google ScholarDigital Library
Xiong Xiao, Chenglin Xu, Zhaofeng Zhang, Shengkui Zhao, Sining Sun, and Shinji Watanabe. 2016b. A study of learning based beamforming methods for speech recognition. In Proceedings of the CHiME Workshop. 26--31.Google Scholar
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. Technical Report MSR-TR-2016-71. Microsoft Research.Google Scholar
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014a. Dynamic noise aware training for speech enhancement based on deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2670--2674.Google ScholarCross Ref
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014b. An experimental study on speech enhancement based on deep neural networks. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.Google ScholarCross Ref
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014c. NMF-based target source separation using deep neural network. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.Google ScholarCross Ref
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 7--19. Google ScholarDigital Library
Yi-Hsuan Yang and Homer H. Chen. 2012. Machine recognition of music emotion: A review. ACM Trans. Intell. Syst. Technol. 3, 3 (May 2012), 40:1--40:30. Google ScholarDigital Library
Takuya Yoshioka, Armin Sehr, Marc Delcroix, Keisuke Kinoshita, Roland Maas, Tomohiro Nakatani, and Walter Kellermann. 2012. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 114--126.Google ScholarCross Ref
Chengzhu Yu, Atsunori Ogawa, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, and John H. L. Hansen. 2015. Robust i-vector extraction for neural network adaptation in noisy environment. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2854--2857.Google Scholar
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv:1605.07146 (May 2016).Google Scholar
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.Google Scholar
Zixing Zhang, Nicholas Cummins, and Björn Schuller. 2017. Advanced data exploitation for speech analysis—An overview. IEEE Sign. Process. Mag. 34 (July 2017). 24 pages.Google ScholarCross Ref
Zixing Zhang, Joel Pinto, Christian Plahl, Björn Schuller, and Daniel Willett. 2014. Channel mapping using bidirectional long short-term memory for dereverberation in hand-free voice controlled devices. IEEE Trans. Cons. Electron. 60, 3 (Aug. 2014), 525--533.Google Scholar
Zixing Zhang, Fabien Ringeval, Jing Han, Jun Deng, Erik Marchi, and Björn Schuller. 2016. Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with LSTM neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16).Google ScholarCross Ref

Index Terms

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Combined speech enhancement and auditory modelling for robust distributed speech recognition

The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...
Read More
A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition

Acoustic feature extraction from speech constitutes a fundamental component of automatic speech recognition (ASR) systems. In this paper, we propose a novel feature extraction algorithm, perceptual-MVDR (PMVDR), which computes cepstral coefficients from ...
Read More
A modified oesophageal speech enhancement using ephraim-malah filter for robust speech recognition
ICNVS'10: Proceedings of the 12th international conference on Networking, VLSI and signal processing

This paper presents a modified Oesophageal Single Channel Speech Enhancement using Ephraim-Malah Filter for Robust Speech Recognition. An Oesophageal voice is due to the laryngectomy undergone by those persons with larynx cancer and it has extremely low ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Intelligent Systems and Technology Volume 9, Issue 5
Research Survey and Regular Papers
September 2018
274 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3210369
Issue’s Table of Contents

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 April 2018
- Accepted: 1 January 2018
- Revised: 1 November 2017
- Received: 1 July 2017
Published in tist Volume 9, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Robust speech recognition
deep learning
multi-channel speech recognition
neural networks
non-stationary noise
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 202
  Total Citations
  View Citations
- 2,480
  Total Downloads
- Downloads (Last 12 months)253
- Downloads (Last 6 weeks)35
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Combined speech enhancement and auditory modelling for robust distributed speech recognition

A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition

A modified oesophageal speech enhancement using ephraim-malah filter for robust speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Combined speech enhancement and auditory modelling for robust distributed speech recognition

A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition

A modified oesophageal speech enhancement using ephraim-malah filter for robust speech recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media