skip to main content
research-article

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Published:24 April 2018Publication History
Skip Abstract Section

Abstract

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition but still remains an important challenge. Data-driven supervised approaches, especially the ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks. In the meanwhile, we discuss the pros and cons of these approaches and provide their experimental results on benchmark databases. We expect that this overview can facilitate the development of the robustness of speech recognition systems in acoustic noisy environments.

References

  1. Alex Acero. 2012. Acoustical and Environmental Robustness in Automatic Speech Recognition. Vol. 201. Springer Science 8 Business Media, Berlin.Google ScholarGoogle Scholar
  2. Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, and others. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning (ICML’16). New York, NY. 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yekutiel Avargel and Israel Cohen. 2007. System identification in the short-time fourier transform domain with crossband filtering. IEEE Trans. Audio Speech Lang. Process. 15, 4 (Mar. 2007), 1305--1319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. 2015. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 504--511.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, and Phil Green. 2013. The PASCAL CHiME speech separation and recognition challenge. Comput. Speech Lang. 27, 3 (May 2013), 621--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Sign. Process. 27, 2 (Apr. 1979), 113--120.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, and others. 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. 28--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Zhuo Chen, Shinji Watanabe, Hakan Erdoğan, and John R. Hershey. 2015. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). Dresden, Germany, 1--5.Google ScholarGoogle Scholar
  9. Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST’14). 103--111.Google ScholarGoogle ScholarCross RefCross Ref
  10. Henry Cox, Robert M. Zeskind, and Mark M. Owen. 1987. Robust adaptive beamforming. IEEE Trans. Acoust. Speech Sign. Process. 35, 10 (Oct. 1987), 1365--1376.Google ScholarGoogle ScholarCross RefCross Ref
  11. Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. 2017. Generative adversarial networks: An overview (submitted for publication).Google ScholarGoogle Scholar
  12. George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 1 (Jan. 2012), 30--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 4 (May 2011), 788--798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Li Deng. 2011. Front-end, back-end, and hybrid techniques for noise-robust speech recognition. In Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, 67--99.Google ScholarGoogle Scholar
  15. Yariv Ephraim and David Malah. 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 32, 6 (Dec. 1984), 1109--1121.Google ScholarGoogle ScholarCross RefCross Ref
  16. Yariv Ephraim and David Malah. 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 23, 2 (Apr. 1985), 443--445.Google ScholarGoogle Scholar
  17. Hakan Erdogan, Tomoki Hayashi, John R. Hershey, Takaaki Hori, Chiori Hori, Wei-Ning Hsu, Suyoun Kim, Jonathan Le Roux, Zhong Meng, and Shinji Watanabe. 2016. Multi-channel speech recognition: LSTMs all the way through. In Proceedings of the CHiME-4 Workshop.Google ScholarGoogle Scholar
  18. Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 708--712.Google ScholarGoogle ScholarCross RefCross Ref
  19. Hakan Erdogan, John R. Hershey, Shinji Watanabe, Michael I. Mandel, and Jonathan Le Roux. 2016. Improved MVDR beamforming using single-channel mask prediction networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1981--1985.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xue Feng, Yaodong Zhang, and James Glass. 2014. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1759--1763.Google ScholarGoogle ScholarCross RefCross Ref
  21. Tian Gao, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. Joint training of front-end and back-end deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4375--4379.Google ScholarGoogle ScholarCross RefCross Ref
  22. J.-L. Gauvain and Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Aud. Process. 2, 2 (Apr. 1994), 291--298.Google ScholarGoogle Scholar
  23. Jürgen Geiger, Jort F. Gemmeke, Björn Schuller, and Gerhard Rigoll. 2014a. Investigating NMF speech enhancement for neural network based acoustic models. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2405--2409.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jürgen Geiger, Erik Marchi, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014b. The TUM system for the REVERB challenge: Recognition of reverberated speech using multi-channel correlation shaping dereverberation and BLSTM recurrent neural networks. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.Google ScholarGoogle Scholar
  25. Jürgen Geiger, Felix Weninger, Jort F. Gemmeke, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014c. Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 6 (June 2014), 1037--1046. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jürgen Geiger, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014d. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 631--635.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ritwik Giri, Michael L. Seltzer, Jasha Droppo, and Dong Yu. 2015. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 5014--5018.Google ScholarGoogle ScholarCross RefCross Ref
  28. Yifan Gong. 1995. Speech recognition in noisy environments: A survey. Speech Commun. 16, 3 (Apr. 1995), 261--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’14). 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. M. G. Grais, Gerard Roma, Andrew J. R. Simpson, and Mark D. Plumbley. 2016. Combining mask estimates for single channel audio source separation using deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 3339--3343.Google ScholarGoogle Scholar
  32. Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv:1308.0850 (Aug. 2013).Google ScholarGoogle Scholar
  33. Kun Han, Yuxuan Wang, DeLiang Wang, William S. Woods, Ivo Merks, and Tao Zhang. 2015. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 6 (Apr. 2015), 982--992.Google ScholarGoogle Scholar
  34. John H. L. Hansen and Mark A. Clements. 1991. Constrained iterative speech enhancement with application to speech recognition. IEEE Trans. Sign. Process. 39, 4 (Apr. 1991), 795--805. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. John H. L. Hansen and Bryan L. Pellom. 1998. An effective quality evaluation protocol for speech enhancement algorithms. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’98). 2819--2822.Google ScholarGoogle Scholar
  36. Jahn Heymann, Lukas Drude, Aleksej Chinaev, and Reinhold Haeb-Umbach. 2015. BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 444--451.Google ScholarGoogle ScholarCross RefCross Ref
  37. Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016a. Neural network based spectral mask estimation for acoustic beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 196--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016b. Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 12--17.Google ScholarGoogle Scholar
  39. Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 82--97.Google ScholarGoogle ScholarCross RefCross Ref
  40. Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 2006), 504--507.Google ScholarGoogle ScholarCross RefCross Ref
  41. Hans Günter Hirsch and Harald Finster. 2005. The simulation of realistic acoustic input scenarios for speech recognition systems. In Proceedings of the Conference of the International Speech Communications Association (INTERSPEECH). Lisbon, Portugal, 2697–2700.Google ScholarGoogle ScholarCross RefCross Ref
  42. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (Nov. 1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yedid Hoshen, Ron J. Weiss, and Kevin W. Wilson. 2015. Speech acoustic modeling from raw multichannel waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4624--4628.Google ScholarGoogle Scholar
  44. Yi Hu and Philipos C. Loizou. 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16, 1 (Jan. 2008), 229--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2014. Deep learning for monaural speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1562--1566.Google ScholarGoogle ScholarCross RefCross Ref
  46. Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2015. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 12 (Dec. 2015), 2136--2147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Takaaki Ishii, Hiroki Komiyama, Takahiro Shinozaki, Yasuo Horiuchi, and Shingo Kuroiwa. 2013. Reverberant speech recognition based on denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3512--3516.Google ScholarGoogle ScholarCross RefCross Ref
  48. Penny Karanasou, Yongqiang Wang, Mark J. F. Gales, and Philip C. Woodland. 2014. Adaptation of deep neural network acoustic models using factorised i-vectors. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2180--2184.Google ScholarGoogle Scholar
  49. Arash Khabbazibasmenj, Sergiy A. Vorobyov, and Aboulnasr Hassanien. 2012. Robust adaptive beamforming based on steering vector estimation with as little as possible prior information. IEEE Trans. Sign. Process. 60, 6 (June 2012), 2974--2987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Keisuke Kinoshita, Marc Delcroix, Sharon Gannot, Emanuël A. P. Habets, Reinhold Haeb-Umbach, Walter Kellermann, Volker Leutnant, Roland Maas, Tomohiro Nakatani, Bhiksha Raj, and others. 2016. A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Sign. Process. 2016, 1 (Dec. 2016), 1--19.Google ScholarGoogle ScholarCross RefCross Ref
  51. Souvik Kundu, Gautam Mantena, Yanmin Qian, Tian Tan, Marc Delcroix, and Khe Chai Sim. 2016. Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5025--5029.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neur. Comput. 1, 4 (1989), 541--551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.Google ScholarGoogle ScholarCross RefCross Ref
  54. Kang Hyun Lee, Shin Jae Kang, Woo Hyun Kang, and Nam Soo Kim. 2016. Two-stage noise aware training using asymmetric deep denoising autoencoder. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5765--5769.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Kang Hyun Lee, Woo Hyun Kang, Tae Gyoon Kang, and Nam Soo Kim. 2017. Integrated DNN-based model adaptation technique for noise-robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 5245--5249.Google ScholarGoogle ScholarCross RefCross Ref
  56. Christopher J. Leggetter and Philip C. Woodland. 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9, 2 (Apr. 1995), 171--185.Google ScholarGoogle ScholarCross RefCross Ref
  57. Bo Li, Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, and Michiel Bacchiani. 2016. Neural network adaptive beamforming for robust multichannel speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1976--1980.Google ScholarGoogle ScholarCross RefCross Ref
  58. Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Proces. 22, 4 (Apr. 2014), 745--777. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yan Liu, Yang Liu, Shenghua Zhong, and Songtao Wu. 2017. Implicit visual learning: Image recognition via dissipative learning model. ACM Trans. Intell. Syst. Technol. 8, 2 (Jan. 2017), 31:1--31:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yulan Liu, Pengyuan Zhang, and Thomas Hain. 2014. Using neural network front-ends on far field multiple microphones based speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5542--5546.Google ScholarGoogle ScholarCross RefCross Ref
  61. Philipos C. Loizou. 2013. Speech Enhancement: Theory and Practice. Taylor Francis, Abingdon, UK. Google ScholarGoogle ScholarCross RefCross Ref
  62. Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 436--440.Google ScholarGoogle ScholarCross RefCross Ref
  63. Andrew L. Maas, Quoc V. Le, Tyler M. OŃeil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng. 2012. Recurrent neural networks for noise reduction in robust ASR. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’12). 22--25.Google ScholarGoogle Scholar
  64. Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 2802--2810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Claude Marro, Yannick Mahieux, and Klaus Uwe Simmer. 1998. Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Trans. Speech Audio Process. 6, 3 (May 1998), 240--259.Google ScholarGoogle ScholarCross RefCross Ref
  66. Iain McCowan and Herv’e Bourlard. 2003. Microphone array post-filter based on noise field coherence. IEEE Trans. Speech Audio Process. 11, 6 (Nov. 2003), 709--716.Google ScholarGoogle ScholarCross RefCross Ref
  67. Zhong Meng, Shinji Watanabe, John R. Hershey, and Hakan Erdogan. 2017. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 271--275.Google ScholarGoogle ScholarCross RefCross Ref
  68. Tobias Menne, Jahn Heymann, Anastasios Alexandridis, Kazuki Irie, Albert Zeyer, Markus Kitza, Pavel Golik, Kulikov Ilia, Lukas Durde, Ralf Schlater, Hermann Ney, Reinhold Haeb-Umbach, and Athanasios Mouchtaris. 2016. The RWTH /UPB/FORTH system combination for the 4th CHiME challenge evaluation. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 49--51.Google ScholarGoogle Scholar
  69. Xavier Mestre and Miguel Angel Lagunas. 2003. On diagonal loading for minimum variance beamformers. In Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology. 459--462.Google ScholarGoogle Scholar
  70. Daniel Michelsanti and Zheng-Hua Tan. 2017. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2008--2012.Google ScholarGoogle ScholarCross RefCross Ref
  71. Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara. 2016. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition. In Proceedings of theConference of the International Speech Communication Association (INTERSPEECH’16). 3803--3807.Google ScholarGoogle ScholarCross RefCross Ref
  72. Seyedmahdad Mirsamadi and John H. L. Hansen. 2015. A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2430--2434.Google ScholarGoogle Scholar
  73. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and others. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533.Google ScholarGoogle ScholarCross RefCross Ref
  74. Asunción Moreno, Børge Lindberg, Christoph Draxler, Gaël Richard, Khalid Choukri, Stephan Euler, and Jeffrey Allen. 2000. SPEECHDAT-CAR. A large speech database for automotive environments. In Proceedings of the the 2nd International Conference on Language Resources and Evaluation (LREC’00).Google ScholarGoogle Scholar
  75. Arun Narayanan and DeLiang Wang. 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7092--7096.Google ScholarGoogle ScholarCross RefCross Ref
  76. Arun Narayanan and DeLiang Wang. 2014. Joint noise adaptive training for robust automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 2504--2508.Google ScholarGoogle ScholarCross RefCross Ref
  77. Arun Narayanan and DeLiang Wang. 2015. Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 92--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, and John R. Hershey. 2017. Multichannel end-to-end speech recognition. In Proceedings of the the 34th International Conference on Machine Learning (ICML’17). 2632--2641.Google ScholarGoogle Scholar
  79. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv:1609.03499 (Sep. 2016).Google ScholarGoogle Scholar
  80. ITU-T Recommendation P.862. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.Google ScholarGoogle Scholar
  81. Kuldip Paliwal, Kamil Wójcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement. Speech Commun. 53, 4 (Apr. 2011), 465--494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Se Rim Park and Jinwon Lee. 2016. A fully convolutional neural network for speech enhancement. arXiv:1609.07132 (Sep. 2016).Google ScholarGoogle Scholar
  83. Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv:1703.09452 (Mar. 2017).Google ScholarGoogle Scholar
  84. David Pearce and Hans-Günter Hirsch. 2000. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’00). 29--32.Google ScholarGoogle Scholar
  85. David Pearce and J. Picone. 2002. Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02. Institute for Signal & Information Processing, Mississippi State University, Tech. Rep (2002).Google ScholarGoogle Scholar
  86. Pasi Pertilä and Joonas Nikunen. 2014. Microphone array post-filtering using supervised machine learning for speech enhancement. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2675--2679.Google ScholarGoogle ScholarCross RefCross Ref
  87. Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florêncio, and Mark Hasegawa-Johnson. 2017. Speech enhancement using Bayesian WaveNet. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2013--2017.Google ScholarGoogle ScholarCross RefCross Ref
  88. Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu. 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Process. 24, 12 (Dec. 2016), 2263--2276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Yanmin Qian and Tian Tan. 2016. The SJTU CHiME-4 system: Acoustic noise robustness for real single or multiple microphone scenarios. In Proceedings of the CHiME-4 Workshop.Google ScholarGoogle Scholar
  90. Schuyler R. Quackenbush, Thomas Pinkney Barnwell, and Mark A. Clements. 1988. Objective Measures of Speech Quality. Prentice-Hall, Upper Saddle River, NJ.Google ScholarGoogle Scholar
  91. Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2017. A network of deep neural networks for distant speech recognition. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 4880--4884.Google ScholarGoogle ScholarCross RefCross Ref
  92. Dario Rethage, Jordi Pons, and Xavier Serra. 2017. A Wavenet for speech denoising. arXiv:1706.07162 (June 2017).Google ScholarGoogle Scholar
  93. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (Dec. 2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Haşim Sak. 2015. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4580--4584.Google ScholarGoogle ScholarCross RefCross Ref
  95. Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew W. Senior, Kean K. Chin, Ananya Misra, and Chanwoo Kim. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 5 (May 2017), 965--979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. George Saon, Tom Sercu, Steven Rennie, and Hong-Kwang J. Kuo. 2016. The IBM 2016 english conversational telephone speech recognition system. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 7--11.Google ScholarGoogle Scholar
  97. Johan Schalkwyk, Doug Beeferman, Françoise Beaufays, Bill Byrne, Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope. 2010. “Your word is my command”: Google search by voice: A case study. In Advances in Speech Recognition. Springer, 61--90.Google ScholarGoogle Scholar
  98. Markus Schedl, Yi-Hsuan Yang, and Perfecto Herrera-Boyer. 2016. Introduction to intelligent music systems and applications. ACM Trans. Intell. Syst. Technol. 8, 2 (Oct. 2016), 17:1--17:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Björn Schuller, Felix Weninger, Martin Wöllmer, Yang Sun, and Gerhard Rigoll. 2010. Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 4562--4565.Google ScholarGoogle ScholarCross RefCross Ref
  100. Michael L. Seltzer, Dong Yu, and Yongqiang Wang. 2013. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7398--7402.Google ScholarGoogle ScholarCross RefCross Ref
  101. Sangita Sharma, Dan Ellis, Sachin S. Kajarekar, Pratibha Jain, and Hynek Hermansky. 2000. Feature extraction using non-linear transformation for robust speech recognition on the aurora database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’00). 1117--1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Soundararajan Srinivasan, Nicoleta Roman, and DeLiang Wang. 2006. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48, 11 (Nov. 2006), 1486--1501.Google ScholarGoogle ScholarCross RefCross Ref
  103. Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2013. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13). 285--290.Google ScholarGoogle ScholarCross RefCross Ref
  104. Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2014. Convolutional neural networks for distant speech recognition. IEEE Sign. Process. Lett. 21, 9 (Sep. 2014), 1120--1124.Google ScholarGoogle ScholarCross RefCross Ref
  105. George Trigeorgis, Fabien Ringeval, Raymond Bruckner, Erik Marchi, Mihalis Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). Shanghai, China, 5200--5204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Barry D. Van Veen and Kevin M. Buckley. 1988. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag. 5, 2 (Apr. 1988), 4--24.Google ScholarGoogle ScholarCross RefCross Ref
  107. Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 126--130.Google ScholarGoogle ScholarCross RefCross Ref
  108. Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 4 (July 2006), 1462--1469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. 2016. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. (submitted for publication).Google ScholarGoogle Scholar
  110. Tuomas Virtanen, Rita Singh, and Bhiksha Raj. 2012. Techniques for Noise Robustness in Automatic Speech Recognition. John Wiley & Sons, Hoboken, NJ. Google ScholarGoogle Scholar
  111. DeLiang Wang. 2005. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. Springer US, Boston, MA, 181--197.Google ScholarGoogle Scholar
  112. Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 12 (Dec. 2014), 1849--1858. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Yuxuan Wang and DeLiang Wang. 2013. Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21, 7 (July 2013), 1381--1390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Yuxuan Wang and DeLiang Wang. 2015. A deep neural network for time-domain signal reconstruction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4390--4394.Google ScholarGoogle ScholarCross RefCross Ref
  115. Zhong-Qiu Wang and DeLiang Wang. 2016. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 4 (Apr. 2016), 796--806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Ernst Warsitz and Reinhold Haeb-Umbach. 2007. Blind acoustic beamforming based on generalized eigenvalue decomposition. IEEE Trans. Audio Speech Lang. Process. 15, 5 (July 2007), 1529--1539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn Schuller. 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Felix Weninger, Florian Eyben, and Björn Schuller. 2014. Single-channel speech separation with memory-enhanced recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 3709--3713.Google ScholarGoogle ScholarCross RefCross Ref
  119. Felix Weninger, Jordi Feliu, and Björn Schuller. 2012. Supervised and semi-supervised suppression of background music in monaural speech recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 61--64.Google ScholarGoogle ScholarCross RefCross Ref
  120. Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2013. The munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks. In Proceedings of the 2nd CHiME Workshop on Machine Listening in Multisource Environments. 86--90.Google ScholarGoogle Scholar
  121. Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014a. Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Comput. Speech Lang. 28, 4 (July 2014), 888--902.Google ScholarGoogle ScholarCross RefCross Ref
  122. Felix Weninger, John R. Hershey, Jonathan Le Roux, and Björn W. Schuller. 2014b. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’14). 577--581.Google ScholarGoogle Scholar
  123. Felix Weninger, Shinji Watanabe, Jonathan Le Roux, J. Hershey, Yuuki Tachioka, Jürgen Geiger, Björn Schuller, and Gerhard Rigoll. 2014c. The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.Google ScholarGoogle Scholar
  124. Donald S. Williamson and DeLiang Wang. 2017a. Speech dereverberation and denoising using complex ratio masks. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 5590--5594.Google ScholarGoogle Scholar
  125. Donald S. Williamson and DeLiang Wang. 2017b. Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 7 (July 2017), 1492--1501. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Martin Wöllmer, Florian Eyben, Alex Graves, Björn Schuller, and Gerhard Rigoll. 2010a. Improving keyword spotting with a tandem BLSTM-DBN architecture. In Proceedings of the Advances in Non-Linear Speech Processing: International Conference on Nonlinear Speech Processing (NOLISP’10). 68--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Martin Wöllmer, Florian Eyben, Björn W. Schuller, Yang Sun, Tobias Moosmayr, and Nhu Nguyen-Thien. 2009. Robust in-car spelling recognition—A tandem BLSTM-HMM approach. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’09). 2507--2510.Google ScholarGoogle ScholarCross RefCross Ref
  128. Martin Wöllmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll. 2010b. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Select. Top. Sign. Process. 4, 5 (Oct. 2010), 867--881.Google ScholarGoogle Scholar
  129. Martin Wöllmer, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2013. Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 6822--6826.Google ScholarGoogle ScholarCross RefCross Ref
  130. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 (Oct. 2016).Google ScholarGoogle Scholar
  131. Bingyin Xia and Changchun Bao. 2013. Speech enhancement with weighted denoising auto-encoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3444--3448.Google ScholarGoogle ScholarCross RefCross Ref
  132. Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, John Hershey, Michael L. Seltzer, Guoguo Chen, Yu Zhang, Michael Mandel, and Dong Yu. 2016a. Deep beamforming networks for multi-channel speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5745--5749.Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Xiong Xiao, Chenglin Xu, Zhaofeng Zhang, Shengkui Zhao, Sining Sun, and Shinji Watanabe. 2016b. A study of learning based beamforming methods for speech recognition. In Proceedings of the CHiME Workshop. 26--31.Google ScholarGoogle Scholar
  134. Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. Technical Report MSR-TR-2016-71. Microsoft Research.Google ScholarGoogle Scholar
  135. Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014a. Dynamic noise aware training for speech enhancement based on deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2670--2674.Google ScholarGoogle ScholarCross RefCross Ref
  136. Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014b. An experimental study on speech enhancement based on deep neural networks. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.Google ScholarGoogle ScholarCross RefCross Ref
  137. Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014c. NMF-based target source separation using deep neural network. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.Google ScholarGoogle ScholarCross RefCross Ref
  138. Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 7--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Yi-Hsuan Yang and Homer H. Chen. 2012. Machine recognition of music emotion: A review. ACM Trans. Intell. Syst. Technol. 3, 3 (May 2012), 40:1--40:30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Takuya Yoshioka, Armin Sehr, Marc Delcroix, Keisuke Kinoshita, Roland Maas, Tomohiro Nakatani, and Walter Kellermann. 2012. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 114--126.Google ScholarGoogle ScholarCross RefCross Ref
  141. Chengzhu Yu, Atsunori Ogawa, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, and John H. L. Hansen. 2015. Robust i-vector extraction for neural network adaptation in noisy environment. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2854--2857.Google ScholarGoogle Scholar
  142. Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv:1605.07146 (May 2016).Google ScholarGoogle Scholar
  143. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.Google ScholarGoogle Scholar
  144. Zixing Zhang, Nicholas Cummins, and Björn Schuller. 2017. Advanced data exploitation for speech analysis—An overview. IEEE Sign. Process. Mag. 34 (July 2017). 24 pages.Google ScholarGoogle ScholarCross RefCross Ref
  145. Zixing Zhang, Joel Pinto, Christian Plahl, Björn Schuller, and Daniel Willett. 2014. Channel mapping using bidirectional long short-term memory for dereverberation in hand-free voice controlled devices. IEEE Trans. Cons. Electron. 60, 3 (Aug. 2014), 525--533.Google ScholarGoogle Scholar
  146. Zixing Zhang, Fabien Ringeval, Jing Han, Jun Deng, Erik Marchi, and Björn Schuller. 2016. Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with LSTM neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Intelligent Systems and Technology
        ACM Transactions on Intelligent Systems and Technology  Volume 9, Issue 5
        Research Survey and Regular Papers
        September 2018
        274 pages
        ISSN:2157-6904
        EISSN:2157-6912
        DOI:10.1145/3210369
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 April 2018
        • Accepted: 1 January 2018
        • Revised: 1 November 2017
        • Received: 1 July 2017
        Published in tist Volume 9, Issue 5

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader