Abstract
Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition but still remains an important challenge. Data-driven supervised approaches, especially the ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks. In the meanwhile, we discuss the pros and cons of these approaches and provide their experimental results on benchmark databases. We expect that this overview can facilitate the development of the robustness of speech recognition systems in acoustic noisy environments.
- Alex Acero. 2012. Acoustical and Environmental Robustness in Automatic Speech Recognition. Vol. 201. Springer Science 8 Business Media, Berlin.Google Scholar
- Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, and others. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning (ICML’16). New York, NY. 173--182. Google ScholarDigital Library
- Yekutiel Avargel and Israel Cohen. 2007. System identification in the short-time fourier transform domain with crossband filtering. IEEE Trans. Audio Speech Lang. Process. 15, 4 (Mar. 2007), 1305--1319. Google ScholarDigital Library
- Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. 2015. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 504--511.Google ScholarCross Ref
- Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, and Phil Green. 2013. The PASCAL CHiME speech separation and recognition challenge. Comput. Speech Lang. 27, 3 (May 2013), 621--633. Google ScholarDigital Library
- Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Sign. Process. 27, 2 (Apr. 1979), 113--120.Google ScholarCross Ref
- Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, and others. 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. 28--39. Google ScholarDigital Library
- Zhuo Chen, Shinji Watanabe, Hakan Erdoğan, and John R. Hershey. 2015. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). Dresden, Germany, 1--5.Google Scholar
- Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST’14). 103--111.Google ScholarCross Ref
- Henry Cox, Robert M. Zeskind, and Mark M. Owen. 1987. Robust adaptive beamforming. IEEE Trans. Acoust. Speech Sign. Process. 35, 10 (Oct. 1987), 1365--1376.Google ScholarCross Ref
- Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. 2017. Generative adversarial networks: An overview (submitted for publication).Google Scholar
- George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 1 (Jan. 2012), 30--42. Google ScholarDigital Library
- Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 4 (May 2011), 788--798. Google ScholarDigital Library
- Li Deng. 2011. Front-end, back-end, and hybrid techniques for noise-robust speech recognition. In Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, 67--99.Google Scholar
- Yariv Ephraim and David Malah. 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 32, 6 (Dec. 1984), 1109--1121.Google ScholarCross Ref
- Yariv Ephraim and David Malah. 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 23, 2 (Apr. 1985), 443--445.Google Scholar
- Hakan Erdogan, Tomoki Hayashi, John R. Hershey, Takaaki Hori, Chiori Hori, Wei-Ning Hsu, Suyoun Kim, Jonathan Le Roux, Zhong Meng, and Shinji Watanabe. 2016. Multi-channel speech recognition: LSTMs all the way through. In Proceedings of the CHiME-4 Workshop.Google Scholar
- Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 708--712.Google ScholarCross Ref
- Hakan Erdogan, John R. Hershey, Shinji Watanabe, Michael I. Mandel, and Jonathan Le Roux. 2016. Improved MVDR beamforming using single-channel mask prediction networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1981--1985.Google ScholarCross Ref
- Xue Feng, Yaodong Zhang, and James Glass. 2014. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1759--1763.Google ScholarCross Ref
- Tian Gao, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. Joint training of front-end and back-end deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4375--4379.Google ScholarCross Ref
- J.-L. Gauvain and Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Aud. Process. 2, 2 (Apr. 1994), 291--298.Google Scholar
- Jürgen Geiger, Jort F. Gemmeke, Björn Schuller, and Gerhard Rigoll. 2014a. Investigating NMF speech enhancement for neural network based acoustic models. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2405--2409.Google ScholarCross Ref
- Jürgen Geiger, Erik Marchi, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014b. The TUM system for the REVERB challenge: Recognition of reverberated speech using multi-channel correlation shaping dereverberation and BLSTM recurrent neural networks. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.Google Scholar
- Jürgen Geiger, Felix Weninger, Jort F. Gemmeke, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014c. Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 6 (June 2014), 1037--1046. Google ScholarDigital Library
- Jürgen Geiger, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014d. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 631--635.Google ScholarCross Ref
- Ritwik Giri, Michael L. Seltzer, Jasha Droppo, and Dong Yu. 2015. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 5014--5018.Google ScholarCross Ref
- Yifan Gong. 1995. Speech recognition in noisy environments: A survey. Speech Commun. 16, 3 (Apr. 1995), 261--291. Google ScholarDigital Library
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’14). 2672--2680. Google ScholarDigital Library
- E. M. G. Grais, Gerard Roma, Andrew J. R. Simpson, and Mark D. Plumbley. 2016. Combining mask estimates for single channel audio source separation using deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 3339--3343.Google Scholar
- Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv:1308.0850 (Aug. 2013).Google Scholar
- Kun Han, Yuxuan Wang, DeLiang Wang, William S. Woods, Ivo Merks, and Tao Zhang. 2015. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 6 (Apr. 2015), 982--992.Google Scholar
- John H. L. Hansen and Mark A. Clements. 1991. Constrained iterative speech enhancement with application to speech recognition. IEEE Trans. Sign. Process. 39, 4 (Apr. 1991), 795--805. Google ScholarDigital Library
- John H. L. Hansen and Bryan L. Pellom. 1998. An effective quality evaluation protocol for speech enhancement algorithms. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’98). 2819--2822.Google Scholar
- Jahn Heymann, Lukas Drude, Aleksej Chinaev, and Reinhold Haeb-Umbach. 2015. BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 444--451.Google ScholarCross Ref
- Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016a. Neural network based spectral mask estimation for acoustic beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 196--200.Google ScholarDigital Library
- Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016b. Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 12--17.Google Scholar
- Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 82--97.Google ScholarCross Ref
- Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 2006), 504--507.Google ScholarCross Ref
- Hans Günter Hirsch and Harald Finster. 2005. The simulation of realistic acoustic input scenarios for speech recognition systems. In Proceedings of the Conference of the International Speech Communications Association (INTERSPEECH). Lisbon, Portugal, 2697–2700.Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
- Yedid Hoshen, Ron J. Weiss, and Kevin W. Wilson. 2015. Speech acoustic modeling from raw multichannel waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4624--4628.Google Scholar
- Yi Hu and Philipos C. Loizou. 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16, 1 (Jan. 2008), 229--238. Google ScholarDigital Library
- Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2014. Deep learning for monaural speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1562--1566.Google ScholarCross Ref
- Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2015. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 12 (Dec. 2015), 2136--2147. Google ScholarDigital Library
- Takaaki Ishii, Hiroki Komiyama, Takahiro Shinozaki, Yasuo Horiuchi, and Shingo Kuroiwa. 2013. Reverberant speech recognition based on denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3512--3516.Google ScholarCross Ref
- Penny Karanasou, Yongqiang Wang, Mark J. F. Gales, and Philip C. Woodland. 2014. Adaptation of deep neural network acoustic models using factorised i-vectors. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2180--2184.Google Scholar
- Arash Khabbazibasmenj, Sergiy A. Vorobyov, and Aboulnasr Hassanien. 2012. Robust adaptive beamforming based on steering vector estimation with as little as possible prior information. IEEE Trans. Sign. Process. 60, 6 (June 2012), 2974--2987. Google ScholarDigital Library
- Keisuke Kinoshita, Marc Delcroix, Sharon Gannot, Emanuël A. P. Habets, Reinhold Haeb-Umbach, Walter Kellermann, Volker Leutnant, Roland Maas, Tomohiro Nakatani, Bhiksha Raj, and others. 2016. A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Sign. Process. 2016, 1 (Dec. 2016), 1--19.Google ScholarCross Ref
- Souvik Kundu, Gautam Mantena, Yanmin Qian, Tian Tan, Marc Delcroix, and Khe Chai Sim. 2016. Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5025--5029.Google ScholarDigital Library
- Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neur. Comput. 1, 4 (1989), 541--551. Google ScholarDigital Library
- Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.Google ScholarCross Ref
- Kang Hyun Lee, Shin Jae Kang, Woo Hyun Kang, and Nam Soo Kim. 2016. Two-stage noise aware training using asymmetric deep denoising autoencoder. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5765--5769.Google ScholarDigital Library
- Kang Hyun Lee, Woo Hyun Kang, Tae Gyoon Kang, and Nam Soo Kim. 2017. Integrated DNN-based model adaptation technique for noise-robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 5245--5249.Google ScholarCross Ref
- Christopher J. Leggetter and Philip C. Woodland. 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9, 2 (Apr. 1995), 171--185.Google ScholarCross Ref
- Bo Li, Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, and Michiel Bacchiani. 2016. Neural network adaptive beamforming for robust multichannel speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1976--1980.Google ScholarCross Ref
- Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Proces. 22, 4 (Apr. 2014), 745--777. Google ScholarDigital Library
- Yan Liu, Yang Liu, Shenghua Zhong, and Songtao Wu. 2017. Implicit visual learning: Image recognition via dissipative learning model. ACM Trans. Intell. Syst. Technol. 8, 2 (Jan. 2017), 31:1--31:24. Google ScholarDigital Library
- Yulan Liu, Pengyuan Zhang, and Thomas Hain. 2014. Using neural network front-ends on far field multiple microphones based speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5542--5546.Google ScholarCross Ref
- Philipos C. Loizou. 2013. Speech Enhancement: Theory and Practice. Taylor Francis, Abingdon, UK. Google ScholarCross Ref
- Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 436--440.Google ScholarCross Ref
- Andrew L. Maas, Quoc V. Le, Tyler M. OŃeil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng. 2012. Recurrent neural networks for noise reduction in robust ASR. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’12). 22--25.Google Scholar
- Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 2802--2810. Google ScholarDigital Library
- Claude Marro, Yannick Mahieux, and Klaus Uwe Simmer. 1998. Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Trans. Speech Audio Process. 6, 3 (May 1998), 240--259.Google ScholarCross Ref
- Iain McCowan and Herv’e Bourlard. 2003. Microphone array post-filter based on noise field coherence. IEEE Trans. Speech Audio Process. 11, 6 (Nov. 2003), 709--716.Google ScholarCross Ref
- Zhong Meng, Shinji Watanabe, John R. Hershey, and Hakan Erdogan. 2017. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 271--275.Google ScholarCross Ref
- Tobias Menne, Jahn Heymann, Anastasios Alexandridis, Kazuki Irie, Albert Zeyer, Markus Kitza, Pavel Golik, Kulikov Ilia, Lukas Durde, Ralf Schlater, Hermann Ney, Reinhold Haeb-Umbach, and Athanasios Mouchtaris. 2016. The RWTH /UPB/FORTH system combination for the 4th CHiME challenge evaluation. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 49--51.Google Scholar
- Xavier Mestre and Miguel Angel Lagunas. 2003. On diagonal loading for minimum variance beamformers. In Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology. 459--462.Google Scholar
- Daniel Michelsanti and Zheng-Hua Tan. 2017. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2008--2012.Google ScholarCross Ref
- Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara. 2016. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition. In Proceedings of theConference of the International Speech Communication Association (INTERSPEECH’16). 3803--3807.Google ScholarCross Ref
- Seyedmahdad Mirsamadi and John H. L. Hansen. 2015. A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2430--2434.Google Scholar
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and others. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533.Google ScholarCross Ref
- Asunción Moreno, Børge Lindberg, Christoph Draxler, Gaël Richard, Khalid Choukri, Stephan Euler, and Jeffrey Allen. 2000. SPEECHDAT-CAR. A large speech database for automotive environments. In Proceedings of the the 2nd International Conference on Language Resources and Evaluation (LREC’00).Google Scholar
- Arun Narayanan and DeLiang Wang. 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7092--7096.Google ScholarCross Ref
- Arun Narayanan and DeLiang Wang. 2014. Joint noise adaptive training for robust automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 2504--2508.Google ScholarCross Ref
- Arun Narayanan and DeLiang Wang. 2015. Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 92--101. Google ScholarDigital Library
- Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, and John R. Hershey. 2017. Multichannel end-to-end speech recognition. In Proceedings of the the 34th International Conference on Machine Learning (ICML’17). 2632--2641.Google Scholar
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv:1609.03499 (Sep. 2016).Google Scholar
- ITU-T Recommendation P.862. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.Google Scholar
- Kuldip Paliwal, Kamil Wójcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement. Speech Commun. 53, 4 (Apr. 2011), 465--494. Google ScholarDigital Library
- Se Rim Park and Jinwon Lee. 2016. A fully convolutional neural network for speech enhancement. arXiv:1609.07132 (Sep. 2016).Google Scholar
- Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv:1703.09452 (Mar. 2017).Google Scholar
- David Pearce and Hans-Günter Hirsch. 2000. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’00). 29--32.Google Scholar
- David Pearce and J. Picone. 2002. Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02. Institute for Signal & Information Processing, Mississippi State University, Tech. Rep (2002).Google Scholar
- Pasi Pertilä and Joonas Nikunen. 2014. Microphone array post-filtering using supervised machine learning for speech enhancement. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2675--2679.Google ScholarCross Ref
- Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florêncio, and Mark Hasegawa-Johnson. 2017. Speech enhancement using Bayesian WaveNet. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2013--2017.Google ScholarCross Ref
- Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu. 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Process. 24, 12 (Dec. 2016), 2263--2276. Google ScholarDigital Library
- Yanmin Qian and Tian Tan. 2016. The SJTU CHiME-4 system: Acoustic noise robustness for real single or multiple microphone scenarios. In Proceedings of the CHiME-4 Workshop.Google Scholar
- Schuyler R. Quackenbush, Thomas Pinkney Barnwell, and Mark A. Clements. 1988. Objective Measures of Speech Quality. Prentice-Hall, Upper Saddle River, NJ.Google Scholar
- Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2017. A network of deep neural networks for distant speech recognition. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 4880--4884.Google ScholarCross Ref
- Dario Rethage, Jordi Pons, and Xavier Serra. 2017. A Wavenet for speech denoising. arXiv:1706.07162 (June 2017).Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (Dec. 2015), 211--252. Google ScholarDigital Library
- Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Haşim Sak. 2015. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4580--4584.Google ScholarCross Ref
- Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew W. Senior, Kean K. Chin, Ananya Misra, and Chanwoo Kim. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 5 (May 2017), 965--979. Google ScholarDigital Library
- George Saon, Tom Sercu, Steven Rennie, and Hong-Kwang J. Kuo. 2016. The IBM 2016 english conversational telephone speech recognition system. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 7--11.Google Scholar
- Johan Schalkwyk, Doug Beeferman, Françoise Beaufays, Bill Byrne, Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope. 2010. “Your word is my command”: Google search by voice: A case study. In Advances in Speech Recognition. Springer, 61--90.Google Scholar
- Markus Schedl, Yi-Hsuan Yang, and Perfecto Herrera-Boyer. 2016. Introduction to intelligent music systems and applications. ACM Trans. Intell. Syst. Technol. 8, 2 (Oct. 2016), 17:1--17:8. Google ScholarDigital Library
- Björn Schuller, Felix Weninger, Martin Wöllmer, Yang Sun, and Gerhard Rigoll. 2010. Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 4562--4565.Google ScholarCross Ref
- Michael L. Seltzer, Dong Yu, and Yongqiang Wang. 2013. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7398--7402.Google ScholarCross Ref
- Sangita Sharma, Dan Ellis, Sachin S. Kajarekar, Pratibha Jain, and Hynek Hermansky. 2000. Feature extraction using non-linear transformation for robust speech recognition on the aurora database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’00). 1117--1120. Google ScholarDigital Library
- Soundararajan Srinivasan, Nicoleta Roman, and DeLiang Wang. 2006. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48, 11 (Nov. 2006), 1486--1501.Google ScholarCross Ref
- Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2013. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13). 285--290.Google ScholarCross Ref
- Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2014. Convolutional neural networks for distant speech recognition. IEEE Sign. Process. Lett. 21, 9 (Sep. 2014), 1120--1124.Google ScholarCross Ref
- George Trigeorgis, Fabien Ringeval, Raymond Bruckner, Erik Marchi, Mihalis Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). Shanghai, China, 5200--5204.Google ScholarDigital Library
- Barry D. Van Veen and Kevin M. Buckley. 1988. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag. 5, 2 (Apr. 1988), 4--24.Google ScholarCross Ref
- Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 126--130.Google ScholarCross Ref
- Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 4 (July 2006), 1462--1469. Google ScholarDigital Library
- Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. 2016. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. (submitted for publication).Google Scholar
- Tuomas Virtanen, Rita Singh, and Bhiksha Raj. 2012. Techniques for Noise Robustness in Automatic Speech Recognition. John Wiley & Sons, Hoboken, NJ. Google Scholar
- DeLiang Wang. 2005. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. Springer US, Boston, MA, 181--197.Google Scholar
- Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 12 (Dec. 2014), 1849--1858. Google ScholarDigital Library
- Yuxuan Wang and DeLiang Wang. 2013. Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21, 7 (July 2013), 1381--1390. Google ScholarDigital Library
- Yuxuan Wang and DeLiang Wang. 2015. A deep neural network for time-domain signal reconstruction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4390--4394.Google ScholarCross Ref
- Zhong-Qiu Wang and DeLiang Wang. 2016. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 4 (Apr. 2016), 796--806. Google ScholarDigital Library
- Ernst Warsitz and Reinhold Haeb-Umbach. 2007. Blind acoustic beamforming based on generalized eigenvalue decomposition. IEEE Trans. Audio Speech Lang. Process. 15, 5 (July 2007), 1529--1539. Google ScholarDigital Library
- Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn Schuller. 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 91--99. Google ScholarDigital Library
- Felix Weninger, Florian Eyben, and Björn Schuller. 2014. Single-channel speech separation with memory-enhanced recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 3709--3713.Google ScholarCross Ref
- Felix Weninger, Jordi Feliu, and Björn Schuller. 2012. Supervised and semi-supervised suppression of background music in monaural speech recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 61--64.Google ScholarCross Ref
- Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2013. The munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks. In Proceedings of the 2nd CHiME Workshop on Machine Listening in Multisource Environments. 86--90.Google Scholar
- Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014a. Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Comput. Speech Lang. 28, 4 (July 2014), 888--902.Google ScholarCross Ref
- Felix Weninger, John R. Hershey, Jonathan Le Roux, and Björn W. Schuller. 2014b. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’14). 577--581.Google Scholar
- Felix Weninger, Shinji Watanabe, Jonathan Le Roux, J. Hershey, Yuuki Tachioka, Jürgen Geiger, Björn Schuller, and Gerhard Rigoll. 2014c. The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.Google Scholar
- Donald S. Williamson and DeLiang Wang. 2017a. Speech dereverberation and denoising using complex ratio masks. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 5590--5594.Google Scholar
- Donald S. Williamson and DeLiang Wang. 2017b. Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 7 (July 2017), 1492--1501. Google ScholarDigital Library
- Martin Wöllmer, Florian Eyben, Alex Graves, Björn Schuller, and Gerhard Rigoll. 2010a. Improving keyword spotting with a tandem BLSTM-DBN architecture. In Proceedings of the Advances in Non-Linear Speech Processing: International Conference on Nonlinear Speech Processing (NOLISP’10). 68--75. Google ScholarDigital Library
- Martin Wöllmer, Florian Eyben, Björn W. Schuller, Yang Sun, Tobias Moosmayr, and Nhu Nguyen-Thien. 2009. Robust in-car spelling recognition—A tandem BLSTM-HMM approach. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’09). 2507--2510.Google ScholarCross Ref
- Martin Wöllmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll. 2010b. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Select. Top. Sign. Process. 4, 5 (Oct. 2010), 867--881.Google Scholar
- Martin Wöllmer, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2013. Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 6822--6826.Google ScholarCross Ref
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 (Oct. 2016).Google Scholar
- Bingyin Xia and Changchun Bao. 2013. Speech enhancement with weighted denoising auto-encoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3444--3448.Google ScholarCross Ref
- Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, John Hershey, Michael L. Seltzer, Guoguo Chen, Yu Zhang, Michael Mandel, and Dong Yu. 2016a. Deep beamforming networks for multi-channel speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5745--5749.Google ScholarDigital Library
- Xiong Xiao, Chenglin Xu, Zhaofeng Zhang, Shengkui Zhao, Sining Sun, and Shinji Watanabe. 2016b. A study of learning based beamforming methods for speech recognition. In Proceedings of the CHiME Workshop. 26--31.Google Scholar
- Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. Technical Report MSR-TR-2016-71. Microsoft Research.Google Scholar
- Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014a. Dynamic noise aware training for speech enhancement based on deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2670--2674.Google ScholarCross Ref
- Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014b. An experimental study on speech enhancement based on deep neural networks. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.Google ScholarCross Ref
- Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014c. NMF-based target source separation using deep neural network. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.Google ScholarCross Ref
- Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 7--19. Google ScholarDigital Library
- Yi-Hsuan Yang and Homer H. Chen. 2012. Machine recognition of music emotion: A review. ACM Trans. Intell. Syst. Technol. 3, 3 (May 2012), 40:1--40:30. Google ScholarDigital Library
- Takuya Yoshioka, Armin Sehr, Marc Delcroix, Keisuke Kinoshita, Roland Maas, Tomohiro Nakatani, and Walter Kellermann. 2012. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 114--126.Google ScholarCross Ref
- Chengzhu Yu, Atsunori Ogawa, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, and John H. L. Hansen. 2015. Robust i-vector extraction for neural network adaptation in noisy environment. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2854--2857.Google Scholar
- Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv:1605.07146 (May 2016).Google Scholar
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.Google Scholar
- Zixing Zhang, Nicholas Cummins, and Björn Schuller. 2017. Advanced data exploitation for speech analysis—An overview. IEEE Sign. Process. Mag. 34 (July 2017). 24 pages.Google ScholarCross Ref
- Zixing Zhang, Joel Pinto, Christian Plahl, Björn Schuller, and Daniel Willett. 2014. Channel mapping using bidirectional long short-term memory for dereverberation in hand-free voice controlled devices. IEEE Trans. Cons. Electron. 60, 3 (Aug. 2014), 525--533.Google Scholar
- Zixing Zhang, Fabien Ringeval, Jing Han, Jun Deng, Erik Marchi, and Björn Schuller. 2016. Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with LSTM neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16).Google ScholarCross Ref
Index Terms
- Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Recommendations
Combined speech enhancement and auditory modelling for robust distributed speech recognition
The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...
A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition
Acoustic feature extraction from speech constitutes a fundamental component of automatic speech recognition (ASR) systems. In this paper, we propose a novel feature extraction algorithm, perceptual-MVDR (PMVDR), which computes cepstral coefficients from ...
A modified oesophageal speech enhancement using ephraim-malah filter for robust speech recognition
ICNVS'10: Proceedings of the 12th international conference on Networking, VLSI and signal processingThis paper presents a modified Oesophageal Single Channel Speech Enhancement using Ephraim-Malah Filter for Robust Speech Recognition. An Oesophageal voice is due to the laryngectomy undergone by those persons with larynx cancer and it has extremely low ...
Comments