Elsevier

Neural Networks

Volume 64, April 2015, Pages 39-48
Neural Networks

2015 Special Issue
Deep Convolutional Neural Networks for Large-scale Speech Tasks

https://doi.org/10.1016/j.neunet.2014.08.005Get rights and content

Abstract

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%–14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.

Introduction

Recently, Deep Neural Networks (DNNs) have achieved tremendous success in acoustic modeling for large vocabulary continuous speech recognition (LVCSR) tasks, showing significant gains over state-of-the-art Gaussian Mixture Model/Hidden Markov Model (GMM/HMM) systems on a wide variety of small and large vocabulary tasks (Dahl et al., 2012, Hinton, Deng, et al., 2012, Jaitly et al., 2012, Kingsbury et al., 2012, Seide et al., 2011). Convolutional Neural Networks (CNNs) (LeCun and Bengio, 1995, Lecun et al., 1998) are an alternative type of neural network that can be used to model spatial and temporal correlation, while reducing translational variance in signals.

CNNs are attractive compared to fully-connected DNNs for a variety of reasons. First, DNNs ignore input topology, as the input can be presented in any (fixed) order without affecting the performance of the network (LeCun & Bengio, 1995). However, spectral representations of speech have strong correlations in time and frequency, and modeling local correlations with CNNs, through weights which are shared across local regions of the input space, has been shown to be beneficial in other fields (LeCun, Huang, & Bottou, 2004). Second, DNNs are not explicitly designed to model translational variance within speech signals, which can exist due to different speaking styles (LeCun & Bengio, 1995). More specifically, different speaking styles lead to formants being shifted in the frequency domain, as well as variations in phoneme durations. These speaking styles require us to apply various speaker adaptation techniques to reduce feature variation. While DNNs of sufficient size could indeed capture translational invariance, this requires large networks with lots of training examples. CNNs on the other hand capture translational invariance with far fewer parameters by averaging the outputs of hidden units in different local time and frequency regions.

In fact, CNNs have been heavily explored in the image recognition and computer vision fields, offering improvements over DNNs on many tasks (Lawrence, 1997, LeCun et al., 2004). Recently, CNNs have been explored for speech recognition (Abdel-Hamid, Mohamed, Jiang, & Penn, 2012), also showing improvements over DNNs, however on a small vocabulary tasks with shallow networks. Specifically,  Abdel-Hamid et al. (2012) introduced a novel framework to model spectral correlations where convolutional weights were shared over limited frequency regions, a technique known as limited weight sharing (LWS). One of the limitations of this LWS approach was that the network was limited to one convolutional layer, unlike most CNN work which uses multiple convolutional layers (LeCun et al., 2004). In this paper, we explore a spatial modeling approach similar to work done in the image recognition community, where convolutional weights are fully shared across all time and frequency components. This modeling approach, known as full weight sharing (FWS), allows for multiple convolutional layers and encourages deeper networks.

The first part of this paper explores the appropriate architecture for CNNs on LVCSR tasks. Specifically, we investigate how many convolutional vs. fully connected layers are needed, the filter size per convolutional layer, an appropriate number of hidden units per layer and a good pooling strategy. In addition, we compare the LWS proposed in  Abdel-Hamid et al. (2012) to our FWS strategy.

The second part of this paper explores the best type of input feature to be used with CNN. Various speaker adaptation techniques have been shown to improve the performance of speech recognition systems. In this paper, we focus on how to incorporate feature-space maximum likelihood linear regression (fMLLR) (Gales, 1998) and identity vectors (i-vectors) (Saon, Soltau, Picheny, & Nahamoo, 2013), which do not exhibit locality in frequency, into the CNN framework through a joint CNN/DNN architecture (Sainath, Kingsbury, Mohamed, Dahl, Saon, Soltau, Beran, Aravkin, & Ramabhadran, 2013).

Finally, we investigate the role of rectified linear units (ReLU) and dropout (Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov, 2012) for Hessian-free (HF) sequence training (Kingsbury et al., 2012) of CNNs. In  Dahl, Sainath, and Hinton (2013), ReLU+dropout was shown to give good performance for cross-entropy (CE) trained DNNs but was not employed during HF sequence-training. However, sequence-training is critical for speech recognition performance, providing an additional relative gain of 10%–15% over a CE-trained DNN (Kingsbury et al., 2012). During CE training, the dropout mask changes for each utterance. However, during HF training, we are not guaranteed to get conjugate directions if the dropout mask changes for each utterance. Therefore, in order to make dropout usable during HF, we keep the dropout mask fixed per utterance for all iterations of conjugate gradient (CG) within a single HF iteration.

After analyzing the best CNN architecture, input feature set and ReLU, we then explore using CNNs on a 50 hr English Broadcast News (BN) task (Kingsbury, 2009). Naturally, our best DNN system offers a 13% relative improvement over the GMM/HMM, consistent with gains observed in the literature with DNNs vs. GMM/HMMs (Kingsbury et al., 2012). Comparing DNNs to CNNs, we find that a CNN hybrid system offers a 3% relative improvement over the hybrid DNN, whereas the joint CNN/DNN system which incorporates speaker adaptation and ReLU+dropout offers an 14% improvement. Finally, we explore the behavior of the joint CNN/DNN and ReLU+dropout on two larger scale tasks — namely a 300 hr Switchboard (SWB) task and a 400 hr BN task. We find that using the CNN with these improvements, we can obtain a 12% relative improvement over the DNN on SWB and a 16% relative improvement over the DNN on 400 hr BN.

The rest of this paper is organized as follows. The basic CNN architecture used in this paper is described in Section  2. An exploration of various weight-sharing and pooling strategies are discussed in Section  3, while input feature analysis is discussed in Section  4. Using ReLU+dropout for HF sequence training is discussed in Section  5. Results on three LVCSR tasks are presented in Section  6, Finally, Section  7 concludes the paper and discusses future work.

Section snippets

Basic CNN architecture

In this section, we describe the basic CNNs architecture and experimental setup used in this paper.

Convolutional vs. fully connected layers

In this section, we analyze the best approach for combining convolutional and fully connected layers for speech recognition tasks.

Analysis of input features

Convolutional neural networks require features which are locally correlated in time and frequency. This implies that Linear Discriminant Analysis (LDA) features, which are very commonly used in speech, cannot be used with CNNs as they remove locality in frequency (Abdel-Hamid et al., 2012). Mel filter-bank (FB) features are one type of speech feature which exhibit this locality property (Mohamed, Hinton, & Penn, 2012).

Typically, various speaker adaptation techniques are applied to speech

Sequence training with rectified linear units and dropout

At IBM, two stages of Neural Network training are performed. First, DNNs are trained with a frame-discriminative stochastic gradient descent (SGD) cross-entropy (CE) criterion. Second, CE-trained DNN weights are re-adjusted using a sequence-level objective function (Kingsbury, 2009). Specifically, using the cross-entropy trained DNN, sequence information is saved out in the form of a numerator lattice, representing the correct set of words, and a denominator lattice, representing competing

Results on LVCSR tasks

In this section, we compare CNN performance to two state of the art techniques used for LVCSR tasks, namely DNNs and GMM/HMMs. We report CNN performance with the architecture described in Section  3, and also break down the improvements due to speaker-adaptation and ReLU, described in Sections  4 Analysis of input features, 5 Sequence training with rectified linear units and dropout respectively.

The GMM system is trained using our standard recipe (Soltau et al., 2010), which is briefly

Conclusions and future work

In this paper, we explored how to make CNNs a more powerful model for speech tasks compared to DNN. Specifically, we experimentally determined an appropriate number of convolutional layers, hidden units, filter size and pooling strategy for CNNs. In addition, we introduced a joint CNN/DNN architecture to allow speaker-adapted features to be used in this framework. Finally, we investigated a strategy to make dropout effective after HF sequence training. Experiments on 3 LVCSR tasks, namely a 50

References (34)

  • M. Gales

    Maximum likelihood linear transformations for HMM-based speech recognition

    Computer Speech and Language

    (1998)
  • Abdel-Hamid, O., Mohamed, A., Jiang, H., & Penn, G. (2012). Applying convolutional neural network concepts to hybrid...
  • H. Bourlard et al.

    Connectionist speech recognition: a hybrid approach

    (1993)
  • Dahl, G., Sainath, T., & Hinton, G. (2013), Improving deep neural networks for LVCSR using rectified linear units and...
  • G. Dahl et al.

    Context-dependent pre-trained deep neural networks for large vocabulary speech recognition

    IEEE Transactions on Audio, Speech, and Language Processing

    (2012)
  • Deng, L., Abdel-Hamid, O., & Yu, D. (2013). A deep convolutional neural network using heterogeneous pooling for trading...
  • M. Gales

    Semi-tied covariance matrices for hidden markov models

    IEEE Transactions on Speech and Audio Processing

    (1999)
  • Glorot, X., & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. AI...
  • G. Hinton et al.

    Deep neural networks for acoustic modeling in speech recognition

    IEEE Signal Processing Magazine

    (2012)
  • G. Hinton et al.

    Improving neural networks by preventing co-adaptation of feature detectors

    The Computing Research Repository (CoRR) 1207

    (2012)
  • Jaitly, N., Nguyen, P., Senior, A. W., & Vanhoucke, V. (2012). Application of pretrained deep neural networks to large...
  • Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic...
  • Kingsbury, B., Sainath, T. N., & Soltau, H. (2012). Scalable minimum bayes risk training of deep neural network...
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems

    (2012)
  • S. Lawrence

    Face recognition: A convolutional neural-network approach

    IEEE Transactions on Neural Networks

    (1997)
  • Y. LeCun et al.

    Convolutional networks for images, speech, and time-series

  • Y. Lecun et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • Cited by (0)

    View full text