Learning the speech front-end with raw waveform CLDNNs

Sainath, Tara N.; Weiss, Ron J.; Senior, Andrew; Wilson, Kevin W.; Vinyals, Oriol

doi:10.21437/Interspeech.2015-1

Learning the speech front-end with raw waveform CLDNNs

Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals

Learning an acoustic model directly from the raw waveform has been an active area of research. However, waveform-based models have not yet matched the performance of log-mel trained neural networks. We will show that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech. Specifically, we will show the benefit of the CLDNN, namely the time convolution layer in reducing temporal variations, the frequency convolution layer for preserving locality and reducing frequency variations, as well as the LSTM layers for temporal modeling. In addition, by stacking raw waveform features with log-mel features, we achieve a 3% relative reduction in word error rate.

doi: 10.21437/Interspeech.2015-1

Cite as: Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O. (2015) Learning the speech front-end with raw waveform CLDNNs. Proc. Interspeech 2015, 1-5, doi: 10.21437/Interspeech.2015-1

@inproceedings{sainath15_interspeech,
  author={Tara N. Sainath and Ron J. Weiss and Andrew Senior and Kevin W. Wilson and Oriol Vinyals},
  title={{Learning the speech front-end with raw waveform CLDNNs}},
  year=2015,
  booktitle={Proc. Interspeech 2015},
  pages={1--5},
  doi={10.21437/Interspeech.2015-1}
}