Lower Frame Rate Neural Network Acoustic Models

Pundak, Golan; Sainath, Tara N.

doi:10.21437/Interspeech.2016-275

Lower Frame Rate Neural Network Acoustic Models

Golan Pundak, Tara N. Sainath

Recently neural network acoustic models trained with Connectionist Temporal Classification (CTC) were proposed as an alternative approach to conventional cross-entropy trained neural network acoustic models which output frame-level decisions every 10ms [1]. As opposed to conventional models, CTC learns an alignment jointly with the acoustic model, and outputs a blank symbol in addition to the regular acoustic state units. This allows the CTC model to run with a lower frame rate, outputting decisions every 30ms rather than 10ms as in conventional models, thus improving overall system speed. In this work, we explore how conventional models behave with lower frame rates. On a large vocabulary Voice Search task, we will show that with conventional models, we can slow the frame rate to 40ms while improving WER by 3% relative over a CTC-based model.

doi: 10.21437/Interspeech.2016-275

Cite as: Pundak, G., Sainath, T.N. (2016) Lower Frame Rate Neural Network Acoustic Models. Proc. Interspeech 2016, 22-26, doi: 10.21437/Interspeech.2016-275

@inproceedings{pundak16_interspeech,
  author={Golan Pundak and Tara N. Sainath},
  title={{Lower Frame Rate Neural Network Acoustic Models}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={22--26},
  doi={10.21437/Interspeech.2016-275}
}