ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

Radfar, Martin; Barnwal, Rohit; Swaminathan, Rupak Vignesh; Chang, Feng-Ju; Strimel, Grant P.; Susanj, Nathan; Mouchtaris, Athanasios

doi:10.21437/Interspeech.2022-10844

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

Martin Radfar, Rohit Barnwal, Rupak Vignesh Swaminathan, Feng-Ju Chang, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention layers. In this paper, we introduce a new streaming ASR model, Convolutional Augmented Recurrent Neural Network Transducers (ConvRNN-T) in which we augment the LSTM-based RNN-T with a novel convolutional frontend consisting of local and global context CNN encoders. ConvRNN-T takes advantage of causal 1-D convolutional layers, squeeze-and-excitation, dilation, and residual blocks to provide both global and local audio context representation to LSTM layers. We show ConvRNN-T outperforms RNN-T, Conformer, and ContextNet on Librispeech and in-house data. In addition, ConvRNN-T offers less computational complexity compared to Conformer. ConvRNN-T's superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies.

doi: 10.21437/Interspeech.2022-10844

Cite as: Radfar, M., Barnwal, R., Swaminathan, R.V., Chang, F.-J., Strimel, G.P., Susanj, N., Mouchtaris, A. (2022) ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition. Proc. Interspeech 2022, 4431-4435, doi: 10.21437/Interspeech.2022-10844

@inproceedings{radfar22_interspeech,
  author={Martin Radfar and Rohit Barnwal and Rupak Vignesh Swaminathan and Feng-Ju Chang and Grant P. Strimel and Nathan Susanj and Athanasios Mouchtaris},
  title={{ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={4431--4435},
  doi={10.21437/Interspeech.2022-10844}
}