Multi-channel Attention for End-to-End Speech Recognition

Braun, Stefan; Neil, Daniel; Anumula, Jithendar; Ceolini, Enea; Liu, Shih-Chii

doi:10.21437/Interspeech.2018-1301

Multi-channel Attention for End-to-End Speech Recognition

Stefan Braun, Daniel Neil, Jithendar Anumula, Enea Ceolini, Shih-Chii Liu

Recent end-to-end models for automatic speech recognition use sensory attention to integrate multiple input channels within a single neural network. However, these attention models are sensitive to the ordering of the channels used during training. This work proposes a sensory attention mechanism that is invariant to the channel ordering and only increases the overall parameter count by 0.09%. We demonstrate that even without re-training, our attention-equipped end-to-end model is able to deal with arbitrary numbers of input channels during inference. In comparison to a recent related model with sensory attention, our model when tested on the real noisy recordings from the multi-channel CHiME-4 dataset, achieves a relative character error rate (CER) improvement of 40.3% to 42.9%. In a two-channel configuration experiment, the attention signal allows the lower signal-to-noise ratio (SNR) sensor to be identified with 97.7% accuracy.

doi: 10.21437/Interspeech.2018-1301

Cite as: Braun, S., Neil, D., Anumula, J., Ceolini, E., Liu, S.-C. (2018) Multi-channel Attention for End-to-End Speech Recognition. Proc. Interspeech 2018, 17-21, doi: 10.21437/Interspeech.2018-1301

@inproceedings{braun18_interspeech,
  author={Stefan Braun and Daniel Neil and Jithendar Anumula and Enea Ceolini and Shih-Chii Liu},
  title={{Multi-channel Attention for End-to-End Speech Recognition}},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={17--21},
  doi={10.21437/Interspeech.2018-1301}
}