Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition backend. A series of audio-visual multi-channel speech separation front-end components based on TF masking, filter & sum and mask-based MVDR beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.
Cite as: Yu, J., Wu, B., Gu, R., Zhang, S.-X., Chen, L., Xu, Y., Yu, M., Su, D., Yu, D., Liu, X., Meng, H. (2020) Audio-Visual Multi-Channel Recognition of Overlapped Speech. Proc. Interspeech 2020, 3496-3500, doi: 10.21437/Interspeech.2020-2346
@inproceedings{yu20f_interspeech, author={Jianwei Yu and Bo Wu and Rongzhi Gu and Shi-Xiong Zhang and Lianwu Chen and Yong Xu and Meng Yu and Dan Su and Dong Yu and Xunying Liu and Helen Meng}, title={{Audio-Visual Multi-Channel Recognition of Overlapped Speech}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={3496--3500}, doi={10.21437/Interspeech.2020-2346} }