In this paper, we present a novel end-to-end automatic speech recognition (ASR) method that considers whether an input speech can be reconstructed from a generated text or not. A speech-to-text encoder-decoder model is one of the most powerful end-to-end ASR methods since it does not make any conditional independence assumptions. However, encoder-decoder models often suffer from a problem that is caused from a gap between the teacher forcing in a training phase and the free running in a testing phase. In fact, there is no guarantee that texts can be generated correctly when some generation errors occur in conditioning contexts. In order to mitigate this problem, our proposed method utilizes not only a generation probability of the text computed from a speech-to-text encoder-decoder but also a reconstruction probability of the speech computed from a text-to-speech encoder-decoder on the basis of a maximum mutual information criterion. We can expect that considering the reconstruction criterion can impose a constraint against generation errors. In addition, in order to compute the reconstruction probability, we introduce a mixture density network into the text-to-speech encoder-decoder. Our experiments on Japanese lecture ASR tasks demonstrate that considering the reconstruction criterion can yield ASR performance improvements.
Cite as: Masumura, R., Sato, H., Tanaka, T., Moriya, T., Ijima, Y., Oba, T. (2019) End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders. Proc. Interspeech 2019, 1606-1610, doi: 10.21437/Interspeech.2019-2111
@inproceedings{masumura19b_interspeech, author={Ryo Masumura and Hiroshi Sato and Tomohiro Tanaka and Takafumi Moriya and Yusuke Ijima and Takanobu Oba}, title={{End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={1606--1610}, doi={10.21437/Interspeech.2019-2111} }