Audio-visual integration for robust speech recognition using maximum weighted stream posteriors

Seymour, Rowan; Stewart, Darryl; Ming, Ji

doi:10.21437/Interspeech.2007-282

Audio-visual integration for robust speech recognition using maximum weighted stream posteriors

Rowan Seymour, Darryl Stewart, Ji Ming

In this paper, we demonstrate for the first time, the robustness of the Maximum Stream Posterior (MSP) method for audio-visual integration on a large speaker- independent speech recognition task in noisy conditions. Furthermore, we show that the method can be generalised and improved by using a softer weighting scheme to account for moderate noise conditions. We call this generalised method the Maximum Weighted Stream Posterior (MWSP) method. In addition, we carry out the first tests of the Posterior Union Model approach for audio-visual integration. All of the methods are compared in digit recognition tests involving various audio and video noise levels and conditions including tests where both modalities are affected by noise. We also introduce a novel form of noise called

jitter which is used to simulate camera movement. The results verify that the MSP approach is robust and that its generalised form (MWSP) can lead to further improvements in moderate noise conditions.

doi: 10.21437/Interspeech.2007-282

Cite as: Seymour, R., Stewart, D., Ming, J. (2007) Audio-visual integration for robust speech recognition using maximum weighted stream posteriors. Proc. Interspeech 2007, 654-657, doi: 10.21437/Interspeech.2007-282

@inproceedings{seymour07_interspeech,
  author={Rowan Seymour and Darryl Stewart and Ji Ming},
  title={{Audio-visual integration for robust speech recognition using maximum weighted stream posteriors}},
  year=2007,
  booktitle={Proc. Interspeech 2007},
  pages={654--657},
  doi={10.21437/Interspeech.2007-282}
}