Non-linear estimation of voice activity to improve automatic recognition of noisy speech

Gemello, Roberto; Mana, Franco; Mori, Renato de

doi:10.21437/Interspeech.2005-243

Non-linear estimation of voice activity to improve automatic recognition of noisy speech

Roberto Gemello, Franco Mana, Renato de Mori

Feed-forward multi-layer perceptrons (MLP) and recurrent neural networks (RNN) fed with different sets of acoustic features are proposed for computing the presence and absence of speech in continuous speech signal in presence of various levels of background noise. Detailed performance evaluations on voice activity detection (VAD) are reported using the Aurora2, Aurora3 and TIMIT corpora. It is shown that the best results are obtained with an RNN fed by the acoustic features used for automatic speech recognition (ASR) augmented by specific features. Detailed evaluations are also proposed for ASR using Aurora2 and the German, Italian and Spanish portions of the test set of the Aurora3 corpus. The highest word error rate (WER) reduction (16.9%) is obtained when the only-noise presence probability is used to modify the phone posterior probabilities used for speech decoding.

doi: 10.21437/Interspeech.2005-243

Cite as: Gemello, R., Mana, F., Mori, R.d. (2005) Non-linear estimation of voice activity to improve automatic recognition of noisy speech. Proc. Interspeech 2005, 2617-2620, doi: 10.21437/Interspeech.2005-243

@inproceedings{gemello05_interspeech,
  author={Roberto Gemello and Franco Mana and Renato de Mori},
  title={{Non-linear estimation of voice activity to improve automatic recognition of noisy speech}},
  year=2005,
  booktitle={Proc. Interspeech 2005},
  pages={2617--2620},
  doi={10.21437/Interspeech.2005-243}
}