ISCA Archive Interspeech 2007
ISCA Archive Interspeech 2007

Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics

Satoshi Asakawa, Nobuaki Minematsu, Keikichi Hirose

Speech acoustics vary due to differences in gender, age, microphone, room, lines, and a variety of factors. In speech recognition research, to deal with these inevitable non-linguistic variations, thousands of speakers in different acoustic conditions were prepared to train acoustic models of individual phonemes. Recently, a novel representation of speech dynamics was proposed [1, 2], where the above non-linguistic factors are effectively removed from speech as if pitch information is removed from spectrum by its smoothing. This representation captures only speaker- and microphone-invariant speech dynamics and no absolute or static acoustic properties such as spectrums are used. With them, speaker identity has to remain in speech representation. In our previous study, the new representation was applied to recognizing a sequence of isolated vowels [3]. The proposed method with a single training speaker outperformed the conventional HMMs trained with more than four thousand speakers even in the case of noisy speech. The current paper shows the initial results of applying the dynamic representation to recognizing continuous speech, that is connected vowels.


doi: 10.21437/Interspeech.2007-325

Cite as: Asakawa, S., Minematsu, N., Hirose, K. (2007) Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics. Proc. Interspeech 2007, 890-893, doi: 10.21437/Interspeech.2007-325

@inproceedings{asakawa07_interspeech,
  author={Satoshi Asakawa and Nobuaki Minematsu and Keikichi Hirose},
  title={{Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics}},
  year=2007,
  booktitle={Proc. Interspeech 2007},
  pages={890--893},
  doi={10.21437/Interspeech.2007-325}
}