Statistical language model based on a hierarchical approach: MCnv

Zitouni, Imed; Smaili, Kamel; Haton, Jean-Paul

doi:10.21437/Eurospeech.2001-7

Statistical language model based on a hierarchical approach: MCnv

Imed Zitouni, Kamel Smaili, Jean-Paul Haton

In this paper, we propose a new language model based on dependent word sequences organized in a multi-level hierarchy. We call this model MCnv, where n is the maximum number of words in a sequence and v is the maximum number of levels. The originality of this model is its capacity to take into account dependent variable-length sequences for very large vocabularies. In order to discover the variable-length sequences and to build the hierarchy, we use a set of 233 syntactic classes extracted from the 8 French elementary grammatical classes. The MCnv model learns hierarchical word patterns and uses them to reevaluate and filter the n-best utterance hypotheses outputted by our speech recognizer MAUD. The model has been trained on a corpus of 43 million words extracted from a French newspaper and uses a vocabulary of 20000 words. Tests have been conducted on 300 sentences. Results achieved 17% decrease in perplexity compared to an interpolated class trigram model. Rescoring the original n-best hypotheses resulted in an improvement of 5% in accuracy.

doi: 10.21437/Eurospeech.2001-7

Cite as: Zitouni, I., Smaili, K., Haton, J.-P. (2001) Statistical language model based on a hierarchical approach: MCnv. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 29-32, doi: 10.21437/Eurospeech.2001-7

@inproceedings{zitouni01_eurospeech,
  author={Imed Zitouni and Kamel Smaili and Jean-Paul Haton},
  title={{Statistical language model based on a hierarchical approach: MCnv}},
  year=2001,
  booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)},
  pages={29--32},
  doi={10.21437/Eurospeech.2001-7}
}