In this paper, we propose a new language model based on dependent word sequences organized in a multi-level hierarchy. We call this model MCnv, where n is the maximum number of words in a sequence and v is the maximum number of levels. The originality of this model is its capacity to take into account dependent variable-length sequences for very large vocabularies. In order to discover the variable-length sequences and to build the hierarchy, we use a set of 233 syntactic classes extracted from the 8 French elementary grammatical classes. The MCnv model learns hierarchical word patterns and uses them to reevaluate and filter the n-best utterance hypotheses outputted by our speech recognizer MAUD. The model has been trained on a corpus of 43 million words extracted from a French newspaper and uses a vocabulary of 20000 words. Tests have been conducted on 300 sentences. Results achieved 17% decrease in perplexity compared to an interpolated class trigram model. Rescoring the original n-best hypotheses resulted in an improvement of 5% in accuracy.
Cite as: Zitouni, I., Smaili, K., Haton, J.-P. (2001) Statistical language model based on a hierarchical approach: MCnv. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 29-32, doi: 10.21437/Eurospeech.2001-7
@inproceedings{zitouni01_eurospeech, author={Imed Zitouni and Kamel Smaili and Jean-Paul Haton}, title={{Statistical language model based on a hierarchical approach: MCnv}}, year=2001, booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)}, pages={29--32}, doi={10.21437/Eurospeech.2001-7} }