LSA-based language model adaptation for highly inflected languages

Alumäe, Tanel; Kirt, Toomas

doi:10.21437/Interspeech.2007-267

LSA-based language model adaptation for highly inflected languages

Tanel Alumäe, Toomas Kirt

This paper presents a language model topic adaptation framework for highly inflected languages. In such languages, sub-word units are used as basic units for language modeling. Since such units carry little semantic information, they are not very suitable for topic adaptation. We propose to lemmatize the corpus of training documents before constructing a latent topic model. To adapt language model, we use few lemmatized training sentences to find a set of documents that are semantically close to the current document. Fast marginal adaptation of sub-word trigram language model is used for adapting the background model. Experiments on a set of Estonian test texts show that the proposed approach gives a 19% decrease in language model perplexity. A statistically significant decrease in perplexity is observed already when using just two sentences for adaptation. We also show that the model employing lemmatization gives consistently better results than the unlemmatized model.

doi: 10.21437/Interspeech.2007-267

Cite as: Alumäe, T., Kirt, T. (2007) LSA-based language model adaptation for highly inflected languages. Proc. Interspeech 2007, 2357-2360, doi: 10.21437/Interspeech.2007-267

@inproceedings{alumae07_interspeech,
  author={Tanel Alumäe and Toomas Kirt},
  title={{LSA-based language model adaptation for highly inflected languages}},
  year=2007,
  booktitle={Proc. Interspeech 2007},
  pages={2357--2360},
  doi={10.21437/Interspeech.2007-267}
}