Unsupervised language model adaptation based on automatic text collection from WWW

Suzuki, Motoyuki; Kajiura, Yasutomo; Ito, Akinori; Makino, Shozo

doi:10.21437/Interspeech.2006-572

Unsupervised language model adaptation based on automatic text collection from WWW

Motoyuki Suzuki, Yasutomo Kajiura, Akinori Ito, Shozo Makino

An n-gram trained by a general corpus gives high performance. However, it is well known that a topic-specialized n-gram gives higher performance than that of the general n-gram. In order to make a topic specialized n-gram, several adaptation methods were proposed. These methods use a given corpus corresponding to the target topic, or collect documents related to the topic from a database. If there is neither the given corpus nor the topic-related documents in the database, the general n-gram cannot be adapted to the topic-specialized n-gram. In this paper, a new unsupervised adaptation method is proposed. The method collects topic-related documents from the world wide web. Several query terms are extracted from recognized text, and collected web pages given by a search engine are used for adaptation. Experimental results showed the proposed method gave 7.2 points higher word accuracy than that given by the general n-gram.

doi: 10.21437/Interspeech.2006-572

Cite as: Suzuki, M., Kajiura, Y., Ito, A., Makino, S. (2006) Unsupervised language model adaptation based on automatic text collection from WWW. Proc. Interspeech 2006, paper 1806Thu1A2O.1, doi: 10.21437/Interspeech.2006-572

@inproceedings{suzuki06b_interspeech,
  author={Motoyuki Suzuki and Yasutomo Kajiura and Akinori Ito and Shozo Makino},
  title={{Unsupervised language model adaptation based on automatic text collection from WWW}},
  year=2006,
  booktitle={Proc. Interspeech 2006},
  pages={paper 1806Thu1A2O.1},
  doi={10.21437/Interspeech.2006-572}
}