ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

Using the web for fast language model construction in minority languages

Viet Bac Le, Brigitte Bigi, Laurent Besacier, Eric Castelli

The design and construction of a language model for minority languages is a hard task. By minority language, we mean a language with small available resources, especially for the statistical learning problem. In this paper, a new methodology for fast language model construction in minority languages is proposed. It is based on the use of Web resources to collect and make efficient textual corpora. By using some filtering techniques, this methodology allows a quick and efficient construction of a language model with a small cost in term of computational and human resources. Our primary experiments have shown excellent performance of the Web language models vs newspaper language models using the proposed filtering methods on a majority language (French). Following the same way for a minority language (Vietnamese), a valuable language model was constructed in 3 month with only 15% new development to modify some filtering tools.


doi: 10.21437/Eurospeech.2003-779

Cite as: Le, V.B., Bigi, B., Besacier, L., Castelli, E. (2003) Using the web for fast language model construction in minority languages. Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003), 3117-3120, doi: 10.21437/Eurospeech.2003-779

@inproceedings{le03_eurospeech,
  author={Viet Bac Le and Brigitte Bigi and Laurent Besacier and Eric Castelli},
  title={{Using the web for fast language model construction in minority languages}},
  year=2003,
  booktitle={Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003)},
  pages={3117--3120},
  doi={10.21437/Eurospeech.2003-779}
}