A language-modeling approach to inverse text normalization and data cleanup for multimodal voice search applications

Ju, Yun-Cheng; Odell, Julian

doi:10.21437/Interspeech.2008-570

A language-modeling approach to inverse text normalization and data cleanup for multimodal voice search applications

Yun-Cheng Ju, Julian Odell

In this paper we address two related challenges in multimodal local search applications on mobile devices: first, correctly displaying the business names, and second, harvesting language model training data from an inconsistently labeled corpus. We investigate the impact of common text normalization and the quality of language model training corpus on the accuracy of displayed results. We propose a new language model framework that eliminates the need for explicit inverse text normalization. The same framework can be applied to sift through corrupted language model training data. Our new language model is 25% more accurate while 25% smaller in size.

doi: 10.21437/Interspeech.2008-570

Cite as: Ju, Y.-C., Odell, J. (2008) A language-modeling approach to inverse text normalization and data cleanup for multimodal voice search applications. Proc. Interspeech 2008, 2179-2182, doi: 10.21437/Interspeech.2008-570

@inproceedings{ju08_interspeech,
  author={Yun-Cheng Ju and Julian Odell},
  title={{A language-modeling approach to inverse text normalization and data cleanup for multimodal voice search applications}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={2179--2182},
  doi={10.21437/Interspeech.2008-570}
}