Abstract
The rapid growth of the Internet has created a tremendous number of multilingual resources. However, language boundaries prevent information sharing and discovery across countries. Proper names play an important role in search queries and knowledge discovery. When foreign names are involved, proper names are often translated phonetically which is referred to as transliteration. In this research we propose a generic transliteration framework, which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model. We improved the traditional statistical-based transliteration in three areas: (1) incorporated a simple phonetic transliteration knowledge base; (2) incorporated a bigram and a trigram HMM; (3) incorporated a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on an English–Arabic back transliteration. Experiments showed that when using HMM alone, a combination of the bigram and trigram HMM approach performed the best for English–Arabic transliteration. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 79.05%. Overall, our framework achieved a precision of 0.72 when the eight best transliterations were considered. Our results show promise for using transliteration techniques to improve multilingual Web retrieval.
Similar content being viewed by others
References
G.-W. Bian, H.-H. Chen, Cross-language information access to multilingual collections on the internet J. Am. Soc. Inf. Sci. 51, 281 (2000)
P. Thompson, C.C. Dozier, Name Searching and Information Retrieval, in Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing (Providence, Rhode Island, 1997)
H.-H. Chen, S.-J. Hueng, Y.-W. Ding et al., Proper name translation in cross-language information retrieval, in Proceedings of the 17th International Conference on Computational Linguistics (Montreal, 1998), p. 232
Y. Al-Onaizan, K. Knight, Translating named entities using monolingual and bilingual resources, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2001), p. 400
Y. Al-Onaizan, K. Knight, Machine transliteration of names in Arabic text, in Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages (Philadelphia, 2002), pp. 1
N. AbdulJaleel, L.S. Larkey, Statistical transliteration for English–Arabic cross language information retrieval, in Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM) (New Orleans, 2003), p. 139
W.-H. Lin, H.-H. Chen, Backward machine transliteration by learning phonetic similarity, in Proceedings of The 6th Workshop on Computational Language Learning (CoNLL-2002) (Taipei, 2002), p. 139
B.G. Stalls, K. Knight, Translating names and technical terms in Arabic text, in Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages (Montreal, 1998)
M. Arbabi, S.M. Fischthal, V.C. Cheng, et al., Algorithms for Arabic Name Transliteration. IBM J. Res. Dev. 38, 183 (1994)
K. Darwish, D. Doermann, R. Jones, et al., TREC-10 experiments at University of Maryland CLIR and video, in Text REtrieval Conference (Gaithersburg, 2001)
S. Wan, C.M. Verspoor, Automatic English–Chinese name transliteration for development of multilingual resources, in Proceedings of the 17th international conference on Computational linguistics (Montreal, 1998), p. 1352
P. Virga, S. Khudanpur, Transliteration of proper names in cross-lingual information retrieval, in Proceedings of the ACL Workshop on Multi-lingual Named Entity Recognition (Sapporo, 2003), p. 57
A. Kawtrakul, A. Deemagarn, C. Thumkanon, et al., Backward transliteration for Thai document retrieval, in Proceedings of the 1998 IEEE Asia-Pacific Conference on Circuits and Systems (APCCAD) (Chiangmai, 1998), p. 563
K. Knight, J. Graehl, Machine transliteration, in Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (Somerset, 1997), p. 128
I. Goto, N. Uratani, T. Ehara, Cross-language information retrieval of proper nouns using context information, in Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (Tokyo, 2001), p. 571
H. Meng, W.-K. Lo, B. Chen, et al., Generating phonetic cognates to handle named entities in English–Chinese cross-language spoken document retrieval, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding Workshop (ASRU) (Trento, 2001), p. 311
S.K. Pal, V. Talwar, P. Mitra, Web mining in soft computing framework: relevance, state of the art and future directions. IEEE Trans. Neural Networks 13, 1163 (2002)
W.-H. Lu, L.-F. Chien, H.-J. Lee, Anchor text mining for translation of web queries: a transitive translation approach. ACM Trans. Inform. Syst. (TOIS) 22, 242 (2004)
L.R. Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition Proc. IEEE 77, 257–286 (1989)
A.J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory 13, 260 (1967)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhou, Y., Huang, F. & Chen, H. Combining probability models and web mining models: a framework for proper name transliteration. Inf Technol Manage 9, 91–103 (2008). https://doi.org/10.1007/s10799-007-0031-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10799-007-0031-9