Abstract
As participants in the TIDES Surprise language exercise, researchers at the University of Massachusetts helped collect Hindi--English resources and developed a cross-language information retrieval system. Components included normalization, stop-word removal, transliteration, structured query translation, and language modeling using a probabilistic dictionary derived from a parallel corpus. Existing technology was successfully applied to Hindi. The biggest stumbling blocks were collection of parallel English and Hindi text and dealing with numerous proprietary encodings.
- Abduljaleel, N. and Larkey, L. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM 2003: Proceedings of the Twelfth International Conference on Information and Knowledge Management (New Orleans, LA, Nov. 2003). O. Frieder et al. eds. ACM, New York, 139--146. Google Scholar
- Aljlayl M. and Frieder, O. 2002. On Arabic search: Improving the retrieval effectiveness via a light stemmer approach. In CIKM 2002: Proceedings of the Eleventh International Conference on Information and Knowledge Management (McLean, VA, Nov. 2002). K Kalpakis. et al. eds. ACM, New York, 340--347. Google Scholar
- Allan, J., Lavrenko, V., and Connell, M. E. 2003. A month to topic detection and tracking in Hindi. ACM Trans. Asian Language Inform. Process., Vol. 2, No. 3, Sep. 2003. Google Scholar
- Ballestros, L. and Croft, W.B. 1998. Resolving ambiguity for cross-language retrieval. In Proceedings of the21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Aug. 1998), W.B. Croft et al. eds. ACM, New York, 64--71. Google Scholar
- Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of SIGIR '99: 22nd International Conference on Research and Development in Information Retrieval (Berkeley, CA, Aug. 1999), M. Hearst et al. eds. ACM, New York, 222--229. Google Scholar
- Callan, J.P., Crift, W.B. and Broglio, J. 1995. TREC and TIPSTER experiments with INQUERY. Inf. Process. Manage. 31 (1995), 327--343. Google Scholar
- Chen, A. and Gey, F.C. 2003. Generating statistical Hindi stemmers from parallel texts. ACM Trans. Asian Language Inform. Process., Vol. 2, No. 3, Sep. 2003.Google Scholar
- Davis, M.W. and Ogden, W.C. 1998. Free resources and advanced alignment for cross-language text retrieval. In Proceedings of the Sixth Text Retrieval Conference: TREC-6 (Gaithersburg, MD, Nov. 1997), E. M. Voorhees et al. eds. NIST Special Publication 500-240, 385--394.Google Scholar
- Larkey, L.S., Allan, J., Connell, M.E., Bolivar, A. and Wade, C. 2003. UMass at TREC 2002: Cross language and novelty tracks. In The Eleventh Text REtrieval Conference: TREC 2002 (Gaithersburg, MD, Nov. 2002), E.M. Voorhees et al. eds. NIST Special Publication 500-251, 721--732.Google Scholar
- Larkey, L.S. and Connell, M.E. 2003. Structured queries, Language modeling, and relevance modeling in cross-language information retrieval. Inf. Process. Manage. To appear Google Scholar
- Larkey, L.S., Ballestros, L., and Connell, M.E. 2002. Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In SIGIR 2002: Proceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland, Aug. 2002), M. Beaulieu et al. eds. ACM, New York, 275--282. Google Scholar
- LDC. 1998. Linguistic Data Consortium North American News Text Supplement, LDC98T30. http://www.ldc.upenn.edu/Catalog/Google Scholar
- NTCIR Workshop 2. 2001. Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization (Tokyo, March 2001). http:/research.nii.ac.jp/ntcir/workshop/OnlineProceedings2.Google Scholar
- Oard, D.W. and Gey, F.C. 2003. The TREC-2002 Arabic/English CLIR track, In The Eleventh Text REtrieval Conference: TREC 2002 (Gaithersburg, MD, Nov. 2002), E.M.Voorhees et al. eds. NIST Special Publication 500-251, 17--26.Google Scholar
- Och, F.J. and Ney, H. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (Hong Kong, Oct. 2000), 440--447. Google Scholar
- Peters, C., Braschler, M., Gonzalo, J., and Kluck, M. Eds. 2002. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001: (Darmstadt, Germany, Sept. 2001). Revised papers. Lecture Notes in Computer Science, Vol. 2406, Springer, New York. Google Scholar
- Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Aug.1998), W.B. Croft et al. eds. ACM, New York, 55--63. Google Scholar
- Ramanathan, A. and Rao, D.D. 2003. A lightweight stemmer for Hindi. Presented at EACL 2003: 10th Conference of the European Chapter of the Association for Computational Linguistics, Workshop on Computational Linguistics for South Asian Languages (Budapest, April 2003.). http://computing.open.ac.uk/Sites/EACLSouthAsia/papers.htmGoogle Scholar
- Unicode, 2003. What is Unicode? http://www.unicode.org/standard/WhatIsUnicode.html.Google Scholar
- Xu, J., Weischedel, R. and Nguyen, C. 2001. Evaluating a probabilistic model for cross-lingual information retrieval. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, LA, Sept. 2001), W.B. Croft et al. eds. ACM, New York, 105--110. Google Scholar
Index Terms
- Hindi CLIR in thirty days
Recommendations
An unsupervised Hindi stemmer with heuristic improvements
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataStemmers are used to convert inflected words into their root or stem. Stem does not necessarily correspond to linguistic root of a word. Stemming improve performance by reducing morphologically variants into same words. This paper presents an approach ...
Intelligent Part of Speech tagger for Hindi
AbstractEnglish Part of Speech like noun, verb, adverb, adjective, pronoun, preposition, interjection, conjunction is somewhat similar in Hindi but not exactly the same. Hindi grammar has different Part of Speech (POS) based on its morphological features ...
Bengali and Hindi to English CLIR Evaluation
Advances in Multilingual and Multimodal Information RetrievalThis paper presents a cross-language retrieval system for the retrieval of English documents in response to queries in Bengali and Hindi, as part of our participation in CLEF 2007 Ad-hoc bilingual track. We followed the dictionary-based Machine ...
Comments