article

Hindi CLIR in thirty days

Authors:
Leah S. Larkey

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

,
Margaret E. Connell

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

,
Nasreen Abduljaleel

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

ACM Transactions on Asian Language Information Processing Volume 2 Issue 2pp 130–142https://doi.org/10.1145/974740.974746

Published:01 June 2003Publication History

ACM Transactions on Asian Language Information Processing

Abstract

As participants in the TIDES Surprise language exercise, researchers at the University of Massachusetts helped collect Hindi--English resources and developed a cross-language information retrieval system. Components included normalization, stop-word removal, transliteration, structured query translation, and language modeling using a probabilistic dictionary derived from a parallel corpus. Existing technology was successfully applied to Hindi. The biggest stumbling blocks were collection of parallel English and Hindi text and dealing with numerous proprietary encodings.

References

Abduljaleel, N. and Larkey, L. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM 2003: Proceedings of the Twelfth International Conference on Information and Knowledge Management (New Orleans, LA, Nov. 2003). O. Frieder et al. eds. ACM, New York, 139--146. Google Scholar
Aljlayl M. and Frieder, O. 2002. On Arabic search: Improving the retrieval effectiveness via a light stemmer approach. In CIKM 2002: Proceedings of the Eleventh International Conference on Information and Knowledge Management (McLean, VA, Nov. 2002). K Kalpakis. et al. eds. ACM, New York, 340--347. Google Scholar
Allan, J., Lavrenko, V., and Connell, M. E. 2003. A month to topic detection and tracking in Hindi. ACM Trans. Asian Language Inform. Process., Vol. 2, No. 3, Sep. 2003. Google Scholar
Ballestros, L. and Croft, W.B. 1998. Resolving ambiguity for cross-language retrieval. In Proceedings of the21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Aug. 1998), W.B. Croft et al. eds. ACM, New York, 64--71. Google Scholar
Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of SIGIR '99: 22nd International Conference on Research and Development in Information Retrieval (Berkeley, CA, Aug. 1999), M. Hearst et al. eds. ACM, New York, 222--229. Google Scholar
Callan, J.P., Crift, W.B. and Broglio, J. 1995. TREC and TIPSTER experiments with INQUERY. Inf. Process. Manage. 31 (1995), 327--343. Google Scholar
Chen, A. and Gey, F.C. 2003. Generating statistical Hindi stemmers from parallel texts. ACM Trans. Asian Language Inform. Process., Vol. 2, No. 3, Sep. 2003.Google Scholar
Davis, M.W. and Ogden, W.C. 1998. Free resources and advanced alignment for cross-language text retrieval. In Proceedings of the Sixth Text Retrieval Conference: TREC-6 (Gaithersburg, MD, Nov. 1997), E. M. Voorhees et al. eds. NIST Special Publication 500-240, 385--394.Google Scholar
Larkey, L.S., Allan, J., Connell, M.E., Bolivar, A. and Wade, C. 2003. UMass at TREC 2002: Cross language and novelty tracks. In The Eleventh Text REtrieval Conference: TREC 2002 (Gaithersburg, MD, Nov. 2002), E.M. Voorhees et al. eds. NIST Special Publication 500-251, 721--732.Google Scholar
Larkey, L.S. and Connell, M.E. 2003. Structured queries, Language modeling, and relevance modeling in cross-language information retrieval. Inf. Process. Manage. To appear Google Scholar
Larkey, L.S., Ballestros, L., and Connell, M.E. 2002. Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In SIGIR 2002: Proceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland, Aug. 2002), M. Beaulieu et al. eds. ACM, New York, 275--282. Google Scholar
LDC. 1998. Linguistic Data Consortium North American News Text Supplement, LDC98T30. http://www.ldc.upenn.edu/Catalog/Google Scholar
NTCIR Workshop 2. 2001. Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization (Tokyo, March 2001). http:/research.nii.ac.jp/ntcir/workshop/OnlineProceedings2.Google Scholar
Oard, D.W. and Gey, F.C. 2003. The TREC-2002 Arabic/English CLIR track, In The Eleventh Text REtrieval Conference: TREC 2002 (Gaithersburg, MD, Nov. 2002), E.M.Voorhees et al. eds. NIST Special Publication 500-251, 17--26.Google Scholar
Och, F.J. and Ney, H. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (Hong Kong, Oct. 2000), 440--447. Google Scholar
Peters, C., Braschler, M., Gonzalo, J., and Kluck, M. Eds. 2002. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001: (Darmstadt, Germany, Sept. 2001). Revised papers. Lecture Notes in Computer Science, Vol. 2406, Springer, New York. Google Scholar
Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Aug.1998), W.B. Croft et al. eds. ACM, New York, 55--63. Google Scholar
Ramanathan, A. and Rao, D.D. 2003. A lightweight stemmer for Hindi. Presented at EACL 2003: 10th Conference of the European Chapter of the Association for Computational Linguistics, Workshop on Computational Linguistics for South Asian Languages (Budapest, April 2003.). http://computing.open.ac.uk/Sites/EACLSouthAsia/papers.htmGoogle Scholar
Unicode, 2003. What is Unicode? http://www.unicode.org/standard/WhatIsUnicode.html.Google Scholar
Xu, J., Weischedel, R. and Nguyen, C. 2001. Evaluating a probabilistic model for cross-lingual information retrieval. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, LA, Sept. 2001), W.B. Croft et al. eds. ACM, New York, 105--110. Google Scholar

Index Terms

Hindi CLIR in thirty days
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
      2. Dictionaries
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

An unsupervised Hindi stemmer with heuristic improvements
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Stemmers are used to convert inflected words into their root or stem. Stem does not necessarily correspond to linguistic root of a word. Stemming improve performance by reducing morphologically variants into same words. This paper presents an approach ...
Read More
Intelligent Part of Speech tagger for Hindi
Abstract
English Part of Speech like noun, verb, adverb, adjective, pronoun, preposition, interjection, conjunction is somewhat similar in Hindi but not exactly the same. Hindi grammar has different Part of Speech (POS) based on its morphological features ...
Read More
Bengali and Hindi to English CLIR Evaluation
Advances in Multilingual and Multimodal Information Retrieval

This paper presents a cross-language retrieval system for the retrieval of English documents in response to queries in Bengali and Hindi, as part of our participation in CLEF 2007 Ad-hoc bilingual track. We followed the dictionary-based Machine ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 2, Issue 2
June 2003
90 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/974740
Issue’s Table of Contents

Copyright © 2003 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2003
Published in talip Volume 2, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hindi
cross-language
cross-lingual information retrieval
evaluation
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 572
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

An unsupervised Hindi stemmer with heuristic improvements

Intelligent Part of Speech tagger for Hindi

Bengali and Hindi to English CLIR Evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

An unsupervised Hindi stemmer with heuristic improvements

Intelligent Part of Speech tagger for Hindi

Bengali and Hindi to English CLIR Evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media