Skip to main content

Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

  • Conference paper
  • First Online:
Book cover Text, Speech, and Dialogue (TSD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Included in the following conference series:

  • 1430 Accesses

Abstract

Evaluating the performance of Cross-Language Information Retrieval models is a rather difficult task since collecting and assessing substantial amount of data for CLIR systems evaluation could be a non-trivial and expensive process. At the same time, substantial number of machine translation datasets are available now. In the present paper we attempt to solve the problem stated above by suggesting a strict workflow for transforming machine translation datasets to a CLIR evaluation dataset (with automatically obtained relevance assessments), as well as a workflow for extracting a representative subsample from the initial large corpus of documents so that it is appropriate for further manual assessment. We also hypothesize and then prove by the number of experiments on the United Nations Parallel Corpus data that the quality of an information retrieval algorithm on the automatically assessed sample could be in fact treated as a reasonable metric.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here we define a document-level information retrieval system as a type of information retrieval systems where users query not by short keyword phrases but by full-text document examples.

  2. 2.

    Nonetheless, the case of document-level information retrieval somewhat simplifies the evaluation procedure as at least there is no need for example queries and ground truth relevance measures between queries and documents: only document-to-document relevance is required.

  3. 3.

    For simplicity, in the present paper we discuss the case of a bilingual dataset. However, the approaches described here could be easily generalized to the case of multiple languages.

References

  1. Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: ACM SIGIR Forum, vol. 31, pp. 84–91. ACM (1997)

    Google Scholar 

  2. Berry, M.W., Young, P.G.: Using latent semantic indexing for multilanguage information retrieval. Comput. Hum. 29(6), 413–429 (1995)

    Article  Google Scholar 

  3. Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 75–82. AUAI Press (2009)

    Google Scholar 

  4. Braschler, M., Harman, D., Hess, M., Kluck, M., Peters, C., Schäuble, P.: The evaluation of systems for cross-language information retrieval. In: LREC (2000)

    Google Scholar 

  5. De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Generalized Louvain method for community detection in large networks. In: 2011 11th International Conference on Intelligent Systems Design and Applications, ISDA, pp. 88–93. IEEE (2011)

    Google Scholar 

  6. Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. In: AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, vol. 15, p. 21 (1997)

    Google Scholar 

  7. Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In: 10th Edition of the Language Resources and Evaluation Conference (2016)

    Google Scholar 

  8. Germann, U.: Aligned hansards of the 36th parliament of Canada (2001). https://www.isi.edu/natural-language/download/hansard/

  9. Gonzalo, J., Verdejo, F., Peters, C., Calzolari, N.: Applying EuroWordNet to cross-language text retrieval. In: Vossen, P. (ed.) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, pp. 113–135. Springer, Dordrecht (1998). https://doi.org/10.1007/978-94-017-1491-4_5

    Chapter  Google Scholar 

  10. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)

  11. Kamps, J., Pehcevski, J., Kazai, G., Lalmas, M., Robertson, S.: INEX 2007 evaluation measures. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 24–33. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85902-4_2

    Chapter  Google Scholar 

  12. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)

    Google Scholar 

  13. Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends\({\textregistered }\) Mach. Learn. 5(2–3), 123–286 (2012)

    Google Scholar 

  14. Meng, H.M., Lo, W.K., Chen, B., Tang, K.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 311–314. IEEE (2001)

    Google Scholar 

  15. Mori, T., Kokubu, T., Tanaka, T.: Cross-lingual information retrieval based on LSI with multiple word spaces. In: Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization. Citeseer (2001)

    Google Scholar 

  16. Nikitinsky, N., Ustalov, D., Shashev, S.: An information retrieval system for technology analysis and forecasting. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference, AINL-ISMW FRUCT, pp. 52–59. IEEE (2015)

    Google Scholar 

  17. Oard, D.W.: A comparative study of query and document translation for cross-language information retrieval. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS, vol. 1529, pp. 472–483. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49478-2_42

    Chapter  Google Scholar 

  18. Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: problems, methods, and research findings. Inf. Retr. 4(3–4), 209–230 (2001)

    Article  Google Scholar 

  19. Ruder, S.: A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902 (2017)

  20. Voorhees, E.M., Harman, D.K., et al.: TREC: Experiment and Evaluation in Information Retrieval, vol. 1. MIT Press, Cambridge (2005)

    Google Scholar 

  21. Vulić, I., De Smet, W., Moens, M.F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retr. 16(3), 331–368 (2013)

    Article  Google Scholar 

  22. Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372. ACM (2015)

    Google Scholar 

  23. Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1.0. In: LREC (2016)

    Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the hard work and commitment from Ivan Menshikh throughout this study. We are also thankful to Anna Potapenko for offering very useful comments on the present paper, and Konstantin Vorontsov for encouragement and support.

The present research was supported by the Ministry of Education and Science of the Russian Federation under the unique research id RFMEFI57917X0143.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Polina Kazakova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shtekh, G., Kazakova, P., Nikitinsky, N. (2018). Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00794-2_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00793-5

  • Online ISBN: 978-3-030-00794-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics