Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

Shtekh, Gennady; Kazakova, Polina; Nikitinsky, Nikita

doi:10.1007/978-3-030-00794-2_9

Gennady Shtekh²⁰,
Polina Kazakova²⁰ &
Nikita Nikitinsky¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1430 Accesses

Abstract

Evaluating the performance of Cross-Language Information Retrieval models is a rather difficult task since collecting and assessing substantial amount of data for CLIR systems evaluation could be a non-trivial and expensive process. At the same time, substantial number of machine translation datasets are available now. In the present paper we attempt to solve the problem stated above by suggesting a strict workflow for transforming machine translation datasets to a CLIR evaluation dataset (with automatically obtained relevance assessments), as well as a workflow for extracting a representative subsample from the initial large corpus of documents so that it is appropriate for further manual assessment. We also hypothesize and then prove by the number of experiments on the United Nations Parallel Corpus data that the quality of an information retrieval algorithm on the automatically assessed sample could be in fact treated as a reasonable metric.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here we define a document-level information retrieval system as a type of information retrieval systems where users query not by short keyword phrases but by full-text document examples.
2.
Nonetheless, the case of document-level information retrieval somewhat simplifies the evaluation procedure as at least there is no need for example queries and ground truth relevance measures between queries and documents: only document-to-document relevance is required.
3.
For simplicity, in the present paper we discuss the case of a bilingual dataset. However, the approaches described here could be easily generalized to the case of multiple languages.

References

Ballesteros, L., Croft, W.B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: ACM SIGIR Forum, vol. 31, pp. 84–91. ACM (1997)
Google Scholar
Berry, M.W., Young, P.G.: Using latent semantic indexing for multilanguage information retrieval. Comput. Hum. 29(6), 413–429 (1995)
Article Google Scholar
Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 75–82. AUAI Press (2009)
Google Scholar
Braschler, M., Harman, D., Hess, M., Kluck, M., Peters, C., Schäuble, P.: The evaluation of systems for cross-language information retrieval. In: LREC (2000)
Google Scholar
De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Generalized Louvain method for community detection in large networks. In: 2011 11th International Conference on Intelligent Systems Design and Applications, ISDA, pp. 88–93. IEEE (2011)
Google Scholar
Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. In: AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, vol. 15, p. 21 (1997)
Google Scholar
Ferrero, J., Agnes, F., Besacier, L., Schwab, D.: A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In: 10th Edition of the Language Resources and Evaluation Conference (2016)
Google Scholar
Germann, U.: Aligned hansards of the 36th parliament of Canada (2001). https://www.isi.edu/natural-language/download/hansard/
Gonzalo, J., Verdejo, F., Peters, C., Calzolari, N.: Applying EuroWordNet to cross-language text retrieval. In: Vossen, P. (ed.) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, pp. 113–135. Springer, Dordrecht (1998). https://doi.org/10.1007/978-94-017-1491-4_5
Chapter Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017)
Kamps, J., Pehcevski, J., Kazai, G., Lalmas, M., Robertson, S.: INEX 2007 evaluation measures. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 24–33. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85902-4_2
Chapter Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)
Google Scholar
Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Found. Trends\({\textregistered }\) Mach. Learn. 5(2–3), 123–286 (2012)
Google Scholar
Meng, H.M., Lo, W.K., Chen, B., Tang, K.: Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 311–314. IEEE (2001)
Google Scholar
Mori, T., Kokubu, T., Tanaka, T.: Cross-lingual information retrieval based on LSI with multiple word spaces. In: Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization. Citeseer (2001)
Google Scholar
Nikitinsky, N., Ustalov, D., Shashev, S.: An information retrieval system for technology analysis and forecasting. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference, AINL-ISMW FRUCT, pp. 52–59. IEEE (2015)
Google Scholar
Oard, D.W.: A comparative study of query and document translation for cross-language information retrieval. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS, vol. 1529, pp. 472–483. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49478-2_42
Chapter Google Scholar
Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: problems, methods, and research findings. Inf. Retr. 4(3–4), 209–230 (2001)
Article Google Scholar
Ruder, S.: A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902 (2017)
Voorhees, E.M., Harman, D.K., et al.: TREC: Experiment and Evaluation in Information Retrieval, vol. 1. MIT Press, Cambridge (2005)
Google Scholar
Vulić, I., De Smet, W., Moens, M.F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retr. 16(3), 331–368 (2013)
Article Google Scholar
Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363–372. ACM (2015)
Google Scholar
Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1.0. In: LREC (2016)
Google Scholar

Download references

Acknowledgements

We would like to acknowledge the hard work and commitment from Ivan Menshikh throughout this study. We are also thankful to Anna Potapenko for offering very useful comments on the present paper, and Konstantin Vorontsov for encouragement and support.

The present research was supported by the Ministry of Education and Science of the Russian Federation under the unique research id RFMEFI57917X0143.

Author information

Authors and Affiliations

Integrated Systems, Vorontsovskaya Street, 35B building 3, room 413, 109147, Moscow, Russia
Nikita Nikitinsky
National University of Science and Technology MISIS, Leninsky Avenue 4, 119049, Moscow, Russia
Gennady Shtekh & Polina Kazakova

Authors

Gennady Shtekh
View author publications
You can also search for this author in PubMed Google Scholar
Polina Kazakova
View author publications
You can also search for this author in PubMed Google Scholar
Nikita Nikitinsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Polina Kazakova .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shtekh, G., Kazakova, P., Nikitinsky, N. (2018). Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-00794-2_9
Published: 08 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics