ABSTRACT
We propose a neural network model to estimate word translation probabilities for Cross-Lingual Information Retrieval (CLIR). The model estimates better probabilities for word translations than automatic word alignments alone, and generalizes to unseen source-target word pairs. We further improve the lexical neural translation model (and subsequently CLIR), by incorporating source word context, and by encoding the character sequences of input source words to generate translations of out-of-vocabulary words. To be effective, neural network models typically need training on large amounts of data labeled directly on the final task, in this case relevance to queries. In contrast, our approach only requires parallel data to train the translation model, and uses an unsupervised model to compute CLIR relevance scores.
We report results on the retrieval of text and speech documents from three morphologically complex languages with limited training data resources (Swahili, Tagalog, and Somali) and short English queries. Despite training on only about 2M words of parallel training data for each language, we obtain neural network translation models that are very effective for this task. We also obtain further improvements using (i) a modified relevance model, which uses the probability of occurrence of a translation of each query term in the source document, and (ii) confusion networks (instead of 1-best output) that encode multiple transcription alternatives in the output of an Automatic Speech Recognition (ASR) system.
We achieve overall MAP relative improvements of up to 24% on Swahili, 50% on Tagalog, and 39% on Somali over the baseline probabilistic model, and larger improvements over monolingual retrieval from machine translation output.
Supplemental Material
- 2011. IARPA Babel Program - Broad Agency Announcement (BAA). https://www.iarpa.gov/index.php/research-programs/babel.Google Scholar
- 2015. DARPA LORELEI Program - Broad Agency Announcement (BAA). https://www.darpa.mil/program/low-resource-languages-for-emergent-incidents.Google Scholar
- 2017. IARPA MATERIAL Program - Broad Agency Announcement (BAA). https://www.iarpa.gov/index.php/research-programs/material.Google Scholar
- Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. 2010. Learning to rank with (a lot of) word features. Information Retrieval, Vol. 13, 3 (2010), 291--314. Google ScholarDigital Library
- Guoguo Chen, Oguz Yilmaz, Jan Trmal, Daniel Povey, and Sanjeev Khudanpur. 2013. Using proxies for OOV keywords in the keyword search task. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 416--421.Google ScholarCross Ref
- Gaofeng Cheng, Vijayaditya Peddinti, Daniel Povey, Vimal Manohar, Sanjeev Khudanpur, and Yonghong Yan. 2017. An exploration of dropout with LSTMs. In Proc. Interspeech.Google ScholarCross Ref
- Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In SIGIR. Google ScholarDigital Library
- Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M. Schwartz, and John Makhoul. 2014. Fast and Robust Neural Network Joint Models for Statistical Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers . 1370--1380. http://aclweb.org/anthology/P/P14/P14-1129.pdfGoogle ScholarCross Ref
- Marcello Federico, Nicola Bertoldi, Gina-Anne Levow, and Gareth J. F. Jones. 2004. CLEF 2004 cross-language spoken document retrieval track. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 816--820. Google ScholarDigital Library
- Jonathan G. Fiscus, Jerome Ajot, and John S. Garofolo. 2007. Results of the 2006 Spoken Term Detection Evaluation.Google Scholar
- Aria Haghighi, John Blitzer, John DeNero, and Dan Klein. 2009. Better Word Alignments with Supervised ITG Models. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (ACL '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 923--931. http://dl.acm.org/citation.cfm?id=1690219.1690276 Google ScholarDigital Library
- William Hartmann, Damianos Karakos, Roger Hsiao, Le Zhang, Tanel Alumäe, Stavros Tsakalidis, and Richard Schwartz. 2017. Analysis of keyword spotting performance across IARPA babel languages. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5765--5769.Google ScholarCross Ref
- Alexander G. Hauptmann, Rong Jin, and Tobun Dorbin Ng. 2002. Multi-modal information retrieval from broadcast video using ocr and speech recognition. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries. ACM, 160--161. Google ScholarDigital Library
- Martin Karafiát, Frantisek Grezl, Mirko Hannemann, and Jan Honza Cernocky. 2014. BUT neural network features for spontaneous vietnamese in BABEL. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5622--5626.Google ScholarCross Ref
- Damianos Karakos, Richard Schwartz, Stavros Tsakalidis, Le Zhang, Shivesh Ranjan, Tim Tim Ng, Roger Hsiao, Guruprasad Saikumar, Ivan Bulyko, Long Nguyen, et al. 2013. Score normalization and system combination for improved keyword spotting. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 210--215.Google ScholarCross Ref
- Martha Larson, Gareth J. F. Jones, et al. 2012. Spoken content retrieval: A survey of techniques and technologies. Foundations and Trends® in Information Retrieval, Vol. 5, 4-5 (2012), 235--422. Google ScholarDigital Library
- Victor Lavrenko, Martin Choquette, and W. Bruce Croft. 2002. Cross-lingual Relevance Models. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02). ACM, New York, NY, USA, 175--182. Google ScholarDigital Library
- Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully Character-Level Neural Machine Translation without Explicit Segmentation. TACL, Vol. 5 (2017), 365--378.Google ScholarCross Ref
- Lin-shan Lee, James Glass, Hung-yi Lee, and Chun-an Chan. 2015. Spoken content retrieval - beyond cascading speech recognition with text retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 9 (2015), 1389--1420. Google ScholarDigital Library
- Robert Litschko, Goran Glavas, Simone Paolo Ponzetto, and Ivan Vulic. 2018. Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only. In SIGIR. Google ScholarDigital Library
- Yuanhua Lv and ChengXiang Zhai. 2011. Lower-bounding Term Frequency Normalization. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM '11). ACM, New York, NY, USA, 7--16. Google ScholarDigital Library
- David R. H. Miller, Tim Leek, and Richard M. Schwartz. 1999. A Hidden Markov Model Information Retrieval System. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99). ACM, New York, NY, USA, 214--221. Google ScholarDigital Library
- Jian-Yun Nie. 2010. Cross-Language Information Retrieval .Morgan and Claypool Publishers. Google ScholarDigital Library
- Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, Vol. 29, 1 (2003), 19--51. Google ScholarDigital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. 311--318. Google ScholarDigital Library
- Pavel Pecina, Petra Hoffmannová, Gareth J. F. Jones, Ying Zhang, and Douglas W. Oard. 2007. Overview of the CLEF-2007 cross-language speech retrieval track. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 674--686.Google Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
- Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. 2018. Cross-Lingual Learning-to-Rank with Shared Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, 458--463. http://aclweb.org/anthology/N18-2073Google ScholarCross Ref
- Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. 2016. Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4955--4959.Google ScholarDigital Library
- Paraic Sheridan, Martin Wechsler, and Peter Schäuble. 1997. Cross-language speech retrieval: Establishing a baseline performance. In ACM SIGIR Forum, Vol. 31. ACM, 99--108. Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Google ScholarDigital Library
- Ivan Vulic and Marie-Francine Moens. 2015. Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings. In SIGIR. Google ScholarDigital Library
- Jinxi Xu and Ralph Weischedel. 2000. Cross-lingual Information Retrieval Using Hidden Markov Models. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13 (EMNLP '00). Association for Computational Linguistics, Stroudsburg, PA, USA, 95--103. Google ScholarDigital Library
- Yoon, Yacine Kim, David Jernite, Alexander Sontag, and Rush. 2016. Character-Aware Neural Language Models. In 2016 Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI). 2741--2749. Google ScholarDigital Library
- Le Zhang, Damianos Karakos, William Hartmann, Roger Hsiao, Richard Schwartz, and Stavros Tsakalidis. 2015. Enhancing Low Resource Keyword Spotting with Automatically Retrieved Web Documents. In Interspeech. 839--843.Google Scholar
- Yingjie Zhang, Md. Mustafizur Rahman, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, and Matthew Lease. 2016. Neural Information Retrieval: A Literature Review. CoRR, Vol. abs/1611.06792 (2016).Google Scholar
Index Terms
- Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech
Recommendations
Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval
AbstractCross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary ...
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
AbstractUnsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Refined stop-words and morphological variants solutions applied to Hindi-English cross-lingual information retrieval
Soft Computing and Intelligent Systems: Techniques and ApplicationsCross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different than the target documents language. CLIR incorporates a machine translation technique, like, Statistical Machine Translation (SMT) and Neural Machine ...
Comments