research-article

Public Access

Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech

Authors:
Rabih Zbib

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

,
Lingjun Zhao

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

,
Damianos Karakos

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

,
William Hartmann

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

,
Jay DeYoung

Northeastern University, Boston, MA, USA

Northeastern University, Boston, MA, USA
View Profile

,
Zhongqiang Huang

Alibaba Technologies, Hangzhou, China

Alibaba Technologies, Hangzhou, China
View Profile

,
Zhuolin Jiang

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

,
Noah Rivkin

Franklin W. Olin College of Engineering, Newton, MA, USA

Franklin W. Olin College of Engineering, Newton, MA, USA
View Profile

,
Le Zhang

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

,
Richard Schwartz

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

,
John Makhoul

Raytheon BBN Technologies, Cambridge, MA, USA

Raytheon BBN Technologies, Cambridge, MA, USA
View Profile

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2019Pages 645–654https://doi.org/10.1145/3331184.3331222

Published:18 July 2019Publication History

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 645–654

ABSTRACT

We propose a neural network model to estimate word translation probabilities for Cross-Lingual Information Retrieval (CLIR). The model estimates better probabilities for word translations than automatic word alignments alone, and generalizes to unseen source-target word pairs. We further improve the lexical neural translation model (and subsequently CLIR), by incorporating source word context, and by encoding the character sequences of input source words to generate translations of out-of-vocabulary words. To be effective, neural network models typically need training on large amounts of data labeled directly on the final task, in this case relevance to queries. In contrast, our approach only requires parallel data to train the translation model, and uses an unsupervised model to compute CLIR relevance scores.

We report results on the retrieval of text and speech documents from three morphologically complex languages with limited training data resources (Swahili, Tagalog, and Somali) and short English queries. Despite training on only about 2M words of parallel training data for each language, we obtain neural network translation models that are very effective for this task. We also obtain further improvements using (i) a modified relevance model, which uses the probability of occurrence of a translation of each query term in the source document, and (ii) confusion networks (instead of 1-best output) that encode multiple transcription alternatives in the output of an Automatic Speech Recognition (ASR) system.

We achieve overall MAP relative improvements of up to 24% on Swahili, 50% on Tagalog, and 39% on Somali over the baseline probabilistic model, and larger improvements over monolingual retrieval from machine translation output.

Supplemental Material

cite3-11h20-d3.mp4

mp4

440.7 MB

Download

References

2011. IARPA Babel Program - Broad Agency Announcement (BAA). https://www.iarpa.gov/index.php/research-programs/babel.Google Scholar
2015. DARPA LORELEI Program - Broad Agency Announcement (BAA). https://www.darpa.mil/program/low-resource-languages-for-emergent-incidents.Google Scholar
2017. IARPA MATERIAL Program - Broad Agency Announcement (BAA). https://www.iarpa.gov/index.php/research-programs/material.Google Scholar
Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. 2010. Learning to rank with (a lot of) word features. Information Retrieval, Vol. 13, 3 (2010), 291--314. Google ScholarDigital Library
Guoguo Chen, Oguz Yilmaz, Jan Trmal, Daniel Povey, and Sanjeev Khudanpur. 2013. Using proxies for OOV keywords in the keyword search task. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 416--421.Google ScholarCross Ref
Gaofeng Cheng, Vijayaditya Peddinti, Daniel Povey, Vimal Manohar, Sanjeev Khudanpur, and Yonghong Yan. 2017. An exploration of dropout with LSTMs. In Proc. Interspeech.Google ScholarCross Ref
Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In SIGIR. Google ScholarDigital Library
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M. Schwartz, and John Makhoul. 2014. Fast and Robust Neural Network Joint Models for Statistical Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers . 1370--1380. http://aclweb.org/anthology/P/P14/P14-1129.pdfGoogle ScholarCross Ref
Marcello Federico, Nicola Bertoldi, Gina-Anne Levow, and Gareth J. F. Jones. 2004. CLEF 2004 cross-language spoken document retrieval track. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 816--820. Google ScholarDigital Library
Jonathan G. Fiscus, Jerome Ajot, and John S. Garofolo. 2007. Results of the 2006 Spoken Term Detection Evaluation.Google Scholar
Aria Haghighi, John Blitzer, John DeNero, and Dan Klein. 2009. Better Word Alignments with Supervised ITG Models. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (ACL '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 923--931. http://dl.acm.org/citation.cfm?id=1690219.1690276 Google ScholarDigital Library
William Hartmann, Damianos Karakos, Roger Hsiao, Le Zhang, Tanel Alumäe, Stavros Tsakalidis, and Richard Schwartz. 2017. Analysis of keyword spotting performance across IARPA babel languages. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5765--5769.Google ScholarCross Ref
Alexander G. Hauptmann, Rong Jin, and Tobun Dorbin Ng. 2002. Multi-modal information retrieval from broadcast video using ocr and speech recognition. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries. ACM, 160--161. Google ScholarDigital Library
Martin Karafiát, Frantisek Grezl, Mirko Hannemann, and Jan Honza Cernocky. 2014. BUT neural network features for spontaneous vietnamese in BABEL. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5622--5626.Google ScholarCross Ref
Damianos Karakos, Richard Schwartz, Stavros Tsakalidis, Le Zhang, Shivesh Ranjan, Tim Tim Ng, Roger Hsiao, Guruprasad Saikumar, Ivan Bulyko, Long Nguyen, et al. 2013. Score normalization and system combination for improved keyword spotting. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 210--215.Google ScholarCross Ref
Martha Larson, Gareth J. F. Jones, et al. 2012. Spoken content retrieval: A survey of techniques and technologies. Foundations and Trends® in Information Retrieval, Vol. 5, 4-5 (2012), 235--422. Google ScholarDigital Library
Victor Lavrenko, Martin Choquette, and W. Bruce Croft. 2002. Cross-lingual Relevance Models. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02). ACM, New York, NY, USA, 175--182. Google ScholarDigital Library
Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully Character-Level Neural Machine Translation without Explicit Segmentation. TACL, Vol. 5 (2017), 365--378.Google ScholarCross Ref
Lin-shan Lee, James Glass, Hung-yi Lee, and Chun-an Chan. 2015. Spoken content retrieval - beyond cascading speech recognition with text retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 9 (2015), 1389--1420. Google ScholarDigital Library
Robert Litschko, Goran Glavas, Simone Paolo Ponzetto, and Ivan Vulic. 2018. Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only. In SIGIR. Google ScholarDigital Library
Yuanhua Lv and ChengXiang Zhai. 2011. Lower-bounding Term Frequency Normalization. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM '11). ACM, New York, NY, USA, 7--16. Google ScholarDigital Library
David R. H. Miller, Tim Leek, and Richard M. Schwartz. 1999. A Hidden Markov Model Information Retrieval System. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99). ACM, New York, NY, USA, 214--221. Google ScholarDigital Library
Jian-Yun Nie. 2010. Cross-Language Information Retrieval .Morgan and Claypool Publishers. Google ScholarDigital Library
Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, Vol. 29, 1 (2003), 19--51. Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. 311--318. Google ScholarDigital Library
Pavel Pecina, Petra Hoffmannová, Gareth J. F. Jones, Ying Zhang, and Douglas W. Oard. 2007. Overview of the CLEF-2007 cross-language speech retrieval track. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 674--686.Google Scholar
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. 2018. Cross-Lingual Learning-to-Rank with Shared Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, 458--463. http://aclweb.org/anthology/N18-2073Google ScholarCross Ref
Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. 2016. Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4955--4959.Google ScholarDigital Library
Paraic Sheridan, Martin Wechsler, and Peter Schäuble. 1997. Cross-language speech retrieval: Establishing a baseline performance. In ACM SIGIR Forum, Vol. 31. ACM, 99--108. Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Google ScholarDigital Library
Ivan Vulic and Marie-Francine Moens. 2015. Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings. In SIGIR. Google ScholarDigital Library
Jinxi Xu and Ralph Weischedel. 2000. Cross-lingual Information Retrieval Using Hidden Markov Models. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13 (EMNLP '00). Association for Computational Linguistics, Stroudsburg, PA, USA, 95--103. Google ScholarDigital Library
Yoon, Yacine Kim, David Jernite, Alexander Sontag, and Rush. 2016. Character-Aware Neural Language Models. In 2016 Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI). 2741--2749. Google ScholarDigital Library
Le Zhang, Damianos Karakos, William Hartmann, Roger Hsiao, Richard Schwartz, and Stavros Tsakalidis. 2015. Enhancing Low Resource Keyword Spotting with Automatically Retrieved Web Documents. In Interspeech. 839--843.Google Scholar
Yingjie Zhang, Md. Mustafizur Rahman, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, and Matthew Lease. 2016. Neural Information Retrieval: A Literature Review. CoRR, Vol. abs/1611.06792 (2016).Google Scholar

Index Terms

Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Probabilistic retrieval models

Recommendations

Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval
Abstract
Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary ...
Read More
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
Abstract
Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Read More
Refined stop-words and morphological variants solutions applied to Hindi-English cross-lingual information retrieval
Soft Computing and Intelligent Systems: Techniques and Applications

Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different than the target documents language. CLIR incorporates a machine translation technique, like, Statistical Machine Translation (SMT) and Neural Machine ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2019
1512 pages
ISBN:9781450361729
DOI:10.1145/3331184
General Chairs:
Benjamin Piwowarski
CNRS - Sorbonne Universite, France
,
Max Chevalier
Universite de Toulouse, CNRS, France
,
Eric Gaussier
Universite Grenoble Alpes, CNRS, France
,
Program Chairs:
Yoelle Maarek
Amazon Research, Israel
,
Jian-Yun Nie
University of Montreal, Canada
,
Falk Scholer
RMIT University, Australia
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-lingual information retrieval
machine translation
neural networks
probabilistic modeling
speech recognition
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR'19 Paper Acceptance Rate84of426submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 1,000
  Total Downloads
- Downloads (Last 12 months)119
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Refined stop-words and morphological variants solutions applied to Hindi-English cross-lingual information retrieval