Finding next of kin: Cross-lingual embedding spaces for related languages

Serge Sharoff

doi:10.1017/S1351324919000354

Finding next of kin: Cross-lingual embedding spaces for related languages

Published online by Cambridge University Press: 04 September 2019

Serge Sharoff

Show author details

Serge Sharoff*: Affiliation:
Centre for Translation Studies, University of Leeds, Leeds, UK
*: *Corresponding author. Email: s.sharoff@leeds.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Some languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments is a Panslavonic embedding space for nine Balto-Slavonic languages. The paper demonstrates that the resulting embedding space helps in such applications as morphological prediction, named-entity recognition and genre classification.

Keywords

Multilinguality Text classification Cross-lingual embeddings

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 2 , March 2020 , pp. 163 - 182

DOI: https://doi.org/10.1017/S1351324919000354 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Artetxe, M., Labaka, G. and Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of EMNLP, Austin, Texas.CrossRef Google Scholar

Augenstein, I., Ruder, S. and Søgaard, A. (2018). Multi-task learning of pairwise sequence classification tasks over disparate label spaces. In Proceedings of NAACL, New Orleans, pp. 1896–1906.CrossRef Google Scholar

Baroni, M. and Bernardini, S. (2006). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing 21(3), 259–274.CrossRef Google Scholar

Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226.CrossRef Google Scholar

Bateman, J.A., Kruijff, G.-J., Kruijff-Korbayová, I., Skoumalová, H., Sharoff, S. and Teich, E. (2000). Resources for multilingual text generation in three slavic languages. In Proceedings of Second International Conference on Language Resources and Evaluation (LREC), Athens, Greece.Google Scholar

Belinkov, Y., Durrani, N., Dalvi, F., Sajjad, H. and Glass, J. (2017). What do neural machine translation models learn about morphology? In Proceedings of ACL, Vancouver, Canada.CrossRef Google Scholar

Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155.Google Scholar

Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022.Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.Google Scholar

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.Google Scholar

Conneau, A., Lample, G., Ranzato, M., Denoyer, L. and Jégou, H. (2017). Word translation without parallel data. arXiv preprint arXiv:1710.04087.Google Scholar

Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of ACL, Portland, Oregon.Google Scholar

Di Marzio, M., Panzera, A. and Taylor, C.C. (2018). Nonparametric rotations for sphere-sphere regression. Journal of the American Statistical Association, 114, 466–476.CrossRef Google Scholar

Dinu, G., Lazaridou, A. and Baroni, M. (2014). Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568.Google Scholar

Dyer, C., Chahuneau, V. and Smith, N.A. (2013). A simple, fast, and effective reparameterization of IBM Model 2. In Proceedings of NAACL, Atlanta, Georgia.Google Scholar

Faruqui, M. and Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of EACL, Gothenburg, Sweden, pp. 462–471.CrossRef Google Scholar

Fišer, D. and Ljubešić, N. (2013). Best friends or just faking it? Corpus-based extraction of Slovene–Croatian translation equivalents and false friends. Slovenščina 2.0 1, 50–77.Google Scholar

Forcada, M.L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Sánchez-Martínez, F., Ramrez-Sánchez, G. and Tyers, F.M. (2011). Apertium: a free/open-source platform for rule-based machine translation. Machine translation 25(2), 127–144.CrossRef Google Scholar

Frunza, O. and Inkpen, D. (2009). Identification and disambiguation of cognates, false friends, and partial cognates using machine learning techniques. International Journal of Linguistics 1(1).Google Scholar

Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel English–Chinese corpus. In Proceedings of the Third Annual Workshop on Very Large Corpora, Boston, Massachusetts, pp. 173–183.Google Scholar

Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of EACL, Valencia.CrossRef Google Scholar

Klementiev, A., Titov, I. and Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. In Proceedings of COLING, Mumbai, India.Google Scholar

Köhn, A. (2015). What’s in an embedding? Analyzing word embeddings through multilingual evaluation. In Proceedings of EMNLP, Lisbon, Portugal, pp. 2067–2073.CrossRef Google Scholar

Kondrak, G. (2013). Word similarity, cognation and translational equivalence. In Lars, B. and Anju, S. (eds), Approaches to Measuring Linguistic Differences. Berlin: Walter de Gruyter, pp. 375–386.Google Scholar

Krek, S., Erjavec, T., Dobrovoljc, K., Holz, N., Ledinek, N. and Može, S. (2012). Učni korpus ssj500k kot podatkovna zbirka.Google Scholar

Kučera, H. and Francis, W.N. (1967). Computational Analysis of Present-Day American English. Providence: Brown University Press.Google Scholar

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of NAACL, San Diego, California, pp. 260–270.CrossRef Google Scholar

Mayfield, J., McNamee, P. and Costello, C. (2017). Language-independent named entity analysis using parallel projection and rule-based disambiguation. In Proceedings of BSNLP, Valencia, Spain, pp. 92–96.CrossRef Google Scholar

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C. and Joulin, A. (2017). Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405.Google Scholar

Mikolov, T., Le, Q.V. and Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.Google Scholar

Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R. and Zeman, D. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of LREC 2016, Portorož, Slovenia.Google Scholar

Petrenz, P. and Webber, B. (2010). Stable classification of text genres. Computational Linguistics 34(4), 285–293.Google Scholar

Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies. Moscow: Morgan & Claypool Publishers.Google Scholar

Piskorski, J., Pivovarova, L., Šnajder, J., Steinberger, J. and Yangarber, R. (2017). The first cross-lingual challenge on recognition, normalization, and matching of named entities in slavic languages. In Proceedings of BSNLP, Valencia, Spain, pp. 76–85.CrossRef Google Scholar

Radovanović, M., Nanopoulos, A. and Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, 2487–2531.Google Scholar

Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd ACL, Cambridge, MA, pp. 320–322.CrossRef Google Scholar

Rios, M. and Sharoff, S. (2016). Language adaptation for extending post-editing estimates for closely related languages. The Prague Bulletin of Mathematical Linguistics 106, 181–192.CrossRef Google Scholar

Santini, M., Mehler, A. and Sharoff, S. (2010). Riding the rough waves of genre on the web. In Mehler, A., Sharoff, S. and Santini, M. (eds), Genres on the Web: Computational Models and Empirical Studies. Berlin/New York: Springer.Google Scholar

Sharoff, S. (2013). Measuring the distance between comparable corpora between languages. In Sharoff, S., Rapp, R., Zweigenbaum, P. and Fung, P. (eds), BUCC: Building and Using Comparable Corpora. Berlin/New York: Springer, pp. 113–129.CrossRef Google Scholar

Sharoff, S. (2018). Functional text dimensions for the annotation of web corpora. Corpora 13(1), 65–95.CrossRef Google Scholar

Simons, G.F. and Fennig, C.D. (eds.) (2017). Ethnologue: Languages of the World, 20th Edn. SIL International, Dallas, Texas.Google Scholar

Sorower, M.S. (2010). A literature survey on algorithms for multi-label learning. Technical report, Oregon State University.Google Scholar

Straka, M., Hajič, J. and Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of LREC 2016, Portorož, Slovenia.Google Scholar

Täckström, O., McDonald, R. and Nivre, J. (2013). Target language adaptation of discriminative transfer parsers. In Proceedings of NAACL HLT, Atlanta, pp. 1061–1071.Google Scholar

Tiedemann, J. (2014). Rediscovering annotation projection for cross-lingual parser induction. In Proceedings of COLING, Dublin, pp. 1854–1864.Google Scholar

Tsvetkov, Y. and Dyer, C. (2016). Cross-lingual bridges with models of lexical borrowing. JAIR 55, 63–93.CrossRef Google Scholar

Wu, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics 23(3), 377–403.Google Scholar

Article contents

Finding next of kin: Cross-lingual embedding spaces for related languages

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests