Hostname: page-component-848d4c4894-m9kch Total loading time: 0 Render date: 2024-05-21T17:31:04.647Z Has data issue: false hasContentIssue false

Finding next of kin: Cross-lingual embedding spaces for related languages

Published online by Cambridge University Press:  04 September 2019

Serge Sharoff*
Affiliation:
Centre for Translation Studies, University of Leeds, Leeds, UK
*
*Corresponding author. Email: s.sharoff@leeds.ac.uk

Abstract

Some languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments is a Panslavonic embedding space for nine Balto-Slavonic languages. The paper demonstrates that the resulting embedding space helps in such applications as morphological prediction, named-entity recognition and genre classification.

Type
Article
Copyright
© Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Artetxe, M., Labaka, G. and Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of EMNLP, Austin, Texas.CrossRefGoogle Scholar
Augenstein, I., Ruder, S. and Søgaard, A. (2018). Multi-task learning of pairwise sequence classification tasks over disparate label spaces. In Proceedings of NAACL, New Orleans, pp. 1896–1906.CrossRefGoogle Scholar
Baroni, M. and Bernardini, S. (2006). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing 21(3), 259274.CrossRefGoogle Scholar
Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209226.CrossRefGoogle Scholar
Bateman, J.A., Kruijff, G.-J., Kruijff-Korbayová, I., Skoumalová, H., Sharoff, S. and Teich, E. (2000). Resources for multilingual text generation in three slavic languages. In Proceedings of Second International Conference on Language Resources and Evaluation (LREC), Athens, Greece.Google Scholar
Belinkov, Y., Durrani, N., Dalvi, F., Sajjad, H. and Glass, J. (2017). What do neural machine translation models learn about morphology? In Proceedings of ACL, Vancouver, Canada.CrossRefGoogle Scholar
Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research 3, 11371155.Google Scholar
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 9931022.Google Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 24932537.Google Scholar
Conneau, A., Lample, G., Ranzato, M., Denoyer, L. and Jégou, H. (2017). Word translation without parallel data. arXiv preprint arXiv:1710.04087.Google Scholar
Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of ACL, Portland, Oregon.Google Scholar
Di Marzio, M., Panzera, A. and Taylor, C.C. (2018). Nonparametric rotations for sphere-sphere regression. Journal of the American Statistical Association, 114, 466476.CrossRefGoogle Scholar
Dinu, G., Lazaridou, A. and Baroni, M. (2014). Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568.Google Scholar
Dyer, C., Chahuneau, V. and Smith, N.A. (2013). A simple, fast, and effective reparameterization of IBM Model 2. In Proceedings of NAACL, Atlanta, Georgia.Google Scholar
Faruqui, M. and Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of EACL, Gothenburg, Sweden, pp. 462–471.CrossRefGoogle Scholar
Fišer, D. and Ljubešić, N. (2013). Best friends or just faking it? Corpus-based extraction of Slovene–Croatian translation equivalents and false friends. Slovenščina 2.0 1, 5077.Google Scholar
Forcada, M.L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Sánchez-Martínez, F., Ramrez-Sánchez, G. and Tyers, F.M. (2011). Apertium: a free/open-source platform for rule-based machine translation. Machine translation 25(2), 127144.CrossRefGoogle Scholar
Frunza, O. and Inkpen, D. (2009). Identification and disambiguation of cognates, false friends, and partial cognates using machine learning techniques. International Journal of Linguistics 1(1).Google Scholar
Fung, P. (1995). Compiling bilingual lexicon entries from a non-parallel English–Chinese corpus. In Proceedings of the Third Annual Workshop on Very Large Corpora, Boston, Massachusetts, pp. 173–183.Google Scholar
Joulin, A., Grave, E., Bojanowski, P. and Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of EACL, Valencia.CrossRefGoogle Scholar
Klementiev, A., Titov, I. and Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. In Proceedings of COLING, Mumbai, India.Google Scholar
Köhn, A. (2015). What’s in an embedding? Analyzing word embeddings through multilingual evaluation. In Proceedings of EMNLP, Lisbon, Portugal, pp. 2067–2073.CrossRefGoogle Scholar
Kondrak, G. (2013). Word similarity, cognation and translational equivalence. In Lars, B. and Anju, S. (eds), Approaches to Measuring Linguistic Differences. Berlin: Walter de Gruyter, pp. 375386.Google Scholar
Krek, S., Erjavec, T., Dobrovoljc, K., Holz, N., Ledinek, N. and Može, S. (2012). Učni korpus ssj500k kot podatkovna zbirka.Google Scholar
Kučera, H. and Francis, W.N. (1967). Computational Analysis of Present-Day American English. Providence: Brown University Press.Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of NAACL, San Diego, California, pp. 260–270.CrossRefGoogle Scholar
Mayfield, J., McNamee, P. and Costello, C. (2017). Language-independent named entity analysis using parallel projection and rule-based disambiguation. In Proceedings of BSNLP, Valencia, Spain, pp. 92–96.CrossRefGoogle Scholar
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C. and Joulin, A. (2017). Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405.Google Scholar
Mikolov, T., Le, Q.V. and Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.Google Scholar
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R. and Zeman, D. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of LREC 2016, Portorož, Slovenia.Google Scholar
Petrenz, P. and Webber, B. (2010). Stable classification of text genres. Computational Linguistics 34(4), 285293.Google Scholar
Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies. Moscow: Morgan & Claypool Publishers.Google Scholar
Piskorski, J., Pivovarova, L., Šnajder, J., Steinberger, J. and Yangarber, R. (2017). The first cross-lingual challenge on recognition, normalization, and matching of named entities in slavic languages. In Proceedings of BSNLP, Valencia, Spain, pp. 76–85.CrossRefGoogle Scholar
Radovanović, M., Nanopoulos, A. and Ivanović, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, 24872531.Google Scholar
Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd ACL, Cambridge, MA, pp. 320322.CrossRefGoogle Scholar
Rios, M. and Sharoff, S. (2016). Language adaptation for extending post-editing estimates for closely related languages. The Prague Bulletin of Mathematical Linguistics 106, 181192.CrossRefGoogle Scholar
Santini, M., Mehler, A. and Sharoff, S. (2010). Riding the rough waves of genre on the web. In Mehler, A., Sharoff, S. and Santini, M. (eds), Genres on the Web: Computational Models and Empirical Studies. Berlin/New York: Springer.Google Scholar
Sharoff, S. (2013). Measuring the distance between comparable corpora between languages. In Sharoff, S., Rapp, R., Zweigenbaum, P. and Fung, P. (eds), BUCC: Building and Using Comparable Corpora. Berlin/New York: Springer, pp. 113129.CrossRefGoogle Scholar
Sharoff, S. (2018). Functional text dimensions for the annotation of web corpora. Corpora 13(1), 6595.CrossRefGoogle Scholar
Simons, G.F. and Fennig, C.D. (eds.) (2017). Ethnologue: Languages of the World, 20th Edn. SIL International, Dallas, Texas.Google Scholar
Sorower, M.S. (2010). A literature survey on algorithms for multi-label learning. Technical report, Oregon State University.Google Scholar
Straka, M., Hajič, J. and Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of LREC 2016, Portorož, Slovenia.Google Scholar
Täckström, O., McDonald, R. and Nivre, J. (2013). Target language adaptation of discriminative transfer parsers. In Proceedings of NAACL HLT, Atlanta, pp. 1061–1071.Google Scholar
Tiedemann, J. (2014). Rediscovering annotation projection for cross-lingual parser induction. In Proceedings of COLING, Dublin, pp. 1854–1864.Google Scholar
Tsvetkov, Y. and Dyer, C. (2016). Cross-lingual bridges with models of lexical borrowing. JAIR 55, 6393.CrossRefGoogle Scholar
Wu, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics 23(3), 377403.Google Scholar