Skip to main content

Towards a Map of the Syntactic Similarity of Languages

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10761))

  • 922 Accesses

Abstract

In this paper we propose a computational method for determining the syntactic similarity between languages. We investigate multiple approaches and metrics, showing that the results are consistent across methods. We report results on 16 languages belonging to various language families. The analysis that we conduct is adaptable to any languages, as far as resources are available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The authors also present a brief history of the syntactic approaches and acknowledge the work of [30], a pioneer in this field and a forerunner of their approach.

  2. 2.

    We use dependency parsing [21], so we rely on dependency trees to compute the syntactic similarity.

  3. 3.

    Tagging and parsing accuracy for each language: Bg: 0.96,0.87; Cs: 0.98,0.82; Da: 0.95,0.82; De: 0.92,0.67; El: 0.96,0.82; En: 0.94,0.85; Es: 0.95,0.59; Et: 0.94,0.80; Fi: 0.93,0.75; Fr: 0.96,0.61; Hu: 0.92,0.79; It: 0.97,0.47; Nl: 0.89,0.77; Pt: 0.96,0.83; Ro: 0.95,0.82; Sv: 0.95,0.84.

  4. 4.

    We believe that our investigation is not negatively influenced by the choice of corpus because we are consistent across all experiments in terms of text gender and we report results obtained solely by comparison between languages on the same dataset. In future work, we intend to apply the proposed methods on other datasets as well (for example, the EUR-Lex Corpus [2]).

  5. 5.

    While the effect of translation cannot be denied, we rely on the fact that the interpreters/translators of Europarl are native speakers of the target language, which reduces the impact of the source language on translations significantly (as opposed to the translations performed by language learners, for example).

  6. 6.

    We repeated Exp. #1 and #2 using the rank distance [7] instead of the edit distance, and there were no significant differences in the results.

  7. 7.

    Developing of Romanian far from the big Romance kernel made the contact of Romanian with Romance languages difficult until the 18th century. Instead, from the 9th to the 17th century there was a significant cultural influence of the South Slavic languages (especially Old Slavic), due in part to the exclusive use of Old Church Slavonic for religious purposes, which lead to giving South Slavic “the status of a cultural superstrate language” [36].

  8. 8.

    www.bing.com/translator.

  9. 9.

    www.translate.google.com.

References

  1. Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of VLDB 2005, pp. 301–312 (2005)

    Google Scholar 

  2. Baisa, V., Michelfeit, J., Medved, M., Jakubícek, M.: European union language resources in sketch engine. In: Proceedings of LREC 2016, pp. 2799–2803 (2016)

    Google Scholar 

  3. Bortolussi, L., Sgarro, A., Dinu, L.P.: Measures of fuzzy disarray in linguistic typology. In: Proceedings of IPMU 2008, pp. 167–172 (2008)

    Google Scholar 

  4. Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., Specia, L.: Findings of the 2012 workshop on statistical machine translation. In: Proceedings of WMT 2012, pp. 10–51 (2012)

    Google Scholar 

  5. Charniak, E., Knight, K., Yamada, K.: Syntax-based language models for statistical machine translation. In: Proceedings of the 9th Machine Translation Summit (2003)

    Google Scholar 

  6. Ciobanu, A.M., Dinu, L.P.: An etymological approach to cross-language orthographic similarity. Application on Romanian. In: Proceedings of EMNLP 2014, pp. 1047–1058 (2014)

    Google Scholar 

  7. Dinu, A., Dinu, L.P.: On the syllabic similarities of romance languages. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 785–788. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30586-6_88

    Chapter  Google Scholar 

  8. Dryer, M.S.: 81 order of subject, object, and verb. In: The World Atlas of Language Structures, pp. 330–333 (2005)

    Google Scholar 

  9. Duma, M., Vertan, C., Menzel, W.: A new syntactic metric for evaluation of machine translation. In: Proceedings of the ACL Student Research Workshop, pp. 130–135 (2013)

    Google Scholar 

  10. Dunn, M., Greenhill, S., Levinson, S., Gray, R.: Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473(7345), 79–82 (2011)

    Article  Google Scholar 

  11. Dunn, M., Terrill, A., Reesink, G., Foley, R., Levinson, S.: Structural phylogenetics and the reconstruction of ancient language history. Science 309(5743), 2072–2075 (2005)

    Article  Google Scholar 

  12. Eger, S., Hoenen, A., Mehler, A.: Language classification from bilingual word embedding graphs. In: Proceedings of COLING 2016, Technical Papers, pp. 3507–3518 (2016)

    Google Scholar 

  13. Eger, S., Schenk, N., Mehler, A.: Towards semantic language classification: inducing and clustering semantic association networks from Europarl. In: Proceedings of *SEM 2015, pp. 127–136 (2015)

    Google Scholar 

  14. Futrell, R., Mahowald, K., Gibson, E.: Quantifying word order freedom in dependency corpora. In: Proceedings of Depling 2015, pp. 91–100 (2015)

    Google Scholar 

  15. Ganitkevitch, J., Cao, Y., Weese, J., Post, M., Callison-Burch, C.: Joshua 4.0: packing, PRO, and paraphrases. In: Proceedings of WMT 2012, pp. 283–291 (2012)

    Google Scholar 

  16. Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Proceedings of SETQA-NLP 2008, pp. 49–57 (2008)

    Google Scholar 

  17. Gray, R., Atkinson, Q.: Language tree divergences support the Anatolian theory of Indo-European origin. Nature 426, 435–439 (2003)

    Article  Google Scholar 

  18. Greenberg, J.H.: Language in the Americas. Stanford University Press, Stanford (1987)

    Google Scholar 

  19. Johannsen, A., Hovy, D., Søgaard, A.: Cross-lingual syntactic variation over age and gender. In: Proceedings of CoNLL 2015, pp. 103–112 (2015)

    Google Scholar 

  20. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, pp. 79–86 (2005)

    Google Scholar 

  21. Kübler, S., McDonald, R., Nivre, J.: Dependency parsing. Synth. Lect. Hum. Lang. Technol. 1(1), 1–127 (2009)

    Google Scholar 

  22. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1965)

    MathSciNet  MATH  Google Scholar 

  23. Liu, D., Gildea, D.: Syntactic features for evaluation of machine translation. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 25–32 (2005)

    Google Scholar 

  24. Longobardi, G., et al.: Across language families: genome diversity mirrors linguistic variation within Europe. Am. J. Phys. Anthropol. 157(4), 630–640 (2015)

    Article  Google Scholar 

  25. Longobardi, G., Guardiano, C.: Evidence for syntax as a signal of historical relatedness. Lingua 119(11), 1679–1706 (2009)

    Article  Google Scholar 

  26. Martins, A.F.T., Almeida, M.B., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Proceedings of ACL 2013, Short Papers, vol. 2, pp. 617–622 (2013)

    Google Scholar 

  27. McMahon, A., McMahon, R.: Finding families: quantitative methods in language classification. Trans. Philol. Soc. 101(1), 7–55 (2003)

    Article  MathSciNet  Google Scholar 

  28. Nagata, R., Whittaker, E.: Reconstructing an Indo-European family tree from non-native English texts. In: Proceedings of ACL 2013, Long Papers, vol. 1, pp. 1137–1147 (2013)

    Google Scholar 

  29. Nerbonne, J., Wiersma, W.: A measure of aggregate syntactic distance. In: Proceedings of the Workshop on Linguistic Distances, pp. 82–90 (2006)

    Google Scholar 

  30. Nichols, J.: Linguistic Diversity in Space and Time. University of Chicago Press, Chicago (1992)

    Book  Google Scholar 

  31. Nichols, J., Warnow, T.: Tutorial on computational linguistic phylogeny. Lang. Linguist. Compass 2(5), 760–820 (2008)

    Article  Google Scholar 

  32. Niehues, J., Zhang, Y., Mediani, M., Herrmann, T., Cho, E., Waibel, A.: The Karlsruhe Institute of Technology translation systems for the WMT 2012. In: Proceedings of WMT 2012, pp. 349–355 (2012)

    Google Scholar 

  33. Nivre, J.: Towards a universal grammar for natural language processing. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9041, pp. 3–16. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18111-0_1

    Chapter  Google Scholar 

  34. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  Google Scholar 

  35. Petrov, S., Das, D., McDonald, R.T.: A universal part-of-speech tagset. In: Proceedings of LREC 2012, pp. 2089–2096 (2012)

    Google Scholar 

  36. Schulte, K.: Loanwords in Romanian. In: Loanwords in the World’s Languages: A Comparative Handbook, pp. 230–259 (2009)

    Google Scholar 

  37. Vilar, D.: DFKI’s SMT system for WMT 2012. In: Proceedings of WMT 2012, pp. 382–387 (2012)

    Google Scholar 

  38. Zeman, D.: Data issues of the multilingual translation matrix. In: Proceedings of WMT 2012, pp. 395–400 (2012)

    Google Scholar 

  39. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for their helpful and constructive comments. This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS/CCCDI UEFISCDI, project number PN-III-P2-2.1-53BG/2016, within PNCDI III.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liviu P. Dinu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ciobanu, A.M., Dinu, L.P., Sgarro, A. (2018). Towards a Map of the Syntactic Similarity of Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77113-7_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77112-0

  • Online ISBN: 978-3-319-77113-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics