Skip to main content
Book cover

Text Mining pp 87–112Cite as

Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts Using the TTLab Latin Tagger

  • Chapter
  • First Online:

Abstract

The analysis of longitudinal corpora of historical texts requires the integrated development of tools for automatically preprocessing these texts and for building representation models of their genre- and register-related dynamics. In this chapter we present such a joint endeavor that ranges from resource formation via preprocessing to network-based text representation and classification. We start with presenting the so-called TTLab Latin Tagger (TLT) that preprocesses texts of classical and medieval Latin. Its lexical resource in the form of the Frankfurt Latin Lexicon (FLL) is also briefly introduced. As a first test case for showing the expressiveness of these resources, we perform a tripartite classification task of authorship attribution, genre detection and a combination thereof. To this end, we introduce a novel text representation model that explores the core structure (the so-called coreness) of lexical network representations of texts. Our experiment shows the expressiveness of this representation format and mediately of our Latin preprocessor.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    According to Anne Bohnenkamp-Renken, Goethe-House Frankfurt, personal communication.

  2. 2.

    TTLab is an acronym that denotes the Frankfurt Text-technology lab (www.hucompute.org).

  3. 3.

    The FLL results from a cooperation of historians and computer scientists—see the project website for more information: www.comphistsem.org.

  4. 4.

    collex.hucompute.org.

  5. 5.

    archives.nd.edu/whitaker/dictpage.htm.

  6. 6.

    ramminger.userweb.mwn.de.

  7. 7.

    la.wiktionary.org/wiki/Pagina_prima.

  8. 8.

    Provided by Michael Trauth, Trier University. See also [36].

  9. 9.

    collex.hucompute.org. is our interface for this human computation of a Latin resource.

  10. 10.

    www.dmgh.de/de/fs1/object/display/bsb00000820_meta:titlePage.html?sortIndex=020:030:0001:010:00:00.

  11. 11.

    By the members of Bernhard Jussen’s lab at Goethe-University Frankfurt: Silke Schwandt, Tim Geelhaar, and colleagues.

  12. 12.

    la.wiktionary.org/wiki/Pagina_prima

  13. 13.

    Models of text representation as small as the one introduced here are common in quantitative text linguistics—an early example is [40]. The difference is that while these models mostly consider well-established indices like TTR or the rate of hapax legomena, we are concerned with inventing a complete new set of quantitative text characteristics based on the same notion of text organization.

  14. 14.

    A test of this hypothesis will be the object of a forthcoming paper.

  15. 15.

    Obviously, for any v: u(v) ≥ σ(v).

  16. 16.

    For linguistic indicators of the monastic-scholastic distinction see [57, 58].

  17. 17.

    The underlying texts of this corpus have been selected by Silke Schwandt from the Patrologia Latina [17]. They are accessible via the eHumanities Desktop (hudesktop.hucompute.org).

  18. 18.

    Note that this goal also requires a semantic disambiguation and sense tagging [61]—beyond PoS tagging— which is not yet provided by the TLT.

References

  1. Heyer G (2014) Digital and computational humanities. www.dagstuhl.de/mat/Files/14/14301/14301.HeyerGerhard.ExtAbstract.pdf

  2. Hearst MA (1999) Untangling text data mining. In: Proceedings of ACL’99: the 37th annual meeting of the association for computational linguistics, University of Maryland

    Google Scholar 

  3. Mehler A (2004) Textmining. In: Lobin H, Lemnitzer L, (eds) Texttechnologie. Perspektiven und Anwendungen, Stauffenburg, Tübingen, pp 329–352

    Google Scholar 

  4. de Saussure F (1916) Cours de linguistique générale. Payot, Lausanne/Paris

    Google Scholar 

  5. Peirce CS (1993) Semiotische Schriften 1906–1913, vol 3. Suhrkamp, Frankfurt am

    Google Scholar 

  6. Crane G, Wulfman C (2003) Towards a cultural heritage digital library. In: Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries (JCDL ’03), Washington. IEEE Computer Society, pp 75–86

    Google Scholar 

  7. Bamman D, Passarotti M, Busa R, Crane G (2008) The annotation guidelines of the latin dependency treebank and index thomisticus treebank. In: Proceedings of LREC 2008, Marrakech, Morocco, ELRA

    Google Scholar 

  8. Bamman D, Crane, G (2009) Structured knowledge for low-resource languages: The Latin and Ancient Greek dependency treebanks. In: Proceeding of the text mining services 2009, Leipzig. Springer, New York

    Google Scholar 

  9. Passarotti M (2010) Leaving behind the less-resourced status. The case of Latin through the experience of the Index Thomisticus Treebank. In: Proceedings of the 7th SaLTMiL workshop on the creation and use of basic lexical resources for less-resourced languages (LREC 2010), La Valletta, Malta, ELDA

    Google Scholar 

  10. Gleim R, Hoenen A, Diewald N, Mehler A, Ernst A (2011) Modeling, building and maintaining lexica for corpus linguistic studies by example of Late Latin. In: Corpus Linguistics 2011, Birmingham, 20–22 July 2011

    Google Scholar 

  11. Büchler M, Heyer G, Gründer S (2008) eAQUA–bringing modern text mining approaches to two thousand years old ancient texts. In: Proceedings of e-Humanities–An emerging discipline, workshop at the 4th IEEE international conference on e-Science

    Google Scholar 

  12. Jussen B, Mehler A, Ernst A (2007) A corpus management system for historical semantics. Sprache und Datenverarbeitung. Int J Lang Data Proc 31(1–2):81–89

    Google Scholar 

  13. Büchler M, Geßner A, Heyer G, Eckart T (2010) Detection of citations and text reuse on ancient Greek texts and its applications in the classical studies: eAQUA project. In: Proceedings of digital humanities 2010, London

    Google Scholar 

  14. Mehler A, Schwandt S, Gleim R, Ernst A (2012) Inducing linguistic networks from historical corpora: Towards a new method in historical semantics. In: Durrell M et al (eds) Proceedings of the Conference on new methods in historical corpora, April 29–30, 2011, Manchester. Corpus linguistics and Interdisciplinary perspectives on language (CLIP). Narr, Tübingen, pp 257–274

    Google Scholar 

  15. Crane, G (1996) Building a digital library: the perseus project as a case study in the humanities. In: Proceedings of the first ACM international conference on Digital libraries (DL ’96), New York. ACM, USA, pp 3–10+++

    Google Scholar 

  16. Smith DA, Rydberg-Co JA, Crane GR (2000) The Perseus Project: A digital library for the humanities. Lit Linguistic Comput 15(1):15–25

    Article  Google Scholar 

  17. Jordan MD (ed) (1995) Patrologia latina database. Chadwyck-Healey, Cambridge

    Google Scholar 

  18. Amancio DR, Antiqueira L, Pardo TAS, Costa LdF, Oliveira ON, Nunes MDGV (2008) Complex networks analysis of manual and machine translations. Int J Mod Phys C 19(4):583–598

    Article  MATH  Google Scholar 

  19. Amancio DR, Jr, ONO, da Fontoura Costa L (2012) Identification of literary movements using complex networks to represent texts. New J Phys 14:043029

    Google Scholar 

  20. Liu J, Wang J, Wang C (2008) A text network representation model. In: FSKD ’08: Proceedings of the 2008 fifth international conference on fuzzy systems and knowledge discovery, Washington. IEEE computer society, pp 150–154

    Google Scholar 

  21. Mehler A (2008) Large text networks as an object of corpus linguistic studies. In: Lüdeling A, Kytö M (eds) Corpus Linguistics. An international handbook of the science of language and society. De Gruyter, Berlin, pp 328–382

    Google Scholar 

  22. Koster CHA (2005) Constructing a parser for Latin. In: Gelbukh AF (ed) Proceedings of the 6th international conference on computational linguistics and intelligent text processing (CICLing 2005). LNCS, vol 3406. Springer, New York, pp 48–59

    Chapter  Google Scholar 

  23. Passarotti M, Dell’Orletta F (2010) Improvements in parsing the index thomisticus treebank. Revision, combination and a feature model for medieval Latin. In: Proceedings of LREC 2010, Malta, ELDA

    Google Scholar 

  24. Voutilainen A (1995) A syntax-based part-of-speech analyzser. In: Proceedings of the 7th conference of the European chapter of the association for computational linguistics (EACL), Belfield, Ireland pp 157–164

    Google Scholar 

  25. Jurafsky D, Martin JH (2000) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall, Upper Saddle River

    Google Scholar 

  26. Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge

    MATH  Google Scholar 

  27. Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP). Philadelphia, Pennsylvania

    Google Scholar 

  28. Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. J Mach Learn Res 6:1453–1484

    MATH  MathSciNet  Google Scholar 

  29. Nguyen N, Guo Y (2007) Comparisons of sequence labeling algorithms and extensions. In: Proceedings of the 24th International conference on machine learning (ICML). ACM, New York

    Google Scholar 

  30. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th international conference on machine learning. St. Petersburg/Russia

    Google Scholar 

  31. Constant M, Sigogne A (2011) MWU-aware part-of-speech tagging with a CRF model and lexical resources. In: MWE ’11 Proceedings of the workshop on multiword expressions: from parsing and generation to the real world. Stroudsburg, pp 49–56

    Google Scholar 

  32. Simionescu R (2011) Hybrid pos tagger. In: Proceedings of the workshop on language resources and tools with industrial applications, Cluj-Napoca

    Google Scholar 

  33. Mehler A, Gleim R, Waltinger U, Diewald N (2010) Time series of linguistic networks by example of the Patrologia Latina. In: Fähnrich KP, Franczyk B, (eds) Proceedings of INFORMATIK 2010: service science, September 27—October 01, 2010, Leipzig. Volume 2 of Lecture Notes in Informatics, GI, pp 609–616+++

    Google Scholar 

  34. Passarotti M (2000) Development and perspectives of the Latin morphological analyser LEMLAT (1). Linguistica Computazionale 3:397–414

    Google Scholar 

  35. Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Jones D, Somers H (eds) New methods in language processing studies in computational linguistics. UCL Press, London

    Google Scholar 

  36. Springmann U, Najock D, Morgenroth H, Schmid H, Gotscharek A, Fink, F (2014) OCR of historical printings of Latin texts: problems, prospects, progress. In: Antonacopoulos A, Schulz KU (eds) Digital access to textual cultural heritage 2014 (DATeCH 2014), Madrid. ACM, May 19–20, pp 71–75

    Google Scholar 

  37. Okazaki N (2007) CRFsuite: a fast implementation of conditional random fields (CRFs). http://www.chokkan.org/software/crfsuite/manual.html

  38. Zipf GK (1972) Human behavior and the principle of least effort. An introduction to human ecology. Hafner Publishing, New York

    Google Scholar 

  39. Panhuis DG (2009) Latin grammar. University of Michigan Press, Ann Arbor

    Google Scholar 

  40. Liiv H, Tuldava J (1993) On classifying texts with the help of cluster analysis. In: Hřebíček L, Altmann G (eds) Quantitative text analysis. Wissenschaftlicher Verlag, Trier, pp 253–262

    Google Scholar 

  41. Schuhmacher M, Ponzetto SP (2014) Knowledge-based graph document modeling. In: Proceedings of the 7th ACM international conference on web search and data mining (WSDM ’14), New York. ACM, pp 543–552

    Google Scholar 

  42. Seidman SB (1983) Network structure and minimum degree. Soc Networks 5:269–287

    Article  MathSciNet  Google Scholar 

  43. Batagelj V, Zavervsnik M (2003) An O(m) algorithm for cores decomposition of networks. http://vlado.fmf.uni-lj.si/vlado/vladounp.html. arXiv:cs/0310049

  44. Ashraf M, Sinha S (2012) Core-periphery organization of graphemes in written sequences: decreasing positional rigidity with increasing core order. In: Gelbukh A (ed) Computational linguistics and intelligent text processing. Lecture notes in computer science, vol 7181. Springer, New York, pp 142–153

    Chapter  Google Scholar 

  45. Fortunato S (1983) Community detection in graphs. Phys Rep 486(3–5):75–174

    MathSciNet  Google Scholar 

  46. Giatsidis C, Thilikos DM, Vazirgiannis M (2011) Evaluating cooperation in communities with the k-core structure. In: Proceedings of the 2011 international conference on advances in social networks analysis and mining (ASONAM ’11), Washington. IEEE Computer Society, pp 87–93

    Google Scholar 

  47. Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2008) k-core decomposition of internet graphs: hierarchies, self-similarity and measurement biases. Net Heterogeneous Media 3(2):371–393

    Article  MATH  MathSciNet  Google Scholar 

  48. Halliday MAK, Hasan R (1989) Language, context, and text: aspects of language in a socialsemiotic perspective. Oxford University Press, Oxford

    Google Scholar 

  49. Dehmer M (2008) Information processing in complex networks: Graph entropy and information functionals. Appl Math Comput 201:82–94

    Article  MATH  MathSciNet  Google Scholar 

  50. Dehmer M, Mowshowitz A (2011) A history of graph entropy measures. Inform Sci 181(1):57–78

    Article  MATH  MathSciNet  Google Scholar 

  51. Mehler A (2011) A quantitative graph model of social ontologies by example of Wikipedia. In: Dehmer M, Emmert-Streib F, Mehler A (eds) Towards an information theory of complex networks: statistical methods and applications. Birkhäuser, Boston, pp 259–319

    Chapter  Google Scholar 

  52. Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, Hoboken

    MATH  Google Scholar 

  53. Botafogo RA, Rivlin E, Shneiderman B (1992) Structural analysis of hypertexts: identifying hierarchies and useful metrics. ACM Trans Infor Syst 10(2):142–180

    Article  Google Scholar 

  54. Mehler A (2008) Structural similarities of complex networks: A computational model by example of wiki graphs. Appl Artif Intell 22(7,8):619–683

    Google Scholar 

  55. Mehler A, Pustylnikov O, Diewald N (2011) Geography of social ontologies: testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia. Comput Speech Lang 25(3):716–740

    Article  Google Scholar 

  56. Pieper U (1975) Differenzierung von Texten nach numerischen Kriterien. Folia Linguistica VII:61–113

    Google Scholar 

  57. Frank-Job B (1994) Die textgestalt als zeichen. Lateinische handschriftentradition und die verschriftlichung der romanischen sprachen, ScriptOralia, vol 67. Narr, Tübingen

    Google Scholar 

  58. Frank-Job B (2003) Diskurstraditionen im Verschriftlichungsprozeß der romanischen Sprachen. In: Aschenberg H, Wilhelm R (eds) Romanische sprachgeschichte und diskurstraditionen. Narr, Tübingen, pp 19–35

    Google Scholar 

  59. Köhler R, Galle M (1993) Dynamic aspects of text characteristics. In: Hřebíček L, Altmann G (eds) Quantitative text analysis. Wissenschaftlicher Verlag, Trier, pp 46–53

    Google Scholar 

  60. McCarthy PM, Jarvis S (2010) Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment. Behav Res Methods 42(2):381–392

    Article  Google Scholar 

  61. Schütze H (1998) Automatic word sense discrimination. Computat Linguistics 24(1):97–123

    Google Scholar 

  62. Stamatatos E (2011) Plagiarism detection based on structural information. In: Proceedings of the 20th ACM international conference on information and knowledge management (CIKM ’11), New York. ACM, pp 1221–1230

    Google Scholar 

  63. Evert S (2008) Corpora and collocations. In: Lüdeling A, Kytö M (eds) Corpus linguistics. An international handbook of the science of language and society. Mouton de Gruyter, Berlin, pp 1212–1248

    Google Scholar 

  64. Miller GA (1956) The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol Rev 63:81–97

    Article  Google Scholar 

  65. van Dijk TA, Kintsch W (1983) Strategies of Discourse Comprehension. Academic Press, New York

    Google Scholar 

  66. Rieger B (1998) Warum fuzzy Linguistik? Überlegungen und Ansätze zu einer computerlinguistischen Neuorientierung. In: Krallmann D, Schmitz HW (eds) Perspektiven einer Kommunikationswissenschaft. Internationales gerold ungeheuer symposium, Essen 1995. Nodus, Münster pp 153–183

    Google Scholar 

Download references

Acknowledgements

Financial support by the BMBF-project Computational Historical Semantics (www.comphistsem.org) as part of the research center on Digital Humanities is gratefully acknowledged. We also thank both anonymous reviewers for their valuable hints, Barbara Job and Silke Schwandt for their fruitful hints on distinguishing monastic and scholastic texts and, finally, Andy Lücking for assisting in producing TikZ graphics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Mehler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Mehler, A., vor der Brück, T., Gleim, R., Geelhaar, T. (2014). Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts Using the TTLab Latin Tagger. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12655-5_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12654-8

  • Online ISBN: 978-3-319-12655-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics