Skip to main content

Structural Differentiae of Text Types – A Quantitative Model

  • Conference paper
Data Analysis, Machine Learning and Applications

Abstract

The categorization of natural language texts is a well established research field in computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector space model is used in terms of a bag of words approach. That is, lexical features are extracted from input texts in order to train some categorization model and, thus, to attribute, for example, authorship or topic categories. Parallel to these approaches there has been some effort in performing text categorization not in terms of lexical, but of structural features of document structure. More specifically, quantitative text characteristics have been computed in order to derive a sort of structural text signature which nevertheless allows reliable text categorizations (Kelih & Grzybek 2005; Pieper 1975). This “bag of features” approach regains attention when it comes to categorizing websites and other document types whose structure is far away from the simplicity of tree-like structures. Here we present a novel approach to structural classifiers which systematically computes structural signatures of documents. In summary, we present a text categorization algorithm which in the absence of any lexical features nevertheless performs a remarkably good classification even if the classes are thematically defined.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • ALTMANN, G. (1988): Wiederholungen in Texten. Brockmeyer, Bochum.

    Google Scholar 

  • BIBER, D. (1995): Dimensions of Register Variation: A Cross-Linguistic Comparison. Uni-versity Press, Cambridge.

    Book  Google Scholar 

  • BOCK, H.H. (1974): Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten (Cluster-Analyse). Vandenhoeck & Ruprecht, Göttingen.

    MATH  Google Scholar 

  • GLEIM, R.; MEHLER, A.; DEHMER, M.; PUSTYLNIKOV, O. (2007): Isles Through the Category Forest — Utilising the Wikipedia Category System for Corpus Building in Ma-chine Learning. In: WEBIST ’07, WIA(2). Barcelona, Spain, 142-149.

    Google Scholar 

  • JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston/Dordrecht/London.

    Google Scholar 

  • KELIH, E.; GRZYBEK, P. (2005): Satzlänge: Definitionen, Häufigkeiten, Modelle (Am Beispiel slowenischer Prosatexte). In: LDV-Forum 20(2), 31-51.

    Google Scholar 

  • MEHLER, A.; GEIBEL, P.; GLEIM, R.; HEROLD, S.; JAIN, B.; PUSTYLNIKOV, O. (2006): Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae. In: OTT’06.

    Google Scholar 

  • MEHLER, A.; GEIBEL, P.; PUSTYLNIKOV, O.; HEROLD, S. (2007): Structural Classifiers of Text Types. To appear in: LDV Forum.

    Google Scholar 

  • PIEPER, U. (1975): Differenzierung von Texten nach Numerischen Kriterien. In: Folia Lin-guistica VII, 61-113.

    Article  Google Scholar 

  • RIEGER, B. (1989): Unscharfe Semantik: Die empirische Analyse, quantitative Beschreibung, formale Repräsentation und prozedurale Modellierung vager Wortbedeutungen in Texten. Peter Lang, Frankfurt a. M.

    Google Scholar 

  • SÜDDEUTSCHER VERLAG (2004). Süddeutsche Zeitung 1994-2003. 10 Jahre auf DVD. München.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pustylnikov, O., Mehler, A. (2008). Structural Differentiae of Text Types – A Quantitative Model. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_77

Download citation

Publish with us

Policies and ethics