Structural Differentiae of Text Types – A Quantitative Model

Pustylnikov, Olga; Mehler, Alexander

doi:10.1007/978-3-540-78246-9_77

Olga Pustylnikov⁵ &
Alexander Mehler⁵

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

5987 Accesses
2 Citations

Abstract

The categorization of natural language texts is a well established research field in computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector space model is used in terms of a bag of words approach. That is, lexical features are extracted from input texts in order to train some categorization model and, thus, to attribute, for example, authorship or topic categories. Parallel to these approaches there has been some effort in performing text categorization not in terms of lexical, but of structural features of document structure. More specifically, quantitative text characteristics have been computed in order to derive a sort of structural text signature which nevertheless allows reliable text categorizations (Kelih & Grzybek 2005; Pieper 1975). This “bag of features” approach regains attention when it comes to categorizing websites and other document types whose structure is far away from the simplicity of tree-like structures. Here we present a novel approach to structural classifiers which systematically computes structural signatures of documents. In summary, we present a text categorization algorithm which in the absence of any lexical features nevertheless performs a remarkably good classification even if the classes are thematically defined.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

ALTMANN, G. (1988): Wiederholungen in Texten. Brockmeyer, Bochum.
Google Scholar
BIBER, D. (1995): Dimensions of Register Variation: A Cross-Linguistic Comparison. Uni-versity Press, Cambridge.
Book Google Scholar
BOCK, H.H. (1974): Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten (Cluster-Analyse). Vandenhoeck & Ruprecht, Göttingen.
MATH Google Scholar
GLEIM, R.; MEHLER, A.; DEHMER, M.; PUSTYLNIKOV, O. (2007): Isles Through the Category Forest — Utilising the Wikipedia Category System for Corpus Building in Ma-chine Learning. In: WEBIST ’07, WIA(2). Barcelona, Spain, 142-149.
Google Scholar
JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston/Dordrecht/London.
Google Scholar
KELIH, E.; GRZYBEK, P. (2005): Satzlänge: Definitionen, Häufigkeiten, Modelle (Am Beispiel slowenischer Prosatexte). In: LDV-Forum 20(2), 31-51.
Google Scholar
MEHLER, A.; GEIBEL, P.; GLEIM, R.; HEROLD, S.; JAIN, B.; PUSTYLNIKOV, O. (2006): Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae. In: OTT’06.
Google Scholar
MEHLER, A.; GEIBEL, P.; PUSTYLNIKOV, O.; HEROLD, S. (2007): Structural Classifiers of Text Types. To appear in: LDV Forum.
Google Scholar
PIEPER, U. (1975): Differenzierung von Texten nach Numerischen Kriterien. In: Folia Lin-guistica VII, 61-113.
Article Google Scholar
RIEGER, B. (1989): Unscharfe Semantik: Die empirische Analyse, quantitative Beschreibung, formale Repräsentation und prozedurale Modellierung vager Wortbedeutungen in Texten. Peter Lang, Frankfurt a. M.
Google Scholar
SÜDDEUTSCHER VERLAG (2004). Süddeutsche Zeitung 1994-2003. 10 Jahre auf DVD. München.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Linguistics and Literature Study, University of Bielefeld, Germany
Olga Pustylnikov & Alexander Mehler

Authors

Olga Pustylnikov
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Mehler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Institute of Business Economics and Information Systems, University of Hildesheim, Marienburgerplatz 22, 31141, Hildesheim, Germany
Christine Preisach
Lehrstuhl für Mustererkennung und Bildverarbeitung, Universität Freiburg, Gebäude 052, 79110, Freiburg i. Br, Germany
Hans Burkhardt
Institute of Computer Science and Institute of Business Economics and Information Systems, Marienburgerplatz 22, 31141, Hildesheim, Germany
Lars Schmidt-Thieme
Fakultät für Wirtschaftswissenschaften, Lehrstuhl für Betriebswirtschaftslehre, insbes. Marketing, Universitätsstraße 25, 33615, Bielefeld, Germany
Reinhold Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pustylnikov, O., Mehler, A. (2008). Structural Differentiae of Text Types – A Quantitative Model. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_77

Download citation

DOI: https://doi.org/10.1007/978-3-540-78246-9_77
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics