Towards Structure-sensitive Hypertext Categorization

Mehler, Alexander; Gleim, Rüdiger; Dehmer, Matthias

doi:10.1007/3-540-31314-1_49

Towards Structure-sensitive Hypertext Categorization

Alexander Mehler²²,
Rüdiger Gleim²² &
Matthias Dehmer²³

Conference paper

2192 Accesses

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 159.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

AMITAY, E. and CARMEL, D. and DARLOW, A. and LEMPEL, R. and SOFFER, A. (2003): The connectivity sonar. Proc. of the 14th ACM Conference on Hypertext, 28–47.
Google Scholar
BOCK, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen.
Google Scholar
BUNKE, H. and GÜNTER, S. and JIANG, X. (2001): Towards bridging the gap between statistical and structural pattern recognition. Proc. of the 2nd Int. Conf. on Advances in Pattern Recognition, Berlin, Springer, 1–11.
Google Scholar
CHAKRABARTI, S. and DOM, B. and INDYK, P. (1998): Enhanced hypertext categorization using hyperlinks. Proc. of ACM SIGMOD, International Conf. on Management of Data, ACM Press, 307–318.
Google Scholar
DEHMER, M. and MEHLER, A. (2004): A new method of similarity measuring for a specific class of directed graphs. Submitted to Tatra Mountain Journal, Slovakia.
Google Scholar
FÜRNKRANZ, J. (2002): Hyperlink ensembles: a case study in hypertext classification. Information Fusion, 3(4), 299–312.
Article Google Scholar
GIBSON, D. and KLEINBERG, J. and RAGHAVAN, P. (1998): Inferring web communities from link topology. Proc. of the 9th ACM Conf. on Hypertext, 225–234.
Google Scholar
GLEIM, R. (2005): Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte, Proc. of GLDV’ 05, 42–53.
Google Scholar
HSU, C.-W. and CHANG, C.-C. and LIN, C.-J. (2003): A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University.
Google Scholar
JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston, 2002.
Google Scholar
JOACHIMS, T. and CRISTIANINI, N. and SHAWE-TAYLOR, J. (2001): Composite kernels for hypertext categorisation. Proc. of the 11th ICML, 250–257.
Google Scholar
KLEINBERG, J. (1999): Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632
Article MATH MathSciNet Google Scholar
KOSALA, R. and BLOCKEEL, H. (2000): Web mining research: A survey. SIGKDD Explorations, 2(1), 1–15.
Google Scholar
MEHLER, A. and DEHMER, M. and GLEIM, R. (2004): Towards logical hypertext structure — a graph-theoretic perspective. Proc. of I2CS’ 04, Berlin, Springer.
Google Scholar
MIZUUCHI, Y. and TAJIMA, K. (1999): Finding context paths for web pages. Proc. of the 10th ACM Conference on Hypertext and Hypermedia, 13–22.
Google Scholar
REHM, G. (2002): Towards automatic web genre identification. Proc. of the Hawai’i Int. Conf. on System Sciences.
Google Scholar
RIEGER, B. (1989): Unscharfe Semantik. Peter Lang, Frankfurt a.M.
Google Scholar
YANG, Y. (1999): An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1,1/2, 67–88.
MATH Google Scholar
YANG, Y. and SLATTERY, S. and GHANI, R. (2002): A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3), 219–241.
Google Scholar
YOSHIOKA, T. and HERMAN, G. (2000): Coordinating information using genres. Technical report, Massachusetts Institute of Technology.
Google Scholar

Download references

Author information

Authors and Affiliations

Universität Bielefeld, 33615, Bielefeld, Germany
Alexander Mehler & Rüdiger Gleim
Technische Universität Darmstadt, 64289, Darmstadt, Germany
Matthias Dehmer

Authors

Alexander Mehler
View author publications
You can also search for this author in PubMed Google Scholar
Rüdiger Gleim
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Dehmer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Technische und Betriebliche Informationssysteme, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Myra Spiliopoulou
Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Rudolf Kruse , Christian Borgelt & Andreas Nürnberger , &
Institut für Entscheidungstheorie und Unternehmensforschung, Universität Karlsruhe (TH), 76128, Karlsruhe
Wolfgang Gaul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mehler, A., Gleim, R., Dehmer, M. (2006). Towards Structure-sensitive Hypertext Categorization. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_49

Download citation

DOI: https://doi.org/10.1007/3-540-31314-1_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31313-7
Online ISBN: 978-3-540-31314-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics