Skip to main content

Towards Structure-sensitive Hypertext Categorization

  • Conference paper
  • 2192 Accesses

Abstract

Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   159.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • AMITAY, E. and CARMEL, D. and DARLOW, A. and LEMPEL, R. and SOFFER, A. (2003): The connectivity sonar. Proc. of the 14th ACM Conference on Hypertext, 28–47.

    Google Scholar 

  • BOCK, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen.

    Google Scholar 

  • BUNKE, H. and GÜNTER, S. and JIANG, X. (2001): Towards bridging the gap between statistical and structural pattern recognition. Proc. of the 2nd Int. Conf. on Advances in Pattern Recognition, Berlin, Springer, 1–11.

    Google Scholar 

  • CHAKRABARTI, S. and DOM, B. and INDYK, P. (1998): Enhanced hypertext categorization using hyperlinks. Proc. of ACM SIGMOD, International Conf. on Management of Data, ACM Press, 307–318.

    Google Scholar 

  • DEHMER, M. and MEHLER, A. (2004): A new method of similarity measuring for a specific class of directed graphs. Submitted to Tatra Mountain Journal, Slovakia.

    Google Scholar 

  • FÜRNKRANZ, J. (2002): Hyperlink ensembles: a case study in hypertext classification. Information Fusion, 3(4), 299–312.

    Article  Google Scholar 

  • GIBSON, D. and KLEINBERG, J. and RAGHAVAN, P. (1998): Inferring web communities from link topology. Proc. of the 9th ACM Conf. on Hypertext, 225–234.

    Google Scholar 

  • GLEIM, R. (2005): Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte, Proc. of GLDV’ 05, 42–53.

    Google Scholar 

  • HSU, C.-W. and CHANG, C.-C. and LIN, C.-J. (2003): A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University.

    Google Scholar 

  • JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston, 2002.

    Google Scholar 

  • JOACHIMS, T. and CRISTIANINI, N. and SHAWE-TAYLOR, J. (2001): Composite kernels for hypertext categorisation. Proc. of the 11th ICML, 250–257.

    Google Scholar 

  • KLEINBERG, J. (1999): Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632

    Article  MATH  MathSciNet  Google Scholar 

  • KOSALA, R. and BLOCKEEL, H. (2000): Web mining research: A survey. SIGKDD Explorations, 2(1), 1–15.

    Google Scholar 

  • MEHLER, A. and DEHMER, M. and GLEIM, R. (2004): Towards logical hypertext structure — a graph-theoretic perspective. Proc. of I2CS’ 04, Berlin, Springer.

    Google Scholar 

  • MIZUUCHI, Y. and TAJIMA, K. (1999): Finding context paths for web pages. Proc. of the 10th ACM Conference on Hypertext and Hypermedia, 13–22.

    Google Scholar 

  • REHM, G. (2002): Towards automatic web genre identification. Proc. of the Hawai’i Int. Conf. on System Sciences.

    Google Scholar 

  • RIEGER, B. (1989): Unscharfe Semantik. Peter Lang, Frankfurt a.M.

    Google Scholar 

  • YANG, Y. (1999): An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1,1/2, 67–88.

    MATH  Google Scholar 

  • YANG, Y. and SLATTERY, S. and GHANI, R. (2002): A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3), 219–241.

    Google Scholar 

  • YOSHIOKA, T. and HERMAN, G. (2000): Coordinating information using genres. Technical report, Massachusetts Institute of Technology.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer Berlin · Heidelberg

About this paper

Cite this paper

Mehler, A., Gleim, R., Dehmer, M. (2006). Towards Structure-sensitive Hypertext Categorization. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_49

Download citation

Publish with us

Policies and ethics