skip to main content
10.1145/1526709.1526880acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
poster

Purely URL-based topic classification

Authors Info & Claims
Published:20 April 2009Publication History

ABSTRACT

Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.

References

  1. The 4 universities data set. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo--20/www/data/.Google ScholarGoogle Scholar
  2. Open directory project. http://www.dmoz.org/.Google ScholarGoogle Scholar
  3. E. Baykan, M. Henzinger, and I. Weber. Web page language identification based on urls. In International conference on Very Large Data Bases (VLDB), pages 176--187, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In International conference on Management of data (SIGMOD), pages 307--318, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Kan and H. O. N. Thi. Fast webpage classification using url features. In International conference on Information and knowledge management (CIKM), pages 325--326, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In International conference on Information and knowledge management (CIKM), pages 228--237, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Qi and B. D. Davison. Web page classification: Features and algorithms. ACM Computing Surveys, 41, 2009. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Purely URL-based topic classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '09: Proceedings of the 18th international conference on World wide web
      April 2009
      1280 pages
      ISBN:9781605584874
      DOI:10.1145/1526709

      Copyright © 2009 Copyright is held by the author/owner(s)

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 April 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader