poster

Purely URL-based topic classification

Authors:
Eda Baykan

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

,
Monika Henzinger

Ecole Polytechnique Fédérale de Lausanne & Google Zürich, Lausanne, Switzerland

Ecole Polytechnique Fédérale de Lausanne & Google Zürich, Lausanne, Switzerland
View Profile

,
Ludmila Marian

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

,
Ingmar Weber

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
View Profile

WWW '09: Proceedings of the 18th international conference on World wide webApril 2009Pages 1109–1110https://doi.org/10.1145/1526709.1526880

Published:20 April 2009Publication History

WWW '09: Proceedings of the 18th international conference on World wide web

Pages 1109–1110

ABSTRACT

Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.

References

The 4 universities data set. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo--20/www/data/.Google Scholar
Open directory project. http://www.dmoz.org/.Google Scholar
E. Baykan, M. Henzinger, and I. Weber. Web page language identification based on urls. In International conference on Very Large Data Bases (VLDB), pages 176--187, 2008.Google ScholarDigital Library
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In International conference on Management of data (SIGMOD), pages 307--318, 1998. Google ScholarDigital Library
M. Kan and H. O. N. Thi. Fast webpage classification using url features. In International conference on Information and knowledge management (CIKM), pages 325--326, 2005. Google ScholarDigital Library
X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In International conference on Information and knowledge management (CIKM), pages 228--237, 2006. Google ScholarDigital Library
X. Qi and B. D. Davison. Web page classification: Features and algorithms. ACM Computing Surveys, 41, 2009. To appear. Google ScholarDigital Library

Index Terms

Purely URL-based topic classification
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity ...
Read More
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth ...
Read More
Crawling Result Pages for Data Extraction Based on URL Classification
WISA '10: Proceedings of the 2010 Seventh Web Information Systems and Applications Conference

In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '09: Proceedings of the 18th international conference on World wide web
April 2009
1280 pages
ISBN:9781605584874
DOI:10.1145/1526709
General Chairs:
Juan Quemada
DIT-UPM
,
Gonzalo León
DIT-UPM
,
Program Chairs:
Yoelle Maarek
Google Inc., Israel
,
Wolfgang Nejdl
L3S and Hannover University
Copyright © 2009 Copyright is held by the author/owner(s)
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 April 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ODP
URL
topic classification
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 72
  Total Citations
  View Citations
- 1,053
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Purely URL-based topic classification

WWW '09: Proceedings of the 18th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

Crawling Result Pages for Data Extraction Based on URL Classification