Article

Evaluation of crawling policies for a web-repository crawler

Authors:
Frank McCown

Old Dominion University, Norfolk, Virginia

Old Dominion University, Norfolk, Virginia
View Profile

,
Michael L. Nelson

Old Dominion University, Norfolk, Virginia

Old Dominion University, Norfolk, Virginia
View Profile

HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermediaAugust 2006Pages 157–168https://doi.org/10.1145/1149941.1149972

Published:22 August 2006Publication History

HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia

Pages 157–168

ABSTRACT

We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.

References

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001. Google ScholarDigital Library
R. Baeza-Yates and C. Castillo. Characterization of national web domains. Technical report, Universitat Pompeu Fabra, 2005.Google Scholar
R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In Proceedings of WWW '05, pages 864--872, 2005. Google ScholarDigital Library
S. Baldwin. Museum of e-failure, 2006. http://disobey.com/ghostsites/mef.shtml.Google Scholar
Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW '04, pages 328--337, 2004. Google ScholarDigital Library
M. K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, August 2001. http://www.press.umich.edu/jep/07-01/bergman.html.Google ScholarCross Ref
T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifier (URI): Generic syntax. RFC 3986, Jan. 2005.Google Scholar
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific resource discovery. In Proceedings of WWW '04, 1999. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB '00, pages 200--209, 2000. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings of SIGMOD '00, pages 117--128, 2000. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of WWW '02, pages 124--135, 2002. Google ScholarDigital Library
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proceedings of SIGMOD '00, pages 355--366, 2000. Google ScholarDigital Library
V. Cothey. Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14):1228--1238, 2004. Google ScholarDigital Library
M. Cutts. SEO advice: URL canonicalization. Jan 2006. http://www.mattcutts.com/blog/seo-advice-url-canonicalization/.Google Scholar
Z. Dalal, S. Dash, P. Dave, L. Francisco-Revilla, R. Furuta, U. Karadkar, and F. Shipman. Managing distributed collections: evaluating web page changes, movement, and replacement. In Proceedings of JCDL '04, pages 160--168, 2004. Google ScholarDigital Library
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of VLDB '00, pages 527--534, 2000. Google ScholarDigital Library
J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of WWW '01, pages 106--113, 2001. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of WebDB '04, pages 1--6, 2004. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the World Wide Web. In Proceedings of ACM SIGIR '05, pages 170--177, 2005. Google ScholarDigital Library
Fire destroys top research centre. Oct 31, 2005. http://news.bbc.co.uk/2/hi/uk_news/england/hampshire/4390048.stm.Google Scholar
D. Gomes and M. J. Silva. Characterizing a national community web. ACM Transactions on Internet Technology (TOIT), 5(3):508--531, 2005. Google ScholarDigital Library
Google Sitemap Protocol, 2005. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html.Google Scholar
Y. Hafri and C. Djeraba. High performance crawling system. In Proceedings of MIR '04, pages 299--306, 2004. Google ScholarDigital Library
T. L. Harrison and M. L. Nelson. Just-in-time recovery of missing web pages. In Proceedings of HYPERTEXT '06, Aug 2006. Google ScholarDigital Library
Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.Google Scholar
C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the Greek Web. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.Google Scholar
S. H. Lee, S. J. Kim, and S. H. Hong. On URL normalization. In Proceedings of the International Conference on Computational Science and Its Applications (ICCSA '05), pages 1076--1085, June 2005. Google ScholarDigital Library
S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In Workshop on Conceptual Modeling Approaches for e-Business, pages 402--413, Oct 2002.Google Scholar
S. W. Liddle, S. H. Yau, and D. W. Embley. On the automatic extraction of data from the hidden web. In Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS 2001), pages 212--226, Nov 2001. Google ScholarDigital Library
T. Lutkenhouse, M. L. Nelson, and J. Bollen. Distributed, real-time computation of community preferences. In Proceedings of HYPERTEXT '05, pages 88--97, 2005. Google ScholarDigital Library
C. C. Marshall and G. Golovchinsky. Saving private hypertext: requirements and pragmatic dimensions for preservation. In Proceedings of HYPERTEXT '04, pages 130--138, 2004. Google ScholarDigital Library
F. McCown. Google is sorry. Jan 2006. http://frankmccown.blogspot.com/2006/01/google-is-sorry.html.Google Scholar
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.Google Scholar
F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proceedings of SIGIR '01, pages 241--249, 2001. Google ScholarDigital Library
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.Google Scholar
S. Mukherjea. Organizing topic-specific web information. In Proceedings of HYPERTEXT '00, pages 133--141, 2000. Google ScholarDigital Library
M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of WWW '01, pages 114--118, 2001. Google ScholarDigital Library
M. L. Nelson, H. Van de Sompel, X. Liu, T. L. Harrison, and N. McFarland. mod\_oai: An Apache module for metadata harvesting. In Proceedings of ECDL '05, 2005. Google ScholarDigital Library
A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proceedings of JCDL '05, pages 100--109, 2005. Google ScholarDigital Library
E. T. O'Neill, B. F. Lavorie, and R. Bennett. Trends in the evolution of the public web. D-Lib Magazine, 3(4), April 2003.Google ScholarCross Ref
G. Pant, P. Srinivasan, and F. Menczer. ``Crawling the Web''. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis, pages 153--178. Springer-Verlag, 2004.Google Scholar
J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software: Practice and Experience, 27(9):995--1012, 1997. Google ScholarDigital Library
M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2):335--348, 1989. Google ScholarDigital Library
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of VLDB '01, pages 129--138, 2001. Google ScholarDigital Library
M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: facts versus sampling biases. Technical report, 2006. http://www.arxiv.org/abs/cs.NI/0511035.Google Scholar
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), volume 60, pages 357--368. IEEE Computer Society, 2002. Google ScholarDigital Library
K. Sigurosson. Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving Workshop (IWAW '05), Sept 2005.Google Scholar
J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.Google ScholarCross Ref
D. Waters and J. Garrett. Preserving digital information: Report of the task force on archiving of digital information. Technical report, 1996. http://www.rlg.org/ArchTF/. Google ScholarDigital Library
What are Google's design and technical guidelines? http://www.google.com/support/webmasters/bin/answer.py?answer=35770.Google Scholar
J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of WWW '02, pages 136--147, 2002. Google ScholarDigital Library

Index Terms

Evaluation of crawling policies for a web-repository crawler
1. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Clustering-based incremental web crawling

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed ...
Read More
A Web Crawler Detection Algorithm Based on Web Page Member List
IHMSC '12: Proceedings of the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 01

Following the widely use of search engines, the impact Web crawlers have on the Web sites should not be ignored. After analyzing the navigational patterns of Web crawlers from Web logs, a new algorithm based on Web page member list is proposed. The ...
Read More
Performance Optimization of Focused Web Crawling Using Content Block Segmentation
ICESC '14: Proceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies

The World Wide Web (WWW) is a collection of billions of documents formatted using HTML. Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia
August 2006
178 pages
ISBN:1595934170
DOI:10.1145/1149941
General Chair:
Uffe K. Wiil
University of Southern Denmark
,
Program Chairs:
Peter J. Nürnberg
Aalborg University Esbjerg, Denmark
,
Jessica Rubart
Sagem Orga, Germany
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crawler policy
digital preservation
search engine
website reconstruction
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate378of1,158submissions,33%
Upcoming Conference
HT '24

Sponsor:

sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 1,101
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluation of crawling policies for a web-repository crawler

HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Clustering-based incremental web crawling

A Web Crawler Detection Algorithm Based on Web Page Member List

Performance Optimization of Focused Web Crawling Using Content Block Segmentation