ABSTRACT
We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.
- A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001. Google ScholarDigital Library
- R. Baeza-Yates and C. Castillo. Characterization of national web domains. Technical report, Universitat Pompeu Fabra, 2005.Google Scholar
- R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In Proceedings of WWW '05, pages 864--872, 2005. Google ScholarDigital Library
- S. Baldwin. Museum of e-failure, 2006. http://disobey.com/ghostsites/mef.shtml.Google Scholar
- Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW '04, pages 328--337, 2004. Google ScholarDigital Library
- M. K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, August 2001. http://www.press.umich.edu/jep/07-01/bergman.html.Google ScholarCross Ref
- T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifier (URI): Generic syntax. RFC 3986, Jan. 2005.Google Scholar
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific resource discovery. In Proceedings of WWW '04, 1999. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB '00, pages 200--209, 2000. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings of SIGMOD '00, pages 117--128, 2000. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of WWW '02, pages 124--135, 2002. Google ScholarDigital Library
- J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proceedings of SIGMOD '00, pages 355--366, 2000. Google ScholarDigital Library
- V. Cothey. Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14):1228--1238, 2004. Google ScholarDigital Library
- M. Cutts. SEO advice: URL canonicalization. Jan 2006. http://www.mattcutts.com/blog/seo-advice-url-canonicalization/.Google Scholar
- Z. Dalal, S. Dash, P. Dave, L. Francisco-Revilla, R. Furuta, U. Karadkar, and F. Shipman. Managing distributed collections: evaluating web page changes, movement, and replacement. In Proceedings of JCDL '04, pages 160--168, 2004. Google ScholarDigital Library
- M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of VLDB '00, pages 527--534, 2000. Google ScholarDigital Library
- J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of WWW '01, pages 106--113, 2001. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of WebDB '04, pages 1--6, 2004. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the World Wide Web. In Proceedings of ACM SIGIR '05, pages 170--177, 2005. Google ScholarDigital Library
- Fire destroys top research centre. Oct 31, 2005. http://news.bbc.co.uk/2/hi/uk_news/england/hampshire/4390048.stm.Google Scholar
- D. Gomes and M. J. Silva. Characterizing a national community web. ACM Transactions on Internet Technology (TOIT), 5(3):508--531, 2005. Google ScholarDigital Library
- Google Sitemap Protocol, 2005. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html.Google Scholar
- Y. Hafri and C. Djeraba. High performance crawling system. In Proceedings of MIR '04, pages 299--306, 2004. Google ScholarDigital Library
- T. L. Harrison and M. L. Nelson. Just-in-time recovery of missing web pages. In Proceedings of HYPERTEXT '06, Aug 2006. Google ScholarDigital Library
- Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.Google Scholar
- C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the Greek Web. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.Google Scholar
- S. H. Lee, S. J. Kim, and S. H. Hong. On URL normalization. In Proceedings of the International Conference on Computational Science and Its Applications (ICCSA '05), pages 1076--1085, June 2005. Google ScholarDigital Library
- S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In Workshop on Conceptual Modeling Approaches for e-Business, pages 402--413, Oct 2002.Google Scholar
- S. W. Liddle, S. H. Yau, and D. W. Embley. On the automatic extraction of data from the hidden web. In Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS 2001), pages 212--226, Nov 2001. Google ScholarDigital Library
- T. Lutkenhouse, M. L. Nelson, and J. Bollen. Distributed, real-time computation of community preferences. In Proceedings of HYPERTEXT '05, pages 88--97, 2005. Google ScholarDigital Library
- C. C. Marshall and G. Golovchinsky. Saving private hypertext: requirements and pragmatic dimensions for preservation. In Proceedings of HYPERTEXT '04, pages 130--138, 2004. Google ScholarDigital Library
- F. McCown. Google is sorry. Jan 2006. http://frankmccown.blogspot.com/2006/01/google-is-sorry.html.Google Scholar
- F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.Google Scholar
- F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proceedings of SIGIR '01, pages 241--249, 2001. Google ScholarDigital Library
- G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.Google Scholar
- S. Mukherjea. Organizing topic-specific web information. In Proceedings of HYPERTEXT '00, pages 133--141, 2000. Google ScholarDigital Library
- M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of WWW '01, pages 114--118, 2001. Google ScholarDigital Library
- M. L. Nelson, H. Van de Sompel, X. Liu, T. L. Harrison, and N. McFarland. mod\_oai: An Apache module for metadata harvesting. In Proceedings of ECDL '05, 2005. Google ScholarDigital Library
- A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proceedings of JCDL '05, pages 100--109, 2005. Google ScholarDigital Library
- E. T. O'Neill, B. F. Lavorie, and R. Bennett. Trends in the evolution of the public web. D-Lib Magazine, 3(4), April 2003.Google ScholarCross Ref
- G. Pant, P. Srinivasan, and F. Menczer. ``Crawling the Web''. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis, pages 153--178. Springer-Verlag, 2004.Google Scholar
- J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software: Practice and Experience, 27(9):995--1012, 1997. Google ScholarDigital Library
- M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2):335--348, 1989. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of VLDB '01, pages 129--138, 2001. Google ScholarDigital Library
- M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: facts versus sampling biases. Technical report, 2006. http://www.arxiv.org/abs/cs.NI/0511035.Google Scholar
- V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), volume 60, pages 357--368. IEEE Computer Society, 2002. Google ScholarDigital Library
- K. Sigurosson. Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving Workshop (IWAW '05), Sept 2005.Google Scholar
- J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.Google ScholarCross Ref
- D. Waters and J. Garrett. Preserving digital information: Report of the task force on archiving of digital information. Technical report, 1996. http://www.rlg.org/ArchTF/. Google ScholarDigital Library
- What are Google's design and technical guidelines? http://www.google.com/support/webmasters/bin/answer.py?answer=35770.Google Scholar
- J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of WWW '02, pages 136--147, 2002. Google ScholarDigital Library
Index Terms
- Evaluation of crawling policies for a web-repository crawler
Recommendations
Clustering-based incremental web crawling
When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed ...
A Web Crawler Detection Algorithm Based on Web Page Member List
IHMSC '12: Proceedings of the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 01Following the widely use of search engines, the impact Web crawlers have on the Web sites should not be ignored. After analyzing the navigational patterns of Web crawlers from Web logs, a new algorithm based on Web page member list is proposed. The ...
Performance Optimization of Focused Web Crawling Using Content Block Segmentation
ICESC '14: Proceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing TechnologiesThe World Wide Web (WWW) is a collection of billions of documents formatted using HTML. Web Search engines are used to find the desired information on the World Wide Web. Whenever a user query is inputted, searching is performed through that database. ...
Comments