skip to main content
10.1145/1149941.1149972acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
Article

Evaluation of crawling policies for a web-repository crawler

Published:22 August 2006Publication History

ABSTRACT

We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.

References

  1. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates and C. Castillo. Characterization of national web domains. Technical report, Universitat Pompeu Fabra, 2005.Google ScholarGoogle Scholar
  3. R. Baeza-Yates, C. Castillo, M. Marin, and A. Rodriguez. Crawling a country: better strategies than breadth-first for web page ordering. In Proceedings of WWW '05, pages 864--872, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Baldwin. Museum of e-failure, 2006. http://disobey.com/ghostsites/mef.shtml.Google ScholarGoogle Scholar
  5. Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In Proceedings of WWW '04, pages 328--337, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, August 2001. http://www.press.umich.edu/jep/07-01/bergman.html.Google ScholarGoogle ScholarCross RefCross Ref
  7. T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifier (URI): Generic syntax. RFC 3986, Jan. 2005.Google ScholarGoogle Scholar
  8. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific resource discovery. In Proceedings of WWW '04, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of VLDB '00, pages 200--209, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings of SIGMOD '00, pages 117--128, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of WWW '02, pages 124--135, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proceedings of SIGMOD '00, pages 355--366, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V. Cothey. Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14):1228--1238, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Cutts. SEO advice: URL canonicalization. Jan 2006. http://www.mattcutts.com/blog/seo-advice-url-canonicalization/.Google ScholarGoogle Scholar
  16. Z. Dalal, S. Dash, P. Dave, L. Francisco-Revilla, R. Furuta, U. Karadkar, and F. Shipman. Managing distributed collections: evaluating web page changes, movement, and replacement. In Proceedings of JCDL '04, pages 160--168, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of VLDB '00, pages 527--534, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of WWW '01, pages 106--113, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of WebDB '04, pages 1--6, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the World Wide Web. In Proceedings of ACM SIGIR '05, pages 170--177, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fire destroys top research centre. Oct 31, 2005. http://news.bbc.co.uk/2/hi/uk_news/england/hampshire/4390048.stm.Google ScholarGoogle Scholar
  22. D. Gomes and M. J. Silva. Characterizing a national community web. ACM Transactions on Internet Technology (TOIT), 5(3):508--531, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Google Sitemap Protocol, 2005. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html.Google ScholarGoogle Scholar
  24. Y. Hafri and C. Djeraba. High performance crawling system. In Proceedings of MIR '04, pages 299--306, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. L. Harrison and M. L. Nelson. Just-in-time recovery of missing web pages. In Proceedings of HYPERTEXT '06, Aug 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.Google ScholarGoogle Scholar
  27. C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the Greek Web. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.Google ScholarGoogle Scholar
  28. S. H. Lee, S. J. Kim, and S. H. Hong. On URL normalization. In Proceedings of the International Conference on Computational Science and Its Applications (ICCSA '05), pages 1076--1085, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In Workshop on Conceptual Modeling Approaches for e-Business, pages 402--413, Oct 2002.Google ScholarGoogle Scholar
  30. S. W. Liddle, S. H. Yau, and D. W. Embley. On the automatic extraction of data from the hidden web. In Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS 2001), pages 212--226, Nov 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Lutkenhouse, M. L. Nelson, and J. Bollen. Distributed, real-time computation of community preferences. In Proceedings of HYPERTEXT '05, pages 88--97, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. C. Marshall and G. Golovchinsky. Saving private hypertext: requirements and pragmatic dimensions for preservation. In Proceedings of HYPERTEXT '04, pages 130--138, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. McCown. Google is sorry. Jan 2006. http://frankmccown.blogspot.com/2006/01/google-is-sorry.html.Google ScholarGoogle Scholar
  34. F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.Google ScholarGoogle Scholar
  35. F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz. Evaluating topic-driven web crawlers. In Proceedings of SIGIR '01, pages 241--249, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.Google ScholarGoogle Scholar
  37. S. Mukherjea. Organizing topic-specific web information. In Proceedings of HYPERTEXT '00, pages 133--141, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of WWW '01, pages 114--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. L. Nelson, H. Van de Sompel, X. Liu, T. L. Harrison, and N. McFarland. mod\_oai: An Apache module for metadata harvesting. In Proceedings of ECDL '05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web content through keyword queries. In Proceedings of JCDL '05, pages 100--109, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. E. T. O'Neill, B. F. Lavorie, and R. Bennett. Trends in the evolution of the public web. D-Lib Magazine, 3(4), April 2003.Google ScholarGoogle ScholarCross RefCross Ref
  42. G. Pant, P. Srinivasan, and F. Menczer. ``Crawling the Web''. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis, pages 153--178. Springer-Verlag, 2004.Google ScholarGoogle Scholar
  43. J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software: Practice and Experience, 27(9):995--1012, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2):335--348, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of VLDB '01, pages 129--138, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: facts versus sampling biases. Technical report, 2006. http://www.arxiv.org/abs/cs.NI/0511035.Google ScholarGoogle Scholar
  47. V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), volume 60, pages 357--368. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. K. Sigurosson. Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving Workshop (IWAW '05), Sept 2005.Google ScholarGoogle Scholar
  49. J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.Google ScholarGoogle ScholarCross RefCross Ref
  50. D. Waters and J. Garrett. Preserving digital information: Report of the task force on archiving of digital information. Technical report, 1996. http://www.rlg.org/ArchTF/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. What are Google's design and technical guidelines? http://www.google.com/support/webmasters/bin/answer.py?answer=35770.Google ScholarGoogle Scholar
  52. J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proceedings of WWW '02, pages 136--147, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluation of crawling policies for a web-repository crawler

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HYPERTEXT '06: Proceedings of the seventeenth conference on Hypertext and hypermedia
        August 2006
        178 pages
        ISBN:1595934170
        DOI:10.1145/1149941

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 August 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate378of1,158submissions,33%

        Upcoming Conference

        HT '24
        35th ACM Conference on Hypertext and Social Media
        September 10 - 13, 2024
        Poznan , Poland

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader