Skip to main content
Log in

The SHARC framework for data quality in Web archiving

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit–revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the “blur” in capturing the site. Visit–revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the “coherence” of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, “sharp” site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web changes everything: understanding the dynamics of Web content. In: WSDM’09, pp. 282–291 (2009)

  2. Alam, Md.H., Ha, J., Lee, S.: Fractional pagerank crawler: Prioritizing URLs efficiently for crawling important pages early. In: DASFAA’09, pp. 590–594 (2009)

  3. Segev A., Shoshani A.: Logical modeling of temporal data. SIGMOD Rec. 16(3), 454–466 (1987)

    Article  Google Scholar 

  4. Baeza-Yates R., Gionis A., Junqueira F., Murdock V., Plachoura V., Silvestri F.: Design trade-offs for search engine caching. ACM Trans. Web 2(4), 1–28 (2008)

    Article  Google Scholar 

  5. Batsakis S., Petrakis E.G.M., Milios E.E.: Improving the performance of focused Web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009)

    Article  Google Scholar 

  6. Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: WWW’09, pp. 1109–1110 (2009)

  7. Brewington B.E., Cybenko G.: Keeping up with the changing Web. Computer 33(5), 52–58 (2000)

    Article  Google Scholar 

  8. Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for Web crawling. In: LA-WEBMEDIA’04, pp. 10–17 (2004)

  9. Chen L., Bhowmick S.S., Nejdl W.: Near-miner: mining evolution associations of Web site directories for efficient maintenance of Web archives. PVLDB 2(1), 1150–1161 (2009)

    Google Scholar 

  10. Cho J., Garcia-Molina H.: Synchronizing a database to improve freshness. SIGMOD Rec. 29(2), 117–128 (2000)

    Article  Google Scholar 

  11. Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)

    Article  Google Scholar 

  12. Cho J., Garcia-Molina H., Page L. (2007) Efficient crawling through URL ordering. In: WWW’07, pp. 161–172. (2007)

  13. Cho J., Ntoulas A. (2002) Effective change detection using sampling. In: VLDB’02, pp. 514–525. (2002)

  14. Cho J., Schonfeld U. (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: VLDB’07, pp. 375–386. (2007)

  15. Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)

    Article  Google Scholar 

  16. Colby L.S., Kawaguchi A., Lieuwen D.F., Mumick I.S., Ross K.A.: Supporting multiple view maintenance policies. SIGMOD Rec. 26(2), 405–416 (1997)

    Article  Google Scholar 

  17. Dai, N., Davison, B.D.: Freshness matters: in flowers, food, and Web authority. In: SIGIR’10, pp. 114–121 (2010)

  18. Dash, D., Kantere, V., Ailamaki, A.: An economic model for self-tuned cloud caching. In: ICDE’09, pp. 1687–1693 (2009)

  19. Denev D., Mazeika A., Spaniol M., Weikum G.: Sharc: framework for quality-conscious Web archiving. PVLDB 2(1), 586–597 (2009)

    Google Scholar 

  20. Masanès, J. (eds): Web Archiving. Springer, UK (2006)

    Google Scholar 

  21. Härder, T., Bühmann, A.: Value complete, column complete, predicate complete. In: VLDBJ 17(4), pp. 805–826 (2008)

  22. Jiawei M., Han J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2005)

    Google Scholar 

  23. Kan, M.-Y., Thi, H.O.N.: Fast Webpage classification using URL features. In: CIKM’05, pp. 325–326 (2005)

  24. Kim, S., Lee, S.: Estimating the change of Web pages. In: ICCS’07, Vol. 4489 of LNCS, pp. 798–805 (2007)

  25. Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: scaling to 6 billion pages and beyond. In: WWW’08, pp. 427–436 (2008)

  26. Levene, M., Poulovassilis, A. (eds): Web Dynamics—Adapting to Change in Content, Size, Topology and Use. Springer, Berlin (2004)

    MATH  Google Scholar 

  27. Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: IWAW’04 (2004)

  28. Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: WWW’01, pp. 114–118 (2001)

  29. Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the Web from a search engine perspective. In: WWW’04, pp. 1–12 (2004)

  30. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW’08, pp. 437–446 (2008)

  31. Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: In SIGMOD’02, pp. 73–84 (2002)

  32. Practice.com. Debunking the wayback machine. http://practice.com/2008/12/29/debunking-the-wayback-machine

  33. Qi X., Davison B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)

    Article  Google Scholar 

  34. Schenkel, R.: Temporal shingling for version identification in Web archives. In: ECIR’10, pp. 508–519 (2010)

  35. Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: WWW’09, pp. 991–1000 (2009)

  36. Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in Web archiving. In: WICOW’09, pp. 19–26 (2009)

  37. Tolia, N., Satyanarayanan, M.: Consistency-preserving caching of dynamic database content. In: WWW’07, pp. 311–320 (2007)

  38. Singh, S.R. (2007) Estimating the rate of Web page updates. In: IJCAI’07, pp. 2874–2879 (2007)

  39. Zheng, S., Dmitriev, P., Giles, C.L.: Graph-based seed selection for Web-scale crawlers. In: CIKM’09, pp. 1967–1970 (2009)

  40. Zhou, Y., Jiang, M., Zhang, Q., Huang, X., Wu, L.: Selective recrawling for object-level vertical search. In: WWW’10, pp. 1221–1222 (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitar Denev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denev, D., Mazeika, A., Spaniol, M. et al. The SHARC framework for data quality in Web archiving. The VLDB Journal 20, 183–207 (2011). https://doi.org/10.1007/s00778-011-0219-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-011-0219-9

Keywords

Navigation