The SHARC framework for data quality in Web archiving

Denev, Dimitar; Mazeika, Arturas; Spaniol, Marc; Weikum, Gerhard

doi:10.1007/s00778-011-0219-9

The SHARC framework for data quality in Web archiving

Special Issue Paper
Published: 02 March 2011

Volume 20, pages 183–207, (2011)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Dimitar Denev¹,
Arturas Mazeika¹,
Marc Spaniol¹ &
…
Gerhard Weikum¹

276 Accesses
15 Citations
3 Altmetric
Explore all metrics

Abstract

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit–revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the “blur” in capturing the site. Visit–revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the “coherence” of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, “sharp” site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web changes everything: understanding the dynamics of Web content. In: WSDM’09, pp. 282–291 (2009)
Alam, Md.H., Ha, J., Lee, S.: Fractional pagerank crawler: Prioritizing URLs efficiently for crawling important pages early. In: DASFAA’09, pp. 590–594 (2009)
Segev A., Shoshani A.: Logical modeling of temporal data. SIGMOD Rec. 16(3), 454–466 (1987)
Article Google Scholar
Baeza-Yates R., Gionis A., Junqueira F., Murdock V., Plachoura V., Silvestri F.: Design trade-offs for search engine caching. ACM Trans. Web 2(4), 1–28 (2008)
Article Google Scholar
Batsakis S., Petrakis E.G.M., Milios E.E.: Improving the performance of focused Web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009)
Article Google Scholar
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: WWW’09, pp. 1109–1110 (2009)
Brewington B.E., Cybenko G.: Keeping up with the changing Web. Computer 33(5), 52–58 (2000)
Article Google Scholar
Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for Web crawling. In: LA-WEBMEDIA’04, pp. 10–17 (2004)
Chen L., Bhowmick S.S., Nejdl W.: Near-miner: mining evolution associations of Web site directories for efficient maintenance of Web archives. PVLDB 2(1), 1150–1161 (2009)
Google Scholar
Cho J., Garcia-Molina H.: Synchronizing a database to improve freshness. SIGMOD Rec. 29(2), 117–128 (2000)
Article Google Scholar
Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)
Article Google Scholar
Cho J., Garcia-Molina H., Page L. (2007) Efficient crawling through URL ordering. In: WWW’07, pp. 161–172. (2007)
Cho J., Ntoulas A. (2002) Effective change detection using sampling. In: VLDB’02, pp. 514–525. (2002)
Cho J., Schonfeld U. (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: VLDB’07, pp. 375–386. (2007)
Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)
Article Google Scholar
Colby L.S., Kawaguchi A., Lieuwen D.F., Mumick I.S., Ross K.A.: Supporting multiple view maintenance policies. SIGMOD Rec. 26(2), 405–416 (1997)
Article Google Scholar
Dai, N., Davison, B.D.: Freshness matters: in flowers, food, and Web authority. In: SIGIR’10, pp. 114–121 (2010)
Dash, D., Kantere, V., Ailamaki, A.: An economic model for self-tuned cloud caching. In: ICDE’09, pp. 1687–1693 (2009)
Denev D., Mazeika A., Spaniol M., Weikum G.: Sharc: framework for quality-conscious Web archiving. PVLDB 2(1), 586–597 (2009)
Google Scholar
Masanès, J. (eds): Web Archiving. Springer, UK (2006)
Google Scholar
Härder, T., Bühmann, A.: Value complete, column complete, predicate complete. In: VLDBJ 17(4), pp. 805–826 (2008)
Jiawei M., Han J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2005)
Google Scholar
Kan, M.-Y., Thi, H.O.N.: Fast Webpage classification using URL features. In: CIKM’05, pp. 325–326 (2005)
Kim, S., Lee, S.: Estimating the change of Web pages. In: ICCS’07, Vol. 4489 of LNCS, pp. 798–805 (2007)
Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: scaling to 6 billion pages and beyond. In: WWW’08, pp. 427–436 (2008)
Levene, M., Poulovassilis, A. (eds): Web Dynamics—Adapting to Change in Content, Size, Topology and Use. Springer, Berlin (2004)
MATH Google Scholar
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: IWAW’04 (2004)
Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: WWW’01, pp. 114–118 (2001)
Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the Web from a search engine perspective. In: WWW’04, pp. 1–12 (2004)
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW’08, pp. 437–446 (2008)
Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: In SIGMOD’02, pp. 73–84 (2002)
Practice.com. Debunking the wayback machine. http://practice.com/2008/12/29/debunking-the-wayback-machine
Qi X., Davison B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)
Article Google Scholar
Schenkel, R.: Temporal shingling for version identification in Web archives. In: ECIR’10, pp. 508–519 (2010)
Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: WWW’09, pp. 991–1000 (2009)
Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in Web archiving. In: WICOW’09, pp. 19–26 (2009)
Tolia, N., Satyanarayanan, M.: Consistency-preserving caching of dynamic database content. In: WWW’07, pp. 311–320 (2007)
Singh, S.R. (2007) Estimating the rate of Web page updates. In: IJCAI’07, pp. 2874–2879 (2007)
Zheng, S., Dmitriev, P., Giles, C.L.: Graph-based seed selection for Web-scale crawlers. In: CIKM’09, pp. 1967–1970 (2009)
Zhou, Y., Jiang, M., Zhang, Q., Huang, X., Wu, L.: Selective recrawling for object-level vertical search. In: WWW’10, pp. 1221–1222 (2010)

Download references

Author information

Authors and Affiliations

Max Planck Institute for Informatics, Campus E1.4, 66123, Saarbrücken, Germany
Dimitar Denev, Arturas Mazeika, Marc Spaniol & Gerhard Weikum

Authors

Dimitar Denev
View author publications
You can also search for this author in PubMed Google Scholar
Arturas Mazeika
View author publications
You can also search for this author in PubMed Google Scholar
Marc Spaniol
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Weikum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitar Denev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denev, D., Mazeika, A., Spaniol, M. et al. The SHARC framework for data quality in Web archiving. The VLDB Journal 20, 183–207 (2011). https://doi.org/10.1007/s00778-011-0219-9

Download citation

Received: 27 August 2010
Accepted: 03 February 2011
Published: 02 March 2011
Issue Date: April 2011
DOI: https://doi.org/10.1007/s00778-011-0219-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The SHARC framework for data quality in Web archiving

Abstract

Access this article

Similar content being viewed by others

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Quantifying retrieval bias in Web archive search

Structural Profiling of Web Sites in the Wild

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The SHARC framework for data quality in Web archiving

Abstract

Access this article

Similar content being viewed by others

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Quantifying retrieval bias in Web archive search

Structural Profiling of Web Sites in the Wild

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation