Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Ainsworth, Scott G.; Nelson, Michael L.

doi:10.1007/s00799-014-0120-4

Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Published: 05 August 2014

Volume 16, pages 129–144, (2015)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Scott G. Ainsworth¹ &
Michael L. Nelson¹

1349 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to \(<\)30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Profiling Web Archive Coverage for Top-Level Domain and Content Language

Profiling web archive coverage for top-level domain and content language

Article 27 June 2014

Quantifying retrieval bias in Web archive search

Article Open access 18 April 2017

Notes

http://web.archive.org/web/20050514013608/ http://www.cs.odu.edu/.
Fig. 1
Impact of drift on archive browsing
Full size image
http://web.archive.org/web/20050422001752/http://sci.odu.edu/.
http://www.gnu.org/software/wget/.

References

Archive Today personal web archiving service. https://archive.today
Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? In: Proceedings of JCDL’11, pp. 133–136 (2011). doi:10.1145/1998076.1998100
Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? Tech. Rep. arXiv:1212.6177, Old Dominion University (2012)
AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of JCDL’13, pp. 339–348 (2013). doi:10.1145/2467696.2467722
AlSum, A., Weigle, M.C., Nelson, M.L., de Sompel, H.V.: Profiling web archive coverage for top-level domain and content language. In: Proceedings of TPDL 2013, pp. 60–71 (2013). doi:10.1007/978-3-642-40501-3_7
Ben Saad, M., Gançarski, S.: Archiving the Web using page changes patterns: a case study. In: Proceedings of JCDL’11, pp. 113–122 (2011). doi:10.1145/1998076.1998098
Ben Saad, M., Gançarski, S.: Improving the quality of web archives through the importance of changes. In: Proceedings of DEXA’11, pp. 394–409 (2011). doi:10.1007/978-3-642-23088-2_29
Ben Saad, M., Pehlivan, Z., Gançarski, S.: Coherence-oriented crawling and navigation using patterns for web archives. In: Proceedings of TPDL’11, pp. 421–433 (2011). doi:10.1007/978-3-642-24469-8_42
Brunelle, J.F., Nelson, M.L.: Evaluating the SiteStory transactional web archive with the ApacheBench tool. Tech. Rep. arXiv:1209.1811, Old Dominion University (2012)
Brunelle, J.F., Nelson, M.L., Balakireva, L., Sanderson, R., Van de Sompel, H.: Evaluating the SiteStory transactional web archive with the ApacheBench tool. In: 17th Annual Conference on the Theory and Practice of Digital Libraries, pp. 204–215 (2012). doi:10.1007/978-3-642-40501-3_20
Casey, C.: The Cyberarchive: a look at the storage and preservation of web sites. Coll. Res. Libr 59 (1998). http://crl.acrl.org/content/59/4/304.short
Day, M.: Preserving the fabric of our lives: a survey of web preservation initiatives. In: Proceedings of ECDL’05, pp. 461–472 (2003). doi:10.1007/978-3-540-45175-4_42
Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: SHARC: framework for quality-conscious web archiving. Proc. VLDB Endow. 2, 586–597 (2009)
Article Google Scholar
Dyreson, C.E., Lin, H.l., Wang, Y.: Managing versions of web documents in a transaction-time web server. In: Proceedings of WWW’04 (2004). doi:10.1145/988672.988730
Eysenbach, G., Trudel, M.: Going, going, still there: using the WebCite service to permanently archive cited web pages. J. Med. Internet Res. 7(5) (2005). doi:10.2196/jmir.7.5.e60
Fitch., K.: Web site archiving: an approach to recording every materially different response produced by a website. In: 9th Australasian World Wide Web Conference, Sanctuary Cove, Queensland, Australia, pp. 5–9 (2003)
Kahle, B.: Wayback machine: now with 240,000,000,000 URLs. http://blog.archive.org/2013/01/09/updated-wayback/ (2013)
Kimpton, M., Ubois, J.: Year-by-year: from an archive of the Internet to an archive on the Internet. In: Masanès, J. (ed.) Web archiving, chap. 9, pp. 201–212 (2006). doi:10.1007/978-3-540-46332-0_9
Masanès, J.: Web archiving: issues and methods. In: Masanès, J. (ed.) Web archving, chap. 1, pp. 1–53 (2006)
McCown, F., Nelson, M.L.: Characterization of search engine caches. In: Proceedings of IS&T Archiving 2007, pp. 48–52 (2007). (Also available as arXiv:cs/0703083v2)
Mohr, G., Stack, M., Rnitovic, I., Avery, D., Kimpton, M.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of IWAW’04 (2004)
Negulescu, K.C.: Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt (2010)
Sanderson, R., Shankar, H., Ainsworth, S., McCown, F., Adams, S.: Implementing time travel for the Web. Code4 Lib J. (13) (2011). http://journal.code4lib.org/articles/4979
Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of WICOW’09, pp. 19–26 (2009). doi:10.1145/1526993.1526999
Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: Catch me if you can: visual analysis of coherence defects in web archiving. In: Proceedings of IWAW’09, pp. 27–37 (2009)
The British Library collection development policy for websites. http://www.bl.uk/aboutus/stratpolprog/digi/webarch/bl_collection_development_policy_v3-0.pdf
Thelwall, M., Vaughan, L.: A fair history of the Web? examining country balance in the Internet Archive. Libr. Inf. Sci. Res. 26(2), 162–176 (2004). doi:10.1016/j.lisr.2003.12.009
Article Google Scholar
Tofel, B.: ‘Wayback’ for accessing web archives. In: Proceedings of IWAW’07) (2007)
Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP framework for time-based access to resource states–Memento (IETF RFC 7089) (2013). http://tools.ietf.org/html/rfc7089
Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: time travel for the Web. Tech. Rep. arXiv:0911.1112 (2009)
Van de Sompel, H., Sanderson, R., Nelson, M., Balakireva, L., Shankar, H., Ainsworth, S.: An HTTP-based versioning mechanism for linked data. In: Proceedings of LDOW’10 (2010). arXiv:1003:3661
Weigle, M.C.: How much of the web is archived? http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html (2011)

Download references

Acknowledgments

This work supported in part by the National Science Foundation (IIS 1009392) and the Library of Congress. We are grateful to the Internet Archive for their continued support of Memento access to their archive. Memento is a joint project between the Los Alamos National Laboratory Research Library and Old Dominion University.

Author information

Authors and Affiliations

Old Dominion University, Norfolk, VA, USA
Scott G. Ainsworth & Michael L. Nelson

Authors

Scott G. Ainsworth
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Nelson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Scott G. Ainsworth.

Additional information

This article is an extended version of the JCDL’13 article by the same title.

Appendix: Curl interactions for the example

The following curl commands and responses demonstrate the HTTP interactions that retrieve the HTML pages (but not the embedded images) shown in Fig. 1. The first three steps use the Internet Archive’s Wayback Machine UI. Steps 4–6 use the Memento API. Note that the response headers shown are a subset of the headers returned by the Internet Archive. (Each response actually contains 15–20 headers.) The headers shown were selected to show the datetime selection process.

Step 1. Retrieve the ODU Computer Science home page. This step retrieves the Computer Science home page for 2005-05-14 01:36:08 (Fig. 1a). Because the datetime encoded in Wayback Machine URI matches an archived Memento-Datetime, a 200 status is returned directly; there are no redirects.

Step 2. Retrieve the Science home page for 2005-05-14 01:36:08. When the user clicks, the Memento-Datetime from step 1 is included in the request URI and used to select the archived copy of the Science Department home page; in this case, 2005-04-22 00:17:52 is the closest match (Fig. 1b). A 302 redirect is used to redirect the browser to the URI-M of the archived resource.

Step 3. Retrieve the Computer Science home page for 2005-04-22 00:17:52. When the user clicks, the Memento-Datetime from step 2, which is different from step 1, is used to select the archived copy of the Computer Science home page. Now, 2005-03-31 is the closest match (Fig. 1c). Again, a 302 redirects to the archive copy’s URI-M.

Step 4. Retrieve the ODU Computer Science home page. This step retrieves the Computer Science home page for 2005-05-14 01:36:08 (Fig. 1a) using the Memento API. The Accept-Datetime request header is added to the curl command and the datetime is removed from the URI. This causes the Wayback Machine to use the Memento API instead of the process normally used by the UI. Again, since the datetime in URI matches an archived Memento-Datetime, the 302 redirects to the matching memento. Also note the additional response header, Memento-Datetime, which is part of the Memento API.

Step 5. Retrieve the Science home page. This step retrieves the Science home page nearest the original target datetime using the Memento API (Fig. 1e). The Accept-Datetime request header is again added to the curl command. Like step 2, a 302 redirect is used to indicate the best archived copy. Also note the additional response headers, Memento-Datetime, which is part of the Memento API. The vary header now includes Memento-Datetime to ensure that web caching remains coherent.

It should also be noted that a Memento API response includes a Link header with much more Memento information. This header is not shown because it does no affect datetime selection.

Step 6. Retrieve the Computer Science page. The Computer Science home page is again retrieved using the original target datetime. As expected, the same version retrieved in step 4 is retrieved again (Fig. 1f).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ainsworth, S.G., Nelson, M.L. Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive. Int J Digit Libr 16, 129–144 (2015). https://doi.org/10.1007/s00799-014-0120-4

Download citation

Received: 31 October 2013
Revised: 14 June 2014
Accepted: 01 July 2014
Published: 05 August 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s00799-014-0120-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Abstract

Access this article

Similar content being viewed by others

Profiling Web Archive Coverage for Top-Level Domain and Content Language

Profiling web archive coverage for top-level domain and content language

Quantifying retrieval bias in Web archive search

Notes

References

Acknowledgments