Skip to main content
Log in

Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to \(<\)30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. http://web.archive.org/web/20050514013608/ http://www.cs.odu.edu/.

    Fig. 1
    figure 1

    Impact of drift on archive browsing

  2. http://web.archive.org/web/20050422001752/http://sci.odu.edu/.

  3. http://www.gnu.org/software/wget/.

References

  1. Archive Today personal web archiving service. https://archive.today

  2. Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? In: Proceedings of JCDL’11, pp. 133–136 (2011). doi:10.1145/1998076.1998100

  3. Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? Tech. Rep. arXiv:1212.6177, Old Dominion University (2012)

  4. AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of JCDL’13, pp. 339–348 (2013). doi:10.1145/2467696.2467722

  5. AlSum, A., Weigle, M.C., Nelson, M.L., de Sompel, H.V.: Profiling web archive coverage for top-level domain and content language. In: Proceedings of TPDL 2013, pp. 60–71 (2013). doi:10.1007/978-3-642-40501-3_7

  6. Ben Saad, M., Gançarski, S.: Archiving the Web using page changes patterns: a case study. In: Proceedings of JCDL’11, pp. 113–122 (2011). doi:10.1145/1998076.1998098

  7. Ben Saad, M., Gançarski, S.: Improving the quality of web archives through the importance of changes. In: Proceedings of DEXA’11, pp. 394–409 (2011). doi:10.1007/978-3-642-23088-2_29

  8. Ben Saad, M., Pehlivan, Z., Gançarski, S.: Coherence-oriented crawling and navigation using patterns for web archives. In: Proceedings of TPDL’11, pp. 421–433 (2011). doi:10.1007/978-3-642-24469-8_42

  9. Brunelle, J.F., Nelson, M.L.: Evaluating the SiteStory transactional web archive with the ApacheBench tool. Tech. Rep. arXiv:1209.1811, Old Dominion University (2012)

  10. Brunelle, J.F., Nelson, M.L., Balakireva, L., Sanderson, R., Van de Sompel, H.: Evaluating the SiteStory transactional web archive with the ApacheBench tool. In: 17th Annual Conference on the Theory and Practice of Digital Libraries, pp. 204–215 (2012). doi:10.1007/978-3-642-40501-3_20

  11. Casey, C.: The Cyberarchive: a look at the storage and preservation of web sites. Coll. Res. Libr 59 (1998). http://crl.acrl.org/content/59/4/304.short

  12. Day, M.: Preserving the fabric of our lives: a survey of web preservation initiatives. In: Proceedings of ECDL’05, pp. 461–472 (2003). doi:10.1007/978-3-540-45175-4_42

  13. Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: SHARC: framework for quality-conscious web archiving. Proc. VLDB Endow. 2, 586–597 (2009)

    Article  Google Scholar 

  14. Dyreson, C.E., Lin, H.l., Wang, Y.: Managing versions of web documents in a transaction-time web server. In: Proceedings of WWW’04 (2004). doi:10.1145/988672.988730

  15. Eysenbach, G., Trudel, M.: Going, going, still there: using the WebCite service to permanently archive cited web pages. J. Med. Internet Res. 7(5) (2005). doi:10.2196/jmir.7.5.e60

  16. Fitch., K.: Web site archiving: an approach to recording every materially different response produced by a website. In: 9th Australasian World Wide Web Conference, Sanctuary Cove, Queensland, Australia, pp. 5–9 (2003)

  17. Kahle, B.: Wayback machine: now with 240,000,000,000 URLs. http://blog.archive.org/2013/01/09/updated-wayback/ (2013)

  18. Kimpton, M., Ubois, J.: Year-by-year: from an archive of the Internet to an archive on the Internet. In: Masanès, J. (ed.) Web archiving, chap. 9, pp. 201–212 (2006). doi:10.1007/978-3-540-46332-0_9

  19. Masanès, J.: Web archiving: issues and methods. In: Masanès, J. (ed.) Web archving, chap. 1, pp. 1–53 (2006)

  20. McCown, F., Nelson, M.L.: Characterization of search engine caches. In: Proceedings of IS&T Archiving 2007, pp. 48–52 (2007). (Also available as arXiv:cs/0703083v2)

  21. Mohr, G., Stack, M., Rnitovic, I., Avery, D., Kimpton, M.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of IWAW’04 (2004)

  22. Negulescu, K.C.: Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt (2010)

  23. Sanderson, R., Shankar, H., Ainsworth, S., McCown, F., Adams, S.: Implementing time travel for the Web. Code4 Lib J. (13) (2011). http://journal.code4lib.org/articles/4979

  24. Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of WICOW’09, pp. 19–26 (2009). doi:10.1145/1526993.1526999

  25. Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: Catch me if you can: visual analysis of coherence defects in web archiving. In: Proceedings of IWAW’09, pp. 27–37 (2009)

  26. The British Library collection development policy for websites. http://www.bl.uk/aboutus/stratpolprog/digi/webarch/bl_collection_development_policy_v3-0.pdf

  27. Thelwall, M., Vaughan, L.: A fair history of the Web? examining country balance in the Internet Archive. Libr. Inf. Sci. Res. 26(2), 162–176 (2004). doi:10.1016/j.lisr.2003.12.009

    Article  Google Scholar 

  28. Tofel, B.: ‘Wayback’ for accessing web archives. In: Proceedings of IWAW’07) (2007)

  29. Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP framework for time-based access to resource states–Memento (IETF RFC 7089) (2013). http://tools.ietf.org/html/rfc7089

  30. Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: time travel for the Web. Tech. Rep. arXiv:0911.1112 (2009)

  31. Van de Sompel, H., Sanderson, R., Nelson, M., Balakireva, L., Shankar, H., Ainsworth, S.: An HTTP-based versioning mechanism for linked data. In: Proceedings of LDOW’10 (2010). arXiv:1003:3661

  32. Weigle, M.C.: How much of the web is archived? http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html (2011)

Download references

Acknowledgments

This work supported in part by the National Science Foundation (IIS 1009392) and the Library of Congress. We are grateful to the Internet Archive for their continued support of Memento access to their archive. Memento is a joint project between the Los Alamos National Laboratory Research Library and Old Dominion University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Scott G. Ainsworth.

Additional information

This article is an extended version of the JCDL’13 article by the same title.

Appendix: Curl interactions for the example

Appendix: Curl interactions for the example

The following curl commands and responses demonstrate the HTTP interactions that retrieve the HTML pages (but not the embedded images) shown in Fig. 1. The first three steps use the Internet Archive’s Wayback Machine UI. Steps 4–6 use the Memento API. Note that the response headers shown are a subset of the headers returned by the Internet Archive. (Each response actually contains 15–20 headers.) The headers shown were selected to show the datetime selection process.

Step 1. Retrieve the ODU Computer Science home page. This step retrieves the Computer Science home page for 2005-05-14 01:36:08 (Fig. 1a). Because the datetime encoded in Wayback Machine URI matches an archived Memento-Datetime, a 200 status is returned directly; there are no redirects.

figure a

Step 2. Retrieve the Science home page for 2005-05-14 01:36:08. When the user clicks, the Memento-Datetime from step 1 is included in the request URI and used to select the archived copy of the Science Department home page; in this case, 2005-04-22 00:17:52 is the closest match (Fig. 1b). A 302 redirect is used to redirect the browser to the URI-M of the archived resource.

figure b

Step 3. Retrieve the Computer Science home page for 2005-04-22 00:17:52. When the user clicks, the Memento-Datetime from step 2, which is different from step 1, is used to select the archived copy of the Computer Science home page. Now, 2005-03-31 is the closest match (Fig. 1c). Again, a 302 redirects to the archive copy’s URI-M.

figure c

Step 4. Retrieve the ODU Computer Science home page. This step retrieves the Computer Science home page for 2005-05-14 01:36:08 (Fig. 1a) using the Memento API. The Accept-Datetime request header is added to the curl command and the datetime is removed from the URI. This causes the Wayback Machine to use the Memento API instead of the process normally used by the UI. Again, since the datetime in URI matches an archived Memento-Datetime, the 302 redirects to the matching memento. Also note the additional response header, Memento-Datetime, which is part of the Memento API.

figure d

Step 5. Retrieve the Science home page. This step retrieves the Science home page nearest the original target datetime using the Memento API (Fig. 1e). The Accept-Datetime request header is again added to the curl command. Like step 2, a 302 redirect is used to indicate the best archived copy. Also note the additional response headers, Memento-Datetime, which is part of the Memento API. The vary header now includes Memento-Datetime to ensure that web caching remains coherent.

figure e

It should also be noted that a Memento API response includes a Link header with much more Memento information. This header is not shown because it does no affect datetime selection.

Step 6. Retrieve the Computer Science page. The Computer Science home page is again retrieved using the original target datetime. As expected, the same version retrieved in step 4 is retrieved again (Fig. 1f).

figure f

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ainsworth, S.G., Nelson, M.L. Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive. Int J Digit Libr 16, 129–144 (2015). https://doi.org/10.1007/s00799-014-0120-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-014-0120-4

Keywords

Navigation