Abstract
In order to address the requirements of different user groups and use cases of web archives, we have identified three views to access and explore web archives: user-, data- and graph-centric. The user-centric view is the natural way to look at the archived pages in a browser, just like the live web is consumed. By zooming out from there and looking at whole collections in a web archive, data processing methods can enable analysis at scale. In this data-centric view, the web and its dynamics as well as the contents of archived pages can be looked at from two angles: (1) by retrospectively analysing crawl metadata with respect to the size, age and growth of the web and (2) by processing archival collections to build research corpora from web archives. Finally, the third perspective is what we call the graph-centric view, which considers websites, pages or extracted facts as nodes in a graph. Links among pages or the extracted information are represented by edges in the graph. This structural perspective conveys an overview of the holdings and connections among contained resources and information. Only all three views together provide the holistic view that is required to effectively work with web archives.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adamic LA, Huberman BA, Barabási AL, Albert R, Jeong H, Bianconi G (2000) Power-law distribution of the world wide web. Science 287(5461):2115. https://doi.org/10.1126/science.287.5461.2115a
Adar E, Teevan J, Dumais ST, Elsas JL (2009) The web changes everything: understanding the dynamics of web content. In: Proceedings of the 2nd ACM international conference on web search and data mining - WSDM ’09. ACM Press, New York, pp 282–291. https://doi.org/10.1145/1498759.1498837
Agata T, Miyata Y, Ishita E, Ikeuchi A, Ueda S (2014) Life span of web pages: a survey of 10 million pages collected in 2001. Digit Libr pp 463–464. https://doi.org/10.1109/JCDL.2014.6970226
Albert R, Jeong H, Barabási AL (1999) Internet: diameter of the world-wide web. Nature 401(6749):130–131. https://doi.org/10.1038/43601
Alkwai LM, Nelson ML, Weigle MC (2015) How well are Arabic websites archived? In: Proceedings of the 15th ACM/IEEE-CS joint conference on digital libraries - JCDL ’15. ACM, pp 223–232. https://doi.org/10.1145/2756406.2756912
Alonso O, Gertz M, Baeza-Yates R (2007) On the value of temporal information in information retrieval. ACM SIGIR Forum 41(2):35–41. https://doi.org/10.1145/1328964.1328968
AlSum A (2014) Web archive services framework for tighter integration between the past and present web. PhD thesis, Old Dominion University
Anand A, Bedathur S, Berberich K, Schenkel R (2011) Temporal index sharding for space-time efficiency in archive search. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’11. ACM, New York, NY, pp 545–554. https://doi.org/10.1145/2009916.2009991
Anand A, Bedathur S, Berberich K, Schenkel R (2012) Index maintenance for time-travel text search. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’12, Portland, Oregon, pp 235–244. https://doi.org/10.1145/2348283.2348318
Berberich K, Bedathur S, Alonso O, Weikum G (2010) A language modeling approach for temporal information needs. In: Proceedings of the 32nd European conference on advances in information retrieval (ECIR), Springer-Verlag, Berlin, Heidelberg, ECIR’2010, pp 13–25. https://doi.org/10.1007/978-3-642-12275-0_5
Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proceedings of the 13th conference on world wide web - WWW ’04, ACM. ACM Press, Manhattan, pp 595–602. https://doi.org/10.1145/988672.988752. http://law.di.unimi.it/datasets.php
Broder A (2002) A taxonomy of web search. In: ACM Sigir forum, ACM. Association for Computing Machinery (ACM), New York, vol 36, pp 3–10. https://doi.org/10.1145/792550.792552
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. Comput Netw 33(1):309–320. https://doi.org/10.1016/s1389-1286(00)00083-9
Campos R, Dias G, Jorge AM, Jatowt A (2015) Survey of temporal information retrieval and related applications. ACM Comput Surv (CSUR) 47(2):15
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst (TOCS) 26(2):4. https://doi.org/10.1145/1365815.1365816
Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In: Proceedings of the 26th international conference on very large data bases, VLDB ’00
Cohen S, Li C, Yang J, Yu C (2011) Computational journalism: a call to arms to database researchers. In: Proceedings of the 5th Biennial conference on innovative data systems research, pp 148–151
Costa M, Silva MJ (2010) Understanding the information needs of web archive users. In: Proceedings of the 10th international web archiving workshop
Costa M, Gomes D, Couto F, Silva M (2013) A survey of web archive search architectures. In: Proceedings of the 22nd international conference on world wide web - WWW ’13 companion. ACM Press, New York, NY, pp 1045–1050. https://doi.org/10.1145/2487788.2488116
Craswell N, Hawking D, Robertson S (2001) Effective site finding using link anchor information. In: Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’01, ACM. ACM Press, New York. https://doi.org/10.1145/383952.383999
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77. https://doi.org/10.1145/1629175.1629198
Fafalios P, Holzmann H, Kasturia V, Nejdl W (2017) Building and querying semantic layers for web archives. In: Proceedings of the 17th ACM/IEEE-CS joint conference on digital libraries - JCDL ’17. IEEE, Piscataway. https://doi.org/10.1109/jcdl.2017.7991555
Fafalios P, Holzmann H, Kasturia V, Nejdl W (2018) Building and querying semantic layers for web archives (extended version). Int J Digit Libr. https://doi.org/10.1007/s00799-018-0251-0
Fetterly D, Manasse M, Najork M, Wiener J (2003) A large-scale study of the evolution of web pages. In: Proceedings of the 12th international conference on world wide web - WWW ’03, pp 669–678. https://doi.org/10.1002/spe.577
Goel V (2016) Beta Wayback machine - now with site search!. https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search. Accessed 16 Mar 2017
Hale SA, Yasseri T, Cowls J, Meyer ET, Schroeder R, Margetts H (2014) Mapping the UK webspace: fifteen years of British universities on the web. In: Proceedings of the 2014 ACM conference on web science - WebSci ’14, WebSci ’14. ACM Press, New York. https://doi.org/10.1145/2615569.2615691
Hall W, Hendler J, Staab S (2017) A manifesto for web science @10. arXiv:170208291
Hockx-Yu H (2014) Access and scholarly use of web archives. Alexandria J Natl Int Library Inf Issues 25(1–2):113–127. https://doi.org/10.7227/alx.0023
Holzmann H (2019) Concepts and tools for the effective and efficient use of web archives. PhD thesis, Leibniz Universität Hannover. https://doi.org/10.15488/4436
Holzmann H, Anand A (2016) Tempas: temporal archive search based on tags. In: Proceedings of the 25th international conference companion on world wide web - WWW ’16 companion. ACM Press, New York. https://doi.org/10.1145/2872518.2890555
Holzmann H, Goel V, Anand A (2016a) Archivespark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries - JCDL ’16,. ACM, New York, pp 83–92. https://doi.org/10.1145/2910896.2910902
Holzmann H, Nejdl W, Anand A (2016b) On the applicability of delicious for temporal search on web archives. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’16. ACM Press, Pisa. https://doi.org/10.1145/2911451.2914724
Holzmann H, Nejdl W, Anand A (2016c) The dawn of today’s popular domains - a study of the archived German web over 18 years. In: Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries - JCDL ’16. IEEE/ACM Press, Newark/New Jersey, pp 73–82. https://doi.org/10.1145/2910896.2910901
Holzmann H, Runnwerth M, Sperber W (2016d) Linking mathematical software in web archives. In: Mathematical software – ICMS 2016. Springer International Publishing, Cham, pp 419–422. https://doi.org/10.1007/978-3-319-42432-3_52
Holzmann H, Sperber W, Runnwerth M (2016e) Archiving software surrogates on the web for future reference. In: Research and advanced technology for digital libraries, 20th international conference on theory and practice of digital libraries, TPDL 2016, Hannover. https://doi.org/10.1007/978-3-319-43997-6_17
Holzmann H, Goel V, Gustainis EN (2017a) Universal distant reading through metadata proxies with archivespark. In: 2017 IEEE international conference on big data (Big Data). IEEE, Boston, MA. https://doi.org/10.1109/bigdata.2017.8257958
Holzmann H, Nejdl W, Anand A (2017b) Exploring web archives through temporal anchor texts. In: Proceedings of the 2017 ACM on web science conference - WebSci ’17. ACM Press, Troy, New York. https://doi.org/10.1145/3091478.3091500
Holzmann H, Anand A, Khosla M (2018) What the HAK? Estimating ranking deviations in incomplete graphs. In: 14th International workshop on mining and learning with graphs (MLG) - co-located with 24th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), London
Holzmann H, Anand A, Khosla M (2019) Delusive pagerank in incomplete graphs. In: Complex networks and their applications, vol VII. Springer International Publishing, Cham
Huberman BA, Adamic LA (1999) Internet: growth dynamics of the world-wide web. Nature 401(6749):131
Jones R, Diaz F (2007) Temporal profiles of queries. ACM Trans Inf Syst 25(3):14–es. https://doi.org/10.1145/1247715.1247720
Kanhabua N, Blanco R, Nørvåg K, et al (2015) Temporal information retrieval. Found Trends Inf Retrieval 9(2):91–208. https://doi.org/10.1145/2911451.2914805
Kanhabua N, Kemkes P, Nejdl W, Nguyen TN, Reis F, Tran NK (2016) How to search the internet archive without indexing it. In: Research and advanced technology for digital libraries. Springer International Publishing, Hannover, pp 147–160. https://doi.org/10.1007/978-3-319-43997-6_12
Kasioumis N, Banos V, Kalb H (2014) Towards building a blog preservation platform. World Wide Web J 17(4):799–825. https://doi.org/10.1007/s11280-013-0234-4
Koehler W (2002) Web Page change and persistence-a four-year longitudinal study. J Am Soc Inf Sci Technol 53(2):162–171. https://doi.org/10.1002/asi.10018
Koolen M, Kamps J (2010) The importance of anchor text for ad hoc search revisited. In: Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval - SIGIR ’10, ACM. ACM Press, pp 122–129. https://doi.org/10.1145/1835449.1835472
Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval - SIGIR ’02. ACM, New York. https://doi.org/10.1145/564376.564383
Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40. https://doi.org/10.1145/1773912.1773922
Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceeding of the 1th ACM SIGKDD international conference on knowledge discovery in data mining - KDD ’05, ACM. ACM Press, pp 177–187. https://doi.org/10.1145/1081870.1081893
Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data (TKDD) 1(1):2. https://doi.org/10.1145/1217299.1217301
Lin J, Gholami M, Rao J (2014) Infrastructure for supporting exploration and discovery in web archives. In: Proceedings of the 23rd international conference on world wide web - WWW ’14 companion. ACM Press, New York. https://doi.org/10.1145/2567948.2579045
Marshall CC, Shipman FM (2014) An argument for archiving Facebook as a heterogeneous personal store. In: Proceedings of the 14th ACM/IEEE-CS joint conference on digital libraries - JCDL ’14. IEEE Press, Piscataway, pp 11–20. https://doi.org/10.1109/jcdl.2014.6970144
Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, and ELA (2010) Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182. https://doi.org/10.1126/science.1199644
Moretti F (2005) Graphs, maps, trees: abstract models for a literary history. Verso
Ntoulas A, Cho J, Olston C (2004) What’s new on the web?: The evolution of the web from a search engine perspective. In: Proceedings of the 13th conference on world wide web - WWW ’04. ACM Press, New York, pp 1–12. https://doi.org/10.1145/988672.988674
Ogilvie P, Callan J (2003) Combining document representations for known-item search. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval - SIGIR ’03, ACM. ACM Press, New York. https://doi.org/10.1145/860435.860463
Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the web. InfoLab
SalahEldeen HM, Nelson ML (2012) Losing my revolution: how many resources shared on social media have been lost? In: Theory and practice of digital libraries, TPDL’12. Springer, Paphos, Cyprus, pp 125–137. https://doi.org/10.1007/978-3-642-33290-6_14
Schreibman S, Siemens R, Unsworth J (2008) A companion to digital humanities. Blackwell Publishing, Malden
Shaltev M, Zab JH, Kemkes P, Siersdorfer S, Zerr S (2016) Cobwebs from the past and present: extracting large social networks using internet archive data. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval - SIGIR ’16, Pisa. https://doi.org/10.1145/2911451.2911467
Singh J, Nejdl W, Anand A (2016) History by diversity: helping historians search news archives. In: Proceedings of the 2016 ACM on conference on human information interaction and retrieval - CHIIR ’16. ACM Press, New York, pp 183–192. https://doi.org/10.1145/2854946.2854959
Suel T, Yuan J (2001) Compressing the graph structure of the web. In: Data compression conference
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, vol 10, p 10
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Holzmann, H., Nejdl, W. (2021). A Holistic View on Web Archives. In: Gomes, D., Demidova, E., Winters, J., Risse, T. (eds) The Past Web. Springer, Cham. https://doi.org/10.1007/978-3-030-63291-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-63291-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63290-8
Online ISBN: 978-3-030-63291-5
eBook Packages: Computer ScienceComputer Science (R0)