Crawling the Infinite Web: Five Levels Are Enough

Baeza-Yates, Ricardo; Castillo, Carlos

doi:10.1007/978-3-540-30216-2_13

Ricardo Baeza-Yates¹⁷ &
Carlos Castillo¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3243))

Included in the following conference series:

International Workshop on Algorithms and Models for the Web-Graph

470 Accesses
15 Citations
5 Altmetric

Abstract

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks” away from the start page, to reach 90% of the pages that users actually visit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the Twenty-seventh International Conference on Very Large Databases (VLDB), Rome, Italy, pp. 129–138. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web Conference 2, 219–229 (1999)
Article Google Scholar
Burke, R.D.: Salticus: guided crawling for personal digital libraries. In: Proceedings of the first ACM/IEEE-CS joint conference on Digital Libraries, Roanoke, Virginia, pp. 88–89 (2001)
Google Scholar
Baeza-Yates, R., Castillo, C.: Balancing volume, quality and freshness in web crawling. In: Soft Computing Systems - Design, Management and Applications, Santiago, Chile, pp. 565–572. IOS Press, Amsterdam (2002)
Google Scholar
Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the eleventh international conference on World Wide Web, Honolulu, Hawaii, USA, pp. 124–135. ACM Press, New York (2002)
Chapter Google Scholar
Chakrabarti, S.: Mining the Web. Morgan Kaufmann Publishers, San Francisco (2003)
Google Scholar
Diligenti, M., Gori, M., Maggini, M.: A unified probabilistis framework for web page scoring systems. IEEE Transactions on Knowledge and Data Engineering 16, 4–16 (2004)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation algorithm: bringing order to the web. In: Proceedings of the seventh conference on World Wide Web, Brisbane, Australia (1998)
Google Scholar
Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M.: On near–uniform url sampling. In: Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands, pp. 295–308. Elsevier, Amsterdam (2000)
Google Scholar
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the Tenth Conference on World Wide Web, Hong Kong, pp. 114–118. Elsevier Science, Amsterdam (2001)
Chapter Google Scholar
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA, pp. 117–128 (2000)
Google Scholar
Henzinger, M.: Hyperlink analysis for the web. IEEE Internet Computing 5, 45–50 (2001)
Article Google Scholar
Haigh, S., Megarity, J.: Measuring web site usage: Log file analysis. Network Notes (1998)
Google Scholar
Tauscher, L., Greenberg, S.: Revisitation patterns in world wide web navigation. In: Proceedings of the Conference on Human Factors in Computing Systems CHI 1997 (1997)
Google Scholar
Tanasa, D., Trousse, B.: Advanced data preprocessing for intersites Web usage mining. IEEE Intelligent Systems 19, 59–65 (2004)
Article Google Scholar
Tan, P.-N., Kumar, V.: Discovery of web robots session based on their navigational patterns. Data Mining and Knowledge discovery 6, 9–35 (2002)
Article MathSciNet Google Scholar
Huberman, B.A., Pirolli, P.L.T., Pitkow, J.E., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280, 95–97 (1998)
Article Google Scholar
Adar, E., Huberman, B.A.: The economics of web surfing. In: Poster Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands (2000)
Google Scholar
Levene, M., Borges, J., Loizou, G.: Zipf’s law for web surfers. Knowledge and Information Systems 3, 120–129 (2001)
Article MATH Google Scholar
Lukose, R.M., Huberman, B.A.: Surfing as a real option. In: Proceedings of the first international conference on Information and computation economies, pp. 45–51. ACM Press, New York (1998)
Google Scholar
Liu, J., Zhang, S., Yang, J.: Characterizing web usage regularities with information foraging agents. IEEE Transactions on Knowledge and Data Engineering 16, 566–584 (2004)
Article Google Scholar
Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)
Article Google Scholar
Catledge, L., Pitkow, J.: Characterizing browsing behaviors on the world wide web. Computer Networks and ISDN Systems 6 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Web Research, DCC, Universidad de Chile, Chile
Ricardo Baeza-Yates & Carlos Castillo

Authors

Ricardo Baeza-Yates
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Castillo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Rome “La Sapienza”, Rome, Italy
Stefano Leonardi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baeza-Yates, R., Castillo, C. (2004). Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (eds) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol 3243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30216-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-30216-2_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23427-2
Online ISBN: 978-3-540-30216-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics