Abstract
A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks” away from the start page, to reach 90% of the pages that users actually visit.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the Twenty-seventh International Conference on Very Large Databases (VLDB), Rome, Italy, pp. 129–138. Morgan Kaufmann, San Francisco (2001)
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web Conference 2, 219–229 (1999)
Burke, R.D.: Salticus: guided crawling for personal digital libraries. In: Proceedings of the first ACM/IEEE-CS joint conference on Digital Libraries, Roanoke, Virginia, pp. 88–89 (2001)
Baeza-Yates, R., Castillo, C.: Balancing volume, quality and freshness in web crawling. In: Soft Computing Systems - Design, Management and Applications, Santiago, Chile, pp. 565–572. IOS Press, Amsterdam (2002)
Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the eleventh international conference on World Wide Web, Honolulu, Hawaii, USA, pp. 124–135. ACM Press, New York (2002)
Chakrabarti, S.: Mining the Web. Morgan Kaufmann Publishers, San Francisco (2003)
Diligenti, M., Gori, M., Maggini, M.: A unified probabilistis framework for web page scoring systems. IEEE Transactions on Knowledge and Data Engineering 16, 4–16 (2004)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation algorithm: bringing order to the web. In: Proceedings of the seventh conference on World Wide Web, Brisbane, Australia (1998)
Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M.: On near–uniform url sampling. In: Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands, pp. 295–308. Elsevier, Amsterdam (2000)
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the Tenth Conference on World Wide Web, Hong Kong, pp. 114–118. Elsevier Science, Amsterdam (2001)
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA, pp. 117–128 (2000)
Henzinger, M.: Hyperlink analysis for the web. IEEE Internet Computing 5, 45–50 (2001)
Haigh, S., Megarity, J.: Measuring web site usage: Log file analysis. Network Notes (1998)
Tauscher, L., Greenberg, S.: Revisitation patterns in world wide web navigation. In: Proceedings of the Conference on Human Factors in Computing Systems CHI 1997 (1997)
Tanasa, D., Trousse, B.: Advanced data preprocessing for intersites Web usage mining. IEEE Intelligent Systems 19, 59–65 (2004)
Tan, P.-N., Kumar, V.: Discovery of web robots session based on their navigational patterns. Data Mining and Knowledge discovery 6, 9–35 (2002)
Huberman, B.A., Pirolli, P.L.T., Pitkow, J.E., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280, 95–97 (1998)
Adar, E., Huberman, B.A.: The economics of web surfing. In: Poster Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands (2000)
Levene, M., Borges, J., Loizou, G.: Zipf’s law for web surfers. Knowledge and Information Systems 3, 120–129 (2001)
Lukose, R.M., Huberman, B.A.: Surfing as a real option. In: Proceedings of the first international conference on Information and computation economies, pp. 45–51. ACM Press, New York (1998)
Liu, J., Zhang, S., Yang, J.: Characterizing web usage regularities with information foraging agents. IEEE Transactions on Knowledge and Data Engineering 16, 566–584 (2004)
Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)
Catledge, L., Pitkow, J.: Characterizing browsing behaviors on the world wide web. Computer Networks and ISDN Systems 6 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Baeza-Yates, R., Castillo, C. (2004). Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (eds) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol 3243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30216-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-30216-2_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23427-2
Online ISBN: 978-3-540-30216-2
eBook Packages: Springer Book Archive