Skip to main content

Crawling the Infinite Web: Five Levels Are Enough

  • Conference paper
Book cover Algorithms and Models for the Web-Graph (WAW 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3243))

Included in the following conference series:

Abstract

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks” away from the start page, to reach 90% of the pages that users actually visit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the Twenty-seventh International Conference on Very Large Databases (VLDB), Rome, Italy, pp. 129–138. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  2. Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web Conference 2, 219–229 (1999)

    Article  Google Scholar 

  3. Burke, R.D.: Salticus: guided crawling for personal digital libraries. In: Proceedings of the first ACM/IEEE-CS joint conference on Digital Libraries, Roanoke, Virginia, pp. 88–89 (2001)

    Google Scholar 

  4. Baeza-Yates, R., Castillo, C.: Balancing volume, quality and freshness in web crawling. In: Soft Computing Systems - Design, Management and Applications, Santiago, Chile, pp. 565–572. IOS Press, Amsterdam (2002)

    Google Scholar 

  5. Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the eleventh international conference on World Wide Web, Honolulu, Hawaii, USA, pp. 124–135. ACM Press, New York (2002)

    Chapter  Google Scholar 

  6. Chakrabarti, S.: Mining the Web. Morgan Kaufmann Publishers, San Francisco (2003)

    Google Scholar 

  7. Diligenti, M., Gori, M., Maggini, M.: A unified probabilistis framework for web page scoring systems. IEEE Transactions on Knowledge and Data Engineering 16, 4–16 (2004)

    Article  Google Scholar 

  8. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation algorithm: bringing order to the web. In: Proceedings of the seventh conference on World Wide Web, Brisbane, Australia (1998)

    Google Scholar 

  9. Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M.: On near–uniform url sampling. In: Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands, pp. 295–308. Elsevier, Amsterdam (2000)

    Google Scholar 

  10. Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the Tenth Conference on World Wide Web, Hong Kong, pp. 114–118. Elsevier Science, Amsterdam (2001)

    Chapter  Google Scholar 

  11. Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA, pp. 117–128 (2000)

    Google Scholar 

  12. Henzinger, M.: Hyperlink analysis for the web. IEEE Internet Computing 5, 45–50 (2001)

    Article  Google Scholar 

  13. Haigh, S., Megarity, J.: Measuring web site usage: Log file analysis. Network Notes (1998)

    Google Scholar 

  14. Tauscher, L., Greenberg, S.: Revisitation patterns in world wide web navigation. In: Proceedings of the Conference on Human Factors in Computing Systems CHI 1997 (1997)

    Google Scholar 

  15. Tanasa, D., Trousse, B.: Advanced data preprocessing for intersites Web usage mining. IEEE Intelligent Systems 19, 59–65 (2004)

    Article  Google Scholar 

  16. Tan, P.-N., Kumar, V.: Discovery of web robots session based on their navigational patterns. Data Mining and Knowledge discovery 6, 9–35 (2002)

    Article  MathSciNet  Google Scholar 

  17. Huberman, B.A., Pirolli, P.L.T., Pitkow, J.E., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280, 95–97 (1998)

    Article  Google Scholar 

  18. Adar, E., Huberman, B.A.: The economics of web surfing. In: Poster Proceedings of the Ninth Conference on World Wide Web, Amsterdam, Netherlands (2000)

    Google Scholar 

  19. Levene, M., Borges, J., Loizou, G.: Zipf’s law for web surfers. Knowledge and Information Systems 3, 120–129 (2001)

    Article  MATH  Google Scholar 

  20. Lukose, R.M., Huberman, B.A.: Surfing as a real option. In: Proceedings of the first international conference on Information and computation economies, pp. 45–51. ACM Press, New York (1998)

    Google Scholar 

  21. Liu, J., Zhang, S., Yang, J.: Characterizing web usage regularities with information foraging agents. IEEE Transactions on Knowledge and Data Engineering 16, 566–584 (2004)

    Article  Google Scholar 

  22. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)

    Article  Google Scholar 

  23. Catledge, L., Pitkow, J.: Characterizing browsing behaviors on the world wide web. Computer Networks and ISDN Systems 6 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baeza-Yates, R., Castillo, C. (2004). Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (eds) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol 3243. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30216-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30216-2_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23427-2

  • Online ISBN: 978-3-540-30216-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics