Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 376))

Abstract

In this article, we illustrate design and implementation of a focused crawling system for effectively collecting webpages concerning specific topics. An algorithm for deciding where to crawl next is developed by exploiting not only anchor texts but also the concept of PageRank. Given a topic to be focused on, our system attempts to collect webpages concerning the topic by crawling webpages that are expected to have not only close similarities to the topic but also high rank. Experimental results using many topics are reported and investigated in this article.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: better strategies than breadth-first for Web page ordering. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 864–872. ACM Press, New York (2005), http://doi.acm.org/10.1145/1062745.1062768 , doi:10.1145/1062745.1062768

    Chapter  Google Scholar 

  2. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2N/2/63e7d8fb6a64027a0c15e6ae3e402889 , doi:10.1016/S0169-7552(98)00110-X; Proceedings of the Seventh International World Wide Web Conference

    Article  Google Scholar 

  3. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999), http://www.sciencedirect.com/science/article/B6VRG-405TDWC-1F/2/f049016cf8fefd114f056306b5ae4a86 , doi:10.1016/S1389-1286(99)00052-3

    Article  Google Scholar 

  4. Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 200–209. Morgan Kaufmann, San Francisco (2000), http://www.vldb.org/conf/2000/P200.pdf

    Google Scholar 

  5. Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003), http://doi.acm.org/10.1145/958942.958945 , doi:10.1145/958942.958945

    Article  Google Scholar 

  6. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2G/2/122be31915c6e16c444898fb12cfdf87 , doi:10.1016/S0169-7552(98)00108-1; Proceedings of the Seventh International World Wide Web Conference

    Article  Google Scholar 

  7. Cho, J., Schonfeld, U.: RankMass crawler: a crawler with high personalized PageRank coverage guarantee. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 375–386. VLDB Endowment (2007), http://www.vldb.org/conf/2007/papers/research/p375-cho.pdf

  8. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 527–534. Morgan Kaufmann Publishers Inc., San Francisco (2000), http://www.vldb.org/conf/2000/P527.pdf

    Google Scholar 

  9. Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 106–113. ACM Press, New York (2001), http://doi.acm.org/10.1145/371920.371960 , doi:10.1145/371920.371960

    Chapter  Google Scholar 

  10. Ester, M., Kriegel, H.P., Schubert, M.: Accurate and efficient crawling for relevant websites. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, vol. 30, pp. 396–407. VLDB Endowment (2004), http://www.vldb.org/conf/2004/RS10P3.PDF

  11. Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, pp. 580–587. ACM Press, New York (2009), http://doi.acm.org/10.1145/1571941.1572041 , doi:10.1145/1571941.1572041

    Chapter  Google Scholar 

  12. Haveliwala, T.H.: Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering 15(4), 784–796 (2003), http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1208999 , doi:10.1109/TKDE.2003.1208999

    Article  Google Scholar 

  13. Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th International Conference on World Wide Web, WWW 2003, pp. 271–279. ACM Press, New York (2003), http://doi.acm.org/10.1145/775152.775191 , doi:10.1145/775152.775191

    Google Scholar 

  14. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999), http://doi.acm.org/10.1145/324133.324140 , doi:10.1145/324133.324140

    Article  MathSciNet  MATH  Google Scholar 

  15. Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 114–118. ACM Press, New York (2001), http://doi.acm.org/10.1145/371920.371965 , doi:10.1145/371920.371965

    Chapter  Google Scholar 

  16. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web, WWW 2008, pp. 437–446. ACM Press, New York (2008), http://doi.acm.org/10.1145/1367497.1367557 , doi:10.1145/1367497.1367557

    Chapter  Google Scholar 

  17. Open Directory Project, http://www.dmoz.org/

  18. Pandey, S., Olston, C.: User-centric Web crawling. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 401–411. ACM Press, New York (2005), http://doi.acm.org/10.1145/1060745.1060805 , doi:10.1145/1060745.1060805

    Chapter  Google Scholar 

  19. Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003), http://www.springerlink.com/content/p0n6lh04f4j7y26u , doi:10.1007/978-3-540-45175-4_22

    Chapter  Google Scholar 

  20. Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005), http://doi.acm.org/10.1145/1095872.1095875 , doi:10.1145/1095872.1095875

    Article  Google Scholar 

  21. Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006), http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.12 , doi:10.1109/TKDE.2006.12

    Article  Google Scholar 

  22. Shchekotykhin, K., Jannach, D., Friedrich, G.: xCrawl: a high-recall crawling method for Web mining. Knowledge and Information Systems 25(2), 303–326 (2010), http://dx.doi.org/10.1007/s10115-009-0266-3 , doi:10.1007/s10115-009-0266-3

    Article  Google Scholar 

  23. Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Information Retrieval 8(3), 417–447 (2005), http://dx.doi.org/10.1007/s10791-005-6993-5 , doi:10.1007/s10791-005-6993-5

    Article  Google Scholar 

  24. Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M.: Where to crawl next for focused crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 220–229. Springer, Heidelberg (2010), http://dx.doi.org/10.1007/978-3-642-15384-6_24 , doi:10.1007/978-3-642-15384-6_24

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M. (2012). An Effectively Focused Crawling System. In: Watanabe, T., Jain, L.C. (eds) Innovations in Intelligent Machines – 2. Studies in Computational Intelligence, vol 376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23190-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23190-2_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23189-6

  • Online ISBN: 978-3-642-23190-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics