Abstract
In this article, we illustrate design and implementation of a focused crawling system for effectively collecting webpages concerning specific topics. An algorithm for deciding where to crawl next is developed by exploiting not only anchor texts but also the concept of PageRank. Given a topic to be focused on, our system attempts to collect webpages concerning the topic by crawling webpages that are expected to have not only close similarities to the topic but also high rank. Experimental results using many topics are reported and investigated in this article.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: better strategies than breadth-first for Web page ordering. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 864–872. ACM Press, New York (2005), http://doi.acm.org/10.1145/1062745.1062768 , doi:10.1145/1062745.1062768
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2N/2/63e7d8fb6a64027a0c15e6ae3e402889 , doi:10.1016/S0169-7552(98)00110-X; Proceedings of the Seventh International World Wide Web Conference
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999), http://www.sciencedirect.com/science/article/B6VRG-405TDWC-1F/2/f049016cf8fefd114f056306b5ae4a86 , doi:10.1016/S1389-1286(99)00052-3
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 200–209. Morgan Kaufmann, San Francisco (2000), http://www.vldb.org/conf/2000/P200.pdf
Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003), http://doi.acm.org/10.1145/958942.958945 , doi:10.1145/958942.958945
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2G/2/122be31915c6e16c444898fb12cfdf87 , doi:10.1016/S0169-7552(98)00108-1; Proceedings of the Seventh International World Wide Web Conference
Cho, J., Schonfeld, U.: RankMass crawler: a crawler with high personalized PageRank coverage guarantee. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 375–386. VLDB Endowment (2007), http://www.vldb.org/conf/2007/papers/research/p375-cho.pdf
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 527–534. Morgan Kaufmann Publishers Inc., San Francisco (2000), http://www.vldb.org/conf/2000/P527.pdf
Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 106–113. ACM Press, New York (2001), http://doi.acm.org/10.1145/371920.371960 , doi:10.1145/371920.371960
Ester, M., Kriegel, H.P., Schubert, M.: Accurate and efficient crawling for relevant websites. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, vol. 30, pp. 396–407. VLDB Endowment (2004), http://www.vldb.org/conf/2004/RS10P3.PDF
Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, pp. 580–587. ACM Press, New York (2009), http://doi.acm.org/10.1145/1571941.1572041 , doi:10.1145/1571941.1572041
Haveliwala, T.H.: Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering 15(4), 784–796 (2003), http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1208999 , doi:10.1109/TKDE.2003.1208999
Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th International Conference on World Wide Web, WWW 2003, pp. 271–279. ACM Press, New York (2003), http://doi.acm.org/10.1145/775152.775191 , doi:10.1145/775152.775191
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999), http://doi.acm.org/10.1145/324133.324140 , doi:10.1145/324133.324140
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 114–118. ACM Press, New York (2001), http://doi.acm.org/10.1145/371920.371965 , doi:10.1145/371920.371965
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web, WWW 2008, pp. 437–446. ACM Press, New York (2008), http://doi.acm.org/10.1145/1367497.1367557 , doi:10.1145/1367497.1367557
Open Directory Project, http://www.dmoz.org/
Pandey, S., Olston, C.: User-centric Web crawling. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 401–411. ACM Press, New York (2005), http://doi.acm.org/10.1145/1060745.1060805 , doi:10.1145/1060745.1060805
Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003), http://www.springerlink.com/content/p0n6lh04f4j7y26u , doi:10.1007/978-3-540-45175-4_22
Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005), http://doi.acm.org/10.1145/1095872.1095875 , doi:10.1145/1095872.1095875
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006), http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.12 , doi:10.1109/TKDE.2006.12
Shchekotykhin, K., Jannach, D., Friedrich, G.: xCrawl: a high-recall crawling method for Web mining. Knowledge and Information Systems 25(2), 303–326 (2010), http://dx.doi.org/10.1007/s10115-009-0266-3 , doi:10.1007/s10115-009-0266-3
Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Information Retrieval 8(3), 417–447 (2005), http://dx.doi.org/10.1007/s10791-005-6993-5 , doi:10.1007/s10791-005-6993-5
Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M.: Where to crawl next for focused crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 220–229. Springer, Heidelberg (2010), http://dx.doi.org/10.1007/978-3-642-15384-6_24 , doi:10.1007/978-3-642-15384-6_24
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M. (2012). An Effectively Focused Crawling System. In: Watanabe, T., Jain, L.C. (eds) Innovations in Intelligent Machines – 2. Studies in Computational Intelligence, vol 376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23190-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-23190-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23189-6
Online ISBN: 978-3-642-23190-2
eBook Packages: EngineeringEngineering (R0)