An Effectively Focused Crawling System

Uemura, Yuki; Itokawa, Tsuyoshi; Kitasuka, Teruaki; Aritsugi, Masayoshi

doi:10.1007/978-3-642-23190-2_5

Yuki Uemura⁴,
Tsuyoshi Itokawa⁴,
Teruaki Kitasuka⁴ &
…
Masayoshi Aritsugi⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 376))

861 Accesses
5 Citations

Abstract

In this article, we illustrate design and implementation of a focused crawling system for effectively collecting webpages concerning specific topics. An algorithm for deciding where to crawl next is developed by exploiting not only anchor texts but also the concept of PageRank. Given a topic to be focused on, our system attempts to collect webpages concerning the topic by crawling webpages that are expected to have not only close similarities to the topic but also high rank. Experimental results using many topics are reported and investigated in this article.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: better strategies than breadth-first for Web page ordering. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 864–872. ACM Press, New York (2005), http://doi.acm.org/10.1145/1062745.1062768 , doi:10.1145/1062745.1062768
Chapter Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2N/2/63e7d8fb6a64027a0c15e6ae3e402889 , doi:10.1016/S0169-7552(98)00110-X; Proceedings of the Seventh International World Wide Web Conference
Article Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999), http://www.sciencedirect.com/science/article/B6VRG-405TDWC-1F/2/f049016cf8fefd114f056306b5ae4a86 , doi:10.1016/S1389-1286(99)00052-3
Article Google Scholar
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 200–209. Morgan Kaufmann, San Francisco (2000), http://www.vldb.org/conf/2000/P200.pdf
Google Scholar
Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003), http://doi.acm.org/10.1145/958942.958945 , doi:10.1145/958942.958945
Article Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1-7), 161–172 (1998), http://www.sciencedirect.com/science/article/B6TYT-3WRC342-2G/2/122be31915c6e16c444898fb12cfdf87 , doi:10.1016/S0169-7552(98)00108-1; Proceedings of the Seventh International World Wide Web Conference
Article Google Scholar
Cho, J., Schonfeld, U.: RankMass crawler: a crawler with high personalized PageRank coverage guarantee. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 375–386. VLDB Endowment (2007), http://www.vldb.org/conf/2007/papers/research/p375-cho.pdf
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB 2000, pp. 527–534. Morgan Kaufmann Publishers Inc., San Francisco (2000), http://www.vldb.org/conf/2000/P527.pdf
Google Scholar
Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 106–113. ACM Press, New York (2001), http://doi.acm.org/10.1145/371920.371960 , doi:10.1145/371920.371960
Chapter Google Scholar
Ester, M., Kriegel, H.P., Schubert, M.: Accurate and efficient crawling for relevant websites. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, vol. 30, pp. 396–407. VLDB Endowment (2004), http://www.vldb.org/conf/2004/RS10P3.PDF
Fetterly, D., Craswell, N., Vinay, V.: The impact of crawl policy on web search effectiveness. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, pp. 580–587. ACM Press, New York (2009), http://doi.acm.org/10.1145/1571941.1572041 , doi:10.1145/1571941.1572041
Chapter Google Scholar
Haveliwala, T.H.: Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering 15(4), 784–796 (2003), http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1208999 , doi:10.1109/TKDE.2003.1208999
Article Google Scholar
Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th International Conference on World Wide Web, WWW 2003, pp. 271–279. ACM Press, New York (2003), http://doi.acm.org/10.1145/775152.775191 , doi:10.1145/775152.775191
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999), http://doi.acm.org/10.1145/324133.324140 , doi:10.1145/324133.324140
Article MathSciNet MATH Google Scholar
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 114–118. ACM Press, New York (2001), http://doi.acm.org/10.1145/371920.371965 , doi:10.1145/371920.371965
Chapter Google Scholar
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web, WWW 2008, pp. 437–446. ACM Press, New York (2008), http://doi.acm.org/10.1145/1367497.1367557 , doi:10.1145/1367497.1367557
Chapter Google Scholar
Open Directory Project, http://www.dmoz.org/
Pandey, S., Olston, C.: User-centric Web crawling. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 401–411. ACM Press, New York (2005), http://doi.acm.org/10.1145/1060745.1060805 , doi:10.1145/1060745.1060805
Chapter Google Scholar
Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003), http://www.springerlink.com/content/p0n6lh04f4j7y26u , doi:10.1007/978-3-540-45175-4_22
Chapter Google Scholar
Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005), http://doi.acm.org/10.1145/1095872.1095875 , doi:10.1145/1095872.1095875
Article Google Scholar
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006), http://doi.ieeecomputersociety.org/10.1109/TKDE.2006.12 , doi:10.1109/TKDE.2006.12
Article Google Scholar
Shchekotykhin, K., Jannach, D., Friedrich, G.: xCrawl: a high-recall crawling method for Web mining. Knowledge and Information Systems 25(2), 303–326 (2010), http://dx.doi.org/10.1007/s10115-009-0266-3 , doi:10.1007/s10115-009-0266-3
Article Google Scholar
Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Information Retrieval 8(3), 417–447 (2005), http://dx.doi.org/10.1007/s10791-005-6993-5 , doi:10.1007/s10791-005-6993-5
Article Google Scholar
Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M.: Where to crawl next for focused crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 220–229. Springer, Heidelberg (2010), http://dx.doi.org/10.1007/978-3-642-15384-6_24 , doi:10.1007/978-3-642-15384-6_24
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Electrical Engineering, Graduate School of Science and Technology, Kumamoto University, Japan
Yuki Uemura, Tsuyoshi Itokawa, Teruaki Kitasuka & Masayoshi Aritsugi

Authors

Yuki Uemura
View author publications
You can also search for this author in PubMed Google Scholar
Tsuyoshi Itokawa
View author publications
You can also search for this author in PubMed Google Scholar
Teruaki Kitasuka
View author publications
You can also search for this author in PubMed Google Scholar
Masayoshi Aritsugi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Systems and Social Informatics, Graduate School of Information Science, Nagoya University, Japan
Toyohide Watanabe
School of Electrical and Information Engineering, University of South Australia, Mawson Lakes Campus, Adelaide, South Australia, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M. (2012). An Effectively Focused Crawling System. In: Watanabe, T., Jain, L.C. (eds) Innovations in Intelligent Machines – 2. Studies in Computational Intelligence, vol 376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23190-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-23190-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23189-6
Online ISBN: 978-3-642-23190-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics