Skip to main content
Log in

Finding seeds to bootstrap focused crawlers

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

Notes

  1. http://dmoz.org

  2. No smoothing is required here since we only use terms that occur in some document.

References

  1. Almpanidis, G., Kotropoulos, C., Pitas, I.: Combining text and link analysis for focused crawling-an application for vertical search engines. Inf. Syst. 32, 886–908 (2007)

    Article  Google Scholar 

  2. Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450 (2007)

  3. Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the performance of focused web crawlers. Data Knowledge & Engineering 68, 1001–1013 (2009)

    Article  Google Scholar 

  4. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Comput. Netw. 33(1–6), 309–320 (2000)

    Article  Google Scholar 

  5. Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Selecting Good Expansion Terms for Pseudo-Relevance Feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, pp. 243–250 (2008)

  6. Chakrabarti, S.: Focused web crawling. In: Encyclopedia of Database Systems, pp. 1147–1155 (2009)

  7. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)

    Article  Google Scholar 

  8. Chakrabarti, S., Joshi, M., Punera, K., Pennock, D.: The structure of broad topics on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 251–262 (2002a)

  9. Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, pp. 148–159 (2002b)

  10. Croft, W.B., Metzler, D., Strohman, T.: Search Engines - Information Retrieval in Practice. Pearson Education (2009)

  11. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Databases, pp. 527–534 (2000)

  12. Dill, S., Kumar, R., Mccurley, K.S., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the Web. ACM Trans. Internet Technol. 2(3), 205–223 (2002)

    Article  Google Scholar 

  13. Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused web crawling. In: ICML, pp. 298–305 (2003)

  14. Karimzadehgan, M., Zhai, C.: Exploration-exploitation tradeoff in interactive relevance feedback. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1397–1400 (2010)

  15. Lavrenko, V., Croft, B.: Relevance-based language models. In: Proceedings of the 23st Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, pp. 120–128 (2001)

  16. Menczer, P.G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. ACM Trans. Internet Technol. 4(4), 378–419 (2004)

    Article  Google Scholar 

  17. Prasath, R., Öztürk, P.: Finding potential seeds through rank aggregation of web searches. Pattern Recog. Mach. Intell., 227–234 (2011)

  18. Qin, J., Zhou, Y., Chau, M.: Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’04, pp. 135–141. ACM, New York (2004)

  19. Sizov, S., Theobald, M., Siersdorfer, S., Weikum, G., Graupmann, J., Biwer, M., Zimmer, P.: The bingo! system for information portal generation and expert web search. In: CIDR (2003)

  20. Vidal, M.L., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: Proceedings 29th of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 292–299 (2006)

  21. Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Evaluation of Cross-Language Information Retrieval Systems, pp. 355–370. Springer (2002)

  22. Wu, J., Teregowda, P., Ramírez, J.P.F., Mitra, P., Zheng, S., Giles, C.L.: The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 340–343 (2012)

  23. Zheng, S., Dmitriev, P., Giles, C.: Graph-based seed selection for web-scale crawlers. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1967–1970 (2009)

  24. Zhuang, Z., Wagle, R., Giles, C.L.: What’s there and what’s not?: focused crawling for missing documents in digital libraries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’05, pp. 301–310. ACM, New York (2005)

  25. Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314 (1998)

Download references

Acknowledgments

This research was partially sponsored by projects TTDSW (PRONEM/FAPEAM/CNPq), e-vox pesquisa (FAPEAM), e-spot (CNPq Universal), and by individual CNPq fellowship grants to Edleno Moura (308130/2014-6) and Altigran da Silva 311433/2014-6). This material is based on research sponsored by DARPA under agreement number FA8750-14- 2-0236. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Altigran Soares da Silva.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 819 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vieira, K., Barbosa, L., da Silva, A.S. et al. Finding seeds to bootstrap focused crawlers. World Wide Web 19, 449–474 (2016). https://doi.org/10.1007/s11280-015-0331-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-015-0331-7

Keywords

Navigation