Finding seeds to bootstrap focused crawlers

Vieira, Karane; Barbosa, Luciano; da Silva, Altigran Soares; Freire, Juliana; Moura, Edleno

doi:10.1007/s11280-015-0331-7

Finding seeds to bootstrap focused crawlers

Published: 26 February 2015

Volume 19, pages 449–474, (2016)
Cite this article

World Wide Web Aims and scope Submit manuscript

Karane Vieira¹,
Luciano Barbosa²,
Altigran Soares da Silva¹,
Juliana Freire³ &
…
Edleno Moura¹

1063 Accesses
15 Citations
Explore all metrics

Abstract

Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The iCrawl Wizard – Supporting Interactive Focused Crawl Specification

Adaptive Focused Crawling of Linked Data

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Notes

http://dmoz.org
No smoothing is required here since we only use terms that occur in some document.

References

Almpanidis, G., Kotropoulos, C., Pitas, I.: Combining text and link analysis for focused crawling-an application for vertical search engines. Inf. Syst. 32, 886–908 (2007)
Article Google Scholar
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450 (2007)
Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the performance of focused web crawlers. Data Knowledge & Engineering 68, 1001–1013 (2009)
Article Google Scholar
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Comput. Netw. 33(1–6), 309–320 (2000)
Article Google Scholar
Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Selecting Good Expansion Terms for Pseudo-Relevance Feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, pp. 243–250 (2008)
Chakrabarti, S.: Focused web crawling. In: Encyclopedia of Database Systems, pp. 1147–1155 (2009)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)
Article Google Scholar
Chakrabarti, S., Joshi, M., Punera, K., Pennock, D.: The structure of broad topics on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 251–262 (2002a)
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, pp. 148–159 (2002b)
Croft, W.B., Metzler, D., Strohman, T.: Search Engines - Information Retrieval in Practice. Pearson Education (2009)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of 26th International Conference on Very Large Databases, pp. 527–534 (2000)
Dill, S., Kumar, R., Mccurley, K.S., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the Web. ACM Trans. Internet Technol. 2(3), 205–223 (2002)
Article Google Scholar
Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused web crawling. In: ICML, pp. 298–305 (2003)
Karimzadehgan, M., Zhai, C.: Exploration-exploitation tradeoff in interactive relevance feedback. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1397–1400 (2010)
Lavrenko, V., Croft, B.: Relevance-based language models. In: Proceedings of the 23st Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, pp. 120–128 (2001)
Menczer, P.G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. ACM Trans. Internet Technol. 4(4), 378–419 (2004)
Article Google Scholar
Prasath, R., Öztürk, P.: Finding potential seeds through rank aggregation of web searches. Pattern Recog. Mach. Intell., 227–234 (2011)
Qin, J., Zhou, Y., Chau, M.: Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’04, pp. 135–141. ACM, New York (2004)
Sizov, S., Theobald, M., Siersdorfer, S., Weikum, G., Graupmann, J., Biwer, M., Zimmer, P.: The bingo! system for information portal generation and expert web search. In: CIDR (2003)
Vidal, M.L., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: Proceedings 29th of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 292–299 (2006)
Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Evaluation of Cross-Language Information Retrieval Systems, pp. 355–370. Springer (2002)
Wu, J., Teregowda, P., Ramírez, J.P.F., Mitra, P., Zheng, S., Giles, C.L.: The evolution of a crawling strategy for an academic document search engine: Whitelists and blacklists. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 340–343 (2012)
Zheng, S., Dmitriev, P., Giles, C.: Graph-based seed selection for web-scale crawlers. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1967–1970 (2009)
Zhuang, Z., Wagle, R., Giles, C.L.: What’s there and what’s not?: focused crawling for missing documents in digital libraries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’05, pp. 301–310. ACM, New York (2005)
Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314 (1998)

Download references

Acknowledgments

This research was partially sponsored by projects TTDSW (PRONEM/FAPEAM/CNPq), e-vox pesquisa (FAPEAM), e-spot (CNPq Universal), and by individual CNPq fellowship grants to Edleno Moura (308130/2014-6) and Altigran da Silva 311433/2014-6). This material is based on research sponsored by DARPA under agreement number FA8750-14- 2-0236. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.

Author information

Authors and Affiliations

Instituto de Computação, Universidade Federal do Amazonas, Manaus, Brazil
Karane Vieira, Altigran Soares da Silva & Edleno Moura
IBM Research - Brazil, Rio de Janeiro, Brazil
Luciano Barbosa
Department of Computer Science and Engineering, New York University, New York, USA
Juliana Freire

Authors

Karane Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Luciano Barbosa
View author publications
You can also search for this author in PubMed Google Scholar
Altigran Soares da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Juliana Freire
View author publications
You can also search for this author in PubMed Google Scholar
Edleno Moura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Altigran Soares da Silva.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 819 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vieira, K., Barbosa, L., da Silva, A.S. et al. Finding seeds to bootstrap focused crawlers. World Wide Web 19, 449–474 (2016). https://doi.org/10.1007/s11280-015-0331-7

Download citation

Received: 23 June 2014
Revised: 28 November 2014
Accepted: 09 January 2015
Published: 26 February 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s11280-015-0331-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding seeds to bootstrap focused crawlers

Abstract

Access this article

Similar content being viewed by others

The iCrawl Wizard – Supporting Interactive Focused Crawl Specification

Adaptive Focused Crawling of Linked Data

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

(PDF 819 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding seeds to bootstrap focused crawlers

Abstract

Access this article

Similar content being viewed by others

The iCrawl Wizard – Supporting Interactive Focused Crawl Specification

Adaptive Focused Crawling of Linked Data

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

(PDF 819 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation