ABSTRACT
This paper presents an algorithm to bound the bandwidth of a Web crawler. The crawler collects statistics on the transfer rate of each server to predict the expected bandwidth use for future downloads. The prediction allows us to activate the optimal number of fetcher threads in order to exploit the assigned bandwidth. The experimental results show the effectiveness of the proposed technique.
- A. Arasu, J. Cho, H. Garchia-Molina, and S. Raghavan. Searching the web. ACM Transactions on the Internet Technologies, 1(1), 2001. Google ScholarDigital Library
- J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proceedings of the 7th International Conference on the World Wide Web, 1998. Google ScholarDigital Library
- M. Diligenti, F. Coetzee, L. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graph. In Proceedings of the 27th Conf. on Very Large Data Base, 2000. Google ScholarDigital Library
- M. Najork and J. Wierner. Breath-first search crawling yields high quality pages. In Proceedings of the 10th International Conference on the World Wide Web, 2001. Google ScholarDigital Library
- J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proocedings of the International Conference on Machine Learning, 1999. Google ScholarDigital Library
- V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of the 18th International Conference on Data Enginering, 2002. Google ScholarDigital Library
Index Terms
- Design of a crawler with bounded bandwidth
Recommendations
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in TechnologyFor context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Managing bandwidth allocations between competing recreational and non-recreational traffic on campus networks
Network performance is a serious concern faced by many campus network managers across the country. As demand for entertainment-based Peer-to-Peer (P2P) applications that involve the transfer of large audio and video files continues to grow, managers are ...
Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications
3PGCIC '13: Proceedings of the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet ComputingCrawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, as old as the web itself. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is ...
Comments