Skip to main content

TS-IDS Algorithm for Query Selection in the Deep Web Crawling

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8709))

Abstract

The deep web crawling is the process of collecting data items inside a data source hidden behind searchable interfaces. Since the only method to access the data is by sending queries, one of the research challenges is the selection of a set of queries such that they can retrieve most of the data with minimal network traffic. This is a set covering problem that is NP-hard. The large size of the problem, in terms of both large number of documents and terms involved, calls for new approximation algorithms for efficient deep web data crawling. Inspired by the TF-IDF weighting measure in information retrieval, this paper proposes the TS-IDS algorithm that assigns an importance value to each document proportional to term size (TS), and inversely proportional to document size (IDS). The algorithm is extensively tested on a variety of datasets, and compared with the traditional greedy algorithm and the more recent IDS algorithm. We demonstrate that TS-IDS outperforms the greedy algorithm and IDS algorithm up to 33% and 24%, respectively. Our work also makes a contribution to the classic set covering problem by leveraging the long-tail distributions of the terms and documents in natural languages. Since long-tail distribution is ubiquitous in real world, our approach can be applied in areas other than the deep web crawling.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bergman, M.K.: The deepweb: Surfacing hidden value. The Journal of Electronic Publishing 7(1) (2001)

    Google Scholar 

  2. He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep web: A survey. Communications of the ACM 50(5), 94–101 (2007)

    Article  Google Scholar 

  3. Madhavan, J., Cohen, S., Dong, X., Halevy, A., Jeffery, S., Ko, D., Yu, C.: Web-scale data integration: You can afford to pay as you go. In: Proc. of CIDR, pp. 342–350 (2007)

    Google Scholar 

  4. Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proc. of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)

    Google Scholar 

  5. Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proc. of ICDE, pp. 47–56 (2006)

    Google Scholar 

  6. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proc. of VLDB, pp. 1241–1252 (2008)

    Google Scholar 

  7. Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proc. of Web Intelligence, pp. 718–724 (2008)

    Google Scholar 

  8. Wang, Y., Lu, J., Liang, J., Chen, J., Liu, J.: Selecting queries from sample to crawl deep web data sources. Web Intelligence and Agent Systems 10(1), 75–88 (2012)

    Google Scholar 

  9. Caprara, A., Toth, P., Fishetti, M.: Algorithms for the set covering problem. Annals of Operations Research 98, 353–371 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  10. Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  11. Barabási, A., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

    Article  MathSciNet  Google Scholar 

  12. Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proc. of SBBD (2004)

    Google Scholar 

  13. Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep web. Information Systems (2013)

    Google Scholar 

  14. Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient deep web crawling using reinforcement learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 428–439. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  15. Valkanas, G., Ntoulas, A., Gunopulos, D.: Rank-aware crawling of hidden web sites. In: Proc. of In WebDB (2011)

    Google Scholar 

  16. Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The bingo! system for information portal generation and expert web search. In: Proc. of CIDR (2003)

    Google Scholar 

  17. Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proc. of WWW, pp. 441–450 (2007)

    Google Scholar 

  18. Alvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng. 64(2), 491–509 (2008)

    Article  Google Scholar 

  19. Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach. IEEE Data Engineering Bulletin 23(4), 33–41 (2000)

    Google Scholar 

  20. Lu, J., Li, D.: Estimating deep web data source size by capture-recapture method. Informatoin Retrieval 13(1), 70–95 (2010)

    Article  Google Scholar 

  21. Feo, T.A., Resende, M.G.: Greedy randomized adaptive search procedures. Journal of Global Optimization, 109–133 (1995)

    Google Scholar 

  22. Lorena, L.W., Lopes, F.B.: A surrogate heuristic for set covering problems. European Journal of Operational Research 1994(79), 138–150 (1994)

    Article  Google Scholar 

  23. Caprara, A., Fishetti, M., Toth, P.: A heuristic method for the set covering problem. Operations Research (1995)

    Google Scholar 

  24. Beasley, J.E., Chu, P.C.: Theory and methodology. a genetic algorithm for the set covering problem. European Journal of Operational Research 94, 392–404 (1996)

    Article  MATH  Google Scholar 

  25. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press and McGraw-Hill (2001)

    Google Scholar 

  26. Chvatal, V.: A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3), 233–235 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  27. Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press (1949)

    Google Scholar 

  28. Beasley, J.E., Jornsten, K.: Enhancing an algorithm for set covering problems. European Journal of Operational Research 58, 293–300 (1992)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, Y., Lu, J., Chen, J. (2014). TS-IDS Algorithm for Query Selection in the Deep Web Crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11116-2_17

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11115-5

  • Online ISBN: 978-3-319-11116-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics