Skip to main content

Integrating Deep-Web Information Sources

  • Conference paper

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 71))

Abstract

Deep-web information sources are difficult to integrate into automated business processes if they only provide a search form. A wrapping agent is a piece of software that allows a developer to query such information sources without worrying about the details of interacting with such forms. Our goal is to help software engineers construct wrapping agents that interpret queries written in high-level structured languages.We think that this shall definitely help reduce integration costs because this shall relieve developers from the burden of transforming their queries into low-level interactions in an ad-hoc manner. In this paper, we report on our reference framework, delve into the related work, and highlight current research challenges. This is intended to help guide future research efforts in this area.

Supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, and TIN2008-04718-E).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anupam, V., et al.: Automating web navigation with the webvcr. Computer Networks 33(1-6) (2000)

    Google Scholar 

  2. Baumgartner, R., et al.: Deep web navigation in web data extraction. In: CIMCA/IAWTIC (2005)

    Google Scholar 

  3. Blanco, L., et al.: Efficiently locating collections of web pages to wrap. In: WEBIST (2005)

    Google Scholar 

  4. Blythe, J., et al.: Information integration for the masses. J. UCS 14(11) (2008)

    Google Scholar 

  5. Chang, C.-H., et al.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10) (2006)

    Google Scholar 

  6. Chang, K.C.-C., et al.: Toward large scale integration: Building a metaquerier over databases on the web. In: CIDR (2005)

    Google Scholar 

  7. Chidlovskii, B., et al.: Documentum eci self-repairing wrappers: performance analysis. In: SIGMOD Conference (2006)

    Google Scholar 

  8. Crescenzi, V., et al.: Roadrunner: Towards automatic data extraction from large web sites (2001)

    Google Scholar 

  9. Davulcu, H., et al.: A layered architecture for querying dynamic web content. In: SIGMOD Conference (1999)

    Google Scholar 

  10. Halevy, A.Y., et al.: Answering queries using views: A survey. VLDB J. 10(4) (2001)

    Google Scholar 

  11. He, H., et al.: Towards deeper understanding of the search interfaces of the deep web. World Wide Web (2007)

    Google Scholar 

  12. Hogue, A., Karger, D.R.: Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW (2005)

    Google Scholar 

  13. Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23(8) (1998)

    Google Scholar 

  14. Jung, K., et al.: Text information extraction in images and video: a survey. Pattern Recognition 37(5) (2004)

    Google Scholar 

  15. Kushmerick, N., et al.: Regression testing for wrapper maintenance. In: AAAI/IAAI (1999)

    Google Scholar 

  16. Kushmerick, N., et al.: Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2) (2000)

    Google Scholar 

  17. Kushmerick, N., et al.: Wrapper verification. World Wide Web 3(2) (2000)

    Google Scholar 

  18. Laender, A.H.F., et al.: A brief survey of web data extraction tools. SIGMOD Record 31(2) (2002)

    Google Scholar 

  19. Lage, J.P., et al.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2) (2004)

    Google Scholar 

  20. Lerman, K., et al.: Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18 (2003)

    Google Scholar 

  21. Liddle, S.W., et al.: Extracting data behind web forms. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503. Springer, Heidelberg (2002)

    Google Scholar 

  22. Liu, B., et al.: Mining web pages for data records. IEEE Intelligent Systems 19(6) (2004)

    Google Scholar 

  23. Madhavan, J., et al.: Harnessing the deep web: Present and future. In: CIDR (2009)

    Google Scholar 

  24. McCann, R., et al.: Mapping maintenance for data integration systems. In: VLDB (2005)

    Google Scholar 

  25. Montoto, P., et al.: A workflow language for web automation. J. UCS 14(11) (2008)

    Google Scholar 

  26. Pan, A., et al.: A model for advanced query capability description in mediator systems. In: ICEIS (2002)

    Google Scholar 

  27. Petropoulos, M., et al.: Exporting and interactively querying web service-accessed sources: The clide system. ACM Trans. Database Syst. 32(4) (2007)

    Google Scholar 

  28. Quinlan, J.R., et al.: Learning first-order definitions of functions. J. Artif. Intell. Res. (JAIR) 5 (1996)

    Google Scholar 

  29. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB (2001)

    Google Scholar 

  30. Rivero, C., et al.: From queries to search forms: an implementation. IJCAT 33(4) (2008)

    Google Scholar 

  31. Shu, L., et al.: Querying capability modeling and construction of deep web sources. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 13–25. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  32. Tax, D.M.J., et al.: One-class classification, concept learning in the absence of counter example. PhD thesis, Delft University of Technology (2001)

    Google Scholar 

  33. Vidal, M.L.A., et al.: Structure-based crawling in the hidden web. J. UCS 14(11) (2008)

    Google Scholar 

  34. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (1999)

    Google Scholar 

  35. Wong, T.-L., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn. 7(1) (2007)

    Google Scholar 

  36. Zhang, Z., et al.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: SIGMOD Conference (2004)

    Google Scholar 

  37. Zhang, Z., et al.: Light-weight domain-based form assistant: Querying web databases on the fly. In: VLDB (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

de Viana, I.F., Hernandez, I., Jiménez, P., Rivero, C.R., Sleiman, H.A. (2010). Integrating Deep-Web Information Sources. In: Demazeau, Y., et al. Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 71. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12433-4_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12433-4_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12432-7

  • Online ISBN: 978-3-642-12433-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics