Abstract
Deep-web information sources are difficult to integrate into automated business processes if they only provide a search form. A wrapping agent is a piece of software that allows a developer to query such information sources without worrying about the details of interacting with such forms. Our goal is to help software engineers construct wrapping agents that interpret queries written in high-level structured languages.We think that this shall definitely help reduce integration costs because this shall relieve developers from the burden of transforming their queries into low-level interactions in an ad-hoc manner. In this paper, we report on our reference framework, delve into the related work, and highlight current research challenges. This is intended to help guide future research efforts in this area.
Supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, and TIN2008-04718-E).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Anupam, V., et al.: Automating web navigation with the webvcr. Computer Networks 33(1-6) (2000)
Baumgartner, R., et al.: Deep web navigation in web data extraction. In: CIMCA/IAWTIC (2005)
Blanco, L., et al.: Efficiently locating collections of web pages to wrap. In: WEBIST (2005)
Blythe, J., et al.: Information integration for the masses. J. UCS 14(11) (2008)
Chang, C.-H., et al.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10) (2006)
Chang, K.C.-C., et al.: Toward large scale integration: Building a metaquerier over databases on the web. In: CIDR (2005)
Chidlovskii, B., et al.: Documentum eci self-repairing wrappers: performance analysis. In: SIGMOD Conference (2006)
Crescenzi, V., et al.: Roadrunner: Towards automatic data extraction from large web sites (2001)
Davulcu, H., et al.: A layered architecture for querying dynamic web content. In: SIGMOD Conference (1999)
Halevy, A.Y., et al.: Answering queries using views: A survey. VLDB J. 10(4) (2001)
He, H., et al.: Towards deeper understanding of the search interfaces of the deep web. World Wide Web (2007)
Hogue, A., Karger, D.R.: Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW (2005)
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23(8) (1998)
Jung, K., et al.: Text information extraction in images and video: a survey. Pattern Recognition 37(5) (2004)
Kushmerick, N., et al.: Regression testing for wrapper maintenance. In: AAAI/IAAI (1999)
Kushmerick, N., et al.: Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2) (2000)
Kushmerick, N., et al.: Wrapper verification. World Wide Web 3(2) (2000)
Laender, A.H.F., et al.: A brief survey of web data extraction tools. SIGMOD Record 31(2) (2002)
Lage, J.P., et al.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2) (2004)
Lerman, K., et al.: Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18 (2003)
Liddle, S.W., et al.: Extracting data behind web forms. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503. Springer, Heidelberg (2002)
Liu, B., et al.: Mining web pages for data records. IEEE Intelligent Systems 19(6) (2004)
Madhavan, J., et al.: Harnessing the deep web: Present and future. In: CIDR (2009)
McCann, R., et al.: Mapping maintenance for data integration systems. In: VLDB (2005)
Montoto, P., et al.: A workflow language for web automation. J. UCS 14(11) (2008)
Pan, A., et al.: A model for advanced query capability description in mediator systems. In: ICEIS (2002)
Petropoulos, M., et al.: Exporting and interactively querying web service-accessed sources: The clide system. ACM Trans. Database Syst. 32(4) (2007)
Quinlan, J.R., et al.: Learning first-order definitions of functions. J. Artif. Intell. Res. (JAIR) 5 (1996)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB (2001)
Rivero, C., et al.: From queries to search forms: an implementation. IJCAT 33(4) (2008)
Shu, L., et al.: Querying capability modeling and construction of deep web sources. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 13–25. Springer, Heidelberg (2007)
Tax, D.M.J., et al.: One-class classification, concept learning in the absence of counter example. PhD thesis, Delft University of Technology (2001)
Vidal, M.L.A., et al.: Structure-based crawling in the hidden web. J. UCS 14(11) (2008)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (1999)
Wong, T.-L., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn. 7(1) (2007)
Zhang, Z., et al.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: SIGMOD Conference (2004)
Zhang, Z., et al.: Light-weight domain-based form assistant: Querying web databases on the fly. In: VLDB (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
de Viana, I.F., Hernandez, I., Jiménez, P., Rivero, C.R., Sleiman, H.A. (2010). Integrating Deep-Web Information Sources. In: Demazeau, Y., et al. Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 71. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12433-4_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-12433-4_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12432-7
Online ISBN: 978-3-642-12433-4
eBook Packages: EngineeringEngineering (R0)