Integrating Deep-Web Information Sources

de Viana, Iñaki Fernández; Hernandez, Inma; Jiménez, Patricia; Rivero, Carlos R.; Sleiman, Hassan A.

doi:10.1007/978-3-642-12433-4_37

Integrating Deep-Web Information Sources

Iñaki Fernández de Viana¹²,
Inma Hernandez¹³,
Patricia Jiménez¹²,
Carlos R. Rivero¹³ &
…
Hassan A. Sleiman¹³

Conference paper

1330 Accesses
4 Citations

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 71))

Abstract

Deep-web information sources are difficult to integrate into automated business processes if they only provide a search form. A wrapping agent is a piece of software that allows a developer to query such information sources without worrying about the details of interacting with such forms. Our goal is to help software engineers construct wrapping agents that interpret queries written in high-level structured languages.We think that this shall definitely help reduce integration costs because this shall relieve developers from the burden of transforming their queries into low-level interactions in an ad-hoc manner. In this paper, we report on our reference framework, delve into the related work, and highlight current research challenges. This is intended to help guide future research efforts in this area.

Supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants TIN2007-64119, P07-TIC-2602, P08-TIC-4100, and TIN2008-04718-E).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anupam, V., et al.: Automating web navigation with the webvcr. Computer Networks 33(1-6) (2000)
Google Scholar
Baumgartner, R., et al.: Deep web navigation in web data extraction. In: CIMCA/IAWTIC (2005)
Google Scholar
Blanco, L., et al.: Efficiently locating collections of web pages to wrap. In: WEBIST (2005)
Google Scholar
Blythe, J., et al.: Information integration for the masses. J. UCS 14(11) (2008)
Google Scholar
Chang, C.-H., et al.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10) (2006)
Google Scholar
Chang, K.C.-C., et al.: Toward large scale integration: Building a metaquerier over databases on the web. In: CIDR (2005)
Google Scholar
Chidlovskii, B., et al.: Documentum eci self-repairing wrappers: performance analysis. In: SIGMOD Conference (2006)
Google Scholar
Crescenzi, V., et al.: Roadrunner: Towards automatic data extraction from large web sites (2001)
Google Scholar
Davulcu, H., et al.: A layered architecture for querying dynamic web content. In: SIGMOD Conference (1999)
Google Scholar
Halevy, A.Y., et al.: Answering queries using views: A survey. VLDB J. 10(4) (2001)
Google Scholar
He, H., et al.: Towards deeper understanding of the search interfaces of the deep web. World Wide Web (2007)
Google Scholar
Hogue, A., Karger, D.R.: Thresher: automating the unwrapping of semantic content from the world wide web. In: WWW (2005)
Google Scholar
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23(8) (1998)
Google Scholar
Jung, K., et al.: Text information extraction in images and video: a survey. Pattern Recognition 37(5) (2004)
Google Scholar
Kushmerick, N., et al.: Regression testing for wrapper maintenance. In: AAAI/IAAI (1999)
Google Scholar
Kushmerick, N., et al.: Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2) (2000)
Google Scholar
Kushmerick, N., et al.: Wrapper verification. World Wide Web 3(2) (2000)
Google Scholar
Laender, A.H.F., et al.: A brief survey of web data extraction tools. SIGMOD Record 31(2) (2002)
Google Scholar
Lage, J.P., et al.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2) (2004)
Google Scholar
Lerman, K., et al.: Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research 18 (2003)
Google Scholar
Liddle, S.W., et al.: Extracting data behind web forms. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503. Springer, Heidelberg (2002)
Google Scholar
Liu, B., et al.: Mining web pages for data records. IEEE Intelligent Systems 19(6) (2004)
Google Scholar
Madhavan, J., et al.: Harnessing the deep web: Present and future. In: CIDR (2009)
Google Scholar
McCann, R., et al.: Mapping maintenance for data integration systems. In: VLDB (2005)
Google Scholar
Montoto, P., et al.: A workflow language for web automation. J. UCS 14(11) (2008)
Google Scholar
Pan, A., et al.: A model for advanced query capability description in mediator systems. In: ICEIS (2002)
Google Scholar
Petropoulos, M., et al.: Exporting and interactively querying web service-accessed sources: The clide system. ACM Trans. Database Syst. 32(4) (2007)
Google Scholar
Quinlan, J.R., et al.: Learning first-order definitions of functions. J. Artif. Intell. Res. (JAIR) 5 (1996)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB (2001)
Google Scholar
Rivero, C., et al.: From queries to search forms: an implementation. IJCAT 33(4) (2008)
Google Scholar
Shu, L., et al.: Querying capability modeling and construction of deep web sources. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 13–25. Springer, Heidelberg (2007)
Chapter Google Scholar
Tax, D.M.J., et al.: One-class classification, concept learning in the absence of counter example. PhD thesis, Delft University of Technology (2001)
Google Scholar
Vidal, M.L.A., et al.: Structure-based crawling in the hidden web. J. UCS 14(11) (2008)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (1999)
Google Scholar
Wong, T.-L., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn. 7(1) (2007)
Google Scholar
Zhang, Z., et al.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: SIGMOD Conference (2004)
Google Scholar
Zhang, Z., et al.: Light-weight domain-based form assistant: Querying web databases on the fly. In: VLDB (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Huelva,
Iñaki Fernández de Viana & Patricia Jiménez
University of Sevilla,
Inma Hernandez, Carlos R. Rivero & Hassan A. Sleiman

Authors

Iñaki Fernández de Viana
View author publications
You can also search for this author in PubMed Google Scholar
Inma Hernandez
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Jiménez
View author publications
You can also search for this author in PubMed Google Scholar
Carlos R. Rivero
View author publications
You can also search for this author in PubMed Google Scholar
Hassan A. Sleiman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Laboratoire d’Informatique de Grenoble, Centre National de la Recherche Scientifique, Maison Jean Kuntzmann, 110 av. de la Chimie, F-38041, Grenoble, France
Yves Demazeau
Department of Information and Computing Sciences, Universiteit Utrecht Centrumgebouw Noord, office A117, Padualaan 14, De Uitho, 3584CH, Utrecht, The Netherlands
Frank Dignum
Departamento de Informática y Automática, Facultad de Ciencias , Universidad de Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Juan M. Corchado
Escuela Universitaria de Informática, Universidad Pontificia de Salamanca, Compañía 5, 37002, Salamanca, Spain
Javier Bajo
ETSI Informática, Universidad de Sevilla, Avda. Reina Mercedes s/n, 41012, Sevilla, Spain
Rafael Corchuelo
Departamento de Informática y Automática Facultad de Ciencias, Universidad de Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Emilio Corchado
Escuela Superior de Ingeniería Informática, Edificio Politécnico, Despacho 408, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
Florentino Fernández-Riverola
Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Camino de Vera s/n, 46022, Valencia, Spain
Vicente J. Julián
Department of Computing and Management, Poznan University of Technology, Strzelecka Str. 11, 60965, Poznan, Poland
Pawel Pawlewski
Department of Computer Science, Dartmouth College 6211 Sudikoff Laboratory, NH 03755-3510, Hanover, USA
Andrew Campbell

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Viana, I.F., Hernandez, I., Jiménez, P., Rivero, C.R., Sleiman, H.A. (2010). Integrating Deep-Web Information Sources. In: Demazeau, Y., et al. Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 71. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12433-4_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-12433-4_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12432-7
Online ISBN: 978-3-642-12433-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics