research-article

Automatic wrapper induction from hidden-web sources with domain knowledge

Authors:
Pierre Senellart

INRIA Saclay & TELECOM ParisTech, Paris, France

INRIA Saclay & TELECOM ParisTech, Paris, France
View Profile

,
Avin Mittal

Indian Institute of Technology, Bombay, India

Indian Institute of Technology, Bombay, India
View Profile

,
Daniel Muschick

Technische Universität Graz, Graz, Austria

Technische Universität Graz, Graz, Austria
View Profile

,
Rémi Gilleron

Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France

Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France
View Profile

,
Marc Tommasi

Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France

Université Lille 3 & INRIA Lille, Villeneuve d'Ascq, France
View Profile

WIDM '08: Proceedings of the 10th ACM workshop on Web information and data managementOctober 2008Pages 9–16https://doi.org/10.1145/1458502.1458505

Published:30 October 2008Publication History

WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management

Pages 9–16

ABSTRACT

We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.

References

L. Barbosa and J. Freire. Siphoning hidden-Web data through keyword-based interfaces. In Proc. Simpósio Brasileiro de Bancos de Dados, Brasília, Brasil, Oct. 2004.Google Scholar
BrightPlanet. The deep Web: Surfacing hidden value. White Paper, July 2001.Google Scholar
J. Carme, R. Gilleron, A. Lemay, and J. Niehren. Interactive learning of node selecting tree transducers. Machine Learning Journal, 66(1):33--67, Jan. 2007. Google ScholarDigital Library
J. Caverlee, L. Liu, and D. Buttler. Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep Web. In Proc. ICDE, Boston, USA, Mar. 2004. Google ScholarDigital Library
S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In Proc. VLDB, Vienna, Austria, Sept. 2007. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large Web sites. In Proc. VLDB, Roma, Italy, Sept. 2001. Google ScholarDigital Library
R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proc. Agents, Marina del Ray, USA, Feb. 1997. Google ScholarDigital Library
D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proc. AAAI, Austin, USA, July 2000. Google ScholarDigital Library
J. Geng and J. Yang. AUTOBIB: Automatic extraction of bibliographic information on the web. In Proc. IDEAS, 2004. Google ScholarDigital Library
B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the deep Web: A survey. Communications of the ACM, 50(2):94--101, May 2007. Google ScholarDigital Library
P. G. Ipeirotis and L. Gravano. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proc. VLDB, Hong Kong, China, Aug. 2002. Google ScholarDigital Library
F. Jousse, R. Gilleron, I. Tellier, and M. Tommasi. Conditional Random Fields for XML trees. In Proc. ECML Workshop on Mining and Learning in Graphs, Berlin, Germany, Sept. 2006.Google Scholar
J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, Williamstown, USA, June 2001. Google ScholarDigital Library
I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In Proc. ICDE, 2006. Google ScholarDigital Library
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields. In Proc. CoNLL, Edmonton, Canada, May 2003. Google ScholarDigital Library
A. Mittal. Probing the hidden Web. Research internship report. Technical Report 479, Gemo, INRIA Futurs, July 2007.Google Scholar
D. Muschick. Unsupervised learning of XML tree annotations. Master's thesis, Université de Technologie de Lille and Technischen Universität Graz, June 2007.Google Scholar
I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Proc. AAMAS, 4(1-2):93--114, 2001. Google ScholarDigital Library
D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In Proc. SIGIR, Toronto, Canada, July 2003. Google ScholarDigital Library
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.Google ScholarCross Ref
Princeton University Cognitive Science Laboratory. WordNet. http://wordnet.princeton.edu/.Google Scholar
S. Raghavan and H. Garcia-Molina. Crawling the hidden Web. In Proc. VLDB, Roma, Italy, Sept. 2001. Google ScholarDigital Library
S. Sarawagi and W. W. Cohen. Semi-Markov conditional random fields for information extraction. In Proc. NIPS, Vancouver, Canada, Dec. 2004.Google Scholar
P. Senellart. Comprendre le Web caché. Understanding the Hidden Web. PhD thesis, Université Paris-Sud 11, Orsay, France, Dec. 2007.Google Scholar
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proc. HLT-NAACL, Edmonton, Canada, May 2003. Google ScholarDigital Library
A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proc. CoNLL, New York, USA, June 2006. Google ScholarDigital Library
B. Thomas. Bottom-up learning of logic programs for information extraction from hypertext documents. In Proc. PKDD, Catvat-Dubrovnik, Croatia, Sept. 2003.Google ScholarCross Ref
W3C. HTML 4.01 specification, Sept. 1999. http://www.w3.org/TR/REC-html40/.Google Scholar
W3C. Web Services Description Language (WSDL) 1.1, Mar. 2001. http://www.w3.org/TR/wsdl.Google Scholar
W. Wu, A. Doan, C. T. Yu, and W. Meng. Bootstrapping domain ontology for semantic Web services from source Web sites. In Proc. Technologies for E-Services, Trondheim, Norway, Sept. 2005. Google ScholarDigital Library
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. WWW, Chiba, Japan, May 2005. Google ScholarDigital Library
Z. Zhang, B. He, and K. C.-C. Chang. Understanding Web query interfaces: best-effort parsing with hidden syntax. In Proc. SIGMOD, Paris, France, June 2004. Google ScholarDigital Library
Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Querying Web databases on the fly. In Proc. VLDB, Trondheim, Norway, Sept. 2005. Google ScholarDigital Library

Index Terms

Automatic wrapper induction from hidden-web sources with domain knowledge
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Read More
A Novel Architecture for Deep Web Crawler

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the ...
Read More
Site-Wide Wrapper Induction for Life Science Deep Web Databases
DILS '09: Proceedings of the 6th International Workshop on Data Integration in the Life Sciences

We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management
October 2008
164 pages
ISBN:9781605582603
DOI:10.1145/1458502
Program Chairs:
Chee-Yong Chan
National University of Singapore, Singapore
,
Neoklis Polyzotis
University of California-Santa Cruz, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep web
form
hidden web
information extraction
invisible web
probing
web service
wrapper
Qualifiers
- research-article
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 475
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic wrapper induction from hidden-web sources with domain knowledge

WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

A QIIIEP based domain specific hidden web crawler

A Novel Architecture for Deep Web Crawler

Site-Wide Wrapper Induction for Life Science Deep Web Databases