skip to main content
10.1145/1458502.1458505acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Automatic wrapper induction from hidden-web sources with domain knowledge

Authors Info & Claims
Published:30 October 2008Publication History

ABSTRACT

We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.

References

  1. L. Barbosa and J. Freire. Siphoning hidden-Web data through keyword-based interfaces. In Proc. Simpósio Brasileiro de Bancos de Dados, Brasília, Brasil, Oct. 2004.Google ScholarGoogle Scholar
  2. BrightPlanet. The deep Web: Surfacing hidden value. White Paper, July 2001.Google ScholarGoogle Scholar
  3. J. Carme, R. Gilleron, A. Lemay, and J. Niehren. Interactive learning of node selecting tree transducers. Machine Learning Journal, 66(1):33--67, Jan. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Caverlee, L. Liu, and D. Buttler. Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep Web. In Proc. ICDE, Boston, USA, Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In Proc. VLDB, Vienna, Austria, Sept. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large Web sites. In Proc. VLDB, Roma, Italy, Sept. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proc. Agents, Marina del Ray, USA, Feb. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proc. AAAI, Austin, USA, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Geng and J. Yang. AUTOBIB: Automatic extraction of bibliographic information on the web. In Proc. IDEAS, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the deep Web: A survey. Communications of the ACM, 50(2):94--101, May 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. G. Ipeirotis and L. Gravano. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proc. VLDB, Hong Kong, China, Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Jousse, R. Gilleron, I. Tellier, and M. Tommasi. Conditional Random Fields for XML trees. In Proc. ECML Workshop on Mining and Learning in Graphs, Berlin, Germany, Sept. 2006.Google ScholarGoogle Scholar
  13. J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, Williamstown, USA, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In Proc. ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. McCallum and W. Li. Early results for named entity recognition with conditional random fields. In Proc. CoNLL, Edmonton, Canada, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Mittal. Probing the hidden Web. Research internship report. Technical Report 479, Gemo, INRIA Futurs, July 2007.Google ScholarGoogle Scholar
  17. D. Muschick. Unsupervised learning of XML tree annotations. Master's thesis, Université de Technologie de Lille and Technischen Universität Graz, June 2007.Google ScholarGoogle Scholar
  18. I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Proc. AAMAS, 4(1-2):93--114, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In Proc. SIGIR, Toronto, Canada, July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, July 1980.Google ScholarGoogle ScholarCross RefCross Ref
  21. Princeton University Cognitive Science Laboratory. WordNet. http://wordnet.princeton.edu/.Google ScholarGoogle Scholar
  22. S. Raghavan and H. Garcia-Molina. Crawling the hidden Web. In Proc. VLDB, Roma, Italy, Sept. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Sarawagi and W. W. Cohen. Semi-Markov conditional random fields for information extraction. In Proc. NIPS, Vancouver, Canada, Dec. 2004.Google ScholarGoogle Scholar
  24. P. Senellart. Comprendre le Web caché. Understanding the Hidden Web. PhD thesis, Université Paris-Sud 11, Orsay, France, Dec. 2007.Google ScholarGoogle Scholar
  25. F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proc. HLT-NAACL, Edmonton, Canada, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proc. CoNLL, New York, USA, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Thomas. Bottom-up learning of logic programs for information extraction from hypertext documents. In Proc. PKDD, Catvat-Dubrovnik, Croatia, Sept. 2003.Google ScholarGoogle ScholarCross RefCross Ref
  28. W3C. HTML 4.01 specification, Sept. 1999. http://www.w3.org/TR/REC-html40/.Google ScholarGoogle Scholar
  29. W3C. Web Services Description Language (WSDL) 1.1, Mar. 2001. http://www.w3.org/TR/wsdl.Google ScholarGoogle Scholar
  30. W. Wu, A. Doan, C. T. Yu, and W. Meng. Bootstrapping domain ontology for semantic Web services from source Web sites. In Proc. Technologies for E-Services, Trondheim, Norway, Sept. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. WWW, Chiba, Japan, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Z. Zhang, B. He, and K. C.-C. Chang. Understanding Web query interfaces: best-effort parsing with hidden syntax. In Proc. SIGMOD, Paris, France, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Querying Web databases on the fly. In Proc. VLDB, Trondheim, Norway, Sept. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic wrapper induction from hidden-web sources with domain knowledge

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management
          October 2008
          164 pages
          ISBN:9781605582603
          DOI:10.1145/1458502

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 October 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader