Skip to main content

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

  • Chapter
Intelligent Exploration of the Web

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 111))

Abstract

A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.

© 2000 IEEE. Reprinted, with permission, from IEEE Data Engineering Bulletin, 23(4), December, 2000.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blum, A. and Mitchell, T., (1998). Combining labeled and unlabeled data with co-training. Proc. of the 1998 Conference on Computational Learning Theory, 92–100.

    Google Scholar 

  2. Carrasco, R. and Oncina, J., (1994). Learning stochastic regular grammars by means of a state merging method. In Lecture Notes In Computer Science, p. 862.

    Google Scholar 

  3. Cohen, W., (199). Recognizing structure in web pages using similarity queries. Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999,59–66.

    Google Scholar 

  4. Freitag, D. and Kushmerick, N., (2000). Boosted wrapper induction. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 577–583.

    Google Scholar 

  5. Goan, T., Benson, N. and Etzioni, O. (1996). A grammar inference algorithm for the world wide web. Proc. of the AAAI Spring Symposium on Machine Learning in Information Access.

    Google Scholar 

  6. Hsu, C. and Dung, M., (1998). Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems, 23 (8): 521–538.

    Article  Google Scholar 

  7. Kushmerick, N., (1999). Regression testing for wrapper maintenance. In Proc. of the 16th National Conference on Artificial Intelligence AAAI-1999, 74–79.

    Google Scholar 

  8. Kushmerick, N., (2000). Wrapper induction: efficiency and expressiveness. Artificial Intelligence Journal, 118 (1–2): 15–68.

    Article  MathSciNet  MATH  Google Scholar 

  9. Lerman, K. and Minton, S., (2000). Learning the common structure of data. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 609–614.

    Google Scholar 

  10. Muslea, I., Minton, S. and Knoblock, C., (2000). Co-testing: Selective sampling with redundant views. Proc. of the 17th National Conference on Artificial Intelligence AAAI-2000, 621–626.

    Google Scholar 

  11. Muslea, I., Minton, S. and Knoblock, C., (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi Agent Systems, 4: 93–114.

    Article  Google Scholar 

  12. Soderland, S., (1999). Learning extraction rules for semi-structured and free text. Machine Learning, 34: 233–272.

    Article  MATH  Google Scholar 

  13. Thompson, C., Califf, M. and Mooney, R., (1999). Active learning for natural language parsing and information extraction. Proc. of the 16th International Conference on Machine Learning ICML-99, 406–414.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Knoblock, C.A., Lerman, K., Minton, S., Muslea, I. (2003). Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds) Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing, vol 111. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-1772-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-7908-1772-0_17

  • Publisher Name: Physica, Heidelberg

  • Print ISBN: 978-3-7908-2519-0

  • Online ISBN: 978-3-7908-1772-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics