Skip to main content
Log in

Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Conference on Management of Data, pp. 207–216. Washington, D.C. (1993)

  2. Alpaydin, E.: Introduction to Machine Learning, chapter 3, pp. 39–59. MIT Press, Cambridge, MA (2004)

    Google Scholar 

  3. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD Conference on Management of Data, San Diego, USA (2003)

  4. Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Conference on Cooperative Information Systems, pp. 160–169 (1997)

  5. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Technical Report (2003)

  6. Chkrabarti, S.: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: International World Wide Web (WWW) Conference (2001)

  7. Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In: The 14th International World Wide Web (WWW) Conference (2005)

  8. Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for the semantic web. In: Proceedings of the 1st European Semantic Web Symposium, Heraklion, Greece (2004)

  9. Crescenzi, V., Mecca, G.: Automatic information extraction from large web sites. J. Artists’ Choice Mus. 51(5), 731–779 (2004)

    MathSciNet  Google Scholar 

  10. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A case for automated large-scale semantic annotation. Journal of Web. Semantics 1(1), 115–132 (2003)

    Google Scholar 

  11. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall. In: International World Wide Web (WWW) Conference (2004)

  12. Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from xml documents. In: ACM SIGMOD Conference on Management of Data (2000)

  13. Gelgi, F., Vadrevu, S., Davulcu, H.: Automatic extraction of relational models from the web data. Technical Report ASU-CSE-TR-06-009, Arizona State University, April (2006)

  14. Guha, R., McCool, R.: TAP: a semantic web toolkit. Semantic Web Journal (2003)

  15. Hearst, M.A.: Untangling text data mining. In: Association for Computational Linguistics (1999)

  16. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  17. Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)

  18. Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: International Conference on Data Engineering (2000)

  19. Muslea, I., Minton, S., Knoblock, C.: Stalker: learning extraction rules for semistructured. In: Workshop on AI and Information Integration (1998)

  20. Noy, N., Musen, M.: Prompt: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the 17th Conference of the American Association for Artificial Intelligence (AAAI). AAAI Press, Menlo Park, CA (2000)

    Google Scholar 

  21. Pearson, K.: On the coefficient of racial likeliness. Biometrica 18, 105–117 (1926)

    Google Scholar 

  22. Vadrevu, S., Gelgi, F., Davulcu, H.: Semantic partitioning web pages. In: The 6th International Conference on Web Information Systems Engineering (WISE) (2005)

  23. Yang, G., Tan, W., Mukherjee, S., Ramakrishnan, I.V., Davulcu, H.: On the power of semantic partitioning of web documents. In: Workshop on Information Integration on the Web, Acapulco, Mexico (2003)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srinivas Vadrevu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vadrevu, S., Gelgi, F. & Davulcu, H. Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge. World Wide Web 10, 157–179 (2007). https://doi.org/10.1007/s11280-007-0021-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-007-0021-1

Keywords

Navigation