Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

Vadrevu, Srinivas; Gelgi, Fatih; Davulcu, Hasan

doi:10.1007/s11280-007-0021-1

Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

Published: 02 March 2007

Volume 10, pages 157–179, (2007)
Cite this article

World Wide Web Aims and scope Submit manuscript

Srinivas Vadrevu¹,
Fatih Gelgi¹ &
Hasan Davulcu¹

222 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Information Extraction from the Web by Matching Visual Presentation Patterns

Joint Information Extraction from the Web Using Linked Data

An Approach to Web Information Processing

References

Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Conference on Management of Data, pp. 207–216. Washington, D.C. (1993)
Alpaydin, E.: Introduction to Machine Learning, chapter 3, pp. 39–59. MIT Press, Cambridge, MA (2004)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD Conference on Management of Data, San Diego, USA (2003)
Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Conference on Cooperative Information Systems, pp. 160–169 (1997)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Technical Report (2003)
Chkrabarti, S.: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: International World Wide Web (WWW) Conference (2001)
Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In: The 14th International World Wide Web (WWW) Conference (2005)
Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for the semantic web. In: Proceedings of the 1st European Semantic Web Symposium, Heraklion, Greece (2004)
Crescenzi, V., Mecca, G.: Automatic information extraction from large web sites. J. Artists’ Choice Mus. 51(5), 731–779 (2004)
MathSciNet Google Scholar
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A case for automated large-scale semantic annotation. Journal of Web. Semantics 1(1), 115–132 (2003)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall. In: International World Wide Web (WWW) Conference (2004)
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from xml documents. In: ACM SIGMOD Conference on Management of Data (2000)
Gelgi, F., Vadrevu, S., Davulcu, H.: Automatic extraction of relational models from the web data. Technical Report ASU-CSE-TR-06-009, Arizona State University, April (2006)
Guha, R., McCool, R.: TAP: a semantic web toolkit. Semantic Web Journal (2003)
Hearst, M.A.: Untangling text data mining. In: Association for Computational Linguistics (1999)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)
Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: International Conference on Data Engineering (2000)
Muslea, I., Minton, S., Knoblock, C.: Stalker: learning extraction rules for semistructured. In: Workshop on AI and Information Integration (1998)
Noy, N., Musen, M.: Prompt: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the 17th Conference of the American Association for Artificial Intelligence (AAAI). AAAI Press, Menlo Park, CA (2000)
Google Scholar
Pearson, K.: On the coefficient of racial likeliness. Biometrica 18, 105–117 (1926)
Google Scholar
Vadrevu, S., Gelgi, F., Davulcu, H.: Semantic partitioning web pages. In: The 6th International Conference on Web Information Systems Engineering (WISE) (2005)
Yang, G., Tan, W., Mukherjee, S., Ramakrishnan, I.V., Davulcu, H.: On the power of semantic partitioning of web documents. In: Workshop on Information Integration on the Web, Acapulco, Mexico (2003)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, 85287, USA
Srinivas Vadrevu, Fatih Gelgi & Hasan Davulcu

Authors

Srinivas Vadrevu
View author publications
You can also search for this author in PubMed Google Scholar
Fatih Gelgi
View author publications
You can also search for this author in PubMed Google Scholar
Hasan Davulcu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Srinivas Vadrevu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vadrevu, S., Gelgi, F. & Davulcu, H. Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge. World Wide Web 10, 157–179 (2007). https://doi.org/10.1007/s11280-007-0021-1

Download citation

Received: 02 May 2006
Revised: 19 September 2006
Accepted: 08 January 2007
Published: 02 March 2007
Issue Date: June 2007
DOI: https://doi.org/10.1007/s11280-007-0021-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

Abstract

Access this article

Similar content being viewed by others

Information Extraction from the Web by Matching Visual Presentation Patterns

Joint Information Extraction from the Web Using Linked Data

An Approach to Web Information Processing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

Abstract

Access this article

Similar content being viewed by others

Information Extraction from the Web by Matching Visual Presentation Patterns

Joint Information Extraction from the Web Using Linked Data

An Approach to Web Information Processing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation