Abstract
World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.
Similar content being viewed by others
References
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Conference on Management of Data, pp. 207–216. Washington, D.C. (1993)
Alpaydin, E.: Introduction to Machine Learning, chapter 3, pp. 39–59. MIT Press, Cambridge, MA (2004)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD Conference on Management of Data, San Diego, USA (2003)
Ashish, N., Knoblock, C.A.: Semi-automatic wrapper generation for internet information sources. In: Conference on Cooperative Information Systems, pp. 160–169 (1997)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Vips: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Technical Report (2003)
Chkrabarti, S.: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: International World Wide Web (WWW) Conference (2001)
Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In: The 14th International World Wide Web (WWW) Conference (2005)
Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for the semantic web. In: Proceedings of the 1st European Semantic Web Symposium, Heraklion, Greece (2004)
Crescenzi, V., Mecca, G.: Automatic information extraction from large web sites. J. Artists’ Choice Mus. 51(5), 731–779 (2004)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A case for automated large-scale semantic annotation. Journal of Web. Semantics 1(1), 115–132 (2003)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall. In: International World Wide Web (WWW) Conference (2004)
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from xml documents. In: ACM SIGMOD Conference on Management of Data (2000)
Gelgi, F., Vadrevu, S., Davulcu, H.: Automatic extraction of relational models from the web data. Technical Report ASU-CSE-TR-06-009, Arizona State University, April (2006)
Guha, R., McCool, R.: TAP: a semantic web toolkit. Semantic Web Journal (2003)
Hearst, M.A.: Untangling text data mining. In: Association for Computational Linguistics (1999)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)
Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construction system for web information sources. In: International Conference on Data Engineering (2000)
Muslea, I., Minton, S., Knoblock, C.: Stalker: learning extraction rules for semistructured. In: Workshop on AI and Information Integration (1998)
Noy, N., Musen, M.: Prompt: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the 17th Conference of the American Association for Artificial Intelligence (AAAI). AAAI Press, Menlo Park, CA (2000)
Pearson, K.: On the coefficient of racial likeliness. Biometrica 18, 105–117 (1926)
Vadrevu, S., Gelgi, F., Davulcu, H.: Semantic partitioning web pages. In: The 6th International Conference on Web Information Systems Engineering (WISE) (2005)
Yang, G., Tan, W., Mukherjee, S., Ramakrishnan, I.V., Davulcu, H.: On the power of semantic partitioning of web documents. In: Workshop on Information Integration on the Web, Acapulco, Mexico (2003)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vadrevu, S., Gelgi, F. & Davulcu, H. Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge. World Wide Web 10, 157–179 (2007). https://doi.org/10.1007/s11280-007-0021-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-007-0021-1