Exploiting Attribute Redundancy for Web Entity Data Extraction

Zhu, Yanxu; Yin, Gang; Li, Xiang; Wang, Huaimin; Shi, Dianxi; Yuan, Lin

doi:10.1007/978-3-642-24826-9_15

Yanxu Zhu¹⁹,
Gang Yin¹⁹,
Xiang Li¹⁹,
Huaimin Wang^19,20,
Dianxi Shi¹⁹ &
…
Lin Yuan²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7008))

Included in the following conference series:

International Conference on Asian Digital Libraries

2141 Accesses

Abstract

Web entities are often associated with many attributes that describe them. It is essential to extract these attributes for Web entity data extraction. This paper proposes a novel approach using duplicated attribute value pairs. We start by constructing a initial seed set of attributes including names and enumerable values, and a training set of Web pages from target website; After that we locate the position of each attribute by matching attribute values within the pages of the site with values contained in the seed set; Thirdly we choose the position with the highest supportiveness as path for extraction, which we use to extract other attribute value pairs with the same template. Finally, we conduct an extensive experimental study with large real data set to demonstrate the effectiveness of our extraction approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gibson, Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW, pp. 830–839. ACM Press, New York (2005)
Google Scholar
Zhu, Y., Yin, G., Wang, H., Shi, D., Li, X., Yuan, L.: An Indent Shape based Approach for Web Lists Mining. In: Wang, F.L. (ed.) WISM 2011, Part II. LNCS, vol. 6988, pp. 113–121. Springer, Heidelberg (2011)
Google Scholar
Agichtein, E.: Confidence Estimation Methods for Partially Supervised Relation Extraction. In: The 6th SIAM International Conference on Data Mining, ACM Press, New York (2006)
Google Scholar
Agrawal, R., Bayardo, R.J., Srikant, R.: Athena: Mining-Based Interactive Management of Text Databases. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 365–379. Springer, Heidelberg (2000)
Chapter Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: The 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM Press, New York (2003)
Chapter Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Papotti, P., Crescenzi, V., Merialdo, P., Bronzi, M., Blanco, L.: Redundancy-driven web data extraction and integration. In: WebDB (2010)
Google Scholar
Gulhane, P., Rastogi, R., Sengamedu, S., Tengli, A.: Exploiting content redundancy for web information extraction. PVLDB 3(1), 578–587 (2010)
Google Scholar
Miao, G., et al.: Extracting data records from the web using tag path clusterting. In: WWW, pp. 981–990. ACM Press, New York (2009)
Chapter Google Scholar
Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: The 10th SIAM, pp. 930–941 (2010)
Google Scholar
Chang, C.-H., Lui, S.: IEPAD: Information Extraction Based on Pattern Discovery. In: The 10th International World Wide Web Conference, pp. 681–688 (2001)
Google Scholar
Sivakumar, P., Parvathi, R.M.S.: An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining. European Journal of Scientific Research 50(3), 340–351 (2011)
Google Scholar
Liu, W., Meng, X., Yang, J., Xiao, J.: Duplicate Identification in Deep Web Data Integration. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 5–17. Springer, Heidelberg (2010)
Chapter Google Scholar
Marchionini, G.: Exploratory search: from finding to understanding. Communications of the ACM 49(4), 46 (2006)
Article Google Scholar
Huang, J., Wang, H., et al.: Link-based Hidden Attribute Discovery for Objects on Web. In: 14th International Conference on Extending Database Technology, pp. 473–484. ACM Press, New York (2011)
Google Scholar
Wang, J., Shao, B., et al.: Understanding Tables on the Web. Technique report. Microsoft Research Asia (2011)
Google Scholar
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, Hunan, China
Yanxu Zhu, Gang Yin, Xiang Li, Huaimin Wang & Dianxi Shi
National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, 410073, China
Huaimin Wang
College of Electronic Technology, Information Engineering University, 450004, Zhengzhou, Henan, China
Lin Yuan

Authors

Yanxu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Huaimin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dianxi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Lin Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Science and Technology Building, Tsinghua University, 100084, Beijing, P.R. China
Chunxiao Xing
Faculty of Informatics, University of Lugano, 6900, Lugano, Switzerland
Fabio Crestani
Institute of Software Technology and Interactive Systems,, Vienna University of Technology, 1040, Vienna, Austria
Andreas Rauber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Y., Yin, G., Li, X., Wang, H., Shi, D., Yuan, L. (2011). Exploiting Attribute Redundancy for Web Entity Data Extraction. In: Xing, C., Crestani, F., Rauber, A. (eds) Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation. ICADL 2011. Lecture Notes in Computer Science, vol 7008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24826-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-24826-9_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24825-2
Online ISBN: 978-3-642-24826-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics