Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7008))

Included in the following conference series:

  • 2141 Accesses

Abstract

Web entities are often associated with many attributes that describe them. It is essential to extract these attributes for Web entity data extraction. This paper proposes a novel approach using duplicated attribute value pairs. We start by constructing a initial seed set of attributes including names and enumerable values, and a training set of Web pages from target website; After that we locate the position of each attribute by matching attribute values within the pages of the site with values contained in the seed set; Thirdly we choose the position with the highest supportiveness as path for extraction, which we use to extract other attribute value pairs with the same template. Finally, we conduct an extensive experimental study with large real data set to demonstrate the effectiveness of our extraction approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gibson, Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW, pp. 830–839. ACM Press, New York (2005)

    Google Scholar 

  2. Zhu, Y., Yin, G., Wang, H., Shi, D., Li, X., Yuan, L.: An Indent Shape based Approach for Web Lists Mining. In: Wang, F.L. (ed.) WISM 2011, Part II. LNCS, vol. 6988, pp. 113–121. Springer, Heidelberg (2011)

    Google Scholar 

  3. Agichtein, E.: Confidence Estimation Methods for Partially Supervised Relation Extraction. In: The 6th SIAM International Conference on Data Mining, ACM Press, New York (2006)

    Google Scholar 

  4. Agrawal, R., Bayardo, R.J., Srikant, R.: Athena: Mining-Based Interactive Management of Text Databases. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 365–379. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  5. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: The 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM Press, New York (2003)

    Chapter  Google Scholar 

  6. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  7. Papotti, P., Crescenzi, V., Merialdo, P., Bronzi, M., Blanco, L.: Redundancy-driven web data extraction and integration. In: WebDB (2010)

    Google Scholar 

  8. Gulhane, P., Rastogi, R., Sengamedu, S., Tengli, A.: Exploiting content redundancy for web information extraction. PVLDB 3(1), 578–587 (2010)

    Google Scholar 

  9. Miao, G., et al.: Extracting data records from the web using tag path clusterting. In: WWW, pp. 981–990. ACM Press, New York (2009)

    Chapter  Google Scholar 

  10. Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: The 10th SIAM, pp. 930–941 (2010)

    Google Scholar 

  11. Chang, C.-H., Lui, S.: IEPAD: Information Extraction Based on Pattern Discovery. In: The 10th International World Wide Web Conference, pp. 681–688 (2001)

    Google Scholar 

  12. Sivakumar, P., Parvathi, R.M.S.: An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining. European Journal of Scientific Research 50(3), 340–351 (2011)

    Google Scholar 

  13. Liu, W., Meng, X., Yang, J., Xiao, J.: Duplicate Identification in Deep Web Data Integration. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 5–17. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  14. Marchionini, G.: Exploratory search: from finding to understanding. Communications of the ACM 49(4), 46 (2006)

    Article  Google Scholar 

  15. Huang, J., Wang, H., et al.: Link-based Hidden Attribute Discovery for Objects on Web. In: 14th International Conference on Extending Database Technology, pp. 473–484. ACM Press, New York (2011)

    Google Scholar 

  16. Wang, J., Shao, B., et al.: Understanding Tables on the Web. Technique report. Microsoft Research Asia (2011)

    Google Scholar 

  17. Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, Y., Yin, G., Li, X., Wang, H., Shi, D., Yuan, L. (2011). Exploiting Attribute Redundancy for Web Entity Data Extraction. In: Xing, C., Crestani, F., Rauber, A. (eds) Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation. ICADL 2011. Lecture Notes in Computer Science, vol 7008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24826-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24826-9_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24825-2

  • Online ISBN: 978-3-642-24826-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics