Visually Extracting Data Records from Query Result Pages

Anderson, Neil; Hong, Jun

doi:10.1007/978-3-642-37401-2_40

Visually Extracting Data Records from Query Result Pages

Neil Anderson²⁰ &
Jun Hong²⁰

Conference paper

4544 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7808))

Abstract

Web databases are now pervasive. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, New York, NY, USA, pp. 337–348 (2003)
Google Scholar
Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chapter Google Scholar
Chang, C.-H., Lui, S.-C.: Iepad: information extraction based on pattern discovery. In: WWW Conference, New York, NY, USA, pp. 681–688 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB Conference, San Francisco, CA, USA, pp. 109–118 (2001)
Google Scholar
Prime spiral (2012), http://mathworld.wolfram.com/PrimeSpiral.html
Tel-8 query interfaces (2004), http://metaquerier.cs.uiuc.edu/repository/datasets/tel8/
Jakob nielsen - usable i.t (2002), http://www.useit.com/alertbox/20021223.html
Webkit - layout engine, http://www.webkit.org/
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: SIGKDD Conference, New York, NY, USA, pp. 601–606 (2003)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22, 447–460 (2010)
Article Google Scholar
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW Conference, pp. 981–990 (2008)
Google Scholar
Nielsen, J., Pernice, K.: Eyetracking Web Usability, 1st edn., pp. 97–110. New Riders (2010)
Google Scholar
Real, R., Vargas, J.M.: The probabilistic basis of jaccard’s index of similarity. Systematic Biology 45, 380–385 (1996)
Article Google Scholar
Simon, K., Lausen, G.: Viper: augmenting automatic information extraction with visual perceptions. In: CIKM Conference, New York, NY, USA, pp. 381–388 (2005)
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW Conference, New York, NY, USA, pp. 187–196 (2003)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW Conference, New York, NY, USA, pp. 76–85 (2005)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW Conference, New York, NY, USA, pp. 66–75 (2005)
Google Scholar
Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search engine result pages. In: VLDB Conference, pp. 989–1000 (2006)
Google Scholar
Zhao, H., Meng, W., Yu, C.: Mining templates from search result records of search engines. In: SIGKDD Conference, New York, NY, USA, pp. 884–893 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, UK, BT7 1NN
Neil Anderson & Jun Hong

Authors

Neil Anderson
View author publications
You can also search for this author in PubMed Google Scholar
Jun Hong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering, Nagoya University, 464-8601, Nagoya, Japan
Yoshiharu Ishikawa
Department of Computer Science and Technology, Harbin Institute of Technology, 150006, Harbin, China
Jianzhong Li
School of Computer Science and Engineering, University of New South Wales, 2031, Sydney, NSW, Australia
Wei Wang & Wenjie Zhang &
Department of Computing and Information Systems, University of Melbourne, 3052, Melbourne, VIC, Australia
Rui Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anderson, N., Hong, J. (2013). Visually Extracting Data Records from Query Result Pages. In: Ishikawa, Y., Li, J., Wang, W., Zhang, R., Zhang, W. (eds) Web Technologies and Applications. APWeb 2013. Lecture Notes in Computer Science, vol 7808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37401-2_40

Download citation

DOI: https://doi.org/10.1007/978-3-642-37401-2_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37400-5
Online ISBN: 978-3-642-37401-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics