A Visual Based Page Segmentation for Deep Web Data Extraction

Palekar, Vikas R.

doi:10.1007/978-81-322-0491-6_72

Vikas R. Palekar⁶

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 131))

2922 Accesses

Abstract

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques such as DOM tree, our approach is independent to the HTML documentation representation. Our method can work well even when the HTML structure is quite different from the visual layout structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adelberg, B.: NoDoSE: A tool for semi-automatically extracting structured and semi-structured data from text documents. In: Proceedings of ACM SIGMOD Conference on Management of Data, pp. 283–294 (1998)
Article Google Scholar
Ashish, N., Knoblock, C.A.: Semi-Automatic Wrapper Generation for Internet Information Sources. In: Proceedings of the Conference on Cooperative Information Systems, pp. 160–169 (1997)
Google Scholar
Ashish, N., Knoblock, C.A.: Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4), 8–15 (1997)
Article Google Scholar
Bailey, P., Craswell, N., Hawking, D.: Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing and Management (2001)
Google Scholar
Bar-Yossef, Z., Rajagopalan, S.: Template Detection via Data Mining and its Applications. In: Proceedings of the 11th International World Wide Web Conference, WWW 2002 (2002)
Google Scholar
Bernard, M.L.: Criteria for optimal web design (designing for usability) (2002)
Google Scholar
Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 104–111 (1998)
Google Scholar
Buckley, C., Salton, G., Allan, J.: Automatic Retrieval with Locality Information Using Smart. In: The First Text REtrieval Conference (TREC-1), National Institute of Standards and Technology, Gaithersburg, MD, pp. 59–72 (1992)
Google Scholar
Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: International Conference on Distributed Computing Systems (2001)
Google Scholar
Buyukkokten, O., Garcia-Molina, H., Paepche, A.: Accordion Summarization for End-GameBrowsing on PDAs and Cellular Phones. In: Proceedings of the Conference on Human Factors in Computing Systems, CHI 2001 (2001)
Google Scholar
Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topicdistillation and information extraction. In: In the 10th International World Wide Web Conference (2001)
Google Scholar
Chakrabarti, S., Joshi, M., Tawde, V.: Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 208–216. ACM Press (2001)
Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the Eleventh International Conference on World Wide Web (WWW 2002), pp. 148–159 (2002)
Google Scholar
Chen, J., Zhou, B., Shi, J., Zhang, H.-J., Wu, Q.: Function-Based Object Model Towards Website Adaptation. In: Proceedings of the 10th International World Wide Web Conference (2001)
Google Scholar
Diao, Y., Lu, H., Chen, S., Tian, Z.: Toward Learning Based Web Query Processing. In: Proceedings of International Conference on Very Large Databases, pp. 317–328 (2000)
Google Scholar
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting Semi-structured Information from the Web. In: Proceedings of the Workshop on Management for Semi-structured Data, pp. 18–25 (1997)
Google Scholar
Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in Web documents. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia PA, pp. 467–478 (1999)
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, USA, pp. 668–677 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

M. E. I. T. Prof. Ram Meghe Institute of Technology & Research, Badnera, Maharashatra, India
Vikas R. Palekar

Authors

Vikas R. Palekar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vikas R. Palekar .

Editor information

Editors and Affiliations

, Department of Mathematics, Indian Institute of Technology Roorkee, Uttarakhand, Roorkee, 247667, India
Kusum Deep
Department of Mathematics and, Computer Science, Liverpool Hope University, Hope Park, Liverpool, L16 9JD, United Kingdom
Atulya Nagar
, Department of Paper Technology, Indian Institute of Technology Roorkee, Uttarakhand, Roorkee, 247667, India
Millie Pant
Information Technology and Management, ABV-Indian Institute of, 109, E- Block, Gwalior, 474010, India
Jagdish Chand Bansal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Palekar, V.R. (2012). A Visual Based Page Segmentation for Deep Web Data Extraction. In: Deep, K., Nagar, A., Pant, M., Bansal, J. (eds) Proceedings of the International Conference on Soft Computing for Problem Solving (SocProS 2011) December 20-22, 2011. Advances in Intelligent and Soft Computing, vol 131. Springer, New Delhi. https://doi.org/10.1007/978-81-322-0491-6_72

Download citation

DOI: https://doi.org/10.1007/978-81-322-0491-6_72
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-0490-9
Online ISBN: 978-81-322-0491-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics