On Finding Templates on Web Collections

Vieira, Karane; da Costa Carvalho, André Luiz; Berlt, Klessius; de Moura, Edleno S.; da Silva, Altigran S.; Freire, Juliana

doi:10.1007/s11280-009-0059-3

On Finding Templates on Web Collections

Published: 27 January 2009

Volume 12, pages 171–211, (2009)
Cite this article

World Wide Web Aims and scope Submit manuscript

Karane Vieira¹,
André Luiz da Costa Carvalho¹,
Klessius Berlt¹,
Edleno S. de Moura¹,
Altigran S. da Silva¹ &
…
Juliana Freire²

173 Accesses
17 Citations
Explore all metrics

Abstract

Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of the International Conference on the World Wide Web, pp. 580–591 (2002)
Beszteri, I., Vuorimaa, P.: Vertical navigation of layout adapted web documents. World Wide Web 10(1), 1–35 (2007)
Article Google Scholar
Chakrabarti, S., Joshi, M., Tawde, V.: Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proceedings the of ACM Conference on Research and Development in Information Retrieval, pp. 208–216 (2001)
Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pp. 61–70. ACM, New York, NY, USA (2007)
Chapter Google Scholar
Chen, W.: New algorithm for tree-to-tree correction problem. J. Algorithms 40, 135–158 (2001)
Article MATH MathSciNet Google Scholar
Chen, L., Ye, S., Li, X.: Template detection for large scale search engines. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 1094–1098. ACM, New York, NY, USA (2006)
Chapter Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.K.: A methodology for clustering xml documents by structure. Inf. Syst. 31(3), 187–228 (2006)
Article Google Scholar
de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the International Conference on the World Wide Web, pp. 502–511 (2004)
Debnath, S., Mitra, P., Giles, C.L.: Automatic extraction of informative blocks from webpages. In: ACM Symposium on Applied Computing, pp. 1722–1726 (2005)
Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Proceedings of the International Conference on the World Wide Web—Poster Session, pp. 830–839. (2005)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pp. 577–582 (2003)
Khy, S., Ishikawa, Y., Kitagawa, H.: A novelty-based clustering method for on-line documents. World Wide Web 11(1), 1–37 (2008)
Article Google Scholar
Lian, W., Cheung, D.W.L., Mamoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16(1), 82–96 (2004)
Article Google Scholar
Macías, J.A.: Intelligent assistance in authoring dynamically generated web interfaces. World Wide Web 11(2), 253–286 (2008)
Article Google Scholar
Nielsen, J.: User interface directions for the web. Commun. ACM 42(1), 65–72 (1999)
Article Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the International Workshop on the Web and Databases (2002)
Selkow, S.M.: The tree-to-tree editing problem. Inf. Process. Lett. 6, 184–186 (1977)
Article MATH MathSciNet Google Scholar
Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: Proceedings of the International Conference on the World Wide Web, pp. 203–211 (2004)
Tai, K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Article MATH MathSciNet Google Scholar
Valiente, G.: An efficient bottom-up distance between trees. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp. 212–219. IEEE Computer Science Press (2001)
Vieira, K., da Silva, A.S., Pinto, N., de Moura, E.S., Cavalcanti, J.M.B., Freire, J.: A fast and robust method for web page template detection and removal. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Arlington, VA, USA, pp. 258–267 (2006)
Wang, J.T.L., Zhang, K.: Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recogn. 34, 127–137 (2001)
Article MATH Google Scholar
Yang, W.: Identifying syntactic differences between two programs. Softw. Pract. Exp. 21(7), 739–755 (1991)
Article Google Scholar
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the International ACM Conference on Knowledge Discovery and Data Mining, pp. 296–305 (2003)
Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10(2), 113–132 (2007)
Article Google Scholar
Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Federal University of Amazonas, Manaus, Brazil
Karane Vieira, André Luiz da Costa Carvalho, Klessius Berlt, Edleno S. de Moura & Altigran S. da Silva
School of Computing, University of Utah, Salt Lake City, USA
Juliana Freire

Authors

Karane Vieira
View author publications
You can also search for this author in PubMed Google Scholar
André Luiz da Costa Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Klessius Berlt
View author publications
You can also search for this author in PubMed Google Scholar
Edleno S. de Moura
View author publications
You can also search for this author in PubMed Google Scholar
Altigran S. da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Juliana Freire
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edleno S. de Moura.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vieira, K., da Costa Carvalho, A.L., Berlt, K. et al. On Finding Templates on Web Collections. World Wide Web 12, 171–211 (2009). https://doi.org/10.1007/s11280-009-0059-3

Download citation

Received: 07 January 2008
Revised: 18 December 2008
Accepted: 05 January 2009
Published: 27 January 2009
Issue Date: June 2009
DOI: https://doi.org/10.1007/s11280-009-0059-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Finding Templates on Web Collections

Abstract

Access this article

Similar content being viewed by others

Site-Level Web Template Extraction Based on DOM Analysis

Discovering Informative Contents of Web Pages

Multiple Template Detection Based on Segments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On Finding Templates on Web Collections

Abstract

Access this article

Similar content being viewed by others

Site-Level Web Template Extraction Based on DOM Analysis

Discovering Informative Contents of Web Pages

Multiple Template Detection Based on Segments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation