Exploiting Web Sites Structural and Content Features for Web Pages Clustering

Lanotte, Pasqua Fabiana; Fumarola, Fabio; Malerba, Donato; Ceci, Michelangelo

doi:10.1007/978-3-319-60438-1_44

Pasqua Fabiana Lanotte¹⁹,
Fabio Fumarola²⁰,
Donato Malerba^19,21 &
…
Michelangelo Ceci^19,21

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10352))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1664 Accesses

Abstract

Web page clustering is a focal task in Web Mining to organize the content of websites, understanding their structure and discovering interactions among web pages. It is a tricky task since web pages have multiple dimension based on textual, hyperlink and HTML formatting (i.e. HTML tags and visual) properties. Existing algorithms use this information almost independently, mainly because it is difficult to combine them. This paper makes a contribution on clustering of web pages in a website by taking into account a distributional representation that combines all these features into a single vector space. The approach first crawls the website by using web pages’ HTML formatting and web lists in order to identify and represent the hyperlink structure by means of an adapted skip-gram model. Then, this hyperlink structure and the textual information are fused into a single vector space representation. The obtained representation is used to cluster websites using simultaneously their hyperlink structure and textual information. Experiments on real websites show that the proposed method improves clustering results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Angelova, R., Siersdorfer, S.: A neighborhood-based approach for clustering of linked document collections. In: Proceedings of CIKM 2006, pp. 778–779. ACM, New York (2006)
Google Scholar
Bohunsky, P., Gatterbauer, W.: Visual structure-based web page clustering and retrieval. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1067–1068. ACM, New York (2010)
Google Scholar
Buttler, D.: A short survey of document structure similarity algorithms. In: Proceedings of the International Conference on Internet Computing, IC 2004, Las Vegas, Nevada, USA, 21–24 June 2004, vol. 1, pp. 3–9 (2004)
Google Scholar
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Gonçalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM 2003, pp. 394–401. ACM, New York (2003)
Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS, vol. 7819, pp. 160–172. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Chehreghani, M.H., Abolhassani, H., Chehreghani, M.H.: Improving density-based methods for hierarchical clustering of web pages. Data Knowl. Eng. 67(1), 30–50 (2008)
Article Google Scholar
Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005)
Article Google Scholar
Fathi, M., Adly, N., Nagi, M.: Web documents classification using text, anchor, title and metadata information. In: Proceedings of the International Conference on Computer Science, Software Engineering, Information Technology, e-Business and Applications, pp. 1–8 (2004)
Google Scholar
Firth, J.: A synopsis of linguistic theory 1930-55. In: Palmer, F.R. (ed.) Selected Papers of J.R. Firth 1952-59, pp. 168–205. Longmans, London (1968)
Google Scholar
Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Hylien: a hybrid approach to general list extraction on the web. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, 28 March - 1 April 2011 (Companion Volume), pp. 35–36 (2011)
Google Scholar
Gornerup, O., Gillblad, D., Vasiloudis, T.: Knowing an object by the company it keeps: a domain-agnostic scheme for similarity discovery. In: Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM), ICDM 2015, pp. 121–130. IEEE Computer Society, Washington, DC (2015)
Google Scholar
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Article MathSciNet MATH Google Scholar
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of WWW 2002, pp. 432–442. ACM, New York (2002)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article MATH Google Scholar
Lanotte, P.F., Fumarola, F., Ceci, M., Scarpino, A., Torelli, M.D., Malerba, D.: Automatic extraction of logical web lists. In: Andreasen, T., Christiansen, H., Cubero, J.-C., Raś, Z.W. (eds.) ISMIS 2014. LNCS, vol. 8502, pp. 365–374. Springer, Cham (2014). doi:10.1007/978-3-319-08326-1_37
Google Scholar
Lin, C.X., Yu, Y., Han, J., Liu, B.: Hierarchical web-page clustering via in-page and cross-page link structures. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6119, pp. 222–229. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13672-6_22
Chapter Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013)
Google Scholar
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: ACM SIGKDD 2014, KDD 2014, pp. 701–710. ACM, New York (2014)
Google Scholar
Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM 2006, pp. 228–237. ACM, New York (2006)
Google Scholar
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: EMNLP-CoNLL 2007, pp. 410–420 (2007)
Google Scholar
Sahlgren, M.: The distributional hypothesis. Ital. J. Linguist. 20(1), 33–54 (2008)
Google Scholar
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, New York, NY, USA, pp. 1067–1077 (2015)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)
MathSciNet MATH Google Scholar
Weninger, T., Johnston, T.J., Han, J.: The parallel path framework for entity discovery on the web. ACM Trans. Web 7(3), 16:1–16:29 (2013)
Google Scholar

Download references

Acknowledgments

We acknowledge the support of the EU Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944).

Author information

Authors and Affiliations

University of Bari Aldo Moro, via Orabona 4, 70125, Bari, Italy
Pasqua Fabiana Lanotte, Donato Malerba & Michelangelo Ceci
Unicredit Research and Development, 20100, Milan, Italy
Fabio Fumarola
CINI - Consorzio Interuniversitario Nazionale per l’Informatica, Bari, Italy
Donato Malerba & Michelangelo Ceci

Authors

Pasqua Fabiana Lanotte
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Fumarola
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar
Michelangelo Ceci
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pasqua Fabiana Lanotte .

Editor information

Editors and Affiliations

Warsaw University of Technology, Warsaw, Poland
Marzena Kryszkiewicz
University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
Institute of Informatics, University of Warsaw, Warsaw, Poland
Dominik Ślęzak
Faculty of Electronics & Information, Warsaw University of Technology, Warsaw, Poland
Henryk Rybinski
Institute of Mathematics, Warsaw University, Warsaw, Poland
Andrzej Skowron
Department of Computer Science, University of North Carolina at Charlotte, North Carolina, USA
Zbigniew W. Raś

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lanotte, P.F., Fumarola, F., Malerba, D., Ceci, M. (2017). Exploiting Web Sites Structural and Content Features for Web Pages Clustering. In: Kryszkiewicz, M., Appice, A., Ślęzak, D., Rybinski, H., Skowron, A., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2017. Lecture Notes in Computer Science(), vol 10352. Springer, Cham. https://doi.org/10.1007/978-3-319-60438-1_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-60438-1_44
Published: 14 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60437-4
Online ISBN: 978-3-319-60438-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics