The Anatomy of SnakeT: A Hierarchical Clustering Engine for Web-Page Snippets

Ferragina, Paolo; Gullì, Antonio

doi:10.1007/978-3-540-30116-5_48

Paolo Ferragina²² &
Antonio Gullì²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3202))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2219 Accesses
11 Citations

Abstract

The purpose of a search engine is to retrieve from a given textual collection the documents deemed relevant for a user query. Typically a user query is modeled as a set of keywords, and a document is a Web page, a pdf file or whichever file can be parsed into a set of tokens (words). Documents are ranked in a flat list according to some measure of relevance to the user query. That list contains hyperlinks to the relevant documents, their titles, and also the so called (page or web) snippets, namely document excerpts allowing the user to understand if a document is indeed relevant without accessing it.

Partially supported by the Italian MIUR projects ALINWEB and ECD, and by the Italian Registry of ccTLD.it.

Download to read the full chapter text

Chapter PDF

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Clustering in a News Corpus

An Empirical Comparison of Web Page Segmentation Algorithms

References

CNN.com. Better search results than Google? Next-generation sites help narrow internet searches. Associated Press (January 2004)
Google Scholar
Fung, B., Wang, K., Ester, M.: Large hierarchical document clustering using frequent itemsets. In: SIAM International Conference on Data Mining (2003)
Google Scholar
Giannotti, F., Nanni, M., Pedreschi, D., Samaritani, F.: Webcat: Automatic categorization of web search results. In: SEBD, pp. 507–518 (2003)
Google Scholar
Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: WWW (2004)
Google Scholar
Lawrie, D.J., Croft, W.B.: Generating hiearchical summaries for web searches. In: ACM SIGIR, pp. 457–458 (2003)
Google Scholar
Maarek, Y.S., Fagin, R., Ben-Shaul, I.Z., Pelleg, D.: Ephemeral document clustering for web applications. Technical Report RJ 10186, IBM Research (2000)
Google Scholar
Weiss, D., Stefanowski, J.: Web search results clustering in polish: Experimental evaluation of Carrot. In: New Trends in I.I.P. and Web Mining Conference (2003)
Google Scholar
Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to Web search results. Computer Networks 31, 1361–1374 (1999)
Article Google Scholar
Zeng, H., He, Q., Chen, Z., Ma, W.: Learning to cluster web search results. In: ACM SIGIR (2004)
Google Scholar
Zhang, D., Dong, Y.: Semantic, hierarchical, online clustering of web search results. In: WIDM (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Pisa,
Paolo Ferragina & Antonio Gullì

Authors

Paolo Ferragina
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Gullì
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Dipartimento di Informatica, Università degli Studi di Bari,
Floriana Esposito
Pisa KDD Laboratory, ISTI - CNR, Area della Ricerca di Pisa, Via Giuseppe Moruzzi 1, Pisa, Italy
Fosca Giannotti
Dipartimento di Informatica, Via F. Buonarroti 2, 56127, Pisa, Italy
Dino Pedreschi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferragina, P., Gullì, A. (2004). The Anatomy of SnakeT: A Hierarchical Clustering Engine for Web-Page Snippets. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_48

Download citation

DOI: https://doi.org/10.1007/978-3-540-30116-5_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

The Anatomy of SnakeT: A Hierarchical Clustering Engine for Web-Page Snippets

Abstract

Chapter PDF

Similar content being viewed by others

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Clustering in a News Corpus

An Empirical Comparison of Web Page Segmentation Algorithms

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The Anatomy of SnakeT: A Hierarchical Clustering Engine for Web-Page Snippets

Abstract

Chapter PDF

Similar content being viewed by others

A Quantitative Comparison of Semantic Web Page Segmentation Approaches

Clustering in a News Corpus

An Empirical Comparison of Web Page Segmentation Algorithms

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation