research-article

Web Search Clustering and Labeling with Hidden Topics

Authors:
Cam-Tu Nguyen

Tohoku University

Tohoku University
View Profile

,
Xuan-Hieu Phan

Tohoku University

Tohoku University
View Profile

,
Susumu Horiguchi

Tohoku University

Tohoku University
View Profile

,
Thu-Trang Nguyen

Vietnam National University

Vietnam National University
View Profile

,
Quang-Thuy Ha

Vietnam National University

Vietnam National University
View Profile

ACM Transactions on Asian Language Information Processing Volume 8 Issue 3Article No.: 12pp 1–40https://doi.org/10.1145/1568292.1568295

Published:01 August 2009Publication History

ACM Transactions on Asian Language Information Processing

Abstract

Web search clustering is a solution to reorganize search results (also called “snippets”) in a more convenient way for browsing. There are three key requirements for such post-retrieval clustering systems: (1) the clustering algorithm should group similar documents together; (2) clusters should be labeled with descriptive phrases; and (3) the clustering system should provide high-quality clustering without downloading the whole Web page.

This article introduces a novel framework for clustering Web search results in Vietnamese which targets the three above issues. The main motivation is that by enriching short snippets with hidden topics from huge resources of documents on the Internet, it is able to cluster and label such snippets effectively in a topic-oriented manner without concerning whole Web pages. Our approach is based on recent successful topic analysis models, such as Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the framework is that we collect a very large external data collection called “universal dataset,” and then build a clustering system on both the original snippets and a rich set of hidden topics discovered from the universal data collection. This can be seen as a richer representation of snippets to be clustered. We carry out careful evaluation of our method and show that our method can yield impressive clustering quality.

References

Andrieu, C., Freitas, N., Doucet, A., and Jordan, M. 2003. An introduction to mcmc for machine learning. Mach. Learn. 50, 5--43.Google ScholarCross Ref
Baamboo. 2008. Vietnamese search engine. http://mp3.baamboo.coms.Google Scholar
Bagga, A. and Baldwin, B. 1998. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational Linguistics (ACL’98). 79--85. Google ScholarDigital Library
Banerjee, S. and Pedersen, T. 2003. The design, implementation and use of the ngram statistics. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics. 370--381. Google ScholarDigital Library
Banerjee, S., Ramanathan, K., and Gupta, A. 2007. Clustering short texts using wikipedia. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). Google ScholarDigital Library
Blei, D. and Lafferty, J. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). Google ScholarDigital Library
Blei, D. and Lafferty, J. 2007. A correlated topic model of science. Ann. Appl. Stat. 1, 17--35.Google ScholarCross Ref
Blei, D., Ng, A., and Jordan, M. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarCross Ref
Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the International World Wide Web Conference (WWW’07). 757--766. Google ScholarDigital Library
Cai, L. and Hofmann, T. 2003. Text categorization by boosting automatically extracted concepts. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’03). Google ScholarDigital Library
Chen, H. and Dumais, S. 2001. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the International Conference on Human Factors in Computing Systems (CHI’01). 145--152. Google ScholarDigital Library
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tokey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 318--329. Google ScholarDigital Library
Deerwester, S., Furnas, G., and Landauer, T. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 391--407.Google ScholarCross Ref
Ferragina, P. and Gulli, A. 2005. A personalized search engine based on Web-snippet hierarchical clustering. In Proceedings of the International World Wide Web Conference (WWW’05). 801--810. Google ScholarDigital Library
Garilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07). Google ScholarDigital Library
Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F. 2006. Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. Lecture Notes in Computer Science, vol. 4209, 25--36. Google ScholarDigital Library
Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Natl. Acad. Sci. 101, 5228--5235.Google ScholarCross Ref
Heinrich, G. 2005. Parameter estimation for text analysis. Tech. rep., University of Leipzig and vsonix GmbH.Google Scholar
Hofmann, T. 1999. Probabilistic lsa. In Proceedings of the Conference on Uncertainly in Artificial Intelligence (UAI’99).Google Scholar
Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., and Cheng, Q. Y. Z. 2008. Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). 179--186. Google ScholarDigital Library
Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the Web. SIGIR Forum. 32, 1, 5--17. Google ScholarDigital Library
Kotsiantis, S. and Pintelas, P. E. 2004. Recent advances in clustering: A brief survey. WSEAS Trans. Inform. Sci. Appl. 1, 1, 73--81.Google Scholar
Manning, C. D. and Schutze, H. 1999. Foundations of Statistic Natural Language Processing. MIT Press. Google ScholarDigital Library
Mei, Q., Shen, X., and Zhai, C. 2007. Automatic labeling of multinomial topic models. In Proceeding of the Knowledge Discovery and Data Mining Conference (KDD’07). Google ScholarDigital Library
Ngo, C.-L. 2003. A tolerance rough set approach to clustering Web search results. Master’s thesis, Warsaw University.Google Scholar
Nguyen, C.-T., Nguyen, T.-K., Phan, X. H., Nguyen, L. M., and Ha, Q. T. 2006. Vietnamese word segmentation with CRFs and SVMs: An investigation. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Compuation (PACLIC’06). 215--222.Google Scholar
Osinski, S. 2003. An algorithm for clustering Web search result. Master’s thesis. Poznan University of Technology, Poland.Google Scholar
Phan, X. H., Nguyen, L. M., and Horiguchi, S. 2008. Learning to classify short and sparse text and Web with hidden topics from large-scale data collections. In Proceedings of the International World Wide Web Conference (WWW’08). Google ScholarDigital Library
Popescul, A. and Ungar, L. 2000. Automatic labeling of document clusters. http://www.cis.upenn.edu/~popescul/Publications/popesculcolabeling.pdf.Google Scholar
Sahami, M. and Heilman, T. 2006. A Web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the International World Wide Web Conference (WWW’06). Google ScholarDigital Library
Schonhofen, P. 2006. Identifying document topics using the wikipedia category network. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI’06). 456--462. Google ScholarDigital Library
Socbay. 2008. Vietnamese search engine. http://www.socbay.com.Google Scholar
Treeratpituk, P. and Callan, J. 2006. Automatically labeling hierarchical clusters. In Proceedings of the International Conference on Digital Government Research (DGRC’06). Google ScholarDigital Library
Vivisimo. 2008. Clustering engine. http://vivisimo.com/.Google Scholar
Vnnic. 2008. Vietnam Internet Center. http://www.thongkeinternet.vn.Google Scholar
Wang, X., McCallum, A., and Wei, X. 2007. Topical n-grams: Phrase and topic discovery with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining (DM’07). 697--702. Google ScholarDigital Library
Wikipedia. 2008. Latent semantic analysis. http://en.wikipedia.org/wiki.Google Scholar
Xalo. 2008. Vietnamese search engine. http://xalo.vn.Google Scholar
Yih, W. and Meek, C. 2007. Improving similarity measures for short segments of text. In Proceedings of the National Conference on Artificial Intelligence (AAAI’07). Google ScholarDigital Library
Zamir, O. and Etzioni, O. 1999. Grouper: A dynamic clustering interface to Web search results. Comput. Netw. 31, 11-16, 1361--1374. Google ScholarDigital Library
Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., and Ma, J. 2004. Learning to cluster Web search results. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). Google ScholarDigital Library
Zing. 2008. Vietnamese Web site directory. http://directory.zing.vn.Google Scholar

Index Terms

Web Search Clustering and Labeling with Hidden Topics
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Formal concept analysis for topic detection

We propose a novel application of FCA-based methods for Topic Detection, overcoming traditional problems of the clustering and classification techniques.We achieve state-of-the-art results for the topic detection task at Replab 2013.We propose an ...
Read More
An efficient hybrid clustering algorithm for molecular sequences classification
ACM-SE 44: Proceedings of the 44th annual Southeast regional conference

The k-means clustering and hierarchical agglomerative clustering algorithms are two popular methods to partition data into groups. The k-means clustering algorithm heavily favors spherical clusters and does not deal with noise adequately. To overcome ...
Read More
Search result presentation based on faceted clustering
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

We propose a competence partitioning strategy for Web search result presentation: the unmodified head of a ranked result list is combined with a clustering of documents from the result list tail. We identify two principles to which such a clustering ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 8, Issue 3
August 2009
81 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1568292
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2009
- Accepted: 1 May 2009
- Revised: 1 April 2009
- Received: 1 September 2008
Published in talip Volume 8, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hierarchical Agglomerative Clustering
Latent Dirichlet allocation
Vietnamese
Web search clustering
cluster labeling
collocation
hidden topics analysis
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 946
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web Search Clustering and Labeling with Hidden Topics

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Formal concept analysis for topic detection

An efficient hybrid clustering algorithm for molecular sequences classification

Search result presentation based on faceted clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Web Search Clustering and Labeling with Hidden Topics

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Formal concept analysis for topic detection

An efficient hybrid clustering algorithm for molecular sequences classification

Search result presentation based on faceted clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media