skip to main content
research-article

Web Search Clustering and Labeling with Hidden Topics

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

Web search clustering is a solution to reorganize search results (also called “snippets”) in a more convenient way for browsing. There are three key requirements for such post-retrieval clustering systems: (1) the clustering algorithm should group similar documents together; (2) clusters should be labeled with descriptive phrases; and (3) the clustering system should provide high-quality clustering without downloading the whole Web page.

This article introduces a novel framework for clustering Web search results in Vietnamese which targets the three above issues. The main motivation is that by enriching short snippets with hidden topics from huge resources of documents on the Internet, it is able to cluster and label such snippets effectively in a topic-oriented manner without concerning whole Web pages. Our approach is based on recent successful topic analysis models, such as Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the framework is that we collect a very large external data collection called “universal dataset,” and then build a clustering system on both the original snippets and a rich set of hidden topics discovered from the universal data collection. This can be seen as a richer representation of snippets to be clustered. We carry out careful evaluation of our method and show that our method can yield impressive clustering quality.

References

  1. Andrieu, C., Freitas, N., Doucet, A., and Jordan, M. 2003. An introduction to mcmc for machine learning. Mach. Learn. 50, 5--43.Google ScholarGoogle ScholarCross RefCross Ref
  2. Baamboo. 2008. Vietnamese search engine. http://mp3.baamboo.coms.Google ScholarGoogle Scholar
  3. Bagga, A. and Baldwin, B. 1998. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational Linguistics (ACL’98). 79--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Banerjee, S. and Pedersen, T. 2003. The design, implementation and use of the ngram statistics. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics. 370--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Banerjee, S., Ramanathan, K., and Gupta, A. 2007. Clustering short texts using wikipedia. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Blei, D. and Lafferty, J. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Blei, D. and Lafferty, J. 2007. A correlated topic model of science. Ann. Appl. Stat. 1, 17--35.Google ScholarGoogle ScholarCross RefCross Ref
  8. Blei, D., Ng, A., and Jordan, M. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarGoogle ScholarCross RefCross Ref
  9. Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the International World Wide Web Conference (WWW’07). 757--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cai, L. and Hofmann, T. 2003. Text categorization by boosting automatically extracted concepts. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen, H. and Dumais, S. 2001. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the International Conference on Human Factors in Computing Systems (CHI’01). 145--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tokey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 318--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Deerwester, S., Furnas, G., and Landauer, T. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ferragina, P. and Gulli, A. 2005. A personalized search engine based on Web-snippet hierarchical clustering. In Proceedings of the International World Wide Web Conference (WWW’05). 801--810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Garilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F. 2006. Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. Lecture Notes in Computer Science, vol. 4209, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Natl. Acad. Sci. 101, 5228--5235.Google ScholarGoogle ScholarCross RefCross Ref
  18. Heinrich, G. 2005. Parameter estimation for text analysis. Tech. rep., University of Leipzig and vsonix GmbH.Google ScholarGoogle Scholar
  19. Hofmann, T. 1999. Probabilistic lsa. In Proceedings of the Conference on Uncertainly in Artificial Intelligence (UAI’99).Google ScholarGoogle Scholar
  20. Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., and Cheng, Q. Y. Z. 2008. Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). 179--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: A study of user queries on the Web. SIGIR Forum. 32, 1, 5--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kotsiantis, S. and Pintelas, P. E. 2004. Recent advances in clustering: A brief survey. WSEAS Trans. Inform. Sci. Appl. 1, 1, 73--81.Google ScholarGoogle Scholar
  23. Manning, C. D. and Schutze, H. 1999. Foundations of Statistic Natural Language Processing. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mei, Q., Shen, X., and Zhai, C. 2007. Automatic labeling of multinomial topic models. In Proceeding of the Knowledge Discovery and Data Mining Conference (KDD’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ngo, C.-L. 2003. A tolerance rough set approach to clustering Web search results. Master’s thesis, Warsaw University.Google ScholarGoogle Scholar
  26. Nguyen, C.-T., Nguyen, T.-K., Phan, X. H., Nguyen, L. M., and Ha, Q. T. 2006. Vietnamese word segmentation with CRFs and SVMs: An investigation. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Compuation (PACLIC’06). 215--222.Google ScholarGoogle Scholar
  27. Osinski, S. 2003. An algorithm for clustering Web search result. Master’s thesis. Poznan University of Technology, Poland.Google ScholarGoogle Scholar
  28. Phan, X. H., Nguyen, L. M., and Horiguchi, S. 2008. Learning to classify short and sparse text and Web with hidden topics from large-scale data collections. In Proceedings of the International World Wide Web Conference (WWW’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Popescul, A. and Ungar, L. 2000. Automatic labeling of document clusters. http://www.cis.upenn.edu/~popescul/Publications/popesculcolabeling.pdf.Google ScholarGoogle Scholar
  30. Sahami, M. and Heilman, T. 2006. A Web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the International World Wide Web Conference (WWW’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Schonhofen, P. 2006. Identifying document topics using the wikipedia category network. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI’06). 456--462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Socbay. 2008. Vietnamese search engine. http://www.socbay.com.Google ScholarGoogle Scholar
  33. Treeratpituk, P. and Callan, J. 2006. Automatically labeling hierarchical clusters. In Proceedings of the International Conference on Digital Government Research (DGRC’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vivisimo. 2008. Clustering engine. http://vivisimo.com/.Google ScholarGoogle Scholar
  35. Vnnic. 2008. Vietnam Internet Center. http://www.thongkeinternet.vn.Google ScholarGoogle Scholar
  36. Wang, X., McCallum, A., and Wei, X. 2007. Topical n-grams: Phrase and topic discovery with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining (DM’07). 697--702. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Wikipedia. 2008. Latent semantic analysis. http://en.wikipedia.org/wiki.Google ScholarGoogle Scholar
  38. Xalo. 2008. Vietnamese search engine. http://xalo.vn.Google ScholarGoogle Scholar
  39. Yih, W. and Meek, C. 2007. Improving similarity measures for short segments of text. In Proceedings of the National Conference on Artificial Intelligence (AAAI’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zamir, O. and Etzioni, O. 1999. Grouper: A dynamic clustering interface to Web search results. Comput. Netw. 31, 11-16, 1361--1374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Zeng, H. J., He, Q. C., Chen, Z., Ma, W. Y., and Ma, J. 2004. Learning to cluster Web search results. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zing. 2008. Vietnamese Web site directory. http://directory.zing.vn.Google ScholarGoogle Scholar

Index Terms

  1. Web Search Clustering and Labeling with Hidden Topics

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian Language Information Processing
      ACM Transactions on Asian Language Information Processing  Volume 8, Issue 3
      August 2009
      81 pages
      ISSN:1530-0226
      EISSN:1558-3430
      DOI:10.1145/1568292
      Issue’s Table of Contents

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 August 2009
      • Accepted: 1 May 2009
      • Revised: 1 April 2009
      • Received: 1 September 2008
      Published in talip Volume 8, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader