skip to main content
10.1145/2911451.2911522acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Explainable User Clustering in Short Text Streams

Published:07 July 2016Publication History

ABSTRACT

User clustering has been studied from different angles: behavior-based, to identify similar browsing or search patterns, and content-based, to identify shared interests. Once user clusters have been found, they can be used for recommendation and personalization. So far, content-based user clustering has mostly focused on static sets of relatively long documents. Given the dynamic nature of social media, there is a need to dynamically cluster users in the context of short text streams. User clustering in this setting is more challenging than in the case of long documents as it is difficult to capture the users' dynamic topic distributions in sparse data settings. To address this problem, we propose a dynamic user clustering topic model (or UCT for short). UCT adaptively tracks changes of each user's time-varying topic distribution based both on the short texts the user posts during a given time period and on the previously estimated distribution. To infer changes, we propose a Gibbs sampling algorithm where a set of word-pairs from each user is constructed for sampling. The clustering results are explainable and human-understandable, in contrast to many other clustering algorithms. For evaluation purposes, we work with a dataset consisting of users and tweets from each user. Experimental results demonstrate the effectiveness of our proposed clustering model compared to state-of-the-art baselines.

References

  1. K. Balog and M. de Rijke. Finding similar experts. In SIGIR, pages 821--822. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Machine Learning research, 3 (4--5): 993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Buscher, R. W. White, S. Dumais, and J. Huang. Large-scale analysis of individual and task differences in search result page examination strategies. In WSDM, pages 373--382. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Chen, J. Wang, Y. Zhang, H. Yan, and X. Li. User based aggregation for biterm topic model. In ACL, pages 489--494, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  6. Z. Chen and B. Liu. Mining topics in documents: standing on the shoulders of big data. In KDD, pages 1116--1125. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Cheng, X. Yan, Y. Lan, and J. Guo. A biterm topic model for short texts. In WWW, pages 1445--1456. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In ICML, pages 289--296, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315 (5814): 972--976, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  10. T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101 (suppl 1): 5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  11. K. Hofmann, K. Balog, T. Bogers, and M. de Rijke. Contextual factors for finding similar experts. J. Am. Soc. Inf. Sci. Techn., 61 (5): 994--1014, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Huang, G. Yu, Z. Wang, J. Zhang, and L. Shi. Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng., 8 (25): 1748--1759, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Hubert and P. Arabie. Comparing partitions. J. Classification, 1 (2): 193--218, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  15. T. Iwata, S. Watanabe, T. Yamada, and N. Ueda. Topic tracking model for analyzing consumer purchase behavior. In IJCAI, pages 1427--1432, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31 (8): 651--666, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, pages 775--784. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Li, Y. Tian, Q. Yang, and K. Wang. Classification pruning for web-request prediction. In WWW. ACM, 2001.Google ScholarGoogle Scholar
  19. S. Liang and M. de Rijke. Burst-aware data fusion for microblog search. Inf. Proc. Man., 51 (2): 83--113, 2015.Google ScholarGoogle Scholar
  20. S. Liang, Z. Ren, and M. de Rijke. Fusion helps diversification. In SIGIR, pages 303--312, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Liang, Z. Ren, and M. de Rijke. Personalized search result diversification via structured learning. In KDD, pages 751--760. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Liang, Z. Ren, W. Weerkamp, E. Meij, and M. de Rijke. Time-aware rank aggregation for microblog search. In CIKM, pages 989--998. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge university press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  24. B. Mobasher, R. Cooley, and J. Srivastava. Creating adaptive web sites through usage-based clustering of urls. In IEEE KDEX workshop. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 2--3 (39): 103--134, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In WWW, pages 91--100. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Rangrej, S. Kulkarni, and A. V. Tendulkar. Comparative study of clustering techniques for short text documents. In WWW Companion, pages 111--112. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Ren and M. de Rijke. Summarizing contrastive themes via hierarchical non-parametric processes. In SIGIR, pages 93--102, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Ren, S. Liang, and M. de Rijke. Personalized time-aware tweets summarization. In SIGIR, pages 513--522, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Ren, M.-H. Peetz, S. Liang, W. van Dolen, and M. de Rijke. Hierarchical multi-label classification of social text streams. In SIGIR, pages 213--222. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, pages 487--494, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: discovery and applications of usage patterns from web data. In SIGKDD Explorations, pages 12--23. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. O. Tsur, A. Littman, and A. Rappoport. Efficient clustering of short messages into general domains. In ICWSM, pages 621--630, 2013.Google ScholarGoogle Scholar
  34. C. Van Gysel, M. de Rijke, and M. Worring. Unsupervised, efficient and semantic expertise retrieval. In WWW, pages 1069--1079. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In KDD, pages 424--433. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X. Wei, J. Sun, and X. Wang. Dynamic mixture models for multiple time-series. In IJCAI, pages 2909--2914, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Xu, Q. Shi, X. Qiao, et al. A dynamic users' interest discovery model with distributed inference algorithm. IJDSN, 2015: Article ID 280892, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  38. J. Yin. Clustering microtext streams for event identification. In IJCNLP, pages 719--725, 2013.Google ScholarGoogle Scholar
  39. J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, pages 233--242. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. G. Yu, R. Huang, and Z. Wang. Document clustering via dirichlet process mixture model with feature selection. In KDD, pages 763--772. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Explainable User Clustering in Short Text Streams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
      July 2016
      1296 pages
      ISBN:9781450340694
      DOI:10.1145/2911451

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 July 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader