ABSTRACT
User clustering has been studied from different angles: behavior-based, to identify similar browsing or search patterns, and content-based, to identify shared interests. Once user clusters have been found, they can be used for recommendation and personalization. So far, content-based user clustering has mostly focused on static sets of relatively long documents. Given the dynamic nature of social media, there is a need to dynamically cluster users in the context of short text streams. User clustering in this setting is more challenging than in the case of long documents as it is difficult to capture the users' dynamic topic distributions in sparse data settings. To address this problem, we propose a dynamic user clustering topic model (or UCT for short). UCT adaptively tracks changes of each user's time-varying topic distribution based both on the short texts the user posts during a given time period and on the previously estimated distribution. To infer changes, we propose a Gibbs sampling algorithm where a set of word-pairs from each user is constructed for sampling. The clustering results are explainable and human-understandable, in contrast to many other clustering algorithms. For evaluation purposes, we work with a dataset consisting of users and tweets from each user. Experimental results demonstrate the effectiveness of our proposed clustering model compared to state-of-the-art baselines.
- K. Balog and M. de Rijke. Finding similar experts. In SIGIR, pages 821--822. ACM, 2007. Google ScholarDigital Library
- D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Machine Learning research, 3 (4--5): 993--1022, 2003. Google ScholarDigital Library
- G. Buscher, R. W. White, S. Dumais, and J. Huang. Large-scale analysis of individual and task differences in search result page examination strategies. In WSDM, pages 373--382. ACM, 2012. Google ScholarDigital Library
- W. Chen, J. Wang, Y. Zhang, H. Yan, and X. Li. User based aggregation for biterm topic model. In ACL, pages 489--494, 2015.Google ScholarCross Ref
- Z. Chen and B. Liu. Mining topics in documents: standing on the shoulders of big data. In KDD, pages 1116--1125. ACM, 2014. Google ScholarDigital Library
- X. Cheng, X. Yan, Y. Lan, and J. Guo. A biterm topic model for short texts. In WWW, pages 1445--1456. ACM, 2013. Google ScholarDigital Library
- C. Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In ICML, pages 289--296, 2006. Google ScholarDigital Library
- B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315 (5814): 972--976, 2007.Google ScholarCross Ref
- T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101 (suppl 1): 5228--5235, 2004.Google ScholarCross Ref
- K. Hofmann, K. Balog, T. Bogers, and M. de Rijke. Contextual factors for finding similar experts. J. Am. Soc. Inf. Sci. Techn., 61 (5): 994--1014, May 2010. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57. ACM, 1999. Google ScholarDigital Library
- R. Huang, G. Yu, Z. Wang, J. Zhang, and L. Shi. Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng., 8 (25): 1748--1759, 2013. Google ScholarDigital Library
- L. Hubert and P. Arabie. Comparing partitions. J. Classification, 1 (2): 193--218, 1985.Google ScholarCross Ref
- T. Iwata, S. Watanabe, T. Yamada, and N. Ueda. Topic tracking model for analyzing consumer purchase behavior. In IJCAI, pages 1427--1432, 2009. Google ScholarDigital Library
- A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31 (8): 651--666, 2010. Google ScholarDigital Library
- O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, pages 775--784. ACM, 2011. Google ScholarDigital Library
- I. Li, Y. Tian, Q. Yang, and K. Wang. Classification pruning for web-request prediction. In WWW. ACM, 2001.Google Scholar
- S. Liang and M. de Rijke. Burst-aware data fusion for microblog search. Inf. Proc. Man., 51 (2): 83--113, 2015.Google Scholar
- S. Liang, Z. Ren, and M. de Rijke. Fusion helps diversification. In SIGIR, pages 303--312, 2014. Google ScholarDigital Library
- S. Liang, Z. Ren, and M. de Rijke. Personalized search result diversification via structured learning. In KDD, pages 751--760. ACM, 2014. Google ScholarDigital Library
- S. Liang, Z. Ren, W. Weerkamp, E. Meij, and M. de Rijke. Time-aware rank aggregation for microblog search. In CIKM, pages 989--998. ACM, 2014. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge university press, 2008. Google ScholarCross Ref
- B. Mobasher, R. Cooley, and J. Srivastava. Creating adaptive web sites through usage-based clustering of urls. In IEEE KDEX workshop. IEEE, 1999. Google ScholarDigital Library
- K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 2--3 (39): 103--134, 2000. Google ScholarDigital Library
- X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In WWW, pages 91--100. ACM, 2008. Google ScholarDigital Library
- A. Rangrej, S. Kulkarni, and A. V. Tendulkar. Comparative study of clustering techniques for short text documents. In WWW Companion, pages 111--112. ACM, 2011. Google ScholarDigital Library
- Z. Ren and M. de Rijke. Summarizing contrastive themes via hierarchical non-parametric processes. In SIGIR, pages 93--102, 2015. Google ScholarDigital Library
- Z. Ren, S. Liang, and M. de Rijke. Personalized time-aware tweets summarization. In SIGIR, pages 513--522, 2013. Google ScholarDigital Library
- Z. Ren, M.-H. Peetz, S. Liang, W. van Dolen, and M. de Rijke. Hierarchical multi-label classification of social text streams. In SIGIR, pages 213--222. ACM, 2014. Google ScholarDigital Library
- M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, pages 487--494, 2004. Google ScholarDigital Library
- J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: discovery and applications of usage patterns from web data. In SIGKDD Explorations, pages 12--23. ACM, 2000. Google ScholarDigital Library
- O. Tsur, A. Littman, and A. Rappoport. Efficient clustering of short messages into general domains. In ICWSM, pages 621--630, 2013.Google Scholar
- C. Van Gysel, M. de Rijke, and M. Worring. Unsupervised, efficient and semantic expertise retrieval. In WWW, pages 1069--1079. ACM, 2016. Google ScholarDigital Library
- X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In KDD, pages 424--433. ACM, 2006. Google ScholarDigital Library
- X. Wei, J. Sun, and X. Wang. Dynamic mixture models for multiple time-series. In IJCAI, pages 2909--2914, 2007. Google ScholarDigital Library
- S. Xu, Q. Shi, X. Qiao, et al. A dynamic users' interest discovery model with distributed inference algorithm. IJDSN, 2015: Article ID 280892, 2014.Google ScholarCross Ref
- J. Yin. Clustering microtext streams for event identification. In IJCNLP, pages 719--725, 2013.Google Scholar
- J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, pages 233--242. ACM, 2014. Google ScholarDigital Library
- G. Yu, R. Huang, and Z. Wang. Document clustering via dirichlet process mixture model with feature selection. In KDD, pages 763--772. ACM, 2010. Google ScholarDigital Library
Index Terms
- Explainable User Clustering in Short Text Streams
Recommendations
Inferring Dynamic User Interests in Streams of Short Texts for User Clustering
User clustering has been studied from different angles. In order to identify shared interests, behavior-based methods consider similar browsing or search patterns of users, whereas content-based methods use information from the contents of the documents ...
The Recommendation System of Micro-Blog Topic Based on User Clustering
As a type of crowdsensing media, micro-blog has become an important crowdsensing place for a lot of real-time information dissemination and discussion. With the increasing of micro-blog users, there are more and more new topics emerging on this kind of ...
Research on User Clustering Algorithm Based on Software System User Behavior Trajectory
ICBDT '19: Proceedings of the 2nd International Conference on Big Data TechnologiesThis paper studies and analyzes software system user behavior, proposes a clustering algorithm based on user behavior trajectory data. Through a series of data processing steps, the original user access and operation data are transformed into user ...
Comments