research-article

Explainable User Clustering in Short Text Streams

Authors:
Yukun Zhao

Shandong University, Jinan, China

Shandong University, Jinan, China
View Profile

,
Shangsong Liang

University College London, London, United Kingdom

University College London, London, United Kingdom
View Profile

,
Zhaochun Ren

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

,
Jun Ma

Shandong University, Jinan, China

Shandong University, Jinan, China
View Profile

,
Emine Yilmaz

University College London, London, United Kingdom

University College London, London, United Kingdom
View Profile

,
Maarten de Rijke

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalJuly 2016Pages 155–164https://doi.org/10.1145/2911451.2911522

Published:07 July 2016Publication History

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Pages 155–164

ABSTRACT

User clustering has been studied from different angles: behavior-based, to identify similar browsing or search patterns, and content-based, to identify shared interests. Once user clusters have been found, they can be used for recommendation and personalization. So far, content-based user clustering has mostly focused on static sets of relatively long documents. Given the dynamic nature of social media, there is a need to dynamically cluster users in the context of short text streams. User clustering in this setting is more challenging than in the case of long documents as it is difficult to capture the users' dynamic topic distributions in sparse data settings. To address this problem, we propose a dynamic user clustering topic model (or UCT for short). UCT adaptively tracks changes of each user's time-varying topic distribution based both on the short texts the user posts during a given time period and on the previously estimated distribution. To infer changes, we propose a Gibbs sampling algorithm where a set of word-pairs from each user is constructed for sampling. The clustering results are explainable and human-understandable, in contrast to many other clustering algorithms. For evaluation purposes, we work with a dataset consisting of users and tweets from each user. Experimental results demonstrate the effectiveness of our proposed clustering model compared to state-of-the-art baselines.

References

K. Balog and M. de Rijke. Finding similar experts. In SIGIR, pages 821--822. ACM, 2007. Google ScholarDigital Library
D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Machine Learning research, 3 (4--5): 993--1022, 2003. Google ScholarDigital Library
G. Buscher, R. W. White, S. Dumais, and J. Huang. Large-scale analysis of individual and task differences in search result page examination strategies. In WSDM, pages 373--382. ACM, 2012. Google ScholarDigital Library
W. Chen, J. Wang, Y. Zhang, H. Yan, and X. Li. User based aggregation for biterm topic model. In ACL, pages 489--494, 2015.Google ScholarCross Ref
Z. Chen and B. Liu. Mining topics in documents: standing on the shoulders of big data. In KDD, pages 1116--1125. ACM, 2014. Google ScholarDigital Library
X. Cheng, X. Yan, Y. Lan, and J. Guo. A biterm topic model for short texts. In WWW, pages 1445--1456. ACM, 2013. Google ScholarDigital Library
C. Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In ICML, pages 289--296, 2006. Google ScholarDigital Library
B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315 (5814): 972--976, 2007.Google ScholarCross Ref
T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101 (suppl 1): 5228--5235, 2004.Google ScholarCross Ref
K. Hofmann, K. Balog, T. Bogers, and M. de Rijke. Contextual factors for finding similar experts. J. Am. Soc. Inf. Sci. Techn., 61 (5): 994--1014, May 2010. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57. ACM, 1999. Google ScholarDigital Library
R. Huang, G. Yu, Z. Wang, J. Zhang, and L. Shi. Dirichlet process mixture model for document clustering with feature partition. IEEE Trans. Knowl. Data Eng., 8 (25): 1748--1759, 2013. Google ScholarDigital Library
L. Hubert and P. Arabie. Comparing partitions. J. Classification, 1 (2): 193--218, 1985.Google ScholarCross Ref
T. Iwata, S. Watanabe, T. Yamada, and N. Ueda. Topic tracking model for analyzing consumer purchase behavior. In IJCAI, pages 1427--1432, 2009. Google ScholarDigital Library
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31 (8): 651--666, 2010. Google ScholarDigital Library
O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, pages 775--784. ACM, 2011. Google ScholarDigital Library
I. Li, Y. Tian, Q. Yang, and K. Wang. Classification pruning for web-request prediction. In WWW. ACM, 2001.Google Scholar
S. Liang and M. de Rijke. Burst-aware data fusion for microblog search. Inf. Proc. Man., 51 (2): 83--113, 2015.Google Scholar
S. Liang, Z. Ren, and M. de Rijke. Fusion helps diversification. In SIGIR, pages 303--312, 2014. Google ScholarDigital Library
S. Liang, Z. Ren, and M. de Rijke. Personalized search result diversification via structured learning. In KDD, pages 751--760. ACM, 2014. Google ScholarDigital Library
S. Liang, Z. Ren, W. Weerkamp, E. Meij, and M. de Rijke. Time-aware rank aggregation for microblog search. In CIKM, pages 989--998. ACM, 2014. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge university press, 2008. Google ScholarCross Ref
B. Mobasher, R. Cooley, and J. Srivastava. Creating adaptive web sites through usage-based clustering of urls. In IEEE KDEX workshop. IEEE, 1999. Google ScholarDigital Library
K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 2--3 (39): 103--134, 2000. Google ScholarDigital Library
X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In WWW, pages 91--100. ACM, 2008. Google ScholarDigital Library
A. Rangrej, S. Kulkarni, and A. V. Tendulkar. Comparative study of clustering techniques for short text documents. In WWW Companion, pages 111--112. ACM, 2011. Google ScholarDigital Library
Z. Ren and M. de Rijke. Summarizing contrastive themes via hierarchical non-parametric processes. In SIGIR, pages 93--102, 2015. Google ScholarDigital Library
Z. Ren, S. Liang, and M. de Rijke. Personalized time-aware tweets summarization. In SIGIR, pages 513--522, 2013. Google ScholarDigital Library
Z. Ren, M.-H. Peetz, S. Liang, W. van Dolen, and M. de Rijke. Hierarchical multi-label classification of social text streams. In SIGIR, pages 213--222. ACM, 2014. Google ScholarDigital Library
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, pages 487--494, 2004. Google ScholarDigital Library
J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: discovery and applications of usage patterns from web data. In SIGKDD Explorations, pages 12--23. ACM, 2000. Google ScholarDigital Library
O. Tsur, A. Littman, and A. Rappoport. Efficient clustering of short messages into general domains. In ICWSM, pages 621--630, 2013.Google Scholar
C. Van Gysel, M. de Rijke, and M. Worring. Unsupervised, efficient and semantic expertise retrieval. In WWW, pages 1069--1079. ACM, 2016. Google ScholarDigital Library
X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In KDD, pages 424--433. ACM, 2006. Google ScholarDigital Library
X. Wei, J. Sun, and X. Wang. Dynamic mixture models for multiple time-series. In IJCAI, pages 2909--2914, 2007. Google ScholarDigital Library
S. Xu, Q. Shi, X. Qiao, et al. A dynamic users' interest discovery model with distributed inference algorithm. IJDSN, 2015: Article ID 280892, 2014.Google ScholarCross Ref
J. Yin. Clustering microtext streams for event identification. In IJCNLP, pages 719--725, 2013.Google Scholar
J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, pages 233--242. ACM, 2014. Google ScholarDigital Library
G. Yu, R. Huang, and Z. Wang. Document clustering via dirichlet process mixture model with feature selection. In KDD, pages 763--772. ACM, 2010. Google ScholarDigital Library

Index Terms

Explainable User Clustering in Short Text Streams
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Inferring Dynamic User Interests in Streams of Short Texts for User Clustering

User clustering has been studied from different angles. In order to identify shared interests, behavior-based methods consider similar browsing or search patterns of users, whereas content-based methods use information from the contents of the documents ...
Read More
The Recommendation System of Micro-Blog Topic Based on User Clustering

As a type of crowdsensing media, micro-blog has become an important crowdsensing place for a lot of real-time information dissemination and discussion. With the increasing of micro-blog users, there are more and more new topics emerging on this kind of ...
Read More
Research on User Clustering Algorithm Based on Software System User Behavior Trajectory
ICBDT '19: Proceedings of the 2nd International Conference on Big Data Technologies

This paper studies and analyzes software system user behavior, proposes a clustering algorithm based on user behavior trajectory data. Through a series of data processing steps, the original user access and operation data are transformed into user ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
July 2016
1296 pages
ISBN:9781450340694
DOI:10.1145/2911451
General Chairs:
Raffaele Perego
ISTI-CNR, Italy
,
Fabrizio Sebastiani
Qatar Computing Research Institute, HBKU, Qatar
,
Program Chairs:
Javed Aslam
Northeastern University, US
,
Ian Ruthven
University of Strathclyde, UK
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
short text processing
user clustering
user topic modeling
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 984
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Explainable User Clustering in Short Text Streams

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Inferring Dynamic User Interests in Streams of Short Texts for User Clustering

The Recommendation System of Micro-Blog Topic Based on User Clustering

Research on User Clustering Algorithm Based on Software System User Behavior Trajectory