Abstract
The problem of clustering content in social media has pervasive applications, including the identification of discussion topics, event detection, and content recommendation. Here, we describe a streaming framework for online detection and clustering of memes in social media, specifically Twitter. A pre-clustering procedure, namely protomeme detection, first isolates atomic tokens of information carried by the tweets. Protomemes are thereafter aggregated, based on multiple similarity measures, to obtain memes as cohesive groups of tweets reflecting actual concepts or topics of discussion. The clustering algorithm takes into account various dimensions of the data and metadata, including natural language, the social network, and the patterns of information diffusion. As a result, our system can build clusters of semantically, structurally, and topically related tweets. The clustering process is based on a variant of Online K-means that incorporates a memory mechanism, used to “forget” old memes and replace them over time with the new ones. The evaluation of our framework is carried out using a dataset of Twitter trending topics. Over a 1-week period, we systematically determined whether our algorithm was able to recover the trending hashtags. We show that the proposed method outperforms baseline algorithms that only use content features, as well as a state-of-the-art event detection method that assumes full knowledge of the underlying follower network. We finally show that our online learning framework is flexible, due to its independence of the adopted clustering algorithm, and best suited to work in a streaming scenario.
Similar content being viewed by others
Notes
Term vectors might or might not include retweets; in all our experiments, we include retweets. Our framework does not make any assumption on the language of the tweets either, therefore, it is flexible to work with multiple languages.
Note that \(R_{p}\) is not necessarily a subset of \(U_{p}\) when only a sample of the tweets is considered in the stream; the sample may include a retweeted message but not the original one.
References
Aggarwal C, Subbian K (2012) Event detection in social streams. In: Proceedings of SIAM international conference on data mining, 2012
Albers S, Leonardi S (1999) Online algorithms. ACM Comput Surv 31(3)
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2002. ACM, New York, pp 1–16
Bakshy E, Hofman J, Mason W, Watts D (2011) Everyone’s an influencer: quantifying influence on twitter. In: Proceedings of the 4th ACM international conference on web search and data mining, 2011. ACM, New York, pp 65–74
Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw 15(3):702–719
BBC (2014) NYPD Twitter campaign ‘backfires’ after hashtag hijacked. http://www.bbc.com/news/technology-27126041
Becker H, Naaman M, Gravano L (2010) Learning similarity metrics for event identification in social media. In: Proceedings of the 3rd ACM international conference on web search and data mining, 2010. ACM, New York, pp 291–300
Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In: Proceedings of the 5th international AAAI conference on weblogs and social media, 2011
Blum A (1998) On-line algorithms in machine learning. Springer, Berlin
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: 2006 SIAM conference on data mining, 2006, pp 328–339
Cataldi M, Caro LD, Schifanella C (2013) Personalized emerging topic detection based on a term aging model. ACM Trans Intell Syst Technol 5(1):7
Cesa-Bianchi N (2006) Prediction, learning, and games. Cambridge University Press, Cambridge
Chew C, Eysenbach G (2010) Pandemics in the age of twitter: content analysis of tweets during the 2009 H1N1 outbreak. PLoS One 5(11):e14118
CNBC (2013) #McFail? McDonald’s Twitter campaign gets hijacked. http://www.cnbc.com/id/46132132
Conover M, Ratkiewicz J, Francisco M, Gonçalves B, Menczer F, Flammini A (2011) Political polarization on twitter. In: ICWSM, 2011
Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, Flammini A (2013) The geospatial characteristics of a social movement communication network. PLoS One 8(3):e55957
Conover MD, Ferrara E, Menczer F, Flammini A (2013) The digital evolution of Occupy Wall Street. PLoS One 8(5):e64679
Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 2005(09):P09008
Ferrara E, JafariAsbagh M, Varol O, Qazvinian V, Menczer F, Flammini A (2013) Clustering memes in social media. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, 2013. IEEE/ACM, pp 548–555
Ferrara E, Varol O, Davis C, Menczer F, Flammini A (2014) The rise of social bots. arXiv preprint arXiv:1407.5225
Ferrara E, Varol O, Menczer F, Flammini A (2013) Traveling trends: social butterflies or frequent fliers? In: Proceedings of the first ACM conference on Online social networks, 2013. ACM, pp 213–222
Fiat A, Woeginger G (1998) Online algorithms: the state of the art. Springer, Heidelberg
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26
Gama J, Gaber MM (2007) Learning from data streams. Springer, Berlin
Gama J, Rodrigues PP, Spinosa EJ, de Carvalho ACPLF (2010) Knowledge discovery from data streams. Chapman and Hall/CRC, Boca Raton
Golder S, Huberman B (2006) Usage patterns of collaborative tagging systems. J Inf Sci 32(2):198–208
Golder SA, Macy MW (2011) Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333(6051):1878–1881
Hong L, Davison B (2010) Empirical study of topic modeling in twitter. In: Proceedings of the 1st workshop on social media analytics, 2010. ACM, New York, pp 80–88
Kranen P, Reidl F, Villaamil FS, Seidl T (2011) Hierarchical clustering for real-time stream data with noise. In: Proceedings of the 23rd international conference on scientific and statistical database management (SSDBM 2011), Portland, Oregon, USA, 2011. Springer, Heidelberg, pp 405–413
Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on world wide web, 2010. ACM, New York, pp 591–600
Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. N J Phys 11(3):033015
Lehmann J, Gonçalves B, Ramasco J, Cattuto C (2012) Dynamical classes of collective attention in twitter. In: Proceedings of the 21st international conference on world wide web, 2012, pp 251–260
Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, 2009. ACM, New York, pp 497–506
Marcus A, Bernstein M, Badar O, Karger D, Madden S, Miller R (2011) Twitinfo: aggregating and visualizing microblogs for event exploration. In: Proceedings of the 2011 annual conference on human factors in computing systems, 2011. ACM, New York, pp 227–236
Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modeling with network regularization. In: Proceedings of the 17th international conference on world wide web, 2008. ACM, New York, pp 101–110
Meilă M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98(5):873–895
Metaxas P, Mustafaraj E (2010) From obscurity to prominence in minutes:political speech and real-time search. In: Proceedings of web science: extending the frontiers of society on-line, 2010
Mika P (2007) Ontologies are us: a unified model of social networks and semantics. Web Seman Sci Serv Agents World Wide Web 5(1):5–15
Morales A, Losada J, Benito R (2012) Users structure and behavior on an online social network during a political protest. Users structure and behavior on an online social network during a political protest 391(21):5244–5253
Nematzadeh A, Ferrara E, Flammini A, Ahn Y-Y (2014) Optimal network modularity for information diffusion. Phys Rev Lett 113(8):088701
Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137
Pramod S, Vyas O (2012) Data stream mining: a review on windowing approach. Glob J Comput Sci Technol Softw Data Eng 12(11):26–30
Ratkiewicz J, Conover M, Meiss M, Gonçalves B, Patil S, Flammini A, Menczer F (2011) Truthy: mapping the spread of astroturf in microblog streams. In: Proceedings of the 20th international conference companion on world wide web, 2011. ACM, New York, pp 249–252
Sayed-Mouchaweh M, Lughofer E (2012) Learning in non-stationary environments. Springer, New York
Sayyadi H, Hurst M, Maykov A (2009) Event detection and tracking in social streams. In: Proceedings of the 3rd international AAAI conference on weblogs and social media, 2009
Shalev-Shwartz S (2011) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194
Simmons M, Adamic LA, Adar E (2011) Memes online: extracted, subtracted, injected, and recollected. In: Proceedings of the 5th international AAAI conference on weblogs and social media, 2011. AAAI, Barcelona
Skoric M, Poor N, Liao Y, Tang S (2011) Online organization of an offline protest: from social to traditional media and back. In: Proceedings of the 44th Hawaii international conference on system sciences, 2011
Thom D, Bosch H, Koch S, Worner M, Ertl T (2012) Spatiotemporal anomaly detection through visual analysis of geolocated twitter messages. In: IEEE Pacific visualization symposium, pp 41–48
Tsur O, Rappoport A (2012) What’s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In: Proceedings of the fifth ACM international conference on Web search and data mining, 2012. ACM, New York, pp 643–652
Varol O, Ferrara E, Ogan CL, Menczer F, Flammini A (2014) Evolution of online user behavior during a social upheaval. In: Proceedings of the 2014 ACM conference on Web science, 2014. ACM, New York, pp 81–90
Wu S, Hofman J, Mason W, Watts D (2011) Who says what to whom on twitter. In: Proceedings of the 20th international conference on world wide web, 2011. ACM, New York, pp 705–714
Xie L, Natsev A, Kender JR, Hill M, Smith JR (2011) Visual memes in social media: tracking real-world news in youtube videos. In: Proceedings of the 19th ACM international conference on multimedia, 2011. ACM, New York, pp 53–62
Yang L, Sun T, Zhang M, Mei Q (2012) We know what@ you# tag: does the dual role affect hashtag adoption? In: Proceedings of the 21st international conference on World Wide Web, 2012. ACM, New York, pp 261–270
Yih W, Qazvinian V (2012) Measuring word relatedness using heterogeneous vector space models. In: Proceedings of annual conference of the North American chapter of ACL, 2012
Zhong S (2005) Efficient online spherical k-means clustering. In: Proceedings of the 2005 IEEE international joint conference on neural networks, IJCNN’05, vol 5. IEEE, pp 3180–3185
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
JafariAsbagh, M., Ferrara, E., Varol, O. et al. Clustering memes in social media streams. Soc. Netw. Anal. Min. 4, 237 (2014). https://doi.org/10.1007/s13278-014-0237-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-014-0237-x