ABSTRACT
Twitter is a new web application playing dual roles of online social networking and micro-blogging. Users communicate with each other by publishing text-based posts. The popularity and open structure of Twitter have attracted a large number of automated programs, known as bots, which appear to be a double-edged sword to Twitter. Legitimate bots generate a large amount of benign tweets delivering news and updating feeds, while malicious bots spread spam or malicious contents. More interestingly, in the middle between human and bot, there has emerged cyborg referred to either bot-assisted human or human-assisted bot. To assist human users in identifying who they are interacting with, this paper focuses on the classification of human, bot and cyborg accounts on Twitter. We first conduct a set of large-scale measurements with a collection of over 500,000 accounts. We observe the difference among human, bot and cyborg in terms of tweeting behavior, tweet content, and account properties. Based on the measurement results, we propose a classification system that includes the following four parts: (1) an entropy-based component, (2) a machine-learning-based component, (3) an account properties component, and (4) a decision maker. It uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot or cyborg. Our experimental evaluation demonstrates the efficacy of the proposed classification system.
- Amazon comes to twitter. http://www.readwriteweb.com/archives/amazon_comes_to_twitter.php {Accessed: Dec. 20, 2009}.Google Scholar
- Barack obama uses twitter in 2008 presidential campaign. http://twitter.com/BarackObama/ {Accessed: Dec. 20, 2009}.Google Scholar
- Best buy goes all twitter crazy with @twelpforce. http://twitter.com/in_social_media/status/2756927865 {Accessed: Dec. 20, 2009}.Google Scholar
- The crm114 discriminator. http://crm114.sourceforge.net/ {Accessed: Sept. 12, 2009}.Google Scholar
- Alexa. The top 500 sites on the web by alexa. http://www.alexa.com/topsites {Accessed: Jan. 15, 2010}.Google Scholar
- Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, San Diego, CA, USA, 2007. Google ScholarDigital Library
- Meeyoung Cha, Alan Mislove, and Krishna P. Gummadi. A measurement-driven analysis of information propagation in the flickr social network. In Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, 2009. Google ScholarDigital Library
- Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 2006. Google ScholarDigital Library
- Marcel Dischinger, Andreas Haeberlen, Krishna P. Gummadi, and Stefan Saroiu. Characterizing residential broadband networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet Measurement, San Diego, CA, USA, 2007. Google ScholarDigital Library
- Il-Chul Moon Dongwoo Kim, Yohan Jo and Alice Oh. Analysis of twitter lists as a potential source for discovering latent characteristics of users. In To appear on CHI 2010 Workshop on Microblogging: What and How Can We Learn From It?, 2010.Google Scholar
- Henry J. Fowler and Will E. Leland. Local area network traffic characteristics, with implications for broadband network congestion management. IEEE Journal of Selected Areas in Communications, 9(7), 1991.Google ScholarDigital Library
- Steven Gianvecchio and Haining Wang. Detecting covert timing channels: An entropy-based approach. In Proceedings of the 2007 ACM Conference on Computer and Communications Security, Alexandria, VA, USA, October-November 2007. Google ScholarDigital Library
- Steven Gianvecchio, Zhenyu Wu, Mengjun Xie, and Haining Wang. Battle of botcraft: fighting bots in online games with human observational proofs. In Proceedings of the 16th ACM conference on Computer and Communications Security, Chicago, IL, USA, 2009. Google ScholarDigital Library
- Steven Gianvecchio, Mengjun Xie, Zhenyu Wu, and Haining Wang. Measurement and classification of humans and bots in internet chat. In Proceedings of the 17th USENIX Security symposium, San Jose, CA, 2008. Google ScholarDigital Library
- Minas Gjoka, Maciej Kurant, Carter T Butts, and Athina Markopoulou. Walking in facebook: A case study of unbiased sampling of osns. In Proceedings of the 27th IEEE International Conference on Computer Communications, San Diego, CA, USA, March 2010. Google ScholarDigital Library
- Google. Google safe browsing API. http://code.google.com/apis/safebrowsing/ {Accessed: Feb. 5, 2010}.Google Scholar
- Paul Graham. A plan for spam, 2002. http://www.paulgraham.com/spam.html {Accessed: Jan. 25, 2008}.Google Scholar
- Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc Najork. On near-uniform url sampling. In Proceedings of the 9th International World Wide Web Conference on Computer Networks, Amsterdam, The Netherlands, May 2000. Google ScholarDigital Library
- Christopher M. Hill and Linda C. Malone. Using simulated data in support of research on regression analysis. In WSC '04: Proceedings of the 36th conference on Winter simulation, 2004. Google ScholarDigital Library
- B A Huberman and T Hogg. Complexity and adaptation. Phys. D, 2(1--3), 1986. Google ScholarDigital Library
- A. L. Hughes and L. Palen. Twitter adoption and use in mass convergence and emergency events. In Proceedings of the 6th International ISCRAM Conference, Gothenburg, Sweden, May 2009.Google ScholarCross Ref
- H. Husna, S. Phithakkitnukoon, and R. Dantu. Traffic shaping of spam botnets. In Proceedings of the 5th IEEE Conference on Consumer Communications and Networking, Las Vegas, NV, USA, January 2008.Google ScholarCross Ref
- Bernard J. Jansen, Mimi Zhang, Kate Sobel, and Abdur Chowdury. Twitter power: Tweets as electronic word of mouth. American Society for Information Science and Technology, 60(11), 2009. Google ScholarDigital Library
- Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, San Jose, CA, USA, 2007. Google ScholarDigital Library
- Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. A few chirps about twitter. In Proceedings of the First Workshop on Online Social Networks, Seattle, WA, USA, 2008. Google ScholarDigital Library
- G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience, 2004.Google Scholar
- Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, San Diego, CA, USA, 2007. Google ScholarDigital Library
- A Porta, G Baselli, D Liberati, N Montano, C Cogliati, T Gnecchi-Ruscone, A Malliani, and S Cerutti. Measuring regularity by means of a corrected conditional entropy in sympathetic outflow. Biological Cybernetics, Vol. 78(No. 1), January 1998.Google Scholar
- P. Real. A generalized analysis of variance program utilizing binary logic. In ACM '59: Preprints of papers presented at the 14th national meeting of the Association for Computing Machinery, New York, NY, USA, 1959. Google ScholarDigital Library
- Erick Schonfeld. Costolo: Twitter now has 190 million users tweeting 65 million times a day. http://techcrunch.com/2010/06/08/twitter-190-million-users/ {Accessed: Sept. 26, 2010}.Google Scholar
- Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, Vol. 34(No. 1), 2002. Google ScholarDigital Library
- Kate Starbird, Leysia Palen, Amanda Hughes, and Sarah Vieweg. Chatter on the red: What hazards threat reveals about the social life of microblogged information. In Proceedings of the ACM 2010 Conference on Computer Supported Cooperative Work, February 2010. Google ScholarDigital Library
- Statsoft. Statistica, a statistics and analytics software package developed by statsoft. http://www.statsoft.com/support/download/brochures/ {Accessed: Mar. 12, 2010}.Google Scholar
- Brett Stone-Gross, Marco Cova, Lorenzo Cavallaro, Bob Gilbert, Martin Szydlowski, Richard Kemmerer, Christopher Kruegel, and Giovanni Vigna. Your botnet is my botnet: analysis of a botnet takeover. In Proceedings of the 16th ACM conference on Computer and Communications Security, Chicago, IL, USA, 2009. Google ScholarDigital Library
- J. Sutton, Leysia Palen, and Irina Shlovski. Back-channels on the front lines: Emerging use of social media in the 2007 southern california wildfires. In Proceedings of the 2008 ISCRAM Conference, Washington, DC, USA, May 2008.Google Scholar
- Alan M. Turing. Computing machinery and intelligence. Mind, Vol. 59:433--460, 1950.Google ScholarDigital Library
- Tweetadder. Automatic twitter software. http://www.tweetadder.com/ {Accessed: Feb. 5, 2010}.Google Scholar
- Twitter. How to report spam on twitter. http://help.twitter.com/entries/64986 {Accessed: May. 30, 2010}.Google Scholar
- Twitter. Twitter api wiki. http://apiwiki.twitter.com/ {Accessed: Feb. 5, 2010}.Google Scholar
- Mengjun Xie, Zhenyu Wu, and Haining Wang. Honeyim: Fast detection and suppression of instant messaging malware in enterprise-like networks,. In Proceedings of the 23rd Annual Computer Security Applications Conference, Miami Beach, FL, USA, 2007.Google ScholarCross Ref
- Mengjun Xie, Heng Yin, and Haining Wang. An effective defense against email spam laundering. In Proceedings of the 13th ACM conference on Computer and Communications Security, Alexandria, VA, USA, 2006. Google ScholarDigital Library
- Jeff Yan. Bot, cyborg and automated turing test. In Proceedings of the 14th International Workshop on Security Protocols, Cambridge, UK, March 2006.Google Scholar
- Sarita Yardi, Daniel Romero, Grant Schoenebeck, and Danah Boyd. Detecting spam in a twitter network. First Monday, 15(1), January 2010.Google Scholar
- Jonathan A. Zdziarski. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press, 2005. Google ScholarDigital Library
- Dejin Zhao and Mary Beth Rosson. How and why people twitter: the role that micro-blogging plays in informal communication at work. In Proceedings of the ACM 2009 International Conference on Supporting Group Work, Sanibel Island, FL, USA, 2009. Google ScholarDigital Library
Index Terms
- Who is tweeting on Twitter: human, bot, or cyborg?
Recommendations
Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg?
Twitter is a new web application playing dual roles of online social networking and microblogging. Users communicate with each other by publishing text-based posts. The popularity and open structure of Twitter have attracted a large number of automated ...
A sentiment analysis of audiences on twitter: who is the positive or negative audience of popular twitterers?
ICHIT'11: Proceedings of the 5th international conference on Convergence and hybrid information technologyMicroblogging is a new informal communication medium of blogging that differs from a traditional blog in which content is much shorter. Microbloggers post about topics that describe their current status. Twitter is a popular microblogging service and ...
Information resonance on Twitter: watching Iran
SOMA '10: Proceedings of the First Workshop on Social Media AnalyticsTwitter has undoubtedly caught the attention of both the general public, and academia as a microblogging service worthy of study and attention. Twitter has several features that sets it apart from other social media/networking sites, including its 140 ...
Comments