Abstract
With the ever-growing use of social media, authorship attribution plays an important role in avoiding cybercrime, and helping the analysis of online trails left behind by cyber pranks, stalkers, bullies, identity thieves and alike. In this paper, we propose a method for authorship attribution in micro-blogs with efficiency one hundred to a thousand times faster than state-of-the-art counterparts. The method relies on a powerful and scalable feature representation approach taking advantage of user patterns in micro-blog messages, and also on a custom-tailored pattern classifier adapted to deal with big data and high-dimensional data. Finally, we discuss search-space reduction when analyzing hundreds of online suspects and millions of online micro messages, which makes this approach invaluable for digital forensics and law enforcement.
The authors thank the financial support of CAPES (Grant #01P45543013), CNPq (Grants #477662/2013-7, and #304352/2012-8), FAPESP (Grant #2010/05647-4), and Microsoft Research.
Chapter PDF
References
Bishop, C.M.: Pattern recog. and machine learning, vol. 1. Springer (2006)
Boutwell, S.R.: Authorship attribution of short messages using multimodal features. Master’s thesis, Naval Postgraduate School, Monterey, CA, USA (2011)
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)
Forstall, C.W., Scheirer, W.J.: Features from frequency: Authorship and stylistic analysis using repetitive sound. In: Annual Chicago Colloquium on Digital Humanities and Computer Science (2009)
Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22(4), 405–417 (2007)
Holmes, D.I., Forsyth, R.S.: The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing 10(2), 111–127 (1995)
Juola, P.: Authorship attribution. Foundations and Trends in information Retrieval 1(3), 233–334 (2006)
Krikorian, R.: New tweets per second record, and how! Twitter Blog (2013), http://tinyurl.com/kcuhdcw (accessed on May, 2014)
Layton, R., Watters, P., Dazeley, R.: Authorship attribution for twitter in 140 characters or less. In: Cybercrime and Trustworthy Computing, pp. 1–8 (2010)
Madigan, D., Genkin, A., Lewis, D.D., Lewis, E.G.D.D., Argamon, S., Fradkin, D., Ye, L., Consulting, D.D.L.: Author identification on the large scale. In: Meeting of the Classification Society of North America (2005)
Mikros, G.K., Perifanos, K.: Authorship attribution in greek tweets using authors multilevel n-gram profiles. In: AAAI Spring Symposium Series (2013)
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley, Reading (1964)
Peng, F., Schuurmans, D., Wang, S.: Augmenting naive bayes classifiers with statistical language models. Information Retrieval 7(3-4), 317–345 (2004)
Ramshaw, E.: Bashing the candidates with their own names. The New York Times (May 2012), http://tinyurl.com/q6lc2fw (accessed on May, 2014 )
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. In: Conference on Empirical Methods in Natural Language Processing, pp. 482–491. Association for Computational Linguistics (2006)
Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro-messages. In: Conference on Empirical Methods on Natural Language Processing, pp. 1880–1891. ACL (2013)
Shih, G.: Anonymous twitter feeds arise as political weapon. The New York Times (June 2014), http://tinyurl.com/5vol3gt (accessed on May, 2014)
Silva, R.S., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: Twazn me!!!(automatic authorship analysis of micro-blogging messages. In: Natural Language Processing and Information Systems, pp. 161–168. Springer (2011)
Stamatatos, E.: A survey of modern authorship attribution methods. J. of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Uzuner, Ö., Katz, B.: A comparative study of language models for book and author recognition. In: Intl. Joint Conf. on Natural Language Processing, p. 969 (2005)
Waters, J.: Why id thieves love social media. The Wall Street Journal (March 2012), http://tinyurl.com/ldvhpsb (accessed on May, 2014)
Wu, J.: Power mean svm for large scale visual classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2344–2351 (2012)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: Writing-style features and classification techniques. J. of the American Society for Information Science and Technology 57(3), 378–393 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cavalcante, T., Rocha, A., Carvalho, A. (2014). Large-Scale Micro-Blog Authorship Attribution: Beyond Simple Feature Engineering. In: Bayro-Corrochano, E., Hancock, E. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2014. Lecture Notes in Computer Science, vol 8827. Springer, Cham. https://doi.org/10.1007/978-3-319-12568-8_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-12568-8_49
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12567-1
Online ISBN: 978-3-319-12568-8
eBook Packages: Computer ScienceComputer Science (R0)