Large-Scale Micro-Blog Authorship Attribution: Beyond Simple Feature Engineering

Cavalcante, Thiago; Rocha, Anderson; Carvalho, Ariadne

doi:10.1007/978-3-319-12568-8_49

Large-Scale Micro-Blog Authorship Attribution: Beyond Simple Feature Engineering

Thiago Cavalcante¹⁷,
Anderson Rocha¹⁷ &
Ariadne Carvalho¹⁷

Conference paper

2374 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8827))

Abstract

With the ever-growing use of social media, authorship attribution plays an important role in avoiding cybercrime, and helping the analysis of online trails left behind by cyber pranks, stalkers, bullies, identity thieves and alike. In this paper, we propose a method for authorship attribution in micro-blogs with efficiency one hundred to a thousand times faster than state-of-the-art counterparts. The method relies on a powerful and scalable feature representation approach taking advantage of user patterns in micro-blog messages, and also on a custom-tailored pattern classifier adapted to deal with big data and high-dimensional data. Finally, we discuss search-space reduction when analyzing hundreds of online suspects and millions of online micro messages, which makes this approach invaluable for digital forensics and law enforcement.

The authors thank the financial support of CAPES (Grant #01P45543013), CNPq (Grants #477662/2013-7, and #304352/2012-8), FAPESP (Grant #2010/05647-4), and Microsoft Research.

Download to read the full chapter text

Chapter PDF

References

Bishop, C.M.: Pattern recog. and machine learning, vol. 1. Springer (2006)
Google Scholar
Boutwell, S.R.: Authorship attribution of short messages using multimodal features. Master’s thesis, Naval Postgraduate School, Monterey, CA, USA (2011)
Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)
Article MATH Google Scholar
Forstall, C.W., Scheirer, W.J.: Features from frequency: Authorship and stylistic analysis using repetitive sound. In: Annual Chicago Colloquium on Digital Humanities and Computer Science (2009)
Google Scholar
Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22(4), 405–417 (2007)
Article Google Scholar
Holmes, D.I., Forsyth, R.S.: The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing 10(2), 111–127 (1995)
Article Google Scholar
Juola, P.: Authorship attribution. Foundations and Trends in information Retrieval 1(3), 233–334 (2006)
Article Google Scholar
Krikorian, R.: New tweets per second record, and how! Twitter Blog (2013), http://tinyurl.com/kcuhdcw (accessed on May, 2014)
Layton, R., Watters, P., Dazeley, R.: Authorship attribution for twitter in 140 characters or less. In: Cybercrime and Trustworthy Computing, pp. 1–8 (2010)
Google Scholar
Madigan, D., Genkin, A., Lewis, D.D., Lewis, E.G.D.D., Argamon, S., Fradkin, D., Ye, L., Consulting, D.D.L.: Author identification on the large scale. In: Meeting of the Classification Society of North America (2005)
Google Scholar
Mikros, G.K., Perifanos, K.: Authorship attribution in greek tweets using authors multilevel n-gram profiles. In: AAAI Spring Symposium Series (2013)
Google Scholar
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley, Reading (1964)
Google Scholar
Peng, F., Schuurmans, D., Wang, S.: Augmenting naive bayes classifiers with statistical language models. Information Retrieval 7(3-4), 317–345 (2004)
Article Google Scholar
Ramshaw, E.: Bashing the candidates with their own names. The New York Times (May 2012), http://tinyurl.com/q6lc2fw (accessed on May, 2014 )
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation. In: Conference on Empirical Methods in Natural Language Processing, pp. 482–491. Association for Computational Linguistics (2006)
Google Scholar
Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro-messages. In: Conference on Empirical Methods on Natural Language Processing, pp. 1880–1891. ACL (2013)
Google Scholar
Shih, G.: Anonymous twitter feeds arise as political weapon. The New York Times (June 2014), http://tinyurl.com/5vol3gt (accessed on May, 2014)
Silva, R.S., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: Twazn me!!!(automatic authorship analysis of micro-blogging messages. In: Natural Language Processing and Information Systems, pp. 161–168. Springer (2011)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Article Google Scholar
Uzuner, Ö., Katz, B.: A comparative study of language models for book and author recognition. In: Intl. Joint Conf. on Natural Language Processing, p. 969 (2005)
Google Scholar
Waters, J.: Why id thieves love social media. The Wall Street Journal (March 2012), http://tinyurl.com/ldvhpsb (accessed on May, 2014)
Wu, J.: Power mean svm for large scale visual classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2344–2351 (2012)
Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: Writing-style features and classification techniques. J. of the American Society for Information Science and Technology 57(3), 378–393 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing, University of Campinas, Av. Albert Einstein, 1251, Cidade Universitaria, Campinas, SP, Brasil, CEP 13083-852
Thiago Cavalcante, Anderson Rocha & Ariadne Carvalho

Authors

Thiago Cavalcante
View author publications
You can also search for this author in PubMed Google Scholar
Anderson Rocha
View author publications
You can also search for this author in PubMed Google Scholar
Ariadne Carvalho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical Engineering and Computer Science, CINVESTAV, Guadalajara, Jalisco, México
Eduardo Bayro-Corrochano
Department of Computer Science, University of York, YO10 5GH, Deramore Lane, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cavalcante, T., Rocha, A., Carvalho, A. (2014). Large-Scale Micro-Blog Authorship Attribution: Beyond Simple Feature Engineering. In: Bayro-Corrochano, E., Hancock, E. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2014. Lecture Notes in Computer Science, vol 8827. Springer, Cham. https://doi.org/10.1007/978-3-319-12568-8_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-12568-8_49
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12567-1
Online ISBN: 978-3-319-12568-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)