Abstract
Stylometry is a form of authorship attribution that relies on the linguistic information found in a document. While there has been significant work in stylometry, most research focuses on the closed-world problem where the author of the document is in a known suspect set. For open-world problems where the author may not be in the suspect set, traditional classification methods are ineffective. This paper proposes the “classify-verify” method that augments classification with a binary verification step evaluated on stylometric datasets. This method, which can be generalized to any domain, significantly outperforms traditional classifiers in open-world settings and yields an F1-score of 0.87, comparable to traditional classifiers in closed-world settings. Moreover, the method successfully detects adversarial documents where authors deliberately change their styles, a problem for which closed-world classifiers fail.
Chapter PDF
References
A. Abbasi and H. Chen, Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace, ACM Transactions on Information Systems, vol. 26(2), pp. 7:1–7:29, 2008.
S. Afroz, M. Brennan and R. Greenstadt, Detecting hoaxes, frauds and deception in writing style online, Proceedings of the IEEE Symposium on Security and Privacy, pp. 461–475, 2012.
L. Araujo, L. Sucupira, M. Lizarraga, L. Ling and J. Yabu-Uti, User authentication through typing biometrics features, IEEE Transactions on Signal Processing, vol. 53(2), pp. 851–855, 2005.
M. Brennan, S. Afroz and R. Greenstadt, Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity, ACM Transactions on Information and System Security, vol. 15(3), pp. 12:1–12:22, 2012.
M. Brennan and R. Greenstadt, Practical attacks against authorship recognition techniques, Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence, pp. 60–65, 2009.
K. Burton, A. Java and I. Soboroff, The ICWSM 2009 Spinn3r Dataset, Proceedings of the Third Annual Conference on Weblogs and Social Media, 2009.
Z. Chair and P. Varshney, Optimal data fusion in multiple sensor detection systems, IEEE Transactions on Aerospace and Electronic Systems, vol. AES-22(1), pp. 98–101, 1986.
C. Chow, On optimum recognition error and reject tradeoff, IEEE Transactions on Information Theory, vol. 16(1), pp. 41–46, 1970.
A. Clark, Forensic Stylometric Authorship Analysis Under the Daubert Standard, University of the District of Comumbia, Washington, DC ( http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2039824 ), 2011.
P. Clough, Plagiarism in Natural and Programming Languages: An Overview of Current tools and Technologies, Technical Report, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom, 2000.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. Witten, The Weka Data Mining Software: An update, SIGKDD Explorations Newsletter, vol. 11(1), pp. 10–18, 2009.
R. Herbei and M. Wegkamp, Classification with reject option, Canadian Journal of Statistics, vol. 34(4), pp. 709–721, 2006.
P. Juola, Ad hoc Authorship Attribution Competition, Proceedings of the Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, 2004.
P. Juola, Authorship attribution, Foundations and Trends in Information Retrieval, vol. 1(3), pp. 233–334, 2008.
P. Juola, Stylometry and immigration: A case study, Journal of Law and Policy, vol. 21(2), pp. 287–298, 2013.
P. Juola, J. Noecker, A. Stolerman, M. Ryan, P. Brennan and R. Greenstadt, A dataset for active linguistic authentication, in Advances in Digital Forensics IX, G. Peterson and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 385–398, 2013.
M. Koppel and J. Schler, Authorship verification as a one-class classification problem, Proceedings of the Twenty-First International Conference on Machine Learning, 2004.
M. Koppel, J. Schler and S. Argamon, Authorship attribution in the wild, Language Resources and Evaluation, vol. 45(1), pp. 83–94, 2011.
M. Koppel, J. Schler and E. Bonchek-Dokow, Measuring differentiability: Unmasking pseudonymous authors, Journal of Machine Learning Research, vol. 8(2), pp. 1261–1276, 2007.
L. Manevitz and M. Yousef, One-class document classification via neural networks, Neurocomputing, vol. 70(7-9), pp. 1466–1481, 2007.
A. McDonald, S. Afroz, A. Caliskan, A. Stolerman and R. Greenstadt, Use fewer instances of the letter “i:” Toward writing style anonymization, in Privacy Enhancing Technologies, S. Fischer-Hubner and M. Wright (Eds.), Springer-Verlag, Berlin, Germany, pp. 299–318, 2012.
A. Narayanan, H. Paskov, N. Gong, J. Bethencourt, E. Stefanov, R. Shin and D. Song, On the feasibility of Internet-scale author identification, Proceedings of the IEEE Symposium on Security and Privacy, pp. 300–314, 2012.
J. Noecker and P. Juola, Cosine distance nearest-neighbor classification for authorship attribution, presented at the Digital Humanities Conference, 2009.
J. Noecker and M. Ryan, Distractorless authorship verification, Proceedings of the Eight International Conference on Language Resources and Evaluation, pp. 785–789, 2012.
H. Paskov, A Regularization Framework for Active Learning from Imbalanced Data, M. Engg. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 2010.
E. Sorio, A. Bartoli, G. Davanzo and E. Medvet, Open world classification of printed invoices, Proceedings of the Tenth ACM Symposium on Document Engineering, pp. 187–190, 2010.
B. Stein, M. Potthast, P. Rosso, A. Barron-Cedeno, E. Stamatatos and M. Koppel, Workshop report: Fourth International Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, ACM SIGIR Forum, vol. 45(1), pp. 45-48, 2011.
D. Tax, One-Class Classification, Ph.D. Dissertation, Faculty of Applied Physics, Delft University of Technology, Delft, The Natherlands, 2001.
H. van Halteren, Linguistic profiling for authorship recognition and verification, Proceedings of the Forty-Second Annual Meeting of the Association for Computational Linguistics, art. 199, 2004.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 IFIP International Federation for Information Processing
About this paper
Cite this paper
Stolerman, A., Overdorf, R., Afroz, S., Greenstadt, R. (2014). Breaking the Closed-World Assumption in Stylometric Authorship Attribution. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics X. DigitalForensics 2014. IFIP Advances in Information and Communication Technology, vol 433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44952-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-662-44952-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44951-6
Online ISBN: 978-3-662-44952-3
eBook Packages: Computer ScienceComputer Science (R0)