Breaking the Closed-World Assumption in Stylometric Authorship Attribution

Stolerman, Ariel; Overdorf, Rebekah; Afroz, Sadia; Greenstadt, Rachel

doi:10.1007/978-3-662-44952-3_13

Breaking the Closed-World Assumption in Stylometric Authorship Attribution

Ariel Stolerman³,
Rebekah Overdorf³,
Sadia Afroz⁴ &
…
Rachel Greenstadt³

Conference paper

1648 Accesses
11 Citations
6 Altmetric

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 433))

Abstract

Stylometry is a form of authorship attribution that relies on the linguistic information found in a document. While there has been significant work in stylometry, most research focuses on the closed-world problem where the author of the document is in a known suspect set. For open-world problems where the author may not be in the suspect set, traditional classification methods are ineffective. This paper proposes the “classify-verify” method that augments classification with a binary verification step evaluated on stylometric datasets. This method, which can be generalized to any domain, significantly outperforms traditional classifiers in open-world settings and yields an F1-score of 0.87, comparable to traditional classifiers in closed-world settings. Moreover, the method successfully detects adversarial documents where authors deliberately change their styles, a problem for which closed-world classifiers fail.

Download to read the full chapter text

Chapter PDF

References

A. Abbasi and H. Chen, Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace, ACM Transactions on Information Systems, vol. 26(2), pp. 7:1–7:29, 2008.
Article Google Scholar
S. Afroz, M. Brennan and R. Greenstadt, Detecting hoaxes, frauds and deception in writing style online, Proceedings of the IEEE Symposium on Security and Privacy, pp. 461–475, 2012.
Google Scholar
L. Araujo, L. Sucupira, M. Lizarraga, L. Ling and J. Yabu-Uti, User authentication through typing biometrics features, IEEE Transactions on Signal Processing, vol. 53(2), pp. 851–855, 2005.
Article MathSciNet Google Scholar
M. Brennan, S. Afroz and R. Greenstadt, Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity, ACM Transactions on Information and System Security, vol. 15(3), pp. 12:1–12:22, 2012.
Article Google Scholar
M. Brennan and R. Greenstadt, Practical attacks against authorship recognition techniques, Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence, pp. 60–65, 2009.
Google Scholar
K. Burton, A. Java and I. Soboroff, The ICWSM 2009 Spinn3r Dataset, Proceedings of the Third Annual Conference on Weblogs and Social Media, 2009.
Google Scholar
Z. Chair and P. Varshney, Optimal data fusion in multiple sensor detection systems, IEEE Transactions on Aerospace and Electronic Systems, vol. AES-22(1), pp. 98–101, 1986.
Article Google Scholar
C. Chow, On optimum recognition error and reject tradeoff, IEEE Transactions on Information Theory, vol. 16(1), pp. 41–46, 1970.
Article MATH Google Scholar
A. Clark, Forensic Stylometric Authorship Analysis Under the Daubert Standard, University of the District of Comumbia, Washington, DC ( http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2039824 ), 2011.
Google Scholar
P. Clough, Plagiarism in Natural and Programming Languages: An Overview of Current tools and Technologies, Technical Report, Department of Computer Science, University of Sheffield, Sheffield, United Kingdom, 2000.
Google Scholar
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. Witten, The Weka Data Mining Software: An update, SIGKDD Explorations Newsletter, vol. 11(1), pp. 10–18, 2009.
Article Google Scholar
R. Herbei and M. Wegkamp, Classification with reject option, Canadian Journal of Statistics, vol. 34(4), pp. 709–721, 2006.
Article MathSciNet MATH Google Scholar
P. Juola, Ad hoc Authorship Attribution Competition, Proceedings of the Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, 2004.
Google Scholar
P. Juola, Authorship attribution, Foundations and Trends in Information Retrieval, vol. 1(3), pp. 233–334, 2008.
Article Google Scholar
P. Juola, Stylometry and immigration: A case study, Journal of Law and Policy, vol. 21(2), pp. 287–298, 2013.
Google Scholar
P. Juola, J. Noecker, A. Stolerman, M. Ryan, P. Brennan and R. Greenstadt, A dataset for active linguistic authentication, in Advances in Digital Forensics IX, G. Peterson and S. Shenoi (Eds.), Springer, Heidelberg, Germany, pp. 385–398, 2013.
Chapter Google Scholar
M. Koppel and J. Schler, Authorship verification as a one-class classification problem, Proceedings of the Twenty-First International Conference on Machine Learning, 2004.
Google Scholar
M. Koppel, J. Schler and S. Argamon, Authorship attribution in the wild, Language Resources and Evaluation, vol. 45(1), pp. 83–94, 2011.
Article Google Scholar
M. Koppel, J. Schler and E. Bonchek-Dokow, Measuring differentiability: Unmasking pseudonymous authors, Journal of Machine Learning Research, vol. 8(2), pp. 1261–1276, 2007.
MATH Google Scholar
L. Manevitz and M. Yousef, One-class document classification via neural networks, Neurocomputing, vol. 70(7-9), pp. 1466–1481, 2007.
Article Google Scholar
A. McDonald, S. Afroz, A. Caliskan, A. Stolerman and R. Greenstadt, Use fewer instances of the letter “i:” Toward writing style anonymization, in Privacy Enhancing Technologies, S. Fischer-Hubner and M. Wright (Eds.), Springer-Verlag, Berlin, Germany, pp. 299–318, 2012.
Chapter Google Scholar
A. Narayanan, H. Paskov, N. Gong, J. Bethencourt, E. Stefanov, R. Shin and D. Song, On the feasibility of Internet-scale author identification, Proceedings of the IEEE Symposium on Security and Privacy, pp. 300–314, 2012.
Google Scholar
J. Noecker and P. Juola, Cosine distance nearest-neighbor classification for authorship attribution, presented at the Digital Humanities Conference, 2009.
Google Scholar
J. Noecker and M. Ryan, Distractorless authorship verification, Proceedings of the Eight International Conference on Language Resources and Evaluation, pp. 785–789, 2012.
Google Scholar
H. Paskov, A Regularization Framework for Active Learning from Imbalanced Data, M. Engg. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 2010.
Google Scholar
E. Sorio, A. Bartoli, G. Davanzo and E. Medvet, Open world classification of printed invoices, Proceedings of the Tenth ACM Symposium on Document Engineering, pp. 187–190, 2010.
Chapter Google Scholar
B. Stein, M. Potthast, P. Rosso, A. Barron-Cedeno, E. Stamatatos and M. Koppel, Workshop report: Fourth International Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, ACM SIGIR Forum, vol. 45(1), pp. 45-48, 2011.
Article Google Scholar
D. Tax, One-Class Classification, Ph.D. Dissertation, Faculty of Applied Physics, Delft University of Technology, Delft, The Natherlands, 2001.
Google Scholar
H. van Halteren, Linguistic profiling for authorship recognition and verification, Proceedings of the Forty-Second Annual Meeting of the Association for Computational Linguistics, art. 199, 2004.
Google Scholar

Download references

Author information

Authors and Affiliations

Drexel University, Philadelphia, Pennsylvania, USA
Ariel Stolerman, Rebekah Overdorf & Rachel Greenstadt
Computer Science Division, University of California at Berkeley, Berkeley, California, USA
Sadia Afroz

Authors

Ariel Stolerman
View author publications
You can also search for this author in PubMed Google Scholar
Rebekah Overdorf
View author publications
You can also search for this author in PubMed Google Scholar
Sadia Afroz
View author publications
You can also search for this author in PubMed Google Scholar
Rachel Greenstadt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Air Force Institute of Technology, Wright-Patterson Air Force Base, 45433-7765, OH, USA
Gilbert Peterson
University of Tulsa, 74104-3189, Tulsa, OK, USA
Sujeet Shenoi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stolerman, A., Overdorf, R., Afroz, S., Greenstadt, R. (2014). Breaking the Closed-World Assumption in Stylometric Authorship Attribution. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics X. DigitalForensics 2014. IFIP Advances in Information and Communication Technology, vol 433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44952-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-662-44952-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44951-6
Online ISBN: 978-3-662-44952-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics