Stylometric Authorship Attribution of Collaborative Documents

Dauber, Edwin; Overdorf, Rebekah; Greenstadt, Rachel

doi:10.1007/978-3-319-60080-2_9

Stylometric Authorship Attribution of Collaborative Documents

Edwin Dauber¹⁵,
Rebekah Overdorf¹⁵ &
Rachel Greenstadt¹⁵

Conference paper
First Online: 02 June 2017

1678 Accesses
13 Citations

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10332))

Abstract

Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://en.wikipedia.org.
2.
https://drive.google.com.
3.
http://starwars.wikia.com/wiki/Main_Page.
4.
https://github.com/.

References

Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. (TOIS) 26(2), 7 (2008)
Article Google Scholar
Akiva, N., Koppel, M.: A generic unsupervised method for decomposing multi-author documents. J. Am. Soc. Inf. Sci. Technol. 64(11), 2256–2264 (2013)
Article Google Scholar
Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowdsourcing. In: Proceedings of the 2nd of the ACM Conference on Online Social Networks, pp. 69–82. ACM (2014)
Google Scholar
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC) 15(3), 12 (2012)
Article Google Scholar
Corney, M.W., Anderson, A.M., Mohay, G.M., de Vel, O.: Identifying the authors of suspect email. Comput. Secur. (2001)
Google Scholar
Dauber, E., Caliskan, A., Harang, R., Greenstadt, R.: Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. arXiv preprint arXiv:1701.05681 (2017)
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1–2), 109–123 (2003)
Article MATH Google Scholar
Fifield, D., Follan, T., Lunde, E.: Unsupervised authorship attribution. arXiv preprint arXiv:1503.07613 (2015)
Harpalani, M., Hart, M., Singh, S., Johnson, R., Choi, Y.: Language of vandalism: improving wikipedia vandalism detection via stylometric analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2, pp. 83–88. Association for Computational Linguistics (2011)
Google Scholar
Juola, P., et al.: Authorship attribution. Found. Trends\(\textregistered \) Inf. Retrieval 1(3), 233–334 (2008)
Google Scholar
Koppel, M., Akiva, N., Dershowitz, I., Dershowitz, N.: Unsupervised decomposition of a document into authorial components. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1356–1364. Association for Computational Linguistics (2011)
Google Scholar
Macke, S., Hirshman, J.: Deep sentence-level authorship attribution (2015)
Google Scholar
Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012)
Article Google Scholar
McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31680-7_16
Chapter Google Scholar
Overdorf, R., Greenstadt, R.: Blogs, twitter feeds, and reddit comments: cross-domain authorship attribution. PoPETs 2016(3), 155–171 (2016)
Google Scholar
Payer, M., Huang, L., Gong, N.Z., Borgolte, K., Frank, M.: What you submit is who you are: a multi-modal approach for deanonymizing scientific publications. IEEE Trans. Inf. Forensics Secur. 10, 200–212 (2015)
Article Google Scholar
Solorio, T., Hasan, R., Mizan, M.: Sockpuppet detection in wikipedia: a corpus of real-world deceptive writing for linking identities. arXiv preprint arXiv:1310.6772 (2013)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. 3(3), 13 (2007)
Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD 2008), pp. 30–44 (2008)
Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, New York (2010)
Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011)
Article Google Scholar
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Article MATH Google Scholar

Download references

Acknowledgements

This work was supported by the National Science Foundation under grant #1253418.

Author information

Authors and Affiliations

Drexel University, Philadelphia, PA, 19104, USA
Edwin Dauber, Rebekah Overdorf & Rachel Greenstadt

Authors

Edwin Dauber
View author publications
You can also search for this author in PubMed Google Scholar
Rebekah Overdorf
View author publications
You can also search for this author in PubMed Google Scholar
Rachel Greenstadt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edwin Dauber .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev , Beer-Sheva, Israel
Shlomi Dolev
Tata Consultancy Services (India) , Chennai, Tamil Nadu, India
Sachin Lodha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dauber, E., Overdorf, R., Greenstadt, R. (2017). Stylometric Authorship Attribution of Collaborative Documents. In: Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2017. Lecture Notes in Computer Science(), vol 10332. Springer, Cham. https://doi.org/10.1007/978-3-319-60080-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-60080-2_9
Published: 02 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60079-6
Online ISBN: 978-3-319-60080-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics