Abstract
Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. (TOIS) 26(2), 7 (2008)
Akiva, N., Koppel, M.: A generic unsupervised method for decomposing multi-author documents. J. Am. Soc. Inf. Sci. Technol. 64(11), 2256–2264 (2013)
Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowdsourcing. In: Proceedings of the 2nd of the ACM Conference on Online Social Networks, pp. 69–82. ACM (2014)
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC) 15(3), 12 (2012)
Corney, M.W., Anderson, A.M., Mohay, G.M., de Vel, O.: Identifying the authors of suspect email. Comput. Secur. (2001)
Dauber, E., Caliskan, A., Harang, R., Greenstadt, R.: Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. arXiv preprint arXiv:1701.05681 (2017)
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1–2), 109–123 (2003)
Fifield, D., Follan, T., Lunde, E.: Unsupervised authorship attribution. arXiv preprint arXiv:1503.07613 (2015)
Harpalani, M., Hart, M., Singh, S., Johnson, R., Choi, Y.: Language of vandalism: improving wikipedia vandalism detection via stylometric analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2, pp. 83–88. Association for Computational Linguistics (2011)
Juola, P., et al.: Authorship attribution. Found. Trends\(\textregistered \) Inf. Retrieval 1(3), 233–334 (2008)
Koppel, M., Akiva, N., Dershowitz, I., Dershowitz, N.: Unsupervised decomposition of a document into authorial components. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1356–1364. Association for Computational Linguistics (2011)
Macke, S., Hirshman, J.: Deep sentence-level authorship attribution (2015)
Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012)
McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31680-7_16
Overdorf, R., Greenstadt, R.: Blogs, twitter feeds, and reddit comments: cross-domain authorship attribution. PoPETs 2016(3), 155–171 (2016)
Payer, M., Huang, L., Gong, N.Z., Borgolte, K., Frank, M.: What you submit is who you are: a multi-modal approach for deanonymizing scientific publications. IEEE Trans. Inf. Forensics Secur. 10, 200–212 (2015)
Solorio, T., Hasan, R., Mizan, M.: Sockpuppet detection in wikipedia: a corpus of real-world deceptive writing for linking identities. arXiv preprint arXiv:1310.6772 (2013)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. 3(3), 13 (2007)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD 2008), pp. 30–44 (2008)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, New York (2010)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011)
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Acknowledgements
This work was supported by the National Science Foundation under grant #1253418.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Dauber, E., Overdorf, R., Greenstadt, R. (2017). Stylometric Authorship Attribution of Collaborative Documents. In: Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2017. Lecture Notes in Computer Science(), vol 10332. Springer, Cham. https://doi.org/10.1007/978-3-319-60080-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-60080-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60079-6
Online ISBN: 978-3-319-60080-2
eBook Packages: Computer ScienceComputer Science (R0)