Skip to main content

Stylometric Authorship Attribution of Collaborative Documents

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10332))

Abstract

Stylometry is the study of writing style based on linguistic features and is typically applied to authorship attribution problems. In this work, we apply stylometry to a novel dataset of multi-authored documents collected from Wikia using both relaxed classification with a support vector machine (SVM) and multi-label classification techniques. We define five possible scenarios and show that one, the case where labeled and unlabeled collaborative documents by the same authors are available, yields high accuracy on our dataset while the other, more restrictive cases yield lower accuracies. Based on the results of these experiments and knowledge of the multi-label classifiers used, we propose a hypothesis to explain this overall poor performance. Additionally, we perform authorship attribution of pre-segmented text from the Wikia dataset, and show that while this performs better than multi-label learning it requires large amounts of data to be successful.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://en.wikipedia.org.

  2. 2.

    https://drive.google.com.

  3. 3.

    http://starwars.wikia.com/wiki/Main_Page.

  4. 4.

    https://github.com/.

References

  1. Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. (TOIS) 26(2), 7 (2008)

    Article  Google Scholar 

  2. Akiva, N., Koppel, M.: A generic unsupervised method for decomposing multi-author documents. J. Am. Soc. Inf. Sci. Technol. 64(11), 2256–2264 (2013)

    Article  Google Scholar 

  3. Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowdsourcing. In: Proceedings of the 2nd of the ACM Conference on Online Social Networks, pp. 69–82. ACM (2014)

    Google Scholar 

  4. Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC) 15(3), 12 (2012)

    Article  Google Scholar 

  5. Corney, M.W., Anderson, A.M., Mohay, G.M., de Vel, O.: Identifying the authors of suspect email. Comput. Secur. (2001)

    Google Scholar 

  6. Dauber, E., Caliskan, A., Harang, R., Greenstadt, R.: Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. arXiv preprint arXiv:1701.05681 (2017)

  7. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1–2), 109–123 (2003)

    Article  MATH  Google Scholar 

  8. Fifield, D., Follan, T., Lunde, E.: Unsupervised authorship attribution. arXiv preprint arXiv:1503.07613 (2015)

  9. Harpalani, M., Hart, M., Singh, S., Johnson, R., Choi, Y.: Language of vandalism: improving wikipedia vandalism detection via stylometric analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2, pp. 83–88. Association for Computational Linguistics (2011)

    Google Scholar 

  10. Juola, P., et al.: Authorship attribution. Found. Trends\(\textregistered \) Inf. Retrieval 1(3), 233–334 (2008)

    Google Scholar 

  11. Koppel, M., Akiva, N., Dershowitz, I., Dershowitz, N.: Unsupervised decomposition of a document into authorial components. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1356–1364. Association for Computational Linguistics (2011)

    Google Scholar 

  12. Macke, S., Hirshman, J.: Deep sentence-level authorship attribution (2015)

    Google Scholar 

  13. Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012)

    Article  Google Scholar 

  14. McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31680-7_16

    Chapter  Google Scholar 

  15. Overdorf, R., Greenstadt, R.: Blogs, twitter feeds, and reddit comments: cross-domain authorship attribution. PoPETs 2016(3), 155–171 (2016)

    Google Scholar 

  16. Payer, M., Huang, L., Gong, N.Z., Borgolte, K., Frank, M.: What you submit is who you are: a multi-modal approach for deanonymizing scientific publications. IEEE Trans. Inf. Forensics Secur. 10, 200–212 (2015)

    Article  Google Scholar 

  17. Solorio, T., Hasan, R., Mizan, M.: Sockpuppet detection in wikipedia: a corpus of real-world deceptive writing for linking identities. arXiv preprint arXiv:1310.6772 (2013)

  18. Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. 3(3), 13 (2007)

    Google Scholar 

  19. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD 2008), pp. 30–44 (2008)

    Google Scholar 

  20. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, New York (2010)

    Google Scholar 

  21. Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multilabel classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011)

    Article  Google Scholar 

  22. Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Science Foundation under grant #1253418.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edwin Dauber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Dauber, E., Overdorf, R., Greenstadt, R. (2017). Stylometric Authorship Attribution of Collaborative Documents. In: Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2017. Lecture Notes in Computer Science(), vol 10332. Springer, Cham. https://doi.org/10.1007/978-3-319-60080-2_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-60080-2_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-60079-6

  • Online ISBN: 978-3-319-60080-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics