Hostname: page-component-76fb5796d-r6qrq Total loading time: 0 Render date: 2024-04-25T21:30:16.970Z Has data issue: false hasContentIssue false

Robust stylometric analysis and author attribution based on tones and rimes

Published online by Cambridge University Press:  10 April 2019

Renkui Hou*
Affiliation:
Department of Linguistics, College of Humanities, Guangzhou University, Guangzhou, China Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Chu-Ren Huang
Affiliation:
Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
*
*Corresponding author. Email: hourk0917@163.com

Abstract

In this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment-final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abbasi, A. and Chen, H. (2008). Writeprints: a stylometric approach to identity-level identification and similarity detection. ACM Transactions on Information Systems 26(), 129.Google Scholar
Argamon, S. and Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing. Victoria, BC, Canada.Google Scholar
Bingenheimer, M., Hung, J.-J. and Hsieh, C.-E. (2017). Stylometric analysis of Chinese Buddhist texts - Do different Chinese translations of the Gaṇḍavyūha reflect stylistic features that are typical for their age?. Journal of the Japanese Association for Digital Humanities 2(1), 130.CrossRefGoogle Scholar
Boroda, M. (1982). Häufigkeitsstrukturen musikalischer Texte. In Orlov, J.K., Boroda, M.G. and Nadarejšvili, I.Š. (eds), Sprache, text, kunst. Quantitative analysen. Bochum: Brockmeyer, pp. 231262.Google Scholar
Chan, B.C. (1986). A computerized stylostatistical approach to the disputed authorship problem of the dream of the red chamber. Tamkang Review: A Quarterly of Comparative Studies between Chinese and Foreign Literatures 16, 247278.Google Scholar
Chao, Y.R. (1968). A Grammar of Spoken Chinese. Berkeley and Los Angeles: University of California Press.Google Scholar
Chen, D.K. (1987). —— 1, 293318.CrossRefGoogle Scholar
Chen, H.H. (1994). The contextual analysis of Chinese sentences with punctuation marks. Literary and Linguistic Computing 9(4), 281289.CrossRefGoogle Scholar
Chen, K.-J., Huang, C.-R., Chang, L.-P. and Hsu, H.-L. (1996). Sinica corpus: design methodology for balanced corpora. In Park, B.-S. and Kim, J.B. (eds), Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation. Seoul: Kyung Hee University, pp. 167176.Google Scholar
Dumais, S., Platt, J., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management. ACM, New York, USA. pp. 137142.Google Scholar
García, A.M. and Martin, J.C. (2006). Function words in authorship attribution studies. Literary and Linguistic Computing 22(1), 4966.CrossRefGoogle Scholar
Grieve, J. (2007). Quantitative authorship attribution: an evaluation of techniques. Literary and Linguistic Computing 22(3), 251270.CrossRefGoogle Scholar
Grzybek, P. (2007). History and methodology of word length studies. In Grzybek, P. (ed), Contributions to the Science of Text and Language. Netherlands: Springer, pp. 1590.Google Scholar
Grzybek, P., Stadlober, E., Kelih, E. and Antić, G (2005). Quantitative text typology: the impact of word length. In Weihs, C. (ed), Classification—The Ubiquitous Challenge. Berlin, Heidelberg: Springer, pp. 5364.CrossRefGoogle Scholar
He, X. and Liu, Y. (2014). Mining stylistic features of rhythm and tempo base on text clustering. Journal of Chinese Information Processing 18(6), 194200.Google Scholar
Herdan, G. (1966). The Advanced Theory of Language as Choice and Chance. New York: Springer-Verlag.CrossRefGoogle Scholar
Hinh, R., Shin, S. and Taylor, J. (2016). Using frame semantics in authorship attribution. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC’16), pp. 004093004098. Taiwan.CrossRefGoogle Scholar
Hirst, G. and Feiguina, O. (2007). Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22(4), 405417.CrossRefGoogle Scholar
Ho, J. (2015). From the use of three functional words “” examining author’s unique writing style–and on dream of red chamber author issues. BIBLID 120(1), 119150.Google Scholar
Holmes, D.I. (1994). Authorship attribution. Computers and the Humanities 28(2), 87106.CrossRefGoogle Scholar
Holmes, D.I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13(3), 111117.CrossRefGoogle Scholar
Holmes, D.I. and Kardos, J. (2003). Who was the author? An introduction to stylometry. Chance 16(2), 58.CrossRefGoogle Scholar
Hou, R., Huang, C. and Liu, H. (2017). A study on Chinese register characteristics based on regression analysis and text clustering. Corpus Linguistics and Linguistic Theory, AOP. doi: 10.1515/cllt-2016-0062CrossRefGoogle Scholar
Hou, R., Huang, C.-R., Do, H.S. and Liu, H. (2017). A study on correlation between Chinese sentence and constituting clauses based on the Menzerath-Altmann law. Journal of Quantitative Linguistics 24(4), 350366. doi: 10.1080/09296174.2017.1314411CrossRefGoogle Scholar
Hou, R., Huang, C.-R., Ahrens, K. and Sophia Lee, Y.-M. (2019). Linguistic characteristics of Chinese register based on the Menzerath– Altmann law and text clustering. Digital Scholarship in the Humanities. doi: 10.1093/llc/fqz005.CrossRefGoogle Scholar
Hu, S. (1921). .Google Scholar
Hu, X., Wang, Y. and Wu, Q. (2014). Multiple authors detection: a quantitative analysis of dream of the red chamber. Advances in Adaptive Data Analysis 6(4), 1450012.CrossRefGoogle Scholar
Huang, C.-R. and Chen, K.-J. (2017). Sinica treebank. In Ide, N. and Pustejovsky, J. (eds), Handbook of Linguistic Annotation. Berlin, Heidelberg: Springer.Google Scholar
Huang, C.-R. and Hsieh, S.-K. (2015). Chinese lexical semantics: From radicals to event structure. In William, S.-Y. W. and Sun, C.-F. (eds), The Oxford Handbook of Chinese Linguistics. New York: Oxford University Press, pp. 290305.Google Scholar
Huang, C.-R. and Shi, D. (2016). A reference Grammar of Chinese. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Jin, M. (2002). Author identification based on n - gram pattern of auxiliary word. Measurement of Language. 23(5), 225240.Google Scholar
Jin, M. and Jiang, M. (2012). Text clustering on authorship attribution based on the features of punctuations usage. In 2012 IEEE 11th International Conference on Signal Processing (ICSP), vol. 3. IEEE, pp. 21752178. Beijing. China.CrossRefGoogle Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning. Berlin, Heidelberg, Springer, pp. 137142.CrossRefGoogle Scholar
Jockers, M.L. and Witten, D.M. (2010). A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing. 25(2), 215223.CrossRefGoogle Scholar
Juola, P. (2008). Author attribution. Foundations and Trends in Information Retrieval. 1(3), 233334.CrossRefGoogle Scholar
Kelih, E., Antić, G., Grzybek, P. and Stadlober, E. (2005). Classification of author and/or genre? The impact of word length. In Weihs, C. (eds), Classification—The Ubiquitous Challenge. Berlin, Heidelberg, Springer, pp. 498505.CrossRefGoogle Scholar
Koppel, M., Schler, J. and Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for information Science and Technology 60(1), 926.CrossRefGoogle Scholar
Koppel, M., Schler, J. and Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 12611276.Google Scholar
Köhler, R. (2006). The frequency distribution of the lengths of length sequences. In Genzor, J. and Bucková, M. (eds), Favete Linguis. Studies in Honour of Victor Krupa. Bratislava: Slovak Academic Press, pp. 145152.Google Scholar
Köhler, R. (2008). Sequences of linguistic quantities report on a new unit of investigation. Glottotheory 1(1), 115119.CrossRefGoogle Scholar
Köhler, R. (2012). Quantitative Syntax Analysis. Berlin/Boston: De Gruyter Mouton.CrossRefGoogle Scholar
Köhler, R. (2015). Linguistic motifs. Sequences in language and text. pp. 89108.CrossRefGoogle Scholar
Köhler, R. and Naumann, S. (2010). A syntagmatic approach to automatic text classification. Statistical properties of F and L-motifs as text characteristics. In Grzybek, P., Kelih, E. and Mačutek, J. (eds), Text and Language. Wien: Praesens, pp. 8189.Google Scholar
Layton, R., Watters, P. and Dazeley, R. (2013a). Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering 19(1), 95120.CrossRefGoogle Scholar
Layton, R., Watters, P. and Dazeley, R. (2013b). Evaluating authorship distance methods using the positive Silhouette coefficient. Natural Language Engineering 19(4), 517535.CrossRefGoogle Scholar
Li, J., Zheng, R. and Chen, H. (2006). From fingerprint to writeprint. Communication of ACM 49(4), 7682.CrossRefGoogle Scholar
Love, H. (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Lu, J. (1993). The features of Chinese sentences. Chinese Language Learning 1, 16.Google Scholar
Luyckx, K. and Daelemans, W. (2008). Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics, August 18–22, 2008, pp. 513520. Manchester, United Kingdom.CrossRefGoogle Scholar
Luyckx, K. and Daelemans, W. (2011). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26(1), 3555.CrossRefGoogle Scholar
Marton, Y., Wu, N. and Hellerstein, L. (2005). On compression-based text classification. In Proceedings of the European Conference on Information Retrieval. Berlin, Germany: Springer, pp. 300314.Google Scholar
Mendenhall, T.C. (1887). The characteristic curves of composition. Science IX, 237249.CrossRefGoogle Scholar
Mosteller, F. and Wallace, D.L. (1964). Inference and Disputed Authorship: The Federalist. Reading, Massachusetts: Addison-Wesley.Google Scholar
Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y. and Woodard, D. (2018). Surveying stylometry techniques and applications. ACM Computing Surveys (CSUR) 50(6), 86.CrossRefGoogle Scholar
Neergaard, K.D. and Huang, C.-R. (2019). Constructing the Mandarin phonological network: novel syllable inventory used to identify schematic segmentation. To Appear in Complexity (special issue), Cognitive Network Science: A New Frontier.Google Scholar
Peng, F., Schuurmans, D., Wang, S. and Keselj, V. (2003). Language independent authorship attribution using character level language models. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, Budapest, Hungary, April 12–17, 2003. doi: 10.3115/1067807.1067843.CrossRefGoogle Scholar
R Core Team. (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at https://www.R-project.org.Google Scholar
Ruano San Segundo, P. (2016). A corpus-stylistic approach to Dickens’ use of speech verbs: beyond mere reporting. Language and Literature. 25(2), 113129.CrossRefGoogle Scholar
Sanderson, C. and Guenter, S. (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the International Conference on Empirical Methods in Natural Language Engineering. Morristown, NJ: Association for Computational Linguistics, pp. 482491.Google Scholar
Savoy, J. (2012). Authorship attribution: a comparative study of three text corpora and three language. Journal of Quantitative Linguistics 19(2), 132161.CrossRefGoogle Scholar
Savoy, J. (2015). Comparative evaluation of term selection functions for authorship attribution. Literary and Linguistic Computing 30(2), 246261.CrossRefGoogle Scholar
Sproat, R. (2000). A Computational Theory of Writing Systems. London: Cambridge University Press.Google Scholar
Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In Proceedings of the 18th International conference on Database and Expert Syterms Applications, Regensburg, Germany: IEEE Computer society. pp. 237241.Google Scholar
Stamatatos, E. (2008). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology. 60(3), 538556.CrossRefGoogle Scholar
Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 471495.CrossRefGoogle Scholar
Tan, P.-N., Steinbach, M. and Kumar, V. (Translated by Fan, Ming, Fan, Hongjian). (2006). Introduction to Data Mining. China, Beijing: Posts and Telecom Press, P115.Google Scholar
Vitevitch, M.S. (2002). The influence of phonological similarity neighborhoods on speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition 28(4). P735747.Google ScholarPubMed
Wang, D. (1992). Fictional realism in Twentieth-Century China. Dun, Mao, She, Lao, Congwen, Shen. Columbia University Press. New York. USA.Google Scholar
Wang, K. and Qin, H. (2014). What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1), 5777.CrossRefGoogle Scholar
Wang, S.-K., Dong, K.-J. and Bao-Ping, Y. (2011). Research on authorship identification based on sentence rhythm feature. Computer Engineering 37(9), 45 +8.Google Scholar
Wei, P. (2002). From the distribution of common words examining the author issue of Dream of Red Chamber Author. In Memorial Li Fanggui’s 100th Anniversary International Symposium on Chinese History. Seattle: University of Washington.Google Scholar
Williams, C.B. (1976). Mendenhall’s studies of word-length distribution in the works of Shakespeare and Bacon. Biometrika 62(1), 207212.CrossRefGoogle Scholar
Wu, X.C., Huang, X.J. and Wu, L.D. (2006). Method research of author identification based on semantic analysis. Journal Chinese Information 20(6), 6168.Google Scholar
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval 1(1), 6990.CrossRefGoogle Scholar
Yang, M.Zhu, D., Tang, Y. and Wang, J. (2017). Authorship Attribution with Topic Drift Model. Available at https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14152.Google Scholar
Yu, P.B. . (1950). .Google Scholar
Yu, B. (2012). Function words for Chinese authorship attribution. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, pp. 4553. Montréal, Canada.Google Scholar
Yule, G.U. (1938). On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika 30(3/4), 363390.Google Scholar
Yule, G.U. (1944). The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.Google Scholar
Zheng, R., Li, J., Chen, H. and Huang, Z. (2006). A framework for authorship identification of online messages: writing style features and classification techniques. Journal of the American Society for Information Science and Technology 57(3), 378393.CrossRefGoogle Scholar
Zhu, D. (1982). Lectures on Grammar. Beijing, China: Commercial Press.Google Scholar
Zipf, G.K. (1932). Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press.CrossRefGoogle Scholar