Abstract
Stylometry in the form of simple statistical text analysis has proven to be a powerful tool for text classification, e.g. in the form of authorship attribution. When analyzing retro-digitized comics, manga and graphic novels, the researcher is confronted with the problem that automated text recognition (ATR) still leads to results that have comparatively high error rates, while the manual transcription of texts remains highly time-consuming. In this paper, we present an approach and measures that specify whether stylometry based on unsupervised ATR will produce reliable results for a given dataset of comics images.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Rigaud, C., Burie, J.-C., Ogier, J.-M.: Segmentation-Free Speech Text Recognition for Comic Books. In: 2nd International Workshop on coMics Analysis, Processing, and Understanding, 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan (2017)
Dunst, A., Hartel, R., Laubrock, J.: The Graphic Narrative Corpus (GNC): Design, Annotation, and Analysis for the Digital Humanities. In: 2nd International Workshop on coMics Analysis, Processing, and Understanding, 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan (2017)
Mendenhall, T.: The characteristic curves of composition. Science 9, 237–249 (1887)
de Vel, O.Y., Anderson, A., Corney, M., Mohay, G.M.: Mining email content for author identification forensics. SIGMOD Records 30(4), 55–64 (2001)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Burrows, J.: Word patterns and story shapers: the statistical analysis of narrative style. Literary Linguist. Comput. 2, 61–70 (1987)
Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (2005)
Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes classifiers with statistical language models. Inf. Retrieval J. 7(3–4), 317–345 (2004)
Sanderson, C., Günther, S.: Short text authorship attribution via sequence kernels, Markov Chains and author unmasking: an investigation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP 2007, Sydney, Australia (2006)
Smith, R.: An overview of the Tesseract OCR Engine. In: 9th International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Paraná, Brazil (2007)
Nguyen, N.-V., Rigaud, C., Burie, J.-C.: Digital comics image indexing based on deep learning. J. Imaging 4(7), 89ff (2018)
Eder, M., Kestemont, M., Rybicki, J.: Stylometry with R: a suite of tools. In: Digital Humanities 2013, DH 2013, Lincoln, NE, USA (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hartel, R., Dunst, A. (2019). How Good Is Good Enough? Establishing Quality Thresholds for the Automatic Text Analysis of Retro-Digitized Comics. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, WH., Vrochidis, S. (eds) MultiMedia Modeling. MMM 2019. Lecture Notes in Computer Science(), vol 11296. Springer, Cham. https://doi.org/10.1007/978-3-030-05716-9_59
Download citation
DOI: https://doi.org/10.1007/978-3-030-05716-9_59
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05715-2
Online ISBN: 978-3-030-05716-9
eBook Packages: Computer ScienceComputer Science (R0)