Abstract
In this paper, we introduce a large scale dataset, called HKR, to address challenging detection and recognition problems of handwritten Russian and Kazakh text in the scanned documents. We present a new Russian and Kazakh database (with about 95% of Russian and 5% of Kazakh words/sentences respectively) for offline handwriting recognition. A few pre-processing and segmentation procedures have been developed together with the database. The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LaTeXwhich subsequently was filled out by persons with their handwriting. The database consists of more than 1500 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 different writers. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning. For experiments, we used several popular text recognition methods for word and line recognition like CTC-based and attention-based methods. The results indicate the diversity of HKR. The dataset is available at https://github.com/abdoelsayed2016/HKR_Dataset.
Similar content being viewed by others
References
Abadi M, Agarwal A, Barham P, Brevdo E et al (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR. http://arxiv.org/abs/1603.04467
Abdallah A, Hamada M, Nurseitov D (2020) Attention-based fully gated CNN-BGRU for Russian handwritten text. J Imaging 6(12), 141. http://dx.doi.org/10.3390/jimaging6120141
Al-ma’adeed S (2012) Text-dependent writer identification for Arabic handwriting. J Electr Comput Eng. https://doi.org/10.1155/2012/794106
Al-ma’adeed S, Elliman D, Higgins C (2002) A data base for Arabic handwritten text recognition research. In: Int Arab J Info Technol vol. 1, pp. 485–489. IEEE. https://doi.org/10.1109/IWFHR.2002.1030957
Al-ma’adeed S, Higgins C, Elliman D (2004) Off-line recognition of handwritten Arabic words using multiple hidden Markov models. In: The Twenty-third SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence vol. 17, pp. 75–79. https://doi.org/10.1016/j.knosys.2004.03.002
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.0473
Bensefia A, Paquet T, Heutte L (2005) A writer identification and verification system. Pattern Recogn Lett 26(13):2080–2092. https://doi.org/10.1016/j.patrec.2005.03.024
Bhattacharya U, Shridhar M, Parui S, Sen P, Chaudhuri B (2012) Offline recognition of handwritten Bangla characters: An efficient two-stage approach. Pattern Anal Applic 15(4):445–458. https://doi.org/10.1007/s10044-012-0278-6
Bluche T, Messina R (2017) Gated convolutional recurrent neural networks for multilingual handwriting recognition. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 646–651. IEEE. 10.1109/ICDAR.2017.111
Bostanbekov K, Tolegenov R (2020) Character error rate (cer) method. https://github.com/abdoelsayed2016/CAR
Bulacu M, Schomaker L (2007) Text-independent writer identification and verification using textural and allographic features. IEEE Trans Pattern Anal Mach Intell 29(4):701–717. https://doi.org/10.1109/TPAMI.2007.1009
Bunke H, Bengio S, Vinciarelli A (2004) Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans Pattern Anal Mach Intell 26(6):709–720. https://doi.org/10.1109/TPAMI.2004.14
Daniels Z, Baird H (2013) Discriminating features for writer identification. In: 12th International Conference on Document Analysis and Recognition, pp. 1385–1389. IEEE. https://doi.org/10.1109/ICDAR.2013.280
Das S, Banerjee S (2014) An algorithm for japanese character recognition. Int J Image Graph Signal Process 7(1):9–15. https://doi.org/10.5815/ijigsp.2015.01.02
Diem M, Fiel S, Garz A, Keglevic M, Kleber F, Sablatnig R (2013) ICDAR 2013 competition on handwritten digit recognition (HDRC 2013). In: 12th International Conference on Document Analysis and Recognition pp. 1422–1427. IEEE. https://doi.org/10.1109/ICDAR.2013.287
Dreuw P, Doetsch P, Plahl C, Ney H (2011) Hierarchical hybrid MLP/HMM or rather MLP features for a discriminatively trained Gaussian HMM: A comparison for offline handwriting recognition. In: 18th IEEE Int Conf Image Process pp. 3541–3544. IEEE. https://doi.org/10.1109/ICIP.2011.6116480
Fiel S, Sablatnig R (2013) Writer identification and writer retrieval using the fisher vector on visual vocabularies. In: 12th International Conference on Document Analysis and Recognition pp. 545–549. IEEE. https://doi.org/10.1109/ICDAR.2013.114
Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recognit Lett 33(7):934–942. https://doi.org/10.1016/j.patrec.2011.09.009
Fischer A, Suen C, Frinken V, Riesen K, Bunke H (2013) A fast matching algorithm for graph-based handwriting recognition. In: International Workshop on Graph-Based Representations in Pattern Recognition pp. 194–203. Springer. https://doi.org/10.1007/978-3-642-38221-521
Frinken V, Bunke H (2014) Continuous Handwritten Script Recognition, pp. 391–425. Springer London, London. https://doi.org/10.1007/978-0-85729-859-1_12
Frinken V, Fischer A, Manmatha R, Bunke H (2011) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224. https://doi.org/10.1109/TPAMI.2011.113
Gatos B, Pratikakis I, Perantonis S (2006) Hybrid off-line cursive handwriting word recognition. In: 18th Int Conf Pattern Recognit vol. 2, pp. 998–1002. IEEE. https://doi.org/10.1109/ICPR.2006.644
Geist JC, Wilkinson R, Janet S, Grother PJ, Hammond B, Larsen NW, Klear R, Matsko MJ, Burges CJ, Creecy R et al (1994) The second census optical character recognition systems conference. Tech. rep, National Institute of Standards and Technology
Grosicki E, Carr M, Geoffrois E, Prteux F (2006) RIMES evaluation campaign for handwritten mail processing. In: Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition, pp. 231–235
Guichard L, Toselli AH, Coüasnon B (2010) Handwritten word verification by SVM-based hypotheses re-scoring and multiple thresholds rejection. In: 12th International Conference on Frontiers in Handwriting Recognition pp. 57–62. IEEE. https://doi.org/10.1109/ICFHR.2010.15
Gnter S, Bunke H (2003) Ensembles of classifiers for handwritten word recognition. Int J Doc Anal Recognit 5(4):224–232. https://doi.org/10.1007/s10032-002-0088-2
Ha TM, Bunke H (1997) Off-line, handwritten numeral recognition by perturbation method. IEEE Trans Pattern Anal Mach Intell 19(5):535–539. https://doi.org/10.1109/34.589216
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proc IEEE Int Conf Comput Vis pp. 1026–1034
Hinton G, Srivastava N, Swersky K (2012) Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14(8)
Jain R, Doermann D (2011) Offline writer identification using K-adjacent segments. In: International Conference on Document Analysis and Recognition, pp. 769–773. IEEE. https://doi.org/10.1109/ICDAR.2011.159
John J, Balakrishnan K, Pramod V (2013) A system for offline recognition of handwritten characters in malayalam script. Int J Image Graph Signal Process 5:53–59. https://doi.org/10.5815/ijigsp.2013.04.07
Kermorvant C, Louradour J (2010) Handwritten mail classification experiments with the RIMES database. In: 12th International Conference on Frontiers in Handwriting Recognition, pp. 241–246. IEEE. https://doi.org/10.1109/ICFHR.2010.45
Kleber F, Fiel S, Diem M, Sablatnig R (2013) CVL-database: An off-line database for writer retrieval, writer identification and word spotting. In: 12th International Conference on Document Analysis and Recognition, pp. 560–564. IEEE. https://doi.org/10.1109/ICDAR.2013.117
Liu CL, Yin F, Wang DH, Wang QF (2011) Casia online and offline chinese handwriting databases. In: 2011 International Conference on Document Analysis and Recognition pp. 37–41. IEEE
Liu H, Ding X (2005) Handwritten character recognition using gradient feature and quadratic classifier with multiple discrimination schemes. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05), pp. 19–23. https://doi.org/10.1109/ICDAR.2005.123
Lcm HT (2018) Line-level handwritten text recognition with tensorflow. https://github.com/lamhoangtung/LineHTR . Last accessed 11 May 2020
Maken P, Gupta A (2021) A method for automatic classification of gender based on text-independent handwriting. Multimed Tools Appl pp. 1–30
Maken P, Gupta A, Gupta MK (2019) A study on various techniques involved in gender prediction system: a comprehensive review. Cybern Inf Technol 19(2):51–73
Marti UV, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pp. 705–708. IEEE. https://doi.org/10.1109/ICDAR.1999.791885
Marti UV, Bunke H (2002) The IAM-database: An English sentence database for offline handwriting recognition. Int J Doc Anal Recognit 5(1):39–46. https://doi.org/10.1007/s100320200071
Montreuil F, Grosicki E, Heutte L, Nicolas S (2009) Unconstrained handwritten document layout extraction using 2D conditional random fields. In: 10th International Conference on Document Analysis and Recognition, pp. 853–857. IEEE. https://doi.org/10.1109/ICDAR.2009.132
Net N (2020) Nomeroff net. automatic numberplate recognition system. version 0.3.1. https://nomeroff.net.ua/. Last accessed 11 May 2020
Parvez M, Mahmoud S (2013) Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognit 46(1):141–154. https://doi.org/10.1016/j.patcog.2012.07.012
Pechwitz M, Maddouri SS, Märgner V, Ellouze N, Amiri H et al (2002) Ifn/enit-database of handwritten arabic words. In: Proc. of CIFED vol. 2, pp. 127–136. Citeseer
Puigcerver J (2017) Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 67–72. IEEE
Salvi D, Zhou J, Waggoner J, Wang S (2013) Handwritten text segmentation using average longest path algorithm. In: Proceedings of IEEE Workshop on Applications of Computer Vision, pp. 505–512. IEEE. https://doi.org/10.1109/WACV.2013.6475061
Santos R, Clemente G, Ing Ren T, Cavalcanti G (2009) Text line segmentation based on morphology and histogram projection. In: 10th International Conference on Document Analysis and Recognition, pp. 651–655. IEEE. https://doi.org/10.1109/ICDAR.2009.183
Scheidl H (2018) Handwritten text recognition in historical documents. Technische Universität Wien
Scheidl H (2018) Handwritten text recognition with tensorflow. https://github.com/githubharald/SimpleHTR. Last accessed 11 May 2020
Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Siddiqi I, Vincent N (2010) Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features. Pattern Recognit 43(11):3853–3865. https://doi.org/10.1016/j.patcog.2010.05.019
Smith SJ, Bourgoin MO, Sims K, Voorhees HL (1994) Handwritten character classification using nearest neighbor in large databases. IEEE Trans Pattern Anal Mach Intell 16(9):915–919. https://doi.org/10.1109/34.310689
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Tao D, Liang L, Jin L, Gao Y (2014) Similar handwritten chinese character recognition by kernel discriminative locality alignment. Pattern Recognit Lett 35, 186–194. https://doi.org/10.1016/j.patrec.2012.06.014.Frontiers in Handwriting Processing
Vinciarelli A, Luettin J (2001) A new normalization technique for cursive handwritten words. Pattern Recognit Lett 22(9):1043–1050
Voigtlaender P, Doetsch P, Ney H (2016) Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233. IEEE
Wshah S, Kumar G, Govindaraju V (2012) Multilingual word spotting in offline handwritten documents. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 310–313. IEEE
Wshah S, Kumar G, Govindaraju V (2012) Script independent word spotting in offline handwritten documents based on hidden Markov models. In: International Conference on Frontiers in Handwriting Recognition, pp. 14–19. IEEE https://doi.org/10.1109/ICFHR.2012.264
Zamora-Martinez F, Frinken V, Espana-Boquera S, Castro-Bleda MJ, Fischer A, Bunke H (2014) Neural network language models for off-line handwriting recognition. Pattern Recognit 47(4):1642–1652. https://doi.org/10.1016/j.patcog.2013.10.020
Zhou S, Chen Q, Wang X (2014) Handwritten chinese text editing and recognition system. Multimed Tools Appl 71(3):1363–1380
Zimmermann M, Bunke H (2002) Automatic segmentation of the IAM off-line database for handwritten English text. In: Proc Int Conf Pattern Recognit vol. 4, pp. 35–39. IEEE. https://doi.org/10.1109/ICPR.2002.1047394
Acknowledgements
We would like to thank the following people for helping with this research project: Maksat Kanatov and Kuanysh Slyamkhan. This work was funded by the Ministry of Education and Science of the Republic of Kazakhstan (Grant No AP05135175)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nurseitov, D., Bostanbekov, K., Kurmankhojayev, D. et al. Handwritten Kazakh and Russian (HKR) database for text recognition. Multimed Tools Appl 80, 33075–33097 (2021). https://doi.org/10.1007/s11042-021-11399-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11399-6