Skip to main content
Log in

Handwritten Kazakh and Russian (HKR) database for text recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we introduce a large scale dataset, called HKR, to address challenging detection and recognition problems of handwritten Russian and Kazakh text in the scanned documents. We present a new Russian and Kazakh database (with about 95% of Russian and 5% of Kazakh words/sentences respectively) for offline handwriting recognition. A few pre-processing and segmentation procedures have been developed together with the database. The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LaTeXwhich subsequently was filled out by persons with their handwriting. The database consists of more than 1500 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 different writers. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning. For experiments, we used several popular text recognition methods for word and line recognition like CTC-based and attention-based methods. The results indicate the diversity of HKR. The dataset is available at https://github.com/abdoelsayed2016/HKR_Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Abadi M, Agarwal A, Barham P, Brevdo E et al (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR. http://arxiv.org/abs/1603.04467

  2. Abdallah A, Hamada M, Nurseitov D (2020) Attention-based fully gated CNN-BGRU for Russian handwritten text. J Imaging 6(12), 141. http://dx.doi.org/10.3390/jimaging6120141

  3. Al-ma’adeed S (2012) Text-dependent writer identification for Arabic handwriting. J Electr Comput Eng. https://doi.org/10.1155/2012/794106

  4. Al-ma’adeed S, Elliman D, Higgins C (2002) A data base for Arabic handwritten text recognition research. In: Int Arab J Info Technol vol. 1, pp. 485–489. IEEE. https://doi.org/10.1109/IWFHR.2002.1030957

  5. Al-ma’adeed S, Higgins C, Elliman D (2004) Off-line recognition of handwritten Arabic words using multiple hidden Markov models. In: The Twenty-third SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence vol. 17, pp. 75–79. https://doi.org/10.1016/j.knosys.2004.03.002

  6. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.0473

  7. Bensefia A, Paquet T, Heutte L (2005) A writer identification and verification system. Pattern Recogn Lett 26(13):2080–2092. https://doi.org/10.1016/j.patrec.2005.03.024

    Article  MATH  Google Scholar 

  8. Bhattacharya U, Shridhar M, Parui S, Sen P, Chaudhuri B (2012) Offline recognition of handwritten Bangla characters: An efficient two-stage approach. Pattern Anal Applic 15(4):445–458. https://doi.org/10.1007/s10044-012-0278-6

    Article  MathSciNet  Google Scholar 

  9. Bluche T, Messina R (2017) Gated convolutional recurrent neural networks for multilingual handwriting recognition. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 646–651. IEEE. 10.1109/ICDAR.2017.111

  10. Bostanbekov K, Tolegenov R (2020) Character error rate (cer) method. https://github.com/abdoelsayed2016/CAR

  11. Bulacu M, Schomaker L (2007) Text-independent writer identification and verification using textural and allographic features. IEEE Trans Pattern Anal Mach Intell 29(4):701–717. https://doi.org/10.1109/TPAMI.2007.1009

    Article  Google Scholar 

  12. Bunke H, Bengio S, Vinciarelli A (2004) Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans Pattern Anal Mach Intell 26(6):709–720. https://doi.org/10.1109/TPAMI.2004.14

    Article  Google Scholar 

  13. Daniels Z, Baird H (2013) Discriminating features for writer identification. In: 12th International Conference on Document Analysis and Recognition, pp. 1385–1389. IEEE. https://doi.org/10.1109/ICDAR.2013.280

  14. Das S, Banerjee S (2014) An algorithm for japanese character recognition. Int J Image Graph Signal Process 7(1):9–15. https://doi.org/10.5815/ijigsp.2015.01.02

    Article  Google Scholar 

  15. Diem M, Fiel S, Garz A, Keglevic M, Kleber F, Sablatnig R (2013) ICDAR 2013 competition on handwritten digit recognition (HDRC 2013). In: 12th International Conference on Document Analysis and Recognition pp. 1422–1427. IEEE. https://doi.org/10.1109/ICDAR.2013.287

  16. Dreuw P, Doetsch P, Plahl C, Ney H (2011) Hierarchical hybrid MLP/HMM or rather MLP features for a discriminatively trained Gaussian HMM: A comparison for offline handwriting recognition. In: 18th IEEE Int Conf Image Process pp. 3541–3544. IEEE. https://doi.org/10.1109/ICIP.2011.6116480

  17. Fiel S, Sablatnig R (2013) Writer identification and writer retrieval using the fisher vector on visual vocabularies. In: 12th International Conference on Document Analysis and Recognition pp. 545–549. IEEE. https://doi.org/10.1109/ICDAR.2013.114

  18. Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recognit Lett 33(7):934–942. https://doi.org/10.1016/j.patrec.2011.09.009

    Article  Google Scholar 

  19. Fischer A, Suen C, Frinken V, Riesen K, Bunke H (2013) A fast matching algorithm for graph-based handwriting recognition. In: International Workshop on Graph-Based Representations in Pattern Recognition pp. 194–203. Springer. https://doi.org/10.1007/978-3-642-38221-521

  20. Frinken V, Bunke H (2014) Continuous Handwritten Script Recognition, pp. 391–425. Springer London, London. https://doi.org/10.1007/978-0-85729-859-1_12

  21. Frinken V, Fischer A, Manmatha R, Bunke H (2011) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224. https://doi.org/10.1109/TPAMI.2011.113

    Article  Google Scholar 

  22. Gatos B, Pratikakis I, Perantonis S (2006) Hybrid off-line cursive handwriting word recognition. In: 18th Int Conf Pattern Recognit vol. 2, pp. 998–1002. IEEE. https://doi.org/10.1109/ICPR.2006.644

  23. Geist JC, Wilkinson R, Janet S, Grother PJ, Hammond B, Larsen NW, Klear R, Matsko MJ, Burges CJ, Creecy R et al (1994) The second census optical character recognition systems conference. Tech. rep, National Institute of Standards and Technology

    Google Scholar 

  24. Grosicki E, Carr M, Geoffrois E, Prteux F (2006) RIMES evaluation campaign for handwritten mail processing. In: Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition, pp. 231–235

  25. Guichard L, Toselli AH, Coüasnon B (2010) Handwritten word verification by SVM-based hypotheses re-scoring and multiple thresholds rejection. In: 12th International Conference on Frontiers in Handwriting Recognition pp. 57–62. IEEE. https://doi.org/10.1109/ICFHR.2010.15

  26. Gnter S, Bunke H (2003) Ensembles of classifiers for handwritten word recognition. Int J Doc Anal Recognit 5(4):224–232. https://doi.org/10.1007/s10032-002-0088-2

    Article  Google Scholar 

  27. Ha TM, Bunke H (1997) Off-line, handwritten numeral recognition by perturbation method. IEEE Trans Pattern Anal Mach Intell 19(5):535–539. https://doi.org/10.1109/34.589216

    Article  Google Scholar 

  28. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proc IEEE Int Conf Comput Vis pp. 1026–1034

  29. Hinton G, Srivastava N, Swersky K (2012) Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14(8)

  30. Jain R, Doermann D (2011) Offline writer identification using K-adjacent segments. In: International Conference on Document Analysis and Recognition, pp. 769–773. IEEE. https://doi.org/10.1109/ICDAR.2011.159

  31. John J, Balakrishnan K, Pramod V (2013) A system for offline recognition of handwritten characters in malayalam script. Int J Image Graph Signal Process 5:53–59. https://doi.org/10.5815/ijigsp.2013.04.07

    Article  Google Scholar 

  32. Kermorvant C, Louradour J (2010) Handwritten mail classification experiments with the RIMES database. In: 12th International Conference on Frontiers in Handwriting Recognition, pp. 241–246. IEEE. https://doi.org/10.1109/ICFHR.2010.45

  33. Kleber F, Fiel S, Diem M, Sablatnig R (2013) CVL-database: An off-line database for writer retrieval, writer identification and word spotting. In: 12th International Conference on Document Analysis and Recognition, pp. 560–564. IEEE. https://doi.org/10.1109/ICDAR.2013.117

  34. Liu CL, Yin F, Wang DH, Wang QF (2011) Casia online and offline chinese handwriting databases. In: 2011 International Conference on Document Analysis and Recognition pp. 37–41. IEEE

  35. Liu H, Ding X (2005) Handwritten character recognition using gradient feature and quadratic classifier with multiple discrimination schemes. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05), pp. 19–23. https://doi.org/10.1109/ICDAR.2005.123

  36. Lcm HT (2018) Line-level handwritten text recognition with tensorflow. https://github.com/lamhoangtung/LineHTR . Last accessed 11 May 2020

  37. Maken P, Gupta A (2021) A method for automatic classification of gender based on text-independent handwriting. Multimed Tools Appl pp. 1–30

  38. Maken P, Gupta A, Gupta MK (2019) A study on various techniques involved in gender prediction system: a comprehensive review. Cybern Inf Technol 19(2):51–73

    Google Scholar 

  39. Marti UV, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pp. 705–708. IEEE. https://doi.org/10.1109/ICDAR.1999.791885

  40. Marti UV, Bunke H (2002) The IAM-database: An English sentence database for offline handwriting recognition. Int J Doc Anal Recognit 5(1):39–46. https://doi.org/10.1007/s100320200071

    Article  MATH  Google Scholar 

  41. Montreuil F, Grosicki E, Heutte L, Nicolas S (2009) Unconstrained handwritten document layout extraction using 2D conditional random fields. In: 10th International Conference on Document Analysis and Recognition, pp. 853–857. IEEE. https://doi.org/10.1109/ICDAR.2009.132

  42. Net N (2020) Nomeroff net. automatic numberplate recognition system. version 0.3.1. https://nomeroff.net.ua/. Last accessed 11 May 2020

  43. Parvez M, Mahmoud S (2013) Arabic handwriting recognition using structural and syntactic pattern attributes. Pattern Recognit 46(1):141–154. https://doi.org/10.1016/j.patcog.2012.07.012

    Article  Google Scholar 

  44. Pechwitz M, Maddouri SS, Märgner V, Ellouze N, Amiri H et al (2002) Ifn/enit-database of handwritten arabic words. In: Proc. of CIFED vol. 2, pp. 127–136. Citeseer

  45. Puigcerver J (2017) Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 67–72. IEEE

  46. Salvi D, Zhou J, Waggoner J, Wang S (2013) Handwritten text segmentation using average longest path algorithm. In: Proceedings of IEEE Workshop on Applications of Computer Vision, pp. 505–512. IEEE. https://doi.org/10.1109/WACV.2013.6475061

  47. Santos R, Clemente G, Ing Ren T, Cavalcanti G (2009) Text line segmentation based on morphology and histogram projection. In: 10th International Conference on Document Analysis and Recognition, pp. 651–655. IEEE. https://doi.org/10.1109/ICDAR.2009.183

  48. Scheidl H (2018) Handwritten text recognition in historical documents. Technische Universität Wien

  49. Scheidl H (2018) Handwritten text recognition with tensorflow. https://github.com/githubharald/SimpleHTR. Last accessed 11 May 2020

  50. Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304

    Article  Google Scholar 

  51. Siddiqi I, Vincent N (2010) Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features. Pattern Recognit 43(11):3853–3865. https://doi.org/10.1016/j.patcog.2010.05.019

    Article  MATH  Google Scholar 

  52. Smith SJ, Bourgoin MO, Sims K, Voorhees HL (1994) Handwritten character classification using nearest neighbor in large databases. IEEE Trans Pattern Anal Mach Intell 16(9):915–919. https://doi.org/10.1109/34.310689

    Article  Google Scholar 

  53. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  54. Tao D, Liang L, Jin L, Gao Y (2014) Similar handwritten chinese character recognition by kernel discriminative locality alignment. Pattern Recognit Lett 35, 186–194. https://doi.org/10.1016/j.patrec.2012.06.014.Frontiers in Handwriting Processing

  55. Vinciarelli A, Luettin J (2001) A new normalization technique for cursive handwritten words. Pattern Recognit Lett 22(9):1043–1050

    Article  Google Scholar 

  56. Voigtlaender P, Doetsch P, Ney H (2016) Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233. IEEE

  57. Wshah S, Kumar G, Govindaraju V (2012) Multilingual word spotting in offline handwritten documents. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 310–313. IEEE

  58. Wshah S, Kumar G, Govindaraju V (2012) Script independent word spotting in offline handwritten documents based on hidden Markov models. In: International Conference on Frontiers in Handwriting Recognition, pp. 14–19. IEEE https://doi.org/10.1109/ICFHR.2012.264

  59. Zamora-Martinez F, Frinken V, Espana-Boquera S, Castro-Bleda MJ, Fischer A, Bunke H (2014) Neural network language models for off-line handwriting recognition. Pattern Recognit 47(4):1642–1652. https://doi.org/10.1016/j.patcog.2013.10.020

    Article  Google Scholar 

  60. Zhou S, Chen Q, Wang X (2014) Handwritten chinese text editing and recognition system. Multimed Tools Appl 71(3):1363–1380

    Article  Google Scholar 

  61. Zimmermann M, Bunke H (2002) Automatic segmentation of the IAM off-line database for handwritten English text. In: Proc Int Conf Pattern Recognit vol. 4, pp. 35–39. IEEE. https://doi.org/10.1109/ICPR.2002.1047394

Download references

Acknowledgements

We would like to thank the following people for helping with this research project: Maksat Kanatov and Kuanysh Slyamkhan. This work was funded by the Ministry of Education and Science of the Republic of Kazakhstan (Grant No AP05135175)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelrahman Abdallah.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nurseitov, D., Bostanbekov, K., Kurmankhojayev, D. et al. Handwritten Kazakh and Russian (HKR) database for text recognition. Multimed Tools Appl 80, 33075–33097 (2021). https://doi.org/10.1007/s11042-021-11399-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11399-6

Keywords

Navigation