Abstract
Analysis of degraded printed documents has been a research topic for last several years. In this article the contribution lies in segmentation of word images into symbols and recognition of the symbols of degraded printed document images of Bengali, the 7th most popular language in the world. A novel approach to symbol level segmentation based on a Multilayer Perceptron (MLP) network is proposed. A database of segmenting and non-segmenting image columns is developed from the ISIDDI page level database and segmentation is treated as a two-class classification problem. The MLP weights are learnt based on this database using the back propagation algorithm. We have introduced certain new metrics, based on which the F-score of the proposed segmentation algorithm is determined. Our method utilizes information that is relevant for character segmentation, ignoring other highly variable information contained in a printed text document, thus allowing for efficient transfer learning between datasets and alleviating the need for labelled training data. Other than Bengali, we have tested on English, Tamil and Devnagari scripts. For the classification purpose we have identified 336 symbols, and the corresponding training and test sets have been developed. The ISIDDI database is used for this purpose. Two classifiers, one CNN based and the other LSTM based, have been developed for this 336-class problem. The classification accuracies obtained on the test set by the CNN classifier and the LSTM classifier are 86.05% and 88.11%, respectively. The proposed classifiers outperform the existing classifiers for the ISIDDI database.
Similar content being viewed by others
References
Robertson B and Boschetti F 2017 Large-scale optical character recognition of ancient greek. Mouseion 14(3): 341–359
White N 2012 Training Tesseract for ancient Greek OCR. Eiiruzov 28–29
Jenckel M, Bukhari S S and Dengel A 2016 anyOCR: a sequence learning based OCR system for unlabeled historical documents. In: Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 4035–4040
Tang Y, Peng L, Xu Q, Wang Y and Furuhata A 2016 CNN based transfer learning for historical Chinese character recognition. In: Proceedings of the 2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, pp. 25–29
Zhang J, Zhu Y, Du J and Dai L 2018 Radical analysis network for zero-shot learning in printed Chinese character recognition. In: Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, pp. 1–6
Darwish K and Oard D W 2002 Term selection for searching printed Arabic. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 261–268
Breuel T M, Ul-Hasan A, Al-Azawi M A and Shafait F 2013 High-performance OCR for printed English and Fraktur using LSTM networks. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition. IEEE, pp. 683–687
Chaudhuri B, Pal U and Mitra M 2002 Automatic recognition of printed Oriya script. Sadhana 27(1): 23–34
Seethalakshmi R, Sreeranjani T, Balachandar T, Singh A, Singh M, Ratan R and Kumar S 2005 Optical character recognition for printed Tamil text using Unicode. Journal of Zhejiang University-SCIENCE A 6(11): 1297–1305
Chaudhuri B and Pal U 1998 A complete printed Bangla OCR system. Pattern Recognition 31(5): 531–549
Biswas C, Mukherjee P S, Ghosh K, Bhattacharya U and Parui S K 2018 A hybrid deep architecture for robust recognition of text lines of degraded printed documents. In: Proceedings of the 24th International Conference on Pattern Recognition. IEEE, pp. 3174–3179
Lakshmi C V and Patvardhan C 2004 An optical character recognition system for printed Telugu text. Pattern Analysis and Applications 7(2): 190–204
Chaudhuri B and Pal U 1997 An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, vol. 2, pp. 1011–1015
Hasnat M A, Chowdhury M R and Khan M 2009 An open source Tesseract based optical character recognizer for Bangla script. In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition. IEEE, pp. 671–675
Hasnat M, Chowdhury M R, Khan M et al 2009 Integrating Bangla script recognition support in Tesseract OCR. In: Proceedings of the Conference on Language and Technology 2009 (CLT09)
Pal U and Chaudhuri B B 1994 OCR in Bangla: an Indo-Bangladeshi language. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 3 – Conference C: Signal Processing (Cat. No. 94CH3440-5), vol. 2, pp. 269–273
Mahmud J U, Raihan M F and Rahman C M 2003 A complete OCR system for continuous Bengali characters. In: Proceedings of the TENCON 2003 Conference on Convergent Technologies for Asia–Pacific Region, vol. 4, pp. 1372–1376
Shatil A M S and Khan M 2006 Minimally segmenting performance Bangla optical character recognition using Kohonen network. Doctoral Dissertation, BRAC University
Pal U, Belad A and Choisy C 2003 Touching numeral segmentation using water reservoir concept. Pattern Recognition Letters 24(1–3): 261–272
Pal U and Datta S 2003 Segmentation of Bangla unconstrained handwritten text. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition. Citeseer, pp. 1128–1132
Upreti K K and Bag S 2016 Segmentation of unconstrained handwritten Hindi words using polygonal approximation. In: Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 150–155
Blumenstein M and Verma B 1997 An artificial neural network based segmentation algorithm for off-line handwriting recognition. In: Proceedings of the International Conference on Computational Intelligence and Multimedia Applications, flCCAL4 ’98
Bhowmik T K, Parui S K, Roy U and Schomaker L 2016 Bangla handwritten character segmentation using structural features: a supervised and bootstrapping approach. ACM Transactions on Asian and Low-Resource Language Information Processing 15(4): 29
Otsu N 1979 A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1): 62–66
Singh C, Bhatia N and Kaur A 2008 Hough transform based fast skew detection and accurate skew correction methods. Pattern Recognition 41(12): 3528–3546
Chaudhuri B and Ghosh S 1998 A statistical study of Bangla corpus, recognition. In: Proceedings of the International Conference on Computational Linguistics, Speech and Document Processing, Calcutta, India, pp. C32–C37
Dhingra K D, Sanyal S and Sharma P K 2008 A robust OCR for degraded documents. In: Advances in Communication Systems and Electrical Engineering. Springer, pp. 497–509
Likforman Sulem L, Zahour A and Taconet B 2007 Text line segmentation of historical documents: a survey. International Journal on Document Analysis and Recognition 9(2): 123–138
Sauvola J and Pietikinen M 2000 Adaptive document image binarization. Pattern Recognition 33(2): 225–236
Liu Y J and You F C 2011 Application of mathematical morphology on touching or broken characters processing. Advanced Materials Research 171: 73–77
Hasan Y M and Karam L J 2000 Morphological text extraction from images. IEEE Transactions on Image Processing 9(11): 1978–1983
Taghva K, Nartker T, Borsack J and Condit A 1999 UNLV-ISRI document collection for research in OCR and information retrieval. In: Proceedings of Document Recognition and Retrieval VII. International Society for Optics and Photonics, vol. 3967, pp. 157–164
Marti U V and Bunke H 2001 Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, IEEE, pp. 159–163
Devi G G and Sathyanarayanan G 2017 A connected components labeling algorithm for 4-connectivity based on position matrix. International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2(6)
Zeiler M D and Fergus R 2014 Visualizing and understanding convolutional networks. In: Proceedings of the European Conference on Computer Vision. Springer, pp. 818–833
Hochreiter S, Bengio Y, Frasconi P and Schmidhuber J 2001 Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Networks. IEEE Press.
Maitra D S, Bhattacharya U and Parui S K 2015 CNN based common approach to handwritten character recognition of multiple scripts. In: Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, pp. 1021–1025
Scherer D, Mller A and Behnke S 2010 Evaluation of pooling operations in convolutional architectures for object recognition. In: Proceedings of the International conference on Artificial Neural Networks. Springer, pp. 92–101
Ciresan D C, Meier U, Masci J, Maria Gambardella L and Schmidhuber J 2011 Flexible, high performance convolutional neural networks for image classification. In: Proceedings of the IJCAI—International Joint Conference on Artificial Intelligence, Barcelona, Spain, vol. 22, p. 1237
Aharrane N, Dahmouni A, Ensah K E M and Satori K 2017 End-to-end system for printed Amazigh script recognition in document images. In: Proceedings of the 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). IEEE, pp. 1–6
Krizhevsky A, Sutskever I and Hinton G E 2012 Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
Su B and Lu S 2014 Accurate scene text recognition based on recurrent neural network. In: Proceedings of the Asian Conference on Computer Vision. Springer, pp. 35–48
Messina R and Louradour J 2015 Segmentation-free handwritten Chinese text recognition with LSTM–RNN. In: Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, pp. 171–175
Mukherjee P S, Chakraborty B, Bhattacharya U and Parui S K 2017 A hybrid model for end to end online hand writing recognition. In: Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, vol. 1, pp. 658–663
Hochreiter S and Schmidhuber J 1997 Long short-term memory. Neural Computation 9(8): 1735–1780
Graves A, Jaitly N and Mohamed A R 2013 Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE, pp. 273–278
Graves A and Jaitly N 2014 Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, pp. 1764–1772
LeCun Y, Bottou L, Bengio Y and Haffner P 1998 Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11): 2278–2324
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mukherjee, J., Parui, S.K. & Roy, U. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā 45, 263 (2020). https://doi.org/10.1007/s12046-020-01492-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-020-01492-1