Generalized Segmentation Algorithm for Dissimilar Script Languages

Authors

  • Abdul Majid  Department of Computer Science and Technology, Faculty of Information Science and technology, Ocean University of China
  • Qinbo  Department of Computer Science and Technology, Faculty of Information Science and technology, Ocean University of China
  • Dil Nawaz Hakro  Faculty of Engineering and Technology (FET) University of Sindh, Jamshoro, Pakistan
  • Muhammad Owais Khan  Department of Computer Science and Technology, Faculty of Information Science and technology, Ocean University of China

DOI:

https://doi.org//10.32628/CSEIT2390657

Keywords:

Multiscript, Languages, Scripts

Abstract

Optical Character Recognition is considered one of the fastest methods of data entry. OCR converts the text image representation of x and y coordinates representing pixel information to be converted into text data in a particular language. OCR as a field of pattern recognition and document image understanding, OCR requires a challenging job once a different language text is available on the image. Difference in language script will pose different challenges for OCR which requires entirely different approaches and algorithms. Latin scripts require a different approach whereas the Arabic adopted language scripts require a different approach. In this regard, various solutions have been proposed for different languages. Segmentation is considered one of the important tasks in the process of OCR. A good segmentation will definitely increase the accuracy of an OCR. Segmentation includes the segmentation of text lines from text images which are further divided into words. These segmented words are further divided into characters which are to be recognized. A single segmentation algorithm to segment various scripts of the languages is proposed in this study which checks the script and then segments the text image for the further processing in OCR. The proposed generalized algorithm will check the style, direction and other properties of the script and then adopts the segmentation process to segment text lines, words and characters of the language. The proposed algorithm segments more than ten languages of three scripts and segments for their OCRs. These images can be further processed for feature extraction and classification further. The process of OCR for selected languages will be made easier to recognize. Multiple scripts, languages and images were experimented, and the proposed algorithm successfully segmented 32,833 images of text line, words and character image. The algorithm provides 97% accuracy while segmenting these images and can be extended to further languages as well as scripts.

References

  1. Bag, S., & Harit, G. (2011). An improved contour-based thinning method for character images. Pattern Recognition Letters, 32(14), 1836-1842.
  2. Cavalin, P. R., de Souza Britto, Jr., A., Bortolozzi, F., Sabourin, R. and Oliveira, L. E. S. (2006). An implicit segmentation-based method for recognition of handwritten strings of characters, Proceedings of the 2006 ACM Symposium on Applied computing, SAC ’06, ACM, Dijon, France, pp. 836–840. URL: http://doi.acm.org/10.1145/1141277.1141468
  3. Cowell J. and H. Fiaz (1992). “Thinning Arabic character feature extraction“, IEEE Transactions on Pattern Analysis Machine Intelligence, Vol. 14, No.11, 869-885,
  4. Fan, X. and Verma, B. (2001). Segmentation vs. non segmentation based neural techniques for cursive word recognition: an experimental analysis, Computational Intelligence and Multimedia Applications, 2001. ICCIMA 2001. Proceedings. Fourth International Conference on, IEEE, Yokusika City, Japan, pp. 251–255.
  5. Hakro (2015), ENHANCED SEGMENTATION AND FEATURE EXTRACTION FOR SINDHI OPTICAL CHARACTER RECOGNITION, PhD thesis, Submitted to University science Malaysia (USM), Malaysia.
  6. Lehal, G. S. and Rana, A. (2013). Recognition of Nastalique Urdu ligatures, Proceedings of the 4th International Workshop on Multilingual OCR, MOCR ’13, ACM,Washington, DC, USA, pp. 7:1–7:5. URL: http://doi.acm.org/10.1145/2505377.2505379
  7. Premaratne, H. and Bigun, J. (2004). A segmentation-free approach to recognise printed Sinhala
  8. Script using linear symmetry, Pattern recognition 37(10): 2081–2089.
  9. Shang, L. and Z.Yi, (2007).“A class of binary images thinning using two PCNNs”, Neurocomputing, Vol.: 70, 1096-1101,
  10. Zhang T. Y. and C. Y. Suen, (1984). “A fast Parallel Algorithms for Thinning Digital Patterns”, Research Contributions, Communications of the ACM. 27 (3): 236-239

Downloads

Published

2023-12-30

Issue

Section

Research Articles

How to Cite

[1]
Abdul Majid, Qinbo, Dil Nawaz Hakro, Muhammad Owais Khan, " Generalized Segmentation Algorithm for Dissimilar Script Languages, IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 9, Issue 6, pp.303-309, November-December-2023. Available at doi : https://doi.org/10.32628/CSEIT2390657