Photometric Ligature Extraction Technique for Urdu Optical Character Recognition

Urdu Optical Character Recognition (OCR) based on character level recognition (analytical approach) is less popular as compared to ligature level recognition (holistic approach) due to its added complexity, characters and strokes overlapping. This paper presents a holistic approach Urdu ligature extraction technique. The proposed Photometric Ligature Extraction (PLE) technique is independent of font size and column layout and is capable to handle non-overlapping and all inter and intra overlapping ligatures. It uses a customized photometric filter along with the application of X-shearing and padding with connected component analysis, to extract complete ligatures instead of extracting primary and secondary ligatures separately. A total of ~ 2,67,800 ligatures were extracted from scanned Urdu Nastaliq printed text images with an accuracy of 99.4%. Thus, the proposed framework outperforms the existing Urdu Nastaliq text extraction and segmentation algorithms. The proposed PLE framework can also be applied to other languages using the Nastaliq script style, languages such as Arabic, Persian, Pashto, and Sindhi. Keywords-ligature; holistic; Urdu OCR; Nastaliq; photometric filter; Urdu printed text images

INTRODUCTION OCR technology is used to obtain machine editable text from text images. It allows the digitization of valuable printed and handwritten data covering cultural and historical milestones [1]. The commercial OCR systems that are now available report near to 100% recognition rates for languages using the Latin alphabet, such as English, German, and French. Arabic and Chinese OCR systems are also well-developed. Despite the significant research interest in this area, OCR systems for many languages, including Urdu, are still in the development stage [2][3]. Urdu is Pakistan's official language having a large collection of valuable printed and handwritten data in the form of books, novels, magazines, and newspapers. Most of these valuable data are not accessible digitally. The Urdu language has 39 basic characters, 28 of which are Arabic. It is mostly written in the Nastaliq script style, which is a complex calligraphic style, written diagonally from right-to-left with varying inter and intra word spaces, overlapping of characters and strokes, incorrect or filled loops and lack of fixed baseline [4][5] as shown in Figure 1. Major challenges in Nastaliq text: Intra overlapping ligatures (red), inter overlapping ligatures(green), false and filled loops (blue) and missing baseline (yellow).
Urdu OCR is primary composed of five stages: Image acquisition, pre-processing, segmentation, classification and recognition, and post-processing [6]. Image acquisition collects digital images through camera shots, scanned text images, or generated synthetic images [6]. Pre-processing aims to enhance the quality of an acquired image [6]. Noise and skew removal, binarization, contrast enhancement, etc. are mainly performed in this step with the use of classic image processing techniques. Segmentation decomposes a source image into characters, ligatures, or words [7][8]. This step usually employs projection profile and Connected Component Analysis (CCA). Classification aims to correctly classify the extracted/segmented features (ligatures, characters, words, etc.). The most common classifier methods are Decision Tree (DT), Statistical Classifier (SC), Neural Networks (NNs) [9,10], Hidden Markov Models (HMMs), and Support Vector Machines (SVMs). Finally, post-processing corrects the recognition errors in the obtained text [10]. The techniques used for OCR post-processing include manual error correction, dictionary-based error correction, and context-based error correction [12][13].
Among the above stages, segmentation at character, ligature, or word level is the most challenging stage in Urdu OCR. Based on these levels, Urdu OCR can be divided into two categories: analytical approach at character level [14][15] and holistic approach at ligature level [7,[16][17]. The analytical approach segments text at character level either explicitly or implicitly. The explicit segmentation requires an extensive knowledge of characters as it explicitly divides handwritten or printed text into characters. Many researchers have adopted the explicit character segmentation [17][18][19][20][21]. On the other hand, implicit segmentation is an integration of the segmentation and recognition processes. Successful work has been reported by researchers for implicit segmentation [22][23][24][25][26] due to the smaller number of segments. However, both algorithms require a massive amount of training data for better results. The holistic approach is also referred to as segmentation-free method. It extracts at ligature or word level. Groups of isolated (nonjoiner) characters and non-isolated (joiner) characters ( Figure  2  Avoiding character level segmentation has made the holistic method extremely popular [3,[27][28][29][30][31][32]. Authors in [27] followed the projection technique for text line extraction. The main body and diacritics were identified based on the distance between the horizontal base and the average line. The technique was tested on a small data set that was not specific to Nastaliq script, consisting of 1050 single characters and ligatures, with 98.86% accuracy. Authors in [28] used the horizontal projection technique. CCA was applied before text segmentation. The horizontal span of each secondary component on the baseline was calculated for the re-association of diacritics to their respective primary ligature. However, this approach assumed to work on text files instead of text images to extract complete ligatures. Similarly, authors in [29] applied the vertical projection profile method for the association of secondary ligatures by calculating the start and end point of diacritics. The proposed method reported 100% and 99% accuracy in baseline identification and ligature extraction respectively on scanned images with 48 font size but this technique ignores intra-overlapping ligatures and is also font size dependent.
Authors in [3] employed the horizontal projection method along with dilation to merge secondary and primary ligatures before line separation from the image. Authors in [30] used only 300 ligature samples to evaluate their proposed method, reporting 91.3% accuracy in segmentation and 78% in diacritics association. Authors in [31] proposed an extraction ligature technique based on 6 heuristic conditions reporting an accuracy of 99.02% on 45 images. Authors in [32] proposed the line segmentation technique with the connected component analysis method on images to collect width, height and centroids of ligatures reporting 99.80% accuracy. However, this technique does not segment multi-column scripts and overlapped inter and intra ligatures. Many recognition techniques carry out separate classifications of primary and secondary components [3,[27][28][29][30][31][32] to reduce the number of distinct recognizable classes. Such techniques face significant challenges in re-associating the secondary components with their primary components to recognize the entire ligature. The complexity at character segmentation has shifted the focus towards the holistic approach, i.e. the recognition of words or ligatures in the text. Segmenting text at the character level is more complex than the recognition of words and ligatures due to character overlapping, varying inter and intra word spaces, context sensitivity, different forms of characters according to their position in a word or a ligature, and the cursive script style. The literature review reveals that Urdu OCR is an open field for the researcher to design a system capable of incorporating factors such as intra and inter ligature overlapping, multi-column text images with borders, font variation, and mass data of ligatures for classification.
An efficient ligature extraction technique for Urdu OCR is proposed in this paper. The proposed method is capable to extract complete ligatures efficiently unlike separating primary and secondary components. The proposed technique is independent of font size and column layout, and is capable to handle all overlapping and non-overlapping ligatures by addressing the issue of intra overlapped ligatures as well as the complex association of the secondary components. It extracts complete ligatures, rather than separating primary and secondary components, thus secondary ligatures do not need to reassociate with their primary ligature in the classification and recognition steps. The proposed framework is designed for Urdu but is applicable to other languages that follow the Nastaliq style, such as Arabic, Persian, Pashto, and Sindhi.
II. THE PROPOSED METHODOLOGY The proposed framework for ligature extraction is depicted in Figure 3. It consists of 3 steps: image acquisition, image binarization, and PLE. Urdu printed text images from novels, religious books written in Nastaliq style, in single and double columns and varying font sizes were downloaded from different sources [33] and are referred to as I img . First, the I img is converted into binarized images I th by using hard thresholding. The resultant I th is a mono-chrome image with white background and black text (Figure 4). Then, an efficient process of PLE is applied on each I th . Framework for Urdu ligature extraction.

A. Photometric Ligature Extraction (PLE)
The proposed PLE used a customized photometric filter which is specifically designed to decompose an image based on the photometric similarity. The stepwise description of PLE process follows: • In the first step, PLE deploys a photometric filter to extract text lines (L lines ) from the image (I th ). The algorithm in Figure 4 demonstrates the working of the photometric filter. This filter scans the binarized image from top to bottom to detect text using the logical AND operator. The size of the photometric filter is adjusted with the width (W) of the image as (1xW). The output of the photometric filter is then saved in an array. The resultant array is a stream of zeroes and ones, on which unary AND operation is performed to get a single bit value, i.e. 0 or 1. The 0 value indicates the presence of black pixel/s in the row, otherwise the value will be 1.
• In the second step, the image L lines is first rotated counterclockwise by 90º. The photometric filter is then applied to each line of L lines to extract both overlapped and non-overlapped ligatures.
• The overlapped ligatures are corrected in this step by applying X-shear transformation and padding simultaneously on each L lig to overcome the most challenging issue of inter and intra ligatures overlapping. The output of this step consists of the sheared and padded ligatures L sheared-lig .
• In this step, the L sheared-lig images are classified into two classes based on the extent value of the first encountered ligature in image using CCA. Height, width, centroid, etc. are major properties obtained through the CCA method. The developed methodology utilized another component property termed as extent which is defined as the ratio of contour area to the bounding rectangle area. The extent value is a key feature in distinguishing secondary and primary ligatures with 99% accuracy. If the extent value of ligature is less than the hard threshold value, then dilatation operation is carried out on the encountered ligature producing L ligs, dilated . This process reduces the distance between the primary and the secondary component of a ligature.
• In the last step, the photometric filter is again applied to all dilated and non-dilated ligatures L ligs,dilated and L ligs,non-dilated to extract complete ligatures as final output L extracted-ligs .

B. Demonstration of the Proposed PLE
The stepwise demonstration of the proposed PLE technique is shown in Figure 6. The input of the PLE technique is a mono-chrome image with white background and black text ( Figure 5). In the first step, text lines are extracted one by one from the text image by applying the photometric filter (Figure 6(a)). In the next step, each extracted line is first rotated counterclockwise and then again passes through the photometric filter to extract both overlapped (marked as red circle) and non-overlapped ligatures (Figure 6(b)). The issue of inter and intra overlapping is resolved (see Figure 6(c), marked as green circles) by applying X-shearing and padding simultaneously on each ligature. Figure 6(d) depicts the list of dilated and non-dilated ligatures. The dilation process reduces the distance between the primary and secondary component of a ligature. Finally, the photometric filter is again applied on these ligatures to get the final output as shown in Figure 6(e). This step will further enhance the correct separation of ligatures.

III. RESULTS AND ANALYSIS
The proposed Urdu ligature extraction framework was evaluated on downloaded Urdu printed text images. The technique was tested on a total of 600 novel and book images. The working dataset mainly comprised of non-overlapping lines with no boundary across images. First, the photometric filter was applied on the images and extracted lines with an accuracy of 99.6%. A total of 13,200 lines were extracted from 600 images. These lines were then segmented into ligatures. A total of 267,800 ligatures were extracted after the complete execution of all the steps of the proposed PLE with an overall accuracy of 99.4%. Table I compares the proposed ligature extraction framework with previously reported methods. Authors in [27] evaluated their approach on 1050 ligatures with 98.86% accuracy in primary and secondary stroke extraction. Authors in [29] achieved 99% accuracy in ligature and diacritics extraction. Authors in [30] tested their system on 300 sample images out of which 274 were segmented correctly with 91.3% accuracy. Authors in [30] analyzed 45 Urdu images to classify and associate the connected components with 99.02% accuracy. Authors in [32] used 10,063 text lines to test their algorithm and reported an accuracy of 99.8%.
However, as discussed above, due to the limited data set of ligatures, researchers have mostly deployed algorithms on their own datasets to check the accuracy of ligature segmentation/extraction. Therefore, the accuracy depends upon the complexity of the text images used for segmentation and reassociation of primary and secondary components. Segmentation algorithms achieving segmentation accuracy near 99% apply CCA for primary and secondary component segmentation in [28,[31][32] and projection profile method in [3,27,[29][30] and then reassociate the secondary components. These studies also ignore the extraction of inter overlapped ligatures. The last row of Table I presents the findings of the proposed technique. The proposed solution resolved the problem of inter ligature overlapping with an accuracy of 99.4%. However, the efficiency of the proposed method is reduced due to the redundant use of the word . This complete ligature remains unaffected even after X-shearing because the primary component Alif " ‫ا‬ " lies in the region of the second main component ‫يک'‬ 'and the diacritics also overlap with the neighboring primary components. It was observed that spacing between diacritics that lie below the main body sometimes leads to incorrect line segmentation.

IV. CONCLUSION
This paper presented an efficient ligature extraction technique for the extraction of Urdu ligatures in Nastaliq fonts. The technique used a customized photometric filter along with the application of X-shearing and padding with CCA that result in the efficient extraction of overlapped and non-overlapped ligatures. The proposed framework achieves an accuracy of