Pashtu Numerals Recognition through Convolutional Neural Networks

-In the proposed paper we introduce a new Pashtu numerals dataset having handwritten scanned images. We make the dataset publically available for scientific and research use. Pashtu language is used by more than fifty million people both for oral and written communication, but still no efforts are devoted to the Optical Character Recognition (OCR) system for Pashtu language. We introduce a new method for handwritten numerals recognition of Pashtu language through the deep learning based models. We use convolutional neural networks (CNNs) both for features extraction and classification tasks. We assess the performance of the proposed CNNs based model and obtained recognition accuracy of 91.45%. Keywords-Optical character recognition, Convolutional neural networks, Pashtu numerals recognition Date Received 04-12-2019 Date Accepted 25-12-2019 Date Published 01-01-2020


I. INTRODUCTION
ITH the advancement of information technology, online and offline usage of digital text is increasing day by day. A system that converts machine written and handwritten scanned images to editable form is called Optical Character Recognition (OCR). OCR is the most researched area in computer vision and pattern recognition. OCR for most of the languages got a very mature position in the last 15 years [1][2][3]. These text images primary sources are scanned documents, images from scenes, and broadcasted videos.
The very first OCR system was investigated 65 years ago [4]. Since then, efforts are made by researchers, which leads to a nearly complete OCR system for advanced languages of the world. Although very large scale deployments have been made for OCR systems of non-cursive scripts languages. But a mature OCR for cursive scripts languages is still a challenging task.
Pashtu is spoken by around fifty million people around the globe [55]. Pashtu is a national language of Afghanistan and spoken in most part of Pakistan as well. This language has rich literature and diversity. There is a large written material in Pashtu, which addresses very diverse topics such as education, politics, religion, poetry, and much more. Instead of all these, Pashtu still needs a mature OCR system.
Pashtu OCR (POCR) is far away due to certain major problems this language is facing. For example, it is a cursive language written from right to left-hand side. Very little variation occurs in characters' shape for non-cursive script languages. Unlike non-cursive script languages, characters in the Pashtu language have significant variations. Similarly, different formation rules are very complicated for Pashtu [6,7]. In cursive script languages, when various characters combine, an intermediate shape is obtained called ligature. These ligatures are missing in non-cursive script languages, which further make OCR system complex.
We have a long term research strategy, which will lead to a complete OCR system for Pashtu. As no standard database exists for OCR development, as an initial step we make the first Pashtu numerals database, we called PHND V-0. We also introduce a CNNs based recognition algorithm for these numerals.
In a nutshell, our current research work has the following contributions: • Introducing a new database for numerals recognition of Pashtu language. • Introducing a CNNs based model for the recognition of the numerals of Pashtu text.

II. RELATED WORK
OCR has been addressed through two prominent methods previously, including holistic and analytical methods. We discuss both of these methods as we proceed. Holistic methods have no specific typography rules. These methods are generic, as can be applied to any language. An image having text is considered as one dimension vector, and features are extracted from the image. No segmentation is needed for such kind of methods. One of the main drawbacks of these methods is the requirement of a large amount of training data. These algorithms are robust to scale and changes in rotation. Moreover, a rich set of features are needed for building a model.
[92] on this system. For synthetic data, a very low error rate has been reported with these methods. These methods fail to perform when applied to a comparatively larger database, as very little training data has been used in the development stage.
For Pashtu text, a method developed on the holistic algorithm is reported in [13]. The authors of the paper used Noori Nastaliq script of the language during this work. This OCR system was evaluated on the synthetic database. Some methods developed for OCR can be explored in the references [14][15][16][17][18][19].
The second class of methods is analytical methods, which are advanced methods and are constructed through specific grammatical rules for the respective language. A unique set of features are used to identify a character. Segmentation at atomic level is performed for these methods. The performance of these methods is better when results of the prior segmentation is easy. For non-cursive script languages boundary of a character can easily be located; hence results are much better. For getting acceptable performance for these algorithms, better segmentation is mandatory, which is itself a big challenge in analytical methods. For the Pashtu language, still, no algorithm has been developed, which is based on analytical methods.
Some methods which are based on Hidden Markov Models and Neural Networks are reported in [8,9] for other cursive script languages. Some excellent papers have also been published for cursive script languages in ICDAR [10,11].
A database for Pashtu ligatures is also reported in [21]. Authors of the paper used Recurrent Neural Networks to develop a Pashtu OCR. Tests are performed on a limited set of images in [21]. Authors named their introduced database KPTI. The KPTI consists of 17, 015 images of Pashtu text. To the best of our knowledge, this [21] is the best research work reported particularly for Pashtu language. Some other works which used deep learning-based methods for cursive script languages can be explored in references [22][23][24][25].

III. PASHTU NUMERALS DATABASE
A main drawback of the deep learning-based method is the requirement of large training data. In this paper, we introduce a first handwritten database for Pashtu numerals. The database is freely available for research and can be provided upon request.
We collected data from different regions to bring diversity in writing style. We collected these images from faculty members, staff, and students of three universities, namely, the University of Azad Jammu and Kashmir, University of Malakand, and the University of Peshawar. The total participants in the data collection were 750. Every participant wrote each digit four times. A form was distributed among the participants to write Pashtu digits with hands.
The age range of all the participants was between 18-60 years. We scanned the written form with a 300 dpi resolution and then did some pre-processing step as described below; • We corrected the inclination of each page with a horizontal histogram. • We detected the center of each numeral. For the localization of center we used a connected component algorithm.
• We extracted each numeral from the image and then rescaled to 30 × 30. • We converted all images to binary form after all the above-mentioned steps. An image is shown in Figure 1, where a complete folder from the database is shown for one single participant. Each participant handwritten images are in one folder. One folder of a single candidate can be seen in Figure 1.
Each name has three parts i.e., S. D, and V. The alphabet S represents subject number which is in the range 1-750, D shows digit number and V represent version of writing which is from 0-4. We collected the written forms from 750 participants. The database can be freely downloaded for research use.

IV. PROPOSED METHOD
The details of the proposed method is discussed in this section. We used CNNs based method for feature extraction and classification. More details can be seen in the following paragraphs.

A. Architecture
The performance of the CNNs based model depends on several parameters. For example, the size of the kernels used, the convolutional layers numbers, and filters in every layer. Our proposed architecture is shown in Figure 2. In our model, we used two convolutional layers having filters specification as 24 (5×5) and 48 (5×5). For activation function, we used rectified linear unit (ReLu). After each convolutional layer, we embedded the pooling layer. For the pooling layer, we used Max-pooling.
A CNNs model has three main parts i.e., convolutional layers, pooling layers and fully connected layers. We represented the kernels as N×M×C. N and M are representing height a width of the filter and C channel. The pooling layers filters are represented by P×Q, where P and Q represent height and width of the filter, respectively. The fully connected layer is the final layer which performs the task of classification. Our complete framework is shown in Figure 2. [93] Layers 1) Convolutional Layers Certain features from images are extracted through convolutional layers. These features include edges, corners, edge points, etc. We used a stride of 1 pixel for feature extraction. Input to the set of the convolutional layer can be computed as; where represents the bias matrix and represents the filter moving through the image. The activation function used is; = The above Equation shows the applied activation function, which in our case, is ReLu.
= max(0, ) The ReLu helps to increase the non-linear properties of the decision function and the overall network.
2) Pooling Layers Small patches are taken from the output of convolutional layers, then down-sampled to produce a single output in the pooling layer. Different kinds of pooling layers are reported by the literature, in our work, we used maximum pooling. Maximum pooling takes the maximum value of the whole block. For pixel window, we fixed the size as 3 × 3. 3

) Fully Connected Layers
The final content from the convolutional and pooling layer is given to the fully connected layer. First, the data is flattened and then given to the fully connected layer. A fully connected layer connects all neurons from the previous layer to its own neurons. The complete architecture with connectivity are shown in Figure 2.

B. CNN Optimization 1) Learning Rate
For updating the weight of the network, we use the learning rate, which is α. α determines how the convergence of the network is done. If the value of α is slow, the convergence rate will be less, and if sufficiently large, divergence will occur. We selected the value of α with extreme care.
2) Activation Function ReLu is commonly used as an activation function in deep learning models. We also used ReLu for activation. This function helps to increase the non-linear properties while taking a decision. The ReLu also helps in the generalization ability of the CNNs network and also reduces the computational cost of the model.

3) Stochastic Gradient Dsecent
We used Stochastic Gradient Descent (SGD) for weights and biases updating. A small step was taken by the SGD towards a negative gradient, which further minimize the error function. + = − ∆ In the above equation, we represent the iteration number with j, learning rate with α, which must be > 0, vector parameter with P, and lastly, the loss function by . The whole dataset is used by the SGD once.

4) Mini-batch
The gradients of the proposed CNNs model is evaluated by SGD, which also updates all parameter through some part of training data. We called the subset of data as mini-batch. In the optimization process of the network, the whole database is divided into batches. For each batch, the gradient descent is calculated. After updating the network, the next batch is considered. In this way, the loss function is minimized with each iteration. Epoch is the full pass of the whole training data [94] through small subsets, i.e., min-batch. We fixed the mini-batch as 100 and the Epochs 30 during our work.

5) Momentum
In some cases the descent algorithm oscillates to the steepest path when moving to the optimum value. We added momentum term for oscillation prevention. The SGD in such cases will be; +1 = − ∇ ( ) + ( − −1 ) In this equation, the symbol decides how the gradient step used previously contributes to the current iteration. The data shuffles in all this process. 6) Regularization During training process of the supervised learning, overfitting is a common problem. To the loss function we added a regularization factor for the weights. The loss function after regularization takes the form: = + Ω( ) In this equation, the weight factor is represented by , the regularization co-efficient by , and the regularization function by Ω( ): In most of the CNNs model, Softmax classifier is frequently used for multi class classification tasks. For probabilistic cases, it is particularly helpful. We also applied Softmax for multi class classification in our work.

( ) = ∑ =1
In the above equation, the symbols x represents the input image to the network.

8) Network Parameters
The network parameters play a vital rule in assessing the complexity of an architecture. These parameters make a clear comparison between various architectures. We computed the dimension of the feature map as; Where the symbol represents the dimension of the used feature map for fully connected layer, I refers the feature map used as input, C is the filter which is convolved with I, and refers the stride used in the convolution process. For each layer, we obtained the parameters through: represents the total parameters in the jth layer, is the output feature maps for jth layer, and −1 are the total feature maps in the (j-1)th layer. Figure 3. The mis-classification rate (%) for training data.

A. Experimental Setup
For experiments, we used Intel core i7 CPU. RAM of the system was 8GB, while GPU was NVIDIA 840M. We used TensorFlow and Keras for experiments. We trained the model for 20 Epochs, and the batch size was 100.
For Pashtu numerals recognition, the only dataset reported by the literature is PHND V-0. For the training stage we used 20,000 images, and the remaining 10,000 images for testing.
We used subjects 1-1000 for training and 1001-1250 for testing.

B. Results Discussion
The performance of the proposed CNNs base model for numerals recognition is investigated and discussed in this section of the paper.
We cannot compare our results with any other Pashtu numerals database, as no database is still reported by the literature. We already presented details about the PHND in Section 3. The database consists of 30,000 images; we used 20,000 images for training and 10,000 for the testing phase.   Figure 3 it is clear that the mis-classification rate reduces as the number of Epochs are increased. At 20 Epochs, the miss classification rate for training data almost reaches to 0.
For more details, we provided the confusion matrix for all ten classes in Table 3. From Table 3, it can be noted that the classification accuracy of some classes such as 0, 2, 3, and 5 is comparatively weak. The obvious reason for these poor results is the shape of these digits. For example, 0 is confused mostly with 5 and vice versa. These details can be studied in Table 3 for each class. As a whole, we obtained a classification accuracy of 91.64%.
In a nutshell, the results reported are encouraging and confirm the effectiveness of the newly proposed CNNs based model for the Pashtu numerals recognition.

VI. CONCLUSION AND FUTURE WORK
In the proposed work we introduce a new Pashtu numeral database named PHND V-0. The database consists of 30,000 images collected from three different universities in Pakistan. We make the database freely available for downloading and research purpose. We also introduce a deep learning-based recognition system for Pashtu digits. The deep learning model is based on concepts of convolutional neural networks. We use two layers of convolutional neural networks followed by a maximum pooling layer. The fully connected layer completes the classification task. For classification we used SoftMax.
Our current research work is part of our long term research strategy regarding cursive script languages. As future work, we are planning several directions. Firstly, we will build a complete Pashtu database for ligatures. In the next step, we will move towards complete OCR for the Pashtu language. We will explore the OCR system both for offline and online Pashtu text recognition. We also have planning towards translation of Pashtu language text to other languages. We also intend to apply our proposed deep learning-based methods to other cursive script languages passing through the same undeveloped phase, such as Sindhi, Punjabi, etc.