Deep learning-based recognition system for pashto handwritten text: benchmark on PHTI

This article introduces a recognition system for handwritten text in the Pashto language, representing the first attempt to establish a baseline system using the Pashto Handwritten Text Imagebase (PHTI) dataset. Initially, the PHTI dataset underwent pre-processed to eliminate unwanted characters, subsequently, the dataset was divided into training 70%, validation 15%, and test sets 15%. The proposed recognition system is based on multi-dimensional long short-term memory (MD-LSTM) networks. A comprehensive empirical analysis was conducted to determine the optimal parameters for the proposed MD-LSTM architecture; Counter experiments were used to evaluate the performance of the proposed system comparing with the state-of-the-art models on the PHTI dataset. The novelty of our proposed model, compared to other state of the art models, lies in its hidden layer size (i.e., 10, 20, 80) and its Tanh layer size (i.e., 20, 40). The system achieves a Character Error Rate (CER) of 20.77% as a baseline on the test set. The top 20 confusions are reported to check the performance and limitations of the proposed model. The results highlight complications and future perspective of the Pashto language towards the digital transition.


INTRODUCTION
One of the essential themes of the digital turn (Rosenzweig, 2003) is that media stored in computers must be adaptable for analysis and effective processing in various applications.Document images are media contents that requires easy analysis and processing.These images include handwritten documents, forms, bank cheques, reports, printed magazines, historical documents, etc. Cameras and scanners serve as the primary sources for the acquisition of document images.However, these images (documents) are initially in pixel form and cannot be explicitly analyzed and processed.Therefore, a special mechanism is needed to convert such document images into a suitable form that can be read, analyzed Similarly, the model of Ahmad et al. (2016) contains three layers of ð4; 20; 100Þ LSTM units with Tanh layer size of ð16; 80Þ.From the available literature, this is the first time to evaluate the PHTI dataset for real world text-lines of the Pashto language.Figure 1 illustrates the abstract view of our proposed system.
The rest of the article is organized as follows: "Related Work" presents the DIA's Pashto work."Pashto Handwritten Text Imagebase-PHTI" describes the PHTI dataset.The proposed methodology is described in "Proposed Recognition System for PHTI"."Experiments and Results" covers the experimental setup, while "Conclusion" covers the conclusion.

RELATED WORK
In general, there is enough work available on the Pashto language.However, regarding the recognition of handwritten Pashto text, there is relatively little work.Therefore, we present relevent work that may help in the recognition of handwritten Pashto text.The following texts describe several techniques based on different approaches, including principal component analysis (PCA), hidden Markov model (HMM), artificial neural networks (ANN), convolutional neural networks (CNN), K-nearest neighbor (KNN), recurrent neural network (RNN), histogram of gradients (HoG), zoning features, Scale-Invariant Feature Transform (SIFT) etc. MacRostie et al. ( 2004) presented a DIA system for the Pashto language with the name of BBN Byblos.The BBN system is based on HMM and their CER ranges from 2:1% to 26:3%.However, the dataset they used in their research is not available.A dataset of 1;000 syntactically generated images of Pashto ligatures was presented by Wahab, Amin & Ahmed (2009).They have used PCA for ligature classification and obtained 17% accuracy as average.Due to the scale variations in the dataset reported in Wahab, Amin & Ahmed (2009), Ahmad, Amin & Khan (2010) tested SIFT descriptor for classification of those ligatures and achieved 73% accuracy and performed better than PCA.
Another work, Ahmad et al. (2015a), used MD-LSTM showing promising results by recognizing ligatures with scale and rotation variations.The system achieved a ligature recognition rate of 99%.A dataset named Katib's Pashto Text Imagebase (KPTI) was introduced by Ahmad et al. (2016) in 2016.The KPTI dataset contains 1;026 images taken from Pashto-scribed books.The obtained images have been further split into 17;015 textline images using a methods described in Ahmad et al. (2017), Ahmad, Naz & Razzak (2021).The CER on the KPTI dataset was 9:22% using MD-LSTM model.The KPTI dataset was further evaluated in Ahmad et al. (2018), where the issue of space anomaly was considered, and hence the accuracy was improved by 2:89%.Their approach for handling space anomaly was also validated on Arabic as well as on Urdu text.So far, the related work presented here contains research on real world Pashto text lines.The rest of the work only addressed either individual characters or digits in the Pashto language.Such work is not considered to be fit for generalization to explore language composition and structure.
A medium-sized dataset was presented by Khan et al. (2019) containing 102 samples of 44 Pashto characters produced a total of 4;488 images in the dataset.They used both KNN and ANN and obtained an accuracy of 70:05% and 72% respectively (Khan et al., 2020) presented a Pashto Handwritten Numerals Database (PHND) containing 50;000 scanned Pashto images for digits.They used CNN and RNN models to evaluate their dataset and obtained an accuracy of 98%.
Likewise, another medium-sized dataset was created by Khan et al. (2021) for the Pashto handwritten characters recognition.There are 8;800 images of Pashto characters in the dataset.They used techniques like zoning techniques, Gabor filters and hybrid feature maps.Hybrid feature maps showed a promising result by obtaining an accuracy of 83%.Amin, Yasir & Ahn (2020) in 2020, created a dataset named Poha containing 26;400 images of 44 Pashto characters and 10 Pashto digits.They used CNN-based model to evaluate the Poha dataset and has obtained an accuracy of 99:64%.
Similarly, Uddin et al. (2021) presented the Pashto handwritten character dataset, consisting of 43;000 images.The authors used a deep neural network with a ReLU activation function and obtained an accuracy of 87:6%.Another medium-sized dataset was also created by Huang et al. (2021).Their dataset contains 11;352 images of Pashto characters.They used KNN algorithm with HoG and zoning features and achieved an accuracy of 80:34% on KNN and 76:42% on HoG.
Rehman et al. ( 2021) developed a dataset for Pashto digits and characters.The authors used LeNet, CNN and Deep CNN models for Pashto digits and character recognition.In their work, the Deep CNN shows promising result and yielded an accuracy of 99:42% for handwritten Pashto digits and 99:17% for handwritten characters.Siddiqu et al. (2023) developed an isolated Pashto character database for 44 Pashto alphabets, using a variety of font styles.The database was pre-processed and reduced in size to 32 by 32 pixels, followed by conversion to binary format with a black background and white text.

PASHTO HANDWRITTEN TEXT IMAGEBASE-PHTI
In this research, we use PHTI (Hussain et al., 2022) dataset as a benchmark.The dataset is designed for research by considering the generalization aspect of the Pashto language.PHTI contains 36;082 Pashto handwritten text-line images along with ground truth annotated with UTF-8 codecs.PHTI comprises various genres, including short stories, historical memory, poetry, and religious content.The data was collected from diverse learners with gender identities, educational backgrounds, and personal experiences.The available version of PHTI has 169, unique characters.By observing the context of Pashto language, some characters are not directly linked.Therefore, this study suggests preprocessing.Figure 2 shows samples of PHTI text-line images.The dataset can be downloaded from https://github.com/adqecsbbu/PHTI.

Pre-processing
As mentioned, the PHTI has 36;082 text-line images.We discovered non-Pashto/ unwanted characters such as A-Z, a-z, 0-9 and special characters such as (¡, @,!).The textline images with these unwanted characters are skipped.The number of text-line images decreases from 36;082 to 25;939.Unique classes/letters are reduced from 169 to 94.Furthermore, the variable height of the text-line images have been fixed to 48 pixels by locking the aspect ratio to reduce the training time.

Data split
To support supervised learning, the data must be split fairly into training, validation and test sets.For this, we split the PHTI data using the holdout method (May, Maier & Dandy, 2010).The data was shuffled, and then 70% of text lines were selected for the training, 15% for validation and 15% for the test.

Evaluation metric
To evaluate the results of our experiments on the PHTI dataset, we used the Levenshtein Edit Distance (LED) (Yujian & Bo, 2007).The LED is a metric that measures the difference between two sequences given as strings.The overall error is determined by calculating the total number of insertions (I), substitutions (S), and deletions (D) divided by the total number of characters in the target string (N).Equation (1) shows how CER can be computed using LED.

PROPOSED RECOGNITION SYSTEM FOR PHTI
Handwritten text recognition is a classic issue; because of language specificity, the solutions are always diverse and based on language-specific treatments.However, RNNs are the most suitable among other solutions.In such problems, RNN-based techniques have performed better than other approaches (Naz et al., 2017;Messina & Louradour, 2015).
The proposed model and grid search for the optimal parameters is essential before going into recognition system detail.

Proposed model
Deep-learning models based on RNNs can better learn the previous context of input sequences.Classic RNNs, however, are limited by the vanishing gradient problem, especially for long-term dependencies (Hochreiter, 1998).LSTM was introduced by Hochreiter & Schmidhuber (1997) to address the vanishing gradient problem; LSTMs learn both short-term and long-term dependencies.We use multi-dimensional LSTM (MD-LSTM) based RNN approach that can scan the input images in four directions (Up and Down, Left and Right).The MD-LSTM is robust against many variations include scale, rotation, and registration.The proposed MD-LSTM system has two major components, (1) the hidden layer size and (2) Tanh layer size.The optimal values for hidden layer size and Tanh layer size are considered to be crucial.For example the MD-LSTM model proposed by Graves & Schmidhuber (2008) contains three hidden layers with 10; 20; 80 LSTM units while another model introduced by Ahmad et al. ( 2016) contains three layers of 4; 20; 100 LSTM units.Therefore, the number of hidden layer size and LSTM units depends on the problem specificity.We need empirical analysis to find the optimal hidden layers as well as the LSTM units in those layers.
In addition to that we also need optimal values for Tanh layer size.Therefore, we performed two different grid analysis, (1) for finding the optimal hidden layer size with appropriate LSTM units and (2) for finding the optimal Tanh layer size.The next section describes the detail about the overall grid analysis.

Grid analysis
The purpose of the first grid analysis is to obtain optimal size of hidden layer size with LSTM units.For this purpose, we carried out 10 experiments.We started with minimum LSTM units with three Hidden Layers and the focus was on to get minimum CER on test set (TS) and validation set (VS).Table 1 shows the overall grid analysis.It is noteworthy that Tanh layer size for each hidden layer configuration was taken as at most.For example, consider the first row in Table 1 for hidden layer (i.e., 4; 15; 80) the first Tanh layer (i.e., 16) will be in between first and second hidden layer and the second Tanh layer (i.e., 60) will be in between second and third hidden layer.Now 4 LSTM units in four different directions (Up, Down, Left, and Right) make 16 connections.Therefore, the maximum Tanh layer size is taken as 16.Similarly, for second hidden layer that contains 15 LSTM units and multiplied by four directions makes it 60 connections.Thus the second Tanh layer size is taken as 60.After the completion of 10 experiments, we found the optimal values for hidden layer size of ð10; 20; 80Þ LSTM units as shown in bold in Table 1 on serial no 9.
The second grid analysis is made for finding the optimal values for Tanh layer units.Therefore, we have performed seven experiments by keeping the hidden layer size of ð10; 20; 80Þ LSTM units as fixed.Table 2 shows the grid analysis for finding the optimal Tanh layer size.In this analysis, the focus was also on the minimum CER on TS as well as VS.Initially, the size for Tanh layer was kept minimum and was gradually increased.In this way, an optimal values of 20; 40 for Tanh layer size has provided better performance.
Consequently, the configuration of the proposed model is completed.Now, it has three hidden layers with 10; 20; 80 LSTM units that can scan an input image in four directions.Further more, the Tanh layer size of proposed model is 20; 40. Figure 3 shows the detail architecture of our proposed MD-LSTM model.

EXPERIMENTS AND RESULTS
After grid search, finding the suitable hidden layer size (i.    3 shows the comparison of the mentioned models on the PHTI dataset.Further, in terms of complexities present in the material, we computed the overall confusion matrix that is 94 Â 94 matrix, and for quick reference, Table 4 provides the top 20 confusions associated with the letters in the PHTI dataset.Among these top 20 confusions, the top two confusions are related to the letters and.Both letters are 1;712 times miss classified with each other.The major reason is the shape similarity present among the Pashto letters. Similarly, the overall confusions of these two letters with all other letters are 3;866 times.Thus the impact of the top two confusions in the overall error is about 2:59%.These findings also validate other research including (Ahmad et al., 2016).Visual analysis shows images that present shorter text-lines are comparatively well recognized.However, longer text-lines are not recognized accurately.Figure 4    The majority of the misclassifications are among the letters which have a resemblance in shape.We suggest transfer learning in the near future to improve the overall accuracy of the PHTI dataset.

Figure 1
Figure 1 Framework for Pashto handwritten text recognition system.Full-size  DOI: 10.7717/peerj-cs.1925/fig-1 Shabir et al. (2023) prepared an extensive Pashto-transformed invariant inverted handwritten text dataset.They fine tuned the MobileNetV2 for Pashto (text classification and extracting images features).The (TILPDeep) transformed invariant lightweight Pashto deep learning techniques was used and the authors achieved an accuracy of 0:9839 on training set, and 0:9405 for the validation set.Khaliq et al. (2023) work for the recognition of Pashto characters and ligatures of handwritten text, the PHWD-V2 dataset was used for the experiment.Several machine and deep methods were assessed, including MobileNetV2, VGG19, MobileNetV3, and customized-CNN (CCNN).The CCNN shows maximum accuracy on 93:98, 92:08, and 92:99 for training, validation, and testing, respectively.
Hussain et al. (2022) published a dataset named Pashto Handwritten Text Imagebase (PHTI).The PHTI contains 36;082 text line images presenting real world Pashto text in handwritten form.However, the dataset is not explored or tested for character recognition.Hence, from the available literature, very little work addresses handwritten text-lines from real world data recognition for Pashto.Most of the work addressed isolated characters or digits.The main reason was the non-availability of the dataset that contains images of handwritten text lines.Therefore, this work presents a baseline recognition system on PHTI dataset.

Figure 2
Figure 2 Samples of the PHTI dataset.Full-size  DOI: 10.7717/peerj-cs.1925/fig-2 e., 10, 20, and 80) containing LSTM units, and Tanh layer size(i.e., 20, 40)  for the PHTI dataset.The proposed model was trained on the train set.To avoid any over-fitting, the validation set was provided during the training process to check the validity of the learning process.After 70 epochs, the training converged to minimal loss and has shown CER of 15:45% on the training set, while 21:15% on the validation set.Then the training was stopped and the model was checked for unseen data i.e., test set.As PHTI dataset is not tested so far, therefore, no results available for comparison.However, to validate the selection of our proposed model,

Figure 3
Figure 3 Proposed MD-LSTM model consists of three hidden layers with LSTM units and two Tanh layers with Tanh activation function.Full-size  DOI: 10.7717/peerj-cs.1925/fig-3 shows the visual comparison of a few text-line images along with the predicted text.The results also signify the empirical analysis via grid search for finding the hidden layer size and Tanh layer size, as the proposed model provides better performance compared to Graves & Schmidhuber's (2008) and Ahmad et al.'s (2016) models.

CONCLUSION
This article presented a deep learning-based recognition system using the MD-LSTM model for the recognition of Pashto handwritten text-line images.The PHTI dataset was bench-marked for the first time in this research.Further, for the proposed recognition system, comprehensive experiments were conducted to find the optimal hidden layer size for MD-LSTM-based architecture.Similarly, the Tanh layer size was also determined by several experiments.The final proposed model was also compared with the models developed byGraves & Schmidhuber (2008) andAhmad et al. (2016).The proposed system shows a baseline accuracy of 79:23% on the test set of the PHTI dataset.Further, the misclassification is examined, and the top 20 confusions on the PHTI dataset are provided.

Table 1
MD-LSTM empirical analysis: hidden layers vs Tanh layer size where HLS represents Hidden Layer size, TLS represents Tanh Layer size, VS represents validation set, TS represents test set, TE represents total epoch, PETS represents per-epoch time spent, respectively.

Table 2
Grid analysis for Tanh layer size where TLS represents Tanh layer size, VS represents validation set, TS represents test set, TE represents total epoch, PETS represents per-epoch time spent, respectively.

Table 4
Top 20 confusions, associated with the letters in the PHTI dataset where MCWL represents misclassifications with letter, and MCWAL represents misclassifications with all letter, respectively.