A REVIEW OF ARABIC TEXT RECOGNITION DATASET

Building a robust Optical Character Recognition (OCR) system for languages, such as Arabic with cursive scripts, has always been challenging. These challenges increase if the text contains diacritics of different sizes for characters and words. Apart from the complexity of the used font, these challenges must be addressed in recognizing the text of the Holy Quran. To solve these challenges, the OCR system would have to undergo different phases. Each problem would have to be addressed using different approaches, thus, researchers are studying these challenges and proposing various solutions. This has motivate this study to review Arabic OCR dataset because the dataset plays a major role in determining the nature of the OCR systems. State-of-the-art approaches in segmentation and recognition are discovered with the implementation of Recurrent Neural Networks (Long Short-Term Memory-LSTM and Gated Recurrent Unit-GRU) with the use of the Connectionist Temporal Classification (CTC). This also includes deep learning model and implementation of GRU in the Arabic domain. This paper has contribute in profiling the Arabic text recognition dataset thus determining the nature of OCR system developed and has identified research direction in building Arabic text recognition dataset.


INTRODUCTION
Conventionally, the input of an Optical Character Recognition (OCR) system is a page-image. The page is usually segmented into paragraphs, the paragraphs are segmented into text-lines, the text-lines into words, the words into sub-words, and finally the sub-words into individual characters for the system to be able to convert this image into the equivalent text. A character recognizer would recognize the segmented characters one by one. This method is called the segmentation-based OCR. The second, more recent method is the holistic OCR. In the holistic method, the recognition is performed at a word or a text-line level. This method overcomes the issues of character segmentation. There are two types of OCR methods which are segmentation-based OCR and segmentation-free method.
In the segmentation-based OCR method, the first main step is to detect discrete components, appropriate for the last recognition of the OCR step. Such a method depends heavily on the extraction of the individual characters from the text. It has continued to be the state-of-the-art method for a period of time before a segmentation-free method outperformed the segmentation method. An old version of Tesseract (Smith 2007), which is a popular opensource OCR system, is a good example of a segmentation-based OCR system. A segmentationbased OCR can be classified into two classes, which are template matching and oversegmentation. Template matching is based on connected components, where the characters are extracted and matched upon the possible templates. The recognition is achieved based on some similarity measures. The use of template-matching based methods is quite limited as it greatly suffers from font variations, image noise, and touching characters. On the over-segmentation method, instead of finding a precise segmentation spot between two characters, an approximate segmentation spot is located and as a result, over-segmentation occurs and is corrected at the next step. The over-segmentation method is very handy with cursive scripts such as the Arabic text where a correct character segmentation is very hard to perform. Ahmed et al. (2007aAhmed et al. ( , 2007b, Jabril et al. (2011Jabril et al. ( , 2016aJabril et al. ( , 2016bJabril et al. ( , 2016c and Atallah & Khairuddin (2009) introduced rules for reconfirming the potential segmentation points of Arabic words using peaks and vertexes points of Voronoi diagrams on the baseline based on peaks detection. Three steps were developed in word and sub-word segmentation approach where a peaks detection function is adopted to model the maximum and minimum peaks. A stroke operator is utilized to extract of potential segmentation points; then a determine baseline process is developed to estimate the parameters depend on the mostly minimum peaks and determine nearest vertexes point to minimum peak on the baseline to confirm the minimum peak as segmentation point. In addition, Jabril et al. (2017) too have applied a novel method to detect correctly of location segmentation points by detect of peaks with neural networks for Arabic word. This method employs baseline and peaks identification; where using two steps to segmenting text. Where peaks identification function is applied which at the sub-word segment level to frame the minimum and maximum peaks, and baseline detection has provide high accuracy.
The difference between segmentation-free and segmentation-based methods is the level of segmentation. Usually, in segmentation-free methods, the text-line is segmented into words or parts of a word. Then, the recognition engine tries to recognize the whole word or parts of the word. In the segmentation-free or sometimes called the holistic approaches, discriminating features are extracted from the word or parts of the word. Then, the HMM or ANN classifier are trained on the extracted features to recognize the word or the sub-word. A recent approach in the segmentation-free domain is to use RNN with CTC. This method is called Sequence Learning, where the classifier has to map the input sequence to the output sequence. In this study we focused on RNN and the CTC.

RECURRENT NEURAL NETWORKS
The human mind usually analyzes information according to past experience and depending on the current context, traditional neural networks are not context-dependent. Meanwhile Recurrent Neural networks are designed to take advantage of the context information. RNN are connecting units which allow information to be passed from one to another as shown in Figure 1. The RNN is capable of remembering the context of a sequence due to the feedback connections between the hidden layers. However, in practice, it is not capable of remembering very long context due to the Exploding Gradient Problem and Vanishing Gradient Problem, when the training uses gradient descent-based learning, the error signal is propagated back to update the internal weight connections. The values of the first-order gradient values either grow exponentially, the reason for it to be called the Exploding Gradient; or they vanish to zero exponentially, which is called the Vanishing Gradient, which makes the RNN learning very slow and unusable. In order to solve the Exploding Gradient Problem and Vanishing Gradient problem (Hochreiter & Schmidhuber 1997) used memory cells to replace the activation units at the hidden layer, which is called the Long Short-Term Memory (LSTM) as shown in Figure  2. FIGURE 2. Example of a net with eight input units, four output units, and two memory cell blocks of size 2 (Hochreiter & Schmidhuber 1997) Another RNN introduced by Cho et al. (2014) called Gated recurrent unit (GRU) also aims to solve the vanishing gradient problem, GRU is basically an LSTM without an output gate, which therefore fully writes the contents from its memory cell to the larger net at each time step as shown in Figure 3. LSTM has three gates i, f and o are the input, forget, and output gates, respectively, c and c~ are the memory cell and the new memory cell content. GRU has two gates r and z are the rest and update gates, and h and h~ are the activation or the candidate activation. In order to map the input sequence to the output sequence, a specialized algorithm called CTC has been used which is comparable to HMM's forward backward methods.
CONNECTIONIST TEMPORAL CLASSIFICATION Connectionist Temporal Classification (CTC) was introduced by Graves et al. (2006) to Labelling Unsegmented Sequence Data with Recurrent Neural Networks, usually sequence classification will required pre-segmented training data, and post-processing to transform their outputs into label sequences, the CTC solve those two problems it will map the sequence of input to the sequence of the output using the CTC loss function and CTC decoder transforms the NN output into the final text.
ARABIC OPTICAL CHARACTER RECOGNITION DATASET The OCR system requires a dataset for training and for the system to learn how to recognize the text within the image, and then convert that image into digital text. Due to the lack of a standard benchmark, most of the studies in this field were conducted using private datasets without a fair comparison. Hence, although most work would showcase high accuracy results, they may not be up to scale for a large set of problems. Therefore, an extensive list of publicly available datasets is offered in this subsection.
IFN/ENIT (Pechwitz et al. 2002) is an Arabic handwritten word dataset that contains 26,459 handwritten Tunisian town names, which were written by 411 different writers. This dataset is available to the public for research purposes.
The Arabic Handwritten Database (AHDB) (Al-Ma'adeed et al. 2002) contains the most popular Arabic words, numerals, and entities used in cheques, and written by 100 different writers.
The Arabic Cheque Database (Al-Ohali et al. 2003) is a handwritten cheque for legal and courtesy amount recognition database, which contains 29,498 sub-words and 15,175 digits in the form of Indo-Arabic numerals, and 2,499 legal and courtesy amount words extracted from 3,000 checks.
The Handwritten Arabic Character Database (Asiri & Khorsheed 2005) contains 15,800 isolated handwritten Arabic character images, written by approximately 500 Saudi Arabian secondary school students of both genders. The hand-written pages were scanned at 300 dpi, and each character image was saved as 7×7 grey-scaled image. However, this dataset is unavailable to the public.
The Handwritten Arabic Digit Database (Awaidah & Mahmoud 2009) contains 21,120 scanned samples of digits written by 44 different writers. Each writer wrote the digit from 0 to 9 for 48 times in an Indian format. The images were saved with a resolution of 300 pixels, which were then converted to the binary format. To segment the scanned pages into lines, the Horizontal Histogram was used. Then, a Vertical Histogram was used to segment each line into digits. This dataset is available online for researchers.
The Database for Handwritten Arabic Characters (HACDB) (Lawgali et al. 2013) was developed to cover all shapes of the Arabic characters, including overlapping characters. This dataset contains 6,600 characters written by 50 writers ranging between 14 to 50 years old. This database is available publicly for research purposes.
The UPTI database (Sabbour & Shafait 2013) contains images which are synthetically generated using the Nastaleeq font for the Urdu Printed Text. This database consists of 10,063 images of the Urdu text lines, which consists of both ligature and line versions. This dataset is suitable for training deep learning models, and page segmentation. To segment the images into lines and words, the Baseline Estimation was used by calculating the maximum horizontal projection, then using connected components. This dataset is available publicly for research purposes.
Mohd Sanusi Azmi (2013) introduced a novel feature from combinations of triangle geometry for digital Jawi paleography. A dataset of 69.400 images of Arabic calligraphy characters was built consisting of the handwriting of ten calligraphy experts.
KHATT ) is an open Arabic offline handwritten text database. It has 2,000 unique paragraph images with 9,000 line images. Written by 1,000 different writers, who came from different countries with different qualifications, age, gender, and left or righthandedness. The images are stored in different resolutions of 200, 300, and 600 dpi. The dataset can be used for different research purposes other than handwriting recognition, such as line segmentation, noise removal techniques, binarization, and writer identification. This dataset is divided into 70%, 15%, and 15% for training, validation, and testing, respectively. This dataset is available publicly for research purposes.
The KAFD dataset (Luqman et al. 2014) is an Arabic font database at page-level and text-line level. It consists of 40 fonts with 10 sizes in three resolutions at 100 dpi, 200 dpi, and 300 dpi. KAFD dataset contains 2,576,024 line-images. This dataset is available publicly for research purposes.
The ALIF Dataset (Yousfi et al. 2015a) is a dataset for Arabic embedded text recognition in videos frames. It consists of 6,532 cropped text line images from 8 popular Arabic News channels. This dataset is divided into the ALIF Train of 4,152 text images, the ALIF Test1 that is composed of 900 text images, ALIF Test2 that is composed of 1,299 text images, and ALIF Test3 that is composed of 1,022 text images for benchmark purpose. This dataset can be obtained upon request.
The ACTIV Dataset (Zayene et al. 2015) is a public dataset, which was extracted from 80 videos (more than 850,000 frames) collected from 4 different Arabic news channels. It consists of 4,824 text lines with 21,520 words. This dataset is publicly available for research purposes.
SmartATID (Chabchoub et al. 2016) or the Smartphone Arabic Text Images Database contains both printed and hand-written images captured by mobile devices. The printed version contains 16,472 document images, while the hand-written version contains 9,088 document images. Both sets were captured using two types of mobile phones, namely, Samsung Galaxy S6 edge and iPhone 6S plus. Different parameters were used, such as camera version, light conditions, and position. This dataset is available publicly for research purposes.
Alaa et al. (2017) propose a database for degraded Arabic historical manuscripts dating to the Islamic and ancient Arabic eras. The documents in the database exhibit different types of degradation such as smears, uneven illumination, contrast variation, blur, deteriorated paper, bleed-through, faded ink or faint characters, and thin or weak text.
The Printed PAW Dataset (Bataineh 2017) introduces a database for printed sub-words or Part of Arabic Word (PAWs). The proposed database consists of 415,280 images with 83,056 unique PAWs, which can construct approximately 550,000 different words. This database will be available to the researchers upon request.
The ACTIV 2.0 Dataset (Zayene et al. 2018a) is a public dataset that was extracted from 189 video clips, and produces 4,063 key-frames for detection and 10,415 cropped text images for recognition. This dataset is distributed with open-source tools for annotation and evaluation.
The Quran Text Image Dataset (QTID) (Badry et al. 2018) is the first Arabic dataset that includes Arabic marks (diacritics). It consists of 309,720-word images with a dimension of 192×64. It is synthetically generated from the Quranic words with font sizes of 22, 24, 26, and 28 pixels. Jabril et al. (2013) introduces, an database (AHDB/FTR) comprising Arabic Handwritten Text Images, which helps the researches associated with recognition of Arabic handwritten text with open vocabulary, word segmentation and writer identification and can be freely accessed by researchers worldwide. This database consists of four hundred and ninety seven images of Libyan cities, which were hand written by five Arabic scholars.  Table 1 has clearly shown that more complex tasks, such as recognizing diacritical image texts (for example, the Quranic text) at the word or line level, have not received much attention. Only the QTID dataset deals with the Quranic text, yet, this dataset is not available to researchers. In addition, it is synthetically generated on the word level. This study proposes a dataset based on a printed version of the Holy Quran, on a page and line level. It is also easier to achieve the word level from the line level by applying the vertical histogram projection. Furthermore, this dataset is meant to be publicly available. To the best of our knowledge, there are no publicly available datasets for a diacritical line dataset or Quranic image dataset for text recognition purposes. Heryanto et al. (2018) proposed Deep Learning approach and using Convolutional Network as learning features to optimize the data representation through end-to-end training of the parameters from raw input data to target class. A multi-classifier implicitly segments the subword into sequences of characters where the classifiers consists of one sub-word length classifier and seven character classifiers. This approach is superior to state-of-the-art methods of Jawi handwriting recognition.

ARABIC TEXT RECOGNITION WITH DEEP LEARNING
Yousfi )2016) presented an Arabic video text recognition system based on the deep learning approach. The proposed model used the input image without any pre-processing or segmentation. Multi-scaled window-based scanning scheme and deep neural models were applied to extract feature vectors from the input image. The Deep Belief Networks and Multi-Layer Perceptron were used as deep auto-encoders and one with the convolutional neural network. Next, the feature vectors were send to the BLSTM network to learn the sequence labeling, followed by CTC output layers with softmax activation function. A subset of the ALIF dataset was used in this work to train the model with 7,000 text images, to validate the system with 673 text images, and for testing 900 text images. The author compared two approaches to extract features (learned features vs. hand-crafted ones). It was reported that convolutional neural network outperformed the hand-crafted approaches. To show the strength of this model, a comparative study was performed, with 'ABBYY Fine Reader 12'4. This system outperformed the commercial software by almost 11 points in terms of CRR.
Graves )2012) won at the ICDAR 2009 on the Arabic offline handwriting recognition competition. This work was based on the MDLSTM recurrent neural networks. Raw pixel data is used as input and CTC as output. The dataset used for training and test is the IFN/ENIT. Rashid et al )2013) described a low resolution, multi-font, and open vocabulary system for printed Arabic text. The system is based on MDLSTM and recurrent neural network architecture with CTC layer. The proposed method was trained and evaluated using the APTI database. They reported a result of 99% word recognition rate.
A study by Morillot et al. (2013) was presented by University of Balamand (Lebanon) and Telecom ParisTech (France) for the OpenHaRT 2013 competition. They implemented a system based on BLSTM for the text-line recognition task. The recognition rate of 52% was obtained using a single BLSTM recognizer trained on only 11% of the available NIST/OpenHaRT data (145,000 text-lines). Chherawala et al. (2013) compared handcraft features and automatic features using the IFN/ENIT dataset. The features used were the concavity features (CCV) for Arabic word image, the distribution features by Rath and Manmatha (R-M) for handwritten word spotting in historical manuscript, and the (M-B) by Marti and Bunke for handwritten text recognition, with HMM, SIFT, Local Gradient Histogram (LGH) features, and automatically learned features by the MDLSTM. The results showed that although the MDLSTM is capable of learning features, the handcraft features had achieved better results. Pham et al. (2014) reported that a dropout on the first layer can reduce the CER and WER by 10% to 20%, and if the dropout is applied to MDLSTM, the error can be reduced by 30% to 40%. The system was evaluated with three datasets in three languages: RIMES dataset for the French language with character accuracy of 91.1%; IAM dataset for the English language with character accuracy of 85.6%; and OpenHaRT dataset for Arabic handwritten recognition with character accuracy of 90.1%. Hamdani et al. (2014) used the Hidden Markov Models (HMM) for sequence modeling and the BLSTM for feature extractions was used to train the HMM. The Minimum Phone Error (MPE) discriminative training was used to enhance the training. They used the OpenHaRT dataset, and implemented the n-gram language model, which was pre-smoothed using the Modified Kneser-Ney method. Yousefi et al. (2015) performed a similar experiment as Chherawala et al. (2013). However, in this experiment, they showed that LSTM, which was faster to learn and converge compared to MDLSTM, had also achieved better results in the same IFN/ENIT dataset, with the same handcraft features, namely, CCV, RM, MB, LGH. The LSTM had automatically extracted features from the row images, and this result was obtained by applying a normalization scheme to the input to reduce the translation to a horizontal axis. They had also showed that the LSTM with automatic features had obtained a better result compared to the handcraft features. Ahmad et al. (2017) proposed a system based on MDLSTM for Arabic character recognition, with CTC layer as the output. A preprocessing technique was introduced, which would remove extra white spaces and de-skews the text-lines for precise height normalization. This system was able to improve the recognition rate by 29% and the accuracy rate was 75.8% CER on text-lines of KHATT dataset. Nashwan et al. (2017) proposed a holistic Arabic OCR approach that is computationally efficient. To reduce the word recognition time, they used a lexicon reduction technique by clustering similar shaped words features. This approach consisted of two modules training and a recognition module. To train the extracted holistic features, they extracted hybrid features, with a combination between global word level-based Discrete Cosine Transform and local block-based features. Then, they used clusters based on similar word shapes, and these clusters were subsequently used on the recognition module. After preprocessing the input image to extract the lines and words, the features were extracted for each word image. Then, the model would try to get the best n-clusters that have the minimum Euclidean distance with the test image vector. As a result, a word list from the selected cluster was used to construct a word matrix for possible recognition hypotheses of the whole line. This word matrix was rescored using the language model based on the 4-gram model to achieve the best recognition hypothesis. Different sets were used to test the proposed system; the first set contained 1,152 words, with three different fonts and four font sizes, and achieved 99.3% of WRR. The second set contained 2,730 words of recent computerized book's text and achieved 84.8% of WRR. The third set of old non-computerized books consisted of 2,276 words with not well-known fonts achieved. These results have been compared with Sakhr, ABBYY, and NovoDynamics, which are known commercial Arabic OCR systems, and the results were promising. Zayene et al. (2018b) presented an Arabic video embedded text recognition system based on deep learning approach, they used MDLSTM network as input layers, so the MDLSTM learn the features from the raw input image, for the output layer they use the CTC with softmax activation function. The suggested method has been trained and evaluated using the AcTiV-R database which is part of AcTiv dataset consists of 10,415 text-lines images, 44,583 words. They report 96.5% as a character recognition rate. Also, they report that their system outperformed the previous work on the ALIF dataset, more particularly those based on the combination of CNN and BLSTM on (Yousfi et al. 2015a(Yousfi et al. , 2015b. Rahal et al. (2018) proposed the holistic text recognition system, which was based on statistical features. They adopted the Bag of Features (BoF) model, using Sparse Auto-Encoder (SAE) for feature representation and for the recognition process, the Hidden Markov Model (HMM) was used. As a preprocessing step, the Gaussian smoothing was used to reduce the noise normally associated with text images and image re-scaling to obtain a standardized height for all images. This system was evaluated in an experiment with three datasets, namely, KHATT, APTI, and MNIST. The obtained average accuracies of recognition had varied between 99.65% and 99.96% for the mono-font and exceeded 99% for the mixed-font. Jain (2018) introduced an end-to-end system using a combination of CNN and RNN architecture and showed the superiority of using the hybrid CNN and RNN over a system which additionally depends only on RNN. This method is reported as having outperformed the previous methods on the existing benchmarks. Figure 4 shows the visualization of the hybrid CNN-RNN architecture with a 7-layer-deep convolutional block.
Suvarnam & Ch (2019) use combination of CNN-GRU Model to Recognize Characters of a License Plate number without Segmentation, CNN was used for feature extraction and GRU was used for sequencing without using any segmentation methods the testing precision of the proposed framework is 100% and training accuracy is 90%. Jiang et al. (2018) use End-to-End Learning OCR Technologies to solve the CAPTCHA problem , the use two pipelines to solve this arithmetic operation, the deep convolutional neural network (DCNN) with parallel dense layers and component-connection- Table 2 clearly shows that the RNN was able to become the state-of-the-art system in the text recognition domain. MDLSTM or BLSTM can be used to obtain good results, with benefits from using every one of them. The LSTM was faster at learning and converging than MDLSTM. GRU, on the other hand, did not used with the Arabic text recognition yet, some researcher implement GRU on OCR for license plate and CAPTCHA, but many researchers use GRU in the Arabic domain for different task like speech recognition (Zerari et al. 2019), Arabic Neural Machine Translation (Almahairi et al. 2016), Arabic Named Entity (Gridach & Haddad 2017), and Arabic discretization (Moumen et al. 2018).
It was concluded that complex tasks, such as recognizing diacritical image texts (Quranic text) at word or line level has not received much attention and this could lead for future research directions in this area.

DISCUSSION
We discovered that complex tasks, such as recognizing diacritical image texts (Quranic text) at the word or line level in Arabic OCR have not received much attention. This can be the future work or research direction in preparing Arabic dataset. Furthermore to the best of our knowledge, there are no publicly available datasets for a diacritical line dataset or Quranic image dataset for text recognition purposes. Only QTID dataset deals with the Quranic text, however it is not available to researchers. In addition, it is synthetically generated at word level.
This work has described Arabic OCR dataset with various types of data such as handwritten text, the printed text and the embedded text. The presented dataset has tremendous potential in fully automated OCR using machine learning and deep learning approaches. We discovered that RNN was able to become the state-of-the-art system in the text recognition domain. Comparison of performance revealed that LSTM was faster at learning and converging compared to MDLSTM. Few researchers have implemented GRU in OCR for license plate and CAPTCHA, but many researchers use GRU in the Arabic domain for different task like speech recognition and in Arabic neural machine translation, Arabic named entity recognition and Arabic discretization. Meanwhile for complex tasks, such as recognizing diacritical image texts (Quranic text) at word or line level has not received much attention.

SUMMARY
We have highlighted the different approaches of the Arabic OCR system. The discussion is made on the different types of OCR dataset, which reflects the type of OCR system such as the handwritten text, the printed text and the embedded text. The discussion is also made on the different types of segmentation such as the character, sub-words, word, text line and paragraph segmentation. The description of the general techniques used for feature extraction has also been reviewed. Apart from that, the techniques and architecture of recognition such as the MDLSTM, BLSTM, CNN, HMM have also been explained in detail in this study. Finally, the review of the recent related studies in the area of Arabic OCR system has also been discussed.