A vast dataset for Kurdish handwritten digits and isolated characters recognition

This article presents two massive datasets for central Kurdish handwriting digits and isolated characters named K-ZHMARA and K-PIT. The first dataset, named K-ZHMARA dataset, contains 70,000 images of Kurdish digits, 7000 images for each digit, and a printed A4 paper with a grid of 10 × 10 is used for data collection. Apart from digits, the K-PIT dataset includes 245,000 images of all Kurdish characters, 7000 images for each character; data was collected via a printed A4 paper with a grid of 12 × 10 for this dataset. Moreover, both datasets include 315,000 images. Python programming has been used to scan each piece of paper, segment, crop, resize, binarize, and invert the images via edge detection and image processing techniques.


Subject
Pattern Recognition, Computer Vision, and Deep Learning Specific subject area A vast dataset for Kurdish digits and isolated characters recognition Type of data Image How data were acquired The collected handwriting images were captured using a scanner and then segmented, cropped, resized, binarized, inversed, and annotated.

Data format
Raw data Segmented data Annotations Jpg format Description of data collection Each handwritten digit and character were written on an empty printed grid paper to facilitate the segmentation process.Writers were advised to write the digits and characters in the right boxes according to its label.Data collection forms were collected from more than 1500 participants.The handwritten characters and digits are segmented using bounding boxes.Each of the bounding boxes contains the characters that are written.Data source location Halabja, Kurdistan Region, Iraq Data accessibility A vast dataset for Kurdish digits and isolated characters recognition [1] .Data identification number: 10.17632/zb66pp7vjh.1 Direct URL to data: https://dx.doi.org/10.17632/zb66pp7vjh.1

Value of the Data
• In the study of pattern recognition and image processing, handwriting recognition is regarded as an exciting and motivating problem.It can be used in a variety of ways, such as applications to learn the characters and digits for children, reading assistance for the blind, computerized reading, processing for paper documents, and turning any handwritten material into structural text.• The datasets can be used for handwriting optical character/digit recognition and identification using machine learning and deep learning models.• Both datasets are ready to implement since they are pre-processed (including suspending excess lighting and noises, segmentation, cropping, resizing, binarization image, and inverse images) for each character and digit.• Deep learning and machine learning researchers are interested in central Kurdish or other languages with similar scripts, such as Persian, Arabic, and Urdu.• The datasets can be used as a standard for usability and quality in subsequent works because they were collected in a precise way, and it is vast data that can achieve higher accuracy in designed models.

Objective
OCR aims to modify or convert any type of text or text-containing document, including handwritten, printed, or scanned text images, into a digital format that may be edited and used for more in-depth processing.OCR allows a machine to recognize text in such materials automatically.A few significant obstacles must be identified and overcome to automate successfully, for instance, the existence of a huge and reliable dataset.
There has not been much research done on automatically recognizing Kurdish handwritten characters and digits since machine and deep learning models need huge datasets to achieve high accuracy; the aim of this work is to prepare two huge datasets for the Kurdish language named K-PIT (for Kurdish characters) and K-ZHMARA (for Kurdish digits), these datasets can be used to build a model for handwriting optical character/digit recognition and identification via deep learning and machine learning approaches.

Data Description
Kurdish language dialects are used across four main nation-states in the Middle East [2] , and only one dialect, Sorani, has official status in one of these nation-states.The majority of Kurdishspeaking regions are located in Turkey, Iraq, Iran, and Syria.More than 40 million people speak Kurdish as a whole, according to estimates [ 3 , 4 ].One of the two main dialects of Kurdish, known as Central Kurdish (Sorani), is spoken by an estimated 9 to 10 million people [5] .It is mostly written with a 35-character modified Arabic/Persian alphabet without characters that have recently been replaced, such as ( ‫ك‬ ), which is no longer used by the Kurdish language and has been replaced with ( ‫)ک‬ [ 6 , 7 ].A large database of isolated handwritten Central Kurdish digit and character images has been developed in this effort, tot aling 315,0 0 0 images, with 70 0 0 images of each handwritten by more than 1500 native individuals.Table 1 shows the number of images and the percentage of each character in the K-PIT database.The Quantity and Proportion of Digits Obtained for the K-ZHMARA Dataset are shown in Table 2 .Central Kurdish uses modified Arabic/Persian (Farsi) characters for writing, and there are numerous expansive databases of Persian and Arabic handwriting characters for recognition of offline characters; some databases even assert that their database can be used to recognize other languages that use the Arabic scripts, for instance, Kurdish [8][9][10] .Nevertheless, there are three primary issues.The first is that it does not include all of the Kurdish letters, such as V( ‫,)ڤ‬ L ( ‫,)ڵ‬ J( ‫,)ژ‬ R( ‫,)ڕ‬ and O ( ‫.)ۆ‬The Kurdish language has an inconsistent quantity and percentage of characters, which is the second issue.The third problem is all the datasets worked with the characters only and ignored the digits.

Experimental Design, Materials and Methods
The data collection methodology for preparing a handwritten image dataset includes several phases, such as gathering handwritten data from participants via designed forms which are labelled to indicate the type and position of the character/digit.The second step is a scanner device that scans the collected data in forms.Next, the scanned forms are processed and segmented in order to extract each digit or character as a separate image.Then some pre-processing techniques are applied to the images in order to achieve higher accuracy percentage when the dataset is used for machine learning models such as binarization and inverting the images.The last step is labeling; each similar digit or character is stored in a specific folder for both test and train directories with unique IDs.

Data Collection
The first step in creating a database is often locating an appropriate data source.Here, gathering examples of handwritten Kurdish numbers and characters from several different writers is the main objective.This task can be achieved by designing several suitable forms for Kurdish digits and characters.Fig. 1 demonstrates a form designed to collect Kurdish digits for the K-ZHMARA dataset, and it contains 100 empty boxes; each line is labeled with a specific number, and the participants have to fill the forms according to the labels.
Each digit is to be written ten times by the writers in each empty line.The number of individuals who participated in building the K-ZHMARA dataset was approximately 700.Similarly, Fig. 2 is another example designed to collect the Kurdish characters for the K-PIT dataset, and it contains 120 empty boxes; each line is labeled with a specific character.Since the Kurdish language (Sorani) has 35 characters without characters that have recently been replaced, we designed three forms, two forms with 12 characters and the last form with 11 characters.Each participant was asked to fill out all three forms.Each character is to be written ten times by the writers in each empty line.The number of individuals who participated in building the K-ZHMARA and K-PIT dataset is more than 1500.The total number of the forms used to collect  Kurdish digits was 700 and for the Kurdish characters was 2100, meaning that there were 2400 forms filled out by the volunteers who built the datasets.Several places were selected to fill the forms, students from 5 different colleges of the University of Halabja, the university students who stay in the dormitory of the University of Halabja, and several primary and preparatory schools in Halabja governate; as a result, each character have 700 distinct images.

Processing of Forms
The forms were filled out by the participants and scanned via a scanner.The scanner may produce files in the following formats: pdf, jpeg, png, or tiff.The scanner uses 75 to 600 dpi to scan documents.The png format was chosen for the initial scan of the forms since JPEGs contain less data than PNGs.Due to the forms being white, all forms were filled out using a black or dark blue pen.Fig. 3 displays a sample of a scanned page.

Image Pre-Processing
Pre-processing stage is a crucial step in every recognition system.It is used to enhance the quality of pictures.First, the big square border has been detected using Python programming language and 4 point detection algorithm to correct the position and angle of the forms since some forms at the time of scanning may be scanned with an incorrect angle.Fig. 4 Demonstrates a form after applying the 4-point detection algorithm.

Image Segmentation and Cropping
Each form page was subjected to the cropping procedure after the pre-processing stage in order to crop each letter block.Each square around the letters and digits has been detected one by one from the first row (left to right).Once all the squares from the first row are detected, and then the program detects the first square from the second row, this procedure will continue until detecting all the squares; this process was done via Python programming and edge detection algorithms, as demonstrated in Fig. 5 .The template had different resolutions, and it has 4 different forms, 1 form to collect the digits, which is divided into 10 rows and 10 columns, and 3 forms to collect the characters, 2 of them divided into 12 rows and 10 columns, and the last one divided into 11 rows and 10 columns because the number of Kurdish characters is 35.  100 distinct single-digit picture were created from a page with (146 × 146) pixels when the K ZHMARA dataset template was saved.
When the templates of the K-PIT dataset were saved, totally generated 350 separate images, 10 images of every single character (the first form, which was labeled with 12 characters, generated 120 images; similarly, the second form generated 120 images, but the last form which labeled with 11 characters generate 110 images) from the page with the (146 × 146) pixels, the images after cropping, resizing, and the saving process is shown in Fig. 6 .Then each line with a specific character/digit is saved in separate folders as a final step and achieves better results with the machine and deep learning models, and all the cropped images are inversed and binarized, as illustrated in Fig. 7 .
Each letter or digit was cropped as a separate image during cropping and then saved in a separate folder with its own ID.The images have the same size within the entire dataset.Due to each digit and character being written 10 times by 700 writers, who each wrote once, there were 70 0 0 images produced for each digit and character.

Labeling and Organizing
Labeling is the last step after pre-processing the dataset.Each image is labeled with an ID number, as shown in Table 3 and Table 4 ; the number of the folder in each dataset represents a single digit or character.For example, folder number 02 in the K-PIT dataset is the id of the letter, which in this case is Alef ( ‫ا‬ ), and folder number 03 in the K-ZHMARA dataset is the id of the digit, which in this case is three ( ٣ ).Each digit and character was stored in a folder with its ID as the name of that folder, with each folder containing 60 0 0 images of that letter/digit for the training and 10 0 0 images for the testing.

Ethics Statement
After submitting this project to the Institutional Review Board (IRB) of "The University of Halabja."They accepted this project by a reference protocol (1/8/205) on 1/10/2022.
With the approval of the people who had taken part in the writing, all the handwriting was collected.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Data Availability
A Vast Dataset for Kurdish Digits and Isolated Characters Recognition (Original data) (Mendeley Data).

Fig. 5 .
Fig. 5.A filled form with all squares around the characters detected by edge detection methods.

Table 1
Quantity and proportion of characters obtained for the K-PIT dataset.
( continued on next page )

Table 2 Quantity
and proportion of digits obtained for the K-ZHMARA dataset.