An extensive dataset of handwritten central Kurdish isolated characters

To collect the handwritten format of separate Kurdish characters, each character has been printed on a grid of 14 × 9 of A4 paper. Each paper is filled with only one printed character so that the volunteers know what character should be written in each paper. Then each paper has been scanned, spliced, and cropped with a macro in photoshop to make sure the same process is applied for all characters. The grids of the characters have been filled mainly by volunteers of students from multiple universities in Erbil.


a b s t r a c t
To collect the handwritten format of separate Kurdish characters, each character has been printed on a grid of 14 × 9 of A4 paper.Each paper is filled with only one printed character so that the volunteers know what character should be written in each paper.Then each paper has been scanned, spliced, and cropped with a macro in photoshop to make sure the same process is applied for all characters.The grids of the characters have been filled mainly by volunteers of students from multiple universities in Erbil. ©

Value of the Data
• The dataset is suitable for machine learning models for handwriting recognition.
• Researchers who have an interest in researches of Kurdish/Persian/Arabic language in deep learning and machine learning.• This data can be a start for research of a more complex subject of joint characters and word recognition for this specific language.• As it is highly standardized (meaning very carefully sized and formatted) it can be used as a benchmark of quality and usability for future works.

Data Description
Central Kurdish ( Sorani ) is one of two main dialects of the Kurdish language, it is generally thought that Sorani is spoken by about 9 to 10 million people in Iraq and Iran [ 1 , 2 ].It is mainly written using a modified Arabic/Persian alphabet containing 34 characters, including characters that have been replaced in recent years like ( ‫ك‬ ) that's no longer been used by the Kurdish language and replaced with ( ‫.)ک‬In this work, a comprehensive database has been created for isolated handwritten Central Kurdish character images containing 40,940 images with an average of 1170 images of each character written by 390 native writers.Table 1 shows the number of images and the Percentage of each character in the whole database.The repository in Mendeley1 consists of a samples folder that contains samples of each character, and a zip file containing the whole dataset that is described in this paper.
Although the Kurdish language uses modified Arabic/Persian (farsi) characters for writing, and there are many comprehensive databases of Arabic and Persian handwriting characters for offline character recognition and some databases even claim that their database can be used for recognition of other languages, such as Urdu and Kurdish [ 3 , 4 ].However, there are two main problems, the first being that it does not contain all the characters used in Kurdish, like Re ( ‫,)ڕ‬ Ve ( ‫,)ڤ‬ Le( ‫)ڵ‬ and Wo ( ‫.)ۆ‬The second problem is that it does not have consistency in the number and percent of characters that the Kurdish language uses.

Data collection
Finding a suitable source of data is considered a first step toward building a database.Here, the main goal is to collect images of Kurdish handwritten characters written by many writers.So, a form is designed to do so.The form is shown in Fig. 1 .It consists of 1 alphabet at a time letter that has been printed on the top right corner, and it has 125 empty blocks.The writers have been asked to write each letter three times in the three empty blocks.Thus, the total number of writers is 390.
The forms have been distributed among two main categories: The academic staff of the Information Technology department at Tishk International University, the university students of the University of Kurdistan-Hawler, Salahaddin University, and Tishk International University, As shown in Table 2 .There were ten sets of forms, each set with 35 forms for 35 different letters; at first, we decided that nine sets, which will give us at least 1100 images for each letter, were the best option for the time that we had.However, then there were some problems with the collection process.In the first prints of the forms, there was confusion for instance, in Set 2, there were two forms for the letter ( ‫)چ‬ and none for ( ‫ج‬ ), and since we printed and distributed the form  at the same time, we were not aware of this problem until the stage of pre-processing, This was creating an inconsistency in the number of samples that we had, for example by the 9th set we had 504 images of the letter ( ‫,)ڤ‬ which was much less than other letters that they had at least 10 0 0 images.So we decided to add the 10th set as a complementary to other sets, it only contained those letter, which was missing in the first nine forms, which was ( ‫,)ی،ن،ل،ک،ق،ڤ،غ،ش،ژ،ز‬ as explained in Table 3 , the First column is the letter and columns 2-11 represent several images gathered in each set accordingly, while the first row the header row 2-36 are letters in each set, last row, and last columns are for the total of each letter and each set.

Form processing
All form pages were scanned using a high-quality scanner.The scanner scans pages using 30 0 to 180 0 dpi.The output of the scanner can be either a pdf, jpeg, bmp format.600 dpi was used as it had more detail than 300 dpi and didn't make the file size as big as 1800 dpi, and the jpeg format was chosen because its compression makes it more suitable to store more than 40 thousand images.All the letters were written in a black or dark blue pen since the paper was white.An example of a scanned page is shown in Fig. 2 .

Pre-processing
The pre-processing phase is important in any recognition system.The goal of the preprocessing process is to improve the quality of the images for extracting the proper features later in any recognition system.A pre-processing process was applied to each form page to enhance the images.First of all, the Table border has been removed using the Eraser Tool in Adobe Photoshop software.The result of this step is shown in Fig. 3 .

Cropping
After the pre-processing phase was completed, the cropping process was applied to each form page to crop each letter block.This process was done by designing a template using the Slice tool in Adobe Photoshop software.The template had a resolution of (6440 × 4140) pixels and divided the page into 9 rows and 14 columns, then cropped each letter, when saved templated generated 126 separate images of single characters from the page with the (460 × 460) pixels, Slice tool cropping, and the saving process is shown in Fig. 4 , while the output of this process is summarized in Fig. 5 .In the process of cropping images, each letter was cropped and saved in a separate folder with the ID of the letter.The entire letter images were saved in the same size.Since each letter was written 125 times by 390 writers each writing three times resulting in 1050 images for each letter Table 3 .

Table 1
number and percentage of the collected letters.

Table 2
Source of the data.

Table 3
Sets of data collection.