BanglaWriting: A multi-purpose offline Bangla handwriting dataset

This article presents a Bangla handwriting dataset named BanglaWriting that contains single-page handwritings of 260 individuals of different personalities and ages. Each page includes bounding-boxes that bounds each word, along with the unicode representation of the writing. This dataset contains 21,234 words and 32,787 characters in total. Moreover, this dataset includes 5,470 unique words of Bangla vocabulary. Apart from the usual words, the dataset comprises 261 comprehensible overwriting and 450 handwritten strikes and mistakes. All of the bounding-boxes and word labels are manually-generated. The dataset can be used for complex optical character/word recognition, writer identification, handwritten word segmentation, and word generation. Furthermore, this dataset is suitable for extracting age-based and gender-based variation of handwriting.


Parameters for data collection
Scanner: HP Scanjet 2400 Smartphone camera: Xiaomi Redmi 6, Xiaomi Redmi 7 A single image contains the handwriting of an individual.Each individual is identified using age, gender, and unique person id.The handwritten words are segmented using bounding-boxes.Each of the bounding-boxes contains the characters that are written.Labelme [1] software is used to draw and label the bounding-boxes.

Description of data collection
The writings were conducted using regular stationery products.Writers were advised to write on a random topic.Only one page of writing was collected from each individual.The handwritings were further captured using scanners and smartphone cameras.Each captured image was cropped and annotated manually.Data  Value of the Data • The dataset exploits possibilities and usage of handwritings from scanned and pictured documents.The usage of scanned and pictured forms in the recognition and identification process is often termed as an offline approach.• The dataset is suitable for machine learning [2] models, deep learning [3] models, producing embedding vectors [4] of handwriting, etc. • The dataset exploits all possible potentials of Bangla handwriting [5].The dataset contains bounding-box annotations for each handwritten word, unicode representation for each written word, and writer information for each document.Therefore, the dataset is suitable for word segmentation, optical character recognition, writer identification, writer verification, and handwriting generation.• The dataset contains raw images (without any pre-processing) of each document.The dataset also contains supplementary pre-processing scripts to suspend excess lighting and noises.• The dataset can be used to explore writing patterns related to age and gender.

Data Description
BanglaWriting, the dataset presented in this paper, aims to provide a preferable handwriting dataset that is enriched from every dimension.The dataset can be used in diverse machine learning and deep learning based applications.It can be implemented in handwriting biometric tasks, including identification, verification, and age/gender estimation.Further, the dataset has possibilities for specific computer vision tasks such as optical character recognition and handwriting segmentation.Moreover, the dataset has the capability of fueling generative handwriting models.Fig. 1 illustrates the possible domains on which the dataset can contribute.This dataset's construction and usage are different from usual Bangla datasets [6].The currently available datasets for Bangla writing only include isolated character writings.Whereas, the BanglaWriting dataset contains word-based writing with bounding-boxes.The dataset is implemented based on well-known offline handwriting, and writer recognition datasets [5].Table 1 presents a comparison BanglaWriting dataset with some of the popular datasets of diverse languages.Most of the bigger datasets (such as KHATT [7], IAM [5]) include some automated and pre-estimated parameters to label the data.In comparison, the annotations and labels of the BanglaWriting dataset are manually determined.Hence from the overall evaluation, it can be concluded that the BanglaWriting dataset attains a marginal amount of quality data.
The BanglaWriting dataset contains single-page handwritings of 260 individuals from eight different districts (illustrated in Table 4).It consists of 5,470 unique words and 124 unique characters.Moreover, the overall dataset comprises 21,234 words and 32,787 characters in total.The dataset contains Bangla characters, numerics, diacritics, and conjuncts.Furthermore, it has punctuation marks and English alphabets mixed with Bangla writing.Table 2 illustrates the Bangla characters that exist in the dataset.For better understanding, Fig. 2 explicates the underlying construction of a Bangla word.Fig. 3 illustrates a sample of the BanglaWriting dataset, bounding-box, and labels.
The dataset is presented in two different versions, i) raw and ii) converted.The raw file contains raw images that were manually cropped, and no image-processing techniques were applied.Hence, the raw dataset includes a diversity of  color shifts, shadowing effects in images.On the contrary, the converted file contains a furnished version of the raw images (discussed in Section 2.5).Fig. 6 illustrates the difference between the raw and converted dataset images.Further, Fig. 4 shows the directory structure for both dataset versions.For every image data, a JSON file is also included with the same naming convention.The JSON file contains the word-level bounding-box information and labels for each bounding-box.The JSON format is illustrated in Fig. 9 and it is further elaborated in Section 2.4.
The labels for each word-level bounding-box represents the words written in unicode format.There are three possible classes/label-formats maintained, which are presented below.The figure illustrates the directory structure of the BanglaWriting data files.The 'raw.zip' contains raw images that were only labeled.The 'converted.zip' contains labels, and the images are manually processed using the additional script [13].For every image file, there exists a JSON file with the same naming scheme.The JSON file contains the bounding-boxes and labels.1. Clear writing: By clear writing, we refer if the bounding-box contains written word that the writer intended to write and are understandable.In this case, we label the bounding-box with the unicode value of the written word.
2. Overwriting: By overwriting, we refer if the bounding-box contains the written word, but some of the characters have been stroked out.Writers often strike-out some character to refer to exclude that character.
In such a case, we label the comprehensible characters with proper unicodes, and we omit the stroked out characters in the label.In such a case, we add an asterisk ('*') with the Unicode label to mark the issue.

Strikes and mistakes:
The dataset contains some random strikes (such as word underlines, rules), and fully stroked out words.We do not include any unicode in such cases, and we only label them using an asterisk ('*').
Fig. 5 further illustrates some examples of the labels mentioned above.Moreover, Table 3 represents the quantitative distribution of each class in the dataset.
The dataset also includes a supplementary script [13] used to produce the furnished images of the 'converted' version of the data.The script is used to reduce the noises and light variations of the 'raw' data images.2 Experimental Design, Materials and Methods

Data Collection
The dataset was collected from the students of Bangladesh University of Business and Technology.Furthermore, to generate a better age distribution of the dataset, the students' household members were also included.Fig. 7 illustrates the age and gender distribution of the population.However, the writers were selected based on the primary clinical constraints, a) The minimum age of the writers can be 8, b) The writers should be physically fit to write.
The writers written on A4-sized papers, and regular ball-point and gel pens were used for writing.Each individual was suggested to write on any topic.Therefore, each document contains a diverse number of words.Fig. 8 represents the word distribution per document.Moreover, allowing writers to write on random topics also resulted in making mistakes and overwriting that are also labeled.
The writers are from eight different districts of Bangladesh.We define a writer belonging to a particular district if he/she stayed in the district for more than ten years.Table 4 illustrates a quantitative distribution of the geographical location of the writers.

Data Extraction
The handwritten pages were further imaged using a scanner and smartphone cameras.The dataset contains a total of 52 scanned images and 208 images captured using smartphone cameras.The scanned images do not contain any noisy conditions.On the contrary, the images captured using smartphone cameras have noises due to environmental factors, such as various lighting effects, glazes of flashlight, and shadow effects.

Data Preprocessing
Each image data were cropped and strengthened manually.The images were named using the formula, personIdentif ier_age_gender.No augmentation was applied to increase the dataset's size to ensure the dataset's authenticity and quality.

Data Labeling
The dataset was manually annotated using labelme [1] software.Fig. 3 illustrates the word-based bounding-boxes and the unicode-text labels for each bounding-box.The figure also demonstrates the annotation policy adapted for overwriting and cropped words/characters.Table 3 illustrates the labeling policy adopted for three different labels/classes of the word-based bounding-boxes.were generated for each image.The "shape" property contains an array of "label" and "points" parameter pairs.The "label" parameter contains the written word (in unicode-8) in the bounding-box.Whereas, the "points" parameter contains an array of starting and ending pixel-coordinates of the bounding-box.The "imagePath", "imageHeight", and "imageWidth" contains some additional information such as, the filename of the corresponding image, the height and width of the image, respectively.

Supplementary Script
As the dataset contains raw images taken using scanners and smartphones, a difference of lightning and background noise is noticed (illustrated in Fig. 6).Hence, the dataset includes a supplementary Python [14] and OpenCV [15] based script [13] that eliminates lightning issues and reduces the background noises.The script further furnishes the images and generates images suitable for machine learning and deep learning strategies.The furnished images are provided in the 'converted.zip'file, whereas the 'raw.zip'contains the raw images where no image-processing techniques were applied.

Figure 2 :
Figure2: Graphemes are the smallest unit of meaningful writing.A grapheme always contains a grapheme root.In the Bangla writing system, a grapheme may have one vowel and one consonant diacritic.Occasionally, a grapheme may include consonant conjuncts as it's grapheme root.The figure is derived from[11,12].

Figure 3 :
Figure 3: The left image illustrates a handwriting image with word-level bounding-boxes.The labels/words for each bounding-box is presented on the right.The excluded word (second row, second word) is marked using an asterisk (*).

Figure 4 :
Figure4: The figure illustrates the directory structure of the BanglaWriting data files.The 'raw.zip' contains raw images that were only labeled.The 'converted.zip' contains labels, and the images are manually processed using the additional script[13].For every image file, there exists a JSON file with the same naming scheme.The JSON file contains the bounding-boxes and labels.

Figure 5 :
Figure 5: The figure depicts some examples of the words and labels generated for each class.The left, middle, and right columns explicate clear writing, overwriting, and strikes/mistakes, respectively.

Figure 7 :
Figure 7: The left graph exhibits age distribution, and the right graph demonstrates the gender distribution of the dataset.

Figure 8 :
Figure 8: The left graph illustrates the word per document distribution for each paper.The right shows the same scenario without outliers.The word-count histogram simulates normal distribution.

Table 2 :
The BanglaWriting dataset contains all characters of Bangla vocabulary.The table illustrates the Bangla characters that also exist in the dataset.

Table 3 :
The table describes the quantitative distribution of each label along with the labeling schemes.

Table 4 :
The table describes the quantitative distribution of the geographical location of the writers.