Tamil handwritten palm leaf manuscript dataset (THPLMD)

Most palm leaf manuscripts are generally accessible in deteriorated condition, including cracks, discoloration, moisture and humidity, and insects bite. Such a manuscript is considered challenging in the research field. We captured deteriorated Tamil palm leaves around 262 dataset samples are ‘Naladiyar(27)’,’ Tholkappiyam(221)’, and’ Thirikadugam(14)’ which are genned up mortal health, discipline, authoritative text on Tamil grammar. We contribute the high-quality raw dataset with the aid of a Nikon camera, pre-enhance samples by editing software tool, and applied the Otsu threshold to deliver the ground images through binarization as readily accessible content presenting a highly time-consuming task to play a vital role in Machine/Deep/ Transfer learning, AI, and ANN


A. Methodological Contribution:
• The dataset consists of a variety of linguistic units, including vowels, consonants, and compound characters, which facilitate the process of text mining.The disciplines of machine learning, deep learning, and transfer learning, as well as the specific areas of computer vision and image processing, are widely recognized in the academic context of artificial intelligence and artificial neural networks.• The dataset collected comprises three distinct categories of writers, which were utilized as a testing dataset for the trained models, facilitating feature extraction and analysis in the domain of Tamil handwritten character recognition.The acquisition of readily available ground truth datasets poses a significant challenge to the advancement of learning models that entails significant time and resource investment for the research community.

B. Benefit of data:
The old ancient Tamil palm leaf manuscripts from Naladiyar, Tholkapiyam, and Thirikadugam as shown in Fig. 1(a) to Fig. 1(c) contain useful information that can benefit people.
• Naladiyar: (Four Hundred Quatrains): [7] It is composed of Jain monks, and deals with mortal morals and ethics, praising righteous conduct, highlighting the value of living a moral life, effective wealth management, and enjoyment.• Tholkappiyam: (Ancient Tamil Grammar): [8] It is written by Tholkappiyar that discusses authoritative text on Tamil grammar, and literary topics as well as orthography, semantics, prosody, phonology, and morphology.• Thirikadugam: (The Three Special Stimulants): [9] It is written by Nallathanar that adheres to secular ethics, the analogy to the traditional herbal medicine that treats stomach ailments with the three herbs sukku (dried ginger), milaku (pepper), and thippili (long pepper) C. Reuse of data • The Tamil palm leaf dataset, which had old ancient days, was collected and photographed using a Nikon D7200 DSLR camera.This resulted in a high-quality standardized dataset that efficiently produces the ultimate binarized ground truth dataset that can be utilized for character-level modifications, exhibits are devoid of noise and degradation thereby providing a valuable resource for society in terms of facilitating visually perceptible and easily readable text.

Objective
Palm-leaf manuscripts are considered one of the oldest and most widely spread methods utilized for this purpose by humanity with the transmission and preservation of writing in various cultures have been achieved by employing a variety of technologies.In India, particularly in the southern region, it is the most ancient form of written communication.The knowledge preserved in palm-leaf manuscripts proves to be highly valuable even for the current youth generation.Manuscripts tend to exhibit variation in size across distinct localities, with an average width of 4 centimeters and length of 48 centimeters, while also measuring more than 40 centimeters in thickness.Narayam was the primary tool to scribe on palm-leaf manuscripts called Thaliyola.In addition, palm leaf serves as the primary writing and drawing medium in countries like South and Southeast Asia, including Nepal, Sri Lanka, Burma, Thailand, Indonesia, and   Cambodia.Manuscripts contain a wide range of information, including details about astrology, astronomy, and traditional medicines [1] .Documents and Manuscripts that contain cultural, historical and medical information about our rich and ancient culture cover a wide range of topics.The majority of these documents and Manuscripts are written on Palm leaves, which are susceptible to damage from handling, moisture, and fungus growth [2] .Due to the tropical climate of the area, the earlier palm leaf manuscripts have been completely destroyed.Climate, pollution, and biological factors like heat, moisture, humidity, discoloration, fungi, insect bites like silverfish and cockroaches, rodent activity, seepage of ink, smearing along the cracks, dirt, and other discoloration cause manuscripts to deteriorate [3] .
The primary objective of the proposed work is to eliminate decay in the manuscript that avoids hard to understand text portions.Few degraded dataset samples obtained from the 'Naladiyar', 'Tholkappiyam', and 'Thirikadugam' are presented in Fig. 2 .Which is composed of the authoritative text on Tamil grammar and medicine and is well-versed in promoting human health and discipline.
The development of a binarization method in the provided dataset is necessary for identifying and removing potential anomalies from being considered during data processing, therefore it reduces the probability of those abnormalities affecting any future post-processing procedures, thus ground truth dataset containing accurate information will aid in enhancing the performance of the Machine, Deep, Transfer Learning, as well as artificial intelligence and artificial neural network evaluation in the future.

Data Description
The captured raw data sets obtained from high-quality materials were used to ensure optimum efficiency in the final stage of ground truth binarization.Specifically, the degraded Tamil palm leaf manuscripts from Naladiyar, Tholkappiyam, and Thirikadugam were selected for their methodological advantages and potential for reuse, and are available in the repository.Photoshop software is used for editing dataset images to perform tasks such as cropping, resizing, and image correction.Then applied Otsu Threshold Algorithm to generate ground truth images.The Production of a binarized ground truth dataset is efficiently achieved through the utilization of high-quality standardized datasets with the help of a Nikon D7200 DSLR camera.The Proposed dataset comprised binarized ground truth without degradation or noise that will facilitate the general public to understand the text, such data sets are highly significant in the research community, specifically in the character-level modifications and progression of segmentation, enhancement, and feature extraction algorithms.
The dataset from the corresponding degraded Tamil palm leaf manuscript is described in Table 1 for accessibility.The dataset is made up of 27 accumulated samples from Naladiyar, 221 samples from Tholkappiyam, and 14 samples from Thirikadugam sourced from Dr. U. Ve Swaminatha Iyer Library in Tamilnadu.In total, there are 262 unprocessed images issn the dataset, 199 corresponding images with ground truth data are available at this link doi: 10.17632/xz9rx5wfc5.1 .

Dataset Accumulation
Collected dataset samples of degraded Tamil palm leaf manuscripts are 262 in number under the condition of rodent exertion, humidity, and cracks.Accumulated samples from Dr.U.Ve.Swaminatha Iyer Library, Chennai, Tamilnadu, India, specifically the 'Naladiayar'-27 samples, 'Tholkappiyam'-221 samples, and 'Thirikadugam'-14 samples as listed in Table 1 .This manuscript serves multiple objectives, including providing details on human well-being, medicine, and authoritative literature on Tamil linguistic structures.The data repository containing all unprocessed raw sample datasets and binarized datasets can be readily accessed through an online platform.doi:10.17632/xz9rx5wfc5.1 .

Dataset Acquisition
The plain background plays an important role in the success of capturing the raw degraded palm leaf dataset, placed a white sheet on the table and on it 5 layers of palm leaves were handled carefully and arranged from bottom to top horizontally, then captured the palm leaf dataset using a Nikon D7200 DSLR camera to improve the quality of the image, with full focus attention of the viewer on it, set better angle position, facing the right way, and with the proper lighting setup to collect data are around 262 samples.An approach used to collect the dataset capturing method is shown in Fig. 3 .

Dataset Preprocessing
As in getting raw palm leaf datasets that have been captured, using an acquisition tool to import such images, bring them into a digital system, and arranged datasets sequentially then visually analyzed whether clear or unblurred images appeared.Then the dataset segregates into three folders.The proposed method focuses on removing deterioration that obscures text section that employs Photoshop editing software performs basic editing tasks such as cropping, adjusting exposure, white balance and contrast removing blemishes, White balance (WB) the colour contrast are the process of removing unrealistic colour casts, so that image which appeared as neutral and is rendered white in the image.
Luminance and colour intensity to increase the brightness of the image, choosing a specific area to be considered a feather, then applying curve adjustments to the dataset image to balance the colours and correct the images.In order to create a ground truth image, the segmentation algorithm of the Otsu threshold [5] is then used.The implementation of the Otsu Threshold Algorithm resulted in the generation of ground truth images thus the selection of an appropriate threshold value is of utmost significance in this process [4] .The Otsu threshold value fixes the possible intensity value (t) for separating pixels into foreground (fg) and background (bg).Finding the minimum weighted variance between the fg and bg pixels until the process is iterated.Lesser than the fixed value considers as foreground and greater than the fixed value is background.The formula weighted within the variance is given by Eq. ( 1) [6] σ Where variance Ω fg and Ω bg are the probability of a number of pixels of foreground and background pixels at Threshold t, σ 2 represents the variance of colour value.The key idea here is

Fig. 4 .
Fig. 4. (a)-(f) Binarized Ground truth images for corresponding raw dataset images.(a) Original image of Naladiyar.(b) Binarized Ground truth image for the corresponding original image of Naladiyar Fig. 4(a).(c) Original image of Tholkappiyam.(d) Binarized Ground truth image for the corresponding original image of Tholkappiyam Fig. 4(c).e) Original image of Thirikadugam.f) Binarized Ground truth image for the corresponding original image of Thirikadugam Fig. 4 (e).

Table 1
Tamil palm leaf dataset collections.