DIMPSAR: Dataset for Indian medicinal plant species analysis and recognition

Mobile-captured images of medicinal plants are widely used in various research investigations. Machine vision-based tasks such as the identification of plant species types for intelligent imaging device applications take a significant part in it. Botanists, farmers and researchers can reliably identify medicinal plants with the help of images captured using smartphones. Mobile captured images can be used for quality control to make sure that the right plant species are being used in pharmaceutical products. In the field of education, pictures of medicinal plants and their usage can be used to educate learners, medical professionals, and the general public. Further, various research investigations in the area of chemistry, pharmacology, the therapeutic potential of medicinal plants, images can be employed. In this paper, we contribute a dataset of Indian medicinal plant species. The dataset is collected from different regions of Karnataka and Kerala. Datasets include characteristics such as multiple resolutions, varying illuminations, varying backgrounds, and seasons in the year. The datasets consist of 5900 images of forty plant species and single leaf images of eighty plant species consisting of 6900 samples obtained from real-time conditions using smartphones. The datasets contributed would be useful to researchers to investigate on development of algorithmic models based on image processing, machine learning, and deep learning concepts to educate about medicinal plants. The dataset can be accessed by anybody, without charge, at DOI:10.17632/748f8jkphb.2, 10.17632/748f8jkphb.3


a b s t r a c t
Mobile-captured images of medicinal plants are widely used in various research investigations. Machine vision-based tasks such as the identification of plant species types for intelligent imaging device applications take a significant part in it. Botanists, farmers and researchers can reliably identify medicinal plants with the help of images captured using smartphones. Mobile captured images can be used for quality control to make sure that the right plant species are being used in pharmaceutical products. In the field of education, pictures of medicinal plants and their usage can be used to educate learners, medical professionals, and the general public. Further, various research investigations in the area of chemistry, pharmacology, the therapeutic potential of medicinal plants, images can be employed. In this paper, we contribute a dataset of Indian medicinal plant species. The dataset is collected from different regions of Karnataka and Kerala. Datasets include characteristics such as multiple resolutions, varying illuminations, varying backgrounds, and seasons in the year. The datasets consist of 5900 images of forty plant species and single leaf images of eighty plant species consisting of 6900 samples obtained from real-time conditions using smartphones. The datasets contributed would be useful to researchers to investigate on development of algorithmic models based on image processing, machine learning, and deep learning concepts to educate about medicinal plants. The

Value of the Data
• Datasets of medicinal plants are helpful in scientific investigations in various fields for research. Exploration of the appearance and features of different plant species is useful in vision-based plant identification, phytochemical analysis, and conservation effort s. • Datasets can be used for the development of algorithmic models based on deep/machine learning/image analysis and pattern recognition.
• Furthermore, in the area of object detection from images, the raw images can be helpful in proposing the models to address various challenges related to medicinal plant species classification, leaf region segmentation, pre-processing, shadow removal, estimate leaf count, modeling of relation to compute similarity from one plant species to another belonging to the same class and different classes and occluded leaf recognition etc. • Datasets can be utilized to create an application that can educate the students and spread awareness on Indian medicinal plants and its health benefits to mankind. • The images collected can be analyzed using various image processing tools to extract useful information such as plant morphology, color, and texture. This data can be used for different applications such as machine learning algorithms for the diagnosis of plant diseases. • Images collected can help in creation of a comprehensive database comprising information related to diseases, usefulness, pests/fertilizers to be applied to prevent diseases and other industrial uses.

Objective
Ayurveda is one of the ancient medicinal systems practiced in India for several thousands of years to treat various diseases with lower cost and undesirable side effects compared to allopathy medicine. Every organ of medicinal plants such as root, leaf, stem, fruit, and seed are composed of medicinal properties. The primary objective of creating an image dataset of Indian medicinal plant species is to promote the use of ayurvedic medicinal practices and spread the knowledge related to common medicinal plants that are present around us. The awareness would help in encouraging researchers, educators, and practitioners in the field of medicinal plants to address new challenges. Also, this would lead to an increased harvest of medicinal plants that leads to the adoption of the best health practices by common people and reducing health-related risks. Specifically, the creation of the datasets can serve benefits such as plant species identification in machine/ deep learning, biodiversity conservation, medicinal plant research [ 1 ], phytochemical analysis, ayurvedic medicine, education and outreach, conservation and sustainable use.
Overall, the creation of an Indian medicinal plant dataset species can contribute to the development of innovative and sustainable approaches to their use and conservation.

Data Description
This work is exclusive as there is no standard plant organ image dataset for Indian medicinal plants in the literature as presented in the Table 1 . The image acquisition process is set free from various constraining factors that usually occur in the conventional image acquisition model [ 2 ].  High resolution, uniform illumination, plain background [ 5 ] Self-built 18 300 Uniform illumination, plain background [ 6 ] Self-built 50 1500 Plain background [ 7 ] Self-built 4 -Plain background [ 2 ] Self-built 40 2515 Plain background [ 8 ] Mendeley dataset 30 1835 High resolution images botanists, phytochemists and data sources from the Botanical survey of India. Table 1 highlights the literature worked on Indian medicinal plants using self-built datasets and Table 2 details the list of the Indian medicinal plant species that contributed as part of this work [ 3 ]. Two datasets consisting of single leaf and whole-plant species of medicinal plants are collected. Dataset of whole plant images consists of forty plant species of 5200 raw image samples and post-augmentation of 5900 image samples [ 9 ]. The leaf image dataset of 80 plant species of 6900 image samples. Table 3 gives the summary of the datasets.

Data collection
The authors collected the dataset from various botanical gardens in and around Mysuru, Karnataka and Kasaragod, Kerala, India.

Medicinal leaf dataset
The dataset is created by first plucking more than 80 leaves in each species type. Initially, the leaves are cleaned to remove the dust or other particles that are present on the leaf. The leaf images are captured both in indoor and outdoor environments with natural lightning conditions and others [ 10 ]. The leaf images are placed on various background surfaces with moderate image stabilization, zoom for tiny leaves, autofocus method for standard leaf sizes. The sample of leaf images is presented in Fig. 1 .
The images captured in this dataset consist of image samples that includea. Occluded images: The leaf images that are broken or the leaf region is not complete and partially crawled.

Medicinal plant dataset
The dataset is created for forty plant species types that consist of more than 100 image samples in each species type. The images are captured using multiple mobile phones with different resolutions and other specifications [ 11 ]. The created datasets can be utilized to address the real-time plant species recognition and analysis challenges. Fig. 2 shows the sample images of datasets collected under plant datasets acquired under various conditions comprising various image acquiring factors.

Data nomenclature
Two dataset contributions are the medicinal leaf dataset and the medicinal plant dataset. Both datasets are captured using smartphone cameras with various real-time conditions as mentioned. Both image datasets are captured with different illuminations, occluded leaves, shadows, and varying backgrounds. The medicinal plant dataset is captured directly from farms with the specified image acquisition conditions using 5MP, 12MP, 8MP,13MP, and 16MP resolutions. The collected images are compiled into separate folders with the folder name of plant species in both medicinal leaf and plant datasets.

Data augmentation
Data augmentation is applied only on medicinal plant datasets. The medicinal plant dataset with image samples of 5200 captured with a natural background was subjected to data augmentation to reduce the dataset imbalance and cover all real-time challenges. Geometrical and intensity transformations such as image rotation is achieved by rotating the image by 180-degree, low contrast by multiplying the intensity factor by 0.6, high contrast with an intensity factor

Ethics Statements
This work does not involve studies with animals and humans.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have or could be perceived to have influenced the work reported in this article.