A novel dataset of potato leaf disease in uncontrolled environment

Potatoes are of the utmost importance for both food processing and daily consumption; however, they are also prone to pests and diseases, which can cause significant economic losses. To address this issue, the implementation of image processing and computer vision methods in conjunction with machine learning and deep learning techniques can serve as an alternative approach for quickly identifying diseases in potato leaves. Several studies have demonstrated promising results. However, the current research is limited by the use of a single dataset, the PlantVillage dataset, which may not accurately represent the diverse conditions of potato pests and diseases in real-world settings. Therefore, a new dataset that accurately depicts various types of diseases is crucial. We propose a novel dataset that offers several advantages over previous datasets, including data obtained in an uncontrolled environment that results in a diverse range of variables such as background and image angles. The proposed dataset comprises 3076 images categorized into seven classes, including leaves attacked by viruses, bacteria, fungi, pests, nematodes, phytophthora, and healthy leaves. This dataset aims to provide a more accurate representation of potato leaf diseases and facilitate advancements in the current research on potato leaf disease identification.


a b s t r a c t
Potatoes are of the utmost importance for both food processing and daily consumption; however, they are also prone to pests and diseases, which can cause significant economic losses.To address this issue, the implementation of image processing and computer vision methods in conjunction with machine learning and deep learning techniques can serve as an alternative approach for quickly identifying diseases in potato leaves.Several studies have demonstrated promising results.However, the current research is limited by the use of a single dataset, the PlantVillage dataset, which may not accurately represent the diverse conditions of potato pests and diseases in real-world settings.Therefore, a new dataset that accurately depicts various types of diseases is crucial.We propose a novel dataset that offers several advantages over previous datasets, including data obtained in an uncontrolled environment that results in a diverse range of variables such as background and image angles.The proposed dataset comprises 3076 images categorized into seven classes, including leaves attacked by viruses, bacteria, fungi, pests, nematodes, phytophthora, and healthy leaves.This dataset aims to provide a more accurate representation of potato leaf diseases and facilitate advancements in the current research on potato leaf disease identification.
© 2023 The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Specifications Table

Value of the Data
• This dataset was collected in an uncontrolled environment, resulting in a variety of variables, including the background and the diverse directions and distances of the images.• This dataset is better at representing the various types of diseases commonly found on potato leaves by categorizing them into seven classes, including leaves symptoms attacked by viruses, bacteria, fungi, pests, nematodes, phytophthora, and healthy leaves.• This dataset can be used for computer vision and pattern recognition, which are primarily employed in classification tasks.• The potato leaf dataset has motivated researchers to develop a novel approach for classifying potato leaf diseases in uncontrolled environments.• This dataset will encourage researchers to classify or build models to identify potato leaf pests and diseases using advanced computer vision techniques under background clutter and occlusion conditions.

Background
The implementation of image processing and computer vision methods can serve as an alternative approach to accelerate the process of identifying diseases in potatoes through the symptoms of leaves.However, the use of a single dataset may impede the ability of Machine Learning (ML) or Deep Learning (DL) to generalize, diversify, and adjust to diverse situations.Furthermore, the existing PlantVillage dataset [1] utilizes a controlled environment for image capture, whereby each image is captured under controlled parameters such as a clean background, controlled angles, and camera directions.This approach was adopted to enable system designers to minimize various sources of variability and achieve high-performance results.In addition, the PlantVillage dataset comprises only three classes: healthy, early blight, and late blight.Among these, late and early blight are caused by fungi.Although previous studies have demonstrated superior performance, the available datasets may not accurately represent the real-world conditions of potato pests and diseases because of the controlled environment in which the images were captured and the lack of information on disease type, which only captures diseases caused by fungi.The creation of a dataset specifically for diversifying images of potato leaf diseases in uncontrolled environments is highly needed because of the scarcity of suitable image datasets for a wider range of diseases.To address this issue, we have recently acquired novel primary data that offer several advantages over previous datasets and will better represent the various types of diseases commonly found on potato leaves.

Data Description
The dataset was developed by multidisciplinary teams from the Faculty of Engineering and Informatics, Universitas Multimedia Nusantara, and Faculty of Agriculture, Universitas Gadjah Mada.The dataset was collected from several potato farms in Java Island, Indonesia, primarily in central Java.The dataset was gathered in an unrestricted setting characterized by multiple inconsistencies, such as background and diverse directions and distances.This dataset comprises several categories of potato leaf diseases, including those caused by fungi, viruses, pests, bacteria, phytoplasmas, and nematodes.
The sample images for each category are shown in Fig. 1 .The total number of images per category in the dataset is listed in Table 1 .As listed in Table 1 , the dataset consists of seven classes, with a total of 3076 images.The classes were categorized as "virus," "phytophthora," "nematode," "fungi," "bacteria," "pest," and "healthy".The images had a resolution of 1500 × 1500 pixels and were saved in.jpg format to ensure ease of access and compatibility with various image-processing software packages.

Image capturing
The images used in this study were obtained from multiple angles and distances ranging from approximately 5-15 cm.The pictures were acquired under diverse weather conditions, including sunny, cloudy, and partially cloudy.Images were captured between the hours of 8 a.m. and 3:00 p.mThis dataset collected symptoms of pathogen and pest attacks on potato leaves of varying ages, approximately 35-80 days after planting.At this stage, the symptoms caused by each pathogen that appeared in the leaves are visible and have no ambiguities with other infections.However, because the dataset was intended for the image classification task, the disease progression stage was not included in the dataset.The disease stage was randomly selected based on the occurrence of diseases observed in the field.The images were obtained at two  2 .
Images were captured by several team members using various smartphone cameras.The detailed specifications of the smartphone camera are listed in Table 3 .The size format for the captured images had a variety of resolutions ranging from 2448 × 3264 to 3472 × 4624, encompassing both vertical and horizontal forms.This is because of the different specifications of the smartphone cameras used to capture the images.The aspect ratios of the images captured by the smartphone cameras were 4:3 and 16:9.Furthermore, the different positions of the cameras used by the teams resulted in various backgrounds and occlusion of the images.Fig. 2 shows how the team members gathered the images.

Image labelling
The experts from the Department of Plant Protection, Faculty of Agriculture, Universitas Gadjah Mada, who have expertise in plant disease, specifically for potato leave disease, conducted a thorough evaluation of each image and labelled them into seven distinct classes, including "virus," "phytophthora," "nematode," "fungi," "bacteria," "pest," and "healthy."The labelling process was performed by meticulous observation of the visual characteristics of the captured images, utilizing the core symptoms of each disease.Images exhibiting ambiguous traits such as the concurrent presence of multiple disease features were excluded from the final dataset.The labelling process utilizes visual observation, performed by experts in plant disease by relying on symptoms for each of the disease categories, such as leaf discoloration, wilting, spots, lesions, or abnormal growth patterns, which can provide clues about the nature of the disease.Reference materials and disease manuals were also used.In addition, plant disease experts labelled the images based on field observations.They considered factors such as weather conditions, plant history, and common diseases in the area.The combination of field observations, experience, and visual inspection of the symptoms appearing in the leaves resulted in the final disease category.A summary of each visual characteristic of the potato leaf dataset is provided in Table 4 .Phytophthora infestans symptoms on leaves can be observed as dark brown to black lesions, and if left untreated, they can cause lesions to enlarge into circular and necrotic patches [2] .The symptoms examined in this investigation encompassed dark black lesions that covered the majority of the leaves, ranging from 20% to 90% of the surface area, which eventually dried out, did not sporulate, and exhibited a tan hue.Certain photographs depicted infections in the form of small lesions, ranging in color from bright green to dark green, exhibiting circular to irregular shapes, and displaying wet spots, as described by [3] .
Fungal diseases in potato leaves can show different symptoms, depending on the causative organism.Early blight caused by Alternaria solani on leaves can be recognized as circular patterns manifested along leaf edges [4] , slightly sunken leaf spots with yellow borders, and concentric rings.In certain cases, these spots may converge.This characteristic becomes more pronounced as the pathogen infects the underside of the leaf, displaying light-brown spots [5] .However, it should be noted that this development does not lead to leaf drying, as previously described by [6] .Another fungal disease is characterized by yellow leaves that become necrotic with powdery patches [2] .
Bacteria-caused diseases show symptoms on leaves as secondary symptoms because of infection of tubers and stems.Symptoms on leaves wilt without dead or necrotic leaves; wilt is rapid and the plant does not initially turn yellow.When an infected lower stem is placed in water, a prominent milky ooze is observed [2] .This symptom attack may or may not be visible depending on the development of the disease and its relationship with temperature.Wilt symptoms occur when vascular infection occurs, which inhibits nutrition in the stem and leaf petioles [7] .In this study, we did not use leaves with symptoms of primary bacterial attack.
Nematode infections above the ground can be seen as expanding patches with poor growth in the field.The plants are smaller and have yellowish leaves, with symptoms similar to those of water and nutrient deficiency.Plants with damaged roots become wilted, particularly under warmer temperatures during the day, and may remain wilted even with irrigation [8] .
Viral symptoms include reduced leaf size and crinkling, mild mottling or mosaicism, necrosis, and severe infections that can cause dwarfing in plants.In addition to these diseases, crop losses in potatoes worldwide can be caused by pests.Potato pests can be broadly grouped into three categories: sucking pests, tuber-and root-damaging pests, and foliage feeders or defoliating pests [9] .Damage from thrips can be observed in leaf tissues, which become distorted and dotted with silver or chlorotic material.When the leaves are fed continuously, their tips wither, coil up, and eventually die more severely under dry weather conditions.Both nymphs and adults graze on leaves along the midribs and veins.Foliage feeders, caterpillars, leaf beetles, and grasshoppers feed on leaves by scraping and skeletonizing them.Older larval instars are divided into groups that feed heavily on frequently defoliating leaves.If the infection is severe, the crop can simultaneously lose its leaves [9] .Another pest is the leaf miner insect, Liriomyza spp., which causes leaves to be mined by their larvae [10] .

Image pre-processing 4.4. Comparison with existing dataset
This section provides a comparison and novelty of the proposed dataset with the existing PlantVillage dataset.Table 5 provides a comparison between the PlantVillage and proposed datasets.
As presented in Table 5 , the PlantVillage dataset includes only images of leaf symptoms caused by fungi.In contrast, our dataset included images of leaf symptoms caused by a variety of sources, including viruses, phytophthora, nematodes, fungi, bacteria, and pests.The PlantVil-  lage dataset was collected in a controlled environment with uniform background, angle, orientation, and lighting conditions.Additionally, the images were captured at a low resolution with a pixel size of 256 × 256.The proposed dataset addresses the shortcomings of the PlantVillage dataset by presenting images in a broader range of symptom categories, capturing uncontrolled environments, and boasting a higher resolution.Fig. 4 shows samples of the images from the PlantVillage dataset and our proposed dataset from the fungi category, and Fig. 5 shows samples from the healthy class.Although the PlantVillage dataset offers benchmark information and training materials for testing and evaluating algorithms for the classification of potato leaf diseases, it may exhibit limitations in encompassing a broader range of potato leaf diseases.Consequently, the model trained on this dataset can encounter constraints when dealing with unfamiliar categories or various real-world situations.This difference signifies that our proposed dataset provides diverse conditions for achieving a robust model performance in real-world settings.

Preliminary study 4.5.1. Proposed preliminary study
After collecting the dataset, we conducted a preliminary study using several pre-trained CNN: EfficientNetV2B3 [11 , 12] , MobileNetV3-Large [13] , VGG-16 [14] , ResNet50 [15] , and DenseNet121 [16] .This CNN-based model was selected because of its lightweight size and outstanding performance on the ImageNet dataset.In addition, previous research has shown that these family models perform well for leaf disease classification using the PlantVillage dataset [17 , 18] .
The dataset was first split into training and test datasets in a ratio of 90:10, consisting of 2765 and 311 images for training and testing, respectively.Before training the CNN modes, the training set was split into 2489 for training and 276 images for validation.All images were then resized to 224 × 224 pixels to match the input specifications of all tested CNN models.The preliminary study was conducted on the Google Colab Free version utilizing the Keras and Ten-sorFlow Library.All CNN models were trained using pre-trained models with pretrained weights on the ImageNet dataset.For consistency, all models were trained using the Adam optimizer with a learning rate of 0.0 0 01, a categorical cross-entropy loss function, and batch sizes of 64 and 50 epochs.
A schematic of the experimental setup is shown in Fig. 6 .Two scenarios were employed in this study: training without data augmentation, and training with data augmentation.The second scenario was conducted by adding additional preprocessing steps using data augmentation, including brightness, flipping both horizontal and vertical, rotating, zooming, and shifting both horizontal and vertical, which resulting of around 990 up to 10 0 0 images per class.Data augmentation was utilized owing to the imbalance issue in the dataset.Several evaluation metrics were employed in this study, such as the test accuracy, precision, recall, and F1 score.Eqs. ( 1) -( 4) present the formulae for each metric.Most of the evaluation metrics employed TP, FP, TN, and FN, where TP is a True Positive; FP is a False Positive; TN is a True Negative; and FN is a  6 and 7 list the performance results of the tested model for scenarios without and with augmentation, respectively.In the scenario in which the dataset did not undergo augmentation, as presented in Table 6 , EfficientNetV2B3 displayed the highest test accuracy of 0.7363, followed by MobileNetV3-Large at 0.7203.In contrast, VGG-16 demonstrated the lowest test accuracy with a value of 0.5981.In the second scenario, in which the dataset was augmented, EfficientNetV2B3 again emerged as the top-performing model, with a test accuracy of 0.723, as shown in Table 7 .MobileNetV3-Large followed closely, with a test accuracy of 0.7042, whereas VGG-16 continued to exhibit the poorest performance, with a test accuracy of 0.5627.These results indicate that the model performance remained subpar, even after the addition of augmentation to balance the dataset.This is because of the complexity of our proposed dataset, which encompasses a diverse range of backgrounds and image angles, resulting in an augmentation process that does not yield significant improvements.
Table 8 compares the performance of the best-performing model in this preliminary study, EfficientNetV2B3, when trained on the PlantVillage dataset and our proposed dataset.As shown in Table 8 , EfficientNetV2B3 achieved a test accuracy of 98.15% when trained on the PlantVillage dataset.However, when trained on the proposed dataset, the accuracy of the model was 73.63%.These results suggest that while the model performed well on the PlantVillage dataset, it struggled when trained on our proposed dataset.Given the exceptional and distinctive characteristics of our proposed dataset, it is understandable that the model struggled to perform optimally.The lack of a controlled environment within the dataset poses a significant challenge for the model to effectively learn.The results of the experimental comparison indicate that the EfficientNetV2B3 model, which exhibits exceptional classification performance on the PlantVillage dataset, underperforms when applied to the proposed dataset.Therefore, it is necessary to develop models with improved performance for identifying potato leaf pests and diseases in uncontrolled environments.The real-world scenarios presented in our dataset can aid the algorithm in handling diverse situations, leading to improved recognition accuracy and facilitating optimization and refinement of the algorithm.This will enable researchers to train more advanced algorithms.We hope that the release of this dataset will lead to the development of an automatic potato leaf disease identification system that can be used in real-life scenarios and will contribute to the advancement of precision agriculture.

Limitations
Adding potato leaf disease samples from countries outside Indonesia could enhance the diversity of the dataset.

Ethics Statement
The dataset presented in this work does not include tests on animals or humans.All images used were obtained by the authors and do not come from any other source.

Fig. 2 .
Fig. 2. Process of capturing images by team members.

Fig. 4 .
Fig. 4. The sample of potato leaves in fungi category in (a) the PlantVillage dataset, and (b) the proposed dataset.

Fig. 5 .
Fig. 5.The sample of potato leaves in the healthy category in (a) the PlantVillage dataset, and (b) the proposed dataset.

Table 1
Distribution of the dataset.

Table 2
Location of potato farm.

Table 3
Specifications of smartphone cameras used to capture dataset images.
distinct periods: August 2, 2023, for potato farms located in Magelang, Central Java and August 15-16, 2023 for potato farms located in Wonosobo, Central Java.The locations of potato farms are listed in Table

Table 4
Summary of visual characteristics of potato leaves.

Table 5
Comparison between PlantVillage datasets and our proposed dataset.

Table 6
Experimental result from non-augmented dataset.

Table 7
Experimental result from augmented dataset.

Table 8
Comparison of results with other datasets.