A Hybrid Deep Learning and Optimized Machine Learning Approach for Rose Leaf Disease Classification

-Analysis of the symptoms of rose leaves can identify up to 15 different diseases. This research aims to develop Convolutional Neural Network models for classifying the diseases on rose leaves using hybrid deep learning techniques with Support Vector Machine (SVM). The developed models were based on the VGG16 architecture and early or late fusion techniques were applied to concatenate the output from a fully connected layer. The results showed that the developed models based on early fusion performed better than the developed models on either late fusion or VGG16 alone. In addition, it was found that the models using the SVM classifier had better efficiency in classifying the diseases appearing on rose leaves than the models using the softmax function classifier. In particular, a hybrid deep learning model based on early fusion and SVM, which applied the categorical hinge loss function, yielded a validation accuracy of 88.33% and a validation loss of 0.0679, which were higher than the ones of the other models. Moreover, this model was evaluated by 10-fold cross-validation with 90.26% accuracy, 90.59% precision, 92.44% recall, and 91.50% F1-score for disease classification on rose leaves.

INTRODUCTION Roses are widely produced and exported globally. In 2019, the export value of roses was more than 175 million US dollars. The top five countries with the highest export rankings are Netherlands, Denmark, Uganda, Germany, and Canada [1]. In cultivating roses, pest problems such as insect infestations are often encountered along with pathogens caused by fungi, viruses, and bacteria [2]. There are also diseases caused by nutritional deficiencies such as in nitrogen, iron, zinc, and magnesium. Disease symptoms can be detected in roots, stems, branches, leaves, and buds or flowers. Especially the leaves are a source of various infectious disease symptoms. However, classifying an infected disease requires skill and experience. For example, rose mosaic disease is a common disease worldwide and can sometimes be caused by more than one pathogen [3]. Image processing methods for plant disease classification are currently being studied [4] combined with machine learning such as Support Vector Machine (SVM) [5] or K-Nearest Neighbors (KNN) [6]. For example, the authors in [7] classified 4 rose leaf diseases using machine learning with with at least 94% accuracy. In addition to machine learning, other methods such as deep learning and neural networks are applied to recognize and classify plant diseases. The author in [8] developed a Convolutional Neural Network (CNN) model, which applied MobileNet and transfer learning for rose disease classification. Over 30 and at least 15 rose diseases can be observed on the leaves [2].
Most of the research used a single perspective or a single set of images as the dataset for model training. However, in the deep learning model training, it is necessary to use images with multiple perspectives, such as image augmentation, and include image segmentation to highlight features that appear on the image. This approach usually increases the accuracy of the model. Moreover, there are currently very few studies that have applied hybrid deep learning models to classify plant diseases. Therefore, this research aims to develop rose leaf disease classification models using hybrid deep learning. Besides, this work also compares the performance of models using different classifiers, namely softmax function and SVM.

A. Image Data Collection
This study classifies rose diseases by identifying the symptoms on leaves based on image processing and CNN. A program was developed using the Google search engine and ChromeDriver utility to search and download rose leaf images with dimensions of at least 224 pixels. All downloaded images were rechecked and labeled. Moreover, the author took photos of rose leaves with and without diseases with an Android mobile phone. Therefore, a dataset of 4,032 downloaded and taken pictures was formed. The imagres were categorized to 16 different classes with regard to the shown rose disease (15 diseases + 1 normal/control) as shown in Table I. Finally, all images were resized and cropped to the dimension of 224×224 pixels.

B. Image Augmentation
During the deep learning training mstage many images are needed to increase the performance of the model. If a particular class has a small number of images, it can affect classification accuracy. Thus, the author increased the number of training images by using the image augmentation technique, including vertical and horizontal flips, rotation (45, -45, 90, -90 degrees), shearing (45 and -45 degrees), and random zoom-in up to 200%. Thus, the number of the images increased from 4,032 to 40,320 images. The example of image augmentation is shown in Figure 1. C. Image Processing Image processing was applied to emphasize the physical appearance of the rose leaves. The preliminary step is that the rose leaf was separated from the background pixels with the GrabCut [9] method based on the graph-cut technique. Such an output image is illustrated in Figure 2. The background pixels are removed by the GrabCut method.
The images obtained after the removal of the background pixels were subjected to image color-spacing and image thresholding processing.

1) Hue, Saturation, Value Color Space
Hue, Saturation, Value (HSV) is a color space model which includes ranges of color type between 0 to 360 degrees, vibrancy, and color brightness. This work focused on color ranges around 120 degrees, which are related to the green color of a rose leaf. All green areas were desaturated with the lower saturation as grayness. As a result, any color not related to green was accelerated to become more emphatic.

2) Truncated Adaptive Gaussian Thresholding
Truncated Adaptive Gaussian Thresholding (TAGT) is a combination technique between truncate and adaptive Gaussian thresholding. For truncate thresholding, an image without background pixels is processed. Pixels greater than the threshold value ‫݄ݐ‬ ௩ = 127, were assigned that value [10]: where ݅݊ሺ‫,ݔ‬ ‫ݕ‬ሻ refers to the input pixel coordination, and ‫,ݔ‪ሺ‬ݐݑ‬ ‫ݕ‬ሻ refers to the output pixel coordination.
Next, the threshold was adjusted from the weighted sum of the block size of the pixel neighborhood at 7×7 using adaptive Gaussian and binary thresholding, as in (2) [10].
where ‫ݔܽ݉‬ ௩ refers to the maximum value assigned to the pixels and ‫,ݔ‪݄ሺ‬ݐ‬ ‫ݕ‬ሻ refers to the individual threshold calculation of each pixel.

3) Double Inverse-Binary Thresholding
Double Inverse-Binary Thresholding (DIBT) is a thresholding method with twice applied inverse-binary thresholding. First, the image without background pixels is taken through a thresholding process between inverse-binary and binary, where the threshold values are set to 100 and 0 respectively. The resulting image from the first step was adjusted to the threshold value of 127 by inverse-binary thresholding, calculated in (3) [10].
This results in an image emphasizing the expected coordinates of the suspected position of disease infection or wilt on the rose leaf. Each original image will result to 3 more images, namely HSV, TAGT, and DIBT, as shown in Figure 3.

D. Hybrid Deep Learning Modeling
In this step the CNN models for the classification of diseases on a rose leaf were developed. Twelve models were developed as follows.

1) Visual Geometry Group16-Based CNN Model
Visual Geometry Group16 (VGG16) model is a CNN architecture presented in [11]. The input images of the VGG16based CNN model were set to 224×224 pixels for processing through 16 weight layers, including 13 convolution layers and 3 fully connected layers. All convolution layers have a 3×3 kernel size, 1 pixel of padding size, and the Rectified Linear Unit (ReLU) activation function. Spatial pooling followed with 5 max-pooling layers with a 2×2 pixel filter and stride 2. Further, 1 flatten layer was included before feeding the output to the fully connected layers. Furthermore, the softmax activation function was applied with 1,000 output classes in the last fully connected layer. Thus, the total trainable parameters of this model were 138,357,544 as shown in Figure 4. VGG16-based CNN model architecture.
According to Figure 4, the "Process: A" refers to feature extraction layers, "Process: B" refers to the flatten layer, and "Process: C" refers to the fully connected layers. This research used the original image dataset with 16 labeled output classes (see Table I) to develop the VGG16-based CNN model. Therefore, the softmax layer (in Process: C) was set to 16 instead of 1000 classes. The overall trainable parameters of this model were 134,326,096.

2) Early Fusion Model
The early fusion (EF) model was developed based on the VGG16-based CNN model. It starts with the images that have undergone each image thresholding processing separately (original, HSV, TAGT, and DIBT images) to each channel of CNN for extracting features (see Process: A in Figure 4) with the VGG16 architecture. The outputs obtained for each dataset were fused and flattened before being classified with the fully connected layers, as shown in Figure 5. The sum of the trainable parameters was 484,041,973.

3) Late Fusion Model
The VGG16-based CNN model was extended to the Late Fusion (LF) model in this work. The LF model starts with each processed image as input (as the EF model) for each CNN channel, then fused each output obtained after classification by the softmax activation function. After the fusion of the results obtained from the 4 image types, they were classified by 2 dense layers of size 4,096 and were finalized with the softmax function to 16 output classes ( Figure 6). The total trainable parameters were 586,665,136.

4) VGG16-Based SVM Models
According to the VGG16-based CNN model, the softmax activation function was used to classify the final output at the last layer. In contrast, this VGG16-based SVM model applied the SVM classifier instead of the softmax activation function, as shown in Figure 7. VGG16-based SVM model.
SVM is a popular classifier for supervised learning algorithms, especially for binary classification. In this work, multi-class SVM is required to classify 16 classes of rose leaf disease images. There are several classifiers for multi-class SVM. In this work various multi-class SVM classifiers were applied including L2-SVM, Categorical Hinge Loss SVM (CHL-SVM), and Weston-Watkins SVM (WW-SVM), to the VGG16-based CNN models as follows.
The L2-SVM is based on the optimization of L2 norm and Squared Hinge Loss (SHL) which is calculated in (4) where ‫ݓ‬ ் refers to the weight of dataset ܶ , ‫ݔ‬ refers to the augmentation of sample data vectors, ܾ refers to the bias, ܰ refers to the number of samples in a dataset, ‫ݕ‬ ො refers to the actual class, and ‫ݕ‬ refers to the predicted class.
Then, the SHL was optimized with a minimum of Euclidean norm and a large error penalty. Thus, the L2-SVM was formulated in (6).
where ‫‖ݓ‖‬ ଶ refers to the Euclidean norm (L2 norm regularization), and ‫ܥ‬ refers to the large error penalty for misclassification in which ‫ܥ‬ > 0.
The CHL-SVM or multi-class hinge loss function was implemented with TensorFlow 2 based Keras and is calculated in (7) [12].
Regarding the WW-SVM or Weston-Watkins hinge loss [13], the linear classifier was calculated in (8), and the optimization was formulated in (9).
where ‫ݓ‬ refers to the weight, ݇ refers to the number of classes, ݈ refers to the number of samples in the dataset, ‫ݕ‬ refers to the predicted class, and ‫ܥ‬ refers to the large error penalty for misclassification in which ‫ܥ‬ > 0.
Finally, the last fully connected layer with the softmax function was replaced with the multi-class SVM to classify the final output. Therefore, 3 VGG16-based SVM models, namely VGG16 & L2-SVM, VGG16 & CHL-SVM, and VGG16 & WW-SVM, were developed with different multi-class SVM classifiers.

5) Early Fusion-Based SVM Models
The early fusion-based SVM model applied the 3 different multi-class SVM classifiers to the last fully connected layer with softmax of the EF model. Thus, 3 EF models were developed, namely EF & L2-SVM, EF & CHL-SVM, and EF & WW-SVM.

6) Late Fusion-Based SVM Models
The LF model included two softmax functions at two layers: before fusion and at the last layer of the model. Thus, the softmax classifier of the last layer was bypassed and replaced with the multi-class SVM classifiers. Therefore By default, all models are based on VGG16 and CNN architecture. The image dataset was randomly split into 70%, 15%, and 15% for training, validating, and testing respectively. The hyperparameters were set as follows: the batch size was 64, the learning rate was 0.001, and training took 200 epochs. The models were compiled with the Adam optimizer.

E. Model Evaluation
All models were evaluated and validated by the accuracy and loss value during the training processing. In addition, the VGG16-based CNN, EF, and LF models that applied the softmax function at the fully connected layer were evaluated using the categorical cross-entropy loss [14]: Cross-entropy = − ∑ ∑ ‫ݕ‬ , log൫‫‬ , ൯ ୀଵ ୀଵ (10) where ݉ refers to the total of input, ݊ refers to the number of classes, ‫ݕ‬ , refers to the input ݅ of class ݆ , ‫‬ , refers to the probability of the predicted class ݆ by input ݅.
Further, the k-fold cross-validation was used to estimate the learning skill of the model based on an unseen dataset. The kfold cross-validation is mainly used to measure performance for machine learning models but can also be applied to deep learning models. Thus, all 12 models were evaluated by 10-fold cross-validation in this work. The performance of the models was validated on accuracy (ACC) [15], precision (PREC) [16], recall (REC) [17], and F1-Score.

III.
RESULTS All 12 models were trained, validated, and tested and their performances were compared.

A. Model Training and Validation Performance
The results showed that the models developed with the early fusion technique performed better than late fusion and VGG16 models. Especially, the model developed with the early fusion method and categorical hinge loss for the SVM (EF & CHL-SVM) gave the best accuracy among the models as shown in Table II. According to Table II,  & CHI-SVM model was able to classify the disease-free (NM) rose leaves with 98.95% accuracy. The most accurate classifications of rose leaf diseases were VW, IB, and DM, with 98.48%, 94.08%, and 92.68% accuracy respectively. For most of the other diseases the accuracy was higher than 87%, except for BB, ATN, CLS, and SLR which had less than 83%. Especially, the SLR disease had the lowest accuracy of 74.44%.

B. Model Evaluation
All developed models were tested and evaluated by 10-fold cross-validation with a test dataset. The result showed that the performances of the EF-based models were higher than the LFbased and VGG16-based models'. Regarding the EF-based models, the EF & CHL-SVM had the highest performance with 90.26% accuracy, 90.59% precision, 92.44% recall, and 91.50% F1-score as shown in Table III. IV. CONCLUSION This research developed 12 models for classifying rose diseases from the symptoms that appear on the rose leaves using a CNN model based on VGG16 architecture and image processing. The classification of rose diseases consists of 16 classes (9 classes for diseases caused by fungi, 4 for virus diseases, 1 for insect bit, 1 for nutrient deficiencies, and 1 disease-free class). The 12 developed CNN models were divided into three groups: VGG16, EF, and LF. In addition, each group was divided into two classifier types: softmax and SVM. The softmax function was used in 3 models, namely the VGG16-based CNN, EF, and LF models. The utilized multiclass SVM classifiers were L2-SVM, CHL-SVM, and WW-SVM. There were 4,032 rose leaf images for model training. The images were resized to 224×224 pixels and underwent image augmentation, resulting in a dataset of 40,320 images. These images were subjected to image processing, including removal of background pixels, HSV color space, TAGT, and DIBT, to emphasize their features. Both TAGT and DIBT are based on image thresholding processing. Ultimately, the dataset was split to 70% for training, 15% for validation, and 15% for model testing by 10-fold cross-validation.
The results showed that the EF-based models gave the highest training, validation, and testing performance values, followed by the LF-based and the VGG16-based models. In addition, the models developed with the SVM classifier performed higher than the models using the softmax function. The model using CHL-SVM showed the highest performance, followed by the models using WW-SVM, L2-SVM, and softmax function. Thus, the EF & CHL-SVM, a developed model based on the early fusion method and employing the SVM categorical hinge loss function was the most suitable model for classifying diseases on rose leaves with an accuracy of at least 88.33%. The models developed in [7,8] had accuracy not less than 94%, which is higher than the accuracy of the CHL-SVM model in this work. However, these two studies only classified 4 rose leaf diseases, unlike this study which classified 15.
Moreover, it is evident that image processing can improve rose leaf disease classification accuracy, especially when the features are fused. Besides, it was found that SVM gave better results as a classifier than the softmax activation function, which is consistent with the findings in [18].
Regarding further work, the author plans to develop a model based on U-net deep learning and a transfer learning approach to detect and classify diseases on plants and then integrate it to the Internet of Things.