Achieving Explainability for Plant Disease Classification with Disentangled Variational Autoencoders

Agricultural image recognition tasks are becoming increasingly dependent on deep learning (DL); however, despite the excellent performance of DL, it is difficult to comprehend the type of logic or features of the input image it uses during decision making. Knowing the logic or features is highly crucial for result verification, algorithm improvement, training data improvement, and knowledge extraction. However, the explanations from the current heatmap-based algorithms are insufficient for the abovementioned requirements. To address this, this paper details the development of a classification and explanation method based on a variational autoencoder (VAE) architecture, which can visualize the variations of the most important features by visualizing the generated images that correspond to the variations of those features. Using the PlantVillage dataset, an acceptable level of explainability was achieved without sacrificing the classification accuracy. The proposed method can also be extended to other crops as well as other image classification tasks. Further, application systems using this method for disease identification tasks, such as the identification of potato blackleg disease, potato virus Y, and other image classification tasks, are currently being developed.


Introduction
Deep learning (DL) methods, especially those that use deep convolutional neural networks (DCNNs), are widely used in agricultural image recognition [1], [2]. However, once an image has been recognized, the rationale behind recognitions (decisions) performed by the DCNN must be explained, which continues to be an active field of research [3].
There are several reasons why the decisions made by a DCNN must be understood.
One reason is that a DCNN may unintentionally learn false features (an artifact) during discrimination [4], [5], cognitive biases (annotations made with unreasonable assumptions) [6]based features, or non-robust features such as textures [7]. Further, by interpreting the features and the relationships between them, key information can be obtained. To address the abovementioned requirements, the type of features and how their variations were used in the DCNN decision making as well as their relative importance must be understood. Hereafter, we refer to the important features as key features.
The most popular DCNN algorithms for the visualization of key features describe the area of the image that is the most crucial to the decision making process (such as a heat map) [8]- [10]. However, these methods cannot help us identify the exact features that were used in the classification [11] (e.g., color or shape). For example, if yellowing and wrinkling are two 3 features at the same location on a leaf, a heat map cannot be used to identify which features were used in the classification. In addition, although a dataset contains variations of these features (e.g., different degrees of yellowing or wrinkling), heatmaps cannot express these variations. Therefore, a heat map cannot provide a complete explanation of the features used in the classification as well as their variations.
The current study focuses on developing a feature visualization method based on the variation of features, in order to show what the classification system considers as features as well as their relative importance while also maintaining an acceptable classification accuracy.
The rest of this paper is structured as follows: First, the importance of the proposed system is discussed. The second section discusses the literature related to this study. The third section introduces the algorithm development and the technical details. The fourth section provides the results and the fifth section presents the conclusions of this study.

Related Work
Explainability is a key aspect of agriculture; as such, several researchers who have developed agricultural image classification algorithms have attempted to incorporate explainability into their approaches. One of the pioneering approaches to achieve explainability involves the use of activation maps; specifically, these approaches use thresholding to visualize the activation maps [12], [13], [14]. However, only visualizing the first activation layer is not sufficient for a classification that involves the top layers of the neural network. Therefore, 4 researchers are now using methods such as saliency and guided backpropagation to achieve explainability [15]. Another approach involves the use of occlusion maps, in which a part of the image is occluded and the changes in the activation maps are observed [16]. To use this method, the user must guess the exact size and shape of the occlusion. In recent years, the Grad-CAM [9] algorithm has become popular among researchers in the agricultural field [17], [18], with some researchers also developing their own approaches to explain the classification process involving a u-net architecture [19]. A good review of such approaches can be found in [8], with most of these methods focus on visualizing the key areas. However, these visualizations do not provide an idea of how a feature is represented in a neural network.
Numerous attempts based on a variational autoencoder (VAE) approach have been made to find interpretable representations from data, such as infoGan [20], Beta-VAE [21], and FactorVAE [22]. Moreover, in recent research [23], concept whitening, utilizing a concept whitening layer and a labeled concept dataset, has gained popularity. However, in the current research, the independent features (concepts) and their variations in the image dataset that can be used for the classification of the dataset were identified.

Development of Explainable AI Algorithm using Disentangled VAE
To develop the explainable algorithm, a specific architecture was used, where the basic idea was to train a VAE that will generate disentangled features in its latent space and these features will then be used in the classification task. Once the classification task is complete, 5 the relative importance of the features can be calculated and visualized via feature interpolation (changing the values) using the decoder component of the VAE. Next, the key features can be visualized for the given classification. Hereafter, such features are referred to as "classifiable latent features." The latent features are first separated into classifiable features, which are latent features of the VAE that are suitable for the classification, and non-classifiable features, which are the latent features of the VAE that are not suitable for the classification, in the latent space of the VAE. The most important classifiable features of an image can be identified after it has undergone classification. Hereinafter, the proposed model that can explain its decisions using explainable classifiable latent features (ECLF) is called an ECLF.

Figure 1 Major stages of the ECLF system
As shown in Figure 1, the latent features are distinguished as either classifiable and nonclassifiable in stage 1. In stage 2, the classifiable features are used to train the image classifier 6 (in this study, unless specified, the classifier refers to a nonlinear classifier). In stage 3, the variations of the features (left to right, from one class to another class) that are important to the classification are shown. In this image, the two most important features (1, 2) are shown. 1 * and 2 * show the changes in the features where the most significant changes occur. In addition, a complementary model, the explainable classifiable latent features class-specific (ECLF-CS) system, is also established to extract class-specific latent features. Systems that apply these models are currently being developed [24].
In the current study, the primary objective was to develop a system that can visualize the variation of important features in plant leaf images (diseased and healthy) and show the disentangled (separated variations) factors of these features that were used in the classification.
To confirm the explainability and accuracy of the proposed system, the explainability performance, the amount by which disentanglement affects the quality of the visual explanations and accuracy, and the major factors affecting disentanglement and explainability must be explored. To do so, the following objectives were set:

ECLF Model
The ECLF model is discussed in three major stages: 1. VAE training stage, 2. classifier training stage, and 3. Important feature visualization for classification.

VAE Training Stage
In this stage, the VAE is trained so that it learns to produce a disentangled latent feature vector that can extract the classifiable and non-classifiable features from an image and reconstruct that image. Three major factors were considered for building explainability in the VAE training stage: 1. Separation of classifiable and non-classifiable features in ECLF. 8 2. Disentangled representation for the separated features, which reduces the correlation. 3. Human understandability of the visualized features of our system, which helps users identify important features from the reconstruction.

Separation of Classifiable and Non-classifiable Features in ECLF
VAE is an algorithm that attempts to approximate a posterior distribution of the latent variable given a data point [25], which is represented by an image. When an image is given to the encoder component of the VAE, it produces a latent vector ( ) that is used as an input to the decoder component of the VAE. ECLF first divides the latent vector ( ), which is produced by the encoder into two parts: a classifiable feature vector (CFV) and a non-classifiable feature vector (NCFV) using the following procedure ( Figure 2) using an adversarial discriminator can be generated using Equation (1).
where is the normal distribution with a diagonal variance represented by ( 2 ) .
= [ , ]. (2) In this case, the CFV and NCFV parameters can be divided as follows: If the decoder function is ( ) , then the reconstructed image can be labeled as ′ using Equation (5).
) . (5) Figure 2 Training procedure of the VAE and adversarial discriminator Figure 2 shows the simplified training procedure for the adversarial and variational autoencoders. Notably, an adversarial discriminator was used to form the NCFV during the VAE training. The red arrow indicates the discriminative loss from the adversarial discriminator.
This loss (if the performance of () is good, it is considered as a loss) attempts to remove the classifiable features from the NCFV, and the yellow arrow shows the reconstruction loss from the decoder to encoder.
An adversarial discriminator function () attempts to learn the class of the input image using only [26]; therefore, is the class assigned to by () using .
Given that the actual class of is (the ground truth class), () is trained using the classification loss between and . In contrast to [26], which used an autoencoder, a part of the VAE output was used, and the CFV was not conditioned using an attribute label because the algorithm should find the classifiable attributes or features by itself. If the adversarial discriminator can learn to discriminate the classes with a desirable accuracy, then, the NCFV might contain information that can be used to clearly separate the classes. Given that adversarial training is conducted, if the accuracy is high in , the prediction is considered as a loss ℒ for the encoder (Equation (7). Therefore, the decoder is discouraged from producing classifiable features in the NCFV.
Gradually, the encoder learns to send classifiable features through the CFV. Therefore, the VAE loss function was minimized while also attempting to minimize ℒ .
A supportive classifier for the CFV was adopted to support the formation of   However, this loss was set such that its role is not as prominent as compared to the adversarial classifier. Although this classifier can be used for the final classification (stage 2 in Figure 1), the same hyperparameters of this classifier were during the VAE training; this leads to suboptimal classification accuracy. For this reason, a separate classifier was used as the final classifier.
A convolutional encoder and decoder were used in the system architecture. Note that it is critical to ensure that the features extracted in the encoder are properly represented in the decoder. To achieve this, a decoder with the same weights as the encoder was constructed. The authors of [27] proved that hidden layer activations in a DNN can be recovered using a generative network. As such, it was assumed that the VAE encoder can be recovered using the 12 decoder if proper conditions can be ensured. The restrictive nature of the VAE loss function, which has an information bottleneck property [28], [29], may prevent NCFV features from going through the .

Latent Vector Disentanglement in VAE
Disentanglement in the latent feature vector is encouraged. In CFV, if: Then, when the latent vector is decoded while changing one CFV feature , only the image features that correspond to should change in ′ . This is a necessary condition when visualizing which latent feature corresponds to which image features in ′ . Although the definition of disentanglement has been a debated topic [30] in recent years, several methods have been proposed to help variational autoencoders learn disentangled latent features [21], [22], [29]. The current study used the algorithm presented in [31] to improve the disentanglement between the features during training (Equation (9)) owing to its ability to factorize the different factors of the VAE loss. Thus, only the required factors were minimized.
. (9) An attempt was made to increase the evidence lower bound, which is ℒ , by reducing the second term of the divergence between the two probability distributions ( | ) and ( ) of Equation (9), which can be considered as the information bottleneck term [28], [29].
Notably, denotes the number of images in a batch, and β denotes the coefficient 13 responsible for controlling the information bottleneck.
According to [31], the second term can be divided into three components: Of these terms, the total correlation term in Equation (11), which reduces correlation and encourages independence, and dimension wise KL in Equation (12) were given special attention.

Full Training Loss Function for the Feature Separation Stage
For the training, the total loss function expressed in Equation 14 was used. In this function, the reconstruction loss was denoted by , the training loss of the VAE by , the supportive classifier loss by and the discriminative regularization loss [32], which is used to increase the visualization quality, by , and α, ϵ, ε, β and γ were used as constants during the training.
In this function, the losses can be divided into four main categories; the and can be referred to as the reconstruction losses, which are helpful in identifying the features 14 reconstructed or changed by changing the of the VAE; and can be referred to as the feature separation classifiable losses, and can be referred to as the disentangling loss, and as the prior matching loss.

Classifier Training Stage
The classifier is responsible for recognizing classes using CFV. After the VAE training was completed, the classifier was trained on the CFV. Given that the encoder produces parameters for the distribution, CFV, as shown in Equation (3), the classifier cannot be trained on this distribution. Since this might reduce the classification accuracy, the value that has the highest likelihood to appear, which is , was selected to train the classifier while keeping the encoder weights fixed. After the classifier is trained, it can be used to make predictions (final classification), for which was used as the input to the classifier. Once the prediction is made, it needs to be explained using the features learned in the VAE. If a linear classifier is used, it is easy to understand which features are more important than the others because the relationship between the features and the prediction is linear. However, for a more generalized approach, it is better to use a classifier algorithm that can use both nonlinear and linear relationships for classification. Several types of nonlinear classifiers can be used in this part of the algorithm.

Important Feature Visualization for Classification
Through this visualization, which features play an important role in selecting one class 15 from another are determined. When the nonlinear classifier makes a decision on an image, it uses a local decision boundary to decide between the classes; on a zoomed-in level, this decision boundary can be approximated using a linear classifier. This method (local interpretable model-agnostic explanation; LIME) was introduced in [5] and uses a standard segmentation algorithm to create image super pixels (sets of pixels with the same properties), then masks other super pixels to make sample points (input point to the algorithm). The LIME method was used because local decision boundaries were the focal point of this study, although numerous methods to determine the importance of a feature in decision-making are available.  In Figure 5, the blue and pink backgrounds are separated by the nonlinear function of the deep learning model. This nonlinear function cannot be approximated using a linear function; LIME attempts to explain the bright red cross point by sampling instances (data points) and sending them to the nonlinear function to obtain the predictions; they are then weighed according to the proximity, which is denoted by the size of the marker. A faithful local explanation is denoted by a dashed line [5].

Approximating Nonlinear Boundary Using a Liner Classifier
The proposed method focuses on two classes that are used to explain a decision, namely; class A and class B. Class A is selected by the nonlinear classifier as the maximum likelihood class for the CFV that is being explained (this can be different from the ), meaning that the features that differentiate class A from class B need to be known. In Step 2, the samples are taken from the distributions of the samples because it represents the probability distribution of that point. Using these points, the linear plane perpendicular to the variation of the variable that contributes the most to the classification can be constructed.
The linear classifier was trained to approximate the activation of the nonlinear classifier in the sampled region.

Figure 7 Explanation support points
Points (a) and (b) were chosen to be closer to the decision boundary; after this, the importance was defined as the cause of the change during the activation of the linear classifier for class A when and were taken as inputs The following formula was used to determine the importance of each feature: if is the weight vector, and is the importance, then The maximum likelihood class was chosen over because the nonlinear classifier sometimes yields incorrect classifications. In such a case, an explanation must be given as to why such a decision was made by the nonlinear classifier.

Visualizing Most Important Feature for Classification
Once the most crucial features were selected; the meaning of these features and their  (5) and (8).
When a classifiable feature is visualized to exactly understand that feature, the way that the features respond to the change in the latent variable also needs to be visualized  Equation (19) shows how an interpolation vector is created: Where is the changed input feature for the visualization, is the feature value of 21 class A at the original point, and is the mean point feature value of class B; is the feature number for the visualized feature, and is an interpolation constant. Figure 9 Classification explanation overview

Visualization of Areas with the Largest Change
Although it is sometimes possible to visualize the changes in ′ , the exact areas where the features have changed can be easily visualized.

Figure 10 Visualizing the Changed Areas
In Figure 10, to visualize frame T, the absolute differences in frame I as we move from frame A to B as well as the absolute differences across the channel were added. Then, the 80 th percentile of the added values, and the threshold of each channel of frame T according to the remaining values were considered to capture the differences between frames A and B.

Factors Effecting Explainability and Accuracy in ECLF
Explainability and accuracy play major roles in any explainable system. Many researchers who work with explainable algorithms have expressed concerns regarding the tradeoff between explainability and accuracy [35], [36]. Moreover, explainability has several definitions in the literature on deep learning [37]. In the ECLF system, the definition of explainability and accuracy needs to be clarified and the important factors that affect these two properties and their tradeoff needs to be determined. As discussed in Section 3.1.1.2., during training, the value of β increases the loss to the disentanglement term, helping to reduce the disentanglement. This means that the disentanglement term is dependent on the value of β.
In the current study, different aspects of explainability were considered, such as the interpretability of the classifiable features (disentanglement, compactness, and separation of the CFV and NCFV) and human understandability (visual quality) of the produced explanations. In Section 3.1.1.1., the importance of disentanglement for explainability is detailed. When the compactness is considered, the lower dimensionality of the increases the compactness wise versa; so, the user must go through a low number of dimensions to investigate, increasing the explainability. Moreover, separating the CFV and NCFV increases the explainability of the system as it helps us determine which features are classifiable and which are not. Users can understand each feature by visualizing the changes in ′ (Section 3.1.3), therefore, the quality of ′ also plays a paramount role in the explainability of ECLF.
The interaction between these factors needs to be understood to obtain a better understanding of the explainability of the ECLF system, more specifically, how increasing β (Section 3.1.1.2), which scales the term of the loss function, would affect the other explainability components. In addition, the dimensionality of , which might play a paramount role in 24 explainability, also needs to be investigated.
The final classification accuracy of the trained nonlinear classifier during the classifier training stage is considered as the overall accuracy. Given that the input for the final classifier is CFV, which is affected by disentanglement (β in this case) and the dimensionality of , the interaction between the classification accuracy and the explainability factors such as disentanglement and dimensionality also needs to be understood.

Explainable Classifiable Latent Features Class-Specific (ECLF-CS)
Even though the ECLF model can show the differences between classifiable features of individual instances of classes, it cannot show class-specific features. To avoid this complication, a system was developed that can show the class-specific features of a CFV. A class-specific feature vector can be defined as a feature vector that is specific to one class and can separate that class from at least one other class. Although it is possible to use ECLF-CS in multiclass situations, this investigation only considered two-class situations in this study.
ECLF-CS uses a latent vector that contains classifiable features specific to classes in two separate vectors, called and , which in turn are assigned to two classes, called 1 and 2 . In this case, the latent vector can be expressed as ]. (19) In the encoding phase, an input image is sent through the encoder to produce the vector and the input to the decoder is decided based on the class of . Only the vector assigned to that 25 class and the are passed to the decoder; for example, if the class is 1 , the vector passed to the decoder is [ , ]. For the , the adversarial loss is used, as in Equation (7). The number of classes that can be trained is limited in this approach, and training is slow compared to ECLF because of the class-specific training procedure.

Differences between ECLF and ECLF-CS
As explained in Section 3.2, ECLF and ECLF-CS features have different characteristics; ECLF feature present the characteristics of the entire dataset, while ECLF-CS features only present the classifiable characteristics of a single category of features. Therefore, ECLF-CS features can provide direct information on the presence of a given category.

Classification and Important Feature Visualization for ECLF-CS Features
The classifiers for ECLF-CS are trained in the same manner as the classifier for ECLF.
Since there are two class-specific vectors (CFVS1 and CFVS2), they are merged before they are trained.
. (20) The important features were determined and visualized according to the procedure described in Section 3.1.3. In contrast to ECLF, when a feature is selected as important in ECLF-CS, feature visualization is performed depending on the feature vector from which the feature is obtained. If it came from , it is possible that was not used to produce ′ and vice versa. 26

VAE Architecture
During training, the convolutional layers were pretrained using the entire PlantVillage dataset [38], [12], which contains diseased and healthy plant leaves. Then, training on specific datasets was conducted, which is discussed in Section 4.1.1.

Architecture of the Discriminator and Supportive Classifier
All the classifiers and discriminators have three fully connected layers with a rectified linear unit (ReLU) and the output size of the CFV determines the input size of the supportive 27 classifier.

VAE Training Stage
The VAE is pretrained using the entire PlantVillage dataset for 106,000 iterations, using only the convolutional parts, which together act as a pure encoder-decoder architecture.

Figure 12 Encoder-decoder architecture for pretraining
Up to 120,000 iterations, the system was trained using + + as shown in Equation (14). The warmup phase [39] was then used on , , and for training up to 140,000 iterations. Although the warmup phase used KL divergence terms from, the warmup phase on was also used to balance the learning with the KL divergence terms. Although the results were saved and calculated every 20,000 iterations, the final results were considered after 1,500,000 iterations.
Ideally, ′ would be a good representation of ; however, this is not always possible and in this case, the faithfulness of the reconstruction comes into question. In recent years, several researchers, including [32], have attempted to solve this problem. Following [32], discriminative regularization loss, namely a VGG-16 network [29], was used as a discriminative regularizer, and the first three layers of the network were used for discriminative regularization.

Final Classifier Training Stage
The training was conducted for 5,000 iterations and the best validation accuracy was used for the testing. The classifier in the final classifier training had the same architecture as the supportive classifier.

Linear Classifier Training
The linear classifier, which is used in the explanation phase, was trained using samples generated by the VAE, as explained in Section 3.1.3. The linear classifier input was the same as the size of the CFV, and the linear classifier had two outputs. One thousand sample points were used to train the linear classifier and the mean squared error (MSE) between the nonlinear classifier activation and linear classifier activation was minimized to represent decisions near the boundary.

PlantVillage Dataset Subsets
Part of the PlantVillage dataset from [38], [12], which contained 39 classes of images (12 healthy and 27 diseased classes), which contained a single leaf in each image, was used in 29 this study. Segmented versions of the leaf images, which were created in [12] were used.
If the full PlantVillage dataset is used, the system would have to learn not only the differences between the respective diseases and healthy and diseased plants but also the differences between the plant types, for example, the differences between grape and potato leaves. In a real field or application condition, however, the type of plant is known and only the distinction between healthy and diseased leaves needs to be determined. Therefore, original datasets that contained only one type of leaf using the PlantVillage dataset (Table 1) were created for this purpose. Since some classes like "potato healthy" had very few images, the datasets needed to be restricted to 30 validation and testing images per class.  Figure 13 Sample leaves from three original datasets Figure 13 shows sample leaves from different datasets as well as features that were used to differentiate the classes. The apple dataset seems to have more subtle features compared with the other two classes.

Visualization of Decision Explanations of ECLF
The 320-dimensional latent vector VAE was trained for 1,500,000 iterations, which increased the classification accuracy and the reconstruction loss. For the Grape4, Apple4, and Potato3 datasets, the classification accuracies were 96.7, 90.0, and 94.4%, respectively. The 31 visual quality of the reconstructions increased, although the reconstruction error seemed to increase owing to overfitting. Therefore, it was assumed that the classifiable features require more iterations for training. Figure 14 A -Sample leaves from 3 datasets, B -Feature encroachment to healthy side  Several difficulties were encountered when obtaining the explanations, with one of the most severe being that the reconstruction was not 100% accurate. The decoder recreated the latent vector supplied by the encoder, and some information was subsequently lost in the latent vector, which may have led us to this loss in reconstruction accuracy. Research is being conducted to create a VAE that can generate images that are very close to the original image [40]. Moreover, sometimes, the explanations are not easily detectable by the human eye.

Visualization of Class Differences
Traveling from point (A) to point (B) in Figure 15, the manner by which the classifiable features change from class A to class B can be visualized. Given that it is easy to interpret such changes between diseased and healthy classes, the diseased class was used as class A and the healthy class as class B to visualize the features changing from class A to class B. Moreover, it is also advantageous to understand what features were used close to the decision boundary. Therefore, travel was also conducted from point (a) to point (b) for visualization purposes.  Higher values of β appeared to remove the high-frequency features; however, it became difficult to distinguish the differences between individual features. A relatively good separation was observed at 1000. However, as can be seen from Figure 18, a β value of 1000 resulted in a less accurate classification. Thus, a very high β value may force explainability while compromising accuracy.

Accuracy of ECLF
To test the effect of dimensionality on reconstruction, 20, 40, 80, 160, and 320 dimensions were used in the latent vector. The iteration, where the result of + + was the minimum, was selected because these factors are directly related to the variational autoencoder loss function; note that β and γ were kept constant.
40 Figure 20 shows how the classification accuracy changes with the dimensionality of the latent vector.

Effect of Dimensionality on the Visual Decision Explanations of ECLF
In this section, we focus on the effect of dimensionality on visual decision explanations.

Figure 21
Feature interpolation visualizations for grape late blight to healthy 41 In Figure 21, the visualization was conducted in a manner similar to that in Figure   18, and shows that, when the dimensionality increases, the images become closer to the ground truth image. However, lower dimensions use more globally distributed features of an image, such as color. In contrast, with higher dimensions in the latent vector, the features tend to move toward local features, such as lesions and local color changes.
At 20 dimensions, the images clearly showed a color change from yellow to green, in addition to changes in shape. In this case, explainability becomes easy to achieve, but the reconstruction quality, reconstruction accuracy, and classification accuracy are compromised.
In contrast, in higher dimensions, the accuracy and the amount of reconstruction increases, but explainability in the form of both disentanglement and number of dimensions is compromised.

Experimental Results of ECLF-CS Classification Accuracy
Using ECLF-CS, the classification accuracies for the Apple2, Grape2, and Potato2 datasets were 98.3, 98.3, and 100.0%, respectively, which were higher than those of ECLF. In particular, the accuracy was higher than that in the Apple2 dataset, which was attributed to its class-specific training, and the classifier only had to handle two classes. Even with a very high number of iterations (1,400,000), the classification accuracies were 100.0, 96.8, and 100.0% for the Apple2, Grape2, and Potato2 datasets, respectively, implying that no significant changes were observed.

ECLF-CS Feature Visualization
A 160-dimensional VAE, trained for 1,400,000 iterations, was used in this experiment to obtain the ECLF-CS features, using the Apple4 and Grape4 classes. As previously mentioned, the lowest loss point, where + + was the minimum was used and two classes were visualized by dropping the other class in the visualization. For example, while the diseased side of the encoder produces a diseased vector, the decoder produces an image that is passed to it by the diseased vector. Therefore, there can be some information loss; however, what the decoder is producing is what information classifier used for the classification. The healthy side also tries to produce an image that is close to a healthy image. These images show us what the VAE sees in the latent space ( Figure 22). Figure 19 shows the difference between the reconstructed diseased and healthy images at the lowest loss point and after 1,400,000 iterations. ECLF-CS also seems to follow multiclass 43 classification because a higher number of iterations provides better visualization.

Figure 23 Important feature visualization for ECLF-CS
In Figure 23, features are class-specific; however, in the classification stage, they are used to differentiate between classes; thus, it shows which part of the feature is considered as another class in the classification. The features of two-class classification can clearly be divided into classes A and B. Grape feature 1 belongs to the grape late-blight class, and grape feature 3 belongs to the healthy grape class. Apple features 1 and 4 of apple scab, are belong to the diseased and healthy apple classes respectively.
In ECLF, the features are trained to cross the classes during VAE training, and therefore, it is easy to understand how the features behave. However, in ECLF-CS, the features are not trained to cross the classes, and so, the feature variations must be visualized within the 44 class images. Figure 24 shows a reconstruction comparison between the ECLF and ECLF-CS features, with the ones in ECLF-CS appearing to be of better quality, which may be due to the class-specific training that we conducted during the VAE training stage. Furthermore, classspecific training may also explain why the apple class showed a larger improvement in the classification accuracy. Compared with multiclass classification, two-class classification seems to handle subtle features more easily. The apple dataset has diseased features that are not very prominent to the eye, such as lesions and darkening.

Comparison with Other Methods
The VGG-11 network [29] was used to compare the classification accuracy of ECLF and ECLF-CS. For the training of the ECLF for two classes, the same datasets that were used for ECLF-CS were employed, and the same training and testing parameters that were used in 45 ECLF were used. It must be noted that the VGG-11 network was not exclusively optimized for any class. For the Grape4, Grape2, Apple2, and Potato2 datasets, nearly the same classification accuracies were achieved using VGG11 and ECLF; however, ECLF showed a decrease in accuracy on the Apple4 and Potato3 datasets. In the case of the Apple4 dataset, the low accuracy was attributed to the network's dependence on high-frequency features for classification. This dependency was visually confirmed for the Apple4 dataset, as in Figure 13.
Moreover, as discussed in Section 4.2.4, the classification accuracy of the Apple4 dataset depends on the value of β. As such, the number of high-frequency features reduces with the increase in β. Therefore, the classification accuracy and high-frequency features were found to have a high correlation in case of the Apple4 dataset. As such, ECLF exhibits a low capacity to capture the high-frequency classifiable features that are required in the classification of the Apple4 dataset under high explainability conditions. Moreover, as mentioned in section 4.2.1, the accuracy of the Apple4 data set increased to 94.4% at 1,500,000 iterations, but this 46 phenomenon needs further investigation. Although a higher accuracy and explainability can be assumed if the β value is reduced during training, an increase in the correlation between individual features may occur (Figure 16), which would hamper the explainability of the classification results.
As seen in Figure 13 with sample leaves from the Potato3 dataset, and Figure 17, which shows an almost constant relationship with increasing β, the Potato3 dataset appears to be less dependent on high-frequency features, which implies a low correlation with highfrequency features. The reduced accuracy in the case of the Potato 3 dataset may be attributed to the overfitting of the model in the 320-dimensional latent space. As can be seen from Figure   20, the classification accuracy of Potato3 classes can reach up to 93.3% in an 80-dimensional latent space. However, the ECLF showed competitive performance in the two-class classifications with the VGG 11 network. This may be because only two classes were used in the case of a dataset that allowed additional classifiable information for each class.
ECLF-CS showed competitive performances on the VGG-11 dataset and ECLF on the Grape2 and Potato2 datasets, and ECLF-CS shows improved performance on the Apple2 dataset, which was attributed to the class-specific training that was performed during the training of ECLF-CS. Although the reasons for the performance differences of ECLF and performance improvements of ECLF-CS on specific datasets compared to VGG11 were discussed, further studies are needed to verify the stated conjectures.

Conclusions
This study details a new approach that deviates from conventional approaches for important area visualization to explain the underlying reasons that lead to a classification based on feature variations. The proposed network was trained with datasets extracted from the PlantVillage dataset and achieved acceptable accuracy with high explainability. In the future, this algorithm can be used to identify new disease symptoms that human evaluators may otherwise miss. There are some limitations of our study, including the low quality of visualization and reduced classification accuracy of the ECLF which must be addressed in future studies.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.