Detection of grapevine yellows symptoms in Vitis vinifera L. with artificial intelligence

Grapevine yellows (GY) are a significant threat to grapes due to the severe symptoms and lack of treatments. Conventional diagnosis of the phytoplasmas associated to GYs relies on symptom identification, due to sensitivity limits of diagnostic tools (e.g. real time PCR) in asymptomatic vines, where the low concentration of the pathogen or its erratic distribution can lead to a high rate of false-negatives. GY’s primary symptoms are leaf discoloration and irregular wood ripening, which can be easily confused for symptoms of other diseases making recognition a difficult task. Herein, we present a novel system, utilizing convolutional neural networks, for end-to-end detection of GY in red grape vine (cv. Sangiovese), using color images of leaf clippings. The diagnostic test detailed in this work does not require the user to be an expert at identifying GY. Data augmentation strategies make the system robust to alignment errors during data capture. When applied to the task of recognizing GY from digital images of leaf clippings—amongst many other diseases and a healthy control—the system has a sensitivity of 98.96% and a specificity of 99.40%. Deep learning has 35.97% and 9.88% better predictive value (PPV) when recognizing GY from sight, than a baseline system without deep learning and trained humans respectively. We evaluate six neural network architectures: AlexNet, GoogLeNet, Inception v3, ResNet-50, ResNet-101 and SqueezeNet. We find ResNet-50 to be the best compromise of accuracy and training cost. The trained neural networks, code to reproduce the experiments, and data of leaf clipping images are available on the internet. This work will advance the frontier of GY detection by improving detection speed, enabling a more effective response to the disease.


Introduction
Grapevine yellows (GY) are among the most important diseases currently studied in grapevine. Among GYs, two causal phytoplasmas are a major concern: Flavescence dorée (FD) and Bois noir (BN). FD (Candidatus Phytoplasma vitis) is a member of the Elm Yellows group, (Martini et al., 1999) and BN (Ca. Phytoplasma solani) is a member of the Stolbur group (16SrXII) (Quaglino et al., 2013). These are considered the most dangerous phytoplasmas found in all major winegrowing areas of Euro-Mediterranean countries, Chile and Asia (Gajardo et al., 2009;Belli et al., 2010;Mirchenari et al., 2015). In Europe and the Mediterranean basin, the causal agent of FD is classified as quarantine pest. FD is vectored from vine to vine by Scaphoideus titanus, a non-native ampelophagous leafhopper from North America (Chuche and Thiéry, 2014). Conversely, the BN etiological agent is transmitted by the polyphagous leafhopper, Hyalesthes obsoletus, endemic to Europe (Maixner, 1994). Current control strategies include uprooting of infected plants and vector control with pesticides. Laboratory-limited activities use antibiotics for cultivar recovery or Meristem tip culture, that can be used to generate phytoplasma-free in vitro material (Bertaccini, 2007), but they do not provide successful phytoplasma disease control. A more efficient strategy to limit phytoplasma disease diffusion is needed.
Detection protocols based on molecular assay for phytoplasmas are not completely reliable when applied to woody plants. This is due to the low concentration of the pathogen and its erratic distribution in these hosts (Belli et al., 2010). Thus, diagnosis of phytoplasma diseases in grapevines relies on symptom identification and visual symptom recognition plays a strategic role in phytoplasma control. Further, effective sampling of plants reduces the risk of false negatives. BN and FD differ in etiology and epidemiology but are indistinguishable by symptoms: desiccation of inflorescences, reduction of growth, leaf discoloration (showing typical sectorial reddening in red cultivars), berry shrivel and irregular maturation of wood. The time at which sampling occurs (late summer) makes it difficult to distinguish GY from other grapevine disorders that may show similar symptoms. For example, GY symptoms can be similar to leafroll or direct damage due to leafhopper feeding (Belli et al., 2010). In particular, in red grapes, esca disease (caused by fungi complex) and grapevine leafroll (caused by viruses) can cause leaf discolorations which may be chromatically similar to GY. The insect Stictocephala bisonia causes sectorial reddening of leaves which are also quite similar to that observed in red grapes infected by GY. Thus, symptoms of GY can be confused for symptoms of other diseases, even by experts (see Fig. 1).
Supervised learning has the potential to distinguish GY from other diseases Cruz et al., 2017;Sharif et al., 2018;Zhou et al., 2014). It is automatic without the need for prior expertise or skill. It can support sampling procedures by reducing false positives in large-scale vineyard monitoring. To the best of the authors' knowledge, we are the first to investigate end-to-end symptom identification of GY with deep learning.

Related work and motivation
Accurate and timely diagnosis of the disease is the most important tactic for plant disease control. But, when carried out by humans, it requires continuous monitoring and manual observation. This is prone to error, time-consuming, and costly (Ali et al., 2017). Automatic tools that identify and detect plant diseases could be an effective solution in monitoring real-time disease diffusion in crop fields (Abdulridha et al., 2018;Ampatzidis et al., 2017;Zhou et al., 2014). In recent years, various studies focused on the development of computer vision techniques to detect disease from leaf clipping images. Kaur et al. (2018) developed a system to detect disease in Glycine max with a k-means based segmentation algorithm. The system detects downy mildew, frog eye and Septoria leaf blight from images collected by the Plant-Village project (https://plantvillage.psu.edu/). A combination of color and texture features achieve an accuracy of 90.7%. Zhou et al. (2014) presented a novel representation of a leaf clipping image-also known as a feature representation, or feature-called orientation code matching. The feature overcomes differences in light source, occlusion, rotation and translation variations. The authors demonstrate good precision, recall and F-measure results when the feature is paired with a support vector machine (SVM). Sengar et al. (2018) proposed an intensity thresholding method for measuring the progression of powdery mildew in cherries. This study also used the PlantVillage dataset. Sharif et al. (2018) presented a hybrid method detecting and identifying diseases in citrus plants. The method segments citrus lesion spots with optimized weighted segmentation method uses color, texture, and geometric features and carries out feature selection with dimensionality reduction, entropy, and skewness-based covariance vector; finally, it classifies the samples with a support vector machine. The method is tested on the citrus disease image gallery dataset, plant village and a local dataset, and is capable of detecting anthracnose, black spot, canker, scab, greening, and melanose. Ali et al. (2017) proposed a method based on color difference to segment diseased parts of Kinnow mandarin leaves. Together with color histogram and texture features, the authors achieve good accuracy and area under the curve (AUC).
Recently in computer vision, there is a focus on deep learning algorithms (LeCun et al., 2015, Szegedy et al., 2015. Deep learning, also known as neural networks, are a type of supervised learning algorithm. Supervised learning is a field of artificial intelligence (AI) and machine learning (ML). In supervised learning, a system automates a domainspecific task. Some examples are face recognition, speech recognition, and object recognition. The algorithm develops a model of inference and reasoning from a set of labelled training samples. The goal is for the system to make correct decisions on a different set of test samples. It carries out automatic deduction of the sample's membership to a specific population. Supervised learning can be an effective tool for diagnosing pests and diseases. AI and ML can provide expert analysis and diagnosis of plant diseases. The end-user does not need to be an expert at identifying the disease. Deep learning is capable of even better performance than the mentioned related work (Cruz et al., 2017;Sharif et al., 2018). In previous work, supervised learning detected diseases with similar symptoms and pests, while reducing diagnosis time and cost Luvisi et al., 2016). For example, Cruz et al. (2017) developed a vision-based program to detect symptoms of Olive Quick Decline Syndrome (OQDS) on leaves of Olea europaea L. infected by Xylella fastidiosa. This has great potential to distinguish symptomatic and asymptomatic leaves for sample selection in largescale monitoring programs (Cardinale et al., 2018) and field.

Objectives
Herein, we demonstrate the potential for deep learning to detect grapevine yellows (GY) in Vitis vinifera L. cv Sangiovese, a red globe cultivar sensitive to this disease (Pierro et al., 2018a,b). An overview of the proposed system is given in Fig. 2. To the best of the authors' knowledge, we are the first to approach this problem with deep learning. This work addresses the following research questions: (1) Is it possible to automatically detect GY from leaf clipping images?
And, can it be distinguished from other, similar looking diseases? (2) Among the many deep learning architectures available, which algorithm is best? (3) How does the proposed system compare to related work that does not use deep learning? (4) How do human recognition rates compare to the machine recognition rates?
Surprisingly, few AI and ML studies compare human performance. Previous researchers were content to study AI and ML for the sake of advancing the frontier of research. Or, it was assumed that machine would be better. One study (Russakovsky et al., 2015) applied ML to predict 1000 different object categories and found that human recognition rate was roughly the same or slightly worse than AI recognition rates (1.7% better to 6.1% worse). Another study (Dodge and Karam, 2017), that also used 1000 different object categories, found that human performance is roughly the same as AI performance. However, these studies are for many different object categories. There are no studies of AI vs. human performance for plant disease diagnosis, and no studies for our specific problem (grapes and grapevine yellows).

Data methods
Two sets of data are aggregated for this study. Section 2.1.1 discusses the first of the two data sets. Plants were surveyed, sampled, photographed and then diagnosed for GY with DNA analysis of the pathogen. With supervised learning, the ML algorithm learns from example (datasets). In training, the algorithm is fed a digital image with a correct diagnosis and learns to recognize visual cues associated with that diagnosis. The membership of a sample to a disease population must be correct or the ML algorithm will learn the wrong cues. Visual inspection is not enough to verify GY, thus samples are lab verified to have the phytoplasmas BN and FD with DNA extraction followed by real time PCR test. Section 2.1.3 discusses the second of the two datasets, PlantVillage (https://plantvillage.psu.edu/). The machine learning algorithms used in the work require on a very large data set-more than we collected in the first dataset. We use PlantVillage data to fulfill this requirement. This is called data augmentation. PlantVilllage data comes with labels so there is no verification. Section 2.1.5 discusses how we further augment the data by generating multiple samples from a single sample with image processing.

Sampled data
Field surveys were conducted in Tuscany (Central Italy) from July to October 2017 in vineyards of cv. Sangiovese localized in several districts. Healthy control (HC) samples of cv. Sangiovese were collected from the greenhouse of the Department of Agriculture, Food and Environment (DAFE, University of Pisa, Italy). Infected control (IC) leaves were collected from V. vinifera plants that were previously assessed by molecular tools and found to be infected by 'Ca. P. solani' (subgroup 16SrXII-A) and Flavescence dorée phytoplasmas (subgroups 16SrV-C and -D).
To further stress the system, we also collected non-GY leaves from grapevines that showed other diseases: downy mildew, esca disease, grapevine leafroll, powdery mildew and Stictocephala bisonia. Recognition of non-GY diseases was carried out visually, evaluating symptoms. Images of downy mildew, grapevine leafroll, powdery mildew and S. bisonia were combined into a single population because there were insufficient samples to form separate populations. We refer to this population as other (OD). In total, we collected 134 healthy controls, 134 GY and 104 samples of other diseases.

DNA verification of sampled data
The following describes how we verify GY for collected data with DNA extraction and real time PCR tests. For each grapevine, 10-12 leaves were collected, and their fresh central midribs were dissected and stored at −20°C until DNA extraction. DNA was extracted with 2% cetyltrimethylammonium bromide (CTAB) based buffer from leaf veins according to the protocol described by Li et al. (2008), with the modifications that follow. Briefly, 1 g of leaf veins were smoothed in plastic bags (Bioreba, Switzerland) with 5 ml of 2% CTAB buffer using Homex 6 (Bioreba, Switzerland). The homogenate was incubated at 65°C for 15 min. DNA was extracted by one volume of chloroform: iso-amylalcohol (24:1) and precipitated with one volume of isopropanol. Pellets were washed with 70% ethanol, air-dried, suspended in 100 µl of deionized water and stored at −20°C until use. Concentrations of the nucleic acids were determined by measuring the absorbance at 260 nm with a spectrophotometer. Purity was assessed by calculating the ratio of the absorbance at 260 nm over the absorbance at 280 nm. Specific detection of phytoplasmas associated with BN and FD was carried out by amplification of 16S ribosomal DNA through TaqMan assay using the Rotor-Gene Q (Qiagen, Germany) following reaction conditions as described by Angelini et al. (2007). The template used in the assay was a 1:10 dilution of the DNA extracted from the samples. The grapevine chloroplast chaperonin 21 gene was used as endogenous control, while DNA extracted from HC plants and ICs were used as negative and positive controls, respectively. The DNA samples, which gave no positive signal to endogenous gene, were further cleaned up and tested with real-time PCR until they yielded the specific control amplicon. Threshold cycle (Ct) < 37 was associated with the presence of GY phytoplasmas (Mori et al., 2015).

Transfer learning
The machine learning algorithms used in the work require on a very large data set. This helps the algorithm learn discriminative features of the disease of interest and reduces the false negatives by providing the system with knowledge of cues that are irrelevant to GY. However, sourcing the required number of images can pose a challenge. As a ruleof-thumb, thousands to hundreds of thousands of images are normally expected. It would be too costly and time-consuming to collect such an order of magnitude of images, particularly for leaf clippings of specimens infected with a quarantine pathogen. Thus, we use a concept called transfer learning (Schmidhuber, 2015;Yosinski et al., 2014), described as follows: (1) The algorithm is trained some other dataset that has a sufficient number of data samples. For example, in Karpathy et al. (2014) images from the ImageNet grand challenge are used. ImageNet has 150,000 data samples and 1000 different populations (object categories).
(1) Imaging (2) Deep learning, a machine learning (ML) and artificial intelligence (AI) algorithm, (3) provides end-to-end diagnosis of GY symptoms and other diseases.
(2) The algorithm is then retrained with the data at hand.
The idea is that with a large, somewhat related dataset, the ML method can learn relevant patterns that are generally applicable for all problem domains. It is then re-trained on the data, and doing this is better than attempting to train the ML algorithm from scratch (Cruz et al., 2017;Schmidhuber, 2015;Yosinski et al., 2014). The ML algorithms in this work are publicly obtainable with step (1) completed, and this is also known as a pre-trained network. The reader is referred to those works for details of the training procedure (Krizhevsky et al., 2012;He et al., 2015;Szegedy et al., 2015;Iandola et al., 2016;Szegedy et al., 2016).

Data augmentation from publicly available sources
Even with transfer learning, we still need to add more data to our dataset for the ML algorithm to work. We add grapevine leaf clipping images from the publicly available PlantVillage dataset (https://plantvillage.psu.edu/). Specifically, we use the subset that is publicly available from Mohanty et al. (2016). The subset we use contains the following grapevine diseases: black rot, esca and leaf blight. This dataset does not contain any GY samples because GY is a nascent problem. Even so, we add it to our data to meet the minimum sample requirement for training a deep ML algorithm. We used all of the grapevine images provided by the dataset: 1180 images of grapevine with black rot; 1383 esca (black measles); 1076 leaf blight (Isariopsis leaf spot). The collected esca samples were merged into the augmented esca samples population. The reader is referred to the Mohanty et al. (2016) for methodology of health assessment for other diseases.

Data augmentation from image preprocessing
Before the images are processed by the neural network, we apply image processing techniques to center the leaf in the image. This procedure also removes unnecessary background information. In this step, data is further augmented with translation and rotation of original samples. First, the image is segmented from the background and noise. This automatically centers the leaf and removes the background from the canvas of the image. Second, the image is randomly flipped along the horizontal axis with a rate of 0.5. Third, the image is rotated randomly°° [ 30 , 30 ] about the center of the image and translated independently along the vertical and horizontal axes randomly by [ 30,30] pixels. Steps two and three generate many samples from an original image (Fig. 3). Perturbing the original images prevents the machine learning algorithm from overfitting to spatial image artifacts and allows a certain amount of tolerance when orienting the leaf during acquisition. As a result, the system does not need the leaves to be perfectly aligned, though the apex must be on the top and the petiole on the bottom. The pre-trained neural networks used in this work require an image of a specific size. We resize all leaf clipping image to × 227 227, the input image size required by the AlexNet neural network architecture. For images of uneven aspect ratio, the horizontal axis is resized to match the required size. The data used in this study is available publicly on GitHub. 1 In the following, we explain the segmentation procedure in detail. First, we obtain the leaf mask. The leaf mask is a black and white image where the white pixels correspond to the leaf. The leaf mask is obtained by: 1. Converting the original RGB image to grayscale with a YIQ transformation; 2. Blurring the grayscale image with a × 3 3 Gaussian filter to remove noise; 3. Obtaining a rough leaf mask by thresholding with Otsu's algorithm (Baxi and Vala, 2013); 4. Convolving the leaf mask with a morphological close operation with the goal of removing the petiole from the final image-the close operation uses a × 15 15 disc; 5. Finally, median filtering with a × 11 11 filter to remove small objects due to noise.
The original image is cropped to the minimum bounding box enclosing the leaf mask. Note that classification in later stages uses RGB and that grayscale is only used for calculating the leaf mask. This procedure is automatic. An overview of the segmentation procedure is The computer methods in this work require thousands to tens of thousands of data samples. Such an order of magnitude of samples would be too difficult collect and too costly to lab verify. One measure to address this is data augmentation with image processing. This figure demonstrates how a single sample can generate many samples by randomly translating, rotating and flipping the images. (A) Original. (B-F) Augmented images generated from A.

Background of deep learning
Deep learning was demonstrated to have high accuracy in other fields and has been successfully applied to the detection of a variety of crops and diseases (Mohanty et al., 2016;Cruz et al., 2017). Deep learning is a collection of methods that improve optimization and generalization of neural networks and allow neurons to share parameters (convolutional neural networks). Deep learning has gained popularity over the past few years, though it is not a new concept. It may have been used to describe neural networks as early as the 1980s (Dechter, 1986). Backpropagation, the training mechanism of a deep learning algorithm, has long been used to train neural networks (LeCun et al., 1989). A neural network consists of many ordered layers. In a feed-forward neural network, signals are propagated sequentially from the front layer to the end layer of the network. A layer consists of a set of neurons that receive input stimulus x i is a real number that is the output of some other neuron, or-with image data and for the first layer only-an image pixel. x is the ordered output of all the neurons in the previous layer. The number of neurons, layers, and structure of the layers are an experimental parameter, and commonly used structures are given later in Section 2.2.2. The input is linearly weighted by the weight vector . An overview of a single neuron is given in Fig. 5.
The result is the scalar charge X . If the charge is enough the neuron fires and sends stimulus to each neuron in the next layer of the network. The activation function defines the nature of how neurons fire. The neural networks in this work use a rectified linear activation function (ReLU), described as follows (Nair and Hinton, 2010) where max(.) is the maximum value of its input arguments. It is currently the most popular activation function (Krizhevsky et al., 2012;He et al., 2015;Szegedy et al., 2015;Iandola et al., 2016;Szegedy et al., 2016) because it addresses the problem of the vanishing gradient. The neural network is a generally feed forward structure, where information propagates layer by layer through the network. At the front end of the network, the whole image is given as input. The end of the network reports the confidence of the membership of a sample to a particular population. Unlike the other layers, the end layer uses a softmax activation function to generate pseudo-probabilities. Let the layer be the set of neurons in the output layer, and y i be the prediction for class i.
Eq. (2) forces the set of outputs of to sum to 1, resembling a probability. The proper weights w are obtained by minimizing the following loss function (Girosi et al., 1995;Bishop, 1995;Nielsen, 2015): The first term is the cross-entropy cost function, where y d is the pseudo-probability for the correct class. The second term is a regularization term. It prevents overfitting the training data. is the regularization parameter. It can be shown that to minimize the loss Eq. (3) weights can be updated as follows: This equation is referred to as stochastic gradient descent with momentum, also known as the generalized delta rule. + w t 1 is the updated weight. w t is the current weight. In the gradient term, is the learning rate, a small non-zero value. is small so that a single sample cannot alter the model too much; it is useful for outliers that are not consistent with the current model. decreases over time to allow the model to converge. L w ( ) t is the error gradient. The second term is called momentum. It can help the model overcome local minima by allowing the model to continue along the gradient. Otherwise, the model might find a suboptimal solution to Eq. (3) in a local minima. is the momentum weight, typically much larger than . w t 1 are the weight updates from the previous step. A more common weight update rule is minibatch stochastic gradient descent with momentum. It has a more stable descent of the error gradient by splitting the training data into minibatches. It calculates updates to the model on the set of samples in the mini-batch as a whole, then carries out the update to  This is an overview of a single neuron. A neuron receives input stimulus . is the bias term. The resulting charge X is the dot product of x and w plus the bias term. It is thresholded by the activation function to produce the output y.
where m is the number of samples in a mini-batch. Neural networks have thousands of weights, and the above equation must be carried out for each mini-batch.
To further reduce training time, computation of neural network weight updates is often carried out with GPU computing. Recent advances in computing power and the outstanding performance in the ImageNet object recognition challenge (Krizhevsky et al., 2012) have renewed the scientific community's interest in neural networks (LeCun et al., 2015). We focus on convolutional neural networks, a type of neural network that can complete recognition tasks previously thought to be too challenging. Generally, a convolutional neural network trains itself to filter an input to detect cues in an image. A convolutional layer accomplishes this by learning a bank of filters that are convolved with the input: where h is the input image, k is a filter learned by the network.
Convolution layers are followed by pooling layers which downsample the input stimulus. Though there is still much to be understood about the human visual system, convolutional neural networks resemble the process of image filtering with the human receptive field. An example of the output of convolutional layers are given in Fig. 6. These images provide examples of the cues used by the neural network.

Deep learning architectures
Six pre-trained convolutional neural networks are used in this work. The first is AlexNet (Krizhevsky et al., 2012), a feed-forward Convolutional Neural Network. AlexNet is trained to detect one thousand different objects from roughly one million images (Russakovsky et al., 2015). The second neural network architecture in this work is Goo-gLeNet, an improvement on deep convolutional neural networks developed by Google (He et al., 2015). It was the first departure from traditional convolutional neural networks that stacked many layers of convolution operations. GoogLeNet layers form a directed acyclic graph. It introduced a concept called inception modules which combine a set of convolution and pooling carried out in parallel. The operations are of varying sizes enabling it to detect cues of varying sizes. The third Fig. 6. Convolutional neural networks detect visual cues in an image. It does this by processing a given image with many filters. Each subfigure is the output of a single filter and highlights cues that the neural network determined to be important. The deep learning architecture AlexNet (Krizhevsky et al., 2012) was used to generate this figure.
neural network is an optimization of GoogLeNet, called Inception v3 (Szegedy et al., 2016). The fourth and fifth neural networks are ResNet-50 and ResNet-101 (Szegedy et al., 2015). Unlike other networks, Re-sNet-50 has residual components. The networks convolution-pooling operations are organized into sets. A set of convolution-pooling operations supplements its normal output with stimulus supplied to the beginning of the set bypassing the sets filters. ResNet-101 is a more complex improvement upon ResNet-50. The last neural network architecture is SqueezeNet (Iandola et al., 2016), a residual network without inception components. Like Inception v3, SqueezeNet is an optimization. However, it is applied to the ImageNet challenge with the goal of increasing overall accuracy while minimizing the costliness of training a network. For transfer learning, we reuse the parameters of the pre-trained network except for the last classification layers. They are replaced with a fully connected layer and a softmax layer. For a complete explanation of neural network architectures, the reader is referred to Krizhevsky et al. (2012), He et al. (2015) and Szegedy et al. (2015), Iandola et al. (2016) and Szegedy et al. (2016). Layer operations can be used to approximate the relative complexity of the network, though Table 1 does not consider the layer inter-connections.

Parameters and cross-validation
Experiment 1: For the first set of experiments, six pre-trained neural networks are obtained. The reader is referred to the respective literature for the training procedures for those networks (Krizhevsky et al., 2012;He et al., 2015;Szegedy et al., 2015;Iandola et al., 2016;Szegedy et al., 2016). After obtaining the networks, they are retrained on the augmented dataset of images we collected. We use fine-tuning (Li and Hoiem, 2017). That is, all network layers are frozen except for the final fully connected layer. Retraining parameters are as follows: minibatch size is 25, initial learning rate is 0.001, L2 regularization is 0.0001. We use mini-batch stochastic gradient descent for the optimizer. The weight learning rate and bias learning rate factor is increased by a factor 20. Retraining is limited to 5 epochs for all architectures. We use fivefold random cross validation. The size of each population, after data augmentation and image processing is given in Table 2.
For a single fold, 2680 images are selected randomly from each population, resulting in a data subset of 16,800 images. By ensuring that all populations have an equal number of samples, we avoid an apriori bias. With each fold, 70% of the images are used for training and the remaining 30% of the images are for testing. This is repeated five times to obtain the testing results given in the 'Results and Discussion' section. Experiments were coded in MATLAB 2018A and executed on a Dell Precision Rack 7910 with the following specifications: Intel Xeon E5-2630 v3 microprocessor operating at 2.40 GHz, 128 GB of DDR3 memory operating at 2133 MHz, Ubuntu 16.04 operating system, and a Nvidia Quadro K6000. Code is publicly available on GitHub. 2 Experiment 2: With the second set of experiments, we compare a deep learning setup to one without deep learning. The convolutional neural network (deep learning) setup is described as follows: (1) the image is processed by a pre-trained convolutional neural network, AlexNet trained on ImageNet (Krizhevsky et al., 2012). (2) However, the image is not fully processed by AlexNet, as we harvest the activations of the convolutional part of the network. (3) The harvested activations are classified by a support vector machine (SVM) (Chang and Lin, 2011) for final prediction.
The system without deep learning, the baseline, is described as follows: (1) Local binary patterns (Ojala et al., 1994) and color histogram features are extracted from the image. (2) The local binary patterns and color histogram features are classified with a SVM for final prediction.
For both systems, the SVM has the following parameters: a C-SVM is used; C is set to 1; a radial basis function kernel is used; and gamma is set to the inverse of the number of pixels in the image. The baseline system we compare our work to performs so poorly on the dataset used in Experiment 1, that we had to reduce the complexity of the problem. We reduce the complexity by reducing the problem to a binary classification problem (GY or non-GY); not including data from publicly available sources that would introduce more disease populations; and augmenting the data with image processing that would perturb the data with misalignment. GY is considered positive (P) and non-GY is considered negative (N). Only GY and non-GY leaf clipping images that we collected are used for this experiment (from Section 2.1.1).
Experiment 3: While the overall goal of our work is to develop a ML system, we would also like to know how well a human compares to the performance of our diagnostic test. We conduct a test where human participants are presented with at least 100 images of grape leaves, presented one at a time. For each leaf, the participants guess if the leaf is positive for grapevine yellows (GY), or negative for GY. There were 7 trained novices and 5 experts for this study. Participants processed 149.08 ± 45.67 images. For this purpose a web-based software were developed to collect participants responces (Fig. 7).
Prior to beginning any data collection, informed consent was obtained from each volunteer participant. 3 Participants received training to detect GY from sight. First, the participants spent 30 min reading about GY and identification guide. Second, participants received an additional 30 min of training where they were presented with trial images. Participants diagnosed image-by-image at their own pace and received immediate feedback about their decisions. Their guesses were not logged at this point.
After 60 min of training, the participants were given a set of testing images. Participants made a single guess for each image, at their own pace. They received no feedback on their responses. The responses to the testing images were collected. This experiment used the same dataset as experiment 2. Participants who were already experts in plant science were not required to receive training.

Evaluation metrics
In the following, we explain the metrics to assess overall performance. In general, metrics are a rate and higher is better. Sensitivity, also known as true positive rate, is calculated as follows: where TP is the number of true positives and P is the number of positive samples. Specificity, also known as true negative rate is calculated as follows: Table 1 This table compares the different deep learning architectures used in this work according to the number of layer operations. Layer operations include cross channel normalization, pooling, concatenation, activation, dropout, etc. They approximate the complexity of the architecture. An architecture with more layer operations will take longer to train. It will also have more weights, requiring increasingly non-trivial hardware. AlexNet (Krizhevsky et al., 2012) and SqueezeNet (Iandola et al., 2016) represent the most "light weight" architectures considered in this work. Inception v3 (Szegedy et al., 2016) and ResNet-101 (Szegedy et al., 2015) represent the most complex architectures.
where TN is the number of true negatives and N is the number of negative samples. Precision, also known as positive predictive value (PPV) is calculated as follows: where FP is the number of false positives. Negative predictive value (NPV) is calculated as follows: where FN is the number of false negatives. False negative rate (FNR) and false positive rate (FPR) are calculated as follows: Accuracy is calculated as follows: It is often used as a measure of overall performance. Another metric for performance is the F1 score: It is the harmonic mean of the precision and sensitivity. It provides an alternative to accuracy, which does not account for false positives.
It accounts for both false positives and false negatives. It is like the chi-square statistic for the confusion matrix. MCC is the only metric that is not a rate; MCC [ 1, 1] and higher is better. Some results are presented in terms of a confusion matrix, also known as a contingency table. Cells of a confusion matrix are calculated as: where i is the row (a population label), j is the column (also a population label), k is an iterator (an index of the current test sample), n i is the number of test samples belonging to population i, k Prediction( ) references the k-th prediction, k Label( ) references the true label of the k-th test sample.
[.] is an Iverson bracket that evaluates to the number 1 if the premise of the argument is true: For this work we enumerate the population labels as follows: black rot (BR) is 0, healthy control (HC) is 1, Esca disease (ED) is 2, grapevine yellows (GY) is 3, leaf blight (LB) is 4, all other diseases and pests (OD) are 5. The rows of the confusion matrix represent the samples of a population. The columns represent how each sample was predicted. An ideal system would have an identity matrix as a confusion matrix, that is, the non-diagonal elements should be zero. The diagonal elements of a confusion matrix correspond to the sensitivity for recognition of a certain population. Confusion matrix elements are a ratio.
For all experiments, many trials are carried out. We present the results of each experiment with the performance metrics in Eqs. (7)-(16), given as the average and variance across all trials. Variance is defined as follows: where m is a vector of the metric values and E (.) is the expectation, or average of the metric values.
To discuss performance, we consider two more metrics: (1) Sigma levels, to determine if an architecture is sufficient. Sigma levels are often used in manufacturing to gauge the performance of a process. We use the 68-95-99.7 rule of a normal distribution to establish one, two and three sigma levels of performance without error. That is, one sigma corresponds to an expectation success without error 68% of the time; two sigma, 95%; and three sigma, 99.7%. Generally, a process less than two sigma performance is not desirable and should not be used in production.
(2) Unpaired t-tests to determine if there are significant differences between two architectures. Because there are few samples for each population (from five validation folds) we use pooled variance. Unpaired t-tests to determine P value and statistical significance, a P value of less than 0.05 is generally considered statistically significant.

Experiment 1 -Feasibility of GY detection by computers
The first set of experiments answer the questions: can GY be predicted from leaf clipping images? And, among the many deep learning architectures available, which is best? We answer these questions by providing recognition results for six different network architectures. They detect the following diseases from a leaf clipping images of grapes: healthy control (HC), GY, black rot (BR), esca (ED), leaf blight (LB), and other grapevine diseases (OD). These experiments demonstrate state-ofthe-art performance when attempting to automatically detect GY. In this experiment, recognizing the symptoms of GY is a complex problem because detection must occur in spite of other diseases that are similar in appearance. As a further challenge, the alignment of the images (where the leaf is placed on the image canvas) was perturbed during data augmentation with image processing. Finally, the images have varying background conditions. Despite this, the system achieves outstanding performance. Some architectures begin to meet expectations for a production level system (two sigma levels of performance). A summary of results for GY detection is given in Table 3 and a full presentation of results for all populations is given in Table 4.
The best sensitivity, also known as true positive rate (TPR), is given by ResNet-101 with a TPR of 98.96% and the worst sensitivity is given by SqueezeNet with a TPR of 71.34%. All but two architectures have at least two sigma TPR and those are GoogLeNet and SqueezeNet. When considering a diagnostic test, one should consider sensitivity, specificity and predictive value, so it would be premature to dismiss GoogLeNet and SqueezeNet just from poor TPR performance. Comparing the best and second-best performers (ResNet-50) in terms of TPR, an unpaired ttest of the results yields a = t 40.51 with a P-value of less than 0.0001. Thus, ResNet-101 alone has the best sensitivity. Comparing the best and worst performers in terms of TPR, an unpaired t-test of the results yields a = t 11.87 with a P-value of less than 0.0001. Tests that find a statistical significance between the best and worst performers support the idea that there is a difference between architectures, and that it is important to consider which neural network to use.
The best specificity, also known as true negative rate (TNR), is given by ResNet-101 with a TNR of 99.40% and the worst specificity is given by AlexNet with a TNR of 97.65%. All architectures have at least two sigma TNR, so all architectures have sufficient specificity. Comparing the best and second-best performers (Inception V3) in terms of TNR, an unpaired t-test of the results yields a = t 14.85 with a P-value of less than 0.0001. Thus, ResNet-101 has the best specificity. Comparing the Table 4 In-depth results for automatic prediction of GY in Vitis vinifera L. cv. Sangiovese by various deep learning algorithms. The results are given in terms of confusion matrixes measuring performance for prediction of non-GY control (HC), GY, black rot, esca, leaf blight, and other diseases/pests. Results are presented as average across five trials with variance, given in parenthesis in terms of   best and worst performers in terms of TNR, an unpaired t-test of the results yields a = t 740.50 with a P-value of less than 0.0001. The best positive predictive value (PPV) is given by ResNet-101 with a PPV of 97.09% and the worst PPV is given by AlexNet with 89.62%. Only three architectures achieve at least two sigma PPV, the two variations of ResNet and Inception v3. PPV is better than TPR or TNR when indicating the ability of the system to detect a disease. Thus, the two variations of ResNet and Inception v3 are the best contenders thus far. ResNet-101 has only slightly higher PPV than the other two and there would appear to be no difference. But, an unpaired t-test of the of the best and second-best architectures according to PPV results yields a = t 11.99 with a P-value of less than 0.0001, indicating a significant difference. ResNet-101 has the best predictive value. Comparing the best and worst performers in terms of PPV, an unpaired t-test of the results yields a = t 56.53 with a P-value of less than 0.0001. Negative predictive value (NPV) is the ability of a diagnostic test to determine that a sample does not have GY. The best NPV is given by ResNet-101 with a NPV of 99.79% and the worst NPV is given by SqueezeNet with a NPV of 94.66%. SqueezeNet is the only architecture that does not achieve at least two sigma NPV, and both versions of ResNet achieve three sigma NPV. Considering previous discussion of TPR, and PPV, it is easier for a deep learning algorithm to reject a GY diagnosis than it is to conclude that a sample has GY. Considering the two different ResNet variations, an unpaired t-test of the results yields = t 0 and a P-value of greater than 0.99999, thus there is no significant difference between the two architectures. Surprisingly, AlexNet has third-best NPV. The t-test conclusion is the same when comparing ResNet-101 to AlexNet, with a P-value of greater than 0.99999. This is surprising because AlexNet was one of the seminal methods that demonstrated the potential of convolutional neural networks in 2012. One would assume that more recent architectures would perform better, yet for NPV AlexNet performs better than GoogLeNet and Inception v3. Comparing the best and worst performers in terms of NPV, an unpaired t-test of the results yields a = t 73.73 with a P-value of less than 0.0001. False negative rate (FNR) can be calculated as: 1 -TPR. False positive rate (FPR) can be calculated: as 1 -TNR. We do not include a discussion on these metrics because they are linear functions of other metrics and would not entail conclusions different from analysis of TPR and TNR.
The best overall accuracy (ACC) is 99.33%, obtained by using ResNet-101 followed by ResNet-50, Inception v3, AlexNet, GoogLeNet and SqueezeNet in that order. ResNet-50 is a close contender, with an accuracy of 99.18%. An unpaired t-test of the two results yields = t 0 and a P value of greater than greater than 0.99999, so there is no significant difference. The next-best architecture to be statistically different from ResNet-50 is AlexNet, with an accuracy of 97.63%, = t 268.79 and a P value of less than 0.00001. AlexNet continues to surprise as a strong contender despite its lack of contemporary layer structures (residual and inception). All architectures achieve two sigma levels of accuracy except for SqueezeNet, and none are greater than two sigma performance. A comparison of best and worst (SqueezeNet) architectures according to accuracy with an unpaired t-test yields = t 146.52 and a P value of less than 0.00001.
The best F1-score is 98.01% obtained by ResNet-101, followed by ResNet-50, Inception v3, AlexNet, GoogLeNet and SqueezeNet. The next-best architecture to be statistically different from ResNet-101 is Inception v3 with an F1-score of 95.14%, = t 90.75 and a P value of less than 0.00001. Only ResNet-50, ResNet-100 and Inception v3 architectures achieve two sigma levels of F1-score. Comparing best and worst performer (SqueezeNet) with an unpaired t-test yields = t 17.57 and a P value of less than 0.0001. There is a significant gap between the best and worst performer. MCC is a similar metric to F1 in that it tries to balance false positives and false negatives. Because of the similarity, the conclusions would be like analysis of F1 results. MCC is not a rate so sigma levels are not applicable.
The most surprising conclusion of Table 3 is that AlexNet has competitive performance. Also, it was surprising that SqueezeNet did not do well, and it was often the worst architecture. Further, Squee-zeNet has the worst variance among the various metrics. SqueezeNet was an optimization of a residual convolutional neural network with the goal of achieving the highest overall accuracy on the ImageNet grand challenge while minimizing the complexity of the network. This may explain why it does not do as well for leaf clipping images. The ResNet architectures are the only networks to achieve at least two sigma performance for all relevant metrics, and the only networks we can recommend for production-grade systems for automatic detection of GY.
Considering GY performance only, ResNet-101 is the best architecture for automatic detection. However, for NPV, accuracy and F1- score, there is no significant difference between ResNet-50 and ResNet-101. A comparison of training time and accuracy for detecting GY is given in Fig. 8. ResNet-101 and Inception v3 take the longest time to train. Their performance may be a diminishing return considering that ResNet-50 takes half as long to train. Considering this, ResNet-50 is a good alternative to ResNet-101. It is notable that AlexNet and Squee-zeNet have a similar training time. A t-test yields a P value of 0.05, so it is not quite statistically significant. However, during the training procedure, all but the final fully connected layer is frozen. It is possible that SqueezeNet may be faster if the networks were not frozen. Table 3 contains results for only GY detection, whereas Table 4 contains confusion matrixes that describe the performance across all six populations. Considering the confusion matrixes, AlexNet has the best sensitivity when predicting leaf blight, GY, healthy controls, other diseases, black rot then esca in that order. The greatest variance is given by the other diseases population and this is not surprising; the other disease population consists of many diseases with varying symptoms. AlexNet achieves two sigma level sensitivity for leaf blight and GY. It achieves one sigma level of sensitivity for all other populations.
GoogLeNet has best sensitivity when predicting leaf blight, other diseases, healthy controls, esca disease, black rot and GY in that order. In general, GoogLeNet has a high variance of sensitivity for all populations, and leaf blight is the only population with two sigma levels of sensitivity. GoogLeNet is not as versatile and reliable as AlexNet when identifying diseases from leaf clipping images.
Inception v3 has best sensitivity when detecting of leaf blight, esca disease, other diseases, black rot, healthy controls, GY in that order. Comparing Inception v3 to GoogLeNet, variance of sensitivity is greatly reduced. GoogLeNet achieves two sigma levels of sensitivity on all populations, except for the GY population-the primary focus of our work.
ResNet-50 has best sensitivity when detecting leaf blight, GY, other diseases, healthy controls, black rot, and esca disease in that order. Generally, ResNet-50 has low variance except for esca disease, its worst performing population. ResNet-50 achieves two sigma levels of performance for all populations, and three sigma levels for leaf blight.
ResNet-101 has best sensitivity for detecting leaf blight, other diseases, black rot, esca disease, healthy controls and GY in that order. It achieves two sigma levels of sensitivity for all populations except for leaf blight where it achieves three sigma levels of sensitivity. The variance is generally low for all populations.
SqueezeNet has best sensitivity with leaf blight, healthy controls, black rot, other diseases, esca disease and GY in that order. SqueezeNet has the worst performance of all the networks when applied to our task. For GY, esca, other diseases and black rot, it has a variance that is a whole order of magnitude greater than the variances of other networks. SqueezeNet achieves two sigma levels of performance for only the leaf blight population.
Considering the confusion matrixes, leaf blight appears to be the easiest disease to detect for deep learning architectures. SqueezeNet, the worst performing architecture in our experiments, does not perform well in general, except for leaf blight recognition where it still achieves two sigma levels of sensitivity. Some architectures (such as AlexNet, ResNet-50 and ResNet-101) show promising results for a variety of diseases. Further, the highly complex architectures (Inception v3, ResNet-50, ResNet-101) do well for the other diseases population that contains many diseases with different symptoms. Considering that AlexNet is the least complex of all the architectures discussed in this work, it is possible that complexity increases the ability of a network to account for varying cues within a population. These results conclusively demonstrate that a neural network can detect GY from leaf clipping images. While the accuracy of traditional detection methods, such as PCR, is not disputed, a system using one of these neural network architectures can offer an early screen method that is used in parallel with traditional lab testing. This will accelerate responses and mitigate crop loses.

Experiment 2 -Is deep learning necessary?
For the second set of experiments, we address the question, how does the proposed deep learning system compare to related work that does not use deep learning? Occam's razor suggests that the best solution is the simplest one. Neural networks are difficult to implement because they are time consuming to train and they introduce the complications of data augmentation and transfer learning. Deep learning algorithms require expensive GPUs to train and there is no benefit to using deep learning if a more conventional system would suffice. We investigate if color and texture features with a support vector machine are enough to detect GY from grapevine leaf clipping images.
The baseline is similar to the approach in Kaur et al. (2018) and Sharif et al. (2018). Poor performance of the baseline system demonstrates the need for a more complex classification scheme, such as deep learning. For this experiment, the deep learning-based system takes the activations from the convolutional layer of the AlexNet architecture and pipes the activations to a support vector machine for final prediction. This also demonstrates the need for convolutional neural networks, over non-convolutional neural networks. Deep learning obtains a 92.06% overall accuracy and a Matthew's correlation coefficient of 0.832. The baseline system with local binary patterns (LBP) and color histogram with a SVM obtains only 26.79% overall accuracy and −0.1244 respectively ( Table 5). The gap in performance is so great that there is no need for statistical analysis. Results indicates that the problem is indeed complex enough to outweigh the drawbacks of deep learning.

Experiment 3 -Comparison to human experts
Experiment 3 addressees the question, among human recognition, deep learning and a baseline system without deep learning, which is better? Results are given in Table 5. The results in this table use the data subset from experiment 2, but not the same data set from experiment 1. Experiments 2 and 3 can be compared, but they both cannot be compared to experiment 1. Deep learning is better than humans for all metrics.
There is a such a large gap in performance between humans, the baseline system and deep learning that an in-depth statistical analysis is not needed. It is obvious that there is a significant difference. Human recognition does not meet the minimum expectations for a production level system (two sigma levels of performance) and supports that adage that it is hard to recognize GY from sight. Human results are in line with ML results in that specificity (TNR) and NPV are higher than sensitivity (TPR) and PPV, indicating that it is easier to reject a hypothesis of membership to GY than it is to confirm a hypothesis of membership to GY.

General discussion
GY is a serious threat due to severe symptoms and lack of healing treatments. Methods for effective detection of Grapevine Yellow diseases are of worldwide interest. Bois noir (BN) is endemic to some regions but still dangerous. Both pathogens cause similar symptoms and health monitoring programs are carried out worldwide. Diagnosis relies on effective symptom identification. Yet, a low concentration of the pathogen and its erratic distribution in the host leads to infected but asymptomatic grapes. This leads to high rates of false-negatives in detection. Further, GY symptoms such as leaf discoloration, bunch drying, and irregular wood ripening are typical among other diseases and outstanding in the late summer. This makes recognition of GY a difficult task (Belli et al., 2010). The aim of this work is to develop a tool for supporting sampling procedures.
The accuracy of lab testing is unparalleled. Yet, this tool can accelerate a response to GY when used in parallel with traditional detection methods (e.g. polymerase chain reaction (PCR), fluorescence insitu hybridization (FISH), immunofluorescence (IF), enzyme-linked immunosorbent assay (ELISA), flow cytometry (FCM), gas chromatography-mass spectrometry (GC-MS) etc. (Mehle et al., 2017)). The benefits of the proposed system are as follows: • The end-user does not need to be an expert at detecting GY.
Prior skill of the use is not required as well. There is no manual segmentation, initialization, or initial guesses. The user submits a cropped leaf clipping image to the system and receives diagnosis.
• Expensive sensing equipment is not required. The system accepts images of roughly 300 pixel resolution. Thus, consumer-grade video cameras, camcorders and scanners are enough. Expensive sensor equipment (thermal, multi-spectral) is not required.
• The system does not need pre-processing of images. Deep learning is end-to-end. We merely expect the user to center the leaf in the image. Further, the system can tolerate poor alignment during data collection.
• The system is very accurate. With ResNet-50, we obtain a sensitivity of 98.96% and a specificity of 99.40%.
There is great potential to revolutionize the detection of GY. The system provides a template for swift detection of other crops and diseases. One major drawback to the system is the complexity of deep learning algorithms. However, training the system is a one-time procedure carried out on a server and not the end-user. Transfer learning and data augmentation mitigate the data requirements of deep learning. The results of experiments 2 and 3 show the need for deep learning despite this drawback. Systems without neural networks (deep learning) and human experts do not achieve the same level of performance as the proposed system.
A drawback of our work is that the leaf images are not taken in field conditions where there are other leaves, occlusion and varying illumination. Systems that operate in the field, under very unconstrained scenarios, are still an open challenge. Detection in very constrained scenarios is still very challenging, particularly for diseases that can be easily confused for others. The purpose of our study is to show that automatic detection is now possible with deep learning, which can be applied to other pests and diseases via transfer learning. When diagnosing a plant, growers collect various plant parts for conventional testing (such as PCR). Leaf blades would be collected through this process, so it is reasonable to require growers to clip the leaf blade before imaging.

Conclusion
In the field, true-positive rate detection of GY was frequently overestimated (due to similar symptoms) or underestimated (due to variability in symptom expression) (Rizzo et al., 2018). Thus, the automatic tools that can help the sampler in pathogen recognition are crucial to avoid missing GY-positive plants, in particular the FD-positive ones. We demonstrate that it possible to detect GY from leaf clipping images, and that certain systems exceed expectations. ResNet-101 is the clear best performer for this application. However, ResNet-50 is often second best and both systems are at least two sigma performance for all metrics. An in-depth statistical analysis revealed that, for NPV, accuracy and F1-score, there is no significant difference between ResNet-50 and ResNet-101. We recommend ResNet-50 for leaf diagnosis systems because the increased complexity of ResNet-101 maybe a diminishing return. It was notable that AlexNet has competitive performance because of its lack of sophistication compared to the other architectures in this work. Consistently, there is a statistically significant gap between the best and worst performer, indicating the importance of selecting an appropriate neural network for this application. Deep learning has 35.97% and 22.88% better predictive value (PPV) for recognizing GY Table 5 Results comparing the performance of a system with deep learning (AlexNet) with a baseline system that does not use deep learning and humans. (Top row) Deep learning, a pre-trained convolutional neural network (CNN) features with a linear support vector machine (SVM) classifier. (Middle row) Non-deep learning, a baseline system using local binary patterns (LBP) and color histograms with a SVM classifier that is similar to the approaches given in related work. (Bottom row) Human recognition on the same dataset. The deep learning algorithm achieves good performance when predicting if a sample has GY, whereas a system without deep learning performs poorly. Without deep learning, a ML algorithm cannot exceed human performance. This demonstrates the need for deep learning when automatically diagnosing the symptoms of GY from leaf clipping images. Variance is given in terms of 10 1 A. Cruz et al. Computers and Electronics in Agriculture 157 (2019) [63][64][65][66][67][68][69][70][71][72][73][74][75][76] from sight, than a baseline system without deep learning and trained humans respectively. Future work in this research will focus on the development of tools that can be used in the field where there are other leaves, occlusion and varying illumination. We anticipate the need for better segmentation algorithms to facilitate this. We will also consider implementing the methods on a Nvidia Jetson so that predictions can be made without need for a remote connection to a server.