Detecting diseases in apple tree leaves using FPN–ISResNet–Faster RCNN

ABSTRACT Apple leaf diseases typified by small disease spots are generally difficult to detect in images. This study proposes a deep learning model called the feature pyramid networks (FPNs) –inception squeeze-and-excitation ResNet (ISResNet)–Faster RCNN (region with convolutional neural network) model to improve the accuracy of detecting apple leaf diseases. Apple leaf diseases were identified, evaluated, and validated by using the FPN–ISResNet–Faster RCNN. The results were compared with those obtained by the single-shot multibox detector (SSD), Faster RCNN, and ISResNet–Faster RCNN. The detection accuracies obtained by using different feature extraction networks, positions and numbers of SE, inception modules, scales of FPN structures, and scales of anchor frames were also compared. The results showed that the values of average precision (AP), and APs with the thresholds of the intersection over union of 0.5 and 0.75 (AP50 and AP75), obtained from the FPN–ISResNet–Faster RCNN were 62.71%, 93.68%, and 70.94%, respectively, which are higher than those of the SSD, VGG–Faster RCNN, GoogleNet–Faster RCNN, ResNet50–Faster RCNN, ResNeXt–Faster RCNN, and ISResNet–Faster RCNN. FPN–ISResNet–Faster RCNN was shown to be able to detect diseases in apple leaves with high accuracy and generalizability.


Introduction
Apple trees can be infected by more than 100 kinds of diseases in their branches, leaves, fruits, and roots (Y.H. Zhang et al., 2018).Such diseases reduce the yield and quality of apples, and inflict economic pain on farmers (ElMasrya et al., 2009).As diseases in apple trees cause millions of dollars of loss every year, their detection has garnered considerable attention (Chao et al., 2020;Yan et al., 2020).
Diseases in the leaves of apple trees include leaf casting, powdery mildew, rust, Venturia nashicola, viruses, and silver leaf disease (Sottocornola et al., 2022).Traditionally, such diseases are manually identified by farmers and experts, which is costly and time consuming (Hou, Fan, et al., 2016).
Machine learning methods, such as support vector machine (SVM) (Bruzzone & Carlin, 2006;Melgani & Bruzzone, 2004;Pal & Mather, 2005), random forests (RFs) (Pal, 2005), and artificial neural networks (ANNs) (Kumar et al., 2020;Laurindo et al., 2017;Li et al., 2007), have been used to detect crop diseases (Bhargava & Bansal, 2020).Disease spots can be identified from images according to their segmented thresholds (Golhani et al., 2018).However, images collected in natural conditions can be significantly affected by environmental factors such as relative humidity and illumination (D.F. Wang et al., 2021).Moreover, most disease spots are small and scattered on the leaves of the plant, which makes it difficult for machine learning methods to learn their features and reduces their accuracy (Geetharamani & Arun, 2019;Hou, Fan, et al., 2016).
Deep learning uses a hierarchical model that includes an input, multiple hidden layers, and output layers according to a mapping relationship from low-level information to high-level semantics (Umit, 2020;Vita et al., 2021;Zhong & Zhao, 2020).Thus, plant diseases have been detected in recent years using different deep learning models, such as convolutional neural networks (CNNs) (Abade et al., 2021;G. S. Hu et al., 2019), bidirectional long short-term memory networks (P.Chen et al., 2020), lightweight attention networks (J.D. Chen et al., 2021), and deep neural networks with transfer learning (Krishnamoorthy et al., 2021).A masked-regions with convolutional neural network (Mask-RCNN) algorithm was developed that integrates a residual network (ResNet) into a multiscale convolution layer to extract disease spots in leaves of grape vines, yielding an accuracy of 90.83% (Picon et al., 2019;Triki et al., 2021;Yu et al., 2019).A five-order spatial pyramid model was constructed to identify diseases, yielding an accuracy of 88.4% using the AI Challenger 2018 dataset (Jie et al., 2021).A faster double-region inception-based attention CNN was proposed based on a faster RCNN algorithm to detect diseases in the leaves of grape vines that recorded an accuracy of 81.1% and a speed of detection of 15.01 frames per second (FPS) (Xie et al., 2020).An asymmetric shuffle CNN that integrates an attention mechanism of the spatial channel squeeze and excitation (scSE) block into a feature extraction network has been proposed to expand the receptive field of the convolution kernel and enhance feature extraction (Jin et al., 2022).An inception-based attention rainbow single-shot multibox detector (INAR-SSD) model was proposed to identify apple leaf diseases, yielding an accuracy of 78.80% at a detection speed of 23.13 FPS (Jiang et al., 2019).An improved YOLO-V3 model combined with a dense connection module was proposed to detect apple anthracnose, recording an accuracy that was 1.4% higher than that of the YOLO-V3 model (Tian et al., 2019).
The above deep learning methods are mainly used to detect diseases in single leaves in images acquired in a laboratory setting with a natural background, and without interference due to illumination, contrast, and the camera (Barman et al., 2020).However, these methods cannot adequately detect small and dense disease spots, different diseases that present with similar shapes and colors, or multiple diseases in a single leaf (S.W. Zhang et al., 2019).This study proposes a feature pyramid networks (FPNs)-inception squeezeand-excitation ResNet (ISResNet)-Faster RCNN model based on a faster RCNN, an FPN, an attention mechanism, and an inception structure to better detect diseases in the leaves of apple trees.The objective of the study is to design a novel deep learning model, which we call FPN -ISResNet-Faster RCNN, to improve the accuracy of detection and spatial distribution of multiple apple leaf diseases with small and dense disease spots under the complex scenes of a typical apple orchard.

Materials
The original dataset contained 2,891 images of diseased leaves of apple trees, of which 2,838 images with a size of 960 × 640 were obtained from the Plant Pathology 2021-FGVC8 dataset released by the Kaggle competition, and 53 images with a size of 256 × 310 were obtained from the AI Challenger 2018 datasets.The images with a size of 256 × 310 were expanded to that of 960 × 640.The dataset contained 697 images of leaves with powdery mildew caused by Chaetomium albicans, 1,127 images of Alternaria leaf spots caused by virulent strains of Alternaria alternata, and 1,067 images of rust disease caused by Gymnosporangium yamadae (Figure 1).
The disease spots due to powdery mildew, Alternaria, and rust disease had different sizes, with powdery mildew having the largest disease spots on the apple leaves, almost covering entire leaves.In contrast, those of Alternaria and rust disease were small and dense.
If the number of images in a dataset is too small, a neural network can extract too much information from the training set and is prone to overfitting (Tiwari et al., 2021).Image enhancement techniques, such as sharpening, rotation, and fuzzy mean, were used to generate more images to simulate various forms of interference by the natural environment.The enhanced images were learned by a neural network to extract as many features as possible to improve detection performance.The methods of changing image brightness and contrast, and sharpening were used to simulate the influence of the weather, sunshine, and other factors on deep learning for disease detection.The images were flipped horizontally and vertically, as well as rotated 90°, 180°, and 270°, to mitigate the interference of the camera angle on the quality of the captured images.Gaussian noise and fuzzy mean methods were used to alleviate the influence of camera shake and blur on the quality of the images.
After image enhancement, the size of the original data set was increased 12-fold, for a total of 37,583 images.The enlarged dataset was divided into a training set, a validation set, and a test set at a ratio of 8:1:1.Thus the training set contained 30,065 images for model training, the validation set contained 3,759 images for model evaluation, and the test set contained 3,759 images for verifying the model's generalizability.
Image annotation is a key step in building a dataset of images of apple leaf diseases.The locations of the disease spots and the categories of diseases in the apple leaf images were annotated in Pascal VOC2012 format by using the data labeling software LabelImg.The annotated images were transformed into an extensive markup language (XML) file that contained the locations of the disease spots and the types of diseases.
Errors in image annotation are inevitable due to the limitations of the tools of annotation and the manual nature of the task.However, the image labels were checked repeatedly and verified by experienced professionals to minimize annotation errors and their impact on the subsequent training.

Construction of FPN -ISResNet-Faster RCNN model
A model to identify diseases in leaves should include a feature extraction network, the location of the disease, and its classification and regression.The FPN -ISResNet-Faster -RCNN framework proposed here includes the ISResNet, a region proposal network (RPN), and a fully connected (FC) layer.ISResNet includes ResNet, inception, and squeeze-and-excitation network (SENet) modules.The ResNet module, which is a residual network, can accelerate convergence and avoid overfitting the network.The inception module can increase the network width and its receptive field.The SENet module can efficiently fuse useful information to extract the most important features and suppress noise in the network.The RPN receives the feature graph obtained by ISResNet and generates high-quality bounding boxes to locate disease spots on the apple leaves.The FC receives all information on the compressed unidimensional tensor and generates a classification according to the parameters of the regression, including the central coordinates X and Y, and the width and height of each bounding box (Figure 2).
The ISResNet structure can suppress interference due to background information and efficiently learn useful information to extract the most important features from the images.ISResNet is designed to learn and detect disease spots rather than interfering factors such as brightness and contrast from images in the dataset.The ISResNet is designed as shown in Table 1.
The classification networks AlexNet, VGGNet, ResNet, and GoogleNet were selected by trial and error according to the characteristics of disease spots in the apple leaf images.ResNet was found to be the most suitable backbone network for detecting diseases.
ResNet generally has 18, 34, 50, 101, or 152 hidden layers.Therefore, we designed our model with 18, 34, 50, 101, and 152 hidden layers to find the optimal number.ResNet50, which has 50 hidden layers, delivered the best accuracy in terms of detecting apple leaf diseases and was therefore used as the backbone network of the FPN -ISResNet-Faster RCNN model.
ResNet50 has five convolution layers (Table 1).The first two layers, Conv1 and Conv2_x, can learn shallow features of images, such as leaf color and edges.The last three layers, Conv3_x, Conv4_x, and Conv5_x, can extract deep and abstract features.
The inception module can increase the scope of ResNet50 and improve its ability to learn multiscale features of the disease spots.Two inception modules were added behind the Conv3_x and Conv4_x layers of ISResNet to integrate the contextual semantic information of the network (Table 1).Each inception module contains 1 × 1, 3 × 3, and 5 × 5 convolution layers as well as a 3 × 3 max pooling layer.The 5 × 5 convolution layer in the original inception module is replaced by two 3 × 3 convolution layers to reduce the amount of computation while obtaining the same receptive field in the network.Each convolution layer performs the convolution operation based on convolution kernels of different sizes that are output by the previous convolution layer.The results of the max pooling layer are used as inputs to the 1 × 1 convolution layer to change the number of channels of the feature graph.The feature graphs obtained from the convolution and pooling layers are spliced and output to the channel dimensions.
The attention mechanism, including channel and spatial mechanisms, can automatically assign small weights to interference to suppress it, and large weights to disease spots to strengthen their representation.SENet, as a pioneering network of the channel attention mechanism, is added to ResNet50.
The squeeze operation is used to compress the spatial dimensions H and W to 1 × 1, that is, 1 × 1 × C, where C is the number of channels and a pixel represents a channel, using global average pooling to obtain a C-dimensional vector z.
where z is the result of the feature u after the operation of the global average pooling in the spatial dimensions ], v c is the cth convolution kernel, and The excitation operation can fully capture channelwise dependencies as follows: where σ is the sigmoid function, δ is the ReLU function, r , and r is to reduce the dimensionality of fully connected layers.
Two SENet structures are placed behind the Conv2_x and Conv5_x layers to improve the ability of the model to learn diseased areas (Table 1) (Yang et al., 2021).A feature graph X in the residual block is pooled globally using the SENet and input to the FC layer.The number of feature channels is changed to 1/r of the number of original channels C in the first FC layer, where r is the reduction coefficient, to decrease the computational cost and increase the capacity of the SE blocks in the SENet.r = 16 was used with SE-ResNet-50 in this study after trial and error according to J. Hu et al. (2020) because it can achieve a trade-off between accuracy and complexity (parameter size) (Table 2).Thus the number of channels in the second FC layer is increased to C. The results of the second FC layer are input to the sigmoid activation function to obtain C weights, which are assigned to C channels to obtain a feature graph based on channel attention.A sigmoid function and an ReLU function are used in the ISEResNET.
An FPN is added to the model to locate small spots of Alternaria and rust disease on the leaves.The FPN uses a structure similar to the U-Net model to carry out the forward processing of a network from bottom to top.The results are derived from the output of the last layer of each type to construct a feature pyramid.
The features in the last residual block layers of the conv2, conv3, conv4, and conv5 layers are selected as the features of the FPN, denoted as {C2, C3, C4, and C5}, and the strides are 4, 8, 16, and 32, respectively.Up-sampling from top to bottom is carried out using a transposed convolution method, and horizontal connection is used to fuse the results of up-sampling with feature graphs of the same size generated from bottom to top.The feature of M5 is obtained from the C5 layer with a 1 × 1 convolution operation.M5 is up-sampled and the feature of M4 is obtained from the C4 layer with a 1 × 1 convolution operation.The above process is repeated twice to obtain M3 and M2, respectively.The features of the layers of P2, P3, P4, and P5 are obtained after a 3 × 3 convolution of the features in the M layer.The number of channels for all feature graphs in the FPN is set to 256.The designed FPN can fuse information on the bottom position and high-level semantic information of the CNN to improve the detectability and accuracy of the positioning of apple leaf diseases.
The RPN can generate a series of high-quality anchors.The predicted feature graphs are output using the FPN.Reshape1 is used to transform the dimensionality of the feature graphs from (B, C, H, W) to (B, 2, H × C/2, W), which are input to the softmax layer to distinguish the foreground from the background.Reshape2 is used to transform the dimensions of the feature graphs from (B, 2, H × C/2, W) to the original dimensions (B, C, H, W).Corresponding proposal boxes are generated by combining a bounding box and the parameters of regression.The feature extraction network sends the feature graphs to the RPN and performs a 3 × 3 convolution operation.The score of the feature graph is obtained using two 1 × 1 convolution operations in the regression and classification layers.Finally, a series of candidate boxes is obtained according to the anchors with the sorted scores of the intersection over union (IOU) of the feature graphs.
The size of the feature graph of the output of the network in this paper was 7 × 7 × 2048.The feature graph performs a 3 × 3 convolution operation to obtain a 7 × 7 × 2048 feature graph.It then performs two 1 × 1 convolution operations in the classification and regression layers to obtain two feature graphs of sizes 7 × 7 × 36 and 7 × 7 × 72, respectively; where 36 = 6 areas of anchor frames (16 2 , 32 2 , 64 2 , 128 2 , 256 2 , and 512 2 ) × 3 ratios of anchor frames (1:1, 1:2, and 2:1) × 2 features (foreground and background); and 72 = 6 areas of anchor frames × 3 ratios of anchor frames × 4 regression parameters (central coordinates X and Y, and the width and height of each bounding box).Finally, the candidate boxes are obtained.
Designing the anchor frame is an important step of target positioning in the field of target detection (Ren et al., 2018).Multiscale anchor frames are used to accurately locate disease spots of different sizes on the leaves of apple trees in the images.Eighteen scales of anchor boxes are generated according to the combination of their areas of 16 2 , 32 2 , 64 2 , 128 2 , 256 2 , and 512 2 , and the ratios of the anchor frames (aspect ratios) of 1:1, 1:2, and 2:1.The anchor boxes are input to the regression and classification layers of the RPN to obtain the corresponding classification scores and regression parameters, including the central coordinates X and Y, and the width and height of each bounding box.
A large overlap occurs between the anchor boxes generated by the RPN.A non-maximum suppression (NMS) algorithm is used to screen the anchor boxes according to their classification scores (Neubeck & Gool, 2006).The NMS is run as follows: the anchor boxes are sorted according to their confidences to select the one with the highest confidence.IOU is calculated according to the remaining and selected anchor boxes to eliminate (suppress) anchor boxes whose IOU is larger than the pre-set threshold.The above steps are repeated until all anchor boxes are processed.The selected anchor boxes in each epoch are retained for output and do not appear in the next epoch.The positions of the anchor boxes retained after screening are corrected according to the regression parameters of the boundary boxes to obtain highquality proposal bounding boxes.Multi-scale anchor boxes improve the positioning ability of the FPN -ISResNet-Faster RCNN model for the detection of apple leaf diseases.

Evaluation of detection accuracy
Average precision (AP) is a commonly used index to assess the accuracy of target detection.Precision is the definite integral of the precision -recall (P-R) curve, surrounded by the recall and precision rates of all categories of apple leaf diseases, (i.e. the area between the P-R curve and the X-axis).
The recall rate is the ratio of all correctly detected targets to all targets that should be detected by a model in the ground truth bounding boxes (GTBox).Precision is the ratio of all correctly detected targets to all actually detected targets: where P and R are the precision and recall, respectively, TP is the number of samples that a classifier detects correctly (true positive), FP is the number of samples that a classifier detects incorrectly as positive (false positive), and FN is the number of samples that a classifier incorrectly detects as negative (false negative).TP, FP, and FN are determined by the IOU between a prediction bounding box (PBBox) and a GTBox.IOU is used in this paper to evaluate the prediction effect of PBBox: where PBBox∩GTBox is the intersection between PBBox and GTBox, and PBBox∪GTBox is their union.
PBBox has a good predictive effect if its IOU exceeds a pre-set threshold, in which case it is a TP.The prediction of PBBox fails if its IOU does not exceed the pre-set threshold, in which case it is an FP.The detection by PBBox fails if GTBox is not detected, in which case the PBBox is an FN.The threshold of IOU is set to 0.7 in this study.The AP used in this paper was the average accuracy of 10 IOU thresholds between 0.5 and 0.95 with an increment of 0.05.
In addition, AP 50 , AP 75 , and FPS are used to evaluate the detection accuracy of the FPN -ISResNet-Faster RCNN model.AP 50 and AP 75 are the average accuracies obtained when the IOU thresholds are 0.5 and 0.75, respectively.FPS is the number of images that can be processed per second and is used to evaluate the speed of detection of the FPN -ISResNet-Faster RCNN model, where the higher the FPS, the faster the detection by the model.

Experimental environment and configuration
The computer hardware environment for model training included an Inter(R) Xeon(R) processor with 32GB of memory, an NVIDIA GeForce RTX2080 Ti GPU, 11GB of video memory, and CUDA version 11.1.Python was used for model construction and image enhancement.The Pycharm software was used to develop IDE.PyTorch was used as the framework for deep learning for the detection of apple leaf diseases.Algorithms such as the gamma transform, Gaussian noise correction, and the rotation, flipping, and sharpening of images were used from the opensource library OpenCV to enhance the images.

Comparison of experimental results by different models
A single-shot multibox detector (SSD) model of oneshot object detectors (Sun et al., 2021) and the faster RCNN model of two-shot object detectors were used in this paper to detect apple leaf diseases.ResNet50 was used as the feature extraction network of the SSD model.VGG16, GoogleNet, ResNet50, ResNeXt, and ISResNet (designed in this study) were selected as feature extraction networks of the faster RCNN model.The main hyper parameters used in the models were set as follows: initial learning rate = 0.001, learning rate = initial learning rate × decay rate (step/decay   steps)   , momentum = 0.9, weight decay = 0.0005, image size = 224 × 224, reduction coefficient r = 16, and batch size = 32.The results of the FPN -ISResNet-Faster RCNN model were compared with those of the SSD and the faster RCNN models (Table 3).
According to Table 3, the FPN -ISResNet-Faster RCNN model achieved the highest accuracies, with an AP of 62.71%, an AP 50 of 93.68%, and an AP 75 of 70.94%.Faster RCNN with the feature extraction network of ISResNet had the best accuracy (AP 50 = 90.86%) in detecting powdery mildew, whereas FPN -ISResNet-Faster RCNN had the best accuracy in detecting of Alternaria leaf spots (AP 50 = 93.34%)and rust disease (AP 50 = 96.94%).The accuracy of detecting powdery mildew was higher than those of identifying Alternaria leaf spots and rust disease when using all feature extraction networks of the SSD and faster RCNN models, except for the FPN -ISResNet-Faster RCNN model.This is because the areas of disease spots representing powdery mildew were generally larger than those of Alternaria leaf spots and rust disease.The FPN -ISResNet-Faster RCNN model had higher accuracies in detecting Alternaria leaf spots (93.34%) and rust disease (96.94%) than those of the SSD and faster RCNN models.
The SSD model had the highest speed of detection (46.89FPS) of all models because the image size input to it was compressed to 300 × 300, and no

Visualized results of detection of apple leaf diseases
Powdery mildew, Alternaria leaf spots, and rust disease were detected in the apple tree leaves using the proposed FPN -ISResNet-Faster RCNN model as shown in Figure 3.The model was able to detect diseases in different situations, such as a leaf afflicted by only one type of disease, and one with multiple diseases.The model's accuracy of detection was mostly above 90%, which shows that it was effective because it suppressed interference from the natural background; therefore, its ability to detect apple leaf diseases had high generalizability.
The FPN -ISResNet-Faster RCNN made a few incorrect identifications and missed a few diseased images due to interference from the natural environment.An example of the misidentification of rust disease, labeled as (1), is shown in the left image of Figure 3E.The red branches in position (1) were mistakenly detected as rust disease by the model because the color of the spots (red) of rust disease was identical to that of adjacent tree branches.An example of a missed detection of powdery mildew, labeled as (2), is shown in the right image of Figure 3E, which occurred because the leaf in position (2) had shriveled and therefore could not be detected by the model.

Performance test of FPN -ISResNet-Faster RCNN
To test and verify the effectiveness and generalizability of the FPN -ISResNet-Faster RCNN proposed in this study, test samples of different disease types in apple tree leaves are selected (different from the training samples).The apple orchard for the experiment was located in Shatan village, Enhe Town, Zhongning County, Ningxia, China (Figure 4).The apple trees were 5-year-old Red Fuji, with a plant spacing of 1.5 m and a row spacing of 3.5 m.A multi-rotor unmanned aerial vehicle (UAV; Mavic 2 Zoom) was used to collect crown images (MP4 video format) of the apple trees along a scheduled route in the orchard at 16:00 on 8 July 2021.The flying height and speed of the UAV were 3.5 m and 1.5 m/s, respectively.The angle of inclusion between the vertical line to ground and the central axis of the camera was set at 45° to avoid the effect of the downward rotating air flow generated by the UAV rotors on imaging.Four apple trees were selected, and their disease type (i.e.powdery mildew, Alternaria leaf spots, and rust disease), longitudes, and latitudes were recorded with a GPS handset.
Redundant images acquired during the take-off and landing of the UAV were deleted, and images acquired at high altitude were retained.Agisift Metashape Professional software was used for image stitching.The polygon subset method in the ENVI 5.3 software was used to crop the redundant parts of the stitched images outside the orchard.Geometric and radiation corrections were used to correct the distortion of the cropped image.The corrected image was divided into image slices with a size of 960 × 640 because whole large images cannot be input into FPN -ISResNet-Faster RCNN at once.
The segmented images were input into FPN -ISResNet-Faster RCNN to automatically identify and extract disease plaques in apple tree leaves in the orchard according to the disease categories learned during training.A morphological method was used to remove noise and fill small holes generated in the image mask.A conditional random field was used to optimize labels of the predicted image.A Douglas -Peucker algorithm was used to simplify the mask boundary.All extracted image slices were stitched from top to bottom to obtain a distribution map of disease in apple tree leaves in the orchard (Figure 5).
According to Figure 5, FPN -ISResNet-Faster RCNN could segment powdery mildew, Alternaria leaf spot, and rust disease in apple tree leaves automatically and clearly in the orchard without training samples.Disease plaques in apple tree leaves are masked with complete edges, few fragmentations, few mask adhesions, much semantic information, and many plaque details.Therefore, FPN -ISResNet-Faster RCNN can detect diseases in leaves of apple trees without training samples, with high accuracy and generalizability.
The performance of the FPN -ISResNet-Faster RCNN was compared with the results of other studies.In this study, the AP 50 of FPN -ISResNet-Faster RCNN and the precision rate of ISResNet for detecting apple leaf diseases are 93.68% and 98.79%, respectively.Jiang et al. (2019) proposed an INAR-SSD (SSD with inception module and rainbow concatenation) model to detect five apple leaf diseases (Alternaria leaf spots, brown spots, mosaic, grey spots, and rust) with a detection performance of 78.80% mean average precision (mAP) for one apple leaf disease dataset.Douarre et al. (2019) used SegNet generative adversarial nets (GANs) to detect apple diseases with an accuracy of 64.3%.Yu et al. (2019) constructed an ROI-aware DCNN-based visual geometry group (VGG) to detect apple leaf diseases with an accuracy of 84.3% including 865 images.Kobayashi et al. (2018) used InceptionV3 to detect apple leaf diseases with an accuracy of 93.0% trained from the PlantVillage Subset, which contains 9,568 images.G. Wang et al. (2017) used VGG to detect apple leaf diseases with an accuracy of 90.4% trained from the PlantVillage Subset, which contains 2,086 images.Peng and Cai (2017) used FCN to detect apple leaf diseases with an accuracy of 87.5%.Sun et al. (2021) constructed the MEAN (Mobile End AppleNet block)-SSD to detect apple leaf diseases with an accuracy of 83.12% mAP trained from the AppleDisease5 dataset.Therefore, the proposed FPN -ISResNet-Faster RCNN model was shown to have a relatively high accuracy in detecting apple tree leaf diseases.

Selection of feature extraction networks
The feature extraction network can learn the semantic information of diseases.Traditional feature extraction networks include GoogleNet, VGG16, ResNet18,    had higher precision values of 97.35%, 97.46%, and 96.84%, respectively, than the other traditional feature extraction networks.ISResNet had best performance in terms of accuracy of detection, with a precision of 98.79%.Therefore, it had a strong ability to learn the semantic features of the dataset (Hou et al., 2019).

Influence of position and number of SENet and inception modules on accuracy
ResNet50 was used as the backbone network, and the SENet module was placed behind each ConvN_x of the ResNet50, where N = 2, 3, 4, and 5.The numbers of SENet modules were set to M = 2, 3, and 4. The detection accuracies were obtained by models at different positions and numbers of the SENet modules after ConvN_x, as shown in Table 5.
According to Table 5, the detection accuracy of ResNet50 was better if two SENets (M = 2) were put behind the convolution layers compared to that of other numbers of SENet modules (M = 3, 4).In particular, the accuracy of ResNet50 was the highest (98.10%) when two SENets were put behind Conv2_x and Conv5_x.This value was 0.75% higher than that obtained by ResNet50 without the attention mechanism.The SENet put behind Conv2_x allocated a small weight to interference and large weights to the disease spots.The SENet put behind Conv5_x output a feature graph with the most abstract semantic information.
By using two SENets after Conv2_x and Conv5_x, the inception module was put behind each ConvN_x of the ResNet50, where N = 2, 3, 4, and 5.The number of inception modules was set to M = 2, 3, and 4. The detection accuracies were obtained by models with different positions and numbers of inception modules after ConvN_x, as shown in Table 6.
The optimal combination of ResNet50 featured two inception modules behind Conv3_ x and Conv4_ x (i.e. one each), which increased the detection accuracy by 0.69% compared with SE-ResNet50 without the inception module.Placing inception modules behind Conv3_x and Conv4_x can enable the fusion of highlevel and low-level semantic information to improve the ability of the model to learn multiscale features.It is not useful to learn the multiscale features of the network if the inception module is put behind Conv1_x because it extracts only low-level semantic features learned by Conv1_x, such as the color and edges of leaves in an image.The inception module is not placed behind Conv5_x because this does not improve detection accuracy but increases computational load.

Influence of FPN on detection accuracy
The FPN combines low-level location information and high-level semantic information by using a transposed convolution and a jump connection to improve the detection of multiscale disease spots, especially small and dense spots.The detection accuracies obtained by the FPN -ISResNet -Faster RCNN model with multiscale FPN structures for different apple leaf diseases were compared with those of the ISResNet-Faster RCNN model, as shown in Table 7.
The AP, AP 50 , and AP 75 values of FPN -ISResNet -Faster RCNN were 4.54%, 3.80%, and 6.58% higher  than those of ISResNet-Faster RCNN, respectively.The precision rates of FPN -ISResNet -Faster RCNN for Alternaria leaf spots and rust disease were 5.38% and 7.8% higher than those of ISResNet-Faster RCNN, respectively.FPN -ISResNet-Faster RCNN enhanced the ability of the network to learn small disease spots, resulting in good detection accuracy.However, when a feature graph with small disease spots and abstract semantic information was input to ISResNet-Faster RCNN to obtain another feature graph with smaller disease spots, it resulted in poor extraction.FPN -ISResNet -Faster RCNN can detect Alternaria leaf spots and rust disease in the leaves of apple trees, which are small disease spots, with higher accuracy than ISResNet -Faster RCNN.However, the detection accuracy of FPN -ISResNet -Faster RCNN for powdery mildew was poorer than that of ISResNet-Faster RCNN (0.43% smaller), which may be due to the following reasons: 1) Powdery mildew in the leaves of apple trees in the images of this study is mostly in the late stage of the disease.The leaves infected by powdery mildew are generally covered with white powder, shriveled up together, and are large targets to be detected; traditional deep learning models such as ISResNet-Faster RCNN, can easily detect them with high accuracy.2) Multiple operations of down-and up-samplings of FPN lead to the dilution of semantic information during the transmission of information from deep to shallow layers, errors in the positioning information of deep networks, and weak feature information in the shallow feature maps.3) The target edge information of disease spots in the shallow layers can only be accessed at the final fusion stage, resulting in a fuzzy target boundary and an incomplete target structure in the predicted saliency map.4) The 1 × 1 convolution layer adopted in the FPN reduces the channel dimensionality of feature maps, which can easily result in lost channel information.5) There are many convolution operators of different dimensions and repeated aggregation operations in the nearest neighbor and bilinear interpolation methods adopted in the up-sampling operator of the FPN, and differences in semantic information and resolution between the features at the adjacent layers generate a lot of redundant and wrong information in the process of pixel level merging.6) The complex integration of modules in the FPN -ISResNet-Faster RCNN model may cause inaccurate positioning and identification of powdery mildew in the leaves of apple trees in images.

Influence of scale of anchor frames on detection accuracy
The anchor frame is an important part of the faster RCNN.The original anchor frame of faster RCNN had three scales (128 × 128, 256 × 256, and 512 × 512) and three aspect ratios (1:1, 1:2, and 2:1).Therefore, there were nine anchor boxes in each pixel of a feature graph.Each anchor box was mapped to the original image according to the ratio of the feature graph to the original image.A candidate proposal box was obtained according to the classification score; the aspect ratios were still set to 1:2, 1:1, and 2:1, but the basic scale of the anchor boxes was changed from the original 128 to 16 in this study because of multiscale disease spots in the apple leaf images in the dataset.The anchor size in this study was changed from the original three scales to three, five, six, and nine scales (Table 8).The detection accuracies of apple diseases were obtained using the FPN -ISResNet-Faster RCNN model at two aspect ratios and different scales of anchor frames, as shown in Table 8.The FPN -ISResNet-Faster RCNN model with six scales of anchor frames (16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512) had the highest detection accuracy (AP = 62.71%,AP 50 = 93.68%,and AP 75 = 70.94%).Therefore, the scale of the anchor frame had a significant impact on the detection accuracy of apple leaf diseases.
A few cases of the misidentification and missed detection of apple leaf diseases were noted.The training time of the FPN -ISResNet-Faster RCNN model was longer than those of other models, although its accuracy was higher (Hou et al., 2020).Online image enhancement should be used in the future to replace offline enhancement to improve the speed of training and robustness of the model.The impact of different optimizers and loss functions on network training should also be explored.In addition, generative adversarial networks should be used to automatically generate datasets of diseases in the leaves of plants in the natural environment.

Conclusions
This study proposed a deep learning -based FPN -ISResNet-Faster RCNN model to improve the detection accuracy of multiscale disease spots in apple tree leaves.The FPN integrates low-level location information with high-level semantic information to improve the capacity of the model to identify small disease spots.An attention mechanism and an inception module were integrated into the feature extraction network of the model to reduce interference from background information, and increase the width and receptive field of the network for multiscale disease detection.
The dataset was expanded using 12 image enhancement methods to simulate the morphology of the leaves in the natural environment to improve the generalizability and robustness of the model.The detection accuracies of apple leaf diseases obtained from different models, feature extraction networks, and with different positions and numbers of SENet and inception modules, as well as multiscale anchor frames, were compared to verify the effectiveness of the FPN -ISResNet-Faster RCNN model.The proposed model obtained the best accuracy among all others, with an AP of 62.71%, an AP 50 of 93.68%, and an AP 75 of 70.94%.The ISResNet structure delivered the best performance, with a precision of 98.79%.The accuracy of ResNet50 was 98.10% when two SENets were put behind Conv2_x and Conv5_x (one each).Placing two inception modules behind Conv3_ x and Conv4_ x was the optimal structure of ResNet50, and yielded a detection accuracy of 98.79%.The AP value obtained by the FPN -ISResNet -Faster RCNN model was 4.54% higher than that obtained by the ISResNet-Faster RCNN model.The FPN -ISResNet-Faster RCNN model with six scales of anchor frames (16 × 16, 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512) had the best accuracy for detecting apple leaf diseases.
In this study, the generalizability of the FPN -ISResNet-Faster RCNN model was effectively improved using the following methods: 1) data augmentation, such as translation, denoising, flipping, cropping, rotation, and weighting; 2) network improvement, such as increases in model complexity, optimization of model structure, regularization, and dropout; (3) improvement of training processes, such as weight initialization, early stop, cross validation, optimizer selection, loss function, and batch size; and (4) postprocessing, such as ensemble learning and voting.

Figure 2 .
Figure 2. Framework of the FPN-ISResNet-Faster RCNN model, where C2-C5 represent the predicted feature layers, in represents the inception module, P2-P5 represent the predicted outputs of the feature graphs from the FPN, ROI represents the region of interest, and Cls_score represents the classification output by softmax.

Figure 3 .
Figure 3. Visualized detection results of (A) powdery mildew, (B) Alternaria leaf spots, (C) rust disease, and (D) Alternaria leaf spots and rust disease in apple tree leaves using the FPN-ISResNet-Faster RCNN model.(E) Identification errors in the left image labeled with (1) and omissions in the right image labelled with (2).
ResNet34, ResNet50, ResNet101, and ResNeXt.Two feature extraction networks, SE-ResNet and ISResNet, were designed in this study, where the former integrates SE and ResNet, and the latter integrates the inception structure and attention mechanism of GoogleNet into ResNeXt.The above nine feature extraction networks were trained and validated using the same dataset.The size of all training images was increased to 224 × 224, and the loss function of crossentropy and the Adam optimizer were used.The batch size was set to 32, the maximum number of iterations was 500, and the initial learning rate was set to 0.001.A warm-up method was used to reduce the learning rate and prevent the model from over-fitting.The precision of each feature extraction network is shown in Table4.ResNet50, ResNeXt, and GoogleNet

Figure 5 .
Figure 5. Distribution of powdery mildew, Alternaria leaf spots, and rust disease in apple tree leaves in magnified images of the (a) northern and (b) southern orchard.

Table 3 .
Comparison of the accuracies and speeds of different models for detecting apple leaf diseases.

Table 4 .
Precision rates of feature extraction networks used to detect apple leaf diseases.

Table 5 .
Accuracies of ResNet50 with different positions and numbers of SENet modules after the convolution layer for the detection of apple leaf diseases.

Table 6 .
Accuracies of ResNet50 with different positions and numbers of inception modules after the convolution layers for the detection of apple leaf diseases.

Table 7 .
Comparison of detection accuracies of apple leaf diseases obtained from models with and without FPNs.

Table 8 .
Comparison of detection accuracies of apple leaf diseases obtained by the proposed model with different scales of anchor frames.