A Screen Location Method for Treating American Hyphantria cunea Larvae Using Convolutional Neural Network

,


Introduction
As a world quarantine pest, American Hyphantria cunea seriously damages trees, fruit trees, and crops. It was first discovered in Dandong, Liaoning, China, in 1979 and spread rapidly from east to west. Notice no. 3 of 2018 issued by the State Forestry Administration shows that the current epidemic areas of American Hyphantria cunea involved 572 county-level administrative regions in 11 provinces (districts, cities). In addition, based on the survey results, adult American Hyphantria cunea is detected in a number of nonepidemic areas in 2018, and the diffusion situation is grim.
e epidemic may spread in Changzhou, Jiangsu province, and Huanggang, Hubei province. In addition, the risk of transmission to Shanghai and Zhejiang province is existed.
Currently, the control methods of American Hyphantria cunea can be divided into physical control, chemical control, and biological control [1][2][3][4]. In terms of physical control, herb control and light trapping are widely used [2]. In biological control, natural enemies of Hyphantria cunea, biocontrol bacteria, and viruses are mainly used [3]. In chemical control, spraying chemical agents are mainly used [4]. Chemical control has the advantages of high efficiency, convenience, and wide application. However, it is easy to cause chemical pollution and waste of resources. e target application technology based on machine vision can effectively improve the spraying efficiency, reduce the dosage, and avoid chemical pollution.
Target application is one of the research studies focusing on fine pesticides, and many scholars have conducted meaningful studies on it [5][6][7][8][9][10][11][12][13][14]. Aiming at the poor mechanization operation conditions within orchards in China and the ineffective spraying of fruit tree gaps during continuous spraying with traditional orchard sprayers, an automatic target-spraying control system is designed by Xu et al., which has a good practical value for the accurate targeting control as well as diseases and insect pests prevention in sparse orchards [5]. Underwood et al. designed a system that can perform automatic liquid delivery to designated crops using manipulator, fine nozzle, and other equipment, which can effectively reduce production cost and improve production efficiency [6]. Liu et al. designed a spray rod-based precise targetspraying system according to the agronomic requirements of plants with large row spacing and plant spacing, which has good practicability for target application operations with plant spacing above 15 cm in the field [8]. In the current agricultural practice, pesticides are usually applied evenly in the field. However, many insect pests and diseases show uneven spatial distribution. Moreover, the excessive pesticide will lead to pesticide pollution. To reduce the use of agriculture and meet people's demand for healthy food, Oberti et al. used a modular agricultural robot to selectively spray grapevine to study the spraying of target medicine [9]. Berenstein and Edan proposed a precise pesticide-spraying device that can deal with targets of irregular shape and variable size. e device includes a nozzle that can automatically adjust the spray angle, color camera, distance sensor, and other devices. Moreover, the device can spray specific targets on the site and significantly reduce the pesticide usage amount [13]. Based on the color, shape, and distribution of the screen of American Hyphantria cunea larvae, a new screen location algorithm is proposed based on the convolutional neural network.
is algorithm can help the spraying robot to make rapid and accurate decisions and improve the spraying efficiency.

Algorithm Flow of the Screen Location of
American Hyphantria cunea Larvae e screen image samples are collected on the site of the American Hyphantria cunea disaster area. e screen data set of American Hyphantria cunea larvae is first created based on the color, texture, other characteristics, and distribution. en, a multicolor space-based CNN architecture is proposed, and the screen data set of American Hyphantria cunea larvae is employed to train the model. e screen of American Hyphantria cunea larvae is located by the sliding window method and CNN. e sliding window mechanism is mainly divided into the multisize sliding window mechanism and the multiscale sliding window mechanism. Based on the original mechanism of sliding window, nonoverlapping sliding window and nonoverlapping regions are proposed in candidate boxes. e algorithm flow of the screen location of American Hyphantria cunea larvae is shown in Figure 1. First, the original image is sharpened in order to improve the image contrast and to enhance the recognition effect. Second, the maximum number of extraction N m is set, and the initial value of N is zero. e image is divided into several candidate frames by the sliding window method. e CNN model of RGB and YIQ is used to score each candidate frame and classify it into three levels (excellent, qualified, and unqualified) based on the score. e excellent candidate frame is retained, and the unqualified candidate frame is eliminated. e region in the qualified candidate box is extracted and screened again, and the width and height of the window before each extraction candidate box are reduced to half of the original size. e extraction and screening process are completed when the number of candidate box extraction reaches the set value or there is no qualified candidate box. Finally, all the excellent candidate frames are fused, and the original image is painted with the outline of the screen of American Hyphantria cunea larvae.

Gathering Images.
e reticular screen of American Hyphantria cunea larvae is the object, and Canon 600D digital camera is used to take color pictures with a resolution of 960 × 720. As shown in Figure 2, the reticular screen picture of American Hyphantria cunea larvae is collected in the field.
to sharpen the original image, which is shown in Figure 3. e equation for updating the pixel value of each point of the original image is as follows:

e Data Set.
To train the CNN model, the collected images are employed to create the reticular screen data set of American Hyphantria cunea larvae. e training set contains 1318 images, and the test set contains 1318 images. First, the original image is sharpening. en, the original image is divided into two categories (the reticular screen of American Hyphantria cunea larvae and the non-American Hyphantria cunea larvae) based on the manually intercepted local images. Two types of image samples are stored in different folders based on the JPG format. Some images in the data set are shown in Figure 4.

History of CNN in Images.
In 1998, Lecun et al. proposed leNet-5 and used its opponents to write numbers for classification [15]. e network structure of leNet-5 is shown in Figure 5, which is consisted of an input layer, an output layer, three convolution layers, two pooling layers, and a full connection layer. en, the development of convolutional neural network is at low ebb, and there is no major breakthrough for a long time. In 2012, Krizhevsky et al. used 1.2 million high-resolution images from the ImageNet LSVRC-2010 contest to train a large deep convolutional neural network, which is called AlexNet. On the test data, the recognition effect of AlexNet is obviously better than the previous advanced level [16]. e convolutional neural network later became the focus of research in the field of image, and researchers proposed many improvement methods to enhance network performance [17][18][19][20][21]. In 2013, Zeiler and Fergus proposed ZF Net and introduced a new visualization technique [17]. In 2014, Simonyan and Zisserman proposed VGG Net, which extended the depth of the network to 19 layers. e experimental results showed that the depth has an important impact on the network performance [18]. In the same year, Szegedy et al. proposed Google Net to introduce the inception structure while deepening the network to replace the traditional operation of convolution and activation function [19]. By constant improvement and innovation, the network gets deeper and deeper, the architecture gets more and more complex, and the accuracy gets higher and higher. He and proposed DenseNet in 2016 [21].
In addition to accuracy, speed is also an important performance index of the convolutional neural network. In 2017, Howard et al. proposed an efficient MobileNets model for mobile and embedded visual applications [22]. Mobi-leNets reduces the requirement of the CNN for hardware and plays an important role in promoting the wide application of the CNN.

CNN Based on Multicolor
Space. Packet convolution first appeared in AlexNet [16], which can greatly reduce training parameters. However, the use of packet convolution is limited to the same color space, mostly RGB space. Different color spaces have different advantages, while a variety of color space mix can make up for the lack of a single-color space. is paper proposes a CNN architecture based on different color space-grouping convolutions, as shown in Table 1, taking the RGB image and YIQ image as examples. e input is the 32 × 32 image, and the output is the probability of the existence of the screen of American Hyphantria cunea larvae; the detailed process is described as follows:

Input Layers.
First, since this network model can only process fixed-size images, the bilinear interpolation algorithm is used to scale the image to meet the input size requirements. Generally, more practical network models need to be trained with a large amount of data, but there are few picture samples of the screen of American Hyphantria cunea larvae. To solve this problem, random clipping, flip, saturation, and brightness are performed on the image samples in the training of the model to make up for the insufficient number of image samples and improve the performance and generalization ability of the model.

Conv1
Layer. e Conv1 layer is a convolution layer, where the images are transferred from RGB space to YIQ space before convolution and are convolved, respectively, in RGB space and YIQ space.
In this layer, the convolution kernel of size 3 × 3 and the zero-complementing strategy are adopted and the step size is set as 1. en, 9 feature graphs of size 32 × 32 are obtained by the convolution of a single-color space. e size of the convolution kernel determines the size of a neuron's sensory field. If the convolution kernel is too small, effective local features cannot be extracted; if the convolution kernel is too large, the complexity of feature extraction may far exceed its representation ability. erefore, the proper convolution kernel is crucial to improve the performance of the CNN. Based on verification, the application of the 3 × 3 convolution kernel has the better effect, which is constituted by 9 weight w, corresponding to the pixel value of the image block x. e convolution operation can be described as where y denotes the output of the convolution operation and b denotes the offset item, which is employed to better fit the data. e terminal value of w and b is determined by the network training. e convolution kernel traverses the whole image to get a feature graph. Since there are 9 different convolution kernels and 2 images, 18 feature graphs are finally obtained.
For the 18 feature graphs obtained by convolution, a 2 × 2 pooling window with a step size of 2 can be used for maximum pooling to reduce the amount of data. e result of pooling should be further calculated with the activation function ReLU and then stored as the eigenvalue in the feature graph. e expression of ReLU is shown as 18 feature images of size 16 × 16 are output in this layer; the number of training parameters is as follows: 2 × (3 × 3 × 3 × 9 + 9) � 504.

Remaining Convolutional Layer.
e setup is similar to the Conv1 layer except the number of convolution kernels. With the number of convolutional layers increasing, the number of convolutional kernels increases exponentially by 2.
e extracted features are more complete and accurate, but the processing time will increase correspondingly. e number of convolutional layers is determined by the actual situation, and the convolution in this case will be completed when the accuracy and timeliness are comprehensively considered.

Global Average Pool.
e global mean pooling layer is used to replace the fully connected layer which is connected with the last convolutional layer. It is proposed by Network  Mathematical Problems in Engineering 5 in Network [23] that the number of neurons can be obtained by averaging each feature graph to the last convolutional layer. Since a total of 72 characteristic images are output from the Conv3 layer, the layer has 72 neurons.

Output
Layers. e output layers are the full connection layer and connected the global average pool. In this case, the sample only has two classes (American Hyphantria cunea and non-American Hyphantria cunea). erefore, the output layer contains two neurons. e data of the Conv3 layer are accumulated and calculated to obtain two values V 1 and V 2 , and the specific calculation process can be described as where w denotes weight and x denotes the value of neuron in the output layer. e final value of w is determined by network training. e activation function Softmax is employed to calculate the target probability and nontarget probability, which can be developed as e first value is the target probability, which is set as the final output value of the program. When the probability is higher than 99%, it is considered that the image is the screen image of American Hyphantria cunea, otherwise it is not.
Based on the above method, the commonly used color spaces such as RGB and YIQ are selected. e effects of single-color space convolution, multicolor space grouping convolution, and single-color space grouping convolution are tested by the screen sample database of American Hyphantria cunea larvae. e results are shown in Table 2. Suppose a total of N images are involved in the test, in which N 1 images are identified as screen images, N 2 images are truly screen images in N 1 , and N 3 screen images are not identified as screen images. e detection rate � (N 2 /N 1 ) × 100% and the omission rate � (N 3 /N) × 100% are shown in Table 2. e network architecture of single-color space convolution is similar to that of a single group in Table 1, and it can be seen from Table 2 that RGB and YIQ perform well in the convolution of single-color space and better than in other convolution methods of color space. However, the number of parameters is much higher than the grouping convolution. e detection rate and recognition rate of the grouping convolution are similar to those of the convolution with a single-color space in the case of parameter reduction. erefore, the grouping convolution is selected.
Multicolor space grouping convolution is better than single-color space grouping convolution on the whole. Since the omission rate of single-color space grouping convolution is higher, it is not conducive to prevent and control American Hyphantria cunea. Compared to the results of multiple packet convolution, RGB and YIQ packet convolutions have the higher detectable rate and lowest omission rate. After comprehensive consideration of the experimental results, RGB and YIQ grouping convolutions are finally selected for judgment.

Image Positioning Based on Candidate Frames.
At present, the image localization algorithm based on convolution can be divided into two categories. One is the combination of the candidate frame and classifier. e image is first divided into several blocks according to certain criteria, and candidate frames are generated. en, each candidate frame is selected, and the area is convoluted in each candidate frame [24][25][26][27]. e other is the recognition probability and position coordinate value of the object directly generated in the whole image range [28][29][30]. By comparison, the accuracy of the former is better, and the latter is faster.
In general, the distribution of the net curtain of the American Hyphantria cunea larvae is irregular, the environmental interference is relatively greater, and the accuracy of direct recognition in the whole image range is low. erefore, the combination of the candidate box and convolutional neural network is adopted in this paper. e accuracy of candidate frame extraction affects the accuracy of object location and algorithm speed. us, the researchers proposed many candidate frame extraction algorithms [25,31,32], among which the sliding window, selective search, and regional suggestion network are the most common. e selective search processing speed is slow, and the regional suggestion network method requires a large amount of prior knowledge. For the actual situation of American Hyphantria cunea larvae, the sliding window method is finally selected. at is, in the image coordinate system, the rectangular window moves in accordance with a certain law and intercepts the subimages in the window. e rectangular window is called the sliding window; the size and outer contour of the intercepted subimages are called candidate boxes. e mechanism for sliding windows is divided into multisize sliding windows and multiscale sliding windows. A multisize sliding window uses a plurality of sliding windows of different sizes to slide over the entire image in equidistant steps to extract candidate frames. On the basis of this method, Li et al. proposed an artificial object detection algorithm based on the sliding window [33]. e principle of a multiscale sliding window is based on an image pyramid, which requires scaling the image at different scales. en, the fixed-size sliding window will move across the entire image to extract candidate frames. Teutsch and Krugerused used this method to quickly detect moving vehicles [34].

Inconsistence with the Sliding Window.
A noncoincident sliding window is proposed based on the original mechanism of the sliding window. e significant advantage of the noncoincident sliding window is that the process of extracting the candidate frame is combined with the process of screening the candidate frame by the CNN model, which greatly reduces the processing time. e specific process is shown in Figure 6. First, a suitable sliding window size is determined. e window slides over the entire image with its width and height as the step size in the x-axis and y-axis directions to extract the candidate frame, and the area in the candidate frame is noncoincident with each other. Each candidate frame is processed by the method shown in Section 4.2, and the candidate box is output for the presence of the white moth larvae screen. Two thresholds E and Q are set, and the grade is denoted as G. When G > E, the candidate box is excellent. Most of the areas are target areas; when Q < G < E, the candidate box is qualified, and some of them are target areas; when G < Q, the candidate box is unqualified, and the target area is not considered to exist. e excellent candidate box is retained, and the unqualified candidate box is eliminated. Since only a small part of the target area is contained in the qualified candidate box, to accurately locate, it is necessary to narrow the sliding window and then conduct window sliding again to extract the candidate box from the area in the qualified candidate box, repeating the cycle for many times. Once there is no qualified candidate box or the number of candidate box extraction reaches the set value, the cycle is no longer conducted. Once the width and height of the sliding window are reduced to one-half of the original, the candidate areas generated by the sliding window in the qualified box do not overlap each other. erefore, one-half is chosen as the reduction proportion of the sliding window.
By continuously looping, the candidate frames become smaller and smaller, and there are more and more excellent candidate frames. All the excellent candidate frames are combined to obtain the target outline frame. e specific comparison between the noncoincident sliding window and the existing sliding window mechanism is shown in Table 3.
To better show the advantages of not overlapping sliding window, the width (w) of the sliding window is used as the xaxis direction in the three sliding window mechanisms. e step size is four times of candidate frame extraction with the height (h) of the sliding window as the step size in the y-axis direction.
e sliding window of the multisize sliding  window is reduced by one-half of its original width and height, and the candidate frame is slid in the original image by sliding windows of four sizes. e image of the multiscale sliding window is reduced by one-half of its original width and height, and the candidate frame is slid on the images of 4 different scales with a sliding window of a fixed size. e initial sliding window size of the noncoincident sliding window is set to be a range value, and the number of candidate frames extracted by each image is experimentally verified to be between 100 and 300. From the number of candidate frames, the candidate frame extraction algorithm proposed in this paper greatly reduces the number of candidate frames compared to the other two sliding window mechanisms.

Screening Results of the Net Curtain of American Hyphantria cunea Larvae.
e size of the original image is 960 × 720. e specific process of the candidate frame extraction and screen is shown in Figure 7. e red rectangular frame is an excellent candidate frame. A total of four candidate frame extraction and screening are performed, and a total of 26 excellent candidate frames are obtained. e size of the first candidate frame extraction and screening window is 320 × 240, and the sliding window traverses the entire image to obtain 9 candidate frames. As shown in Figure 7(a), the candidate frame with probability higher than 99% is excellent and less than 1% is unqualified. To be qualified, after screening by the CNN model, one excellent candidate frame and 8 qualified candidate frames are obtained. e second candidate frame extraction and screening window size is 160 × 120, and the sliding window traverses the entire image to obtain 32 candidate frames. After screening, 12 excellent candidate frames, thirteen qualified candidate frames, and seven unqualified candidate frames are obtained from Figure 7(b). e third candidate frame extraction and screening window size is 80 × 60, and 52 candidate frames are obtained. After screening, nine excellent candidate frames, sixteen qualified candidate frames, and 27 unqualified candidate frames are obtained, as shown in Figure 7(c). e fourth candidate frame extraction and screening window size is 40 × 30, and 64 candidate frames    Figure 7(d).
After finishing the screen, all the excellent candidates are merged. e specific process is shown in Figure 8. First, a pure black image of the same size as the original image is set, and the size of all the excellent candidate frames in Figure 8(a) is copied on the image and set to white. As shown in Figure 8(b), extraction of the white outline frame and drawing it on the original image are performed as shown in Figure 8(c) to obtain the final processing result.
Using the above process, more images are processed, and the processing results are as shown in Figure 9. It can be seen that the algorithm can obtain ideal results in different processing scenarios. e single picture recognition rate, false positive rate, and processing time are shown in Table 4. e recognition rate refers to the ratio of the identified screen area to the total screen area. e false positive rate refers to the ratio of the screen area defined in the target area except the screen area to the total screen area. It can be seen that the recognition rate is above 96%, and the processing time is less than 150 ms. e false positive rate is slightly  higher when the background light intensity is higher, and the rest are less than 5%.

Conclusion
In this paper, a new screen analysis method for American Hyphantria cunea larvae is proposed based on the CNN. A CNN architecture is proposed based on multicolor space. Meanwhile, the RGB and YIQ packet convolution methods are selected for judgment. e sliding window is divided to avoid the convolution in the whole image range and improve the processing precision. Based on the image, a new candidate frame extraction algorithm is proposed which is named the noncoincident sliding window method. e image is divided into several candidate frames. e volume convolution of RGB and YIQ space is used in each candidate frame. e product result is output in the form of probability, and two thresholds are set. e result higher than the high threshold is directly considered as excellent, and that lower than the low threshold is removed. e candidate frame in the middle region is again divided by the noncoincident sliding window method. e above process is repeated until the process completed. e number of candidate frame extractions reaches the set value or ends when there is no qualified candidate frame. e final recognition result can be obtained by merging the excellent candidate frames. It is verified that the recognition rate of the method is higher than 96%, and the single image processing time is less than 150 ms.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.