Lightweight tea bud recognition network integrating GhostNet and YOLOv5

: Aiming at the problems of low detection accuracy and slow speed caused by the complex background of tea sprouts and the small target size, this paper proposes a tea bud detection algorithm integrating GhostNet and YOLOv5. To reduce parameters, the GhostNet module is specially introduced to shorten the detection speed. A coordinated attention mechanism is then added to the backbone layer to enhance the feature extraction ability of the model. A bi-directional feature pyramid network (BiFPN) is used in the neck layer of feature fusion to increase the fusion between shallow and deep networks to improve the detection accuracy of small objects. Efficient intersection over union (EIOU) is used as a localization loss to improve the detection accuracy in the end. The experimental results show that the precision of GhostNet-YOLOv5 is 76.31%, which is 1.31, 4.83, and 3.59% higher than that of Faster RCNN, YOLOv5 and YOLOv5-Lite respectively. By comparing the actual detection effects of GhostNet-YOLOv5 and YOLOv5 algorithm on buds in different quantities, different shooting angles, and different illumination angles, and taking F1 score as the evaluation value, the results show that GhostNet-YOLOv5 is 7.84, 2.88, and 3.81% higher than YOLOv5 algorithm in these three different environments.


Introduction
With the continuous improvement of people's quality of life, tea has become an indispensable drink in people's leisure life. Therefore, to further improve the influence of tea culture, it is necessary to improve the control of tea quality [1]. The processing technology of tea is divided into picking, killing, crushing, initial drying, shaping, and drying. Among them, the quality of tea picking will directly affect the economic benefits. At present, tea picking is mainly divided into manual and mechanical picking [2]. Manual picking is mainly based on workers' experience and according to the differences in the color and shape of tea buds, which is too time-consuming, labor-intensive, and costly [3]. Mechanical picking solves the problem of slow tea picking, but it can not accurately identify the buds. Therefore, it is urgent to study the detection algorithm of tea buds.
The detection algorithm based on machine vision has always been a research hotspot [4]. In the field of tea detection, different detection algorithms have different usages [5]. There are detection methods for tea leaf diseases, tea bud leaves, and tea sprout's picking points. Mukhopadhyay et al. [6] presented a novel approach for automatically detecting tea leaves diseases based on image processing technology. Yang et al. [7] developed an effective, simple, apt computer vision algorithm to detect tea disease area using infrared thermal image processing techniques and to estimate tea disease. Karunasena et al. [8] proposed a new method for tea bud detection using a cascade classifier, which carried out the detection of tea buds by combining histogram of oriented gradient features and support vector machine classification. Zhang et al. [9] proposed a method to obtain information on the picking point on the basis of the Shi-Tomasi algorithm. The test identified 1042 effective shoots for tender buds, and 887 picking points were marked, with a success rate of 85.12%. These methods all provide a theoretical basis for the automatic picking of tea leaves.
Subsequently, object detection algorithms based on deep learning rose rapidly and then evolved into two-stage network models such as fast region-based convolutional network method (R-CNN) [10], faster R-CNN [11], and one-stage network models such as YOLO [12], single shot multibox detector [13]. These algorithms have achieved certain applications in the agricultural field [14]. Lawal et al. [15] proposed a modified YOLOv3 model called YOLO-Tomato model to detect tomatoes under complex environmental conditions, and its performance is superior to other state-of-the-art methods. Roy et al. [16] presented a high-performance real-time fine-grained object detection framework for plant disease detection. At a detection rate of 70.19 FPS, the proposed model obtained a precision value of 90.33%, F1 score of 93.64%, and a mean average precision value of 96.29%. It can be seen that the detection effect of deep learning on general crops is excellent. However, using deep learning to detect tea buds in complex background is difficult, so there are few related studies. Yang et al. [17] proposed a complete solution, including the mechanical structure, the visual recognition system and the motion control system of the high-quality tea automatic plucking robot. Tao et al. [18] proposed a method of tea picking point location based on mask RCNN. By training the tea picking point location model and identifying the segmentation of tea buds, the coordinates of tea picking points can be located. Li et al. [19] proposed a real-time tea bud detection method using the channel and layer pruned YOLOv3-SPP deep learning algorithm. The test results show that the number of parameters, model size, and inference time of the tea bud detection model after compression reduced by 96.82, 96.81, and 59.62% respectively, while the mean average precision of the model is only 0.40% lower than that of the original model. The above research provides a basis for the study of tea bud detection under complex background.
Xu et al. [20] proposed a detection and classification approach of a two-level fusion network with a variable universe to solve the detection and classification problems of different grades of tea. This has established the foundation for realizing the automation of famous and excellent tea picking, but it lacks the consideration of actual environmental factors. Considering that tea bud picking needs certain rapidity and accuracy, this paper takes the YOLOv5 model as the basic framework and improves the backbone layer based on GhostNet, to reduce the amount of calculation and shorten the detection speed. The coordinate attention mechanism is introduced to improve the recognition accuracy of small targets such as buds. EIOU is used as the loss function of the positioning box. Through the self-made data set, the algorithm designed in this paper is compared with the original algorithm. The paper is organized as follows: Section 2 introduces the YOLO framework; Section 3 describes the proposed GhostNet-YOLOv5 model; Section 4 discusses the experimental results and comparative analysis of the proposed object detection model. Finally, the conclusions and prospects of the current work have been discussed in Section 5.

Related works
YOLO, fully known as "you only look once", is an object detection algorithm based on a deep neural network. Its biggest feature is that it runs very fast and can be used in real-time systems. Different from the two-stage of R-CNN, YOLO uses the one-stage platform, that is, end-to-end prediction [21]. It can directly predict the categories and locations of different targets using only one CNN network.
At present, the YOLO algorithm has been upgraded in five versions: YOLOv1, YOLOv2, YOLOv3, YOLOv4, and YOLOv5. The improvement ideas of predecessors are mainly divided into the following four parts: 1) Input: in the model training stage, some improvement measures are proposed, such as mosaic data enhancement, adaptive image scaling, adaptive anchor box calculation; 2) Feature extraction layer: optimize the main detection algorithms, such as focus structure and CSP structure; 3) Feature fusion layer: FPN structure is used in the early stage, and FPN+PANet [22] structure is used in the later stage; 4) Output layer: the main improvement is the loss function GIOU-Loss [23] and non-maximum suppression (NMS) [24].
The overall framework of the current YOLOv5 algorithm is shown in Figure 1. Generally, it is divided into four modules, including the input layer, backbone layer, neck layer, and output layer. Today, there are four versions of YOLOv5, including YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Considering the model size, this paper gives priority to the fastest YOLOv5s. On this basis, the network is improved to improve the accuracy of object detection.
YOLOv5 is indeed a detection algorithm with fast detection speed and high accuracy. It performs well on some open source data sets, but it still needs to be improved on the detection task of tea buds. Aiming at the problems of complex background, fuzzy color, and small size of tea buds, this paper improves and optimizes based on YOLOv5, and proposes a tea bud detection algorithm GhostNet-YOLOv5. Experimental and test results show the effectiveness of the algorithm. The YOLOv5 network evolved from the YOLO model. Compared with the faster R-CNN network, the YOLO network model turns the detection problem into a regression problem [25]. It does not need the suggested region, and directly generates the bounding box coordinates and probabilities of each class through the regression [26]. Compared with R-CNN, the detection speed has been greatly improved. The detection principle of YOLO is shown in Figure 2. First, divide the input picture into s * s cells. If the center of the detected target falls into the specified cell, the cell is responsible for detecting the target. Assuming that each cell produces B anchor boxes, the target location and category prediction can be expressed by the tensor of s × s × b × (4 + 1 + C), where 4 represents the coordinates (x, y), width and height (w, h), 1 represents the confidence score, and C represents the category number. The position, confidence, and category information of the final prediction target can be obtained by continuous regression of the real bounding box through training.

GhostNet network
A Ghost module is designed in GhostNet to replace general convolution [27]. The Ghost module integrates the general convolution into two parts. First, 1 × 1 convolution is used to obtain the necessary feature concentration of the input features. Secondly, the deep separable convolution is performed. This deep separable convolution is layer by layer convolution, and a similar feature map is generated by using the feature concentration obtained in the previous step. The Ghost module is shown in Figure 3. where, X represents a feature map with channel number c, height h and width w, Y represents a feature map with channel number n, height h and width w, f represents n convolution kernels with size k * k, and b represents a bias term. The algorithm and calculation amount of ghost module can be expressed as: where, ′ is the m intrinsic feature maps of input x for general convolution output, ′ represents m convolution kernels of size k × k, ′ represents the i-th intrinsic feature map in ′ , and represents the j-th ghost feature map generated by convolution kernel linear transformation of d d, and m n, k = d. The Ghost module first generates a part of the original feature map by ordinary convolution, and then convolutes these feature maps one by one to obtain another part of the corresponding feature map, and splices this part with the original feature map. It can be seen from the principle of the algorithm that the convolution layers after splicing are the same as the original ones, but compared with flops, it can be seen that the calculation amount of the network is reduced, and the improved convolution structure is shown in Figure 4.

Coordinate attention
Attention mechanism comes from the study of human vision. In cognitive science, due to the bottleneck of information processing, human beings will selectively pay attention to a part of all information and ignore other visible information [28]. These mechanisms are often referred to as attention mechanisms. To make rational use of the limited visual information processing resources, human beings need to select a specific part of the visual region and then focus on it. At present, the commonly used attention mechanisms are sequence and exception (SE) and convolution block attention module (CBAM). However, SE only considers remeasuring the importance of each channel by modeling the channel relationship and ignores the location information, which is very important for generating a spatial selective attention map. On the contrary, CBAM can not capture the long-range dependence on the feature map, although it introduces local location information into the channel.
As a new and efficient attention module, coordinate attention (CA) [29] is mainly divided into two steps: coordinate information embedding and coordinate attention generation. Firstly, to mitigate the loss of position information caused by 2D global pooling, the channel attention is decomposed into two parallel (X and Y direction) one-dimensional feature coding processes, and the spatial coordinate information is effectively integrated into the generated attention map. Then, the two feature maps embedded with specific direction information are coded into two attention maps respectively, and each attention map captures the long-range dependence of the input feature map along one spatial direction. Specifically, for input x, the pooled cores with dimensions (H, 1) and (1, W) are used to encode each channel along the horizontal and vertical coordinate directions. The output formula of the c-th channel with height h is as follows: Accordingly, the output of the c-th channel with width w is as follows: Secondly, after the transformation in the information embedding, the fusion operation is carried out, and then 1 × 1 convolution transform function F1 performs transform operation on it: , 7 After normalization and nonlinear processing, is divided into two independent tensors and along the spatial dimension, and then and are transformed into the same number of channels as X by two 1 × 1 convolutions, as shown in the following formula. where, represents sigmoid activation function.
Finally, the output y of coordinate attention can be written as: , , 10 The coordinate attention module is shown in Figure 5.

BiFPN
The original YOLOv5 backbone network adopts FPN + PAN structure. This paper retains the original network architecture, and also extracts the features of the upper, middle and bottom feature layers, and then transmits them to BiFPN to strengthen the feature extraction network. Since the BiFPN network has five input layers, the BiFPN network is simplified to three input layers to integrate with YOLOv5 and reduce the amount of calculation. The improved BiFPN network structure is shown in Figure 6. The calculation formula of the output of each fusion node after the improved network is as follows.
where, , represent input and output feature layer respectively. represents convolution, represents upsampling or downsampling of the input, 0 is a learnable weight, and is a small quantity to ensure numerical stability.
Compared with previous networks, the YOLOv5 network has many improvements in recognition accuracy and model size. However, the identification of small targets such as tea buds is still somewhat difficult. Therefore, in this paper, GhostConv is used to replace the ordinary convolution module, a CA module is used in the backbone layer, and the BiFPN structure is used in the neck layer to replace the original FPN + PAN structure. The improved YOLOv5 structure is shown in Figure 7. The improved parts are indicated by red boxes.

Loss function
In the original YOLOv5 algorithm, the loss function is mainly composed of three parts: classification loss, confidence loss, and bounding box loss [30]. The cross entropy loss function is used for classification loss and target loss, while generalized intersection over union (GIOU) loss is the positioning loss of the bounding box, loss keeps the main properties of IOU and avoids the shortcomings of IOU. It not only pays attention to overlapping areas but also to other nonoverlapping areas, which can well reflect the coincidence degree between the two. GIOU first calculates the minimum closure area C of the predicted frame and the real frame and then calculates the proportion of the closure area that does not belong to the two frames in the total area. Finally, GIOU is obtained by subtracting this proportion from IOU, as shown in the following formula. where, and represent prediction box and real box respectively.
where, A represents the prediction box and D represents the real box, both of which are tensors. C represents the minimum area surrounding A and D. CIOU loss considers the overlapping area, center point distance and aspect ratio of bounding box regression. However, the difference of aspect ratio reflected by V in its formula, rather than the real difference between width and height and their confidence, sometimes hinders the effective optimization similarity of the model. According to the characteristics of tea bud with high aspect ratio and small target, EIOU is introduced to integrate the overlapping part, center distance and aspect ratio of the two frames into the calculation of loss function. where, and are the height and width of the smallest circumscribed rectangle covering the real box of the prediction box, respectively.

Dataset
The shooting place is WuChao mountain tea garden in Hangzhou City, Zhejiang Province. The shooting time is during the Qingming Festival, which is the best time for picking tea. The shooting equipment selects the mobile camera, selects 1000 pieces of collected images as the data set for tea bud detection, and labels the pictures through labelimg. Then, the training is conducted according to the ratio of 8:2 to verify.

Experimental platform
This experiment is carried out under the pytorch framework. The configuration parameters of software and hardware are shown in Table 1.

Evaluation indicators
There are many evaluation criteria for deep learning, such as accuracy, confusion matrix, precision, recall, average precision, mean average precision (map), intersection union ratio (IOU), non maximum mechanism (NMS), etc. This paper mainly introduces the accuracy rate, recall rate and map as the evaluation criteria for identifying tea buds. Next, it will be introduced separately.
Precision (P) is the ratio of true position in the recognized image.

18
Recall (R) is the proportion of the left and right positive samples in the test set that are correctly identified as positive samples. 19 where, refers to the result of correctly identifying the category of tea buds, refers to the result of incorrectly identifying the category of tea buds, and refers to the result of failing to detect the category of tea buds.
Since the accuracy and recall may be contradictory in some cases, this test defines F1 score as the comparative evaluation index, and the formula is as follows:

Data comparison
In order to verify the detection performance of the algorithm, the improved method is compared with the original YOLOv5 and the mainstream object detection algorithms such as deep learning YOLOv5-Lite and Faster RCNN in the loss value and recall rate. The results are shown in Figure 8.  The training loss curves obtained from Faster RCNN, YOLOv5, YOLOv5-Lite and GhostNet-YOLOv5 have been compared in Figure 8(a). At the initial stage of the loss curve, the Faster RCNN loss begins to reduce significantly after approximately 250 epochs, whereas, for YOLOv5-Lite and YOLOv5, loss reduction happens after approximately 50 epochs. However, the loss in the GhostNet-YOLOv5 rapidly decreases within approximately 30 epochs indicating the best convergence characteristics. After 200 epochs, the loss curve becomes stable and the model gradually tends to converge. It can be seen from Figure 8(b) that after 50 epochs, the recall rate of GhostNet-YOLOv5 is basically stable at 85%, while the YOLOv5-Lite has been floating between 40% and 60%. For YOLOv5 and Faster RCNN, the recall rate were stable at 80 and 75% after 250 epochs. The specific data comparison is shown in Table 2. In order to verify the effectiveness of the improved YOLOv5 network model in detecting tea buds, various modules are compared and tested, and the results are shown in Table 3. It can be seen that the detection time of the model with GhostNet alone is significantly shortened by 0.031 s, while the detection time of the model with coordinate mechanism is increased by 1.68% on map. Finally, although the detection time of GhostNet-YOLOv5 is slightly improved after the improvement of BiFPN, it is still reduced by 0.015 s compared with the original YOLOv5, and the overall accuracy is improved by 4.83%.
To further verify the effectiveness of the model, it is necessary to detect the efficiency of the algorithm in various actual environments. This paper will take the number of buds, shooting angle, and illumination angle as control variables, compare the detection effect with the original algorithm and evaluate the performance with F1 score. Thirty images will be randomly selected for each category from the test set as F1 score evaluation.
In the actual tea image, there are often different numbers of bud objects, which will have different effects on the detection. For example, in a single bud image, the buds occupy a large area, the outline is complete and clear, and the recognition difficulty is low. In a multi-target image, due to the reduction of size and the increase of number, the recognition object will be blocked, so the recognition difficulty will increase. Therefore, set the comparison test of bud detection under different numbers, including 1-4 buds, 5-10 buds, and more than 10 buds. Compare the detection performance of the two algorithms under different numbers, as shown in Figure 9.  Table 4. The Figure 10 shows the detection results of the two algorithms at different shooting angles. As can be seen from Figure 10, the shape characteristics of tea buds are obvious when facing up. Therefore, both algorithms can be well detected. However, when looking down and sideways, the original YOLOv5 was seriously missed. The test adopts the same method as the previous section, and repeats the random sampling data set for different shooting angles. The statistical results are shown in Table 5. As an external influence factor, the illumination angle is equally important. In this test, forward light, reverse light and side light are selected for comparative analysis. Figure 11 shows the detection results of the two algorithms under different illumination angles. It can be seen from Figure 11 that the brightness of the buds is enhanced in the forward light, and the outline of the buds is obvious in the reverse light, although there are some shadows, while the buds and branches and leaves are easy to distinguish under the side light. According to the sampling method in the previous section, select three gradient data sets for comparison, and the statistical results are shown in Table 6. To sum up, the three groups of comparative tests have fully proved the effectiveness and realtime of tea bud detection based on GhostNet-YOLOv5 algorithm.

Conclusions
In this paper, a tea bud detection model integrating YOLOv5 and GhostNet is proposed. To improve the detection efficiency, the GhostNet module is used to replace the CSP module in the backbone layer first, and a coordinate attention mechanism is added to enhance the ability of feature extraction. To enhance multi-scale fusion, BiFPN is introduced. In addition, EIOU is used as the optimized loss function. The experimental results show that the precision of this model is improved by 4.83, 3.59, and 1.31% respectively compared with YOLOv5, YOLOv5-Lite, and Faster RCNN, and the recall rate is increased by 1.81, 42.52, and 11.82%, respectively. However, under the influence of different numbers, shooting angles, and illumination angles in the actual environment, the F1 score of GhostNet-YOLOv5 is 7.84, 2.88, and 3.81% higher than that of YOLOv5. In the future, I will further optimize the network model.