An Improved Lightweight Parameters Network for Strawberry Flowers Detection

Accurate and efficient detection for target crops is crucial to develop intelligent agriculture. A great deal of studies have been devoted to improving the accuracy and efficiency of detection algorithms, but the increasing requirement of computing power makes them particularly difficult to implement on embedded devices. Although some methods have been proposed to accelerate inference by lightening the weights of the algorithms after training, the huge computing power requirements of the algorithms are still a problem. In this paper, an improved lightweight parameters network with lightweight designed backbone and neck by grouped convolution is proposed, which also integrates convolutional (Conv) layers and Batch Normalization (BN) layers to accelerate inference. The experiments in this paper utilize the Strawberry Flower Detection dataset, Tomato dataset, Wind Turbine Detection dataset, and VOC2007 dataset to verify performances of the proposed network. And the results show that the computational cost, the number of parameters, memory footprint and inference time of the improved model are all reduced, while the mean Average Precision(mAP) is increased comparing with the baseline algorithm. Furthermore, the detection performances of the proposed algorithm implemented on Jetson Nano platform indicate it is suitable to be deployed in practical scenarios, especially for embedded platforms with limited computing power.


I. INTRODUCTION
Although computer vision technology has been applied everywhere in people's life, to decode image information as fast and accurately as person do is still a tricky problem [1]. Particularly, object detection is the most important and challenging part, which aims to classify and localize objects in images or videos [2], [3]. Fast and accurate object detection is essential for the smooth advancement of downstream tasks, such as using robots to pollinate strawberry flowers. Achieving the rapid and accurate detection of strawberry The associate editor coordinating the review of this manuscript and approving it for publication was Kah Phooi (Jasmine) Seng . flowers is indispensable for yield estimation and development of pollination robots [4], [5].
From the VJ Det(Viola-Jones Face Detector) based manual features to the YOLO(You Only Look Once) series based deep learning, object detection continues to develop rapidly and deeply. And, detection algorithms with higher accuracy are constantly proposed by research institutions and universities [6]. However, the computing power demand is huge for both traditional algorithms and deep learning-based algorithms, which means that dedicated large computing devices are needed. And that is extremely unfriendly to UAVs (Unmanned Aerial Vehicles) or mobile robots with restricted load [7]. The computing devices equipped with mobile devices are so lacking in computing power that cannot match the requiring of high-precision detection algorithms. In addition, overload computing shortens the lifetime of mobile devices significantly. Therefore, algorithms with high accuracy and low arithmetic power requirements are indispensable for mobile devices.
Although the problem of computing power demand in the object detection domain is still not completely solved, a large amount of outstanding works have made good progress. The field of traditional object detection based on manual features, from VJ Det to DPM (Deformable Parts Model), has seen an obvious improvement in detection speed and accuracy [8]. With the CNN (Convolutional Neural Network) making a splash in the field of computer vision, a lot of works have started to apply it to improve the efficiency of object detection. From RCNN to YOLO, the detection speed and accuracy have made a qualitative leap compared to traditional algorithms [9].
Despite the good progress made by a large amount of excellent works, the current field of object detection still has the following problems: 1) Some works only focused on improving the detection accuracy of the algorithm but ignored the algorithm's computing power requirement, which resulted in the algorithm not being successfully applied to embedded devices. 2) Some works ignored the algorithm's huge parameter that is the root cause of the huge computing power requirement. They only used pruning, quantization or other methods to lightweight the weight after training, which would decrease the detection accuracy.
In order to solve the above problems, our works aim to reduce the huge number of parameters brought by the bloated backbone of general object detection networks. A lightweight backbone with the VGG (Visual Geometry Group) paradigm is designed, which is simple enough to make the network lightweight and efficient [10]. In addition, to increase the sensitivity of the algorithm, an improved PAN (Path Aggregation Network) architecture is deployed as the neck of the detection network, which used skip-layer connections to transfer the strong localization information from the shallow layer to deep layer [11]. Both the backbone and neck use grouped convolution for calculation, which further reduced the parameters to accelerate the speed of training and inference [12]. After the network training with the above methods, the Conv layers and BN layers are further integrated to reduce the memory footprint of intermediate variables during the computation to accelerate the inference. The network is implemented on embedded device to verify the feasibility, and the experimental results demonstrate that the work of this paper have both highly accurate and efficient on embedded devices.
The contributions of our work are summarized as follows: 1) Under the premise of accuracy, a lightweight backbone with low parameters based on group convolution is designed, which ensures the running speed of the algorithm. 2) To improve the sensitivity of the algorithm to the object position information, the neck of the network based on PAN structures is improved through group convolution and skip-layer connection. 3) Conv layers and BN layers are integrated to reduce memory footprint and accelerate inference. 4) The algorithm is deployed on an embedded device and compared with state-of-the-art methods to verify the feasibility and practicability of our work.

II. RELATED WORK
Due to the characteristics of contactless and noninvasive, computer vision is widely used for crop detection considering the advantage of protecting delicate plants, particularly fruits and flowers. In this section, a brief review of existing researches on crop detection based on traditional methods and deep learning methods is presented.

A. TRADITIONAL METHODS
Lü et al. [13] used computer vision and support vector machine (SVM) to simultaneously segment the fruits and branches, and then acquired a recognition rate of 92.4% for citrus fruits. Kurtulmus et al. [14] detected immature peach fruits in natural environment using statistical classifiers and neural network, then 84.6%, 77.9% and 71.2% of the actual fruits were successfully detected using three different image scanning methods. Bulanon et al. [15] achieved an accuracy of 84.3% while monitor flowers using 20 hyperspectral aerial images which are sensitive to light. McCarthy et al. [16] identified the maize flowering status based on color segmentation and shape analysis using the images captured by infield low-cost fixed cameras. Zhou et al. [17] used four cameras to capture strawberry flowers illuminated with UVA light. According to the fluorescence of strawberry flower, they accomplished the flower detection from the captured images with an accuracy of 90% through the procedures of threshold segmentation, morphological operations and object size analysis. Nowadays, computer vision has become an indispensable technology in flower detection. However, due to the poor robustness resulting in a weak adaption of natural environments, the traditional computer vision technology is hard to provide effective information for downstream automated equipment, such as pollination robot and flower thinning robot.

B. DEEP LEARNING METHODS
With the development of deep learning and its application in computer vision, the accuracy of flower detection has begun to be improved rapidly. Different region-based convolutional neural network (R-CNN), including the R-CNN, Fast R-CNN and Faster R-CNN, were used to detect strawberry flowers in outdoor field in the work of Lin et al. [18]. After trained by 400 strawberry flower images and tested by another 100 images, the networks acquired the detection accuracies of 63.4%, 76.7% and 86.1%, respectively for R-CNN, Fast R-CNN and Faster R-CNN. With the goal of detecting flowers and optimizing fruit production, Dias et al. [19], [20] proposed a CNN-based model which is robust to clutter and changes of illumination by combining both color and morphological information. Palacios  In order to detect apple flowers accurately, Wu et al. [24] proposed a channel pruning-based YOLO v4 deep learning algorithm, which has an inference time of 0.046 second and a mAP of 97.31% after trained by apple flowers images collected manually in natural environments. The detection accuracy of flowers can be improved greatly through the methods of computer vision based on deep learning. However, the deep learning algorithms require huge computing power, resulting in a low calculation speed and dissociation from the real-time requirement when deployed in actual scene, especially for the automated pollination robot in precision agriculture [25]. In summary, most works did not consider algorithms implemented on embedded devices, which made it difficult to apply the algorithm in practical scenarios. Based on the mentioned above, an improved lightweight parameters network is proposed to implement on embedded devices.

III. METHODOLOGY
In this section, an improved lightweight parameters network for strawberry flower detection is proposed. The state-of-theart YOLO series are chosen as the baseline to compare the progress of our work in this paper.

A. STEP 1: BACKBONE NETWORK LIGHTWEIGHT DESIGN
The usage of CSP structure [26] in the backbone network of YOLO v4(baseline) can greatly reduce the quantity of computation caused by the repetition of gradient information. And this not only enhances the learning ability of CNN network, but also eliminates the computational bottleneck and accelerates the inference speed [26]. The backbone network in baseline enhances the ability of object detection availably. However, the computing power required by baseline is still huge, which makes it difficult to be deployed on platforms with limited computing power.
In order to obtain a high performance of the algorithm, the Inception Network with multi-branch structure was proposed by Google firstly in 2015. It may significantly deepen the network and enable different convolution kernels to obtain different receptive fields, resulting in a better prediction accuracy of the algorithm. Subsequently, CSPDarknet53 also consists of multi-branch structures which can ensure a higher accuracy and a faster inference speed than that of Darknet53. Nevertheless, due to the preservation of the intermediate computing results in multi-branch structures, the memory footprint will increase significantly until the multi-channel fusion occurs. As a result, the backbone network based on CSPDarknet53 is unfavorable to be deployed on the platforms with limited computing power.
In view of the reasons mentioned above, the backbone network of the improved lightweight parameters network is VOLUME 11, 2023 lightweight designed to reduce the parameters in this paper, with the aim to obtain a more efficient object detection model. The lightweight design is mainly referred to the classic classification networks of VGG and re-parameterization VGG (Rep-VGG) [10]. The architecture of the backbone network is shown as Fig. 1.
The main improved strategies of the backbone network are detailed as below.
1) The topology of VGG is so concise and easy-to-use to be widely applied in industry and academia. Therefore, the main part of the backbone network in this paper is also designed based on VGG-style, in which the output of previous layer is simply input into the next layer without a large number of cross-layer branches. This topology could reduce the memory footprint and ensure the simplicity and efficiency of the network [27].
2) A convolution network with stride of 2 is used as the HeadConv. Through the subtraction of excess redundant information, it can provide different scales of feature maps for the multi-scale detection tasks of the downstream networks. It ensures the sensitivity of the algorithm to the objects with different sizes.
3) In order to increase the receptive field of the network and ensure the downstream networks to have a perfect detection accuracy, a 5 * 5 convolution branch is added in the BodyConv on the basis of Rep-VGG network.
To reduce the quantity of computation, the standard convolution, as shown in Fig. 2(a), is replaced by grouped convolution in this paper [12], [28]. The comparison of them is shown as Fig. 2.
As shown in Figure 2(b), the input feature map is divided into g groups according to the number of channels to perform the convolution calculation in the grouped convolution network, followed by the Concat operation. And the grouped convolution kernels are learned sparsely on the channels in a block-diagonal structure style. As a result, the convolution kernels with higher correlation are learned more structured, while the lower ones are no longer parameterized. The numbers of parameters and quantity of computation of the standard convolution and grouped convolution are shown in (1)-(2), respectively.
where, k represents the size of the convolution kernel, C 1 and C 2 respectively represent the number of channels of the input feature map and output feature map, W and H respectively represent the width and height of the feature map, while g represents the number of groups.
Comparing with standard convolution, the grouped convolution not only reduce the number of parameters and quantity of computation, but also make the convolution kernels learn more accurately and efficiently in the deep networks with less overfitting. From this perspective, the performance of the network with abundant groups will be more appropriate in a lightweight network. However, the large number of groups may lead to a significant increasement of the memory access cost (MAC) and a slow inference speed of the network. In order to ensure the detection accuracy of the network, the grouped convolution method is only deployed in the multi-branch layer part, namely, BodyConv.

B. STEP 2: IMPROVED THE ARCHITECTURE OF PAN AS NECK
Learning the different scale features of objects is priority for object detection algorithms, because that the objects usually have different sizes in the images. The FPN (Feature Pyramid Network) with the top-down fusion strategy is used as the neck of YOLO v3 to obtain the feature map with the semantic information in the deep network layers and the texture information in the shallow network layers [29]. Furthermore, on the basis of FPN, the neck network of baseline is improved through the addition of PAN which has a bottom-up fusion strategy [11]. It makes the neck to be a two-way fusion network, and enhances the representation capability of the object detection algorithm.
For mobile platforms with limited computing power, it is important to have algorithms with low computing power requirements. The massive standard convolution operations of PAN may lead to a poor efficiency. Then, grouped convolutions are used to substitute them in the neck network apart from the sampling layer. This strategy may greatly decrease the number of parameters and calculation amount of the network. Additionally, the original PAN network focuses on the fusion of different scales but neglects the information transformation from the shallow layer to the deep layer in the chain link [30]. In order to ensure the deep layers to acquire the spatial information existed in the shallow layers, thereupon, the shallow layers and the deep layers in the same chain link are skip-layer connected for three different scales links. The comparison of the original PAN network and the improved PAN network is shown in Fig. 3.

C. STEP 3: INTEGRATION OF CONV LAYERS AND BN LAYERS
During the training stage of DNN, there is a notoriously phenomenon named internal covariate shift, which can greatly slow down the learning rate. Internal covariate shift refers to the fact that the distribution of each layer's inputs will change along with the variation of previous layers' parameters. And Awais et al. [31] solved this problem to accelerate the training of DNN by the method of Batch Normalization (BN), as shown in (3)-(5).
where,X i represents the feature map which is the output of the BN layer, X i represents the i th feature map of the batch acquired by convolution calculation of a certain layer and 1 < i ≤ n; µ and σ 2 represent the mean and variance of the batch, respectively; γ and β represent the scaling factor and translation factor, respectively; while ϵ represents a constant that is used to ensure a non-vanishing divisor.
As the µ, σ 2 , γ and β are fixed during the inference stage of the algorithm, we integrated the Conv layers and BN layers in the inference stage, in order to reduce the internal covariate shift and thus further improve the inference speed. And the integration method is shown in (6)- (9).
where, w represents the weight of the convolution kernel, b represents the bias, X ii represents the feature map which is the output of the previous layer in the network. After the improvement by the strategies mentioned in Step 1-Step 3, the improved lightweight parameters network is acquired and its framework could be illustrated as Fig. 4. Compared with YOLOv4, our work has the following differences: 1) Our work employs a simple VGG-style network as the backbone, rather than the usage of CPSDarknet53 backbone in YOLOv4. Considering the decrease of memory consumption by avoiding the heavy use of skip connections, our work is more suitable to be deployed on embedded devices. 2) YOLOv4 uses the PAN structure as its neck, while our work utilizes an Improved PAN. The Improved PAN is acquired by adding a small number of skip VOLUME 11, 2023 connections on the same branch of the PAN structure. And then, spatial information can effectively propagate from shallow layers to deeper ones. This approach is particularly effective in enhancing the performance of object detection algorithms on embedded devices.

3) The training and inference processes of YOLOv4
use the same network, while our work integrates the convolutional and BN layers of the network in the inference process to improve computational efficiency and reduce memory footprint.

A. MATERIALS 1) DATASET GENERATION
The subjects used in this research include three varieties of strawberry flowers: Mengxiang, Redface and Ssanta. The images used in this study are collected in a strawberry plantation located in Jiulongpo District, Chongqing, China, using a simulated view from a mobile robot or UAVs. The images are photographed by a Xiaomi MI8 mobile phone (Xiaomi Technologies Co., Ltd, Beijing, China) with a resolution of 3024 pixels (horizontal) × 3024 pixels (vertical). Then, a total of 2424 images with detection objects are obtained from 2:00 pm to 5:30 pm in April 2022. And all images are collected in the natural environment of strawberry plantation, including natural illumination condition, natural growth orientation, natural shielding of leaves against illumination and flowers overlap. Subsequently, LabelImg is used to manually label the strawberry flowers in these 2424 images, and the pistils of each flower are ensured to be located in the center of the  bounding box when labeling. Then, the label files are stored as * .xml format. 81.00% (1962 images) of the prepared data set are used as training data for the training of the improved lightweight parameters network, while 9.00% (219 images) and 10.00% (243 images) are respectively used to verify and test the improved lightweight parameters network. The setup of dataset is shown as Table 1.

2) SIMULATION PLATFORM
In order to verify the performance of the improved lightweight parameters network on the strawberry flower detection, we compare it with the baseline, as well as the other   object detection networks. The parameters of the simulation platform used for training and testing are shown in Table 2.

B. LIGHTWEIGHT BENEFITS PRELIMINARY ASSESSMENT
The benefits of lightweighting are pre-evaluated on the simulation platform (as shown in Table 2), and the evaluation performances include the FLOPs, number of parameters, memory footprint, as well as inference time. In step with the process of improvement, the pre-evaluation results of the improved lightweight parameters network are shown in Table 3.
It can be found from Table 3 that the FLOPs, number of parameters, memory footprint and inference time of the improved lightweight parameters network are reduced by 77.14%, 76.35%, 48.71% and 38.75% after the three steps, respectively, which indicated that the proposed methods are effective. And it is noteworthy that the FLOPs and number of parameters decrease rapidly in Step 1 and Step 2, rather than Step 3. These decreases can be mainly attributed to the utilization of grouped convolution method in the two steps. Meanwhile, the memory footprint is also decreased rapidly in all steps, and the amounts of decrease are 10.87%, 9.76% and 28.08% comparing with the previous step. This could be mainly attributed to the grouped convolution method utilized both in Step 1, Step 2, and the integration of Conv layers and BN layers in Step 3. However, the decrease amount of memory footprint in Step 2 (9.76%) is lower than that of Step 1 (10.87%). In Step 1, the backbone network is lightweight design based on the concise VGG-style topology, which can decrease a large amount of memory footprint by reducing the preservation of intermediate computing results in multibranch structures. Conversely, the complexity of the network is increased in Step 2 of neck architecture modification. Last but not least, the inference times are reduced by 35.20%, -2.24% and 5.79% comparing with the previous step. Based on the same reasons as the decrease of memory footprint, the inference time is greatly reduced in Step 1. Nevertheless, the inference time in Step 2 is slightly increased by

2.24% comparing with Step 1, caused by the skip-layers connection.
Neck is capable of generating feature maps with multiscale information, which is crucial for improving the accuracy of object detection. Therefore, in addition to the direct numerical comparison mentioned above, our work also visualize Neck to compare the algorithm's attention to objects on feature maps of different scales. The heatmap of feature localization generated by Grad-CAM method is used to assess the effect of the improved PAN network [32]. The Grad-CAM heatmap provides a visualization method in a form of model gradients to highlight what is the DNN model focus on. The heatmaps of the outputs of different scales in baseline and our work are worked out by Grad-CAM method, as shown in Table 4.
It can be seen that the improved PAN in our work has a better ability to focus on the detection object than the original PAN. For the Head 1 of the baseline, the original PAN and improved PAN both mistakenly pay some attention to the background rather than the object features (the mistake feature is marked by red dotted box), but the correct object feature that the improved PAN focus on is more complete than that of original PAN (the correct feature is marked by white box). For the Head 2, the original PAN and improved PAN are able to pay attention to the object features, but there are more mistakes occurred in the original PAN that the backgrounds are identified as detection objects. For the Head 3, the original PAN and improved PAN are almost completely focused on the target features, but the original PAN still mistakenly pay a little attention to the background.

C. PROPOSED NETWORK FOR STRAWBERRY FLOWER DETECTION 1) TRAINING OF THE PROPOSED NETWORK
Subsequent work is the training of network using the images in the dataset. The training parameters are set as Table 5.
Because that only the integration of Conv layers and BN layers method is utilized in Step 3, while the modification of topological structure is uninvolved. Then, the weights of the improved lightweight parameters network after Step 3 are derived from the previous step, so there is no need to retrain. It means that the networks involved in Step 2 and Step 3 had the same training results. Therefore, the parameters of training process of baseline and parameters of the first two steps of the improved lightweight network are given in Table 6. It can be seen that the results of the three models tend to be flat in the later training stage, which means a convergent training.

2) EXPERIMENTAL RESULTS OF STRAWBERRY FLOWER DETECTION
Five criteria including precision, recall, F 1 score, mAP and inference time are used to evaluate the performances of the algorithm for strawberry flower detection in this paper. The criteria mentioned above are used to evaluate the strategies of the improved lightweight parameters network stepwise.  The parameters of simulation platform are detailed in Table 2. And the weighs of the networks used to test the performances are respectively assigned as the ones that showed the best property on the validation set, while the weights of Step 3 are derived from the previous step. The experimental results of the improved lightweight parameters network tested on the test set are shown in Table 7. As shown in Table 7, the inference time of the improved lightweight parameters network in this paper is greatly reduced comparing with the baseline, and other evaluation criteria are roughly equivalent. This means that the improved lightweight parameters network is stable and suitable to be deployed in actual scenarios, especially for the mobile platforms with limited computing power.
To verify the generality of proposed network, we have compared mAP on open source datasets, including Tomato VOLUME 11, 2023 63769 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   dataset, Wind Turbine Detection dataset and VOC2007 dataset. As shown in Table 8, our work outperforms baseline in mAP, Recall and F 1 -score.
The actual detection results of our work on different datasets are shown in Figure 5.

D. COMPARISON OF THE PROPOSED NETWORK WITH PREVIOUS STUDIES
With the rapid development of CNN and its usage in the field of computer vision, numerous excellent object detection algorithms based on CNN were proposed by scholars. Then, four object detection algorithms with excellent performance in recent years, namely, Faster R-CNN [33], [34], SSD [35], YOLOX-s [36], [37] and EfficientDet [38], are selected to compare with our work. And the data of training set generated in Section IV-A1 are used to train the networks, while the weights of each network are respectively set as the ones that have the best property on the validation set. The precision, recall, F 1 score, mAP and inference time of the five object detection algorithms are shown in Table 9, respectively.
It can be seen that the SSD algorithm acquires the highest mAP of 98.34%, while the YOLOX-s algorithm acquires the minimum inference time of 5.85 ms. Our work is 0.21% lower than SSD algorithm in the mAP, while 1.77 ms greater than YOLOX-s algorithm in inference time. In other words, our work in this paper could not obtain the best scores on all criteria, especially for both the mAP and inference time. However, the overall performance of our work is better than that of the others, considering the high mAP and short inference time. Compared with other networks, our work is suitable for strawberry flower detection in natural environment.

E. DISCUSSION
Followed by the experiments based on the platform with high computing power, the performance of the improved lightweight parameters network is further discussed on the platform with limited computing power. Therefore, the improved lightweight parameters network, baseline and also the other algorithms in previous studies, are transferred to Jetson Nano for the purpose of inference time comparison. And the hardware parameters of Jetson Nano are shown in Table 10, while the results of inference speed tested on Jetson Nano are shown in Table 11. Remarkably, the inference time of our work is only 0.44 times that of the baseline, demonstrating the effectiveness of our approach on low computing power platforms. It also can be seen that the inference time of our work is lower than Faster R-CNN, SSD and EfficientDet, but slightly higher than YOLOX-s.
The mAP of our work is higher than that of the YOLOX-s algorithm. However, the YOLOX-s algorithm has a faster inference speed than that of our work, not only for the platform with high computing power but also the embedded devices. In order to explore the impact factors of the inference speed, we choose memory footprint, FLOPs and number of parameters as indicators to analyze the complexity of algorithm which could provide a reference for the farther research of lightweight. And the comparing results of complexity are shown in Table 12.
As shown in Table 12, the comparion results of complexity are consistent with the inference speed tested on Jetson Nano. Our work has an intermediate complexity among the five algorithms, visualized as the three indicators. This may indirectly indicate that not only the accuracy but also the complexity should be taken into consideration in the lightweight design of a network. The trade-off between accuracy and complexity is essential to ensure the comprehensive performance of the network when run on platforms with limited computing power. In general, our work in this paper acquires an extremely high accuracy as well as a fast inference speed, not only for high computing power platforms but also the lightweight devices with limited computing power. Thus, that proves our work could support yield estimation for strawberry flower pollination robots or UAVs.

V. CONCLUSION
Accurate and efficient detection of strawberry flowers is very important for yield estimation and the development of a pollination robot. Hereby, the improved lightweight parameters network is proposed in this paper. After the training and testing of the network on the strawberry flower dataset, we compared it with the baseline as well as the other algorithms in previous studies. Then the conclusions are carried out as below.
1) The improved lightweight parameters network includes backbone network lightweight design, neck architecture modification and also the integration of Conv layers and BN layers. As the results, the number of parameters, quantity of computation, memory footprint andinference time of the improved lightweight parameters network are reduced vastly while comparing with the baseline, respectively. The results indicate that the improved lightweight parameters network is suitable for the mobile pollination robots which had a high-speed requirement of strawberry flowers detection but with limited computing power.
2) The improved lightweight parameters network not only has a faster inference speed than YOLOv4, but also has a higher mAP than YOLOv4. Moreover, the improved lightweight parameters network also has a better overall performance than the other algorithms in previous studies. It shows that the improved lightweight parameters network in this paper makes the algorithm more accurate than the baseline, thus, could provide technical supports for the development of pollination robots and yield estimation of strawberry in natural environment.