YOLOFig detection model development using deep learning

The detection of fruit, including accuracy and speed is of great signiﬁcance for robotic harvesting. Nevertheless, attributes such as illumination variation, occlusion, and so on have made fruit detection a challenging task. A robust YOLOFig detection model was proposed to solve detection challenges and to improve detection accuracy and speed. The YOLOFig detection model incorporated Leaky activated ResNet43 backbone with a new 2,3,4,3,2 residual block arrangement, spatial pyramid pooling network (SPPNet), feature pyramid network (FPN), complete (CIoU) loss, and distance DIoU − NMS to improve the fruit detection performance. The obtained average precision (AP) and speed (frames per second or fps) respectively under 2,3,4,3,2 residual block arranged backbone for YOLOv3b is 78.6% and 69.8 fps, YOLOv4b is 87.6% and 57.1 fps, and YOLOFig is 89.3% and 96.8 fps; under 1,2,8,8,4 residual block arranged backbone for YOLOv3 is 77.1% and 56.3 fps, YOLOv4 is 87.1% and 52.5 fps, and YOLOResNet70 is 87.3% and 79 fps; and under 3,4,6,3 residual block arranged backbone for YOLOResNet50 is 85.4% and 77.1 fps. An indication that the new residual block arranged backbone of 2,3,4,3,2 outperformed 1,2,8,8,4 on an average AP of 1.33% and detection speed of 15.2%. Finally, the compared results showed that the YOLOFig detection model performed better than other models at the same level of residual block arrangement. It can better generalize and is highly suitable for real-time harvesting robots.


INTRODUCTION
In the past few years, the popularity of figs has increased due to their sweetness and high nutritional value. Fig fruit and its leaves according to Lianju et al. [1], contain plenty of amino acids and inorganic elements, medical components such as flavone, rutin and quercetin, which implies that fig has a significant role in the human body health. Meanwhile, fig fruits are transformed into several processing products, such as dried figs, preserved fruits, jam, juice, wine, powder, and so on. For these reasons, over one million tonnes of figs are produced yearly in the world, according to the Food and Agriculture Organization of the United Nations [2], where Turkey is the largest fig producer with 305,450 tonnes, followed by Egypt with 167,622 tonnes, and China ranked 14 with 14,324 tonnes of production volume per year. Nevertheless, figs harvesting mainly relies on manual labour, and the labour cost accounts for a high percentage of the total production cost. Therefore, it is necessary to replace the manual labour with harvesting robots, because robotic harvesting offers a solution to labour cost reduction, enabling selective harvesting, optimizing harvest scheduling and maximizing operational efficiency and profits [3]. The potential applications of vision-based fruit picking/harvesting robots to agriculture have been reviewed, summarized, and future perspective reported by Tang et al. [4]. Robotic fruit harvesting is an exciting technology with challenges, because it requires the integration of multiple subsystems such as fruit detection, motion, and guided manipulation. Fruit detection is the most important and challenging of these multiple subsystems. Uneven illumination in an unstructured environment, fruit occlusion by stem and leaves, overlapping fruit and other unpredictable factors are problems faced by fruit detection, which in other word makes it extremely difficult to develop a vision system as intelligent as humans for detection. In addition, the different colour types, similar visual appearance as background, e.g. leaves, and ripeness variation from green to dark brown are contributory factors to fig fruit detection.
wileyonlinelibrary.com/iet-ipr Therefore, a fig fruit detection model that is environmentally adaptable to the harvesting robots is crucial to overcome these challenges. Deep learning techniques have made considerable efforts in addressing the challenges in fruit detection. Koirala et al. [5] summarized fruit detection and yield estimation in deep learning. Presently, two-stage object detectors such as faster regionbased network (Faster R-CNN) [6] and single-stage object detectors, e.g. You Only Look Once (YOLO), single shot detector (SSD) [7] are mainly the divided stages of deep learning detection models. Sa et al. [8] explored multi-modal (RGB and NIR) information of deepfruits based on Faster R-CNN to achieve an improved F 1 score from 80.7% to 83.8% for the fruit detection. However, the detection of small fruits is difficult with a slower detection speed for real-time in-field operation of harvesting robots. Missing fruits detection was noted in the tight clusters of fruit tested on the model proposed by Bargoti et al. [9] in spite of the F 1 score of more than 90% reported. The proposed mangoYOLO by Koirala et al. [10] through modified YOLOv3 obtained an accuracy of 98.3% on detected mango, but with a slower inference time of 15 ms. Zheng et al. [11] reported 92% detection accuracy on different crops and suggested that YOLOv3 [12] framework has a good potential application for agricultural detection tasks. An indication that a single-stage is faster than a two-stage fruit detector. Lawal [13] modified the YOLOv3 framework using the DenseNet backbone for tomato detection, and achieved an accuracy of 99.3% but with a slower detection time of 44 ms. Due to the significant role of both detection accuracy and speed in agricultural harvesting robots, it is necessary to develop a robust detection algorithm for their tasks.
YOLOv3 [12] and YOLOv4 [14], being a single-stage deep learning detector is popular for real-time object detector in computer vision. It directly predicts the bounding boxes and corresponding classes in a single-stage network at three different scales. YOLOv3 and YOLOv4 respectively applied the DarkNet53 backbone activated with Leaky rectified linear unit (ReLU) [15] and the CSPDarkNet53 backbone activated with Mish [16]. Meanwhile, the backbone of YOLOv3 and YOLOv4 which is composed of 1,2,8,8,4 residual block arrangement is long overdue for improvement. Furthermore, YOLOv3 uses a feature pyramid network (FPN) [17] as a model Neck, binary cross-entropy loss, while YOLOv4 applies spatial pyramid pooling network (SPPNet) [18], path aggregation network (PANet) [19] as a model Neck and YOLOv3 head. YOLOv4 is built on YOLOv3 with 10% detection accuracy and 12% detection speed improvement as reported. Recently, Lawal [20] reported a mix of DarkNet and DenseNet as a backbone with FPN to improve the YOLOv4 algorithm, achieving an accuracy of 98.4% greater than YOLOv4 at 97.6% tested on tomatoes in a natural scene. Nevertheless  ResNet structure [21] identity block is when the activation from the previous layer is fast-forwarded to a deeper layer in the neural network. The skip connection (shortcut connection) introduced into ResNet was explored to solve drops off from saturated accuracy for a deeper neural network. The 1 × 1 convolution layers are applied at the beginning (conv layer 1) and end (conv layer 3), while 3 × 3 convolution layer is placed in the middle (conv layer 2) of the network in order to reduce the number of parameters without network performance degradation. According to Zheng et al. [11], ResNet is an excellent object detection compared to Inceptionv4, VGGNet, SqueezeNet, and DenseNet. In the quest for an improved single-stage fruit detector, Lawal [22] incorporated the ResNet43 backbone into the YOLOv4 algorithm for muskmelon fruit detection, and reported an accuracy of 89.6% and speed of 96.3 frames per second (fps). However, the proposed method was not experimented on other types of fruit to ascertain the obtained findings.
This paper proposed a robust YOLOFig detection model to solve the shortcomings encountered by fruit detection and to improve fruit detection accuracy and speed. The model incorporated residual block 2,3,4,3,2 arrangements of the ResNet43 backbone [21] into YOLOv4 [14] for a deeper network and rich features extraction. The ResNet43 backbone was Leaky activated for non-linearity, including SPPNet, FPN, complete (CIoU) loss, and distance (DIoU−NMS) [23] added to the model for improved detection performance. Meanwhile, exploratory studies on YOLOv3 and YOLOv4 was experimented to compare the residual block arrangement 1,2,8,8,4 and 2,3,4,3,2 of their backbones. The obtained findings showed that the YOLOFig detection model can achieve an impressive detection accuracy and real-time detection speed.
The remainder of this paper is arranged as follows: Section 2 proposes the fig detection model. Section 3 explains the results and discussion of the proposed model, and Section 4 draws conclusions and future work.

Dataset details
The images of figs were taken from a greenhouse in Taigu county, Jinzhong city, Shanxi, China under natural daylight    [12], and CSPDarkNet53 (1,2,8,8,4) for YOLOv4 [14] conditions. The images were captured using a digital camera with an image resolution of 2976 × 3968 pixels, stored in JPG format. Apart from 30 test images set aside, a total of 382 fig images were collected and randomly divided into 80% train set and 20% valid set. For the dataset construction, labelling image annotation tool was used to hand label all the ground truth bounding boxes, in which the annotation files were saved in YOLO format. Without minding the diverse conditions of the images such as occlusion, illumination variation, similar background and so on, the bounding boxes were drawn by the supposed shape depending on what human eyes can see as displayed in Figure

YOLOFig model development
The proposed YOLOFig detection model is made up of ResNet43 backbone with 2,3,4,3,2 new residual block arrangement as shown in Figure 3.   Figure 3. The non-maximum suppression (NMS) algorithm used in the YOLOFig model to select a bounding box out of many overlapping entities by removing redundant detections of multiple bounding boxes in order to find the best match is DIoU−NMS. According to Zheng et al. [11], DIoU−NMS considered overlap area and the distance between two central points of bounding boxes. The original six detection layers of the YOLOv3 head were pruned to four layers in the YOLOFig detection model for detection speed improvement with FPN [17] application as the Neck to get feature pyramids. The FPN enables a well-generalized model on object scaling. Finally, the YOLOFig model applied complete IoU (CIoU) loss function [23] for faster convergence and better performance.

Experiment and evaluation
The training and testing of models were implemented on Dark-Net platform, computer with Intel i7-8700 CPU @ 64-bit 3.20 GHz, 16  ResNet50 backbone has 3,4,6,3 residual block arrangement compared to other models in Table 1. From the annotated dataset, the k-clustering algorithm was used to count the size of the anchor boxes before model training. The generated nine anchor boxes were incorporated into the models individually in descending order according to the three scales of the detection layer, i.e. 52 × 52, 26 × 26 and 13 × 13 feature. This is to improve the fig fruit detection model. The models received an input image of 416 × 416 pixels, 0.001 learning rate to reduce the training loss, iterations between 0 and 4000, 64 batches and 32 subdivisions to reduce the memory usage, the momentum of 0.9, and weight decay of 0.0005. Meanwhile, a random initialization method was applied to initialize the weights for the models' training including YOLOv3 and YOLOv4 models. This is to maintain the models' training consistency.
The trained models were evaluated using Precision, Recall, where TP is the true positive (correct fig detection), FN is false negative (missed fig detection), and FP is the false positive (incorrect fig detection). AP is the average precision that explains the overall performance of the model under various confidence thresholds, expressed as Equation (4) AP = ∑ n (r n+1 − r n ) max wherep(r ) is the measured precision at recallr.

Ablation studies on YOLOFig model
The ablation studies presented in Table 2 Figure 4 for justification. The P−R curve of YOLOFigB showed a better performance, having a greater area under curve (AUC) compared to YOLOFigA, YOLOFigF,   Figure 5, where the models detected the number of figs in the image. Figure 5 shows that the undetected fig image in Figure 5(a) is found detected in Figure 5(b)−(f). The added SPPNet enhancement support is responsible for this effect [18]. Nevertheless, the different modifications of the YOLOFig model are robust under different conditions. The obtained YOLOFig detection speed in fps is a function of the model weight size. The detection speed of model YOLOFigA is 97.9 fps, YOLOFigB is 96.8 fps, YOLOFigC is 79.4 fps, YOLOFigD is 94.7 fps, YOLOFigE is 90.2 fps, and YOLOFigF is 91.5 fps. YOLOFigA with the least weight size is faster than YOLOFigB, YOLOFigC, YOLOFigD, YOLOFigE, and YOLOFigF, but the difference between YOLOFigA and YOLOFigB is very little. Therefore, YOLOFigB with also an outstanding AP results stands selected as YOLOFig detection model.

Residual block arrangement performance
The compared backbone residual block of models stated in Table 3 shows that the YOLOv3b, YOLOv4b, and YOLOFig with 2,3,4,3,2 has a smaller weight size than YOLOv3, YOLOv4, and YOLOResNet70 with 1,2,8,8,4 and YOLORes-Net50 with 3,4,6,3. The side by side comparison based on detection accuracy and speed respectively shows that YOLOv3b with 78.6% and 69. 8 Figure 7 for justification. Just as Figure 5, the undetected figs in Figures 7(a) and (b) are detected in Figure 7(c)−(f) to show the importance of SPPNet incorporation into the models.  Furthermore, the tested AP of YOLOFig detection model in Table 3 supported by P−R curves shown in Figure 6 outperformed other models including its detection speed. Meanwhile, the application of higher grade GPU or reduced image resolution to the YOLOFig model would further improve the detection speed. With this, the YOLOFig detection model could better generalize and perform excellently well for realtime fruit detection, which is applicable for agricultural harvesting robots.

Feature map visualization
The feature map of different models is presented in Figure 8 to study the mechanisms responsible for the fig fruit detection improvement. However, deep neural network mechanism is very difficult to understand. The obtained feature maps were randomly taken from different activated last convolutional layers of 208 × 208, 104 × 104, 52 × 52, 26 × 26 and 13 × 13 were compared as shown in Figure 8. According to different detection scales with reference to the residual block arrangement, the network goes deeper from one activated feature map to another, whereby the previously unseen regions are present in the next activated feature map. The pattern of features mapped within each network, i.e. DarkNet in Figure 8 Figure 8 shows approximately similar feature map. However, the output detection  Table 3 and Figure 6

CONCLUSIONS AND FUTURE WORK
A robust YOLOFig detection model was proposed in this paper to solve fruit detection challenges and to improve fruit detection accuracy and speed. However, the detection accuracy is not satisfactory and the model could not detect to differentiate between unripe, ripe, and over-ripe figs, which call for further improvement and investigation. The practical applications of harvesting robot equipped with the proposed detection method would also be part of the focus in the future work.