1 Introduction

In the field of computer vision, identifying regions of interest (ROI) in digital images is a fundamental task, often taking precedence over other problems, particularly in fruit image classification. Identifying the ROI is crucial for detecting and recognizing visual objects on a screen.

Deep learning models for visual object detection are primarily divided into two categories: one-stage models and two-stage models. Two-stage models generate a pre-selected box, known as a region proposal (RP), which potentially contains an object to be detected. These models then classify the given samples using convolutional neural networks. In contrast, one-stage object detection models bypass the use of RP, directly extracting visual features to predict the object class and location. Examples of two-stage models include R-CNN, SPPNet, Fast R-CNN, Faster R-CNN [21], etc., while one-stage models include YOLO (You Only Look Once) [31, 36], SSD, CenterNet, etc [13, 20, 22, 41].

The motivation for this article stems from the shortage of human labor in orchards during the harvest season. This critical task requires completion within a very short time frame, thus necessitating machine vision and robots for automated picking.

The primary objective of this study is to train a fruit detection model capable of identifying the location of a fruit within a given image and distinguishing the fruit's class. The classes, labelled based on the visual characteristics of the fruit skin, include: "Ripe Apple", "Overripe Apple", "Ripe Pear", and "Overripe Pear".

Despite architectural differences between one-stage and two-stage object detection models, their training methods remain similar. Visual object detection primarily involves two phases: the training process and the testing process. The main goal of model training is to use an image dataset to derive parameters for the detection network. This training dataset includes annotated information such as the object location and class.

Figure 1 illustrates how the CenterNet model is trained based on the provided training data and how it produces prediction results. The rectangular box indicates the location of the visual object that we manually marked, along with the corresponding visual object classes. Thus, both the images and the annotation tags serve as inputs for the CenterNet model training. Once the training process concludes, the model outputs the predicted results.

Fig. 1
figure 1

The progress of object detection (one-stage model)

The contributions of this article:

  1. (1)

    In this paper, we utilize the YOLOv8 model for fruit detection. Additionally, we employ a transfer learning model, achieving an impressive accuracy rate of 99.5% and precisely classifying the ripeness of various fruits.

  2. (2)

    We have created our own dataset, incorporating factors such as fruit occlusion, overlapping, and mixed datasets with various classes.

In the first part of this article, we introduce the background and objectives of the experiment. In the second part, we introduce past research work for visual object detection based on YOLO model and CenterNet. In the third part, we will introduce YOLOv8 model and the CenterNet model in detail. In the fourth part, we analyze the experimental results. Finally, we summarize our contributions of this paper and envision future work.

2 Literature review

2.1 Fruit detectio n

Fruit detection from digital videos is based on computer vision and deep learning methods at present [39, 41], which requires the use of computers and digital cameras instead of manual operations [5, 7, 10, 11, 17, 28]. There are a lot of methods for fruit object detection [2, 4, 15, 18, 33]. The fruit detection essentially needs both classification and localization by providing the class labels and bounding box coordinates of the targets [8, 14, 16, 26, 27]. Visual object detection is to use YOLOv8 model or CenterNet model and dataset composed of ground truths, so that the model can extract visual features of the fruits, and then output the predicted results.

The two-stage models can output better recognition results, but the one-stage models are able to achieve faster detection [37]. Faster R-CNN is a typical two-stage model. The two-stage model is a proposal-based method, which needs to use selective search to generate a region proposal, and conduct object classification and bounding box regression. Faster R-CNN has high accuracy, but with slow speed. In our previous experiments, Faster R-CNN model combined with ResNet-50 model achieved a precision 93%, while YOLOv3 model combined with Darknet model achieved a precision 99.96% [34]. Wan and Goudos [29] improved the convolutional layer and pooling layer of Faster R-CNN network to increase the speed of visual object detection and obtained a mean average precision 86.41%. At the same time, it achieved 84.89% precision with YOLOv3 model. RGB colors were utilized as visual feature. Sa, et. al. [23] explored multiple modalities, which inspired us on how to extract the features of visual objects with multiple bits. Sa, et al. conducted transfer learning based on Faster R-CNN and achieved the precision and recall 0.807 to 0.838 for detecting sweet peppers, respectively.

FDR model [12] was developed to deal with the problems that fruit identification. The FDR model can better overcome the complexity caused by the overlapping of fruits under a specific dataset. At the same time, the baseline of the convolutional neural network was improved based on classification and recognition, the model can reduce the impact of background noises and resolutions based on the given dataset. This method achieved an accuracy rate 97.83%.

In the fruit detection methods, Transformer models can also achieve satisfactory detection results. Sun, et al. solved the impact of the complex environment of real orchards on detection through the focal bottleneck changer module [24]. The focus changer block took a focus changer layer and embedded it into the original bottleneck architecture through replacing the spatial convolution layer with a focus changer layer. The focus changer block is a solution for the similarity between the green apple peel in an orchard environment and the green background environment of the leaves in an orchard. In the experiment, window-based focal multi-head self-attention is embedded into the focal transformer layer, which can filter the noise of the orchard environment background and enhance the local features of green apples. Sun’s work attained 34.2% accuracy based on the Pascal VOC dataset.

Swin Transformer model was treated as a basis, combined with Mask R-CNN to solve the impact of the natural environment on detection [35]. The model can effectively identify the size and types of tomatoes and achieved an accuracy rate 89.4%. DenseNet-169, ResNet-50v2 and Vision Transformer models were developed to detect plant diseases and achieved an accuracy rate 99.88% [1]. The loss function as sparse categorical cross entropy can achieve multi-classification problems.

2.2 CanterNet model

CenterNet model is characterized by key point detection. Key point detection means that the entire detected object is modeled as a point. As shown in Fig. 1, the dot is the center point of the bounding box. The CenterNet model is to locate the center point of the detected target. The structure of Residual Network (ResNet) can enable the learning ability to increase with the increase of network depth. The CenterNet model combined with the ResNet network is usually employed to achieve visual object detection [9].

ResNet [38] was embedded into the backbone of CenterNet and attained an accuracy rate 78.6%. The loss function of ResNet can assist CenterNet to better complete the target detection task. CenterNet can utilize a grid of feature maps at the center of the object in current 2D object detection. However, in practical applications, 3D object detection is prone to ambiguity in the size and direction of the detected object, resulting in misjudgment of global information. 3D-CenterNet processes the parameters of the bounding box so that it can accurately estimate the position of center point and aid the model to identify the local features of the target [30].

2.3 YOLO models

YOLO models take advantage of the entire given image as the input [25, 40, 42], and directly regress the position and the bounding box at the output layer [3, 6]. YOLO models segment the input image into s × s grids, predict the boundary of each grid, and analyze whether each border is the position and confidence of the detected object.

Liu et al. [19] probed the method of detecting pineapples based on YOLOv3 model. In the work, Darknet53 was introduced as the backbone of YOLOv3 model. DenseNet was added to the backbone of the Darknet model to enhance the representation ability of feature maps. The improved YOLOv3 model can complete the disparity calculation between the ROI and other regions, achieved an average precision 97.55%.

Wang, et al. [32] took use of YOLOv5 model to probe apple stems for product packaging automation. YOLOv5 is use of the training method of transfer learning to obtain better detection performance. By comparing the number of detection heads and feature map size, layer pruning, and channel pruning to optimize YOLO-v5s, the complexity of the model was further reduced, the model parameters and weight volume were reduced about 71%, the mAP was only allieved by 1.57%. The optimized algorithm achieved 93.89% accuracy in stem/calyx detection of a variety of apples.

3 Methodology

3.1 One-stage and two-stage models

Visual object detection algorithms are grouped into two cagtegories: One-stage and two-stage. Two-stage algorithms usually generate region proposals, and classify each candidate box. The two-stage algorithm requires multiple detection and classification processes, so the algorithm is relatively slow.

In Fig. 2, the anchor box is a sliding window that traverses the image which obtains feature map. In the two-stage algorithm, \({c}_{1},{c}_{2}, {c}_{3}, {c}_{4}\) represents ripe apple, overripe apple, ripe pear, and overripe pear, respectively. \(\left(x,y,w,h\right)\) shows the position of one of the corners of the bounding box. T in Eq. (1) indicates an anchor box in two-stage model.

Fig. 2
figure 2

The sample of two-stage model

$$T=(x,y, w, h,{c}_{1},{c}_{2},{c}_{3},{c}_{4})$$
(1)

Similar to the one-stage model in Fig. 4, the predictive model may produce multiple bounding boxes, the boxes are represented by (A, B, C, D). The two-stage model calculates the bounding boxes that regress to the ground truth iteratively. The regression equation is,

$${R}_{1}({A}_{1},{B}_{1},{C}_{1},{D}_{1})\to {R}_{i}({A}_{i},{B}_{i},{C}_{i},{D}_{i})\to \cdots \to {R}_{ground truth}({A}_{1},{B}_{1},{C}_{1},{D}_{1})$$
(2)

The one-stage object detection algorithm usually sends the images to the network model once and can generate all the bounding boxes, so it is fast and very suitable for real-time detection. Thus, whether the model can achieve rapid object detection is also within the scope of our evaluations. Both CenterNet and YOLOv8 models are typical one-stage algorithms.

3.2 CenterNet & ResNet-50

CenterNet object detection is based on bounding box of the identified object as a center point, and returns to other object attributes based on this center point. As shown in Fig. 3, CenterNet is an end-to-end one-stage object detection model. In Fig. 3, the CenterNet prediction module contains three branches, namely the prediction of heatmap of the center point, the prediction of offset, and the prediction of object size. The heatmap contains C channels, and each channel contains a class. The blue shaded part in Fig. 4 indicates the center point of the target region.

Fig. 3
figure 3

The flowchart of CenterNet model

Fig. 4
figure 4

CenterNet bounding box

As shown in Fig. 4, if the bounding box is accurate, the probability of blue center point that can be detected will be high. If the bounding box in the orange area is inaccurate, the probability of the detected orange center point is low. Therefore, the bounding box marked by the upper left and lower right corner points defines a central area, and CenterNet model detects the center point in the central area of each box in Fig. 3. In Fig. 4, the blue center points and boxes with high probability will be retained, while the orange center points and boxes will be deleted.

CenterNet algorithm is implemented by using heatmap. While the CanterNet network predicts the center point, which presents a Gaussian distribution. In Fig. 5, we are use of a grid to model the center point. The blue box is ground truth, and the orange box is the predicted box. There are three situations between the prediction box and the real ground truth box, the ground truth box and the prediction box overlap, the ground truth box contains the prediction box, the prediction box includes the ground truth box.

Fig. 5
figure 5

Heatmap ground truth

The loss function of the entire CenterNet consists of three branches of prediction modules. \({L}_{dat}\) represents the loss function. \({L}_{k}\) shows the loss heatmap center point. \({L}_{off}\) indicates the loss of object center point offset. \({L}_{size}\) displays the loss of length and width. The prediction loss function is presented as following,

$${\mathrm{L}}_{\mathrm{dat}}={\mathrm{L}}_{\mathrm{k}}+{\uplambda }_{\mathrm{size}}{\mathrm{L}}_{\mathrm{size}}+{\uplambda }_{\mathrm{off}}{\mathrm{L}}_{\mathrm{off}} ({\uplambda }_{\mathrm{size}}=0.1, {\uplambda }_{\mathrm{off}}=1)$$
(3)

where ​\({\mathrm{Y}}_{\mathrm{xyc}}\) indicates the ground truth value. \({\overset\vee Y}_{xyc}\) is the ground truth value.

$$L_K=\frac{-1}N{\textstyle\sum_{xyc}}\left\{\begin{array}{l}{(1-{\overset\vee Y}_{xyc})}^\alpha\log({\overset\vee Y}_{xyc}),if\;Y_{xyc}=1\\{(1-Y_{xyc})}^\beta\log(1-{\overset\vee Y}_{xyc}),\;otherwise\end{array}\right.$$
(4)

The heatmap loss function in Eq. (4) is improved based on the basis of focal loss, where α and β are two hyperparameters to balance difficult and easy samples, N represents the number of key points.

The resolution of feature map output by CenterNet network is a quarter of the original input image, which will bring a large error. Therefore, the offset center point loss function in Eq. (5) takes use of L1 loss to calculate the offset loss of the positive sample block, where \({\widehat{O}}_{\widetilde{p}}\) represents the offset value predicted by the network, p shows the coordinates of the center point of the image, R displays the scaling factor of the heatmap, and ​\(\widetilde{p}\) indicates the approximate integer coordinates of the center point after scaling. Each pixel on the output feature map corresponds to a 4 × 4 region of the original image.

$${\mathrm{L}}_{\mathrm{off}}=\frac{1}{\mathrm{N}}\sum\nolimits_{\mathrm{p}}\left|{\widehat{\mathrm{O}}}_{\widetilde{\mathrm{p}}}-(\frac{\mathrm{p}}{\mathrm{R}}-\widetilde{\mathrm{p}})\right|$$
(5)
$${\mathrm{L}}_{\mathrm{size}}=\frac{1}{\mathrm{N}}\sum\nolimits_{\mathrm{k}=1}^{\mathrm{N}}\left|{\widehat{\mathrm{S}}}_{\mathrm{pk}}-{\mathrm{s}}_{\mathrm{k}}\right|$$
(6)

where the length and width loss function are shown in Eq. (6), where N represents the number of key points, \({s}_{k}\) shows the real size of the target, \({\widehat{S}}_{pk}\) indicates the predicted size, and the whole process is calculated by using L1 loss function.

CanterNet removes the non-maximum suppression module, which enables the algorithm to achieve faster processing speed and higher detection accuracy. We observe in Fig. 3 that CenterNet's backbone is use of a residual network to solve the problem of gradient explosion and gradient disappearance. ResNet includes deformable convolution, increases up-sampling, and reduces the number of channels. The ResNet model can increase the size of output feature map and reduce the amount of calculation.

3.3 YOLOv8 model

YOLOv8 is an improvement on the previous version of YOLO, which further improves the performance, makes the model fast, accurate and easy to use. The backbone of YOLOv8 model continues the CSP module of YOLOv5. As shown in Fig. 6, the C2f module is employed to extract visual features. In YOLOv8, we delete the CBS 1 × 1 convolution structure in the PAN-FPN up-sampling stage in YOLOv5, and also replaces the C3 module with the C2f module. Decoupled-head in YOLOv8 is use of two convolutions for classification and regression respectively, and takes advantage of the idea of DFL at the same time.

Fig. 6
figure 6

C2f block module in YOLOv8 model

The most important update of YOLOv8 model is to adopt the anchor-free method and use task alignment learning to align classification ( or cls) and regression (or reg) tasks. Normally aligned anchors should be able to be accurately positioned. YOLOv8 model is use of a new anchor alignment metric. The anchor alignment metric is obtained by multiplying cls score and the IOU between the predicted frame and the ground truth real frame. The alignment metric is integrated in the sample allocation and loss function to dynamically optimize the prediction of each anchor. YOLOv8 model takes use of VFL loss as classification loss and DFL loss + CIOU loss as classification loss.

$$\mathrm{VFL}(\mathrm{p},\mathrm{q})=\left\{\begin{array}{c}-q(q\mathrm{log}(\mathrm{p})+(1-\mathrm{q})\mathrm{log}(1-\mathrm{p})) ,\mathrm{ q}>0\\ -\alpha {\mathrm{p}}^{\upgamma }\mathrm{log}(1-\mathrm{p}) , q=0\end{array}\right.$$
(7)

where VFL indicates an asymmetric weighting operation based on the imbalance between positive and negative samples, both FL and QFL are symmetrical. As shown in Eq. (7), p is the label, q is the value calculated by using norm_align_metric if the positive sample is taken, and p = 0 if the negative sample is taken. Norm_align_metric weighting for highlighting master samples.

DFL (Distribution Focal Loss) changes the single value of coordinate regression to output n + 1 values, each value represents the probability of the corresponding regression distance, and the integral is calculated to obtain the final regression distance. DFL can make the network focus on the target y faster nearby values, increasing their probability.

$$\mathrm{DFL}({\mathrm{S}}_{\mathrm{i}},{\mathrm{S}}_{\mathrm{i}+1})=-(({\mathrm{y}}_{\mathrm{i}+1}-\mathrm{y})\mathrm{log}({\mathrm{S}}_{\mathrm{i}})+(\mathrm{y}-{\mathrm{y}}_{\mathrm{i}})\mathrm{log}({\mathrm{S}}_{\mathrm{i}+1}))$$
(8)

The meaning of DFL is to optimize the probability of the two positions which is the closest one to the label y, one left and one right, in the form of cross entropy, so that the network can focus on the distribution of adjacent area of ​​the target position faster.

4 Our results

4.1 Experimental settings and evaluation method

In this project, pyTorch is adopted as an experimental platform. We made use of mobile phones to create four datasets in Table 1, with a total of 4,000 images and 20,000 labels. The three groups of data are sorted according to the size and the quantity of images. We found when we are use of all the datasets in model training, too many visual features cause redundancy and generate overload of the model. So, we manually discarded the data with inconspicuous features in the training, and finally we used two thousand samples. We classify the ripeness of fruits according to the degree of fruit peel. In Fig. 3, the smooth peel is from ripe apple, and the wrinkled peel is from overripe apple. The dataset is labeled with software Labelimg. The image on the left side of Fig. 1 is the dataset we labelled. We manually marked the location and class of the apples with a bounding box in red. According to the characteristics of one-stage models, the training images can be input in any size, and then the algorithm resizes the image to a size 640 × 640.

Table 1 Dataset Description

Table 2 shows the parameter settings of this experiment. With regard to supervised learning, initially we set a larger learning rate, and then decreased the learning rate as the number of iterations increases, we set the learning rate to 0.01. All the data is input into the network during training, and the gradients are calculated. Due to the huge difference in different gradient values, it is difficult to use a global learning rate. Therefore, we set the batch value to 2 in order to avoid memory explosion. We take use of precision as an indicator for evaluating the model in Eq. (9).

Table 2 Training Parameters
$$Precision=\frac{TruePositive}{TruePositive+FalseNegative}$$
(9)

As shown in Fig. 4, if the prediction bounding box and the real bounding box IOU (Intersection over Union) are greater than or equal to 0.5, it is considered a positive sample. AP measures the detection of a class, mAP is the detection of multiple classes. In AP0.5, the confidence threshold IoU is set to 0.5, and only the preselected boxes with IoU > 0.5 are calculated. mAP@0.5:0.95 represents the average mAP on different IoU thresholds (0.5–0.95, step size 0.05).

4.2 Results and analysis

Both CenterNet model and YOLOv8 model are basic anchor-free. For the CenterNet model, we chose ResNet-50 as the backbone. For YOLOv8 model, we made use of YOLO8n, YOLOv8m, and YOLOv8x with three weights from small to large. We compare the precision values obtained by different models and epochs in the same experimental environment. At the same time, based on the real-time detection requirements of the experiment, the average inference time of the detection is also employed to evaluate the quality of the model.

While we are training the model, we split samples into a training set and a validation set. The training set and verification set are divided according to the ratio 9:1, then the loss value calculated by the training model will be divided into the overall loss of the training set and the val loss of the test set. From Table 3, we observe that when the number of iterations is too small, the model cannot learn the characteristics of the fruit. In Fig. 7 (b), if the loss decreases, val_loss decreases, which indicates that the training is normal and the model is in the optimal state. In Fig. 7 (a), the loss is stable and val_loss is stable, which indicate that the learning process encounters a bottleneck, the training parameters are not set properly, and the model is in the worst case. During the training process of CenterNet, the backbone is frozen, and the feature extraction network does not change, so more training can help jump out of the local optimal solution.

Table 3 The results of precisions by training CenterNet model
Fig. 7
figure 7

Loss map with various epochs (a) 30 epochs (b) 400 epochs

The anchor-free structure (AFS) of YOLOv8 adopts task alignment learning dynamic matching, and introduces distribution focal loss (DFL) combined with CIoU loss as the function of the regression branch, which makes the classification and regression tasks have a high consistency. In the data enhancement part of training, turning off mosaic enhancement in the last 10 epochs is conducive to the stability of model convergence. In Table 4, increasing the number of training epochs from 50 to 100 makes the model training more adequate. YOLOv8 achieves lightweight and fast detection.

Table 4 The results of average precisions by training YOLOv8 model

In Table 4, the larger the pre-training weight, the more time the model needs for average inference time. If the threshold is set as 0.5, the model outputs a satisfactory result. If AP@0.5:0.95, the average precision fluctuates. Although the precision results are all higher than 80%, we observe from Fig. 8 that if the number of iterations is too small, the convergence result of the model is not good.

Fig. 8
figure 8

Loss map of YOLOv8n model (a) 30 epochs (b) 200 epochs

In Table 4, larger pre-training weights lead to longer average inference time. In Tables 5, 6, and 7, the large weight processing takes longer time on average for each picture and does not generate better average precision. On the contrary, the large weight also brings the problem of training overfitting and wastes resources. During the training process, we chose a smaller learning rate to ensure that the model can better find the optimal point. However, if too many iterations are assigned, the model will still have the problem of not being able to be trained.

Table 5 The results of precisions by training YOLOv8n model
Table 6 The results of precisions by training YOLOv8m model
Table 7 The results of precisions by training YOLOv8x model

In Table 8, we are use of the idea of ablation experiments to analyze the training results of the YOLOv8 model and CenterNet model. As an anchor free model, both the YOLOv8 model and the CenterNet model can complete fruit ripeness recognition. Although the CenterNet model freezes the backbone part during training time, which is equivalent to accomplish a transfer learning, the precision of CenterNet training does not have much advantage over the YOLOv8 model. While training for two hundred iterations, the CenterNet model requires more inference time than YOLOv8m. If the CenterNet model saves a lot of resources during training process, the C2f module of YOLOv8 model also reduces the weights of the proposed model. But if the training parameters are the same, the lightweight model YOLOv8n can complete the detection with a faster response speed in 100 iterations. CenterNet has not been able to extract the feature of the fruits at 100 iterations.

Table 8 The results of comparison

5 Discussion

Sun's green apple detection method achieved an accuracy of 34.2%. Wang et al. utilized the transformer model to recognize different sizes and types of tomatoes, reaching a precision of 89.4%. Alzahrani & Alsaade attained a remarkable precision of 99.88% in fruit lesion detection using DenseNet-169. Kim et al. conducted research on detecting tiny objects from UAV images. Considering the environmental noise in the experiment, the YOLOv8 model achieved a processing speed of 45.7 fps (frames per second) in the P2 layer. Similar to these experiments, our fruit object detection experiment requires a small number of parameters. If we can guarantee a precision of 99.3%, our YOLOv8 model can control the detection speed to 2.9ms. In practical applications, faster object detection can result in significant time savings.

6 Conclusion

From the experimental results, we observe that both YOLOv8 and CenterNet can achieve an accuracy rate of more than 90%. The c2f module of the YOLOv8 model significantly reduces the number of blocks in the largest stage of the backbone network to construct a more lightweight model. Simultaneously, the model decreases the number of output channels in the final stage, which further reduces the number of parameters and computations. In the practical application of fruit detection, the detection speed is of utmost importance. YOLOv8 outperforms in terms of both speed and accuracy, with the lightweight model YOLOv8n requiring only 2.9ms to complete accurate detection.

Although the current model can accurately locate and classify fruits, we must also consider the impact of extreme orchard environments on automatic fruit detection. The effects of severe weather conditions, such as strong winds, heavy rains, or disturbances from birds, represent a research direction that we aim to explore in future experiments.