Multi-stage tomato fruit recognition method based on improved YOLOv8

Introduction In the field of facility agriculture, the accurate identification of tomatoes at multiple stages has become a significant area of research. However, accurately identifying and localizing tomatoes in complex environments is a formidable challenge. Complex working conditions can impair the performance of conventional detection techniques, underscoring the necessity for more robust methods. Methods To address this issue, we propose a novel model of YOLOv8-EA for the localization and identification of tomato fruit. The model incorporates a number of significant enhancements. Firstly, the EfficientViT network replaces the original YOLOv8 backbone network, which has the effect of reducing the number of model parameters and improving the capability of the network to extract features. Secondly, some of the convolutions were integrated into the C2f module to create the C2f-Faster module, which facilitates the inference process of the model. Third, the bounding box loss function was modified to SIoU, thereby accelerating model convergence and enhancing detection accuracy. Lastly, the Auxiliary Detection Head (Aux-Head) module was incorporated to augment the network's learning capacity. Result The accuracy, recall, and average precision of the YOLOv8-EA model on the self-constructed dataset were 91.4%, 88.7%, and 93.9%, respectively, with a detection speed of 163.33 frames/s. In comparison to the baseline YOLOv8n network, the model weight was increased by 2.07 MB, and the accuracy, recall, and average precision were enhanced by 10.9, 11.7, and 7.2 percentage points, respectively. The accuracy, recall, and average precision increased by 10.9, 11.7, and 7.2 percentage points, respectively, while the detection speed increased by 42.1%. The detection precision for unripe, semi-ripe, and ripe tomatoes was 97.1%, 91%, and 93.7%, respectively. On the public dataset, the accuracy, recall, and average precision of YOLOv8-EA are 91%, 89.2%, and 95.1%, respectively, and the detection speed is 1.8 ms, which is 4, 4.21, and 3.9 percentage points higher than the baseline YOLOv8n network. This represents an 18.2% improvement in detection speed, which demonstrates good generalization ability. Discussion The reliability of YOLOv8-EA in identifying and locating multi-stage tomato fruits in complex environments demonstrates its efficacy in this regard and provides a technical foundation for the development of intelligent tomato picking devices.


Introduction
Tomatoes, with their rich nutrients and unique flavor, are highly favored by consumers.As market demand continues to grow, so too does the production and cultivation scale of tomatoes (Su et al., 2022).Currently, the harvesting process still relies on manual labor which is subject to personal judgment and past experience, leading to issues such as low efficiency, high costs, and untimely harvesting (Han et al., 2022;Yang et al., 2024).The use of intelligent robotic harvesters to replace human labor in tomato picking holds significant importance and prospects for the modernization of the tomato industry.Given that tomatoes have a short ripening period and are not easy to store, it is necessary to screen tomatoes at different maturity stages according to actual needs; this plays a positive role in increasing farmers' income (Nassiri et al., 2022).The basic requirement for achieving intelligent harvesting lies in accurately identifying and locating multi-stage tomato fruitsa key step towards implementing precision agriculture (Bai et al., 2023;Lin et al., 2024).Therefore, enhancing model detection performance is crucial for realizing the automation of tomato harvesting.
Traditional image processing methods extract features such as color, shape, and texture from images by analyzing highresolution pictures and designing artificial features to match and recognize target fruits.However, these methods have limitations in automatic feature extraction, detection speed, and accuracy (Wang et al., 2022).They are susceptible to environmental influences and the number of fruit colors, lacking reliability and robustness in complex scenarios, which makes it difficult to meet practical demands (Zhang et al., 2023a).With the continuous development of machine vision technology, Convolutional Neural Networks (CNN) show enormous potential in agriculture due to their rapid processing capabilities and adaptability to complex scenes.The current mainstream algorithms are divided into two categories: a second-order detection algorithm based on candidate regions represented by the R-CNN series; and a first-order monitoring algorithm based on network regression represented by the YOLO series.Long Jiehua et al. (Long et al., 2021) proposed an improved Mask R-CNN model that provides a basis for detecting maturity levels of tomatoes and intelligent picking operations.Mu et al. (Mu et al., 2020) integrated Faster R-CNN with transfer learning for detecting unripe tomato fruits.Li Tianhua (Li et al., 2021) et al. proposed a recognition method that fuses YOLOv4 with HSV to segment red areas on tomatoes; however, this approach does not perform well when multiple fruits overlap one another.Zeng et al. (Zeng et al., 2023) reconstructed the backbone network of YOLOv5 using lightweight Bneck modules they also pruned it which resulted in a 78% reduction in model parameters and an 84.15% decrease floating-point operations per second leading greatly increased detection efficiency though its efficiency at spotting ripe tomatoes was lower.Liu Fang (Liu et al., 2020) and others proposed the multi-scale IMS-YOLO, which achieves a tomato detection accuracy of 97.13% in complex greenhouse environments, but performs poorly in detecting small objects.Zhang Junning (Zhang et al., 2023b) integrated the CBAM attention mechanism into the YOLOv5s network to give more focus on green tomatoes, enhancing the recognition accuracy of two types of tomatoes.Similarly, (Appe et al., 2023) replaces the DIoU loss function on this basis and achieves an average detection accuracy of up to 88.1% for overlapping targets and small target tomatoes.Gao (Gao et al., 2024) proposed an improved Soft-NMS algorithm for improving YOLOv5s by taking into account the real-time condition of the picking robot in continuous working condition, which significantly improves the recognition of tomato in continuous working.Miao Ronghui (Miao et al., 2023) and others adopted an improved YOLOv7 model to detect multistage cherry tomatoes, effectively reducing the amount of model parameters and memory usage while speeding up inference.Chen et al. (Chen et al., 2024a) proposed the MTD-YOLOv7 model, used for multitask maturity detection of cherry tomato bunches and fruits, achieving a detection accuracy of 86.6% and an inference speed of 4.9ms, demonstrating outstanding performance.Based on information mapping and morphological operations, the SimAM attention module and MobileNeXt are integrated into YOLOv7-tiny, while the improved DeepSORT algorithm is integrated to propose a real-time detection algorithm for multiple maturity tomatoes with good results (Meng et al., 2024).
Recently, many scholars have also considered deploying the improved YOLO algorithm on edge devices for tasks such as tomato fruit morphology recognition (Du et al., 2023;Fu et al., 2024), pest and disease dynamics detection (Jin et al., 2024;Wang and Liu, 2024), and growth monitoring (Chen et al., 2024b;Tian et al., 2024), and its excellent task completion performance demonstrates notable competitiveness.
The above research demonstrates the feasibility and potential application of deep learning-based multi-stage target detection for tomatoes, but the following issues still exist: Fruits and small targets that are obscured may be missed or incorrectly identified; the model structure is complex and has a large number of parameters, leading to redundant feature extraction; under complex environments, detection efficiency and accuracy are relatively low.Based on this, the paper proposes an improved YOLOv8 model aimed at efficiently recognizing tomatoes at different growth stages in complex greenhouse environments.By reducing the number of parameters and optimizing network structure, a balance between model accuracy and efficiency is achieved.

Data collection
The data collection site is located at the Yuzhong Greenhouse Complex in Zhongmu County, Zhengzhou City, Henan Province, China (34.66°N, 114.06°E),As shown in Figure 1.focusing on tomatoes cultivated on greenhouse ridges.This study selected the locally representative "YingFen-No.58"variety of tomatoes as the research subject and used an EOS M50 Mark II camera to take photographs from December 14 to 27, 2023, between 9:00 AM and 5:00 PM.
To enhance the model's generalization ability and diversify the dataset, we seek to downplay structured features of greenhouses.Batches of tomato plants were photographed in their natural environment in the greenhouse, taking into account different time periods, densities, shading conditions, light conditions, and other actual picking conditions in the sampling process.After screening, 716 high-resolution images (3024 pixels x 4032 pixels) were obtained.Figure 2 Sample image collection displays some images from the dataset.

Data preprocessing
This study utilizes Roboflow to annotate the collected raw images, accurately delineating the contours of the fruit using minimal bounding rectangles to ensure each box contains only one piece of fruit and minimizes background noise.Do not label fruit that is severely obscured or relatively small.According to the experience of local farmers "Ripe-Tomato" (Ripe tomatoes in bright red)、"Semi-ripe Tomato" (light orange-yellow semi-ripe tomato)、"Unripe-Tomato" (Green unripe tomatoes) Three categories.After the annotation is complete, use the built-in scaling feature to process the image, uniformly adjusting the resolution of the image to 640 pixels × 640 pixels.Save this as a.txt file.The stored information includes: target category, coordinates of the bounding box center point, and dimensions such as width and height.
Divide the dataset randomly into training, validation, and test sets in a 7:2:1 ratio.To enhance the model's robustness and its ability to resist interference, as well as to avoid overfitting, the training set was augmented using Roboflow tools through methods such as Gaussian blur and random cropping.As shown in Figure 3, each original image generated four new images, resulting in a total of 2720 images.

Construction of experimental platform and parameter settings
The operating system used for the experiment is Linux, with an Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, NVIDIA GeForce RTX3080Ti GPU, 32GB of RAM, and a 500GB HDD.The programming language is Python 3.9, utilizing the Pytorch 1.9 deep learning framework with CUDA 11.8 GPU acceleration.The initial learning rate is set to 0.01, momentum parameter to 0.937, iteration rounds to 300, target class number to 3, and Batch_size to 32.The backbone network adopts the Darknet53 structure, obtaining features of different sizes through five down-samplings.The C3 module has been replaced with a more abundant C2f module to increase branches for enriched gradient backpropagation.The neck network utilizes a PANet, enhancing the receptive field and improving feature fusion capabilities by bidirectional integration of dual-layer features.The head network adopts an Anchor Free strategy and a decoupled head structure, using a parallel branch architecture to separate positioning from classification tasks while discarding confidence prediction to accelerate model convergence.

YOLOv8 network model
Although the YOLOv8 model belongs to the latest iteration of the YOLO series, it still has some limitations.For example, the low resolution of the feature map due to the restricted working conditions of the actual scene makes it perform poorly during small target detection; furthermore, despite its highly efficient structural design, real-time processing on resource-constrained devices is still challenging; and lastly, its sensitivity to occlusion and lighting variations also affects its robustness and reliability in practical applications.

Improvement of the network model
The improved network structure of YOLOv8-EA, as shown in Figure 4, utilizes EfficientViT as the backbone network.This version incorporates variable convolutions into the original C2f module, switches to SIoU loss function, and adds Aux-Detect.These The EfficientViT structure is shown in Figure 5A, and the core building block "EfficientViT Module" is shown in Figure 5B.This module consists of a Lightweight MSA (Geser et al., 2006) module (as shown in Figure 5C) and an MBConv module.The lightweight MSA module uses linear projection layers to extract Q, K, V tokens, and uses small convolution kernels for information aggregation to form multiscale tokens.By employing a global self-attention mechanism based on ReLU, each scale feature is weighted to capture information at various scales.Subsequently, the outputs are concatenated and sent to the final linear projection layer for feature fusion, producing more expressive and diverse global features.This model introduces a method that enhances the ability to learn globally across multiple scales by aggregating nearby Q, K, and V values in order to reduce computational and storage costs while using small convolutional kernels to achieve a balance between accuracy and efficiency.Meanwhile, the MBConv module enhances gradient propagation characteristics to better capture local information (Nascimento et al., 2019).
Assuming the input is, the self-attention calculation formula for the EfficientViT module is as shown in Equation ( 1) In the formula:   The calculation formula for angle loss L is as shown in Equations ( 2), (3): The formula for calculating the distance loss D is as shown in Equations ( 4), (5): The shape loss W is defined as as shown in Equations ( 6), ( 7): (1 − e −wt ) q (6) In the formula, q represents the weight of shape loss.q ∈ ½2, 6 SIoU Loss Function is defined as shown in Equation ( 8): The C2f module used in YOLOv8 enhances the image feature extraction capabilities, but the stacking of Bottleneck modules inevitably leads to redundancy in information channels and an increase in inference workloads.To address these issues, the Faster Block module was integrated into C2f, reducing both model computation and floating-point calculations (Chen et al., 2023).Partial Convolution (PConv) extracts features from only some channels of the input feature map, reducing redundant operations and memory access, thereby enhancing the capture of key spatial features.Assuming that the number of channels before and after outputting a feature map remains unchanged and that k is the kernel size, then PConv's FLOPs per second (floating-point operations) and MAC (Memory Access Cost) calculation formula are as shown in Equations ( 9), (10): This module performs convolution operations on a portion of the input channels, Cp, representing the entire feature map while keeping the remaining channels unchanged.Afterwards, it concatenates and overlays these processed channels with the remaining ones for output.Under a typical partial convolution rate (r=1/4), the computational cost of the improved C2f-Faster is approximately 1/16 that of C2f's, featuring low memory occupancy during convolution and around 1/4 the memory access volume compared to regular convolutions.This design aims to reduce redundant computations, maximize channel information preservation, and enhance feature extraction.See Figure 7 for the structural layout of C2f-Faster module.d) Auxiliary detection head In the YOLO series networks, the reduction of feature map size and resolution due to downsampling operations leads to the challenge of losing fine-grained information in learning complex image features.Therefore, this study introduces the strategy of Auxiliary Head from YOLOv7 (Wang et al., 2023).By embedding auxiliary heads in the middle layers of the network, additional ) represent the center coordinates of the predicted box and the ground truth box, respectively; C w and C h represent the differences in the horizontal and vertical coordinates between the B and B gt points, respectively; a is the horizontal angle between the center points of the two boxes; w, h, ??? and ℎ ?? represent the width and height of the predicted box and the ground truth box, respectively; s represents the distance between the center points of the ground truth box and the predicted box.
gradient signals are provided to enhance gradient backpropagation.During the training process, the auxiliary detection head can extract more shallow network information, obtain fine-grained feature maps, and accelerate the regression of the loss function detection boxes.The introduction of auxiliary learning mode enhances the model's understanding of multi-scale targets and complex scenes.Meanwhile, the auxiliary branch and the main classification branch merge to calculate the loss function, utilizing a richer gradient information flow to aid network training, thereby improving detection accuracy and reducing overfitting risks.Assuming a is the participation rate of the auxiliary detection head, the loss calculation for the auxiliary detection head is as shown in Equation ( 11): In the formula: LOSS G -Total model loss; LOSS A -Backbone network loss; LOSS M -Loss of auxiliary detection heads.

Evaluation metrics
To measure the detection effects and performance between models, precision (Precision, P), recall rate (Recall, R), mean average precision (mean Average Precision, mAP), frames per second (Frames Per Second, FPS), model weight (MB), and floating-point operations (FLOPs) are selected as evaluation metrics to assess the final effect of the model (Jiang et al., 2018;S et al., 2023).
3 Results and analysis

Ablation experiments to improve the model
This study sets uniform training parameters and conducts 11 groups of ablation experiments aimed at accurately assessing the impact of various improvement strategies on multi-stage tomato detection.Given the needs for actual scenario detection, the YOLOv8n model is chosen as the baseline network.The model is evaluated through comparative metrics, with experimental results shown in Table 1.
According to the data in Table 1, Experiment 1 uses the original YOLOv8n model, achieving an accuracy of 80.5%, recall rate of 77%, and mAP of 86.7%, with a model weight of 5.99MB and 8.1GFLOPs of floating-point operations.Experiment 2, which replaced the backbone network with EfficientViT, shows increases in accuracy, recall rate, and mAP by 9.2%, 11.5%, and 6.9% percentage points respectively, compared to Experiment 1.This also results in a reduction in model weight and floating-point operations, indicating that the EfficientViT network significantly improves model performance by enhancing feature extraction capability and reducing the size and computational complexity of the model.Experiment 3 introduced C2f-Faster to optimize the feature transfer path and accelerate feature fusion, enhancing the model's response speed, with accuracy and recall rates improving by 11.42% and 14.93% respectively; the frame rate increased by 49.29%.In Experiment 4, after replacing the SIoU loss function, the model's accuracy, recall rate, and mAP all improved, suggesting that SIoU helps the model converge and enhances its recognition accuracy and stability.Experiment 5, which added an auxiliary detection head, led to a 2.7 percentage point increase in mAP, slightly improving detection accuracy.However, due to the addition of the detection head, the model weight increased by 1.46MB and the frame rate dropped by 10.31%.Compared to the baseline network, the improved model achieves optimal detection performance, with increases in accuracy, recall rate, and mAP of 10.9%, 11.7%, and 7.2% percentage points respectively.Although the introduction of more modules led to an increase in model weight and computational requirements, the detection performance significantly improved.Comprehensive ablation study results prove that the optimization strategies proposed for the YOLOv8n network in this study are meaningful.

Model performance comparison before and after improvements
Figure 8 shows the comparison between the mean average precision (mAP) curves at different IOU thresholds and the box loss function for YOLOv8-EA and YOLOv8n.In Figure 8A, when the IOU threshold is 0.5, YOLOv8-EA shows a significant improvement in mean average precision compared to YOLOv8n.As the IOU threshold increases, the gap in accuracy performance between the two narrows, but YOLOv8-EA performs better across all  The recognition performance of the improved model for multistage tomatoes is shown in Table 2. Compared to YOLOv8n, the improved YOLOv8-EA has increased the recognition accuracy of three stages of tomatoes by 10.1%, 17%, and 5.7% respectively, while increasing computational load by only 16%.This has resulted in an increase in detection precision mAP@0.5, mAP@0.5:0.95, and frame rate by 7.2, 7.5, and 81.25 percentage points respectively, providing powerful technical support for real-time tomato detection in complex environments.

Comparison test of different detection models
To verify the effectiveness of the method discussed in this paper, it was compared with mainstream object detection algorithms on the same dataset, with results shown in Table 3.The results demonstrate that the improved YOLOv8-EA model surpasses other models in precision, recall rate, and average accuracy, proving that our enhanced model offers superior detection performance.
Additionally, the improved model features a frame rate detection that significantly surpasses other models.Even though this model has slightly larger weights and FLOPs compared to YOLOv8n, it still fits practical scenarios well.After comparing evaluation parameters, it is known that the improved model balances speed and efficiency effectively, exhibiting overall performance superior to other models especially in multi-stage fruit target detection.
Figure 9 depicts the recognition effects of various mainstream target detection models on tomatoes at different growth stages.As observed from Figure 8, under complex conditions such as overlapping tomato fruits and occlusion by branches and leaves, other models exhibit instances of missed and false detections.However, the improved YOLOv8-EA model significantly ameliorates these issues.It shows enhanced performance in recognizing small target tomatoes in complex environments, with an increase also noted in confidence levels.

Comparative tests of different detection models on publicly available datasets
In order to conduct a comprehensive assessment of the enhanced YOLOv8-EA model, this study was subjected to evaluation using the publicly accessible dataset provided by Kaggle (http://www.kaggle.com).The dataset comprises a diverse range of real-world work scenarios, encompassing a total of 17,345 images that illustrate the various stages of tomato maturation.This makes it an optimal testing environment for validating the efficacy of each detection model.
Five mainstream detection models, including YOLOv8-EA, were selected for this test, and all models were completed under the same experimental platform to ensure the results were comparable and the process was fair and consistent.The principal performance metrics are illustrated in Table 4.The enhanced YOLOv8-EA model demonstrates robust performance on this public dataset, exhibiting a precision rate of 91%, a recall rate of 89.2%, and an average precision of 95.1%.These metrics demonstrate superior performance compared to other models, confirming the efficacy of optimising the model structure, particularly in the context of complex backgrounds and high-variability fruit images.Despite the increased weights and computational requirements of the YOLOv8-EA model, its detection speed can reach 1.8 ms, indicating that the model effectively optimises the utilisation of computational resources while maintaining high processing efficiency.Its exceptional performance renders it suitable for real-time processing scenarios where high efficiency and accuracy are paramount.

Discussion
We reviewed previous related research work, based on which we proposed the YOLOv8-EA model for detecting multi-stage tomato fruits, taking into account the differences between actual tomato picking conditions and individual fruit ripening stages.The previous section 3 demonstrates its remarkable performance and accuracy.
EfficientViT employs sandwich-layout blocks, using a single memory-efficient MHSA between effective feed-forward networks (FFN), enhancing storage efficiency while increasing the number of feature channels.It introduces a new Cascaded Group Attention module (CGA), which maximizes computational cost savings while ensuring high-quality key feature extraction; SIoU evaluates the overlap between predicted and ground truth boxes more reasonably, enabling the model to reach its optimal state more quickly during training; PConv exploits the redundancy in feature maps by systematically applying regular convolution (Conv) on a subset of input channels without affecting others.Additionally, a pointwise convolution (PWConv) is added on top of PConv to fully and effectively utilize information from all channels.This approach reduces the number of parameters and computational complexity while maintaining a certain receptive field and nonlinear representation capability; Aux-Head provides additional supervision signals at the early stages of training, enhancing feature extraction capabilities and thereby improving overall detection accuracy.This richer information feedback stream accelerates model convergence and alleviates memory pressure.Aux-Head is used to capture shallow network information, employing Detect to guide Aux-Detect in matching positive detection samples, which addresses performance degradation and poor positive sample quality issues as model depth decreases.Therefore, the YOLOv8-EA detection model has both lightweight and high detection performance.
Despite the improvements we have made, which have significantly enhanced the model's performance and accuracy, there are still some limitations that need to be addressed.These issues warrant deeper exploration in future work.First, the introduction of the EfficientViT module and the C2f-Faster module has reduced the model's parameters and computational complexity, accelerating its running speed.However, further optimization of the model is still needed in future work.Second, although the new loss function speeds up the model's convergence, the accuracy of bounding box localization may still be insufficient in cases of complex edges or significant overlap of target objects.For severely occluded fruits and scenes with significant lighting variations, the recognition efficiency and accuracy decrease, necessitating further research to improve the loss function or introduce newer feature extraction and fusion techniques.Furthermore, while the auxiliary detection head (Aux-Head) module enhances the network's learning capability, it also increases the model's structural complexity.This means that more computational resources and storage space are required during model training and deployment, which could pose challenges for deployment on resource-constrained edge devices.Lastly, the model proposed in this study performs excellently on the tomato dataset, but its generalization ability to other crop datasets remains to be verified.

Conclusion
This paper is based on the YOLOv8-EA multi-stage detection model for tomatoes, achieving rapid and accurate detection of tomato fruits in complex environments.It also validates the improved model's detection performance on a homemade dataset, with the main conclusions as follows: 1) The architecture adopts the EfficientViT network as the backbone, introduces the SIoU loss function and C2f-Faster module, along with additional optimized strategies such as auxiliary detection heads.On the self-constructed dataset, compared to the baseline network YOLOv8n, with only a 2.07MB increase in model weight and a 1.3G rise in FLOPs, accuracy improvements for detecting unripe, semi-ripe, and ripe tomatoes have respectively increased by 4.8%, 12.6%, and 4.2% points; meanwhile, the frame rate of detection has improved by 42.1%, achieving enhancements in both detection efficiency and precision.
2) Whether on the self-built dataset or the open dataset, compared with the current mainstream target detection models, the YOLOv8-EA model proposed in this study outperforms other models in a number of indexes, with obvious advantages in the comprehensive performance, and has a better detection effect on multi-stage tomatoes, providing technical support for the subsequent intelligent picking.
3) Through a visual comparison of detection results, YOLOv8-EA shows fewer missed and false detections of tomatoes in complex environments, providing optimal detection ability.This indicates the feasibility of the proposed object detection algorithm.Subsequent efforts will further optimize the model's parameter volume to adapt to practical environments with limited computing resources.
YOLOv8 is the latest SOTA (State-of-the-Art) model released by the Ultralytics team in 2023.Building on the success of YOLOv5, it incorporates new improvements and features to further enhance flexibility and performance.The main changes include: replacing the original C3 module with the C2f module; removing the convolution operation in the upsampling process; introducing a new anchor-free decoupled head structure.The network structure of YOLOv8 includes the backbone network, neck network, and head network.

FIGURE 4
FIGURE 4 YOLOv8-EA network architecture diagram.Conv represents ordinary convolution operation; MBConv represents convolution with inverted residual structure; C2f-Faster introduces a C2f module with partial convolution; Upsample refers to upsampling; Contact denotes concatenation operation; SPPF stands for fast pooling pyramid module; Aux-Detect is an auxiliary detection head, called only during training; Bbox.Loss and Cls.Loss stand for bounding box loss and classification loss, respectively.
The j-th column of matrix K; V j -The j-th column of matrix V; W (Q,K,V) -Mapping matrix for learning; f( : )-Kernel function.The EfficientViT network introduced in this text can enhance the recognition of subtle features and improve robustness in complex environments, by integrating multi-scale information and strengthening feature fusion, thereby further enhancing the model's performance efficiency.b)SIoU loss function YOLOv8 uses the CIoU(Zheng et al., 2022) loss function to optimize localization loss.Although it considers the issues of aspect ratio and scale loss based on GIoU(Rezatofighi et al., 2019) and DIoU(Zheng et al., 2019), it relies on the aggregation of bounding box regression indicators.Due to the neglect of orientation mismatch issues, during training, the predicted boxes may affect the convergence speed and detection performance of the model due to "unordered wandering."The SIoU(Gevorgyan, 2022) loss function (as shown in Figure6) introduces the concept of vector angle, considers the angle issue between the true box and the predicted box, redefines the penalty metric, and improves the accuracy of the detection task.The SIoU loss function consists of four penalty terms: angle loss, distance loss, shape loss, and IOU loss.

FIGURE 5
FIGURE 5EfficientViT network architecture diagram.

FIGURE 6
FIGURE 6 SIoU parameters schematic diagram.B(b cx , b cy )and B gt (b gt cx , b gt cy) represent the center coordinates of the predicted box and the ground truth box, respectively; C w and C h represent the differences in the horizontal and vertical coordinates between the B and B gt points, respectively; a is the horizontal angle between the center points of the two boxes; w, h, ??? and ℎ ?? represent the width and height of the predicted box and the ground truth box, respectively; s represents the distance between the center points of the ground truth box and the predicted box.

FIGURE 7 C2f-
FIGURE 7C2f-Faster module architecture diagram.h and w represent the height and width of the input feature map; c denotes the number of input channels; c p denotes the number of channels participating in convolution; PConv stands for Partial Convolution; Split represents the channel splitting module.

FIGURE 8
FIGURE 8Training curves before and after model improvement.(A) Average precision curve (B) Box loss function.

TABLE 1
Results of ablation studies for the improved model.
"×" This policy is not used; "√" to use this policy.

TABLE 2
Test results before and after improvements of YOLOv8n model.

TABLE 4
Key performance indicators of different models on public datasets.