YOLOv8-RCAA: A Lightweight and High-Performance Network for Tea Leaf Disease Detection

: Deploying deep convolutional neural networks on agricultural devices with limited resources is challenging due to their large number of parameters. Existing lightweight networks can alleviate this problem but suffer from low performance. To this end, we propose a novel lightweight network named YOLOv8-RCAA (YOLOv8-RepVGG-CBAM-Anchorfree-ATSS), aiming to locate and detect tea leaf diseases with high accuracy and performance. Speciﬁcally, we employ RepVGG to replace CSPDarkNet63 to enhance feature extraction capability and inference efﬁciency. Then, we introduce CBAM attention to FPN and PAN in the neck layer to enhance the model perception of channel and spatial features. Additionally, an anchor-based detection head is replaced by an anchor-free head to further accelerate inference. Finally, we adopt the ATSS algorithm to adapt the allocating strategy of positive and negative samples during training to further enhance performance. Extensive experiments show that our model achieves precision, recall, F1 score, and mAP of 98.23%, 85.34%, 91.33%, and 98.14%, outperforming the traditional models by 4.22~6.61%, 2.89~4.65%, 3.48~5.52%, and 4.64~8.04%, respectively. Moreover, this model has a near-real-time inference speed, which provides technical support for deploying on agriculture devices. This study can reduce labor costs associated with the detection and prevention of tea leaf diseases. Additionally, it is expected to promote the integration of rapid disease detection into agricultural machinery in the future, thereby advancing the implementation of AI in agriculture.


Introduction
Tea is one of the most popular beverages in China.It contains over 700 different chemical compounds, including tea polyphenol, carbohydrates, vitamins, and caffeine, which are beneficial to human health [1].It has also become an important economic crop in China, playing a crucial role in the development of agriculture and rural areas.According to a statistic from 2022, the area of tea plantations in China increased to 3.2 million hectares [2].However, tea leaf diseases pose a serious threat to the sustainable development of the tea industry.These diseases lead to decreased tea yields and significant economic losses, which presents substantial challenges to both tea farmers and scientists.Therefore, accurate and real-time detection of tea leaf diseases are urgently needed for timely management.
Manual methods for detecting tea leaf diseases are unreliable, inefficient, and timeconsuming [3,4].Ensuring efficiency and accuracy in the detection process is crucial for the precise and real-time identification of tea leaf diseases.The application of intelligent terminal devices shows great promise in addressing the problem of tea leaf disease detection.Traditional machine learning techniques typically necessitate the manual extraction of features like color, texture, and shape, which is then followed by classification for identification.Zhang Shuitang [5] successfully implemented rapid identification and classification of tea leaf diseases by combining hyper spectral imaging technology with machine learning.This provides significant theoretical support and practical value for remote sensing disease monitoring and early warning using plant protection drones.Ting Zhang et al. [6] used hyperspectral imaging and machine learning to evaluate the damage and recoverability of non-glyphosate-resistant corn plants exposed to glyphosate, achieving over 95% accuracy in classification.Alper Taner et al. [7] classified apple varieties using deep learning and machine learning, achieving 97.48% accuracy with DenseNet201, 98.28% with SVM using deep features, and 99.77% with MLP using PCA.However, traditional machine learning algorithms rely on feature engineering [8,9].This requires the manual design and extraction of plant phenotypic features [10].If feature extraction is insufficient or the features lack discriminative power, it is challenging to capture complex patterns effectively.Consequently, this affects the model's performance.These limitations hinder the algorithm's ability to detect diseases under varying environmental conditions [11].In contrast, deep learning methods can achieve higher accuracy in the field of recognition.
Rapid advancements in computer vision, pattern recognition, and artificial intelligence have become key areas of interest in agricultural research.These technologies have significant potential for identifying and assessing plant diseases and pests.Researchers are now using deep learning techniques to address the limitations of traditional machine learning methods [12].Yunfei Wang et al. [13] developed the GMM-DC module for segmenting adhesive pests in apple orchards, achieving a high segmentation accuracy of 95.75% and improving pest recognition accuracy with Mask R-CNN models.Shahrzad Zolfagharnassab et al. [14] employed an ANN to develop a thermal imaging method for assessing oil palm maturity.They found that the temperature difference between the fruit bunch and the ambient temperature effectively identified maturity levels.This method achieved a classification accuracy of 91.5%.Wen-Liang Chen et al. [15] employed IoT and AI technologies to detect rice blast disease using nonimage sensors, achieving an accuracy of 89.4% in real-time data analysis and predictions.Qi Yang et al. [16] developed a near-real-time deep learning approach using UAV RGB images to detect rice phenology, achieving an accuracy rate of 83.9% and a mean absolute error of 0.18.
Deep learning techniques for object detection have advanced notably in the field of agriculture.Numerous models, including Faster R-CNN, SSD, and EfficientDet, have been utilized for identifying plant diseases.Among them, the YOLO series stands out for its speed, efficiency, global perception, and simplicity, making it ideal for applications that demand high real-time performance and have limited resources [17].Although the YOLO series models have made progress in plant disease detection [18,19], further optimization is still needed.Balancing accuracy, speed, and lightweight design adapting for agriculture remains a challenge.Researchers have conducted numerous experiments to optimize the existing YOLO models.Md.JanibulAlam Soeb et al. [20] created a novel dataset encompassing four major tea gardens in Bangladesh and utilized data augmentation techniques to enhance dataset diversity.They developed an improved YOLOv7 model, termed YOLO-T, specifically for detecting tea leaf diseases, achieving high accuracy (97.3%), precision (96.7%), recall (96.4%), mAP (98.2%), and F1-score (0.965).This work is the first application of such techniques for tea leaf disease detection in Bangladesh, significantly contributing to the advancement of smart agriculture.Fahad Jubayer et al. [21] utilized an improved YOLOv5 algorithm for mold detection on food surfaces, achieving a precision of 98.10%, a recall of 100%, and an average precision of 99.60%.This study marks the first successful application of YOLOv5 in mold detection.Wenji Yang et al. [22] created an innovative crop pest detection model called YOLOv5s-pest, incorporating the HSPPF module, the NCBAM module, recursive gated convolution, and Soft-NMS.This model achieved the mAP of 92.5%, effectively tackling agricultural challenges and providing substantial improvements for prompt and precise pest management.Yuzhuo Zhang et al. [23] developed a transformer-based model called YOLO-Sp for detecting Achnatherum splendens using ground-based visible spectrum images, achieving high performance with AP values of 98.4% in object detection and 95.4% in image segmentation.Ye Rong et al. [24] proposed an improved YOLOv5s-ECA-ASFF algorithm for detecting tea tree diseases.This algorithm builds on YOLOv5 and incorporates the ECA Channel Attention Module, adaptive spatial feature fusion technology, and the GIoU loss function.It achieved an average precision of 92.1% in identifying tea tree diseases under complex backgrounds.Yishen Lin et al. [25] developed AG-YOLO, an efficient citrus fruit detection method incorporating NextViT and a Global Context Fusion Module.AG-YOLO achieved a precision of 90.6%, mAP@ of 81.2%, outperforming existing models and maintaining high accuracy at a speed of 34.22 FPS.Zhengyang Zhong et al. [26] created Light-YOLO, an efficient and lightweight mango detection model based on Darknet53 with bidirectional and skip connection modules, optimizing the structure to reduce parameters and FLOPs, achieving a 64.0%mAP and 96.1% mAP, making it ideal for rapid and precise mango detection in agriculture.
In summary, researchers have proposed many excellent modules and networks for detecting various plant diseases and pests.However, the YOLO series algorithms still have the following issues: (1) The complex multi-branch structure of the backbone limiting the speed of disease detection [27], (2) the drop in performance when dealing with small and dense targets in complicated environments, (3) the heavy reliance of the detection head on predefined anchor boxes, resulting in a slow inference speed [28].Furthermore, the inflexibility of the sample allocation strategy also affects the effective utilization of features during training [29].
We introduce YOLOv8-RCAA, a high-performance, lightweight model for detecting tea leaf diseases based on an enhanced version of YOLOv8.The YOLOv8-RCAA model demonstrates significant potential for rapid and accurate identification of tea plant diseases, offering a promising integration into agricultural machinery.It can reduce labor costs and foster the development of smart agriculture.Specifically, we replace the CSPDarkNet53 backbone with the RepVGG block and apply re-parameterization to increase detection speed.Then, we introduce a convolutional block attention module (CBAM) into the feature fusion layer to calculate attention across both channel and spatial dimensions, significantly enhancing the feature extraction capability.Furthermore, we adopt an anchor-free head, which does not require predefined prior boxes, to simplify model structure.In summary, the main contributions of this study are as follows: 1.
We adopt RepVGG to simplify multi-branch structure to single-branch through structural re-parameterization, which enhances the feature extraction capability, leading to increase efficiency.

2.
We integrate the CBAM block into the FPN-PAN layer to adapt feature map weights, which enable the model to detect small and dense tea leaf diseases in complex environments.

3.
An anchor-free head is proposed to replace an anchor-based head, which simplifies the detection process by dropping the predefined anchor boxes, leading to a significant increase in inference speed.

4.
We also propose an ATSS sample strategy to allocate positive and negative samples dynamically, which utilizes effective features during training, resulting in an enhancement of the accuracy of the model.

Experimental Materials
This work focuses on a tea plantation in Jiaokou County, Lüliang City, Shanxi Province, China (latitude 36.98,longitude 111.16), aiming to comprehensively understand the occurrence of tea leaf diseases in this region.The tea plantation scene is shown in Figure 1.The experimental site is located in the North China monsoon climate zone, characterized by significant temperature variations and predominantly poor, loess soil, with some areas exhibiting saline-alkaline conditions.The tea plants in the plantation have an average age of 10 years, an average height of approximately 1.5 m, and are spaced 1.5 m apart within rows.This study selected the highly adaptable Da Qing tea (Jin tea) as the research subject, known for its flat, dark, and glossy leaves that make it suitable for harsh environmental conditions.To precisely capture the actual conditions of tea leaf diseases in the plantation, we meticulously considered factors such as light intensity, weather conditions, and shooting angles.Images were collected using a Bettertree RX0212 camera mounted on a tripod at a standard height of 1 meter above the tea canopy to ensure consistency.The shooting rules included taking photographs between 10 AM and 2 PM to utilize optimal natural light and avoid rainy or excessively cloudy days to prevent reflections and distortions.Professionals meticulously selected and annotated the images, resulting in a total of 995 images depicting various tea leaf diseases, including powdery mildew, algal spot, red blotch, damping-off, anthracnose, and pest infestations.A subset of these samples is presented in Figure 2.

Powdery Mildew
Algal Spot Red Blotch Damping-Off Anthracnose Pest

Data Preprocessing
Since only 995 images were initially collected, data augmentation techniques were used to increase the dataset to 4001 images, thereby enhancing the dataset's quality and diversity.The following data augmentation methods were utilized [30]: (1) Gaussian Blur: Applying Gaussian blur to the images to simulate the effect of out-of-focus camera shots [18].(2) Random Noise: Adding randomly distributed noise to the images to mimic conditions where the camera experiences strong electromagnetic interference [31].(3) Hue Enhancement: Adjusting the hue of the images randomly to simulate varying lighting conditions during photography [32].(4) Exposure Enhancement: Changing the exposure of the images randomly to simulate environments with both low light and direct sunlight during photography [33].(5) Random Splicing: Performing random cropping and splicing of different images to increase diversity within a single photo, simulating more complex environments and thus enhancing the robustness of models [34].The distribution of the dataset before and after data augmentation is shown in Table 1.

Network Design
The network design of a tea leaf disease detection model is crucial, impacting both accuracy and inference speed.We propose an improved YOLOv8-based model that optimizes the backbone network, introduces attention mechanisms, enhances the detection head, and adjusts the training strategy.

YOLOv8 Network Architecture
YOLOv8 represents the eighth generation of the YOLO series, enhancing both accuracy and speed in object detection through a series of improvements [35].This algorithm adopts a single-stage object detection method, treating the task as a regression problem by splitting the image into grid cells to locate and classify objects.YOLOv8 network architecture comprises backbone, neck, and head layers.The backbone layer employs CSPDarkNet53 network, the neck layer integrates both FPN and PAN networks, and the head layer utilizes an anchor-based head.
CSPDarkNet53 consists of CBS convolutional layers and C2f residual blocks.The CBS convolutional layer performs a 3 × 3 convolution, followed by a BN layer and a sigmoid activation function [36].The C2f residual block consists of 3 × 3 convolutional layers and several bottleneck units.The output from the second-to-last residual block is combined with the original input, and this combined result is then processed through a ReLU function.The introduction of the C2f residual block effectively addresses the issues of gradient explosion and gradient vanishing.CSPDarkNet53 extracts high-, mid-, and low-level features from the image, which are sequentially input into the FPN (Feature Pyramid Network) and PAN (Path Aggregation Network) to fuse features at different levels.The FPN propagates high-resolution semantic information down to low-resolution feature maps in a top-down manner, and then further integrates feature information through a cascade approach with CBS modules [37].The fused features are then input into the PAN, which propagates low-resolution semantic information up to high-resolution feature maps in a bottom-up manner, thereby supplementing the semantic information in the high-level feature maps [38].Finally, the features at different levels, fused through the FPN and PAN layers, are input into the anchor-based head for prediction, yielding the height, width, and center coordinates of the objects.The anchor-based head consists of one 3 × 3 convolutional layer and two 1 × 1 convolutional layers.The 3 × 3 convolutional layer is used to integrate feature information: one of the 1 × 1 convolutional layers includes four convolutional kernels used for predicting the bounding boxes' four coordinates and the other 1 × 1 convolutional layer contains n convolutional kernels to predict the object classes.

Overall Design of the YOLOv8-RCAA Model
Drawing from the YOLOv8 architecture, we introduce a novel model called YOLOv8-RCAA designed to improve the precision and efficiency of detecting tea leaf diseases and pests.This study optimizes detection accuracy and speed by employing a multibranch network to replace the original backbone network and by integrating structural re-parameterization techniques.Specifically, the CSPDarkNet53 backbone network of the YOLOv8 model is replaced with RepVGG to improve accuracy, and structural reparameterization is applied to the RepVGG backbone network to increase detection speed.
In the feature fusion network layers (FPN and PAN), this study introduces the Rep-Block and CBAM (Convolutional Block Attention Module) attention mechanism modules to enhance the representation and fusion capabilities of features at different scales [39].The FPN and PAN network layers in the YOLOv8 architecture consist of multiple C2f modules, convolutional layers, and upsampling layers, with input and output feature sizes remaining unchanged.The specific fusion methods include: (1) replacing the C2f modules in the feature fusion network layers with RepBlock modules and introducing the CBAM attention mechanism before the RepBlock modules and (2) adding the CBAM attention mechanism between the feature fusion network layers and the head layer to enhance the feature representation capabilities at different levels.
In terms of the detection layer, this study adopts an anchor-free head to replace the anchor-based head, eliminating the use of multiple anchors for regressing target bounding boxes, and directly using the center point to regress the bounding box.Structurally, a center point convolutional layer is added to the original anchor-based detection layer to predict the targets' position, significantly reducing the computational load of anchor-based regression and improving detection speed [40].The specific improvements are illustrated in Figure 3.

I
Network structure: The RepVGG backbone network is designed based on the Rep-Block module (as illustrated in the dashed box in Figure 4), which performs excellently on both GPUs and edge devices [41].Additionally, RepVGG can streamline the network architecture while preserving robust feature extraction capabilities.This is achieved through structural re-parameterization, which decreases the number of parameters and computational complexity [42].During training, RepVGG splits the RepBlock module into multiple branch modules while merging them into a single module consisting of a 3 × 3 convolutional layer and a BN layer during inference.RepVGG draws on the ideas from ResNet using residual connection to establish the information flow of data features.RepVGG models the information flow as follows: where x is the mapping of the information, g(x) represents the convolution operations used to match the channel dimensions, and f (x) is the residual learning component.
During training, the residual architecture of RepVGG includes a 1 × 1 convolutional layer and an identity residual branch, depicted in the dashed box in Figure 4.This design tackles problems like gradient vanishing and explosion in deep backbone networks, thereby improving the data's feature extraction capabilities.However, the multi-path topology of the residual structure can consume more computational resources during inference than the single-path structure due to more kernels, leading to slower inference speed.
I I Structural Re-parameterization: Traditional network models are typically used for inference directly after training.However, RepVGG employs structural re-parameterization techniques to equivalently transform the multi-branch structure during training into a single-path structure during inference.This allows for the model to maintain both high detection accuracy from the multi-branch structure and fast inference speed from the single-path structure.This method first merges the convolutional layers and BN (Batch Normalization) layers, then converts the merged convolutional layer into a 3 × 3-sized convolutional layer, and finally merges the convolutional kernels from each branch into a single kernel based on the convolutional additivity principle.The process of structural re-parameterization is shown in Figure 4. Specifically, given that the convolutional layers can be formulated as follows: where W represents the convolution kernel parameters and b indicates the bias, the BN is calculated as follows: where µ denotes the statistical mean of the data in the BN layer, σ represents the standard deviation, γ is the scaling factor, and β is the bias.We can merge the convolutional layers and BN layers as follows: where γ σ W is the new merged weights and b−µ σ γ + β is the new bias.

Attention Mechanism
The attention mechanism is pivotal in enabling adaptive attention within neural networks, as it emphasizes the most relevant units while suppressing less important ones during the feature extraction process.Among the prevalent attention mechanisms are SENet, CBAM, and ECA.Specifically, the CBAM structure integrates a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), allowing for it to concurrently focus on both channel and spatial information within an image.In this study, CBAM is utilized to mitigate the challenges posed by significant noise and uneven brightness in images of tea leaf diseases and pests, which are frequently influenced by varying weather conditions and lighting intensities.
In Figure 5, the CBAM module operates on the input feature map F using a two-step process to improve its representation.Initially, it applies channel attention to F, generating a weighted feature map F .Next, spatial attention is applied to F , producing the final output feature map F .By performing attention operations on the input feature map across both channel and spatial dimensions, CBAM boosts the network's capability to extract significant features from intricate and noisy images, thereby increasing the accuracy of disease and pest detection.We let M c represent the channel attention operation and M s represent the spatial attention operation.The Channel Attention Module can be expressed as follows: where F represents the input, σ represents the sigmoid function, MLP denotes a shared Multilayer Perceptron, W 0 and W 1 represent the weights of the shared perception layer, AvgPool and MaxPool represent the average pooling and max pooling operations, and F c avg and F c max represent the average pooling features and the maximum pooling features of the channel.
As depicted in Figure 6, the feature map's channel information is efficiently combined using both average-pooling and max-pooling techniques.These methods produce two unique spatial context descriptors: for the average-pooled features and F max c for the max-pooled features.These descriptors are then fed into a shared network, designed as a multi-layer perceptron (MLP) with a single hidden layer, to produce the channel attention map.Furthermore, the channel attention feature map is produced using the sigmoid function.In the final step, the original feature map is multiplied by this channel attention feature map, resulting in the output of the CAM (Channel Attention Module).The descriptors are then input into a shared multi-layer perceptron (MLP) to generate the channel attention map.The channel attention feature map is then created using the sigmoid function.Finally, the original feature map is multiplied by this channel attention feature map, producing the output of the Channel Attention Module (CAM).This process can be described as follows: where ⊗ represents element-wise multiplication (convolution).The Spatial Attention Module (SAM) uses the output feature map F from the Channel Attention Module (CAM) as its input.The calculation process for SAM can be expressed as follows: where As shown in Figure 7, we use two pooling operations extracting two features: and F max c .These are the average-pooled and max-pooled features.Next, we concatenate these features, pass them through a convolution layer, and apply the sigmoid function to generate the spatial attention map.Finally, to obtain the output of the SAM/CBAM, we multiply the original feature map by the spatial attention feature map.This process can be formulated as follows: where ⊗ represents element-wise multiplication (convolution).Ultimately, the output feature map of the CBAM is the output of the SAM.

Anchor-Free Head
The anchor-free head is a paradigm in object detection algorithms that, unlike traditional anchor-based methods, does not require predefined sets of prior boxes (anchors).In the anchor-free method, the network directly predicts the position and size of the target.This method reduces the need for design parameters, simplifies the model structure, and is more adaptable to targets of different scales and shapes.Additionally, it enhances detection and training speeds.This prediction algorithm first predicts the corner coordinates of the target in the feature map grid and then uses these coordinates to predict the actual corner coordinates and the bounding box of the target [27].The expressions are as follows: w = P w e t w (11) where x center and y center represent the predicted center coordinates of the target; w and h denote the predicted width and height of the bounding box; t x and t y represent the horizontal and vertical offsets of the predicted center coordinates relative to the anchor points on the feature map; t w and t h denote the scaling factors for width and height.It is important to note that t x , t y , t w , and t h are parameters learned automatically by the model during the training process.Score(z i ) is the confidence score for the i − th class, indicating the Intersection over Union (IoU) between the predicted bounding box and the ground truth box.The calculation formula is as follows: where β x,y is a tuple representing the ground truth target's corner coordinates (c x , c y ), width w, and height h, β x,y is a tuple representing the predicted result's corner coordinates ( c x , c y ), width w, and height h.
During the classification process, the confidence scores are constrained to the range of (0,1) using the sigmoid function.When the confidence score of the predicted bounding box exceeds 0.5, it is considered to belong to a specific disease category.
In the process of bounding box prediction filtering, both score sorting and Non-Maximum Suppression (NMS) are crucial.After setting an NMS threshold, multiple high-score bounding boxes may exist for each category.In YOLOv8-CBAM, it is essential to select the bounding box with the highest probability as the final result.First, NMS is applied to remove bounding boxes below the threshold.Then, the remaining bounding boxes are sorted based on their scores, and the box with the highest score is chosen as the final bounding box for the detection target.

ATSS Positive and Negative Sample Allocation Strategy
Traditional anchor-free models often exhibit lower detection accuracy compared to anchor-based models after training, primarily due to the allocation strategy for positive and negative samples during training.The improved YOLOv8-RCAA model is an anchorfree model.Although using an anchor-free head improves detection speed, it leads to a decrease in tea leaf disease detection accuracy.This is due to the continued use of a fixed ratio for positive and negative sample allocation.To address this issue, we employ an adaptive positive sample selection method called Adaptive Training Sample Selection (ATSS), replacing the fixed ratio of positive and negative samples.Specifically, the algorithm operates as follows: 1.For each output detection layer, compute the L2 distance between the center points of each anchor and the center point of the target.Then, select the K nearest anchors as candidate positive samples.
2. Next, calculate the Intersection over Union (IOU) between each candidate positive sample and the ground truth, and compute the mean and variance of this set of IOUs.
3. Based on the mean and variance, set the threshold for selecting positive samples as t = m + g, where m is the mean and g is the variance.

Experimental Design
As shown in Figure 8, the experimental process is systematically structured as follows: data collection, data preprocessing, model training and evaluation, comparative experiment, ablation study, visualization experiment, and result analysis.The purpose of the ablation study is to deeply investigate the impact of four technologies (RepVGG, CBAM, ATSS, anchor-free head) on the model's performance.By systematically removing or replacing these key components, the specific impact of each technology on the overall performance of the model can be observed, leading to a better understanding of their roles in enhancing the model's effectiveness.
Confusion Matrix Visualization: This visualization evaluates and demonstrates the classification performance of YOLOv8-RCAA across different categories, providing an intuitive view of the model's accuracy, error distribution, and inter-category confusion.

2.
Grad-CAM Heatmap Visualization: This visualization assesses and optimizes the feature extraction capability and detection accuracy of the YOLOv8-RCAA model by highlighting the specific areas of focus in leaf images.

Evaluation Metrics
In this study, the dataset is split into training, validation, and test sets with a ratio of 7:2:1 to facilitate model training, tuning, and evaluation.The validation set is utilized during training to evaluate model performance, aiding in the selection of optimal parameters and adjustment of training strategies.The test set functions as an independent dataset for the final evaluation, allowing assessment of the model's generalization ability and practical application effectiveness.
Precision, recall, F1 score, and mean average precision (mAP) were selected as evaluation metrics for the tea leaf disease detection model.Regarding detection accuracy, precision, recall, and the precision-recall (P-R) curve were of interest.mAP indicates the average level of detection accuracy of the model for tea leaf diseases.Additionally, Frames Per Second (FPS) was used to assess the real-time performance of the model, where a higher FPS value indicates better real-time performance.The formulas for calculating precision, recall, F1 score, and mAP are shown in Equations ( 14)- (17).
Recall = TP TP + FN (15) where TP (True Positive) represents the number of detection boxes predicted by the model as a certain class of tea leaf disease and actually belonging to that class.FP (False Positive) represents the number of detection boxes predicted by the model as a certain class of tea leaf disease but actually not belonging to that class.FN (False Negative) represents the number of samples predicted as background but actually belonging to a certain class of tea leaf disease.In Equation ( 17), r represents the integral variable recall used to solve for the integral of the product of precision and recall over the interval from 0 to 1.

Comparison Results of Different Models
We performed comparative experiments to validate the detection performance of the YOLOv8-RCAA model in identifying tea leaf diseases.The improved model was compared with several classic object detection models: SSD, Faster-RCNN, RetinaNet, and YOLOv8.The specific results are presented in Table 2.
As shown in Table 2, all models achieved basic detection for the six types of pests and diseases, with accuracies exceeding 91%.The YOLOv8-RCAA model had the best detection performance.Its precision, recall, F1 score, and mean average precision (mAP) were 98.23%, 85.34%, 91.33%, and 98.14%, respectively.These results significantly outperformed other models.
The SSD model had the lowest detection performance among the tested models.The Faster-RCNN model showed superior detection performance but had the highest number of parameters.It also exhibited the most floating-point operations and the highest memory consumption.This resulted in significantly reduced detection speed.However, the SSD and RetinaNet models had fewer parameters, floating-point operations, and lower memory consumption, leading to relatively higher detection speeds.In contrast, the YOLOv8 and YOLOv8-RCAA models achieved a better balance between performance and resource consumption.They demonstrated relatively high detection performance while consuming fewer resources.The YOLOv8-RCAA model exhibited the best performance in terms of detection speed, parameter count, floating-point operations, and memory consumption.Specifically, its values were 0.035 s for detection speed, 0.81 M parameters, 27.23 M FLOPs, and 6.46 MB of memory consumption.Therefore, considering both detection performance and speed, the YOLOv8-RCAA model is the best for detecting pests and diseases on tea leaves.Its outstanding performance may be due to the use of the lightweight backbone architecture RepVGG and the CBAM attention mechanism.These features effectively enhance detection accuracy while reducing resource consumption.
We conducted comparative experiments to investigate the impact of different backbone networks on the performance of the YOLOv8 model in tea leaf disease detection.The backbone networks included CSPDarkNet53, ResNeXt, MobileNetv2, ShuffleNetv2, EfficientNetv2, and RepVGG.Notably, the YOLOv8-RCAA model is based on the YOLOv8 model, with its backbone network changed from the original CSPDarkNet53 to RepVGG.
The experimental results are presented in Table 3.In the YOLOv8 network architecture, replacing the backbone network with Mo-bileNetv2, ShuffleNetv2, EfficientNet, and RepVGG significantly reduces the number of parameters, floating-point operations, and memory consumption.This improves detection speed and indicates the potential of these backbone networks for lightweight model architectures.However, replacing the backbone with MobileNetv2, ShuffleNetv2, and EfficientNet increases detection speed but decreases detection accuracy compared to the original CSPDarkNet53 backbone.The decreases in precision, recall, F1 score, and mean average precision (mAP) range from 1.50% to 2.40%, 0.80% to 2.12%, 1.18% to 2.67%, and 0.87% to 2.93%, respectively.This suggests that lightweight models may lead to reduced accuracy.In contrast, using the RepVGG backbone significantly reduces the number of parameters, floating-point operations, and memory consumption while maintaining good detection performance.Its precision, recall, F1 score, and mAP are much higher than those of the original CSPDarkNet53 backbone, reaching 96.11%, 83.83%, 96.01%, and 95.93%, respectively.
Therefore, the YOLOv8-RCAA model, which is based on the YOLOv8 model and uses the RepVGG backbone network, achieves the best performance and speed compared to other backbone networks.
We conducted comparative experiments to investigate the impact of different attention mechanisms on the performance of the YOLOv8 model in tea leaf disease detection.The attention mechanisms included CBAM, SE, CA, and ECA.Notably, the YOLOv8-RCAA model adopts the CBAM attention mechanism, building upon the YOLOv8 model.The experimental results are shown in Table 4. Incorporating these attention mechanisms into the YOLOv8 model enhanced its ability to extract features of diseases and pests.This improvement resulted in better precision, recall, F1 score, and mean average precision (mAP).Comparatively, the ECA, CA, SE, and CBAM attention mechanisms increased precision, recall, F1 score, and mAP by 1.18% to 2.50%, 0.35% to 0.96%, 1.05% to 2.43%, and 0.92% to 2.37%, respectively.Notably, CBAM showed similar results in detection speed, number of parameters, floating-point operations, and memory consumption compared to other attention mechanisms.However, CBAM exhibited the largest improvement in precision, recall, F1 score, and mAP among all attention mechanisms.
Therefore, the YOLOv8-RCAA model, based on the YOLOv8 model and incorporating the CBAM attention mechanism, achieves the best performance and speed compared to other attention mechanisms.

Visualization of Analytical Results
We utilized a confusion matrix and Grad-CAM heatmap to visually assess and analyze the YOLOv8-RCAA model's effectiveness in identifying tea leaf diseases.Figure 9 shows the confusion matrix of the model on the test set.In the confusion matrix, the diagonal elements represent correct detections and classifications by the model.The off-diagonal elements represent instances where targets were either not detected or misclassified.The color intensity along the diagonal corresponds to the prediction accuracy for each class.
According to Figure 9, the confusion matrix of the YOLOv8-RCAA model shows the following results.Out of 146 images of Powdery Mildew, 143 were correctly detected.This results in a detection precision of 97.95%.Among 179 images of Algal Spot, 4 were misidentified, yielding a precision rate of 97.77%.For 116 images of Red Blotch, 5 were misclassified.This results in a precision rate of 96.52%.Among 143 images of Anthracnose, 4 were detected as Algal Spot, and 4 were detected as Anthracnose.This leads to a precision rate of 94.41%.Out of 155 images of Anthracnose, 3 were misclassified as Anthracnose, resulting in a precision rate of 98.06%.Among 140 images of Pest, 3 were misclassified, resulting in a precision rate of 97.86%.Additionally, among 142 images of Healthy Leaves, 4 were misclassified, yielding a precision rate of 97.18%.Overall, the model's average precision rate in this study is 97.11%.
The mutual misclassification between Anthracnose and Red Blotch may be due to subtle differences in the early stages of these diseases.The training dataset might have lacked samples of these early stages.As a result, the model may not have developed strong features to distinguish between the two diseases.This led to misclassification.In summary, although the YOLOv8-RCAA model does not achieve perfect detection, it accurately identifies the majority of disease instances.It shows promising application prospects.Commonly used model evaluation metrics may not intuitively depict the YOLOv8-RCAA's attention to different regions in leaf images.To address this issue, we used the Grad-CAM heatmap technique to visualize the model's focus areas.Specifically, for the proposed YOLOv8-RCAA model, we selected images of diseases such as Powdery Mildew, Algal Spot, Red Blotch, and Anthracnose.We then visualized them using Grad-CAM heatmaps.The visualization results are presented in Figure 10.
The results indicate that the YOLOv8 model failed to accurately cover the diseased areas when detecting Powdery Mildew, Algal Spot, Red Blotch, and Anthracnose.In contrast, the proposed YOLOv8-RCAA model accurately covered the diseased areas.Regarding Red Blotch, both models could focus on the diseased areas.However, the YOLOv8 model's attention to critical regions was less precise and included background parts.The YOLOv8-RCAA model, by integrating the CBAM attention mechanism, reduced interference in complex environments.This enabled the model to concentrate more effectively on the diseased areas.These results suggest that the proposed YOLOv8-RCAA model exhibits superior feature extraction capabilities and precise attention to critical regions in disease detection.

Ablation Study
Ablation experiments were conducted to systematically remove various modules in order to assess their impact on overall performance.The effectiveness of the enhanced modules was validated using YOLOv8-RCAA for tea leaf disease detection.Specifically, our focus was on evaluating the RepVGG backbone network, the CBAM attention mechanism module, anchor-free head, and the ATSS algorithm.The baseline model utilizes the CSPDarkNet53 backbone network but lacks an attention mechanism in the neck layer.It employs an anchor-based head by default and implements a fixed sampling strategy for allocating positive and negative samples.Detailed results of the ablation experiments are presented in Table 5.The RepVGG backbone network significantly improved detection performance.Precision, recall, F1 score, and mean average precision (mAP) increased by 2.1%, 1.38%, 2.26%, and 2.43%, respectively.Additionally, the model complexity was simplified, with reductions in parameters, floating-point operations, and memory consumption by 1.22 M, 43.45 M, and 23.67 MB, respectively.The detection speed also increased to 0.040 s.After introducing the CBAM attention mechanism module, the model's feature extraction capability was enhanced.This resulted in improved detection performance, with precision, recall, F1 score, and mAP increasing by 2.5%, 0.96%, 2.44%, and 2.37%, respectively.However, there was a slight increase in the model's parameters, floating-point operations, and memory consumption.Replacing the detection head with an anchor-free head significantly increased the detection speed to 0.051 s.However, there was a slight decrease in detection performance, with precision, recall, F1 score, and mAP decreasing by 0.7%, 0.58%, 0.7%, and 0.71%, respectively.Using the ATSS positive and negative sample allocation strategy improved the model's detection accuracy.Precision, recall, F1 score, and mAP increased by 1.78%, 0.69%, 1.17%, and 1.06%, respectively.Detection speed, parameters, floating-point operations, and memory consumption remained consistent.
Under the YOLOv8 architecture, using the RepVGG backbone network, the CBAM attention mechanism, anchor-free head, and the ATSS algorithm, the model's precision, recall, F1 score, and mean average precision (mAP) improved by 4.22%, 2.89%, 3.48%, and 4.64%, respectively.Meanwhile, the model's parameters, floating-point operations, and memory consumption saw significant reductions.Overall, the YOLOv8-RCAA model enhanced detection performance and reduced model complexity while improving detection speed.Specifically, the RepVGG backbone network significantly improved detection performance and speed.The CBAM attention mechanism notably enhanced detection performance.The anchor-free head markedly improved detection speed.The ATSS algorithm compensated for the decrease in detection performance caused by the anchor-free head.

Innovations, Limitations, and Future Work
Previous studies have shown that deep learning techniques have made progress in the agricultural domain.However, their application in detecting diseases and pests in tea leaves is relatively limited.Existing methods suffer from issues such as model complexity, slow speed, and low accuracy.For instance, Sun et al. [43] proposed an improved YOLOv4 model that utilized MobileNetv2 to reduce model parameters.However, the average precision was only 93.85%, with a speed of 0.038 s per frame.Yuxin Xia et al. [44] developed an improved YOLOv7 lightweight model with MobileNeXt and a dual-layer routing attention mechanism to reduce model size and increase detection speed.But the average precision was only 92.1%.Although these algorithms are suitable for the embedded devices used in agricultural machinery, their detection accuracy is low and cannot meet the actual requirements for tea disease detection.
To address these issues, this study proposed a detection method based on the YOLOv8-RCAA model for detecting six common and challenging diseases and pests in tea leaves.We replaced the backbone network with RepVGG and introduced the CBAM attention mechanism to enhance feature extraction.We adopted the anchor-free head to improve detection speed and employed the ATSS algorithm to compensate for the decrease in accuracy.Our research achieved faster speed and more accurate localization compared to previous studies.Compared to models such as YOLOv8, SSD, Faster-RCNN, and RetinaNet, the YOLOv8-RCAA model demonstrated significant advantages.It achieved precision of 98.23%, recall of 85.34%, F1 score of 91.33%, and mean average precision (mAP) of 98.14%.Additionally, this model significantly reduced parameters, floating-point operations, and memory consumption, with a detection speed of 0.035 s per frame.The results demonstrate that these improvements effectively enhance the detection accuracy applicable to the embedded devices used in agricultural machinery.The proposed tea leaf disease detection method based on YOLOv8-RCAA demonstrates excellent performance and speed.It also shows strong generalization capability and robustness.This method effectively reduces the cost of manual detection and control, providing practical technological support for agricultural intelligence.Additionally, it is applicable to disease and pest detection in other crops.
However, this study has several limitations.It currently validates the climate conditions only in the central region of China, which predominantly features a subtropical monsoon climate with significant monsoon influences, characterized by hot and rainy summers, cold and dry winters, and distinct seasons.Validation of our model in the northern, southern, western, and eastern regions of China has not yet been performed.Given the significant climatic differences across these regions, the detection accuracy of our model might decrease in these areas.For instance, the model has not yet been extensively validated in large-scale tea plantations in the southern region, so its effectiveness in diverse and variable agricultural environments remains uncertain.Moreover, the actual tea production environment is dynamic and influenced by numerous unpredictable factors, such as severe leaf damage, low-light conditions on rainy days, and strong-light conditions on sunny days.In such a complex real-world environment, further evaluation, optimization, and enhancement of the model's robustness are necessary.
In the future, expanding the dataset, optimizing algorithm models, and enhancing feature extraction capabilities will be pursued to better serve tea production.We plan to collect data from the northern, southern, western, and eastern regions to validate the YOLOv8-RCAA tea disease detection model.This approach aims to enhance the model's robustness and generalization capabilities.Moreover, we will use the YOLOv8-RCAA algorithm combined with the UAV PID balance algorithm and the path planning algorithm to achieve automatic identification and spraying of pesticides with agricultural drones.We plan to explore the application of federated learning technology that combines YOLOv8 with the Internet of Things (IoT) for real-world tea disease detection.We will also explore the application of the YOLOv8-RCAA module in detecting diseases and pests in other crops, thereby driving the development of smart agriculture.

Conclusions
In this study, we developed a lightweight and high-performance YOLOv8-RCAA model based on the YOLOv8 framework for the detection of six common tea diseases.In the Backbone layer, we replaced YOLOv8's CSPDarkNet53 backbone network with the RepVGG network and improved detection speed through re-parameterization.In the neck layer, we introduced the CBAM convolutional attention mechanism module into the FPN and PAN networks to calculate attention in the channel and spatial dimensions, significantly enhancing feature extraction capabilities.In the head layer, we replaced the original anchor-based layer with an anchor-free head to simplify the model structure and further improve detection speed.Additionally, we adopted the ATSS sample assignment strategy instead of the original fixed ratio positive and negative sample assignment strategy to further enhance detection accuracy.
The experimental results indicate the following: (1) The YOLOv8-RCAA model exhibited the best detection performance, with precision, recall, F1 score, and mean average precision (mAP) reaching 98.23%, 85.34%, 91.33%, and 98.14%, respectively, significantly outperforming other conventional models.(2) The model achieved the fastest detection speed (0.035 s), utilized the fewest parameters (0.81 M), required the lowest FLOP (27.23 M), and had the smallest memory consumption (6.46 MB).(3) The YOLOv8-RCAA model fully utilized computational resources, achieving the optimal balance between detection performance and speed.Due to its minimal parameters and FLOPs, it consumed the least computational resources and it can be deployed on agricultural machinery equipped with embedded chips.These results demonstrate that the YOLOv8-RCAA model has significant advantages in tea disease detection, providing an efficient and practical solution for intelligent agricultural disease detection.

Figure 2 .
Figure 2. Sample Images of Tea Leaf Diseases.
represents the output of the CAM, σ represents the sigmoid function, AvgPool and MaxPool represent the average pooling and max pooling operations, F c avg and F c max represent the average pooling features and the maximum pooling features of the spatial, and f 7×7 represents a convolution operation of size 7 × 7.

4 .
Using the threshold t for each layer, filter out the truly needed positive samples from the candidate positive samples and proceed with training.This adaptive positive sample selection method effectively alleviates the issue of imbalanced allocation of positive and negative samples during training.2.4.Experimental Procedure 2.4.1.Experimental Parameter Configuration During the training process, this study employed the Adaptive Momentum Optimizer (ADMW) with a momentum factor of 0.9 and an L2 regularization factor of 0.001.Training was conducted for 300 epochs, including 10 warm-up epochs to facilitate faster convergence of the model.A batch size of 8 was used for training batches, and the model's convergence was evaluated by monitoring the loss value during training.Throughout the training process on the dataset, continuous monitoring of the model's performance and loss values was conducted until the loss value ceased to decrease and reached a steady state, indicating the completion of the training process.Through this training methodology, a well-trained YOLOv8-RCAA object detection model was ultimately obtained, providing a reliable foundation for subsequent detection of tea leaf diseases.

Figure 8 . 1 . 2 . 3 .
Figure 8. Experimental design procedure.Comparative Experiment includes: 1.Comparison with Baseline and Classical Models: This step aims to analyze the performance of the YOLOv8-RCAA model by comparing it with benchmark models and other classical models.2. Comparison of Replacement of Different Backbone Networks: This analysis studies the impact of different backbone networks on the performance of the YOLOv8 model, specifically examining why RepVGG was used to replace the original YOLOv8 network.3. Utilization of Different Attention Mechanisms: This step investigates the impact of different attention mechanisms on the performance of the YOLOv8 model and analyzes the rationale behind choosing the CBAM attention mechanism.

Funding:
This research was funded by the Fundamental Research Program of Shanxi Province (No. 202203021222175) and the Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi (No. 2022L086).

Table 1 .
Distribution of Tea Leaf Disease and Pest Data.

Table 2 .
Comparison Results of Different Models

Table 3 .
Comparison Results of Different BackBone

Table 4 .
Comparison Results of Different Attention Mechanisms

Table 5 .
Results of Ablation Test.