PV-YOLO: Lightweight YOLO for Photovoltaic Panel Fault Detection

The rapid development of the photovoltaic industry in recent years has made the efficient and accurate completion of photovoltaic operation and maintenance a major focus in recent studies. The key to photovoltaic operation and maintenance is the accurate multifault identification of photovoltaic panel images collected using drones. In this paper, PV-YOLO is proposed to replace YOLOX’s backbone network, CSPDarknet53, with a transformer-based PVTv2 network to obtain local connections between images and feature maps to extract more edge-detail features of similar faults. The CBAM attention mechanism is added to enhance the effective features and improve the detection accuracy of small objects. The label assignment mechanism is optimized, and the SIoU loss functionis used to improve the uneven distribution of samples and accelerate network convergence. Experiments on the dataset prove that this method is superior to the existing technology, as the highest mAP value is 92.56%. This value is 10.46% higher than that of YOLOX, and the mAP is optimal under the same parameter magnitude,proving the model’s effectiveness.Moreover, mAP is increased by over 10%, especially for small targets. In this paper, we implemented a lightweight design for the model, and proposes four models of different sizes to be-sized models that are suitable for different detection scenarios.


I. INTRODUCTION
Fossil energy is nonrenewable and poses great environmental challenges [1]. Thus, seeking alternative renewable energy to replace fossil fuels is necessary [2]. The rapid development of the photovoltaic industry has made the efficient and accurate completion of the photovoltaic operation and maintenance work a major focus in recent studies. During the power generation process of photovoltaic panels, a series of failures may occur. Common faults such as foreign object blocking, cracking, and hot spot heating, greatly impact the power generation performance of photovoltaic panels, and also bring a lot of hidden danger. The operation and maintenance of traditional photovoltaic power plants are mainly based on the electrical characteristics of photovoltaic inverters, which The associate editor coordinating the review of this manuscript and approving it for publication was Ines Domingues .
can only be accurate to the series connection. However, being accurate to specific components is difficult. Moreover, detection accuracy is greatly affected by the weather, and fault diagnosis accuracy is not high [3]. The operation and maintenance of specific photovoltaic modules rely on manual inspection. However, large-scale photovoltaic power plants have a complex distribution environment and huge coverage area, and they are cluttered and scattered due to the terrain's influence. A manual inspection takes considerable time and effort; however, its pressure has been partially relieved with the rapid development of automated inspections based on drones. Recently, PV inspection using infrared (IR) cameras has become a common practice to observe PV hotspots [4]. IR cameras can detect different degrees of internal failures in photovoltaic modules, which are imperative in improving training efficiency and reducing operation and maintenance costs. The rapid development of machine learning and deep learning technology in recent years has provided new ideas for the accurate diagnosis and precise positioning of photovoltaic faults. Furthermore, computer vision technology combines image processing and deep learning and performs excellently in classification, detection, and segmentation tasks. Applying it in photovoltaic operation and maintenance will greatly improve the intelligence level of this study [5].
The traditional photovoltaic fault detection method uses line and corner detection methods and Canny and Sobel edge detection operators and mathematical morphological operations and scale invariant feature transform, among others, to extract fault edge information and perform rough detection of faults such as light spots and foreign objects. Although these methods partially solve the problem of photovoltaic panel fault detection, the features obtained using these methods need to be designed manually. However, it is difficult to automatically extract identification information, as the detected fault types are relatively single, and achieving the accuracy needed to meet the basic requirements of industrialization is difficult. With the development of convolutional neural networks (CNNs) and deep learning in image processing, some photovoltaic panel detection methods based on deep learning have achieved good detection results. Fei et al. [6] used a CNN for fault diagnosis of IR images of photovoltaic components. The experiment results show that the CNN has a progressive detection effect on several faults such as cracks and missing corners. Menghao and Hongwei [7] combined the Faster-RCNN algorithm with transfer learning, image preprocessing, and other methods to diagnose faults in the IR images of photovoltaic modules. However, the model was not sufficiently optimized, resulting in several model parameters that were difficult to deploy. Kun [8] combined the operating parameters of photovoltaic systems U, I, and other operating parameters with the residual network ResNet to identify photovoltaic panel images with different degrees of dust accumulation, and the accuracy of identification reached 81%. Jinchao [9] used the voltage and current results of co-simulation as the fault dataset for photovoltaic modules under standard conditions and used SSA-DBN for fault diagnosis, and the average diagnosis accuracy was as high as 97.71%. For photovoltaic modules under natural conditions, the improved DRN was selected for fault diagnosis, and the accuracy of the improved diagnosis model reached 98.86%. Jianyu et al. [10] established a PVT-YOLOv5 model based on the one-stage target detection algorithm, YOLOv5. The test accuracy of the model in the abnormal occlusion dataset of photovoltaic modules can reach 92.02%. Winston et al. [11] used a ''feedforward backpropagation'' neural network and support vector machine technology to detect and classify hot spots and microcracks in photovoltaic modules, and the average accuracy rate reached 87% and 99%, respectively. The above deep learning-based detection methods have made great progress in accuracy compared to traditional image-processing methods. Nevertheless, detection robustness under different weather conditions is poor, and detecting the faults of some small targets is difficult. There are a few types of faults detected, and most studies aim to optimize one fault detection method, which is yet to mature for industrial applications. In order to further improve detection accuracy, multi-stage target detection models such as FCOS [12], Faster RCNN [13] and Cascade RCNN [14] come into being, which have achieved great improvement in mAp and have better performance for small target detection. However, the number of model parameters based on multi-stage target detection model is large, and taking it as the research baseline will increase the difficulty of UAV deployment in practical applications. Considering this actual situation, yolox, a single-stage target detection model with a smaller number of parameters and better accuracy performance, is more suitable for the research baseline.
Visible-light cameras can only detect visible faults, such as occlusion and glass breakage, whereas IR cameras can find faults caused by internal defects in photovoltaic modules. IR cameras can also judge the severity of the fault based on the temperature information at the time of the shooting and multiple failure types. However, there are three technical difficulties. First, the IR images of photovoltaic panels collected in different weather conditions and periods differ. The same fault may appear with different color and shape characteristics, making detection difficult. Second, for different spots, due to the different reasons for their formation, they show subtle differences in brightness, size, and distribution, easily leading to false detection. Third, fault detection accuracy for small targets, such as bird droppings, is poor. Consequently, in this paper, we propose a lightweight network, PV-YOLO, for photovoltaic panel fault detection.
The primary contributions of this study are as follows: 1. Reduction of the computational complexity of the network: The improved YOLOX target detection algorithm is used to detect the photovoltaic module area, and the transformer-based PVTv2 network replaces the CNN-based Focus + CSPDarknet53 network as the YOLOX feature extraction network to capture global information and rich context information.

Addition of the CBAM attention mechanism module:
This module performs an attention mechanism on space and channel, extracting attention regions with CBAM to reduce the interference of background information for detection. 3. The computational details of SimOTA's label assignment strategy are optimized, using a weighted sum of variable focus loss and GIoU loss to compute a cost matrix that improves accuracy without compromising efficiency. 4. Using the SIoU loss function, the model is concentrated difficult samples to classify during training, which promotes the rapid convergence of the model and further improves the regression accuracy.
The paper is structured as follows: After the introduction, Section II introduces the proposed PV-YOLO, VOLUME 11, 2023 including the PVTv2 backbone feature extraction network, CBAM attention module, SimOTA label assignment strategy optimization, and SIoU loss function. Section III discusses the experiment results. The conclusion of this paper is presented in Section IV.

II. PROPOSED PV-YOLO ARCHITECTURE
A. PV-YOLO STRUCTURE YOLOX [15] is a single-stage object detection model, which is a breakthrough in the YOLO series. YOLOX switches the detection head to an anchor-free method, and adopts other advanced detection techniques, namely the decoupling head and the leading label assignment strategy SimOTA, which significantly reduces the number of model parameters and surpasses YOLOV5 in accuracy on the COCO dataset. Furthermore, YOLOX designed a lightweight network, YOLOX Tiny, for real-time detection and YOLOX Nano with 0.91 M, which are 10% AP and 1.8% AP higher than the corresponding YOLOv4 Tiny and NanoDet 3, respectively. Although YOLOX has made great progress, YOLO based on the CNN structure will lose a lot of valuable information, and the association of whole features and part features will be ignored, which has adverse effects on similar faults and small target faults in photovoltaic panel fault detection. Specifically, the real labels in the red box in Fig. 1 are all ''dust'', which are very similar to ''hot spot'' (small target) in shape, color and size, and belong to similar target detection. The difference between the two is that ''dust'' is generally distributed at the edge of the panel. Therefore, when judging its category, the information around the point and the overall feature distribution of the panel must be combined to determine whether it belongs to ''dust''. The backbone feature extraction network of YOLO series is designed based on CNN. Convolution operation is good at extracting local information and its implementation cost is lower, but it cannot capture the global relationship and self-concern. The receptive field of CNN is the size of the region mapped by the pixels on the Feature Map of each layer in the convolutional neural network in the original image, which is equivalent to the size of the region affected by the pixels in the feature map of the high-level in the original image, as shown in Fig. 2. In addition, for a CNN feature, each pixel in the receptive field is not equally important. The closer a pixel is to the center of the receptive field, the greater role it plays in the calculation of output features. The effective receptive field (ERF) has been proposed to measure the regions in the input image that may influence the activation patterns of neurons. Luo et al. [16] examined this region by backpropagating the central pixel and computing the partial derivative of the input image. By studying a series of convolutional networks, they found that the effective receptive field was usually much smaller than its theoretical counterpart. Such local effective receptive fields will cause some loss of receptive field edge information in the process of feature extraction, which lacks the overall representation ability of features, is not conducive to the detection of similar  targets, and may cause ''dust'' to be mistakenly detected as ''hot spot''.
In recent years, a transformer has developed rapidly in natural language processing. Its self-attention mechanism can check the attention distribution from the model, resulting in a highly interpretable model. Each attentional head can learn to perform different tasks. A pyramid vision transformer(PVT) is a pure transformer backbone used as an alternative to the CNN backbone in many downstream tasks, including image-level and pixel-level dense prediction [17]. PVT overcomes the difficulties of traditional transformers: taking fine-grained image blocks of 4 × 4 pixels each as a high-resolution image input to the network and making the structure suitable for dense prediction; with the deepening of the network, the introduction of progressive the pyramid is contracted to reduce the sequence length of the transformer, which significantly reduces the computational cost; the spatial reduction attention layer is adopted to further reduce resource consumption when learning high-resolution features. In this paper, we use the improved PVT-v2 [18]. As shown in Fig. 3, the model is divided into four stages, with each stage consisting of a patch embedding layer and Li layer transformer encoder. The output resolution of the four stages is gradually reduced from high (4 steps) to low (32 steps) following the pyramid structure. The improved PVT is designed with overlapping patch embedding, a convolutional feedforward network, and a linear complexity attention layer, so that PVT can obtain a local continuity of images and feature maps, handle variable resolution input flexibly, and combine with CNNs to have the same linear complexity.
Based on the above, we propose PV-YOLO (Fig. 4) to combine the advantages of the transformer and CNN. The  backbone feature extraction network, CSPDarknet53, of the original YOLOX is replaced by the PVT-v2 network based on the transformer structure for multiscale feature extraction. Taking I ∈ R H ×W ×3 as the input image, a set of feature maps 16, 32} are extracted after patch embedding and encoding are conducted four times. The effective feature layers X 1 , X 2 , X 3 , and X 4 are passed into the FPN + PAN feature pyramid through the CBAM attention mechanism module for feature fusion, and the fused information is passed into the decoupled detection head, separately implementing the classification and regression tasks. The head structure is shown in Fig. 5. In this paper, taking 416 × 416 × 3 input as an example, the number of test heads is expanded from three to four to form four test branches of 13 × 13, 26 × 26, 52 × 52, and 104 × 104, making full use of shallow feature information. For each feature layer, three prediction results can be obtained: Reg(h, w, 4) is used to judge the regression parameters of each feature point, and the prediction frame can be obtained after the regression parameters are adjusted [19]; Obj(h, w, 1) is used to judge whether each feature point contains an object; Cls(h, w, num classes) is used to determine the types of objects contained in each feature point. The three prediction results are stacked, and the results obtained by each feature layer are Out (h, w, 4 + 1 + num classses). The first four parameters are used to determine the regression parameters of each feature point, and the regression parameters can be obtained after an adjustment. The prediction frame, the fifth parameter, is used to determine whether each feature point contains an object, and the last num classes parameters are used to determine the type of object contained in each feature point. In this paper, we also used the SIoU loss function to calculate the loss.    The PVT is not only suitable for traditional single-scale image tasks but also for multiscale image feature extraction and can accurately identify fragmentation, hot spots, and plant occlusions of different shapes and sizes. Moreover, the PVT has fewer parameters than the traditional YOLOX feature extraction layer. After combining PVT and YOLOX, they give full play to their respective advantages and have better performance in speed and accuracy.

B. CBAM
CBAM is a lightweight feedforward CNN attention module (Fig. 6). This module can sequentially estimate the attention map along two independent dimensions of channel and space and multiply the input feature map on the attention map to perform an adaptive feature refinement [20].
Given the intermediate feature map F ∈ R C×H ×W as input, the operation process of CBAM is generally divided into two parts: channel and space (Figs. 7(a) and 7(b)). First, global maximum pooling and average pooling are performed on the input using a channel. The two pooled one-dimensional vectors are sent to the fully connected layer, and a one-dimensional channel attention map M C ∈ R C×1×1 is generated after summing. Subsequently, channel attention is multiplied with the input F to get adjusted feature map F ′ . Second, global max pooling and average pooling are performed for each space on. The two two-dimensional vectors generated by pooling are connected to perform convolution operations. A two-dimensional spatial attention map M S ∈ R C×1×1 is generated, and the spatial attention and F ′ are multiplied using elements. The CBAM generation attention process can be described as follows: The extracted feature maps of different scales use the CBAM attention mechanism module to enhance the effective features and suppress invalid features by changing the weight information of the original feature channels. This enables the model to pay more attention to local analysis, focus on the texture differences between the features, and improve its detection accuracy. Moreover, CBAM is applied to the IR image detection of photovoltaic panels, so the model can quickly focus on fragmentation, hot spots, and plant occlusion. Since hot spots and some kinds of dust have similar characteristics in the IR state, there are only subtle differences in shape and brightness, which easily cause errors. The addition of CBAM can expand the features, learn the feature differences between the two fault types, and partially suppress the influence of negative samples.

C. LABEL ASSIGNMENT STRATEGY
The label assignment of positive and negative samples has a crucial impact on object detectors, as most object detectors use a fixed label assignment strategy [21]. YOLOv4 and YOLOv5 select the ground-truth center point and its adjacent anchored positions as positive samples, and this label assignment strategy is invariant during the global training process. SimOTA is a dynamic label assignment strategy that continuously changes with the training process and achieves good results in YOLOX.
SimOTA first determines the candidate region through the center, then calculates the IoU of the predicted box and the ground-truth box in the candidate region, and finally obtains the parameter K by summing the n largest IoUs of each ground-truth box. The cost matrix is obtained by directly computing the loss of all predicted boxes and the ground-truth boxes in the candidate regions. The anchor corresponding to the minimum K loss is chosen and designated as a positive sample for each groundtruth. The original SimOTA uses the weighted sum of CE loss and IoU loss to compute the cost matrix. Furthermore, we use the weighted sum of variable focus loss [22] and GIoU loss as the cost matrix to align the cost in SimOTA with the objective function. The weight of the GIoU loss is λ, which is set to 6. The specific formula of the cost matrix is as follows:

D. LOSS FUNCTION
The training of a target detection network generally needs to define at least two loss functions: classification loss and bounding box regression loss. The definition of loss function often greatly impacts detection accuracy and training speed. In recent years, commonly used bounding box regression losses include IoU, GIoU, CIoU, and DIoU loss. These loss functions measure the difference between the two boxes by considering factors such as the degree of overlap between the predicted and target boxes, the distance between the center points, and the aspect ratio. However, these methods do not consider the matching of the direction between the prediction and target boxes, which leads to slow convergence and low efficiency. Additionally, the ratio distribution of positive and negative samples and difficult samples in the PV panel failure dataset is extremely uneven, which greatly impacts the detection accuracy.
In this paper, we adopt the advanced SIoU loss function [23], which comprises the classification loss function and regression loss function. The regression loss function considers the vector angle between the required regressions, the distance loss is redefined, which effectively reduces the degree of freedom of the regression, accelerates the network convergence, and further improves the regression accuracy. The classification loss function uses the focal loss loss function, the model concentrates on the samples that are indistinguishable during training, and the convergence speed is accelerated.
The SIoU loss function comprises four cost functions: angle, distance, shape, and IoU costs. The angle cost defines the angle between the center point of the prediction and the real boxes. The angle with the x-axis is α, and the angle with the y-axis is β. Minimize β and α along the x and the y axes so that the center position of the prediction box is close to that of the real box. The distance cost is used to supplement the angle loss to further approximate the center position of both boxes. The α closer is to π/4, the greater the contribution of the position loss. The shape is responsible for the convergence of the width and height of the predicted box, and the intersection ratio cost is responsible for the coincidence of the predicted box and the real box area.
Finally, the regression loss function is as follows: The total loss function is given as follows: Here, represents the distance loss, including the angle, is the shape loss, L box is the bounding box regression loss, L cls is the classification loss, here we use focal Loss, and W box and W cls are the box and classification loss weights, respectively [24].

III. EXPERIMENTATION A. PHOTOVOLTAIC PANEL IR FAULT DATASET
In this paper, we adopted the photovoltaic panel IR images collected by the research group with drones, including about 8900 IR images collected in 13 power stations, and distinguished the images into obvious faults and small targets based on the unpublished photovoltaic panel IR fault dataset. Two types of faults were manually marked using the LabelImg tool. Furthermore, there are five types of obvious faults: battery string, hot spot, dust, fracture, and plant. There is one fault in the small target dataset: occlusion. This class is mainly composed of bird droppings, and the target pixels are extremely small, which belongs to the typical small target detection. In this paper, it was marked and trained separately to verify the effectiveness of the algorithm for small target detection. Fig. 8 shows the characteristics of the above six kinds of faults Fig. 8(a) to 8(f) correspond to the abovementioned fault sequences. (a) is ''battery string'', which is generally represented as a bright rectangular light strip, accounting for 1/3 or 2/3 of the area of a single photovoltaic panel; (b) is ''hot spot'', the hot spot in the first row is large and bright with a smooth edge, and the hot spot in the second row is small and dark with a sharp edge, usually square; (c) ''dust'', the first 5 sheets account for the majority of this category, which is characterized by uneven colors and blurred edges, mostly appearing in dusty desert or Gobi region. The last three are the dust similar to the hot spots mentioned above, which generally appear in mountainous areas with less dust; (d) fracture, manifested in a large number of densely distributed and irregular hot spots with obvious characteristics; (e) plant. Due to the irregular growth of vegetation, sometimes only a single photovoltaic panel is shielded, sometimes multiple photovoltaic panels are shielded; (f) is ''occlusion'', which is bird droppings in most cases. It is a typical small goal, divided into spots (the first 4 pieces) and lines (the last 4 pieces) due to the differences in climate and bird species in different areas. Considering the corresponding relationship between labels and data, the two types of datasets were randomly divided into the training set, validation set, and test set according to the ratio of 70%, 20%, and 10% to ensure the uniform distribution of the dataset. To ensure that the experiment environment is maintained, the final dataset is stored in the PASCAL VOC dataset format. Furthermore, positive samples with unclear pixel regions were not labeled to prevent overfitting in the neural network [25]. The dataset is shown in Fig. 9, and the labeling situation is shown in Table 1. The dataset involved in training and testing is shown in Fig. 9. Fig. 8 is only the fault description of a single panel of photovoltaic panel captured from the image, and does not participate in the training process. Fig. 9(a) was taken under the condition of insufficient light. The temperature difference between the photovoltaic panel and the ground surface was small at that time. After the color difference adjustment of the IR camera, the photographed photovoltaic panel appeared orange; Fig. 9(b) was  taken under sufficient light. Here, the temperature difference between the photovoltaic panel and the ground surface was large, and the photovoltaic panel was purple after adjusting the color difference of the IR camera. Our model was trained by integrating two datasets of typical weather conditions.

B. TRAINING PROCESS
PVT needs to be pre-trained before the target detection task. The pre-trained model is trained on a large data set to obtain a model with strong generalization ability as the benchmark model for the subsequent task. A good recognition effect can be achieved by fine-tuning training with a small amount of labeled data. The parameters of the PVT model are from the model pre-trained on ImageNet. Before training, we use the pre-trained weights on ImageNet to initialize the backbone and Xavier to initialize the newly added layers. When training the detection model, all layers in the PVT are not frozen, that is, all weights are involved in updating.
The graphics card used in this experiment is NVIDIA GeForce RTX 3050Ti, and pytorch version is 1.9.0. In order to ensure the fairness and reliability of experimental data, Adam optimization method is used for optimization and training. In this experiment, the batchsize is 8 and the epochs is 100. It is worth noting that the DETR-based model can only set batchsize = 2 at most on the 3050Ti graphics card. In order to ensure the fairness and efficiency of training, gradient accumulation is adopted and batchsize is expanded to 8 by setting accumulation steps as 4. Considering that the PVT used 16 batches and warmed up for 5 epochs, the initial learning rate was 1e-6, and the learning rate after the warmup was 1e-4. According to the rule-of-thumb, we reduce the learning rate when batchsize=8 and sets it as 5e-5. In order to further stabilize the training process, the gradual warm-up strategy was also adopted to shrink the learning rate [26]. After the warm-up, the cosine annealing algorithm was used to realize the real-time adjustment of the learning rate.

C. EVALUATION INDICATORS
To verify the performance of the algorithm, we conducted a series of experiments on test images using the trained PV-YOLO model. The IoU threshold of the ground-truth and predicted boxes were set to 0.5. If the IoU value was greater than 0.5, the target location was correctly predicted. The average detection accuracy AP of each target type and four kinds of average accuracy mean mAP indicators were used to compare the detection accuracy of different algorithms, and GFLOPs were used to measure the model complexity. Additionally, in this paper, we compared the parameters of each model. The calculation method for each indicator was as follows: In the formulas, precision represents the detection accuracy of the class, recall represents the recall rate, and N represents the number of detected object classes. TP, FP, TN, and FN stand for true positive, false positive, true negative and false negative, respectively.

1) COMPARISON WITH OTHER ALGORITHMS
To verify the advanced nature of the detection model in this paper, its AP, mAP, parameter quantity and model complexity were compared with those of the EfficientDet-D0, EfficientDet-D1, EfficientDet-D2, EfficientDet-D3 algorithms [27], and the optimal model of the YOLO series. In addition, since the DETR [28] requires 10 to 20 times more training epoch than modern mainstream detectors to converge, in order to get the performance of the converged DETR, epochs are added to the Table 2 for comparison with other DETR-based models [29], [30], [31]. The experiment results are shown in Table 2. The results in Table 2 show that the overall mAP of our model was better than the YOLOX model using CSPDarknet53 as the backbone feature extraction network, and it was higher than the models of ''S,'' ''M,'' ''L,'' and ''X'' with 1.97%, 4.84%, 6.14%, and 8.36%, respectively. With an increase in the model parameters, the mAP improvement became larger. The mAP performance of the EfficientDet network was poor, with a maximum of 78%, which could not meet the requirements of industrial accuracy. The convergence speed of the DETR is very slow. It needs 500 Epochs, and its mAP is only 75.3%, which is poor. However, the improved algorithm based on DETR has achieved better mAP, among which Deformable DETR and DAB-DETR have the same performance, reaching 82.4% and 82.1% respectively, close to PV-YOLO-M. Moreover, the overall performance of fragmentation fault accuracy is poor, which is caused by the uneven distribution of the samples. The total number of this type of label was 658, accounting for only 24.5% of the average number of labels. In the SIoU loss function, focal Loss was used as the classification loss function and partially improved the problem of uneven sample classification. It increased the mAP value of this class by about 10%, on average, among the four sizes of models. Comparing the network parameters, the parameters of our model were shown to be slightly larger than those of YOLOX, which were 4.2 M, 4.8 M, and 6.6 M, and 3.5 M higher than the models of ''S,'' ''M,'' ''L,'' and ''X,'' respectively. Furthermore, the gap decreased with an increase in the model parameters. The parameters on the ''S'' magnitude increased by 46.7%, but the parameters on the ''X'' magnitude only  increased by 3.5%. The Deformable DETR and DAB-DETR parameters with the best performance of algorithm accuracy based on DETR are 40M and 44M respectively. Compared with PV-YOLO-M, the mAP is 2.64% and 2.94% lower respectively, but the parameters are 33% and 46% higher respectively. As shown in Fig. 10, the parameters of our model are slightly larger than those in other algorithms; however, it still achieved the optimal mAP under the same parameter magnitude. Compared with the algorithm based on DETR, PV-YOLO can achieve better mAP with smaller parameters.
To clearly show the detection effect of this model and the YOLOX and DAB-DETR models, Fig. 11 shows the comparison of the detection effects of these three models.
The missed and false detection rates of YOLOX-M are shown to be higher than those of the other two algorithms. For example, no obvious hot spots are detected in the first line, but both DAB-DETR and this model can detect them. In certain weather conditions, battery string failures may have IR signatures similar to dust, which could easily lead to false detections. Among the three algorithms, the error detection rate of the algorithm in this paper was the lowest. For example, the first two models in the first row mistakenly detect the battery string fault as dusty, but the algorithm in this paper could accurately detect it. This is because this paper used PVTv2 as the backbone feature extraction network. A global receptive field is generated. Consequently, the local connection between the image and the feature map could can be obtained, and more edge details of similar faults could be extracted. Additionally, our model in this paper performed well in detecting small targets. As shown in the third and fourth rows of Fig. 11, when the hot spot area is small, the first two algorithms often fail to detect or miss detection, but the algorithm in this paper can detect all. This is because the increase in the number of feature maps enriched the feature information of different scales of the detection target, making our study highly sensitive to small targets at long distances. Furthermore, the CBAM attention mechanism module was added to the feature fusion input channel, which enhances the expression ability of deep and shallow semantic information in the feature pyramid and the detection ability of the model in small target objects. The adaptability of the SIoU loss function to our model improved the network convergence speed and detection performance of the model.

2) ABLATION EXPERIMENT
In this paper, ablation experiments were designed to verify the effectiveness of each module's. improved methods of each module. The PV-YOLO-L model with the feature extraction network replaced by PVTv2 is used as the base model for the experiments (Table 3).
From Table 3, the addition of the CBAM attention mechanism increased the mAP of the network by 3.9%.
After the cost matrix of the SimOTA allocation strategy was improved, the mAP increased by 2.9%. This showed that the improvement of the CBAM attention mechanism module and cost matrix was imperative to improving the mAP value. The addition of the SIoU loss function effectively reduces the degree of freedom of regression, accelerates network convergence, improves regression accuracy, and increases mAP by 1.6%. Although the four-scale feature map improves mAP by 0.9%, this method and the CBAM are imperative in detecting faults in small target classes, such as foreign object occlusion by guano. The ''occlusion'' AP increased by 3.4% after adding CBAM and increased by 5.0% after adding the four-scale feature map (Table 3). Fig. 12 shows the detection performance of different models when only foreign objects are detected. Fig. 12(c) is a model without the addition of the CBAM attention mechanism module. Here, smaller bird droppings cannot be detected. The model that added CBAM but retained three feature maps can detect smaller bird droppings but not linear bird droppings (Fig. 12(d)). Simultaneously, At the same time, the improved model with the above two points has a better detection effect for smaller bird droppings and extremely small stains (Fig. 12(e)).

IV. CONCLUSION
To address the problem of photovoltaic panel fault detection, we redesigned the YOLOX target detection network.We used the transformer-based PVTv2 network as the backbone feature extraction network and added the CBAM attention module to the feature fusion part of the network to fuse the four expanded feature maps. The label assignment mechanism and loss function of light spots and foreign objects blocking small target faults are improved. Furthermore, the problem of uneven sample distribution is improved, the convergence of the network is accelerated, and detection accuracy is improved. The experiment results show that although the parameters of our mode are slightly larger than those of other algorithms, they still achieve the optimal mAP under the same parameter magnitude. The advantages are obvious, especially in detection. The follow-up work will further compress the model parameters, reduce the difficulty of embedding and deploying the model, and combine it with the path planning of the UAV to improve the practicability of the model in actual photovoltaic operation and maintenance work.
WANG YIN received the B.S., M.S., and Ph.D. degrees in guidance navigation and control from the Nanjing University of Aeronautics and Astronautics, in 2005, 2008, and 2013, respectively. From 2013 to 2017, he was a Lecturer at the Taiyuan University of Science and Technology, where he has been an Assistant Professor, since 2018. He is the author of two books, more than 30 articles, and more than ten inventions. His research interests include computer vision, intelligent control, and machine vision.