HDS-YOLOv5: An improved safety harness hook detection algorithm based on YOLOv5s

: Improperly using safety harness hooks is a major factor of safety hazards during power maintenance operation. The machine vision-based traditional detection methods have low accuracy and limited real-time effectiveness. In order to quickly discern the status of hooks and reduce safety incidents in the complicated operation environments, three improvements are incorporated in YOLOv5s to construct the novel HDS-YOLOv5 network. First, HOOK-SPPF (spatial pyramid pooling fast) feature extraction module replaces the SPPF backbone network. It can enhance the network’s feature extraction capability with less feature loss and extract more distinctive hook features from complex backgrounds. Second, a decoupled head module modified with confidence and regression frames is implemented to reduce negative conflicts between classification and regression, resulting in increased recognition accuracy and accelerated convergence. Lastly, the Scylla intersection over union (SIoU) is employed to optimize the loss function by utilizing the vector angle between the real and predicted frames, thereby improving the model’s convergence. Experimental results demonstrate that the HDS-YOLOv5 algorithm achieves a 3% increase in mAP@0.5, reaching 91.2%. Additionally, the algorithm achieves a detection rate of 24.0 FPS (frames per second), demonstrating its superior performance compared to other models


Introduction
The electrical equipment such as power towers and substations is vulnerable to environmental factors like rain, snow and hail as well as equipment failures. To ensure the safety of transmission lines and the stability of the power grid, the power company regularly arranges inspections conducted by personnel. Typically, these inspections take place at high altitudes, making falls from height the most common safety hazard in the power industry. The proper use of safety harnesses is crucial in protecting the lives of personnel during these operations. However, workers are required to move up and down the tower, which means they need to unhook and reconnect the safety harness hook each time. Unfortunately, in an attempt to save time, some workers improperly hang the hook or even neglect to attach it at all during the power operations process.
The proper use of a safety harness is essential for protecting operators and preventing accidents. The non-standard use of hooks is the main cause of accidents, such as hanging on an unstable slope, at a sharp angle or with unclosed hooks. In the environment of electrical power operations, highintensity work and a complex environment frequently make staff careless, increasing the probability of non-standard use of harness hooks and resulting in safety mishaps. To enhance field monitoring, machine vision technology has been implemented to detect helmets [1], faces [2], safety harnesses [3], transmission lines [4] and recognize anomalous operator behavior [5].
Deep learning has gained popularity as a method for enhancing computer vision performance. Currently, there are two main categories of deep recognition networks: two-stage algorithms such as R-CNN [6], Fast R-CNN [7], Faster R-CNN [8], Mask R-CNN [9], FPN [10] and SPPnet [11] first filter out candidate regions of potential targets from input images before using convolutional neural networks to accurately identify classification and bounding box prediction information; Another is the one-stage algorithm such as SSD [12], DSSD [13], EfficientDet [14], RetinaNet [15], YOLO [16][17][18][19][20] and YOLOX [21] that directly produce predictions without a prefiltering stage. The recently released YOLOv5 model applies methods of improvement such as adaptive anchor frame calculation, mosaic data enhancement and adaptive image scaling to significantly improve both processing speed and accuracy. One-stage algorithms usually have lower accuracy than two-stage algorithms. But the operation is faster and more real-time, making it more suitable for power operation site inspection.
Machine vision technology is based on deep learning technology, which has been applied to the safety control of electric power operation [22], obtaining better results. However, further research is needed to improve the accuracy of identifying safety harness wearing by electric power operation personnel. Fang et al. [23] developed a computer vision-based method utilizing two convolutional neural network (CNN) models to determine whether workers are wearing their harnesses while working at height. Although providing an advantage over manual examination of safety harnesses, this method is still inaccurate and relies on a large amount of data and computational resources. To increase accuracy, Fang et al. [24] proposed a harness detection algorithm based on YOLOv5 and OpenPose network. The dataset is created by using video streams of workers wearing safety harnesses and the networks are trained to detect safety harness. Li et al. [25] Proposed a CME-YOLOv5 network to reduce environ-mental disturbances and mutual occlusion as well as to facilitate the detection of small targets. Zhou et al. [26] proposed a method for detecting insulator defects using an improved YOLOv7 and a multi-UAV cooperative system. However, the efficacy of the detection is contingent upon the states of UAV and weather conditions. Due to its excellent detection performance, the YOLO series has also been widely applied in other fields [27,28]. Lawal [29] designed the YOLO-Tomato model for detecting tomatoes in complex environmental conditions. Roy and Bhaduri [30] provided an efficient damage classification and localization model based on YOLOv5. Their model addresses the shortcomings of existing deep learning-based damage detection models by offering highly accurate localized bounding box predictions. Ref [31] proposed light-weight object detection method (Efficient-YOLOv5) for detecting safety harness wearing for general construction operations, although it has certain limitations in the electric power operations scene. Moreover, this method solely verifies the presence of safety belts on the workers and does not address the assessment of the hook's status thereby rendering the evaluation of the condition of the safety harness hook impossible.
Given the above issues, in order to address the recognizing the status of the safety harness hook in complex electric power operation scenes. It involves challenging backgrounds, such as farmland, grassland, trees, houses and other electric power facilities, which can cause interference. This research proposes an efficient one-stage deep learning network based on the YOLOv5 network. This paper designs a HOOK-SPPF module to enhance the backbone network and express the target features more accurately in complex backgrounds. Moreover, it adopts the decoupled head [21] for independent implementation of confidence and regression frames, consequently improving the detection accuracy and accelerating the network convergence. Furthermore, the SIoU loss function [32] is invoked to further accelerate model convergence and make the loss function smoother. Finally, extensive experiments are conducted on a homemade Hook dataset to evaluate and verify the performance of the proposed model.

YOLOv5s object detection algorithm model
The algorithms in the YOLO (you only look once) series hold a momentous position within the sphere of deep learning for target detection. These algorithms have undergone a steady stream of innovations and improvements from versions 1 to 5. YOLOv5 is the fifth version. It boasts higher precision and faster speed while maintaining a relatively small model size. The YOLOv5 network structure consists of three parts: backbone, neck and head. The backbone is responsible for extracting feature maps, while the head generates detection boxes and predicts classes. The backbone employs two modules, C3 and SPPF (In YOLOv7, the SPPF module is replaced with the SPPCSPC module), which effectively improve the quality and quantity of feature maps. The YOLOv5 utilizes the same coupled head as that of YOLOv3 as the default head, while YOLOX employs a decoupled head as the head. YOLOv5 also introduces some new features. For instance, in backbone, DropBlock regularization is used to enhance the model's robustness. Additionally, MixUp data augmentation method is added to improve model generalization and reduce the risk of overfitting.
The YOLOv5 model comprises five versions: YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x, with each version progressively increasing in depth and width. The model details are shown in Table 1. This study specifically focuses on the task of recognizing the state of hanging safety harness hooks, requiring high real-time performance and accuracy. YOLOv5s is the smallest model within the YOLOv5 series, as opposed to larger models like YOLOv5l and YOLOv5x. Its network layers and parameters are relatively minimal, resulting in enhanced inference speed during the process. YOLOv5s exhibits higher detection efficiency and lower hardware requirements. It achieves faster detection speed while ensuring accurate results. As a result, this research focuses on improving and designing YOLOv5s.

HDS-YOLOv5 network model
This section goes into greater detail about the proposed improved method for YOLOv5s. including the HOOK-SPPF module, decoupled head and SIoU [32]. HDS-YOLOv5 network architecture ultimately appears as illustrated in Figure 2.

Building the HOOK-SPPF module
In the current hook target detection tasks, there are often problems with targets being confused with the background, difficulties in extracting small feature hooks and multiple overlapping dense categories of targets. As shown in Figure 3, the hook target is relatively small and the color of the hook is similar to that of the power tower. This results in high performance requirements for hook target detection models in practical applications. To solve this problem, this chapter proposes a HOOK-SPPF structure to further enhance the features extracted by the backbone network, enabling the network to improve the ability to recognize and classify safety harness hooks in complex environments.
The SPPF used in YOLOv5s is an enhanced version of the spatial pyramid pooling (SPP) [33] module, as shown in Figure 4. SPPF establishes connections between each pooling layer, preserving more feature information and enhancing the network's receptive field. This optimization retains the advantages of SPP while further improving the calculation speed.  HOOK-SPPF is a module that incorporates SPPF into cross stage partial network (CSPNet) [34], as illustrated in Figure 5. CSPNet optimizes the problem of duplicate gradients in the network by integrating the feature maps at the beginning and end of the network, achieving a 20% reduction in computational requirements while achieving the same or even higher accuracy. CSPNet divides the input feature map into two parts. These parts are then merged through the cross-stage hierarchy structure. By separating the gradient flow and propagating it through different network paths, the network achieves greater diversity in gradient combinations, thus enhancing both the speed and accuracy of network inference.
HOOK-SPPF is divided into two parts. The first part convolves, normalizes and activates the feature information extracted from the backbone network with the RELU activation function. This part plays a role in auxiliary optimization. It also retains the positional information contained in the input feature layer. The other part first convolves, normalizes and activates the feature information three times with the RELU activation function to extract deep image information. Subsequently, the SPPF structure increases the receptive field size and two convolutions are applied, followed by batch normalization and RELU activation functions to extract the features. At this point, the feature layer contains more semantic information. HOOK-SPPF stacks these two parts together, greatly improving the network's ability to learn multiscale features while reducing the number of parameters and enhancing the accuracy of detection. The HOOK-SPPF structure inherits the advantages of the SPPF structure, which includes adaptive size output without distortion, lower model computational complexity and faster processing speed, while avoiding repeated image feature extraction. These advantages contribute to the HOOK-SPPF structure's effectiveness in detecting hooks.

Decoupled head
The YOLO family's backbones and feature pyramids have evolved, while keeping their detection heads coupled. YOLOv5s utilizes a coupled head with a 1 × 1 convolution to confidently finalize the classification and regression frame. Figure 6 demonstrates the implementation of the coupled head. The structure of the decoupled head is illustrated in Figure 7. For the given input feature layer, the decoupled head employs a 1x1 convolution to reduce its dimension. It further utilizes two sets of 3x3 convolutions in each classification and regression branch with parallel channels dedicated to object classification and target frame coordinate regression tasks, respectively. This processing generates three outputs: Cls, Reg and Obj. Cls represents the category corresponding to the target frame, Reg represents the location information of the target frame and Obj indicates whether each feature point contains an object. All three output values are combined to generate the final prediction information. The decoupling operation separates the confidence degree and regression frame, slightly increasing the complexity of the process. However, it alleviates the negative impact caused by the conflict between the classification and regression tasks [35,36], ultimately improving the detection accuracy of the network and accelerating network training convergence.  Figure 7. The decoupled head structure.

Loss function
The loss function of YOLOv5s measures the distance between the predicted information and the expected information (label) of the Neural Network. The closer the predicted information aligns with the expected information, the smaller the value of the loss function becomes. This loss function consists of three components: rectangular box loss (loss_rect), confidence loss (loss_obj) and classification loss (loss_cls). The overall loss is calculated as the weighted sum of these three components with the flexibility to adjust the emphasis on each loss by modifying the weights. The YOLOv5s loss function can be expressed using the following formula: (1) YOLOv5s uses complete intersection over union (CIoU) loss [37] to calculate the rectangular box loss (loss_rect), confidence loss and classification loss with BCE loss and CIoU loss is calculated as:

Loss a loss obj b loss rect c loss cls
where  is the distance between the center points of prediction bounding box A and real bounding box B. c represents the diagonal length of the minimum bounding rectangle of box A and box B. v and represents the aspect ratio similarity of box A and box B.  is the influence factor of v .
In Eq (4), The value range of the arctan function is 0-2  ; then the value range is 0-1. When the width-to-height ratio of prediction bounding box A and real bounding box B are equal, = 0 v . At this time, the influence factor v of  is also equal to 0. The v  in Eq (2) does not work. In this case, the CIoU loss function does not get a stable expression. In this regard, the SIoU loss function is chosen to replace the original CIoU loss function. The vector angle between the true and predicted frames is further considered to redefine the associated loss function, which contains four components: angle cost, distance cost, shape cost and IoU cost. The SIoU schematic is illustrated in Figure 8.
In Eq (6), where  represents the distance between the center points of prediction bounding box A and real bounding box B. In Eq (7) In Eqs (9) and (10), X w and X h are the width and height of the minimum bounding rectangle of the real bounding box and the prediction bounding box. As the angle increases,  is assigned a value of time-preferred distance.

Shape cost
The definition formula of shape loss is shown in Eq (12): , , , gt gt w h w h are the width and height of the prediction and real bounding box. In order to avoid paying too much attention to shape cost and reduce the movement of the prediction bounding box, this paper sets  to 2.
In summary, the final definition of the SIoU loss function is shown in Eq (15): Due to the increased angle cost, the loss function is more fully expressed, reducing the likelihood of obtaining a zero-penalty term and facilitating the smoother convergence of the final loss function. In turn, this enhances regression accuracy and minimizes prediction errors.

Dataset
To assess the detection performance of the improved YOLOv5s algorithm in this paper, the hook dataset was created using selfies and images obtained from the internet. The objective was to enable the algorithm to achieve better hook detection results under various complex scenes and extreme weather conditions. For this purpose, hooks were initially hung in different ways and at various locations on a domestic electric power tower, simulating the common arrangement of safety harness hooks during the operations of electric workers. Subsequently, a Xiaomi 11 cell phone was used to capture photos of the hooks, ensuring variations in lighting conditions, time periods (noon, evening, etc.), distances and focal lengths. After undergoing collation, a total of 3378 photos of safety harness hooks encompassing four types of violations, one type of safety and five types of hook hanging were obtained. Considering the small size of the dataset, precautions were taken to prevent the overfitting phenomenon resulting from an insufficient number of samples, which could potentially affect the detection effectiveness of the seat belt hook. Consequently, a data enhancement tool was employed to expand the original dataset. This involved augmenting the images through random rectangle masking and horizontal flipping, boosting the total count to 9738. By doing so, the scale of the training set was effectively increased, enhancing the model's ability to generalize. Furthermore, the Labelimg annotation software was utilized to annotate each image according to the required txt format for YOLOv5. Finally, the dataset was split into three parts: a training set, a validation set and a test set distributed in an 8:1:1 ratio. Figure 9 displays examples of images from the dataset.

Experimental environment and parameter setting
The hook detection method proposed in this research was implemented in a Windows 10 Professional environment. PyTorch, a deep learning framework, was utilized for model construction, training and testing. The programming software of choice was PyCharm community edition. To expedite the model training process, CUDA and CUDNN were employed for acceleration. A comprehensive list of the training platform parameters is provided in Table 2. The training configuration for the improved YOLOv5s model is outlined as follows: the size of the input image is set to 640 × 640, the number of epochs is specified as 300, the batch size is set at 32, the initial learning rate is defined as 0.01 and the weight decay is established as 0.0005.

Evaluation metrics
To provide a comprehensive and objective evaluation of the improved YOLOv5s model proposed in this paper, precision and recall are employed as commonly used evaluation metrics for neural network models. Precision represents the ratio of correctly predicted targets to all predicted targets, while recall indicates the ratio of correctly predicted targets to all actual (correct) targets. The calculation of precision and recall is as follows: where true positive (TP) refers to the correct target in the predicted target, false positive (FP) refers to the wrong target in the predicted target and false negative (FN) refers to the right target that is not predicted. Where the target prediction is considered correct when IoU ≥ the threshold and incorrect when IoU < threshold. In this paper, the detection threshold is set to 0.5, when the IoU value between the detection box and the real box exceeds 0.5 the detection box is considered accurate. The two most common evaluation metrics for target detection tasks are the average precision (AP) and the mAP. AP is the area enclosed by the curve of different accuracy and recall rates. Generally, the classifier exhibits superior performance as the AP value increases. A larger value indicates better detection accuracy by the network model, while MAP represents the average AP value calculated across all categories. Its value ranges from 0 to 1. The calculation formula is given as follows: In this paper, N = 5 represents the number of target detection categories. The measure used to evaluate the model's detection effectiveness in this paper is mAP@0.5. This allows us to measure the comprehensive performance of the model under different IoU thresholds. Higher numbers suggest a better model effect and a more accurate fit between the predicted and real bounding boxes. In this study, we conducted ablation experiments to comprehensively validate the optimization impact of each enhancement module. Specifically, we set up multiple ablation experiments between YOLOv5s (SPPF + CIoU + Coupled Head), YOLOv5s-1 (SPPF + SIoU + Coupled Head), YOLOv5s-2 (SPPF + SIoU + Decoupled Head), YOLOv5s-3 (HOOK-SPPF + SIoU + Coupled Head), and HDS-YOLOv5 (HOOK-SPPF + SIoU + De-coupled Head). Table 3 shows the results of the experiment.  The mAP@0.5 of the original YOLOv5s is 88.2%. YOLOv5s-1 replaced the original CIoU with SIoU, resulting in an increase in mAP to 89.0%. This indicates that SIoU improves the accuracy of the network. YOLOv5s-2 integrated SIoU and decoupled head, further improving the mAP to 90.0%. YOLOv5s-3 utilized SIoU and HOOK-SPPF, leading to an improvement in mAP to 90.8%. HDS-YOLOv5 incorporates SIoU, decoupled head, and HOOK-SPPF, significantly enhancing the model's accuracy and achieving a 91.2% mAP, which is 3.0% higher than the original YOLOv5s. The effectiveness and superiority of the proposed network as described in this paper are clearly demonstrated.

Ablation experiment
To demonstrate the increased efficiency of the model more clearly, we tested it on several images from the test set and the results are shown in Figure 11. It is evident from Figure 11 that the improved algorithm HDS-YOLOv5 successfully extracts the desired features and the performance of the safety harness hook suspension method is better than YOLOv5s in both backlit conditions and complex background environments. Figure 10 shows the change curve of the mAP@0.5 during training.

Comparison experiment
To assess the effectiveness of the object detection methods described in this paper, comparative tests were performed between the improved model proposed in this study and various existing algorithms including fast R-CNN, SSD, YOLOv3, YOLOv4, YOLOv5 and YOLOv7. The comparative experiment was conducted using identical experimental settings and dataset with four metrics:mAP@0.5, mAP@0.5:0.95, FPS and model size utilized as measurement criteria. The findings of these experiments are presented in Table 4. As shown in Table 4, when compared to the SSD, faster R-CNN and YOLOv4 models, the HDS-YOLOv5 model significantly reduces the model size while greatly improving detection speed. In comparison to the original YOLOv5s and the YOLOv7 model, our improved model achieves the highest mAP value. Although the detection speed (FPS) of our improved model is lower than the original YOLOv5s and YOLOv7 models, a comprehensive analysis of the experimental results shows that the improved YOLOv5 model strikes a balance between detection speed and performance, resulting in superior overall performance.
To intuitively verify the effectiveness of the improved algorithm in object detection and its robustness in different complex settings, we selected the same test datasets for experimental comparison of SSD, Faster R-CNN, YOLOv4, YOLOv5 and YOLOv7. The experimental results are depicted in Figure 12. From Figure 12, it is evident that all five detection algorithms effectively identify the suspended state of the hooks. However, the superiority of the HDS-YOLOv5 model can be intuitively observed from the specific results. The HDS-YOLOv5 model exhibits a greater level of confidence in detecting the hooks. Moreover, the HDS-YOLOv5 showcases improved accuracy and stability when detecting targets, particularly in complex backgrounds, thereby substantially enhancing the detection capabilities of hook-shaped objects.

Conclusions
This article presents HDS-YOLOv5, an algorithm designed to identify the suspended state of safety harness hooks during power tower inspections. The HOOK-SPPF module is designed to enhance the feature extraction capability of the backbone network and improve the model's ability to extract deep and crucial features of the hooks. The decoupled head replaces the coupled head in the original network, reducing the negative conflict between the classification and regression tasks thereby improving accuracy and reducing missed detections of hooks in complex environments. The CIoU loss function is replaced by the SIoU loss function, SIoU further considers the vector angle between the real box and the predicted box, redefining four loss functions: angle cost, distance cost, shape cost and IoU cost thereby enhancing the regression accuracy of the model. Comparative experiments were conducted on a homemade hook dataset. The improved model achieves a 3% increase in mAP@0.5, reaching 91.2%, compared to the original network. However, the detection speed of the improved model is 24FPS, which is 3FPS slower than the original YOLOv5s. To address this issue, the next optimization plan involves implementing a more lightweight MobileNetV3 network as the backbone network for YOLOv5, further reducing model parameters and computational load to improve detection speed.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.