YOLO-SD: A Real-Time Crew Safety Detection and Early Warning Approach

Wearing safety rope while working at the loft and over the side of a ship is an eﬀective means to protect seafarers from accidents. However, there are no active and eﬀective monitoring methods on ships to control this issue. In this article, a one-stage system is proposed to automatically monitor whether the crew is wearing safety ropes. When the system detects that a crew enters the work area without a safety rope, it will warn the supervisor. In this regard, a safety rope wearing detection dataset is established. Then a data augmentation algorithm and a boundary loss function are designed to improve the training eﬀect and the convergence speed. Furthermore, features from diﬀerent scales are extracted to get the ﬁnal detection results. The obtained results demonstrate that the proposed approach YOLO-SD is eﬀective at diﬀerent on-site conditions and can achieve high precision (97.4%), recall rate (91.4%), and mAP (91.5%) while ensuring real-time performance (38.31 FPS on average).


Introduction
In order to minimize the human injury and loss of life at sea, the International Maritime Organization (IMO) has adopted the International Safety Management (ISM Code) as a mandatory standard and requires vessel operators to develop, implement, and maintain their Safety Management System (SMS). Generally, it requires that when working aloft, the crew must obtain the captain's permission, and a checklist, two or more staff, and a task manager must be assigned at the same time. However, it cannot guarantee that all the crew can implement these rules. Issued statistics by the European Maritime Safety Agency (EMSA) [1] indicate that, from 2014 to 2019, maritime accidents across Europe caused more than 496 deaths and 6,210 injuries, of which slips, trips, and falls share almost 37% of all accidents. More specifically, during a ship drilling in November 2019, a Japanese cargo ship ORANGE PHOENIX crew died after falling to the deck without a safety rope. Studies of the EMSA show that many casualties at offshore tasks originate from not wearing safety ropes by the crew. Currently, monitoring the safety rope wearing is one of the workers' mutual inspections before the operation [2]. So far, no automatic method has been proposed in this regard so that it is a challenge for the captain or shipping company manager to ensure whether the crew is wearing safety ropes when working at high altitudes. Accordingly, it is of significant importance to establish a method to automatically determine whether the crew is wearing a safety rope when entering the work area. e crew working at height refers to working on ship masts (see Figure 1(a)), outside chimneys (see Figure 1(b)), or ship crane (see Figure 1(c)). Unlike ordinary construction sites, there are a few crews onboard, which mainly originates from the space limitation. Accordingly, the working crews lack adequate supervision and protection during the work. Moreover, the crew works in a wide range of vertical space on the ship, which requires safety rope detection algorithms, multiscale detection capabilities, and high robustness. On the other hand, conventional target detection algorithms often rely on cloud platforms with substantial computing power or high-performance GPU computer clusters to achieve better detection rates. However, considering onboard requirements, it is a challenge to apply these algorithms. Generally, safety detection and early warning systems for the crews working at heights should have the following capabilities: (1) determine whether all individuals in the working area are wearing a safety rope; (2) detect the surveillance video in real time; (3) issue a warning message when the crew enters the working area without wearing a safety rope. Currently, investigations about safety ropes are mainly focused on the use of visual perception technology [3][4][5][6] and signal processing [7,8] to perform a qualitative or quantitative analysis of the reliability of safety rope. Particularly, detecting the wearing of safety rope on the ship has not yet been carried out.
In the present study, it is intended to specifically address the task of safety rope-wearing detection on the ship. e main objective of this article is to identify whether all crews on the ship's high-altitude operation area wear a safety rope or not; otherwise, a warning message should be issued on the monitoring screen. e difficulty of the above-mentioned tasks makes it problematic to rely on any multistage method with handicraft features. In order to resolve this problem, a convolutional neural network (CNN) is introduced. e proposed method is inspired by the development of a one-stage detector YOLO (You Only Look Once) [9], which implements the object by directly regressing bounding boxes through a single CNN. Compared with conventional target detection methods, the established method can automatically perform feature learning and provide excellent performance in the field of computer vision. e main contributions of this article can be summarized as follows: (1) It is the first method that introduces CNN into the crew safety detection and early warning in surveillance video. e proposed system is end-to-end trainable. (2) Based on the YOLOv5 pipeline and BN-Conv modules, a multiscale CNN framework YOLO-SD is proposed to improve the accuracy and speed of the detection under complex surveillance conditions. (3) Based on the proposed method, Crew Safety Rope Wearing Detection Dataset is established, which contains 3150 images and 6,583 safety ropes instances.
ese images cover massive changes in diverse scenes and scales as well as examples with occlusion. Each instance in the benchmark is annotated with a class label with a bounding box. e contents of this article are organized as follows: e literature survey is presented in Section 2, followed by the system overview in Section 3. en details of the proposed Crew Safety Rope Detection algorithm are discussed in Section 4. e experimental results are presented in Section 5. Finally, conclusions and main achievements are summarized in Section 6.

Literature Survey
Aiming at detecting the target precisely, detection and classification with precise target positioning are combined to provide a proper understanding of the image. In conventional target detection techniques, artificial features are mainly applied to extract and shallow trainable structures for target detection. Consequently, these methods may simply fail in complicated scenes, severe climate, and different occlusions.
With the emergence of deep learning technology [10][11][12][13], many limitations of conventional target detection technologies have been resolved so that deeper semantics and features can be learned. e deep learning-based target detector has three main modules, including information area selection, feature extraction, and classification. In the information area selection process, a multiscale sliding window is usually applied to scan the entire image and calculate the height process. Moreover, the target recognition process is mainly carried out in the feature extraction process. It is based on visual feature extraction to characterize the semantics of the target. It is worth noting that common features in this regard are SIFT [14], HOG [15], and Haarlike [16]. Finally, classification is defined as the process of distinguishing the target object from all other categories. Some common effective classifiers are AdaBoost [17], support vector machine (SVM) [18], and deformable component model (DPM) [19]. Based on these algorithms, modern target detectors can be mainly divided into two categories: (1) target detectors based on the area proposal such as R-CNN [20], Fast R-CNN [21], Faster R-CNN [22]; (2) detectors based on the target detector for classification/regression such as YOLO [10,[23][24][25], SSD [26], and Effi-cientDet [27]. In the former detectors, the area proposal should be initially generated and then the proposal area should be classified. is two-level detector based on the region suggestion extracts the candidate area, which is a complex and time-consuming process to achieve real-time detection. On the other hand, the target detector based on classification/regression has only a single forward CNN network, eliminating the region suggestion generation and subsequent feature resampling. Accordingly, it encapsulates all calculations in a network, thereby making it easier to realize a real-time detection. Redmon used mapping and direct evaluation of class probabilities to predict candidate frames and proposed a one-stage object detector YOLO [9]. e main idea of this detector was to divide the captured image into specific grid units, where each grid unit is responsible for predicting the candidate anchor and the corresponding confidence value. Further investigations reveal that this method can process 45 frames of images per second in real time, which is much higher than that of other target detectors. However, YOLO cannot effectively deal with objects with unusual aspect ratios or small objects. In order to resolve these shortcomings, Liu used the anchor, RPN, and multiscale representations and proposed a single-shot multibox detector (SSD) [26]. Moreover, Redmon improved the original YOLO and proposed YOLO9000 [23] and YOLO v3 [24] and achieved good results in detecting high-resolution or multiscale targets. Bochkovskiy [25] improved Redmon's YOLO framework and proposed YOLO v4 and YOLO v5. Accordingly, a 43.5% average accuracy and 140 fps were achieved on the MS-COCO data set. Figure 2 shows the framework of the proposed algorithm YOLO-SD, indicating that the proposed framework is mainly composed of three networks.

System Overview
(1) Data Augmentation Network: In the training stage, a data augmentation network is designed to augment the dataset and increase the number of multiscale safety ropes and the number of target-occluded images artificially, to improve the network's ability to detect different scale safety ropes and improve the algorithm robustness. When the prediction network detects a crew member entering the work area without wearing a safety rope, the system marks it out and issues a sound warning to remind the supervisor.

Data Augmentation Network.
During the experiment, it was found that the majority of missing targets were relatively small targets. is phenomenon may be attributed to the following two reasons: (1) e area of the safety rope in the image is relatively small. (2) e safety rope can be easily blocked by other objects. Kisantal [28] showed that the algorithm's detection performance for small targets could be improved through the oversampling and copy-pasting data enhancement methods for images containing targets. Accordingly, a data augmentation strategy is designed to train the algorithm and overcome the foregoing problems. To this end, each image is initially oversampled and then new training images are generated by flipping, random scaling, random cropping, and random arrangement. Figure 3 shows that the bottom image, which is used for the training, is generated by randomly cropping, arbitrarily scaling, flipping, and rearranging the four top images.

Feature Extraction Network.
e extraction of safety rope features is the key to detect and identify the safety rope. In this regard, the Darknet53 network is employed to extract safety rope features, which is based on the Darknet-19 and residual network of YOLO v2. e network is deepened to layer 53 and then shortcut connections are added between layers. It should be indicated that as the deeper network structure deepens, more advanced features of the safety rope can be extracted. However, the corresponding network calculation and the computational complexity dramatically increase.
e Darknet53 network introduces Cross-Stage Partial Networks (CSPNet) [29] to achieve a more decadent combination of gradients while reducing the calculation. In particular, CSPNet divides the feature map of the Darknet53 network base layer into two parts, where the first part passes through a dense block and the other part passes a transition layer and then merges them through a cross-stage hierarchical structure. e main idea is to propagate the gradient flow through different network paths by separating the upper layer of gradient flow. Accordingly, the CSPNet can reduce the number of inference calculations by about 20%. Meanwhile, the memory usage during the feature pyramid generation process can be reduced by up to 75%. e feature extraction network of YOLO v5s uses CSPDarknet53, and its network structure is shown in Figure 4. It should be indicated that CSPDarknet53 is based on the idea of Cross-Stage Partial Networks (CSPNet) to transform the Darknet53 network.
Since the safety rope may have different sizes in different images, the feature extraction network is required to perform the multiscale feature extraction process. In this regard, Feature Pyramid Network (FPN) [30], which is composed of the bottom-up pathway, top-down pathway, and lateral connections, can be applied to generate a high resolution containing low-level safety rope contour features with less semantic information and low resolutions containing highlevel safety rope texture feature with rich semantic information.
e bottom-up pathway is the feedforward calculation of the backbone network, which calculates the feature level composed of feature maps of multiple scales with a zoom step of 2.
e top-down pathway generates higher resolution features by upsampling spatially coarser but semantically more robust feature maps from higher pyramid levels, enhanced by horizontal connections. Each horizontal connection incorporates feature maps of the same size in the bottom-up path and the top-down path. For example, Figure 5  e cross-stage partial network (CSPNet) reduces the number of network calculations. However, the existence of a large number of CBL modules in the entire feature extraction network still increases the computational expense. In particular, there are 48 CBL modules in the YOLO v5s network structure. Meanwhile, each CBL module consists of a convolutional layer, a batch normalization (BN) layer, and an activation function (Leaky Relu). e BN layer can accelerate network convergence in the network training process. Moreover, it also increases the computational expense in the network forward inference process and affects   the model performance. Based on the CBL module, the BN layer is merged into the convolution layer to form a new BN-Conv layer to reduce the calculation of forwarding inference of the model. e BN layer is between the convolution layer and the activation function Leaky Relu. It converts each small batch of data into a standard Gaussian distribution with the zero mean and the variance of 1.
is can be mathematically expressed in the form below: Moreover, the BN layer can be expressed as follows: where ε, c, and β are the regularization parameter, scaling factor, and the offset term, respectively. en the convolutional layer can be written in the form below: where w i denotes the weight of the i-th layer. e BN-Conv layer is formed by merging the BN layer into the convolutional layer, where the weight parameter w i ′ and bias term β ′ can be calculated from the following expressions: Finally, the BN-Conv layer can be obtained from the following expression:

Prediction Network.
In the safety rope prediction network, three different scale feature maps of the feature extraction network are applied to generate prediction boxes and the corresponding category predictions. Accordingly, a unique prediction box can be obtained through the nonmaximum suppression (NMS). Furthermore, the loss function is an essential part of the model training process and the prediction network. It affects the convergence speed and the prediction error of the model. e loss function of YOLO v5s consists of three parts, including classification loss, confidence loss, and bounding box loss. It is worth noting that the bounding box loss function is a Generalized Intersection over Union (GIoU) [31], which is an evolutionary version of Intersection over Union (IoU).
IoU is the most commonly used indicator in target detection tasks. In anchor-based methods, the main role of IoU is to determine the positive and negative samples of the safety rope and evaluate the output box's distance to the ground-truth. IoU can be mathematically expressed in the form below: where A and B are the blue target box and yellow prediction box in Figure 6, respectively. Moreover, A ∩ B and A⋃ B denote the intersection and union of A and B, respectively. Equation (6) indicates that the definition of the IoU loss function is relatively simple. Further investigations reveal that this definition has two shortcomings: (1) when the target box and the prediction box do not intersect, equation (6) results in IoU � 0. However, this result cannot reflect the distance between the two boxes so that the loss function cannot be derived and optimized. (2) When the target boxes and the prediction boxes have the same size separately, as long as A ∩ B is constant, the corresponding IoU value is the same regardless of A and B relative position. Meanwhile, the IoU loss function cannot distinguish the intersection of the two boxes. In order to resolve these two shortcomings, the GIoU loss function has been introduced to the YOLO v5s.
where A c is the minimum area of the two boxes' enclosed area (the red box in Figure 7). Unlike IoU, which only considers overlapping areas, all areas are considered in GIoU. Accordingly, GIoU can better reflect the overlap degree between the two boxes. It is worth noting that when the prediction box is inside the target box, A c and A⋃ B are equal, and GIoU reduces into IoU, and their relative positional relationship cannot be distinguished. Although the foregoing shortcomings of the IoU loss function are resolved in the GIoU function, some other problems, including slow convergence and inaccurate regression [32], should be resolved. Accordingly, the Distance-IoU (DIoU) loss function is proposed in the present article to replace the GIoU loss function and resolve these problems.
where b and b gt are the center points of the prediction box and the target box, respectively. Moreover, ρ and c denote the Euclidean and the diagonal distance of the minimum closure area A c , respectively. In the DIoU function, the overlap area and the center point distance are considered simultaneously. When the target box wraps the prediction box, the distance between the two boxes is directly measured, so the DIoU loss function converges faster. In this regard, the black arrow in Figure 8 represents the diagonal distance c of the minimum closure area A c , and the red arrow Journal of Advanced Transportation represents the Euclidean distance between the center point of the prediction box and the center point of the target box In summary, the overlapping area and the distance between the center points of the two boxes are considered in the proposed DIoU loss function. is loss function is expected to improve the accuracy of the prediction box's regression and increase the convergence rate.

Dataset.
e training and testing result of the algorithm depends on the data set. However, no public crew safety rope data set is available for use. erefore, a self-made crew safety rope data set is created in this study. In this regard, two data sources, including the surveillance video onboard and web crawlers, have been prepared prior to the test. In total, 3,150 images and 6,583 targets are gathered in the dataset. After extracting images from the ship's surveillance video, some images including crew are screened out and labelled. en the labelled targets are divided into two categories, including safety harness with 1,250 images and dangerous cases with 1,900 pictures. During training the algorithm, the data set is divided into the training set, the validation set, and the test set at the ratio of 16 : 4:5. Figure 9 shows some images of the prepared dataset.

Evaluation Indicators and Experimental Platform.
e performance of the algorithm depends on the objective evaluation indicators. In the present article, precision, recall, mean average precision (mAP), and frame rate are considered as the primary performance evaluation indicators.
Precision and recall are used to evaluate the algorithm's accuracy in classifying the target and its capability to find the target. ese indicators are defined in equations (9) and (10), respectively. e distribution of the precision-recall rate (P-R) can be drawn by taking the recall rate as the X-axis and the precision as the Y-axis. Average precision (AP) is applied to calculate the area under the precision-recall (P-R) rate curve of a specific category. Moreover, the mean average precision (mAP) is used to calculate the average of the area under the P-R curve of all categories. It should be indicated that mAP reflects the average detection accuracy of all categories and can be used to evaluate the performance of the proposed DIoU loss function.
where TP (True Positive) refers to the number of positive classes predicted as positive, FP (False Positive) refers to the number of false positives that predict a negative class as a positive class, and FN (False Negative) refers to the number of false negatives that predict a positive class as a negative class.
All simulations are carried out on a personal PC with an Intel(R) Core(R) 7 @ 3.60 GHz CPU, 16G running memory, NVIDIA Geforce GTX1070 graphics card, CUDA v10.1 software, Window7 64-bit operating system, and Pytorch 1.7 deep learning framework.

Analysis of Experimental Results.
In order to evaluate the advantages of the proposed algorithm, obtained results are compared with those of the standard YOLO v5s, SSD, and EfficientDet algorithms on the same platform. Figure 10 shows sample results of the algorithm training process. It is observed that, after 250 iterations, the average loss value remains constant and approaches an asymptotic value, indicating rapid convergence of the algorithm. e precision-recall rate (P-R) curve in Figure 10(b) shows that, for the recall rate of 96%, the accuracy can reach 90%. Moreover, Figure 10 improved YOLO v5s. It is observed that the modified algorithm outperforms the bounding box loss from the aspects of convergence rate and regression accuracy.

Comparison with Standard Algorithms.
In order to verify the detection effect of the proposed algorithm, the detection results obtained from the standard YOLO v5s and the proposed YOLO-SD algorithms are compared. Table 1 shows that proposed YOLO-SD algorithm improves the accuracy, recall, and average accuracy. More specifically, the detection speed is increased by 7.3%.

Comparisons with Other Algorithms.
To further evaluate the performance of the proposed algorithm, the detection efficiencies of YOLO-SD, SSD, and EfficientDet [27] algorithms are compared. In this regard, the test results are presented in Table 2. It is observed that the SSD algorithm performs well, where the mAP value reaches 89.3%, but the detection frame rate in the SSD algorithm is only one-third that of the proposed YOLO-SD algorithm. Meanwhile, it is found that the EfficientDet-d0 algorithm has the highest detection speed, up to 59.30 fps, while that of the proposed YOLO-SD algorithm is 38.31 fps. However, mAP value of the YOLO-SD algorithm is much higher than that of the EfficientDet-d0 algorithm. It should be indicated that the EfficientDet-d1 algorithm has a higher mAP value than the EfficientDet-d0 algorithm, but there is still a remarkable gap with that of the proposed YOLO-SD algorithm. Moreover, it is found that the detection speed of EfficientDet is lower than that of the proposed one. Since the proposed YOLO-SD algorithm considers both the detection precision and detection speed, it can better complete the crew safety rope's detection task. Figure 11 shows some of the crew safety detection results. From left to right, the crew's distance and the camera become farther and the crew's imaging size gradually decreases. It should be indicated that each image has a different resolution. It is found that the SSD algorithm has one misdetection and two fraudulent detections. Moreover, the EfficientDet-d1 algorithm has four missing detections. Meanwhile, the proposed algorithm has only one misdetection. Table 3 indicates that the average confidence values of the SSD, EfficientDet-d1, and YOLO-SD algorithms are 78.33, 47.13, and 78.63, respectively. It is concluded that the proposed algorithm has the highest average confidence value and has the most concentrated distribution of confidence values. Meanwhile, it has the best detection stability in different image resolutions and target sizes and has a good detection effect. Accordingly, it is inferred that the proposed algorithm has good robustness. Figure 12 shows sample results of the surveillance video detection process. It is observed that the proposed YOLO-SD algorithm achieves good results in the detection process. Although the crew's back safety rope was partially obscured by the iron bar at certain moments (see Figure 12 Frame 350), YOLO-SD algorithm can still detect the safety rope.

Anti-Interference Test.
To evaluate the algorithm's detection stability, tests are conducted on images taken at different working conditions such as low-light, night, rainy, and foggy days. e obtained results are presented in Figure 13. To quantitatively analyse the anti-image pollution ability of the algorithm, Perlin Noise [33] is added to test images. It is worth noting that applying Perlin Noise is a widely adopted method to generate specific texture noises or augmentation data sets [34,35]. Figure 14 reveals that the test set images are polluted by 10%, 20%, 30%, 40%, 50%, and 60% coverage rates of Perlin Noise. e test results are shown in Table 4. It is found that when 40% noise is added to the original image, the decline of recall and mAP is less than 6%. Moreover, when the noise coverage increases to 50% and 60%, the corresponding drop of recall and mAP increases to 21% and 35%, respectively. However, the detection precision remains at a high level. Figure 15 shows the detection demo software based on the pyQt5 framework. A surveillance camera is used at the port to test the software's functions. When the start button is pressed, the program detects the crew in the surveillance video, marks the detected crew with a mark box, and records the detection information in the log box. en when a detected crew with no safety rope enters the surveillance area, an alarm will be issued to notify the supervisor.

5.4.
Discussion. An automatic system for identifying nonsafety-ropes-use provides an effective means to reduce the risk of falling and improve safety issues in a ship's high-  altitude operations. e main objective of the present article is to employ the target recognition method to detect whether the safety rope is worn or not. In this regard, extensive experiments on the self-built dataset have been carried out, which demonstrate the effectiveness of the proposed method.
It is found that CNN-based safety ropes wearing detection methods are reliable and stable in a wide range of on-site conditions such as visual range, individual posture, and occlusions. rough a series of improvements to the original algorithm, the YOLO-SD algorithm has improved the mAP (91.5% vs. 89.1%). To apply the new BN-Conv modules, the network inference speed increases by 7.3% (38.31 fps vs. 35.71 fps). Experimental results show that, for the same training and testing dataset, the proposed YOLO-SD algorithm outperforms the SSD method by 8.8% (97.4% vs. 89.5%) in the detection precision and 312.4% (38.31 fps vs. 9.29 fps) in the detection speed. Moreover, the SSD model has a larger weight file and a larger parameter amount so that it requires more memory space and more powerful equipment. Further investigations show that the proposed YOLO-SD algorithm is slower than the EfficientDet-d0 algorithm (38.31 fps vs. 59.30 fps) and faster than the EfficientDet-d1 algorithm (38.31 fps vs. 26.93 fps). However, the mAP value of the YOLO-SD algorithm is significantly higher than that of EfficientDet-d1 (91.5% vs. 48.0%) and EfficientDet-d0 (91.5% vs. 15.2%) algorithms.      Figure 11.
Method Figure 11(a) Figure 11(b) Figure 11(c) Figure 11(d) Figure 11 Note. e SSD algorithm incorrectly detects the crew on the right side of Figure 11(b), and the crew has two wrong classification boxes.     Real-time detection screen Log Figure 15: Detection demo software.

Conclusion
In the present study, an algorithm is designed to resolve shortcomings of the manual supervision of safety rope wearing in ship operation sites and has high detection accuracy and speed under different resolutions and target sizes. e performed experiments show that, compared with the state-of-the-art target detection algorithms, the YOLO-SD algorithm has a good detection effect, while it is robust in real time. e present article is only focused on the accuracy and speed problems in the detection process of crew safety rope wearing and has not researched algorithm transplantation to artificial intelligence chip or edge computing device.

Data Availability
Some or all data, models, or codes that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.