Research on ship detection of optical remote sensing image based on Yolo V5

In order to achieve ship target detection based on remote sensing images in complex backgrounds, this paper proposes a ship target detection algorithm based on YOLO V5. First, the improved Kmeans is used to cluster the data set, DIOU_loss is used for the candidate frame selection strategy, and finally the channel attention mechanism is introduced. The experimental results show that under the same conditions, the improved network has a slight decrease in detection speed compared with the original YOLOv5 network, and the effect of images with densely distributed small targets is significantly improved. On the same test set, the accuracy rate is reduced by 3.84%, the recall rate is increased by 32.73%, and mAP is increased by 22.48%.


Introduction
With the continuous development of satellite technology, it has become very common to use satellites to monitor the conditions of various sea areas in real time. The use of remote sensing images for ship target detection has extremely broad application prospects in civil and military fields [1] . In civilian use, it can be used to find missing ships at sea; it can also be used to monitor marine fisheries; it can also monitor illegal ships, such as pirate ships, ferry boats, and so on. In military use, it can be used to detect the location of enemy ships, guide long-range missiles to accurately strike; it can also monitor the sea during wartime, illegal intrusion of warships, and so on. Due to the long-distance shooting, most of the objects to be detected in the obtained remote sensing images show the characteristics of small targets [2] . The small target in the image loses a lot of feature information after the model changes. Therefore, the recognition of small targets in remote sensing images is a hot topic with high research significance.
At present, the detection methods of ships in remote sensing images can be divided into two categories: traditional methods and deep learning-based methods. The current deep learning method is the mainstream method of optical remote sensing image detection. In the field of two-stage target detection algorithms, Gu Jiaojiao et al. used Faster RCNN to redesign the number and size of its anchor frames, which effectively improved the problem of easy repeated detection of objects to be detected in remote sensing images [3] . In the field of single-stage detection algorithms, Ma Junjie et al. used the YOLO network to establish the nonlinear observation equation of the remote sensing ship target, which was transformed into a linear equation, and then solved the specific position coordinates of the ship, which improved the positioning accuracy of the ship [4] .
In summary, based on the YOLO v5s network, this article improves the YOLO V5s network to solve the problem of poor recognition of dense small targets in remote sensing images. First, use the improved Kmeans to cluster the used data set and improve the selection strategy of the target frame.
Introduce channel attention mechanism. The experimental results show that compared with the original YOLOv5 network, the improved YOLOv5s network has an improved effect in the task of remote sensing image ship target recognition.

Ship remote sensing detection model based on YOLO V5
2.1. Yolo V5 introduction YOLO v5 currently has four versions: YOLO V5s, YOLO V5m, YOLO V5l, YOLO V5x.Among them, YOLO v5s is the smallest and fastest version of the model. This article uses the YOLO v5s version. YOLOv5 can be divided into 4 modules in structure: input terminal, backbone network Backbone, Neck and prediction terminal Prediction. The network structure is shown in Figure 1. The image preprocessing stage of YOLOv5 introduces image preprocessing operations such as Mosaic data enhancement, adaptive image scaling, and adaptive anchor point calculation [5] . This is conducive to the image proportion not deformed after the image is zoomed. The Backbone stage of YOLOv5 adopts the Focus and CSPDarknet53 structure, which reduces the amount of calculation for the model while ensuring accuracy. The Neck structure is mainly to generate feature pyramids. This structure generates pooled feature vectors of different fixed sizes, and also strengthens the expression ability of features, which is conducive to detecting targets with large changes in size.

NMS improvements
Non-Maximum Suppression (NMS) is mainly used to eliminate redundant candidate frames. The target detection algorithm model will generate a lot of target frames during operation, and one target will be selected by many target frames . The role of NMS is to eliminate redundant target frames, so that a target retains only one target frame with higher confidence. However, for small targets that are too close to each other in the original NMS algorithm, the neighboring target frame may be eliminated. The original NMS algorithm will greatly reduce the detection effect of moored ships in the port terminal. This article introduces DIOU-NMS. Unlike the original NMS, DIoU-NMS not only considers the value of IoU, but also considers two boxes. The distance between the center points. A new formula is used to determine whether a box is deleted: Among them, R DIoU is the distance between the center points of the two Boxes, Expressed by the above formula (2):Where p(.) is the distance, b and bgt represent two boxes, and c is the diagonal length of the smallest box containing two boxes.

Channel attention mechanism
Convolutional neural networks use the idea of local receptive fields to fuse spatial information and channel information to extract image information features. The channel attention mechanism (Squeeze-and-Excitation (SE), SENet) improves its representation ability by modeling the dependence of each channel, and it can adjust the features channel by channel, so that the network can learn Global information is used to selectively enhance features that contain useful information and suppress useless features [6] .

Figure 2. SENet is introduced to YOLO v5s structure diagram
The algorithm structure is composed of three parts: Squeeze, Excitaion and Reweight, and explicitly constructs the interdependence between feature channels [7] . Specifically, it uses a method of recalibrating feature channels.In the process of learning features, the network learns the contribution value of each channel to the overall feature at the same time, and then according to this contribution value to improve useful features and Suppress features that do not contribute much to the task [8] .

Data set
The data set of this experiment comes from the public remote sensing image data set and the image data intercepted from the Internet. Among them, 2432 pictures in the training set and 57 pictures in the validation set are selected from the public data set HRRSD. The test set selected the aerospace remote sensing target detection data set annotated by NWPU Northwestern Polytechnical University and some of the image data made by myself.

Experimental platform
The deep learning framework used in this experiment is the torch framework, and the operating platform is the ubuntu 18 system.The running memory is 16g, the video memory is 6g, the graphics card model is GTX1660ti, GPU accelerated environment 10.0, and the framework of the training model is pytorch.

Evaluation Index
Because the test set used in this article contains a part of data that only contains a single target. As a result, when the map value is calculated, the maximum accurate value corresponding to each recall stage is 1, so the effect of this experiment cannot be effectively evaluated. Therefore, the evaluation indicators of this experimental model mainly use precision (Precision, P), recall (Recal, R) and F-score, and the specific calculations are shown in equations (3), (4). (3) In the formula, TP means that the ship is correctly detected; TN means that the target is mistakenly detected as a ship; FP means that the target is not a ship, and the detection result is not a ship; FN means that the ship is mistakenly detected as another target. The mean precision AP is the area value

Analysis of experimental results
After optimizing the loss function of YOLO V5 and the screening function of the prediction box, introducing a series of operations such as the channel attention mechanism, the improved detection result is shown in 11. After using CIOU_LOSS as the bounding box loss function of the predicted target and DIOU_NMS processing, it is obvious from Figure 11 that the ships in densely arranged ports and occluded locations can be effectively identified, and the confidence score has also been improved. Figure 3: Improved YOLO V5 test results There are a total of 30 pictures in the test set, which contain a total of 1437 boats. Using the improved YOLO V5 for detection, a total of 1 ships were detected, 1320 ships were correctly detected, 13 ships were falsely detected, and 122 ships were missed. They are in the same test set as Faster RCNN, SSD, YOLO v4, and YOLOv5s models. For experimental comparison, the detection performance of different models is shown in Table 1. It can be seen from Table 1 that the performance of the improved algorithm has been greatly improved compared with the original YOLOv5s, more small target detection samples have been detected, and the number of false detections has also increased. The recall rate increased by 32.73%, and the accuracy rate decreased by 3.84%. mAP increased by 22.48%. Compared with some other algorithms, there are similar changes. From this point of view, the performance of the improved model is significantly improved.