ShipYOLO: An Enhanced Model for Ship Detection

,


Introduction
With the rapid development of deep learning in recent years, more and more deep learning techniques have been applied to intelligent ships [1,2].In 2020, Pan et al. proposed a finegrained classification model RMA based on deep learning [3], which realizes the identification of navigation marks and provides accurate navigation mark information for intelligent ships.In 2021, Du et al. developed an intelligent navigation mark recognition system using deep learning technology [4], which provided an effective solution for intelligent ships.e vision system that uses computer vision technology to identify ships, navigation mark, and obstacles in the navigation environment has become an essential part of the intelligent ship perception system [5].erefore, an effective ship detection model is of great significance for improving the safety of intelligent ships.
ere are many traditional object detection models proposed by researchers.Traditional object detection models mainly rely on region selection [6], feature extraction [7], and classifier classification [8].In 2006, Dalal and Triggs proposed the HOG algorithm [9], which composes features by calculating and counting the histogram of the local area's gradient direction.Subsequently, Felzenszwalb et al. proposed the DPM algorithm [10], which produced corresponding excitation templates for image features and determined the target's location according to the distribution of excitation.However, object detection will predict many redundant borders.In response to this problem, Neubeck and Van Gool proposed the NMS algorithm [11] to eliminate redundant borders. is idea is also widely used in deep learning object detection models.Traditional object detection models have limitations in many aspects, and they cannot perform image features well.
e rise of deep learning in 2012 has had a massive impact on many fields, and object detection is no exception.A large number of the deep neural network parameters can extract features with better robustness and semantic relevance, and the performance of the classifier is also superior.erefore, the object detection model based on deep learning can better learn the characteristics of the image.e object detection model based on deep learning mainly exists in two forms, two-stage and one-stage.e main difference is whether to predict the position information of the object's border and the border's category information in one step.In 2014, Girshick et al. used the idea of combining region candidates and CNN to propose a two-stage detection model R-CNN [12], which opened the chapter of deep learning for target detection.Based on R-CNN, Girshick proposed Fast R-CNN [13] to realize the end-to-end detection and convolution sharing function.In 2015, Ren et al. proposed the faster R-CNN [14] object detection model.
e anchor frame idea and the region proposal network are designed, which significantly improves the R-CNN series of model' detection accuracy and won many firsts in the LSVRV and COCO competitions.In 2018, Redmon and Farhadi proposed YOLO-V3 [15], which added many excellent ideas to the network, such as residual ideas [16], multilayer feature maps [17], and no pooling layer.While ensuring the detection speed of the YOLO series, the detection accuracy of the model is improved.With the continuous improvement of deep learning technology, more and more methods are proposed to enhance object detection accuracy from different angles.In 2020, in order to improve the detection effect of analog instruments, Huang et al. proposed an improved YOLO-V3 algorithm in the robot-based detection process [18], which can effectively locate the instrument and has a good detection effect.In 2020, based on the original YOLO-V3, Bochkovskiy et al. integrated the excellent optimization strategies in the CNN field in recent years, including data processing, backbone network, network training, activation function, and loss function, and proposed a better one-stage object detection model YOLO-V4 [19].Compared with YOLO-V3, the YOLO-V4 model uses a richer data enhancement method, including Mosaic data enhancement and SAT self-antagonism training.On the basis of the backbone network of YOLO-V3, the Mish activation function and the idea of CSPNet are introduced to increase the feature extraction effect of the backbone network.e SPP module is added behind the backbone network to further increase the receptive field of the model and further improve the detection effect.
Similarly, many ship detection models based on deep learning have been proposed by researchers.Like general object detection models, the ship detection model also has two-stage and one-stage forms.Li et al. proposed a SAR image ship detection model based on improved faster R-CNN [20].As a two-stage detector, although the original detection accuracy is improved, the proposal filtering and ROI pooling operations limit the speed of the model, and it is difficult to achieve real-time detection.Wang et al. studied the application of SSD object detection model in ship detection under complex background [21] and used transfer learning technology to improve detection accuracy and overall performance.However, the single feature extraction network and FPN structure did not fully consider the smallscale ship's detection.Chen et al. used the attention mechanism to propose an improved YOLO-V3 (ImYOLO-V3) [22], and embedding the attention module into YOLO-V3 effectively improved the accuracy of detection, but there is no further optimization of the speed of the model.Jie et al. introduced the K-means clustering algorithm and soft nonmaximum suppression algorithm to optimize YOLO-V3 to make it more suitable for the ship scene [23], but the improved method proposed by it belongs to the engineering tuning technology, and there is no solution to the accuracy problem of ship detection from the perspective of model construction.Shan et al. combined camera and inertial sensor data and proposed a new marine target detection algorithm based on camera motion posture [24]. is algorithm uses the ideas of area candidate and edge detection to optimize the detection algorithm and improve the accuracy of ship detection.However, the traditional image enhancement method is still used, and its detection rate does not meet the requirements of the actual scene of the intelligent ship.In 2020, Li et al. proposed an improved ship detection algorithm LSDM based on YOLO-V3 and Densnet [25], which reduced the model parameters to 1/3 of the original YOLO-V3, but its backbone network uses a large number of densely connected structures. is design still affects the inference speed of the model.
In summary, the current ship detection models still has the problems of poor detection speed and missed detection of small-scale ships.First of all, in order to improve the detection speed and make the ship detection model achieve real-time effects, and this paper optimizes the backbone network of YOLO-V4.While ensuring the accuracy, the parameters of the model are reduced, and the inference speed of the model is effectively improved.Secondly, in order to solve the problem of missed detection of small-scale ships, this paper designs a new amplified receptive field module and combines the attention mechanism to optimize the original feature pyramid of YOLO-V4, which effectively improves the detection effect of small-scale ships.In the end, we get Shi-pYOLO, a faster and more accurate model for ship detection.

Methods
e YOLO-V4 model consists of a backbone network (CSPDarknet53), a receptive field amplification module (SPP), a feature pyramid (PAFPN), and a detection head (YOLOhead) (see Figure 1).e backbone network (CSPDarknet53) uses the CSP module composed of ResUnit components as the feature extraction part of the overall structure.
e receptive field amplification module (SPP) uses pooling layers of different sizes to fuse features of different scales to amplify the receptive field.e Feature Pyramid Module (PAFPN) refers to PANet and obtains a two-layer pyramid structure.Although YOLO-V4 has good detection results overall, it has not been effectively designed for ship detection, so this paper has made targeted improvements to YOLO-V4.

2
Journal of Advanced Transportation

Backbone Network Based on Structured Reparameterization (RCSPDarknet).
e original ResUnit component of YOLO-V4 [16] is a typical multibranch structure (see in Figure 2(a)), and CBM_N is composed of N × N convolution, batch normalization, and activation function (Mish) in series.e calculation formula of the ResUnit component is shown as Although the multibranch topology has a good feature extraction effect, each branch's results need to be retained until superimposed or connected, which significantly increases memory usage and seriously affects the model's inference speed.Such a structure is very unfriendly to the ship detection field with high inference speed requirements.erefore, removing the branch structure in the model can effectively improve the inference speed of the model.For example, the classic single-line model VGG [26], composed of multiple 3 × 3 convolution, although it has obvious advantages in speed, the accuracy is far inferior to the ResNet structure.erefore, this study refers to the idea of RepVgg [27] and uses the structure reparameterization technology to construct the feature extraction component RepUnit (see in Figures 2(b) and 2(c)).Although the multibranch structure has poor inference speed, this structure is more conducive to model training and feature extraction.
erefore, in order to achieve both speed and accuracy improvements, this paper first uses a multibranch structure for training the calculation formula is as follows: en, use the structure reparameterization technology to fuse the model parameters and convert a training block into a single 3 × 3 convolution layer for inference.
e final calculation formula in the inference stage is shown as ( While ensuring the accuracy of the model, it effectively improves the inference speed of the model.e structure reparameterization process and the calculation process of the convolution kernel are shown in Figure 3. First, the convolution layer and the batch normalization layer in the residual block are fused (this operation is performed in the inference stage of many deep learning frameworks), and the calculation formula is where W i is the convolutional layer parameters before calculation, β i is the convolutional layer bias before convolution, μ i is the mean value of the batch normalization layer, and σ i is the variance of the batch normalization layer.Branch (a) directly executes the fusion of the 3 × 3 convolution layer and the batch normalization layer, Branch (b) executes the fusion of the 1 × 1 convolution layer and the batch normalization layer, Branch (c) first sets a 3 × 3 convolution layer with a weight of 1 and then executes the fusion of the 3 × 3 convolution layer and the batch normalization layer (because this branch does not change the value of the input feature map, it is set to a 3 × 3 convolution layer with a weight value of 1, and then, it will maintain the original value after multiplying with the input feature map).
en, convert the convolution layer after branch (b) fusion into a 3 × 3 convolution layer (the value in the 1 × 1 convolution kernel is used as the center point of the 3 × 3 convolution kernel, and the other places are filled with 1).Finally, the 3 × 3 convolution layer in each branch are Journal of Advanced Transportation merged, and the weights and biases of all the branches are superimposed to obtain a 3 × 3 network layer after fusion.
In the end, we used the improved feature extraction component (RepUnit) to form a new module (RCSP) and got a new backbone network (RCSPDarknet), which effectively improved the model's inference speed and had a better detection effect.[28] and added the SPP module (see in Figure 4(b)), CBL_N is composed of N × N convolution, batch normalization, and activation function (Leaky) in series (the difference from CBM_N is that they use different activation functions.In CBM_N, the activation function uses Mish and CBL_N uses Leaky),and MaxPool_N is the Max-Pooling layer whose kernel size is equal to N. e pooling operation of fixed blocks is used to stitch together different feature maps to realize the fusion of features of different sizes, which effectively improves the detection effect of images with significant differences in target size and increases the receptive field.However, ship sizes are different for ship detection, and the problem of missed small-scale ships is serious.e original SPP structure and traditional convolution structure are difficult to increase the receptive field while capturing small-scale targets in space.Luo et al. studied the problem of receptive fields in deep convolution networks [29] and pointed out that pixels in the center of the receptive fields are greater.In the forward pass process, the center pixel has more paths to transmit the pixel information to the neural node, and the edge pixels have fewer paths to transmit its pixel information to the neural node.Similarly, in the backward pass process, the receptive field's center pixel  obtains more gradients from the corresponding neural nodes.e design of dilated convolution [30] can reduce the loss of spatial features without reducing the receptive field compared with ordinary convolution and can effectively consider the feature extraction of targets of different scales.

Spatial Pyramid Pooling Module Based on Dilated Convolution (DSPP). YOLO-V4 was inspired by SPPNet
erefore, this paper refers to the dilated convolution and SPPNet, which designs a new feature enhancement module (DSPP).While increasing the model's receptive field, it improves its feature extraction effect for small-scale targets in space and effectively solves the problem of missing smallscale ships and improving ship detection accuracy.e DSPP structure is shown in Figure 4(a).DBL_N is composed of the dilated convolution with a spatial interval span of N, the batch normalization layer, and the activation function (Leaky) in series.
Firstly, the feature map is passed through a 1 × 1 convolution layer to reduce the number of channels and then divided into three branches.
e three branches are composed the Max-Pooling layer, convolution layer, and dilated convolution layer in series (the number of convolution kernels of each branch, the number of dilated convolution rates, and the kernel size of Max-Pooling are shown in Figure 4), and the last branch uses two 3 × 3 convolutions instead of 5 × 5 convolutions, reducing the parameters and deepening the nonlinear layer.Secondly, contact the feature maps of the three branches together and then connect to a 1 × 1 convolution layer which is used for the scale conversion feature.Finally, referring to the residual structure of ResNet, we get the feature enhancement module DSPP.

Feature Pyramid Based on Attention Mechanism (AtFPN).
In deep learning, the fusion of different scales' features is an essential means to improve performance, and convolution layers learn semantic features of different levels of different depths.e FPN [31] structure proposed by Anthimopoulos simultaneously uses the high-resolution of low-level features and the high-semantic information of high-level features and achieves the prediction effect by fusing these features of different layers.YOLO-V4 effectively referred to this idea and combined with PANet [32] to add a bottom-up feature pyramid after the FPN layer to obtain PAFPN (see in Figure 1). is structure utilizes robust semantic features from the top to the bottom and strong positioning features from the bottom to the top and aggregates parameters from different backbone layers to different detection layers.However, in the PAFPN structure of YOLO-V4, the same convolution module as the backbone network is still used.
Although it has a good feature extraction effect, it brings the problem of excessive parameter volume.
erefore, this paper refers to the CBAM structure, merges it with PAFPN, and adds a residual structure design to each semantic layer.We are obtaining a new feature pyramid structure (AtFPN), which effectively improves the model's accuracy and reduces the number of model parameters.
CBAM [33] was proposed by Woo et al. (see in Figure 5(a)).is structure provides attention maps from the channel and spatial dimensions, respectively, and is used for the middle feature map, which can effectively help the network's information.e channel attention module aims to focus on what features are meaningful.Firstly, the channel attention module compresses the spatial dimension of the input feature map, uses the Avg-Pooling layer and the Max-Pooling layer, obtains the global context information in the feature map while reducing the interference information in the feature map, then forwards it to a shared network (single-layer perceptron), and finally generates the channel attention map through sigmoid (see in Figure 5(b)).e calculation formula is shown as ( Spatial attention is complementary to channel attention and aims to assign weights to feature maps to obtain spatially interesting areas (see in Figure 5(c)).Firstly, Avg-Pooling and Max-Pooling operations are applied along the channel axis, and they are connected to generate effective feature descriptors, and then, the spatial attention map is obtained through sigmoid.e calculation formula is as follows: erefore, we replaced the original YOLO-V4 bottomup semantic layer CBL component with a CBAM component, which effectively reduced the model parameters and increased the target region parameters' weight to be identified in the feature map at different scales.At the same time, we once again referred to the residual structure of ResNet in each semantic layer and fused the corresponding pixels of the shallow feature map output by the backbone network and the deep feature map after multilayer convolution to enhance the variety of feature map.e AtFPN designed in this paper is shown in Figure 6.

ShipYOLO.
In summary, this paper designs ShipYOLO, a detection model that is more suitable for the ship field, and the model structure is shown in Figure 7. Firstly, an efficient backbone network RCSPDarknet is designed using the structure reparameterization technology, which effectively solves the current problem of low real-time performance in ship detection.Secondly, the feature enhancement module DSPP is designed using dilated convolution and Max-Pooling and combined with the feature pyramid structure AtFPN based on the attention mechanism, while ensuring the model inference speed, and it further improves the model accuracy and effectively solves the problem of smallscale ship missed inspection existing in the current ship detection model.

Evaluating Indicator.
is paper uses mAP as the model's accuracy evaluation indicator, where mAP@5 : 5:95 represents the average mAP at different IOU thresholds (from 0.05 to 0.95 and step size is 0.05).e mAP50 and mAP90 score tables represent mAP at IOU thresholds of 0.5 and 0.9.e mAP (small) represents the average mAP of small objects.FPS represents the number of frames transmitted per second.# Params represents the parameter amount of the model.For the convolutional layer, the calculation formula is shown as where C o is the number of output channels, C i is the number of input channels, k w is the width of the convolutional kernal, k h is the length of convolutional kernal, and 1 is the bias of convolutional kernal.
For the fully connected layer, the calculation formula is shown as where m is the output vector dimension of the fully connected layer, n is the input vector dimension of the fully connected layer, and 1 is the bias of the fully connected layer.8.For this dataset, we divided it according to the ratio of 8 : 2 and produced a training set and a validation set.

Self-Built Ship Dataset.
In order to meet a richer scene, we have produced a ship dataset in a natural scene, a total of 2238 sheets, of which the category is a single category (Ship), and some of the dataset are shown in Figure 9. Similarly, we divided it according to the ratio of 8 : 2 and produced a training set and a validation set.

Experiment and Result.
We conducted experiments in a 1080Ti environment.First, we used our three optimization strategies to conduct experiments on the basis of YOLO-V4.
Using our self-built with an input size of 512 × 512, the experimental results are shown in Table 1: From Table 1, we can see that RCSPDarknet can significantly improve the inference speed of the model while maintaining the accuracy and reduce the amount of parameters.Embedding the DSPP module into YOLO-V4 can effectively improve the detection accuracy of the model, but the inference speed of the model is slower than that of YOLO-V4.Finally, the model parameters of YOLO-V4 based on AtFPN have been reduced, and the detection accuracy and inference speed have not been affected.
Finally, we compared the performance of ShipYOLO, YOLO-V4, and YOLO-V3 in the two datasets and tested the detection accuracy of three models for small targets.e experimental results are shown in Tables 2-4: From Tables 2 and 3, we can see that YOLO-V4 has better detection accuracy than YOLO-V3 at input sizes of 416 × 416 and 512 × 512, while YOLO-V3 has better detection accuracy than YOLO-V4 at input sizes of 320 × 320, but YOLO-V4 has a faster inference speed.rough comparison and verification, the ship detection model ShipYOLO proposed in this paper is better than YOLO-V4 and YOLO-V3 in accuracy, FPS, and # Params.With an input size of 320 × 320, compared to YOLO-V4, ShipYOLO has a 13.6% increase in mAP@5 : 5:95 (10.6% mAP90), a 23.7% increase in FPS, and the model # Params reduced to 188 m.From Table 4, we can also find that our ShipYOLO has a better detection effect in the detection of small-scale ships.
We also screened some typical pictures for verification.As shown in Figures 10 and 11, it can be seen that ShipYOLO has solved the small-scale ship missed inspection problem of YOLO-V4 and YOLO-V3 and has a better bounding box regression effect.e comparison of Figures 10 and 11         the experimental data in Tables 1 and 2 proves the effectiveness of ShipYOLO in the field of ship detection.It is a faster and more accurate ship detection model.

Conclusions
is paper proposes an enhanced model based on YOLO-V4 for ship detection.First of all, this paper uses structured reparameterization technology to optimize the backbone network.e new backbone network increases the model's inference speed, and effectively solves the problem of poor ship detection model speed.Secondly, this paper redesigns the amplified receptive field module of YOLO-V4 and optimizes the feature pyramid structure based on the attention mechanism.ese structures improve the model's detection effect for small-scale ships and solve the problem of missed inspection of small-scale ships.Extensive experimental results show that our detection model ShipYOLO has a significant improvement in speed and accuracy compared to the current advanced detection models and can be effectively applied to the field of ship detection.rough experiments, we have also found that extreme weather conditions such as foggy weather and rainy days during the ship's navigation seriously affect the model's recognition of the ship.erefore, we will also do more research in this section so that ships can be effectively identified in more complex environments.

Figure 5 :Figure 6 :
Figure 5: (a) Schematic diagram of the CBAM module structure.(b) Schematic diagram of the channel attention module structure.(c) Schematic diagram of the CBAM module structure. and

Figure 7 :
Figure 7: Schematic diagram of the ShipYOLO model structure.

Table 1 :
Experimental results of different optimization strategies.

Table 4 :
Experimental results of the mAP (small).

Table 3 :
Experimental results of the dataset by ourselves.