A Decoupled Head and Coordinate Attention Detection Method for Ship Targets in SAR Images

Currently, deep learning-based synthetic aperture radar (SAR) image ship target detection methods have been widely used in the field of SAR image ship detection. However, these methods suffer from high model complexity and poor performance when detecting small dense targets. To address this problem, this paper proposes a ship target detection algorithm based on the improved YOLO (You Only Look Once) algorithm. In addition, considering the real-time requirements and computational constraints in mobile applications, the YOLOv4 network is modified to make it more lightweight. Moreover, decoupled head and coordinate attention are introduced to preserve YOLOv4’s superb detection performance as much as possible after lightweighting it. First, as the detection head of the YOLOv4 degrades the performance, this study decouples the classification and regression tasks. Second, since the channel attention mechanism ignores the spatial position information, coordinate attention is used to obtain long-range dependencies and accurate position information in the spatial domain. Moreover, the effects of the coordinate attention mechanism in different hierarchical YOLOv4 structures are analyzed. Furthermore, on the basis of the YOLOv4 backbone, another lightweight backbone is added to the model structure to improve model detection performance. Experimental results on the SAR ship detection dataset (SSDD) and the high-resolution SAR images dataset (HRSID) demonstrate that the proposed method can achieve high detection accuracy in complex scenes. The proposed lightweight model has fewer parameters compared to the original YOLOv4 structure. Furthermore, two massive SAR images are used to confirm the proposed model’s migration application performance. The experimental results demonstrate that the proposed model has a strong migration ability and can be used in maritime monitoring.


I. INTRODUCTION
Compared with optical, infrared, and hyperspectral sensors, synthetic aperture radars are active microwave imaging sensors, which have an all-day and all-weather operational capability. In recent years, the number and quality of SAR images have greatly improved due to the advancements in SAR imaging technology. Therefore, the detection of ship targets has become a research hotspot. The traditional ship detection methods usually perform image preprocessing, sea-land The associate editor coordinating the review of this manuscript and approving it for publication was Zhongyi Guo . segmentation, and candidate region extraction. Constant false alarm rate (CFAR) [1], multi-resolution [2], [3], polarization information [4], and conversion [5] have been commonly used. However, these methods have high computational complexity, weak mobility, and are considerably laborious.
It is noteworthy that CNNs can effectively address the problems of traditional methods by automatically learning from SAR images in a robust manner.
Girshick introduced a region-based convolution neural network (RCNN) [6] to the field of target identification and recognition. Subsequently, the Fast RCNN [7] can realize end-to-end detection by adopting various optimizations, such as shared convolution, ROI pooling, and multitask loss. The Faster RCNN [8] is based on the RPN network, and it uses the anchor mechanism to connect region generation with a CNN to realize real-time detection. The YOLO [9] used the idea of regression to complete classification and localization directly by using a one-stage network. The SSD [10] uses a fixed frame for region generation and multi-layer feature information to improve the detection speed and accuracy to a certain extent. By modifying the loss function, the RetinaNet [11] has addressed the class imbalance problem in one-stage methods.
For the Faster RCNN, Zhang et al. [12] employed binary normed gradient and cascaded CNN to improve the accuracy. Yang et al. [13] used RetinaNet and improved the loss function to reduce the false alarm rate. The SSD was enhanced by Wang et al. [14] to improve the detection speed. Furthermore, the authors enhanced the detection performance for small targets. Zhu et al. [15] used the YOLO to design an integrated multi-scale mechanism for the detection of small ships.
The attention mechanism modeled on the human visual system has been a research focus in the computer vision field. The research on spatial attention has been mostly based on recurrent neural networks (RNNs) [16]. The RAM [17] and subsequent studies, such as DRAW [18], GlimpseNet [19], STN [20], DCN [21], and DCNv2 [22] use subnetworks, were used to predict the target regions explicitly. The GeNet [23] uses the attention mechanism to predict a soft mask implicitly. It should be noted that the research on the channel attention mechanism has been mainly based on the SENet [24]. The improvement versions of the SENet include the GospNet [25], FcaNet [26] (the squeeze module is improved), ECANet [27] (the excitation module is improved), SRM [28], and GCT [29]. The channel and spatial attention mechanism includes the research of split-channel attention and spatial attention, and some of the proposed models are CBAM [30] the SCSE [31]. In contrast, the threedimensional attention maps have been estimated directly by a residual attention mechanism. The follow-up work on splitchannel attention and spatial attention has proposed triple attention [32] for cross-dimensional interaction, coordinated attention [33] for long-term dependence, and DANet [34] and RGA [35] for relationship perceived attention. There have been fewer studies on the attention mechanism in the field of ship target detection. Lin et al. [36] improved the detection performance of the Fast RCNN by introducing an attention mechanism. To enable multi-scale ship detection, the CBAM was introduced to the detection method by Cui et al. [37]. Further, Zhao et al. [38] implemented an attention mechanism into the FPN to achieve multi-scale detection. Fu et al. [39] designed a hierarchy-based attention network and a spacebased attention network to improve detector performance.
To address the shortcomings of the existing research, this work introduces a coordination attention module to ship target detection. This solves the problem that the SENet considers only the internal channel information while ignoring the localization information. At the same time, the problems that the CBAM captures only local relationships from multiple channels of each location and is unable to obtain the longrange dependency relationship are solved. In addition, this work combines recent achievements in the research on decoupled heads and solves the contradictions between classification and regression. The decoupled head is used to enhance the performance of a deep learning network.
The main contributions of this work can be summarized as follows: (1) The decoupled head is used to decouple the classification and regression tasks in the traditional coupled head; (2) Compared with the conventional SE and CBAM attention mechanisms, the coordinate attention mechanism used in this work can obtain long-range dependencies and accurate position information in the spatial domain; (3) A two-way trunk is used to improve the detection model's performance for small targets; (4) Lightweight networks for mobile applications are presented.
In this paper, part II introduces YOLOv4's structure, part III introduces the optimization methods in this paper, including coordinate attention mechanism, decoupled head, loss function optimization and two-way trunk, part IV mainly introduces the experimental results of this paper, including comparative experiments and ablation experiments, part V mainly introduces the lightweight part, part VI summarizes our paper.

II. YOLO-v4 ARCHITECTURE
The YOLOv4 architecture is presented in Figure 1. The YOLOv4 uses the original YOLO structure but adopts well-known optimization strategies developed in recent years, including the CIO loss function, improved NMS [40], SPPNet [41], and PANet [42].
Compared with the YOLOv3, the backbone of the YOLOv4 is changed from the original Darknet-19 to CSPDarknet-53, which retains the original residual connection module but avoids network performance degradation. The CSPdarknet-53 contains five cross-stage partial (CSP) block backbones composed of a five-layer residual network named the resblock_body. The resblock_body incorporates a special convolution operation to reduce the input resolution. As is shown in Figure 2, a cross-stage partial network (CSPNet) [43] solves the problem of gradient information repetition in network optimization. Moreover, it also reduces the number of computations and ensures higher accuracy.
The SPP module is used in the CSPDarknet-53's last feature layer, which uses 5 × 5, 9 × 9, and 13 × 13 max-pooling layers to conduct multi-scale fusion. The obtained feature maps are used to expand the receptive field and introduce contextual features. The YOLOv4 model uses the output feature map of the SPP structure as input of the feature pyramid. By using the fusion method PANet, the final feature map is  provided to the YOLO detection head for final classification and localization.
The CIOU loss function, which considers not only the overlapping area of the predicted box and the ground truth in the GIOU loss function but also the distance between the center point of the predicted box and the ground truth in the DIOU loss function, is used in the YOLO-v4 model. In this loss function, both the predicted box's lengthwidth ratio and the ground truth's length-width ratio are considered.
where c is the diagonal length of the smallest box that can simultaneously cover both the ground-truth and predicted boxes; ρ 2 (b,b gt ) denotes the Euclidean distance between the ground truth and the predicted frame's center point; w gt h gt is the ground truth's aspect ratio; w h is the prediction frame's aspect ratio.
It should be noted that the YOLO detection head in the YOLOv4 model couples the classification and regression tasks, thus degrading network performance. At the same time, there is no explicit application attention mechanism in the YOLOv4 model, which affects the detection effect for dense and small target objects.

III. DECOUPLED HEAD AND COORDINATED ATTENTION DETECTION METHOD
To address the limitations of the YOLO network, an improved high-resolution ship detection method is proposed. The flowchart of the proposed method is presented in Figure 3, where it can be seen that the proposed architecture includes three main sections: a backbone, a neck, and a head. The head adopts the decoupled detection head, which is discussed in detail in Section III-A. In addition, coordinate attention is added to the residual blocks of stages 3-5 of the cross-stage partials to enhance the detection effect of a small target, which is explained in detail in Section III-B. In the neck, the input image has a size of 416 × 416. After passing through the backbone, processes 1-3 generate 13 × 13, 26 × 26, and 53 × 53 feature maps, respectively. Process 2 upsamples input 13 × 13 feature maps to 26 × 26 feature maps and fuses them with the backbone's 26 × 26 feature maps. Process 3 upsamples the output feature maps of process 2 to the size of 52 × 52 and fuses them with the backbone's 52 × 52 feature maps, and then passes the resulting data to the detection head. The bottom-up integration process of the PAN is implemented based on processes 4 and 5. Process 4 downsamples the output feature maps of process 3 to the size of 26 × 26, fuses the obtained maps with the output feature map of process 2 and then passes the resulting data to the detection head. Process 5 downsamples the output feature maps of process 4 to 13 × 13 feature maps and then fuses them with the output feature maps of process 1 and passes them to the detection head.

A. DECOUPLED HEAD
The conflict between classification and regression tasks in object detection has been a widely analyzed problem [44].
The YOLOX [45] shows that the coupled head degrades network performance to a certain extent. It is worth noting that the decoupled head can enhance a network's convergence speed. Therefore, the decoupled head is crucial in end-to-end models [46], [47]. In addition, it should be noted that the coupled detection head has been used in YOLOv3-YOLOv5 models, and it consists of a 1 × 1 convolution layer. For instance, when the coupled detection head of the YOLOv4 model is used for performing detections on the COCO dataset, three boxes are preset. Each box needs to predict the confidence of a target, four regression bounding box parameters, and 80 category probabilities. This leads to a reduction in network performance and the inability to determine the location of a target accurately. Therefore, this study decouples the detector head and derives a branch responsible for finding the target object and regression bounding box, as well as a branch to handle the target category. Finally, the two branches are integrated into the prediction to avoid performance degradation in the traditional detector head. The architecture of the proposed decoupled detection head with anchors is presented in Figure 4. The decoupled head reduces the characteristic channels using a 1 × 1 convolution layer and adds two parallel branches. For the classification and regression tasks, each of the branches has two 3 × 3 convolution layers.

B. COORDINATE ATTENTION
The SENet structure can be briefly described as follow. First, the global average pooling operation defined in (4) is employed to compress global spatial information in statistical data of the channel dimension, which is based on a squeeze module; namely, the input of a size of C × H × W is converted to the size of C × 1 × 1. The numerical distribution of C characteristic graphs is stored in the output. Second, the channel correlation is obtained by the excitation module. Because the size of the feature map obtained by the squeeze module is C ×1×1, the result obtained by it is fed to the fullyconnected layer. Therefore, the result obtained after passing it through the fully-connected layer is C / r × 1 × 1, where r represents the scaling factor. Then, after passing through a VOLUME 10, 2022 nonlinear layer and another fully connected layer, the output dimension becomes C ×1×1. Finally, according to the output weights of the excitation module, the reweighting process allocates the weight of the characteristic channel, which is performed to finish the recalibration of the original feature in the channel dimension.
The CBAM structure is as follows. The CBAM combines channel and spatial information. Its operation begins by compressing the feature maps in the spatial dimension by using both average and maximum pooling to obtain a onedimensional vector. Next, a multi-layer perceptron is used to process this vector. Then, both average and maximum pooling are applied to the feature map in the channel dimension. Afterward, the two results are concatenated according to the channel dimensions. Further, the dimension is reduced to a single channel after the convolution process. The final spatial attention feature map is obtained by multiplying this feature map with the input feature map. It should be noted that the SENet considers only channel information but ignores the position information. Conversely, the CBAM considers location information by introducing the weighting coefficients. The weighting process captures only the local relationships and is unable to obtain long-range dependencies. However, coordinated attention can effectively address the aforementioned problems.
Moreover, the coordinate attention module decomposes (4) into horizontal and vertical parts, as shown in (5) and (6), respectively. Particularly, given an input x, each channel is encoded along the horizontal and vertical directions based on two spatial ranges, (h, 1) and (1, w), of the pool core. To build a pair of direction sensing feature maps, the features are aggregated in the two spatial directions. This enables the attention block to capture the long-term dependence and attain accurate location information. In addition, this significantly helps the network to locate an object of interest precisely.
As shown in Figure 5, it concatenates the results obtained by (5) and (6). These results are finally sent to a 1 × 1 convolution layer, which is given by: where δ denotes the sigmoid function, F1 is the convolution function, [·] represents the concatenation operation, and f is a C/r × (H + W ), and it is divided into two parts, f h and f w . The output results are expressed by using the following mathematical expressions: Finally, the output of the coordinate attention module is given by:  In addition, (c) can be easily integrated into mobile networks, such as MobileNetv2 [48] and EfficientNet [49]. This is also conducive to the lightweight work of subsequent models.

C. DOUBLE TRUNK
To enhance the extraction ability of the network model for small targets further, a two-way backbone is introduced into the VoVNet [50] on the basis of the CSPDarknet53. The VoVnet can effectively extract various feature information by using the one-time aggregation (OSA) module, which  connects the subsequent layers, as shown in Figure 6. Because the OSA module can capture multi-scale receptive fields, the diversified feature maps enhance the multi-scale target detection ability of the target detection model, especially for small target detection.
In this paper, the feature information extracted from the input feature map by the backbone of the VoVNet is fused with the feature map extracted by another Yolo backbone, the CSP8. More feature information is extracted through feature fusion, which is conducive to enhancing the model's detection ability for small ship targets.

D. LOSS FUNCTION
In the loss function part, the CEIOU loss function [51] reflects the distance between the predicted frame and the ground truth. When the CEIOU value is large, the closer the predicted frame is to the ground truth, the smaller the loss function value is. However, in the training process of a network model, the gradient of the CEIOU loss function in the training process will not change, thus affecting the training effect. In order to solve this problem, the ICEIOU loss function, which is presented in Figure 7, is introduced to improve the model training effect.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, the simulation setup, datasets, evaluation measures, and implementation of the proposed method are presented. In addition, ablation and comparative experiments are introduced.

A. DATASETS
The SAR ship detection dataset (SSDD) [12] containing 1,160 images and 2,456 ship targets was used in the experimental verification. This dataset was collected by using Sentinel-1, TerraSAR-X, and RadarSat-2 sensors, including target ships with HH, HV, VV, and VH polarization modes. The resolution for the dataset was 1 m-15 m. Please note that the images of target ships were acquired on seas, as well as in nearshore areas. The high-resolution SAR images dataset (HRSID) [52] contained 5,064 SAR images and 16,951 ship targets. It was collected using TanDEM-X, TerraSAR-X, and Sentinel-1B sensors installed on target ships with HH, HV, and VV polarization modes and different backgrounds. The dataset included data with resolutions of 0.5 m, 1 m, and 3 m.

B. SIMULATION SETUP
For implementation, PyTorch 1.8.0, CUDA 11.1, CUDNN 8805, Intel(R) Xeon(R) Gold 6130, and Tesla P100 were used. Model training included 200 epochs; a learning rate was 0.0003, and an AdaBelief optimizer was adopted. It should be noted that optimizer selection has a significant effect on the convergence of a trained deep learning model [53], [54]. The AdaBelief optimizer was selected because it has both the fast convergence characteristics of the Adam optimizer and the good generalization capability of the SGD [55]. Further, the learning rate used a cosine annealing strategy. The detection threshold for IOU in all experiments was 0.5.

C. EVALUATION METRICS
In this study, precision, recall, F1_score, FPS parameters, and GFLOPs were used to evaluate the proposed method's detection performance. The precision and recall were respectively calculated by: The F1_score is a mathematical expression of the harmonic average of accuracy and recall, and it is defined by: Further, the AP is mathematically expressed as follows: Generally, mAP (0.5:0.95) represents IOU from 0.5 to 0.95. In this study, mAP was calculated at intervals of 0.05 and then averaged; mAP0.5 and mAP0.75 denoted the map values for IOU of 0.5 and 0.75, respectively.
The receiver operation characteristic (ROC) curve describes the relationship between the true positive rate (TPR) and the false positive ratio (FPR). The AUC is defined   as the area between the ROC curve and the coordinate axis. The TPR and FPR are respectively defined as follows: where FPS represents the detection speed, and it is given by: where N denotes the number of samples in the test set, and T is the amount of time required for testing the model on the test set.
The GFLOPs are used to measure the computation amount. Namely, network complexity is proportional to the number of performed calculations. The GFLOPs represent the number of parameters in the network. In a neural network, parameters generally refer to the weights and biases that are learned during the training process.

D. ABLATION EXPERIMENT
To confirm the efficiency of the decoupled head and coordinated attention module in detecting target ships in SAR images, ablation experiments were performed on the SSDD and HRSID datasets. Figure 8 presents the P-R curves of VOLUME 10, 2022  the ablation experiment on the SSDD dataset. The mAP results of the ablation experiment are given in Table 1. The results showed that the decoupled detection head used to replace the coupled detection head for decoupling could FIGURE 13. The detection results on the HRSID dataset. VOLUME 10, 2022 handle classification and regression problems. This improved the mAP by 1.07% and significantly enhanced network performance. In addition, inspired by the ConvNeXt [56], the coordinate attention module was added to the third stage for achieving robust feature learning. The coordinate attention module was also incorporated in the fourth and fifth phases with deeper semantics. This improved the mAP by 1.80% compared to the baseline network. The network obtained by combining decoupled detection head and coordinate attention showed a mAP improvement of 2.43% compared with the baseline network. Figure 9 shows the P-R curve of the ablation experiment on the HRSID dataset, and the corresponding mAP results are presented in Table 2. Compared with the baseline, the decoupled detection head increased the mAP by 2.28%, the coordinate attention enhanced the mAP by 2.24%, and the combination of decoupled detection head and coordinate attention improved the mAP by 4.32%. It is noteworthy that the HRSID dataset included a large number of small targets and rich data, thus enabling the decoupled detection head and coordinate attention to affect the network's performance significantly.

E. COMPARATIVE EXPERIMENT
Next, the proposed algorithm was compared with the Faster RCNN, SSD, RetinaNet, and ImYOLOv4. The comparison results are presented in Tables 3 and 4. The P-R curves of different algorithms obtained on the SSDD and HRSID datasets are presented in Figure 10. The ROC curves of different algorithms obtained on the SSDD and HRSID datasets are presented in Figure 11. The experimental results showed that although the number of parameters in the proposed algorithm was slightly larger than in the SSD and RetinaNet, it outperformed the other methods by more than 4% on the F1_score. In terms of precision and recall, the proposed method surpassed the previous methods by more than 2% and 3%, respectively. Further, in terms of mAP0.5, the performance of the proposed method was more than 1.4% higher than those of the other methods; also, the proposed method achieved a significant improvement in mAP0.5:0.95. As shown in Table 4, the proposed method outperformed other methods on the F1_score by more than 1.2% and by more than 1% and 2% on precision and recall, respectively. The proposed method improved the mAP0.5 by more than 1.1%, as well as mAP0.75. Moreover, the proposed method had the lowest computational complexity of 32.27 GFLOPs, while those of the Faster R-CNN, RetinaNet, and SSD were 109.7, 87.7, and 107.5 GFLOPs, respectively.
To emphasize the benefits of the proposed strategy even further, various types of targets, such as small targets, nearshore targets, and dense targets, were selected to compare the detection results of different methods on the SSDD dataset, as shown in Figure 12, where the ground truth is denoted by the green box, and the predictions of the algorithms are represented by the red boxes. Figures 12(a) show the detection results of the ground truth, Fast RCNN, RetinaNet, SSD, and the proposed  algorithm, respectively. As presented in Figures 12(b)-12(d), for nearshore and small targets, the degrees of false alarm and missed detection were obvious. The proposed method reduced the probability of missed detections and false alarms and had a good performance. For small and dense targets, Faster RCNN had a smaller number of missed detections than the other algorithms, as shown in Figures 12(c) and 12(d); however, the missed detection problem was still prominent. In contrast, the proposed algorithm could effectively recognize dense and small objects, having a low percentage of missed detection. The detection results obtained on the HRSID are shown in Figure 13, where the green box represents the ground truth, and the red box represents the prediction box. The results suggested that the proposed model could detect targets that were close to shore and dense and small.

F. ATTENTION MECHANISM EXPERIMENT
Further, an experimental investigation of various effects of coordinate attention in the YOLOv4 hierarchical structures was conducted. Namely, the effects of coordinated attention, SE, and CBAM in the ship target detection task from SAR images were investigated using the YOLOv4 as a baseline network.   First, a comparative experiment was performed on the SSDD dataset using different addition positions. In Experiment 1, coordinate attention was added before all residual layers, and in Experiment 2, coordinate attention was added to all residual layers. Inspired by the ConvNeXt, the features learned by the model in the third stage were the most robust. At the same time, the MobileNet showed that the features learned by the model in the fourth and fifth stages had high semantics. Thus, in Experiment 3, the residual layers of P3, P4, and P5 received coordinate attention. Figure 14 shows the P-R curve obtained by adding the attention mechanism to the decoupled optimization model in different positions.
The results of Experiment 1 showed that when the CA module was added to all residual layers, the mAP was reduced by 3.31% compared with the decoupled optimization model. The results of Experiment 2 demonstrated that when the CA module was added to all residual layers, the mAP increased slightly, namely by only 0.4%, compared with the decoupled optimization model. Thus, the effect of the attention mechanism was not obvious. The results of Experiment 3 showed that the addition of the attention mechanism to the residual layer of P3, P4, and P5 increased the mAP by nearly 1.4% compared with the decoupled optimization model. The three VOLUME 10, 2022 experiments showed that adding coordinate attention to the residual layers of P3, P4, and P5 could be effective. Figure 15 presents the P-R curves of different attention methods obtained on the SSDD and HRSID datasets. As shown in Figure 15, the coordinate attention mechanism outperformed the SE and CBAM in the ship detection task from SAR images. Table 5 compares the performances of different attention mechanisms on the SSDD dataset. As shown in Table 5, all three attention models could improve the model's map, but the CBAM and SE reduced the decoupled model's recall rate. The CA improved the model's map the most, while also improving the model's recall rate. Therefore, among the three attention mechanisms, the CA provided the greatest enhancement to the model's detection.
In addition, as shown in Figure 16, the intermediate feature visualization results demonstrated the benefits of the coordinate attention module proposed in this paper.
After introducing the coordinate attention mechanism, the model could effectively deal with the multi-scale problem in the task of ship target detection in SAR images.

G. ADAPTABILITY EXPERIMENT
To ensure that the selected model could be migrated easily, two large SAR images were selected, and their real geographical locations were marked, as presented in Figure 17. In Figure 17, the Strait of Malacca is denoted by the orange color, and the Strait of Singapore is represented in red. These locations represent famous shipping routes in the world. The descriptions of the two large scene images are given in Table 6. As shown in Table 6, the VV polarized target ships with high backscattering value and interference broadband (IW) mode of sentry 1 were selected. Due to the limited GPU memory, it was impossible to use large-scale images in model training directly. Therefore, the training and test were performed by segmenting the subgraphs in the document [57]. The adaptability of the proposed algorithm was also tested, and the obtained results are shown in Figure 18.
The detection results in Figure 18 show that the suggested model could successfully detect the majority of the target ships.

V. DISCUSSION
Currently, the three popular factions of target detection architecture on mobile terminals include the ShuffleNet [58,59], FBnet [60], and MobileNet [61]. The mobile target detection architecture adopted in this work is MobileNetv3, and it is presented in Figure 19. The core idea in the MobileNet series refers to deep separable convolutions. The deep separable convolution divides an ordinary convolution into deep and point-by-point convolutions. During the process of deep convolution, the convolution kernel is divided based on the channel dimension and convoluted with the input feature map. However, it is noteworthy that the reduction in dimensions of the characteristic map leads to the loss of useful information. To address this issue, MobileNetV3 introduces point-by-point convolution after deep convolution to ensure the number of channels of the output feature map. Based on these operations, the depth separable convolution reduces the number of computations and parameters to about one-ninth to one-eighth of the standard convolution at the cost of   a 1% reduction in accuracy. Second, the inverse residual structure with a linear bottleneck is implemented to reduce information loss during the training process caused by a low-dimensional ReLU. The pointwise (PW) convolution is used to upgrade the dimensions before performing deep convolution, and then convolution is performed in a high VOLUME 10, 2022   dimensional space to extract the features. The residual connection structure introduced after the last layer's activation function is replaced by a linear function.
During the lightweight experiment, the backbone of the baseline network was replaced with the MobileNetV3. In addition, the coupling and attention mechanisms were used to improve the performance of lightweight networks. Figure 20 shows the P-R curve of the lightweight model. The results in Figures 20(a) and 20(b) obtained on the SSDD dataset show that after introducing the decoupled head and coordinate attention module, the mAP increased by more than 4% compared to the lightweight baseline. The results presented in Figures 20(c) and 20(d) obtained on the HRSID dataset showed that by introducing decoupled head and coordinate attention module, the mAP increased by approximately 2% compared to the lightweight baseline. This further indicated that the decoupled head and coordinate attention module could be successfully applied to a lightweight network. Moreover, the number of parameters of the lightweight model was 7.1M, accounting for only 9.1% of the parameters of the model presented in Figure 3. The accuracy achieved on the SSDD dataset was lowered by 0.5%, whereas the accuracy on the HRSID dataset was reduced by approximately 5%. In addition, the computational complexity of our lightweight model was 3.52 GFLOPs, and its detection frame rate was 49.32 FPS.
Next, the focus layer was introduced to the lightweight model for further analysis, and the corresponding results are shown in Figure 21. Besides, Tables 7 and 8 show the comparison of various performance metrics of the MobileNetv3 lightweight baseline network, MobileNetv3 lightweight network with coordinate attention and decoupled head, and focus optimized network. Tables 7 and 8 show the results on the SSDD and HRSID datasets, respectively.
To minimize the number of parameters in the lightweight model, the original lightweight large and small models were analyzed. The small model had 1.8 M parameters, which was one 1/4 of the original lightweight model. The effectiveness was verified on the SSDD and HRSID datasets. Figure 22 depicts the P-R curves of the large and small models.

VI. CONCLUSION
By introducing the well-known optimization strategy, this study improves the original YOLOv4's detection accuracy. However, the complex YOLOv4 structure is not conducive to mobile deployment. To address this problem, this paper proposes a decoupled head and coordinate attention method. On the basis of lightweighting the YOLOv4 network, the proposed method ensures good detection performance. In addition, a decoupled head is proposed to optimize model performance. Moreover, to address the problems of the channel attention system's inability to gather precise position information and CBAM's inability to capture long-range dependencies in the spatial domain, the coordinate attention module is added to the third stage with the most robust learning features, the fourth and fifth stages with higher semantics. Further, by introducing a two-way trunk, the detection performance of the model for small targets is further improved. According to the experimental results on two public datasets, compared to the other five SAR ship detectors based on CNN, the proposed decoupled head and coordinate attention method is feasible and has higher detection performance. Moreover, by using the proposed method, satisfactory detection results can be obtained in two large-scene images, indicating the excellent migration ability of the proposed model in marine monitoring.
The results presented in this study can be useful for further research on SAR ship detection.