Inshore Dense Ship Detection in SAR Images Based on Edge Semantic Decoupling and Transformer

Synthetic aperture radar ship detection has recently received significant attention from scholars. However, accurately distinguishing between ships is challenging due to the significant overlap between inshore ship labels. In addition, some labeled boxes contain interference information, such as land areas, which can cause false alarms and confusion in ship feature learning. To address these challenges, this article creates an edge semantic decoupling (ESD) module, adds semantic segmentation branches, and introduces the edge semantic information of ships into the training process. As a result, the model can accurately distinguish between ship targets even when significant overlap exists between inshore labeled boxes. In addition, considering that transformer has the benefit of capturing global and contextual information, this article introduces it into the detection layer to construct a transformer detection layer (TDL) to limit the interference of land and other regions within the labeled box. Experimental results from the public SAR ship detection dataset show that the proposed ESD module and TDL detection layer effectively distinguish different ship targets in the inshore dense ship area, which is less affected by interference areas, such as land in the labeled box. The average precision improves to 96.72%, and both false alarms and miss detections inshore are reduced.

Inshore Dense Ship Detection in SAR Images Based I. INTRODUCTION S YNTHETIC aperture radar (SAR) has the capability to conduct all-day and all-weather observations, allowing for long-term monitoring of the ocean without the interference of weather conditions like cloud cover, fog, etc., [1], [2], [3], [4], [5], [6]. With the increase of public datasets and the development of convolution neural networks (CNN), more and more researchers are utilizing CNNs for SAR ship detection [7], [8], [9], [10], [11], [12], [13]. Current methods for SAR ship detection using CNNs can be classified into two categories as follows: 1) single-stage and 2) two-stage detection. The two-stage detection method involves a two-step process where the detection box is first coarsely extracted using region proposal network (RPN) [14], followed by regression and classification of the box. Some representative methods using this approach are faster region-CNN (R-CNN) [14] and cascade R-CNN [15]. The single-stage detection method has the advantage of being faster and more efficient, as it can perform regression and classification without the need for coarse extraction. Some representative methods using this approach are YOLOv3 [16] and YOLOv4 [17]. In consideration of practical application requirements, the current trend in SAR ship detection primarily favors the single-stage detection method due to its speed and simplicity.
The detection of ships in SAR images using CNN-based methods requires a large number of labels. However, dense inshore ship labels often overlap, making it challenging to differentiate between different ship targets. As a result, the dense region is prone to miss detection, as illustrated in Fig. 1. Tian et al. [18] attempted to solve this issue by utilizing rotating boxes for detection. Rotating box labeling has been applied to mitigate a significant portion of the inshore interference. However, this approach faces a limitation in its ability to include contextual semantic information, resulting in a higher likelihood of false alarms in inshore scenarios. Conversely, the horizontal box contains richer contextual information; but, its susceptibility to higher levels of inshore interference creates a pressing issue for detection of inshore ships. One critical challenge lies in how to suppress the inshore interference when working with a horizontal box containing significant contextual semantic information. Another approach was used by Ma et al. [19] who employed key point estimation to differentiate individual targets in dense inshore ships. However, this method could only highlight the central area of the ship, not the edges, resulting in a biased fit of inshore ship labels in the dense region. Su et al. [20] and Wu et al. [21] attempted to overcome the problem of dense overlap by merging the edge semantic information of ships and employing instance segmentation methods for detection. However, their algorithms were complex and inefficient to implement. Ge et al. [22] demonstrated that decoupling the detection layer can improve the regression and classification performance of the model. Decoupling can be utilized to construct a simple and easy-to-implement module that introduces inshore edge semantic information about ships, thereby addressing the miss detection problem in dense inshore scenarios.
The inshore ship labels often contain interference information, such as land, which can lead to land false alarms and mislead ship feature learning, as illustrated in Fig. 1. Wang et al. [23] and Hou et al. [24] showed that introducing contextual semantic information in combination with the scene effectively reduces land false alarms in the inshore region. Ke et al. [25] expanded the encoding and increased contextual information to obtain feature maps of multiple sensory fields, which was more effective but computationally complex and unsuitable for practical applications. Zhu et al. [26] demonstrated that introducing transformer can capture contextual information more comprehensively, especially for high-density occlusion objects, with minimal computational overhead. The structure of the transformer is composed of an encoder and decoder, which can better obtain the contextual semantic information of the target, improve the feature extraction ability, and better locate the edges of the target in the target detection [27]. Therefore, transformer can be introduced to build a plug-and-play module for learning contextual information to reduce the land false alarms caused by land area interference in ship labels.
In this article, an SAR ship detection method based on edge semantic decoupling and transformer is proposed to address the issue of inshore dense ship detection. To tackle the challenge of miss detection caused by inshore label overlap, a semantic segmentation layer is added by decoupling the detection layer, thereby enhancing the model's ability to differentiate ship edges and reduce miss detection in dense scenes. Furthermore, to mitigate the interference from regions, such as land in the labeled boxes and facilitate feature learning, a transformer detection layer is constructed that leverages the transformer's capacity to capture global and contextual information. This enables the model to better distinguish between inshore false alarm targets and ship targets, leading to a reduction in land false alarms.
In summary, it is worthwhile to note the following contributions of the proposed method.
1) The edge semantic decoupling (ESD) module is introduced to address the challenge of distinguishing between dense inshore ship targets in SAR images. By adding semantic segmentation branches and incorporating edge semantic information, the model is able to accurately discriminate between ships even in regions with significant overlap between inshore labeled boxes. 2) The transformer detection layer (TDL) is introduced to limit interference caused by land areas and other regions within the labeled box. By taking advantage of the transformer's ability to capture global and contextual information, the TDL helps to reduce false alarms and improve the accuracy of ship target detection. The rest of this article is organized as follows. Section II presents the proposed method. In Section III, the proposed method is validated by comparison to other methods. Section IV presents discussions. Finally, Section V concludes this article.
II. METHODOLOGY Fig. 2 illustrates the overall structure of the proposed method. It consists of the feature extraction and detection layer parts, with the solid orange line representing the improved ESD module and the transformer-based TDL module. In Section II-A, the ESD module designed for SAR images of dense inshore ships is introduced first. Then the design details of the TDL module are presented. Finally, the decoupling loss function is presented.

A. Edge Semantic Decoupling Module
Inshore ships are known to have a dense appearance with significant labeled box overlap, which can negatively impact the model's ability to assess individual ship targets, resulting in the inclusion of several ship targets within a detection box (i.e., missed detection), as illustrated in Fig. 1(b). Conventional single-stage detection algorithms utilize a single branch to simultaneously handle the tasks of classification and detection box coordinate regression. However, the goals of classification and localization is different, as classification is primarily concerned with the texture information of the target, while localization is focused on the edge information of the target. This difference in focus can lead to conflicts between the two tasks as follows.
1) Higher-level convolutional fields have a larger receptive field, allowing them to extract more global information, which is useful for classification. However, the corresponding areas in the original image become larger, which can be detrimental to localization. Therefore, while the information contained in higher level feature maps is advantageous for classification, it is not necessarily helpful for localization. 2) Lower-level convolution and other operations correspond to smaller areas of the original image, making them more accurate for localization. However, they may only contain local information about the object and therefore are not suitable for classification. Consequently, the information contained in the lower-level feature map is suitable for target localization but not for classification. In order to overcome the limitations of performing the two tasks in one branch, the approach proposed in this article is inspired by [22] and [28], which separates the tasks of classification and detection box coordinate regression into two distinct  branches. Specifically, an additional ship semantic segmentation branch is incorporated to capture edge information of ships, enhancing the model's ability to differentiate between different ship targets in dense scenes and reducing the occurrence of missed detections.
The ESD module and the conventional single branch are compared in the middle of Fig. 3. It can be observed that the decoupled detection head performs multiple tasks, obtaining the results of both ship detection and semantic segmentation simultaneously, compared to the conventional single branch. All three branches, regression, classification, and segmentation, are used separately after the backbone network to further specialize in learning features using more convolutional layers, decoupling while making the learned features richer and helping to further define the ship's position and edges. This not only introduces semantic information about the ship's edge but also significantly improves the model's scalability and ability to carry out multiple tasks. An example of conventional single-branch decoding is provided as follows for reference: where Infer single represents the prediction result of the conventional single-branch structure, and Conv represents the 2-D convolution. ch represents the number of the channels of the feature map extracted by the detection layer, which is set to 255 in this article. cls det represents the number of detected target categories, which is set to 1 as there is only one ship category in the detection task of this article. num reg represents the coordinate values of the regression. It is set to 4, which corresponds to the upper left and lower right horizontal and vertical coordinates of the detection box. The edge semantic decoupling is as follows: where Infer decouple is the prediction result of the edge semantic decoupling structure and cls seg is the number of split categories.
Since there is only one ship category in the detection task of this article, the value is set to 1.

B. Transformer Detection Layer Module
Due to the proximity of inshore ships to land, their labeled boxes often include land, as illustrated in Fig. 1(a). This can lead to the model mistakenly identifying certain features of the land as features of the ship in the absence of contextual information, resulting in false alarms.
Compared to CNN, the transformer architecture can efficiently extract contextual information of the target by uniformly cropping the input into multiple patches and utilizing a multiheaded attention mechanism [29]. To mitigate the impact of land areas in the detection labels, this article introduces transformer to construct the TDL module, which incorporates contextual information and enhances the model's ability to distinguish land targets, thus reducing false alarms.
The input of transformer is a 1-D sequence of token embeddings. To handle 2-D feature maps, feature maps x ∈ R h×w×c are reshaped into a sequence of flattened 2-D patches x p ∈ R n×(p 2 ·c) , where (h, w) is the resolution of the feature map, c is the number of channels, (p, p) is the resolution of each feature patch, and n = hw/p 2 is the resulting number of patches. The process of patch embedding can be described as where Output embedding means embedded patches, Part represents the chunking operation, which divides the input feature map into patches of a specific size. Conv is used to reduce dimensions, and Flatten is used to construct a 1-D vector by pulling flat. The design of the transformer detection layer is illustrated in Fig. 4, which mainly consists of a multihead attention module and a feedforward neural network multilayer perceptron (MLP) module.
The multihead attention module aliquots the input x ∈ R N ×d in in the feature dimension to obtain several copies of x i ∈ R N ×d i , i = 1, 2, . . . n, where N is the sequence length, d denotes the feature dimension and n i=1 d i = d in . Each x i is processed with an attention to obtain n copies of the output, which are then stitched together in the feature dimension to obtain the final result. The calculation of a single attention is as follows: where Q, K, and V are obtained from the input x i through the fully connected layer and B is the position information. The components of MLP are shown as follows: where drop means dropout operation, fc means fully connected layer, x is the input, act represents Gaussian error linear unit (GELU) activation function as follows: The input of transformer encoder are embedded patches, and the added LayerNorm and Dropout layers are used to prevent overfitting. The TDL module mainly replaces one convolutional layer of the detection layer, which enhances the ability to capture diverse contextual information with only a minor increase in computational cost. It also leverages the self-attention mechanism to explore the potential of feature representation.

C. Decoupling Loss Function
The decoupling loss function in this article is designed to optimize both the detection and segmentation tasks in a single network structure. Instead of training and optimizing the two tasks individually, the decoupling loss allows for the inclusion of edge semantic information in the ship detection optimization process by back-propagating the loss after both ship segmentation and detection have been performed.
The designed decoupling loss function has two components, namely, 1) ship detection loss function and 2) ship semantic segmentation loss function.

1) Ship Detection Loss Function: The loss function of the ship detection component is as follows:
loss det = loss CIoU + loss cls .
loss CIoU is the detection box regression loss. In order to more effectively filter out the high quality detection results that are closer to the labeled box, this article uses CIoU [30] as the regression loss for ship detection. The CIoU is calculated as where ρ 2 (B p , B g ) represents the Euclidean distance between the center point of the detection box and the labeled box, B p is the detection box, B g is the labeled box, c represents the length of the diagonal between the upper left and lower right corners of the smallest outer rectangle of the detection box and the labeled box. α is a parameter to measure the consistency of the aspect ratio and v is a tradeoff parameter where w and h represent the width and height of the prediction box, respectively; w gt and h gt represent the width and height of the labeled box, respectively. The IoU is defined as loss cls is the category classification loss for ship detection. In the dataset used in this study, there is only one category, so the foreground and background of the ships need to be separated. The binary cross-entropy function used is shown as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. where I obj is the value of the detection label: 0 for the background and 1 for the ship. d is the output of the detection layer by the Sigmoid function, and n is the number of detected samples.

2) Ship Semantic Segmentation Loss Function:
Since the ship semantic segmentation also needs to distinguish between just two types of pixels, i.e., ship and background, the same binary cross-entropy function is used as (12) where I obj is the true value of the semantic segmentation of the ship edges. The value 0 indicates that the pixel belongs to the background, whereas 1 indicates that it belongs to the ship. s is the output of the segmentation layer by the Sigmoid function and n is the number of segmented samples. The joint loss for detection and segmentation in this article is shown in (14), and the values of the optimization process are involved in the final parameter update. loss total = loss det + loss seg .

III. EXPERIMENTAL RESULTS AND DISCUSSION
The effectiveness of the proposed method is evaluated using the publicly accessible dataset SSDD labeled by Zhang et al. [7]. This section first introduces the dataset and the hyperparameter settings of the experiments, followed by presenting the ablation experiment results of each module. Finally, the proposed method is compared with the current mainstream detection algorithms.

A. Dataset and Experimental Parameter Setting
The SSDD dataset used in this study contains SAR images with resolutions ranging from 1-15 m, sourced from RADARSAT-2, TerraSAR-X, and Sentinel-1. The dataset includes ship detection box labels as well as ship semantic segmentation labels. An example of a labeled dataset can be seen in Fig. 5.
The SSDD dataset was labeled with 1160 images containing 2456 ship targets. Among the 2456 ship targets in the dataset, 928 were used for training and 232 were used for testing. In addition, 46 of the test images were taken from the coast and 186 were taken from the ocean.

Algorithm 1: Update Parameters During Training.
The network is implemented using the Pytorch deep learning framework. The optimizer utilized is stochastic gradient descent (SGD) with momentum. A Geforce RTX 2070 GPU is used to train 100 epochs, starting with an initial learning rate of 1e-3, momentum of 0.9, and weight decay of 5e-4. Joint training is necessary to incorporate ship edge semantic information into the optimization of the ship detection model throughout the training phase. Given that semantic segmentation and target detection have different labels, the losses are calculated separately during training. First, the segmentation and detection losses are added together for back-propagation. Then, the gradient is accumulated to a preset value and a parameter update is performed. The training process is shown in Algorithm 1.
Gradient accumulation training offers the advantage of achieving large batches even on machines with limited video memory, thereby mitigating the oscillations lost during training and allowing for faster acquisition of the best model.

B. Evaluation Metric
In this article, the average precision (AP) metric is used to assess the performance of the ship detection model, which is calculated as follows: where P = TP TP + FP (16) where TP, FP, and FN refer to the number of correctly predicted ship targets, the number of incorrectly predicted ship targets, and the number of ship targets judged to be nonship targets, respectively. P represents the accuracy rate, which is the proportion of the number of correct predictions to the total number of predictions among all predictions. R represents the recall rate, which is the proportion of the number of correctly predicted ship targets to the total number of annotations among all annotated ship targets. AP describes the area under the Precision-Recall (P-R) curve. It is a compromise between the two metrics and also shows the overall performance of different methods.

C. Effectiveness of ESD
The aim of the ESD module is to reduce the ship's missed detection in inshore dense scenarios. To assess the module's effectiveness, comparative experiments were performed on the publicly available SSDD dataset, and the experimental results are shown in Table I.
In Table I, P, R and AP represent the precision, recall, and average precision in the inshore region, respectively. AP 50 represents the AP of inshore ship detection calculated with 0.5 as the threshold value. AP 50−95 means that the threshold value of IOUs is taken from 0.5 to 0.95 in steps of 0.05, and then the average value of APs under these IOUs is calculated. Compared with AP 50 , the calculation of AP 50−95 is more rigorous and better reflects the advantages and disadvantages of the model.
As can be seen from Table I, compared to the conventional single-branch detection structure, the P, R, AP 50 , and AP 50−95 of the inshore ships are improved after adding the ESD module. The R has increased by 2.31%, indicating that the miss detection of the inshore ships is alleviated. To further analyze the impact of the ESD module, feature visualization is performed in this article. The PLT image processing package is used in the model inference process to save the feature matrices at different scales by channel and colorize them, which makes the visualized feature maps visually better compared with grayscale maps. The feature visualization results in Fig. 6 also demonstrate that the inclusion of the ESD module results in sharper edges of ships in dense areas and clearer distinction between individual ships. The interference in the land area is also effectively suppressed, which helps to decrease the rate of miss detection in the dense inshore scenario.

D. Ablation Experiments of ESD and TDL Module
The proposed transformer-based TDL module is capable of effectively extracting contextual information about the target, which enables it to accurately differentiate between the ship target and land-based false alarms. Compared with the baseline, the TDL module is added, and the number of model parameters only increases by 0.0064%, which is a small increase in computational burden. Ablation experiments were conducted to confirm the effectiveness of this module. According to the experimental results in Table II, the addition of the TDL detection layer improves the P, R, AP 50 , and AP 50−95 . Compared to the baseline algorithm, the P improves by 5.96%, and adding TDL to ESD, the P improves by 0.73%, indicating that TDL can effectively combine contextual information to reduce false alarms in the inshore region. The AP 50−95 is improved by 5.16%, indicating that the model's overall performance has been optimized. The P-R curves in Fig. 7 with the enclosed region of the coordinate axes are the values of AP 50 . From which, the improvement in accuracy of the proposed method in this article can be seen more intuitively.
To demonstrate the superiority of the proposed method in this article, comparative experiments were conducted with typical two-stage detection algorithms (Faster R-CNN [14], Cascade R-CNN [15]), rotating box algorithm (OSCD-Net [31]) and typical single-stage detection algorithms (YOLOv3 [16],   [17]). Among the above methods, the backbone network of faster R-CNN, cascade R-CNN, and OSCD-Net is ResNet. YOLOv3, YOLOv4, and the backbone network of the method proposed in this article use DarkNet. The ship detection results are shown in Table III, where the experimental results of OSCD-Net is derived from [31].
The proposed method in this article has been shown to be more effective than several commonly used conventional single-stage, two-stage, and rotate box detection algorithms in detecting SAR inshore ships. The detection results of different comparison methods are shown in Fig. 8. The ground truth images are colorized for different ship targets to facilitate better differentiation of dense adjacent ship targets. The red ellipse in the result comparison graph indicates the miss detection and the yellow ellipse indicates the false alarm. These visualizations provide an intuitive demonstration of the effectiveness of the proposed method in suppressing false alarms and miss detection in the inshore scenario when compared to conventional detection algorithms.

IV. DISCUSSION
The detection of inshore ships presents a greater challenge than that of ships located solely at sea due to the higher rates of false alarms and missed detections.
The higher rates of false alarms are due to that SAR images of inshore ship targets are prone to interference from nonship targets. Therefore, context information is needed to better differentiate between ships and false alarms. In this article, a transformer-based TDL detection layer is introduced to capture global and context information, and comparative experiments have shown that adding TDL can effectively reduce false alarms.
The higher rates of missed detections are due to that inshore ships are densely arranged, making it difficult to distinguish between adjacent targets. To alleviate this issue, ship edge semantic information needs to be introduced to better distinguish adjacent targets. In this article, by decoupling the detection layer and adding a semantic segmentation branch to introduce ship edge information, the ship recall rate was improved and missed detections were reduced. Compared to methods that increase computational complexity, such as dilated convolution or fusion of high-resolution feature layers to add context information, the proposed TDL layer only adds a small number of parameters while achieving high accuracy. Compared to using instance segmentation to introduce ship edge information, the proposed ESD method is simple to implement and does not require complex instance labels, making it an effective and easily implementable module.
However, training the proposed method requires ship semantic segmentation labels, which undoubtedly increases the annotation workload for large datasets. How to reduce the dependence of the decoupled semantic segmentation layer on ship semantic labels is a direction for future algorithm improvements. Compared to SAR images of purely sea scenes, inshore scenes are more complex and ship detection is more difficult. Therefore, how to improve the detection of inshore ships while increasing a minimal or even no computational burden is an important and meaningful research direction.

V. CONCLUSION
This article presents a novel method for detecting dense inshore ships in SAR images using an ESD module and a transformer-based TDL layer. The ESD module incorporates edge semantic information of inshore ship targets during training, improving the model's ability to distinguish between neighboring ships and reducing miss detections. Meanwhile, the TDL layer utilizes transformer to extract contextual information and reduce false alarms caused by interference from land and other regions in the labeled boxes. The results of comparison experiments with some two-stage and single-stage detection algorithms on the SSDD dataset showed the proposed method achieved the highest AP 50 of 96.72%, demonstrating its effectiveness in detecting inshore ships. The simple structure of the single-stage detector makes it easier to perform improvements and experiments, so this article performs experimental validation on a single-stage detector. However, the TDL and ESD proposed in this article are both plug-and-play improvement modules, which are less dependent on the overall structure of the detection algorithm. Theoretically, they can be fully ported to two-stage detectors, and whether the porting is effective requires extensive experimental verification. The future work is to explore the potential of combining this method with other two-stage detection frameworks for further optimization and improvement.