Ship Detection in Synthetic Aperture Radar Images Based on BiLevel Spatial Attention and Deep Poly Kernel Network

: Synthetic aperture radar (SAR) is a technique widely used in the field of ship detection. However, due to the high ship density, fore-ground-background imbalance, and varying target sizes


Introduction
Radar pairs are able to differentiate between different sources of radiation by means of signals, enabling the identification and classification of objects [1].SAR is a unique active microwave imaging radar technology [2] that can overcome the limitations of bad weather, such as clouds, fog, rain, and snow and provide high-resolution radar images for maritime targets.In addition, the resolution of SAR images does not change with the observation distance, which can achieve long-distance detection and support long-term, continuous, dynamic, and real-time monitoring of a vast sea area.Therefore, Synthetic Aperture Radar (SAR) image target detection has been widely researched and applied in many fields, such as agriculture, forestry, water resource management, geology, and military research.It is also worth noting that the field of ship detection technology in SAR has received considerable attention from academics both within and outside the country.
The advantages of deep learning methods are manifested in the advantages of selflearning, self-improvement, and weight sharing.One-stage algorithms such as SSD [3] and YOLO series [4][5][6][7] and two-stage detection networks such as Faster R-CNN [8] and Cascade R-CNN [9] have been widely used for object detection.In the field of ship detection, deep learning methods are increasingly influential.
The main challenge in using deep learning models for ship detection is to deal with targets at different scales.Various approaches have been proposed to achieve accurate multiscale ship detection.The most common approaches include the attention module (ABP [10], AMMRF [11], and A-BFPN [12]) or multilayer feature fusion (DFF-YOLOv5 [13], LMSDYOLO [14], MAM [15], and Quad-FPN [16]).All of these detection methods perform feature fusion based on existing features.However, the existing features do not adequately take into account the various sizes of the object, and the detection accuracy decreases in multiscale ship target detection.In addition, the semantic details of the ships extracted by these approaches are flooded, which is not favorable for ship detection.In particular, ship targets may become blurred under the influence of clutter, and degradation of the signal-to-noise ratio makes it more challenging to distinguish ship targets.Therefore, it is necessary to capture as much information about the ship as possible.We need to make the best use of this information to reduce the impact of clutter on the detection process.Many methods have high computational complexity and consume a large amount of memory without considering the reduction in redundant parameters.
In order to effectively solve the above problems, the YOLO-MSD method is proposed in this paper.Firstly, we design the Deep Poly Kernel Network (DPK-Net), which consists of the Optimized Convolution (OC) Module and Poly Kernel (PK) Module.PConv is introduced in the OC module, which makes the network lighter and ensures more efficient extraction of spatial features.The PK module employs multiple parallel depthwise separable convolutions to capture contextual information at different scales, which enhances the depth and breadth of feature extraction and improves the detection accuracy and scale adaptability for targets of different sizes.Secondly, we propose a BiLevel Spatial Attention Module (BSAM), which combines the BiLevel Routing Attention (BRA) and the Spatial Attention Module.The BRA receives global information initially, followed by the Spatial Attention Module, which enhances the network's capacity to localize the target and acquire high-quality detailed information.Finally, we adopt the Powerful-IoU (P-IoU) loss function.This function can combine a target-size adaptive penalty factor and a gradient-adjusting function based on the anchor box quality to guide the anchor box to faster regression.Through this series of designs, YOLO-MSD demonstrates excellent performance and broad applicability in ship detection tasks.
The main contributions of our work can be summarized as follows: 1.
We construct DPK-Net in the backbone network, which consists of the OC module and PK module.The OC module aims to reduce data redundancy and optimize the efficiency of information processing.The PK module extracts dense ship features from different receptive fields.These features are adaptively fused along the channel dimensions to collect contextual information more efficiently.2.
We design the BSAM attention mechanism to obtain global information through sparsity while preserving ship detail information and achieving faster regression through P-IoU.

3.
Many experiments have been conducted on the SSDD and HRSID datasets, with excellent experimental results proving the effectiveness of the proposed models.

Related Work 2.1. MultiScale Ship Detection
In SAR ship detection, ship targets exhibit multiscale diversity, and attention modules can be an effective solution.
Specifically, Fu et al. [10] designed an attention-guided balanced pyramid (ABP) structure in the FBR-Net network in order to improve the attention of small vessels.Tang et al. [11] proposed a multiscale attention mechanism for sensory field convolutional blocks (AMMRF), which can effectively use the positional information of the feature maps to distinguish between the vessels and the background.Li et al. [12] more fully utilized semantic features and multilayer complementary features to build an attention-guided balanced feature pyramid network (A-BFPN).
Multilayer feature fusion is also a commonly used method.Li et al. [13] obtained context fusion information by cascading and juxtaposing a number of pyramid modules containing different combinations of convolutional layers.Guo et al. [14] designed a depth-adaptive spatial feature fusion module for the multiscale problem in rotating object detection.Suo et al. [15] integrated low-level spatial data with high-level semantic data to address the challenge of detecting ships with significant size variations.Zhang et al. [16] developed four distinct feature pyramid modules and arranged them sequentially to create a Quad-FPN, enhancing the model's ability to detect features across multiple scales.
Although the above neural networks play a particular effect in multiscale ship detection applications, they still have specific problems.They usually require a large amount of computational resources, especially in the training phase, which requires a large amount of data and computational power, resulting in high cost and energy consumption.In addition, it is difficult for them to consider various scale targets in the feature extraction process, losing detailed information, and there is still room for improvement in the detection effect.

Attention Mechanism
In the field of computer vision, attention mechanisms are designed to improve the efficiency of image feature extraction by assigning weights to the spatial and channel information in neural networks.This process generates image weight coefficients that amplify targets and diminish backgrounds, thereby aiding subsequent imaging tasks.Attention mechanisms can be classified into several types: hard attention, soft attention, self-attention, global attention, local attention, and multihead attention [17].Notable attention mechanism algorithms include Squeeze-and-Excitation Networks (SENet), Selective Kernel Networks (SKNet), Convolutional Block Attention Module (CBAM), Criss-Cross Attention Network (CCNet), Object Context Network (OCNet), and Dual Attention Network (DANet) [18][19][20][21][22].The Transformer model, which employs an encoder-decoder architecture, has also attracted significant interest in recent years [23].
Attention mechanisms hold significant promise for ship detection applications.For instance, Chen et al. [24] enhanced the feature extraction capabilities of backbone networks by utilizing attention mechanisms and multilevel features.Zhu et al. [25] introduced a hierarchical attention-based SAR ship detection method, which integrates global and local attention modules and applies a hierarchical attention strategy at both the image and target levels.Yasir et al. [26] incorporated a convolutional block attention module into the feature fusion module of the YOLO-tiny framework, assigning different weights to each feature map image to highlight effective features.Shan et al. [27] proposed the SimAM attention mechanism to enhance spatial features in images, improving both the accuracy of ship detection and the computational efficiency of the network.Zhou et al. [28] developed a subflap sensing mechanism to mitigate the impact of strong scattering points and enhance ship information, thereby improving the model's ability to recognize ship targets by identifying and suppressing sub-flap noise in images.
Although attention mechanisms can enhance the performance of ship detection, they come with certain limitations.These mechanisms add to the model's complexity, leading to higher computational costs, making the training process more challenging, and hindering the real-time performance of detection tasks.

Loss Function
The loss function allows for a thorough evaluation of the deviation between the model's predicted values and actual outcomes.As a pivotal element in model training, the loss function significantly influences both the model's performance and its practical applicability.Efficient and precise ship detection can be achieved by using a well-crafted loss function.For instance, Zhu et al. [29] implemented CloU loss to speed up convergence and enhance overall performance.Yang et al. [30] developed a novel, straightforward, and efficient E 1/2 IoU loss that balances the impact of both high-quality and low-quality samples on the loss, thereby making it more effective for SAR image ship detection using unsupervised domain adaptation.Additionally, Hu et al. [31] incorporated the Normalized Wasserstein Distance (NWD) into the loss function to improve the regression for small ships and enhance the model's capability for multiscale detection.
Nonetheless, the aforementioned approaches encounter challenges in balancing the loss across targets of varying scales and exhibit slow convergence in multiscale ship detection.To address these issues, our method integrates a penalty factor, where the target box size acts as the denominator, and utilizes a P-IoU loss function tailored to the quality of the anchor box.This strategy enhances the scale adaptation and robustness.

The Overview of YOLO-MSD
Our proposed YOLO-MSD framework builds upon YOLOv7-tiny [7] as its foundational structure.The architecture comprises three main components: the backbone network, responsible for feature extraction; the feature fusion network, which reprocesses and refines the features obtained from the backbone; and the classification prediction network, which conducts the final detection predictions.In this process, the predicted outcomes are combined with the ground truth data and fed into the loss function for further computation and optimization.Ultimately, non-maximum suppression is employed to discard redundant detection boxes, ensuring the precise localization of ship targets.
The overall network architecture is depicted in Figure 1.Initially, a deep polykernel backbone network is established.This backbone network serves as the foundation for the subsequent modules and processes.Specifically, the introduction of PConv within the backbone facilitates the design of an OC module aimed at minimizing unnecessary computations and memory access.Additionally, a PK module is incorporated at the terminus of the backbone network to extract multiscale target features and capture the local context.

Deep Poly Kernel Network (DPK-Net)
In this section, we will introduce DPK-Net in detail.The specific structure is shown in Figure 2. Within the DPK-Net architecture, the input image undergoes initial sampling via two CBL modules, reducing its size to one-fourth of the original dimensions.Subsequently, the network integrates three optimized convolutional modules, three MP mod- Subsequently, further processing and dimensionality adjustment of the feature map is achieved by fine fusion of ELAN-Tiny, SPP-Tiny, MP and CBL modules at the neck.The BSAM attention mechanism is integrated at the neck to thoroughly capture global image information alongside ship details.
The Head section focuses on feature classification and regression, which not only enhances the feature representation capability through the Conv, BN and LeakyReLU combination, ELAN-Tiny module and Maxpool operation but also provides effective support for final decision-making through dimensionality reduction (e.g., halving the feature map size after Maxpool).In particular, the SP module combined with the Cat operation enables feature map splicing and fusion, which significantly increases the number of channels in the feature map and further enriches feature information.Furthermore, a novel P-IoU loss function is implemented in the final regression phase to achieve faster convergence and enhanced accuracy.

Deep Poly Kernel Network (DPK-Net)
In this section, we will introduce DPK-Net in detail.The specific structure is shown in Figure 2. Within the DPK-Net architecture, the input image undergoes initial sampling via two CBL modules, reducing its size to one-fourth of the original dimensions.Subsequently, the network integrates three optimized convolutional modules, three MP modules, one ELAN-Tiny module, and one deep poly kernel module on top of the baseline model YOLOv7-tiny.This arrangement produces three distinct feature maps, each on a different scale.

Deep Poly Kernel Network (DPK-Net)
In this section, we will introduce DPK-Net in detail.The specific structure is shown in Figure 2. Within the DPK-Net architecture, the input image undergoes initial sampling via two CBL modules, reducing its size to one-fourth of the original dimensions.Subsequently, the network integrates three optimized convolutional modules, three MP modules, one ELAN-Tiny module, and one deep poly kernel module on top of the baseline model YOLOv7-tiny.This arrangement produces three distinct feature maps, each on a different scale.

Optimized Convolution Module (OC)
Figure 3a illustrates the OC module's architecture, which features two distinct branches.The left branch enhances the network's receptive field by passing through the CBL module.Meanwhile, the right branch sequentially processes through the CBL and PBL modules, enabling comprehensive feature extraction with a lightweight design and minimizing the parameter count, which is typically increased by standard convolutional module stacking.Ultimately, the output feature maps from both branches are concatenated, followed by dimensionality reduction to decrease the channel count for the final output.Incorporating PConv into this module enhances network efficiency and spatial feature extraction while maintaining a lightweight structure.As depicted in Figure 3c, PConv optimizes this process by selectively applying filters to some input channels, leaving the others unchanged [32].For continuous or periodic memory access, the initial or final consecutive c p channel is computed to represent the entire feature map.nated, followed by dimensionality reduction to decrease the channel count for the final output.Incorporating PConv into this module enhances network efficiency and spatial feature extraction while maintaining a lightweight structure.As depicted in Figure 3c, PConv optimizes this process by selectively applying filters to some input channels, leaving the others unchanged [32].For continuous or periodic memory access, the initial or final consecutive p c channel is computed to represent the entire feature map.To generalize, we assume that the input and output feature maps possess an equal number of channels.Consequently, the FLOPs of PConv are only: Due to == , the FLOPs of PConv are only 1/16 of those of a regular convolu- tion.In addition, PConv has a smaller memory access.
That is, the memory access of PConv is 1/4 that of the regular convolution.Hence, the construction of the OC module is designed to minimize superfluous computations and optimize memory access efficiency.

Poly Kernel Module (PK)
Depthwise separable convolution is composed of two stages, as illustrated in Figure 4. First, depthwise convolution is applied to the input features.In this step, each convolution kernel is linked to a specific channel, meaning that each channel undergoes a convolution operation using only its corresponding kernel.The output feature map retains the same number of channels and convolution kernels as the input feature map.Subsequently, the feature maps are processed using pointwise convolution.This involves To generalize, we assume that the input and output feature maps possess an equal number of channels.Consequently, the FLOPs of PConv are only: 4 , the FLOPs of PConv are only 1/16 of those of a regular convolution.In addition, PConv has a smaller memory access.
That is, the memory access of PConv is 1/4 that of the regular convolution.Hence, the construction of the OC module is designed to minimize superfluous computations and optimize memory access efficiency.

Poly Kernel Module (PK)
Depthwise separable convolution is composed of two stages, as illustrated in Figure 4. First, depthwise convolution is applied to the input features.In this step, each convolution kernel is linked to a specific channel, meaning that each channel undergoes a convolution operation using only its corresponding kernel.The output feature map retains the same number of channels and convolution kernels as the input feature map.Subsequently, the feature maps are processed using pointwise convolution.This involves weighting and combining the feature maps from the previous step along the channel dimension, resulting in new feature maps in which the number of convolution kernels matches the number of output channels [33].
As shown in Figure 5, the PK module first utilizes a small kernel convolution to obtain local information, followed by a set of parallel depthwise separable convolutions to capture contextual information across multiple scales [34].
The PK module in the DPK block can be mathematically represented as follows, where X is the initial ship localized feature.Here L ∈ R C×H×W is the local features extracted by k s × k s convolution and Z (m) ∈ R C×H×W is the contextual features extracted by the m-th k (m) × k (m) depthwise separable convolution (DWConv).In our experiments, we set weighting and combining the feature maps from the previous step along the channel dimension, resulting in new feature maps in which the number of convolution kernels matches the number of output channels [33].As shown in Figure 5, the PK module first utilizes a small kernel convolution to obtain local information, followed by a set of parallel depthwise separable convolutions to capture contextual information across multiple scales [34].The PK module in the DPK block can be mathematically represented as follows,

Conv
( ) where X is the initial ship localized feature.Here  weighting and combining the feature maps from the previous step along the channel dimension, resulting in new feature maps in which the number of convolution kernels matches the number of output channels [33].As shown in Figure 5, the PK module first utilizes a small kernel convolution to obtain local information, followed by a set of parallel depthwise separable convolutions to capture contextual information across multiple scales [34].The PK module in the DPK block can be mathematically represented as follows,

Conv
( ) where X is the initial ship localized feature.Here  The interrelationships between the various channels were then characterized by fusing the local and contextual features through a convolution of size 1 × 1: where P ∈ R C×H×W denotes the output features.The 1 × 1 convolution acts as a channel fusion technique, enabling the integration of features with varying receptive field sizes.
Our PK module enables the extraction of features from the backbone that encompass various scales and convolution depths, facilitating the effective detection of large, medium, and small ship targets simultaneously.Additionally, it allows the capture of extensive contextual information while preserving the integrity of the local texture features.This helps to extract ship features, especially those of small vessels, from background clutter and improves the reliability of ship detection performance.

BiLevel Spatial Attention Module (BSAM)
The attention mechanism enables an improved network to retain features with critical information by calculating the similarity or correlation between features and assigning appropriate weights to different features.At the same time, the design of the attention mechanism can also improve the leakage and misdetection caused by the ship's targets being too dense or obscuring each other.In this paper, the BSAM attention is designed.
Figure 6 depicts the detailed implementation of the BSAM attention.Commencing with an intermediate feature map, our module derives the attention map along two distinct dimensions sequentially.Subsequently, this attention map undergoes multiplication with the input feature map to accomplish adaptive feature refinement.
various scales and convolution depths, facilitating the effective detection of large, me dium, and small ship targets simultaneously.Additionally, it allows the capture of exten sive contextual information while preserving the integrity of the local texture features This helps to extract ship features, especially those of small vessels, from background clut ter and improves the reliability of ship detection performance.

BiLevel Spatial Attention Module (BSAM)
The attention mechanism enables an improved network to retain features with criti cal information by calculating the similarity or correlation between features and assigning appropriate weights to different features.At the same time, the design of the attention mechanism can also improve the leakage and misdetection caused by the ship's targets being too dense or obscuring each other.In this paper, the BSAM attention is designed.
Figure 6 depicts the detailed implementation of the BSAM attention.Commencing with an intermediate feature map, our module derives the attention map along two dis tinct dimensions sequentially.Subsequently, this attention map undergoes multiplication with the input feature map to accomplish adaptive feature refinement.

BiLevel Routing
Attention Spatial Attention Given an intermediate feature map as input, the BiLevel Routing Atten tion module is utilized to generate an attention map , and the Spatial Atten tion Module is utilized to generate a 2D spatial attention map and the overal attention process can be summarized as, ( ) , ( ) , The BRA modifies attention weights according to the features of the input image This enables the network to apply varying levels of focus to different locations or attrib utes, enhancing the detection of ship targets at multiple scales.Importantly, this adjust ment does not overburden the model computationally [35].
BiFormer, a derivative of the Transformer [23] model, incorporates dynamic sparse attention, enhancing computational flexibility and feature discernment via BRA.Initially it eliminates most non-essential key-value pairs at the coarse region level, preserving only a small subset of routing areas.Subsequently, it implements detailed token-to-token at tention within the selected regions.As shown in Figure 7, the two-layer routing attention Given an intermediate feature map X ∈ R H×W×C as input, the BiLevel Routing Attention module is utilized to generate an attention map M B ∈ R H×W×C , and the Spatial Attention Module is utilized to generate a 2D spatial attention map M s ∈ R 1×H×W and the overall attention process can be summarized as,

BiLevel Routing Attention (BRA)
The BRA modifies attention weights according to the features of the input image.This enables the network to apply varying levels of focus to different locations or attributes, enhancing the detection of ship targets at multiple scales.Importantly, this adjustment does not overburden the model computationally [35].
BiFormer, a derivative of the Transformer [23] model, incorporates dynamic sparse attention, enhancing computational flexibility and feature discernment via BRA.Initially, it eliminates most non-essential key-value pairs at the coarse region level, preserving only a small subset of routing areas.Subsequently, it implements detailed token-to-token attention within the selected regions.As shown in Figure 7, the two-layer routing attention mechanism firstly divides the input feature map X ∈ R H×W×C into S × S non-overlapping regions, such that X is transformed into X r ∈ R S 2 × HW S 2 ×C and each region contains HW S 2 feature vector.We can use Equation (7) where W q , W k , W v ∈ R C×C are the projection weights of Q, K and V, respectively.Then, calculate the mean value of Q and K to get Q r , K r , respectively.Equation ( 8) is employed to compute the adjacency matrix A r , which assesses the semantic similarity across various regions.
where ,,  WWW are the projection weights of Q , K and V , respectively.Then, calculate the mean value of Q and K to get , rr QK, respectively.Equation ( 8) is employed to compute the adjacency matrix r A , which assesses the semantic similarity across various regions.The matrix r A is filtered using Equation ( 9), and only the first k connections are kept for each region to prune the association graph to obtain the index matrix r I , topkIndex( ).
where the i-th row of r I contains the k indexes of the most relevant regions to the ith region.Filtering and collecting K and V by r I using Equation (10) yields g K and g V .
gather( , ), gather( , ) Ultimately, the routing index matrix facilitates the application of fine-grained tokento-token attention from one region to another on Q , g K , and g V .
( ) Attention( , , ) LCE( ) where LCE is a deeply separable convolution with a convolution kernel size of 5 and a step size of 1.
The computation of the BRA consists of three parts: linear projection, region-to-region routing, and token-to-token attention.The total amount of computations is therefore: The matrix A r is filtered using Equation ( 9), and only the first k connections are kept for each region to prune the association graph to obtain the index matrix I r , where the i-th row of I r contains the k indexes of the most relevant regions to the ith region.
Filtering and collecting K and V by I r using Equation ( 10) yields K g and V g .
Ultimately, the routing index matrix facilitates the application of fine-grained tokento-token attention from one region to another on Q, K g , and V g .
where LCE is a deeply separable convolution with a convolution kernel size of 5 and a step size of 1.
The computation of the BRA consists of three parts: linear projection, region-to-region routing, and token-to-token attention.The total amount of computations is therefore: (12) where O((HW) 2 ) is the complexity of the ordinary attention, C is the token embedding dimension (i.e., number of channels of the feature map), and k is the number of regions to attend ("k" in "top-k").Here, the inequality between the arithmetic and geometric means has been applied.The equality in Equation ( 12) holds if and only if 2S 4 = k(HW) 2 S 2 .Therefore: In other words, BRA achieves O((HW) 3 ) complexity if we scale the region partition factor S with respect to the input resolution according to Equation (13).
Compared with the traditional Transformer self-attention structure, BRA is less computationally intensive and significantly reduces the memory pressure.At the same time, while keeping the model lightweight, the mechanism ensures that the model can maximize the retention of fine-grained contextual feature information.It also reduces the impact of noise interference and mitigates the limitations imposed by clutter in the SAR images.This enables the capture of remote dependencies more effectively.

Spatial Attention Module
To concentrate on global spatial information, spatial attention must be computed.Initially, average-pooling followed by max-pooling is conducted along the channel axis to consolidate the feature maps' channel information [36], generating two 2D maps: F s avg ∈ R 1×H×W and F s max ∈ R 1×H×W .Fusion on the channel axis allows for better handling of critical spatial information.Then, we concatenate them to form an information descriptor.Finally, a spatial attention map M s (F) ∈ R H×W is generated using convolutional layers and sigmoid operations to highlight key pixels and suppress clutter from interfering with ship features.In short, spatial attention is computed as, where σ represents the sigmoid function and f 3×3 signifies the convolution process utilizing a 3 × 3 filter size.

Powerful-IoU Loss Function (P-IoU)
Numerous metrics employed by sophisticated detectors rely on the IoU, which is a crucial evaluation metric for current loss functions.Simply put, IoU quantifies the overlap between the detection box and the target box.
where B a and B b denote the prediction and actual frames, respectively.The loss function is defined as As shown in Figure 8a, the D-IoU and C-IoU [37] are defined as follows: where d and d gt denote the centers of mass of the predicted and real frames, respectively, ρ(•) is the Euclidean distance, and c is the length of the diagonal of the smallest frame that contains both frames.
where α serves as the loss trade-off parameter and υ evaluates the similarity between aspect ratios.
where w and h are the width and height of the bounding box, and w gt and h gt are the width and height of the ground truth (GT) box, respectively.
because they do not precisely differentiate between the anchor box and target box properly consider the target size, and may underperform in certain scenarios.Emp factors based on the anchor box size and smallest enclosing box of the target as nominator in the penalty term is inappropriate.This causes the anchor box region pand during regression, negatively impacting efficiency.Thus, the IoU-based reg loss function requires a more suitable penalty term to enhance the performance.To address these shortcomings, we utilize P-IoU, integrating a penalty factor corporates the target box size in the denominator and considers the quality of the a The penalization factors used are flawed, leading to an increase in the anchor box size and slower convergence during regression.Specifically, these factors are inadequate because they do not precisely differentiate between the anchor box and target box, fail to properly consider the target size, and may underperform in certain scenarios.Employing factors based on the anchor box size and smallest enclosing box of the target as the denominator in the penalty term is inappropriate.This causes the anchor box region to expand during regression, negatively impacting efficiency.Thus, the IoU-based regression loss function requires a more suitable penalty term to enhance the performance.
To address these shortcomings, we utilize P-IoU, integrating a penalty factor that incorporates the target box size in the denominator and considers the quality of the adapted anchor boxes [38].This method ensures that the anchor box regresses more efficiently along a direct path, resulting in quicker convergence and improved accuracy.Here, the penalty factor P, adjusted to the target size, is defined as follows, where dw 1 , dw 2 , dh 1 , dh 2 is the absolute distance between the corresponding edge of the prediction frame and the target frame, w gt , h gt are the width and height of the target frame, as shown in Figure 8b.Using P as a penalty factor in the loss function avoids expanding the anchor box.This occurs because the denominator P relies solely on the target box size, remaining unaffected by the anchor box size or the smallest enclosing box of the target.Unlike penalty factors in other loss functions, P remains unchanged by anchor box enlargement.Furthermore, P only reaches zero if the anchor frame fully overlaps with the target frame.Additionally, P adapts to the target size.Consequently, we employ a penalty function that is adjusted according to the quality of the anchor box.
The PIoU loss function guides the anchoring box to faster regression along the effective path and, thus, faster convergence.In particular, the combination of target-size adaptive tuning and loss adjustment for the importance of ship targets is fine-tuned to optimize the requirements specific to ship detection.This method effectively addresses the challenge of identifying multiscale ship targets, particularly in maritime environments where target sizes vary significantly.It enhances the detection accuracy and adaptability to complex conditions, thereby increasing the precision and robustness of ship detection.

Experimental Platform
The detailed implementation of the proposed YOLO-MSD method is outlined as follows: This method is developed using Python, and the experiments are conducted within a deep learning framework built on Pytorch 1.11.0.YOLOv7-tiny is chosen as the baseline model for this study.The hardware setup includes an Intel(R) Xeon(R) Silver 4210R CPU running at 2.40 GHz, an NVIDIA RTX A6000 GPU, and 512 GB of RAM.

Datasets 4.2.1. HRSID
The HRSID dataset [39] comprises data obtained from satellite sensors, such as TerraSAR-X, Sentinel-1B, and TanDEM-X.From the original 136 large-scale SAR satellite images, 5604 images of size 640 × 640 pixels were derived.Among the 16,951 annotated ships, the distribution is 54.5% small, 43.5% medium, and 2% large vessels.The HRSID dataset includes SAR images with resolutions of 0.5 m, 1 m, and 3 m, all formatted into horizontal bounding boxes (HBB) of 800 × 800 pixels.It features a variety of maritime scenes ranging from simple to complex.Throughout the model training phase, the dataset was divided into training, validation, and testing sets at an 8:1:1 ratio.

SSDD
The SSDD dataset comprises 1160 SAR images, with dimensions ranging from 190 to 526 pixels in height and 214 to 668 pixels in width, encompassing 2456 targets [40].On average, there are 2.12 ships per image.This dataset predominantly sources its data from the RadarSat-2, Sentinel-1, and TerraSAR-X sensors with resolutions between 1 m and 15 m.The target areas are cropped to approximately 500 × 500 pixels.The ship target positions were manually annotated using the PASCAL VOC format.The dataset primarily contains small targets that exhibit diverse features near coasts, in open seas, and across various scales, making it suitable for evaluating the robustness of models.During the training phase of the model, the dataset was segmented into training, validation, and testing subsets, adhering to an 8:1:1 split ratio.

Model Evaluation
To assess the detection performance of YOLO-MSD in comparison with other methods, we evaluate the detection results using Precision (P), Recall (R), mean average recall (mAP), parameters, and FLOPs.To determine the accuracy of the prediction frames, we calculate the Intersection over Union (IoU) between the predicted frames and the ground truth [40].IoU represents the ratio of the intersection area to the union area of the predicted frame to the ground truth, as described by Equation (24).
where B P represents the prediction box and B gt represents the actual ground truth.A higher Intersection over Union (IoU) value indicates more accurate detection results.For ship detection, three outcomes are possible: true positives (TP), false positives (FP), and false negatives (FN).True positives refer to the number of accurately detected ships, false positives refer to the number of erroneously detected ships, and false negatives refer to the number of missed ships.Precision is defined as the proportion of correctly detected ships out of all detected ships, while Recall is the proportion of correctly detected ships out of the total number of actual ships.Equations ( 25) and ( 26) are used to compute the detection accuracy and completeness rates.
Assessing the detection model based solely on Precision (P) or Recall (R) can be inadequate.Consequently, the F1 score is employed to integrate both P and R for a more holistic evaluation of the model.The formula for the F1 score is provided in Equation (27).
Average Precision (AP) provides a more comprehensive evaluation of various detection methods.By plotting Recall on the horizontal axis and Precision on the vertical axis, AP represents the area under the Precision-Recall (P-R) curve.The calculation formula for AP is presented in Equation (28).
Using the pixel count within the ship's predicted bounding box, ships are classified as small, medium, or large based on COCO index definitions.Subsequently, their detection accuracies are computed.Table 1 provides several definitions for the COCO index.To enhance the evaluation of the model performance, we introduced additional metrics such as frames per second (FPS), model parameters, and floating-point operations per second (FLOPs).The FPS is defined as: where T denotes the detection time for a single image.FPS indicates the average frame rate of the validation datasets.The parameters of the convolutional layer can be obtained by Equation (30).
where k H and k W denote the convolution kernel's dimensions, C in represents the number of input feature map channels, C out represents the number of output feature map channels, and g is the number of group convolutions,.The total model parameters are obtained by summing the parameters of all layers.The FLOPs can be obtained by Equation (31).
where C out × H out × W out is the total number of units included in the output feature map.
These findings indicate the effectiveness of the proposed method for ship SAR image detection.Faster R-CNN [8] demonstrates the poorest performance, primarily due to its two-stage detection process, which first generates pre-selected frames that are often compromised by the scattering noise prevalent in SAR images.Among single-stage detection algorithms, SSD and EfficientDet exhibit high accuracy but fall short in Recall, achieving less than 50% on the HRSID dataset.The YOLO series, in contrast, achieves higher mAP.In anchorless algorithms, the absence of pre-generated anchor boxes leads to only one anchor box prediction per position.This limitation may result in undetected overlapping or blurred regions, thus reducing the Recall and mAP of RetinaNet and CenterNet.YOLO-MSD achieves the highest mAP on both SSDD and HRSID datasets among the evaluated methods, with an average accuracy of 90.2% on the HRSID dataset.However, the algorithm's performance on HRSID is inferior to that on SSDD, likely due to the complexity of the HRSID dataset.This complexity arises from scenarios like closely aligned ships in The PR curve illustrates the relationship between precision and recall during the model training.Typically, a curve that extends closer to the upper right indicates a superior model performance.Figure 9 shows a comparison of the PR curves from the ablation experiments using the HRSID and SSDD datasets.The figure demonstrates that, in comparison to the baseline YOLOv7-tiny model, the YOLO-MSD model achieves the best performance, evidenced by the largest area under the curve.To further demonstrate the module's semantic perceptual capabilities, we compare the results of image heat maps across various scales and scenarios.These heat maps illustrate the model's activity levels in different regions or input data features, reflecting its perception of diverse semantic categories.The heat maps reveal that the improved model delineates object boundaries and texture features more precisely than the baseline model.Figure 13 presents the heat maps of the images at different scales and scenes on the SSDD and HRSID datasets.
As shown in Figure 13a, the improved model outperforms the baseline model in detecting small ships, providing a clearer depiction of the ship boundaries.This enhancement is attributed to the model's superior feature extraction capabilities, particularly in capturing ship contours and fine details. Figure 13b shows that both models perform well in medium ship detection, but the improved model and our proposed method produce clearer heat maps.In Figure 13c, the improved model outperforms the baseline model in detecting large ships.The inclusion of the BSAM and P-IoU loss function significantly improves edge feature extraction for large ships with trailing shadows and image blurring, demonstrating the model's ability to integrate multiple features and enhance its recognition of larger ships.
Figure 13d illustrates that both the baseline and improved models perform well in detecting offshore ships and effectively extracting ship features.Figure 13e,f shows that the improved model surpasses the baseline in detecting inshore ships and densely packed ships.Our proposed method effectively captures image information across multiple scales and identifies ship features of various sizes and shapes, thus enhancing the model's comprehension of the image.
In summary, the YOLO-MSD model demonstrates significant improvements over the baseline model.To further demonstrate the module's semantic perceptual capabilities, we compare the results of image heat maps across various scales and scenarios.These heat maps illustrate the model's activity levels in different regions or input data features, reflecting its perception of diverse semantic categories.The heat maps reveal that the improved model delineates object boundaries and texture features more precisely than the baseline  As shown in Figure 13a, the improved model outperforms the baseline model in detecting small ships, providing a clearer depiction of the ship boundaries.This enhancement is attributed to the model's superior feature extraction capabilities, particularly in capturing ship contours and fine details. Figure 13b shows that both models perform well in medium ship detection, but the improved model and our proposed method produce clearer heat maps.In Figure 13c, the improved model outperforms the baseline model in detecting large ships.The inclusion of the BSAM and P-IoU loss function significantly improves edge feature extraction for large ships with trailing shadows and image blurring, demonstrating the model's ability to integrate multiple features and enhance its recognition of larger ships.
Figure 13d illustrates that both the baseline and improved models perform well in detecting offshore ships and effectively extracting ship features.Figure 13e,f shows that the improved model surpasses the baseline in detecting inshore ships and densely packed ships.Our proposed method effectively captures image information across multiple scales and identifies ship features of various sizes and shapes, thus enhancing the model's comprehension of the image.
In summary, the YOLO-MSD model demonstrates significant improvements over the baseline model.

Discussion
While YOLO-MSD demonstrates strong detection performance in most scenarios, it still encounters issues with missed detections and inaccuracies.For instance, in situations with numerous overlapping bounding boxes, such as those depicted in Figure 12, where ships are densely docked and their bounding boxes overlap, the model struggles.The indistinct boundaries between features and the presence of redundant elements in the extracted features hinder the detection of centrally located ships, demonstrating the model's limitations in feature recognition and refinement.Furthermore, the SAR ship dataset is characterized by a scarcity of positive samples, a complex background with numerous negative samples, and a data-dependent deep detection model.This results in suboptimal data utilization during detection and inadequate performance in near-shore and complex environments.Therefore, to enhance the practical utility of YOLO-MSD, it is essential to improve the balance between positive and negative sample effectiveness and bolster robustness and generalization capabilities in diverse scenarios.

Conclusions
This study introduces a novel multiscale ship detection algorithm for SAR images, leveraging the YOLO-MSD framework.By utilizing an enhanced DPK-Net as the backbone network and incorporating the BSAM attention mechanism alongside the P-IoU loss function, the proposed algorithm significantly enhances detection performance.Experimental results validate the superior capabilities of the YOLO-MSD model in SAR ship detection tasks.Specifically, when compared to the baseline YOLOv7-tiny algorithm, the proposed method shows a precision improvement of 3.8% and 7.9%, a recall improvement of 4.1% and 2.6%, and a mAP50 improvement of 5.9% and 6.2% on the HRSID and SSDD datasets, respectively.
There are several potential areas for improvement in this research.Future studies within the current framework could focus on these key areas.First, to enhance the model's generalization ability in specific scenarios, it may be necessary to augment scenario-specific datasets, apply advanced data enhancement techniques, or implement domain adaptation methods.Second, integrating multisource data is crucial for boosting the accuracy and robustness of ship detection.Combining SAR images with data from other sensors, such as optical and radar, can provide more comprehensive target information, particularly under adverse weather conditions.By thoroughly investigating these directions, we can significantly improve the performance and practicality of ship detection technology, thereby providing robust support for related research and practical applications.

Figure 1 .
Figure 1.The overall network structure of YOLO-MSD.The DPK-Net, including the OC module and PK module, is firstly constructed as the backbone network, then the BSAM is introduced at the neck, and finally, the P-IoU is introduced at the regression stage.

Figure 1 .
Figure 1.The overall network structure of YOLO-MSD.The DPK-Net, including the OC module and PK module, is firstly constructed as the backbone network, then the BSAM is introduced at the neck, and finally, the P-IoU is introduced at the regression stage.

Figure 1 .
Figure 1.The overall network structure of YOLO-MSD.The DPK-Net, including the OC module and PK module, is firstly constructed as the backbone network, then the BSAM is introduced at the neck, and finally, the P-IoU is introduced at the regression stage.

Figure 2 .
Figure 2. The structure of the DPK-Net.The backbone network extracts the critical characteristics of the input image, followed by the output of three different scale feature maps.

Figure 2 .
Figure 2. The structure of the DPK-Net.The backbone network extracts the critical characteristics of the input image, followed by the output of three different scale feature maps.

Figure 3 .
Figure 3. Detailed design of the OC module.(a) shows the structure of the OC Module.(b,c) show a detailed comparison between regular convolution and PConv.

Figure 3 .
Figure 3. Detailed design of the OC module.(a) shows the structure of the OC Module.(b,c) show a detailed comparison between regular convolution and PConv.

Figure 5 .
Figure 5.The PK Module starts with a small kernel convolution for local data and then uses parallel DWConv for the multiscale context.


is the local features extracted by ss kk  convolution and () m C H W Z   is the contextual features extracted by the m-th ( ) ( ) mm kk  depthwise separable convolution (DWConv).In our experiments, we set 3 s k = , () ( 1) 2 1 m km = +  + .

Figure 5 .
Figure 5.The PK Module starts with a small kernel convolution for local data and then uses parallel DWConv for the multiscale context.

Figure 5 .
Figure 5.The PK Module starts with a small kernel convolution for local data and then uses parallel DWConv for the multiscale context.

Figure 6 .
Figure 6.The structure of the BSAM attention mechanism.

Figure 6 .
Figure 6.The structure of the BSAM attention mechanism.

Figure 7 .
Figure 7.The overall structure of BRA.

Figure 7 .
Figure 7.The overall structure of BRA.

Figure 8 .
Figure 8. IoU-based losses.The loss functions described in (a) incorporate dimensional infor specifically using the diagonal length of the smallest enclosing bounding box (represented gray dashed box) for both the anchor and target boxes as the denominator in the loss calc Conversely, the P-IoU loss function outlined in (b) simplifies this approach by utilizing o edge length of the target box as the denominator of its loss factor.

Figure 8 .
Figure 8.IoU-based losses.The loss functions described in (a) incorporate dimensional information, specifically using the diagonal length of the smallest enclosing bounding box (represented by the gray dashed box) for both the anchor and target boxes as the denominator in the loss calculation.Conversely, the P-IoU loss function outlined in (b) simplifies this approach by utilizing only the edge length of the target box as the denominator of its loss factor.

Figure 9 .
Figure 9. PR curves for ablation experiments: (a) is based on HRSID, and (b) is based on SSDD.

Figure 10
Figure 10 illustrates the training loss curves for YOLOv7-tiny and YOLO-MSD on both the HRSID and SSDD datasets.Initially, both models show a similar reduction in training loss.However, as training progresses, YOLO-MSD demonstrates a more rapid decline in training loss compared to YOLOv7-tiny after ten training sessions.In conclusion, the YOLO-MSD model introduced in this study effectively reduces loss and accelerates model convergence.

Figure 9 .Figure 10 .
Figure 9. PR curves for ablation experiments: (a) is based on HRSID, and (b) is based on SSDD.

Figure 10 .
Figure 10.The loss curves of the proposed YOLO-MSD and the original YOLOv7-tiny model: (a) is based on HRSID, and (b) is based on SSDD.

Figure 11 .
Figure 11.Ship detection results for the HRSID.

Figure 12 .
Figure 12.Ship detection results for the SSDD.

Figure 13 .
Figure 13.Heat map of images at different scales and in different scenes.(a) small ship.(b) medium ship.(c) large ship.(d) offshore ships.(e) inshore ships.(f) dense inshore ships.

Figure 13 .
Figure 13.Heat map of images at different scales and in different scenes.(a) small ship.(b) medium ship.(c) large ship.(d) offshore ships.(e) inshore ships.(f) dense inshore ships.

Table 1 .
The definition of some COCO indicators.

Table 2 .
Experimental results of comparative experiments.