PVT-SAR: An Arbitrarily Oriented SAR Ship Detector With Pyramid Vision Transformer

The development of deep learning has significantly boosted the development of ship detection in synthetic aperture radar (SAR) images. Most previous works rely on the convolutional neural networks (CNNs), which extract characteristics through local receptive fields and are sensitive to noise. Moreover, these detectors have limited performance in large-scale and complex scenes due to the strong interference of inshore background and the variability of target imaging characteristics. In this article, a novel SAR ship detection framework is proposed, which establishes the pyramid vision transformer (PVT) paradigm for multiscale feature representations in SAR images and, hence, is referred to as PVT-SAR. It breaks the limitation of the CNN receptive field and captures the global dependence through the self-attention mechanism. Since the difficulties of object detection in SAR and natural images are quite different, directly applying the existing transformer structure, such as PVT-small, cannot achieve satisfactory performance for SAR object detection. Compared with the PVT, overlapping patch embedding and mixed transformer encoder modules are incorporated to overcome the problems of densely arranged targets and insufficient data. Then, a multiscale feature fusion module is designed to further improve the detection ability for small targets. Moreover, a normalized Gaussian Wasserstein distance loss is employed to suppress the influence of scattering interference at the ship's boundary. The superiority of the proposed PVT-SAR detector over several state-of-the-art-oriented bounding box detectors has been evaluated in both inshore and offshore scenes on two commonly used SAR ship datasets (i.e., RSSDD and HRSID).

all-weather remote imaging function, which plays a vital role in both military and civilian fields. Ship detection of SAR images is an important branch of SAR image interpretation and has been widely applied in maritime surveillance, maritime traffic control, fishery management, etc. Traditional SAR ship detection algorithms usually have multiple steps, including sea-land segmentation, image preprocessing, candidate region extraction, and false alarm rejection. The methods of SAR ship detection can be mainly classified as threshold-based [2], [3], saliency-based [4], [5], hand-crafted feature-based [6], [7], and statistical modeling-based [8], [9] approaches. However, with the increasing amount of data and resolution of SAR images in recent years, these methods are not competent in terms of speed, accuracy, and robustness. Hence, it becomes urgent to develop faster and more efficient SAR ship detection methods.
Currently, convolutional neural networks (CNNs) are highly effective in object detection, recognition, segmentation, tracking, etc. Most state-of-the-art general object detectors are based on CNNs [10], [11], [12]. These detectors based on horizontal bounding box (HBB) not only have been widely used in natural images but have also obtained performance gains in the task of SAR ship detection [13], [14], [15], [16]. However, ships in SAR images are slender, arbitrarily oriented, which cannot be effectively represented by HBB. Especially for inshore ships, complex land backgrounds are included, which affect the judgment of the detectors.
In order to solve these problems, some SAR datasets [17], [18], [19], [20] with oriented bounding box (OBB) annotations are proposed. Based on these datasets, some studies [21], [22], [23], [24], [25], [26] focused on the OBB detector to provide a more accurate localization for the ship in high-resolution SAR images. However, there are still several limitations for CNN-based methods in large-scale and complex scenes. The convolution layer only models the relationship between pixels in a small neighborhood. It cannot capture long-distance relationships in images. This work proves that although large kernel convolution can obtain a broader range of responses, the shape of the response is inaccurate. In order to identify the key features from the complex background, some studies' attention mechanism has been introduced [27], [28], [29]. It can focus on important information and reduce false positives caused by inshore and inland interferences. Su et al. [30] introduced the deformable convolution to improve the ability of the network to extract key information. Nevertheless, the convolution layer applies fixed weights regardless of changes to the visual input. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Compared to the traditional CNN backbones, which have local receptive fields, transformers can learn long-distance feature dependence and produce a global receptive field. We visualize the heat maps through the GradCam [56] and select one ship for analysis. The green box represents the exact position of the ship, and the red box represents the area with a high-response value. The high-response area of CNNs heat map only covers a local part of the ship. But the high-response area of the transformer's heat map covers not only the whole ship body but also part of the wharf. Therefore, the features learned by the CNNs are sensitive to noise and incapable of adapting to the scattering change [31]. When the key features are interfered by noise, the performance of these CNN-based models will be seriously affected. Furthermore, most current works use the L 1 loss to regress OBBs. Due to the periodicity of angle parameters, OBB detectors often suffer from boundary discontinuity. Although some losses [32], [33], [34] have been proposed to overcome angle boundary discontinuity, it does not seem to consider the problem of blurred ship boundaries caused by the SAR imaging mechanism [35]. In SAR images, scattering interference at the ships' boundary leads to unsatisfactory direction regression results.
Recently, the vision transformer paradigm [36] has been developed, which can achieve excellent results in computer vision without relying on CNNs. Unlike CNNs, which only attend to a local context, transformers can learn long-distance feature dependence. Transformers have dynamic modeling and powerful representation capabilities, which also show robustness to occlusion and noise [37]. Now it can be found that some works have attempted to apply transformer structure to SAR image processing, e.g., classification [38] and despeckling [39]. In their transformer structures, the scale of features is fixed. Therefore, they cannot extract multiscale features of images, which is very important for SAR ship detection. To the best of our knowledge, only one work tried to apply transformer structure to SAR target detection [40]. However, it still employed the CNN as the backbone and used the transformer as the attention module. Since the difficulties of object detection in SAR and natural images are quite different, directly applying the existing transformer structure cannot achieve satisfactory performance for SAR object detection. The objects in SAR images are relatively small compared with those in optical images. Especially in the inshore scene, the transformer faces the problem of densely arranged small targets in SAR images. Insufficient data are another bottleneck when the transformerbased detector is applied to the SAR ship detection task. In the general vision, the more training data, the more pronounced the advantages of transformer-based detectors over CNN-based detectors. However, SAR data acquisition and annotation are difficult, and there is no large-scale dataset similar to FARIM [41] for SAR ships.
Based on the above considerations, a novel pyramid vision transformer (PVT) [42] network is proposed for the arbitrary oriented ship detection in SAR images with complex backgrounds, which is referred to as PVT-SAR. In our work, in order to improve the dynamic modeling and powerful representation capabilities, the backbone network is completely replaced by the transformer architecture. This is the first article introducing the transformer vision pyramid paradigm into SAR ship detection. As shown in Fig. 1, the long-distance relationships include not only the relationships between ship components but also the relationships between the ship and the wharf. However, the direct application of the existing transformer structure may lead to the following problems. First, inshore ships are small and densely arranged. The patch embedding module will lead to miss detection. Hence, we introduce an overlapping patch embedding (OPE) module to overcome this problem. OPE can expand the patch window so that the nearest-neighbor window overlaps half the area. It ensures that at least one patch can retain a full ship. Second, there is no large-scale dataset as FAIR1M for SAR ship detection, resulting in the transformer cannot be fully trained. To solve this problem, we develop a mixed transformer encoder (MTE) module. It removes the fixed-size position encoding and introduces the convolutional feed-forward module to learn the position information. Third, since the ships in SAR images are tiny, we design a simplified multiscale feature fusion module, i.e., SFPN, to enhance largescale feature maps and improve the detection performance of small ships without affecting large ships. SFPN reduces both the input and output layers of the feature pyramid network (FPN) and selects higher-resolution backbone layers as inputs. Finally, we also design a normalized Gaussian Wasserstein distance (nGWD) loss to solve boundary discontinuity and suppress the influence of scattering interference at the ship's boundary. As shown in Fig. 2, the rotating bounding box is modeled as a 2-D Gaussian distribution. Compared with OBBs hard boundary, the Gaussian distribution's soft boundary is more robust to scattering interference, which can represent the ship location learned by the model more reasonably. Now we summarize the main contributions of this work as follows.
1) A new PVT paradigm, namely PVT-SAR, is designed for SAR ship detection, which is the first time that transformer is introduced into rotated SAR ship detection. The visualization results of the network's heat maps further verify the transformer's potential in the SAR ship detection task. 2) Since the difficulties of object detection in SAR and natural images are quite different, directly applying the existing transformer structure cannot achieve satisfactory performance for SAR object detection. Therefore, we propose the core modules, i.e., OPE, MTE, and SFPN, to overcome the problems of densely arranged small targets and insufficient data for SAR ship detection. 3) To reduce the influence of scattering interference in SAR images, nGWD loss is introduced, which can also solve the boundary discontinuity of the OBB detector. 4) The experimental results on RSSDD and HRSID datasets show that the proposed method achieves superior detection performance compared with the prevalent detectors. The ablation experiment demonstrates each component's effectiveness. The rest of the article is organized as follows. The PVT paradigm is introduced in Section II. In Section III, we proposed PVT-SAR, whose experimental results compared with other state-of-the-art methods are given in Section IV. Finally, conclusions are presented in Section V.
Notations-Throughout the article, a matrix, vector, and scalar are represented by a bold uppercase letter X, bold lowercase letter x, and regular letter x, respectively. N and Σ represent a 2-D Gaussian distribution and its covariance. Superscripts (·) T and (·) −1 represent the transpose and inverse, respectively. We use | · |, •, Trace(·), det(·), and · F to denote the absolute value, elementwise product, trace, determinant, and Frobenius norm, respectively.

II. PYRAMID VISION TRANSFORMER
As far as we know, PVT is one of the most advanced transformer-based backbone network for object detection.
Although some excellent backbone networks are emerging, it is still a widely recognized and representative model. PVT is the first pure transformer backbone designed for dense prediction tasks, which can be used as a direct replacement for CNN backbones in detection tasks. Different from ViT that typically yields low-resolution outputs, PVT can be trained on dense partitions of an image to achieve high resolution, which is important for dense prediction. According to the number of parameters, PVT models can be recognized as tiny, small, medium, and large size. To make a fair comparison with the most common CNN backbone network ResNet50, we select the PVT-small backbone network as the baseline, since they have similar GFLOPs. As shown in the top of Fig. 3, the backbone consists of four transformer blocks, and each block is composed of one patch embedding module and transformer encoders. We split an image into patches with fixed size, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard transformer encoder. The transformer encoder can extract self-attention by the multihead attention (MHA) module. As shown in the bottom of Fig. 1, self-attention allows the transformer to integrate information across the entire image even in the lowest layers. The calculation process of MHA is shown in Fig. 4. First, generate the corresponding query Q, key K, and value V through the input sequence X where X is the input sequence, and X is multiplied by different weight matrices W Q , W K , and W V to generate corresponding Q, K, and V . Then, the self-attention is obtained through the following steps: Given a Q, calculate the correlation between Q and all K by scaled dot product as Then, the weight matrix W A is multiplied by the V and the self-attention matrix is obtained as which is the basic unit of MHA, also known as self-attention head. Suppose that the shape of the three matrices Q, K, and V is n × d, the complexity of the self-attention is O(n 2 d). Suppose there are h self-attention heads, we concatenate their outputs together, and multiply by the weight matrix W B to obtain the expression of MHA  with Q i , K i , and V i representing the query, key, value matrices in the ith head, respectively. MHA mechanism can be understood as a feature ensemble, which fuses multiple self-attention features. To reduce the amount of calculation, PVT adopts a spatial reduction attention (SRA) layer to reduce the dimension of Q and K. In addition, PVT introduces a progressive shrinking pyramid to reduce the sequence length of the transformer as the network deepens, which significantly reduces the computational cost. This also enables PVT flexible to learning multiscale and high-resolution features. Finally, the FPN fuses feature maps from different stages to obtain the outputs for detection.

III. PROPOSED ARCHITECTURE
The proposed PVT-SAR for SAR ship detection will be introduced in this section.

A. PVT-SAR
Transformer is currently the state-of-the-art attention technology in the general vision field because of its powerful modeling capabilities. Transformers can learn long-distance relationships and retain more global information and hence alleviating the influence of complex backgrounds to SAR inshore ship detection. In addition, the transformer also shows strong robustness to noise and occlusion. Therefore, we would like to introduce it to solve the speckle noise and the occlusion caused by boundary scattering between densely docked ships in SAR images. However, the characteristics of the objects in SAR images and natural images are quite different. The existing transformer structure, such as PVT, is not applicable for SAR object detection. The aim of our work is to develop a new transformer architecture for SAR ship detection. We compare the architectures of the proposed PVT-SAR and PVT-small in Fig. 3. The structural parameters of the transformer-based backbone are shown in Table I. S i represents the strip of the OPE in stage i. C i indicates the number of output channels in stage i. L i indicates the number of encoders in stage i. R i represents the decline ratio of SRA in stage i. N i indicates the number of efficient self-attention headers in stage i. E i represents the expansion ratio of the convolutional feed-forward layer in stage i. We will address the contribution of PVT-SAR in three aspects.
1) OPE: The targets in the SAR ship datasets are small and densely arranged. Using the patch embedding module directly may break the target features seriously. Therefore, the OPE module is proposed to reduce the missing rate of ships. As shown in Fig. 5, the patch window is expanded so that the nearest neighbor window overlaps half the area. Here, convolution with zero padding is used to embed overlapping patches. Specifically, given the input with H i × W i × C i , the convolution with strip S i is being used. Kernel size 2S i − 1 and padding size S i − 1 are used for processing and the output size When the overlapping strip S i is larger than the target feature size of the current layer, at least one patch can retain the full target.
2) MTE: It is difficult and time-consuming to obtain the annotation for SAR images. PVT does not show satisfactory generalization ability when trained on the insufficient dataset. The position embedding module can characterize the relative or absolute location between patches so that the transformer can establish long-distance dependence between two nonadjacent patches. However, the transformer lacks some of the inductive biases inherent to CNN, such as translation equivalence and locality. If the position encoding is fixed, the model needs to learn the position information through the patches' semantics, increasing the learning cost. As shown in Fig. 6, MTE removes the fixed-size position encoding and introduces the convolutional feed-forward module to learn the position information, which can alleviate the dependence on the amount of data.
3) SFPN: FPN is a method that can effectively generate multiscale feature maps in a single-picture view, extracting more expressive multiscale features for downstream tasks [43], [44]. As shown in Fig. 7, due to the ships in the SAR image being tiny, the small-scale layers in FPN are hard to capture correct targets. Therefore, SFPN is proposed, which reduces both the input and output layers of FPN and selects backbone layers with higher resolution as inputs. It makes the model process the features extracted by the transformer-based backbone more reasonably. As shown in Fig. 8, the neck input of SFPN is changed from {C3, C4, C5} to {C2, C3, C4}. Considering the calculation cost, we reduce small-scale feature layers and change the neck output from {P 3, P 4, P 5, P 6, P 7} to {P 2, P 3, P 4}.
Experiments show that SFPN can significantly improve the detection ability of the proposed networks, especially for small targets.

B. nGWD Loss
The scattering interference in SAR images is an obstacle to the ship detector, which may lead to unsatisfactory results of the direction regression of the ships. To this end, we design the nGWD loss to suppress the influence of scattering interference at the ship's boundary. It can be seen in Fig. 2 that the rotating bounding box is modeled as a 2-D Gaussian distribution. Compared with the OBBs hard boundary, the Gaussian distribution's soft boundary is more robust to scattering interference, which can help to represent the ship location learned by the model more reasonable. By calculating the Wasserstein distance of  the Gaussian distribution of two OBBs, the RIoU can be approximately expressed. Specifically, the rotated bounding box is converted from B(x, y, h, w, θ) to a 2-D Gaussian distribution  N (m, Σ). The relationship between the OBB and 2-D Gaussian distributions is expressed as where R represents the rotation matrix and S represents the diagonal matrix of eigenvalues. The Wasserstein distance between two Gaussian distributions is written as According to the properties of a 2 × 2 matrix Trace Σ 1/2 and in the commutative case (HBB detection task) Note that both boxes are horizontal here, and hence, (8) is approximately equivalent to the 2 loss. This also partly proves the correctness of using the Wasserstein distance as the regression loss. To speed up the calculation of loss, we further simplify (8). Given any positive-definite symmetrical 2 × 2 matrix Z where λ 1 and λ 2 are the eigenvalues of Z. It is also known as Combining (11) and (12), we have Assuming Z = Σ  Therefore, (8) can be simplified as Thus far, the final expression of the Wasserstein distance between two OBBs under Gaussian distribution is obtained. Since the function is differentiable, it can be taken as the regression loss directly. It is not difficult to find that the Gaussian distribution has the following two properties: Property 1: Σ 1/2 (w, h, θ) = Σ 1/2 (h, w, θ − π 2 ); Property 2: Σ 1/2 (w, h, θ) = Σ 1/2 (w, h, θ − π), which naturally eliminates the periodicity of angle and exchange of edge at the boundary.
It still has another problem, namely, the advantage of scale (AoS). As shown by the green curve representing the actual Gaussian Wasserstein distance in Fig. 9, for two OBBs with the same RIoU, the larger the size of the boxes, the greater d w . It will make the model more inclined to learn those large-scale targets while ignoring the small ships in SAR datasets. To solve this problem, d w is normalized according to the following formula: As shown by the purple curve representing the nGWD in Fig. 9, different scales have almost the same d norm , which means that the model can treat targets of different sizes more fairly. Note that d norm can be sensitive to errors with a high value of loss, and a nonlinear transformation is performed. Then, the expression of nGWD loss is expressed as The loss function of the whole detector is where N indicates the number of anchors and obj n is a binary value (obj n = 1 for foreground and obj n = 0 for background, indicating no regression for background). b n denotes the nth predicted box, and gt n is the nth ground truth. t n represents the label of the nth object, p n is the nth probability distribution of various classes calculated by the sigmoid function. The hyperparameter λ controls the tradeoff and is set to 5 by default. The classification loss L cls is set as focal loss [12].

IV. EXPERIMENTAL RESULTS
According to [45] and [34], we carried out experiments on the HRSID [18] and RSSDD [17] datasets. First, the two datasets, open-source and used for OBB detection, are introduced in detail. Then, the evaluation metrics are illustrated. Next, a series of comparative experiments to verify the effectiveness of PVT-SAR are designed.

A. Dataset Description and Experimental Settings
HRSID is a dataset for ship detection and segmentation tasks in high-resolution SAR images. This dataset contains 5604 high-resolution SAR images and 16951 ship instances. Under the overlapping ratio of 25%, 136 panoramic SAR images with ranging resolutions from 1 m to 5 m are cropped to 800 × 800 pixel SAR images for dataset construction. The training set contains 3623 images, and the test set contains 1955 images. RSSDD is another open-source OBB-based SAR ship detection dataset composed of SAR images of multiple resolutions, polarization, and scenes. This dataset contains a total of 1160 images and 2456 ship targets. Images have different shapes, ranging from 217 × 214 pixels to 526 × 646 pixels. The training set contains 928 images, and the test set contains 232 images. Refer to Table II for more details on these datasets. Fig. 10(a)-(e) shows several inshore scene images, and Fig. 10(f) gives an example of an image with an offshore scene. In addition, it can be seen that in Fig. 10(c) and (d), the resolution is high, so the ships are larger. The resolution is lower in other images, making the target very small and difficult to detect. To evaluate the performance of PVT-SAR more comprehensively and robustly, the test set can be divided into two parts: 1) inshore and 2) offshore. Based on the statistics, the HRSID dataset contains 312 inshore and 1643 offshore while the RSSDD dataset contains 39 inshore and 193 offshore. In the inshore images, the ships are docked on the shore, and the complex land background increases the difficulty of detection. Offshore, the ships are sailing on the vast sea without complex background, so the task is relatively simple.
For the HRSID dataset, the shape of the input images in both the training and test stage is consistent with the original image, which is 800 × 800 pixels. However, for the RSSDD dataset, all the images of different sizes are resized to 608 × 608 pixels. ImageNet pretrained weights are used to initialize the parameters of the feature extraction backbone. All models are trained in 36 epochs, and their batch size is set to 4. The new adaptive moment estimation (AdamW) [46] optimizer is adopted as the training optimizer of the transformer-based model, the weight decay of which is 5.0 × 10 −4 . The initial learning rate of AdamW is set to 1.0 × 10 −4 . By default, the other DCNN-based models use the stochastic gradient descent (SGD) optimizer. The momentum and weight decay of SGD are 0.9 and 1.0 × 10 −4 , respectively. The initial learning rate of SGD is set to 1.0 × 10 −3 and 5.0 × 10 −3 for one-stage detectors and two-stage detectors, respectively. The learning rate decays by 0.1 times at 24 epochs and 33 epochs. The comparison experiments are conducted based on mmdetection [47]. All the experiments are carried out on SJTU student innovation center's CentOS 7.3 system with two 1080Ti GPUs.

B. Evaluation Metrics
Widely used criteria are adopted to quantitatively evaluate the detection performance, namely, the precision-recall curve (PRC), average precision (AP), and F1-score. Generally speaking, a detection result is a true positive if the RIoU overlap ratio between the predicted box and ground truth box is more than 0.5. Otherwise, the prediction box is considered as a false positive. Furthermore, if several predicted boxes overlap with the same ground truth, only one box with the highest score is considered as a true positive, and others are false positives. To eliminate the difference between different OBB representations, we will first convert the OBB to polygon form and then calculate the RIoU between polygons. The precision measures the fraction of true positives detections, and the recall measures the fraction of correctly identified positives that are calculated as (20) where N tp is the number of targets correctly detected, N pred denotes the total number of predicted boxes, and N target represents the actual number of targets. The F1-score combines the precision and recall metrics into a single measure for evaluating the quality of an object detection method, which is defined as Unless otherwise specified, the abovementioned indicators are measured when the RIoU threshold is 0.5. Given the RIoU threshold, the corresponding PRC will be determined. The precision and recall corresponding to this PRC are computed according to the point on this curve that maximizes the F1-score. The AP metric quantitatively evaluates the comprehensive detection performance of the detector by calculating the area under the PRC as The higher the AP value, the better the performance, and vice versa. When the RIoU thresholds are 0.5 and 0.75, the APs are compared and recorded as AP 50 and AP 75 . Furthermore, mAP is adopted to obtain a more comprehensive performance evaluation, which is defined as Moreover, the AP of the targets with an area less than 1024 is AP small , and the AP of the target with an area greater than 1024 is called AP medium . These two indicators are introduced to verify the impact of modifying FPN on the detector.

C. Ablation Experiments and Further Analysis 1) Effect of PVT-SAR:
Taking RetinaNet-FPN as the baseline, we evaluate the performance of transformer-based backbones on the HRSID dataset. Except for the classic DCNN backbone, i.e., ResNet-50 [48], ResNext [49] with MHA, and ResNest [50] with split attention are also taken into account for comparison, which are two advanced DCNN backbones with   Table III, which show that OPE does not contribute to performance improvement without the participation of MTE and vice versa. However, if OPE and MTE are both added to PVT-small, AP small is significantly improved from 0.310 to 0.350. It is because OPE introduces redundant information between different patches, which is not considered by the original transformer encoder. The MTE introduces local information through convolution, which also reduces the negative effects of redundant information. Therefore, these two modules are mutually complementary. After replacing the FPN with the SFPN, the model's performance is further improved. Especially for small objects, AP small increases from 0.350 to 0.464. Compared to baseline, AP small increases surprisingly by 49.7%. Experiments show that the PVT-SAR can significantly improve the performance of rotated object detection for small targets, which can be seen in Fig. 11. For the medium targets in the upper-right corner, the miss detection of two ships for one no longer exists. Moreover, the detector's confidence score is higher for the small targets in the lower-left corner. It can also be found that not all advanced DCNN-based backbones can achieve performance improvements. The performance of the two DCNN-based backbones with attention mechanisms has decreased. For example, AP small of ResNext even dropped by 0.072 compared to ResNet-50. It shows that the attention mechanism effective for horizontal object detection is not necessarily effective for rotated object detection. The features extracted by the transformer have richer characterization ability than that of the attention mechanism, which helps to improve the performance of rotated object detection.
2) Robustness to Noise: To evaluate the performance of the proposed backbone regarding the robustness to noise, some multiplicative speckle noise with varying signal-to-noise ratios (SNRs) was added to each test image [52]. The speckle noisy SAR imageÎ is obtained according tô where N is uniformly distributed random noise with mean 0 and variance ν. Denote P (I) as the energy of an image I. Then, SNR is defined as SNR = 10 log 10 P (I) P (N • I) .
Here, we compare the proposed architecture with two representative DCNN backbones with attention mechanisms. The two dashed curves represent the two backbones proposed in this article. The results are shown in Fig. 12, which illustrates that AP 50 of all methods increases with higher SNR. However, PVT-SAR shows the highest AP 50 for the SNR from −5 to 15 dB. We can see from Fig. 12 that AP 50 of the PVT-SAR is about 10%-15% higher than that of DCNN-based backbones for low SNRs. It means that the proposed transformer-based backbone is more robust to speckle noise. This can be considered as the transformer's ability to obtain global attention. Even if the noise   suppresses part of the local information of the target, the transformer can make a comprehensive judgment in combination with the global information.
3) Effect of nGWD: Taking RetinaNet-HBB based on ResNet-50 as the baseline, we evaluate the performance of nGWD loss on the HRSID dataset. The results are shown in Table IV. It turns out that nGWD brings obvious improvements to AP small index and AP medium index. When λ is equal to 5, AP small reaches the score of 0.357 (increased by 15.2%), and AP medium reaches the score of 0.908 (increased by 1.7%). It also demonstrates that the impact of AoS cannot be ignored, which may lead to loss instability or even nonconvergence. Now, λ equal to 5 is selected for follow-up experiments. Furthermore, the detection results using GWD and nGWD are drawn in Fig. 13. It can be seen that nGWD can help the model learn the proper orientation of the OBBs, which can significantly improve the AP of the detector in the inshore scene.

4) Ablation Experiment:
The effectiveness of PVT-SAR and nGWD is evaluated separately in the previous experiments. We verify the benefit of integrating PVT-SAR and nGWD. The results are shown in Table V. When only PVT-SAR is used, AP 50 , AP 75 , and mAP increased by 7.1%, 21.1%, and 13.7%, respectively, compared with the baseline. When only nGWD is employed, AP 50 , AP 75 , and mAP increased by 4.1%, 19.8%, and 10.7%, respectively, compared with the baseline. Therefore, the contribution of transformer-based PVT-SAR is more significant than that of nGWD. It is worth to emphasize that the performance of PVT-SAR has been significantly improved by combining with the nGWD. Compared with the baseline, it further increases AP 50 , AP 75 , and mAP by 16.6%, 42.9%, and 28.6%, respectively. Recall that PVT-SAR can extract features with richer expression while nGWD filters the noise of SAR images. Therefore, combining the two modules endows the model with powerful information mining ability and makes the AP improve linearly. In the following experiments, nGWD loss is added to PVT-SAR by default.

5) Comparison With Representative Methods:
In this part, PVT-SAR is compared with several state-of-the-art OBB-based aerial detectors. The quantitative comparison between PVT-SAR and other methods on the HRSID dataset is given in Table VI. We can see that the recall, F1, and AP metrics of PVT-SAR are highest in both inshore and offshore scenarios. Its AP 50 outperforms the ReDet [53] by 23.7% in the inshore scenario, which demonstrates the ability of the proposed method against complex background. In offshore scenes, PVT-SAR is also superior to other methods. Compared with the baseline, PVT-SAR finally increases the recall, precision, F1, and AP 50 by 42.1%, 23.6%, 32.9%, and 49.4%, respectively. The experimental results show that the indicators of PVT-SAR are optimal compared with other detection models in both offshore and inshore scenes. Moreover, it has the smallest weights, requiring fewer trainable parameters than other models. The average inference time per task on one 1080Ti GPU is also given in Table VI. It can be seen that PVT-SAR lags behind part of DCNN-based detectors in inference time. Furthermore, no matter whether it is one-stage or two-stage, some detectors do not work well on SAR dataset, e.g., Gliding Vertex [54], and Oriented RCNN [55]. It reflects that the gap between SAR and aerial images cannot be ignored, and it is necessary to design detectors specifically for SAR images.

6) Investigation of the Generalization Ability:
In addition to HRSID, another small dataset, i.e., RSSDD, is also used to verify the performance of PVT-SAR. Table VII shows the quantitative comparison between PVT-SAR and other methods on the RSSDD dataset. It can be seen from the table that the superiority of the transformer-based model in extracting spatial features ensures that PVT-SAR achieves the best detection results, especially for small targets. AP small compared to baseline increased by 9.8%. The experimental data further verify that the proposed method has a strong generalization ability for different SAR datasets. On the other hand, the RSSDD data are much less than that of HRSID, which shows that PVT-SAR still outperforms other methods even if the training data are limited. It is undeniable that although the picture size is reduced from 800 × 800 pixels to 608 × 608 pixels, PVT-SAR lags behind part of DCNN-based detectors in inference time.
7) Validation on Large Scene SAR Image: To examine the migration ability of the model trained on our dataset, we have obtained panoramic Alos-2 SAR imagery with multiple inshore and offshore ships for the experiment. The size of largescale SAR imagery does not fit the input of detectors. Hence, the detection process is divided into several steps. First, the SAR imagery is vertically and parallelly cropped by 800 × 800 pixels sliding window; each successively cropped image has an overlap of 160 pixels to ensure that the stitching process can be implemented. Second, 644 cropped SAR images are input into the detectors. Third, detection results are stitched to form the detected panoramic SAR imagery.
The visible ship detection result of PVT-SAR is shown in Fig. 14. We can see that PVT-SAR performs well in offshore scenes. However, some false alarms on the land of Alos-2 can  [14] indicates that in some cases, training from scratch is a better choice. Therefore, we conducted further experiments to explore the transformer-based backbone when training from scratch. Table VIII shows the quantitative comparison of training from ImageNet pretrained weights and training from scratch on the RSSDD dataset. As can be seen from the table, the pretrained weights do not play an important role in the offshore scenes. For ReDet, training from scratch even achieves higher accuracy. The scenes that benefit the most from the pretrained models are the complex inshore scene. It is commendable that the performance of the PVT-SAR model in this article has only a slight decrease, even in the inshore scenes. The transformer-based detector can mine the complex context information of inshore scenes without relying on the pretrained weights. However, the detection performance of all CNN-based models in inshore scenes is greatly degraded without pretrained weights. It shows the superiority of the transformer-based model in extracting complex features of SAR images.

9) Comparison With the Convolution With Large Kernels:
We evaluate the performance of the PVT-SAR with a convolutional operator with larger kernels. The transformer can extract long-distance information and takes into account when learning ship features. And convolution can also obtain a wide range of features with a large kernel. So we replace the 3 × 3 convolutions in FPN with 7 × 7 convolution in ResNet50-FPN. According to   Fig. 15, although large kernel convolution can obtain a broader range of responses, the shape of the response is irrelevant to the ship. It is clear that the response of PVT-SAR tightly surrounds the ship, which means the feature extracted by the PVT-SAR is more accurate. As far as we know, the previous work did not analyze the response of the transformer and CNN on the feature maps in detail. Our experiment further verified the advantages of the transformer.
10) Error Analysis of PVT-SAR: According to Fig. 16, there are some false alarms in the heat maps, marked with red ellipses. Two factors may lead to these false alarms, e.g., insufficient land backgrounds in the train set and image clipping during dataset production. First, deep learning is highly dependent on data. Massive datasets have significantly improved the model's performance, which is a process from quantitative change to qualitative change. Due to the insufficient land backgrounds in the train set, the model may misunderstand the land background that has never been seen before. In order to solve this problem, we plan to introduce the few-shot learning technology or create a super large dataset similar to FARIM, which is a large-scale remote sensing datasets you brought to us in the following comments, for SAR ships in the future works. Second, we found that false detection is more likely to occur at the edge area of the image. We believe that it was introduced in the process of dataset production. Limited by the memory of the GPUs, the deep learning model cannot take the whole SAR huge image as the input. Therefore, the dataset is generated by cutting the huge image into small images and sending them into the model for training, which may make some objects to be split into two parts. The model learns these incomplete targets at the edge area of the image. In order to detect these incomplete ships, the model has a higher tolerance for the features at the image's boundary, which leads to more false detections. To overcome this problem, the dataset needs to be constructed by the original huge image and its annotation so that we can directly test the indicators on the huge image.

V. CONCLUSION
In this article, a novel PVT architecture for arbitrarily oriented ship detection in SAR images has been developed. Compared with the original PVT, we have proposed OPE and MTE modules to overcome the problems of small target size, dense target arrangement, and insufficient data for SAR ship detection. We also introduced the multiscale feature fusion module to enhance the utilization of large-scale features and, thus, improving the detection ability for small targets. The nGWD loss was further incorporated to suppress the influence of scattering interference at the ship's boundary. Experimental results on RSSDD and HRSID have verified the superiority of the proposed detector.