ROS-Det: Arbitrary-Oriented Ship Detection in High Resolution Optical Remote Sensing Images via Rotated One-Stage Detector

To address these problems, namely, arbitrary orientations, various sizes, and dense distributions of ship detection, we propose an arbitrary-oriented ship detection method via rotated single-stage detector (ROS-Det), which integrates a feature pyramid network (FPN) based on an improved ResNet50, rotated anchors, a classification network, and a regression network together. Firstly, to improve robustness against various sizes of ships, the FPN is used to fusion multiscale convolutional feature maps. Through several tweaks, the improved ResNet50 can receive more information and reduce the computational cost. Secondly, for the purpose of arbitrary-oriented ship detection, rotated anchors, skew intersection over union (IoU), and skew non-maximum suppression (NMS) are introduced to RetinaNet. Then, on account of the disadvantages that the arbitrary-oriented object detection methods usually cause loss discontinuity problem, we improve the traditional smooth L1 loss function by introducing an IoU constant factor. Finally, based on several techniques such as data augmentation and transfer learning, we achieve ship detection on a public ship dataset HRSC2016. Through comparison experiments, we have analyzed and discussed the validity of our proposed ROS-Det, which achieves the state-of-the-art performance.


I. INTRODUCTION
With the development of aerospace technologies and sensors, a number of high-resolution and high-quality remote sensing images have been developed to facilitate object detection [1]- [3]. Ship detection has played a vital role in many fields, such as harbor dynamic surveillance, traffic monitoring, maritime management, and national defense construction.
Traditional ship detection methods mainly consist of feature extraction and classification recognition. For example, voting-based pose estimation methods [4], deep forest ensemble-based methods [5], Haar-like feature-based methods [6] and Fourier HOG-based methods [7]. Most of these methods rely heavily on manual features and sliding The associate editor coordinating the review of this manuscript and approving it for publication was Weipeng Jing . windows, which results in causing low accuracy and high time cost.
In 2014, Girshick et al. [8] introduced an iconic model in the object detection field: a region-based convolutional neural network (R-CNN). To further reduce the time cost, an improved R-CNN algorithm, i.e., Fast R-CNN, was proposed [9]. Since the above two models obtain candidate regions through artificial methods, a speed bottleneck exists. Faster R-CNN [10] replaces the selective search (SS) algorithm with a region proposal network (RPN). At the same time, the RPN and the subsequent detection network share the same convolutional layers, which greatly reduces the detection time. To solve the multiscale problem, a feature pyramid network (FPN) [11] was proposed, which can be applied to Faster R-CNN. The above methods are representative two-stage detectors, and there are other methods, e.g., Mask R-CNN [12], region-based fully convolutional network (R-FCN) [13], and Cascade R-CNN [14]. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Different from two-stage detectors, one-stage detectors simultaneously complete two tasks, i.e., classification and localization. You Look Only Once (YOLO) [15], Single Shot MultiBox Detector (SSD) [16], and RetinaNet [17] are representative one-stage detectors. Specifically, YOLO simultaneously completes feature extraction, object classification and positioning tasks based on CNNs, thereby greatly improving detection speed. After dividing the image into S×S grids, YOLO directly predicts the location and category for each object. SSD has an improved ability to detect small objects. RetinaNet introduces a new loss function, i.e., the focal loss, thereby significantly improving the detection accuracy.
In recent years, many ship detection algorithms and models based on deep learning have been proposed. For example, You et al. [18] first used the scene mask extraction network for scene segmentation, and then excluded false targets based on the feature map and inferenced scene mask. Shao et al. [19] achieved ship detection by using CNNs as well as saliency detection. They employed comprehensive ship features, e.g., deep features, saliency maps, and coastline priors. Wu et al. [20] introduced a coarse-to-fine idea into ship detection.
The above-mentioned methods are known as horizontal region detection methods. However, there are three main disadvantages. 1) Horizontal bounding boxes (HBBs) cannot reflect the real shape of ships with various orientations, as shown in Fig. 1(a). 2) As shown in Fig. 1(b), object and background pixels cannot be effectively separated, especially for ships with a large aspect ratio. 3) As a result of the large overlap between bounding boxes, dense ships may be missed, as shown in Fig. 1(c). Compared with HBBs, rotated bounding boxes (RBBs) can reflect the physical size of ships, contain fewer background pixels, and efficiently separate dense ships, as shown in Fig. 1(d). Several methods for arbitrary-oriented scene text detection have been proposed. For example, Jiang et al. [21] proposed a text detection method called a rotational region CNN (R2CNN) by introducing angle information. Ma et al. [22] proposed the rotation RPN (RRPN) for scene text detection, which improved upon the RPN and region of interest (RoI) of Faster R-CNN by introducing a rotation factor. Liao et al. [23] proposed a text detector called the rotation-sensitive regression detector (RRD). Inspired by these methods above, some rotated detectors for ship detection have been proposed. Liu et al. [24] introduced a method called RR-CNN for ship detection and a public ship dataset called the 2016 High-Resolution Ship Collection (HRSC2016). Yang et al. [25] integrated a dense FPN, RRPN, and multiscale RoI Align to achieve ship detection. Zhang et al. [26] employed a rotated RPN to generate RBBs with orientation angle information. Tian et al. [27] considered the multiscale problem in ship detection. As result of their complex structures, there exists a speed bottleneck. One-stage rotation detectors can be a good choice if one is willing to sacrifice accuracy for speed. However, the loss discontinuity problem for angle regression still limits the performance of the one-stage rotation detectors.
To address these above-mentioned problems, we present a rotated one-stage detector (ROS-Det) for arbitrary-oriented ship detection based on RetinaNet, which integrates an FPN based on an improved 50-layer residual neural network (ResNet50), rotated anchors, a classification network, and a regression network. Compared with other methods, such as Faster R-CNN [10], Cascade R-CNN [14], SSD [16], RetinaNet [17], R-FCN [13], R2CNN [21], RR-CNN [24], RDFPN [25], RRPN [22], R2PN [26], Tian et al [27], RRD [23], RoI-Transformer [28] and RSDet [29], our proposed ROS-Det method is able to achieve higher detection performance. The main ideas and contributions are as follows: 1) Different from previous ship detection methods, our proposed ROS-Det method can handle arbitrary orientations, various sizes and dense distributions in ship object detection. 2) To improve robustness against various sizes of ships, we propose an FPN based on an improved ResNet50 that contributes to addressing different sizes of ships. Moreover, the improved ResNet50 is able to receive more information and reduce the computational cost. 3) For the purpose of arbitrary-oriented ship detection, we employ a rotated anchor, the skew intersectionover-union (IoU) and skew non-maximum suppression (NMS) to improve the one-stage detector RetinaNet. 4) To address loss discontinuity problem, we improve the traditional smooth L1 loss function by introducing an IoU constant factor.

II. METHODS
The architecture of the proposed ROS-Det method is shown in Fig. 2, which includes four parts: an FPN based on improved ResNet50, rotated anchors, a classification network (Cls-Net), and a regression network (Reg-Net). Specifically, the FPN based on an improved ResNet50 is used to extract and fuse multiscale feature maps. Then, we add rotated anchors of multiple aspect ratios, sizes and angles at each FPN level. Finally, Cls-Net and Reg-Net perform classification and regression, respectively.

A. FEATURE PYRAMID NETWORK BASED ON AN IMPROVED ResNet50
A model tweak [30] generally does not increase the time cost, but it may substantially improve performance. Therefore, an improved ResNet50 is used to extract feature maps over an entire input image. First, we briefly introduce ResNet50 [31], which mainly includes five block groups, as shown in Fig. 3. The first block group has a 7 × 7 × 64 convolutional layer and a 3 × 3 max pooling layer. Starting from the third block group, each block group begins with a down-sampling block followed by many residual blocks. The down-sampling block consists of paths A and B.
For a 1×1 convolutional layer with a stride of 2, only 1/4 of the input information will be obtained during the convolution operation. To ensure that no information is ignored, we switch the stride sizes of the first two convolutional layers in path A.
At the same time, in path B, we add a 2 × 2 average pooling (AvgPool) layer and change the stride size of the 1 × 1 convolutional layer to 1. To simplify the calculation, we replace the 7 × 7 convolutional layer in the first block group with three 3×3 convolutional layers. The above tweaks are illustrated in Fig. 4. Scale imbalance (a multiscale object or input image in the dataset) plays a non-negligible role in affecting the overall detection accuracy [32]. Fig. 5 illustrates some ship samples from the HRSC2016 dataset, and we observe that there is a large difference in the size of ships in the same image. The common methods to solve the above problem are: (1) feature hierarchy-based methods (e.g., SSD [16], MSCNN [33]); (2) image pyramid-based methods (e.g., SNIP [32]); (3) feature pyramid-based methods (e.g., FPN [11]). Considering that the feature information varies among VOLUME 9, 2021 different layers of the FPN, feature hierarchy-based methods making predictions separately from different layers by hierarchy-based methods is unreliable. Since an image pyramid-based method needs to make predictions separately from multi-scale images, we use one only as a data augmentation method.
To avoid these issues above, we use an FPN [11] based on the improved ResNet50 mentioned above. For the improved ResNet50, we represent the final output of the five block groups as C i , i = 1, 2, 3, 4, 5, which are reduced by {2, 22, 23, 24, 25} times compared to the original image. First, the top-down pathway upsamples the feature maps. Then, each lateral connection merges the feature maps in different pathways (i.e., the bottom-up pathway and top-down pathway). Finally, a 3 × 3 convolutional layer is added to generate the final feature map P (including P3 through P5). In addition, P6 is obtained via a 3 × 3 convolutional layer on C5, and P7 is obtained via rectified linear unit (ReLU) activation and a 3 × 3 convolutional layer on P6.
Anchors are assigned to ground-truth object boxes if the IoU between the anchor and ground-truth object boxes is at least 0.5; anchors are assigned to the background if the IoU is in [0, 0.4). Anchors that are unassigned do not participate in training. For each anchor, the length 5-vector of box regression represents the offsets between each anchor and its assigned ground-truth.
It is inaccurate if we use the traditional IoU computation method for HBBs to compute the IoU of the RBBs. Therefore, the skew IoU computation is redefined to consider the  geometric principles of triangulation. Given two rectangles R ABCD and R EFGH , the geometric principles and pseudo code of skew IoU computation are shown in Fig. 8 and Table 1, respectively. Specifically, we generate the point set PSet between R ABCD and R EFGH . The PSet includes the intersection points PSet1 of two rectangles and the vertices PSet2 of one rectangle inside another. Then, we compute the intersection area PSet by triangulation. We generate a convex polygon by sorting the points of PSet into anticlockwise order, and then sum the areas of the triangles. Finally, we compute the skew IoU.

C. CLASSIFICATION AND REGRESSION NETWORKS 1) CLASSIFICATION NETWORK
The Cls-Net, a simple and small FCN, is attached to one of the pyramid levels and shares parameters with each other. Specifically, the Cls-Net consists of four sets of 3 × 3 × 256 convolutional layers followed by ReLU activations, a 3 × 3 × A convolutional layer, and sigmoid activation, as shown in Fig. 9. Given the W × H × 256 feature map, binary predictions per spatial location are output.

2) REGRESSION NETWORK
In parallel with the Cls-Net, the structure of Reg-Net is exactly the same as that of Cls-Net except for the output. For each spatial position in case of the A anchors, there are 5 outputs that are used to predict the relative offsets. It is worth noting that the Cls-Net and the Reg-Net use separate parameters.

3) LOSS FUNCTION
As mentioned above, five offsets in total are predicted: where (x, y, w, h, θ) indicate the box's center coordinates, width, height, and angle, respectively. θ a , θ , θ indicate the anchor, predicted bounding box, and ground-truth, respectively (likewise for x, y, w, h).
For the Cls-Net and Reg-Net, a multitask loss function is defined as follows: L cls p n , p n = FL p n , p n = FL (p nt ) where n is the index of an anchor and N indicates the number of anchors.p n is the category probability calculated by the sigmoid function. p n represents the ground-truth class, i.e., ships. b ni , b ni (i = {x, y, w, h, θ}) represent the offsets of the predicted bounding box and ground-truth, respectively. The classification loss is the focal loss (FL) [17], in which α ∈ [0, 1] is a weighting factor, and γ is a focusing parameter.
In this paper, we use α = 0.25 and γ = 2 in all experiments. The regression loss is the smooth L1 loss [9]. However, there exists a loss discontinuity for the angle parameter in the rotation detection, as shown in Fig. 10. During the regression of an anchor (0, 0, 60, 30, −pi/2), an optimal form is to rotate the anchor counterclockwise to obtain the predicted box (0, 0, 60, 30, −2pi/3), but the smooth L1 loss between the predicted offsets and target offsets is very large as a result of the angular periodicity and edge exchange. To avoid the loss discontinuity above, the model must be regressed on other relatively complex forms. Specifically, the anchor needs to be rotated clockwise to obtain the yellow box (0, 0, 60, 30, -pi/6), then the yellow box is scaled to obtain the final predicted box (0, 0, 30, 60,pi/6). Although the latter is ideal, it increases the regression difficulty. SCRDet [34] has confirmed the effectiveness of introducing the IoU constant factor in addressing boundary problem. Therefore, the IoU constant factor | − log(IoU )| | L reg( t ni ,t ni)| is introduced into traditional smooth L1 loss, as shown below: where IoU indicates the overlap between the predicted bounding box and the ground-truth. The improved regression loss consists of two parts: When the angular periodicity and edge exchange occur, the regression loss is approximately equal to 0 (|− log (IoU )| ≈ 0).

4) SKEW NON-MAXIMUM SUPPRESSION
NMS is extensively used as a postprocessing method for object detection to eliminate redundant bounding boxes. In rotation detection, the traditional NMS method is insufficient because of the added angle parameter. Given the IoU threshold T (e.g., 0.7), an anchor (false positive) with a ratio of 1:8 and an angle difference of 15 degrees has an IoU of 0.31, which is less than the threshold T. However, it should actually be regarded as a positive sample. Therefore, we modify the skew NMS method by defining two constraints: (a) we keep the bounding boxes with an IoU larger than 0.7; (b) if the IoU is greater than 0.3 and less than 0.7, we retain the bounding boxes with angle differences of less than 15 degrees.

D. DATA AUGMENTATION
The performances of deep learning models are heavily dependent on massive numbers of well-annotated training samples. However, the number of labeled remote sensing images is very limited. Therefore, obtaining a large-scale wellannotated remote sensing dataset is much more laborious and expensive than obtaining a natural image dataset. Compared with object classification, object detection is more difficult to implement data augmentation strategies for object detection. The main reason is that more complexity will be introduced as a result of distorting the image, bounding-box locations and object sizes. In this paper, we adopt random rotation, graying, multiscale training and flipping.

III. EXPERIMENTS AND RESULTS
The experimental environment includes an Intel Core i9-9820X processor and four NVIDIA GeForce RTX 2080Ti GPUs. The deep learning framework is TensorFlow running on Ubuntu16.04.

A. EXPERIMENTAL DATASET
The HRSC2016 [25] dataset contains 1061 images and 2976 ship samples collected from Google Earth. It includes 436 training images, 181 validation images and 444 testing images. Fig. 11 shows examples from the HRSC2016 dataset. The annotation files in the HRSC2016 are in XML format, where the data annotations include HBB format (box_xmin, box_ymin, box_xmax, box_ymax) and RBB fromat (mbox_cx, mbox_cy, mbox_w, mbox_h, mbox_ang), as shown in Fig. 12(a). The annotation files in the Visual Object Classes (VOC) dataset are in XML format, where the data annotations are in HBB format and are determined by the following values: (xmin, ymin, xmax, ymax). The annotation files used by our model are also in XML format, where the data annotations are in RBB format and are determined by the following values: (x0, y0, x1, y1, x2, y2, x3, y3). Therefore, we use the code to convert the annotation files in the HRSC2016 dataset into the above two annotated files, as shown in Fig. 12(b) and Fig. 12(c).

B. EVALUATION INDEX
The evaluation indexes are the precision-recall curve (PRC) and average precision (AP).
The precision and recall are defined as follows: where TP, FP, and FN denote the numbers of true positive samples, false positive samples, and false negative samples, respectively. In this paper, the IoU threshold is set to 0.5. If the threshold is exceeded, then the object is detected. The AP is defined as the area under the PRC, and the definition is as follows:

C. TRANSFER LEARNING AND NETWORK TRAINING
Generally, the pretrained model can be applied to new object detection tasks through transfer learning, which is efficient and can prevent overfitting. The weights of the model are initialized by the pretrained improved ResNet50 model by default. The initial learning rate is 0.0005, and the maximum number of iterations is 100,000. The learning rate is reduced to 0.00005 after 60,000 iterations, and to 0.000005 after 80,000 iterations. In addition, we choose the momentum optimizer during optimization. The learning rate and loss value for the sequence of iterations are shown in Fig. 13.

D. PARAMETER ANALYSIS
The settings of the anchors have a key influence on the detection results. To obtain the best detection performance during network training, we investigate the detection results of the proposed ROS-Det using different anchor settings, as shown in Table 2.
As shown in Table 2, the AP gradually increases as the number of anchors increases and plateaus at 88.2%. This demonstrates that using anchors with multiple aspect ratios and angles is an effective means to improve the AP. When using 5 and 9 aspect ratios, the APs of rows 4 and 7 are lower than those of rows 3 and 6, respectively, demonstrating that the AP will drop when the number of angles is increased from 6 to 7. In addition, the AP value for 11 aspect ratios and 6 angles is lower than that for 9 aspect ratios and 6 angles, suggesting that the AP will drop when the number of ratios is increased from 9 to 11. Therefore, in the proposed ROS-Det, we use 9 aspect ratios and 6 angles, i.e., {1:1, 1:2, 2:1, 1:3, 3:1, 1:5, 5:1, 1:7, 7:1} and {−90, −75, −60, −45, −30, −15}, respectively. In Table 2, we also summarize the training and inference times under different aspect ratios and angles. As the number of anchors increases, the training time and inference time gradually increase. Although the time cost of 9 ratios and 6 angles is not the smallest, it is still efficient. Fig. 14 shows how the PRCs of the proposed ROS-Det change using different anchor settings. The higher PRC denotes a larger AP value. It is shown that the PRC for 3 aspect ratios and 4 angles lies at the innermost, demonstrating that its AP value is the worst. The PRC for 9 aspect ratios and 6 angles is the highest, which suggests that this setting is the best.

E. ABLATION STUDY
To quantitatively illustrate the effectiveness of each part, we perform a comprehensive ablation study. Table 3 shows the detailed detection results of different data augmentation schemes. As shown in Table 3, our proposed   ROS-Det achieves the highest AP, i.e., 88.2 %. For schemes 1, 2, 3 and 4, the AP value gradually increases, which proves that these data augmentation strategies are helpful for improving the ship detection results. The PRCs of different data augmentation schemes are also shown in Fig. 15. The PRC of ROS-Det has the highest AP value. Table 4 shows the results of three pre-trained networks. As shown in Table 4, using the improved ResNet50 achieves the best performance in terms of the AP 88.2 %. From   MobileNet V2, ResNet50 and the improved ResNet50, the AP gradually increases. Compared with the original ResNet-50, our improved ResNet-50 achieves an improvement of approximately 0.1%. The reason is that the improved ResNet50 is able to receive more information, and replaces the 7 × 7 convolutional layer with three 3 × 3 ones. This comparative experiment illustrates the role of the improved ResNet50.  Fig. 16 illustrates how the PRCs of our method change in different pre-trained networks. The PRC of MobileNet V2 is the lowest, indicating that its AP is the worst. The PRC of ResNet50 lies higher than that of MobileNet V2, proving that the feature representation ability of ResNet50 is better than that of MobileNet V2. The improved ResNet50 (ROS-Det) has the highest PRC, indicating it achieves the highest AP. Table 5 shows the detailed detection results with different loss functions. As shown in Table 5, the improved loss function achieves better performance than the traditional smooth L1 loss function. This is because the improved loss function can solve the loss discontinuity problem of arbitrary-oriented object detection by introducing the IoU constant factor into the traditional smooth L1 loss. This comparative experiment illustrates the role of the improved loss function. The PRCs of different loss functions are illustrated in Fig. 17. The PRC of improved smooth L1 loss lies higher than that of traditional smooth L1 loss, showing the effectiveness of the improved loss function. In addition, the visualization results of different loss functions are shown in Fig. 18. It can be seen that the improved smooth L1 loss can significantly improve the accuracy of bounding box regression.
As can be seen in Table 6, our proposed ROS-Det method achieves the AP of 88.2 % in HRSC2016 dataset, it significantly outperforms other state-of-the-art methods, e.g., improved by R-FCN/RSDet over 1.7 % on AP. These observations verify the superior performance of ROS-Det.
1) Comparison to the horizontal object detection methods. The performance gaps are significant when comparing ROS-Det and the horizontal object detection methods including Faster R-CNN [10], Cascade R-CNN [14], SSD [16], RetinaNet [17] and R-FCN [13]. Specifically, the AP of ROS-Det is higher than Faster R-CNN by 7.5 %, Cascade    R-CNN by 7.3 %, RetinaNet by 5.6 %, SSD by 4.3 %, and R-FCN by 1.7 %, respectively. The main reason is that ROS-Det achieves the purpose of arbitrary-oriented ship detection, which can better handle arbitrary orientations, various sizes and dense distribution in ship object detection.
To demonstrate the advantages of our method, we conduct a qualitative visualization comparison of different methods on two scenes, i.e., dense distribution and different sizes of ships. As shown in the first and second columns of Fig. 19, there are some densely distributed ships. It can be observed that R2CNN and our method can detect all ships. However, compared with R2CNN, our method has no false detections and its positioning is more accurate. This indicates that our method has better robustness against the impact of dense distribution. As shown in the third and fourth columns of Fig. 19, there are some ships with different sizes. It is obvious that our method can detect all ships more accurately. In other words, our method has better robustness against the impact of different sizes.

G. VISUALIZED RESULTS
To display detection effect of the proposed ROS-Det, Fig. 20 shows the visualized detection results on HRSC2016 dataset. It can be seen that our method can detect ships well in various situations, e.g., arbitrary orientations, various sizes and dense distribution.

IV. CONCLUSION
We have proposed ROS-Det, an arbitrary-oriented ship detection method, which consists of a FPN based on an improved ResNet50, rotated anchors, a classification network, and a regression network. In consideration of various sizes of ships, ROS-Det fuses multi-scale features by using the FPN based on an improved ResNet50. Through several tweaks, the improved ResNet50 is able to receive more information and reduce the computational cost. The rotated anchors, skew IoU and skew NMS are employed to achieve arbitrary-oriented ship detection. Moreover, to offset the loss discontinuity impact of arbitrary-oriented object detection, an IoU constant factor is introduced to the traditional smooth L1 loss function.
Although ROS-Det has reached the optimal detection performance, there are still some missed detections and false positives. In addition, there are some small ships and incomplete ships ignored in HRSC2016 dataset. In future work, we will explore more efficient strategies for extremely small objects. In addition, we will further expand and improve the HRSC2016 dataset.