Light-SAR-ShipNet (LSSNet): A Lightweight SAR Image Ship Detection System with Enhanced Deep Supervision

Applying deep learning methods to achieve synthetic aperture radar (SAR) image ship target detection has become a current hot research topic. However, most of the existing detection networks are designed based on optical images, so when they are directly applied to SAR image target detection, there will be the following two problems: 1) There are big differences between SAR image and optical image in imaging mechanism, geometric characteristics and radiation characteristics, etc, which makes the detection network have more redundant imformation; 2) Real-time performance and accuracy are both very important in SAR image target detection task, and applying a network designed based on optical image cannot balance the performance of the two well. In response to the above problems, this paper proposes a lightweight network LSSNet specially designed for SAR image ship detection task. In earlier layers, we use depthwise separable convolution instead of conventional convolution to design dense block; in deeper layers, stacked dense blocks with shortcuts are used to enhance deep supervision, and finally achieve high-speed and high-accuracy detection. This paper uses SSDD dataset as baseline for expeiment and results show that LSSNet have higher accurracy and detecition speed. Its lightweight structure will also help transplant to hardware devices in the future.


INTRODUCTION
SAR image target detection is divided into two categories: traditional methods and deep learning methods. The traditional ship target detection algorithm is mainly composed of three parts: detection window design, feature selection, and classifier design. Among them, the constant false alarm rate detection(CFAR) algorithm is one of the most widely used target detection algorithms. It detects the target by modeling the statistical distribution of the background clutter. However, with the increasing maturity of SAR image acquisition technology, traditional SAR target detection methods with complex calculations and high manual participation are no longer enough to meet people's needs for SAR image processing.
In recent years, SAR image ship detection mainly uses deep learning methods to achieve detection, but many deep learning network models have complex structures and many parameters, resulting in poor real-time performance and not conducive to the transplantation of hardware devices. So there has been a rising interest in running high-quality CNN models under strict constraints on memory and computational budget. Therefore, some experts and scholars have carried out research on the

Depthwise Convolution
In order to achieve the purpose of designing a lightweight detection network, we are inspired by the depthwise convolution and using this method can greatly decrease the amount of network parameters. At present, the research of lightweight detection network has made progress, such as MobleNet series [1,20], ShuffleNet series [3,19], etc., which all use depthwise convolution to design the network and achieve a lightweight one . Similarly, Xception [4] also draws on the idea of depthwise convolution to decrease network parameters. In the SAR image ship detection task, depthwise convolution can be introduced to redesign the feature extraction network to achieve the purpose of lightening the network. Figure 1 shows the structure of depthwise convolution. Unlike conventional convolution, each convolution kernel is convolved with only one channel, which greatly reduce the amount of network parameters. However, there are two problems with using depthwise convolution alone. One is that the number of feature channels cannot be expanded, and the other is that the correlation between channels is ignored. These reasons will lead to the loss of detection accuracy. Research on optical images [21,22] has proved that introducing a 1×1 convolution kernel after depthwise convolution can achieve almost the same accuracy as conventional convolution, but the amount of parameters is greatly decreased. (a) (b) Figure 1 (a) Conventional convolution; (b) Depthwise convolution.

Dense Connection
In order to achieve high-precision detection, we are inspired by a dense connection, experiments in this paper can also verify the effectiveness of the method. In the deep learning network, as the depth of the network continues to deepen, gradient disappearance becomes more and more obvious [23]. In order to alleviate the problem, scholars try to connect the shallow features with the latter features through shortcuts. The densely connected DenseBlock is introduced for the first time in DenseNet [6]. This structure has very excellent performance and PeleeNet [5], DSOD [8], etc [24,25,26], have been successively proposed based on this structure. Figure 2 shows the densely connected structure. It concats the feature maps obtained in each layer to realize the efficient use of features and alleviate the gradient disappearance problem. Since the feature map can be utilized to the greatest extent, the channel of each layer can be narrow enough to reduce the amount of network parameters. In the SAR image ship detection task, efficient using feature maps can improve the detection performance and decrease the redundant parameters to a certain extent, which is beneficial to the design of lightweight networks.  Figure 3 Structure of LSSNet proposed this paper.

OVERVIEW OF PROPOSED METHOD
Based on the existing research results of lightweight network and drawing on its lightweight module design method, we design LSSNet for the SAR image ship targets detection task, which achieves highprecision and high-speed detection. Figure 3 shows the structure of LSSNet. It consists of three submodules, DenseBlock, Transition layer and stacked denseblock with shortcut (ShortcutDB). In order to realize the detection of different sizes' ship targets, the network sets two different detection scales. One is the feature map obtained at the end of LSSNet, and the other is the feature map obtained in the penultimate downsampling.

Transition Layer
It is designed to down-sample the feature maps and increase the number of feature map channels. Shallow features have a small receptive field, but contains rich image informations, which is conducive to the detection of small targets. As the number of network layers continues to deepen, the receptive field gradually becomes larger and the feature map scale becomes smaller and smaller. In order to prevent more ship features loss, the number of convolutional layers and deep feature channels in the deep position of the network should be set to increase, and the transition layer achieves this. Figure 4 shows the structure of transition layer. The convolution layer uses a 3×3 convolution kernel with stride 2. Then we provide batch normalization (BN) [27] operations after the convolutional layer, and finally use the activation function LeakReLU to activate.

DenseBlock
The traditional denseblock module structure [6] is shown in Figure 5(a). For each layer, all the feature maps in previous layers are used as the input of the current layer, and its own feature maps are used as the input of the subsequent layers, forming a full interconnection. The denseblock module proposed in this paper is shown in Figure 5(b). It replaces the conventional 3×3 convolutional layer in the traditional denseblock with 3×3 depthwise convolution and 1×1 pointwise convolution. Related papers [21,22]   This interconnected structure of upper and lower layer feature maps can make full use of the target information in each layer of feature maps. On the basis of retaining the feature maps of the previous layers, each layer can add new feature maps. Even if the number of new feature maps in each layer is small, the feature information can be fully utilized through the form of multi-layer interconnection. Therefore, the DenseBlock designed in this paper decreases the number of new feature maps for each layer, which not only ensures the full use of feature information, but also helps to achieve a lightweight design of the network.
The SE module [28] is proposed for the accuracy loss caused by the different importance of feature map channels in the process of convolution. Conventional convolution is that each feature channel has the same importance by default, but their importance is not the same in actual problems [4,18]. The introduction of the SE module can effectively solve this problem. The structure of the SE module is shown in Figure 6. It mainly consists of compression and excitation. The compression operation is a global average pooling and the excitation operation is composed of two fully connected layers. The excitation operation can obtain the importance of each channel with a size of 1×1×C weight coefficient. Finally, the input feature map is multiplied by the weight coefficient to obtain a feature map considering the importance of different channels.

Stacked DenseBlocks With Shortcut (ShortcutDB)
In the shallow features, a small number of channels can still contain rich ship target information, and the DenseBlock structure designed above can effectively extract ship features from the shallow structure. However, as the number of layers deepens, the receptive field of the convolutional layer becomes larger and larger, which is prone to loss of ship feature information. Therefore, it is necessary to increase the number of convolutional layers and convolutional channels in the deep structure. Using the above-mentioned DenseBlock structure concating upper and lower layer features will make the width of the feature map wider and wider, which is not conducive to the realization of lightweight networks. The use of a 1×1 convolution kernel can compress the width of the feature map, but will lose a part of the feature information. In order to ensure a lightweight network structure with no damage to the feature information, a stacked dense block structure with shortcut is proposed later.
It named this stacked DenseBlocks with shortcut ShortcutDB. The ShortcutDB structure is shown in Figure 7(a). It uses two DenseBlocks designed above to stack, and then compresses the number of feature channels through a 1×1 convolution kernel, so that the number of feature maps added each time is the same as a single DenseBlock, and finally we directly connect input to the output through a 1×1 convolution kernel. The shortcut branch adopts the 1×1 convolution kernel because we have to ensure the same number of input and output channels, otherwise the input and output feature maps cannot be added directly. It is precisely because of the existence of shortcut that the deep features have an enhanced supervision. We also showed stacked DenseBlocks without shortcut as Figure7(b) to prove the superiority of ShortcutDB module in later experiments.

Activation Function
In order to make the network training more fully, batch normalization (BN) operation is performed after each convolutional layer, and the LeakReLU function is used for activation, which helps the gradient disappearance during the training process. The activation function LeakReLU shows as follows:

Anchor Box
Two-scale feature maps are used for prediction, and sets three anchor boxes for each predicted feature map. For the smallest feature map, it has the largest receptive field and is suitable for detecting larger targets, so the largest anchor boxes are used: (81,82), (135,169), (344,319). For larger feature maps, its receptive field is smaller, which is suitable for detecting small and medium-sized targets. Therefore, corresponding anchor boxes are used: (10,14), (23,27), (37,58).

Loss Function
The loss function proposed in the detection network can be used for reference in YOLOv3 [29]. Since there is only one type target in SAR image ship detection, the category loss function could be ignored.
To achieve SAR ship detection, it is necessary to obtain center coordinates, height and width of the ship target bounding box and its confidence score. Therefore, the overall loss function consists of two parts: the bounding box loss function and the confidence loss function.
The bounding box loss function is as follows: Where i x , i y represent the i-th ground truth coordinate,  is the width and height loss weight coefficient, B is the number of bounding boxes, S is the number of grids (The specific grid division method is detailed in Yolov3 [29]).
The confidence loss function is as follows: The traditional method uses IoU to construct the loss function. IoU is the intersection ratio of the predicted box and the real box, which is defined as:   This paper uses GIoU [30] to construct the loss function, as shown in Figure 8(b), which is defined as: Where C is minimum bounding rectangle of prediction bounding box and ground truth. The loss function constructed with GIoU not only focuses on the overlapping area, but also on the nonoverlapping area. Using GIoU solves the problem that the distance cannot be evaluated between two non-overlapping areas.

EXPERIMENT
This paper is based on the Ubuntu16.04 system for experiments, using Pytorch [31] as the framework of deep learning, and writing programs on the Pycharm software platform. The experimental hardware configuration CPU is Inter(R) i5-10400F, GPU is GeForce GTX 1660 SUPER, memory is 16G, and CUDA 11.1 is used to call GPU for training acceleration.

Dataset
The dataset used in this paper is SSDD, published by the Professor Li Jianwei et al. of Naval Aviation University [9,32,33]. SSDD comes from the satellite-borne radar system, and the target areas are cropped into about 500×500 pixel images. It has four polarization modes: HH, HV, VV and VH, with a 1-15m resolution, ship targets are distributed in open sea areas and coastal areas. The dataset contains a total of 1160 images, 2456 ship targets, and each image has an average of 2.12 ship targets. The dataset images have multiple polarization modes, different resolutions, far and near sea scenes, which can effectively verify the performance of the detection algorithm, having been used by many scholars [9,18,32,33,34,35]. Unfortunately, the SSDD dataset does not give a clear data division method [18], and the relevant literature [9,18,32,33,34,35] divides the dataset according to its own needs. This paper refers to the division method of related literature and uses SSDD dataset with the same division method for performance comparison.  Figure 9 shows part of the prediction results of LSSNet on the SSDD test set. The green boxes in the figure are the predicted results of LSSNet, and the white boxes are the ground truth. It can be seen from the figure that the detection method we proposed has a good detection effect for ship targets of different scales on the far and near coasts. The detection accuracy rate reaches 96.9%, the recall rate reaches 97.8%, and the mAP reaches 98.6%. Compared with the classical lightweight network, its performance has a significant improvement. The mAP of LSSNet is 98.6%, which is 1.7% higher than Shufflev2, the lightweight network with the highest average precision. And its model size is 11.7M, only 1.9M larger than the lightest network model GhostNet. Due to the lightweight structure of the network, it only takes 10.1ms to realize the detection of a SAR image, so it can meet the basic realtime requirements. Figure 9 Results of LSSNet on SSDD test set.

Experiment Results
In order to compare and analyze the performance of the LSSNet, the following table Ⅱ shows the evaluation indexes obtained by testing different detection models on the SSDD data set. From the results in the table, it can be seen that the lightweight networks ShuffleNetv2, MobileNetv2, GhostNet, etc. proposed based on optical images all have an mAP more than 95%, but their precision rate is lower than 95% and false alarm rate is high. Compared with the classic Yolov3, they even have lower precision rate. The LSSNet proposed in this paper can achieve 96.9% precision rate and 97.8% recall rate, and has the highest mAP of 98.6% in these comparison algorithms, which proves the effectiveness of the detection model for SAR ship detection tasks.

Algorithm Comparison Analysis
In order to show the real-time performance of different detection algorithms more intuitively, table Ⅱ provides multiple evaluation indexes for different detection models, including network parameters, FLOPs, and model size. The amount of network parameters and FLOPs directly determine the inference speed, which are two important indexes for judging real-time performance. It can be seen from table Ⅱ that GhostNet has the least amount of network parameters and FLOPs. Its model size is only 9.8M, but its false alarm rate is 5.0% higher than LSSNet, which is also a lightweight network and have higher mAP. Table Ⅰ shows that the inferrence speed of LSSNet is only 3.2ms per SAR image slower than GhostNet, but LSSNet have a significantly better performance on false alarm rate 3.1% and missed alarm rate 2.2%. The model size of ShuffleNetv2 and LSSNet are almost the same. As is shown in table Ⅰ, although ShuffleNetv2 has the best precision performance in classical lightweight network, its precision performance is still worse than our LSSNet. LSSNet is 1.9M larger than the smallest model GhostNet, but it has significantly better precision performance than GhostNet. Compared with ShuffleNetv2, a lightweight network with the best precision performance in traditional lightweight networks, the model size of LSSNet is decreased by 0.1M and mAP is increased by 1.7%. Figure 10 shows the speed and precision performance of different lightweight networks. The superior performance of LSSNet can also be seen from the figure. It can well balance the speed and precision of SAR ship detection. Besides, it also reflects the necessity of redesigning a backbone network for specific task. Using the above detection algorithms all can achieve efficient detection of far-shore targets. However, for some near-shore ship targets, the detection effects of different algorithms are obviously different. The average precision of the detection network is often determined by the ability to detect difficult ship samples. Therefore, even if the average precision is increased by about 1-2%, it can be considered as a significant improvement in the detection ability of difficult ship samples. The figure 11 below is a representative of some difficult samples. We show the detection capabilities of different detection algorithms on the SAR image samples. The green box represents the prediction result, and the white box represents the ground truth. It can be seen from the figure that the LSSNet proposed in this paper has better detection capabilities for difficult targets than classic lightweight networks, and it can even detect two ship targets that are very close while classical lightweight network just regard it as one ship target. Besides, we can see from the Figure 11, the results of LSSNet shows less false alarm targets and missing alarm targets. Compared with the classic lightweight network method, LSSNet better balances the speed and precision performance, which adapts to the specific tasks of SAR ship detection. In addition, LSSNet has lower network parameters, FLOPs and smaller model size compared with a method in literature [18], which is also designed for SAR ship detection. Besides, the detection speed and precision of LSSNet are faster and higher. LSSNet achieved such excellent performance, the reasons may be:

5.4.1.
This method densely concats the deep and shallow features to achieve efficient use of ship feature maps. In the deep network, shortcuts are introduced in the stacked denseblocks, which enhance the deep supervision and strengthen the information flow between upper and lower layers; 5.4.2. In the basic structure of denseblock, the conventional convolution is replaced with a depthwise separable convolution, and the SE module is added later. The new structure improves the detection speed and decreases the model size while ensuring that the precision performance is not compromised.

Ablation Experiment
LSSNet is a lightweight network designed specifically for SAR ship detection tasks. The above experimental results have also proved the great performance of LSSNet for SAR ship detection. However, in order to pursue a more excellent detection model, this paper also explores its structural variants. The improvement of its variants is mainly based on the following two directions: 5.5.1. When downsampling in LSSNet, the conventional convolution is used. We can consider using depthwise separable convolution for downsampling instead of conventional convolution to decrease the model size, and introduce SE module to reduce the loss of ship feature information during downsampling; 5.5.2. LSSNet uses directly stacked denseblocks in the shallow layers, and the stacked denseblocks with shorcuts(ShortcutDB) in the deeper layers. We can consider removing the shortcuts and explore the impact of the shortcuts on LSSNet detection performance.   Table Ⅲ, Ⅳ shows the experimental results obtained by the above improved method. From the results in the tables, it can be seen that when only using the depthwise separation convolution to replace the conventional convolution for downsampling, the network parameters, FLOPs and the model size are all decreased to a certain extent. However, the average precision is also decreased by 2.5%. No matter whether SE module is introduced behind the depthwise separation convolution, the average precision cannot be effectively improved, even it is decreased by 1.9%. The use of depthwise separable convolution for downsampling slightly improves the detection speed, but the average precision is greatly reduced. This also proves the rationality of LSSNet using conventional convolution for downsampling. When the shortcuts in LSSNet are removed, the average precision is decreased from 98.6% to 94.2%, and the inferrence time of each SAR image is only 1.4ms faseter, which proves the necessity of adopting shortcuts. Obviously, the above exploration experiments have proved the superiority of the LSSNet designed. Perhaps LSSNet still has room to explore, but considering our limited energy, it is impossible to traverse the exploration of all network variants. Based on the above exploration experiment, the structure of LSSNet has thebest performance in terms of balancing speed and precision in SAR ship detection. And previous experimental results also well prove this point.

Supplementary Explanation
The baseline used in this paper is SSDD data set. Many scholars have used this baseline to verify the designed detection algorithm. However, many papers pointed out that the precision performance will be different because the data set does not have a clear data division method. The SSDD data set has 1160 SAR ship images, of which simple samples in the far shore areas account for the majority, and difficult samples with offshore targets only account for a minority. If there are too many simple samples in the test set generated by random division, the average precision of the detection network will naturally increase. Therefore, in order to compare and analyze the performance of different detection algorithms, the data set needs to use the same division method. The LSSNet proposed and its comparison detection algorithms all use the same train set, validation set and test set. The previous experimental results have proved that LSSNet has good generalization ability and strong robustness. Compared with the classical lightweight detection algorithm, it has higher precision and also meets the basic needs of real-time performance.
LSSNet is a lightweight network specifically designed for SAR ship detection tasks. It can achieve high-speed and high-precision detection of SAR ship images. Its model size has been decreased to 11.7M, which will help future FPGA and DSP hardware transplant. We believes that the lightweight network model is already relatively small, and the focus of subsequent research work should be on how to improve the average precision of the lightweight network. Specifically, how we inprove the average precision, enhance the network generalization capability and strengthen the ability to detect difficult samples on the basis of not significantly changing the model size. We think lightweight networks with better performance can be studied from two aspects. One is to redesign the network structure to find a more efficient backbone, and the other is to design more suitable anchors and a better loss function for SAR image targets.

CONCLUSION
This paper proposes a lightweight detection network LSSNet for SAR ship detection tasks, which can achieve high-speed and high-precision precision detection of ship targets. This method uses a combination of denseblock and stacked denseblocks with shortcut to improve the average precision and uses depthwise separable convolution to achieve the purpose of lightweight. Verified by SSDD data set, LSSNet has good generalization ability and strong robustness. Besides, the model size of LSSNet is only 11.7M, which is helpful for the transplantation of resource-scarce hardware devices in the future.