Aircraft Detection in Remote Sensing Image based on Multi-scale Convolution Neural Network with Attention Mechanism

Detecting aircraft from remote sensing image (RSI) is an important but challenging task due to the variations of aircraft type, size, pose, angle, complex background and small size of aircraft in RSIs. An aircraft detection method is proposed based on multi-scale convolution neural network with attention (MSCNNA), consisting of encoder, decoder, attention and classification. In MSCNNA, the multiple convolutional and pooling kernels with different sizes are utilized to learn the multi-scale discriminant features, and the global attention mechanism (GAM) is employed to capture the spatial and channel dependencies and adaptively preserve the relationships of the entire image. Compared with the standard deep CNN, multi-scale convolution neural networks (CNN) and GAM are integrated to learn the multi-scale features for aircraft detection, especially small aircrafts. Experiment results on the aircraft image dataset of the public EORSSD dataset show that the proposed method outperforms the state-of-the-art method on the same dataset and the obtained multi-size aircraft edge is clearer.


Introduction
Automatic aircraft detection from remote sensing image (RSI) is an important research due to its great value in civilian and military and high application value in dynamic monitoring and military surveillance. However, it is a very difficult and challenging task because the aircraft objects occupy a relatively small and blurry proportion in RSIs, and have various sizes, shapes, poses, shadows, illuminations, and complex background as shown in Fig.1.

Small
Complex-background Interference Blurry Multi-target Complex Fig.1 Samples of aircrafts in RSIs With the development of computer vision and image processing technologies, many aircraft detection methods have been proposed in recent years [1,2]. Zhao et al. [3] proposed a multiscale sliding-window framework for aircraft detection based on aggregate channel features and well-designed features. In the framework, the features are trained by Cascade AdaBoost including multiple rounds of bootstrapping, and a two-step nonmaximum suppression algorithm is designed based on a given detection set. Yan et al. [4] designed a aircraft detection method using centre-based proposal regions and invariant features, consisting of three steps, firstly extracting proposal regions from each RSI, secondly training the ensemble learning classifier with the extracted rotation-invariant features for aircraft classification, finally detecting aircraft from RSIs by the trained well classifier. The detecting performance of the above methods mainly relies on the extracted features, but it is difficult to extract the proper discriminant features for various aircraft objects in the RSIs. Recently, with the development of big data technology and computing ability, many convolution neural networks (CNN) based aircraft detection methods have been presented and achieved significant advances due to the prominent capability of CNN on feature extraction. Bin et al. [5] proposed a cascade CNN (CCNN) framework based on transfer-learning and geometric feature constraints to implement the aircraft detection in large-area RSIs, and achieved high detection accuracy with relatively few samples. Guo et al. [6] proposed an end-to-end aircraft detection method, which completely eliminates proposal generation and encapsulates all calculations in a network, making it simple and efficient. Hu et al. [7] proposed an aircraft detection in RSIs approach based on saliency and CNN, where the region of interests (RoIs) are obtained using the saliency algorithm, and the feature vector of the position information is obtained by deep CNN (DCNN) from the whole image. Yang et al. [8] constructed an effective airplane detection framework called Markov random field-fully convolutional network (M-FCN) by using a cascade strategy that consists of a fully convolutional network (FCN) based coarse candidate extraction stage, a multi-Markov random field (multi-MRF)-based region proposal (RP) generation stage, and a final classification stage. Zhang et al. [9] constructed an effective aircraft detection framework based on CNN to detect multi-scale targets in extremely large and complicated scenes, designed a constrained Edge-Boxes approach to generate a modest number of target candidates quickly and precisely, and presented a modified GoogLeNet based on Fast Region-based CNN (R-CNN) to detect various kinds of multi-scale aircrafts. Zhong et al. [10] collected RSIs of airports from Google Earth and took full advantage of data augmentation, transfer learning, DCNN and limited training samples to implement end-to-end trainable airplane detection. Li et al. [11] collected a multi-resolution aircraft remote sensing dataset from Google Earth and proposed a R-CNN based aircraft detection model. They trained the detection model by end-to-end fine-tuning on the collected dataset and realized automatic aircraft detection and positioning. Wang et al. [12] proposed an aircraft detection algorithm which can detect weak and small aircraft objects of varying sizes. Fu et al. [13] proposed a feature-fusion algorithm to generate a multi-scale features, which can form a powerful feature representation for multi-scale aircraft detection. They constructed a rotation-aware object detector to localize objects in RSIs, utilized the oriented proposal boxes to enclose objects rather than horizontal proposals that can coarsely locate the oriented objects, and adopted the orientation RoI pooling operation to extract the feature maps of oriented proposals for the following R-CNN sub-network. Liu et al. [14] proposed an aircraft detection scheme based on corner clustering and CNN, and compared with three methods, namely selective search (SS)+CNN, Edge-boxes+CNN and histogram of oriented gradient (HOG)+support vector machine (SVM). The proposed scheme contains two main steps: region proposal and classification, where candidate regions are generated by utilizing mean-shift clustering algorithm, and feature extraction and classification of candidate regions possibly containing aircrafts are realized by CNN.
Although CNN and its improved models have achieved promising results on various image datasets, they require a large amount of labeled image data to learn, while labeling RSIs requires high labor costs, and many standard CNN models focus only on the local features while ignore the global region features, then some important global features are lost inevitably. In fact, both local and global features are important for complex image classification and recognition, but it is difficult to effectively extract the global and local features by CNN and its improved models from RSIs with different scales of aircrafts. As for the small aircrafts in RSIs, pooling layer of CNN may further reduce the amount of the classification information. Suppose a 2424 difficult only has about 1 pixel after 4-layer pooling, making the dimension too low to be distinguished. That is to say, it is difficult to detect the small aircrafts. To overcome these shortages and enhance the robustness and classification performance, a lot of multi-scale CNN (MS-CNN) models have been proposed and successfully applied to complex image segmentation and multi-scale object detection tasks. Some MS-CNN models adopt multi-scale input blocks of the original image instead of the directly inputting the original images [15][16][17], and some MS-CNN models use the multi-scale input images of different channels to convolve with multi-scale filters [18,19]. Deng et al. [20] proposed a unified and effective method for multi-class object detection in RSIs with large scales variability, where the detection is performed by two sub-networks: a multi-scale object proposal network for object-like region generation from several intermediate layers, to concentrate on the most salient features [22,23], but they are difficult to directly applied these attention mechanisms to aircraft object detection in RSIs due to the aircraft objects usually vary and are relatively small in RSIs [24]. To overcome this problem, Dayananda et al. [25] designed a global attention module between the encoder and decoder parts with a multi-scale guided input.
The traditional deep learning based image segmentation and detection models such as FCN [8], SegNet, U-Net and PSP-Net apply bilinear interpolation to recover image resolution during deconvolution [26]. Multi-scale multi-column CNN (MSMC-CNN) adopts the bicubic convolution as a deconvolution for the decoder structure to restore the original image size [27]. The results validate that MSMC-CNN is superior to the other segmentation methods. Since many aircraft in RSIs are small targets (dozens or even a few pixels) containing little target information, RSIs are taken from overhead, then the direction of the target is uncertain, and RSI has a lot of interference, such as noise, complex background, and illumination changes, the aircraft detection in RSIs is still a challenging topic. There are a lot of shortcomings in the existing CNN and MS-CNN based aircraft detection algorithms, such as low recognition accuracy and time-consuming. Inspired by CNN, MS-CNN and MSMC-CNN and attention mechanism [28], a modified MS-CNN with attention (MSCNNA) is constructed to automatically detect the aircraft objects from RSIs with different scales. It is also an encoder-decoder network. It adopts the multi-scale input images into MS-CNN instead of directly inputting the original image. In the traditional CNNs model, the pooling operation can compress the feature maps and simplify reduce CNN computation complexity, but it results in the loss of features. Comparing with CNN and MS-CNN models, MSCNNA is able to extract multi-scale features for improving the accuracy and robustness of aircraft detection. The main contributions of this paper are indicated as follows:  A modified MS-CNN model namely MSCNNA is constructed to deal with scale variation problem of RSIs in aircraft detection.  Global attention mechanism (GAM) is introduced to MSCNNA to capture global and local features of multi-scale aircrafts and to improve the aircraft detection accuracy.  MSCNNA is validated on a public aircraft RSI dataset. The rest of the paper is organized as follows. Section 2 provides the review of related methods, including MS-CNN and attention mechanism and global attention mechanism (GAM). The details of MSCNNA for aircraft detection are described in Section 3. Then, a lot of experiments and results are given in Section 4. Finally, some conclusion remarks are drawn in Section 5.

Multi-scale CNN (MS-CNN)
To solve the problem of object scale inconsistency in object detection, many MS-CNN models have been proposed. There are two common MS-CNN architectures as shown in Fig.2, where Fig.2B is the convolutional operation comparing with the traditional CNN model as seen in Fig.2A, Fig.2C is a multi-channel input scheme using multi-scale convolutional kernels to extract the multi-scale features from the input images, Fig.2D is a multi-scale input scheme by changing the original input to multi-scale images as different channel input. Two kinds of MS-CNN models can be implemented by utilizing multiple convolutions to form multi-scale features for objective image detection with different sizes in the convolution layer. Meanwhile, to ensure that the spatial feature information between the input and output of the multi-scale convolutional kernels remains unchanged, the edge of the feature graph is filled according to the size of the convolution kernel. As shown in Fig.2D, each original input image is reshaped to three images with three sizes. When the multi-scale input scheme is adopted, three images are input into three channels with different sizes of convolution and pooling kernels, respectively, and are carried out separately. Finally, the extracted feature graphs of different channels are combined or integrated in the fully-connected layers. The features extracted by the convolved kernels of large size are more global, while those of small size are more reflective of local characteristics. So MS-CNN can extract both global and local features.

Attention mechanism and global attention mechanism (GAM)
Attention mechanism makes full use of the local and global features of CNN to reduce noise and improve the classification accuracy by giving higher weight to the important features [23,24]. It is widely used to obtain the better feature representation for CNN. It can learn the weight between the extracted feature graphs, and then multiples back to the original feature graphs, so that the feature graphs can be better represented. The channel attention module firstly conducts global average-pooling and global max-pooling respectively for the original feature graphs, then connects the two pooling feature vectors to one fully-connected layer, adds point-wise, and then activates. The spatial attention module uses 1×1 pooling including average-pooling and max-pooling instead of 1×1 convolution. Both pooling operations are concatenated followed by a layer of 7×7 convolution, and then activated [25].
In image segmentation and classification, attention model is responsible for learning the attention Global attention mechanism (GAM) can be used to refine the feature extraction and improve the perception ability of CNN [28]. The multi-scale information and the image classification module output and are fed as input to the GAM at each encoding layer as follows, x is the down-sampled using max-pooling with a stride of 2×2.
The output of GAM is input into the max-pooling layer to reduce the dimensionality and focus on the fine details of the feature map, as follows, In the process of GAM in RSI segmentation, pooling indices are stored at each encoder layer so that the decoder uses the information to up-sample the feature maps. The output at each encoder layer is referred to down-sampling unit obtained by Eq.(3) [25].

MS-CNN with attention for aircraft detection
The above experiment results show that it is difficult to extract the detection and classification  The main process of MSCNNA is introduced as follows: (1) Input data. The original image is resized into three input images with three different sizes, which are input into three CNNs to extract multi-scale features of the aircraft image, respectively. Three CNNs are carried out separately, and the extracted feature graphs are integrated by global attention mechanism (GAM) in attention module. This integrating process of GAM is similar to the existing attention mechanism in MS-CNNs [25]. In each CNN of MSCNNA, multi-scale connections are made to the feature maps of different layers to adjust the scale and viewing angle of the sonar image.
(2) Encoder. In encoder module, the multi-scale input features at each encoding layer can encoder both the global and local features. After the original input image is convolved with the kernel, an activation function ReLU is adopted to obtain the first-layer output feature maps, and the specific steps of convolution and pooling are the same as CNN, introduced as follows, Convolutional operation is used to extract feature maps from the input images. The output l x of the l-th layer in CNN is calculated as follows,  (4) Decoder. Decoder module uses several deconvolutions to up-sampling the feature graphs integrated in the attention mechanism. It consists of deconvolution and upsampling operations, where deconvolution with a parameter forms the encoder information similar to that of a convolutional layer, and upsampling is implemented on the input feature maps using the memorized max-pooling indices from the corresponding encoder feature maps. Max-pooling index from the encoder feature mapping is often used for upsampling of the input feature mapping in the decoder block. To effectively preserve the information from the original image for edge details, the entire feature maps are transferred from the encoder block to the corresponding decoder block, and are connected to the upsampling decoder feature map using the pooled indexes. Decoder can use fewer parameters and obtain better image features to recover the object information from the original image.
In decoder, the different convolution blocks of the sampling layer can be connected by bilinear and bicubic interpolation [26], while the interpolated surface by bicubic interpolation is smoother than that by bilinear interpolation or nearest-neighbor interpolation [29]. Then in MSCNNA, bicubic interpolation is applied to deconvolution, which can be implemented by Lagrange polynomials, cubic splines, and cubic convolution algorithm. Bicubic convolution interpolation function is defined as follows, , ( ) f x is the input data. By Eq.(6), two-dimensional bicubic convolution interpolation can be calculated as follows, 2 2 1 2 ( , ) g s s is regarded as deconvolution graph.
By Eqs.(6) and (7), the calculation coefficient relation of the two-dimensional cubic convolution interpolation is as follows, , , where ( , ) f x y is the two-dimensional integrated feature graph by GAM. (5) Aircraft detecting and model training. The training dataset is divided into a positive sample set containing aircrafts and a negative sample set without aircraft. MSCNNA is used for a binary aircraft classification task, i.e., determining whether a RSI contains aircraft or not. When training MSCNNA is complete, a test RSI is input into the trained MSCNNA for forward propagation. Then, the weight corresponding to the image classification result of having aircraft in the last deconvolution layer is extracted, weighted, and averaged with the characteristic map output. The highlight area is the basis for MSCNNA to detect the test image using the aircraft category. The main difference between two RSIs of having aircraft and without aircraft is that the former contains an aircraft target, so these areas correspond to the position of the aircraft in the test RSI. Then, adaptive threshold based detection is used to segment a binary graph. Finally, according to the threshold, the detection of aircraft in the test RSI is completed [30].
For a RSI with size of MN, the foreground and background are segmented according to the optimal grayscale threshold, which can be evaluated by the OTSU algorithm as follows [31]: where ( ) T  is the variance between the foreground and background, and is calculated as, where Nf and Nb are the pixel numbers in the foreground and background, respectively, Gf and Gb are the average gray values in the foreground and background, respectively. In RSI, the pixels larger than the threshold are marked as the foreground and the rest is the background. Then, the foreground and background are segmented and binarized using the optimal threshold. The training strategies mostly follow LeNet5, including multi-scale training, data augmentation, convolutional with anchor boxes, and loss function. For a fair comparison, LeNet5 is trained in the same way as that of MSCNNA.
(6) Model evaluation. Intersection over Union (IoU) is often used in object detection to measure the correlation between real and predicted. The higher the correlation, the higher the IoU. It is calculated by the overlap rate of generated predicted images and bounding boxes, that is, the ratio of the intersection and union of two boxes. It is defined as follows, In semantic segmentation, pixel regions are used to replace the marked boundary boxes to calculate the intersection ratio of image segmentation. The average intersection ratio is To evaluate the detection effect of aircraft more objectively, the detection precision rate Precision, recall rate Recall and average F-value is introduced as three evaluation indexes to measure the difference between the detection results and the actual labeled images. They are calculated as follows: where TP is the overlap between the detection aircraft region and the original aircraft region, FP is the part of the detection result that does not belong to the aircraft region, and FN is the part that does not belong to the normal region.
In Eq. (12), Precision is the consistency between the reflected aircraft area and the real aircraft area, Recall is the proportion of correctly segmented aircrafts to the total number of aircrafts, and F reflects the evaluation index of the overall accuracy by combining Precision and Recall. Since the detection time is a key technical index to evaluate the practicability of the model, the detection time of a single image is used as a time index to measure the detection speed.

Experiments and analysis
To validate MSCNNA based aircraft detection method, a lot of experiments are conducted on the aircraft RSI dataset, and compared with five state-of-art methods, i.e., Deep CNN (DCNN) [6], saliency and CNN (SCNN) [7], Markov random field-FCN (M-FCN) [8], CNN based weakly supervised learning (CNNWSL) [31], CNN based semantic segmentation (CNNSS) [32]. All models are trained on the EORSSD dataset, and the main experimental configuration is listed in Table 1.

Data and Preprocessing
The EORSSD dataset (https://github.com/rmcong/EORSSD-dataset) contains 2,000 images improve the aircraft detection ability, each original aircraft RSI is augmented by a rotation method (90 0 , 180 0 , 270 0 ) and a randomly cropping method with an overlap of 100 pixels, and then data cleaning is adopted to reduce sub-images without objects [27]. Each image is augmented to 4 images including aircraft targets with 3 orientations and a cropping, and then the augmented dataset contains 1548 images in total for aircraft detection task. There are no any image preprocessing besides of resizing, augmentation and annotating. Four-fold validation experiments are conducted on the augmented dataset and the average detection results are regarded as the accuracy rates.

Experiments
MSCNNA is initialized by being pertained on the EORSSD dataset with various resolutions ranging from 0.1 to 1 m [33,34]. It is a benchmark dataset for evaluating the performance on salient object detection in optical RSIs. During the fine-tuning training [6,7], the initial learning rate of each layer is set to 0.001 and reduced to the previous 10% after 500 iterations. The momentum and weight decay are set to 0.9 and 0.0005, respectively. The max training iteration is 3,000, and the batch size of iteration is set to 40, and the total number of iterations depends on the convergence of the network model. During testing, the proposed MSCNNA is utilized to detect the aircraft in the test RSIs.

Results
In network training, the convolutional layers are used to extract features from the three-size input images, and the extracted features by different convolutional kernels are different. The shallow convolution kernel mainly extracts the low-level features of the image, including color and contour, etc.
Deep convolution kernel extracts more semantic features from images, including texture and detail features. To observe the different features extracted by different convolution kernels, all convolution feature maps of C1 and C3 layers are visualized in Fig.4. From Fig.4, all feature graphs obtained by convolution C1 and C3 layers can display the contour features of aircraft images.    Four-fold validation experiments of aircraft detection are performed on the augmented dataset of aircraft RSI and compared with DCNN [6], SCNN [7], M-FCN [8], CNNWSL [31] and CNNSS [32].
The average detection results are regarded as the aircraft detection results, as shown in Table 3. features by adding convolution layers, but due to aircrafts in RSIs occupy a relatively small part, the deep-level features may loss the target features. Therefore, these two methods are weaker than MSCNNA in performance evaluation indexes. According to the above results, it can be judged that MSCNNA has a good detection effect, and the detection result of aircraft region is closest to the marking value, which can meet the high precision requirements of aircraft detection in natural state.
However, due to the small area of the aircraft image and the high similarity between some aircraft areas and background areas, it is difficult to distinguish the boundary of the aircraft area from the background area. Therefore, it can be seen from Table 3 that the aircraft detection performance evaluation index value of the algorithm is still low.
In terms of single image detection time, the detection time of MSCNNA is smaller than that of other detection algorithms, mainly because MSCNNA adopts shallow cascade mode for training and uses GAM to further select the obtained multi-scale features, which not only reduces the hardware requirements of the network model, but also reduces the model training time. Because DCNN and CNNSS adopt deep convolutional layers, so they take a long time to detect a single image. In FCN, deconvolution operation is required to restore the resolution of the image, thus increasing the detection time of a single image. Based on the class activation maps, M-FCN generates heat maps via reverse weighting for locating the aircraft object, reducing the model training time.
From the above analysis, the comparative experiments of Figs.5 to 8 and Tables 2 and 3 reveal that MSCNNA has a great advantage in small aircraft detection. From Fig.8 and Table 3, it is also seen that MSCNNA is capable of detecting aircraft targets of different resolutions and shows strong feature representation ability in the detection of small aircrafts.

Conclusion
Accurate aircraft detection is always an important but challenging task for the various aircraft poses, Comparison results show that MSCNNA is effective and feasible for aircraft detection in RSIs. Still, there is room for improvement to detect small aircrafts precisely in RSIs, especially in some complicated conditions. In the future work, MSCNNA will be applied to the Weakly Supervised Aircraft Detection Dataset (WSADD) for algorithm benchmarking, and does not need aircraft annotation information in the training data.