Aircraft Detection in Remote Sensing Images Based On Deep Convolutional Neural Network

Aircraft detection in remote sensing images is always the research hotspot but a challenging task for the variations of aircraft type, pose, size and complex background. In the paper, we propose a region-based convolutional neural network to detect aircrafts. To enhance the learning ability of the network, a mult-resolution aircraft remote sensing dataset is collected from Google Earth. Then, the detection model is trained end to end by fine-tuning on the obtained dataset and realizes automatic aircraft recognition and positioning. Experiments show that the proposed method outperforms state-of-the-art method on the same dataset and the requirement for real-time can be satisfied simultaneously.


Introduction
Automatic aircraft detection in remote sensing images has been one of the research focuses due to its high application value in airport dynamic monitoring and military surveillance. With the development of computer vision and image processing technology, different detection methods have been proposed in recent years.
Traditional aircraft detection methods usually consist of three separated stages: region proposal, feature extraction and target classification. The first stage chooses some candidate regions on the given images for further recognition. Sliding window [1] is usually used for object locating. Because millions of windows per image are sent into network to compute gradient, it is rather time-consuming. To solve the problem, some approaches, such as selective search [2], binarized normed gradients (BING) [3], Edge boxes [4] and objectness [5], have been proposed. However, the time consumption of region proposals is still considerable and the process is hardly realizable through GPU. The extracted features from intermediate stage are used to classify and recognize the areas through retrained classifier at last. Conventional methods always apply advanced general features, such as histogram of oriented gradient (HOG) [6] and scale-invariant feature transform (SIFT) [7], to specific target application, but general features have difficulty in distinguishing different types and holding the invariance of target. Researchers also design templates with rotation and scale invariant or corresponding manual features based on the characteristics of the aircraft to detect specific aircraft target [8][9]. Nevertheless, experiments show that those methods cannot keep high accuracy faced with new complex scene.

Figure. 1 The architecture of Faster R-CNN
Are not suitable to detect small targets for the strategy of target localization and the demand of fixed-size images as input. Inspired by Faster-RCNN, we detect aircrafts by modifying the parameters of the model according to the characteristics of our dataset.
The rest of the paper is organized as follows. In section 3, the network model is introduced. Simulation experiment and conclusion respectively arranged in section 4 and section 5.

Model Struction
Faster R-CNN is a region-based object detection network and consists of a Region Proposal Network (RPN) and a state-of-the-art object detection network Fast R-CNN. The architecture of Faster R-CNN is shown in Fig.1. (RPN) shares fully convolutional layers with the object detection network. Among them, the RPN takes an image of arbitrary size as input and output a set of object proposals, each with 4 coordinates ( , , x y w and h ) of the predicted bounding box and 2 scores to estimate the probability of object/not-object. The cross-boundary anchors are ignored to increase the detection accuracy and the threshold for non-maximum suppression (NMS) is fixed at 0.7 to reduce redundancy caused by the The mechanism of region proposal method is that sliding a 3*3 spatial window on the feature map produced by the last shared convolutional layer. At each sliding-window location, it outputs 9 anchor boxes, 3 scales and 3 aspect ratios, corresponding to 9 region proposals in the raw image. Each sliding window is mapped to a lower-dimensional vector That will be sent into parallel fully connected layers, box-regression layer and box-classification layer. Considering that the aircraft targets in remote sensing image are smaller than the objects in natural images, we extend the anchor boxes to 12 with a smaller box area of 2 64 pixels at the same aspect ratios in the paper. The experimental results also prove the effectiveness. The concrete sizes are shown in Table 1.

Training Strategy
To training the unified detection network, we assign a binary label to each anchor in RPN. We set a positive label to an anchor when the rules are met: 1) the anchor has the highest Intersection-over-Union (IoU) overlap with a ground-truth box, or 2) the anchor has an IoU overlap that higher than 0.75 with any ground-truth box. Meanwhile, we set a negative label for the anchors that IoU is lower than 0.3 with all the ground-truth boxes. Anchors that are neither positive nor negative do not serve as training samples. The sampling strategy used in RPN follows [16]. A mini-batch comes from one image that contains many positive and negative anchors. To avoid the bias towards negative samples, we randomly sample 256 anchor boxes in an image and the ratio of positive and negative anchors is 1:1.
For each anchor box, we adopt a multi-task loss function for classification and bounding-box regression as used in [14]: Where the classification loss cls L is log loss and the regression loss reg L is smooth 1 L loss function defined in [16]. Parameter i p is the predicted probability that anchor i is an aircraft. *  x y w And h denote the center coordinates of the anchor box, its width and height. , r x x And * x are for the predicted box, anchor box and ground-truth box (same to , y w and h ). The hyperparameter  is set to 10 to control the balance between the two task losses. With those definitions, the detection network is possible to predict aircraft targets at a wide range of scales and aspect ratios by minimizing the objective function. Figure. 2 The examples of the image dataset.

Dateset
The remote sensing aircraft dataset used in [11] has 76 images in total, too small to train Faster R-CNN. In the paper, we collect 1000 images with different resolution from Google Earth. Fig. 2 shows some examples of the dataset and the percentage of high-resolution images is about 90%. The image sizes vary from 600 * 700 to 740 *1380 pixels. Research shows that data augmentation can make the network fully study the change of the object and enhance the recognition ability for the complex change in translation and angle. We expand the available dataset with horizontal flip and rotation (90 ,180 , 270 ) o o o to the seed images. With the approach, the dataset is expanded by 8* from 1000 to 8000 images. We select randomly 3000 images for training, 3000 images for validation and 2000 images to test. The network we propose is a supervised object detection network and the other two kinds of documents we need to use the detection network. XML format file is used to indicate the ground-truth locations of aircrafts within the images. We adopt the image annotation tool LabelImg to complete the job manually and only one category in the dataset. Apart from annotation files, the code framework requires 4 TXT files named train.txt, test.txt, val.txt and trainval.txt to indicate the use of those images.  Figure. 3 The performance with different proposals number Avoid local optimum. We use pre-trained VGG-16 [20] model and the approximate joint training which can be trained end-to-end to fine-tune the detection network. The main parameters of the network are shown in table 2. All experiments are done within Caffe and our graphics is NVIDIA GeForce GTX 1070 with an 8GB GDDR5 memory.

Rusults and Analysis
Considering that the number of proposals fed into Fast R-CNN has influence on the detection accuracy at test time [14,21], we analyze the effect of this parameter on the detection performance at first. As showed in Fig. 3, the maximum is achieved when the number of proposals is fixed at 500. To evaluate the detection method quantitatively, we test the proposed model and the compared methods on the sane test dataset we collected. The compared methods include the original Faster R-CNN and FCN which achieves the best results in [11]. The detection rate and the average test time are shown in Table 3.
FCN is an end-to-end detection network which can be approximately considered as the RPN part of the Faster-RCNN. We can clearly see that Faster R-CNN and the proposed method have better performance than FCN. The results show that the high-quality region proposals produced by RPN can greatly improve the detection capacity of the network. The proposed method has 1.8 percentages higher than the original Faster R-CNN indicating that the smaller box areas are suitable to the detection of the small targets in large range. The running time is about 3fps on GTX 1070 slower than FCN because of the network structure and is enough fast to meet requirements for real-time. Fig. 4 shows the detection results of the proposed method on partial test images from which it can be seen that the proposed method can accurately detect multi-scale and multi-direction aircraft targets in complex background.  Figure. 4 Aircraft detection results of the proposed method with a score threshold as 0.8

Conclusion
In the paper, we propose a region-based convolutional neural network using smaller anchor boxes area in Faster R-CNN to detect aircrafts in remote sensing images. Experiments show that our method can yield an efficient detection and performs better than the state-of-art method. Aircrafts that are various types and different colors are treated as the same kind of target in the paper. In real application, it always needs to distinguish the type and detect moving target. In view of the results we have achieved, we will be devoted to improving the detection precision and applying it to other target detection tasks in our future works.