Image Small Target Detection based on Deep Learning with SNR Controlled Sample Generation

: A small target detection method based on deep learning is proposed. First, random background parts are sampled from some cloud-sky images. Then, random generated target spots are added to the backgrounds with controlled signal to background noise ratio (SNR) to generate target samples. Then training and testing results show that the performance of deep nets is superior to tradition small target detection techniques and the selection of sampling SNRhas an important effect on nets training performances. SNR = 1 is a good selection for deep nets training, not onlyfor small target detection,but also for other applications.


Introduction
Image small target detection plays a crucial role in infrared warning and tracking systems which requires that small targets on background such as cloud-sky or sea-sky can be detected effectively. There have been many approaches to resolve this problem. Some papers provided methods by filtering or morphology. A high-pass template filter was designed for real-time small target detection by Peng and Zhou [1]. In order to detect an infrared small target, Yang et al. presented an adaptive Butterworth highpass filter (BHPF) [2]. Wang et al. provided a real-time small target detection method based on the cubic facet model [3]. Hilliard put forward a low-pass IIR filter to predict clutter [4]. Xiangzhi Bai, et al. used top-hat transformation on small target detection [5]. Some other papers proposed methods of classifier. Wang etc., proposed a detection method based on high pass filter and least squares support vector machine [6]. Jiajia Zhao, etc., designed a detection approach based on sparse representation.
In this paper, we propose an end-to-end deep learning solution for small target detection based on deep learning which could be taken as a method belongs to classifier. Experimental results prove that our proposed method is robust and insensitive to background and target changing.
Nowadays, deep architectures with convolution and pooling are found to be highly effective and commonly used in computer vision and object recognition [7][8][9][10][11][12][13][14][15][16][17][18][19]. The most impressive result was achieved in 2015 ImageNet contest, where 1.2 million images in the training set with 1000 different object classes. On the test data set including 150,000 images, the deep Convolutional Neural Network (CNN) approach described in [18] achieved the top 5 error rates 3.57%, considerably lower than human recognition error rate of 5%. Furthermore, CNN has achieved superior classification accuracy on different tasks, such as handwritten digits or Latin and Chinese character recognition [8,9], traffic sign recognition [10], face detection and recognition [11], radar target recognition [19]. All facts make us believe that deep neural nets could potentially be used on many other applications, including image small target detection.
In recent years, there have been many papers about using deep neural nets on target detection [20][21][22]. However, their goal is to locate and recognize large objects on images. The size of the objects in the image are usually scores or hundreds of pixels, full of complex image details, which is easy for human to locate and recognize. As for small target, sized a few pixels, it can only be located, barely able to identify its classes, which is the focus of this paper. So long as we know, it is the first time that deep neural nets are used in this field.

Deep Nets Architectures and configuration
We designed the nets with an input dimension of 21×21 pixels. The small input nets are used as a moving filter window to detect small target at all positions of an image. During training, the only preprocessing we do is subtracting the mean value, computed on the training set, from each pixel. The image is passed through a stack of layers. Each layer is Full-Connected with following layer. The first layers have 128 channels each, and the last performs 2-way classification. The final layer is the soft-max transform layer. All hidden layers are equipped with the rectification nonlinearity [23]. Table 1 lists the models we trained in this paper.

Sample Set Generation
As for any neural networks to be set up, an important job is to find and establish a dataset with enough number of samples for nets training and validation. Unlike general daily life object image datasets which are publicly accessible, the datasets for small target detection have to be generated ourselves. First, we randomly downloaded some cloud-sky images from the internet and transformed them to gray level images as a source for background generation. Figure 1a shows some of these background images. Second, background image patches are randomly cropped from background image source by a dimension of 21×21 pixels. Next, a simple program was written to add target spot at the center of half number of background patches as target images, the other half number of background images are kept unchanged as sample images of no target. The grey-scale(或grey/ intensity) distribution of spot target are generated by Eq.1~Eq.4. Figure 1b shows some generated image samples.
Where wx and wy are the dimensions of gauss distribution on two perpendicular directions. α is random angle between 0~π, decides the direction of distribution. rand() generates uniform distributed random value between 0 and 1. s(x,y) is the grey level function and (x,y) represents the pixel position in a sample patch, where the center coordinate of patches is (0,0). In order to train the nets effectively, the target spot intensity should be wisely selected.
There are basically two kinds of strategy we used in sample sets establishing, a random intensity strategy and an intensity strategy with constant SNR. There are some considerations to be made. First, unlike common object recognition, the sampling regions of target and no target overlap with each other under a low or unstable SNR, result in problem of over-fitting in training (Figure 2c). Besides, if the SNR is very high, there could be much space between two sampling regions, and the samples could even be separated linearly (Figure 2a). The nonlinear classification ability of neural nets would be in vain. Figure 2 illustrates the effect of SNR under a low dimensional condition. Second, in order to have a better training performance with limited number of samples, training sample points should be as close to class boundaries as possible while maintaining an ideal shape of boundary surface without mixing with other classes. A boundary approaching effect, which is illustrated by Figure 3,can be obtained by holding SNR as a specific constant in a training set. All in all, we choose a strategy with fixed SNR to generate target samples and compare the net training result with the strategy of random SNR sample generation.
First, a random background patch is normalized to have unit variance as Eq.5. Then, target spot and background patch are added together according to SNR to make a target sample image, as Eq.6.
( ) ( Where Bn(x,y) is normalized background image, mean(B(x,y)) is the mean value of the background patch. P(x,y) is the generated target sample image. By controlling SNR, we generated 140000 sample images each time for one time training and validation.
Half of the samples are positive targets and the others are negative backgrounds with no targets.

Experiments
We trained deep nets of different layer width and depth on training sets of different SNRs. Then evaluate the trained nets on a large sky-cloud image (sized 1024×768) which does not belong to the background source set. 100 small spot targets are generated by Eq.1~4 and randomly added on the sky-cloud background image to generate an image for final performance test. The intensity of spot is controlled in order to get an evenly-distributed local SNR ranging from 0 to 1. The performance of nets is compared by Eq.7 and 8. Small Area Signal-to-Noise ratio gain: Large Area Signal-to-Noise ratio gain: Where S is the signal amplitude, C s and C l are the standard deviations within a local small area (21×21pixel) and a large area (201×201pixel) respectively. SSNR reflects the ability of signal enhancement while LSNR reflects the ability of background noise suppression. Both indices with larger value imply better performance. Different from other papers, we do not use the indices of background suppression factor (BSF) due to its dependency on signal amplification level. All training and tests were conducted on matlab platform with matconvnet [24] toolbox. A laptop computer with Intel Core i7-2630QM CPU and GTX650m Nvidia Graphic Card was used.
We trained all nets in Table 1 by training set of different SNRs, then test and compare their performances with traditional method of Max-mean and Max-median filtering. Table 2 lists all test results. As shown in Table 2, the performances of deep nets are significantly better than traditional filtering methods of Max-mean and Max-median. Deeper nets get better performances in case the number of full connection layers are less than or equal 5. The training process did not converge for nets with more than 5 full connection layers. And the nets trained by samples with constant SNR (SNR≈1) achieves best training performance, better than random sampling, except for the linear classification model (type A). The performance of linear classification model does not change with SNR. Moreover, we give our explanations on why SNR≈1 is the best choice. Figure 4 provides a sampling space with only two dimensions, each axis of the Figure represents a pixel gray level of a sample image which has only two pixels here. Each image, whether it is a target of background, could be expressed as a point in this Figure. Then, the random background images with equal intensity would form a circle centered at the original point on the Figure. The circle radius equals to the intensity of background. On the other hand, if a target is added to these background, we would get a new circle formed by target images. As there are more than one type of targets (There are 3 types of targets on Figure (4), we would get a set of circles, each circle set in the point represent a type of target with given intensity. Then Figure 4 shows three conditions with SNR>1, SNR=1 and SNR<1 in (a) (b) and (c) respectively.
As shows in Figure 4a, as for background noise of totally random, a sampling SNR>2, a large gap exists between background and target samples. The samples could even be divided linearly which is obviously inaccurate. As Figure 4b shows, sampling with SNR = 2 gets a perfect classification between classes. There are many samples located near the classification surface while no mixture between them, which meet the conditions of boundary approaching effect. In Figure 4c, with a sampling SNR>2, samples of two classes mixed with each other and some background samples are misclassified as target samples. Usually, there are far more background pixels than target pixels in an image and this misclassification could lead to pool performance in test. In a word, SNR=2 is the best choice for training in small target sample generation.
This conclusion does not coincide with the result of experiments at a first glance. However, the key idea to understand this inconformity is to notice the different definitions of SNR. The intensity of signal used in sample generation experiments is the peak value of signal pixels, while in Figure 4, it is the root-mean-square of signal pixels. For a signal spot of gauss shape, the ratio between peak value and root-mean-square is about 2, which causes the SNR inconformity.

Conclusions
In our work, a new small target detection method is proposed based on deep learning. The performance of deep nets is significantly better than traditional filtering method. It is found that, the nets trained by samples of specific constant SNR, gets better performance than that trained by samples of random SNR. A reasonable choice of SNR is SNR≈1 (peak signal value SNR) which gets the best test performance in all tests with different sampling SNRs. As it has been well known that the addition of noise to the input data of a neural network during training can lead to significant improvements in generalization performance [25], our conclusion is that adding too much noise would do more harm than good in training for the application to detect signal from noise background. And we provide a simple explanation on why sampling SNR≈1(peak signal value SNR) gets the best performance in all tests. And SNR = 2 (mean square root SNR) might get even better performance, which needs a further testify. This conclusion might also be useful for training of nets on general objects recognition and detection which needs a further research. The next work would be to reconstruct the nets by convolutions to achieve a parallel faster even real-time small target detection algorithm.