Few-shot remote sensing scene classification based on multi subband deep feature fusion

: Recently, convolutional neural networks (CNNs) have performed well in object classification and object recognition. However, due to the particularity of geographic data, the labeled samples are seriously insufficient, which limits the practical application of CNN methods in remote sensing (RS) image processing. To address the problem of small sample RS image classification, a discrete wavelet-based multi-level deep feature fusion method is proposed. First, the deep features are extracted from the RS images using pre-trained deep CNNs and discrete wavelet transform (DWT) methods. Next, a modified discriminant correlation analysis (DCA) approach is proposed to distinguish easily confused categories effectively, which is based on the distance coefficient of between-class. The proposed approach can effectively integrate the deep feature information of various frequency bands. Thereby, the proposed method obtains the low-dimensional features with good discrimination, which is demonstrated through experiments on four benchmark datasets. Compared with several state-of-the-art methods, the proposed method achieves outstanding performance under limited training samples, especially one or two training samples per class.


Introduction
Remote sensing (RS) images play a significant role in urban planning, land cover and land use (LCLU), agriculture management, etc. [1−4], not only due to the high spatial resolution, but also abundant structural patterns. To use these RS images sufficiently, remote sensing scene classification (RSSC) is imperative. The appropriate feature representation method plays a key role in RSSC. Due to the limited labeled RS images, it is still a challenging and complex issue to represent and classify RS scenes by using more intelligent and convenient methods.
Various efforts have been devoted to developing various methods for feature representation. Traditional methods are bag-of-the-visual words (BoVW) and many improvements or extensions of BoVW [1,5,6]. Since AlexNet [7] got the best score in the Large-Scale Visual Recognition Challenge in 2012, plenty of deep learning methods have sprung up [8−11]. Generally, CNN-based methods can be classified into three categories: training CNNs from scratch, fine-tuning pre-trained CNNs, and using pre-trained CNNs as feature extractors. Full-training-based methods mainly focused on the building of deep networks to enhance accuracy. Such methods usually improve currently available advanced models or rebuild the CNN structure to obtain astonishing scene classification results [12−16]. Wu et al. [16] took convolutional neural networks (CNNs) as a backbone to construct a deep-learningbased framework for multimodal RS data classification. Fine-tuning methods usually involve adjusting pre-trained CNNs or optimizing their loss functions to improve classification accuracy [17−21]. However, they generally require a significant number of labeled training samples, high-performance computer equipment, and take a very long time to fine-tune pretrained CNNs or train a new network.
The CNN-based methods mentioned above utilize practical features to classify remote sensing scenes. However, it is nontrivial to obtain the features that can adequately represent the scene in the case of few training samples. The lack of available data will make the neural network overfitting, which will lead to performance degradation. To tackle this problem, many few-shot based methods have been developed [22][23][24][25]. Wu et al. [26] proposed a "U-Net in U-Net" framework to detect small objects in infrared images. Mei et al. [27] presented a sparse representation-based framework and obtained a satisfactory result. However, in the case of very few samples, this method is not enough to describe the key semantic features, and there is still a lack of discrimination for remote sensing images of the same category with the diversity of direction scales. Zeng et al. [28] proposed a prototype calibration to enhance the representation of feature in few-shot RSSC task. Yang et al. [29] emphasized the importance of the underlying features in the classification of small samples, which improves the ability to characterize the feature of small samples, but the computational complexity is large. The feature-wise transformation can be employed for RSSC and land-cover mapping tasks [30]. Tseng et al. [31] used feature-wise transformation layers for addressing the problem of few-shot classification under domain shifts for metric-based methods. Chen et al. [32] proposed a feature-wise transformation module address the difficulty of cross-domain RSSC tasks with few training samples, and pointed out that transfer-based methods may outperform sophisticated few-shot learners. Chowdhury et al. [33] proposed a library of pre-trained feature extractors combined with a feed-forward network to solve few-shot image classification task. Recently, few-shot learning is presented to address a series of fewshot tasks. Discriminative learning of adaptive match network (DLA-MatchNet), an end-to-end network, was proposed for boosting a few-shot RSSC [34]. Deep nearest neighbor neural network (DN4) is proposed to exploit deep local descriptors and the image-to-class measure for classification, which is one of the most advanced networks for few-shot scene classification of remote sensing images [35]. Huang et al. [25] proposed a meta-learning-based task-adaptive embedding network to enhance the generalization performance of the model for few-shot settings. These few-shot classification methods almost focused on the C-way K-shot problems. In addition, many studies focused on deep learning-based fusion strategy to generate a more comprehensive feature representation [36][37][38][39]. Hong et al. [39] proposed a cross fusion strategy to solve the multi-modality learning issue.
Although the above methods have acquired high accuracy, the features extracted from the deep learning approaches are usually high-dimensional and redundant. To further improve the classification performance of RSSC, there is still a thorny road to go between improving feature utilization and reducing computational complexity. Chaib et al. [40] adopted discriminant correlation analysis (DCA) to combine the deep features extracted from different fully connected layers CNNs, which provide an efficient and low-cost feature fusion strategy. However, the categories with a small distance between classes are become closer in the mapping space, which leads to overlap in the mapping space. Motivated by this, an improved DCA strategy is proposed in this paper. The key difference from the related method lies in that it reconstructed the between-class scatter matrix by introducing the distance coefficient, which helps adjusting the distance between classes and avoiding cross overlap in mapping space. In our approach, the features from different CNNs and different frequency bands are integrated by the improved DCA, thus enriches the expression of feature semantics and overcomes the limitation of the number of categories, especially for small number of categories in the remote sensing data set. Specifically, a discrete wavelet-based multilevel feature fusion (DWMLFF) strategy is proposed to fuse multi-sub-band features extracted from different CNNs for few-shot RSSC. The discrete wavelet transform (DWT) is employed as a decomposer to extend the multi-sub-band information of limited samples to surmount the problem of insufficient features in few-shot RSSC. The transfer learningbased CNN model is used as a feature extractor to generated the deep feature of original image and multi-sub-bands. Furthermore, an improved DCA method is proposed to integrate all the obtained deep feature. In the improved DCA method, we reconstruct the between-class scatter matrix by introducing distance coefficient, which addresses the overlap of categories in the mapping space to distinguish easily confused categories. The proposed method gives full play to the advantages of different wavelet sub-bands, and utilizes an improved DCA strategy to deeply integrate different frequency components to obtain low-dimensional and high-discriminative features for few-shot RSSC.

Materials and methods
The proposed method is comprised of the following parts: discrete wavelet transform, feature extraction from the pretrained deep CNNs, and feature fusion. Figure 1 is the framework of our approach. The discrete wavelet transform is employed to decompose the original image into various components at different frequency intervals. Then, the original image, and the generated low-low (LL) subbands in different level are fed into pre-trained CNNs to obtain the deep features, separately. Next, all the obtained feature are integrated by the improved DCA method. Finally, The LIBSVM is employed for replacing the softmax layer of CNN and monitoring classification on well-known datasets.

Discrete wavelet transform
In image processing, the discrete wavelet transform (DWT) is proposed to decompose an image into various components at different frequency intervals. Figure 2 shows an example of low frequency components of the RS image. Wang et al. [41] validates that the low-frequency component is much more generalizable than the high-frequency component. The low-frequency components of the images obtained by discrete wavelet transform, which are used for deep feature extraction, can take full advantage of the image feature information. The low-frequency components of the images obtained by DWT, which are used for deep feature extraction, can take full advantage of the image feature information. Inspired by this, CNN features of low-frequency components at different levels can be used to construct feature pyramids. For the input image X , the basic wavelet function w f is used for DWT to calculate k LL coefficients as follows.
where k LL , k LH , k HL and k HH are the low-low, low-high, high-low and high-high filter coefficients of the k th  level. Low-low subbands are used for the subsequent feature extraction.

Feature extraction
In the past decade, several typical CNN models have been developed, such as AlexNet [7], VGG-Net [9], GoogleNet [10], Resnet [11], etc. These models have different structures and different representational abilities. The CNNs pre-trained on ImageNet already can obtain powerful and rich features. Our approach focuses on the fusion of multiple subband deep features extracted from distinct off-the-shelf CNN models. In order to reduce computational complexity and improve recognition accuracy, AlexNet, VGG-Net, and ResNet 50 are introduced for feature extraction. There is a correlation between different features of convolutional networks. These features extracted from diverse CNN models are redundant and different, and their fusion can be applied to represent the RS images.

Improved DCA method
The basis for the combination of different features is their redundancies and differences. The DCA method [42] contributes to further disperse classes that are far away from each other in the mapping space, which provide an efficient and low-cost feature fusion strategy. However, the categories that are less distinct from each other are closer together. To address the superposition of categories in the mapping space, we reconstructed the between-class scatter matrix.
Assuming an image set I, c is the number of categories in I, n the number of trained features, we defined two feature matrices, X and Y, respectively.
The between-class scatter matrix in X is expressed as x and x are the mean of the feature vector in the i th  category and in the whole X set, respectively.
In order to separate the classes, X is projected into a new space. The projection ' X is described as [42] by mapping matrix bx r is the feature length of the transformed features [42]: The other feature set Y is processed in a similar way. Supposing ' is the between-set covariance matrix of the transformed feature set. To maximize the pairwise correlation across X and Y , ' xy S needs to be diagonalized.
Similar to the previous step, let Next, the transformed feature set can be described as: where x W , and y W are the last transformation matrices for X and Y , respectively, thereby minimizing the correlation between-class.
The transformed features are fused to obtain the combination. There are two classic fusion approaches: parallel strategy and serial strategy. The parallel strategy is to add the feature vectors, and the serial strategy is to concatenate different features into one single feature. The final feature dimension is related to the number of classes. If the number of classes is small, the fusion features will not be rich enough, which will affect the subsequent classification performance. To enrich the information of fusion features, the proposed method is performed by concatenation after transforming.

Discrete wavelet-based multilevel feature fusion
Motivated by the idea of DWT in image processing, the discrete wavelet-based multilevel feature fusion (DWMLFF) method is proposed. This method can fuse the information extracted from different wavelet subbands. Figure 3 shows the details of the DWMLFF method. The method is explained as follows: The maximum dimension of the initial fusion features is 2 ( 1) c   , and the full size of the final features is 2 ( 1) ( 1) c k     . For small sample datasets, this value is much less than the dimension of the features directly extracted from the CNN model.
Finally, The LIBSVM library [43] is employed for replacing the softmax layer of CNN and monitoring classification on well-know datasets.

Data sets
To verify the feasibility of the DWMLFF method, four famous public datasets, the UC Merced dataset [1], the WHU-RS19 dataset [44], the AID dataset [4] and the NWPU-RESISC45 dataset [45], are employed in our experiments, respectively. The training ratios of these datasets are similar to [27].
The UC Merced dataset is extracted from a lot of optical images of the US Geological Survey National Map Urban Area Imagery. It includes 21 scene classes, and each category contains 100 RGB images with 256 256 pixels. The spatial resolution of these images is 1 foot per pixel. Figure 4 shows an example image for each category. It can be clearly seen that there are many similarities among 'forest', 'medium residential' and 'mobile home park'. Similarities can lead to severe difficulty in distinguishing them. The WHU-RS19 dataset contains 19 challenging categories, which are exported from Google Earth. In this dataset, the size of the images is 600 600  pixels. The image samples in the same class are collected from different regions of satellite images with various resolutions. These samples contain different orientations, scales, and illumination. Some example images from this dataset are shown in Figure 5. The resolution of the images in this dataset is variable, which causes more challenges for RSSC in RS19 than that in the UC Merced dataset. The AID dataset, which are acquired from Google Earth, consists of 10,000 images within 30 aerial scene types. The pixel-resolution changes from about 0.5 to 8 m. The size of each image in AID30 is the same as that in RS19. The scene classes include: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, and so on. The images in AID dataset are extracted at different time and seasons under different conditions, which increases the intra-class diversities of the data.
The NWPU-RESISC45 dataset (NR45) was created by the researchers of Northwestern Polytechnical University. It contains 31,500 RGB images, covering 45 scene categories with 700 images in each class. The scene classes include: airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, and so on. The spatial resolution varies from 0.2 to 30 m for most of the scene classes. The size of each image in NR45 is the same as that in the UC Merced dataset.
According to the above analysis, the images of these datasets show many low-level features of ordinary optical images. The features extracted by the CNNs pre-trained on ImageNet can be used for scene classification of these datasets.

Experimental results of the UC Merced dataset
The recognition accuracy of different methods based on the UC Merced data set is shown in Table 1, in which the best results are highlighted in bold. The training ratio is from 2 to 10%, and the results show that the ratio of train samples affects the classification accuracy. The lower proportion of training samples leads to more complex scene recognition and a lower recognition ratio. Due to the limited number of classes in the dataset, the feature dimensions of the DCA method is small, which makes the DCA method cannot obtain sufficient discriminant information. However, the DWMLFF method by introducing different frequency band features from multiple sub-bands, is not only better than single feature without fused, but also better than the DCA method. The recognition accuracy of the DWMLFF method is more than 10% higher than that of the single CNN method. The per-class recognition accuracies using the single feature, DCA, and DWMLFF methods are shown in Figure 6. In Figure 6, the training rate is 4%, the single feature is generated from AlexNet, It can be seen that the classification accuracies of most classes have been enhanced by using DWMLFF. However, the performance of DCA on the categories of 'runway', 'intersection' and 'storage tanks' is better than that of the DWMLFF method. It can be explained that the multi-subband low-frequency components of these images cannot provide enough additional effective features but increase redundancy, leading to the degradation of classification performance.

Experimental results of the WHU-RS19 data set
The classification accuracy of different methods on the WHU-RS19 data set is described in Table 2, in which the training ratio is from 2 to 10%. The results of the experiment demonstrate that feature fusion between different networks can effectively improve classification performance. As shown in Table 2, the recognition performance of the DWMLFF method is obviously better than that of the DCA method and the single feature method. For example, compared to ResNet 50, 'AlexNet & ResNet 50' by DWMLFF improves the overall accuracy by more than 11% under different training ratios. Compared to DCA method and individual CNNs, the DWMLFF method has outstanding advantages in small sample RSSC. By comparing Tables 1 and 2, it is worth mentioning that our DWMLFF method achieves a greater signification gain on WHU-RS19 than on UC Merced. That can be explained by the fact that the WHU-RS19 dataset has fewer categories than the UC Merced dataset, and there are less images in each class of the WHU-RS19 dataset than in the first dataset. Figure 7 shows the per-class recognition accuracy on the WHU-RS19 dataset of different methods. Obviously, for almost categories of the WHU-RS19 dataset, the recognition accuracy of the fused method is better than the single feature method. Furthermore, the DWMLFF method performs better than the DCA and single feature methods in all categories except the 'river' class. It is due to the fact that the number of classes in this dataset is such small that the dimensions of the feature which is fused by the DCA method is too small to provide sufficient discriminant information. The DWMLFF method of introducing different frequency band features from multiple subbands is superior to the DCA method.  Table 2. Comparison with other methods on the WHU-RS19 data set (Bold indicates the best results).

Figure 7.
Per-class recognition performance on the WHU-RS19 data set based on single feature and fusion features.
In order to clearly observe the dispersion of different methods on the WHU-RS19, the image features are visualized. In Figure 8, features generated from different methods of the WHU-RS19 data set are visualized for comparison. Various colors on the graphs represent different categories in the data set, and the points represent the feature of images in the data set.
As shown in Figure 8, the single feature extracted from AlexNet or ResNet50 forms some overlaps that are in a confused order. On the contrary, the fusion feature generated by DCA and DWMLFF forms clusters that are clearly separated. Compared to the DCA method, the DWMLFF method using the multi-subband information of images can be utilized to achieve a better representation, obtain a higher convergence among the identical classes, and gain a greater distinguish among different classes.

Experimental results of the AID dataset and the NR45 dataset
The experimental results of different methods on AID and NR45 are shown in Tables 3 and 4, respectively. The training ratio is from 2 to 10%. Table 3. Comparison with other methods on the AID data set (Bold indicates the best results).
Similar to the above experiments, the classification performance of the DWMLFF method on the AID dataset and the NR45 dataset is significantly superior to that of the DCA method and the single feature method. By comparing the classification results in Tables 1-4, it can be found that the DWMLFF method achieves a greater signification gain on WHU-RS19 than on other datasets. It indicates that in the case of small samples and fewer categories, our method has more obvious advantages.  Table 4. Comparison with other methods on the NR45 data set (Bold indicates the best results).

Comparison with advanced methods
To effectively analyze the performance of our method, we conducted a comparison between the DWMLFF method and the state-of-the-art methods. The results of the accuracy comparison on UC Merced and WHU-RS19 are shown in Tables 5 and 6, respectively, in which the best results are highlighted in bold. In Tables 4 and 5, the methods with '*' indicate that the experiments are performed around the 5-way K-shot, which indicates that K labeled samples are used to recognize samples from 5 scene classes. For UC Merced, training ratios of 1 and 5% are added in experiments to obtain a more comprehensive comparison. Table 5. Comparison with the advanced methods on the UC Merced data set. The methods with '*' indicate that the experiments are performed around the 5-way K-shot. Tables 5 and 6, the accuracy performance of the DWMLFF method is superior to that of most methods, the less the number of training samples, the more advantageous DWMLFF method is. Although the experiment results of the 'GLF + SRC' method and '5-way k-shot' methods are better than that of our DWMLFF method in some cases, the DWMLFF method is simpler to implement, and generates smaller feature dimensions.

Impact of multi-subband number
The number of subbands k can affect the final classification accuracy. In order to analyze the effect of k on classification performance, and obtain better fusion results, experiments were conducted. The experimental configuration is the same as before, except for k . Figure 9 shows that the number of subbands of DWT affects the classification accuracy. Due to the larger size and higher resolution of WHU-RS19, the low-frequency components from multiple subbands contain richer information, resulting in a more significant gain on WHU-RS19 than on UC Merced.   When the number of subbands is small, the dimension of the feature is tiny, which cannot provide satisfactory semantic information, resulting in low classification accuracy. As the value of k increases, the feature dimension also increases, and the classification accuracy is improved. However, when the k is too large, some redundant information will be introduced, resulting in no further significant improvement in accuracy. It is worth mentioning that if k is greater than 3, the recognition accuracy will decrease to different degrees under the training ratio of 2-4%. When the training rate is set to 2%, there are only two images in each category in the UC Merced dataset as training samples.
There are even fewer images in the WHU-RS19 dataset as training samples, with only one shot in each category. In this case, increasing different frequency components of the training samples to obtain more complex feature information will lead to overfitting, thus reducing the recognition accuracy. The dimension of fusion features obtained by our method is closely related to the number of wavelet subbands. According to Eq (7), the maximum dimension of the final feature fused by our strategy is 2 ( 1) ( 1) 36 ( 1) k c k       . The dimension of the features of various ways on the WHU-RS19 data set under the 4% training ratio is depicted in Table 7. Compared with the single feature extracted from pretrained CNNs, the recognition accuracy of the DWMLFF method dramatically increases, while the dimension of feature decreases significantly. Compared with the DCA method, the feature dimension of the DWMLFF method increases slightly, but the classification accuracy is greatly improved. As can be seen from Figure 9 and Table 7, we can find that the appropriate value of k is helpful in improving the classification performance, which is the advantage of making full use of different frequency components from wavelet subbands.

Conclusions
In this paper, a multi-subband feature fusion method, namely DWMLFF, is proposed for few-shot RSSC. To surmount the problem of insufficient features in few-shot RSSC, the DWT is employed as a decomposer to obtain the multi-subband information of limited samples. The original image and the LL subbands of different level are fed into pretrained CNN models, which improves the feature generation capability of pretrained CNN models. In order to maximize the difference between categories, the improved DCA strategy is proposed. In the improved DCA strategy, the distance coefficient is introduced to reconstructed the between-class scatter matrix, which helps adjusting the distance between classes and avoiding cross overlap in the mapping space. Finally, the features extracted from different CNNs and different frequency components are fused by the improved DCA.
The proposed method gives full play to the advantages of different wavelet subbands, and utilizes an improved DCA strategy to obtain low-dimensional and high-discriminative features for RSSC. The experimental results on four well-known datasets indicate that the proposed method achieves outstanding performance in RSSC with few training data, especially with one or two training samples per category. In the future, we would focus on the relationship between different modalities to achieve automatic and accurate classification, which we would apply for post-disaster identification with limited training samples.