1 Introduction

Remote sensing scene classification is a challenging research problem, where the task is to associate each aerial image, comprising a variety of land cover types and ground objects, with its respective semantic scene category. The problem is important for understanding remote sensing image data and has many potential applications, including disaster monitoring, vegetation mapping, land resource management, urban planning, traffic supervision, and environmental monitoring. Most previous methods either rely on employing hand-crafted visual features [13, 24, 47, 77], such as color and shape, or using mid-level holistic image representations [11, 33, 44, 75, 76] constructed by encoding hand-crafted visual features. Recently, deep Convolutional Neural Networks (CNNs) have revolutionized computer vision by significantly advancing the state-of-the-art in many areas such as, image classification [8, 20, 31, 38, 43, 64], object detection/segmentation [19, 28, 30, 40, 53, 57, 68, 69, 81] and action recognition [26, 48, 51, 63]. Similarly, deep learning techniques have also made an impact on satellite image analysis, including aerial scene classification [2, 32, 54, 73] and hyperspectral image analysis [14, 15, 66].

Generally, deep convolutional neural networks or CNNs take a fixed-sized image as input to a series of convolution, local normalization and pooling operations (termed as layers). The final layers of the convolutional neural network are fully connected (FC), and are typically used to extract deep features that are generic and used for a variety of vision applications [5], including remote sensing scene classification [54, 73]. The standard input to a deep convolutional neural network is RGB pixel values, with training performed on the large-scale ImageNet dataset. Most existing remote sensing scene categorization approaches employ these CNNs, pre-trained on the ImageNet dataset, as feature extractors. The exploration of different color spaces and their combination for remote sensing scene classification is still an open research problem. The work of [56] investigated different color spaces for vehicle color identification. The work of [23] explored YCbCr and RGB color channels for image super resolution. A collaborative facial color feature learning approach, combining multiple color spaces, was proposed by [41] for face recognition. Here, we investigate a variety of color features within a deep learning framework for remote sensing scene classification.

Prior to deep learning, the impact of multiple color features was well studied for object recognition and detection [3, 4, 39, 67]. The work of [67] studied the invariance properties of different color descriptors and showed that different color features are complementary and their combination provides a consistent improvement in overall classification performance. Khan et al. [39] proposed an attention based framework to combine multiple color features with shape features. Within the deep learning framework, the complementary characteristics of these color features are yet to be investigated for remote sensing scene classification. Figure 1 shows visualizations of filter weights from the first convolutional layer of CNNs, trained from scratch on ImageNet, using different color features. Visualizations are shown for RGB, Opponent, YCbCr, and Lab based CNNs. The visualizations show that the corresponding activation maps for different color spaces have different responses due to variations in filter weights. Further, the trained image representations in different color features based CNNs are represented in different feature subspaces that likely provide complementary information. In this work, we investigate the effectiveness of combining multiple color features, within a deep learning framework, for remote sensing scene classification.

As discussed above, the common strategy employed by most remote sensing scene classification methods is to extract deep features from the activations of the FC layers of a pre-trained deep convolutional neural network. However, such a strategy will encounter the inherent problem of a high-dimensional final image representation when combining activations from multiple deep color convolutional neural networks. Therefore, it is desired to obtain a compact final image representation without sacrificing the improvements obtained from the complementary characteristics of multiple deep color features. Recently, Khan et al. [37] proposed to use a divisive information theoretic clustering (DITC) technique [22] to combine heterogenous texture descriptors for texture classification. Their work showed that a notable reduction in the dimensionality of the final image representation can be obtained using the DITC technique, without significant loss in classification performance. Motivated by this, we propose to use the DITC technique to compress the dimensionality of a multi-color deep representation for remote sensing scene classification. The DITC approach has previously been employed to compress the high-dimensionality of bag-of-words based spatial pyramids and hand-crafted heterogenous texture representations [25, 37]. To the best of our knowledge, we are the first to investigate the DITC technique to compress deep multi-color image representations for scene classification in remote sensing images.

Fig. 1
figure 1

Visualization of filter weights from different color features. Top row: RGB space-based deep convolutional neural network (left) and Opponent color space-based deep convolutional neural network (right). Bottom row: YCbCr color space-based deep convolutional neural network (left) and Lab color space-based deep convolutional neural network (right). All networks employ the same architecture. (Color figure online)

Contributions In this work, we study the problem of remote sensing scene classification with the following contributions.

  • We investigate the contribution of color in a deep learning framework for scene classification in remote sensing images. We further demonstrate the effectiveness of combining multiple color features within the deep learning framework. Furthermore, we propose the usage of an information theoretic compression approach to compress high-dimensional multi-color deep features into a compact image representation.

  • Comprehensive experiments are performed on several remote sensing datasets: UC-Merced with 21 scene classes, WHU-RS19 with 19 scene types, RSSCN7 with 7 categories, AID with 30 aerial scene classes, and NWPU-RESISC45 with 45 categories. The results of our experiments clearly demonstrate that combining multi-color deep features significantly improves the classification performance compared to standard RGB deep features alone. Furthermore, our results show that multi-color deep features can be efficiently compressed without any significant loss in classification performance. Finally, our compact multi-color deep features provide competitive classification results compared to existing remote sensing image classification approaches in the literature.

The rest of this paper is organized as follows. We present related work in Sect. 2. Our method is described in detail in Sect. 3. We present the remote sensing scene classification experiments and results in Sect. 4. Finally, our conclusions are drawn along with potential future research directions in Sect. 5.

2 Related Work

The impact of color features for remote sensing scene analysis has been extensively studied [11, 24, 59, 65, 76, 78]. The work of [76] investigated integrating color information within Gabor features for remote sensing scene classification. dos Santos et al. [24] evaluated a variety of hand-crafted color and texture feature description approaches for remote sensing image classification and retrieval. Chen et al. [11] performed an evaluation of local features, such as structure, texture and color for remote sensing scene classification. The work of [65] studied a variety of feature description baselines in different color spaces for remote sensing images.

Combining multiple hand-crafted color features has also been investigated in the literature [39, 60, 67]. The work of [67] investigated integrating color and shape features, within the bag-of-words framework, for object recognition. Their evaluation recommends employing opponent color features with SIFT descriptor for object recognition and also showed the importance of fusing multiple color representations to achieve further improvement in classification performance. Khan et al. [39] proposed an approach where multiple hand-crafted color features are employed to modulate shape features for object recognition. The work of [1] proposed an approach to combine color models that are learned to achieve color invariance for object detection. The work of [21] investigate the impact of color information for oint set registration. Different to these previous works using hand-crafted features, recent works have also investigated combining multiple color features within a deep learning framework for face recognition [41], image super resolution [23] and vehicle classification [56]. However, to our knowledge, the effectiveness of combining multiple color features within a deep learning framework is yet to be investigated for remote sensing scene analysis.

In recent years, several deep learning-based approaches have been introduced for remote sensing scene classification. The work of [54] evaluated off-the-shelf CNNs features and compared their performance with low-level descriptors for remote sensing scene classification. Marmanis et al. [49] proposed a two-stream approach where pre-trained CNNs features are first used to represent the images. Then, the extracted representations were input to shallow CNN classifiers. The work of [72] introduced a hybrid architecture where multi-column stacked denoising sparse auto-encoder is combined with Fisher vector to learn features in a hierarchical manner for land use scene classification. Yan et al. [74] proposed an approach based on improved category-specific codebook using kernel collaborative representation based classification which is integrated with Spatial pyramid matching. Their approach then employed SVM classifier to classify remote sensing images. The work of [73] introduced a large-scale remote sensing scene classification dataset and also evaluated several pre-trained CNNs on their large dataset. Cheng et al. [17] introduced a method based on bag of convolutional features where CNNs features are employed in place of hand-crafted local features to construct bag-of-words based image representation. The work of [2] investigated a fusion approach where standard RGB deep features are combined with LBP-based deep texture features to classify remote sensing images. Chen et al. [10] proposed a CNN based approach under the guidance of a human visual attention mechanism. In their approach, a computational visual attention model is utilised to extract salient regions. Afterwards, sparse filters are employed for learning features from extracted regions. A superpixel-guided layer-wise embedding CNN based approach is introduced by [46] to exploit information from both labeled and unlabeled examples. The work of [29] introduced a center-based structured metric learning approach where both deep metrics and center points are taken into account to penalize pairwise correlation and class-wise information between categories. Most of these approaches employ CNN models trained using RGB patches as input. Different to these approaches based on the de facto practice of using RGB patches for CNN training, we investigate the contribution color in deep learning framework and demonstrate the effectiveness of integrating multiple deep color features. Our extensive experiments on five benchmarks demonstrate the effectiveness of combining multiple deep color features for remote sensing scene classification.

3 Our Approach

Here, we first discus the motivation behind the proposed approach. Then, we describe how the deep color models are constructed.Afterwards, we investigate the fusion of deep color features and counter the problem of the high-dimensionality of fused deep color features for classification.

Motivation As discussed earlier, most existing state-of-the-art remote sensing image classification methods are based on CNNs. Here, CNNs are generally pre-trained on a large-scale generic object recognition dataset (ImageNet) using raw RGB pixel values as an input. Previously, combining multiple hand-crafted local color features have been investigated, within the bag-of-words framework, for object recognition. Motivated by these previous works, we investigate the contribution of color within a deep learning framework (CNNs) and demonstrate the impact of integrating multiple deep color features for remote sensing scene analysis. To the best of our knowledge, we are the first to investigate the effectiveness of integrating multiple color features, within a deep learning framework (CNNs), for remote sensing scene classification.

3.1 Deep Color Convolutional Neural Networks

Most existing state-of-the-art remote sensing image classification methods are based on deep models. These deep models are generally pre-trained on a large-scale generic object recognition dataset (ImageNet) using raw RGB pixel values as an input. Here, we analyze a variety of color features within deep learning framework to evaluate their impact on remote sensing scene classification. We investigate the contribution of color in a standard deep convolutional neural network (CNN) architecture [8, 43, 64].

To analyze the impact of color for remote sensing image classification, we employ a variety of color features popular in object recognition. The selected color features are based on different color space transformations: HSV, YCbCr, Opponent, C, Lab, and the colornames. Furthermore, the motivation of these color representations differ from photometric invariance to discriminative power.

RGB In this work, we use the standard three-channel RGB color space as the baseline.

HSV In the HSV color space, the H model is scale-invariant. Further, it is shift-invariant with respect to light intensity [67]. The HSV color space has been previously investigated with the hand-crafted SIFT descriptor for scene recognition [7].

YCbCr In the YCbCr color space, Y is the luminance component and Cb and Cr are the blue-difference and red-difference chroma components. This color space is approximately perceptually uniform and has been used previously for remote sensing images [34, 65].

Lab The three dimensions of the Lab color space correspond to L for lightness and a and b for the color components green–red and blue–yellow. This color space is perceptually uniform implying that colors at an equal distance are also perceptually equally far apart.

Opponent In this color space, the \(O_{1}\) and \(O_{2}\) channels encode the color information in the image. The \(O_{3}\) channel describes the intensity information. The image is transformed as in [45]:

$$\begin{aligned} \left( {\begin{array}{*{20}c} {O1} \\ {O2} \\ {O3} \\ \end{array}} \right) = \left( {\begin{array}{*{20}c} {\frac{1}{{\sqrt{2} }}} &{} { - \frac{1}{{\sqrt{2} }}} &{} 0 \\ {\frac{1}{{\sqrt{6} }}} &{} {\frac{1}{{\sqrt{6} }}} &{} {\frac{{ - 2}}{{\sqrt{6} }}} \\ {\frac{1}{{\sqrt{3} }}} &{} {\frac{1}{{\sqrt{3} }}} &{} {\frac{1}{{\sqrt{3} }}} \\ \end{array}} \right) \left( {\begin{array}{*{20}c} R \\ G \\ B \\ \end{array}} \right) . \end{aligned}$$
(1)

The opponent color representation possesses invariance with respect to specularities. In the evaluation performed by van de Sande et al. [67], the opponent color space in conjunction with the hand-crafted SIFT feature descriptor, was shown to provide improved results for visual object recognition.

C The C representation, defined as \(C = \left( {\begin{array}{*{20}c} {\frac{{O1}}{{O3}}} &{} {\frac{{O2}}{{O3}}} &{} {O3} \\ \end{array}} \right) ^\mathrm{{T}}\), aims at adding photometric invariance with respect to shadow-shading to the opponent representation. The invariance is achieved by normalizing the first two dimensions with the luminance channel \(O_{3}\). The C representation was initially proposed by [27] and later employed with the SIFT descriptor by [67].

Color names Most of the aforementioned color representations aim at employing specific photometric invariance properties. Different to these representations, the color names are linguistic color labels assigned by humans to represent world colors. It involves the assignment of linguistic color labels to image pixels. A linguistic study by [6] identified that the 11 basic color terms of the English language are white, blue, grey, brown, orange, green, red, black, purple, yellow, and pink. The work of [70] proposed an approach to automatically learn from images retrieved with Google-image search. The descriptor is based on the 11 basic color terms. Color names representation [70] CN is defined as a feature vector comprising the probability of a color name for an image Img:

$$\begin{aligned} CN = \left\{ p\left( cns_{1}|Img\right) , p\left( cns_{2}|Img\right) ,\ldots , p\left( cns_{11}|Img\right) \right\} \end{aligned}$$
(2)

with

$$\begin{aligned} p\left( cns_{j}|Img\right) = \frac{1}{P} \sum _{x,y\in Img} p\left( cns_{j}|\mathbf{f }\left( x,y\right) \right) \end{aligned}$$
(3)

here \(cns_{j}\) is the j-th color name, \( \mathbf{f } = \left\{ L^{*}a^{*}, b^{*}\right\} \) and xy are the spatial coordinates of the P pixels in the image Img, Further, \( p\left( cns_{j}|\mathbf{f }\right) \) is the probability of a color name given a pixel value in the image, computed from an image dataset collected from Google. Since the images are acquired from the web, the issue of retrieving noisy images is addressed by employing PLSA approach [70]. Figure 2 shows the proposed color fusion in a deep convolutional neural network architecture. We use same architecture to train all deep color convolutional neural networks. Each deep color network is trained separately. The details of the underlying network architecture is provided in Sect. 4.1.

Fig. 2
figure 2

The proposed color fusion in a deep convolutional neural network architecture. Here, the function f(RGB) denotes a RGB pixel values mapping to another color-space representation. Deep color models are trained separately, from scratch, using different color spaces. We extract activations from FC6 and FC7 layers of each deep color network, respectively. These activations are then concatenated (indicated by a ‘plus’) to be used as image features and input to a classifier. See text for more details. (Color figure online)

3.2 Compact Deep Color Features

As discussed earlier, the final layers of the deep convolutional neural network (FC layers) are typically employed to extract deep features since they are generic and previously used for a variety of vision applications [5], including remote sensing scene classification [54, 73]. Here, we extract 4096 dimensional activations from the FC7 (second last) and FC6 (third last) layers of each deep color network respectively. These activations are then concatenated and used as image features, \(D =[d_{c1},d_{c2},d_{c3},\ldots ,d_{cn}]\). However, the combination of these activations from multiple deep color convolutional neural networks has the disadvantage of being high-dimensional (more than 57k in size) for each image. Here, we propose to use an information theoretic compression approach (DITC) [22] to compress the high-dimensional multi-color deep representation. The DITC algorithm works by discriminatively learning a pre-determined compact representation by minimizing the loss in mutual information between clusters and the class labels of training samples. The approach operates on the class-conditional distributions over deep multi-color image representations. The class-conditional estimation is measured by the probability distributions \(p\left( R|d\right) \), where \(R=\{r_1,r_2,\ldots ,r_{CL}\}\) is the set of CL classes. The approach then estimates the drop in mutual information MI between the combined deep color representation D and the category labels R. The high-dimensional deep multi-color image representation is then transformed to a compact representation \(D^R=\{D_1,D_2,\ldots ,D_J\}\) (where each \(D_j\) represents a collection of bins in the original uncompressed high-dimensional representation) as

$$\begin{aligned} \varDelta MI= & {} MI\left( {R;D} \right) - MI\left( {R;D^R} \right) \nonumber \\= & {} \sum \limits _{j = 1}^J {\sum \limits _{d \in D_j } { p\left( {d } \right) KL(p(R|d),p(R|D_j)) } }, \end{aligned}$$
(4)

where KL is the Kullback-Leibler (KL) divergence. The Kullback-Leibler divergence between the two distributions is defined by

$$\begin{aligned} KL(p_{1},p_{2})=\sum _{y\in {Y}} p_{1}(y) \log \frac{ p_{1}(y) }{ p_{2}(y) }. \end{aligned}$$
(5)

It is worth mentioning that the category information is exploited using only the training samples. The high-dimensional deep multi-color image representation is compressed by merging the bins, over the classes, with similar discriminative powers. We refer to [22] for additional details of the DITC algorithm.

4 Experimental Results

4.1 Experimental Setup

We first describe the underlying deep convolutional neural network architecture employed to obtain our deep color models. The deep convolutional neural network is based on the VGG architecture and is similar to [43]. The deep convolutional neural network consists of 5 convolutional layers (C1, C2, C3, C4, and C5) and 3 fully-connected (FC) layers (FC6, FC7 and FC8). The deep convolutional neural network takes as input an image of size \(224 \times 224\). Throughout our experiments, images are resized to \(224 \times 224\) pixels and then input to the network. The first convolutional layer C1 contains 64 convolutional filters with a filter size of \(11 \times 11\). The convolution stride is set to 4 and a max-pooling downsampling factor of 2. The second convolutional layer C2 comprises of 256 convolutional filters with a filter size of \(5 \times 5\). The convolution stride is set to 1, spatial padding is set to 2, and a max-pooling downsampling factor of 2. For the third, fourth and fifth convolutional layer C3, C4 and C5, the number of convolutional filters is 256, filter size is \(3 \times 3\), and the convolution stride and spatial padding are 1. For the fifth convolutional layer, a max-pooling downsampling factor of 2 is employed. Furthermore, the first two FC layers (FC1 and FC2) are regularised using dropout [43] with dropout ratio set to 0.5. Consequently, the last FC layer FC3 is a multi-class soft-max classifier. Other than the FC3 layer, the activation function for the rest of the weight layers is the Rectified Linear Unit (ReLU) [35, 43, 50].

We train all deep color convolutional neural networks, described in Sect. 3.1, from scratch on the ImageNet 2012 training set. We use the same set of hyper-parameters as in [43] during network training in our experiments. For all CNNs training, the learning rate is set to 0.001 and momentum is set to 0.9. The initial learning rate is decreased by a factor of 10, in case the validation error stops to decrease further. We initialize the layers from a Gaussian distribution with a zero mean and variance equal to \(10^{-2}\). A similar data augmentation, as in [8], in the form of random crops, horizontal flips, and RGB color jittering is employed during training. For a fair comparison, we train the baseline standard RGB by increasing the depth of the network architecture with a factor of seven resulting in same number of network parameters as our color fusion. Furthermore, pre-trained deep color convolutional neural networks are employed as feature extractors by extracting 4096 dimensional activations from the FC7 and FC6 layers as image features. All the image features are \(\mathrm {L_{2}}\)-normalized and input to a one-versus-all linear SVM classifier.

Throughout all experiments, the classification results are reported in terms of average recognition accuracy over all scene categories in a remote sensing scene classification dataset. From the classifier, the scene class label providing the highest confidence is assigned to the test image. The overall recognition results are obtained by computing the average classification score over all scene categories in a remote sensing scene classification dataset. As in [16, 73], each dataset is randomly split into training and test sets for performance evaluation. For all datasets, the ratio of training to test images is set to 50:50, where images are randomly selected from each aerial scene category. To obtain a reliable performance comparison, we repeat the evaluation procedure ten times. The final classification results are then reported as the mean over these ten runs together with the standard deviation.

Fig. 3
figure 3

Example images from the five high resolution remote sensing datasets used in our experiments. The top row, second row, third row, fourth row and bottom row contains example images from UC-Merced, WHU-RS19 RSSCN7, AID and NWPU-RESISC45, respectively

4.2 Datasets

We conduct experiments on multiple datasets (see Fig. 3).

UC-Merced is a commonly used remote sensing dataset [76] that is publicly available and comprises of 2100 images. There are 21 classes in this dataset. Some of the scene categories in this dataset are: agriculture, golf course, baseball diamond, dense residential, medium density residential, forest, river, sparse residential, overpass, parking lot, storage tanks, and tennis courts. The images in the dataset are cropped to 256 \(\times \) 256 pixels and are collected from 20 regions across the USA.

WHU-RS19 is a public dataset [62] with 950 aerial images acquired from Google Earth imagery. There are 50 samples per scene class in the dataset whereas the images are of size 600 \(\times \) 600 pixels. There are 19 aerial scene categories in this dataset. Some of the scene categories in the dataset are: airport, meadow, pond, parking, port, beach, bridge, river, railway station, viaduct, commercial area, desert, farmland, industrial area, and park. The dataset poses several challenges due to scale and illumination variations.

RSSCN7 is a dataset [84] with seven aerial scene categories: farmland, grassland forest, industrial region, lake, parking lot, residential region, and river. The dataset was released in 2015 and is publicly available. Each aerial scene category contains 400 images. The images in the dataset are of size 400 \(\times \) 400 pixels with sampling performed at varying scales (four).

AID is a large scale public dataset [73] with 30 classes and 10,000 images. The dataset consists of 30 aerial scene categories. Some of the scene categories in the dataset are: playground, sparse residential, medium residential, bare land, center, desert, farmland, mountain, park, parking, forest, resort, school, church, square, river, storage tanks, and viaduct. The images in the AID dataset are collected from different countries, including China, USA, Germany, France, UK, and Italy.

NWPU-RESISC45 is another large scale public dataset [16] with 31500 images having 700 images per category. The images in the dataset are of size 256 \(\times \) 256 pixels. The dataset comprises 45 aerial scene categories where some of the classes in the dataset are: airplane, railway, railway station, bridge, church, stadium, sparse residential, forest, ship, terrace, freeway, storage tank, golf course, lake, ground track field, baseball diamond, mountain, parking lot, wetland, river, and roundabout.

Table 1 Comparison of different deep color features (overall accuracy (OA) in \(\%\)) on the UC-Merced, WHU-RS19, RSSCN7, AID, and NWPU-RESISC45 datasets

Here, we evaluate different deep color features, described in Sect. 3.1, on a variety of five datasets. In all cases, we employ activations from the FC7 layer of the CNNs as deep color features. For the color fusion, we concatenate all the deep color features resulting in a 28672 dimensional feature vector. Table 1 shows the comparison of deep color features on the UC-Merced, WHU-RS19, RSSCN7, AID, and NWPU-RESISC45 datasets. On the UC-Merced dataset, the baseline approach provides a mean recognition rate of \(94.7\%\). Image features from the color names and HSV based CNNs achieve mean classification scores of \(93.7\%\) and \(93.8\%\), respectively. Deep features from the C and Lab color space based CNNs provide mean recognition scores of \(93.6\%\) and \(93.9\%\), respectively. Deep features from the opponent color space-based deep convolutional neural network provides an average classification accuracy of \(94.5\%\). Furthermore, the proposed deep color feature fusion significantly improves the classification performance, achieving a mean recognition score of \(96.3\%\). The proposed deep color feature fusion provides an absolute gain of \(+1.6\%\) in terms of classification performance compared to the baseline standard RGB deep features.

On the WHU-RS19 dataset, the baseline (RGB) network provides an average recognition rate of \(96.0\%\). Deep features from the color names, C and HSV color space based CNNs achieve mean recognition scores of \(95.1\%\), \(94.7\%\) and \(94.4\%\), respectively. Furthermore, image features from the YCbCr and Lab color space based CNNs provide mean recognition rates of \(95.4\%\) and \(95.0\%\), respectively. On this dataset, deep features from the opponent color space based deep convolutional neural network achieves similar performance with a mean classification score of \(96.0\%\), compared to the baseline RGB features (\(96.0\%\)). Moreover, the combined set of deep color features improves the classification performance with an absolute gain of \(+1.4\%\), compared the baseline standard RGB deep features. Similarly on the RSSCN7 dataset, deep features from the opponent color space-based deep convolutional neural network provide similar classification results with a mean recognition rate of \(89.4\%\), compared to the baseline RGB features (\(89.5\%\)). Furthermore, the classification performance is improved by employing the combined set of deep color features, which obtains an average recognition accuracy of \(92.3\%\).

We also evaluate different deep color features on two recently introduced large scale AID and NWPU-RESISC45 datasets. On the AID dataset, the baseline standard RGB color space-based deep convolutional neural network achieves an average classification score of \(90.3\%\). The deep features from most other color spaces provide slightly inferior results compared to the standard RGB. However, deep features from the opponent color space-based deep convolutional neural network again provide similar performance, with an average classification accuracy of \(89.9\%\), compared to the baseline RGB features. Furthermore, the proposed deep color feature fusion significantly improves the classification performance, with an absolute gain of \(+3.1\%\) in terms of classification performance, compared to the baseline standard RGB deep features. Finally on the NWPU-RESISC45 dataset, the baseline standard RGB deep network provides a mean recognition rate of \(85.7\%\). Image features from the color names, HSV and C based deep convolutional neural networks (CNNs) obtain mean classification scores of \(83.2\%\), \(83.1\%\) and \(82.7\%\), respectively. Deep features from the YCbCr and Lab color space based deep convolutional neural networks (CNNs) achieve average classification scores of \(84.3\%\) and \(83.0\%\), respectively. The proposed deep color feature fusion provides significant improvement in classification performance with an absolute gain of \(+4.3\%\), compared to the baseline standard RGB deep features.

Figure 4 shows a per-category recognition performance comparison between the deep color feature fusion and the baseline RGB deep convolutional neural network on the NWPU-RESISC45 dataset. The combined set of deep color features provides consistent improvements on 43 out of 45 aerial scene categories compared to the baseline RGB features. Particularly significant gains in classification performance are achieved for the tennis court (\(+18\%\)), palace (\(+15\%\)), commercial area (\(+12\%\)), medium residential (\(+11\%\)), and basketball court (\(+11\%\)) aerial scene categories.

Table 2 Comparison (overall accuracy (OA) in \(\%\)) of the baseline RGB and deep color feature fusion when using convolutional layers (Conv) and FC layers (FC)

4.3 Deep Color Features Evaluation

Fig. 4
figure 4

Category-specific classification performance comparison (accuracy in \(\%\)) of the proposed method, compared to the baseline RGB deep features, on the large scale NWPU-RESISC45 dataset. Our deep color feature fusion improves the classification performance on 43 out of 45 aerial scene categories. Notably, a significant gain in classification accuracy is achieved for tennis court (\(+18\%\)), palace (\(+15\%\)), commercial area (\(+12\%\)), medium residential (\(+11\%\)), and basketball court (\(+11\%\)) aerial scene classes, compared to the baseline (standard RGB). (Color figure online)

We also perform a comparison between convolutional features (Conv1, Conv2, Conv3, Conv4 and Conv5) and FC features (FC6 and FC7). Table 2 shows the comparison for both the baseline RGB and our color fusion. In all cases, superior classification results are obtained using features from FC layers. Note that no significant improvement in performance is observed when combining convolutional and FC layer features.

To summarize, the deep color feature fusion provides consistent improvements on all five datasets, compared to the baseline RGB features. It is worth mentioning that the most considerable gains in performance are obtained on large-scale AID and NWPU-RESISC45 datasets. These results suggest that different deep color features possess complementary information as their combination leads to a significant performance boost for remote sensing scene classification. Furthermore, superior results are obtained using features from the FC layers, compared to convolutional layer features.

4.4 Compact Deep Color Features

As demonstrated above, the combined set of deep color features always improves the classification performance compared to the baseline RGB. However, this gain in classification performance comes at the cost of high-dimensionality. When fusing deep color features from the FC6 and FC7 layers of the networks, the resulting dimensionality becomes significantly higher (57K). To tackle this issue, we evaluate the compression of deep color feature fusion using the approach described in Sect. 3.2.

Table 3 shows the results obtained when compressing the combined set of deep color features using the DITC approach. The final dimensions of the compact deep color fusion image representation are fixed to 8k so that it is similar to the dimensionality of the standard RGB deep features commonly employed for classification. The DITC compression approach compresses the combined set of deep color features from 57k to 8k without any substantial deterioration in classification accuracy for all datasets. In the case of UC-Merced and WHU-RS19, there is even a slight improvement in performance when compressing the combined set of deep color features indicating an increase in discriminative power by removing the redundancy. In the case of the RSSCN7, AID and NWPU-RESISC45 datasets, there is a marginal reduction in accuracy compared to the original combined set of deep color features. In all cases, the compact deep color feature fusion significantly reduces the dimensionality without sacrificing the classification accuracy.

We additionally analyze the extreme compression of the deep color feature fusion and compare it with several commonly used dimensionality reduction techniques: principle component analysis (PCA), partial least squares (PLS) and diffusion maps (DM). Among these existing approaches, PLS is a category-aware dimensionality reduction statistical technique that models relations between sets of observations by means of latent variables. We perform the comparison to obtain very low-dimensional (100 to 500 dimensional) deep color feature fusion based image representations. Figure 5 shows the results of extreme compression (even to 100 dimensions) on the UC-Merced dataset. The DITC compression technique provides superior classification results even in the case of extreme compression of the deep color feature fusion based image representation. Figure 6 shows the results on the NWPU-RESISC45 dataset.

Table 3 Comparison (overall accuracy (OA) in \(\%\)) between the deep color feature fusion and the compact color fusion representation
Fig. 5
figure 5

Comparison of different compression techniques in terms of classification performance (overall accuracy (OA) in \(\%\)) on the UC-Merced dataset. The DITC based compression technique achieves the best results when performing extreme compression (100 dimensions from the original 57K) of the combined set of deep color features

Fig. 6
figure 6

Comparison of different compression techniques in terms of classification performance (overall accuracy (OA) in \(\%\)) on the NWPU-RESISC45 dataset. The DITC based compression technique obtains favorable results when employing extreme compression (100 dimensions from the original 57K) of the combined deep color features

4.5 State-of-the-art Performance Comparison

Finally, we compare our compact deep color feature fusion representation with state-of-the-art methods in the literature. Table 4 shows the results on the five remote sensing scene classification datasets. For fair comparison, we adopt the same sampling setting as [32, 62, 76], taking 80 images per class for training for the UC-Merced dataset. In the case of the WHU-RS19 dataset, 30 images per aerial scene category are used for training and the rest for testing. For the RSSCN7 and AID datasets, 50 images per aerial scene category are employed for training. Furthermore, 20 images per class are employed for training in the case of the NWPU-RESISC45 dataset.

On the UC-Merced dataset, the work of [76] proposed an approach that extends the bag-of-visual-words (BOVW) framework with the spatial co-occurrence kernel, achieving an average classification accuracy of \(77.7\%\). In their work, the integration of color features within a Gabor representation was also investigated, leading to a mean recognition rate of \(80.5\%\). The impact of texture information on remote sensing scene classification has been investigated by previous works [9, 58, 82]. One such texture description based on multi-scale completed LBP features achieved an average classification accuracy of \(90.6\%\). A pyramidal co-occurrence feature representation, accounting for both photometric and geometric aspects of an image, was proposed by [75] achieving a classification accuracy of \(77.4\%\). With the recent advent of deep features, a considerable jump in classification performance has been observed. The work of [72] proposed deep filter banks based on CNNs and obtained a classification accuracy of \(92.7\%\). Previous works have also investigated transferring pre-trained deep features from both the FC and the convolutional (conv) layers of the CNNs. The work of [72] investigated transferability of deep CNNs with respect to both FC and convolutional layers. In case of FC layers (Case I: FC features), their approach achieved an accuracy of \(96.8\%\) whereas a mean recognition rate of \(96.9\%\) is obtained when using features from convolutional layers (Case II: Conv features) in conjunction with VLAD encoding strategy. The work of [55] proposed a multi-scale deeply described correlations-based model and achieved an accuracy of \(96.9\%\). Our proposed approach, while being compact, achieves an average classification accuracy of \(98.1\%\).

Table 4 State-of-the-art performance comparison (overall recognition rate in %) on five datasets used in our evaluation. Our approach, while being compact, provides favorable results compared to the existing works in literature

On the WHU-RS19 dataset, the work of [74] based on class-specific vocabulary employing kernel collaborative representation obtained an average classification accuracy of \(93.7\%\). Among the deep learning approaches, CaffeNet model achieved an average classification accuracy of \(94.8\%\). Our compact deep color feature fusion approach also employing FC layer features obtains an average classification accuracy of \(96.6\%\). The best result (\(98.6\%\)) on this dataset is obtained by transferring deep features from the conv layers together with VLAD encoding technique. Such an encoding of conv features is complementary to our approach using FC features. In the case of RSSCN7 dataset, the work of [71] based on hierarchical coding vectors based classification obtained a mean recognition accuracy of \(86.4\%\). The work of [72] based on deep filter banks achieved a mean recognition rate of \(90.4\%\). Our approach obtains outperforms state-of-the-art methods on this dataset with a mean recognition rate of \(92.9\%\).

Table 5 Absolute gain (overall accuracy in %) in classification performance obtained using the proposed compact deep color feature fusion compared to deep features using the standard RGB representation

For the AID dataset, the work of [10] proposed a deep convolutional neural network pre-training approach (SSF-AlexNet) and achieved a mean recognition accuracy of \(88.7\%\). The work of [36] proposed a fusion approach (BAFF) to integrate SIFT and deep features. Their approach achieved a mean recognition rate of \(93.6\%\). Our approach provides superior results with an average classification accuracy of \(94.0\%\). Finally, on the NWPU-RESISC45 dataset, the work of [79] based on single-scale deep features achieved a mean recognition rate of \(83.6\%\). The multi-scale variant of their approach obtained an average classification score of \(84.3\%\). The bag-of-convolutional feature approach of [17] achieved a mean recognition rate of \(84.3\%\). Our approach again provides superior classification performance by achieving an average classification accuracy of \(87.5\%\).

5 Conclusions

In this paper, we investigated the contribution of color within a deep learning framework (CNNs) for the problem of remote sensing scene classification. We demonstrated that different deep color features possess complementary information and combining them leads to a significant performance boost for the remote sensing scene classification task. Additionally, we addressed the high-dimensionality of deep color feature fusion and compressed them to obtain a compact final image representation without a significant deterioration in classification performance. To validate our approach, we perform comprehensive experiments on five challenging remote sensing scene classification datasets. The results from our experiments clearly demonstrated the effectiveness of the proposed approach. Table 5 shows the gain in classification performance obtained using the proposed compact deep color feature fusion, compared to the standard RGB deep features, on the five remote sensng scene classification datasets.

A potential future research direction is to investigate the fusion of other available spectral bands (, near infrared) besides RGB in the form of different color transformations. Additionally, integrating other visual cues, such as texture features with color features in a deep learning framework may improve remote sensing scene classification performance and is therefore a promising research direction. Another future research direction is to investigate the impact of integrating multiple deep color features for other remote sensing image analysis tasks, such as object detection (simultaneous classification and localization).