RS-DeepSuperLearner: fusion of CNN ensemble for remote sensing scene classification

ABSTRACT Scene classification is an important problem in remote sensing (RS) and has attracted a lot of research in the past decade. Nowadays, most proposed methods are based on deep convolutional neural network (CNN) models, and many pretrained CNN models have been investigated. Ensemble techniques are well studied in the machine learning community; however, few works have used them in RS scene classification. In this work, we propose an ensemble approach, called RS-DeepSuperLearner, that fuses the outputs of five advanced CNN models, namely, VGG16, Inception-V3, DenseNet121, InceptionResNet-V2, and EfficientNet-B3. First, we improve the architecture of the five CNN models by attaching an auxiliary branch at specific layer locations. In other words, the models now have two output layers producing predictions each and the final prediction is the average of the two. The RS-DeepSuperLearner method starts by fine-tuning the five CNN models using the training data. Then, it employs a deep neural network (DNN) SuperLearner to learn the best way for fusing the outputs of the five CNN models by training it on the predicted probability outputs and the cross-validation accuracies (per class) of the individual models. The proposed methodology was assessed on six publicly available RS datasets: UC Merced, KSA, RSSCN7, Optimal31, AID, and NWPU-RSC45. The experimental results demonstrate its superior capabilities when compared to state-of-the-art methods in the literature.


Introduction
Classification of optical Very High Resolution (VHR) images is a crucial research problem in remote sensing (RS) . Optical VHR images have many limitations. For example, they are affected by weather conditions, as they cannot be collected at night or under a cloud cover. Other challenges are related to geometric accuracy of the data and its positioning in a common reference system. In addition, because of the much higher resolution, it is no longer feasible or meaningful to analyse the image at the pixel level. Instead, classification is now focusing on a per-object or a scene approach.
Classification of RS scenes from urban areas is a fundamental step in many important applications, such as urban planning (Anees et al. 2020;Lynch, Blesius, and Hines 2020;Wu, Gui, and Yang 2020), natural hazards detection (Zhang, Song, and Yu 2011;Kerle and Bobrowsky 2013;Goldberg et al. 2020), environmental monitoring (Nyamugama and Qingyun 2005;Ambrosone et al. 2019;, mapping (Rogan and Chen 2004;Yang et al. 2017;Li, Wang, and Jiang 2020), and object detection (Bin and Li 2004;Ammour et al. 2017;Amini et al. 2018;. It is a challenging topic due to a wide variety of man-made objects and scene complexity . Deep learning solutions, in particular convolutional neural networks (CNN), are now the state-of-the-art solutions for classifying RS scenes Song et al. 2019;Cheng, Han, and Lu 2017). However, despite the fact that a large number of deep learning models have been developed for this problem, having a single one that performs perfect classification in all situations encountered by an RS analysis system is likely unattainable. For that reason, decision fusion methods are proposed to produce correct decisions with a given amount of input information (Ludmila 2014). Decision fusion methods are tailored to generate a single decision from multiple sensor or classifier decisions. Moreover, fusion can compensate for the deficiencies of one classifier by using one or more additional classifiers.
In this article, a method for the scene classification in remote sensing based on ensemble learning is proposed, where the predictions of five different CNN models are fused including: VGG16 (Karen and Andrew 2014), Inception-V3 (Szegedy et al. 2016), InceptionResNet-V2 (Szegedy et al. 2016), DenseNet121 (Huang et al. 2017), and EfficientNet-B3 . Many works that use these CNN models for scene classification have been published Ajjaji et al. 2019;Alhichri 2018;Lasloum et al. 2021;. Cheng et al . have presented a good review of the state-of-the-art of RS scene classification. However, it is impossible to comprehensively review all proposed methods in one article. Instead, the literature review included here highlights some recent examples of work from each research problem type and direction, with a special emphasis on ensemble methods in RS scene classification.
Most works have dealt with supervised classification, where RS scenes are classified into predefined categories of one dataset used during training and testing the model. However, more and more research is now focusing on reducing the labelling effort using techniques like domain adaptation and few-shot learning. Domain adaptation aims to transfer training knowledge from a labelled source dataset to a target dataset that has a few labelled samples (semi-supervised) or is completely unlabelled (unsupervised). For example, in recent work by Lasloum et al (Lasloum et al. 2021), a multisource semi-supervised domain adaptation (SSDAN) approach was presented. SDDAN uses a CNN model with a prototypical layer for classification and trains the model on the labelled source dataset, the few labelled target samples, and the remaining unlabelled target samples. The optimized loss function combines the standard cross-entropy loss and an entropy loss computed using the unlabelled data.
Furthermore, few-shot learning reduces the labelling effort because it only needs a few labelled samples per class. However, unlike domain adaptation, it also assumes that the target dataset contains new classes unseen during the model training. For example, Li et al . have proposed a zero-shot RS scene classification (ZSRSSC) paradigm based on localitypreservation deep cross-modal embedding networks (LPDCMENs). Zero-shot means here that the target dataset has no labelled samples; instead, the method uses the semantic information about the classes and matches that with the semantic information from the labelled source dataset. In their proposed method, Li et al . have addressed the problem of class structure inconsistency between the two hybrid spaces (i.e. the visual image space and the semantic space) by designing a set of explainable constraints to optimize LPDCMENs. In another zero-shot learning work, the authors of  have proposed a new remote sensing knowledge graph (RSKG) from scratch to support the inference recognition of unseen RS image scenes. They have generated a semantic representation of scene categories by representation learning of RSKG (SR-RSKG). Then, they have designed a novel deep alignment network (DAN) with a series of well-designed optimization constraints to achieve robust cross-modal matching between visual features and semantic representations.
However, few works have investigated the ensemble fusion approach in RS scene classification. In (Li et al. 2017), Li et al. have proposed a novel two-stage neural network ensemble model to address the problem of low classification accuracies due to lack of labelled data. They have used a pretrained CNN for feature extraction and then fed the output features to a Restricted Boltzmann Machine (RBM) retrained network to get better feature representations. Finally, they have trained several copies of their deep model and then fused their outputs using an Ensemble Inference Network (EIN).
Moreover, Ye et al . have enhanced the performance of deep convolutional neural networks (D-CNN) by combining a Softmax loss function and a centre loss function and have built discriminative hybrid features using linear connections. Finally, they have adopted an ensemble extreme learning machine (EELM) classifier that classifies the hybrid features and fuses the classification results. Akodad et al (Akodad et al. 2019). have proposed a novel ensemble learning approach based on the concept of multilayer stacked covariance pooling (MSCP) of CNN features obtained from a pretrained model. Then, they have employed an ensemble learning approach among the stacked convolutional feature maps, which aggregates multiple learning algorithm decisions produced by different stacked subsets.
On the other hand, Alosaimi et al (Alosaimi and Alhichri 2020). built an ensemble of three state-of-the-art CNN classifiers: VGG-16, SqueezeNet, and DenseNet. They have fine-tuned the pretrained models then proposed a novel decision-level fusion approach that fuses the classification decisions using a combination of majority voting and breaking-ties confidence measure. The work by Tombe et al (Tombe, Viriri, and Dombeu 2020). presented a method using ensemble learning that fuses the outputs of CNN features using majority voting. First, they have fine-tuned a pretrained VGG-16 model for feature extraction. Then, they have used the multi-grain forest for feature learning and building an ensemble of classifiers. The final classification result is obtained by fusing the ensemble classifiers using majority voting. Another exciting work by Li et al . has addressed the problem of wrong labels in an RS dataset. They have proposed a new remote sensing image scene classification error-tolerant deep learning (RSSC-ETDL) solution that adaptively combines an ensemble of CNN features to classify RS scenes and correct the labels of uncertain samples. To achieve these two goals simultaneously, they have alternatively performed the two tasks of training the multiple CNNs and correcting wrong labels iteratively.
Ensemble learning, which combines the results of a set of learning models rather than selecting a single classifier, is an old technique that has received a lot of attention in the machine learning literature . The types of ensemble learning approaches include bagging (Breiman 1996a;Bühlmann and Yu 2002), boosting (Schapire 1990;Freund and Schapire 1997), and stacking (Wolpert 1992;Breiman 1996b;Naimi and Balzer 2018). Bagging, the short form for bootstrap aggregating, is mainly applied in classification and regression. It increases the accuracy of models by using decision trees, which reduces variance to a large extent. The variance reduction improves accuracy and combats overfitting to the training data, which is a challenge for many predictive models (Garbin, Zhu, and Marques 2020). On the other hand, boosting is an ensemble technique that learns from previous predictor mistakes, resulting in better predictions in the future. It combines several weak base models to form one strong learning solution, thus significantly improving the predictability of models. Moreover, it works by arranging weak models in a sequence so that weak models can learn from the previous models in the sequence to create better predictive new models. Stacking, another ensemble method, is often referred to as stacked generalization. It allows a training algorithm to fuse several other similar learning algorithm predictions. Stacking has been successfully implemented and applied in regression, density estimations, distance learning, and classification. A basic common example of stacked generalization is averaging the outputs of several learning models. However, the averaging technique is not data-adaptive and is vulnerable to weak classifiers.
In the work by van der Laan et al (van der Laan, Polley, and Hubbard 2007; Polley and van der Laan 2010), stacked generalization with a cross-validation-based optimization framework was further extended. They have proposed SuperLearner, which uses cross-validation to examine different classifiers in the ensemble and, based on that evaluation, learns a better way to combine them. By minimizing the cross-validated risk, SuperLearner determines the optimal combination of a set of prediction algorithms or models. The individual models must provide different generalization patterns for ensemble fusion success; thus, diversity within the ensemble plays a crucial role in the training process. Otherwise, the ensemble would have the same predictions and provide as good accuracy as a single model. Van der Laan et al (Polley and van der Laan 2010;van der Laan and Dudoit 2003;van der Laan and Rubin 2006). have theoretically shown that the classifier with the best estimated cross-validated performance is asymptotically equivalent to the unknown best classifier in the ensemble if we assume independent and identically distributed data. Thus, cross-validation can be used as an objective measure to compare the performance of an set of classifiers.
This article proposes an ensemble fusion method to classify RS scenes, which we call RS-DeepSuperLearner. In particular, our approach fuses the predictions of five pretrained CNN models using another DNN, which is trained to perform the fusion. The inputs to the DNN are the predictions of the individual CNN models and their crossvalidation accuracies. In this way, the RS-DeepSuperLearner has an objective approach of learning the fusion process, which should result in an increase in total classification accuracy following fusion. The main contributions of the article include the following: • It proposes an effective ensemble learning method, called RS-DeepSuperLearner, to classify RS scenes that fuses the predictions of five pretrained CNN models: VGG16, Inception-V3, InceptionResNet-V2, DenseNet121, and EfficientNet-B3. • It presents a novel way of modifying and finetuning the five pretrained CNN models by adding an auxiliary branch to the architecture which helps fight overfitting and increase diversity among models in the ensemble. • It develops a novel DNN SuperLearner model that successfully fuses predictions of base models in the ensemble. Unlike previous work, the inputs to the proposed DNN SuperLearner are: 1) the classpredicted probabilities of each model, 2) the crossvalidation accuracies for each class and for each base model, and 3) the overall accuracy of each base model.
The rest of the article is organized as follows. Section 2 describes the proposed RS-DeepSuperLearner method, presents the ensemble of CNN models and the architecture modifications, and describes the RS-DeepSuperLearner method in detail. Section 3 presents the experimental results. Finally, Section 4 shows the conclusions and future research. Figure 1 illustrates the main overview of the RS-DeepSuperLearner method. Five powerful CNN models are trained on the RS dataset. Then, a DNN is trained to fuse the predicted probabilities of all models into a final result. However, we will compare our method to basic stacked generalization algorithms, including naïve averaging and weighted averaging (Breiman 1996a(Breiman , 1996b.

Description of the base CNN models
As previously stated, we constructed our classifier ensemble utilizing five alternative CNN models recently proposed in the literature. In our work, the selected base models are as follows: VGG16 (Karen and Andrew 2014), Inception-V3 (Szegedy et al. 2016), InceptionResNet-V2 (Szegedy et al. 2016), DenseNet-121 (Huang et al. 2017), and EffecientNet-B3 (Tan and Le 2019). These models have been used separately in many works for RS scene classification. However, to the best of our knowledge, no prior work has proposed fusing them. There are hundreds of CNN models proposed in the literature. Therefore, the number of possible ensemble combinations is very high and performing an exhaustive study will be impossible. What this study aims at is to create a deep SuperLearner network that outperforms other fusion methods and demonstrates that we can achieve high RS scene classification accuracy using the proposed ensemble method, which outperforms the state-of-the-art methods. Consequently, our choice for the CNN models is motivated by having as high diversity between the models as possible so that they will not make the same mistakes in their predictions.
First, we selected CNN models that belong to the following five main genres: networks with inception modules (Inception-V3), networks with residual connections (InceptionResNetV2), networks with dense connections (DenseNet121), networks with an optimized architecture (EfficientNet-B3), and a basic genre (VGG16). To increase diversity, the models we selected have different numbers of weights and layers and top-5 accuracy rates on the ImageNet dataset, as shown in Table 1. Moreover, Figure 2 presents a visual illustration of CNN models's diversity. Each model is depicted as a pie chart with relative sizes of each property (normalized between zero and one). The pies on the figure look, and probably taste, different.
Given this diversity, we expect a high probability that for each RS scene, at least one of these CNN models makes the correct class prediction. Additionally, the proposed RS-DeepSuperLearner has the role of intelligently exploiting this fact to produce improved class predictions.
All five CNN models used are pretrained on the ImageNet dataset (Deng et al.), the largest image classification dataset with over 14 million images divided into 1000 object classes. We always remove a few top layers for all of these pretrained models and replace them with our new layers to adapt the model to our classification problem which does not have 1000 classes as in ImageNet. The output layer is a fully connected layer with a Softmax activation function that produces the class probabilities. For some models we added another  auxiliary branch with a separate output layer also having a Softmax activation function. Finally, we train the new model by fine-tuning all of its parameters, except the VGG16 model, where we fine-tune only the newly added layers following the suggestion of (Ammour et al. 2017;, which incidentally increases the diversity among the ensemble CNN members. The added new layers include a combination of fully connected layers, convolutional layers, average pooling layers, BatchNormalization (BN), and activation layers. In particular, we have found that the Global Average Pooling (GAP) layer is a much more effective pooling layer for use towards the end of CNN models. Therefore, we use the GAP layer in all five pretrained models. BN layer is an important addition that is quite effective in combatting the problem of overfitting, hence enhancing the performance of the CNN model. Moreover, we apply the ReLU action function and its advanced variant, the LeakyReLU function.

VGG16
VGG-Net is a family of CNN models developed by the Visual Geometry Group at Oxford University, hence the name VGG, in 2014 by Karen Simonyan and Andrew Zisserman (Karen and Andrew 2014). The VGG16 variant consists of 16 weight layers: 13 convolutional layers and 3 fully connected layers. The convolution operation is conducted using a kernel of a 3 × 3 dimension. Its uniform architecture makes it very appealing in the deep learning community because it is easy to understand. Figure 3 shows the modified VGG16 model. Here, we removed the top four layers from the pretrained VGG16 CNN model and then added our new layers. We started with the GAP layer. The GAP layer gets an input tensor of  a size of 16 � 16 � 512 and produces a tensor of a size of 1 � 1 � 512. The next layer has 128 neurons that are FC to the GAP layer. The FC layer is followed by a BN layer and a LeakyReLU activation function with its alpha parameter set to 0.5.

Inception-V3 and InceptionResNet-V2
Inception networks are a family of CNN developed by Szegedy et al (Szegedy et al. 2015(Szegedy et al. , 2016(Szegedy et al. , 2016. The authors have engineered some tricks that set them apart from other conventional CNN architectures, such as AlexNet or VGG16. In 2014, the earliest version was introduced as GoogLeNet, also known as Inception-V1 (Szegedy et al. 2015). In 2015, Szegedy et al. improved their work by introducing a new BN layer in the second version, Inception-V2 (Szegedy et al. 2016), and the concept of factorizing convolutions in the third version, Inception-V3 (Szegedy et al. 2016). Finally, in another article, they have made further improvements in the Inception-V4 and the InceptionResNet versions (Szegedy et al. 2016). Inception-V4 contained several simplifications, which reduced the computational costs, whereas InceptionResNet-V2 included the concept of residual blocks inspired by the success of the ResNet architecture (He et al. 2016).
In this work, we selected the Inception-V3 and modified is as shown in Figure 4. We removed the top four layers. Then, we added a GAP layer, followed by the output FC layer with a Softmax activation function. Following the recommendation made by Bazi et al (Bazi et al. 2019), we also employed a second branch in the network connected to an intermediate layer of the CNN model, which has been quite effective in improving the performance of very deep models such as the one at hand. The second branch is connected to layer 228 of the original model and contains new layers, including convolutional, BN, ReLU activation, GAP, and Dropout layers, in addition to the final FC layer.
As for InceptionResNet-V2 models, we modified as shown in Figure 5. The added layers are the same as in the Inception-V3 model, and the auxiliary branch is added starting from layer 594.

DenseNet-121
DenseNet (Huang et al. 2017) is another family of CNN models that rely on the idea of dense connections, where each layer is connected to every other layer in a feedforward fashion. For each layer, the feature maps of all preceding layers are used as inputs, while its feature maps are used as inputs into all subsequent layers. In this work, we selected a version called DenseNet121, which provides a good compromise between accuracy and computational cost. Figure 6 shows how we modified the DenseNet121 model. We removed the top three layers and added a GAP layer followed by the output FC layer with a Softmax activation function. As shown in Figure 6, we added a second branch similar to the Inception CNNs and connected it to layer 311.

EfficientNet-B3
EfficientNet is a new model scaling method recently developed by Tan Figure 7 shows the modified EfficientNet-B3 model, where we removed the top two layers of the original model and then added the GAP and Softmax layers. The same second branch is also added as before and is connected to layer 266.

Naïve unweighted averaging fusion
The simplest and most common way to combine the ensemble of classifiers is naïve unweighted averaging. Deep CNN classifiers are well known for having high variance and low bias in their results due to many randomization steps in the training data and the  optimization algorithm. Naïve unweighted averaging is effective in reducing this variability.
In naïve unweighted averaging, we take the unweighted average of the output probability of the base classifiers. Equation 14 shows the predicted probability p m;c for the m th classifier and class c of a typical DNN based on the Softmax activation function: where C is the number of classes and v m �! = [v m;1 , v m;2 ; , v m;C ] is the output vector of the final layer before applying the Softmax activation function. Then, we computed the output probability of class C using the naïve unweighted averaging as follows: This can be rewritten in terms of p m �! = [p m;1 , p m;2 , . . . , p m;C ] (i.e. all probabilities for the m th classifier) as follows: The deep learning literature has concluded that naïve unweighted averaging effectively reduces the variability of CNN models of similar architectures. However, it is not suitable for heterogeneous CNN models because it is vulnerable to weaker ensemble members (Zhao and Liu 2020). Some members in an ensemble of heterogeneous CNN models, may be weak in general, but they can outperform the others for certain classes or images. Thus, an intelligent combination approach should discover and capitalize on this observation.

Weighted-Averaging fusion
In this approach, the probabilities of m classifiers are fused using a weighted average, where the weights measure the confidence in the predictions of particular model (Breiman 1996b;Large, Lines, and Bagnall 2019).
In other words, the contribution of each classifier to the final prediction is weighted by the classifier's performance. In this technique, the new prediction probability is computed using a weighted average as follows: where w m;c is the weight parameter for a particular classifier m and a particular class c. To compute the weights, we use a cross-validation technique, where we divide the training set into K folds and then hold one fold as a validation set while we use the rest of the folds for training (Large, Lines, and Bagnall 2019;Kohavi 1995;Benkeser et al. 2018). For a given fold k, we can compute the overall per-class accuracy a k of the current classifier m as follows: Thus, at the end of these K experiments, we end up with many performance accuracies for each classifier, which can be averaged across K validation sets to obtain the overall per-class accuracy of the m th classifier: Finally, the weights per classifier m and class c are computed as follows: This ensures that the weights for a particular class c sum to 1, i.e. P M m¼1 w m;c ¼ 1.

Proposed RS-DeepSuperlearner method
The SuperLearner is a type of stacking method, where the weights used to combine the different ensemble outputs are learned using another learning algorithm (van der Laan, Polley, and Hubbard 2007;Polley and van der Laan 2010). The SuperLearner uses crossvalidation to make objective evaluation of the individual ensemble classifiers and based on that optimizes the fusion of the classifier ensemble. In this study, we propose to use DNN as a SuperLearner as shown in Figure   formulation of the proposed fusion algorithm using a DNN SuperLearner.
Recall that the training data are divided into K folds; then, the total cross-validated loss for a particular classifier m can be defined as follows: where val d ð Þ is the samples in fold d and p À d i;m is defined as the prediction vector of the i th sample by the m th base classifier, which is trained on all the training set except the d th fold. Thus, if we compute the final predicted probability for sample image i using Equation 14, where α ¼ α 1 ; α 2 ; . . . ; α M ½ � are weights to be learned, then the total loss for the combined classifier can be expressed as follows: Then, the goal of the SuperLearner algorithm is to find the optimal weight vector that minimizes the above loss function; i.e.
This is a convex optimization problem that can be solved directly. In the above original formulation found in (van der Laan, Polley, and Hubbard 2007; Polley and van der Laan 2010), we have a total of M weights to optimize or one weight per classifier. However, varying the weights depending on each class c and each sample image i will be more effective. In this way, the weights will be more adaptive to each class and each image hopefully giving better results. Thus, in our proposed RS-DeepSuperLearner algorithm, we define the loss function as follows: where α i;m is the vector of C (classes) weights for each sample image i and classifier m and ʘ means elementwise multiplication. In our proposed RS-DeepSuperLearner, we use a deep neural network to solve this optimization problem. The DNN model learns the nonlinear mapping between the prediction probabilities of the base classifiers and the final combined prediction probabilities, such that where θ represents the set of parameters of the DNN model. The set θ can have thousands of parameters, much more than M weights in the weighted-averaging approach in Section 2.3. This makes it more adaptive to each class and each image, and hopefully, it will find a more optimal combination of the classifier ensemble. Another contribution of this article is adding the validation accuracies for all base classifiers Ã m ¼ A m ; A m;1 ; A m;2 ; . . . ; A m;C � � as an input to the DNN to provide it with more information to learn from. Thus, in summary, the proposed RS-DeepSuperLearner method uses a DNN model F θ that learns the following mapping: where P i is the final fused predicted class probabilities for sample image i. The DNN SuperLearner is illustrated in Figure 8, which shows that it is composed of two hidden fully connected layers. Each layer is followed by a BatchNomalization layer and a ReLU activation faction. Moreover, we have experimented with an additional Dropout layer after the first hidden layer. Both BatchNomalization and Dropout effectively fight overfitting in DNN models (Garbin, Zhu, and Marques 2020).
We conducted experiments to investigate their effectiveness in our case.

Experimental results
This section presents an extensive experimental analysis to show the capabilities of the proposed solution.

Dataset description
In total, seven datasets were used to test our method: the University of California Merced (UCMerced) dataset (Yang and Newsam 2010), the Kingdom of Saud Arabia (KSA) dataset (Othman et al. 2017; Alhichri 2020a), the RSSCN7 dataset (Zou et al. 2015a), the Optimal-31 dataset , the Aerial Image Datasets (AID) dataset , and the NWPU-RESISC45 dataset (Cheng, Han, and Lu 2017). We present detailed information of each dataset: moreover, Table 2 summarizes their properties.

UC merced dataset
The UC Merced land-use dataset comprises 2,100 overhead scene images divided into 21 land-use scene classes. Each class consists of 100 aerial images measuring 256 × 256 pixels, with a spatial resolution of 0.3 m per pixel in the red-green-blue (RGB) colour space. This dataset was extracted from aerial orthoimagery downloaded from the United States Geological Survey (USGS) National Map.

KSA dataset
This multisensor dataset ) was acquired over different cities of KSA (i.e. Riyadh, Al-Qassim, Al-Rajhi farms, Al-Hufuf, and Jeddah) by three different VHR sensors, IKONOS-2, GeoEye-1, and WorldView-2, with spatial resolutions of 1 m, 0.5 m, and 0.5 m, respectively. This dataset consists of 3250 RGB images of a size of 256 × 256 pixels categorized into 13 classes (250 images per class). The class labels are as follows: agriculture, beach, cargo, chaparral, dense residential, dense trees, desert, freeway, medium-density residential, parking lot, sparse residential, storage tanks, and water. This dataset can be downloaded from here

RSSCN7
This dataset (Zou et al. 2015b) contains 2800 remote sensing images collected from Google Earth and is divided into seven scene categories: grassland, forest, farmland, parking lot, residential region, industrial region, and river and lake. Each category consists of 400 images with a size of 400 × 400 pixels.

Optimal-31
This new dataset contains images from Google Earth imagery. The images have a size of 256 × 256 pixels and their resolution is 0.5 metres. Optimal-31 categorizes 1860 images within 31 classes, each class containing 60 images (Li, Wang, and Jiang 2020).

Aerial image datasets (AID)
This dataset (Ambrosone et al. 2019) is a collection of 10,000 annotated aerial images distributed in 30 landuse scene classes and can be used for image classification purposes. Compared with the UC Merced dataset, AID contains more images and covers a wider range of scene categories, thus being in line with the data requirements of modern deep learning.

NWPU-RESISC45
NWPU-RESISC45 is acquired from Google Earth imagery, created by Northwestern Polytechnical University (NWPU) (Wu, Gui, and Yang 2020). It consists of 31,500 remote sensing images, divided into 45 classes. Each class has 700 images with a cropped size of 256 × 256 pixels. Most of the classes with spatial resolutions vary from around 30 metres to 0.2 metres, except for those having lower spatial resolutions: island, lake, mountain, and snowberg.

Experimental setup
We implement all our models in the Google Colab environment using the Tensorflow machine learning library. Tensorflow is a neural network application programming interface written in Python. First, we have resized all datasets to a standard size of 256 × 256 pixels. The newly resized datasets can be downloaded from this web location (Alhichri 2020b). For performance evaluation, we present the results using the overall accuracy (OA), that is the ratio of the number of correctly classified samples to the total number of the tested samples. OA, is the metric used by almost all works on RS scene classification methods in the literature due to the fact that the classes in RS datasets are balanced. It is generally known in the machine learning literature that precision, recall, and F1 score are instrumental when we have unbalanced datasets, such as medical classification problems where the samples classified as a disease are scarce compared to the normal and other samples. We split each dataset of images into training and testing sets for training the networks. The train/test splits used in the experiments 20%-80%, 50%-50%, and 80%-20%. However, we cannot test with large training sets due to limitations on computational resources for large datasets, such as the AID and NWPU-RESISC45 (see Table 2 for dataset sizes). For these datasets, we follow the footsteps of previous researchers by applying the method on 10%-90% and 20%-80% splits only. Moreover, we consider the 50%-50% split for the AID dataset For the training parameters, we set the batch size to 32 images and train for 60 epochs. However, we divide the training epochs into three stages of 20 epochs each. Then, we employ a decaying learning rate strategy, such that the learning rate starts as 0.001 in the first stage and is reduced to 0.0001 and 0.00001 in stages two and three, respectively. This reduction is important to enhance the convergence of the model. Basically, because of the initial high learning rate, the model searches in a larger area of the search space. Then, when the best minimum of the loss function is found, we continue searching with a lower learning rate to more precisely locate the minimum value, which in turn fine-tunes the training of the model. Furthermore, we employ the 'early-stopping' technique in stage three (the last 20 epochs with learning rate 0.00001), which stops the training of the model if the loss value does not change for a given number of consecutive epochs (five epochs in our case).

Optimizing the DNN metalearning model
In this subsection, we conduct experiments to study the DNN SuperLearner model, where we focused on one dataset, namely, the NWPU-RESISC45 dataset, with a train-test split of 10%-90%, since it is a challenging dataset with the largest size.
In the first set of experiments we train the DNN for 300 epochs with a batch size of 256 and the learning rate parameter for the Adam optimizer set to 0.001. We perform a total of 4 experiments in this set. In experiment 1, we only use two hidden FC layers with ReLU activation functions. Then, in experiment 2, we add BN layers after each FC layer. In experiment 3, we remove the BN layers and replace them with Dropout layers with a probability of 0.8. Finally, in experiment 4, we add both BN and Dropout layers to the model. Figure 9 plots the loss curve and the OA on the test data for the 300 epochs. We note that the experiments were conducted 30 times to produce these curves and the average loss and OA curves were computed. These results clarify that the BN and Dropout layers are effective in combatting overfitting as the OA on the test data is much better in experiments 3 and 4. Furthermore, we can conclude that Dropout is more important than BN layers because when we used BN layers only in experiment 2, the OA was worse.
We recompile the model with a lower learning rate for the Adam optimizer equal to 0.0001 and then continue training the model for another 20 epochs to further enhance the OA accuracy. Finally, we execute this training for 30 times and fuse the predicted class probabilities using naïve averaging to reduce variability in the final results and also introduce further improvements.

Single-model results
First, Table 3 shows the results in terms of OA for each CNN model, our modified version trained separately, and all datasets. We also included the highest possible OA that can be achieved for each dataset. This is computed by considering the classification to be correct if at least one of the classifiers in the ensemble is correct. In other words, we are assuming that we have an oracle SuperLearner that can help us select the correct classifier all the time. Effectively, this represents an upper bound OA that can be obtained by fusing the given ensemble of models.
The upper bound OA is significantly higher than the OA by individual models (especially for the 20%-80% splits) and even reaches 100% in some cases. The impressive upper bound OA confirms the diversity of the selected CNN models because when they start to have wrong predictions, they are not all wrong for the same scenes; instead, in a significant portion of cases, at least one of the models is correct.

Comparison of our results with the state-of-theart methods
In this section, we compare our results to those of other state-of-the-art methods published in the literature. Our proposed method has performed better than previous methods for all datasets except for UC Merced and AID datasets. The UC Merced dataset was compared with a large number of previous methods from the literature in Table 4. Our proposed method has outperformed all of them except for the BiMobileNet (MobileNetv2)  and HABNet methods , which have outperformed our method in the 20%-80% split. However, we have outperformed them for the other two splits because when we have a low number of images, the validation accuracies do not measure the performance of the models in the different classes well. Consequently, the RS-DeepSuperLearner's ability to learn the best way to fuse the ensemble of CNN models is reduced. The KSA, RSSCN7, Optimal-31, and NWPU-RESISC45 datasets are also compared as shown in Tables 5, 6, 7, and 8, respectively. Our method has outperformed listed methods from the literature for all these datasets.
Finally, Table 9 presents the results for the AID dataset. The CAD+DenseNet ) method has beaten our method for the AID datasets. It uses the channel attention concept, which is a novel idea that enhances the performance of CNN models by learning to give different weights to the channels of the convolutional activation maps. Thus, it can also be incorporated into other CNN models, such as the ones we have used in our work, improving both their individual performance and the fused ensemble performance. Furthermore, we can always include the CAD +DenseNet model in our ensemble instead of the basic DenseNet121 model. If we do this, we expect to see an improvement in the final classification accuracy that will outperform the original method of CAD+DenseNet. While the proposed RS-DeepSuerLearner method has outperformed many previous state-of-the-art methods, its performance is still limited. In fact, it did not reach the OA upper bounds presented in Table 2. This means that the method's ability to learn which base model has the correct prediction is limited. One reason for this is the fact that we are using a two folds only in the crossvalidation training step. This may not provide accurate estimations of the model × class accuracies. These inaccurate inputs the DNN SuperLearner leads to an inability to fuse the models properly. Thus, one way to improve the performance of the RS-DeepSuperLearner is to increase the number of folds during training, however this will have a negative effect on the computational complexity.
In fact, high computational cost and the high memory requirements are the the main limitation of the RS-DeepSuperLearner, because it is composed of several CNN models. We can discuss two aspects of the computational complexity of the proposed algorithm: one is the training times and the other is the inference (prediction) times. The training time consists of training the five models in addition to the fusion DNN SuperLearner. The total training time for each model consists of training it using the two halves of the training set to compute the two-fold cross-validation accuracies and training it again  using the whole training set. The latter trained model is the ones fused with the other models. Tables 10 and 11 show the computational costs for the AID dataset to see the time difference in execution times between the different algorithms. Recall that we are executing our algorithms using Google's Colab platform, typically using NVIDIA Tesla K80 GPU (though this may change over time). Obviously, the RS-DeepSuperLearner method takes about two to ten times more training time than the individual CNN models, which is a significant  difference. However, we remind our readers that training is done offline, so we can tolerate slow execution times. Furthermore, by investing in faster and more GPUs, these training times can always be reduced. Table 11 shows that the inference time per image is quite small and is in the order of a few milliseconds because we are using GPU-enabled computing platforms. Again, the RS-DeepSuperLearner method is significantly slower; however, it is well within the real-time range because it only takes about 22 milliseconds per image, which translates to about 46 images per second.
The inference times are only a problem for low-end computing platforms, such as mobile or similar embedded devices. Having several CNN models with a large number of weights and floating-point operation can be a problem for these devices with limited memory and computational resources. However, there is rarely a need to run classification algorithms on devices such as mobile phones in   remote sensing applications. Some uses for mobile phones in remote sensing are reported in (Anderson et al. 2016;Masiero et al. 2016), but they typically deal with mapping application of a small area using a mobile from a relatively short distance using a UAV or a kite to capture the images. The algorithms used in these areas are more similar to computer vision than remote sensing.

Conclusions
In this paper, we proposed an ensemble deep learning method, RS-DeepSuperLearner, for the classification of RS scenes. First, RS-DeepSuperLearner fine-tunes five pretrained CNN models: VGG16, Inception-V3, InceptionResNet-V2, DenseNet121, and EfficientNet-B3. These deep CNN models are chosen because they are quite different in many aspects such as model depth, size, architecture, and achieved classification accuracy. This increases the diversity in the ensemble which is a desirable trait in ensemble methods. Then, RS-DeepSuperLearner fuses the ensemble of models using another deep neural network that is trained on the predicted class probabilities and the cross-validation accuracies of the individual CNN models. In other words, RS-DeepSuperLearner fusion approach is datadriven as it uses deep learning techniques to learn how to fuse the ensemble of models and predict the best classification results.
To show the effectiveness of RS-DeepSuperLearner in practice, we carried several experiments using six RS scene datasets, UC Merced, KSA, RSSCN7, Optimal-31, AID, and NWPU-RSC45. From these experiments, we learned that the selected CNN models in the ensemble are quite diverse as they rarely commit the same classification mistakes for all datasets. Furthermore, RS-DeepSuperLearner consistently outperformed the individual CNN models in the ensemble for all dataset.
The experimental results have also shown the capabilities of this solution in enhancing the classification accuracy as RS-DeepSuperLearner has consistently outperformed the other fusion methods such as averaging, weighted averaging, and majority voting, as well as, many state-ofthe-art RS scene classification methods in the literature.
In the future, we can improve the proposed solution using more advanced CNN models from the literature because the proposed RS-DeepSuperLearner method can be applied to any set of classification models. Moreover, we can use more folds when training the CNN models. Currently, we are using a two-fold crossvalidation approach, which is relatively fast but does not give us accurate confidence criteria in the model predictions. Using more folds during training should improve the accuracy of the confidence criterion and accordingly the accuracy of the fusion step because the fusion step is dependent on the accuracy of the confidence criterion.

D isclosure s tatement
No potential conflict of interest was reported by the authors.