Embedded Self-Distillation in Compact Multi-Branch Ensemble Network for Remote Sensing Scene Classification

Remote sensing (RS) image scene classification task faces many challenges due to the interference from different characteristics of different geographical elements. To solve this problem, we propose a multi-branch ensemble network to enhance the feature representation ability by fusing features in final output logits and intermediate feature maps. However, simply adding branches will increase the complexity of models and decline the inference efficiency. On this issue, we embed self-distillation (SD) method to transfer knowledge from ensemble network to main-branch in it. Through optimizing with SD, main-branch will have close performance as ensemble network. During inference, we can cut other branches to simplify the whole model. In this paper, we first design compact multi-branch ensemble network, which can be trained in an end-to-end manner. Then, we insert SD method on output logits and feature maps. Compared to previous methods, our proposed architecture (ESD-MBENet) performs strongly on classification accuracy with compact design. Extensive experiments are applied on three benchmark RS datasets AID, NWPU-RESISC45 and UC-Merced with three classic baseline models, VGG16, ResNet50 and DenseNet121. Results prove that our proposed ESD-MBENet can achieve better accuracy than previous state-of-the-art (SOTA) complex models. Moreover, abundant visualization analysis make our method more convincing and interpretable.


I. INTRODUCTION
R EMOTE sensing scene classification is a recent popular task in practical application. It reveals the geographical characteristic, such as land utilization, vegetation coverage [1]. With the progress of RS scene classification, researches on local land planning, tree planting and afforestation can be realized more intelligent. In recent years, with the rapid development of deep learning technology [2], [3], methods for improving the RS scene classification accuracy have been continuously proposed.
In RS scene classification task, RS images always have large resolution and large geographic coverage area. Therefore, the Qi   major problem is the interference from different characteristics of different geographical elements. To solve this problem, previous works focus on fusing multi-level features to enhance the model's ability of representing complex structural information. In this paper, we integrate ensemble learning method into CNN modules to construct a multi-branch ensemble network. Towards different geographical elements in RS images, we use more branches to provide sufficient representation. Through fusing the final logits of different branches, we combine all perspectives together and obtain a more convincing prediction. As shown in Fig. 1, we use main-branch and sub-branch to construct our ensemble network.
Although ensemble network is quite effective on dealing with the above-mentioned problem, the large memory and computation cost of multi-branch structure can not be ignored. Especially on embedded device like UAV, multi-branch models are cumbersome and hard to deploy. To construct a lighter yet high-efficient multi-branch network, we embed selfdistillation (SD) method in multi-branch ensemble network arXiv:2104.00222v1 [cs.CV] 1 Apr 2021 (ESD-MBENet). The overview of ESD-MBENet is shown in Fig. 1.
To intuitively lighten the multi-branch structure, we design one sub-branch in ESD-MBENet-v1. Then we split the two branches and connect blocks in a zigzag manner. Designing like this, we can generate a multi-logits network with only two branches. As shown in Fig 1(a), we respectively split the main-branch into three blocks and sub-branch into two blocks. The images are first fed into the first block of main-branch ("main1"). The output feature maps then respectively pass through "sub1-sub2" and "main2". The output feature maps of "main2" will then pass through "sub2" and "main3" respectively. Finally, we obtain three output logits from three paths ("main1-main2-main3", "main1-sub1-sub2", "main1-main2-sub2"). If we set more split points, we can get more diverse output logits. Besides ESD-MBENet-v1, we also design ESD-MBENet-v2 as Fig.1(b) to form the multibranch ensemble network because if we set split points in more deeper layer, the number of branches of ESD-MBENet-v1 will be reduced. ESD-MBENet-v2 can flexibly construct multiple branches without being affected by the backward movement of the split points. Essentially, we construct weightsharing blocks in ESD-MBENet and maximally explore the representation ability of them.
Even though we use weight-sharing blocks to simplify multi-branch network, sub-branch is also cumbersome during inference. To cast off sub-branch during inference, we introduce self-distillation method to transfer knowledge from multibranch ensemble network to the main-branch. When mainbranch can show comparable performance as the ensemble network, we can prune the sub-branch and only adopt mainbranch during inference. In ESD-MBENet, we embed SD method into the ensemble network. Specifically, the final ensemble logits which are fused together by logits of every branch will be served as soft-label to distill knowledge to main-branch. In intermediate feature maps, the ensemble feature map is used to guide feature map of main branch. After optimizing with SD method, main-branch has close performance as ensemble network. Therefore, we can prune sub-branch and only use main-branch as inference model.
Compared to previous single-branch networks [4], [5], [6], the inference speed of our proposed model does not become slower. Compared to previous multi-branch networks [7], [8], [9], [10], our proposed ESD-MBENet achieves better performance with more compact structure by enhancing the capability of main-branch to learn more and better knowledge. Extensive experiments on using VGG16 [11], ResNet50 [12] and DenseNet121 [13] as baseline models on RS benchmark datasets (AID [14], NWPU-RESISC45 [15] and UC-Merced [16]) prove the effectiveness of our proposed ESD-MBENet. Classification results on ESD-MBENet surpass previous method and reach SOTA level. To show generalization of our model, we also conduct experiments on natural scene classification datasets (CIFAR and tiny-ImageNet). The results are also encouraging. Our main contributions can be summarized as follows: • We propose an compact yet efficient multi-branch ensemble network embedded with self-distillation method in RS scene classification to overcome the interference of different geographical elements in RS images. • We insert self-distillation method in ensemble network to distill knowledge to main-branch, which can further simplify the whole network. • Our proposed ensemble networks and main-branch network all achieve better classification results than previous SOTA networks on RS image datasets and natural scene datasets.

II. RELATED WORKS A. RS Image Classification
With the development of artificial intelligence, RS scene classification methods have transitioned from handcrafted feature extraction to deep learning feature extraction.
Due to the appearance of back-propagation neural network, deep learning has developed quickly. [31], [11], [12], [13] achieve amazing improvement in image classification task. Based on these baseline models, RS image classification technologies have improved rapidly [32], [33], [34]. [35] proposes a deep-learning-based feature-selection method to achieve feature abstraction of the RS images. [36] classifies unlabeled RS images, which improves the speed and accuracy of classification compared to traditional machine learning algorithms. Our proposed ESD-MBENet also uses deep learning method for RS scene classification.

B. Multi-Branch Network
A multi-branch network can obtain abundant information from multiple perspectives of the input images, which helps the network to have a more comprehensive representation of the images and improves the generalization of the classifier. In RS scene classification task, many researchers explore the potential of multi-branch networks by fusing features of different branches [37], [38], [39]. [40], [41], [42] achieve feature fusion method by extracting multiple spectral and spatial features and concatenate them, which improves the accuracy of RS image classification. [7] uses a multi-branch lightweight network to extract image features, and builds a graph model based on the learned features. [8] adopts finegrained and coarse branch to obtain the features in images.
Our proposed ESD-MBENet uses fewer modules and more weight-sharing blocks to build a multi-branch network, which can obtain multi-view information from multi-branch, so that the network can have more references when making final decisions. Overall framework of the ESD-MBENet-v1. The bottom branch is the main-branch. The sub-branch and the main-branch use the same backbone, namely VGG16 or ResNet50 or DenseNet121. Compared to the main-branch, the sub-branch adds an attention module, such as SE or CAM or Dropout. It is designed to overcome the interference of different geographic elements in RS images and the complexity of the model. Four branches are used for online training. We self-distill the final output logits and intermediate feature maps. In addition, the hard labels are also used to optimize the network. To reduce the complexity of the model, only the main-branch is used to complete the model inference.

C. Knowledge Distillation and Self-Distillation
Knowledge distillation is a concept proposed by Hinton [43]. The main purpose of knowledge distillation and self-distillation is model compression. Knowledge distillation aims to guide simple and relatively poor student network learning from a complex but superior teacher network [44], [45]. The student network learns how the teacher network learns to improve its distinguishing performance for RS image classification. Self-distillation mainly distilled from its own network, without the assistance of external networks or models. The weighted combination of multiple teacher networks is proposed to guide students to learn from it in [46]. A knowledge distillation framework is proposed in [47], which makes the output of the student and teacher models match. Discriminative modality distillation approach is introduced in [48], the teacher is trained on multimodal data and then the student model learns from the teacher model to improve the performance of the RS image classifiction. To address the problem of network overfitting due to noisy data, a novel noisy label distillation method (NLD) is proposed in [49].
Regarding the self-distillation method, there is little research in RS image classification task. We propose an end-to-end compact multi-branch ensemble network ESD-MBENet that uses self-distillation to improve the main-branch performance.

III. PROPOSED NETWORK A. Overview of ESD-MBENet
To overcome the interference of different geographic elements in RS images, we propose two versions of ESD-MBENet. Fig. 2 shows the structure of ESD-MBENet-v1. For the purpose of using fewer modules and constructing a multibranch network, we set multiple split points in the network. First, we design a network of two branches, the main-branch and the sub-branch. The main-branch and the sub-branch use the same backbone. To resist the lack of multi-branch diversity caused by partial weight-sharing, attention modules such as SE or CAM or Dropout are added to the sub-branch. If the sub-branch share the first "conv" with the main-branch, we can set split points in the main-branch "layer1" and "layer2", and then send the main-branch feature maps to the sub-branch network in the position of the split points. Of course, the mainbranch and the sub-branch can also share the "layer1", and the corresponding split points will move backwards. In the training phase, multi-branch output results can be obtained after fusing the output logits. To speed up the inference speed and simplify the model complexity, we only use the mainbranch for inference.
As the split points move backwards, more and more weights are shared by the network, and the number of multiple branches that can be constructed is reduced. It will bring about the lack of network diversity and the deterioration of network performance. To solve the problem, while sharing as many weights as possible to build multi-branch network, we propose ESD-MBENet-v2. The network structure diagram is shown in Fig. 3. ESD-MBENet-v2 has four branches. The same backbone is used in four branches, and different attention modules are added respectively to each sub-branch, namely SE, CAM, and Dropout. There is no weight-sharing in each branch. ESD-MBENet-v2 is not limited by the number of branches. At the corresponding split point, we can add multiple branches at will. Taking into account the amount of parameters, model complexity and performance improvement, we choose four branches when exprimenting. Similar to ESD-MBENet-v1, multiple branches are used for training, and the main-branch is used as inference prediction.
In short, ESD-MBENet-v1 uses fewer modules to construct multi-branch structures, and ESD-MBENet-v2 can construct multi-branch networks more flexibly to realize multi-branch networks. Both ESD-MBENet-v1 and ESD-MBENet-v2 consider using as few modules as possible to build multiple branches, that is, sharing as many weights as possible. Through experimental verification, the two versions of ESD-MBENet we proposed both have better RS image classification performance than previous methods.

B. Attention Modules
The diversity of multi-branch networks is very important for feature fusion. In our proposed ESD-MBENet, if the subbranch and the main-branch are exactly the same, it may lack diversity for image feature extraction, and it is impossible to describe image features from multiple perspectives. Therefore, we propose the following design, the sub-branch and the mainbranch use the same backbone, the main-branch does not add any extras, and the attention module is added to the subbranch. Compared with tasks such as image segmentation, image classification does not require so much attention between pixels. Therefore, we consider the enhancement of the attention between feature map channels. We use the SE and CAM modules proposed in SENet [50] and DANet [51] to add attentions to the feature maps.
The structure diagram of SE and CAM modules is shown in Fig. 4. The realization idea of SE is to take the obtained feature maps and pass it through the global average pooling layer, two fully connected layers, the sigmoid function as Eq. 1, and then multiply it with the originally input feature maps. In this way, the global information of the images can be integrated into the feature maps, which improves the sensitivity of the network to the channel, and makes the feature maps contain richer information.
The CAM module selectively emphasizes the interdependent channel mappings by integrating the relevant features between all channel mappings. After the input feature maps are reshaped, transposed and multiplied, the matrix of C×C is obtained, which is the channel attention map. Then multiply the matrix which should pass through the softmax layer with the input feature maps after reshaping. Finally, reshape the feature maps and add the feature maps with the originally input feature maps. In this way, the attention mechanism is added to the feature maps. The specific process of SE and CAM module is consistent with [50] and [51].
In addition, for RS image classification, an image corresponds to a category, but not all pixel values in the image can provide useful information for the classification results, and even some pixels may interfere with the judgment for the image. Therefore, we design to use the Dropout module, and the probability of a random drop is 0.2. This is of great help to the improvement of network generalization performance.
These three modules are all independent modules, which can be embedded anywhere in the network without affecting other structure of the network. They have strong flexibility.

C. Multi-Branch Ensemble
To solve the interference between different geographic elements in RS images, we propose ESD-MBENet. ESD-MBENet-v1 constructs multiple branches from the two branches (main-branch and sub-branch) networks by setting different split points, and then fuses the features in multiple branches. Assume that the main-branch and the subbranch share the first "conv", and we build a total of four branches as Fig. 2. The construction of multi-branch is as follows. The first branch is the main-branch, the second branch is the sub-branch. The third branch is the feature maps obtained after "layer1" of the main-branch pass through the rest of the sub-branch, and the fourth branch is the feature maps after "layer2" of the main-branch pass through the rest of the sub-branch. Suppose the mainbranch is divided into "conv", "main-layer1", "main-layer2", "main-layer3" and "main-layer4", mathematically expressed as f 0 ,f 1 ,f 2 ,f 3 ,f 4 , and the sub-branch is divided into "sub-layer1", "sub-layer2", "sub-layer3", "sub-layer4", mathematically expressed as g 1 ,g 2 ,g 3 ,g 4 . To ensure the data diversity after the multi-branch with weight-sharing, the sub-branch adds the same attention module in front of each layer compared to the main-branch, such as adding SE or CAM or Dropout.
if i is the first split point and i ∈ sp then 3: When the split points of the ESD-MBENet-v1 gradually move backwards, the weight-sharing of multi-branch continue to increase, and the number of branches that can be divided decrease, which is very unfriendly to the extraction of image features. Therefore, we propose ESD-MBENet-v2. Compared with ESD-MBENet-v1, ESD-MBENet-v2 can add any number of branches flexibly, and will not be affected by the movement of split points. Suppose, the split point of ESD-MBENet-v2 is set to "layer2", we can build four branches as shown in Fig. 3. To ensure the diversity of multi-branch output features, although the same backbone is used in four branches, each sub-branch adds a different attention module, that is, the main-branch does not add any additional module, and the sub-branch1 adds SE module, sub-branch2 adds CAM module, and sub-branch3 adds Dropout module. And there is no weight-sharing in the four branches. Assuming that the main-branch is divided into "conv", "main-layer1", "main-layer2", "main-layer3" and "main-layer4", mathematically expressed as f 0 ,f 1 ,f 2 ,f 3 ,f 4 . The sub-branches are denoted as l 1 ,l 2 and l 3 . The four branches , which x is a batch of images. The ESD-MBENet-v2 ensemble multi-branch algorithm is shown in Alg. 2.

D. Self-Distillation
Using compact multi-branch ensemble network, the accuracy of ESD-MBENet for RS image classification can be significantly improved. To shorten the time of inference and simplify the complexity of ESD-MBENet, we propose to use self-distillation for ESD-MBENet to improve the inference performance of the main-branch. So we can prune all the subbranches and use only the main-branch for inference.
In ESD-MBENet, the self-distillation process includes two parts, final output logits distillation and feature maps distillation. With respect to the output logits distillation, we can get the ensemble output logits from the multiple branches as Eq. 2 and then let it pass through the softmax function as Eq. 3, which can be as the teacher, and the output logits of the main- end for 6: end if branch is as the student. We use KL loss to optimize it. The self-distillation output logits algorithm is shown in Alg. 3.
where v i is the logits of the i th branch, N is the number of all branches, M is the number of all classes.

Algorithm 3: Self-Distillation output logits algorithm
Input: The total number of branches N , the i th branch output logits v i , main-branch output logits v s . Output: the self-distillation loss L KL em between ensemble output logits and main-branch output logits 1: Compute the v t using Eq.2 and Eq.3 2: Compute the L KL em , p e = v t , p m = v s using Eq.12 With respect to the feature maps distillation, the output feature maps (C×H×W) of the multiple branches after "layer4" are added along the channel direction to obtain new feature maps (H×W) as Eq. 4.
where f H * W kc is the k th branch feature map in the c th channel, C is the total numble of channels, g H * W k is the k th branch new feature map.
Then normalize each of the feature maps (H×W). The normalization process is each pixel value g kij on the feature map subtracts the mean value x a and then divides the standard deviation x s as Eq. 5, Eq. 6, Eq. 7,Eq. 8. Then we can obtain normalized feature maps F H * W k , k = 1, 2, ..., N . We average the these feature maps to obtain a teacher feature map F H * W e as Eq.9. The student is the main-branch feature map after normalization F H * W m . The distribution of the feature maps directly affects the output logits. Therefore, if the main-branch feature map (student) can learn the equivalent knowledge to the multi-branch ensemble feature map (teacher), the overall performance of the main-branch will be improved. MSE loss is used to optimize it. Compared with the feature map mutual learning mechanism proposed by previous researchers, we propose a simple feature map learning mechanism, as shown in Alg. 4.

E. Backward Propagation of ESD-MBENet
The ESD-MBENet we proposed is optimized by continuously reducing the total loss objective function as Eq. 10. We use cross-entropy loss as Eq. 11, Kullback Leibler divergence (KL) loss as Eq. 12 and Mean Square Error (MSE) loss as Eq. 13 to make ESD-MBENet converge quickly.  [11], [12], [13]. THE LAYER NAME CORRESPONDS TO THE LAYERS IN FIG. 2 AND FIG. 3 (fe(xt))ij − (fm(xt))ij 2 (13) where L cei represents the cross-entropy loss function, and the cross-entropy loss is obtained from each branch respectively, L KL em represents the KL loss between ensemble output logits p e and the main-branch output logits p m , L M SE em is the MSE loss between the ensemble feature map F e (x) and the main-branch feature map F m (x). α, β and λ are the weight coefficients of each loss function. Our proposed ESD-MBENet network uses only two loss functions in the distillation process of output logits and feature maps, regardless of the number of sub-branches. This greatly simplifies the process of tuning and optimization.

A. Datasets
We use three RS datasets (AID, NWPU-RESISC45, UC-Merced) to verify the effectiveness of ESD-MBENet in RS images classification to compare with other methods conveniently. The AID dataset has 10,000 RS images, including 30 categories, and the image size is 600×600. There are 220∼420 images per category. This dataset was released by Huazhong University of Science and Technology and Wuhan University in 2017. The NWPU-RESISC45 dataset has 31,500 images, including 45 categories, each category has 700 images, and the image size is 256×256. The dataset was created by Northwestern Polytechnical University. The UC-Merced dataset has only 2100 images, including 21 categories. The image size is 256×256, and each category has 100 images. This dataset is extracted from USGS National Map Urban Area Imagery. In addition, to verify the generalization of ESD-MBENet, we also select the natural scene image classification datasets (CIFAR, tiny-ImageNet) to do related experiments.

B. Implementation Details
We select three backbones, namely VGG16, ResNet50, DenseNet121 on three datasets (AID,NWPU-RESISC45,UC-Merced) to expriment. The detail structure of the backbone is shown in Tab. I. Since most of the methods proposed by many researchers used VGG16 as the backbone, to facilitate comparison with them, we select VGG16 as the backbone. ResNet50 was proposed by He Kaiming [12]. It has superior performance than VGG16 and is widely used. Therefore, we also select ResNet50 as one of our backbones. For deeper networks, we select DenseNet121, which can compare with the SOTA results of KFBNet [52]. The optimizer used in the experiment is the Stochastic Gradient Descent (SGD) with momentum, and the momentum parameter is set to 0.9. The image is resized to 256×256 during the training of the AID dataset, and resized to 288×288 and croped to 256×256 during the test. The models are trained for 100 epoches in each experiment on AID dataset, and the learning rate drops 10 times at the 40th, 70th and 90th epoch. The training images of the NWPU-RESISC45 and UC-Merced are resized to 224×224, and the test images are resized to 256×256 and croped to 224×224. The models are trained for 120 epoches in each experiment on NWPU-RESISC45 dataset, and the learning rate drops 10 times at the 70th, 90th, and 110th epoch. ImageNet pretrained parameters are loaded in each layer when training. The code is implemented using the Pytorch framework. The equipment used in the experiment is NVIDIA GTX 1080ti.

C. Results
The experimental results are shown in the Tab. II. We compare the ESD-MBENet with the previously proposed excellent algorithm when using the same backbone, the same dataset. To reduce the experimental error, we did each experiment five times, and reported the results as the mean and standard deviation of the five experiments.
1) AID: To make a fair comparison with the previous methods, the setting of ESD-MBENet in the AID dataset is the same as them. That is, 20% of the data is randomly selected as the training set, 80% as the test set or 50% as the training set and 50% as the test set. If the backbone is VGG16, the results of ESD-MBENet-v1 and ESD-MBENet-v2 are 94.10% and 94.12% on 20% training data. Although it did not exceed SOTA, it was very close to it. On 50% training data the results are 97.15% and 97.3%. When using ResNet50 as the backbone and 20% and 50% of the data as the training set, ESD-MBENet-v1 can reach 96.0% and 98.54% accuracy, ESD-MBENet-v2 can reach 95.81% and 98.66%; when DenseNet121 is selected as the backbone, ESD-MBENet-v1 can achieve accuracy rates of 96.2% and 98.85%, and ESD-MBENet-v2 can achieve accuracy rates of 96.39% and 98.4%. Compared with KFBNet, we did not add additional elements to the network in the inference stage, that is, only the main-branch is used for inference, but the results exceeded about 1% in DenseNet121.
2) NWPU-RESISC45: In the experiment, we randomly select 20% for training, 80% for testing or 10% for training and 90% for testing in all datasets. If VGG16 is the backbone of the ESD-MBENet, 10% and 20% data for training, the accuracy of the ESD-MBENet-v1 is 90.29% and 93.48%, the accuracy of the ESD-MBENet-v2 is 90.25% and 93.42%. If ResNet50 is the backbone of the ESD-MBENet, we can achieve the accuracy of 92.5% and 95.58% in ESD-MBENet-v1 and the accuracy of 93.03% and 95.24% in ESD-MBENet-v2. If the backbone is DenseNet121, ESD-MBENet-v1 can reach 93.24% and 95.5%, ESD-MBENet-v2 can reach 93.05% and 95.36%. In most cases, ESD-MBENet reaches the SOTA. In a few cases, although ESD-MBENet did not reach the SOTA, it is very close to the previous SOTA results. And compared with previous methods ESD-MBENet is simpler in inference.

3) UC-Merced:
The accuracy of proposed methods previously in the UC-Merced dataset has reached the limit. The ESD-MBENet we proposed is the same. During the experiment, even if VGG16 is used as the backbbone, the accuracy can reach 100% sometimes. There are a total of 420 test images. This result means that at most one image can be predicted incorrectly or all predictions are accurate in each experiment. The average accuracy of multiple test can reach 99.81% or 99.86%. 4) CIFAR-10/100 and tiny-ImageNet: In order to verify the generalization performance of the ESD-MBENet, we tried to experiment on the common scene classification dataset CIFAR-10/100 and tiny-ImageNet. During the experiment, we select ResNet20/32/44/56 and ResNet18/34 as backbones, and the detailed structure is in Tab. I. ResNet20/32/44/56 represents a shallower network, and ResNet18/34 represents a deeper network. The experiment process is the same as that used in RS images.
As shown in the Tab.III, the results of our proposed ESD-MBENet on the CIFAR-10/100 are higher than baseline network in all backbones, especially the CIFAR-100, the experimental results are higher more than 3%. Compared the previous methods [59], [60], our result is also higher more than 1% and reaches the SOTA. Tab. IV shows the ESD-MBENet experimental results of ResNet18 and ResNet34 on the CIFAR-100 and tiny-ImageNet. Our results on the tiny-ImageNet dataset exceed the baseline by 3%∼5%.

D. Confusion Matrix
In RS image scene classification, confusion matrix is usually used as an evaluation criterion to judge the effect of the proposed algorithm. The confusion matrix can be used to represent the difference between the predicted label and the true label. From the confusion matrix, you can see how much data in the predicted label is correctly predicted and how many predicted labels are predicted incorrectly. And you can also see what the result of the prediction error should be, but what label is actually predicted. The confusion matrix is very intuitive and clear, which is very useful for data analysis. In this paper, a confusion matrix is used to check the effectiveness of ESD-MBENet. The Fig. 5 shows the confusion matrix of 20% AID training data in DenseNet121 network. It can be seen from the confusion matrix that ESD-MBENet-v1 has a less than 4% prediction error rate for almost all classes, and the prediction accuracy rate of some classes even reaches 99% and 100%. Fig. 6 shows the confusion matrix made by ESD-MBENet-v1(ResNet50) prediction results on randomly selected 20% NWPU-RESISC45 training dataset. The abscissa of the confusion matrix represents the predicted label, and the ordinate represents the true label.

E. Ablation Study
To more effectively verify the effects of multi-branch ensemble and self-distillation of our proposed ESD-MBENet algorithm, and the robustness of the network to add different attention modules to sub-branch, we did the following ablation experiments.
1) Comparison between ESD-MBENet and baseline: To solve the interference from different characteristics of differ-  ent geographical elements in RS images, ESD-MBENet has introduced the method of multi-branch ensemble network in the training process, allowing the network to explore image information from multiple perspectives and get more attention to the classification information of the image. In the inference stage, we introduce self-distillation on the output logits and feature maps to reduce the complexity of the model, there is no difference in inference speed compared with the baseline   network. We use VGG16, ResNet50 and DenseNet121 as the backbones respectively, and do the following comparative experiments on AID and NWPU-RESISC45 datasets. It can be seen from the Tab. V that ESD-MBENet-v1 and ESD-MBENet-v2 both have more than 1% improvement of compared with the baseline network. This can also verify that ESD-MBENet has indeed learned more image information in feature extraction through multi-branch feature ensemble and self-distillation, which is more helpful for RS image classification.
2) Self-distillation in ESD-MBENet feature maps: Distillation technology is essentially a process in which a student network continuously learns and imitates a teacher network to achieve student knowledge enhancement. In this experiment, we use the ESD-MBENet itself as a teacher, so it can be called self-distillation. We mainly use self-distillation in output logits and intermediate feature maps. For students' learning, Fig. 9. The Grad-CAM comparison of ResNet50 baseline network and ESD-MBENet on the 20% AID training dataset. We randomly select four images from the AID dataset as representatives, and show the focus of different parts of the baseline and ESD-MBENet, such as "layer1", "layer2" and "layer3" during RS image classification. The warmer the color, the higher the degree of attention.
if the teacher directly tells the students the standard answer every time, let the students explore the learning process by themselves, in most cases the students will learn very well. However, if the teacher also provides guidance and advice in the middle learning process, this may be more helpful to the students' learning. Therefore, we propose to use selfdistillation technology in feature maps. In order to effectively compare the effectiveness of the idea, we also did relevant comparative experiments on the AID and NWPU-RESISC45 datasets on the VGG16, ResNet50 and DenseNet121 networks respectively. As shown in Tab. V, students who not only selfdistill the output logits,but also self-distill the feature maps learn better than only self-distill the output logits.
3) ESD-MBENet outputs of ensemble multi-branch and main-branch: Distillation is a process of mutual assistance. Therefore, when students learn well, the teacher may also be inspired, although sometimes the effect of this inspiration is small. Therefore, during the experiment, we also compare the results of the multi-branch ensemble with the output of the main-branch, as well as the parameters and FLOPs in inference. As shown in the Tab. VI, it can be seen that the ensemble teacher outputs does indeed learn a little more knowledge than the main-branch student network, but it consumes more parameters and FLOPs than main-branch in inference. This is not friendly for practical applications. Therefore, considering all factors, we choose to use only the main-branch for inference.

4)
Comparison of ESD-MBENet-v1 sub-branch using different attention modules: To verify that our proposed ESD- MBENet is still effective even when the sub-branch uses different attention modules, this is mainly for ESD-MBENet-v1. In the experiment comparison process, we mainly add the SE or CAM or Dropout module in the sub-branch. The experimental results are shown in Tab. VII. The multi-branch network we constructed is robust to the addition of different attention modules.
F. Visualization and Analysis 1) Training Curves: To compare the convergence of ESD-MBENet with the baseline network more intuitively, we plot the curves of the experimental results. As shown in the Fig. 8, we use ResNet50 as the backbone to train 20% or 50% data of the AID dataset. It can be seen from the curves that the overall performance of the ESD-MBENet is better than the baseline in both ESD-MBENet-v1 and ESD-MBENet-v2. And the difference between the results of ESD-MBENet-v1 and ESD-MBENet-v2 is very small. We set a total of 100 epoches. The accuracy of baseline and ESD-MBENet networks are gradually flattening out around the 50th epoch. Compared with baseline post-training, ESD-MBENet has less fluctuation and is more stable.
2) T-SNE: T-SNE technology can map data in highdimensional space to low-dimensional space. It can clearly see the difference between different algorithms for RS image classification. Therefore, we use T-SNE technology to show a 2-dimensional mapping representation of the final output results. As shown in Fig. 7, we compare the baseline, ESD-MBENet-v1 and ESD-MBENet-v2 networks, which backbone is DenseNet121. Compared with the baseline, ESD-MBENet-v1 and ESD-MBENet-v2 can obtain larger inter-class differences for RS image classification, which is very useful for accurate classification. The more similar the category, the larger the gap between the categories is needed to achieve better classification, and the network will not be confused due to the large gap between the categories. Therefore, ESD-MBENet achieves a better classification effect than baseline.
3) Grad CAM: Grad-CAM is a relatively popular visualization method, which can make it easier for us to understand how convolutional neural networks learn for a given task, such as image classification or image segmentation. We also visualized the Grad-CAM experimental effect of ESD-MBENet. And compared with the baseline network. In the experiment, the baseline, ESD-MBENet-v1 and ESD-MBENet-v2 models trained on the ResNet50 network using the 20% AID dataset were used to draw Grad-CAM on four randomly selected images. To compare the learning effect of the network at different stages more clearly, we have shown the Grad-CAM of the different depths of the network, such as "layer1", "layer2" and "layer3', which can also represent the learning focus of the network in the shallow stage and the deep stage. It can be seen from the Fig. 9 that the ESD-MBENet network pays more attention to the objects to be classified at the "layer2" than the baseline network. At the "layer3", the focus of the ESD-MBENet is more than that of the baseline, which means ESD-MBENet can extract more information of the images and then transfer it to the deeper network. This is more conducive to network learning.

V. CONCLUSION
In this paper, we design ESD-MBENet-v1 and ESD-MBENet-v2 to construct compact multi-branch ensemble network to solve the interference from different characteristics of different geographical elements in RS images. ESD-MBENet-v1 uses as few modules as possible to build as many branches as possible, but as the slpit points move backwards, the number of branches built decreases. Therefore, we propose ESD-MBENet-v2, which can build multiple branches flexibly. ESD-MBENet-v2 achieves the greatest possible weight-sharing. Due to the multi-branch construction, although the network performance has been greatly improved, in the inference stage, the model is too complex to reduce the inference efficiency and speed. So we propose self-distillation, distilling the logits and the intermediate feature maps, to make the main-branch network reach the performance of the whole model. In this way, only the main-branch is used for inference. Through experimental verification, our proposed ESD-MBENet network achieves better classification results than previous SOTA networks on RS datasets. In addition, in the field of natural scene image classification, ESD-MBENet also shows strong advantages.