EMMCNN: An ETPS-Based Multi-Scale and Multi-Feature Method Using CNN for High Spatial Resolution Image Land-Cover Classification

Land-cover information is significant for land-use planning, urban management, and environment monitoring. This paper presented a novel extended topology-preserving segmentation (ETPS)-based multi-scale and multi-feature method using the convolutional neural network (EMMCNN) for high spatial resolution (HSR) image land-cover classification. The EMMCNN first segmented the images into superpixels using the ETPS algorithm with false-color composition and enhancement and built parallel convolutional neural networks (CNNs) with dense connections for superpixel multi-scale deep feature learning. Then, the multi-resolution segmentation (MRS) object hand-delineated features were extracted and mapped to superpixels for complementary multi-segmentation and multi-type representation. Finally, a hybrid network was designed to consist of 1-dimension CNN and multi-layer perception (MLP) with channel-wise stacking and attention-based weighting for adaptive feature fusion and comprehensive classification. Experimental results on four real HSR GaoFen-2 datasets demonstrated the superiority of the proposed EMMCNN over several well-known classification methods in terms of accuracy and consistency, with overall accuracy averagely improved by 1.74% to 19.35% for testing images and 1.06% to 8.78% for validating images. It was found that the solution combining an appropriate number of larger scales and multi-type features is recommended for better performance. Efficient superpixel segmentation, networks with strong learning ability, optimized multi-scale and multi-feature solution, and adaptive attention-based feature fusion were key points for improving HSR image land-cover classification in this study.


Introduction
Land-cover information reflects the distribution of various natural and man-made ground objects, which is essential for land-use planning, urban management, and environment monitoring. Remote sensing imagery has provided a wide-range and real-time data source for land-cover mapping in past decades. In particular, with the development of advanced satellite sensors, such as WorldView, SuperView, and GaoFen, high spatial resolution (HSR) images are becoming increasingly available and popular [1][2][3]. However, as the observation scale becomes finer, the difficulty of feature extraction p, and s = (s 1 , . . . , s N ) denote the set representing superpixel segmentation, with M the number of superpixels, and N the size of the image. Let µ i and µ = (µ 1 , . . . , µ M represent the mean position of the i-th superpixel and all superpixels, respectively, and c i and c = (c 1 , . . . , c M ) represent the mean color of the i-th superpixel and all superpixels, respectively. The energy objective function is defined as [41] E mono (s, µ, c) = p E col s p , c s p +λ pos p E pos s p , µ s p +λ b p q∈N 8 E b s p , s q +E topo (s)+E size (s) (1) where N 8 is the 8th neighborhood of pixel p. The energy items included above are expressed in detail as E col s p , c s p = (I (p)−c s p ) 2 (2) E pos s p , µ s p = L(p) − µ s p || 2 2 (3) where I(p) and L(p) denote the color and location of pixel p, respectively, and c s p and µ s p denote the mean color and mean position of a superpixel s p , respectively. E col is the appearance coherence constraint for the color homogeneity of each superpixel. E pos is the shape constraint to hold the regularity of the superpixels. The constraint with λ pos obtains relatively regular superpixels by limiting the distance of pixels from the central position not too far. E b is the boundary length constraint imposing that superpixels should have short boundaries. The constraint with λ b makes superpixels with good boundary adherence by limiting the length of boundaries not too long. E topo (s) is the topological preservation constraint that keeps superpixels connected and penalizes ∞ otherwise. E size is the minimum size constraint that restrains superpixels to be at least 1/4 of initialized size and penalizes ∞ otherwise. Through these combined constraints, the ETPS method can make the objective function get better local optimal value faster and better retain the boundary information of superpixels. The ETPS algorithm uses a coarse-to-fine segmentation scheme to quickly find the local optimal value of the objective function. It initializes each superpixel to a square grid of the same size and calculates the initial center position and the average color of each superpixel. Then, it initializes L layers, and each layer corresponds to a different grid size. The grid size at the first layer is a quarter of the superpixel grid, and the grid size at the second layer is a quarter of that at first layer, and so on until the grid size is equal to one pixel. For each layer, boundary blocks are defined and added to a first-in-first-out queue. Each time it is popped from the queue and determined if deleting this boundary block from its superpixel will affect the connectivity of superpixels, and if not, then the block is tried to merge with other neighboring superpixels to make the objective function value smaller. If so, the block is merged with the neighboring superpixel, the center position and average color of changed superpixels are updated, and the new boundary blocks are added to the queue. Next, the above steps are repeated until the boundary block queue is emptied. The ETPS method reaches a much better local optimum faster at coarser levels and approximates to the final local optimum gradually at finer levels.

CNN-Based Classification
CNNs have become the popular deep learning method in many computer vision and remote sensing image interpretation tasks. Representative CNN methods have developed from LeNet [42] and AlexNet [43] to VGG [44], ResNet [45], and DenseNet [46]. These models are broadly used in remote sensing image interpretation combined with pre-training strategy [47], LBP encoding [48], attention units [49], internal classifiers [50], and multiple comparisons [51]. DenseNet, as a deep CNN model embracing dense shortcut connections between layers, reduces the problems of gradient vanishing and parameter increasing, as well as enhances the feature reuse and sample utilization [46], making it more Remote Sens. 2020, 12, 66 5 of 33 appropriate for HSR images. Therefore, dense CNN is adopted to extract discriminative features from data based on different contextual scales.
The structure of CNNs mainly consists of convolutional, pooling, fully connected layers, and a classifier. The multi-layer network learns detailed and semantic features at lower and higher layers, respectively. Convolutional layers calculate the results on feature maps using local perception and weight sharing strategies, and multiple kernels are used to learn various features. If the size of the input feature map is m × m, then the size of the output feature map becomes ((m − n)/s + 1) × ((m − n)/s + 1) after a convolutional layer with n × n kernels and s sliding steps. The feature map is computed by where w k and b k denote the k-th filter and bias, respectively, * is the convolution operator, x k−1 and x k are the input and output feature maps, respectively, and f (·) represents the activation function. The rectified linear unit (ReLU) f (x)= max(0, x) [43] is a commonly used and robust activation function. Pooling layers are often set following convolutional layers to reduce the size of feature maps, and the maximum pooling and average pooling are widely employed operators. Then, fully connected (FC) layers are set to transform feature dimensions and input into the classifier. Finally, the classifier predicts labels based on the feature vector using softmax, MLP, support vector machine (SVM), or other methods. DenseNet, as shown in Figure 1, is a deep CNN designed with dense connections and shortcut propagation. In the structure, dense blocks and transition layers are the main components for feature extraction and reduction, respectively. Each dense block contains several densely connected convolutional blocks, which consist of batch normalization (BN) [52], ReLU activation, and 3×3 convolutional layers. When bottleneck layers are adopted to reduce the channels of input feature maps, the consecutive operations of BN-ReLU-convolution (1×1) are added before BN-ReLU-convolution (3×3). In a dense block, if the number of input channels is k 0 , and the growth rate of feature maps is k, then the l-th convolutional block will have k 0 + k × (l − 1) input feature maps, and its produced feature map x l can be calculated as x l = F l ([x 0 , x 1 , . . . , x l−1 ]) (6) where F l denotes the computation of the l-th convolutional block, and [x 0 , x 1 , . . . , x l−1 ] represents the stacking of feature maps from all preceding blocks. Between two adjacent dense blocks, transition layers are set as a significant bridge for feature extraction and reduction, including BN, ReLU activation, and 1×1 convolution with compression factor and average pooling layers. Moreover, dropout is used to improve the diversity of the network, and the global average pooling (GAP) layer is employed for global semantic extraction instead of fully connected layers [46].
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 32 The structure of CNNs mainly consists of convolutional, pooling, fully connected layers, and a classifier. The multi-layer network learns detailed and semantic features at lower and higher layers, respectively. Convolutional layers calculate the results on feature maps using local perception and weight sharing strategies, and multiple kernels are used to learn various features. If the size of the input feature map is m×m, then the size of the output feature map becomes ((m-n)/s+1)×((m-n)/s+1) after a convolutional layer with n×n kernels and s sliding steps. The feature map is computed by where w k and b k denote the k-th filter and bias, respectively, * is the convolution operator, x k-1 and x k are the input and output feature maps, respectively, and f (•) represents the activation function. The rectified linear unit (ReLU) f(x) = max (0, x) [43] is a commonly used and robust activation function. Pooling layers are often set following convolutional layers to reduce the size of feature maps, and the maximum pooling and average pooling are widely employed operators. Then, fully connected (FC) layers are set to transform feature dimensions and input into the classifier. Finally, the classifier predicts labels based on the feature vector using softmax, MLP, support vector machine (SVM), or other methods. DenseNet, as shown in Figure 1, is a deep CNN designed with dense connections and shortcut propagation. In the structure, dense blocks and transition layers are the main components for feature extraction and reduction, respectively. Each dense block contains several densely connected convolutional blocks, which consist of batch normalization (BN) [52], ReLU activation, and 3×3 convolutional layers. When bottleneck layers are adopted to reduce the channels of input feature maps, the consecutive operations of BN-ReLU-convolution (1×1) are added before BN-ReLUconvolution (3×3). In a dense block, if the number of input channels is k 0 , and the growth rate of feature maps is k, then the l-th convolutional block will have k 0 + × ( − 1) input feature maps, and its produced feature map x l can be calculated as where F l denotes the computation of the l-th convolutional block, and [x 0 , x 1 ,…, x l-1 ] represents the stacking of feature maps from all preceding blocks. Between two adjacent dense blocks, transition layers are set as a significant bridge for feature extraction and reduction, including BN, ReLU activation, and 1×1 convolution with compression factor and average pooling layers. Moreover, dropout is used to improve the diversity of the network, and the global average pooling (GAP) layer is employed for global semantic extraction instead of fully connected layers [46].

Methodology
In this section, the implementation of the proposed method is illustrated concretely. The overall flowchart of our scheme is shown in Figure 2. Given an HSR multi-spectral image, superpixels were initially generated using the ETPS algorithm with false-color composition and image enhancement. Then, the multi-size patches of superpixels were selected and learned through parallel dense CNNs, and multi-scale features were extracted from GAP layers. Next, superpixel multi-scale CNN features were mapped and combined with multi-resolution segmentation (MRS) [53] object hand-delineated features. Finally, multiple features were fused for comprehensive land-cover classification upon a 1-D CNN-MLP hybrid network with channel-wise stacking and attention-based weighting design. In particular, the effect of various multi-scale and multi-feature combinations was explored to determine the optimal fusing solution.

Methodology
In this section, the implementation of the proposed method is illustrated concretely. The overall flowchart of our scheme is shown in Figure 2. Given an HSR multi-spectral image, superpixels were initially generated using the ETPS algorithm with false-color composition and image enhancement. Then, the multi-size patches of superpixels were selected and learned through parallel dense CNNs, and multi-scale features were extracted from GAP layers. Next, superpixel multi-scale CNN features were mapped and combined with multi-resolution segmentation (MRS) [53] object hand-delineated Remote Sens. 2020, 12, 66 6 of 33 features. Finally, multiple features were fused for comprehensive land-cover classification upon a 1-D CNN-MLP hybrid network with channel-wise stacking and attention-based weighting design. In particular, the effect of various multi-scale and multi-feature combinations was explored to determine the optimal fusing solution.
Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 32 Figure 2. Flowchart of the proposed extended topology-preserving segmentation (ETPS)-based multiscale and multi-feature method using the convolutional neural network (CNN) for high spatial resolution (HSR) image land-cover classification.

ETPS Superpixel Segmentation
Instead of concentrating on individual pixels or irregular objects, superpixels were employed as basic units for HSR image land-cover classification in our proposed method. ETPS was designed for natural images with red, green and blue bands, whereas HSR images were multi-spectral with more than 3 bands. Standard false-color composition of images could reflect invisible environmental information and highlight the vegetation, and image enhancement using standard deviation stretching could adjust the brightness and contrast to emphasize the important content. Therefore, standard false-color composition with standard deviation stretching was employed upon original HSR images for ETPS segmentation in this study to enhance the discriminative features and boundary adherence. The comparison between ETPS segmentation based on true color and standard false-color is shown in Figure 3, using GaoFen-2 fused images with 1 m resolution in the study area. It could be observed that the boundaries of trees were easily confused with soil, crops, or grass in true color, whereas the boundaries of different vegetation categories were clearer and more accurate in standard false-color. Considering the farmland, woodland, grassland, and vacant were similar in appearance and had similar spectral characteristics, the false-color composition with information enhancement was more applicable to the ETPS method and highlighted its superiority. In addition, this approach could also better separate buildings from vegetation and produce better-fitting ground object boundaries.

ETPS Superpixel Segmentation
Instead of concentrating on individual pixels or irregular objects, superpixels were employed as basic units for HSR image land-cover classification in our proposed method. ETPS was designed for natural images with red, green and blue bands, whereas HSR images were multi-spectral with more than 3 bands. Standard false-color composition of images could reflect invisible environmental information and highlight the vegetation, and image enhancement using standard deviation stretching could adjust the brightness and contrast to emphasize the important content. Therefore, standard false-color composition with standard deviation stretching was employed upon original HSR images for ETPS segmentation in this study to enhance the discriminative features and boundary adherence. The comparison between ETPS segmentation based on true color and standard false-color is shown in Figure 3, using GaoFen-2 fused images with 1 m resolution in the study area. It could be observed that the boundaries of trees were easily confused with soil, crops, or grass in true color, whereas the boundaries of different vegetation categories were clearer and more accurate in standard false-color. Considering the farmland, woodland, grassland, and vacant were similar in appearance and had similar spectral characteristics, the false-color composition with information enhancement was more applicable to the ETPS method and highlighted its superiority. In addition, this approach could also better separate buildings from vegetation and produce better-fitting ground object boundaries.

Multi-Scale CNN Feature Extraction
Based on ETPS segmentation results, multi-scale patches of superpixels were extracted from original HSR images for feature learning. According to the center points of superpixels, multi-scale contextual windows were designed, including 24×24, 32×32, 40×40, 48×48, 56×56, 64×64, 72×72, 80×80, 88×88, 96×96, 104×104, and 112×112, to comprehensively explore the spatial effects and determine the optimal solution. Considering that multi-scale patches represent multi-scope spatial relationships and multi-level spatial semantics, parallel dense CNNs were designed to learn multi-scale features specifically. In the designed network structure, dense CNNs were built upon 4 dense blocks and 3 transition layers alternatively, and the number of convolutional blocks within each dense block was evaluated and set according to the complexity of land cover in images. After training parallel dense CNNs, multi-scale features were extracted from GAP layers for feature fusion and comprehensive classification. Taking an example of 80×80 patch input, the two concrete dense CNN architectures employed in this study are shown in Table 1, designed with dense blocks containing 5 (DCNN-5) and 6 (DCNN-6) convolutional blocks, respectively. In the network, the growth rate was set to 12, the initial number of feature maps was set to 24, and the compression factor was set to 0.5.

Multi-Scale CNN Feature Extraction
Based on ETPS segmentation results, multi-scale patches of superpixels were extracted from original HSR images for feature learning. According to the center points of superpixels, multi-scale contextual windows were designed, including 24×24, 32×32, 40×40, 48×48, 56×56, 64×64, 72×72, 80×80, 88×88, 96×96, 104×104, and 112×112, to comprehensively explore the spatial effects and determine the optimal solution. Considering that multi-scale patches represent multi-scope spatial relationships and multi-level spatial semantics, parallel dense CNNs were designed to learn multi-scale features specifically. In the designed network structure, dense CNNs were built upon 4 dense blocks and 3 transition layers alternatively, and the number of convolutional blocks within each dense block was evaluated and set according to the complexity of land cover in images. After training parallel dense CNNs, multi-scale features were extracted from GAP layers for feature fusion and comprehensive classification. Taking an example of 80×80 patch input, the two concrete dense CNN architectures employed in this study are shown in Table 1, designed with dense blocks containing 5 (DCNN-5) and 6 (DCNN-6) convolutional blocks, respectively. In the network, the growth rate was set to 12, the initial number of feature maps was set to 24, and the compression factor was set to 0.5. Classification layer softmax softmax The "conv" and "avg pool" denote "convolution" and "average pooling", respectively, in the table. Note that each "conv" block corresponds to the sequence of BN-ReLU-convolution.

Multi-Scale and Multi-Feature Combination
Ground objects were greatly heterogeneous in shape, texture, distribution, and context. For superpixels within class boundaries, the context was relatively uniform, and larger windows were needed to make a better decision. For superpixels across class boundaries, the context became complicated and disordered, and smaller windows were required to exclude confusing noise. Hence, multi-scale feature fusion would contribute to utilizing complementary information of multiple contexts and achieve better performance. Multi-scale description of objects or superpixels was like observing them from near to far or far to near, which was more in line with the visual recognition and multi-scale nature of remote sensing. In different spatial resolutions, the ground objects showed different characteristics and patterns, and object information from the single-scale observation field was insufficient for accurate classification. It was proposed to capture multi-scale contextual information of objects to exploit their attributes and spatial distributions [21]. Multi-scale feature learning has been shown to improve the performance in scene parsing, object categorization, and so on [33]. To do a multi-scale combination, we needed to solve the two problems of what scale to combine and how to combine scales. However, it was not the best choice to combine as many as possible single-scale features together for classification because it would cause information redundancy and accuracy reduction. Considering various multi-scale combinations have different effects of spatial complementarity, it was necessary to design and find the optimal solution. Therefore, among 12 sets of single-scale features, 36 solutions of combining 4, 6, 8, 10, 12 sets of features were put forward to test and discuss the fusion effect further, as shown in Table 2. The 1st to 36th combinations corresponded to the 1st to 36th columns in the table and were expressed as COMB1 to COMB36 in the remainder of this paper.
HSR images were segmented into objects using the MRS algorithm, and spectral (i.e., mean and standard deviation of bands), geometrical (i.e., shape index, compactness, length/width, and density), textural (i.e., GLCM homogeneity, contrast, entropy, dissimilarity, correlation, angular second moment, mean, and standard deviation), and spectral index (i.e., normalized difference vegetation index (NDVI) and normalized difference water index (NDWI)) attributes were extracted from objects. To integrate features from different segmentation levels, superpixels and objects were intersected and calculated to determine the object to which each superpixel maps. The segmentation boundaries of ETPS superpixels and MRS objects both were adherent to ground objects, so the boundaries were relatively coincident, and the mapping results were corresponding. Then, the ETPS superpixels were assigned with the hand-delineated features of their mapping objects to integrate and utilize the comprehensive multi-segmentation and multi-type features. NDVI and NDWI were often used to interpret vegetation and water information from the images, respectively. NDVI and NDWI are defined as where R, G, and NIR denote the red, green, and near-infrared bands of multi-spectral images, respectively.

1-D CNN-MLP Comprehensive Classification
The softmax classifier used in dense CNNs was mainly for feature learning, and the 1-D CNN-MLP classifier with attention-based weighting was subsequently designed for feature fusion and final classification. Feature fusion would result in high-dimension features and intricate patterns, and there was great nonlinearity between land-cover types and multi-scale and multi-type features. The softmax classifier was simple and easy to use, but it was insufficient to fit complex relationships and not suitable for multi-scale and multi-feature fusion classification. Although SVM was good at processing large features and performed well with limited samples, it needed much more effort to optimize parameters and more time for training large datasets. Therefore, MLP, with an easy-to-adjust network structure, steady learning performance, and robust anti-noise capacity, was chosen as a comprehensive classifier. As shown in Figure 4, the hybrid network mainly consisted of 1-D CNN for multi-scale fusion and MLP for multi-feature fusion. Considering multi-scale features are interrelated vectors in the same length and feature space with various spatial semantics, channel-wise stacking and 1-D CNN were designed for feature fusion and encoding, respectively. 1-D CNN extracted cross-channel information from multi-scale sequence input through alternative 1-D convolution and max-pooling operators, and the convolutional layers with 32, 64, 96, 128 3×1 filters were employed in this study. After that, the GAP was used to further abstract the features and transform the dimension for the following multi-feature fusion. In addition, the significance of multiple scales was unbalanced for diverse superpixels, and, thus the attention-based weighting block was designed to adaptively adjust the different rate of contributions among multiple scales, instead of using equal weights. The attention block contained weighted summation for input aggregation, FC layer with ReLU for nonlinear significance evaluation, FC layer with Sigmoid for value normalization, and residual addition for value enhancement to avoid response vanishing. The attention block is calculated as where x and w ws represent the CNN features and summation weights, respectively, y is the concatenation of multi-scale response values, w f cr , b f cr, and z denote the weights, bias, and output of the first FC layer, respectively, w f cs , b f cs , and s denote the weights, bias, and output of the second FC layer, respectively, and t is the final weighing factors for multi-scale CNN features.
In comparison, multi-scale CNN and hand-delineated features were not in the same shape and feature space, and hence they were encoded through 1-D CNN and FC layer, respectively, before multi-feature fusion. In a similar way, the encoded multi-scale and hand-delineated features were adaptively weighted using a learnable attention block to emphasize the differentiating significance for each superpixel input. After that, two types of features were combined via length-wise concatenation, and finally, FC layers with ReLU and softmax were employed for feature abstraction and classification, respectively. As a result, multi-scale CNN and hand-delineated features were efficiently and adaptively fused upon a 1-D CNN-MLP hybrid network with channel-wise stacking and attention-based weighting for comprehensive land-cover classification.
concatenation, and finally, FC layers with ReLU and softmax were employed for feature abstraction and classification, respectively. As a result, multi-scale CNN and hand-delineated features were efficiently and adaptively fused upon a 1-D CNN-MLP hybrid network with channel-wise stacking and attention-based weighting for comprehensive land-cover classification.

Datasets
To validate the effectiveness of the proposed method, four multi-spectral HSR GaoFen-2 datasets were used for a land-cover classification demonstration. As shown in Figure 5 and Figure 6, the four datasets were located in Beilun and Cixi County of Ningbo City, Xiaoshan County of Hangzhou City, and Yuecheng County of Shaoxing City, respectively, Zhejiang Province, China. The extracted Beilun and Cixi images mainly showed the coastal urban and rural scenes, respectively, and the Xiaoshan and Yuecheng images presented the urban-rural mixed scenes. These four datasets were chosen as typical land-cover distribution scenarios to verify the applicability and generalization of our method. The Beilun and Cixi images were taken on February 10 and 15, 2016, respectively, and the Xiaoshan and Yuecheng images were taken on July 12 and December 21, 2017, respectively. All images were of 1 m resolution with 4 bands (i.e., blue, green, red, and near-infrared) and generated by the fusion of panchromatic and multi-spectral GaoFen-2 images. The land-cover ground truth was provided by the Zhejiang Provincial Bureau of Surveying and Mapping, and the data update time was 2016 for Beilun and Cixi and 2017 for Xiaoshan and Yuecheng. Manual checking was carried out to adapt the ground truth to image data.

Datasets
To validate the effectiveness of the proposed method, four multi-spectral HSR GaoFen-2 datasets were used for a land-cover classification demonstration. As shown in Figures 5 and 6, the four datasets were located in Beilun and Cixi County of Ningbo City, Xiaoshan County of Hangzhou City, and Yuecheng County of Shaoxing City, respectively, Zhejiang Province, China. The extracted Beilun and Cixi images mainly showed the coastal urban and rural scenes, respectively, and the Xiaoshan and Yuecheng images presented the urban-rural mixed scenes. These four datasets were chosen as typical land-cover distribution scenarios to verify the applicability and generalization of our method. The Beilun and Cixi images were taken on 10 and 15 February 2016, respectively, and the Xiaoshan and Yuecheng images were taken on 12 July and 21 December 2017, respectively. All images were of 1 m resolution with 4 bands (i.e., blue, green, red, and near-infrared) and generated by the fusion of panchromatic and multi-spectral GaoFen-2 images. The land-cover ground truth was provided by the Zhejiang Provincial Bureau of Surveying and Mapping, and the data update time was 2016 for Beilun and Cixi and 2017 for Xiaoshan and Yuecheng. Manual checking was carried out to adapt the ground truth to image data.    For the Beilun dataset, as shown in Figure 5c, the image covered the size of 12782×3077 m 2 . There were mainly 10 land-cover classes in this urban scene, including farmland, woodland, grassland, dense building, sparse building, road, impervious surface, facility land, vacant, and water. For the Cixi dataset, as shown in Figure 5b, the image covered the size of 10367×2991 m 2 . There were mainly 8 land-cover classes in this rural scene, including water, farmland, woodland, grassland, building, road, impervious surface, and facility land. For the Xiaoshan dataset, as shown in Figure 6b,c, the training and testing image covered the size of 7880×3855 m 2 , and the validating image covered the size of 5757×3086 m 2 . For the Yuecheng dataset, as shown in Figure 6d,e, the training and testing image covered the size of 6269×4683 m 2 , and the validating image covered the size of 4150×3663 m 2 . There were mainly 7 land-cover types in these two mixed scenes, including farmland, woodland, building, road, impervious surface, vacant, and water. Ground object examples of all classes for the Beilun and Cixi datasets are shown, for instance, in Figure 7, and sample conditions and subclass descriptions for all datasets are presented in Table 3-7, respectively. The 60% and 40% of samples in Figure 5b,c, as well as Figure 6b,d, were employed for training and testing, respectively, and all samples in Figure 6c,e were adopted for validation. All samples for each dataset were normalized using the z-score approach band by band. For the Beilun dataset, as shown in Figure 5c, the image covered the size of 12782×3077 m 2 . There were mainly 10 land-cover classes in this urban scene, including farmland, woodland, grassland, dense building, sparse building, road, impervious surface, facility land, vacant, and water. For the Cixi dataset, as shown in Figure 5b, the image covered the size of 10367×2991 m 2 . There were mainly 8 land-cover classes in this rural scene, including water, farmland, woodland, grassland, building, road, impervious surface, and facility land. For the Xiaoshan dataset, as shown in Figure 6b,c, the training and testing image covered the size of 7880×3855 m 2 , and the validating image covered the size of 5757×3086 m 2 . For the Yuecheng dataset, as shown in Figure 6d,e, the training and testing image covered the size of 6269×4683 m 2 , and the validating image covered the size of 4150×3663 m 2 . There were mainly 7 land-cover types in these two mixed scenes, including farmland, woodland, building, road, impervious surface, vacant, and water. Ground object examples of all classes for the Beilun and Cixi datasets are shown, for instance, in Figure 7, and sample conditions and subclass descriptions for all datasets are presented in Tables 3-7, respectively. The 60% and 40% of samples in Figure 5b,c, as well as Figure 6b,d, were employed for training and testing, respectively, and all samples in Figure 6c,e were adopted for validation. All samples for each dataset were normalized using the z-score approach band by band.

Parameter Settings
Considering that the land-cover categories and distribution were more intensive in Beilun dataset than the others, DCNN-6 was employed for the Beilun dataset, and DCNN-5 was adopted for Cixi, Xiaoshan, and Yuecheng datasets. The growth rate k was set to 12, and the number of filters at the first convolutional layer was set to 2k. The bottleneck width was set to 4k, and the compression factor was set to 0.5 to reduce computation and increase efficiency. Dropout with a 0.2 rate was adopted to enhance network diversity and generalization, and an Adam optimizer with default parameters [54] was used to adjust the learning rate during training. The dense CNNs were trained for 500 epochs with a batch size of 128 to learn sufficiently. In an attention-based weighting block, the number of nodes at FC layers with ReLU and Sigmoid was set to 2m and m, respectively, for nonlinear significance evaluation (m represents the number of input features). The number of nodes at FC layers for hand-delineated feature encoding and final feature abstraction was both set to 32. In object segmentation, both images were segmented using eCognition 9.0 software and MRS algorithm with 150 scale, 0.1 shape, and 0.5 compactness parameters.

Comparison Methods
Five comparison methods were employed to verify the proposed EMMCNN method for HSR image land-cover classification: object-based CNN (OCNN), patch-based CNN (PCNN), SLIC-based multi-scale CNN (SMCNN) [33], SLIC-based multi-scale and multi-feature CNN (SMMCNN) [55], and object-based random forest (ORF). Moreover, single-scale and multi-scale experiments were carried out to analyze scale effect, and single-feature and multi-feature experiments were performed to explore feature complementarity.
The OCNN and PCNN methods employed the CNN with 12 convolutional layers, 3 pooling layers, a GAP layer, and a softmax classifier. The first and second halves of convolutional layers adopted 64 and 128 3×3 filters per layer, respectively. The OCNN method extracted MRS object patches from HSR images using envelopes and scaled them to a fixed size for CNN input. Various scaled sizes were tried, and 64×64 was chosen for better performance. The multi-layer CNN was used to train and predict land-cover types for objects. The PCNN method first divided images into 24×24 grids and extracted contextual patches for CNN input. Various patch sizes were tried, and 80×80 was chosen for higher accuracy. The multi-layer CNN was used to train and predict land-cover types for patches, and land-cover types for objects were obtained through mapping with grids and majority voting [21].
The SMCNN method first performed the pixel-based multi-scale CNNs with contextual inputs in size of 15×15, 25×25, 35×35, and 45×45, and the concatenation of CNN last pooling layers was used as input to an auto-encoder. The classification was made by a softmax classifier based on the hidden layer of auto-encoder. The CNNs contained 5 convolutional layers with 50 filters and 2 max-pooling layers. Then, the SLIC algorithm was employed to segment the image into superpixels, and the segments were merged and classified using the prediction certainty of the classified pixels [33].
The SMMCNN method first performed pixel-based multi-scale CNNs with 16×16, 32×32, and 64×64 patch inputs, and the concatenation of CNN last convolutional layers was used as input to a logistic classifier. The CNNs contained 4 convolutional layers with 32, 64, 96, and 128 filters, respectively, and 4 max-pooling layers. Then, per-pixel hand-delineated features (i.e., NDVI, saturation, (NIR+R+G)/3, spectral values, and entropy) were extracted and classified using random forest (RF) with 100 trees. Finally, CNN and RF class probabilities were multiplied to result in the combined prediction [55], and SLIC segments were also employed and merged for mapping. Taking account of a large number of pixels in HSR images, pixel samples used in the SMCNN and SMMCNN were partially selected from the images for training and testing in this study.
The ORF method first segmented the image into MRS objects with 150 scale, 0.1 shape, and 0.5 compactness parameters, and the spectral, textural, geometrical, and spectral index attributes were extracted from each object. Then, it used the random forest classifier with 200 trees to train and recognize the land-cover types. The ORF method was set as a comparison method using only object hand-delineated features for classification, in order to prove the effectiveness of our proposed integrated multi-scale and multi-feature method.

Evaluation Criteria
Overall accuracy (OA), Kappa coefficient (KC), user's accuracy (UA), producer's accuracy (PA), and average accuracy (AA) were employed to inclusively assess the performance of the proposed and comparison methods. In addition, the confusion matrix was adopted to analyze the land-cover category confusion of classification results. It is a matrix of rows and columns in category number, where the main diagonal elements represent the correctly classified samples of each category, and the horizontal and vertical summation denote the total numbers of each type on classification and reference maps, respectively. OA is the percentage of rightly predicted pixels in all pixels, and UA and PA are the correct percentages for each class related to the classification and reference maps, respectively. If Q ij denotes the number of pixels of class i predicted to class j, then OA, UA, and PA are expressed as where i, j = 1, 2, . . . , K and K denote the number of categories. AA is the average of UA for all classes. KC is a statistical value measuring the consistency of predicted labels and ground truth. Let N = i j Q ij represent the number of all pixels, and KC is defined as where where k = 1, 2, . . . , K and K denote the number of classes. The values of OA, KC, UA, PA, and AA range between 0 and 1, and the higher value indicates higher accuracy and better performance.

Experimental Analysis
The

Cixi Dataset
For the quantitative result analysis of the Cixi dataset, as shown in Table 9 and Figure 10, the proposed EMMCNN method obtained the best OA (88.35%), KC (0.857), AA (81.78%), and UA/PA for most classes, higher than the ESSCNN. Compared with the OCNN, PCNN, SMCNN, SMMCNN, and ORF, the improvement of EMMCNN was more dramatic. The UA/PA of farmland, woodland, and facility land classes was more than 90% for EMMCNN, whereas the UA/PA of road, impervious surface, and grassland classes was only 59.25%/70.54%, 74.00%/75.22%, and 70.39%/79.01%, respectively. In addition to linear road objects, impervious surface and grassland objects in the Cixi dataset were scattered and surrounded by analogous objects, making recognition difficult. Nonetheless, the EMMCNN method still achieved better precision of identifying them than the compared methods. For the ESSCNN and EMMCNN methods using ETPS superpixels, the classification performance was better than the OCNN and PCNN methods using MRS objects by 5.74% to 12.57% for OA, 0.074 to 0.156 for KC, and 15.48% to 22.57% for AA, especially the grassland, road and impervious surface classes, and was better than the SMCNN and SMMCNN methods using

Xiaoshan Dataset
For result analysis of the Xiaoshan testing dataset, as shown in Table 10 and Figure 12, the proposed EMMCNN method achieved the best OA (88.63%), KC (0.860), AA (85.11%), and UA/PA for most classes, better than the ESSCNN. The superiority of the EMMCNN was more significant when comparing with the OCNN, PCNN, SMCNN, SMMCNN, and ORF methods. The UA/PA of farmland, vacant, and water classes was higher than 90% in EMMCNN, whereas the road and impervious surface classes had lower precision with UA/PA 69.98%/75.08% and 69.97%/76.43%, respectively. In the Xiaoshan testing dataset, the road and impervious surface objects were dispersed and mixed with other similar artificial ground objects, and the comparison methods had worse accuracy on these two classes, especially the OCNN and ORF methods. For the ESSCNN and EMMCNN methods using ETPS superpixels, the recognition accuracy was better than the OCNN and PCNN methods using MRS objects by 3.43% to 13.41% for OA, 0.043 to 0.166 for KC, and 3.92%

Yuecheng Dataset
For result analysis of the Yuecheng testing dataset, as shown in Table 12 and Figure 13, the proposed EMMCNN method obtained the highest OA (87.52%), KC (0.846), AA (83.81%), and UA/PA for most classes, better than the ESSCNN. The improvement was more remarkable compared to the OCNN, PCNN, SMCNN, SMMCNN, and ORF methods. The UA/PA of farmland and water was higher than 90% in EMMCNN, and the UA/PA of impervious surface was only 64.76%/70.46%. The accuracy of road class improved in the Yuecheng dataset compared to other datasets because the road objects were wider and more complete, with more discriminative characteristics. However, the impervious surface objects in the Yuecheng dataset were more scattered and irregular, distributed among buildings, roads, and farmlands. Nonetheless, the EMMCNN method still recognized them comparatively better, and, in contrast, the OCNN and ORF methods had the worst precision. For the ESSCNN and EMMCNN methods using ETPS superpixels, the performance was better than the OCNN and PCNN methods using MRS objects by 5.32% to 16.41% for OA, 0.066 to 0.205 for KC, and 7.45% to 24.38% for AA, and was better than the SMCNN and SMMCNN methods using SLIC segments by 2.89% to 5.84% for OA, 0.037 to 0.072 for KC, and 5.50% to 8.50% for AA. The ORF method had inferior overall performance and especially lower precision for impervious surface and vacant classes. From Figure 13, it could be seen that the OCNN and ORF methods had more category confusion, whereas the ESSCNN and EMMCNN methods showed better object discrimination. The buildings, roads, and impervious surfaces were relatively easier to be confused, with similar appearance and spectral features.
For result analysis of the Yuecheng validating dataset, as shown in Table 13 and Figure 13, the proposed EMMCNN method obtained the best OA (60.93%), KC (0.521), AA (54.85%), and UA/PA for most classes. The accuracy of the EMCNN method was higher than comparison methods by 0.54% to 10.42% for OA, 0.008 to 0.127 for KC, and 0.82% to 11.24% for AA. The woodland, impervious surfaces, and vacant classes-easily confusing classes-had inferior accuracy, as displayed in Figure  13. The building and water classes, whose appearance and characteristics are relatively more stable

Beilun Dataset
Quantitative result analysis of the Beilun dataset is shown in Table 8 and Figure 8. The EMMCNN achieved the highest OA (88.56%), KC (0.871), AA (88.59%), and UA/PA for most classes, consistently greater than the ESSCNN. The accuracy increment was much more remarkable in comparison with the OCNN, PCNN, SMCNN, SMMCNN, and ORF. Farmland, grassland, facility land, vacant, and water classes had higher accuracy, with UA/PA more than 90% for EMMCNN, whereas road class had the lowest UA/PA (77.12%/77.86%). Roads are typical linear objects distributed with other class objects along both sides, especially similar buildings and impervious surfaces, making it difficult for accurate recognition. Nonetheless, the EMMCNN method still achieved better precision in identifying roads than the compared methods. For the ESSCNN and EMMCNN methods that employed ETPS superpixels as analysis units, the classification accuracy was higher than the OCNN and PCNN methods that employed MRS objects by 7.37% to 17.99% for OA, 0.084 to 0.204 for KC, and 7.56% to 19.36% for AA, especially the woodland, sparse building, and road classes, and was higher than the SMCNN and SMMCNN methods that employed SLIC segments by 2.78% to 7.01% for OA, 0.031 to 0.079 for KC, and 3.10% to 7.24% for AA, especially the grassland and vacant classes. The performance of the EMMCNN, ESSCNN, PCNN, SMCNN, and SMMCNN methods retaining high resolution was better than the OCNN method using scaled resolution due to its information loss of spatial granularity. As a comparison method using only object attributes for classification, the ORF method had the lowest accuracy with OA 61.72%, KC 0.564, and AA 56.65%, especially the sparse building, facility land, and vacant classes. From Figure 8, it was observed that the OCNN and ORF methods tended to confuse similar artificial ground objects, such as dense buildings, sparse buildings, roads, impervious surfaces, and facility land, whereas the EMMCNN method recognized them best with the most focused values on the diagonal. The PCNN, SMCNN, and SMMCNN methods performed well on most classes except the roads, which were easily confused with impervious surface and facility land.
For qualitative result analysis of the Beilun dataset, as shown in Figure 9, the EMMCNN method had greater performance in not only recognizing land-cover types but also identifying class boundaries. The ESSCNN method made some mistakes in classifying small ground objects and continuous objects due to the insufficiency consideration of multi-scale contextual information and object hand-delineated features. The OCNN, PCNN, SMCNN, and SMMCNN methods had inferior performance in classifying large-scale ground objects with complex compositions and cross-class-boundary objects with variable contexts. The boundaries of the OCNN, PCNN, and ORF methods that employed MRS objects were more irregular and circuitous, and the boundaries of the SMCNN and SMMCNN methods that employed SLIC segments were less smooth and continuous. The classification result of the ORF method lacked integrity and continuity, indicating that the distribution of ground objects was not orderly enough. In summary, the experimental results of the Beilun dataset illustrated that the complementarity of multi-scale and multi-type features, the adaptability of attention-based weighting, and the learnability of dense CNNs and hybrid network promoted the EMMCNN performance for HSR image land-cover classification. The ORF, OCNN, and PCNN methods had inferior performance due to the lack of deep feature representation, the loss of original resolution information, and coarse-grained object segmentation, respectively. The SMCNN, SMMCNN, and ESSCNN methods did not achieve the best accuracy since they considered single features and limited neighborhood scales without effective parameter setting and combining solution.

Cixi Dataset
For the quantitative result analysis of the Cixi dataset, as shown in Table 9 and Figure 10, the proposed EMMCNN method obtained the best OA (88.35%), KC (0.857), AA (81.78%), and UA/PA for most classes, higher than the ESSCNN. Compared with the OCNN, PCNN, SMCNN, SMMCNN, and ORF, the improvement of EMMCNN was more dramatic. The UA/PA of farmland, woodland, and facility land classes was more than 90% for EMMCNN, whereas the UA/PA of road, impervious surface, and grassland classes was only 59.25%/70.54%, 74.00%/75.22%, and 70.39%/79.01%, respectively. In addition to linear road objects, impervious surface and grassland objects in the Cixi dataset were scattered and surrounded by analogous objects, making recognition difficult. Nonetheless, the EMMCNN method still achieved better precision of identifying them than the compared methods. For the ESSCNN and EMMCNN methods using ETPS superpixels, the classification performance was better than the OCNN and PCNN methods using MRS objects by 5.74% to 12.57% for OA, 0.074 to 0.156 for KC, and 15.48% to 22.57% for AA, especially the grassland, road and impervious surface classes, and was better than the SMCNN and SMMCNN methods using SLIC segments by 4.50% to 8.08% for OA, 0.057 to 0.101 for KC, and 10.50% to 13.76% for AA, especially the water, road, and impervious surface classes. The performance of the OCNN method with the scaled resolution was worse than the other methods with the high resolution because of the spatial information loss. As a comparison method using only object attributes for classification, the ORF method had the moderate accuracy with OA 80.77%, KC 0.761, and AA 64.15%, but the grassland, road, and impervious surface classes had much lower UA. From Figure 10, it was observed that the OCNN method tended to confuse woodland and grassland, farmland and facility land, as well as buildings and impervious surfaces. Besides the spectral similarity between these confusing classes, there were many greenhouses distributed in the farmland in the Cixi image, so the other methods also had a certain misclassification. The EMMCNN method achieved a relatively better and more balanced identification result, with more focused values on the diagonal.
For the qualitative result analysis of the Cixi dataset, as shown in Figure 11, the EMMCNN method achieved better performance in identifying land-cover types and class boundaries. The ESSCNN method had similar performance to the EMMCNN but poor in recognizing extra small objects and linear continuous objects. The SMCNN and SMMCNN methods sometimes tended to classify the constituent parts of objects as different land-cover types or omit them. The OCNN and PCNN methods had inferior performance in distinguishing objects from similar surroundings, such as identifying roads from buildings and identifying grasslands from woodlands. The ESSCNN and EMMCNN methods that used ETPS superpixels had more regular and smooth boundaries than the OCNN, PCNN, and ORF methods using MRS objects, as well as the SMCNN and SMMCNN methods using SLIC segments. In summary, the experimental results of the Cixi dataset demonstrated that the efficient superpixel segmentation, attention-based multi-scale and multi-feature fusion, and deep learning networks in EMMCNN improved the accuracy for HSR image land-cover classification. The comparison methods had inferior performance considering they used segmentation methods with coarser granularity or less boundary adherence, classified the land-cover types based on limited features and spatial scales, and lacked comprehensive parameter optimization and feature integration.

Xiaoshan Dataset
For result analysis of the Xiaoshan testing dataset, as shown in Table 10 and Figure 12, the proposed EMMCNN method achieved the best OA (88.63%), KC (0.860), AA (85.11%), and UA/PA for most classes, better than the ESSCNN. The superiority of the EMMCNN was more significant when comparing with the OCNN, PCNN, SMCNN, SMMCNN, and ORF methods. The UA/PA of farmland, vacant, and water classes was higher than 90% in EMMCNN, whereas the road and impervious surface classes had lower precision with UA/PA 69.98%/75.08% and 69.97%/76.43%, respectively. In the Xiaoshan testing dataset, the road and impervious surface objects were dispersed and mixed with other similar artificial ground objects, and the comparison methods had worse accuracy on these two classes, especially the OCNN and ORF methods. For the ESSCNN and EMMCNN methods using ETPS superpixels, the recognition accuracy was better than the OCNN and PCNN methods using MRS objects by 3.43% to 13.41% for OA, 0.043 to 0.166 for KC, and 3.92% to 16.97% for AA, and was better than the SMCNN and SMMCNN methods using SLIC segments by 2.49% to 5.47% for OA, 0.031 to 0.067 for KC, and 3.16% to 6.31% for AA. The ORF method, as a comparison method using only MRS object attributes for classification, had inferior overall performance and especially worse precision for impervious surface and vacant classes. Whereas, combining the MRS object hand-delineated features with the deep multi-scale features extracted by dense CNNs, as the auxiliary and complementary feature description, could improve the performance in the EMMCNN method. From Figure 12, it could be seen that the buildings and impervious surface, as well as woodland and roads, were relatively easier to be confused, and the ORF method had relatively worse category confusion results. The ESSCNN and EMMCNN methods had better discrimination ability with comparatively more balanced and focused values on the diagonal.
For result analysis of the Xiaoshan validating dataset, as shown in Table 11 and Figure 12, the proposed EMMCNN method achieved the highest OA (67.20%), KC (0.588), AA (57.20%), and UA/PA for most classes. The performance of the EMCNN method was better than comparison methods by 1.58% to 7.14% for OA, 0.020 to 0.095 for KC, and 1.95% to 9.17% for AA. The road, impervious surface, and vacant classes-easily confusing objects-had inferior precision, as shown in Figure 12. The farmland, building, and water classes had higher accuracy, having a relatively stable appearance and characteristics between separate areas. The accuracy of all methods for validating the dataset was worse than that for the testing dataset, considering the appearance and features of ground objects changed, and the patterns of object distribution varied in a separate area. Directly applying a trained model of one image to predict the land-cover types in another image would cause the accuracy reduction, and more efforts, such as fine-tuning, sample transferring, and feature transferring, before predicting could adjust the model and raise the performance. However, these aspects were not the focus of this article, and they would be researched in future work. In summary, the EMMCNN method achieved the best performance in both testing and validating images, owing to its hybrid network design, combining solution optimization and adaptive multi-scale and multi-feature fusion. In addition, for HSR images with complex ground objects, unbalanced land-cover types, and fragmented features, where the segment characteristics and object features changed a lot, the land-cover classification accuracy of the EMMCNN method would decrease. The parameter selection and optimization would also influence the performance of the EMMCNN method. Nonetheless, the EMMCNN method could still get relatively better accuracy with careful design and tuning in such conditions, although the improvement was reduced.

Yuecheng Dataset
For result analysis of the Yuecheng testing dataset, as shown in Table 12 and Figure 13, the proposed EMMCNN method obtained the highest OA (87.52%), KC (0.846), AA (83.81%), and UA/PA for most classes, better than the ESSCNN. The improvement was more remarkable compared to the OCNN, PCNN, SMCNN, SMMCNN, and ORF methods. The UA/PA of farmland and water was higher than 90% in EMMCNN, and the UA/PA of impervious surface was only 64.76%/70.46%. The accuracy of road class improved in the Yuecheng dataset compared to other datasets because the road objects were wider and more complete, with more discriminative characteristics. However, the impervious surface objects in the Yuecheng dataset were more scattered and irregular, distributed among buildings, roads, and farmlands. Nonetheless, the EMMCNN method still recognized them comparatively better, and, in contrast, the OCNN and ORF methods had the worst precision. For the ESSCNN and EMMCNN methods using ETPS superpixels, the performance was better than the OCNN and PCNN methods using MRS objects by 5.32% to 16.41% for OA, 0.066 to 0.205 for KC, and 7.45% to 24.38% for AA, and was better than the SMCNN and SMMCNN methods using SLIC segments by 2.89% to 5.84% for OA, 0.037 to 0.072 for KC, and 5.50% to 8.50% for AA. The ORF method had inferior overall performance and especially lower precision for impervious surface and vacant classes. From Figure 13, it could be seen that the OCNN and ORF methods had more category confusion, whereas the ESSCNN and EMMCNN methods showed better object discrimination. The buildings, roads, and impervious surfaces were relatively easier to be confused, with similar appearance and spectral features.
For result analysis of the Yuecheng validating dataset, as shown in Table 13 and Figure 13, the proposed EMMCNN method obtained the best OA (60.93%), KC (0.521), AA (54.85%), and UA/PA for most classes. The accuracy of the EMCNN method was higher than comparison methods by 0.54% to 10.42% for OA, 0.008 to 0.127 for KC, and 0.82% to 11.24% for AA. The woodland, impervious surfaces, and vacant classes-easily confusing classes-had inferior accuracy, as displayed in Figure 13. The building and water classes, whose appearance and characteristics are relatively more stable between separate areas, had better precision. The performance of the validating dataset was worse than that of the testing dataset, which could be raised in further using transfer learning in future work. In summary, the EMMCNN method had the highest accuracy in both testing and validating images due to its flexible multi-scale and multi-type feature fusion, parameter and solution tuning, and effective network construction. In addition, the unbalance of numbers among different categories of samples would influence the model learning ability and classification result accuracy, i.e., better for majority classes and worse for minority classes. In this study, the number of samples per land-cover type was basically sufficient for network learning with at least 909 samples for training among all datasets, although the class imbalance would result in the relatively biased model prediction capabilities. The solution to a class imbalance in land-cover classification would be researched in future work, and we mainly focused on how to raise the overall performance through multi-scale and multi-feature fusion in this paper. As a result, the precision of the majority and minority categories in the EMMCNN method was generally higher than the comparison methods, showing the overall superiority of our proposed method.

Evaluation of Single Spatial Scales
In the first discussion, we compared the land-cover classification accuracy using different scales of contextual information and analyze the effect of multi-feature fusion in single-scale settings. Taking Beilun and Cixi datasets, for example, as shown in Table 14, the accuracy and consistency generally improved with larger scales and more features. For vertical comparison of the Beilun dataset, the OA/KC was raised by 11.21%/0.127 and 10.99%/0.124 in single-feature and multi-feature settings, respectively, as spatial scales extended from 24 × 24 to 112 × 112. For the Cixi dataset, the OA/KC was raised by 11.11%/0.138 and 5.81%/0.072 in single-feature and multi-feature settings, respectively. The increasing accuracy became steady when the scale was larger than 80×80. This indicated that at smaller scales, the larger contextual information was important for recognizing land-cover types, whereas, at larger scales, the importance decreased because it might introduce irrelevant information. For horizontal comparison, due to the superiority of multi-feature complementation, the OA was improved by 0.15% and 1.74% on average for the Beilun and Cixi datasets, respectively, among various single scales. Additionally, the multi-feature improvement became less significant as the spatial scale increased, reflecting that a larger scale of contextual information would partially compensate for the limited perception of object features.

Evaluation of Multi-Scale Combinations
In the second discussion, we compared the performance based on different combining solutions and discussed the effect of feature fusion in multi-scale settings. Taking the Beilun and Cixi datasets, for example, a total of 36 combinations and the performance for accuracy and consistency are shown in Figure 14. For comparison among various combinations in single-feature settings, the OA/KC ranged from 83.54%/0.814 to 88.37%/0.869 and from 82.26%/0.781 to 88.04%/0.853 for the Beilun and Cixi datasets, respectively. In multi-feature settings with attention-based weighting, the OA/KC ranged from 83.67%/0.816 to 88.56%/0.871 and from 84.43%/0.808 to 88.35%/0.857 for the Beilun and Cixi datasets, respectively. For the Beilun dataset, the top five combining solutions were COMB 9, 18, 36, 17, and 4. For the Cixi dataset, the top five combinations were COMB 18,36,9,35,and 4. For the Xiaoshan and Yuecheng datasets, which are not shown in the figure, the top five combinations were COMB 9, 18, 17, 4, and 35, as well as COMB 18,9,16,17, and 8, respectively. It was found that most of the top combinations with better performance were composed of an appropriate number of larger scales, reflecting that it was not the best solution to combine as many scales as possible. Larger scales played a major role in promoting the combination accuracy, and smaller scales provided assisting contributions for the classification. Therefore, selecting appropriate and complementary contextual scales for the combination was important to achieve the best multi-scale performance.
Remote Sens. 2019, 11, x FOR PEER REVIEW 28 of 32 contributions for the classification. Therefore, selecting appropriate and complementary contextual scales for the combination was important to achieve the best multi-scale performance. For comparison among different settings, the multi-feature methods with attention-based weighting obtained the most accurate and consistent results in general. For a total of 36 combinations, the promotion of attention-based weighting classification OA/KC was 0.16%/0.002 and 0.46%/0.006 on average for the Beilun and Cixi datasets, respectively, compared to the single-feature settings, and was 0.08%/0.001 and 0.37%/0.005 on average for the Beilun and Cixi datasets, respectively, compared to the multi-feature settings without attention-based weighting. The results of multi-scale combinations were better than single-scale settings, and multi-feature fusion raised the accuracy further. In addition, the improvement of multi-feature fusion in multi-scale settings was less significant than that in single-scale settings, considering that multi-scale fusion had partially compensated for the insufficient perception of object features. In conclusion, the comprehensive classification method based on multi-scale and multi-feature fusion with attention-based weighting design had applicable importance for promoting HSR image land-cover classification. Figure 14. Classification accuracy comparison among single-feature, multi-feature, and multi-feature with attention-based weighting methods upon multi-scale combinations. The 1st to 36th columns of the horizontal axis represent the COMB1 to COMB36 combining solutions, as shown in Table 2. OA: overall accuracy; KC: Kappa coefficient.

Conclusions
In this paper, we presented a novel extended topology-preserving segmentation (ETPS)-based multi-scale and multi-feature method using a convolutional neural network (EMMCNN) for high spatial resolution (HSR) image land-cover classification. In the proposed scheme, HSR images were first segmented into superpixels using the ETPS algorithm with false-color composition and image enhancement to improve the boundary adherence to confusing ground objects, and parallel dense convolutional neural networks (CNNs) were built for superpixel multi-scale deep and effective feature learning. Next, superpixel multi-scale CNN features were mapped with hand-delineated Figure 14. Classification accuracy comparison among single-feature, multi-feature, and multi-feature with attention-based weighting methods upon multi-scale combinations. The 1st to 36th columns of the horizontal axis represent the COMB1 to COMB36 combining solutions, as shown in Table 2. OA: overall accuracy; KC: Kappa coefficient.
For comparison among different settings, the multi-feature methods with attention-based weighting obtained the most accurate and consistent results in general. For a total of 36 combinations, the promotion of attention-based weighting classification OA/KC was 0.16%/0.002 and 0.46%/0.006 on