Land cover classification from remote sensing images based on multi-scale fully convolutional network

ABSTRACT Although the Convolutional Neural Network (CNN) has shown great potential for land cover classification, the frequently used single-scale convolution kernel limits the scope of information extraction. Therefore, we propose a Multi-Scale Fully Convolutional Network (MSFCN) with a multi-scale convolutional kernel as well as a Channel Attention Block (CAB) and a Global Pooling Module (GPM) in this paper to exploit discriminative representations from two-dimensional (2D) satellite images. Meanwhile, to explore the ability of the proposed MSFCN for spatio-temporal images, we expand our MSFCN to three-dimension using three-dimensional (3D) CNN, capable of harnessing each land cover category’s time series interaction from the reshaped spatio-temporal remote sensing images. To verify the effectiveness of the proposed MSFCN, we conduct experiments on two spatial datasets and two spatio-temporal datasets. The proposed MSFCN achieves 60.366% on the WHDLD dataset and 75.127% on the GID dataset in terms of mIoU index while the figures for two spatio-temporal datasets are 87.753% and 77.156%. Extensive comparative experiments and ablation studies demonstrate the effectiveness of the proposed MSFCN. Code will be available at https://github.com/lironui/MSFCN.


I. INTRODUCTION
and cover classification is a foundational technology for land resource management, cultivated area evaluation, and economic assessment, which is significant for homeland security and national economic stability [1]. Conventionally, large-scale field surveys are the primary method to obtain the condition of land use and land cover. Despite the outcomes of surveys are in high quality, the investigative procedures are time-consuming and labor-intensive. Meanwhile, the information about the geographical distribution of land cover is often missing [2,3].
As a significant Earth observation technology, remote sensing is able to capture Earth's surface images via sensors on aircrafts or satellites without physical contact [4]. The optical remote sensing is a major branch of remote sensing and has been applied in many fields including super-resolution land cover mapping [5], drinking water protection [6], and object detection [7]. Profiting from the abundant remote sensing images, scholars have increasingly focused on automatic land cover classification using satellite images [8,9].
Generally, remote sensing classification models consist of two procedures, feature engineering and classifier training; the former is aimed at transforming spatial, spectral, or temporal information into discriminative feature vectors, and the latter is designed to train a general-purpose classifier to classify the feature vectors into the correct category.
When it comes to land cover classification, vegetation indices are one genre of frequently-used features extracted from multi-spectral/multi-temporal images to manifest physical properties of land cover. The normalized difference vegetation index (NDVI) [10] and soil-adjusted vegetation index (SAVI) [11] highlight vegetation over other of land resources, while the normalized difference bareness index (NDBaI) [12] and the normalized difference bare land index (NBLI) [13] emphasize bare land, and the normalized difference water index (NDWI) [14] and modified NDWI (MNDWI) [15] indicate water.
Meanwhile, the remote sensing community has tried to design assorted classifiers from diverse perspectives, from orthodox methods such as logistic regression [16], distance measure [17] and clustering [18], to advanced methods including support vector machine (SVM) [19], random forest (RF) [20], artificial neural network (ANN) [21], and multi-layer perceptron (MLP) [22]. Since extraction of the geographical distribution of land cover requires pixel-based image classification, how to precisely refine pixel features is the core of these classifiers. However, the high dependency on manual descriptors restricts the flexibility and adaptability of these methods.
Deep Learning (DL) is powerful to automatically capture nonlinear and hierarchical features and has influenced many domains such as computer vision (CV) [23], natural language processing (NLP) [24], as well as automatic speech recognition (ASR) [25]. As a typical classification task, there are many DL C [26] exploited temporal feature using a onedimensional (1D) CNN to recognize intricate seasonal dynamics of economic crops and lessened the dependency on hand-crafted feature engineering for multi-temporal crop classification. Pelletier et al. [8] proposed a temporal CNN for satellite image time series and proved the significance of harnessing the information both in spectral dimension and temporal dimension when implementing the convolutions. Based on fine-tuned CNN, Tong et al. [27] combined hierarchical segmentation and patch-wise classification for land cover classification. Semantic segmentation, an important and common task in computer vision, has been applied to land cover classification using satellite images. Inspired by the progress in the encoder-decoder Fully Convolutional Network (FCN) framework like U-Net, Stoian et al. [28] proposed a Fine Grained U-Net architecture for sparse annotation images captured by Sentinel-2. Cao et al. [29] incorporated the U-Net and ResNet to classify the tree species using high-resolution images.
Even though the encoder-decoder FCN framework [30][31][32] has been a basic structure for land cover classification [33][34][35], the single-scale convolution kernel limits the scope of information extraction. To cope with this issue, we propose a Multi-Scale Fully Convolutional Network based on encoderdecoder FCN structure to exploit both local and global features from satellite images for land cover classification. In each layer of the encoder, we design two branches with convolutional layers in different kernel sizes to capture multi-scale features. In addition, a channel attention block and a global pooling module [36] are adopted to enhance the channel consistency and global contextual consistency.
Currently, spatio-temporal satellite images, bolstered by their increasing attainability, are at the forefront of a comprehensive effort towards automatic Earth monitoring by international agencies [37]. However, when utilizing the 2D CNN to extract features from spatio-temporal satellite images, the temporal dimensions of the extracted features generated by the convolution layer must be averaged and devastated to a scalar, which collapses the time series information contained in multi-temporal images. To handle this problem, many studies have been conducted motivated by the progress of NLP, which should model temporal sequences of language. Rußwurm et al. [38,39] adapted sequence encoders to model temporal sequence of Sentinel 2 images, and alleviated the demand of humdrum and cumbersome cloud-filtering. Interdonatoa et al. [40] designed a two-branch architecture, a RNN branch to extract temporal features and a CNN branch to extract spatial features, for time series classification. By incorporating both CNN and RNN, Rustowicz et al. [41] designed a 2D U-Net + CLSTM model for spatio-temporal satellite images. Meanwhile, for embedding time-sequences, Transformer architecture was also introduced into land cover classification using spatio-temporal satellite images by Garnot et al. [37]. All these attempts have made encouraging progress and broadened the boundaries of land cover classification.
Meanwhile, the advent of 3D CNN solves the abovementioned dilemma from another perspective. Unlike traditional 2D CNN which operates on 2D images, 3D CNN implements convolutional operation on three dimensions, which naturally fits feature extraction from spatio-temporal satellite images and other data represented in 3D format. Thus, 3D CNN has been utilized for video understanding [42], point clouds representation [43], 3D object detection based on light detection and ranging (LiDAR) data [44], hyperspectral images classification [45], and multi-temporal images segmentation [46]. As the temporal or spectral dimensions are generally not considered independently in conventional computer vision tasks, the usage scenarios of 3D CNN are less than 2D CNN, which affects the popularization of 3D CNN for land cover classification using multi-temporal images. However, as to remote sensing images which comprise abundant temporal, dynamic, or spectral information, like the whole crop growth cycle contained in the temporal dimension, 3D CNN is a superexcellent method to extract these features.
Using multi-temporal images, Ji et al. [46] designed a 3D-CNN-based segmentation model for crop classification. As the temporal dimension is reserved, the performance of the model surpassed the 2D-CNN-based methods and other traditional classifiers. However, as 3D CNN is a computationally intensive operation, the pixel-by-pixel segmented procedure of their work requires numerous computational resources. Thus, based on the idea of semantic segmentation, Ji et al. [36] proposed a novel 3D encoder-decoder FCN framework with global pooling and attention mechanism (3D FGC), which was able to capture feature maps from the whole input and improves both the accuracy and the efficiency.
Based on the above-mentioned insight and progress, we extend our Multi-Scale Fully Convolutional Network to threedimension based on 3D CNN for land cover classification using spatio-temporal satellite images. To verify the effectiveness, we compare the performance of 2D MSFCN with SegNet [31], FC-DenseNet [47], U-Net [30], Attention U-Net [48] and FGC [36] and the performance of 3D MSFCN with 1D U-Net, 2D U-Net [30], 3D U-Net [30], Conv-LSTM [38] and 3D FGC [36]. In addition, we expand 2D Attention U-Net [48] to 3D and contrast its capability to MSFCN. The major contributions of this paper could be listed as follows: 1) To expand the scope of information extraction in the spatial domain, we designed a multi-scale convolutional block (MSCB), which is able to capture the local and global features of the input respectively. 4) The 3D version Attention U-Net is designed, and the experiments validate its validity for land cover classification using spatio-temporal satellite images, especially when compared with 3D U-Net. 5) A series of quantitative experiments on two spatial datasets and two spatio-temporal datasets show the effectiveness of proposed MSFCN. The remainder of this paper is arranged as follows: In Section 2, taking 3D MSFCN as an example, we illustrate the detailed structure of proposed framework. The experimental results are provided and analyzed in Sections 3. Finally, in Section 4 we draw a conclusion of the entire paper.

A. Feature Extraction using 3D CNN
3D CNN is capable of capturing spatial and temporal features simultaneously, and Batch Normalization (BN) layer [49] is often appended to improve numerical stability. Thus, we consider 3D CNN with a BN layer as an example to elaborate the mechanism of 3D CNN. Supposing that the size of input 3D feature maps is expressed as ( × ℎ × , ), and the shape of the convolution kernel is ( × ℎ × ), where t, h, w, and c denote the dimension of time series, height, width, and channels. The convolution operations are implemented between the convolution kernel and sliding windows in the shape of ( × ℎ × ), and the obtained values constitute the output 3D feature maps. Another important parameter, stride, determines the distance of width and height traversed per slide of the sliding windows. A diagrammatic sketch with one kernel can be seen in Fig. 1. Concretely, the operation of 3D CNN can be formulized as: where , ,ℎ, denotes the th feature cube at position ( , ℎ, ) in the th layer, means the feature maps generated by the ( − 1) th layer.
, , , , represents the column weight of the th feature cube at position ( , , ). , is the th feature cube in the th layer's bias items of the filter. means the convolution kernel along the temporal dimension of input spatio-temporal satellite images, while and respectively express the height and width of the kernel in the spatial dimension.
Then, the generated 3D feature maps is fed into the BN layer and normalized as: where is the output of the BN layer. (•) and (•) represent the variance function and expectation of the input. is a small constant to maintain numerical stability. and are two trainable parameters, and the normalized result ̂ can be scaled by and shifted by . σ(•) denotes the activation function, which is set as ReLU in our model.
As the quality of extracted features limits the performance of the model and the convolution kernel size determines the receptive field, how to design the size of the convolution kernel is the crux of the network.

B. Multi-Scale Convolutional Block
Generally, the larger convolution kernel size means the larger receptive field and the more global vision, which augments the scope of areas observed in the image. Conversely, the decrease in the size of the convolution kernel would shrink the receptive field and obtain the local vision. However, both the global visual patterns and the local visual patterns contain visual features. Thus, an evident imperfection of the fully convolutional neural network is that the convolutional kernels are in the same size, which means the receptive field of a convolutional layer is constant. As can be seen in Fig. 2(a), the conventional convolutional block used in FCN usually contains two stacked 3D CNN with the activation function. To expand the receptive field, in MSFCN, we design a multi-scale convolutional block (MSCB) to exploit the global and local features simultaneously.
The structure of the multi-scale fully convolutional layer can be seen in Fig. 2(b). Similarly, supposing the input 3D feature maps is in the shape of ( × ℎ × , ), where the t, h, w, and c represent the time series, height, width, and channels of the input. The top branch of the block contains two stacked (3 × 3 × 3) convolution layers, and the receptive field of two stacked (3 × 3 × 3) convolution layers are equivalent to a (5 × 5 × 5) convolution layer, which can be seen from Fig. 3. Thus, the top branch is capable of capturing more global visual patterns. Meanwhile, the bottom branch of the block harnesses Subsequently, the add operation is implemented between the outputs of the top branch and the bottom branch, and obtains the feature maps with the size of ( × ℎ × , ). Finally, the extracted feature maps are fed into a (1 × 1 × 1) convolution layer with BN layer to further increase the nonlinear characteristics and characterization capabilities of the block.

C. Channel Attention Block and Global Pooling Module
In the FCN framework, the output of the convolution operator is a score map, which indicates the probability of each class at each pixel. And to attain the final score map, all channels of feature maps are simply summed as: denotes the convolution kernel. represents the feature maps generated by the network. is the set of pixel positions. And ∈ {1, 2, … , }, where indicates the number of channels. Then the prediction probability is generated as: where denotes the output of the network, and indicates the prediction probability. Obviously, the category with the highest probability is the final predicted label which can be deduced by Equation 4 and Equation 5. Equation 4 impliedly indicates that all channels share equal weights. However, the features generated by different stages own different levels of discrimination, which causes different consistency in prediction.
Supposing the prediction label is 0 and that the corresponding true label is 1 , we can modify the highest probability value from 0 to 1 by introducing a parameter : in which = ( ; ) and ̅ is the new prediction label of the network. As can be seen from Equation 6, the value weights the feature maps and enhances the discriminative features and restrains the indiscriminative features. The channel attention block is designed based on the above-mentioned insight [50,51] and is expanded to the 3D version [36].
The structure of the CAB can be seen in Fig. 4, whose input is the concatenated feature maps extracted by the encoder and decoder. First, a 3D global average pooling layer in CAB exploits the global context of the input, and sequentially two (1 × 1 × 1) convolution layers with ReLU and sigmoid activation function adaptively realign the channel-wise dependencies. The weight vector generated by CAB models the relative significance between the channel-wise features and enhances the discriminability about features. Then a multiplication operation and an addition operation are operated between the output vector and the input feature maps. Finally, the last (1 × 1 × 1) convolution layer is designed to generate globally consistent spatio-temporal feature maps. Through reweighting the channel-wise features, 3D channel attention block (CAB) fuses the spatio-temporal features between the encoder and the decoder.
Meanwhile, context is a salutary information which can be used to enhance the performance on segmentation and detection using deep learning [52]. As for land cover classification, local semantic information contained in per pixel is often equivocal. And by taking contextual information into consideration, the semantic information will be enhanced. Global average pooling is proved to be a good method to capture the global contextual prior [52]. Based on the idea that the spatio-temporal consistency can be enhanced by a global average pooling layer on the highest level of the encoder (i.e. the top semantic layer) [51], the global pooling module (GPM) is elaborately designed [36], which can be seen in Fig. 5. Meanwhile, with global   spatio-temporal consistency, the GPM transforms the feature maps at highest level of the encoder to the corresponding feature maps of the decoder. Just like the CAB, the effect of GMP is reweighting feature maps, which also can be seen as an attention mechanism.
The structure of the GMP can be seen in Fig. 5. First, the input feature maps are fed into a (1 × 1 × 1) convolution layer. Then, a 3D global average pooling and a (1 × 1 × 1) convolution layer with sigmoid activation function are attached. Finally, a multiplication operation and an addition operation are implemented between the generated vector and the output obtained by the first convolution layer, and the final output is processed by the last (1 × 1 × 1) convolution layer to acquire the highest layer of the decoder.

D. Network Architecture
Based on the 3D CNN, the multi-scale convolutional block, the channel attention block, and the global pooling module, we construct the MSFCN for land cover classification from satellite images, which can be seen in Fig. 6. The encoder of the MSFCN comprises four multi-scale convolutional blocks with the output channels as 32, 64, 128, and 256 respectively, and the number of layers and channels will be discussed in Section Ⅲ.F. After each multi-scale convolutional block, the max-pooling layer with (1 × 2 × 2) kernel is applied, which reserves the temporal information and condenses the spatial information. At the highest layer of the encoder, the GPM is utilized to enhance the global spatio-temporal consistency. Then, using CAB, the feature maps from the encoder and decoder are fused, and the output of each layer in decoder is sequentially restored up to the input size via the transposed convolution layer with (1 × 2 × 2) kernel. After each transposed convolution layer, a (3 × 3 × 3) convolution layer is attached to avoid the checkerboard pattern generated by the transposed convolution. In the end, the final 3D feature maps are fed into a ( × 3 × 3) convolution layer and a (1 × 1 × 1) convolution layer to coalesce time dimension and generate 2D segmentation maps.
The cross-entropy loss function is used as quantitative evaluation and backpropagation index to measure the disparity between the obtained 2D segmentation maps and ground truth, which is defined as: 1 https://github.com/lironui/Multi-Scale-Fully-Convolutional-Network.
where , is the predicted category probability distribution of pixel ( , ), , is the actual category probability distribution of pixel ( , ), represents the number of classes, and denotes the number of pixels.

III. EXPERIMENTAL RESULTS
This section first introduces the datasets and experimental settings to verify the effectiveness of MSFCN, and then compares the performance between different frameworks.
WHDLD contains 4940 RGB images in the size of 256 × 256 captured by Gaofen 1 Satellite and ZY-3 Satellite over Wuhan urban area. By image fusion and resampling, the images resolution is reach to 2m/pixel. The images contained in WHDLD are labeled with six classes, i.e. bare soil, building, pavement, vegetation, road, and water.
GID contains 150 RGB images in the size of 7200 × 6800 captured by Gaofen 2 Satellite over 60 cities in China. Each image covering a geographic region of 506 2 . The images contained in GID are labeled with six classes, i.e. build-up, forest, farmland, meadow, water, and others. However, as we don't have enough computing resources to cope with such extremely enormous pixels, we just select 15 images contained in GID. The principle of selection is to cover whole six classes. And the serial number of the selected images will be released with our open source code 1 .
All of the models are implemented with PyTorch, and the   optimizer is set as Adam with 0.0001 learning rate. The batch size is set as 16 for WHDLD and GID, and 4 for GF2 spatiotemporal satellite images. All the experiments are implemented on a single NVIDIA GeForce RTX 2080ti GPU with 11 GB RAM. For WHDLD, we randomly select 60% images as training set, 20% images as validation set, and the rest 20% images as test set. For GID, we separately partition each image into nonoverlap patch sets with the size of 256 × 256, and just discard the pixels on the edges which cannot be divisible by 256. Thus, 10920 patches are obtained. Then we randomly select 60% patches as training set, 20% patches as validation set, and the rest 20% patches as test set. And the training sets of WHDLD and GID are augmented by horizontal axis flipping, vertical axis flipping, color enhancement, Gaussian blur and random noise. When training the network, if the accuracy in validation set is no longer increasing for 10 epochs, then we would terminate the training process early to restrain overfitting. The number of training, validation and test pixels per class for WHDLD and GID is provided in Table I. For two spatio-temporal satellite images, the samples in each category are severely imbalanced. Thus, we selected portion of the images which contains samples of all the categories to train the network, which is indicated in red rectangles in Fig. 9. Since pixels in these two datasets are not abundant, we enlarge the images in 2015 dataset to the size of 2816 × 1536 and the images in 2017 dataset to the size of 2304 × 1280 by zeropadding, and then segment each image into non-overlap patch sets in the size of 256 × 256 to evaluate prediction accuracy. Of course, the selected portion for training is also set as zero to avoid data leakage. The number of training and test pixels per class is provided in Table Ⅱ. Each model is trained 100 epochs on the training set, and then verified on the test set.
For each dataset, the overall accuracy (OA), average accuracy (AA), Kappa coefficient (K), mean Intersection over Union (mIoU), Frequency Weighted Intersection over Union (FWIoU), and F1-score (F1) are adopted as evaluation indexes. Given the predicted segmentation maps and ground truth, the IoU indicates the size of their intersection divided by the size of their union. The mIoU averages the IoU of every category, and the FWIoU weights IoU of each category by the frequency. We select mIoU as the main indicator, as it reflects both the overall accuracy (OA) and the consistency degree (Kappa), and is  table  TABLE I  THE SAMPLES FOR EACH CATEGORY FOR TRAINING, VALIDATION AND TEST.  Train  Val  Test   WHDLD   bare  7746403  2475482  2854410  building  21848819  7135568  6917771  pavement  22842445  7671979  6782834  road  8225161  2850179  2869957  vegetable  87444443  28505640  28859223  water  46141433  16110720  16465373   GID   others  125858447  40426710  40061365  build-up  49528719  16603346  17203079  farmland  125542298  41351598  40884984  forest  37555494  12302122  13716761  meadow  25657841  9335581  8437873  water  65249073  23111267 Ⅱ  THE SAMPLES FOR EACH CATEGORY FOR TRAINING AND TEST.  2015  Train  Test  2017  Train  Test  rice  253286  1069586  rice  93931  356085  corn  198585  1064487  corn  320895  1206244  sorghum  102649  193686  grass  15140  63117  tree  17410  57677  tree  3941  7787 becoming a frequently-used indicator for land cover segmentation [35,56,57].

C. Results on WHDLD and GID
The experimental results of different methods on WHDLD and GID are demonstrated in Table Ⅲ and Table Ⅳ. The performance of proposed MSFCN transcends other algorithms in all quantitative evaluation indexes, which can be seen from tables.
For WHDLD, the proposed MSFCN brings near 3% improvements both on mIoU and F1-score compared with FGC. And for GID dataset, the improvements are more than 3% in mIoU and more than 2% in F1-score, respectively. Table Ⅴ and Table Ⅵ summarize the per class F1-score performance of the different methods for WHDLD and GID. The proposed MSFCN obtains the best performance in most classes on WHDLD and whole classes on GID. Meanwhile, we investigate the confusion between each pair of classes and we report the confusion matrix by heat maps for each competing method in Fig. 10 and Fig 11. The more visible diagonal structure (the dark blue blocks concentrated on the diagonal) indicates the more powerful capacity of distinguishing between classes. And the diagonal structure of MSFCN is more distinct than others, which proves the superiority of our framework.
What is more, the number of parameters and the consumptions of calculation are also significant to assess the merit of a framework. The comparison of parameters and     computational complexity between different algorithms are reported in Table Ⅶ, where 'M' is the abbreviation of million, the unit of parameter number, and 'G' is the abbreviation of Gillion (thousand million), the unit of floating point operations.
And the comparison demonstrates that the design of MSFCN doesn't bring in redundant parameters or lead to high computational complexity.
Some visual results generated by our method and comparisons are provided in Fig. 12 and Fig. 13.

D. Results on 2015 and 2017 datasets
To training the network, the inputs of the 1D U-Net are reshaped into ( × 65536) tensors, and the inputs of the 2D U-Net are reshaped into ( × 256 × 256), while the input of the Conv-LSTM, 3D U-Net, 3D FGC, 3D U-NetAtt and 3D MSFCN are ( × × 256 × 256) tensors, where c and t denote the number of spectral channels and time series, respectively.
The experimental results with different methods for two datasets are demonstrated in Table Ⅷ and Table Ⅸ. Since the operation of 1D CNN destroys both the spatial dimension and temporal dimension, the performance of 1D U-Net is worst. As the operation of 2D CNN ruins the temporal dimension when extracting spatio-temporal features, the models based on 3D CNN dramatically outperform the models based on 2D CNN, which prominently demonstrates the superiority of 3D CNN. The performance of Conv-LSTM transcends 2D-based models, as the information contained in temporal dimension is taken into consideration. Benefitting from the utilization of attention mechanism, the 3D U-NetAtt performs better than 3D U-Net. Similarly, since the introduction of the CAB which enhances      Fig. 14 and Fig. 15. And Fig. 16 demonstrates the segmentation maps on two datasets. The first three columns are from the 2015 dataset and the remainder are from the 2017 dataset. Taking the fourth column as an example, the proposed MSFCN differentiates corn (green) and grass (yellow) better than other models. Table Ⅻ provides the number of parameters and the consumptions of calculation, which illustrates the complexity of the proposed MSFCN is not unacceptable.

E. Effectiveness of the Multi-Scale Convolutional Block and Attention Mechanisms
To verify the effectiveness of the multi-scale convolutional block and attention mechanisms, we analyzed the proposed MSFCN without multi-scale convolutional block (MSFB), channel attention block (CAB) and global pooling module (GPM) both on WHDLD and GID, and the results are shown in Table ⅩⅢ and Table ⅩⅣ. The 3D U-Net obtains mIoU of 0.55706 and 0.69417 on WHDLD and GID. When the conventional convolutional blocks are replaced by multi-scale convolutional blocks, the mIoUs reach to 0. 57098 and 0.71992. And the introduction of channel attention block and global pooling module brings 0.01473/0.01510 for WHDLD and 0.01680/0.01679 for GID improvements on mIoU, respectively. The mIoUs are further improved to 0.60366 and 0.75127 when all blocks are introduced.

F. Investigation about the Number of Layers and Channels
The number of layers and channels are two vital parameters which not only impact the performance of the model, but also   determine the computational complexity. Thus, it is worthwhile to investigate the influence of the number of layers and channels.
To inquire the effect caused by the number of layers, we design a MSFCN with 3 layers (MSFCN3), and a MSFCN with 5 layers (MSFCN5), and compare their performance with the MSFCN with 4 layers (MSFCN4). As the capacity of representations is limited by finite layers, the performance of MSFCN3 is significantly weaker than MSFCN4. Specifically, without enormous increases in the parameters and computational complexity, MSFCN4 surpasses MSFCN3 more than 5% on mIoU, which can be seen from Table ⅩⅤ and Table  ⅩⅦ. However, notwithstanding certain improvements boosted by MSFCN5, the number of parameters of MSFCN5 is four times more than MSFCN4's, which is not an efficient option.
To research the impact caused by the number of channels, we design a narrow MSFCN (MSFCNN) with channels in the number of [16,32,64,128], and a wide MSFCN (MSFCNW) with channels in the number of [64, 128, 256, 512], and compare their performance with the MSFCN with channels in the number of [32,64,128,256]. The results show that the performance of MSFCN surpasses MSFCNN near 5% on mIoU. Meanwhile, with five times on parameters and computational complexity, MSFCNW just brings in near 1% improvement.
Based on the above experiments, we can draw a conclusion that the design of proposed MSFCN delicately balances the performance and complexity.

IV. CONCLUSION
In this paper, to implement land cover classification using satellite images, we propose a Multi-Scale Fully Convolutional Network (MSFCN). Firstly, multi-scale convolutional blocks are elaborately designed to expand the scope of information extraction in spatial domain, which captures both the local and global information of the satellite images. Secondly, a channel attention block and a global pooling module are included to enhance the channel consistency and global contextual consistency, respectively. Thirdly, we extend MSFCN to 3D for spatio-temporal satellite images based on 3D CNN to replace 2D FCN, which adequately utilizes the time series interaction of each land cover class on temporal dimension.
Experiments on two spatial datasets provide the effectiveness of proposed MSFCN. And experiments on two spatio-temporal datasets demonstrate the 3D CNN is a valid method to exploit information from spatio-temporal images. Meanwhile, we explore the impact caused by the number of layers and channels, which may provide beneficial references for designing land cover classification network based on FCN.