Hierarchical Multi-Scale Convolutional Neural Networks for Hyperspectral Image Classification

Deep learning models combining spectral and spatial features have been proven to be effective for hyperspectral image (HSI) classification. However, most spatial feature integration methods only consider a single input spatial scale regardless of various shapes and sizes of objects over the image plane, leading to missing scale-dependent information. In this paper, we propose a hierarchical multi-scale convolutional neural networks (CNNs) with auxiliary classifiers (HMCNN-AC) to learn hierarchical multi-scale spectral–spatial features for HSI classification. First, to better exploit the spatial information, multi-scale image patches for each pixel are generated at different spatial scales. These multi-scale patches are all centered at the same central spectrum but with shrunken spatial scales. Then, we apply multi-scale CNNs to extract spectral–spatial features from each scale patch. The obtained multi-scale convolutional features are considered as structured sequential data with spectral–spatial dependency, and a bidirectional LSTM is proposed to capture the correlation and extract a hierarchical representation for each pixel. To better train the whole network, weighted auxiliary classifiers are employed for the multi-scale CNNs and optimized together with the main loss function. Experimental results on three public HSI datasets demonstrate the superiority of our proposed framework over some state-of-the-art methods.


Introduction
Hyperspectral images (HSIs) [1] often contain hundreds of spectral bands varying from visible wavelength to short infrared light. This rich information enables us to distinguish different materials which look similar to the naked eye or the conventional RGB cameras. To benefit from this type of data, HSI classification has become one of the most important tasks for various applications [2][3][4][5][6] and many classification methods have been proposed.
In the last decades, many pattern recognition approaches, such as decision tree [7], support vector machine (SVM) [8], random forest [9], and maximum likelihood estimate (MLE) [10] have been utilized to perform HSI classification. Among these traditional methods, SVM is considered as a stable and efficient method for HSI classification since it can handle the "curse of dimensionality" problem [11] and requires a relatively small size of training data. In recent years, contextual information [12] has been proven to be effective for HSI classification, leading to a growing interest in the research of joint spectral-spatial features. Earlier methods include Gabor filters [13], gray-level co-occurrence matrices [14] and mathematical morphological profiles [15]. Though these methods could improve the classification results to some extent, the handcrafted features are strongly dependent on the prior the sequential property and extract a hierarchical representation for each hyperspectral pixel. To better train the whole network, auxiliary classifiers are added as constraints for the multi-scale CNNs and optimized together with the main loss during the training process. The main contributions of this paper are summarized as follow: 1.
An end-to-end framework (HMCNN-AC) combining multi-scale CNNs and a bidirectional LSTM is proposed to generate discriminative features with spectral information, spatial information and scale-dependency for HSI classification.

2.
To address the limitation of the single input spatial scale and effectively utilize the spatial contexture, a multi-scale CNN based technique is introduced to extract spectral-spatial features at various input scales simultaneously. In this way, different shapes and sizes of objects in HSI could be considered during classification and the multi-scale features will make the performance more robust compared with the single scale input.

3.
A bidirectional LSTM is proposed to capture the dependency and correlation between the obtained multi-scale features from a sequential perspective and output a hierarchical representation for each pixel. Compared with the conventional concatenation method to simply concatenate multi-scale features together, we can make full use of the correlation between multi-scale features without losing scale-dependent information by adopting the bidirectional LSTM. 4.
In order to better train the whole network, weighted auxiliary classifiers are employed for multi-scale convolutional features and optimized together with the main loss during the training process. This helps to fully optimize the parameters in multi-scale CNNs and obtain robust convolutional features without adding too many extra parameters.
The rest of the paper is organized as follows. Section 2 reviews the classical deep learning models used in this paper, namely CNNs and LSTM networks. In Section 3, we give a detailed description of our proposed HMCNN-AC framework. Experimental analysis and a comparative evaluation of other baseline methods are reported in Section 4. Section 5 presents the final conclusion and gives directions to the future work.

CNNs
Convolutional neural networks (CNNs) have attracted much attention due to their outstanding performance in computer vision tasks [29,30]. Typical CNNs consist of convolutional layers, pooling layers, and a fully connected layer. The convolutional layers are composed of multiple convolutional kernels. Each kernel is a weight matrix and generates a distinct feature map via the convolutional operation with the input, which can be formulated as: where x j i is the jth channel's feature map of the present layer i, W j i is the jth convolutional kernel matrix at layer i, b j i is the bias term, x i−1 is the output feature maps of the (i − 1)th convolutional layer, the symbol * represents convolutional operation, and F(·) is an activation function.
Pooling layers are often after the convolutional layers to further reduce the redundancy by partitioning the input data into a set of non-overlapped sub-regions and returning the average or maximum values locally. Both convolutional layers and pooling layers can be repeated multiple times to obtain the representative features. Finally, a fully connected layer is followed to further process the extracted features and a multinomial logistic regression (MLR) layer is added at the end to convert the output convolutional features into category or regression results.

LSTM and Bidirectional LSTM
Recurrent neural networks (RNNs) are an important class of deep neural networks which has "memory" functionality and can remember past state information while dealing with the current state. This powerful function makes RNNs extremely suitable for processing sequential inputs. However, the traditional RNNs are difficult to train in practice due to the vanishing or exploding gradient [31].
To address this issue, the long short-term memory (LSTM) network [32] was developed to capture the long-term dependencies by designing a more sophisticated recurrent unit. The diagram of a basic LSTM building block is presented as Figure 1. Compared with the traditional RNNs, LSTM explicitly introduces a memory cell into the network together with some control gates to decide the information flow. As we can see, every unit of the network takes three inputs, namely the current input data x t , the previous hidden state h t−1 and the previous memory cell c t−1 . Then we use them to calculate the following values. The current hidden state h t is obtained by: where tanh(·) is the hyperbolic tangent function [33] and o t is the output gate at current state which determines the percentage of the exposed memory content. The o t is determined as: where the σ(·) is the sigmoid function, W oi and W oh represent the weight matrices of the input-output and the hidden-output respectively, and b o is the bias of this layer. The c t is the memory cell of the current state and it is updated by adding some new content of the candidate memory cellc t and discarding part of the previous memory cell c t−1 , as described in Equation (4): where the input gate i t controls the extent to which the candidate memory cellc t will be added to the current memory cell c t , and the forget gate f t modulates the percentage of the previous memory cell c t−1 should be forgotten. The operation is an element-wise multiplication and the new candidate memory cellc t is obtained by:c The input gate i t and forget gate f t are calculated as follows: The LSTM explores the dependencies of a current state with the previous states. However, when it comes to the non-time sequential vectors, the dependencies exist in both the forward direction and the backward direction. To solve this problem, a bi-directional LSTM [34] was proposed to utilize both forward and backward relationship of a sequential data. In that case, we first apply a forward LSTM to the input sequence to obtain a forward output, and then a backward LSTM is applied to obtain a backward output. The final output of this layer is a concatenated version of the forward vector and the backward vector.

Proposed Methods
In this section, we give a detailed description of our proposed hierarchical multi-scale CNNs with auxiliary classifiers (HMCNN-AC) for HSI classification. The whole framework is shown in Figure 2. To solve the limitation of the single input scale in conventional methods, multi-scale CNNs are employed to extract multi-scale convolutional features at different spatial scales. In addition, to fully explore the dependency and correlation of the obtained multi-scale features, a bidirectional LSTM is adopted to characterize the sequential property and extract a hierarchical representation for each hyperspectral pixel. Moreover, to better train the whole network, weighted auxiliary classifiers are employed for these multi-scale convolutional features. The whole framework is trained in an end-to-end supervised manner and all the parameters are optimized by mini-batch stochastic gradient descent (SGD) algorithm [35].

Multi-Scale Convolutional Feature Extraction
Traditional CNNs can only extract deep spectral-spatial features from receptive fields with one single input scale. Since objects in remote sensing images often appear with various shapes and sizes, the fixed input scale will limit the effectiveness of spatial feature integration and fail to capture the scale-dependent information. In order to accommodate the objects with different scales and effectively utilize the spatial contexture, multi-scale CNN based technique is introduced to our proposed framework. Specifically, our multi-scale convolutional feature extraction is composed of two main phases: multi-scale patches generation and convolutional feature extraction.
To construct multi-scale patches, a sub-region image set with S scales {P s } S s=1 is generated for each pixel from the original HSI cube. The first scale patch P 1 is just the central spectrum itself with the size of 1 × 1 × D, where D represents the number of spectral bands. The second scale patch P 2 contains P 1 but with a larger spatial scale of 3 × 3 × D. In that case, an upper scale patch always contains the lower scale patches, which results in a spatial dependency among multi-scale patches. Our last scale patch P s contains all the lower patches with a size of w × w × D, where w is the largest input scale, namely the original input window. The window size w is carefully selected so that it would include enough spatial information without increasing the outliers' disturbance and the computing complexity. For each pixel, all these generated multi-scale patches are spatial-dependent and share the same central spectrum with the same class label, as seen in Figure 2. By utilizing multi-scale patches, we would not only take multi-scale spatial information into account, but also put more emphasis on the central spectra and reduce the unwanted noise.
In our second phase, we perform convolutional operations on the generated multi-scale patches to extract spectral-spatial features at different input scales. To preserve as much spatial information as possible, we exclude pooling layers from the multi-scale CNNs and treat each scale patch as a conventional image with multiple channels on its own [23,36,37]. The convolutional operator is formulated as below: where x j i denotes the jth feature map at the ith layer; x i−1 is the output of the (i − 1)th convolutional layer; F(·) stands for the activation function; b j i refers to the bias term; and W j i is the jth convolutional kernel at the layer i with the depth equaling to the channel number of x i−1 . For the first convolutional layer, the depth of W j 1 equals to the number of spectral bands D. The detailed architecture description of the multi-scale CNNs is listed in Table 1. For the proposed model, we adopt the ReLU [38] function as the activation function which is defined as follow: with various input scales, the detailed structures of each sub-network including the numbers of convolution layers, kernel sizes and kernel numbers are specified.
After the convolutional layers, we adopt a fully connected layer for each spatial scale to obtain the multi-scale convolutional features f 1 , f 2 , ..., f S for each pixel. Since these multi-scale features contain discriminative characteristics from both spectral dimension as well as their respective spatial scales, we can use them to improve the HSI classification results.

Hierarchical Feature Learning
A conventional way to process the obtained multi-scale convolutional features is to concatenate them together into a 1-D vector directly. However, this simple concatenation-based method ignores the inherent correlation of the multi-scale features and will lead to the loss of scale-dependent information.
Since the multi-scale patches for each pixel are spatial-dependent and share the same central spectrum with the same class label, we consider the extracted multi-scale features as sequential structured data with spatial-spectral dependency. Thus, a bidirectional LSTM is proposed to characterize the sequential property and extract a hierarchical representation for each hyperspectral pixel.
We apply one or two layers of bidirectional LSTM to the obtained multi-scale features f conv , depending on the performance of different HSI datasets: where .., f S ] is the multi-scale convolutional feature sequence with length S, S is the scale number of generated multi-scale patches for each pixel; and BiLSTM is the bidirectional LSTM operation which is introduced in Section 2. The obtained hierarchical feature f is regarded as the most representative feature for each pixel and will be used for the final HSI classification.

Label Prediction
After the multi-scale CNNs and the bidirectional LSTM networks, we feed the output feature f to a MLR layer for the final classification. The output size of the MLR is the same as the total number of HSI classes and we use the Softmax function [39] as the activation function. The label of each pixel is determined by the class with the largest probability: where W MLR and b MLR are the weight matrix and bias of the MLR layer, y is the predicted label and Φ(·) is the MLR function which is defined as:

Weighted Auxiliary Classifiers
A main difference between our proposed method and the traditional deep learning-based HSI classification methods is the adoption of the weighted auxiliary classifiers for our multi-scale CNNs. The motivation of employing these auxiliary classifiers is to better train the sub-networks of the multi-scale CNNs in our proposed architecture and acquire robust multi-scale convolutional features, especially for a deep model consisting of two neural networks. Since the final output feature f is closer to the bidirectional LSTM part, the parameters in the bidirectional LSTM might have a dominant position during the training process, while the parameters in the multi-scale CNNs may not get fully optimized.
To address this issue, auxiliary classifiers are added as constraints for the multi-scale convolutional features to perform sub-classification. The predicted label for each multi-scale feature is given as: where f i is the ith scale convolutional feature from the multi-scale CNNs and i = 1, 2, ..., S, W i aux and b i aux are the weight matrix and bias of ith auxiliary classifier, Φ(·) is the MLR function which produces the ith auxiliary predicted label y i . The total loss L of our proposed HMCNN-AC contains one main loss L main as well as S weighted auxiliary losses L i aux , and measures the total difference between the predicted outputs and the real labels, which is defined as: where L main is the main loss of the whole framework, L i aux is the ith auxiliary loss for the i-th scale convolutional feature f i ; α i is the weight for its corresponding auxiliary loss and will be discussed in later experiments; S is the number of multi-scale features. The whole network is trained in an end-to-end manner and all the parameters are optimized simultaneously by minimizing the total loss function L.

Experimental Results and Analysis
In order to evaluate our proposed model, three publicly available HSI datasets [40] are utilized to perform classification. We choose several other HSI classification methods, including SVM, extended morphological profiles with SVM (EMP-SVM) [15], stacked autoencoder (SAE-PCA) [20], CNN-based structure on spatial dimension (CNN-MLR) [22], LSTM [25] and MCNN [28], as baselines for a comparative evaluation. To evaluate the effectiveness of the adopted auxiliary classifiers and the bidirectional LSTM in our proposed method, we also construct two other frameworks based on our original proposed HMCNN-AC architecture, namely HMCNN and MCNN-AC, and compare their classification results with other baseline methods.
For the evaluation metrics of all methods above, we adopt overall accuracy (OA), average accuracy (AA), and κ coefficient [41] to measure the performance. OA is the overall accuracy for all classes and is defined as follow: where test,correct is the i-th correctly classified test sample, N is the total number of test samples. AA represents the averaged accuracy of each class, and is defined as: where M is the class number of each dataset, N i is the total test sample number of the i-th class, and x (j) i,correct is the j-th correctly classified test sample of class i. Kappa coefficient [41] is a statistical measurement of agreement degree, referred as κ. The higher of all our measurement metrics the better of the classification performance. The experiments are performed on a desktop PC equipped with an Intel Core 5 CPU and four GTX 1080 GPUs.

Dataset Description
We choose Salinas, Pavia University (PaviaU) and Kenned Space Center (KSC) as our evaluation datasets and their corresponding false-color images and ground truth maps are shown as  Considering the dataset sizes, we randomly select 5% labeled samples of each class from Salinas and PaviaU as training sets and use the rest 95% samples as test sets. As for KSC, 10% samples of each class are chosen for training and the rest 90% samples for testing.  There are 16 different classes in this HSI, as shown in Figure 3. The training set and test set are presented in Table 2.  Figure 4. And the training data and test data are displayed in Table 3. (3) Kenned Space Center: Our last dataset, Kenned Space Center (KSC), was acquired by the AVIRIS sensor in Florida on 23 March 1996. The spatial dimension of this hyperspectral image is 512 × 614 pixels and the original spectral dimension is 224 spectral bands. Due to the noisy bands, we only use 176 bands for method evaluation. The image set contains 13 labeled categories, which can be seen from the ground truth map in Figure 5. The training data and test data are described in Table 4.

Compared Methods
To demonstrate the effectiveness of our proposed HMCNN-AC, several classification approaches are adopted. These compared approaches are based on two kinds of groups: one is based on classical machine learning methods, which includes (RBF-)SVM and extended morphological profiles (EMP) [15]; the other is based on deep learning models, which includes SAE-PCA (joint spectral-spatial features) [20], CNN-MLR [22], LSTM [25], and MCNN [28]. Since the 3D-CNN method requires a long training time, here we exclude this method.

Classical Machine Learning Methods
(a) The SVM method only utilizes spectral information of each pixel and we use the LibSVM [42] for the classification in our experiments. (b) For the EMP method, spatial features are exploited by the adoption of five opening and closing operations followed by morphological reconstruction on the first three principal components of HSI. We use the disk structure element to perform the morphological operations and the structure sizes range from 1 to 9. The generated features are then sent to a conventional RBF-SVM for classification.

Deep Learning Based Methods
(a) For the implementation of SAE-PCA, first four PCs of each hyperspectral dataset are extracted and flattened within a certain neighborhood region, then they are concatenated with spectral features to obtain the joint spectral-spatial features for further processing. (b) For the CNN-MLR, the number of PCs is determined by reserving at least 99.9% information of the original hyperspectral dataset, then convolutional operations are applied to the reserved PCs on spatial dimension. (c) The LSTM method only exploits spectral information and regards a spectrum as sequential data to explore the spectral correlation within different spectral channels. (d) For MCNN, first three PCs are selected and three spatial scale patches for each PC band are generated and down-sampled to the equally spatial sizes for spatial feature extraction. The obtained spatial features are concatenated with spectral information for each scale's classification and the final results are obtained by majority voting. Other parameter settings for SAE-PCA, CNN-MLR, LSTM and MCNN are set to the default values in [20,22,25,28], respectively.

Network Hyperparameters Discussion
The hyperparameters in our proposed deep model could influence the classification performance. Except the weights in the networks can be automatically learned during the training process, several other related hyperparameters need to be discussed, such as input scale number S, auxiliary weights α i and layer settings in the bidirectional LSTM. In this section, we shall present the sensitivity analysis and investigate the effects of these hyperparameters, and the overall process is shown in Figure 6. Figure 6. The flow chart of hyperparameter tuning process. For each HSI, we first fix auxiliary weights α i and layer settings in Bi-LSTM, and adjust the input scale number S to find the optimal one. Then we fix scale number S and layer settings in Bi-LSTM to find suitable auxiliary weights α i . The same procedure goes for determining suitable layer settings in Bi-LSTM. Finally, we obtain all the determined hyperparameters.

Input Scale Number S
Objects in remote sensing images often appear with various shapes and scales, it is necessary to introduce multi-scale information into HSI classification. To quantitatively analyze the effects of different input scales on the final accuracy of our proposed method, scale number S from 3 to 8 with its corresponding largest input window w from 5 to 15 is investigated in this section. Figure 7 shows the OA value changes with scale number S on three HSI datasets. It can be seen that OAs first improve as the scale number S increases, but become saturated later on. On the one hand, classification with a small S fails to collect enough scale-dependent features, resulting in relatively lower accuracy. And a larger S allows us to consider more spatial scales, which helps to improve classification accuracy to some extent. On the other hand, further increasing S does not further improve performance since we already have enough spatial information. Moreover, it could incur an additional computational expense. Accordingly, we choose a spatial scale number of S = 8 with corresponding window size w = 15 for all three datasets.

Auxiliary Weights α i
The auxiliary classifiers help to better train the multi-scale CNNs and obtain robust convolutional features. And the auxiliary weight α i is an important parameter since it determines the percentage of each auxiliary loss L i aux takes in the total loss L. In this study, we investigate the effects of auxiliary weights on the final classification results. To simplify the experiment, the auxiliary weights for each spatial scale are set to be equal α i = α. Since the upper-scale patches always contain the lower-scale patches and we want to yield prediction of the central spectrum, equal weights assignment allows us to put more emphasis on the central spectra and reduce the unwanted noise. Specially, we set α from 0.1 to 1, with an increment of 0.1. The main loss weight is set to 1, as described in Equation (14).  Figure 9 shows the OA values with different layer settings. Accordingly, the optimal hyperparameters setting for Salinas is two-layer with 64 and 128 memory cells respectively. For Pavia University, two-layer with 64 memory cells respectively achieves best results. As for KCS datasets, one-layer with 64 memory cells yields good performance due to its smaller data size.

Classification Results
In this section, we will report the classification results of the proposed HMCNN-AC along with other compared methods. The parameters are chosen based on our experimental results. To better train the whole network, batch normalization [43] and dropout technique [44] are adopted after every convolutional layer to avoid overfitting. We use RMSprop [45] as the optimization algorithm. The classification maps of different methods on three hyperspectral datasets are displayed in Figures 10-12 and the quantitative assessments are shown in Tables 5-7. As can be seen, among all the methods, HMCNN-AC achieves the best classification results on all three datasets, with OA = 99.88% for Salinas with 5% training samples, OA = 99.83% for Pavia University with 5% training samples, and OA = 98.27% for Kenned Space Center with 10% training samples, and yields the cleanest visualization results much more similar to reference maps than others. Compared with other methods, SVM-RBF and LSTM obtain relatively poor performances and exhibit noisy estimations in classification maps, since they fail to consider spatial information. In contrast, the classification results of EMP-SVM, SAE, CNN-MLR and MCNN methods show much improvement and deliver smoother appearance in classification maps by combining spectral and spatial features, especially for MCNN method with multi-scale features. [32] [      The main reasons for the superior performance of our proposed HMCNN-AC lie in two aspects. On the one hand, multi-scale CNNs are adopted to capture the scale-dependent information considering objects with different shapes and sizes. With the help of auxiliary classifiers, the whole network could be better trained and robust convolutional features are acquired. On the other hand, to fully explore the spectral-spatial correlation of the obtained multi-scale features, a bidirectional LSTM is proposed to capture the dependency and extract a hierarchical representation for each pixel, instead of simply concatenating these multi-scale features together. In order to validate the effectiveness of the adopted auxiliary classifiers and the bidirectional LSTM, we construct two other frameworks, namely HMCNN and MCNN-AC, based on our original proposed HMCNN-AC architecture, and compare their classification performances on three HSIs with other baseline methods. The detailed analysis is shown as below.

Effective Analysis of Auxiliary Classifiers
To further examine the effectiveness of the auxiliary classifiers, we construct another framework named HMCNN based on our original proposed HMCNN-AC architecture as the baseline. The network of HMCNN as well as other parameter settings is the same as our original proposed HMCNN-AC. The only difference between these two frameworks is whether adopting auxiliary classifiers. In that case, we could demonstrate the effectiveness of the auxiliary classifiers by comparing the classification performances of these two models on three HSI datasets. The quantitative assessments of HMCNN are also shown in Tables 5-7 with other baseline methods, and the visualization results can be seen in Figures 10-12, subfigure (h). To clearly demonstrate the comparison results, we plot the OA values of the two proposed frameworks on three HSIs in Figure 13. The red bars represent the results of our original proposed HMCNN-AC with weighted auxiliary classifiers, and the blue ones indicate the newly constructed HMCNN without auxiliary classifiers. From the figure, it is clear to see a decline in OA for all three datasets with HMCNN architecture, which confirms the effectiveness of our auxiliary classifiers in improving HSI classification.

Effective Analysis of Bidirectional LSTM
To analyze the effectiveness of the adopted bidirectional LSTM in exploring the spectral-spatial correlation of the multi-scale features and learning hierarchical features, we replace the bidirectional LSTM part in our original proposed architecture with one simple concatenation layer as the baseline and keep other parameter settings the same. The constructed framework is named MCNN-AC and the concatenation layer concatenates the obtained multi-scale features into 1D vector for the final classification. The quantitative metrics and visualization maps of MCNN-AC on three HSIs are also compared with other methods in Tables 5-7 and Figures 10-12, respectively. To better evaluate the effectiveness of adopting the bidirectional LSTM, we present the OA values of MCNN-AC and our original proposed HMCNN-AC on three datasets, as displayed in Figure 14. From the comparison, we can observe that the OAs of MCNN-AC decrease a little when replacing the original bidirectional LSTM with the concatenation layer, which validates the effectiveness of the bidirectional LSTM in spectral-spatial correlation exploration and hierarchical feature learning.

Conclusions
In this paper, a novel hierarchical multi-scale CNN with auxiliary classifiers (HMCNN-AC) is proposed for HSI classification. Unlike the conventional spectral-spatial classification methods where only one single input scale is considered for spatial feature integration, our proposed method could extract spectral-spatial features at various input scales simultaneously by adopting the multi-scale CNNs technique. To fully explore the spectral-spatial dependency of the obtained multi-scale features, a bidirectional LSTM is proposed to capture the correlation and extract a hierarchical representation for each hyperspectral pixel from a sequential perspective. In order to better train the whole network, weighted auxiliary classifiers are employed for the multi-scale CNNs and are optimized together with the main loss function. Experimental results on three public datasets demonstrate the superiority of our proposed method with the highest OA values and the cleanest visualization maps compared with other baseline methods.
Although our proposed HMCNN-AC model has achieved remarkable performance, some details are worth further investigation. For instance, the convolutional features from different spatial scales might exert different influence on the final classification and equal weights assignment would fail to reflect this difference in our current setup. Besides, the parameter settings in the multi-scale CNNs are also worth further discussion for better performance.