Hierarchical Spatial-Spectral Feature Extraction with Long Short Term Memory (LSTM) for Mineral Identification Using Hyperspectral Imagery

Deep learning models are widely employed in hyperspectral image processing to integrate both spatial features and spectral features, but the correlations between them are rarely taken into consideration. However, in hyperspectral mineral identification, not only the spectral and spatial features of minerals need to be considered, but also the correlations between them are crucial to further promote identification accuracy. In this paper, we propose hierarchical spatial-spectral feature extraction with long short term memory (HSS-LSTM) to explore correlations between spatial features and spectral features and obtain hierarchical intrinsic features for mineral identification. In the proposed model, the fusion spatial-spectral feature is primarily extracted by stacking local spatial features obtained by a convolution neural network (CNN)-based model and spectral information together. To better exploit spatial features and spectral features, an LSTM-based model is proposed to capture correlations and obtain hierarchical features for accurate mineral identification. Specifically, the proposed model shares a uniform objective function, so that all the parameters in the network can be optimized in the meantime. Experimental results on the hyperspectral data collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) in the Nevada mining area show that HSS-LSTM achieves an overall accuracy of 94.70% and outperforms other commonly used identification methods.


Introduction
Minerals tend to form unique spectra due to their distinct constituents [1], and the corresponding regional characteristics are formed during the process of mineralization [2]. With the development of hyperspectral remote sensing, the spectral resolution and spatial resolution have become higher and higher, so lots of novel mineral identification methods have been proposed [3,4]. Traditional mineral identification methods are mainly aimed at the spectra of minerals, including methods based on spectral similarity and methods based on spectral characteristics [5][6][7]. The mineral identification methods based on spectral similarity establish standards to measure the similarity between the identified spectrum and a reference spectrum. Spectral libraries are established to store spectra of various minerals under different conditions for reference, such as the Jet Propulsion Laboratory (JPL) spectral library and United States Geological Survey (USGS) spectral library [8,9]. The spectral angle mapper (SAM) takes the angle between the target spectral vector and the reference spectral vector in the multidimensional the correlations between features maps in different scales, and the pixels in the neighborhood were equally important as input, which greatly reduced the attention to the identified pixel.
To overcome the aforementioned drawbacks, the hierarchical spatial-spectral feature extraction with LSTM (HSS-LSTM) method is proposed to extract hierarchical spatial-spectral features in this paper, which considers correlations between primary spatial features and spectral features. The main contributions of this paper can be summarized as follows: (1) To obtain local spatial features, a CNN-based model is constructed to extract local spatial features of a pixel from the first several principle components of the pixel's neighborhood region. (2) To construct the primary fusion spatial-spectral feature, the local spatial features of the pixel and its spectrum are stacked together. (3) To reveal the correlations between the components of the fusion feature and obtain intrinsic features, an LSTM-based model is established to refine deep hierarchical spatial-spectral features.
The remainder of this paper is organized as follows. In Section 2, the structure of HSS-LSTM is introduced in detail. Experimental results and corresponding analyses are presented in Section 3. Section 4 provides the conclusions.

Proposed HSS-LSTM Framework
The architecture of HSS-LSTM for mineral identification using hyperspectral imagery is shown in Figure 1. Overall, there are three main steps in the proposed model. First, local spatial features are extracted with a CNN-based model which consists of several convolution layers, max pooling layers and a fully connected layer. Then, a fusion spatial-spectral feature is presented which stacks the spectrum of a pixel with its local spatial features. Finally, an LSTM-based model is established to further extract the deep hierarchical spatial-spectral feature from the fusion feature, followed by a softmax layer to fulfil mineral identification.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 15 different scales, and the pixels in the neighborhood were equally important as input, which greatly reduced the attention to the identified pixel.
To overcome the aforementioned drawbacks, the hierarchical spatial-spectral feature extraction with LSTM (HSS-LSTM) method is proposed to extract hierarchical spatial-spectral features in this paper, which considers correlations between primary spatial features and spectral features. The main contributions of this paper can be summarized as follows: (1) To obtain local spatial features, a CNNbased model is constructed to extract local spatial features of a pixel from the first several principle components of the pixel's neighborhood region. (2) To construct the primary fusion spatial-spectral feature, the local spatial features of the pixel and its spectrum are stacked together. (3) To reveal the correlations between the components of the fusion feature and obtain intrinsic features, an LSTMbased model is established to refine deep hierarchical spatial-spectral features.
The remainder of this paper is organized as follows. In Section 2, the structure of HSS-LSTM is introduced in detail. Experimental results and corresponding analyses are presented in Section 3. Section 4 provides the conclusions.

Proposed HSS-LSTM Framework
The architecture of HSS-LSTM for mineral identification using hyperspectral imagery is shown in Figure 1. Overall, there are three main steps in the proposed model. First, local spatial features are extracted with a CNN-based model which consists of several convolution layers, max pooling layers and a fully connected layer. Then, a fusion spatial-spectral feature is presented which stacks the spectrum of a pixel with its local spatial features. Finally, an LSTM-based model is established to further extract the deep hierarchical spatial-spectral feature from the fusion feature, followed by a softmax layer to fulfil mineral identification.

Spatial Features with CNN-Based Model
CNN is a special neural network which connects the local receptive field with convolution filters, so that it is able to extract local spatial features from images. Generally, a typical CNN model is usually composed of several pairs of convolutional layers and max pooling layers followed by a few fully connected layers [38]. Since the pixels which are closer to the identified pixel have greater influences on its local spatial features, a 10 × 10 neighborhood region of the identified pixel is taken as the input data of the proposed CNN-based model. To obtain spatial features, the proposed CNNbased model consists of two convolution layers with the kernel size of 3 × 3, two max pooling layers with the kernel size of 2 × 2 and a fully connected layer, as shown in Figure 2.

Spatial Features with CNN-Based Model
CNN is a special neural network which connects the local receptive field with convolution filters, so that it is able to extract local spatial features from images. Generally, a typical CNN model is usually composed of several pairs of convolutional layers and max pooling layers followed by a few fully connected layers [38]. Since the pixels which are closer to the identified pixel have greater influences on its local spatial features, a 10 × 10 neighborhood region of the identified pixel is taken as the input data of the proposed CNN-based model. To obtain spatial features, the proposed CNN-based model consists of two convolution layers with the kernel size of 3 × 3, two max pooling layers with the kernel size of 2 × 2 and a fully connected layer, as shown in Figure 2. The illustration below each data shows the dimension of the data. For example, a@m × n indicates that the data has channels, and the height and width of the data are m and n, respectively.
Considering that hyperspectral imagery has hundreds of bands, it leads to redundant information and intensive calculations [39]. Principal components analysis (PCA) is introduced to reduce the dimensionality of the raw hyperspectral data in the spectral domain, while maintaining the spatial information of the data in the meantime [40]. The first k principle components, whose eigenvalues account for more than 99% of all the principle components, are fed into the proposed CNN-based model as input [41].
As shown in Figure 2, the proposed CNN-based model has five layers, and the output of the lth layer is denoted as H l , where ∈{1,2,3, 4,5} l . In addition, H 0 is denoted as the input data. On the condition that the lth layer is a convolution layer, a 2-D convolution operation is conducted on H l to capture features and form a set of feature maps, named as H l . The process can be formulated as: whereσ ⋅ ( ) denotes the activation function, and * denotes the convolution operation. In addition, WCl and bCl represent the weight matrix and the bias vector that connect lth layer and its previous layer, respectively. When the lth layer is a max pooling layer, the data is compressed by replacing the neighborhood region of 2 × 2 with the maximum value of the region. Thus, the nonlinearity of the model is improved. The process is formulated as: where pool(·) denotes the max-pooling operation. At the end of the proposed CNN, a fully connected layer is presented to extract the spatial features denoted as Fsaptial. Mathematically, the step can be written as: where WF and bF represent the weight matrix and the bias vector that connect the fifth layer and the fourth layer. In order to accelerate the convergence of the proposed model, the sigmoid function is selected as the activation function.
In the proposed CNN-based model, the spatial feature vector of a pixel denoted as = 1 2 { , ,..., } spatial f F p p p , which has f elements with values between 0 and 1, is extracted from the first several components of its 10 × 10 neighborhood region.

Fusion Spatial-spectral Feature
Since various minerals present certain characteristics in spectral domain and spatial domain in hyperspectral imagery, it is crucial to capture both spectral features and spatial features when identifying minerals accurately. The spectrum of a pixel contains complete information in spectral The illustration below each data shows the dimension of the data. For example, a@m × n indicates that the data has channels, and the height and width of the data are m and n, respectively.
Considering that hyperspectral imagery has hundreds of bands, it leads to redundant information and intensive calculations [39]. Principal components analysis (PCA) is introduced to reduce the dimensionality of the raw hyperspectral data in the spectral domain, while maintaining the spatial information of the data in the meantime [40]. The first k principle components, whose eigenvalues account for more than 99% of all the principle components, are fed into the proposed CNN-based model as input [41].
As shown in Figure 2, the proposed CNN-based model has five layers, and the output of the lth layer is denoted as H l , where l ∈ {1, 2, 3, 4, 5}. In addition, H 0 is denoted as the input data. On the condition that the lth layer is a convolution layer, a 2-D convolution operation is conducted on H l to capture features and form a set of feature maps, named as H l . The process can be formulated as: where σ(·) denotes the activation function, and * denotes the convolution operation. In addition, W Cl and b Cl represent the weight matrix and the bias vector that connect lth layer and its previous layer, respectively. When the lth layer is a max pooling layer, the data is compressed by replacing the neighborhood region of 2 × 2 with the maximum value of the region. Thus, the nonlinearity of the model is improved. The process is formulated as: where pool(·) denotes the max-pooling operation. At the end of the proposed CNN, a fully connected layer is presented to extract the spatial features denoted as F saptial . Mathematically, the step can be written as: where W F and b F represent the weight matrix and the bias vector that connect the fifth layer and the fourth layer. In order to accelerate the convergence of the proposed model, the sigmoid function is selected as the activation function.
In the proposed CNN-based model, the spatial feature vector of a pixel denoted as F spatial = p 1 , p 2 , . . . , p f , which has f elements with values between 0 and 1, is extracted from the first several components of its 10 × 10 neighborhood region.

Fusion Spatial-Spectral Feature
Since various minerals present certain characteristics in spectral domain and spatial domain in hyperspectral imagery, it is crucial to capture both spectral features and spatial features when identifying minerals accurately. The spectrum of a pixel contains complete information in spectral domain, which can provide sufficient spectral features for mineral identification. Let F spectral = {a 1 , a 2 , . . . , a b } denote Sensors 2020, 20, 6854 5 of 15 the spectrum of a pixel with b bands, and the fusion spatial-spectral feature F of the pixel is constructed by stacking its spatial feature vector F saptial with the spectral vector F spectral , which can be written as: Due to the fact that the fusion spatial-spectral feature F concatenates the spatial features and the spectral features primitively without considering correlations between them, it is necessary to extract deep hierarchical features to achieve high-efficiency mineral identification.

Hierarchical Spatial-Spectral Feature with LSTM-Based Model
Although the fusion feature incorporates the spectral features and the spatial features, there is still a lack of exploration on their correlations to obtain hierarchical features for accurate mineral identification. Unlike other neural networks, RNN is able to process sequential inputs, and the output of the latter subsequence is related to the previous ones due to the connections between their hidden units [42]. In this manner, RNN is generally applied to the situations that data is interrelated, especially sequential data.
As shown in Figure 3, for a simple RNN model, the input whose length is T can be denoted as X = {x 1 , x 2 , . . . , x T }, and each item x t , where t ∈ {1, 2, . . . , T}, is a subsequence. At time step t, given the previous hidden layer state h t-1 , the hidden layer state h t and the output layer state y t of the current subsequence x t can be calculated as: where W h and U h represent the input-to-hidden and the hidden-to-hidden weight matrices respectively, and W o denotes the hidden-to-output weight matrix. In addition, b h and b o denote the input-to-hidden and the hidden-to-output bias vectors separately, and σ R denotes the activation function, which is generally the tanh function.
Sensors 2020, 20, x FOR PEER REVIEW 5 of 15 domain, which can provide sufficient spectral features for mineral identification. Let a a a denote the spectrum of a pixel with b bands, and the fusion spatial-spectral feature F of the pixel is constructed by stacking its spatial feature vector Fsaptial with the spectral vector Fspectral, which can be written as: Due to the fact that the fusion spatial-spectral feature F concatenates the spatial features and the spectral features primitively without considering correlations between them, it is necessary to extract deep hierarchical features to achieve high-efficiency mineral identification.

Hierarchical Spatial-spectral Feature with LSTM-based Model
Although the fusion feature incorporates the spectral features and the spatial features, there is still a lack of exploration on their correlations to obtain hierarchical features for accurate mineral identification. Unlike other neural networks, RNN is able to process sequential inputs, and the output of the latter subsequence is related to the previous ones due to the connections between their hidden units [42]. In this manner, RNN is generally applied to the situations that data is interrelated, especially sequential data.
As shown in Figure 3, for a simple RNN model, the input whose length is T can be denoted as x x x .., } , and each item xt, where ∈ {1, 2,..., } t T, is a subsequence. At time step t, given the previous hidden layer state ht-1, the hidden layer state ht and the output layer state yt of the current subsequence xt can be calculated as: where Wh and Uh represent the input-to-hidden and the hidden-to-hidden weight matrices respectively, and Wo denotes the hidden-to-output weight matrix. In addition, bh and bo denote the input-to-hidden and the hidden-to-output bias vectors separately, and σ R denotes the activation function, which is generally the tanh function. In order to tackle vanishing gradient and exploding gradient problems in a simple RNN, a variant of RNN is proposed, namely long short-term memory (LSTM) [43,44]. In LSTM, the sigmoid function L σ is used as the activation function of three gates, which determine the information transmitted in the network and preserve the errors that can be back-propagated through sequences, as shown in Figure 4. In order to tackle vanishing gradient and exploding gradient problems in a simple RNN, a variant of RNN is proposed, namely long short-term memory (LSTM) [43,44]. In LSTM, the sigmoid function σ L is used as the activation function of three gates, which determine the information transmitted in the network and preserve the errors that can be back-propagated through sequences, as shown in Figure 4.  Compared with a simple RNN, LSTM introduces cell state dominated by three gates with the sigmoid function as the activation function to generate outputs between 0 and 1, which is used to determine how much information retained from the previous steps. The states of the three gates are calculated as: where Wf, Wi and Wo denote the weight matrices between the input xt and the forget gate, the input gate and the output gate, respectively. Moreover, bf, bi and bo correspond to the bias vectors of the three gates. The cell state t C is updated by adding in new information t C  and discarding part of the information memorized in cell state t-1 C as follows: where WC and bC correspond to the weight matrix and the bias vector between the input layer and the hidden layer.  denotes an elementwise multiplication. The hidden state ht at time step t is determined by ot and Ct written as: It is obvious that the hidden state ht incorporates the effective information of the previous t time steps in LSTM. In view of the specialty, an LSTM-based model is constructed to extract the deep hierarchical feature from the fusion spatial-spectral feature, as shown in Figure 5. Considering that the proposed LSTM-based model takes sequential data as input, the fusion feature F needs to be preprocessed. F contains a total of f b + items and is equally divided into T dices, so each dice constructs a subsequence of length . In order to facilitate subsequent calculations, l needs to be an integer, and F is denoted as: Compared with a simple RNN, LSTM introduces cell state dominated by three gates with the sigmoid function as the activation function to generate outputs between 0 and 1, which is used to determine how much information retained from the previous steps. The states of the three gates are calculated as: where W f , W i and W o denote the weight matrices between the input x t and the forget gate, the input gate and the output gate, respectively. Moreover, b f , b i and b o correspond to the bias vectors of the three gates.
The cell state C t is updated by adding in new information C t and discarding part of the information memorized in cell state C t−1 as follows: where W C and b C correspond to the weight matrix and the bias vector between the input layer and the hidden layer. denotes an elementwise multiplication. The hidden state h t at time step t is determined by o t and C t written as: It is obvious that the hidden state h t incorporates the effective information of the previous t time steps in LSTM. In view of the specialty, an LSTM-based model is constructed to extract the deep hierarchical feature from the fusion spatial-spectral feature, as shown in Figure 5. Considering that the proposed LSTM-based model takes sequential data as input, the fusion feature F needs to be preprocessed. F contains a total of f + b items and is equally divided into T dices, so each dice constructs a subsequence of length l = ( f + b)/T. In order to facilitate subsequent calculations, l needs to be an integer, and F is denoted as: F= {s 1 , s 2 , . . . , s T } where s i , i ∈ {1, 2, . . . , T} is a vector of l elements.
where , {1, 2,..., } i i T ∈ s is a vector of l elements. Since the fusion spatial-spectral feature F is decomposed into T subsequences, each subsequence contains partial features and correlates with others. Under the control of the input gate and the forget gate, the cell state t C of the LSTM-based model is able to select the last cell state t-1 C and the current subsequence t s to remain the effective features of the first t subsequences and abandon the redundant features. The output gate refines t C to acquire the hidden unit t h , which not only integrates the effective features contained in these t subsequences, but also removes the excrescent parts. As above, with the continuous extension of the proposed LSTM-based model, a growing number of features are stored in the cell state. As for the last subsequence T s , its hidden state hT captures all of the effective features in the fusion spatial-spectral feature F. In this way, hT thinks about correlations between the components of the fusion feature F and refines valid features, so it is regarded as the deep hierarchical spatial-spectral feature. In order to fulfil mineral identification, a fully connected layer is added at the end of the proposed LSTM-based model to generate an output with length n, followed by a softmax layer to form an intuitive probability distribution of the n classes of identified minerals. Unlike other models that extract spectral features and spatial features separately resulting in their losses composed of multiple parts, HSS-LSTM refines the stacked spatial-spectral feature to achieve the hierarchical spatial-spectral feature, so that the loss of the model is exclusive. Considering that the LSTM-based model is able to backpropagate the loss primely, the front CNN-based model of HSS-LSTM can also be well trained to extract effective spatial features. To train the parameters in HSS-LSTM, the cross entropy loss is selected as the loss function, and it is minimized by using the adaptive moment estimation (Adam) with back-propagation [45]. After training and fine-tuning HSS-LSTM, the deep hierarchical feature incorporates enough spatial-spectral features which can be used for accurate mineral identification, and the class label for each pixel can be eventually acquired.

Experimental Results and Analysis
In this section, a well-known hyperspectral dataset which is gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor is employed to evaluate HSS-LSTM, namely the Cuprite mine in Nevada, USA. The study area is located on the eastern edge of Esmeralda and Nye County, Nevada (37°29′ to 37°35′ North, 117°9′ to 117°17′ West) [46]. The Nevada mining dataset is covered by various minerals including silicate minerals such as kaolinite and muscovite, sulfate minerals such as alunite, and carbonate minerals such as calcite [47]. The ground truth map of the dataset was produced by the Tricorder 3.3 software [9]. Several experiments are conducted to verify Since the fusion spatial-spectral feature F is decomposed into T subsequences, each subsequence contains partial features and correlates with others. Under the control of the input gate and the forget gate, the cell state C t of the LSTM-based model is able to select the last cell state C t−1 and the current subsequence s t to remain the effective features of the first t subsequences and abandon the redundant features. The output gate refines C t to acquire the hidden unit h t , which not only integrates the effective features contained in these t subsequences, but also removes the excrescent parts.
As above, with the continuous extension of the proposed LSTM-based model, a growing number of features are stored in the cell state. As for the last subsequence s T , its hidden state h T captures all of the effective features in the fusion spatial-spectral feature F. In this way, h T thinks about correlations between the components of the fusion feature F and refines valid features, so it is regarded as the deep hierarchical spatial-spectral feature. In order to fulfil mineral identification, a fully connected layer is added at the end of the proposed LSTM-based model to generate an output with length n, followed by a softmax layer to form an intuitive probability distribution of the n classes of identified minerals.
Unlike other models that extract spectral features and spatial features separately resulting in their losses composed of multiple parts, HSS-LSTM refines the stacked spatial-spectral feature to achieve the hierarchical spatial-spectral feature, so that the loss of the model is exclusive. Considering that the LSTM-based model is able to backpropagate the loss primely, the front CNN-based model of HSS-LSTM can also be well trained to extract effective spatial features. To train the parameters in HSS-LSTM, the cross entropy loss is selected as the loss function, and it is minimized by using the adaptive moment estimation (Adam) with back-propagation [45]. After training and fine-tuning HSS-LSTM, the deep hierarchical feature incorporates enough spatial-spectral features which can be used for accurate mineral identification, and the class label for each pixel can be eventually acquired.

Experimental Results and Analysis
In this section, a well-known hyperspectral dataset which is gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor is employed to evaluate HSS-LSTM, namely the Cuprite mine in Nevada, USA. The study area is located on the eastern edge of Esmeralda and Nye County, Nevada (37 • 29 to 37 • 35 North, 117 • 9 to 117 • 17 West) [46]. The Nevada mining dataset is covered by various minerals including silicate minerals such as kaolinite and muscovite, sulfate minerals such as alunite, and carbonate minerals such as calcite [47]. The ground truth map of the dataset was produced by the Tricorder 3.3 software [9]. Several experiments are conducted to verify the parameter settings of the HSS-LSTM method and compare with other mineral identification methods.

Experimental Data
The Nevada mining dataset consists of 400 × 350 pixels and 50 bands with the wavelength range from 1990 nm to 2480 nm, considering that the spectral features of minerals are mainly concentrated in the short-wave infrared [48]. The false color image and the ground truth map for the dataset are presented in Figure 6.
Sensors 2020, 20, x FOR PEER REVIEW 8 of 15 the parameter settings of the HSS-LSTM method and compare with other mineral identification methods.

Experimental Data
The Nevada mining dataset consists of 400 × 350 pixels and 50 bands with the wavelength range from 1990 nm to 2480 nm, considering that the spectral features of minerals are mainly concentrated in the short-wave infrared [48]. The false color image and the ground truth map for the dataset are presented in Figure 6. As shown in Figure 6b, similar minerals that belong to diverse classes thanks to the differences in metal ingredients tend to be merged into the same class. To obtain sufficient training samples and testing samples, and to avoid mixed pixels due to the restriction of the spatial resolution and mineralization, seven classes of widely distributed minerals are selected to be identified, including muscovite, halloysite, calcite, kaolinite, montmorillonite, alunite and chalcedony. Training samples and testing samples are randomly chosen according to the ground truth map, as shown in Table 1.

Analysis of Parameter Settings
HSS-LSTM is mainly constructed with an LSTM-based model, and there are considerable parameters to be configured that are closely related to the efficiency of the method. The PCA data and the original hyperspectral data are preprocessed to restrict their values between 0 and 1, so that HSS-LSTM is able to converge rapidly. As for the CNN-based model, the first three principle components which contain more than 99% information of the hyperspectral imagery are fed in as input. Considering that the sizes of feature maps decrease with the progress of convolution and max pooling, so more feature maps are demanded to maintain information of the data, the sizes of feature maps in different layers are set as With a view to the number of spectral bands, c1 is set As shown in Figure 6b, similar minerals that belong to diverse classes thanks to the differences in metal ingredients tend to be merged into the same class. To obtain sufficient training samples and testing samples, and to avoid mixed pixels due to the restriction of the spatial resolution and mineralization, seven classes of widely distributed minerals are selected to be identified, including muscovite, halloysite, calcite, kaolinite, montmorillonite, alunite and chalcedony. Training samples and testing samples are randomly chosen according to the ground truth map, as shown in Table 1.

Analysis of Parameter Settings
HSS-LSTM is mainly constructed with an LSTM-based model, and there are considerable parameters to be configured that are closely related to the efficiency of the method. The PCA data and the original hyperspectral data are preprocessed to restrict their values between 0 and 1, so that HSS-LSTM is able to converge rapidly. As for the CNN-based model, the first three principle components which contain more than 99% information of the hyperspectral imagery are fed in as input. Considering that the sizes of feature maps decrease with the progress of convolution and max pooling, so more feature maps are demanded to maintain information of the data, the sizes of feature maps in different layers are set as f = c 2 = 2c 1 . With a view to the number of spectral bands, c 1 is set to 10, 15 and 20 separately, and f and c 2 are set to 20, 30 and 40 correspondingly. In the LSTM-based model, it takes a sequence of length f + b as input, where b is 50, and each subsequence possesses l elements. l is set to 5 and 10 respectively. The number of the hidden units denoted as h is also a significant parameter in the proposed LSTM-based model, which is set to 40, 50 and 60 respectively. To achieve HSS-LSTM for mineral identification, diverse parameter combinations are trained with the learning rate set to 0.0005 and the training epochs set to 1000, and the corresponding overall accuracy (OA) are shown in Table 2. In Table 2, the overall accuracy is the maximum identification accuracy that can be achieved when HSS-LSTM with the specific parameter combination tends to be stable. It is obvious that the overall accuracy of HSS-LSTM varies with different parameter combinations, and it presents a growth as values of the parameters augment. The reason for this phenomenon is that more features are extracted and maintained when f, l and h are larger. The overall accuracy reaches the peak value when f, l and h are 40, 10 and 50, respectively, and the specific parameter combination is empirically chosen in the following discussion.

Identification Results of the Nevada Dataset
For evaluating the performance of HSS-LSTM, the identification results of the proposed method are compared with traditional identification methods and other deep learning methods. Specifically, the SAM method is a typical traditional method that utilizes the angle between the spectrum of the identified pixel and the reference spectrum of minerals in the multidimensional space [10]. The average spectrums of training samples of various minerals are taken as reference, as shown in Figure 7. Then the angles between the identified pixel and the reference spectrums are calculated, and the pixel will be classified into the class with the minimum value. In addition, an LSTM-based model with the mere spectrum divided into subsequences with a length of 10 as input is constructed for comparison to explore how spatial features affect the identification accuracy [43]. Besides, a 3D-CNN model is proposed to verify the necessity of deep hierarchical spatial-spectral feature extraction for mineral identification. The 3D-CNN utilizes 3-D filters to convolve the raw hyperspectral dataset [49], and both spectral features and spatial features are extracted in the meantime. The 3D-CNN model has a similar CNN-based network with HSS-LSTM, except that the kernel sizes of the convolution layers and the max pooling layers are 3 × 3 × 11 and 2 × 2 × 2 respectively. The identification accuracies of the test samples over different identification methods are reported in Table 3.    The UA, PA and AA denote the user's accuracy, the producer's accuracy and the average accuracy respectively in Table 3. The HSS-LSTM method achieves rather satisfied identification accuracies of various minerals, and it is obviously superior to other identification methods in terms of the average accuracy and the overall accuracy. To facilitate further discussion, the identification maps of the four different identification methods are shown in Figure 8. The black part of the results indicates the background.  The UA, PA and AA denote the user's accuracy, the producer's accuracy and the average accuracy respectively in Table 3. The HSS-LSTM method achieves rather satisfied identification accuracies of various minerals, and it is obviously superior to other identification methods in terms of the average accuracy and the overall accuracy. To facilitate further discussion, the identification maps of the four different identification methods are shown in Figure 8. The black part of the results indicates the background.    The UA, PA and AA denote the user's accuracy, the producer's accuracy and the average accuracy respectively in Table 3. The HSS-LSTM method achieves rather satisfied identification accuracies of various minerals, and it is obviously superior to other identification methods in terms of the average accuracy and the overall accuracy. To facilitate further discussion, the identification maps of the four different identification methods are shown in Figure 8. The black part of the results indicates the background.

Discussion
According to the experimental results presented above, it is proved that HSS-LSTM achieves better overall and average identification results compared with SAM, LSTM and 3D-CNN. Therefore, the proposed HSS-LSTM method is conducive to promoting mineral identification using hyperspectral imagery. In addition, some further discussions are presented below.
First, due to the similarity between the reference spectrums of halloysite and kaolinite as shown in Figure 7, it is hard to discriminate them in spectral domain. There exist similar problems in [48], where the producer's accuracy of kaolinite is only 68.4% based on the Kruse and Perry's (K&P) method. The specific identification results of halloysite and kaolinite on testing samples are presented in Table 4. A large number of testing samples of halloysite are misidentified into kaolinite with SAM and 3D-CNN. The LSTM-based methods including HSS-LSTM and LSTM capture correlations between spectral features and obtain intrinsic features, which promote the overall accuracy of halloysite and kaolinite by more than 10%. More importantly, HSS-LSTM takes correlations between spectral features and spatial features into consideration, which further improves nearly 3.6% overall accuracy compared with LSTM. Besides, it is worth noting that the HSS-LSTM method can significantly improve the producer's accuracy of kaolinite, reaching 89.50%, which is 15−30% higher than other methods. Second, the reference spectrum of chalcedony possesses no distinct spectral features, such as significant absorption bands, which results in relatively low identification accuracy with SAM and LSTM. HSS-LSTM improves more than 4% identification accuracy of chalcedony by exploiting spatial features with a CNN-based model. Although 3D-CNN also extracts spectral features and spatial features, HSS-LSTM refines the spatial-spectral feature for the hierarchical feature, which prompts the proposed model to perform the best when identifying chalcedony.

Discussion
According to the experimental results presented above, it is proved that HSS-LSTM achieves better overall and average identification results compared with SAM, LSTM and 3D-CNN. Therefore, the proposed HSS-LSTM method is conducive to promoting mineral identification using hyperspectral imagery. In addition, some further discussions are presented below.
First, due to the similarity between the reference spectrums of halloysite and kaolinite as shown in Figure 7, it is hard to discriminate them in spectral domain. There exist similar problems in [48], where the producer's accuracy of kaolinite is only 68.4% based on the Kruse and Perry's (K&P) method. The specific identification results of halloysite and kaolinite on testing samples are presented in Table 4. A large number of testing samples of halloysite are misidentified into kaolinite with SAM and 3D-CNN. The LSTM-based methods including HSS-LSTM and LSTM capture correlations between spectral features and obtain intrinsic features, which promote the overall accuracy of halloysite and kaolinite by more than 10%. More importantly, HSS-LSTM takes correlations between spectral features and spatial features into consideration, which further improves nearly 3.6% overall accuracy compared with LSTM. Besides, it is worth noting that the HSS-LSTM method can significantly improve the producer's accuracy of kaolinite, reaching 89.50%, which is 15−30% higher than other methods. Second, the reference spectrum of chalcedony possesses no distinct spectral features, such as significant absorption bands, which results in relatively low identification accuracy with SAM and LSTM. HSS-LSTM improves more than 4% identification accuracy of chalcedony by exploiting spatial features with a CNN-based model. Although 3D-CNN also extracts spectral features and spatial features, HSS-LSTM refines the spatial-spectral feature for the hierarchical feature, which prompts the proposed model to perform the best when identifying chalcedony.
Third, it is noted that the identification map of HSS-LSTM achieves more accurate identification results and reveals more precise textural information according to the ground truth map, as shown in Figure 8. SAM and LSTM only consider the spectrums of each pixel, so it is difficult to form regional structural characteristics. Besides, 3D-CNN lacks further refinement of the spectral-spatial feature, which induces the rough texture of the identification map.
Finally, with the high spectral resolution of hyperspectral remote sensing, a large number of researchers focused on the intrinsic spectral features of minerals to fulfil mineral identification, who conducted experiments on the Nevada mining dataset. The band selection (BS) method achieved an overall accuracy of 87.86% [50], compared with the spectral feature fitting (SFF) method 81.43% [6]. It is obvious that the overall accuracy of 94.70% with the HSS-LSTM method is much better than the identification methods merely based on the spectral features. Although some researchers applied CNN-based models to the mineral identification using hyperspectral images obtained by a hyperspectral camera [23,51], there exist few applications of deep learning in mineral mapping of airborne hyperspectral images. Moreover, the researchers discovered that the spatial features extracted by most of deep learning methods only focus on the geological features without considering correlations with the minerals [52], which highlights the innovation of the proposed HSS-LSTM method.
As above, the HSS-LSTM method achieves better mineral identification results on the experimental dataset by exploring the correlations between spectral features and the spatial features to obtain the hierarchical feature, especially when minerals have similar spectrums or do not have distinct spectral features. In addition, the identification map which is obtain by the proposed method is more consistent with the ground truth.

Conclusions
In this paper, in order to better exploit the spectral features and the spatial features of hyperspectral imagery for efficient mineral identification, the HSS-LSTM method is proposed to explore the correlations and obtain the hierarchical spatial-spectral feature. The proposed method exploits a CNN-based model to extract local spatial features followed by an LSTM-based model to capture the correlations between the spatial features and the spectral features in the spectrum, which promotes the identification accuracy of minerals. Besides, the proposed method can backpropagate the loss properly, so all the parameters are able to be optimized in the meantime. Comparative experiments are conducted on the Nevada mining dataset. Experimental results show that the HSS-LSTM method outperforms traditional mineral identification methods and other deep learning methods.
However, the parameters in the HSS-LSTM method affect the identification results greatly, which result in an overall accuracy alteration of more than 3% as shown in Table 2. It is necessary to further optimize them for a better parameter combination. In the other hand, some other state of the art deep learning models are able to extract more efficient spatial features than the proposed CNN-based model. Therefore, our future work will mainly focus on optimizing the parameters of the HSS-LSTM method and extracting spatial features with high-efficiency deep learning models.
Author Contributions: H.Z. contributes to conceptualization, literature search, methodology, review and proofreading; K.D. contributes to conceptualization, methodology, data collection, performing data analysis and manuscript writing; N.L. contributes to methodology, data analysis and review; Z.W. contributes to methodology, data collection and revising; W.W. contributes to methodology and data collection. All authors have read and agreed to the published version of the manuscript.