Using dual-channel CNN to classify hyperspectral image based on spatial-spectral information

: In the ﬁeld of remote sensing image processing, the classiﬁcation of hyperspectral image (HSI) is a hot topic. There are two main problems lead to the classiﬁcation accuracy unsatisfactory. One problem is that the recent research on HSI classiﬁcation is based on spectral features, the relationship between di ﬀ erent pixels has been ignored; the other is that the HSI data does not contain or only contain a small amount of labeled data, so it is impossible to build a well classiﬁcation model. To solve these problems, a dual-channel CNN model has been proposed to boost its discriminative capability for HSI classiﬁcation. The proposed dual-channel CNN model has several distinct advantages. Firstly, the model consists of spectral feature extraction channel and spatial feature extraction channel; the 1-D CNN and 3-D CNN are used to extract the spectral and spatial features, respectively. Secondly, the dual-channel CNN have been used for fusing the spatial-spectral features, the fusion feature is input into the classiﬁer, which e ﬀ ectively improves the classiﬁcation accuracy. Finally, due to considering the spatial and spectral features, the model can e ﬀ ectively solve the problem of lack of training samples. The experiments on benchmark data sets have demonstrated that the proposed dual-channel CNN model considerably outperforms other state-of-the-art method.


Introduction
Recent years, a lot of scholars have proposed many methods to extract features from hyperspectral image. These methods can be divided into three categories: spectral domain analysis, spatial domain analysis and spatial-spectral analysis.
Spectral domain analysis refers to use of spectral information in the classification of hyperspectral image [1][2][3]. There are two kinds of spectral domain analysis. One kind of spectral domain analysis does not reduce the dimensionality of hyperspectral image, hyperspectral image are classified by using the original spectral information directly [4][5][6][7]. The other kind of spectral domain analysis method is to first reduce the dimension of hyperspectral image, and then classify the hyperspectral image. The commonly methods used for dimensionality reduction of hyperspectral image include: PCA [2,8], ICA [3] and LDA [1]. However, the disadvantage of these methods is that they only use the spectral features of hyperspectral image, ignoring the relationship between different pixels in hyperspectral image, their classification results may contain noise, like salt-and-pepper [9]. Therefore, the classification accuracy of spectral domain analysis method is not ideal.
Spatial domain analysis refers to use of spatial information in the classification of hyperspectral image. These spatial informations include color, contour and texture. Numerous research results have shown that the spatial features are helpful to improve the representation and classification accuracy of HSI data. In order to extract the spatial features of HSI, it is necessary to define a spatial filter, the common spatial filters include: gray level co-occurrence matrix, wavelet sign, geometric features, texture features and so on. But these spatial features are usually designed for specific data sets, with weak generalization ability, and cannot be widely used. Meanwhile, the variability of spatial features is also very large, which makes it impossible to set classification parameters of HSI by using empirical values. In recent years, deep learning technology has been widely used in hyperspectral image processing. Compared with the traditional artificial design method of spatial feature parameters, deep learning method can automatically extract spatial features, which have strong robustness in classification tasks. Y. Chen et al [9] input spatial features into the automatic coding machine directly, the classification of hyperspectral image had been implemented. However, this method converts the original two-dimensional image data into one-dimensional data when data are input, which causes a great loss of spatial information X. Chen et al [10] proposed a convolutional neural network model, which can be used to extract two-dimensional spatial features and implement the vehicle recognition. However, there are two problems with the above spatial domain analysis methods. First, in hyperspectral image, different objects often have different sizes, so the fixed size detection window can't meet the detection requirements of different size objects. Second, the spatial domain analysis method ignores the spectral features of the original hyperspectral image.
Spatial-spectral analysis methods refers to consider both spectral and spatial information together. Spatial-spectral methods have attracted great interests and improved the HSI classification accuracy significantly [11][12][13][14][15][16][17]. Camps-Valls et al. [18] proposed a Composite Kernel (CK) that easily combines spatial and spectral information to enhance the classification accuracy of HSI. Li et al. [19] extended CK to a generalized framework, which exhibits the great flexibility of combining the spectral and spatial information of HSIs. Li et al. [20] proposed the Maximized of the Posterior Marginal by Loopy Belief Propagation (MPM-LBP). It exploits the marginal probability distribution from both the spectral and spatial information. Zhong et al. [21] developed a discriminate tensor spectral-spatial feature extraction method for HSI classification. Kang et al. [22] proposed a spectral-spatial classification framework based on Edge-Preserving Filtering (EPF), where the filtering operation achieves a local optimization of the probabilities. Feng et al. [11] defined discriminate spectral-spatial margins (DSSMs) to reveal the local information of hyperspectral pixels and explore the global structures of both labeled and unlabeled data via low-rank representation. Zhou et al. [23] proposed a spatial and spectral regularized local discriminant embedding (SSRLDE) method for DR of HSIs. However, most of these extract spectral-spatial features using a shallow architecture and yield limited complexity and non-linearity.
Although the above methods have made some achievements in different areas, but the problem is most of these methods are based on the features for manual design. These methods must be used under the condition of establishing the classification strategy first and these methods design classification strategies directly on data without using classification label information. They are not an end-to-end approach method. Therefore, these methods highly dependent on prior knowledge of specific fields and are usually not the optimal solution [24]. Generally, HSI classification aims at classifying each pixel to its correct class. However, pixels in smooth homogeneous regions usually have high withinclass spectral variations. Consequently, it is crucial to exploit the nonlinear characteristics of HSI and to reduce interclass variations. In recent years, the advantages of deep learning in these aspects have gradually emerged, and there have been successful cases of hyperspectral image classification using deep learning. Such as stacked auto encoder [9] and deep belief network [25] in unsupervised feature learning method. Although theses unsupervised learning methods can extract deep features, they need to expand the three-dimensional data into a one-dimensional form to meet the requirements for input data. Therefore, these methods lose the spatial information [26]. The other methods are based on supervised auto-encoder methods [27], which makes use of classification label information in the learning process. These works demonstrate that deep learning opens a new window for future research, showcasing the deep learning-based methods' huge potential. However, how to design a proper deep net is still an open area in the machine learning community [28,29].
As mentioned above, compared with the traditional spectral domain HSI classification method, the deep learning method can directly learn the data dependency from the original data and make hierarchical representation of the data. Although the above methods of deep learning achieved well results, they did not make full use of spectral information and spatial information for classification. Therefore, it is necessary to synthesize spatial-spectral feature information to further improve the classification accuracy of hyperspectral image. To solve these problems, in this paper a dual-channel CNN model has been proposed to boost its discriminative capability for HSI classification. The proposed dual-channel CNN model has several distinct advantages. Firstly, the model consists of spectral feature extraction channel and spatial feature extraction channel; each channel can extract the spectral and spatial features of the original HSI separately. Secondly, the spectral and spatial features have been fused by using full-connection layer; the fusion feature is input into the classifier, which effectively improves the classification accuracy. Finally, due to considering the spectral and spatial features, the model can effectively solve the problem of lack of training samples. The experiments on benchmark data sets have demonstrated that the proposed dual-channel CNN model considerably outperforms other state-of-the-art method.
An important contribution to the success of the dual-channel CNN to classify hyperspectral image based on spatial-spectral information can be summarized as follows: (1) A novel end-to-end neural network architecture has been proposed that performs for superior modeling of hyperspectral image. The architecture has fewer independent connection weights and thus requires lesser number of training data. The method is found to outperform the highest reported accuracies on popular hyperspectral image dataset.
(2) Compared with hand-crafted feature extraction, the proposed deep model can adaptively learn spectral-spatial joint feature, which contains semantic and discriminative information from both spectral and spatial domains.
(3) The design is aimed at efficient spectral-spatial joint feature learning keeping the number of parameters low. So considerable improvement in training time is observed when compared to other popular architectures.

Related works
In recent years, the convolutional neural network has made great achievements in the field of computer vision [30]. Many researches have shown that the method based on CNN can significantly improve the accuracy of hyperspectral image classification. For example, LI et al. proposed a feature extractor based on convolutional neural network, which can learn the feature representation of hyperspectral image [31].

CNN
As shown in Figure 1, the typical convolutional neural network is mainly composed of input layer, convolutional layer, pooling layer, fully connected layer and output layer. Normally, the input of convolutional neural network is the original image X . In this paper, H i is used to represent the feature map of the ith layer of convolutional neural network. Eq. (2.1) is used to calculate the H i W i represents the weight vector of the convolution kernel at the ith layer. The symbol ⊗ represents the convolutional operation of the ith layer and (i − 1)th layer with the image or feature map. The output of the convolution is added to the bias b at the ith layer. Finally, the feature map W i of the ith layer is obtained through the nonlinear activation function.
Pooling layer is under the convolutional layer, the pooling layer samples the feature map according to certain rules. There are two main rules of pooling layer: (1) Reduce the dimension of the feature map.
After completing the calculation of the convolutional layer and pooling layer alternatively. Convolutional neural network classifies extracted features by the values of fully connected network. Obtained the probability distribution Y l i of the input data (l i is the ith label). As shown in Eq. (2.3), convolutional neural network is a mathematical model that maps the original matrix (H 0 ) to a new feature expression Y through multiple layers of data transformation or dimension reduction.
The goal of convolutional neural networks for training process is to minimize the loss function L(W, b) of the network. After the input (H 0 ) passes through the forward conduction, the difference between the predicted value and real value is calculated through the loss function. The typical loss function includes Mean Squared Error, Negative Log, as shown in Eq. (2.4) and Eq. (2.5).
In order to alleviate the problem of over-fitting, the final loss function is usually obtained by adding the L2 norm, and λ is the parameter for controlling the strength of over-fitting, as shown in Eq. (2.6).
In the training process, gradient descent is the common optimization method of convolutional neural network. Loss values are back propagated through gradient descent, the training parameters (W, b) of each layer in convolutional neural network are updated layer by layer. The learning rate η is used to control the intensity of back propagation for Loss value. The updating methods of W and b are shown in Eq. (2.7) and Eq. (2.8) Zhao et al. extracted the spatial features which combined with spectral information by using convolutional neural network, and combined with local discrimination embedded for hyperspectral image classification [32]. However, after dimensionality reduction this method only takes the three principal components of the original hyperspectral image as the input, so some information is still lost in the process of spatial feature extraction.

3D-CNN
To solve the above problems of convolutional neural network model, Chen et al. extracted spectralspatial features from the original hyperspectral image by using 3D convolutional neural network, and the results performed better than the aforementioned method on the same data set [33]. Li et al. further researched the 3D convolutional neural network for spatial-spectral joint features by changing the size of the hyperspectral image input cube [31].
The architecture of 3D convolutional neural network (3D-CNN) is similar to that of 2D convolutional neural network (2D-CNN). They are all composed of convolution layer and pooling layer. Unlike the 2D-CNN, 3D-CNN implements the convolution operation by using 3D convolution kernel, which is one of the key differences between the two kinds of convolution operation. 3D-CNN is shown as Figure 2. The value v at the position (x, y, z) of the jth feature map in the layer l is calculated as Eq. (2.9): P l and Q l represents the length and width of the three dimensional convolution kernel, R l is the size of the 3D convolution kernel in spectral dimension, m represents the number of feature maps connected to the current feature graph in the l − 1 layer, w pqr l jm represents the weight of the mth feature map connected to the l−1 layer, v (x+p)(y+q)(z+r) (l−1)m represents the value of the mth feature map at the position (x + p, y + q, z + r) in the l − 1 layer. b i j is the bias of the jth feature map in the l layer.
Compared with the previous methods, Li et al. And Chen et al. Provided a more concise idea. The model can directly process the original hyperspectral image to obtain feature maps. However, with the expansion of data scale, the classification performance of the model will decrease when the network is deepened.

Multi-level Convolutional Neural Networks for Scene Understanding
Although CNN and 3D-CNN have gained significant popularity as methods for learning image representations and helped improve the performance of many important computer vision tasks, the transformation of the learned knowledge from the known domain to a new domain such as scene parsing is uncovered yet.
In order to solve the problem, Tam V. Nguyen exploited generic multi-level convolutional neural networks for scene understanding or image parsing task [34]. The input of the proposed model is an image, first, a set of similar images from the training set are retrieved based on global-level CNN feature matching similarities. Then, the input test image and the similar images are oversegmented into superpixels. Next, the class of each test image's superpixel is initialized by the majority vote of the k-nearest-neighbor superpixels based on regional-level CNN features and hand-crafted features matching. The initial superpixel parsing is later combined with per-exemplar sliding windows to roughly form the pixel labels. Eventually, the final labels are further refined by the contextual smoothing. This is a simple yet effective approach to scene understanding or image parsing that can take advantage of generic convolutional neural network for feature extraction at both image and superpixel levels. Extensive experiments on different challenging datasets demonstrate the multi-level convolutional neural networks can extract the discriminate features which can actually improve the performance significantly.
Inspired by the ideas of this paper, we proposed the dual-channel model to extract the spectral feature and spatial feature by using the 1D and 3D-CNN. It is hoped that this method can improve the classification accuracy of hyperspectral image.

Architecture and training of dual-channel CNN
In order to extract features for hyperspectral image, the information in both spectral and spatial domain should be learned jointly. In this section, we proposed a dual-channel deep convolutional neural network for joint spatial-spectral feature learning. Firstly, the spectral and spatial features are extracted, respectively. For the spectral channel, 1-D convolutional neural network is used to extract the spectral features. For the spatial channel, 3-D convolutional neural network is used to extract the spatial features. Then, the spectral-spatial features can be obtained by using the fully connected layers. Finally, the spectral-spatial features are inputted into a classifier, and classification results can be achieved.

Spatial feature extraction with 3-D CNN
In this section, the HSI spatial feature extraction model based on 3-D convolutional neural network is proposed. The model consists of one input layer, two convolutional layers, two pooling layers, two full connection layers and one output layer. This model can automatically extract the spatial information features of hyperspectral image. The model is shown as Figure 3. Assume the hyperspectral image is H ∈ R h×w×d , h and w represent the height and width of hyperspectral image, respectively, d is the number of bands, the category of each pixel in hyperspectral image is defined as K = 1, 2, 3, · · · , k, k is the number of categories in hyperspectral image. A sample set of p × p size is extracted at the center of each pixel in hyperspectral image H. The data set is represented as X = x 1 , x 2 , x 3 , · · · , x n , x i ∈ R p×p×d , n = h × w is the total number of the data set. During the extraction process of the data set, the sample points for the boundary are filled with data 0. Each sample point x i is input into the convolutional neural network model as input data, the output is z i , and the corresponding category is y i ∈ K, y i is the category of the corresponding vector centered on x i . The data (x i , y i ) represents the sample with size p × p centered on pixel point x i . After convolutional neural network, the category of input vector is predicted to be y i . For the kth category in the training dataset, Instead of converting the input data into one-dimensional data, the HSI spatial feature extraction model based on 3-D convolutional neural network can directly input the original three-dimensional data into the convolutional neural network model. The size of the input layer data is p × p × d.
Firstly, the sample centered on x i with size p × p × d is input into the first convolutional layer, the kernel size of the first convolutional layer is 5 × 5 × d, and the number of the kernel is 100. After the first convolutional layer operation, 100 feature maps with size n 1 × n 1 (n 1 = p − 4) will be obtained. After the first convolutional layer is the second max pooling layer, the size of the pooling kernel is 2 × 2. After the max pooling operation, the size of the output feature map is n 2 × n 2 × 100 (n 2 = n 1 /2 ).
Secondly, the feature map will be input to the third convolutional layer, the kernel size of the third convolutional layer is 3 × 3 × d , and the number of the kernel is 300. After the third convolutional layer operation, 300 feature maps with size n 3 × n 3 (n 3 = n 2 − 2) will be obtained. After the third convolutional layer is the fourth max pooling layer, the size of the pooling kernel is 2 × 2. After the max pooling operation, the size of the output feature map is n 4 × n 4 × 300 (n 4 = n 3 /2 ).
Finally, the output feature map of the fourth max pooling layer well be converted to a onedimensional vector x pool2 (1 × (n 4 × n4 × 300)). The fifth layer, the sixth layer and the seventh layer is the fully connecter layer. The output of the seventh layer is a one-dimensional vector with the size of 1 × K. The fully connected operation formula of the fifth, sixth and seventh layers are shown in Eq.
W (5) , W (6) and W (7) is the weight vector, b (5) , b (6) , b (7) is bias. σ(·) is the nonlinear activation function. In this paper, the activation function used in two convolutional layers and three fully connected layers is Tanh. In order to simplify the parameters of the model, suppose (7) . W (i) is the weight of the layer. b (i) is the bias of the ith layer. The parameters of the model can be represented with (W, b).
The output of model f (7) ∈ R K can be input to the Softmax classifier for the classification of hyperspectral image based on spatial features. y i = e f (7) ( The predicted value of the category of the sample with size p × p × d centered on x i can be obtained. Then, the label y i and predicted value y (W,b) ik of sample points are taken as input values, the cross entropy is calculated by using Eq. (3.4) The parameters W and b are optimized by stochastic gradient descent. After the lth iterations, the calculation methods of W and b are shown in Eq. (3.5) and Eq. (3.6) The back propagation algorithm is used to calculate the gradient of parameters W and b. η is the learning rate.
The HSI spatial feature extraction model based on 3-D convolutional neural network proposed in this paper is different from the traditional convolutional neural network classification model. The traditional convolutional neural networks are mostly based on fine-turning technique. In other words, the convolutional neural network is firstly trained with some prepared samples, and then its parameters are fine-tuned. However, the HSI spatial feature extraction model based on 3-D convolutional neural network proposed in this paper does not require training of prepared samples. The parameter W can be initialized by the standard global distribution; b can be initialized to 0. To prevent overfitting, the dropout is applied after the fifth and sixth full connection layers. Table 1 shows the parameters of all layers in Figure 3. Table 1. The parameters of all layers in spatial feature extraction model.

Spectral feature extraction with 1-D CNN
In this section, the HSI spatial feature extraction model based on 1-D convolutional neural network is proposed. The 1-D convolutional neural network is used to extract spectral features of hyperspectral images. Replacing the traditional 2-D convolutional kernel with a 1-D convolution kernel can effectively extract the spectral features of hyperspectral image. The model consists of one input layer, three convolutional layers, three pooling layers, two full connection layers and one output layer. This model can automatically extract the spectral information features of hyperspectral image. The model is shown as Figure 4.  Table 2 shows the parameters of all layers in Figure 4. First, the data of its 3 × 3 neighborhood window is collected at a pixel in the original hyperspectral image. L is the band number of the hyperspectral image. Convert the data of 3 × 3 × L to nine L × 1 1-D vectors. The value of the jth eigenvector of data x in the lth layer is shown in Eq.
l is the number of layer, j is the number of eigenvector, b i, j is the bias of the ith eigenvector in the lth layer, f () is the activation function, m is the index of the (l − 1) layer that connected to current layer, k l, j,m is the hth value of the convolution kernel connected to the mth eigenvector in the (l − 1)th layer. H l is the length of the convolutional kernel. In practical applications, we can choose different types of activation functions, such as Sigmoid, ReLU and Tanh. The effect of each activation function will be analyzed through experiments to determine which is the most appropriate activation function.
The pooling layer is usually located after the convolution layer, and the pooling operation can effectively reduce the dimension of the eigenvector. The most commonly used max-pooling operation methods will be adopted in this paper. It is important to note that the input data is a one-dimensional vector, so the convolutional kernel and the pooling kernel are all one-dimensional.

Spatial-spectral feature extraction with dual-channel CNN
In order to extract the spectral and spatial features of the original hyperspectral image simultaneously. In this section, a HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network is proposed. The model consists of two channels: the first channel is spectral feature extraction channel, and the second channel is spatial feature extraction channel. The architecture of the model is shown as Figure 5. In the spectral feature extraction channel, S n is used to represent the input data corresponding to the nth pixel. After a series of convolution and pooling operations, the output data F 1 (S n ) of the spectral feature extraction channel can be obtained. The output data of the channel is the spectral features extracted from the original input data. In the calculation processing of spectral channels, the input data is a 1 − D vector, so convolution operation and pooling operation are both 1 − D operation forms.
In the spatial feature extraction channel, P n is used to represent the p × p neighborhood window data of the nth pixel. This is the input data of spatial feature extraction channel. After a series of convolution and pooling operations, the output data F 2 (P n ) of the spatial feature extraction channel can be obtained. The output data of the channel is the spatial features extracted from the original input data. In the calculation processing of spatial channels, the input data is a 3 − D vector, so convolution operation and pooling operation are both 3 − D operation forms.
After calculating spectral feature F 1 (S n ) and spatial feature F 2 (P n ), in order to make comprehensive use of spectral features and spatial features, feature fusion joint calculation is made for F 1 (S n ) and F 2 (P n ), as shown in Eq. (3.8): • represents the connected operation, this operation corresponds to the method keras.layers.concatenate which was used in the experiment. The input data for this method is a list of concatenated tensors, and the return value is an output tensor concatenated by all the input tensors..
The data after the connected operation is fed to the full connection layer for the operation shown in the Eq. (3.9).
W represents the weight vector of the fully connected layer, b represents the bias of the fully connected layer. The output F (n) is calculated by taking spectral and spatial feature as input data, so F (n) can be regarded as the spatial-spectral feature of the nth pixel.
Finally, F (n) is input into the Softmax classifier and the probability distribution of the nth pixel is calculated, as shown in the Eq. (3.10) C is the number of categories of data to be classified. The maximum value of Y (n) is the corresponding category of the pixel.
It is worth noting that hyperspectral image is inevitably affected by local spatial deformation, shadow, illumination and blur, which greatly affect the classification accuracy. The HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network is proposed in this section which can effectively reduce the influence of local spatial deformation, shadow, light and fuzziness on the classification accuracy rate because of its deep hierarchy architecture.

The training and optimizing process
The training and optimizing process can be divided into two parts as shown in Figure 6. The spectral data and spatial data are input into the dual-channel network. After a series of convolution and pooling operations, the data will be input to the fully connected layer. The purpose of the fully connected layer is to map distributed feature representation to the sample label space, and the mapped features can be classified by the Softmax classifier. The predicted value and label are used to calculate the loss value, the gradient descent algorithm and back propagation is used to adjust the network parameters. In the process of training, minimize the loss until the network convergence. The verification process of the network is to cross-verify the trained network model. Parts of the sample data is randomly selected as training data and provided to the network model for identification, calculate the overall accuracy performance of the network by analyzing the performance of the model.
Through training and optimizing process, all parameters in the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network proposed in this paper are learned. The loss function is shown as Eq. (3.11).
N is the number of training samples, c (n) is the real category of the nth training sample, p (n) k is the distribution value of the nth category corresponding to p (n) , p (n) k is the probability of distribution the nth sample to the kth category. θ represents the convolutional kernel and the bias. 1• is the indicator function, the value is 1 when the parenthesis condition is satisfied, otherwise it is 0.
The random gradient descent algorithm was used to optimize the parameter θ. The parameter θ is initialized with standard deviation of 0.05 and mean value of 0 for random Gaussian distribution. The parameter bias is initialized with 0. The learning rate is initialized with 0.0001. The number of iterations is initialized with 5 × 10 4 .
In order to obtain the model with the best classification accuracy, we divided the experimental dataset into two groups: training set and testing set. The K-fold cross-validation method is adopted in the process of training and testing. As shown in Figure 7, the initial sample is divided into K subsamples, one of the subsamples is retained as testing set for the model, the other k-1 samples were used as the training data set. The cross validation was repeated K times, and each subsample was verified once. The average value of the results of K times was shown in Eq. 3.12. The advantage of this method is that randomly generated subsamples are repeatedly used for training and testing, and the results are verified once each time, it is very useful for the experiment based on one dataset. In my experiment K=10.

Experimental results and discussions
In this section, the experimental analysis of the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network is conducted. The hardware and software environment used in the experiment is shown in Table 3.

Description of experimental data sets
The Indiana Pines dataset were collected on 12 June 1992. The collection was taken at Purdue University Farm in Northwest Indiana, USA. The collection equipment is AVIRIS (Airborne Visible Infrared Imaging Spectrometer). Table 4 is a description of the relevant parameters for the dataset. Image resolution 145 × 145 Figure 8 is the gray-scale image corresponding to hyperspectral image, which is composed of band 10. The ground truth available contains 16 classes and the number of samples in each class distribute unevenly. Table 5 summarizes the categories and image counts for each.  Figure 9 is the sample distribution of each category.

Experimental setup for classification of labeled pixels
The framework for all data sets was established as follows. All data sets were randomly divided into the two following groups: a training set, and a testing set. The training sets were used to optimize model parameters. The testing sets were used to test the performance of the model after the training was completed. The batch size was set to 16 and the Adam [29] optimizer was used for stochastic optimization. We used the Xavier normal distribution initialization method [27], also known as the Glorot normal distribution initialization method, for the fully-connected layer. We used a variable learning rate, which was gradually reduced during the optimization process. This was done because the learning rate must be smaller when closer to the valley. The number of training epochs was set to 50000 and the initial learning rate was set to 0.0001. The learning rate was halved when the loss did not decrease after 10 epochs.
The overall accuracy (OA), average accuracy (AA), and the kappa coefficient (K) are adopted to qualitatively evaluate the classification results.
Overall accuracy: refers to the probability that the classified result is consistent with the test data category for each random sample. The overall accuracy is equal to the sum of the pixels that correctly classified divided by the total pixels. The calculation method is shown in Eq. (4.1): Average accuracy: refers to the average of classification accuracy of each category. The calculation method is shown in Eq. (4.2): Kappa coefficient is another method to calculate classification accuracy. The Kappa coefficient is between -1 and 1. But usually Kappa coefficient falls between 0 and 1, Kappa = 1 indicates complete agreement between the two judgments, Kappa ≥ 0.75 indicates a satisfactory agreement, Kappa < 0.4 indicates less than ideal. It is an ideal index to describe the consistency of diagnosis, so it has been widely used in practical engineering. The calculation method is shown in formula Eq. (4.3): In addition to these basic settings, four key factors were used to configure the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network. Namely, (1) The effect of convolutional kernel size; (2) The effect of spatial neighborhood window size; (3) The effect of activation function; (4) The effect of output feature vector dimension on classification results. These four factors are discussed by the OA of IP below.
First, the size of convolution kernel size can affect the OA on classification results. During the experiment, the convolutional kernel size of the first convolutional layer is first fixed, and then the convolutional kernel size of the second convolutional layer is changed to evaluate the effect of the convolutional kernel size on the classification results. Table 6 shows the experimental results. It can be seen from the experimental results that increasing the size of convolution kernel can improve the classification accuracy under certain conditions. However, the accuracy of classification does not increase linearly with the increment of the convolutional kernel size. In this dataset, with the increment of convolutional kernel size, the accuracy of classification appears to rise first and then fall. The experiments results show that the classification accuracy is the highest when the convolutional kernel size is 3 × 3. Therefore, the convolution kernel size of the second convolutional layer is set as 3 × 3. It can also be seen from Table 6 that as the size of convolutional kernel size increases, the computational complexity of the model increases and the classification time increases gradually. Figure 10 shows the curve of classification accuracy during the training process. It can be seen from Figure 10, when the number of iterations is less than 15,000, the classification accuracy increases rapidly with the number of iterations; when the number of iterations is more than 15,000, the classification accuracy increases very slowly with the number of iterations and gradually converges.  Second, the window size of spatial neighborhood can effect on classification. This experiment analyzes the effect of convolution kernel size on classification results. During the experiment, the convolutional kernel size of the first convolutional layer is first fixed, and then the convolutional kernel size of the second convolutional layer is changed to evaluate the effect of the convolutional kernel size on the classification results. 5 different neighborhood window sizes of 7 × 7 pixels, 9 × 9 pixels, 11 × 11 pixels, 13 × 13 pixels and 15 × 15 pixels were selected to analyze the classification results. Figure 11 is the comparison of classification results.
It can be seen from Figure 11 that the overall classification accuracy does not increase with the increase of spatial neighborhood window size, the overall classification accuracy appears to be increased first and then decreased, reached the best implement at 11 × 11 pixels. This is because: when the size of the spatial neighborhood window is small, it contains few spatial features that reflect the relationship between adjacent pixel points and cannot describe the spatial features between pixel points very well, so the overall accuracy is low. When spatial neighborhood window size increases gradually, it contains more and more spatial features that reflect the relationship between adjacent pixels, but it also brings a lot of redundant information or noise data, the redundant information or noise data will affect the classification accuracy, so when the spatial neighborhood window increases to a certain value, the overall classification accuracy declines continue to increase the window size. Figure 11. The effect of spatial neighborhood window size on classification results.
Third, the activation function can effect on classification results. During the experiment, all parameters of the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network were fixed, and then the activation functions are set to ReLU, Sigmoid and Tanh respectively. Figure 12 shows the curve of classification accuracy corresponding to the three activation functions. Figure 13 shows the curve of loss function values corresponding to the three activation functions.  It can be seen from Figure 12 and Figure 13 that with the increase of iterations, the classification accuracy rate corresponding to the three activation functions is gradually increased. However, the classification accuracy corresponding to Sigmoid function is significantly lower than that of ReLU and Tanh. The classification accuracy of Tanh and ReLU is basically the same, but Tanh converges faster. Therefore, Tanh was selected as the activation function of the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network.
Moreover, in convolutional neural networks, the dimensions of the output feature vector in the last layer have a great impact on the accuracy of classification. Therefore, this experiment tests the relationship between the dimensions of output feature vectors and classification results in the HSI spatialspectral feature extraction model based on dual-channel convolutional neural network. The dimensions of the output feature vectors of the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network are set as 50-150 respectively. Then, 50, 100 and 200 samples were randomly selected from the data set as training samples, and the number of test samples was 300. Training the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network and recording the final classification accuracy. The classification results are shown in Figure 14. During the experiment, the learning rate x was set to 0.0001, and the batch size was set to 16. During the experiment, we found that the influence of other parameters in the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network on the experimental results could be ignored, so we did not conduct further experimental analysis on these parameters.

Classification results and discussion
The joint representation learning of information from the dual-channels are one of the main contributions of this paper. In this section, we conduct an experiment to show the performance of the proposed dual-channel method compared with single-channel sub-models. In order to verify the classification performance of the proposed model based on dual-channel method. The classification accuracy of the convolutional neural network classification model based on spectral feature extraction, the convolutional neural network classification model based on spatial feature extraction, and the dualchannel convolutional neural network classification model based on spatial-spectral feature extraction were compared and analyzed through experiments. Table 7 shows the statistical table of classification accuracy OA corresponding to each category of the three models. Figure 15 is the histogram comparing the classification accuracy of the three models for each category.  Figure 15. The histogram comparing the classification accuracy of the three models for each category.
As we can see from Table 7 and Figure 15, the classification accuracy of the dual-channel convolutional neural network classification model based on spatial-spectral feature extraction proposed in this paper is significantly higher than that of the other two classification models, with OA reaching 90.12%, the OA of the other two models did not exceed 90.00%. Category 2 has the worst classification result, while category 7, 9 and 16 has the best classification result. The classification accuracy of category 7, 9 and 16 of the three classification models has reached 100%. Among all the 16 categories to be classified, the three classification models were ranked from low to high in terms of OA is: the convolutional neural network classification model based on spectral feature extraction, the convolutional neural network classification model based on spatial feature extraction, and the Dual-channel convolutional neural network classification model based on spatial-spectral feature extraction. The experiment results show that the proposed method can effectively improve the classification accuracy of hyperspectral images.
The classification results are then compared to some available feature extraction methods, they are SVM [35], MLRsub [35], SVM-GC [35], MLRsubMLL [35]. There also compared with two deep learning based method, stacked AEs basedmethod (J-SAE) [36], 3-D CNN based method (3-D-CNN) [26] for spectral-spatial feature extraction. We demonstrate the results of those feature extraction methods on the experimental datasets. The parameters presented in these contrast methods are respectively set as provided in the corresponding references.
Firstly, Table 8 is the classification results of different methods. From the numerical results, it can be seen from the Table 8  Secondly, Figure 16 is the visualization of the hyperspectral image with different categorys and Table 9 is the confusion matrix for different cateforys. It can be seen from the Figure 16 and Table 9, the accuracy on different category of the HSI spatial-spectral feature extraction model based on dualchannel convolutional neural network is more average. In category 1,4,5,7,9,10,11,13,14,15,16 the classification accuracy is 100%. The lowest classification accuracy of the 12 category was 38.79%, because the 11, 12 and 13 categories were all different kinds of soybeans, which had similar spectral characteristics. Figure 16. The visualization of the hyperspectral images with different categories.  Thirdly, Regarding the computational time, the times required by the different methods are listed in Figure 17. Clearly, J-SAE required the longest time for training. SVM and SVM-GC required a large number of parameters to achieve its best performance, whereas the accuracy was also not the best.Although MLRsub and MLRsubMLL required the shortest time, its overall accuracy were worst. The 3D-CNN requites about the same time with Dual-channel, but its overall accuracy is lower than the proposed Dual-channel method, the Dual-channel achieved the best overall accuracy. Finally, the classification accuracy of SVM, SVM-GC and MLRsubMLL in the 9 category is 0%. The main reason is that the number of training samples of this category is few (20), so it is not possible to construct a perfect classification model, resulting in a lower classification accuracy rate. However, the HSI spatial-spectral feature extraction model based on dual-channel convolutional neural network proposed in this paper achieves 100%. In terms of classification accuracy, it is obviously higher than other algorithms, which indicates that dual-channel CNN can effectively extract spectral and spatial features of the original samples and can effectively solve the problem of lack of training samples. So we can conclude that our proposed method gains better classification accuracy than other feature extraction methods.

Conclusion
In this paper, we have proposed a novel dual-channel CNN model. It contains two channels of CNN, each of which learns features from spectral and spatial domain, and then a spatial-spectral joint feature is obtained for classification. The model has several distinct advantages. Firstly, the model consists of spectral feature extraction channel and spatial feature extraction channel; the 1-D CNN and 3-D CNN are used to extract the spectral and spatial features, respectively. Secondly, the dualchannel CNN have been used for fusing the spatial-spectral features, the fusion feature is input into the classifier, which effectively improves the classification accuracy. Finally, due to considering the spectral and spatial features, the model can effectively solve the problem of lack of training samples. The proposed method is compared to other well-known classification methods. The experiment results on well-known data sets have shown that the proposed method has better performance in terms of overall classification accuracy, average classification accuracy and Kappa coefficient.
There is still plenty of room to grow in our proposed method, such as more successful strategies in multi-scale feature fusion and robust classification accuracy to the boundary region. Besides, parallel and distributed fusion strategy, such as [37], will be great in accelerating computation efficiency in practice.