Multiple Kernel ‐ Based SVM Classification of Hyperspectral Images by Combining Spectral, Spatial, and Semantic Information

: In this study, we present a hyperspectral image classification method by combining spectral, spatial, and semantic information. The main steps of the proposed method are summarized as follows: First, principal component analysis transform is conducted on an original image to produce its extended morphological profile, Gabor features, and superpixel ‐ based segmentation map. To model spatial information, the extended morphological profile and Gabor features are used to represent structure and texture features, respectively. Moreover, the mean filtering is performed within each superpixel to maintain the homogeneity of the spatial features. Then, the k ‐ means clustering and the entropy rate superpixel segmentation are combined to produce semantic feature vectors by using a bag of visual ‐ words model for each superpixel. Next, three kernel functions are constructed to describe the spectral, spatial, and semantic information, respectively. Finally, the composite kernel technique is used to fuse all the features into a multiple kernel function that is fed into a support vector machine classifier to produce a final classification map. Experiments demonstrate that the proposed method is superior to the most popular kernel ‐ based classification methods in terms of both visual inspection and quantitative analysis, even if only very limited training samples are available.


Introduction
Hyperspectral images (HSIs) have been widely used for various applications, such as precision agriculture [1], anomalous target detection [2], and environmental monitoring [3]. It is necessary to accurately discriminate between different objects in HSIs. Difficulties of HSI classification are caused by the following problems: First, only limited ground truth data is available. Second, information redundancy and the Hughes phenomenon [4] mean that an increase in the data dimension will lead to a decrease in classification accuracy when the number of training samples is fixed, which prevents the improvement of classification accuracies. Finally, HSIs are often contaminated by different types of noise and is dominated by mixed pixels. To overcome these problems, many pixel-wise classification methods have been proposed, such as using a support vector machine (SVM) [5,6] and multinomial logistic regression [7][8][9]. However, such methods only consider the spectral characteristics of each pixel and ignores the spatial relationship between pixels, which can easily cause classification maps to have a lot of "salt-and-pepper" noise.
To further improve classification accuracy of HSIs, various spectral-spatial methods have been presented [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. Different from those pixel-wise methods, the spectral-spatial methods perform image classification by employing both spectral and spatial information in the image. It has been confirmed that the inclusion of spatial information can greatly improve the classification accuracy of HSIs [24][25][26]. Typically, these methods can be divided into three branches. The first branch is based on spatial information extraction. Specifically, the spatial and spectral information is combined either by using a single kernel with stacked vectors or multiple kernels. The spatial feature extraction can be conducted using mean filtering [27], area filtering [13], Gabor filtering [28], a gray-level cooccurrence matrix [29], edge-preserving filtering (EPF) [17], and extended morphological profiles (EMPs) [30]. The main disadvantage is that these methods cannot effectively avoid the "salt-andpepper" noise effect caused by misclassified pixels. In the second branch, Markov random field (MRF) models are used to combine the spectral and spatial information due to its intuition and effectiveness. By applying the maximum a posteriori (MAP) decision rule, HSI classification can be converted, thereby solving the problem of minimizing a MAP-MRF energy function. The most typical representative of this branch is the combination of SVM and MRF [16,[31][32][33][34][35][36]. The main problem involved with this branch is the assumption that neighboring pixels should have the same class labels, which leads to over-smoothed classification maps, meaning that the object boundaries between different classes cannot be determined most effectively. To solve this problem, the third branch of object-based classification methods has been developed [37]. The main idea of these methods is to partition the image into different regions to effectively preserve the object boundaries and overcome the misclassification effect. Then, a final classification map is obtained by combining its pixel-wise classification map and its segmentation map by employing a majority voting algorithm. The segmentation map can be obtained by using the most popular unsupervised algorithms, such as mean-shift [38], watershed [39], hierarchical segmentation [23,40], minimum spanning forest [41], and graph cut [42]. The main challenge is to obtain accurate object-based segmentation results of HSIs.
To obtain accurate classification results, kernel-based methods have been widely used to fuse different features because they can overcome several problems of HSI classification, such as the curse of dimensionality, limited ground truth data, and noise contamination. As a result, these methods have become conventional classification techniques. The most representative method is the SVM classifier employing a linear, polynomial, sigmoid, or Gaussian radial basis function (RBF) kernel, which is the function of only spectral characteristics. Although single kernel based SVMs can be improved by using stacked vectors to include spectral and spatial information, it requires more computational time to classify HSIs by using the improved SVMs. To deal with this problem, Camps-Valls et al. [27] formulated a general classification framework for composite kernels. Specifically, two kernel functions are presented to integrate the spectral and spatial information in the ways of direct summation and weighted summation. Afterwards, Li et al. [43] formulated a generalized composite kernel (GCK) framework for HSI classification, and modeled spatial information using the EMPs. Later, Wang et al. [44] presented a HSI classification method, in which spectral, spatial, and hierarchical structure information was combined into an SVM classifier by using composite kernels. Based on this work, they presented its improved version by using multiscale information fusion [45].
To produce more accurate classification results, the two techniques of multiple kernel and superpixel have been combined for HSI classification in several previous studies. Fauvel et al. [13] proposed a novel SVM characterized with a customized spectral-spatial kernel where the spatial features are represented with the median value of each superpixel defined using morphological area filtering. Fang et al. [46] extended this work by taking into account the spatial features among different superpixels using weighted average filtering, and constructed three kernels to represent the spectral features, the spatial features within and among the superpixels, respectively. More recently, we presented a spectral-texture SVM method for HSI classification, in which the texture features are extracted within each superpixel by using a local spectral histogram [47]. The integrated methods mentioned above use only superpixels to fine tune the extracted spectral or spatial features, and to reduce the heterogeneity between features by using the mean filtering or average features of the pixels within each superpixel. Therefore, using superpixels as a simple spatial boundary constraint cannot make full advantages of superpixel attributes. Although the noise effect in classification maps by these methods can be effectively reduced, classification accuracy cannot be further improved.
The purpose of this study is to solve the above two problems and retain the integration framework. To obtain more accurate classification results, some new discriminative features should be intuitively obtained as a supplement to the spectral and spatial information, because they not only consider the characteristics of the pixels inside the superpixel, but also express the relationship between adjacent superpixels. Therefore, high classification accuracy can be achieved compared with the aforementioned methods using simply superpixels to improve the quality of spatial information, which means that they reduce interferences by using mean filtering or simply average features to process pixels inside each superpixel. It is necessary to treat each single superpixel as an independent entity rather than a simple spatial boundary constraint. For object-oriented classification methods, a superpixel composed of highly uniform pixels actually represents a specific scene in an image. Inspired by this idea, the semantic features related to superpixels are naturally produced to provide a specific meaning to each superpixel. Semantic analysis has attracted wide attention in the field of object classification and clustering of remote sensing images [48,49]. In this work, an effective SVM classification method is presented, which involves combining the spectral, spatial, and semantic features of HSIs. The main contributions of this work are summarized as follows: First, the proposed method attempts to use superpixels to extract the semantic information of HSIs as a very important supplement in addition to spectral and spatial information. Second, the proposed method is introduced by integrating spectral, spatial and semantic information into the SVM classifier through multiple kernels. Specifically, the spectral information is defined by directly using the spectral features of each pixel, and the spatial information is modeled by combining the EMP and Gabor features of HSIs to construct a stacked vector. Pesaresi and Benediktsson [50] reported that the size of different structures of HSIs can be represented by using the EMP, and a set of Gabor filters with different frequencies and orientations has been widely used for texture representation of images, and the semantic information is obtained by using a bag of visual words (BOVW) model.
The rest of the paper is organized as follows: Section 2 reviews the related techniques, Section 3 describes the proposed method, Section 4 provides a comparison of the proposed method with other state-of-the-art HSI classification methods, Section 5 discusses some issues, and the concluding remarks and future work are summarized in the last section.

Related Techniques
Spatial information is usually divided into two categories: texture and shape features. The two features are stacked to fully model the spatial information and two techniques of superpixel segmentation and BOVW are used to obtain semantic features. In addition, all the features are then fused into a composite kernel and fed into the SVM classifier. In this section, spatial feature extraction techniques, superpixel segmentation, BOVW, and the principles of composite kernel methods are briefly introduced.

SVM Model and Kernel Functions
where     is a nonlinear mapping function which transforms the input samples i x into a higher dimensional space, C controls the generalization capability of the classifier, ω and b define the linear classifier in the feature space, and i  are positive slack variables dealing with permitted errors. The optimal hyperplane is identified by solving the following Lagrangian dual problem as follows: To combine the spectral and spatial features for HSI classification, Camps-Valls et al. [27] presented a spectral-spatial composite kernel as a weighted summation of (6) and (7) as follows: where μ is a weight used to balance the spectral and spatial kernels. After incorporating (8) into (4), a new decision function for classification can be obtained. In reference [27], the mean and variance within a fixed-size window are computed for each pixel to model the spatial information. The SVM classifier with the composite kernel (8) can effectively combine the spectral and spatial information, which provides a reasonable way to improve the classification accuracy. In the paper, the SVM is adopted as the basic classifier.

Gabor Filter
Gabor filters are band-pass filters which were inspired by a multiband filtering theory for processing visual information in the human visual system. This technique has been widely used for feature extraction and texture representation because it is capable of performing multi-resolution decomposition due to its localization both in spatial and spatial frequency. Gabor filters are a set of scale-direction filters, which are obtained using: from a mother wavelet: in which: are the scale and direction indices of wavelets, Uh and Ul are the minimum and maximum center frequencies of filters on the u axis in the two-dimensional frequency domain, and x0 and y0 are the filters' center coordination in the spatial domain, respectively [51]. The Gabor filters constitute the texture part of spatial features.

Morphological Profiles
The main function of EMP is to reconstruct the spatial information by using morphological (opening/closing) operators, while preserving image edge features. Let k and n be the total of the required principal components (PCs) and the morphological operators, respectively, ψ and η be the opening and closing operations in morphology, respectively, I be a single-band image, and B be the total of spectral bands for a HSI, then the morphological profile (MP) for I can be defined as follows: According to (12), the MP is a (2n + 1)-band image. Actually, we can construct the MP for each spectral band of the HSI without feature selection, which causes the following limitations. Firstly, information redundancy cannot be avoided in the B(2n + 1)-band image, which may reduce the classification accuracies. Secondly, the classification process using such high-dimensional data leads to much higher computational costs. To solve these problems, the principal component analysis (PCA) [28] technique is used in this work. Specifically, the PCA transform was used in the original work for producing the EMP. For each PC, the MP is a (2n + 1)-band image. Then, the EMP can be obtained by stacking the MPs as follows: where EMP is a   2 +1 k n -d vector and contains both the spectral and spatial information and models the shape information of spatial features.

Entropy Rate Superpixel (ERS)
Superpixel segmentation is an important module for many computer vision tasks such as object recognition and image segmentation [52], and can be formulated as an optimization problem on an undirected graph G (V, E), where two sets of V and E are the pixels of the base image and the pairwise similarity of adjacent pixels, respectively. We can consider image segmentation as a graph partitioning problem, i.e., to select a subset from E to construct a compact, homogeneous, and balanced subgraph which corresponds to a superpixel and maximize the objective function as follows: where λ is the weight to balance the two terms, NA is the number of connected components in the graph, and H (A) is the entropy rate of the random walk on a graph to obtain compact and homogeneous clusters and can be represented as follows: w is the sum of the weights of the edges connected to the ith vertex and is used for normalization, where V is the total of vertices in the graph and T T T  and ZA and NA to be the distribution of the cluster membership and the total of connected components with respect to A in the graph, respectively. The distribution of ZA is expressed as follows: Then, the balancing term in (14) is presented to favor clusters with similar sizes as follows:

Bag of Visual Words (BOVW)
The BOVW method is derived from the bag of words (BOW) model, which has been used in the field of text classification. In the BOW model, a set of words is selected based on the text document to be classified to construct a word dictionary, and the document is then encoded into a histogram to indicate the number of occurrences of each selected word. It is reasonable to speculate that an image can be characterized by using a histogram of visual word counts.
To apply the BOVW method to perform image classification, a series of images can be considered as a document. Unlike text classification, the BOVW method does not have a given visual word dictionary, also known as a codebook. The two main steps to construct a visual word dictionary are summarized as follows. First, each image is characterized as a bunch of feature vectors by feature detection technique such as scale-invariant feature transform. These features are usually regarded as low-level features, also known as visual words. By clustering all these low-level features into k groups, the visual word dictionary of all the images is built by k cluster centers. Then, each image can be represented as a histogram feature vector by counting the numbers of occurrences of low-level features belonging to different visual word dictionaries. Histogram vectors can be regarded as semantic features and can be further used for image classification. In fact, semantic information is capable of providing "medium-level" features, which helps to bridge the huge semantic gap between low-level features extracted from images and high-level concepts to be classified. The main procedures of the BOW algorithm are summarized in Algorithm 1.

Algorithm 1: BOW algorithm
Input: a set of images, the number of cluster centers k.
Step 1: Local features detection. Apply feature detection techniques for each image to extract key points which is also called visual words (vw).
Step 2: Visual word dictionary construction. Cluster methods are adopted to divide all vws into k groups, the cluster centers constitute the visual-words dictionary.
Step 3: Histogram feature vectors construction Count the numbers of vws belonging to different elements of visual-words dictionary (generally through calculating the Euclidean distance of feature vectors) which construct the histogram features vectors.

Proposed Method
In this work, we present an SVM classifier with the spectral, spatial, and semantic kernels method (SVM-SSSK) for HSI classification. First, a PCA transform is performed on the original HSI to obtain the first three principle components (PCs 1-3). Then, the PCs 1-3 are used to obtain the EMP, Gabor features, and an ERS segmentation map. Next, to model the spatial information, the EMP and the Gabor features are combined to construct a new feature map, where each pixel is a stacked feature vector composed of structure and texture features. To ensure the uniform spatial characteristics, mean filtering is performed within each superpixel. Thereafter, to model the semantic information, a k-means clustering map and the ERS segmentation map are integrated to produce a feature vector for each superpixel. In the following, three single kernels are constructed to represent spectral, spatial, and semantic information, respectively, and a composite kernel function is defined as the weighted sum of the three kernel functions. Finally, the final classification map is obtained by using the SVM classifier with the composite kernel. The flowchart of the proposed method is illustrated in Figure 1.

Spatial Feature Extraction
To completely represent the spatial information, texture and structure features are directly combined for each pixel as a stacked vector. To this end, we use the EMP as the structure features of HSIs. The Gabor filter has been widely used for feature extraction for grayscale images and many studies have been proposed to extract the Gabor features from the first PC of HSIs. However, the texture information of some ground objects from PC 1 may not be sufficient to ensure there are better classification results. To avoid this problem, we performed the Gabor filtering on the PCs 1-3 in this work to extract the multiband Gabor (MultiGabor) features because these three PCs include over 99% of total variation in the image. A superpixel is usually defined as a uniform region in the image whose shape and size can be adaptive to different spatial structures. Nevertheless, each pixel corresponds to a stacked vector in the structure-texture feature image, which may be different from each other inside a superpixel. To ensure the uniform spatial features within each superpixel, the mean filtering is performed in each band of the structure-texture feature image. Eventually, the filtered spatial features are used for each pixel to construct the spatial kernel function.

Semantic Feature Extraction
Once the HSI has been segmented into nonoverlapping superpixels, each superpixel of the HSI can be considered as a separate image. Semantic features can represent similarities or differences between different superpixels. To extract the semantic features, the BOVW model was used in this work. Specifically, each superpixel can be regarded as a patch of an image in addition to being a document, and the spectral feature vector can also be used as a low-level feature. To group low-level features, the k-means clustering algorithm is used because of its simplicity, which indicates that the construction of a visual dictionary is an unsupervised process. Furthermore, it is more flexible to label the visual words since the number of the centers can be manually specified.
Once the visual word dictionary is constructed, the histogram information of visual words is used to describe the meaning of each superpixel. Specifically, the number of pixels inside each superpixel belonging to each cluster class is counted. Therefore, each superpixel can be represent by using a k × 1 histogram vector. In this way, the histogram vector of each superpixel can be used to represent semantic features. It should be noted that the pixels inside each superpixel should have the same semantic feature due to the homogeneity of the superpixel. The main procedures of semantic feature extraction are summarized in Algorithm 2.
Algorithm 2: Semantic feature extraction Input: An original HSI u, the ERS segmentation map us consisting of N superpixels, the number of cluster centroids k.
Step 1: Perform the k-means algorithm to cluster u into k cluster centers.
Step 2: For i = 1, 2, …, N (a) Count the number of pixels inside the ith superpixel belonging to each cluster. (b) Construct the k × 1 feature histogram vector as the semantic feature for the ith superpixel. End Step 3: Obtain the Semantic feature extraction map.

SVM-SSSK Method
So far, three different types of features have been obtained, namely, spectral, spatial and semantic. To fuse these features, the aforementioned composite kernel method was used to construct a novel HSI classification framework. Let , so the composite kernel can be represented as follows: which can be defined through the weighted summation of the spectral, spatial, and semantic kernel functions as follows: , , ... , n x x x . The widely used Gaussian RBF kernel was utilized to compute the kernel matrix for each piece of information. As described in Section 2.1, the spectral and spatial kernels are the same for (6) and (7) and the semantic function was defined as follows: The main procedures of the proposed algorithm are summarized in Algorithm 3.

Algorithm 3: SVM-SSSK
Input: An original HSI u, the available training/verification samples.
Step 1: Obtain PCs 1-3 of u; Step 2: Perform the Gabor filtering on the PCs 1-3 to extract the MultiGabor features.
Step 3: Build the EMP by computing the MPs for the PCs 1-3 in Step 2 as described in Section 2.3.
Step 4: Conduct ERS as described in Section 2.4 obtain segmentation map with superpixels.
Step 5: Construct the spatial features as described in Section 3.1.
Step 6: Extract the semantic feature extraction by using Algorithm 1.
Step 8: Construct the spectral, spatial, and semantic kernels as described in Section 3.3.
Step 9: Apply the SVM classifier with the proposed composite kernel in (20) to classify u using the training samples by choosing the optimized C and  Step 10: Obtain the final classification map.

Descriptions of Datasets
In order to validate the effectiveness of the proposed method, we reported some experiments on two widely used datasets of the Indian Pines and the University of Pavia. The AVIRIS Indian Pines image was covered with different agricultural/forest land covers and 16 groups were recorded in the ground truth data. The University of Pavia image was obtained over an urban area in Italy, consisting of nine typical urban structures in its ground truth data. The RGB false color and the corresponding ground truth data for the three datasets are illustrated in Figure 2. For the Indian Pines dataset, 10% of the known samples per class in the ground truth data were randomly selected as the training set and the rest of the known samples made up the validation set. If the training samples of a certain class was less than 10, then we fixed the number to 10. For the University of Pavia dataset, the same number of the known samples for each class were randomly chosen for training and the rest were for validation.

Experimental Setting
To verify the superiority of the proposed method, several kernel-based HSI classification methods were chosen for comparison, including three single kernel methods of SVM, EMP [30], and an edge-preserving filter (EPF) [17], two composite kernel methods of SVM-CK [27] and GCK [43], and two superpixel-based classification methods using spectral-spatial kernel (SC-SSK) [13] and multiple kernels (SC-MK) [46]. Moreover, we constructed a single kernel SVM classifier using both the spectral and texture information of each pixel for comparison, in which the texture information is represented using the MultiGabor features. For simplicity, we refer to this method as "MultiGabor". The criteria of overall accuracy, average accuracy (average accuracy), kappa coefficient (κ), and the class-specific accuracy were used for quantitative evaluation. For objective comparison, we employed the optimized parameters for each method to obtain the optimal classification results, which can be comparable to that from the original references for the two datasets with the same number of training samples. In the following experiments, the parameters for each method are provided as follows: (1) For SVM, only the spectral features were used in the RBF kernel and such kernels were employed by all of the other methods, except for GCK. The optimal C and γ for each method were obtained by using five-fold cross validation with (2) The PCs 1-3 were used by the EMP, EPF, and MultiGabor methods for different purposes. Specifically, they were used for EMP to construct the MPs, which were computed using a flat disk-shaped structuring element with a radius from 1 to 11 with an interval of 2. Thus, EPF can form a guidance image with the following parameter settings: a 5 × 5 window for the bilateral filter,

The Indian Pines Dataset
The classification results obtained by different methods are illustrated in Figure 3. The SVM classification result in Figure 3a was corrupted by a great deal of isolated class noise. The noise can be alleviated by the EMP, SVM-CK, SC-SSK, and GCK methods, but cannot be thoroughly removed by them, especially in the upper-left part of the figures, as shown in Figure 3b,e-g. The classification maps obtained by the rest of the methods were comparable and the noise effect was effectively minimized. However, some misclassification areas appeared in Figure 3c,d, h at the upper part of the image by the EPF, MultiGabor, and SC-MK methods. For instance, the boundaries of the image such as the Corn-no till in the upper left part of the image were contaminated with the class noise because the feature extraction in the EPF and MultiGabor methods was performed in a predefined sliding window in the image. Some classification errors involving the Soybeans-no till in the middle right part of the image cannot be effectively corrected by the SC-MK method. In comparison, the SVM-SSSK method can achieve a more homogenous classification map with smoother boundaries and nearly without any isolated noise in terms of visual inspection. As a result, only a few classification errors can be observed in Figure 3i.
The classification accuracies of this dataset are reported in Table 1 for quantitative evaluation, which are the average results of the 10 times we conducted experiments using different training samples. Some observations can be found from this table as follows. First, the proposed method can obtain very good classification accuracies above 98.16% in terms of overall accuracy, average accuracy and κ, which are better than all the other methods. Second, the proposed method can achieve the highest class-specific accuracies for eleven classes. With the exceptions of Corn, Grass/pasture-mowed, Soybeans-clean till, and Woods and Stone-steel towers, all of the class-specific accuracies were obtained at 100%, including for two classes of Hay-windrowed and the overall accuracyts, validating the effectiveness of the proposed method for both large and tiny areas in the image. It is worth noting that the MultiGabor method can demonstrate better classification performance than the SVM, EMP, SVM-CK, and GCK methods in terms of visual inspection and classification accuracies, since the texture features for the agricultural covers are much more effective for describing the spatial structures of the Indian Pines dataset. Therefore, the texture information is significant for improving HSI classification and can be employed in the SVM-SSSK method as well.
In this work, McNemar's test [53] was used to analyze the statistical significance of the proposed method with the other methods. This test has been commonly used in the field of remote sensing and is based upon the standardized normal test statistic: 12 21 12 21 f f Z f f    (22) where f12 indicates the number of samples classified correctly by classifier 1 and incorrectly by classifier 2. The difference in accuracy between classifiers 1 and 2 is said to be statistically significant if |Z| > 1.96. The sign of Z indicates whether classifier 1 is more accurate than classifier 2 (Z > 0) or vice versa (Z < 0). Table 2 lists the statistical significance of the classification results obtained by different methods for the Indian Pines dataset using McNemar's test. It can be observed that the Z values between the SVM-SSSK method and the other classification methods were in the range of 1.90-12.01, which means that the differences between the proposed method and the other methods are significant, except for EPF. In particular, the SVM-SSSK method demonstrated significant difference with |Z| > 4 against the state-of-the-art SVM, EMP, SVM-CK, and GCK methods.

The University of Pavia dataset
The classification maps produced by all the methods are shown in Figure 4. We can observe that there was a great deal of noise in the SVM map in Figure 4a. Although the noise effect can be reduced by the other methods, there was still isolated noise in the EMP, MultiGabor, SVM-CK, SC-SSK, and GCK maps in Figure 4b,d-g, respectively. For instance, the noise can be seen clearly in a large area belonging to Bare soil in the center of the previously mentioned maps. The noise was completely minimized by the EPF, SC-MK, and SVM-SSSK methods, as shown in Figure 4c,h,i, respectively. However, the classification errors were observed in a large area belonging to Meadows in the bottom of the EPF map, and other small areas in the left part of this map. The classification results of the SC-MK and SVM-SSSK methods are both quite good, but a few classification errors appeared in the aforementioned large area belonging to Meadows in the bottom of the image and the Self-blocking bricks in the middle of the image for the SC-MK method. In contrast, the SVM-SSSK method can effectively avoid the noise and misclassification, and its result is very close to the ground truth data in Figure 4d.
The classification accuracies of the University of Pavia dataset are reported in Table 3 for quantitative evaluation based on 10 experiments conducted using different training samples. First, we can observe that the highest overall accuracy, average accuracy and κ were reached by using the proposed method. Second, the proposed method achieved the highest class-specific accuracies for all the classes, except for Bitumen. Finally, the EMP, SVM-CK, GCK, and SVM-SSSK methods achieved very high classification accuracies because the shape features can be successfully represented by using EMP because there are many typical urban features in this dataset. Table 4 lists the statistical significance of the classification results obtained by different methods for the University of Pavia dataset using McNemar's test. It can be seen that all the Z values between the SVM-SSSK method and the other methods were higher than 2, indicating that the difference between the proposed method and the other methods are significant.
Based on the previously mentioned experiments of the two datasets, some similar conclusions can be drawn from Figures 3 and 4 and Tables 1-3. First, both the SC-MK and SVM-SSSK methods characterized with the multiple kernels were better than the SVM-CK, SC-SSK, and GCK methods characterized with the spatial and spectral kernels in terms of both the visual effect and the classification accuracies, due to the fact that more information that is integrated into the kernel functions is capable of improving the classification performance. Second, the SVM-SSSK method can achieve more accurate classification accuracies than the SC-MK method. The multiple kernels of the SC-MK method are mainly constructed using both the spectral and spatial features of HSIs, whereas the proposed multiple kernels are defined using the spectral, spatial, and semantic features.

Discussion
In this section, the sensitivity of the parameters in the SVM-SSSK method is analyzed. Our experiments reported that the number of the superpixels S, the dimensionality of the semantic features D, and the weights in the spectral-spatial-semantic kernel greatly influence the performance of the proposed method. Furthermore, the sensitivity analysis of different training sets for all the methods is provided. In the following experiments, the training-validation sets for each dataset and the default parameter settings of the SVM-SSSK method were the same as those in Sections 4.3 and 4.4.

Influence of S
As mentioned in Section 2.4, a different number of superpixels can be obtained by the ERS algorithm, which may influence the region homogeneity and the semantic features. In the first experiment, we performed the proposed method to analyze the impact of S on classification performance. Figure 5 shows the overall accuracy, average accuracy and κ for the two datasets obtained by the SVM-SSSK method with the number of superpixels ranging from 20 to 100 with a step size of 10 or ranging from 100 to 1000 with a step size of 100. It can be observed that the classification accuracies of the two datasets mainly demonstrate an upward trend as S is increased from 20 to 100, and a downward trend as S is increased from 100 to 1000. In fact, when S is small, the image is divided into large areas by the ERS algorithm and the semantic features are not fully extracted. However when S is larger, the image is divided into more small-scale regions and the semantic features extracted from these regions may fluctuate from different regions belonging to the same class, meaning some miscellaneous components can be easily misclassified to other land covers. Consequently, the final classification accuracies are often decreased as the number of superpixels increased. If there is no prior knowledge, then we recommend choosing a relatively small value of S as S = 100 to ensure there are optimal classification results.

Influence of D
As described in Section 3.2, different dimensions of the semantic features can be obtained by modulating the number of the centroids in the k-means algorithm and can make a significant impact on the classification performance of the proposed method. In the second experiment, we performed the proposed method to analyze the impact of D on classification performance. Figure 6 shows the overall accuracy, average accuracy and κ for the two datasets obtained by the SVM-SSSK method with D ranging from 10 to 150. For the Indian Pines dataset, when D is increased from 10 to 50, the overall accuracy, average accuracy and κ rise very fast from 96.87%, 97.14%, and 96.42% to 98.58%, 98.21%, and 98.39%, respectively. When D is increased from 50 to 150, the classification accuracies remain steady. This means that more semantic information is used to improve the classification performance as D is increased. However, if D is very large, too more semantic features can easily cause information redundancy. For the University of Pavia dataset, as D is increased, the overall accuracy, average accuracy and κ generally show a slight rising trend. For instance, the overall accuracy, average accuracy and κ are increased by 0.13%, 0.14%, and 0.18% when D ranges from 10 to 150, respectively.

Impact of Weights
In the SVM-SSSK method, the weights in SPE-SPA-SEM K can demonstrate the contribution of the spectral, spatial, and semantic information for classification. To uncover the interaction of the three weights in (20)  ( μ . There were a total of 66 combinations of these two weights used for parameter settings. Based on the statistics outlined in Figure 7, the overall accuracy result was better than 98% with the Indian Pines dataset for 42 of 66 (63.6%) cases, and all the overall accuracy results were higher than 99% for the University of Pavia dataset, except for the cases of To further verify the effectiveness of the semantic features, we performed McNemar's tests for Indian Pines and University of Pavia datasets in five different situations: three for using one of the spectral, spatial, and semantic features alone and two for composite kernels of  Tables 5 and 6 list the statistical significance of the classification results obtained by the SVM-SSSK method using different types of the features for the Indian Pines and University of Pavia datasets using the McNemar's test, respectively. We can observe that the Z values between the SVM-SSSK method using all the features and using any one or two types of the features were in the range of 2.35-12.13 and 4.01-14.29 for the Indian Pines and University of Pavia datasets, respectively, which validated that the classification results were greatly improved when using the semantic features. Therefore, the integration of the spectral, spatial, and semantic information is an effective way to obtain better classification results than using a single or double kernels when using the SVM classifier.

Sensitivity to Different Training Sets
In this section, the sensitivity of the proposed method with respect to different training samples is analyzed. In this experiment, all the classification methods described in Section 4.2 were employed for comparison for the two datasets and the parameter settings were consistent with those in Sections 4.3 and 4.4. For Indian Pines, different proportions ranging from 2.5% to 30% of the ground truth data were randomly selected for each class as training samples. In particular, the number of the training samples were fixed to 10 per class for some minority classes under the different cases. For University of Pavia, different numbers of known samples varying from 50 to 350 with a step size of 50 were randomly selected for each class as training samples. Figure 8 plots the impact of different training samples on overall accuracy for the two datasets. In this figure, the overall accuracy results achieved by all the methods are positively related to the number of training samples. For the Indian Pines dataset, the SVM-SSSK method can achieve the best overall accuracy results with respect to different training sets. In particular, the SVM-SSSK method outperformed the other methods when the number of the training samples is very limited. For instance, an overall accuracy of 94.47% was obtained by the proposed method, which is 3.46-19.7% higher than the other methods when the proportion of the number of training samples is only 2.5%. Furthermore, when the proportion was more than 20%, the overall accuracy results of the MultiGabor and SVM-SSSK methods were comparable, indicating that the texture information is important for representing the spatial features for this dataset. For the University of Pavia dataset, the SVM-SSSK method was obviously superior to the other methods under all of the cases. Specifically, the SVM-SSSK method can obtain an overall accuracy of 99.03% when the number of training samples is only 50 per class and this overall accuracy result is better than that of all the other methods under different training sets, except for the SC-MK method when the number of training samples is more than 200 per class. Meanwhile, the overall accuracy achieved by the SVM-SSSK method was stable around the remarkable 99.9% level when the number of training samples is more than 150 per class. Based on the above observations, the proposed method is capable of achieving better classification accuracies for different HSIs than the other methods.

Conclusions
In this paper, we propose an HSI classification method by combining spectral, spatial and semantic information into an SVM classifier using multiple kernels. Specifically, the spectral kernel is defined using spectral features, the spatial kernel is constructed by stacking the structure and texture information of each pixel, and the semantic kernel is developed by performing the BOVW algorithm within each superpixel of the image. The main advantages of the proposed method are twofold: first, semantic information is an important supplement in addition to the commonly used spectral and spatial information. Second, additional information integrated into the kernel function can improve classification performance. The experimental results on two hyperspectral datasets confirm the following conclusions: (1) The combination of spectral, spatial and semantic features in the SVM-SSSK method can effectively improve classification accuracy. For instance, the overall accuracy values by the SVM-SSSK method when SPE SPA 1 μ μ   were much higher than that when SEM 0 μ  . (2) The SVM-SSSK method can produce more accurate classification results than all kernel-based HSI classification methods mentioned in Section 4.2, even with limited training samples. For example, our method improves the overall accuracy for the Indian Pines dataset by 3.46-19.7% when the proportion of the ground truth data per class is 2.5%, as well as 1.62-14.12% higher accuracy for the University of Pavia dataset with 50 training samples for each class. (3) The SVM-SSSK method can use different weights for the three kernels in most cases to achieve stable classification accuracy. In short, SVM-SSSK can be used as a very promising multiple kernel learning technique for HSI classification. Future work mainly includes developing more effective learning methods to integrate features in HSIs.