Broad Learning System with Locality Sensitive Discriminant Analysis for Hyperspectral Image Classification

In this paper, we propose a new method for hyperspectral images (HSI) classiﬁcation, aiming to take advantage of both manifold learning-based feature extraction and neural networks by stacking layers applying locality sensitive discriminant analysis (LSDA) to broad learning system (BLS). BLS has been proven to be a successful model for various machine learning tasks due to its high feature representative capacity introduced by numerous randomly mapped features. However, it also produces redundancy, which is indiscriminate and ﬁnally lowers its performance and causes heavy computing demand, especially in cases of the input data bearing high dimensionality. In our work, a manifold learning method is integrated into the BLS by inserting two LSDA layers before the input layer and output layer separate, so the spectral-spatial HSI features are fully utilized to acquire the state-of-the-art classiﬁcation accuracy. The extensive experiments have shown our method’s superiority.


Introduction
Hyperspectral images (HSIs) are produced by hyperspectral sensors by capturing reflectance values on tens or even hundreds of spectral bands for each pixel. e increased spectral resolution of HSIs makes them essential for many remote sensing tasks in various fields, such as agriculture [1], environment [2], and military [3], etc. To obtain semantic abstraction from HSIs, classification requires mapping from pixel values to land-use and/or land-cover descriptions, which is nontrivial because the high spectral redundancy detrimentally affects the classification process in terms of the curse of dimensionality problem [4] and noisy labels [5]. Moreover, accompanied by increasing spatial resolution, the widespread adoption of integrated spatial and spectral information in HSIs' analysis has further increased the dimensionality of input data [6]. It has been proven in many cases that the useful spectral information for HSIs classification implies a nonlinear embedded submanifold of the original feature space, which can be retrieved by manifoldlearning-based feature extraction methods [7,8]. Sun et al. modified isometric mapping (ISOMAP) by accelerating its process to reduce the dimensionality of HSIs [9]. Fauvel et al. investigated the kernel principal component analysis (KPCA) cooperating with a linear classifier in HSIs classification and showed its privilege over the original principal component analysis method [10]. In contrast to previous global approaches, locally based methods like locally linear embedding (LLE) [11] and Laplacian eigenmaps (LE) [12] merely attempt to preserve the local geometrical structure of data, thus bringing about two prominent advantages: computational efficiency and representation capacity [13]. Some recent researches tried to formulate locally based manifold learning with a supervised regularization, thereby creating discriminative and compact feature representations [14][15][16]. Locality sensitive discriminant analysis (LSDA) [17] was developed as an extension to linear discriminant analysis (LDA) by integrating the discriminative properties of LDA with the nearest neighborhood graph (NNG) modeling the local geometrical structure of the underlying manifold. Unlike other NNG based approaches (e.g., locality preserving projections (LPP) [18] and LE) [12], LSDA was used within-class graph and between-class graph to obtain good between-class separation and preserve the within-class local structure as well. It can then be expected as a useful feature reduction method for supervised classification tasks.
During the past decades, machine learning methods have been widely used to achieve higher semantic prediction accuracy on HSIs. For example, kernel machines such as support vector machine (SVM) and kernel Fisher discriminant analysis (KFDA) have been used successfully for HSIs classification [19]. Ensemble learning methods like random forest [20] and rotation forest [21] also showcased their benefits, especially when the available labeled training samples are limited [22]. Inspired by the biological nervous system, neural network models have achieved great success in general media information processing [23] and HSIs analysis [24,25]. Moreover, models with random weights (NNRW) such as random vector functional link networks (RVFL) [26], Schmidt's method [27], and extreme learning machine (ELM) [28] set arbitrary weights and biases for the hidden layer while the weights for output layer are obtained analytically. As noniterative artificial neural network (ANN) based frameworks, the NNRW algorithms enable high training efficiency while still retaining the powerful representation learning capacity [29]. In the field of HSIs classification, Xia et al. [30] reported that the general ELM was more accurate and much faster than SVM. Zhou et al. compared ELM with the composite kernel (ELM-CK) to SVM with CK (SVM-CK) and revealed that the ELM-based method still holds its advantages [31].
Recently, a new NNRW method, which broadly extends the hidden layer of RVFL called broad learning system, has been introduced [32,33]. e main distinctive feature of BLS is that the input data are randomly mapped to features in "feature nodes," which are subsequently transformed by nonlinear activation function to form "enhancement nodes." Such a higher-order network structure provides an alternative way of learning deep features. In addition, the universal approximation capability of a broad learning system has been proven [33]. Jin et al. developed a robust broad learning system (RBLS) by modifying the regular terms of the cost function in order to promote its generalization performance on contaminated data modeling [34]. rough replacing the feature nodes with Takagi-Sugeno (TS) fuzzy subsystems, Feng and Chen crafted a neurofuzzy model called fuzzy broad learning system for regression and classification tasks [35]. Kong et al. applied BLS to HSIs classification for the first time. e semisupervised framework enabled the proposed method (i.e., semisupervised BLS (SBLS)) to leverage limited labeled samples and substantial unlabeled samples [36]. Although SBLS has shown its advantages over many approaches, including deep learningbased methods, the potential of BLS in HSIs classification is far from being fully exploited under the current situation.
In this paper, we propose a new framework for HSIs classification called BLS-LSDA, which integrates hierarchical spectral-spatial information abstraction, manifold learning method, and BLS. Our method firstly extracts spectral-spatial features by iteratively abstracting pixels' neighborhood in a hierarchical manner. en the features are input into manifold learning nodes implementing LSDA. e reduced dimensional features, which are discriminative and locality preserving, are sent to the feature nodes and afterward, the enhancement nodes of BLS. e following layer which is identical to the previous LSDA one is adopted to exploit the intrinsic structure of high order features produced by random mapping. At last, the weights of output nodes are acquired by a ridge regression learning algorithm. Our contributions are highlighted as follows: (1) Our method integrates a manifold learning algorithm with BLS in a multilayer neural network model, thus providing enhanced feature representation capacity to BLS (2) A novel implementation of spectral-spatial response (SSR) [37] consisting of Gabor filter and adaptive weighted filter (AWF) is developed to extract deep features of HSIs without a deep learning scheme (3) With comparative experiments conducted on 3 standard HSIs datasets, we show the proposed approach's advantage in classification accuracy over the state-of-the-art methods e rest of this paper is organized as follows. Section 2 gives a brief overview of BLS. In section 3, we present our method along with the details of the learning algorithm. Section 4 compares the performance of our method in three benchmark datasets with several prominent approaches and analyses the experimental results. Finally, discussions and conclusions are given in section 5.

Broad Learning System
Being different from the deep neural networks (DNN), BLS has no need of gradually searching for the models' optimized parameters with backpropagation (BP). e learning procedure of BLS consists of only one step, e.g., performing matrix inversion to figure out the weights of links between the nodes of the hidden layer and output layer. As a single hidden layer feedforward neural network (SLFN), the most prominent characteristic of BLS is the adoption of mapped feature nodes to construct the enhancement nodes, which bring in higher feature representation capability. Figure 1 shows the framework of the original BLS. Given the input data set X, the i th group of mapped feature nodes can be established by the following: where W ei is the randomly chosen weights, and β ei is the bias. It should be noted that different functions φ i can be adopted for the n different groups of the mapped nodes.
Concatenating the mapped nodes, we get the following: en Z n is fed into the enhancement nodes to produce further abstraction of the input data as follows: 2 Mathematical Problems in Engineering where W hm and β hm are weights and biases, respectively, φ is the activation function. Usually, sigmoid function is used. Eventually, the hidden layer of BLS is as follows: en the output layer can be obtained by the following: where W m n is the connection weights between the BLS's hidden layer nodes and output layer nodes. Since the H and Y are already known in the learning procedure, we can calculate W m n by rigid regression of the pseudoinverse as follows:

MultiScale Composite Spatial
Features. Spatial information has been utilized for hyperspectral image classification for many years [38,39], along with the recognition that a smoother classification map always ensures higher classification accuracy [40]. However, for those pixels lying along the edges, a smoothing filter may jeopardize the classification accuracy gain. us, in most cases, the spatial information derived by smoothing filters was composed with the raw spectral band values to acquire a trade-off of the context-based and isolated pixel values [31], or context-sensitive adaptive filters were designed to give out the edgepreserving maps [41]. In this work, we utilize the adoptive weighted filter (AWF) proposed by Zhou and Wei [42] to extract spatial information from HSIs. Meanwhile, given its deficiency in obtaining the differential information and inspired by the success of Gabor features applying in hyperspectral image analysis by enhancing the spatial discrimination on the highly contrastive area [43,44], we exploit the benefit of integrating a simple two-dimensional Gabor filter and the AWF for feature extraction. By assuming that neighborhood pixels which have similar spectrum distribution are more likely to belong to the same class, AWF obtains the weight of each pixel in the neighborhood by evaluating the similarity between it and the central pixel of the filter as follows: e similarity s i,j is calculated by the Gaussian radial basis function as follows: where p central is the central pixel and p i,j is the pixel located at the i th row and j th column of the neighborhood. e σ is the standard deviation of the pixels' difference, as follows: Derived from the computational model for human beings' visual cortical channels, the 2D Gabor filter (https://en. wikipedia.org/wiki/Gaborfilter) has been widely used in computer vision for various low-level tasks [45,46]. It is a directional sinusoidal function modulated by a Gaussian envelope on a 2D (h, v) plane, which can be expressed in the complex form as follows: where where λ denotes the wavelength of the sinusoidal factor, θ is the orthogonal orientation to the parallel stripes of a Gabor function. ψ is the phase offset, σ denotes the standard deviation of the Gaussian envelope, and c is the spatial aspect ratio that specifies the ellipticity of the support of the Gabor kernel. e two spatial features are then integrated into a multiscale framework. We extract AWF and Gabor features through 5 × 5, 7 × 7, 9 × 9, 11 × 11, and 13 × 13 filters, respectively. At each scale, the convolution is conducted 3 times iteratively, and then the obtained features are sent to the next step. Figure 2 shows a brief view of the extraction of the multiscale composite spatial features.

BLS-LSDA.
We believe that the validity of BLS greatly roots from its numerous randomly constructed hidden nodes which form a "broad" neural network structure. However, excessive nodes generally cause heavy computational or storage consumption, especially when the number of input nodes is boosted. Moreover, the randomly produced nodes have been often criticized for their arbitrariness that may deteriorate the performance in real-world applications [47]. e underlying structure of input data which is useful for classification can be retained by discriminate analysis methods, which intend to seek feature representations that address the interclass separation. It is necessary for BLS to make a compromise between the arbitrarily created nodes and their usefulness in differentiating input features of varied classes. To fulfill this task, an effective way is to introduce LSDA into BLS. Deriving from LDA, LSDA overwhelms its prototype by revealing the local geometrical structure of the data manifold additionally. In this work, we craft a novel neural network model named BLS-LSDA by inserting two layers applying LSDA as in Figure 3. Details of the layers are listed as follows.
(1) A layer was added between the input layer and the hidden layer of BLS to decrease the dimensionality of input data as well as enhance its separability; (2) By inserting an extra layer applying LSDA between the hidden layer and output layer, we further introduce a groupwise feature mapping which will benefit the weights retrieving.
For each LSDA layer, given m samples and their labels l(x 1 ), l(x 2 ), . . . , l(x m ) as input, LSDA splits the nearest neighbors of x i into N w (x i ) and N b (x i ) which are the x i 's nearest neighbor sample sets of the same and different labels, respectively.
where k denotes the number of nearest neighbors of x i . Based on N w (x i ) and N b (x i ), within-class graph G w and between-class graph G b are constructed with weight matrices W w and W b .
In order to map the points in feature space to a line so that the within-class points stay as close as possible while the between-class points stay as far as possible, given the map as is a n × m matrix, the projection vector a can be retrieved by, where L b is the Laplacian matrix, D w is a diagonal matrix with D w,ii � j W w,ij . And α is a scalar between 0 and 1. See [17] for details.
Algorithm 1 depicts the framework of the BLS-LSDA training process. Let X g tr be the Gabor and X f tr be the AWF features of sample X tr . ey are fed into the first LSDA layer separately to produce dimensional reduced features X g ′ tr and X f ′ tr . en the two features are weighted and concatenated as follows: According to equations (1) and (3), now we get the groups of feature nodes and enhancement nodes of BLS-LSDA as follows: Each group of mapped features and enhancement features are fed into the second LSDA layer separately, and then we concatenated the outputs to get [Z n ′ | E m ′ ] (see Algorithm 1). At last, the weights of the output layer are calculated with equation (6).   our experiment, only 200 bands were used in order to avoid the effect of water absorption. e second dataset is the Pavia University dataset collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor with 115 bands. e dataset has a spatial size of 610 × 340 pixels with 9 labeled land-cover classes. Due to the noise, 12 bands are discarded in our experiment.

Experimental Result and Analysis
Salinas dataset was also captured by AVIRIS sensor while being characterized by high spatial resolution (3.7 m/pixel). e ground-truth covered contains 16 classes. 20 water absorption bands were also removed from the Salinas in the experiment.

Parameter Settings.
To quantitatively compare the classification results of BLS-LSDA with some prominent or state-of-the-art methods, including SVM, KELM, SVM-CK, KELM-CK [31], HiFi-We [48], and MASR [49], three frequently used indexes as overall accuracy (OA), average accuracy (AA), and kappa coefficient (K) are adopted in our experiment.  Tables (2), (3), and (4). Based on the results shown in all tables, we can find that our proposed BLS-LSDA is superior to the classic methods (i.e., SVM and KELM) and their derivations with spectral-spatial kernel method (i.e., SVM-CK and KELM-CK), as well as recent prominent methods (i.e., HiFi-We and MASR) focusing on exploring the advantage of spectral-spatial filters in HSI classification. Moreover, in order to explore the utility of LSDA layers inserted into the BLS model, the classification results using the original BLS are also provided.

Classification Results. We present our experimental results on each dataset in
For the Indian Pines dataset, our method achieves up to 94.2 ± 0.95% OA, 97.0 ± 0.49% AA, and 93.3 ± 1.10% k when 40 training samples are used. e advantage of our method over classic methods is much more obvious than it over other methods; however, MASR is better than the proposed method on AA by nearly 0.3%. However, for the University of Pavia dataset, by using the same number of training samples, the classification accuracies are 94.8 ± 0.79% OA, 96.0 ± 0.66% AA, and 93.0 ± 1.09% k, which surpass all the chosen comparative methods by 3-4% generally. For the Salinas dataset, our BLS-LSDA also achieved higher OA (98.0 ± 0.43%), AA (99.1 ± 0.18%), and k (97.7 ± 0.48%) than all other compared methods. e advantage of BLS-LSDA is more obvious when there are a limited number of training samples (i.e., 5, 10, and 15). As an example, in Table (2), when 10 random samples are used for training, our method claims a nearly 4% OA increase over HiFi-We, which achieves the highest OA in comparative methods with 81.6 ± 2.26%. We believe that it is due to the high representation learning capacity of our proposed network structure. Moreover, the classification accuracies' standard deviations of our method are overall lower than other methods, which means that the proposed model is more robust to the randomly chosen training data. However, an exception to the previous statement can be found in Table 3, indicating that when there are extremely limited training samples, LSDA may fail to capture the representative features. Figures 5, 6, and 7 visually show the classification results of BLS-LSDA and other compared methods when r � 40. By visual evaluation, we can conclude that our proposed method is good at balancing the classification accuracy of pixels at both sharp and smooth regions. Taking the Salinas dataset ( Figure 7) as an example, it can be easily seen that our method surpasses the other methods on the homogeneous area (i.e., the two smooth patches marked with grey circles), while the edges or acute angles are also well-preserved.

Parameters' Sensitivities Analysis.
We have evaluated the impact of different values of the model's parameters shown in Table 1. It reveals that the classification results are not sensitive to different h, C, and s. Also, since N1, N2, and N3 are intertwining, we choose these parameters empirically. Besides, it has been well recognized that the dimensionality of the reduced subspace is crucial in manifold learning. Here, we investigate the performance of BLS-LSDA with different subspace dimensions on three benchmark datasets. For each dataset and each class, 20 training samples were randomly selected and the remaining samples were used for testing. All experiments were performed 10 times in order to get the average results. Figure 8 depicts the relationship between the classification results (OA) and the dimensions of the reduced subspace on three datasets.
For Indian Pines, we can find that when the dimensions of reduced subspace are less than 15, the overall accuracy goes up with the increase of the dimensions; otherwise, the curve becomes flat. Similar curves can also be observed  (15), yielding X s tr ; (4) for i � 1; i < n; i++ do (5) Assign a random value to W ei and β ei ; (6) Calculate the mapped feature values Z i � φ(X s tr W ei + β ei ). (7) end for (8) Concatenate the mapped feature values to get a mapped feature group Z n � [Z 1 , Z 2 , . . . , Z n ]; (9) Assign W hm and β hm with random values; (10) Calculate the enhancement nodes with E m � ϕ(Z n W hm + β hm ); (11) Apply LSDA to each Z i in Z n and E m in another LSDA layer to get Z n ′ and E m ′ ;

Mathematical Problems in Engineering
(13) Calculate the connection weights between the BLS's hidden layer and an output layer with W m n � [Z n | E m ] + Y.
ALGORITHM 1: BLS-LSDA training algorithm.  5, 7, 9, 11, 13 5, 7, 9, 11, 13 5, 7, 9, 11, 13 f 5, 7, 9, 11, 13 5, 7, 9, 11, 13 5, 7, 9, 11, 13 h is the convolution depth of Gabor and AWF filters, g and f are the neighborhood's sizes of Gabor and AWF, respectively, λ is the weight of the spatial information in equation (15), d is the number of dimensions of reduced subspace in LSDA algorithm, C and s are penalty parameters and enhanced node scaling in BLS, and N1, N2, and N3 are the number of feature node groups, feature nodes per group, and enhanced nodes in BLS, respectively.   when the other two datasets are taken. However, the turning points are found when the number of dimensions equals 10.
We also investigate the change of classification accuracies with λ. Figure 9 shows that with λ � 0.1 the model acquires its best performance.

Conclusion
In this paper, we present a novel method for HSI classification which is based on BLS and LSDA. e two algorithms are integrated into a multilayered neural network model. To utilized both the spatial and spectral information, the Gabor filter and AWF filter are adopted to produce the input of the BLS-LSDA. Our experiments on three open benchmark datasets have shown their advantages against compared methods in terms of OA, AA, and k. Also, our work has shown that with limited dimensional features acquired by LSDA layers in the model, high classification accuracy can be achieved, which means computational efficiency in real applications.
We believe that BLS-LSDA is a successful improvement on the original BLS for HSI classification; however, there are still some problems to be tackled. Our future work would address the initialization of weights and offsets with heuristic algorithms instead of random assignments.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.