Hyperspectral Dimensionality Reduction by Tensor Sparse and Low-Rank Graph-Based Discriminant Analysis

: Recently, sparse and low-rank graph-based discriminant analysis (SLGDA) has yielded satisfactory results in hyperspectral image (HSI) dimensionality reduction (DR), for which sparsity and low-rankness are simultaneously imposed to capture both local and global structure of hyperspectral data. However, SLGDA fails to exploit the spatial information. To address this problem, a tensor sparse and low-rank graph-based discriminant analysis (TSLGDA) is proposed in this paper. By regarding the hyperspectral data cube as a third-order tensor, small local patches centered at the training samples are extracted for the TSLGDA framework to maintain the structural information, resulting in a more discriminative graph. Subsequently, dimensionality reduction is performed on the tensorial training and testing samples to reduce data redundancy. Experimental results of three real-world hyperspectral datasets demonstrate that the proposed TSLGDA algorithm greatly improves the classiﬁcation performance in the low-dimensional space when compared to state-of-the-art DR methods.


Introduction
A hyperspectral image contains a wealth of spectral information about different materials by collecting the reflectance of hundreds of contiguous narrow spectral bands from the visible to infrared electromagnetic spectrum [1][2][3]. However, the redundant information in a hyperspectral image not only increases computational complexity but also degrades classification performance when training samples are limited. Some research has demonstrated that the redundancy can be reduced without a significant loss of useful information [4][5][6][7]. As such, reducing the dimensionality of hyperspectral images is a reasonable and important preprocessing step for subsequent analysis and practical applications.
Dimensionality reduction (DR) aims to reduce the redundancy among features and simultaneously preserve the discriminative information. In general, existing DR methods may belong to one of three categories: unsupervised, supervised, and semisupervised. The unsupervised methods do not take the class label information of training samples into consideration. The most commonly used unsupervised DR algorithm is principal component analysis (PCA) [8], which is to find a linear transformation by maximizing the variance in the projected subspace. Linear discriminant analysis (LDA) [9], as a simple graph-based discriminant analysis for DR of hyperspectral image. To the best of our knowledge, this is the first time that tensor theory, sparsity, and low-rankness are combined in graph embedding framework. (2) Tensorial structure contains the spectral-spatial information, sparse and low-rank representation reveals both local and global structure and a graph preserves manifold structure. The integration of these three techniques remarkably promotes discriminative ability of reduced features in low-dimensional subspaces. (3) The proposed method can effectively deal with small training size problem, even for the class with only two labeled samples.
The rest of this paper is organized as follows. Section 2 briefly describes the tensor basics and some existing DR methods. The proposed TSLGDA algorithm for DR of hyperspectral imagery is provided in detail in Section 3. Parameters discussions and experimental results compared with some state-of-the-art methods are given in Section 4. Finally, Section 5 concludes this paper with some remarks.

Related Work
In this paper, if not specified otherwise, lowercase italic letters denote scalars, e.g., i, j, k, bold lowercase letters denote vectors, e.g., x, y, bold uppercase letters denote matrices, e.g., U, X, and bold uppercase letters with underline denote tensors, e.g., A, X.

Tensor Basics
A multidimensional array is defined as a tensor, which is represented as A ∈ R I 1 ×...I n ×...I N . We regard A ∈ R I 1 ×...I n ×...I N as an N-order tensor, corresponding to an N-dimensional data array, with its element denoted as A i 1 ...i n ...i N , where 1 ≤ i n ≤ I n , and 1 ≤ n ≤ N. Some basic definitions related to tensor operation are provided as follows [28,33,34].

Definition 1. (Frobenius norm):
The Frobenius norm of a tensor A is defined as

Definition 2.
(Mode-n matricizing): The n-mode vector of an N-order tensor A ∈ R I 1 ×...I n ×...I N is defined as an n-dimensional vector by fixing all indices except i n . The n-mode matrix is composed of all the n-mode vectors in column form, denoted as A n ∈ R I n ×(I 1 ...I n−1 I n+1 ...I N ) . The obtained n-mode matrix is also known as n-mode unfolding of a tensor A. Definition 3. (Mode-n product): The mode-n product of a tensor A with a matrix U ∈ R I n ×I n yields C = A × n U, and C ∈ R I 1 ...I n−1 I n I n+1 ...I N , whose entries are computed by where i k = 1, 2, . . . , I k , (k = n) and i n = 1, 2, . . . , I n . Note that the n-mode product can also be expressed in terms of unfolding tensor where × n denotes mode-n product between a tensor and a matrix.
The condition for tensor contraction is that both two tensors should have the same size at the specific mode. For example, when the contraction is conducted on all indices except for the index n on tensors A, B ∈ R I 1 ×...I n ×...I N , this operation can be denoted as [A ⊗ B; (n)(n)]. According to the property of tensor contraction, we have

Sparse and Low-Rank Graph-Based Discriminant Analysis
In [19], sparse graph-based discriminant analysis (SGDA), as a supervised DR method, was proposed to extract important features for hyperspectral data. Although SGDA can successfully reveal the local structure of the data, it fails to capture the global information. To address this problem, sparse and low-rank graph-based discriminant analysis (SLGDA) [23] was developed to preserve local neighborhood structure and global geometrical structure simultaneously by combining the sparse and low-rank constraints. The objective function of SLGDA can be formulated as arg min where β and λ are two regularization parameters to control the effect of low-rank term and sparse term, respectively, X (l) represents samples from the lth class in a vector-based way, and l = [1, 2, . . . , c], in which c is the number of total classes. After obtaining the complete graph weight matrix W = diag(W (1) , W (2) , . . . , W (c) ), the projection operator can be solved as where L s = D − W is defined as the Laplacian matrix, D is a diagonal matrix with the ith diagonal entry being D ii = ∑ N j=1 W ij , and L p may be a simple scale normalization constraint [13]. The projection can be further formulated as which can be solved as a generalized eigendecomposition problem The bth projection vector p b is the eigenvector corresponding to the bth smallest nonzero eigenvalue. The projection matrix can be formed as P = [p 1 , . . . , p B ] ∈ R d×B , B d. Finally, the reduced features are denoted as X = P T X ∈ R B×M .

Multilinear Principal Component Analysis
In order to obtain a set of multilinear projections that will map the original high-order tensor data into a low-order tensor space, MPCA performs to directly maximize the total scatter matrix on the subspace U i (i = n) where S n T = ∑ M k=1 X n k X nT k and X n k is the n-mode unfolding matrix of tensor X k .
The optimal projections of MPCA can be obtained from the eigendecomposition where U n = [u 1 n , . . . , u d n n ] is the eigenvector matrix and D n = diag(λ 1 n , . . . , λ d n n ) is the eigenvalue matrix of S n T , in which the eigenvalues are ranked in descending order, and λ j n is the eigenvalue corresponding to the eigenvector u j n . The optimal projection matrix for mode-n is composed of the eigenvectors corresponding to the first B n largest eigenvalues, e.g., U n = [u 1 n , . . . , u B n n ]. After obtained the projection matrix for each mode, the reduced features can be formulated as where U i ∈ R B n ×I n (B n ≤ I n ).

Tensor sparse and low-rank graph-based discriminant analysis
Consider a hyperspectral image as a third-order tensor A ∈ R I 1 ×I 2 ×I 3 , in which I 1 and I 2 refer to the width and height of the data cube, respectively, and I 3 represents the number of spectral bands, I 3 = d. Assume that the kth small patch is composed of the kth training sample and its i 1 × i 2 neighbors, which is denoted as X k ∈ R i 1 ×i 2 ×d . M patches construct the training set {X k } M k=1 . The training patches belonging to the lth class are expressed as {X k,l } M l k=1 , where M l represents the number of patches belonging to the lth class and l ∈ {1, 2, . . . , c}. For the purpose of convenient expression, a fourth-order tensor X (l) ∈ R i 1 ×i 2 ×d×M l is defined to represent these M l patches, and X ∈ R i 1 ×i 2 ×d×M denotes all training patches for c classes, where M = ∑ c l=1 M l . A visual illustration of 3-mode vectors, 3-mode unfolding, and 3-mode product is shown in Figure 1.

HSI Cube
X 2 R I 1 £I 2 £d

Tensor Sparse and Low-Rank Graph
The previous SLGDA framework can capture the local and global structure of hyperspectral data simultaneously by imposing both sparse and low-rank constraints. However, it may lose some important structural information of hyperspectral data, which presents an intrinsic tensor-based data structure. To overcome this drawback, a tensor sparse and low-rank graph is constructed with the objective function arg min where W (l) ∈ R M l ×M l denotes the graph weigh matrix using labeled patches from the lth class only. As such, with the help of class-specific labeled training patches, the global graph weigh matrix W can be designed as a block-diagonal structure To obtain the lth class graph weight matrix W (l) , the alternating direction method of multipliers (ADMM) [35] is adopted to solve problem (12). Two auxiliary variables Z (l) and J (l) are first introduced to make the objective function separable arg min The augmented Lagrangian function of problem (14) is given as where D 1 and D 2 are Lagrangian multipliers, and µ is a penalty parameter. By minimizing the function L(Z (l) , J (l) , W (l) ), each variable is alternately updated with other variables being fixed. The updating rules are expressed as where µ t denotes the learning rate, Ω τ (∆) = QS τ (∑)V T is the singular value thresholding operator (SVT), in which S τ (x) = sgn(x) max(|x| − τ, 0) is the soft thresholding operator [36]. By fixing Z (l) t+1 and J (l) t+1 , the formulation of W (l) t+1 can be written as where The global similarity matrix W will be obtained depending on equation (13) when each sub-similarity matrix corresponding to each class is calculated from problem (12). Until now, a tensor sparse and low-rank graph G = {X, W} is completely constructed with vertex set X and similarity matrix W. How to obtain a set of projection matrices {U n ∈ R B n ×I n , B n ≤ I n , n = 1, 2, . . . , N} is the following task.

Tensor Locality Preserving Projection
The aim of tensor LPP is to find transformation matrices {U 1 , The optimization problem for tensor LPP can be expressed as where C ii = ∑ j W ij . It can be seen that the corresponding tensors X i and X j in the embedded tensor space are expected to be close to each other if original tensors X i and X j are greatly similar.
To solve the optimization problem (19), an iterative scheme is employed [33]. First, we assume that With properties of tensor and trace, the objective function (19) is rewritten as where X n i denotes the n-mode unfolding of tensor X i,(n) . Finally, the optimal solution of problem (20) is the eigenvectors corresponding to the first B n smallest nonzero eigenvalues of the following generalized eigenvalue problem To solve this problem, the function eig(·) embedded in the MATLAB software (R2013a, The MathWorks, Natick, Massachusetts, USA) is adopted, i.e., [u, Λ] = eig(Φ, Ψ), and the eigenvectors in u corresponding to the first B n smallest nonzero eigenvalues in Λ are chosen to form the projection matrix. The other projection matrices can be obtained in a similar manner. The complete TSLGDA algorithm is outlined in Algorithm 1.

6.
Check convergence conditions: W until convergence conditions are satisfied or t >maxIter. 9. end for 10. Construct the block-diagonal weight matrix W according to (13). 11. Compute the projection matrices {U 1 , U 2 , U 3 } according to (21). 12. Compute the reduced features: Determine the class label of Y by NN classifier. 14. Output: The class labels of test patches.

Experiments and Discussions
In this section, three hyperspectral datasets are used to verify the performance of the proposed method. The proposed TSLGDA algorithm is compared with some state-of-the-art approaches, including unsupervised methods (e.g., PCA [8], MPCA [25]) and supervised methods (e.g., LDA [9], LFDA [10], SGDA [19], GDA-SS [15], SLGDA [23], G-LTDA (local tensor discriminant analysis with Gabor filters) [29]). SGDA is implemented using the SPAMS (SPArse Modeling Software) toolbox [38]. The nearest neighbor classifier (NN classifier) is exploited to classify the projected features obtained by these DR methods. The class-specific accuracy , overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) are reported for quantitative assessment after ten runs. All experiments are implemented on an Inter Core i5-4590 CPU personal computer (Santa Clara, CA, USA).

Experimental Datasets
The first dataset [39] was acquired by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over northwest Indiana's Indian Pine test site in June 1992. The AVIRIS sensor generates the wavelength range of 0.4-2.45-µm covered 220 spectral bands. After removing 20 water-absorption bands (bands 104-108, 150-163, and 220), a total of 200 bands is used in experiments. The image with 145 × 145 pixels represents a rural scenario having 16 different land-cover classes. The numbers of training and testing samples in each class are listed in Table 1. The second dataset [39] is the University of Pavia collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor in Italy. The image has 103 bands after removing 12 noisy bands with a spectral coverage from 0.43 to 0.86 µm, covering a region of 610 × 340 pixels. There are nine ground-truth classes, from which we randomly select training and testing samples as shown in Table 1.
The third dataset [39] was also collected by the AVIRIS sensor over the Valley of Salinas, Central Coast of California, in 1998. The image comprises 512 × 217 pixels with a spatial resolution of 3.7 m, and only preserves 204 bands after 20 water-absorption bands removed. Table 2 lists 16 land-cover classes and the number of training and testing samples.

Parameters Tuning
For the proposed method, four important parameters (i.e., regularization parameters β and λ, window size, and the number of spectral dimension) that can be divided into three groups need to be determined before proceeding to the following experiments. β and λ control the effect of sparse term and low-rank term in the objective function, respectively, which can be tuned together, while window size and the number of spectral dimension are another two groups that can be determined separately. When analyzing one group specific parameter, the other group parameters are fixed on their corresponding chosen values. According to many existing DR methods [22][23][24] and tensor-based research [26,28], window size is the first set as 9 for the Indian Pines and Salinas datasets, and 7 for the University of Pavia dataset; the initial value for the number of spectral dimension is given as 30 for all three datasets, and the performance basically reaches steady state with this dimension.

Regularization Parameters for TSLGDA
With the initial values of window size and the number of spectral dimension fixed, β and λ are first tuned to achieve better classification performance. Figure 2 shows the overall classification accuracy with respect to different β and λ by fivefold cross validation for three experimental datasets. It can be clearly seen that the OA values can reach the maximum values for some β and λ. Accordingly, for the Indian Pines dataset, the optimal values of β and λ can be set as (0.01, 0.1), which is also an appropriate choice for the University of Pavia dataset, while (0.001, 0.1) is chosen for the Salinas data.

Window Size for Tensor Representation
For tensor-based DR methods, i.e., MPCA and TSLGDA, window size (or patch size) is another important parameter. Note that small windows may fail to cover enough spatial information, whereas large windows may contain multiple classes, resulting in complicated analysis and heavy computational burden. Therefore, the window size is searched in the range of {3 × 3, 5 × 5, 7 × 7, 9 × 9}. β and λ are fixed on the tuned values, while the numbers of spectral dimension are still set as initial values for three datasets, respectively. Figure 3 presents the variation of classification performances of MPCA and TSLGDA with different window sizes for experimental datasets. It can be seen that the window sizes for MPCA and TSLGDA can be both chosen as 9 × 9 for the Indian Pines and Salinas datasets, while the optimal values are 5 × 5 and 7 × 7, respectively, for the University of Pavia dataset. This may be because the formers represent a rural scenario containing large spatial homogeneity while the Pavia University data is obtained from an urban area with small homogeneous regions. To evaluate the classification performance using the low-dimensional data, 1NN classifier is adopted in this paper.

The Number of Spectral Dimension for TSLGDA
According to [28], {1, 1} is set as the reduced dimensionality of the first two dimensions (i.e., two spatial dimensions). The third dimension (i.e., spectral dimension) is considered carefully by keeping the tuned values of β, λ, and window size is fixed. Figure 4 shows the overall classification accuracy with respect to spectral dimension for three hyperspectral datasets. Obviously, due to the spatial information contained in tensor structure, tensor-based DR methods (i.e., MPCA, TSLGDA) outperform vector-based DR methods (i.e., PCA, SGDA, GDA-SS, SLGDA). According to [29,37], G-LTDA can automatically obtain the optimal reduced dimensions during the optimization procedure; therefore, the number of spectral dimension for G-LTDA is not discussed here. For the Indian Pines dataset, the performances of all considered methods increase when the spectral dimension increases, and then keep stable at the maximum values. The similar results can also be observed from the University of Pavia and Salinas datasets. In any case, TSLGDA outperforms other DR methods even when the spectral dimension is as low as 5. In the following assessment, {1, 1, 30} and {1, 1, 20} dimensions are used to conduct classification for two AVIRIS datasets and one ROSIS dataset, respectively.

Classification Accuracy
Tables 3-5 present the classification accuracy of individual class, OA, AA, and kappa coefficient for three experimental datasets, respectively. Obviously, the proposed method provides the best results than other compared methods on almost all of classes; meanwhile, OA, AA, and kappa coefficient are also better than those of other methods. Specifically, by comparing to all considered methods, TSLGDA yields about 2% to 30%, 5% to 20%, and 2% to 12% gain in OA with limited training sets for three datasets, respectively. Even for classes with few labeled training samples, such as class 1, class 7, and class 9 in the Indian Pines data, the proposed TSLGDA algorithm offers great improvement in performance as well. Besides TSLGDA, MPCA and G-LTDA also obtain much higher accuracies than other vector-based methods, which effectively demonstrates the advantage of tensor-based techniques. In addition, SLGDA yields better results than SGDA (about 3%, 1%, and 0.6% gain) by simultaneously exploiting the properties of sparsity and low-rankness, while GDA-SS is superior to SGDA by considering the spectral similarity measurement based on spectral characteristics when constructing the graph. Table 3. Classification accuracy (%) and standard deviation of different methods for the Indian Pines data when the reduced dimension is 30.

Classification Maps
In order to show the classification results more directly, classification maps of all considered methods are provided in Figures 5-7 for three experimental datasets, respectively. From Figure 5, it can be clearly seen that the proposed method can obtain much smoother classification regions than other methods, especially for class 1 (Alfalfa), class 2 (Corn-notill), class 3 (Corn-mintill), and class 12 (Soybean-clean) whose spectral characteristics are highly correlated with other classes. The similar results can also be observed from Figures 6 and 7, where class 1 (Asphalt), class 6 (Bare Soil), and class 8 (Self-blocking bricks) in the second dataset, and class 8 (Grapes untrained), class 15 (Vineyard untrained) in the third dataset are labeled more precisely. These observations are consistent with the quantitative results listed in Tables 3-5.

The Influence of Training Size
To show the influence of training size, some considered DR methods are tested. The results are given in Figure 8, from which we can see that the OA values of all methods are improved when the number of training samples increases for three datasets. Due to the spatial structure information contained in the tensor, the proposed method always performs better than other methods in all cases. In addition, with the label information, the supervised DR methods (i.e., SGDA, GDA-SS, SLGDA, G-LTDA, TSLGDA) achieve better results than the corresponding unsupervised DR methods (i.e., PCA, MPCA).

The Analysis of Computational Complexity
For the comparison of computational complexity, we take the Indian Pines data as an example. Table 6 shows the time requirements of all considered methods, from which it can be clearly seen that traditional methods (e.g., PCA, LDA, LFDA) run faster than other recently proposed methods. In addition, due to complicated tensor computation, tensor-based DR methods (e.g., MPCA, G-LTDA, TSLGDA) cost more time than vector-based methods (e.g., SGDA, GDA-SS, SLGDA). Although TSLGDA has the highest computational complexity, it yields the best classification performance. In practice, the general-purpose graphics processing units (GPUs) can be adopted to greatly accelerate the TSLGDA algorithm.

Conclusions
In this paper, we have proposed a tensor sparse and low-rank graph-based discriminant analysis method (i.e., TSLGDA) for dimensionality reduction of hyperspectral imagery. The hyperspectral data cube is taken as a third-order tensor, from which sub-tensors (local patches) centered at the training samples are extracted to construct the sparse and low-rank graph. On the one hand, by imposing both the sparse and low-rank constraints on the objective function, the proposed method is capable of capturing the local and global structure simultaneously. On the other hand, due to the spatial structure information introduced by tensor data, the proposed method can improve the graph structure and enhance the discriminative ability of reduced features. Experiments conducted on three hyperspectral datasets have consistently confirmed the effectiveness of our proposed TSLGDA algorithm, even for small training size. Compared to some state-of-the-art methods, the overall classification accuracy of TSLGDA in the low-dimensional space improves about 2% to 30%, 5% to 20%, and 2% to 12% for three experimental datasets, respectively, with increased computational complexity.