Sketch-Based Subspace Clustering of Hyperspectral Images

Sparse subspace clustering (SSC) techniques provide the state-of-the-art in clustering of hyperspectral images (HSIs). However, their computational complexity hinders their applicability to large-scale HSIs. In this paper, we propose a large-scale SSC-based method, which can effectively process large HSIs while also achieving improved clustering accuracy compared to the current SSC methods. We build our approach based on an emerging concept of sketched subspace clustering, which was to our knowledge not explored at all in hyperspectral imaging yet. Moreover, there are only scarce results on any large-scale SSC approaches for HSI. We show that a direct application of sketched SSC does not provide a satisfactory performance on HSIs but it does provide an excellent basis for an effective and elegant method that we build by extending this approach with a spatial prior and deriving the corresponding solver. In particular, a random matrix constructed by the Johnson-Lindenstrauss transform is first used to sketch the self-representation dictionary as a compact dictionary, which significantly reduces the number of sparse coefficients to be solved, thereby reducing the overall complexity. In order to alleviate the effect of noise and within-class spectral variations of HSIs, we employ a total variation constraint on the coefficient matrix, which accounts for the spatial dependencies among the neighbouring pixels. We derive an efficient solver for the resulting optimization problem, and we theoretically prove its convergence property under mild conditions. The experimental results on real HSIs show a notable improvement in comparison with the traditional SSC-based methods and the state-of-the-art methods for clustering of large-scale images.


Introduction
Hyperspectral images (HSIs), acquired by the hyperspectral cameras, record the spectrum of materials covering a wide range of wavelengths. The rich spectral information of HSIs enables discriminating the materials that are often visually indistinguishable, which led to a number of applications in remote sensing, such as target detection [1,2], environmental monitoring [3], geosciences, defense and security [4]. It is often desired to categorize pixels in the imaged scene into different classes, corresponding to different materials or different types of objects. When no training data is available, this task is called clustering. Hence, clustering is also referred to as unsupervised classification.
Two most popular clustering methods are fuzzy c-means (FCM) [5] and k-means [6,7] due to their simplicity and superior computational efficiency. They group data points by finding the minimum distance between test data and each cluster centroid which is updated iteratively. However, their performance is sensitive to initial conditions and noise.
Recently, spectral clustering-based methods [8][9][10][11][12][13] have achieved a great success and have been widely applied in various fields due to excellent performance and robustness to noise [14]. In general, these methods first define a similarity matrix to construct a graph of data points, which is learned from the input data under different critera. Then, the resulting similarity matrix is used within the spectral clustering framework. The performance of spectral clustering heavily depends on the similarity matrix [14], hence, its construction is a crucial step. Many of these methods, like local subspace affinity (LSA) [15], spectral local best-fit flats (SLBF) [16] and locally linear manifold clustering (LLMC) [17] build the similarity matrix with k nearest neighbours (KNN) using angle or Euclidean distance between two data points. This approach tends to treat erroneously the data points near the intersection of two subspaces because their closest points often lie in another subspace.
The recent sparse subspace clustering (SSC) method [11] constructs the similarity matrix based on the self-expressiveness model where the input data is employed as the representation dictionary. SSC models a high-dimensional data space as a union of low-dimensional subspaces. The key insight is that, for each data point in the subspace S i , the global solution of the sparse coding problem with the self-representation dictionary automatically selects the data points in the same subspace S i . Thus, each data point gets automatically represented as a sparse linear or affine combination of other points in the same subspace. This is called subspace preserving property and is explicitly expressed by non-zero entries of the coefficient matrix C: i-th and j-th data points are in the same subspace if C i,j = 0. The coefficient matrix leads directly to the similarity matrix for spectral clustering.
As the SSC model calculates sparse coefficients individually and independently for each input data point, the clustering performance is sensitive to noise. In order to solve this problem, various extensions have been proposed with the aim to encode the spatial dependencies among the neighbouring data points in hyperspectral data, and obtain thereby more accurate similarity matrices and improved clustering results [18][19][20][21][22][23][24][25]. Guo et al. [18,19] focus on the clustering of 1-D drill hole hyperspectral data and regularize the coefficients of neighbouring data points in depth to be similar by a 1 norm based smoothing regularization. For the 2-D spatial-wise hyperspectral images, a smoothing strategy was introduced in Reference [20] by minimizing the difference between coefficients corresponding to the central pixel and to the mean of pixels in a local square window. A kernel version of SSC incorporating max pooling of the sparse coefficient matrix was presented in Reference [21]. The spectral-spatial SSC method of Reference [22] integrates an 2 spatial regularizer with the SSC model (L2-SSC), to penalize abrupt differences between the coefficients of nearby pixels. In Reference [23,25], an 1,2 norm constraint on the coefficients of pixels in each local region was incorporated in the SSC model. Based on the collaborative representation with an 2 norm constraint on the coefficients, a novel model with a locally adaptive dictionary was proposed in Reference [24].
While showing excellent performances, the above mentioned methods are also of considerable computational complexity, resulting from iterative optimization. The time complexity in each iteration is typically in the range of O((MN) 3 ), where M and N are the number of rows and columns in each band. For large scale HSIs with millions of pixels in each band, this bound can thus exceed 10 18 elementary operations per iteration, and such processing becomes often infeasible on the common computing platforms. The approaches reported in References [26,27] addressed this problem by constructing a graph based on a set of selected representative samples. In combination with modified spectral clustering methods, a lower complexity has been reached, but the clustering results are sensitive to the initially selected samples. Recently, some generalized large-scale methods [28][29][30] based on SSC have been proposed for clustering tasks in computer vision. In Reference [28], a scalable SSC method was designed for large-scale data sets, where a small part of samples are first randomly selected and clustered with the SSC model, and then clustering of remaining samples is executed by sparse coding with respect to the dictionary constructed from previously selected samples. The work in Reference [29] studied an efficient SSC model based on orthogonal matching pursuit (OMP) and discussed theoretical conditions for subspace preserving representation. A recent sketched SSC model of Reference [30] lowers the computational burden of SSC by using a clever random projection technique to sketch and compress the input data to a computationally affordable level. While these large-scale SSC-based methods demonstrated success in real applications with facial images, handwritten text and news corpus data, to the best of our knowledge none of them was applied before in the clustering of HSIs. Our experiments show that despite the scalability of these methods, their clustering performance in HSIs turns out to be poor. This can be attributed to the complex spatial structure of HSIs, spectral noise and spectral variability.
In view of this, we propose a sketched sparse subspace clustering method with total variation (TV) spatial regularization, termed Sketch-SSC-TV, which can handle large-scale HSIs while achieving a high level of clustering performance. A sketching matrix constructed by a random matrix is firstly employed to build a sketched dictionary, which is much smaller than the self-representation dictionary, resulting in a significant reduction of the number of coefficients to be solved. By incorporating the spatial constraint as the TV norm on the coefficient matrix, the proposed model greatly promotes the connectivity of neighbouring pixels and improves the piecewise smoothness of clustering maps. Furthermore, we propose an algorithm with theoretically guaranteed global convergence to solve the resulting optimization problem. By adopting the sketching matrix, the optimization complexity of the TV-related sub-problem reduces from O((MN) 2 log(MN)) to O(MNnlog(MN)) (n << MN), facilitating thus greatly the processing of large-scale data. The similarity matrix is constructed by applying KNN on the obtained coefficient matrix, and further employed within the spectral clustering method. Experiments conducted on four HSIs show superior clustering performance compared to both traditional SSC-based methods and the related large-scale clustering methods. The major contributions of the paper can be summarized as follows.

1.
The most important contribution of this paper is a new SSC-based framework, which can be applied on large-scale HSIs while achieving excellent clustering accuracy. To the best of our knowledge, this is the first time to address the large-scale clustering problem of HSIs based on the SSC model.

2.
Different from the traditional SSC-based methods which use all the input data as a dictionary, we adopt a compressed dictionary by using random projection technique to reduce the dictionary size, which effectively enables a scalable subspace clustering approach.

3.
To account for the spatial dependencies among the neighbouring pixels, we incorporate a powerful TV regularization in our model, leading to a more discriminative coefficient matrix. The resulting model proves to be more robust to spectral noise and spectral variability.

4.
We develop an efficient algorithm to solve the resulting optimization problem and prove its convergence property theoretically.
The rest of this paper is organized as follows. Section 2 briefly introduces the clustering of HSIs with the SSC model. Section 3 describes the proposed Sketch-SSC-TV model and the resulting optimization problem. Experimental results on real HSIs are presented in Section 4. Section 5 concludes the paper.

HSI Clustering with the SSC Model
Let a B-band HSI be denoted as Y ∈ R B×MN , where the i-th vector y i ∈ R B represents the spectral signature of the i-th pixel in HSI and MN is the total number of pixels. Sparse subspace clustering (SSC) partitions the high-dimensional data space into a union of lower dimensional subspaces. Concretely, it assumes that all high-dimensional data points y i 's, that is, spectral signatures of all the pixels from a given HSI Y, are drawn from a union of subspaces, each of which corresponds to a particular class.
The key idea is that among infinitely many possibilities to represent a data point y i in terms of other points, a sparse representation will select a few points that belong to the same subspspace as y i . This is known as the subspace preserving property. Thus, SSC starts from a self-representation model where the input data matrix Y is employed as a dictionary: Y = YC and infers the coefficient matrix C ∈ R MN×MN by solving the sparse coding problem (requiring that C is sparse) and ensuring that the trivial solution where each sample would be simply represented by itself is excluded. The non-zero entries in C will then indicate directly which data points lie within the common subspaces. Formally, SSC solves the following optimization problem: where is an all-one vector; diag(C) is a diagonal matrix whose entries outside the main diagonal are zero and λ is a parameter, which controls the balance between the data fidelity and the sparsity of the coefficient matrix. The constraint diag(C) = 0 is introduced to avoid the trivial solution of representing a sample by itself and the second constraint 1 T C = 1 T ensures that each data point is an affine combination of other data points. The problem in (1) can be solved by the ADMM algorithm [31], with the time complexity of O((MN) 2 B + (MN) 3 (I + 1)) where I is the number of iterations. The coefficient matrix C yields directly the dependence structure among the data points: a non-zero entry C ij indicates that the samples y i and y j are in the same class. Thus, it is reasonable to construct the similarity matrix W ∈ R MN×MN as where |C| takes the absolute values of C. The symmetric structure of W ensures that each pair of samples are connected to each other if either side is selected to represent another, which results in a strengthened connection of the graph. The similarity matrix W is then used as an input to spectral clustering [32] to produce the clustering result. Specifically, the Laplacian matrix L ∈ R MN×MN is first formed by where D ∈ R MN×MN is a diagonal matrix with D ii = ∑ j W ij [33]. Then the c eigenvectors {v k } c k=1 of L corresponding to the c smallest non-zero eigenvalues of L are calculated via singular-value decomposition (SVD). Finally, the clustering result is obtained by running k-means clustering on the MN × c matrix

Sketch-SSC-TV Model for HSIs
In this section, we first introduce our new SSC-based model with TV regularization (SSC-TV), to effectively account for the spatial dependencies between the input data points. Next, we incorporate a random sketching technique into this model, leading to our unified Sketch-SSC-TV model for large-scale HSIs. Finally, we develop an efficient optimization algorithm for the resulting model based on ADMM.

The SSC-TV Model
HSIs not only record the spectrum of materials in the spectral domain but also capture the distribution of ground-objects in the spatial domain. As the distribution of ground materials typically shows some continuity, HSIs are composed of various nearly homogeneous regions made of pixels that belong to the same class with a very high probability [34][35][36][37][38][39][40]. For this reason, spectral signatures of pixels within small local regions are typically very similar. Conversely, the pixels belonging to different classes are more likely to occupy different spatial locations and exhibit significantly different spectral characteristics. HSI clustering assigns pixels into distinct groups according to the spectral similarities such that the pixels from the same group are more similar to each other than to those from other groups. Here, each cluster is viewed as a subspace. Thus, by conducting subspace clustering the pixels of HSIs in local homogeneous regions are likely to be grouped together in the same cluster. At the same time, the pixels showing significant spectral differences, which are typically also spatially separated, are assigned to different clusters. This way, subspace clustering results in a meaningful interpretation of the spatial content of HSIs. Thus, idealy, the subspaces in the spectral domain correspond to the cluster structure in the spatial domain. However, due to noise and spectral variability, the actual results of the subspace clustering model differ from the ideal cluster structure, and do not agree perfectly with the spatial content. Such sensitivity to noise and to spectral variability is inherent to all the methods that perform pixel-wise processing and thus also to the SSC model in (1), where sparse coefficients are calculated independently for each pixel. Random variations in the recorded spectral responses affect the solution of the sparse coding problem such that in the resulting sparse representation some data points may be represented by data points from different subspaces. This degrades the construction of the similarity matrix, thereby deteriorating spectral clustering performance. We aim to alleviate this problem by imposing a spatial constraint that makes the model less sensitive to random spectral variations of individual pixels.
To accommodate for the fact that pixels within a local homogeneous region are likely to belong to the same class, we require explicitly that sparse coefficients of nearby pixels likely to be mutually similar, that is, selecting similar sets of pixels in the subspace-sparse representation. Formally, this means that the coefficient matrix C exhibits certain local smoothness. Recall that pixels y i and y j are likely to belong to the same class if C ij = 0. In reality, C ij is rarely exactly 0, but the larger C ij , the more likely it is that y i and y j are from the same class. Ideally, y i as an atom in the dictionary, only contributes to the representation of pixels in the same class. Since neighbouring pixels from a local region of the input image Y usually belong to the same class in the ideal case, all of them are likely to select the same atoms in the subspace representation. Hence, any row of C, c i = [C i1 , C i2 , ..., C iMN ], composed of the coefficients that correspond to an atom y i , will reflect some aspect of the spatial structure of HSI. In other words, the ideal coefficients should reflect the local smoothness and discontinuities that are present in the original HSI, as shown in Figure 1, where each c i is reshaped to a M × N 2-D slice. This motivates us to introduce TV spatial regularization on sparse coefficients, which promotes effectively piece-wise smoothness while preserving sharp transitions among the distinct regions.
Let x ∈ R MN denote a vector of raster scanned pixel values from a grayscale image of size M × N and define the anisotropic TV norm (An alternative isotropic TV norm formulation is where H x and H y are the forward finite-difference operators in the horizontal and vertical directions, respectively, with periodic boundary conditions. For the 2-D matrix Y reshaped from a 3-D HSI Y ∈ R M×N×B in HSI, the corresponding TV norm is formulated as Now we incorporate the spatial constraint into the SSC model. In particular, we impose the TV norm as defined above on the sparse coefficient matrix C, and formulate our SSC-TV model as where λ and λ tv are two penalty parameters corresponding to sparsity and spatial constraint, respectively. Like with the standard SSC, the similarity matrix is obtained from C by applying (2), and fed into the spectral clustering. The TV norm imposed on C promotes the local smoothness of the resulting subspace-sparse representation, which encourages neighbouring pixels to select a common set of pixels from the same class. Since pixels belonging to the same class tend to be spatially clustered as well (within one or multiple local regions), this locally smooth coefficient structure will also lead to an improved agreement of the resulting spectral clustering with the underlying spatial structure.

Figure 1.
A motivation for applying the TV norm in the sparse subspace clustering (SSC) model. In the ideal case, coefficient matrices of pixels in hyperspectral images (HSIs) should be piece-wise smooth in local region and have similar edges to the original HSI, which becomes apparent after reshaping them to a 3-D cube. Observe that each M × N slice in this cube corresponds to one row in the 2-D matrix C and resembles the spatial structures in the original HSI. In order to preserve such smoothness in local regions and edge structure of the coefficient matrix, the TV spatial constraint is employed.

The Sketch-SSC-TV Model for Large-Scale HSIs
The problem at this point is to solve the sparse coefficient matrix C from the cost function in (6). However, as the number of pixels in HSIs, MN, is typically very large, the matrix C ∈ R MN×MN in (6) is huge. The optimization problem of the SSC-TV model actually cannot be efficiently solved in practice due to its prohibitively high computational complexity. The traditional SSC-based methods [11,[20][21][22][23]41,42] also suffer from the same problem. One key obstacle is that they have to calculate and save the inverse of the entire large matrix (Y T Y + µI) ∈ R MN×MN in memory based on the ADMM algorithm, whose time complexity reaches O((MN) 3 ), which is infeasible for large-scale data sets. In addition, for the TV-regularized model in (6), the complexity to solve the subproblem with respect to the TV-norm is O((MN) 2 log(MN)), which further increases the computation burden. Despite the effectiveness of incorporating the TV-norm in the tasks such as HSI unmixing [43,44], superresolution [45] and denoising [46][47][48], the exploitation of TV-regularization in the SSC model is impractical, especially for large-scale HSIs. In the following parts, a sketched SSC (Sketch-SSC) [30] method designed for large-scale data sets will be introduced, and then our Sketch-SSC-TV model is present.

The Sketch-SSC Model
The recently proposed Sketch-SSC method [30], which was explored in the context of computer vision, employs a random projection matrix R ∈ R MN×n to sketch the input data, compressing the self-representation dictionary Y in (1) to a compact one D ∈ R B×n := YR. The objective function of the Sketch-SSC with respect to the sparse coefficient matrix A ∈ R n×MN can be formulated as arg min By using the random sketching matrix R, the number of optimization variables in the sparse matrix is significantly reduced, making the Sketch-SSC model applicable to large-scale data sets. We illustrate this pictorially in Figure 2. After obtaining the sparse matrix A, the similarity matrix W is built via the KNN graph of A for spectral clustering. The random matrix R ∈ R MN×n used here is known as Johnson-Lindenstrauss transform (JLT), which can compress Y to a very small dictionary while preserving major information in Y. The typically used JLTs are matrices with independent and identically distributed (i.i.d.) ±1 entries multiplied by 1/ √ n [49]. It was proved in Reference [30] that with a properly selected sketching matrix R the compressed dictionary D shows an equal expressive capability to Y since it preserves the column space of Y with high probability.

The Sketch-SSC-TV Model
By using the sketching technique in Reference [30], we convert the SSC-TV model in (6) to the following Sketch-SSC-TV model: Compared with (6), the self-representation dictionary Y is replaced with the sketched D, and also the constraint diag(C) = 0 is not necessary any more because I is not the trivial solution of (8). For simplicity, here we also remove the affine subspace constraint 1 T C = 1 T . D serves as the basis to represent the whole data and thus pixels in HSI now lie in the union of subspaces described by D. Similarly to the self-representation method SSC, the coefficients with respect to D should preserve the smoothness of pixel values in local image regions. Since the neighbouring pixels are often in the same class, they ideally select the same or similar set of atoms in D which are constituting together that particular class. As for the computational complexity, the heaviest part in the traditional SSC-based methods for solving the inverse of (Y T Y + µI) ∈ R MN×MN is replaced with the inverse of (D T D + µI) ∈ R n×n in (8), which reduces the complexity from O((MN) 3 ) to O(n 3 ). In addition, for the model in (6) the complexity of the solver to the TV term is also reduced from O((MN) 2 log(MN)) in (6) to O(MNnlog(MN)) in (8). Note that n is much smaller than MN. Our experimental results shown later indicate that when n is larger than 100, there is no obvious performance improvement, and thus n can be empirically set to a value around 100, which can be more than thousand times smaller than MN in large-scale HSIs. Therefore, the computational complexity of the Sketch-SSC-TV model can be significantly reduced.
We solve the resulting model by the ADMM algorithm, as described in the following subsection. After obtaining the sparse coefficient matrix A, we cannot apply it directly in the same way as the traditional SSC-based methods to construct the similarity matrix since the size of A is n × MN and it cannot explicitly indicate the connections between input data points.
Here we use a KNN graph to build the similarity matrix with the sparse matrix A. For each a a a i from the i-th column of A, the first k nearest neighbours in Euclidean distance are located, denoted as N k (a a a i ). Then the similarity matrix W is calculated as , 0 otherwise (9) where w ij is obtained with a Gaussian kernel function: For large-scale HSIs, the construction of the KNN graph may result in a high computation burden. However, various methods [50][51][52] can be used to speed up this procedure. The obtained sparse similarity matrix W serves as an input to the spectral clustering framework to produce the clustering result. The complete procedure of the proposed Sketch-SSC-TV method is summarised in Algorithm 1.

Optimization
In order to solve model (8), three auxiliary variables B, Z ∈ R n×MN and U ∈ R 2MN×n are introduced, and then model (8) becomes arg min where H = [H x ; H y ] is the TV operator in spatial direction of HSIs. Based on the efficient ADMM algorithm, the optimization problem (11) can be solved by minimizing the resulting augmented Lagrangian function as: where Y 1 , Y 2 ∈ R n×MN and Y 3 ∈ R 2MN×n are the Lagrange multipliers, and µ is a weighting parameter.
To this end, the following subproblems can be solved iteratively. In each subproblem, a variable is updated with others being fixed.

Update B
The objective function with respect to B is given by: The solution can be obtained by setting the first-order derivative to zero:

Update A
The objective function with respect to A is given by: By setting the first-order derivative to zero, we can obtain As for each row of A, H is a convolution, the above problem can be efficiently solved by using the fast Fourier transform (FFT) method: where , and F (·) and F −1 (·) denote the FFT and the inverse FFT, respectively.

Update Z
The objective function with respect to Z is given by: By introducing the following soft-thresholding operator: the problem in (18) can be solved by

Update U
The objective function with respect to U is given by: Similarly, U can be updated by

Update Other Parameters
The next step is to update the multipliers Y 1 , Y 2 , Y 3 and µ by The above 5 steps are iteratively updated until the stop criterion is satisfied, that is, where MaxIter is the predefined maximum number of iteration. Algorithm 2 summarizes the whole optimization steps of the Sketch-SSC-TV model.  (14). 5: Update A by (17). 6: Update Z by (18). 7: Update U by (22). 8: Update other parameters by (23).

Convergence Analysis
The convergence property of the ADMM algorithm has been theoretically proven when two blocks of variables are alternatively updated [31,53,54]. However, it is difficult to guarantee the convergence of ADMM for the cases with more than two blocks [55]. In our problem (11), there are four variables {B, A, Z, U}. We show a week convergence property of our algorithm by proving that the solution obtained by Algorithm 2 converges to a Karush-Kuhn-Tucker (KKT) point under some mild conditions. We refer to these conditions as "mild", meaning that they are most of the time fulfilled in practice, which is evidenced by the experimental results shown later. This weak convergence property is stated in Theorem 1 given below. We first introduce a lemma from Reference [56] that we will need in proving this theorem.

Lemma 1 ([56]
). Let X be a real Hilbert space endowed with an inner product · , a norm · with its dual norm · dual , and y ∈ ∂ x , where ∂ f (·) is the subgradient of f (·). Then we have y dual = 1 if x = 0, and y dual ≤ 1 if x = 0.
be the sequence that is derived from Algorithm 2. If lim r→∞ µ(Z r+1 − Z r ) = 0 and lim r→∞ µ(U r+1 T − U r T ) = 0, the sequence {Γ r } ∞ r=1 is bounded, and its accumulation point converges to a KKT point.
Proof. We first prove the boundedness of the variable sequences {B r , A r , Z r , U r , Y r 1 , Y r 2 , Y r 3 }. With the definition of L(·) in (12) and the solver to the U-subproblem (22), we obtain where ∂L U (·) is the subgradient of the non-smooth function L(·) with respect to U. Based on the above-stated Lemma 1, we derive that (24), so the sequence {Y r+1 3 } is bounded. In the update of Z, we have where ∇L A denotes the gradient of smooth function L(·) with respect to A. With the updating rules for Y 2 and Y 3 in (23), we reformulate the equation in (26) as follows: When Let r vary from 1 to t and sum both sides of (28), we get Because of the boundedness of Y r 1 , Y r 2 and Y r 3 , we conclude that is also bounded. We can obtain the following equation with (12) Due to the boundedness of L(B r+1 , A r+1 , Z r+1 , U r+1 , Y r 1 , Y r 2 , Y r 3 ), Y r 1 , Y r 2 and Y r 3 , the left side is bounded and thus the right side of (30) is bounded as well, which deduces that each term in the right side of (30) is bounded. Therefore, we conclude that the sequences of {B r }, {A r }, {Z r }, {U r } are bounded. To this end, the proof to the boundedness of the variable sequences {B r , be the sequence that is generated by Algorithm 2. Based on Bolzano-Weierstrass theorem [57], it is known that for a bounded sequence, there exists at least one accumulation point. We denote by Next, we prove that the accumulation point Γ * satisfies the KKT conditions [58], which means {Γ} ∞ r=1 converges to a KKT point. The proof is similar to that in Reference [59,60]. A KKT point of (11) should meet the KKT conditions, including: According to the updating rules in (23), we have: As lim r→∞ (Y r+1 Based on the updating rules in (14), we have: By left multiplying D T D + µI, we obtain that Considering lim r→∞ (B r+1 − B r ) = 0, we drive the following equation: Due to A * = B * , we can infer that D T DB * − D T Y * − Y * 1 = 0, which satisfies the KKT condition (35). Similarly, we have the following equation according to the updating rule (17) Thus, Combining lim r→∞ (A r+1 − A r ) = 0 and the proved conditions (32)-(34), we obtain Y * T 3 H + Y * 1 + Y * 2 = 0, which satisfies the KKT condition (36). To prove the condition (37), we reformulate it as follows: where scalar function Θ λ µ (t) = t + λ µ |t| is applied to Z element-wise. Based on Reference [61], we have the following relation: Now the condition (37) is transformed equivalently to (47). Based on the updating rules in (18), we have As lim r→∞ (Z r+1 − Z r ) = 0 and A * = Z * , we derive that Z * = R λ µ (Z * + Y * 2 µ ), which satisfies the condition (37). The last KKT condition (38) can be proved similarly as (37). We first reformulate it to the following equivalent condition: With the updating rule (22), we have Taking lim r→∞ (U r+1 − U r ) = 0 and HA * T − U * = 0, we conclude that , which proves the condition (38). Overall, the accumulation point Γ * satisfies all the KKT conditions. This completes the proof of Theorem 1.
Theorem 1 assures the theoretical convergence property of our algorithm under mild conditions. In the next Section, we will prove the convergence empirically by conducting experiments on real data sets.

Experimental Settings
We conduct experiments on three widely used bench mark data sets: Indian Pines, Pavia University and Salinas. The results of two classical clustering methods FCM [5] and k-means [7], the random swap clustering (RSC) [62], the original SSC method [11], the SSC-based extensions L2-SSC [22] and JSSC [23], and the state-of-the-art large-scale clustering methods SSSC [28], SSC-OMP [29] and Sketch-SSC [30] are reported and analysed. The clustering methods FCM, k-means, RSC, SSC, SSSC, SSC-OMP and Sketch-SSC yield the results based on the spectral information alone while the L2-SSC, JSSC and Sketch-SSC-TV methods employ both spatial and spectral information.
We conduct four independent experiments using the three data sets. The traditional SSC-based methods [11,22,23] cannot cluster large-scale data sets. For this reason, we first test all the methods on the cropped version of Indian Pines. In order to compare with the methods [28][29][30] designed for large-scale data sets, we also test the performance on the original large-scale HSIs. We refer to the cropped data set as the small HSIs and to the original data sets as the large-scale HSIs.
Two commonly utilized quantitative metrics including overall accuracy (OA) and Kappa coefficient (κ) are employed to evaluate the clustering performance. In addition, we report the running time (t) as well for all the methods. For a dataset Y = [y 1 , y 2 , ..., y N ] ∈ R B×N with N samples, the OA is obtained by ∑ N i=1 δ(map(r i ), l i )/N, where r i and l i are the cluster label obtained by clustering and the true label of y i , respectively, and δ(x, y) equals one if x = y and equals zero otherwise. The map(·) is a pair-wise mapping function that finds the best match between the clustering results and ground truth. We apply here the Hungarian algorithm [63] to derive the best mapping function. For more details about cluster matching, we refer to Reference [64]. The obtained mapping function finds the label for each pixel. Thus, κ can be directly computed from the corresponding confusion matrix [65]. The running time records the whole clustering procedure for each clustering method. The optimal parameters for the traditional SSC-based methods SSC, L2-SSC and JSSC on the small HSI are set according to References [20,22,23]. The parameters of the other analysed methods were tuned to produce the best results in terms of OA to guarantee a fair comparison. The total number of the randomly selected samples in the SSSC method is equal to n for the small HSIs. For simplicity, the number of samples in the large-scale HSIs is set to 10 per class. In order to avoid the biased clustering results caused by randomness, the methods FCM, k-means, SSSC, Sketch-SSC and Sketch-SSC-TV are repeated five times and the averaged performance are reported. In the spectral clustering method, to reduce the computational complexity, we only calculate c eigenvectors of the Laplacian matrix L whose time complexity is O(c(MN) 2 ). For the Sketch-SSC and Sketch-SSC-TV models, the sketching matrices R are shared in each simulation. We set n = 70, σ 2 = ∑ i,j a a a i − a a a j

Pavia University
This data was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor during a flight campaign over Pavia, Northern Italy. The typically used image consists of 610 × 340 pixels, resulting in 207,400 pixels with 103 spectral reflectance bands. The resolution is 1.3 m per pixel and the number of ground-truth classes is 9. The false color image and ground truth are shown in Figure 6a,b.

Salinas
The third image was acquired by the AVIRIS sensor over the Salinas Valley, CA, USA. The geometric resolution is 3.7 m per pixel, and the image size is 512 × 217 × 224. There are 16 ground-truth classes. Twenty bands in 108-112, 154-167 and 224 are removed due to water absorption. The false color image and ground truth are shown in Figure 7a,b.

Experiments on the Small HSI
In this part, the experiments are conducted on the data in Figure 3 that is cropped from the original Indian Pines as indicated by the yellow box in Figure 5a. The specific class names and their corresponding clustering results are shown in Table 1, where the best result is marked in bold and the sub-optimal result is underlined.
The results reveal that our proposed Sketch-SSC-TV model achieves the best performance in terms of clustering accuracy and κ. The optimal parameters of the Sketch-SSC-TV in terms of OA are λ = 10 −3 , λ tv = 10 −2 . Compared with the classical clustering methods FCM and k-means, the SSC-based methods SSC, L2-SSC, JSSC, Sketch-SSC and Sketch-SSC-TV usually yield higher accuracy, showing the superior capability of SSC model in such clustering task. RSC is a k-means based clustering method, which consists of a sequence of centroid swaps and fine-tuning of the exact centroids with k-means. It shows better performance than k-means in Table 1 in terms of OA and κ. The number of iterations in RSC is set by default to 5000 in the source code (http://www.uef.fi/web/machine-learning/software), resulting thereby in longer running time than k-means and FCM. The classification accuracy of k-means for the class "Soybean-notill" is zero, which means that all the pixels belonging to class 3 are wrongly assigned to other classes. Compared with the original SSC model, the extensions L2-SSC and JSSC by incorporating different spatial information obtain the improved performance with the higher accuracy, which indicates the importance of spatial information in HSI clustering. It is very interesting and also surprised to find that the proposed Sketch-SSC-TV method yields higher clustering accuracy than the L2-SSC and JSSC methods which not only use the uncompressed self-representation dictionary but also takes the spatial information into account. Compared with the large-scale clustering methods SSSC, SSC-OMP and Sketch-SSC, our method yields significant accuracy improvement of more than 20%. The SSSC model heavily relies on the initially selected samples. So when the data sets are contaminated by noise or much diverse in each class, the performance may be greatly degraded. Compared with the Sketch-SSC model, our method offers significant improvement as well, which demonstrates the effectiveness of our approach. Figure 4 shows the similarity matrices obtained by different clustering methods. For a better visual comparison, we randomly select 75 samples per class and arrange them in the sequential order by classes. It is known that the ideal similarity matrix should be block-diagonal as only the samples of the same class are connected in the graph [11]. The results in Figure 4 indicate that our method (Figure 4f) preserves such block-diagonal structure best, which is also the main reason why our approach achieves the highest accuracy in the spectral clustering in Table 1. In general, the similarity matrices in Figure 4e,f constructed by KNN are sparser than that by (2) in Figure 4a-c, but surprisingly they achieve comparable or even better spectral clustering performance, demonstrating the efficiency of sparse graph in the spectral clustering. The similarity matrix in Figure 4d is over sparse due to the strict sparsity constraint in the OMP algorithm, leading to the poor clustering performance. It is clearly observed that there are a lot of wrong connections between class 3 and class 4 in Figure 4a-c,e, which consequently results in the low accuracy for class 3 and class 4 as shown in Table 1. While it is less pronounced in Figure 4f, showing much closer block-diagonal structure, which achieves the highest accuracy for class 3 and 4. Such improved graph connectivity mainly benefits from the utilization of TV spatial regularization.
The results in Table 1 show that the SSSC method achieves the shortest computational time. k-means is known to be an efficient clustering algorithm with the complexity of O (IBcMN) where I is the number of iterations. The time complexity of SSSC is O(I 1 Bn 3 + I 2 nc 2 + MNn 2 ), where n is set to 70 in the experiment. k-means took slightly longer running time than SSSC because it needed much more iterations to converge than SSSC on this data set. However, the results on the big data sets such as Pavia University and Salinas shown later indicate that k-means consistently is the fastest algorithm. It is also observed that the computation time t of the SSSC method is increased when the number of initially selected samples becomes larger. Among the clustering methods designed for large-scale data, SSC-OMP takes the longest time. In general, the large-scale clustering methods are much faster than the traditional SSC-based methods, with more than hundred times speed improvement. The reason for the significant speed improvement is that traditional SSC-based methods SSC, L2-SSC and JSSC use the self-representation dictionary Y, which is commonly huge for large-scale data and involves thereby many time-consuming operations of matrix multiplications and inverse calculations on the large dense matrix Y T Y ∈ R MN×MN in the optimization loop, while scalable clustering methods SSSC, Sketch-SSC and Sketch-SSC-TV employ a compressed dictionary, thus enabling a much lower computational complexity. In our method, the column size of the sketched dictionary D is 70 so that the cost of matrix multiplications and inverse calculation on the new matrix D T D ∈ R 70×70 in the optimization algorithm is significantly reduced in comparison with that on the huge matrix Y T Y in the traditional SSC-based methods. That is why our method only uses 6 seconds to obtain clustering result while the traditional SSC-based methods take around 10 minutes. The computation time of our sketching-based method is comparable to that of FCM and k-means, and hence much smaller than the computation time of the traditional SSC-based methods. Compared with other large-scale clustering methods, the computational cost of the Sketch-SSC-TV is similar.

Experiments on the Large-Scale HSIs
In this part, we conduct three more experiments on the entire HSIs. Due to the high computational memory requirement of the traditional SSC-based methods on large-scale HSIs, the SSC, L2-SSC and JSSC methods cannot be run on our computer for the Pavia University and Salinas. We estimate the required memory for only saving the large matrix Y T Y in the three HSIs by MATLAB as in Table 2. The required memory for the Pavia University is 320.5 GB without considering the extra memory cost for the operations including matrix multiplications and inverse calculations, which is unaffordable for normal computational devices. We report the experimental results in Tables 3-5. The clustering maps are shown in Figures 5-7. The optimal parameters of the Sketch-SSC-TV model are λ = 10 −3 , λ tv = 10 −1 for the Indian Pines, λ = 5 × 10 −2 , λ tv = 5 × 10 −1 for the Pavia University and λ = 10 −3 , λ tv = 10 −4 for the Salinas.
The results in Tables 3-5 reveal that our method consistently achieves the highest clustering accuracy in the three HSIs, which confirms its effectiveness. Clustering for the Indian Pines is a very challenging task as some of the spectral signatures from different classes are very close and also parts of the spectrum are highly mixed due to low spatial resolution [20]. As depicted in Table 3 most of the approaches achieve quite low clustering accuracy, while our method yields a much better result with the accuracy of 60.48%.  The FCM, k-means and RSC produce similar accuracy in the Indian Pines and Pavia University, but the k-means is more efficient in terms of computation time than FCM and RSC. The traditional SSC-based methods SSC, L2-SSC and JSSC can be run only on the Indian Pines, however, our method is not only capable of running on all the three large-scale HSIs but also improves clustering performance, which mainly benefits from the exploitation of TV-norm spatial constraint and the sketching technique. Also in Table 3 we can see the computation time of the Sketch-SSC-TV method is significantly reduced by at least 600 times compared to the SSC, L2-SSC and JSSC methods, indicating the efficiency of using a compressed dictionary in our method instead of using the large self-representation dictionary. For the clustering methods designed for large-scale data, SSC-OMP uses much longer time than SSSC, Sketch-SSC and our method. The reason is that the sparse coding for each sample is performed in series. The computational time can be reduced by running simulation in parallel. Compared with the SSC clustering map in Figure 5f, the L2-SSC, JSSC and Sketch-SSC-TV methods have less impulse noise in the clustering maps, which is due to the use of spatial information to promote the connectivity between neighbouring pixels, leading to a more robust similarity matrix. The clustering results in Table 4 show that the accuracy of our method on "Self-Blocking-Bricks" is much lower than that of the reference methods. This can mainly be attributed to the over-smoothed clustering results as shown in Figure 6i, where the "Painted Metal Sheets" and the neighbouring "Self-Blocking-Bricks" are merged. However, this can be alleviated by relaxing the spatial constraint with a smaller λ tv . A possible risk is the reduced overall accuracy.  The large-scale clustering methods SSSC and SSC-OMP typically yield worse performance in terms of accuracy than the k-means and RSC method in the three large-scale HSIs, which indicates the limitation of their performance in HSI clustering. In Figures 5i, 6f and 7f the SSSC clustering maps are seriously deteriorated by the impulse noise, which is caused by the limited discriminative information in the spectral domain. The SSC-OMP method also suffers from the same problem for the Indian Pines and Pavia University as shown in Figures 5j and 6g. Compared with the Sketch-SSC method, our method achieves significant improvement in terms of accuracy in the three large-scale HSIs, especially in the Indian Pines with the accuracy enhancement of 23.7%. The cost is a slight increase of computational time that comes from the TV-norm regularization. The Sketch-SSC model also suffers from the same impulse noise problem to the SSSC and SSC-OMP approaches in the clustering maps as shown in Figures 5k, 6h and 7h, while in our method such a problem is greatly alleviated as connections between neighbouring pixels are strengthened by the TV-norm constraint.

Analysis of Parameters
In this part, we analyse the effect of the parameters λ, λ tv , n and k on the clustering performance of the Sketch-SSC-TV method in the large-scale HSIs. 4.5.1. Effect of λ and λ tv λ and λ tv in (8) controls the sparsity level and spatial constraint of the sparse matrix, respectively, which are two important parameters in the model. Let λ be varied in the range of {1 × 10 −4 , 5 × 10 −4 , 1 × 10 −3 , 5 × 10 −3 , 1 × 10 −2 , 5 × 10 −2 } and λ tv in the range of {1 × 10 −4 , 5 × 10 −4 , 1 × 10 −3 , 5 × 10 −3 , 1 × 10 −2 , 5 × 10 −2 , 1 × 10 −1 , 5 × 10 −1 }. The clustering results with respect to λ and λ tv are shown in Figure 8 for the three large-scale HSIs. The results indicate that the clustering performance is more stable with respect to λ than λ tv . According to the experimental results, we recommend to set λ = 10 −3 for all the data. For the parameter λ tv , the value may be different for different data sets, but the clustering accuracy is stable and superior over other methods in a wide range, that is, when λ tv ∈ [5 × 10 −3 , 10 −1 ] for the Indian Pines, λ tv ∈ [5 × 10 −3 , 5 × 10 −1 ] for the Pavia University and λ tv ∈ [10 −4 , 5 × 10 −3 ] for the Salinas. The results of Salinas in Figure 8c are quite different with those of the Indian Pines and Pavia University. For the Salinas when the values of λ tv and λ are similar, the results typically show better performance, which means the sparsity constraint and spatial constraint are equally important. In contrast, for the Indian Pines and Pavia University our method achieves better performance when λ tv is larger than λ, indicating the spatial constraint is more important than the sparsity. The reason may lie in different types of HSIs and different levels of data quality. As each crop in the Salinas is planted regularly in block, there are more homogeneous regions and less edge than the Indian Pines and Pavia University, resulting in much smaller value of the TV-norm in Salinas. In addition, due to high data quality of the Salinas spatial information may be less important than that in the other two HSIs. Thus a lager value of λ tv can make the clustering result over smoothing, leading to a lower accuracy. Among the three constraints of Sketch-SSC-TV model, the data fidelity term is most important for the Salinas. Overall, based on the results in Figure 8 our method is robust and stable with respect to λ and λ tv .

Effect of the Parameter n
n is the number of columns of the sketching matrix R, which decides the sketched dictionary size and also the computation efficiency of the proposed method. We vary n in the range of {10, 20, 40, 70, 100, 140} for the three large-scale HSIs. The results are reported in Figure 9, which shows that a larger n typically can result in a better clustering result. The reason is that a larger sketched dictionary can better preserve the original column space of the input data. For the Indian Pines and Salinas, the number of classes is 16. When n = 10, the sketched dictionary cannot well represent the input data space, which results in the drop of accuracy compared to that when n = 20. It is also revealed in Figure 9 that a small value of n (20 for example) is able to obtain satisfying clustering performance in the three HSIs, which coincides with the fact that the data of HSIs actually lies in a low dimensional subspace.

Effect of the Parameter k
We investigate the effect of the number of neighbours k on the clustering performance of our method by varying k in the range of {5, 10,15,20,30, 50} for the three large-scale HSIs. The results shown in Figure 10 reveal that a larger k yields a higher clustering accuracy in general. The accuracy curves with k < 20 rise more significantly than those with k ≥ 20 in the three HSIs. When k ≥ 20, the accuracy becomes much more stable. Based on the results, we set the value of k to 30 in this paper.  Figure 11 shows the squared Frobenius norm of the differences in Z and U values in each two subsequent iterations: Z r+1 − Z r 2 F and U r+1 − U r 2 F . We refer to these distances between the values of Z and U in two successive iterations as updating errors. The results show that the updating errors after a sufficient number of iterations tend to a very small value, meaning that the solutions of Z and U become stable eventually. Moreover, on all three datasets, the updating errors decline monotonically after certain iterations. Thus, we have lim r→∞ µ(Z r+1 − Z r ) = 0 and lim r→∞ µ(U r+1 − U r ) = 0, where µ is a constant. This empirically demonstrates that the conditions in Theorem 1 are satisfied for all three analysed datasets, and it is thus reasonably to assume that they will be satisfied in a similar manner for most other HSIs in practice. It can be observed that the updating errors of Z and U in some datasets are zero at the beginning, which is mainly caused by the small values of A r+1 + Y r 2 /µ in (18) and HA (r+1) T + Y r 3 /µ in (22) at the first several iterations, leading to the output of zero matrices in the thresholding operator R (·). After certain number of iterations, their values increase and the output of thresholding operator is no longer zero, resulting in a temporary increase of updating errors, as shown in Figure 11. But finally they tend to a value that is close to zero. The diagrams in Figure 12 show the evolution of the objective function values with respect to the number of iterations. The results reveal that the objective function is monotonically decreasing to a stable level in the three data sets, demonstrating the practical convergence of our optimization algorithm. Especially, the curves in Figure 12a,c drop sharply in the first few iterations and then become saturated. The results coincide with the aforementioned theoretical convergence analysis.

Conclusions
In this paper, the problem of large-scale HSIs clustering based on the SSC model is addressed for the first time, and a novel clustering method, namely Sketch-SSC-TV, is proposed to incorporates a random projection based sketching technique to significantly reduce the number of optimization variables. In addition, a TV-norm constraint on the sparse coefficient matrix promotes the dependencies between neighbouring pixels, which enhances the block-diagonal structure of the similarity matrix, improving thereby the performance of spectral clustering. We derived an efficient solver based on the ADMM algorithm for the resulting model and also we proved its convergence property theoretically. Unlike the traditional SSC-based methods which cannot be applied on large-scale HSIs due to extremely high computational burden, the proposed method is not only applicable on big data sets but also able to achieve a high level of clustering accuracy. The extensive experimental results clearly demonstrate that our method outperforms the state-of-the-art clustering methods.