A new Kmeans clustering model and its generalization achieved by joint spectral embedding and rotation

The Kmeans clustering and spectral clustering are two popular clustering methods for grouping similar data points together according to their similarities. However, the performance of Kmeans clustering might be quite unstable due to the random initialization of the cluster centroids. Generally, spectral clustering methods employ a two-step strategy of spectral embedding and discretization postprocessing to obtain the cluster assignment, which easily lead to far deviation from true discrete solution during the postprocessing process. In this paper, based on the connection between the Kmeans clustering and spectral clustering, we propose a new Kmeans formulation by joint spectral embedding and spectral rotation which is an effective postprocessing approach to perform the discretization, termed KMSR. Further, instead of directly using the dot-product data similarity measure, we make generalization on KMSR by incorporating more advanced data similarity measures and call this generalized model as KMSR-G. An efficient optimization method is derived to solve the KMSR (KMSR-G) model objective whose complexity and convergence are provided. We conduct experiments on extensive benchmark datasets to validate the performance of our proposed models and the experimental results demonstrate that our models perform better than the related methods in most cases.


INTRODUCTION
Clustering is one of the important research contents in many communities such as data mining and pattern recognition. Basically, it aims to group the data points into different clusters according to their similarities or densities (Ubukata, 2019;Ren, Zhang & Zhang, 2019). Over the past decades, a number of clustering algorithms have been proposed such as the K means clustering, spectral clustering (Ng, Jordan & Weiss, 2001), min-max cut (Ding et al., 2001;Nie et al., 2010), subspace clustering Xie et al., 2020), and multi-view clustering (Nie, Tian & Li, 2018;Cai et al., 2013). Among the existing clustering methods, the most popular one is the K means clustering algorithm due to its simpleness and efficiency, which aims to learn certain cluster centroids to minimize the within cluster data distances. However, the K means algorithm suffers a great impact on its clustering performance due to the random initialization of cluster centroids.
By characterizing the data connection with an appropriate graph whose vertices represent the data points and the weights represent the connection between data pairs, spectral clustering tries to partition the vertices into different clusters by minimizing the cut information. There are some popular spectral clustering algorithms such as the ratio cut (RCut) (Hagen & Kahng, 1992), normalized cut (NCut) (Shi & Malik, 1997), clustering with adaptive neighbors (CAN) and its projected version (PCAN) (Nie, Wang & Huang, 2014), multiclass spectral clustering (Yu & Shi, 2003), constrained Laplacian rank , and nonnegative matrix factorization (Peng et al., 2018). Given a built graph, existing spectral clustering methods usually employ a two-step strategy to complete the clustering; one is performing eigen-decomposition on the graph Laplacian matrix to obtain the scaled cluster indicator matrix based on which the other aims to make discretization to get the final cluster assignment. The former step is recognized as spectral embedding and the latter step as postprocessing. Generally, the existing two approaches to complete the postprocessing task of recovering the final discrete cluster indicators from the relaxed continuous spectral vectors are the K means clustering and spectral rotation. As pointed by Chen et al., 2017), using spectral rotation as the postprocessing step can usually obtain better clustering performance than that of K means postprocessing. However, such two-stage process has an obvious disadvantage that the final assignments may deviate far from the true discrete solution .
As mentioned above, both the K means clustering and the spectral clustering have limitations in respective fields. To this end, in this paper, we first derive the underlying connection between the K means clustering and spectral clustering, and then propose a new K means formulation by jointly performing spectral embedding and spectral rotation. The resultant KMSR model can effectively alleviate the drawback of the randomness in the initialization of cluster centroids of K means. Moreover, the two sub-objectives of spectral embedding and spectral rotation are jointly optimized, which can co-evolve to the optimum and avoid the sub-optimality caused by the two-step strategy. Due to that the KMSR is originated from the K means clustering, it measures the data similarity by directly using the dot-product weighting scheme whose performance is limited in dealing with complicated data sets. To accommodate more advanced graphs and then improve the performance of KMSR, we make corresponding extension on it by replacing the dot-product of data matrix with predefined graphs such as CAN and PCAN (Nie, Wang & Huang, 2014), leading to the generalized version KMSR-G. Mathematically, the KMSR model objective involves three variables respectively corresponding to the relaxed continuous cluster indicator matrix, the discrete cluster indicator, and the orthogonal transformation matrix to bridging them; therefore, under the coordinate blocking framework, we design an efficient optimization method to alternately update them, whose complexity and convergence property are also analyzed. We conduct extensive experiments on representative benchmark data sets to evaluate the performance of our proposed models. By comparing the clustering performance of KMSR and KMSR-G with related models, they both perform better in most cases.
The remainder of this paper is organized as follows. We give brief introductions to some related works including the K means, spectral clustering and spectral rotation in 'Related Work'. In 'The Proposed Model', we first derive the model formulation of KMSR based on depicting the connection between K means clustering and spectral clustering, and then provide the detailed optimization process to KMSR model objective. Besides, the complexity and convergence analysis of KMSR, its generalization to KMSR-G are included. In 'Experiment', extensive experiments are conducted on representative benchmark data sets to evaluate the effectiveness of KMSR and KMSR-G in data clustering. 'Conclusion' concludes the whole paper and puts forward a future work.
Notations. In this paper, matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For example, the (i,j)-th element of matrix W is w ij . The squared 2 -norm of matrix W ∈ R n×m is W 2 2 = n i=1 m j=1 w 2 ij . By default, we use w i to represent the ith column of W and w j to represent its jth row. We use R and B to represent the real domain and the binary domains, respectively.

Kmeans clustering
Given a data matrix X = [x 1 ,x 2 ,...,x n ] ∈ R d×n , the K means clustering aims to partition X into c (1 ≤ c ≤ n) clusters C = [C 1 ,C 2 ,...,C c ] such that the within-cluster sum of squared distances can be minimized and the sum of squared distances between clusters can be maximized. Mathematically, the objective function of K means clustering is where µ i is the centroid corresponding to the cluster C i . To optimize objective Eq. (1), the membership of each data point and the centroid of each cluster are alternately updated.

Spectral clustering
For spectral clustering, we first need to construct a graph affinity matrix A ∈ R n×n according to certain similarity measures to depict the connection between data pairs. Let y i | n i=1 be the ith row vector of matrix Y = [y 1 ;y 2 ;··· ;y n ] ∈ B n×c , which corresponds to the cluster indicator vector for x i . The jth element of y i is 1 if x i ∈ C j , and 0 otherwise. By defining the scaled cluster indicator matrix F as F = Y(Y T Y) −1/2 whose jth column is given by where n j is the number of data in the jth cluster. Then, the objective function of spectral clustering can be formulated as Here L = D − A is the Laplacian matrix, where D is the diagonal degree matrix with its ith diagonal element defined as d ii = n j=1 a ij .
= I, the embedding F can be obtained by stacking the eigenvectors of L corresponding to its c smallest eigen-values. However, F is a real-valued matrix and therefore a postprocessing step such as the K means clustering or spectral rotation  is necessary to perform discretization.
It is easy to find that the solution to Eq. (3) is not unique. That is, for any solution F, FR is another solution where R is an arbitrary orthogonal matrix. Therefore, spectral rotation aims at finding a proper orthogonal and normalized R such that the resultant FR are closer to the discrete indicator matrix solution set than the F in K means. Mathematically, it aims to minimize the following objective where 1 c and 1 n are both all-one column vectors with sizes of c × 1 and n × 1, respectively. By using the alternative optimization method, objective Eq. (4) can be solved and therefore the final cluster assignment can be obtained.

THE PROPOSED MODEL
In this section, we formulate the model objective function of KMSR and derive its optimization method. Besides, the complexity and convergence analysis are provided.

Model formulation
By introducing two matrices U = [µ 1 ,µ 2 ,...,µ c ] ∈ R d×c and Y such that (Y ∈ B n×c ,Y1 c = 1 n ) to respectively represent the cluster centroids and indices, the K means objective in Eq.
(1) can be reformulated as Since the solution to U is U = XY(Y T Y) −1 , we have Tr(Y U T UY T ) = Tr(X T UY T ) and then Eq. (5) can be written as where F ∈ R n×c and F Y(Y T Y) − 1 2 . By checking Eq. (6), an usual way to solve it is relaxing the binary matrix F to real domain but keeping its orthogonality intact. Then, we obtain the following formulation Note that F in Eq. (7) is the continuous relaxation which preserves the orthogonality but misses the discrete nature of Y. If we use K means to find the cluster assignment which aims to jointly find the cluster indicator matrix Y and centroids C by min Y∈B n×c ,Y1 c =1 n ,C F − Y C 2 F , it can only guarantee that YC best approximates the relaxed continuous vector matrix F and cannot guarantee that such yielded Y best approximates Y. As mentioned in 'Spectral Clustering', the relaxed solution to Eq. (7) is not unique. Actually, for any solution F, FQ is another solution where Q is an arbitrary orthonormal matrix. The goal of spectral rotation is to find a proper Q such that the resulting FQ are closer to the discrete indicator matrix solution set than the F in K means. Therefore, we can take the idea of spectral rotation into account to perform post-processing operation on the optimal F * of Eq. (7) to obtain the final cluster indicator matrix. Inspired by Chen et al. (2017), we aim at finding an orthonormal matrix Q ∈ R c×c to minimize the discrepancy between Y(Y T Y) − 1 2 and F * Q as the postprocessing step. Mathematically, it can be achieved by solving the following objective From the above analysis, we realize that an intuitive way to handle a spectral clustering task is to get F * by solving Eq. (7) followed by an appropriate Q from Eq. (8) to finally obtain the final cluster indicator matrix Y. To avoid the sub-optimality caused by such two-step process, in this paper, we propose to jointly optimize the objectives of Eq. (7) and Eq. (8) which respectively correspond to the spectral embedding and rotation, leading to the following new K means formulation (termed KMSR) as where λ > 0 is a regularization parameter to control the balance between the two items.
In spectral clustering, the normalized Laplacian matrix L n is defined as If we replace L in Eq. (3) with L n , it becomes the objective function of the normalized cut. Since Tr(F T F) is a constant, we can find an interesting point that there exists an equivalence between the K means clustering and the normalized cut. That is, the graph affinity matrix in K means clustering employs the simple dot-product weighting scheme, i.e., X T X, while it isÃ D −1/2 AD −1/2 in normalized cut. We know that the graph quality plays an important role in spectral clustering. Sometimes, we need to learn a more robust graph to characterize the connection among data points than simply constructing it by fixed rules. Therefore, to further enhance the model performance, we make the generalization on KMSR by introducingÃ as the graph matrix. That is, we can incorporate more advanced graphs instead of the simple dot-product weighting scheme.
We name the generalized model as KMSR-G which has the following objective function Obviously, KMSR-G is a general model to accommodate any predefined (or pre-learned) graph, which is expected to achieve better performance than KMSR especially when handling complicated data sets.  Figure 1 intuitively shows the framework of our proposed models from which we can observe that KMSR jointly performs spectral embedding and rotation on a specified graph. Further, KMSR-G is a generalized model to accommodate other advanced graphs, leading to better clustering performance.

Model optimization
The only difference between the model objectives of KMSR in Eq. (9) and KMSR-G Eq. (11) is the graph affinity matrix; therefore, they share the identical optimization procedure. Below taking the KMSR-G as an example, we show its detailed optimization steps based on the alternating framework (Tang et al., 2020). That is, we update one variable by fixing the others.
Update Q with Y and F fixed. The sub-objective associated with Q is which can be further reformulated into Suppose that the singular value decomposition of M 2 is M 2 = U V T and then we have where E = V T QU with λ ii and e ii as the (i,i)-th elements of matrix and E, respectively.
λ ii , and the equality holds when e ii = 1 (1 ≤ i ≤ c). That is to say, Tr(M 2 Q) reaches its maximum when E = I c = V T QU. Then we obtain the optimal solution of Q as Update F with Y and Q fixed. The sub-objective associated with F is where symmetric matricÃ ∈ R n×n and M 1 Y(Y T Y) − 1 2 ∈ R n×c . Since Q ∈ R c×c is a square matrix and F T F = I c , the second term in objective Eq. (17) is a constant and then we can get the simplified version of objective Eq. (17) as The corresponding Lagrangian function of problem Eq. (19) is where is a Lagrangian multiplier in matrix form. Then we can obtain the KKT condition as which is difficult to solve directly. Essentially, problem Eq. (19) is a relaxed form of quadratic optimization problem on the Stiefel manifold (QPSM). In optimization theory, the standard form of QPSM is min P T P=I k Tr(P T HP − 2P T K), where P ∈ R m×k , K ∈ R m×k , and the symmetric matrix H ∈ R m×m . This objective can be relaxed into max P T P=I k Tr(P TH P) + 2Tr(P T K) by introducingH = αI m − H ∈ R m×m , which is equivalent to Eq. (19). Inspired by the work (Nie, Zhang & Li, 2017), we employ the generalized power iteration (GPI) method to optimize Eq. (19) and summarize the detailed procedure in Algorithm 1.

Algorithm 1 The GPI-based optimization to problem (19)
Input: symmetric matrixÃ ∈ R n×n , orthogonal matrix Q ∈ R c×c , indicator matrix Y ∈ B n×c , regulation parameter λ; Output: the scaled indicator matrix F.
Update Y with F and Q fixed. Similar to the optimization process of updating Q, the sub-objective associated with Y is Denote G FQ and then optimizing Eq. (22) is equivalent to optimizing the following one Motivated by Chen et al. (2018), objective Eq. (23) can be represented as Since y T j y j involves all rows of Y, we can sove Y row-wisely; that is, we can update one row of Y by fixing the others as constants. Suppose we have obtained the optimal solution Y in the last iteration and the corresponding objective function values is J old (Ȳ). The elements of each row vector y i are composed of 1 or 0 where the unique 1 indicates the cluster membership of ith data point. To solve the ith row y i , we only need to consider the increment of the objective function value from y ij = 0 to y ij = 1. The increment as can be calculated as follows whose graphical illustration is given in Fig. 2.
Then it can be verified that the optimal solution of y i is where · is 1 if the argument is true or 0 otherwise and s ij is defined by Eq. (25). As a whole, we summarize the complete procedure of solving the objective function Eq. (11) of KMSR-G in Algorithm 2. Algorithm 2 The optimization procedure to KMSR-G objective function in (11).
Input: Data matrix X ∈ R d×n , the number of clusters c, and the regularization parameter λ; Output: The binary cluster indicator matrix Y ∈ B n×c . 1: Construct graph similarity matrix A ∈ R n×n and its normalized versioñ A = D −1/2 AD −1/2 ; 2: Compute the normalized graph Laplacian matrix L n = I −Ã, where D ∈ R n×n is a diagonal degree matrix with its i-th diagonal element d ii = n j=1 a ij ; 3: Form the matrix F * by stacking the eigenvectors of L n corresponding to its c smallest eigenvalues; 4: Initialize Y according to Y * = diag (F * F * T ) − 1 2 F * and y ij = j = argmax j ∈ [1,c] y * ij ; 5: Initialize Q randomly such that Q T Q = I c ; 6: while problem (11) Calculate G = FQ; 10: while problem (24)

Model complexity and convergence analysis
In terms of the computational complexity, if ignoring the special process in graph construction in KMSR-G, KMSR and KMSR-G share similar complexities because they involve the same optimization procedure. The below complexity analysis is based on Algorithm 2.
• Updating the variable F. We use the generalized power iteration method to update F in Eq. (19). According to the analysis in Nie, Zhang & Li (2017), the complexity of updating F is O(n 2 c).
• Updating the variable Q. The complexity of updating Q mainly comes from the singular value decomposition of M 2 ∈ R c×c , which has the complexity of O(c 3 ).
• Updating the variable Y. We need O(nc) time to obtain Y because we deal with Y row by row. What's more, y j T y j and y j T g j can be calculated before solving Y and updated after solving y i according to Eq. (26).
Assuming that T 1 is the maximum number of iterations for the KMSR-G, r 1 and r 2 are the average numbers of iterations to update F and Y respectively, the overall computational complexity of KMSR-G is O(T 1 (n 2 cr 1 + c 3 + ncr 2 )). In general cases, we have c < T 1 << n. In comparison with spectral clustering methods which usually have the time complexity of O(n 3 ) (n is the number of samples), we can easily find that KMSR and KMSR-G have a lower computational complexity in dealing with large-sized data sets.
Obviously, KMSR and KMSR-G have similar convergence properties due to the identical optimization procedure. Here we also give the analysis based on Algorithm 2. When solving the variable F, the GPI method is utilized whose optimization procedure is summarized in Algorithm 1. According to the appendix of Nie, Zhang & Li (2017), we know that the GPI method converges to a global minimum of the quadratic problem on the Stiefel manifold, which guarantees the convergence of Algorithm 1 in updating F. When updating the variable Q, the analytical solution to Q can be obtained based on the singular value decomposition. For the variable Y, we propose to optimize it in a row-by-row manner according to Eq. (26) because determining the membership of each sample is independent. So we can convert the updating of Y into n independent subproblems. Further, since each row of Y is a binary vector in one-hot encoding, there are finite candidate solutions for each subproblem. Therefore, it can be proved that each subproblem must have an optimal solution, which guarantees the convergence of the updating of Y. In total, we can come to a conclusion that the optimization of KMSR-G is expected to converge in terms of iterations.

Discussions
We can find that the only difference between KMSR and KMSR-G is the different ways to represent the similarity matrix (also called the affinity matrix) in respective objective functions. If the normalized graph affinity matrixÃ is represented as X T X, KMSR-G will degenerate to KMSR. It is obvious that the clustering performance heavily depends on the quality of the input data graph in graph-based clustering. In the section of experiments, we will adopt one rule-based method ('Heatkernel' weighting scheme) and two learning-based methods (CAN and PCAN (Nie, Wang & Huang, 2014)) to obtain three different affinity matrices. Then we will see how much the graph quality influences the performance of the KMSR-G. Therefore, there are two proposed models KMSR and KMSR-G in this paper and the latter can be seen as the generalization of the former. As a summary, below we summarize the main contributions of this paper.
• We propose a novel K means formulation (termed KMSR) by exploring the underlying equivalence between the K means clustering and the spectral clustering, which is finally achieved by jointly performing the spectral embedding and rotation. Mathematically, the objective of KMSR consists of two items, which respectively aim to calculate the scaled cluster indicator matrix and perform discretization to obtain the final discrete cluster indicator matrix. When compared with K means clustering, the randomness would be effectively alleviated in KMSR because it jointly searches for an optimal rotator in discretization process.
• By investigating the connection between the graph affinity matrices respectively employed in KMSR and spectral clustering, we make generalization on the KMSR model to make it accommodate more advanced graphs, and finally formulate the KMSR-G model. KMSR-G is also a unified model for jointly completing the spectral embedding and rotation steps, which is often expected to obtain superior performance to KMSR.
• We propose an efficient algorithm to optimize the objective function of KMSR-G (also KMSR since they share the same optimization procedure). In the iterative procedure, there are three blocks respectively corresponding to the three variables (i.e., F, Q and Y) involved in KMSR-G. They are co-optimized toward the optimum. Besides, we provide detailed computational complexity and convergence analysis to the optimization algorithm.
• To evaluate the performance of KMSR and KMSR-G in data clustering, we conduct extensive experiments on twelve representative benchmark data sets. The experimental results show that both KMSR abd KMSR-G perform better than the closely related counterparts.

EXPERIMENTS
In this section, we conduct experiments on twelve representative benchmark data sets to evaluate the performance of the proposed KMSR and KMSR-G models in data clustering.

Evaluation metrics
To evaluate the clustering results, we compare the obtained label of each sample with the label provided by the data set. We use three popular metrics, i.e., Accuracy (Acc) , Normalized Mutual Information (NMI) and Purity (Huang, Nie & Huang, 2015) to measure the clustering performance of different models. Below we give the definition of these three metrics in turn. Given a data point x i , we use r i to denote the obtained cluster label and s i to denote the ground-truth label provided by the data set. Then Acc is defined as where n is the sample size, δ(x,y) is the indicator function that equals to one if x = y and zero otherwise. map(r i ) is the permutation mapping function which maps each cluster label r i to the equivalent class label from the data set. Let C denote the set of clusters obtained from the ground truth and C denote the set of the clusters obtained from the given model. Then mutual information MI(C,C ) is defined as where p(c i ) and p(c j ) denote the probabilities that a sample arbitrarily selected from the data set belongs to the clusters c i and c j , respectively. Besides, p(c i ,c j ) is the joint probability that the selected sample belongs to the clusters c i and c j simultaneously. So the NMI is given as follows where H (C) and H (C ) denote the entropies of C and C respectively. To compute the purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by n. Then the clustering Purity metric is estimated by where c is the number of the clusters and n is the total number of the data points, n j i is the number of ith input class that is assigned to the jth cluster.
It is easy to check that all Acc, NMI and Purity metrics range from zero to one and a higher value indicates a better clustering result.

Data sets and experimental settings
Twelve real-world data sets were used in the following experiments including nine image data sets (COIL20, umist, AT&T, YaleB, Yale, PIE, AR, MNIST and jaffe) and three non-image data sets (ecoli, abalone and scale). Their basic characteristics of the sample size, dimensionality, the number of clusters were summarized in Table 1.
The following experiments can be divided into two parts. In the former part, we aim to demonstrate the effectiveness of the KMSR model in comparison with the traditional one; In the latter part, we want to evaluate the effectiveness of the generalized KMSR-G model and also provide some insights to the influence of graph quality on the clustering performance. PART 1. Experimental settings to evaluate the performance of the KMSR model. To investigate the effectiveness of the KMSR model, we perform the pairwise comparison between the KMSR and the traditional K means clustering. Since the K means clustering is sensitive to the initialization, we independently repeat it 50 times. For our KMSR model, we repeat it 20 times. The number of clusters is set as the ground-truth. Since there is a free regularization parameter in the proposed KMSR model, we tune it from candidate values 10 −3 ,10 −2 ,...,10 3 to let it achieve the best results. PART 2. Experimental settings to evaluate the performance of the KMSR-G model. First, we compare the generalized model KMSR-G with two closely related spectral clustering algorithms, NCut and RCut, to evaluate its effectiveness. Two commonly used post-processing methods (i.e., K means and spectral rotation) are adopted in spectral clustering; therefore, we implement this two versions of NCut respectively named as NCut+KM and NCut+SR. Similarly, we have two corresponding versions of RCut, RCut+KM and RCut+SR. The affinity matrix referred in these models is constructed by the 'Heatkernel' function in which the number of the nearest neighbors is set as five, and the bandwidth parameter is set as one. Second, by taking the two more advanced learning-based graph affinity matrices, CAN and PCAN, as the input graphs to KMSR-G, we respectively obtain two variant models termed KMSR-GC and KMSR-GPC. The neighborhood parameters k required in CAN and PCAN are tuned from {5,10,15,20}. The CAN and PCAN method are repeated only once since their clustering results are stable (Nie, Wang & Huang, 2014). Both KMSR-GC and KMSR-GPC are repeated 20 times. We use CAN and PCAN as graph learning methods in our experiments for two reasons. One is that both are joint models for graph construction and scaled cluster indicator matrix learning, leading to superior graph quality. The other is that they can adaptively determine the number of neighbors in graph construction. For all the above mentioned models, the number of clusters is set as the ground-truth in each data set and the clustering performance is evaluated based on the metrics of Acc, NMI and Purity. Besides, we respectively set the maximum numbers of iterations in updating F and Y in our proposed models to be 50 and 10. For the free regularization parameters in related models, they are tuned from 10 −3 ,10 −2 ,...,10 2 ,10 3 to let the models achieve their best results. The average clustering results and standard deviations are reported for comparison.

Experimental results and analysis
Here we show the experimental results based on which we provide the corresponding analysis. PART 1. The experimental results of the pairwise comparison between K means and KMSR are shown in Table 2, which includes the average results and standard deviations over multiple runs. From this table, we can find that the best results highlighted in boldface are all from the proposed KMSR model. Therefore, we can conclude that KMSR significantly outperforms the K means on all the used data sets in terms of all the three clustering performance evaluation metrics, which indicates the effectiveness of jointly performing spectral embedding and rotation.
Besides the mean values, we can observe that the standard deviations on all the data sets corresponding to the KMSR is much smaller than those of K means, which are more explicitly shown by the statistical box digrams in Fig. 3. This means that KMSR is superior to K means on the model stability. We think that this improvement comes from the optimization of the orthogonal and normalized rotation matrix instead of the random initialization of cluster centroids in K means clustering.
PART 2. By constructing the graph affinity matrix in the 'Heatkernel' scheme, we show the experimental results of comparing KMSR-G with the four closely related models, NCut+KM, NCut+SR, RCut+KM and RCut+SR, in Table 3 where the best results are highlighted in boldface. It is obvious that KMSR-G obtained better clustering performance than the other models in most cases. For example, KMSR-G performs pretty well on the ecoli, COIL20, AT&T, PIE and MNIST data sets, which respectively obtains the improvements of 7.78%, 3.76%, 2.7%, 12.83%, and 4.05% in comparison with the secondbest method in terms of the Acc metric.
Besides the theoretical analysis on the convergence of KMSR-G in 'Model complexity and convergence analysis', , we empirically show the decreasing of its objective function values on the six data sets of abalone, scale, umist, YaleB, PIE and jaffe in Fig. 4  results are obtained when the regulation parameter λ is set as 10 −1 . We can find that KMSR-G has desirable convergence property and usually converges in a few iterations.  In the description of experimental settings, we mentioned that the regulation parameter λ is tuned from candidate values {10 −3 ,10 −2 ,...,10 3 }. Here we explore the impact of such regulation parameter on the clustering performance of KMSR-G. We show the clustering accuracy of KMSR-G with the variation of parameter λ on the eight of our used data sets in Fig. 5. Generally, it depicts that KMSR-G prefers a small value of λ to achieve better clustering accuracy.
In the optimization of the binary cluster indicator matrix Y, it is also in an iterative manner (i.e., the inner loop in Algorithm 2). Here taking the AT&T and COIL20 data sets for example, we show the number of iterations in updating Y in Fig. 6. Since the number of clusters is usually small in respective data sets, the updating of Y could be a very fast process, usually less than 10 iterations.
Besides the experiments on the rule-based graph (i.e., 'Heatkernel' function), we further try two learning-based graphs which are the CAN and PCAN (Nie, Wang & Huang, 2014). CAN can adaptively learn the graph affinity matrix from data by simultaneously considering the non-negativity, normalization and rank constraint (Peng et al., 2020) properties of a desirable graph. PCAN is its projected version, which takes the three-fold constraints into account in a subspace. We present the clustering results of CAN, KMSR-GC, PCAN and KMSR-GPC in Table 4. From the obtained results, we have the following two findings.
• By comparing the results in Tables 3 and 4, we can find that KMSR-G with CAN (or PCAN) obtained superior performance than that with 'Heatkernel' function in most cases. This indicates that the graph quality is the leading factor in graph-based clustering models. Even for one graph construction method, it functions differently on different data sets according to our experimental results. • We can generally declare that KMSR-GC is better than its baseline method CAN; and similarly, KMSR-GPC outperforms PCAN on all the data sets. Though CAN and PCAN jointly performs graph learning and clustering, the indicator matrices learned by them are still real-valued ones; therefore, the postprocessing step is necessary to make discretization. This limitation is avoided in our proposed KMSR-G model and thus improved performance is obtained.

CONCLUSION
In this paper, based on the connection between K means clustering and spectral clustering, we proposed a new K means formulation by jointly performing spectral embedding and spectral rotation. The formulated KMSR model can not only improve the clustering performance but also enhance the model stability in comparison with the traditional K means clustering. Further, the KMSR model was generalized as KMSR-G which can take any pre-defined graph as input and output the final discrete cluster indicator matrix. An efficient method in coordinate blocking framework was designed to optimize the proposed KMSR (KMSR-G) model objective. Extensive experiments were conducted on representative data sets to show the effectiveness of the proposed models in data clustering. As our future work, we will consider unifying the three components of graph construction, spectral embedding and spectral rotation in graph-based clustering into a complete framework. That is, we will incorporate the graph learning process into the present KMSR-G model.