Fast Constrained Spectral Clustering and Cluster Ensemble with Random Projection

Constrained spectral clustering (CSC) method can greatly improve the clustering accuracy with the incorporation of constraint information into spectral clustering and thus has been paid academic attention widely. In this paper, we propose a fast CSC algorithm via encoding landmark-based graph construction into a new CSC model and applying random sampling to decrease the data size after spectral embedding. Compared with the original model, the new algorithm has the similar results with the increase of its model size asymptotically; compared with the most efficient CSC algorithm known, the new algorithm runs faster and has a wider range of suitable data sets. Meanwhile, a scalable semisupervised cluster ensemble algorithm is also proposed via the combination of our fast CSC algorithm and dimensionality reduction with random projection in the process of spectral ensemble clustering. We demonstrate by presenting theoretical analysis and empirical results that the new cluster ensemble algorithm has advantages in terms of efficiency and effectiveness. Furthermore, the approximate preservation of random projection in clustering accuracy proved in the stage of consensus clustering is also suitable for the weighted k-means clustering and thus gives the theoretical guarantee to this special kind of k-means clustering where each point has its corresponding weight.


Introduction
With the arrival of the big data era, data has become an important asset. How to analyse the large scale data efficiently is becoming a big challenge [1,2]. As an underlying method for data analysis, clustering can partition a data set into several subsets according to the similarities of points [3], and it has become a basic tool for image analysis [4,5], community detection [6,7], disease diagnosis [8], and so on. Therefore, more and more attention has been paid to the design of efficient and effective clustering algorithms.
Constrained clustering can improve the accuracy of the clustering result via encoding constraint information into unsupervised clustering. As an important area of clustering, many constrained clustering algorithms [9][10][11][12][13][14][15][16][17] have been proposed. Since spectral clustering often has high clustering accuracy and the suitability for a wide range of geometries [18,19], constrained spectral clustering (CSC) [11][12][13][14][15][16][17] can usually have better performance than other constrained clustering algorithms. However, the ( 2 ) space complexity and ( 3 ) time complexity of many CSC algorithms [11][12][13][14][15] restrict their applications over large scale data sets, where is the number of data points. The most efficient CSC algorithm known is SCACS algorithm [16], which reduces the space and time complexities to be linear with through incorporating the landmark-based graph construction [20,21] with the constrained normalized cuts problem [15]. What is needed to be noticed is that the constrained normalized cuts problem [15] makes SCACS algorithm solve the generalized eigenvector problem twice. In 2016, Cucuringu et al. [17] proposed a new CSC algorithm with better accuracy and shorter running time empirically than constrained normalized cuts 2 Computational Intelligence and Neuroscience problem. Taking a new encoding technique of constraint information, the new CSC model just needs the computation of eigenvectors once.
By means of integrating many basic partitions into a unified partition, ensemble clustering has many excellent properties such as the improvement of clustering quality, the robustness and stability of clustering results, the handling of noise, the reuse of knowledge [3], and the suitability to multisource and heterogeneous data [22]. Researchers have proposed many ensemble clustering algorithms [22][23][24][25][26][27][28][29].
Since there are different notations in different literatures, we call the integration of basic partitions as ensemble clustering or consensus clustering and call the union of the stages of basic clustering and ensemble clustering as cluster ensemble in the following. Among different ensemble clustering methods, the method based on coassociation matrix has become a landmark [22]. Specifically, the coassociation matrix is constructed to represent the similarities of pairs of points from the basic partitions and the final partition result is computed via the graph partition method on the matrix. Thus, this kind of method suffers from the high space and time complexity. Recently, Liu et al. [22] transformed spectral clustering on coassociation matrix to weighted -means clustering over specific binary matrix equivalently, which decreased the space and time complexities vastly. However, when the number of basic partitions or clusters is large, the corresponding binary matrix will be high dimensional.
As the seminal work, Johnson and Lindenstrauss [30] pointed out that the random projection produced by random orthogonal matrix could preserve the pairwise distances of data sets approximately with reduced dimensions. Subsequently, a lot of researches constructed more matrices with the above properties: random Gaussian matrix [31], random sign matrix [32], random matrix based on randomized Hadamard transform [33], random matrix based on block random hashing [34], and so on. In addition, dimensionality reduction with random projection has also been widely applied to data mining methods such as classification [35], clustering [36][37][38], and anomaly detection [39]. In terms of object function, there are several works [36][37][38] to prove that random projection can maintain the accuracy ofmeans clustering approximately. Since its objective function is different from that of -means clustering, the theoretical analysis of the influence of random projection on weighted -means clustering is still scarce.
Our Contribution. In this paper, our contributions can be divided into three parts: the first part is the proposition of a fast CSC algorithm which is suitable for a wide range of data sets; the second part is the analysis of the effect of random projection on the spectral ensemble clustering; the third part is the proposition of a scalable semisupervised cluster ensemble algorithm. More specifically, the contributions are as follows: (i) We propose a fast CSC algorithm whose space and time complexities are linear with the size of a data set: we compress the size of the original model proposed by Cucuringu et al. [17] by the encoding of landmarkbased graph construction and improve the efficiency further via random sampling in the process ofmeans clustering. Besides, we prove that the new CSC algorithm will have the comparable clustering result of the original model asymptotically. Experimental results show that the new algorithm not only can utilize the constraint information effectively, but also costs less running time and fits a wider range of data sets compared to the state of the art SCACS method. (ii) With respect to the difference of objective function caused by random projection, we give a detailed proof that random projection can keep the clustering quality of spectral ensemble clustering within a small factor. Based on this theoretical analysis, we design a spectral ensemble clustering algorithm with reduced dimensions caused by sparse random projection. Experiments over different data sets also verify the correctness of our theoretical results. Moreover, since the theoretical analysis is also suitable for the ordinary weighted -means clustering, the influence of random projection on weighted -means clustering is also obtained. (iii) We propose a scalable semisupervised cluster ensemble algorithm through the combination of the fast CSC algorithm and spectral ensemble clustering algorithm with random projection. The efficiency and effectiveness of the new cluster ensemble algorithm are also demonstrated theoretically and empirically.
The remainder of our paper is organized as follows. In Section 2, we introduce the CSC model of Cucuringu et al. [17], landmark-based graph construction, and two related components in our cluster ensemble algorithm: spectral ensemble clustering and random projection. In Section 3, we present our fast CSC algorithm and give its asymptotic property. Then, the algorithm formulation and theoretical analysis of spectral ensemble clustering with random projection are displayed in Section 4. In Section 5, we show the experiment results of our algorithms. Finally, we draw the conclusions of the article and put forward the future directions in Section 6.

Preliminaries
In this section, we present the CSC algorithm proposed by Cucuringu et al. [17] and introduce landmark-based graph construction [20,21] which will be applied to our fast CSC algorithm. In addition, we also introduce spectral ensemble clustering algorithm [22] and sparse random projection [34] which can be used to speed up the spectral ensemble clustering.

Constrained Spectral Clustering.
Here, we first introduce the notion of undirected graph which is very important in constrained spectral clustering and then show the CSC model proposed by Cucuringu et al. [17]. Let = ( , , ) be an undirected graph, where = {V 1 , V 2 , . . . , V 푛 } is the vertex set, is the edge set, and is the weight set with respect to the edges. 푖 푗 = 푗푖 is specially the nonnegative weight of the edge between the vertices V 푖 and V 푗 , indicating the level of "affinity" between V 푖 and V 푗 . If Computational Intelligence and Neuroscience 3 푖 푗 = 0, there is no edge between the vertices V 푖 and V 푗 . We denote L 퐺 = D − W as the Laplacian matrix of , where the diagonal entry of diagonal matrix D is D( , ) = ∑ 푗 ̸ =푖 푖 푗 ; W is an adjacency matrix with W( , ) = W( , ) = 푖 푗 .
The constrained spectral clustering has three undirected graphs: one data graph 퐷 and two knowledge graphs ML and CL . In data graph 퐷 = ( , 퐷 , 퐷 ), each weight indicates the similarity level of vertices in the corresponding edge. The "must link" (ML) graph ML = ( , ML , ML ) gives the "must link" information of vertices: each edge in ML indicates that the corresponding vertices should be in the same group and the level of "must link" belief is described by the weight. The "cannot-link" (CL) graph CL = ( , CL , CL ) has analogous components to ML . The values of weights in the two knowledge graphs are both nonnegative and set according to the constraint information such as prior knowledge. For example, assuming that the range of value of weight is set from 0 to 1, if we have known that points V 1 , V 2 are in the same group, their corresponding weight ML,12 = 1. If we only have 40% confidence in the constraint information that the two points are in the same group, the weight ML,12 = 0.4, and if we have no constraint information about these two points, ML,12 = CL,12 = 0.
Viewing pairwise similarities of vertices as the implicit ML constraints declaration, Cucuringu et al. [17] defined a generalized ML graph̃퐷[ ] = ( , 퐷 ∪ ML , 퐷 + * ML ) where is the level of trust for ML constrains. Let be the number of clusters and x 퐶 be the indicator vector of cluster 푖 such that x 퐶 ( ) = 1 if the th data point belongs to cluster 푖 and x 퐶 ( ) = 0 otherwise. In order to violate as few ML constraints as possible and meet as many CL constraints as possible, the constrained way cuts problem [17] can be described as arg min To solve the problem in (1) approximately, Cucuringu et al. [17] relaxed the condition "x 푖 ∈ {0, 1} 푛 , ∑ 푘 푖=1 x 푖 = {1} 푛 " to be the real vectors. Thus, the solution vectors of the relaxed problem are the first nontrivial generalized eigenvectors of the problem After getting the generalized eigenvectors, an additional embedding phase embeds the row vectors of eigenvectors matrix onto the -dimensional sphere and gives the theoretical guarantees of clustering results. The detailed embedding procedures can be accessed in [17]. However, the construction cost and storage cost of data graphs for large scale data sets are both huge ( ( 2 )). What is more, if the number of iterations in the process of -means clustering on the embedded eigenvectors matrix is great, the process will also be time-consuming over large scale data sets.

Landmark-Based Graph Construction.
Based on sparse coding theory [40], the landmark-based graph construction [20,21] scales linearly with the number of data points and can suit large scale data sets very well. Let data set be A ∈ R 푛×푑 and the row vector a 푖 of A be data points; sparse coding problem is defined as follows: where each column vector of U ∈ R 푑×푝 is the basis vector, column vectors of Z ∈ R 푝×푛 are the representations of data points over U and is the number of basis vectors. To avoid the high time complexity of solving sparse coding problem, landmark-based graph construction just samples points randomly from input data A as basis vectors. In the process of computing Z, if u 푗 is among the nearest basis vectors of data points a 푖 , Z( , ) can be computed as where ( , ) is the indices set of the nearest basis vectors of a 푖 and 휎 (⋅) is Gaussian kernel function with bandwidth ; otherwise Z( , ) = 0.
After obtaining the sparse representation Z ∈ R 푝×푛 , graph affinity matrix is constructed as follows: whereẐ = D −1/2 Z and D is a diagonal matrix with diagonal entry D( , ) = ∑ 푗 Z( , ). Since Chen and Cai [20,21] have pointed out that W was automatically normalized, the normalized graph Laplacian matrix for A is I −Ẑ 푇Ẑ . Considering ≪ , the ( ) time of computingẐ is much less than the ( 2 ) time of the nearest neighbors graph construction.

Spectral Ensemble Clustering.
To gain the unified results from different basic partitions, spectral ensemble clustering applies spectral clustering to the coassociation matrix [24] derived from basic partitions. In 2015, Liu et al. [22] transformed spectral ensemble clustering into weighted means clustering over specific binary matrix. This transformation decreased the time and space complexities effectively and our new ensemble clustering method is based on this nice transformation.
Given basic clustering results Π = { 1 , 2 , . . . , 푔 } of data set A ∈ R 푛×푑 ; the coassociation matrix C is constructed in the following way: 4 Computational Intelligence and Neuroscience where 푖 (a 푗 ) is the label of a 푗 in the th clustering result 푖 , and Viewing this coassociation matrix as adjacency matrix, spectral ensemble clustering uses spectral clustering to get final clustering result. In the process of the transformation from spectral clustering to weighted -means clustering, binary matrix B = {b(a)} [22] is built as follows: indicates a row vector. The following lemma [22] presents the connection between spectral ensemble clustering and weighted -means clustering.
Lemma 1 (see [22]). Given a basic partitions set Π, let the corresponding coassociation matrix be C, the diagonal matrix whose diagonal elements are sums of rows of C be D1, and the diagonal element set of D1 be { b(a) }. Then normalized cuts spectral clustering on coassociation matrix C has equivalent objective function to weighted -means clustering on data sets Through Lemma 1, the space and time complexities of spectral ensemble clustering can be decreased dramatically. However, when the number of basic partitions and cluster number are large, the binary matrix B will be a high dimensional data set, resulting in long running time for weighted -means clustering.

Random Projection.
Recently, random projection has become a common technique of dimensionality reduction [36][37][38][39]41]. Random projection often has low computing complexity and can preserve the structure of original data approximately. In this paper, we use the sparse random projection proposed by Kane and Nelson [34]. When most of the elements of data are zero, the sparse random projection can utilize the sparsity of data effectively and speed up the process of dimensionality reduction.
Lemma 2 (see [34]). For any 0 < , And the random matrix R can be constructed as follows: The number of nonzero (nnz) elements of sparse random matrix R is , and the time complexity of AR is nnz(A) . Lemma 2 implies that the sparse random projection can preserve the length of data points approximately. Thus, for data points, since there are ( − 1)/2 pairwise distances, we can conclude that the pairwise distances squares can be preserved within a factor of 1 ± with = Θ(2 −1 log( / )).

Fast Constrained Spectral Clustering Framework
In this section, we introduce our fast CSC framework for large scale data sets. Inspired by [20,21], we also try to compute the sparse representationẐ and obtain the approximate adjacency matrix W =Ẑ 푇Ẑ , whereẐ ∈ R 푝×푛 , and ≪ . Then, our fast framework decreases the size of graph Laplacian through the above approximate graph reconstruction. At last, we analyse the asymptotic property of our new CSC algorithm.

Framework Formulation.
To get the generalized eigenvector x approximately, we can let x =Ẑ 푇 y, whereẐ ∈ R 푝×푛 is the sparse representation in (5) and y ∈ R 푝 . Thus, bringing the x back to (1) can decrease the size of problem apparently if ≪ . Specifically, we use Q to denote constraint matrix, where and Q( , ) = 0 otherwise. Let adjacency matrix be computed approximately by W =Ẑ 푇Ẑ . Next, bring x =Ẑ 푇 y into (1) and relax their solution over real vectors. Thus, we reformulate the original problem as the following problem.
Problem 3. One has arg min To obtain shorthand notations, we denoteẐL 퐺Ẑ 푇 by L CGD and denoteẐL 퐺 CLẐ 푇 by L CCL . Thus, the first nontrivial generalized eigenvectors of the problem L CGD y = L CCL y (12) are the solution vectors of (11).
In order to speed up the -means clustering on the embedded eigenvector matrix, we sample row vectors of eigenvectors matrix randomly and get centers throughmeans clustering over the selected row vectors. According to Computational Intelligence and Neuroscience 5 Input: data set A ∈ R 푛×푑 , the number of landmark points , constraint matrix Q, cluster number , confidence parameter , sample rate ; Output: the grouping result.
(1) Compute the sparse representationẐ ∈ R p×n in Equation (5); is the Laplacian matrix of̃퐷, L 퐺 CL is the Laplacian matrix of CL ; (3) Solve the first non-trivial generalized eigenvectors Y of Equation (12); (4) Compute X =Ẑ 푇 Y; (5) Embed X into a -dimensional sphereX using the embedding process in [17]; (6) Sample × row vectors ofX randomly and run -means clustering on the sampled row vectors; (7) Get the clustering result utilizing distances between centers of -means clustering and row vectors ofX.
Algorithm 1: Fast constrained spectral clustering. the distances between centers and row vectors, we can partition all the row vectors into different clusters. Cucuringu et al. [17] have pointed out that the specific embedding process after getting the generalized eigenvectors can concentrate the row vectors of eigenvector matrix onto the -dimensional sphere and a simple partition algorithm such as -means clustering can be applied to get the final clustering result. Since random sampling is a popular scalability method formeans clustering [42], we will take it to improve the efficiency of the clustering on the row vectors of eigenvector matrix. The experimental results in Section 5 also show that random sampling has little influence on the clustering results and makes the algorithm more efficient than the original one.
Our fast CSC framework is shown in Algorithm 1. In our new algorithm, parameter (in L 퐺 of Step (2)) stands for the trust level on constraint information. Since the of the original problem (see (2)) has been taken to a constant in the previous work [17], we also set as a constant.
The complexity analysis of Algorithm 1 is presented as follows. The time of computingẐ is (npd). In Step (2), the L CGD is computed as follows: Let the number of data points with constraint information be ; then the time cost for computingẐL 퐺 MLẐ 푇 is ( 2 + 2 ).

Asymptotic Property of the Framework.
In this subsection, we show that the partition result of our fast CSC algorithm could be comparable to that of the original model [17] as converges to .

Theorem 4. Assuming the adjacency matrix W in the original model is full rank, the result of
Step (4) in Algorithm 1 will converge to the generalized eigenvectors of (2) as converges to .
Proof. From the construction of sparse representationẐ, we can get that whereŴ is the normalized adjacency matrix. Equation (12) can be rewritten aŝ Equally, we have that Since the rank ofẐ will be equal to ,Ẑ can be removed. Thus the equation will be This equation shows thatẐ 푇 y and in Step (4) of Algorithm 1 are indeed the eigenvector and eigenvalue of (2), respectively. Moreover, the number of eigenvectors of (19) will converge to as converges . Hence Algorithm 2 could also get all the eigenvectors of (2) asymptotically.
Since the eigenvectors of our framework will converge to that of original CSC model [17] and the random sampling has little influence on the clustering result of embedded eigenvectors matrix, our new CSC algorithm will generate the partition result which is comparable to that of original framework. In addition, the reason why we give the assumption of Theorem 4 is that each row vector of adjacency matrix is the similarity representation of certain point over the whole data set, and those representations are often linearly independent. In the experiments, we have demonstrated this theory empirically on the 30 nearest neighbors adjacency matrices of three data sets.

Spectral Ensemble Clustering with Random Projection
In this section, we propose an improved spectral ensemble clustering algorithm with random projection. The new ensemble clustering not only improves the efficiency of spectral ensemble clustering algorithm designed by Liu et al. [22], but also can theoretically preserve the approximate clustering result.

Algorithm Formulation.
In this subsection, we give the detailed procedure of our new spectral ensemble clustering algorithm. We denote the original spectral ensemble clustering [22] by SEC and our improved spectral ensemble clustering with random projection by SECRP. From the description of Section 2.3, we can know that the SEC algorithm transforms the spectral clustering on the coassociation matrix into weighted -means clustering on the specific binary matrix B. The dimension of binary matrix B is ∑ 푔 푖=1 푖 , where 푖 is the cluster number of basic partition 푖 . When the number of clusters and/or basic partitions is big, B is probably a high dimensional matrix on which the weighted -means clustering runs slowly.
To avoid the high dimensions of B, we design an improved SEC algorithm with random projection for dimensionality reduction. The new algorithm SECRP is showed in Algorithm 2.
The complexity analysis of the new algorithm is as follows. Obviously, the running time of Steps (1) and (2) is very short (compared with that of Step (3)). The time of Step (3) is ( (B) ) = ( ), where is the number of basic partitions; () denotes the number of nonzero entries. Another common method of dimensionality reduction is singular value decomposition (SVD). The time of running SVD on binary matrix B is (( 耠 ) 3 + ( 耠 ) 2 ), and that of the product between eigenvectors and B is ( 耠 V ). Since ≈ 耠 / , random projection with sparse random matrix is a cost-effective method of dimensionality reduction. With respect to the weighted -means clustering, dimensionality reduction of random projection can decrease the running time of each iteration from ( 耠 ) to ( V ).
As a basic module, Algorithm 2 can be combined with different basic partition methods to produce different cluster ensemble algorithms. Thus, taking Algorithm 1 as the basic partition algorithm for Algorithm 2 could generate an efficient constrained cluster ensemble method with high accuracy (both basic partitions and final clustering are spectral clustering). Moreover, the last two steps of Algorithm 2 are just weighted -means clustering with sparse random projection, which is also suitable for any other applications of weighted -means clustering.

Theoretical Analysis of New Ensemble Algorithm.
In this subsection, we demonstrate that our new algorithm SECRP can maintain the clustering result of SEC approximately.
For the theoretical analysis, we give the formal definition of weighted -means clustering problem with matrix notation: Definition 5 (weighted -means clustering problem). Given an points set B (each row is a data point), diagonal matrix W B whose diagonal entries set { b } is weights set and clusters number find an × indicator matrix X opt such that where ‖ ⋅ ‖ 2 퐹 denotes the square of Frobenius norm; X is selected from the set of all indicator matrices. An indicator matrix has one nonzero element on each row. Specifically, if the th point belongs to the th cluster, X( , ) = 1/ √ 퐶푗 , where 퐶푗 denotes the sum of weights points in cluster 푗 .
Since computing X opt is an NP-hard problem, we focus on the approximate algorithm for weighted -means clustering. The corresponding definition is as follows.
Definition 6 (weighted -means approximation algorithm). An algorithm is called the " -approximation" for weighted -means clustering problem, if the algorithm takes B, , and W B as input and outputs an indicator matrix X 훾 such that Computational Intelligence and Neuroscience where is the approximation factor and 훾 is the failure probability of the " -approximation" weighted -means clustering algorithm.
Though there is the -approximation -means clustering algorithm such as [43], it is unclear whether theapproximation weighted -means clustering algorithm exists or not. To facilitate the proof of our theory, we assume that the approximation algorithm exists and utilize the definition of approximation algorithm in the process of proof. And we will take the weighted version of the classical -means clustering algorithm [44] as the weighted -means clustering to verify our theoretical results in the following experiments.

Theorem 7. Let × 耠 matrix B, weight set { b(a) }, and cluster number be the inputs of Algorithm 2. Let
∈ (0, 1/3). Assuming that a -approximation weighted -means clustering algorithm exists, then the output X훾 of Algorithm 2 satisfies with probability of at least 0.97 − 훾 : In the above,B = W −1 B B is the computing result of Step (2) in Algorithm 2; X opt is the optimal solution of weighted -means clustering onB.
This theorem reveals that random projection not only can be used to improve the efficiency of spectral ensemble clustering with lower dimensions, but also maintains its final result approximately.
In the following, we present a useful lemma which is needed in the proof of Theorem 7. The results of the lemma are based on the results of [36] and Lemma 2.
(2) (Lemma 4 of [36]) For any × 耠 matrix G, with probability of at least 0.99, (3) (Combination of Lemmas 2 and 3 of [36]) With probability of at least 0.99, These conclusions are all about the influences of random matrix R on the norms of different matrices, which are useful for bounding the norms of the matrices in Theorem 7. In the following proof of Theorem 7, we start by decomposing the term ‖W 1/2 (22). Then, based on the influences of random matrix in Lemma 8, we manipulate the norms of the different terms in the decomposition result.
Proof. Using the notation of Lemma 8,(22) can be decomposed into where H 휌−푘 = H − H 푘 . The last equation is based on the orthogonality of H 푘 and H 휌−푘 . We first give the bound of the second term of (26). According to our definition of indicator matrix, B is a projector matrix; namely, its 2 norm is 1. As a result, we get where the second inequality is caused by the fact that rank(W 1/2 B X opt X 푇 opt W 1/2 B ) ≤ and the optimality of SVD. We next bound the first term of (26). From the first statement of Lemma 8, we get 8

Computational Intelligence and Neuroscience
From Definition 6 and the meaning of X opt of Theorem 7, we get Using the statement 2 of Lemma 8, (29) can be transformed to Combining the statement 3 of Lemma 8 and (30), we get From (28) and (31), and rescaling , we can get Finally, combining (27) and (32) concludes the proof.
It is easy to check that the above theoretical analysis can be also applied to ordinary weighted -means clustering, indicating that the method of dimensionality reduction with random projection can preserve the clustering quality of weighted means clustering approximately. Furthermore, the integration of Theorems 4 and 7 means that the new semisupervised cluster ensemble method (combination of Algorithms 1 and 2) can have an encouraging clustering result.

Experiments
In this section, we present the experimental results of our new algorithms in Sections 3 and 4. We implemented all the related algorithms in Matlab and conducted our experiments on a Windows machine with the Intel Core 3.6 GHz processor and 16 GB of RAM.

Data Sets and Experimental Settings.
In order to facilitate the comparison, we performed experiments on three data sets which can be achieved from public web sites (http://archive.ics.uci.edu/ml/), (http://www.cad.zju.edu.cn/ home/dengcai/). Table 1 summarizes their basic information. The constraint information is generated from the real labels of data sets. In our experiments, we sample the labeled points randomly from data sets. The constraint matrix Q is constructed as The validation measures of the partition result used in our experiments are cluster accuracy (CA) [45] and normalized mutual information (NMI) [25]. The CA is computed as where is the cluster number of clustering result, is the number of data points, max(cluster 푖 | label) is the maximum number of points with the same true label in the th cluster. For computing the NMI, we construct two random variables and from the clustering result and true label, respectively. The probability distributions of random variables are the proportions of different clusters (or classes) over the whole data set. The NMI is computed as follows: where MI( , ) denotes the mutual information of random variables and , (⋅) denotes the entropy of a random variable, is the number of data points, 푐,푙 is the number of points in both cluster and class , 푐 is the points number of cluster , and 푙 is the points number of class . The values of CA and NMI both vary from 0 and 1, and the higher value means better clustering solution.

Comparisons of Different Constrained Spectral Clustering.
In this subsection, we compare our fast CSC (constrained spectral clustering) algorithm with other spectral clustering algorithms. Following is the list of information of different algorithms in comparison: Computational Intelligence and Neuroscience (i) LSC- [20,21]: the unsupervised spectral clustering baseline with landmark-based graph construction.
(ii) SCACS [16]: the most efficient CSC algorithm known and be set as the CSC baseline over MNIST and CoverType data sets.
(iii) CCS [17]: the original CSC model proposed in [17], set as the CSC baseline over LetterRec data set. (Since the constructions of the nearest neighbors graphs are both time-consuming on MNIST and CoverType data sets, we do not run CCS algorithm on these two data sets.) (iv) CCS-L: our improved CCS algorithm with landmarkbased graph construction.
(v) CCS-LS: our improved CCS algorithm with landmark-based graph construction and random sampling.
In the process of the landmark-based graph construction, we fix the number of landmark points = 500 and the number of nearest neighbors = 3. The parameters in SCACS algorithm that we used are 0 = 0.1, which is the same as those in [16]. Since in the original model CCS [17] it has been pointed out that could be a constant number and was set to 5 in their implementation code, we also set = 5 in CCS, CCS-L, and CCS-LS.
First, we investigate the influence of the number of labeled points on the performance of algorithms. We vary the value of from 100 to 1000 with step size 100. For each value of , we select the labeled points randomly to produce constraint information and repeat 20 trials with different labeled points sets. The corresponding experimental results are presented in Figure 1. Figures 1(a), 1(b), and 1(c) are related to CA of clustering results, Figures 1(d), 1(e), and 1(f) are related to NMI, and Figures 1(g), 1(h), and 1(i) are related to running time. We can see that our algorithm CCS-LS outperforms LSC-R on all data sets and the values of CA and NMI increase with the growth of constraint information. Those indicate that our algorithm can employ the constraint information appropriately. Compared with SCACS, our algorithm has the similar performances on LetterRec and MNIST data sets and superior performances on CoverType data set, indicating that our algorithm adapts a wider range of geometries. Over the three data sets, the performances of CCS-LS are all close to CCS-L. What is more, our algorithm runs fastest among these algorithms.
Next, we study the influence of random sampling (Step (5) of Algorithm 1) which can be seen in Figure 2. In the experiments, we fix = 500 and change the sample rate from 0.1 to 1 by a step size 0.1. We still run 20 independent trials considering the randomness and compute the means of validity measures. We can see that the values of CA and NMI vary slightly along with the growth of sample rate, verifying the feasibility of random sampling.

Performance of the Spectral Ensemble Clustering with
Random Projection. Since cluster ensemble consists of two parts: basic partition clustering and ensemble clustering, we below combine different basic partition clustering algorithms and different ensemble clustering algorithms to get different cluster ensemble algorithms. Thus, the performance of new ensemble clustering algorithm (Algorithm 2) and new cluster ensemble algorithm (combination of Algorithms 1 and 2) can both be manifested. Following is the list of information of different cluster ensemble algorithms in comparison: (i) CK-: the basic partition clustering algorithm "CK" is the constrained -means clustering algorithm [9], Computational Intelligence and Neuroscience and the ensemble clustering algorithm "SE" is the spectral ensemble clustering (SEC) algorithm [22].
(iii) CCSS-SE: the basic partition clustering algorithm "CCSS" is our fast CSC algorithm (Algorithm 1), and the ensemble clustering algorithm is also SE [22].
(iv) CCSS-SER: the basic partition clustering algorithm is CCSS, and the ensemble clustering algorithm "SER" is our spectral ensemble clustering with random projection (Algorithm 2).
In the phase of basic partition clustering, we fix the number of basic partitions as 50 and the parameters of basic clustering algorithms are the same as those in the last subsection. In addition, similar to the operation of SE [22], the basic partitions are obtained by varying the cluster number from − 5 to + 4. We repeat each cluster ensemble algorithm 10 times and present the average values of results.
First, we show the comparison of different cluster ensemble algorithms in terms of different constraint information in Figure 3. Here the dimensionality of CCSS-SER reduced by random projection is 40 and we change the number of labeled points from 100 to 1000 with step size 100. In the figure, the validity measures of Figures 3(a)-3(c) and Figures 3(d)-3(f) are related to CA and NMI, respectively. Just like the results of last subsection, CCSS-SE has similar performance to that of SCACS-SE on LetterRec and MNIST data sets and has much better performance on CoverType data set. From the comparison between Figure 1 and 3, we can see that the two validity measures are both higher than those of the basic partition dramatically, verifying ensemble clustering's improvement in clustering quality. Compared with CK-SE, CCSS-SE and CCSS-SER both have better performance significantly, which indicates that the basic partitions have an obvious impact on the final result and also verify the high accuracy of our new constrained spectral cluster ensemble method. In addition, the little difference of performance between CCSS-SE and CCSS-SER implies that the random projection can preserve the results of spectral ensemble clustering approximately on different constraint information.  Second, we inspect the influence of dimensions of random projection on the performance of our algorithm in Figure 4 and Table 2. In Figure 4, the "SEC-SVD" denotes the SEC algorithm with dimensionality reduction of SVD. When is above certain bound, the validity measures of "SECRP" (denote our algorithm SECRP) are almost stable and similar to those of SEC over all three data sets. This indicates that the accuracy of clustering algorithm can be kept when the dimensions surpass a certain bound, which verifies Theorem 7. The small bound of dimensions ( = 40) also reveals the effectiveness of dimensionality reduction of random projection. With respect to SEC-SVD, although it can also preserve the accuracy of clustering algorithm, its running time is not encouraging. Even letting = 20, the running time comparisons of original algorithm and SVD method over three data sets are 3.47 s/10.85 s, 4.91 s/14.54 s, and 22.06 s/326.61 s. These phenomena may be caused by the tardiness of SVD on large matrix and the breaking of sparseness of binary matrix B. In Table 2, the decrease of running time verifies the efficiency of our new spectral ensemble clustering. Combining this and subfigures (g,h,i) in Figure 1, the efficiency of new constrained cluster ensemble method is also verified. In addition, we can see the decrease of running time caused by random projection is declining with the growth of dimensions, indicating the relative small dimensionality with random projection is preferable.
Computational Intelligence and Neuroscience 13

Conclusion
To handle large scale data sets, we propose a fast CSC algorithm. The new algorithm can decrease the space and time complexity of a recently introduced CSC model through landmark-based graph construction and improve its efficiency further by random sampling. The new algorithm not only has the similar property of original model asymptotically, but also is the most efficient and suitable to a wide range of data sets empirically. Taking the new CSC algorithm as basic partition algorithm, we design an efficient semisupervised cluster ensemble algorithm. In the stage of consensus clustering, we reduce the dimensionality of input of spectral ensemble clustering by sparse random projection and prove that the sparse random projection can keep the clustering quality approximately. The experimental results over several data sets also verify the efficiency and effectiveness of new cluster ensemble algorithm. Moreover, in the process of spectral ensemble clustering, the influence analysis of dimensionality reduction with random projection can also give the theoretical guarantee for the weightedmeans clustering with random projection. In the future, we will use techniques such as applying several different basic partition methods, selecting the results of basic partitions, and giving different weights for basic partitions to improve the performance of our cluster ensemble algorithm further.

Conflicts of Interest
The authors declare that they have no conflicts of interest.