A Gaussian Process Latent Variable Model for Subspace Clustering

. Eﬀective feature representation is the key to success of machine learning applications. Recently, many feature learning models have been proposed. Among these models, the Gaussian process latent variable model (GPLVM) for nonlinear feature learning has received much attention because of its superior performance. However, most of the existing GPLVMs are mainly designed for classiﬁcation and regression tasks, thus cannot be used in data clustering task. To address this issue and extend the application scope, this paper proposes a novel GPLVM for clustering (C-GPLVM). Speciﬁcally, by combining GPLVM with the subspace clustering method, our C-GPLVM can obtain more representative latent variable for clustering. Moreover, it can directly predict the new samples by introducing a back constraint in the model, thus being more suitable for big data learning tasks such as analysis of chaotic time series and so on. In the experiment, we compare it with the related GPLVMs and clustering algorithms. The experimental results show that the proposed model not only inherits the feature learning ability of GPLVM but also has superior clustering accuracy.


Introduction
In machine learning tasks, data are often distributed in a high-dimensional space and have many redundant features.Training machine learning models in such high-dimensional data may result in not only higher computational and storage complexities but also the model overfitting problem [1].Existing research studies have shown that high-dimensional data are often embedded in low-dimensional manifold.We therefore can utilize the dimension reduction and feature learning method to learn the low-dimensional manifold and obtain more representative feature for the improvements of machine learning model accuracy and efficiency.us, effective feature representation is the key to success of machine learning applications.
In the past decade, many related methods have been proposed, such as dictionary learning [2], autoencoder [3], Gaussian process latent variable model (GPLVM) [4], Isomap [5], and locally linear embedding [6].Among these models, the GPLVM for nonlinear feature learning has received extensive attention because of its superior feature learning ability and has been used in many applications such as dynamical system [7], modelling and control of nonlinear system [8].Given a few training samples, it can effectively learn the low-dimensional manifold that is embedded in the high-dimensional space, thus has been widely used in the dimension reduction and data visualization tasks [9,10].
Although has the abovementioned advantages, the conventional GPLVM is just a fully unsupervised feature learning model, thus cannot meet the demands of real-world applications, when dealing with specific machine leaning tasks such as analysis of chaotic time series, dynamical system [7], and modelling and control of nonlinear system in which we also observe the response values of the inputs.How to modify the GPLVM and improve its performance is the key content of the related research studies.To date, the extensions of this model mainly focus on the supervised and unsupervised learning methods [9,11,12].ese methods assume that apart from the input features, we also observe labels of the samples.By their extensions, the GPLVM can effectively utilize the supervised information to improve the classification accuracy of the learned latent variables.However, in the real-world applications, we may also deal with unsupervised clustering tasks in which we cannot obtain the label information or any other auxiliary information, thus bringing more challenges to application of the GPLVM in clustering tasks.
In order to address the abovementioned issues, this paper proposes a fusion model that combines the GPLVM with the subspace clustering model [13] to simultaneously obtain more representative features and accurate clustering results.Moreover, we also use the back constraint trick [14] in the model, which makes the model predict new samples directly and more suitable for big data learning tasks such as analysis of chaotic time series.In the experiment, we verify the performance of the proposed model on multiple datasets.
e experimental results show that our model has much superior clustering performance than the other related models.

Related Work
to denotes the latent variable of x n .Obviously, the GPLVM can realize dimension reduction by learning the latent variables.Specifically, the GPLVM assumes that the generation process of x n as follows: where where θ denotes the hyperparameters involved in the kernel function and noise distribution.In the model optimization process, the GPLVM learns the latent variable and hyperparameters jointly by maximizing the above likelihood function and obtains the low-dimensional representation finally.
From the abovementioned generation process, as a fully unsupervised dimension reduction model, the GPLVM cannot embed auxiliary information when dealing with specific machine learning tasks, thus cannot meet demands of real-world application.For example, in analysis of chaotic time series, data of similar time will have similar features.If it can utilize this knowledge, the GPLVM will learn more representative features for the task and significantly improve the prediction accuracy.e existing methods for the extension of GPLVM mainly focus on embedding supervised information to improve its classification and regression accuracy, for example, the discriminative GPLVM (D-GPLVM) and supervised GPLVM (S-GPLVM).For the extension to the clustering task, the related works is much fewer.e existing unsupervised GPLVM just focuses on how to preserve the local distance and learn better latent variables or features.For example, local preserving projection GPLVM (LPP-GPLVM) combines the objective of local preserving projection with that of the GPLVM, thus simultaneously learning the low-dimensional representation and preserving the local structures [15].e GPLVM with back constraints (B-GPLVM) introduces a back-constraint (from observed space to latent space) into the GPLVM.By this way, it can also realize the preservation of local distance.

Subspace Clustering.
e goal of subspace clustering is to segment a set of data samples into different subspaces; thus, similar samples are in the same subspace, while dissimilar samples are in different subspaces.Over the past decade, subspace clustering has been used in various clustering tasks and many well-designed algorithms have been proposed such as Gaussian mixture model-(GMM-) based methods [16,17], matrix factorization-(MF-) based methods [18,19], algebra-based methods [20], and spectral clustering methods [13,21,22].Among these models, the subspace clustering method based on spectral clustering has been widely applied because of its concise implementation process and reliable performance.It uses low-rank representation to construct the affinity matrix of the spectral clustering.Its objective is to find the low-rank representation of input data X by optimizing the following function: where we assume that each sample can be expressed by the linear combination of other samples.e above lowrank penalty term can be considered as a global constraint on the subspace structure of samples and makes similar samples have similar weights.In general, we can use the following nuclear norm to replace the penalty term: where we use the nuclear norm ‖Z‖ * to approximate the rank of Z. Considering that the data often contain noise, we use the following formulation to learn the self-representation matrix Z: In low-rank subspace clustering, we can first construct the affinity matrix W and the Laplace matrix L and then use spectral clustering to cluster the data.W and L can be constructed as follows: where D denotes a diagonal matrix and D ii �  N j�1 W ij .After obtaining the Laplace matrix, we can optimize the following objective function to obtain the latent variable H ∈ R N×Q : arg min Obviously, H is composed of the eigenvectors corresponding to the Q smallest eigenvalues.At last, we can run the k-means algorithm on the learned H and obtain the clustering result.

Model Construction and Optimization
3.1.Designing of the Gaussian Process Latent Variable Clustering Model.Assuming that there are N observed samples denoted as X � [x 1 , . . ., x N ] T , our goal is to learn the lowdimensional latent variable H � [h 1 , . . ., h N ] T corresponded to these observed variables and make the latent variable have more superior clustering performance (i.e., make the common clustering algorithms obtain accurate clustering result on the learned H).
In order to achieve the above goal, we assume that the latent variable H has the following prior distribution: where p 0 is a constant that makes  H p(H)dH � 1 and R(H) has the following form: where W ij is the i th row and j th column element of the affinity matrix W. Equation ( 9) often can be written as follows: In this paper, we assume that the generation process observed variables from the latent variable can be constructed by conditional distribution p(X|H).us, from the Bayes formulation, we can obtainthe posterior distribution of latent variable as Since p(X) is a constant, we therefore can obtain the optimal latent variable by maximizing the following joint marginal distribution: To introduce the GPLVM into this model, we assume that X is generated by latent function which follows a Gaussian process prior.us, equation ( 12) can be written as arg min where θ denotes the hyperparameter that is involved in the kernel function and σ 2 denotes the variance of Gaussian noise distribution.By the above modelling process, the GPLVM can effectively embed the sample similarity information when learning the latent variable, thus improving its latent variable clustering ability.However, how to learn the affinity matrix is still an urgent problem of this paper and other related algorithms such as self-representation learning and subspace clustering.In this paper, we borrow the idea of low-rank selfrepresentation learning and introduce the following lowrank subspace constraints into the model: It is worth noting that, in this paper, we assume that W � Z, i.e., we directly use matrix Z as the affinity matrix.
is setting is the same as that of [23] and its role is similar to the affinity matrix of the original subspace clustering. is C-GPLVM is very similar to the LPP-GPLVM.However, in the LPP-GPLVM, the Laplace matrix is fixed.Different from LPP-GPLVM, the Laplace matrix in our C-GPLVM can be learned in the training process.us, our C-GPLVM has more superior performance than the LPP-GPLVM.
One important limitation of GPLVM and self-representation is that they cannot effectively predict the new samples.To mitigate this problem, we introduce a back constraint on the proposed model.us, given a new sample, it can effectively predict the corresponding low-dimensional latent variable using the constraint function.Specifically, given an observed sample x n , we assume that we can use a function g(•) to obtain latent variable h n : where g(•) is the neural network function with learnable parameter Φ.At last, we obtain the objective of the proposed model as follows: arg min e whole model structure is shown in Figure 1.

Model Optimization
In order to optimize ( 16), we transform it into the following optimization problem: where λ 1 , λ 2 , and λ 3 are regularization terms.By this formulation, we can use the alternating iterative optimization method to learn all the parameters.First, we fix Z and write (17) as arg min is problem can be solved effectively by using gradientbased methods, and its gradient with respect to Φ can be computed as e gradients with respect to θ and σ are similar to the above formulation.For the sake of brevity, we have omitted their derivation processes.We then can fix Φ, θ and σ and write (17) as e gradient of the first term with respect to Z can be computed as where Z n,: denotes the n th row of matrix Z and e ∈ R N denotes the vector whose element is 1. e gradient of the second term can be computed as e subgradient of the third term is By the above derivations, we can learn the whole model, as descripted in Algorithm 1.
e main computation complexity of C-GPLVM is the inversion of the kernel matrix, which has a complexity of O(N 3 ), where N is the number of training samples.e main storage complexity of this is the storage of kernel matrix, which has a complexity of O(N 2 ).us, both the computation and storage complexities are the same as those of the conventional GPLVM.

Experimental Setup.
To verify the effectiveness of C-GPLVM, we use 8 datasets in the experiments.e detailed information of these datasets is given in Table 1.
e YEAST is a dataset for the prediction of protein localization sites.
e USPS is a digits dataset that was gathered at the Center of Excellence in Document Analysis and Recognition at SUNY Buffalo, as part of a project sponsored by the US Postal Service.YALE, JAFFE, and ORL are three face recognition datasets, as shown in Figure 2. TR11, TR41, and TR45 are three textual datasets.
In order to fully verify the advantage of C-GPLVM, we compare it to the related Gaussian process latent variable model (i.e., GPLVM, B-GPLVM, and LPP-GPLVM) and clustering methods, such as spectral clustering method (SC) [24], kernel spectral clustering (KSC) [25], and simplex sparse representation learning (SSR) [21].All the kernel-based models (GPLVM, LPP-GPLVM, KSC, and C-GPLVM) used Radial Basis Function (RBF) as the kernel function.It is worth noting that some other kernel functions can also be used in the proposed model such as linear kernel, Laplacian kernel, and circular kernel.Furthermore, all the hyperparameters in these kernel functions can also be learned in the same form as descripted in the paper.During the experiment process, the hyperparameters λ 1 , λ 2 , and λ 3 are chosen from 0.01, 0.1, 1, 10, 100 { }. e hyperparameters involved in other models are set to be the same as those of original paper.In the experiment process, we use the Gaussian process toolkit (GPFlow) 1 to implement the GP-based model.Other related models are all implemented with python.All the algorithms are tested on the Windows computer with i7 9700 CPU, 16G RAM.2-4, where the best results are given in bold.
From Tables 2-4, we can observe that the GPLVM, as an unsupervised dimension reduction model, usually obtain latent variables that have poor clustering performance.e B-GPLVM and LPP-GPLVM can preserve the local distance of samples during the feature learning process, thus obtaining more representative latent variables.Meanwhile, the LPP-GPLVM obtains much better result than the B-GPLVM which indicates that graph Laplace regularization is more suitable for clustering than back constraints.In general, spectral clustering and subspace clustering methods have better performance than the GPLVM.As we can see, SC, KSC, and SSR outperform GPLVM, B-GPLVM, and LPP-GPLVM.In this paper, the proposed C-GPLVM combines the subspace clustering with GPLVM, thus effectively improving the clustering performance of the GPLVM.As shown in the experimental results, the C-GPLVM has more superior clustering result than other related models in most cases.

Conclusion and Future Work
is paper proposes a joint model by combining the lowrank subspace with the back constraint GPLVM to address the poor clustering performance problem of the conventional GPLVM.
e proposed C-GPLVM can not only obtain low-dimensional latent variables but also directly predict the new samples, thus effectively extending the application scope of GPLVM on tasks such as analysis of chaotic time series.e experimental results show that the C-GPLVM has much better latent variable learning ability and superior clustering performance.In the future work, we will further extend the C-GPLVM to make it suitable for much bigger dataset and supervise tasks such as classification and regression, improving its efficiency and application scope.
, . . ., x N ] T ∈ R N×D (where x n ∈ R D denotes the n th training sample), our objective is to learn the corresponding low-dimensional latent variable.In this paper, we use

K) , where K is the kernel matrix that is computed by using kernel function k(•, •) on the latent variables in H. e i th row and j th column element of K is computed as K ij � k(h i , h j ). By integrating out the intermediate variable f d , we can obtainthe following marginal likelihood function:
n d is the d th feature of the n th training sample, ϵ

Table 1 :
e detailed information of the datasets.

Table 2
In the experiments, we use clustering accuracy, purity, and normalized mutual information (NMI) as the clustering measurement.At the clustering stage, the latent variables learned by different methods are used as inputs and the k-means algorithm is used to obtain the final clustering methods.e dimension and the number of clusters are set to be the same as the number of classes.At the same time, in order to mitigate the initial value sensitivity problem of k-means method, we randomly initialize and run the k-means method 20 times.Finally, we calculate the mean and standard deviation of these 20 experiments.e experimental results are shown in Tables