Bayesian Variable Selection for Globally Sparse Probabilistic PCA

Sparse versions of principal component analysis (PCA) have imposed themselves as simple, yet powerful ways of selecting relevant features of high-dimensional data in an unsupervised manner. However, when several sparse principal components are computed, the interpretation of the selected variables is difficult since each axis has its own sparsity pattern and has to be interpreted separately. To overcome this drawback, we propose a Bayesian procedure called globally sparse probabilistic PCA (GSPPCA) that allows to obtain several sparse components with the same sparsity pattern. This allows the practitioner to identify the original variables which are relevant to describe the data. To this end, using Roweis' probabilistic interpretation of PCA and a Gaussian prior on the loading matrix, we provide the first exact computation of the marginal likelihood of a Bayesian PCA model. To avoid the drawbacks of discrete model selection, a simple relaxation of this framework is presented. It allows to find a path of models using a variational expectation-maximization algorithm. The exact marginal likelihood is then maximized over this path. This approach is illustrated on real and synthetic data sets. In particular, using unlabeled microarray data, GSPPCA infers much more relevant gene subsets than traditional sparse PCA algorithms.


Introduction
From the children test results of the seminal paper of Hotelling (1933) to the challenging analysis of microarray data (Ringnér, 2008) and the recent successes of deep learning (Chan et al., 2015), principal component analysis (PCA) has become one of the most popular tools for data-preprocessing and dimension-reduction.The original procedure consists in projecting the data onto a "principal" subspace spanned by the leading eigenvectors of the sample covariance matrix.It was later shown that this subspace could also be retrieved from the maximum-likelihood estimator of a parameter, in a particular factor analysis model called probabilisitic PCA (PPCA) (Roweis, 1998;Tipping and Bishop, 1999).This probabilistic framework led to diverse Bayesian analysis of PCA (Bishop, 1999a;Minka, 2000;Nakajima et al., 2011).

Local and global sparsity
A potential drawback of PCA is that the principal components are linear combinations of every single original variable, and can therefore be difficult to interpret.To tackle this issue, several procedures have been designed to project the data onto subspaces generated by sparse vectors while retaining as much variance as possible.Many of them were based on convex or partially convex relaxations of cardinality-constrained PCA problems -among these techniques are the popular ℓ 1 -based SPCA algorithm of Zou et al. (2006) or the semidefinite relaxation of d 'Aspremont et al. (2008).Another strategy is to use a sparsity-inducing prior distributions on the coefficients of the projection matrix (Archambeau and Bach, 2009;Guan and Dy, 2009;Khanna et al., 2015).
However, when several principal components are computed, these various techniques do not enforce them to have the same sparsity pattern, and each component has to be interpreted individually.While individual interpretation is particularly natural in several caseswhen PCA serves visualization, for example -, it is not adapted to situations where the practitioner aims at globally selecting which features are relevant.In these situations, a simple and popular approach has been to consider that the relevant variables correspond to the sparsity pattern of the first principal component (Zou et al., 2006;Zhang et al., 2012).However, this procedure is limited, and several important aspects of the data may lie in the next principal components.For example, in the colon cancer data set studied by d 'Aspremont et al. (2008), the most relevant genes were the ones selected not by the first but by the second principal component.Another motivation for global sparsity is the fact that, in many reallife situations, the sparsity pattern of the axes computed by a sparse PCA algorithm are extremely close.This is for example the case of the three axes of the template attacks application considered by Archambeau and Bach (2009).In this setting, forcing these patterns to be equal will give the practitioner a precise idea of which variables are relevant.Another interesting feature of global sparsity is the fact that, once the common sparsity pattern has been determined, performing PCA on the relevant variables yields orthogonal and uncorrelated principal components -conversely to most sparse PCA procedures.

Related work
Since the seminal papers of Jolliffe (1972Jolliffe ( , 1973) ) and Robert and Escoufier (1976), several methods have been designed to discard features in PCA (see e.g.Brusco (2014) for a recent review).However, these techniques were designed to eliminate redundant, rather that irrelevant variables, and are based on combinatorial algorithms that are not really suitable for high-dimensional problems.
A simple and scalable way of performing variable selection for PCA is to simply keep the features that have the largest marginal variance.In certain cases, this technique is theoretically sound, and was applied for instance to the analysis of electrocardiogram (ECG) data (Johnstone and Lu, 2009).Zhang and El Ghaoui (2011) also proved that it could be used as an efficient preprocessing technique to reduce the dimensionality of ultra-high dimensional problems before applying a traditional sparse PCA algorithm.However, this technique has two main drawbacks.First, it is not robust to simple transformations of the data since simply multiplying a variable by a constant may wrongfully select (or discard) it.An unfortunate consequence of this is the fact that this technique can not be applied to scaled data.Moreover, since it ignores non-marginal information, this technique will behave badly in the case of correlated features.
A more refined approach to global sparsity is ℓ 1 -based regularization, which has imposed itself as one of the most versatile and efficient approaches to sparse statistical learning (Hastie et al., 2015).In a context of structured sparse PCA, Jenatton et al. (2009) proposed to recast sparse PCA as a penalized matrix factorization problem and suggested that limiting the number of sparsity patterns allowed within the principal vectors could improve the feature extraction quality -particularly in face recognition problems.Using the ℓ 1 − ℓ 2 norm, they derived an algorithm (hereafter referred as SSPCA) that allows to compute d sparse components with exactly m ≤ d sparsity patterns.However, they only considered cases where m is larger than 2 and therefore did not focus on global sparsity.They were followed by Khan et al. (2015) who, in a very close framework, argued that global sparsity (which they called joint sparsity) led to better representations of hyperspectral images.Other similar approaches based on structured composite norms have been conducted by Masaeli et al. (2010), Gu et al. (2011) and Xiaoshuang et al. (2013).Ulfarsson andSolo (2008, 2011) used sparsity inducing penalties together with a PPCA model to enforce global sparsity.They proposed an algorithm called sparse variable noisy PCA (hereafter refered as svnPCA) and fixed the amount of penalization using the Bayesian information criterion (BIC) of Schwarz (1978).
Eventually, it is worth mentioning that global sparsity has also been investigated in other contexts, such as partial least squares regression (Liu et al., 2013) or electroencephalography (EEG) imaging (Wipf and Nagarajan, 2009;Gramfort et al., 2013).

Contributions and organization of the paper
We present in Section 2 a Bayesian approach that allows to project the data onto a globally sparse subspace (i.e a subspace spanned by vectors with the same sparsity pattern) while preserving a large part of the variance.To this end, we use the noiseless PPCA model introduced by Roweis (1998) together with an isotropic gaussian prior on the projection matrix and a binary vector that segregates relevant from irrelevant variables.While past Bayesian PCA frameworks relied on variational (Bishop, 1999b;Archambeau and Bach, 2009;Guan and Dy, 2009) or Laplace (Bishop, 1999a;Minka, 2000) methods to approximate the marginal likelihood, we derive here a closed-form expression for the evidence based on the multivariate Bessel distribution.In order to avoid the drawbacks of discrete model selection and to treat high-dimensional data, we also present a relaxation of our model by replacing the binary vector with a continuous one.Inference of this relaxed model can be performed using a variational expectation-maximization (VEM) algorithm.Such a procedure allows to find a path of models.The exact evidence is eventually maximized over this path, relying on Occam's razor (MacKay, 2003, chap. 28), to select the relevant variables.
We illustrate the behaviour of our algorithm and compare it to other methods in Section 3. In particular, we show that Bayesian model selection empirically outperforms ℓ 1 −ℓ 2 -based regularization on a series of tasks.
Sections 4 and 5 are devoted to two applications showcasing the features of our method.The first one concerns signal denoising with wavelets, and shows how global sparsity can surpass traditional sparse PCA algorithms within this context.The second one treats about unsupervised gene selection.Given an (unlabeled) microarray data matrix, we show how GSPPCA can select biologically relevant subsets of genes.Interestingly, we exhibit an important correlation between our exact marginal likelihood expression and a criterion of biological relevance based on pathway enrichment.
Note that this paper is an extended version of previous work (Mattei et al., 2016) published in the Proceedings of the 19 th Conference on Artificial Intelligence and Statistics.

Bayesian variable selection for PCA
Let us assume that a centered i.i.d.sample x 1 , ..., x n ∈ R p is observed which one wishes to project onto a d-dimensional subspace while retaining as much variance as possible.All the observations are stored in the n × p matrix X = (x 1 , ..., x n ) T .

Probabilistic PCA
The PPCA model assumes that each observation is driven by the following generative model where y ∼ N (0, I d ) is a low-dimensional Gaussian latent vector, W is a p × d parameter matrix called the loading matrix and ε ∼ N (0, σ 2 I p ) is a Gaussian noise term.This model is a particular instance of factor analysis and was first introduced by Lawley (1953).Following Theobald (1975), Tipping and Bishop (1999) confirmed that this generative model is equivalent to PCA in the sense that the principal components of X can be retrieved using the maximum likelihood (ML) estimator W ML of W. Indeed, if A is the p × d matrix of ordered principal eigenvectors of X T X and if Λ is the d × d diagonal matrix with corresponding eigenvalues, we have where R is an arbitrary orthogonal matrix.
Several Bayesian treatments of this model have been conducted by using different priors on the loading matrix.However, the marginal likelihood of these models appeared to be untractable.To tackle this issue, several computational techniques were considered.The automatic relevance determination (ARD) prior was used together with Laplace (Bishop, 1999a) or variational (Bishop, 1999b;Archambeau and Bach, 2009) approximations.Minka (2000) introduced more complex conjugate priors to perform Bayesian model selection on the dimension d of the latent space using the Laplace approximation.Combined with variational inference, several sparsity inducing priors such as the Laplace (Guan and Dy, 2009), the generalized hyperbolic (Archambeau and Bach, 2009) or the spike-and-slab (Lázaro-Gredilla and Titsias, 2011) prior were also chosen for W.
In this work, we aim at avoiding these approximations.Our approach is to investigate in which cases the marginal likelihood can be analytically computed.To this end, we will use the fact that, within the PPCA model (1), the limit noiseless setting σ → 0 also allows to recover the principal components.This convenient framework was first studied by Roweis (1998) and has proven to be useful in several situations.The noiseless PPCA model was used for instance to facilitate inference in the presence of missing data (Yu et al., 2010;Ilin and Raiko, 2010).More importantly in our context, it was successfully used by Sigg and Buhmann (2008) to enforce sparsity within an ℓ 1 penalized PPCA frameworkwhich means that getting rid of the noise term is likely to be compatible with variable selection.

A general framework for globally sparse PPCA
In a classical (locally) sparse PCA context, the loading matrix W would be expected to contain few nonzero coefficients.However, to reach global sparsity, several entire rows of W have to be further constrained to be null.In this work, we handle variable selection using a binary vector v ∈ {0, 1} p whose nonzero entries correspond to relevant variables.For technical purposes, we also denote v the binary vector of {0, 1} p whose support is exactly the complement of Supp(v).We denote q = ||v|| 0 the number of relevant variables.In the PPCA framework, this leads to the following model for each observation where V = diag(v).Notice that the rows of VW, corresponding to the zero entries of v, are null.Therefore, the principal subspace will be generated by a basis of vectors which shares the sparsity pattern of v.Such spaces spanned by a family of vectors sharing the same sparsity pattern will be called globally sparse subspaces.This definition of global sparsity is closely related to the notion of row sparsity introduced by Vu and Lei (2013).
We further assume that the coefficients of the matrix W are endowed with the Gaussian priors w ij ∼ N (0, 1/α 2 ), for all i, j.Following the empirical Bayes framework leads to seeking the parameters v, α and σ that maximizes the marginal likelihood or evidence In previous Bayesian PCA models, the marginal likelihood was never derived because it was too difficult to compute in practice or even intractable.Here, conversely, the evidence of the model can be expressed analytically as a univariate integral using the isotropy of the prior onW.In the following, x v denotes the subvector of x where only the columns corresponding to the nonzero indexes of v have been kept.Given a real order ν, we denote by J ν and K ν the Bessel function of the first kind (Abramowitz and Stegun, 1965, chap. 10 and 11).
Theorem 1 The density of x is given by A proof of this theorem is given in Appendix A. While reducing the dimension of the integration domain to one appears to be a valuable improvement, the integral of Equation ( 4), albeit univariate, falls within the category of Hankel-like integrals known to be particularly delicate to compute.This is due to the fact that the integrand has singularities near the real axis (Ogata, 2005).To overcome this limitation, we investigate in the following subsection the use of the noiseless PPCA model to obtain a tractable expression.

A closed-form evidence for globally sparse noiseless PPCA
To obtain a closed-form expression of the marginal likelihood, we consider the following modification of Model (3).For the relevant variables, we use the noiseless PPCA model, and we assume that the irrelevant variables are generated by a Gaussian white noise.More specifically, we write where ε 1 ∼ N (0, σ 2 1 I p ) is the noise of the inactive variables and ε 2 ∼ N (0, σ 2 2 I p ) is the noise of the active variables, having in mind that we aim at investigating the noiseless limit σ 2 → 0. We will see that, with this particular formulation of the problem, the evidence has a closed form expression which involves the multivariate Bessel distribution, introduced by Fang et al. (1990, Def. 2.5).
Definition 2 A random vector is said to have a symmetric multivariate Bessel distribution with parameters β > 0 and ν > −k/2 if its density is Theorem 3 In the noiseless limit σ 2 → 0, x converges in probability to a random variable x whose density is This theorem (proved in Appendix B) allows us to efficiently compute the noiseless marginal log-likelihood defined as Regarding hyper-parameter tuning, if we assume that v is known, the regularization parameter α can be optimized efficiently using univariate gradient ascent.In fact, as stated by next proposition (proved in Appendix C), the marginal log-likelihood is even a strictly concave function of α.
The unique optimal value α can therefore be found easily using univariate convex programming.
The noise variance σ 1 can be estimated using ( 6) by computing the standard error of the variables which were not selected by v.However, since model ( 3) is a particular instance of PPCA, it is possible to use any regular PPCA noise variance estimator.A discussion on which estimator to choose is provided in subsection 2.7

High-dimensional inference through a continuous relaxation
In spite of the results of the previous subsection, maximizing the evidence, even in the noiseless case, is particularly difficult (because of the discreteness of v which can take 2 p possible values).We therefore consider a simple continuous relaxation of the problem by replacing v by a continuous vector u ∈ [0, 1] p .This relaxation is close to the one considered by Latouche et al. (2016) in a sparse linear regression framework.Denoting U = diag(u), this relaxed model can be written as We denote θ = (u, α, σ) the vector of parameters.In order to maximize the evidence p(X|θ), we adopt a variational approach (Bishop, 2006, chap. 10).We view y 1 , ...y n andW as latent variables.
Given a (variational) distribution q over the space of latent variables, the variational free energy is given by where H denotes the differential entropy, and is an upper bound to the negative log-evidence To minimize F q (X|θ), the following mean-field approximation is made on the variational distribution q(Y, W) = q(Y)q(W).
With this factorization, a variational expectation-maximization (VEM) algorithm can be derived.For the E-step, the variational posterior distribution q * , which minimizes the free energy, is computed.

Proposition 5 The variational posterior distribution of the latent variables which minimizes the free energy is given by
and where, for all i ∈ {1, ..., n} and k ∈ {1, ..., p} It is worth noticing that two factorizations arise naturally.The four equations of Proposition (5) (proved in Appendix D) will constitute the E-step of the VEM algorithm used to minimized the free energy.
We can now compute the negative free energy which will be maximized during the Mstep.
Proposition 6 Up to unnecessary additive constants, the negative free energy is given by Minimizing the free energy leads to the following M-step updates and, for k ∈ {1, ..., p}, Note that the objective function of the optimization problem ( 15) is simply a univariate polynomial.

The GSPPCA algorithm
Once the VEM algorithm has converged, the continuous vector u still needs to be transformed into a binary one.To do so, the following simple procedure, summarized in Algorithm 1, is considered: • a family of p nested models is built using the order of the coefficients of u as a way of ranking the variables.Specifically, for each k ≤ p, the k-th element of this family is the binary vector v (k) such that the k top coefficients of u are set to 1 and the others to 0.
• the marginal likelihood L of the non-relaxed model (computed using the formula of Theorem 3) is then maximized over this family of models.
• the model v with the largest marginal likelihood is kept.
Algorithm 1: GSPPCA algorithm for unsupervised variable selection Input: data matrix X ∈ R n×p , dimension of the latent space d ∈ N * Output: sparsity pattern v ∈ {0, 1} p // VEM algorithm to infer the path of models Initialize u, α, σ, µ 1 , ..., µ n , m 1 , ..., m p , S 1 , ..., S p and Σ ; repeat E-step from Proposition 5; M-step from equations ( 13),( 14),( 15); until convergence of the variational free energy; // Model selection using the exact marginal likelihood Compute σ 1 ; k) , α, σ 1 )} using gradient ascent ; Once the model is estimated, the globally sparse principal components of X can be computed by simply performing PCA on X v .This type of post-processing is similar to the variational renormalization introduced by Moghaddam et al. (2005).In the case of local sparsity, variational renormalization can be achieved using an alternating maximization scheme (Journée et al., 2010).However, the global sparsity structure greatly simplifies this procedure by reducing it to performing PCA on the relevant variables.

Links with other sparsity-inducing Bayesian procedures
Spike-and-slab models Model (3) may be rewritten x = Wy + ε where W = VW.The prior distribution for the parameter W is similar to the spike-and-slab prior introduced by Mitchell and Beauchamp (1988) in a linear regression framework.Indeed, each coefficient wij follows a priori either a Dirac distribution with mass at zero (if v i = 0) which is usually called the spike or a Gaussian distribution with variance 1/α 2 (if v i = 1) which is usually called the slab.However, contrary to standard spike-and-slab models which would assume a product of Bernoulli prior distributions over v, we see v here as a deterministic parameter to be inferred from the data.It is worth noticing that spike-and-slab priors have already been applied to locally sparse PCA by Lázaro-Gredilla and Titsias (2011) and Mohamed et al. (2012).
Automatic relevance determination Introduced in the context of feedforward neural networks (MacKay, 1994;Neal, 1996), automatic relevance determination (ARD) is a popular empirical Bayes procedure to induce sparsity.ARD was applied to Bayesian PCA models together with VEM algorithms in order to obtain automatic dimensionality selection (Bishop, 1999b) of local sparsity (Archambeau and Bach, 2009).In order to obtain global sparsity, ARD may be built using Model (1) together with Gaussian priors w i ∼ N (0, a i I d ) for i ∈ {1, ..., p}.Similarly to Tipping (2001), maximizing the marginal likelihood would discard irrelevant variables by leading several variance parameters a i to vanish.Interestingly, this model is somehow related to the relaxed GSPPCA model.Indeed the relaxed model ( 7) assumes that the i-th line of the loading matrix UW follows a priori a N (0, u 2 i /α 2 I d ) distribution.The relaxed model will consequently inherit the good properties of ARD -listed for example by Wipf et al. (2011).However, similarly to Latouche et al. (2016), using the exact marginal likelihood to eventually obtain a sparse solution will avoid many classical drawbacks of ARD.First, as pointed out by Wipf and Nagarajan (2008), convergences of EM algorithms are extremely slow in the case of the ARD models.However, with our approach, since we only need the ordering of the coefficients of u, we do not have to wait for the complete convergence of this parameter.In practice, in all the experiments that we carried out, we only had to perform less than a few hundreds of iterations of the algorithm to obtain convergence of the free energy in order to perform variable selection.It is worth mentioning that the fact that the objective function converges faster than the parameters of the model is a quite general property of EM algorithms (Xu and Jordan, 1996).Our procedure also avoids the lack of flexibility of ARD by computing posterior probabilities of models rather than simply giving an estimate of the best sparse model.Combined with a greedy technique similar to Occam's window (Madigan and Raftery, 1994), this feature could allow for example to perform Bayesian model averaging, which is not possible with ARD.Eventually, in the context of Bayesian PCA, ARD models such as the ones of Bishop (1999a,b) or Archambeau and Bach (2009) have to rely on approximations of the marginal likelihood while we use an exact expression.

Computational considerations
Intrinsic dimension estimation Since model (3) is a particular instance of PPCA, any intrinsic dimension estimator for PCA can be applied to estimate beforehand the intrinsic dimension d.Although the problem of finding d is of critical importance, we assume in this work that a reasonable choice of dimension has already been made by the practitioner.While it could be tempting to use the exact noiseless marginal likelihood to select d, the close relationship existing between the noise level and d in PPCA (Tipping and Bishop, 1999;Nakajima et al., 2011) suggests that loosing the noise information is likely to be prejudicial for intrinsic dimension estimation.
Initialization strategies for the VEM algorithm Regarding the initialization of the relaxed model parameter u, we chose to initialize all its coefficients to one.This allows to avoid premature vanishing of these coefficients which is a common drawback of ARD-like techniques (Wipf and Nagarajan, 2008).The noise standard error can be simply initialized using any classical PPCA noise estimator (cf.subsection 2.3).Similarly to Latouche et al. (2016), the slab precision parameter α controls the sparsity of the VEM solution and a too small initial value is likely to lead to a too sparse solution such as the useless local optimum u = 0. Following Biernacki et al. (2003), we chose to perform short VEM runs (with less than 5 iterations) on a small grid (typically α ∈ {0.1, 1, 10}) and to select the value of α that led to the lowest free energy.The posterior means of the PCA loadings m 1 , ..., m p and of the corresponding scores µ 1 , ..., µ n can be initialized using the singular vectors of X.If the size of the data forbids to perform this SVD, using random standard Gaussian coefficients as starting points does not significantly alter the results.Finally, the initial values chosen for the posterior covariance matrices are Σ = I d and S 1 = ... = S p = α −2 I d .
Computational cost of VEM iterations Thanks to the factorizations that arised naturally during variational inference, the cost of each VEM iteration is of order O(pnd 3 ) which is linear both in sample size and dimensionality and therefore particularly suitable for high-dimensional inference.
Estimation of the noise variance As mentioned in seubsection 2.3, the standard error σ 1 of irrelevant predictors can be estimated using any regular PPCA estimator.Specifically, three important estimators are considered: the maximum likelihood estimator (Tipping and Bishop, 1999), its unbiased correction (Passemier et al., 2015), or simply the median of the variances of all features (Johnstone and Lu, 2009).Since the ML estimator is known to be biased in the high-dimensional regime, it is usually preferable to use its bias-corrected version.Both of these estimators can also be computed using the singular value decomposition (SVD) of X.Note that since the median estimator does not need to perform this decomposition, it is therefore more suitable for large-scale inference.
Large scale inference In the GSPPCA algorithm, SVD is used twice.Indeed, the top d singular vectors can be used to initiate the VEM algorithm and the p − d smallest singular values can be used to estimate the noise variance (both as a VEM starting point for σ and as an estimator for σ 1 ).This can be done efficiently using a truncated SVD algorithm.We chose specifically the R interface (Qiu and Mei, 2016) of the Spectra1 C++ library.However, for very large scale problems, even a fast truncated SVD algorithm appears computationally prohibitive.To tackle this issue, we offer two alternatives.First, the covariance matrices initialized using the eigenvectors can be initialized using random standard Gaussian coefficients.Moreover, following Johnstone and Lu (2009), the noise variance can be estimated using the median of the variable variances.This leads to a "SVD-free" version of the GSPPCA algorithm suitable for very large scale problems.
Model selection speedup The model selection step of the GSPPCA algorithm requires to perform p univariate gradient ascents, which can be computationally expensive when p is large.A simple way to reduce the number of gradient ascents is to rely on the links between our relaxed model and ARD.Specifically, we can discard before the model selection step all the variables corresponding to the subset {i ∈ {1, ..., p}|u i = 0} where u is the relaxed model parameter obtained after convergence of the VEM algorithm.When u is sparse, this will bring about a substantial speedup.Notice that, since ARD is known to converge slowly, u is unlikely to be sparse enough and the model selection step is still necessary.

Evaluation of Bessel functions
The modified Bessel function of the second kind, which is used to compute the exact marginal likelihood and it gradient with respect to α, can be delicate to compute as soon as its order or its argument is large.In our experiments, we tackled this issue by using an asymptotic expansion based on Debye polynomials (Abramowitz and Stegun, 1965, formula 9.8.7).This is in particular implemented in the R package Bessel (Mäechler, Variables Evidence 2013).We found this approximation to be extremely accurate in all the experiments that we carried out.

Numerical simulations
This section aims at highlighting the specific features and abilities of the proposed GSPPCA approach on simulated and real data sets.

An introductory example
We consider here a simple introductory example to illustrate the proposed combination between a relaxed VEM algorithm and the closed-form expression of the marginal likelihood.For this experiment, n = 50 observations are simulated according to (3) with p = 30, d = 5 and q = 10.Each coefficient of W is drawn at random according to a standard Gaussian distribution and the noise variance is equal to 0.1.Figure 1 presents the results of GSPPCA on this toy data set.The left panel presents in dark blue the coefficients of the estimated u obtained after running the VEM algorithm (sorted in decreasing order) and the corresponding true values of v (pale blue points) used in the simulations.The right panel shows the values of evidence computed on the family of models inferred by the order of the coefficients of u.On this simple example, u captures the true ranking of the variables and the model with the largest evidence is actually the true one.

Range of the noiseless assumption
In all the experiments that we carried out, since the noiseless PPCA model is not a true generative p-dimensional model (the random variable x belongs to a strict subspace of R p ), we chose not to use it to generate data in our experiments.We rather chose the more realistic and natural Model (3).Since this model includes a nonzero noise, it is important to know the limits of the noiseless assumption.(16) In this simple simulation scheme, the signal-to-noise ratio (SNR) may be defined as SNR = We chose a linear grid of 20 SNR ranging from 0.1 (most difficult scenario) to 3 (easiest scenario) and generated 100 datasets for each noise level.To evaluate the quality of the variable selection, we computed the F-score between v and v on 100 runs.We recall that the F-score is the harmonic mean of precision and recall, and is closer to 1 when the selection is faithful.Unsurprisingly, when the SNR gets close to zero, the quality of the variable selection diminishes.However, GSPPCA appears to be quite robust to noise, even though the data are not generated according to the underlying noiseless model.Indeed, even in the case where n = 40, we observe an almost perfect recovery as long as SNR>0.5.

Model selection
In this subsection, we compare the model selection accuracies of two global methods -GSPPCA, SSPCA (Jenatton et al., 2009) -and a local one -SPCA (Zou et al., 2006).
Simulation setup While the simple simulation setup of Subsection 3.2 conveniently allowed to compute the SNR in closed formed in order to assess the range of the noiseless assumption, we introduce here a more realistic scheme by considering a finer correlation structure as well as a non-Gaussian noise.Specifically, first we generate n i.i.d observations (z 1 , ..., z n ) following multivariate normal distribution N (0, R) where R = diag(R 1 , ..., R 4 ) is a 4-blocks diagonal matrix where R ℓ is such that r ℓii = 0.3 and r ℓij = ρ for i, j = 1, . . ., p/4 and i = j.Then, a globally sparse PCA model is obtained as followed.First, PPCA is performed on the sample (z 1 , ..., z n ), which leads to a non-sparse ML estimate W ML for the loading matrix.Then, given a sparsity pattern v ∈ {0, 1} p and denoting V = diag(v) as before, the loading matrix matrix is "globally sparsified" by considering VW ML .The final observations are eventually generated according to the non-noiseless model The simple sparsity pattern ( 16) is kept and the vectors y 1 , ..., y n are standard Gaussian as in regular PPCA.Regarding the noise term ε, we consider two scenarios.A first one with Gaussian noise and a second one with Laplacian noise, both centered with unit variance.
We choose p = 200, d = 10, q = 20 and consider five cases for the sample size: n = p/5, p/4, n = ⌊p/3⌋, n = p/2 and n = p.More classical n > p cases are not presented here since regular PCA is known to perform well in this context and variable selection thus may not be of great use (Johnstone and Lu, 2009).Each experiment was repeated 50 times.
Model selection criteria Regarding SSPCA, we used the Matlab code available at the main author's webpage and chose the tuning parameter using 5-fold cross-validation on the reconstruction error.We constrained the algorithm in order to obtain globally sparse solutions.For SPCA, we used the elasticnet R package and an ad-hoc method by selecting enough variables to explain 99% of the total variance.We also tried to apply another globally sparse algorithm, vsnPCA-ℓ 0 from Ulfarsson and Solo (2011).However, their use of the Bayesian information criterion (BIC) led to selecting very few variables.This is not very surprising: since BIC is an asymptotic sparsity criterion, it is thus likely to perform poorly when p is larger than n.
Results Tables 1 and 2 reports the mean and standard error of the F-score for the experiments described is this subsection.The two globally sparse methods vastly outperform SPCA, which is unable to identify the particular structure of the data.When p is larger than n/2, both globally sparse algorithms perform very well, GSPPCA being slightly better in the Gaussian noise case.It is not surprising to see SSPCA adapt efficiently to Laplacian noise because cross-validation is a model-free technique and is more likely to outperform model-based techniques when the data is not generated according to the model distribution.However, when n is smaller than p/2, GSPPCA significantly outperforms SSPCA in both noise scenarios.This reminds the fact that, is many p ≫ n situations, Bayesian model selection empirically outperforms ℓ 1 -based methods (Celeux et al., 2012;Latouche et al., 2016).

Global versus local
Here, we illustrate on real data sets how using GSPPCA instead of computing the leading sparse principal component for model selection can lead to selecting more relevant variables -i.e variables that retain more variance or are more interpretable.Explained variance We consider the breast cancer data base from the breastCancerVDX R package (Schroeder et al., 2011), consisting in expression levels of p = 5391 genes for n = 344 breast cancer patients.More details regarding this data set -including the preprocessing technique used -are given in Appendix F. Given a cardinality q, we applied three methods to select relevant genes: • we computed the first q-sparse principal component using SPCA (Zou et al., 2006) • we computed the support of the globally q-sparse subspace of dimension d = 10 using GSPPCA and SSPCA For each method, we projected the data onto a 10-dimensional globally q-sparse subspace using the sparsity pattern found by the algorithm and computed the percentage of explained variance using the criterion introduced by Shen and Huang (2008) -for each method, we applied the post-processing technique of Moghaddam et al. (2005).The results are plotted on Figure 3.4.It is important to notice that both global methods explain much more variance than SPCA.This fact is not surprising since the data is indeed projected onto a globally sparse subspace, but the significance of this variance gap highlights the fact that different dimensions lead to very different sparsity patterns.This means that projecting the data onto a single sparse axis is likely to lead to an important information loss (this fact is confirmed in section 5).The variables selected by GSPPCA retain significantly more variance than the ones selected by SSPCA, and may consequently be of superior interest.
Interpretability Inspired by Hastie et al. (2015, section 8.2.3.1),we consider the problem of learning which features are relevant on three data sets of handwritten digits.We consider n = 500 gray-scale images (with p = 758 pixels) of handwritten sevens from three data sets introduced by Larochelle et al. ( 2007): • mnist-basic which is simply a subsample of sevens from the original MNIST data set, • mnist-back-rand in which random backgrounds were inserted in the images.Each pixel value of the background was generated uniformly between 0 and 255, • mnist-back-image in which random patches extracted from a set of 20 grey-scale natural images were used as backgrounds for the sevens.
On these three data sets, we apply SPCA (with d = 1), SSPCA and GSPPCA (both with d = 100) in order to select q = 200 relevant pixels.On mnist-basic, even if SPCA's result is a little bit more erratic than the two others, all selections are interpretable and we can easily recognize a seven.On mnist-back-rand however, while the two globally sparse selections are still consistent, SPCA's pixels are more scattered and it is harder to recognize the shape of a seven.Eventually, on mnist-back-image, GSPPCA's selection is less smooth but a seven can still be recognized, whereas SPCA appears to randomly select pixels almost everywhere but near the mean seven.SSPCA seems to notice that the zone occupied by the upper bars of the sevens is of interest, but its selection does not appear interpretable.

Application to signal denoising
In this section, we focus on a first possible application of GSPPCA for signal denoising through the sparsification of a wavelet decomposition.PCA is indeed a popular way to denoise multivariate signals (Aminghafari et al., 2006;Johnstone and Lu, 2009).To illustrate the potential interest of GSPPCA in this context, we consider hereafter two simulation scenarios, each using a specific form of signal and wavelet.The simulation scenarios are as follows: • Scenario A: it consists in a square wave signal with 6 states of different lengths.The observed signal is sampled with a time step of 5 × 10 −3 with an additional Gaussian noise with zero mean and 0.2 standard deviation.The Haar wavelet is used here for signal reconstruction.
• Scenario B: the original signal is here a mixture of 4 Gaussian densities.The observed signal is also sampled with a time step of 5 × 10 −3 with an additional Gaussian noise with zero mean and 0.2 standard deviation.The Daubechies D8 wavelet is used here for signal reconstruction.As an illustration, we plotted on Figure 4 the denoising results for newly sampled signals A and B with GSPPCA.We used the same projection-reconstruction protocol for PCA, thresholded PCA (PCA loading smaller than 1 × 10 −3 are set to 0) and SPCA (λ is chosen such that 99% of the PCA projected variance is conserved).Denoising results obtained with those methods are also supplied on Figure 4. First, on both signal A and B, PCA achieves a very satisfying denoising and thus confirms his validity in this context.One can also show that a simple thresholding of the PCA loadings allows a clear denoising improvement and turns out to be competitive with the one performed by SPCA.The SPCA result is here somehow disappointing due to the fact that the sparsity is not global and most wavelet Finally, Table 4 presents the reconstruction error (sum of squared errors) averaged on 50 test signal reconstructions, on the two simulation scenarios.The results confirms the observations made on Figure 4. GSPPCA achieves particularly good performances on both scenarios and thus imposes itself as a competitive tool for signal denoising.Moreover, the GSPPCA reconstruction uses fewer wavelet levels and is therefore visually smoother.

Application to unsupervised gene selection
Considering again the breast cancer data set previously studied in Section 3, we address here the issue of the biological significance of the selected genes.To this end, we will use the pathway enrichment index (PEI) introduced by Teschendorff et al. (2007) and used in a sparse PCA framework by (Journée et al., 2010).

Pathway enrichment as a measure of biological significance
In this subsection, we briefly review how the PEI can be computed in order to evaluate the quality of a given subset of genes.For more details on the PEI, see Teschendorff et al. (2007) or Journée (2009), and on hypergeometric tests and enrichment, see Rivals et al. (2007).
Suppose that using a microarray data matrix X ∈ R n×p where each variable corresponds to a gene, an algorithm infers a subset s ⊂ {1, ..., p} of genes.A way to assess its biological significance is to compare s to many other subsets which are known to be biologically relevant.In this case, the biologically relevant subsets are defined by biological pathways, and are therefore groups of genes involved in series of biochemical reactions linked to a certain biological function.Let us denote these known subsets b 1 , ..., b N ⊂ {1, ..., p}.For our breast cancer experiment, we use the N = 1116 pathways from the Reactome database (Fabregat et al., 2016) included in the R package reactomePA (Yu and He, 2016).For k ≤ N , the enrichment of s in the k-th pathway of this list is the statistical significance of its overlap with b k , evaluated using the hypergeometric test.More specifically, for each k ≤ N , the null hypothesis of this test is that the genes in s are chosen uniformly at random from the total gene population.Under this hypothesis, the test statistic #(s ∩ b k ) follows a hypergeometric distribution and a p-value can be computed to assess the statistical significance of the overlap.Because we are conducting one test for each pathway considered, these p-value are then adjusted using the Benjamini-Hochberg procedure to control the false discovery rate (Benjamini and Hochberg, 1995).The subset s is eventually declared

Results
We compare in  2010) SPCA appears to give slightly better results than thresholded PCA.GSPPCA significantly outperforms the two other methods.This means that the genes selected by GSPPCA are consistently more associated with the Reactome pathways, and are therefore more interpretable.This highlights the fact that projecting the data onto a globally sparse subspace of dimension higher than one leads to significantly more interpretable and biologically plausible results.Regarding the estimation of the sparsity level, choosing the one that explains 99% of the variance led SPCA to selecting 4810 genes, which is difficult to interpret.For thresholded PCA, we selected the sparsity level using a criterion proposed by Teschendorff et al. (2007).Even though it led to the sparsest solution, its PEI was very small.Regarding GSPPCA, the noiseless marginal log-likelihood and the PEI of the corresponding models are plotted on Figure 5.We can see that the marginal likelihood peak corresponds to highly interpretable genes: more than 5% of the biological pathways in the Reactome family have a significant overlap with the genes selected by GSPPCA.Furthermore, models with a lower marginal likelihood have generally a lower PEI.To a certain extend, this shows that our marginal likelihood expression can stand as an indicator of biological significance.

Conclusion
Unsupervised feature selection is an hazy and exciting problem.It becomes particularly difficult and ill-posed when no specific learning task (such as clustering) is driving it.We have proposed in this paper a new method for unsupervised feature selection based on the idea that the data may lie close to a subspace of moderate dimension spanned by a basis with a shared sparsity pattern.On several real data sets, this approach outperforms a popular method which consists in finding the sparsity pattern of the single leading principal vector of the data.These results suggest that, on many real-life high-dimensional data sets, an important part of the information cannot be captured by one-dimensional subspace approximations.

Number of genes Pathway enrichment index
While building our framework, we derived the first closed-form expression of the marginal likelihood of a Bayesian PCA model, using the noiseless model of Roweis (1998).Regarding future work, it would be interesting to see if more complex priors can be used and to what extend our expression can lead to a simultaneous estimation of the sparsity level and the dimension of the latent space.Indeed, intrinsic dimension estimation, which was beyond the scope of this paper, has an enduring relationship with probabilistic versions of PCA (Minka, 2000;Bouveyron et al., 2011;Nakajima et al., 2015) and would be an interesting direction.
Finally, since the noise term and Wy are independent, the characteristic function of x will be ϕ x (u) = ϕ Wy (u)ϕ ε (u) = e −σ 2 ||u|| 2 2 (1 + α||u|| 2 2 ) d/2 .The density of x is then given by the Fourier transform of its characteristic function p(x) = 1 (2π) p R p ϕ x (u)e ix T u du, but, since ϕ x (u) is a radial function (i.e a function that only depends on the norm of its argument), its Fourier transform can be expressed as a univariate integral (Schaback and Wu, 1996) and we can write which is the desired form for the case with no inactive variable.In the general case, v is not necessarily equal to (1, 1, ..., 1) but we can notice that, since x v and x v are independent, we can write p(x) = p(x v)p(x v ).Applying (20) to x v allows us to compute p(x v ) and to eventually obtain the expression of the density given by the theorem.

Appendix B. Proof of Theorem 2
We begin by proving the following lemma, which links the distribution of the product between a Gaussian matrix and a Gaussian vector with the Bessel distribution.This result may be of independent interest.Lemma 7 Let A be a q × d random matrix such that a ij ∼ N (0, s 2 ) with s > 0 for all i, j and let b ∼ N 0, I d ).Then Ab follows a Bessel distribution with parameters s and (d − q)/2.
Proof Using the decomposition arguments from the proof of Theorem 1, the characteristic function of Ab is, for all u ∈ R k , ϕ Ab (u) = 1 (1 + ||u|| 2 2 /s) d/2 , which is exactly the characteristic function of the symmetric multivariate Bessel distribution Fang et al. (1990, Def. 2.5).
We can now prove Theorem 2. Proof Let us first consider the case where all variables are active and assume that v = (1, 1, ..., 1).Using Lévy's continuity theorem, ε 2 weakly converges to zero when σ 2 vanishes.Since zero is a constant, this convergence also happens to be in probability ( Van der Vaart, 2000, p. 10).The variable x therefore converges in probability to Wy, which follows a Bessel(1/α, (d − q)/2) distribution according to our lemma.
In the general case when v is not necessarily equal to (1, 1, ..., 1) we can prove (6) by invoking the independence between x v and x v, similarly to the proof of Theorem 1. x i,j u j w T j E q(y i ) which leads to the factorization q * (W) = j≤p q * (w i ) and to the desired expression.
• among the genes obtained, only the genes listed in the Reactome database (Fabregat et al., 2016) were kept in order to eventually perform pathway enrichment, • finally, the data was centered but not standardized.
The resulting data matrix contains 5391 variables (genes) and 344 observations (patients).

Figure 1 :
Figure 1: Variable selection with GSPPCA on the introductory example.

Figure 2 :
Figure 2: Median, first and third quartiles of the F-score for the experiment of subsection 3.2, based on 100 runs

Figure 3 :
Figure 3: Percentage of variance explained by the data projected onto a 10-dimensional globally sparse subspace

Figure 4
Figure 4 presents the original signals and observed signals for scenarios A and B. In both cases, n = 100 signals were sampled during the training phase and decomposed as p = 175 wavelet coefficients.For signal denoising, GSPPCA is applied on the n × p wavelet coefficient matrix to extract d = 10 globally sparse principal axes.Then, a new sampled signal is projected on those extracted principal axes and back-projected in the original wavelet domain.It is worth mentioning that the estimated value for q = v 0 is 17 on scenario A and 15 on scenario B.As an illustration, we plotted on Figure4the denoising results for newly sampled signals A and B with GSPPCA.We used the same projection-reconstruction protocol for PCA, thresholded PCA (PCA loading smaller than 1 × 10 −3 are set to 0) and SPCA (λ is chosen such that 99% of the PCA projected variance is conserved).Denoising results obtained with those methods are also supplied on Figure4.First, on both signal A and B, PCA achieves a very satisfying denoising and thus confirms his validity in this context.One can also show that a simple thresholding of the PCA loadings allows a clear denoising improvement and turns out to be competitive with the one performed by SPCA.The SPCA result is here somehow disappointing due to the fact that the sparsity is not global and most wavelet

Table 1 :
F-score×100 for the model selection experiment of subsection 3.3 with Gaussian noise

Table 3 :
Larochelle et al. (2007)CA and GSPPCA for the three datasets ofLarochelle et al. (2007), selected variables are in white

Table 4 :
Reconstruction error (sum of squared errors) for wavelet signal denoising on the two simulation scenarios (results are averaged on 50 signal reconstructions).Standard deviations are also provided.levels stay active in the final reconstruction.Finally, the global sparsity of GSPPCA retains only a few wavelet levels and achieves here the best reconstruction in both scenarios.

Table 5
certain pathway if the adjusted p-value of the corresponding hypergeometric test is lower than 0.01.The PEI is finally defined as the percentage of enriched pathways in the Reactome family.
Table 5 the PEI obtained by GSPPCA with d = 10, SPCA and thresholded PCA for several fixed cardinalities.Similarly to Zou et al. (2006), the two local methods are computing a single sparse axis.As in Journée et al. (