1 Introduction

Principal component analysis (PCA) plays an important role in applications arising from data analysis, dimension reduction and bioinformatics etc. PCA finds a few linear combinations of the original variables. These linear combinations, which are called principal components (PCs), are orthogonal to each other and explain most of the variance of the data. Specifically, let ξ=(ξ 1,⋯,ξ p ) be a p-dimensional random vector. Then for a given data matrix \(M\in \mathbb {R}^{p\times n}\), which consists of n samples of the p variables, PCA corresponds to a singular value decomposition (SVD) of M or an eigenvalue decomposition of a sample covariance matrix Σ. Without loss of generality, assuming M is centered, i.e., the means of the rows of X are all zeros, then a commonly used sample covariance matrix is Σ=MM /(n−1). Assume the eigenvalue decomposition of Σ is given by Σ=VΛV , then the columns of ξV are the PCs and the columns of V are called loading vectors of the corresponding PCs. Thus, finding the PC that explains the largest variance of the variables corresponds to the following eigenvalue problem:

$$ x^* := \arg\max x^\top \varSigma x, \quad \mbox{s.t.}\ \|x\|_2 \leqslant 1, $$
(1.1)

where ∥x2 is the Euclidean norm of vector x. Problem (1.1) actually gives the eigenvector that corresponds to the largest eigenvalue of Σ. However, the loading vector x is not expected to have many zero coefficients. This makes it hard to explain the PC. For example, in the text classification problem, we are given a binary data matrix \(M\in \mathbb {R}^{p\times n}\) that records the occurrences of p words in n postings. That is, M ij =1 if the ith word appears in the jth posting and M ij =0 if the ith word does not appear in the jth posting. The standard PCA cannot tell which words contribute most to the explained variance since the loadings are linear combinations of all the variables. Thus, sparse PCs are needed because it is easier to analyze which variables contribute most to the explained variance.

Many techniques were proposed to extract sparse PCs from given sample covariance matrix Σ or sample data matrix M. One natural thought is to impose a cardinality constraint to (1.1), which leads to the following formulation for sparse PCA:

$$ x^* := \arg\max x^\top \varSigma x, \quad \mbox{s.t.}\ \|x\|_2 \leqslant 1, \quad \|x\|_0 \leqslant K, $$
(1.2)

where ∥x0 (the 0 norm of x) counts the number of nonzeros of x and the integer K controls the sparsity of the solution. Note that the cardinality constraint ∥x0K makes the problem nonconvex, which is usually numerically challenging to solve. Some other nonconvex models and algorithms for solving them are also considered in [8, 21, 23, 24, 42]. It should be pointed out that although these algorithms are quite efficient, the convergence results are usually not very strong. Especially, there is no result for global convergence as the models are nonconvex.

In this paper, we will focus on a convex relaxation of (1.2) that was proposed by d’Aspremont et al. in [9]. The convex relaxation in [9] is a semidefinite programming (SDP) problem based on the lifting and projection technique, which is a standard technique in using SDP to approximate combinatorial optimization problems (see e.g., [1, 3, 34]). Note that if we denote X=xx , then (1.2) can be rewritten as

$$ \max_{X\in \mathbb {R}^{p\times p}}\bigl\{ \langle \varSigma, X\rangle, \ \mbox{s.t.}\ \mathbf {Tr}(X) = 1, \|X\|_0 \leqslant K^2, X\succeq 0, \operatorname{rank}(X) = 1\bigr\}, $$
(1.3)

where Tr(X) denotes the trace of matrix X. The rank constraint is then dropped and the cardinality constraint is replaced by 1 norm constraint, and this leads to the following convex problem, which is an SDP.

$$ \max_{X\in \mathbb {R}^{p\times p}}\bigl\{\langle \varSigma, X\rangle, \ \mbox{s.t.}\ \mathbf {Tr}(X) = 1, \|X\|_1 \leqslant K, X \succeq 0\bigr\}, $$
(1.4)

where the 1 norm of X is defined as ∥X1:=∑ ij |X ij | and ∥X1K is imposed to promote the sparsity of the solution. This is inspired by the recent emergence of compressed sensing (see e.g., [5, 10]). Note that ∥X1K is used in (1.4) instead of ∥X1K 2. This is due to the fact that, when X=xx and Tr(X)=1, we have ∥X F =1, and also that if ∥X0K 2, then ∥X1KX F . After the optimal solution X to (1.4) is obtained, the vector \(\hat{x}\) from the rank-one decomposition of X , i.e., \(X^{*}=\hat{x}\hat{x}^{\top}\) is used as an approximation of the solution of (1.2). This is the whole procedure of the lifting and projection technique. Although some standard methods such as interior point methods can be used to solve the SDP (1.4) (see e.g., [1, 3, 34]), it is not wise to do so because (1.4) is a nonsmooth problem, and transforming it to a standard SDP increases the size of the problem dramatically.

It is known that (1.4) is equivalent to the following problem with an appropriately chosen parameter ρ>0:

$$ \max_{X\in \mathbb {R}^{p\times p}} \bigl\{ \langle \varSigma, X\rangle - \rho\|X\|_1 \ \mbox{s.t.}\ \mathbf {Tr}(X) = 1, X\succeq 0\bigr\}. $$
(1.5)

Note that (1.5) can be rewritten as

$$ \max_{X\succeq 0, \mathbf {Tr}(X) = 1} \min_{\|U\|_\infty \leqslant \rho} \langle \varSigma+U,X\rangle, $$
(1.6)

where ∥U denotes the largest component of U in magnitude, i.e., ∥U=max ij |U ij |. The dual problem of (1.5) is given by interchanging the max and min in (1.6), i.e.,

$$\min_{\|U\|_\infty \leqslant \rho} \max_{X\succeq 0, \mathbf {Tr}(X) = 1} \langle \varSigma+U,X \rangle, $$

which can be further reduced to

$$ \min_{U\in \mathbb {R}^{p\times p}} \lambda_{\max}(\varSigma+U), \quad \mbox{s.t.}\ \|U \|_\infty \leqslant \rho, $$
(1.7)

where λ max(Z) denotes the largest eigenvalue of matrix Z. d’Aspremont et al. [9] proposed to solve the dual problem (1.7) using Nesterov’s first-order algorithm (see e.g., [27, 28]), which is an accelerated projected gradient method. However, since the objective function of (1.7) is nonsmooth, one needs to smooth it in order to apply Nesterov’s algorithm. Thus, the authors of [9] actually solve an approximation of the dual problem (1.7), which can be formulated as follows.

$$ \min f_\mu(U), \quad \mbox{s.t.}\ \|U\|_\infty \leqslant \rho, $$
(1.8)

where μ>0 is the smoothing parameter, f μ (U):=max{〈Σ+U,X〉−μd(X), s.t. Tr(X)=1,X⪰0} and d(X):=Tr(XlogX)+log(n). It is shown in [9] that an approximate solution X k to the primal problem (1.5) can be obtained by X k=∇f μ (U k), where U k is an approximate solution of (1.8). It is easy to see that X k is not guaranteed to be a sparse matrix. Besides, although there is no duality gap between (1.5) and (1.7), the authors solve an approximation of (1.7). It needs also to be noted that Nesterov’s algorithm used in [9] cannot solve the constrained problem (1.4). Although (1.4) and (1.5) are equivalent with appropriately chosen parameters K and ρ, in many applications, it is usually easier to choose an appropriate K since we know how many nonzeros are preferred in the sparse PCs.

Note that (1.2) only gives the largest sparse PC. In many applications, several leading sparse PCs are needed in order to explain more variance. Multiple sparse PCs are usually found by solving a sequence of sparse PCA problems (1.2), with Σ constructed via the so-called deflation technique for each sparse PC.

In this paper, we propose an alternating direction method based on a variable-splitting technique and an augmented Lagrangian framework for solving directly the primal problems (1.4) and (1.5). Our method solves two subproblems in each iteration. One subproblem has a closed-form solution that corresponds to projecting a given matrix onto the simplex of the cone of semidefinite matrices. This projection requires an eigenvalue decomposition. The other subproblem has a closed-form solution that corresponds to a vector shrinkage operation (for Problem (1.5)) or a projection onto the 1 ball (for Problem (1.4)). Thus, our method produces two iterative points at each iteration. One iterative point is a semidefinite matrix with trace equal to one and the other one is a sparse matrix. Eventually these two points will converge to the same point, and thus we get an optimal solution which is a sparse and semidefinite matrix. Compared with Nesterov’s first-order method suggested in [9] for solving the approximated dual problem (1.8), our method can solve the nonsmooth primal problems (1.4) and (1.5) uniformly. Also, since we deal with the primal problems directly, the 1 norm in the constraint or the objective function guarantees that our solution is a sparse matrix, while Nesterov’s method in [9] does not guarantee this since it solves the approximated dual problem.

The rest of the paper is organized as follows. In Sect. 2, we introduce our alternating direction method of multipliers for solving the nonsmooth SDP problems (1.4) and (1.5). We discuss some practical issues including the deflation technique for computing multiple sparse PCs in Sect. 3. In Sect. 4, we use our alternating direction method of multipliers to solve sparse PCA problems arising from different applications such as classification of text data and senate voting records to demonstrate the efficacy of our method. We make some conclusions in Sect. 5.

2 Alternating Direction Method of Multipliers

We first introduce some notation. We use \(\mathcal {C}\) to denote the simplex of the cone of the semidefinite matrices, i.e., \(\mathcal {C}=\{X\in \mathbb {R}^{p\times p}\mid \mathbf {Tr}(X)=1,X\succeq 0\}\). We use \(\mathcal {B}\) to denote the 1-ball with radius K in \(\mathbb {R}^{p\times p}\), i.e., \(\mathcal {B}=\{X\in \mathbb {R}^{p\times p}\mid \|X\|_{1}\leqslant K\}\). \(I_{\mathcal {A}}(X)\) denotes the indicator function of set \(\mathcal {A}\), i.e.,

$$ I_\mathcal {A}(X) = \left \{ \begin{array}{l@{\quad}l} 0 & \mbox{if}\ X\in \mathcal {A}, \\ +\infty & \mbox{otherwise}. \end{array} \right . $$
(2.1)

We know that \(I_{\mathcal {C}}(X)\) and \(I_{\mathcal {B}}(X)\) are both convex functions since \(\mathcal {C}\) and \(\mathcal {B}\) are both convex sets. We then can reformulate (1.4) and (1.5) uniformly as the following unconstrained problem:

$$ \min -\langle\varSigma,X \rangle + I_\mathcal {C}(X) + h(X), $$
(2.2)

where \(h(X)=I_{\mathcal {B}}(X)\) for (1.4) and h(X)=ρX1 for (1.5). Note that h(X) is convex in both cases. (2.2) can be also viewed as the following inclusion problem:

$$ \mbox{Find}\ X, \quad \mbox{s.t.}\ 0\in -\varSigma + \partial I_\mathcal {C}(X) + \partial h(X). $$
(2.3)

Problem (2.3) finds zero of the sum of two monotone operators. Methods based on operator-splitting techniques, such as Douglas–Rachford method and Peaceman–Rachford method, are usually used to solve Problem (2.3) (see e.g., [6, 7, 11, 13, 14, 22, 30]). From the convex optimization perspective, the alternating direction method of multipliers (ADMM) for solving (2.2) is a direct application of the Douglas–Rachford method. ADMM has been successfully used to solve structured convex optimization problems arising from image processing, compressed sensing, machine learning, semidefinite programming etc. (see e.g., [2, 1519, 26, 33, 3639]). We now show how ADMM can be used to solve the sparse PCA problem (2.2).

ADMM for solving (2.2) is based on a variable-splitting technique and an augmented Lagrangian framework. By introducing a new variable Y, (2.2) can be rewritten as

$$ \begin{array}{r@{\ }l} \min & -\langle\varSigma,X \rangle + I_\mathcal {C}(X) + h(Y) \\ \mbox{s.t.} & X = Y. \end{array} $$
(2.4)

Note that although the number of variables is increased, the two nonsmooth functions \(I_{\mathcal {C}}(\cdot)\) and h(⋅) are now separated since they are associated with different variables. For this equality-constrained problem, augmented Lagrangian method is a standard approach to solve it. A typical iteration of augmented Lagrangian method for solving (2.4) is given by

$$ \left \{ \begin{array} {l} \bigl(X^{k+1},Y^{k+1}\bigr) := \displaystyle\arg\min _{(X,Y)} \mathcal {L}_\mu\bigl(X,Y;\varLambda^k \bigr), \\[3mm] \varLambda^{k+1} := \displaystyle \varLambda^k - \frac{1}{\mu}\bigl(X^{k+1}-Y^{k+1}\bigr), \end{array} \right . $$
(2.5)

where the augmented Lagrangian function \(\mathcal {L}_{\mu}(X,Y;\varLambda)\) is defined as:

$$ \mathcal {L}_\mu(X,Y;\varLambda) := - \langle \varSigma,X\rangle + I_\mathcal {C}(X) + h(Y) - \langle \varLambda, X-Y \rangle + \frac{1}{2\mu}\|X-Y\|_F^2, $$
(2.6)

where μ>0 is a penalty parameter and Λ is the Lagrange multiplier associated with the linear constraint X=Y. Note that it is usually hard to minimize the augmented Lagrangian function \(\mathcal {L}_{\mu}(X,Y;\varLambda^{k})\) with respect to X and Y simultaneously. In fact, it is as difficult as solving the original problem (2.4). However, if we minimize the augmented Lagrangian function with respect to X and Y alternatingly, we obtain two subproblems in each iteration and both of them are relatively easy to solve. This results in the following alternating direction method of multipliers:

$$ \left \{\begin{array} {l} \displaystyle X^{k+1} := \arg\min_{X}\mathcal {L}_\mu \bigl(X,Y^k;\varLambda^k\bigr), \\[1mm] \displaystyle Y^{k+1} := \arg\min_Y \mathcal {L}_\mu\bigl(X^{k+1},Y;\varLambda^k\bigr), \\[1mm] \varLambda^{k+1} := \varLambda^k - \bigl(X^{k+1}-Y^{k+1} \bigr)/\mu. \end{array} \right . $$
(2.7)

It can be shown that the two subproblems in (2.7) are both relatively easy to solve in the sparse PCA problem. Before we show that, we characterize two nice properties of the indicator function (2.1).

Property 1

The proximal mapping of the indicator function \(I_{\mathcal {A}}(\cdot)\) is the Euclidean projection onto \(\mathcal {A}\), i.e.,

$$ \mathrm {prox}_{I_\mathcal {A}}(X) \equiv \mathcal {P}_{\mathcal {A}}(X), $$
(2.8)

where

$$ \mathrm {prox}_{I_\mathcal {A}}(X) := \arg\min _U \biggl\{I_\mathcal {A}(U)+\frac{1}{2}\|U-X \|_F^2\biggr\}, $$
(2.9)

and

$$ \mathcal {P}_{\mathcal {A}}(X) := \arg\min_U \biggl\{\frac{1}{2}\|U-X\|_F^2, \ \mbox{s.t.}\ U\in \mathcal {A}\biggr \}. $$
(2.10)

Property 2

The optimality conditions for Problem (2.10) are given by

$$ X-U^* \in \partial I_\mathcal {A}\bigl(U^*\bigr), $$
(2.11)

which is equivalent to

$$ \bigl\langle X-U^*,Z-U^* \bigr \rangle \leqslant 0, \quad \forall Z\in \mathcal {A}, $$
(2.12)

where U is the optimal solution of (2.10).

Now, the first subproblem in (2.7) can be reduced to

$$ X^{k+1} := \arg\min \biggl\{ \mu I_\mathcal {C}(X) + \frac{1}{2} \bigl\|X-\bigl(Y^k+\mu\varLambda^k+ \mu\varSigma\bigr)\bigr\|_F^2 \biggr\}, $$
(2.13)

which can be further reduced to projection onto \(\mathcal {C}\) using Property 1,

$$ X^{k+1} = \mathcal {P}_\mathcal {C}\bigl(Y^k+\mu\varLambda^k+\mu\varSigma \bigr) := \arg\min \biggl\{ \frac{1}{2} \bigl\|X-\bigl(Y^k+\mu \varLambda^k+\mu\varSigma\bigr)\bigr\|_F^2, \ \mbox{s.t.}\ X\in \mathcal {C}\biggr\}. $$
(2.14)

When \(h(Y)=I_{\mathcal {B}}(Y)\) as in Problem (1.4), the second subproblem in (2.7) can be reduced to

$$ Y^{k+1} := \arg\min \biggl\{\mu I_\mathcal {B}(Y) + \frac{1}{2}\bigl\|Y-\bigl(X^{k+1}-\mu\varLambda^k \bigr)\bigr\|_F^2 \biggr\}, $$
(2.15)

which can be further reduced to projection onto \(\mathcal {B}\) using Property 1,

$$ Y^{k+1} = \mathcal {P}_\mathcal {B}\bigl(X^{k+1}-\mu\varLambda^k\bigr) := \arg\min \biggl\{\frac{1}{2} \bigl\|Y-\bigl(X^{k+1}-\mu\varLambda^k\bigr)\bigr\|_F^2, \ \mbox{s.t.}\ Y\in \mathcal {B}\biggr\}. $$
(2.16)

When h(Y)=ρY1 as in Problem (1.5), the second subproblem in (2.7) can be reduced to

$$ Y^{k+1} := \arg\min \biggl\{\mu\rho \|Y\|_1 + \frac{1}{2} \bigl\|Y-\bigl(X^{k+1}-\mu\varLambda^k \bigr)\bigr\|_F^2 \biggr\}. $$
(2.17)

Problem (2.17) has a closed-form solution that is given by

$$ Y^{k+1} = \mathrm {Shrink}\bigl(X^{k+1}-\mu\varLambda^k,\mu\rho\bigr), $$
(2.18)

where the shrinkage operator is defined as

$$ \bigl(\mathrm {Shrink}(Z,\tau)\bigr)_{ij} := \mathrm {sgn}(Z_{ij})\max\bigl\{|Z_{ij}|-\tau,0\bigr\}, \quad \forall i,j. $$
(2.19)

In the following, we will show that (2.13) and (2.15) are easy to solve, i.e., the two projections (2.14) and (2.16) can be done efficiently. First, since the problem of projection onto \(\mathcal {C}\)

$$ \mathcal {P}_\mathcal {C}(X) = \arg\min \biggl\{ \frac{1}{2} \|Z-X\|_F^2, \ \mbox{s.t.}\ \mathbf {Tr}(Z) =1, Z\succeq 0 \biggr\} $$
(2.20)

is unitary-invariant, its solution is given by \(\mathcal {P}_{\mathcal {C}}(X)=U\operatorname{diag}(\gamma)U^{\top}\), where \(X=U\operatorname{diag}(\sigma)U^{\top}\) is the eigenvalue decomposition of X, and γ is the projection of σ onto the simplex in the Euclidean space, i.e.,

$$ \gamma := \arg\min \Biggl\{\frac{1}{2}\|\xi-\sigma \|_2^2, \ \mbox{s.t.}\ \sum_{i=1}^p \xi_i = 1, \xi \geqslant 0\Biggr\}. $$
(2.21)

We consider a slightly more general problem:

$$ \xi^* := \arg\min \Biggl\{\frac{1}{2}\|\xi- \sigma\|_2^2, \ \mbox{s.t.}\ \sum _{i=1}^p \xi_i = r, \xi \geqslant 0\Biggr\}, $$
(2.22)

where scalar r>0. Note that (2.21) is a special case of (2.22) with r=1. From the first-order optimality conditions for (2.22), it is easy to show that the optimal solution of (2.22) is given by

$$\xi^*_i:=\max\{\sigma_i-\theta,0\}, \quad \forall i=1, \cdots,p, $$

where the scalar θ is the solution of the following piecewise linear equation:

$$ \sum _{i=1}^p\max\{\sigma_i-\theta,0\}=r. $$
(2.23)

It is known that the piecewise linear equation (2.23) can be solved quite efficiently and thus solving (2.22) can be done easily. In fact, the following procedure (Algorithm 1) gives the optimal solution of (2.22). We refer the readers to [31] for the proof of the validity of the algorithm. It is easy to see that Algorithm 1 has an O(plogp) complexity. Linear time algorithms for solving (2.22) are studied in [4, 12, 29]. Thus, solving (2.13) corresponds to an eigenvalue decomposition and a projection onto the simplex in the Euclidean space, and they both can be done efficiently.

Algorithm 1
figure 1

Projection onto the simplex in the Euclidean space

Solving (2.15) (or equivalently (2.16)) corresponds to a projection onto the 1-ball: ∥Y1K. It has been shown in [12, 35] that projection onto the 1-ball can be done easily. In fact, the solution of

$$ \hat{\gamma}= \arg\min \biggl\{\frac{1}{2}\|\xi-\hat{\sigma}\|_2^2, \ \mbox{s.t.}\ \|\xi\|_1\leqslant r\biggr\} $$
(2.24)

is given by \(\hat{\gamma}_{i}=\mathrm {sgn}(\hat{\sigma}_{i})\gamma_{i}, \forall i=1,\cdots,p\), where γ is the solution of

$$\min \frac{1}{2}\bigl\|\gamma-|\hat{\sigma}|\bigr\|_2^2, \quad \mbox{s.t.}\ \sum_{i=1}^p \gamma_i=r, \quad \gamma \geqslant 0, $$

i.e., the projection of \(|\hat{\sigma}|\) (elementwise absolute value of \(\hat{\sigma}\)) onto the simplex. Thus, (2.15) can be rewritten as

$$ \mathrm {vec}\bigl(Y^{k+1}\bigr) = \arg \min \biggl\{\frac{1}{2}\bigl\|y-\mathrm {vec}\bigl(X^{k+1}-\mu\varLambda^k\bigr) \bigr\|_2^2, \ \mbox{s.t.}\ \|y\|_1\leqslant K\biggr\}, $$
(2.25)

and it corresponds to a projection onto the simplex in the Euclidean space, where vec(Y) denotes the vector form of Y which is obtained by stacking the columns of Y into a long vector.

To summarize, our ADMM for solving (1.4) and (1.5) can be uniformly described as Algorithm 2.

Algorithm 2
figure 2

ADMM for solving (1.4) and (1.5)

Remark 2.1

Although Algorithm 2 suggests that we need to compute the eigenvalue decomposition of Y k+μΛ k+μΣ in order to get the solution to (2.13), we actually only need to compute the positive eigenvalues and corresponding eigenvectors of Y k+μΛ k+μΣ.

We have the following global convergence result for Algorithm 2.

Theorem 2.2

The sequence {(X k,Y k,Λ k)} produced by (2.7) (Algorithm  2) from any starting point converges to an optimal solution to Problem (2.4).

Proof

The proof of this convergence result is a direct application of the well studied convergence result for Douglas–Rachford operator splitting method (see [13, 14]). We include the specialized proof for Problem (2.4) in the Appendix just for the sake of completeness. □

3 The Deflation Techniques and Other Practical Issues

It should be noticed that the solution of Problem (1.1) only gives the largest eigenvector (the eigenvector corresponding to the largest eigenvalue) of Σ. In many applications, the largest eigenvector is not enough to explain the total variance of the data. Thus one usually needs to compute several leading eigenvectors to explain more variance of the data. Hotelling’s deflation method [32] is usually used to extract the leading eigenvectors sequentially. The Hotelling’s deflation method extracts the rth leading eigenvector of Σ by solving

$$x_r = \arg\max \bigl\{x^\top \varSigma_{r-1} x, \ \mbox{s.t.}\ \|x\|_2\leqslant 1\bigr\}, $$

where Σ 0:=Σ and

$$\varSigma_r = \varSigma_{r-1} - x_rx_r^\top \varSigma_{r-1}x_rx_r^\top. $$

It is easy to verify that Hotelling’s deflation method preserves the positive-semidefiniteness of matrix Σ r . However, as pointed out in [25], it does not preserve the positive-semidefiniteness of Σ r when it comes to the sparse PCA problem (1.2), because the solution x r is no longer an eigenvector of Σ r−1. Thus, the second leading eigenvector produced by solving the sparse PCA problem may not explain well the variance of the data. We should point out that the deflation method used in [9] is Hotelling’s deflation method.

Several deflation techniques to overcome this difficulty for sparse PCA were proposed by Mackey in [25]. In our numerical experiments, we chose to use the Schur complement deflation method in [25]. The Schur complement deflation method updates matrix Σ r by

$$ \varSigma_r = \varSigma_{r-1} - \frac{\varSigma_{r-1}x_r x_r^\top \varSigma_{r-1}}{x_r^\top \varSigma_{r-1} x_r}. $$
(3.1)

The Schur complement deflation method has the following properties as shown in [25]. (i) Schur complement deflation preserves the positive-semidefiniteness of Σ r , i.e., Σ r ⪰0. (ii) Schur complement deflation renders x s orthogonal to Σ r for sr, i.e., Σ r x s =0, ∀sr.

When we want to find the leading r sparse PCs of Σ, we use ADMM to solve sequentially r problems (1.4) or (1.5) with Σ updated by the Schur complement deflation method (3.1). We denote the leading r sparse PCs obtained by our ADMM as X r =(x 1,⋯,x r ). Usually the total variance explained by X r is given by \(\mathbf {Tr}(X_{r}^{\top}\varSigma X_{r})\). However, because we do not require the loading vectors to be orthogonal to each other when we sequentially solve the SDPs (1.4) or (1.5), these PCs are correlated. Thus, \(\mathbf {Tr}(X_{r}^{\top}\varSigma X_{r})\) will overestimate the total explained variance by x 1,⋯,x r . To alleviate the overestimated variance, Zou et al. [42] suggested that the explained total variance should be computed using the following procedure, which was called adjusted variance:

$$\mathit{AdjVar}(X_r) := \mathbf {Tr}\bigl(R^2\bigr), $$

where M X r =QR is the QR decomposition of M X r . In our numerical experiments, we always report the adjusted variance as the explained variance.

It is also worth noticing that the problems we solve are convex relaxations of the original problem (1.2). Hence, one needs to postprocess the matrix X obtained by solving (1.4) or (1.5) to get the approximate solution to (1.2). To get the solution to the original sparse PCA problem (1.2) from the solution X of the convex SDP problem, we simply perform a rank-one decomposition to X, i.e., X=xx . Since X is a sparse matrix, x should be a sparse vector. This postprocessing technique is also used in [9].

For the stopping criteria of ADMM, we consider both the primal and dual residuals as suggested by Boyd et al. in [2]. Note that in our problem, the primal residual at iteration k is measured by

$$r^k := X^k - Y^k $$

and the dual residual at iteration k is measured by

$$s^k := \bigl(Y^{k-1}-Y^k\bigr)/\mu. $$

We thus chose to terminate our ADMM when

$$ \bigl\|r^k\bigr\|_F < \epsilon^p \quad \mbox{and} \quad \bigl\|s^k\bigr\|_F < \epsilon^d, $$
(3.2)

where

$$ \epsilon^p := p \epsilon_1 + \epsilon_2 \max\bigl\{\bigl\|X^k \bigr\|_F, \bigl\|Y^k\bigr\|_F\bigr\} \quad \mbox{and} \quad \epsilon^d := p\epsilon_1 + \epsilon_2 \bigl\|\varLambda^k\bigr\|_F, $$
(3.3)

and ϵ 1 and ϵ 2 are tolerance parameters that will be specified later.

4 Numerical Results

In this section, we use our ADMM to solve the SDP formulations (1.4) and (1.5) of sparse PCA on both synthetic and real data sets. We mainly compare the performance of ADMM with DSPCA used in [9] for solving (1.5). We also include a comparison with the ALSPCA method proposed in [23], but we should note that ALSPCA solves a completely different model, which is non-convex and thus the global convergence of ALSPCA is not guaranteed. The MATLAB codes of DSPCA and ALSPCA were downloaded from the authors’ websites. Our codes were written in MATLAB. All experiments were run in MATLAB 7.6.0 on a laptop with Intel Core I3 2.30 GHz CPU and 6 GB of RAM.

4.1 Random Examples

We created some random examples to test the speed of ADMM and compared it with DSPCA [9] and ALSPCA [23]. The sample covariance matrix Σ was created by adding some small noise to a sparse rank-one matrix. Specifically, we first created a sparse vector \(\hat{x}\in \mathbb {R}^{p}\) with s nonzeros randomly chosen from the Gaussian distribution \(\mathcal{N}(0,1)\). We then got the sample covariance matrix \(\varSigma = \hat{x}\hat{x}^{\top}+\sigma vv^{\top}\), where σ denotes the noise level and \(v\in \mathbb {R}^{p}\) is a random vector with entries uniformly drawn from [0,1]. We applied DSPCA and ADMM to find the largest sparse PC of Σ. We report the comparison results in Tables 1 and 2 that correspond to noise levels σ=0.01 and σ=0.1, respectively. The parameters used for ALSPCA were set as their default settings. When using DSPCA to solve (1.5), we set different ρ’s to get solutions with different sparsity levels. Specifically, we tested DSPCA for ρ=0.01,0.1,1 in Tables 1 and 2. We set different K’s in (1.4) to control the sparsity level when using ADMM to solve it. We set ϵ 1=10−3 and ϵ 2=10−4 used in (3.3) for the stopping criterion (3.2) of ADMM, and the parameter μ was set as p/200. In both Tables 1 and 2, we tested four data sets with dimension p and sparsity s setting as (p,s)=(100,10),(100,20),(200,10) and (200,20).

Table 1 Comparisons of ADMM, DSPCA and ALSPCA on random examples with σ=0.01
Table 2 Comparisons of ADMM, DSPCA, and ALSPCA on random examples with σ=0.1

We report the cardinality of the largest sparse PC (Card), the percentage of the explained variance (PEV) and the CPU time in Tables 1 and 2. The objective function value 〈Σ,X〉 for both ADMM and DSPCA is also reported. We do not include the objective value for ALSPCA because it solves a different model. From Table 1 we see that, for σ=0.01, DSPCA is sensitive to the parameter ρ that controls the sparsity. ρ=0.01 always gave the best results for DSPCA and the explained variance is very close to the standard PCA. ρ=0.1 still provided relatively good solutions for DSPCA in terms of both sparsity and the explained variance. When ρ was increased to 1, the solutions given by DSPCA sometimes had more nonzeros than the desired sparsity level (when (p,s)=(100,10)), and even when the solutions were of the desired sparsity level, the explained variances were affected by a lot (when (p,s)=(100,20) and (200,20)). For ADMM, different K’s were tested for different settings. Results shown in Table 1 indicate that when K is slightly greater than s/2, ADMM usually produced good results. For example, for (p,s)=(100,20), both K=13 and K=12 produced good results in the sense that the cardinality of the largest sparse PC is the same as the desired one, and the explained variance is very close to the standard PCA. When K=11, ADMM produced a sparser solution, i.e., the cardinality of the largest PC was 19, but the explained variance was only degraded slightly.

From Table 2 we see that, for σ=0.1, i.e., when the noise level was larger, DSPCA was more sensitive to the noise compared with their performance when σ=0.01. However, we observed that the performance of ADMM when σ=0.1 was consistent with its performance when σ=0.01, i.e., its performance was not very sensitive to the noise.

From both Tables 1 and 2, we see that ADMM was significantly faster than DSPCA, and comparable to the speed of ALSPCA.

4.2 Text Data Classification

Sparse PCA can also be used to classify the keywords in text data. This application has been studied by Zhang, d’Aspremont and El Ghaoui in [40] and Zhang and El Ghaoui in [41]. In this section, we show that by using our ADMM to solve the sparse PCA problem, we can also classify the keywords from text data very well. The data set we used is a small version of the “20-newsgroups” data,Footnote 1 which is also used in [40]. This data set consists of the binary occurrences of 100 specific words across 16242 postings, i.e., the data matrix M is of the size 100×16242 and M ij =1 if the ith word appears at least once in the jth posting and M ij =0 if the ith word does not appear in the jth posting. These words can be approximately divided into different groups such as “computer”, “religion” etc. We want to find the words that contribute as much variance as possible and also discover which words are in the same category. By viewing each posting as a sample of the 100 variables, we have 16424 samples of the variables, and thus the sample covariance matrix \(\varSigma\in \mathbb {R}^{100\times 100}\). Using standard PCA, it is hard to interpret which words contribute to each of the leading eigenvalues since the loadings are dense. However, sparse PCA can explain as much the variance explained by the standard PCs, and meanwhile interpret well which words contribute together to the corresponding variance. We applied our ADMM to solve (1.4) to find the first three sparse PCs. We set (μ,K)=(0.05,5),(0.01,3) and (0.01,4), respectively, in the three resulting problems. We set ϵ 1=ϵ 2=10−4 in the stopping criterion (3.2). The resulting three sparse PCs have eight, 12 and 19 nonzeros, respectively. The total explained variance by these three sparse PCs is 11.64 %, while the variance explained by the largest three PCs by the standard PCA is 19.10 %.

The words corresponding to the first three sparse PCs generated by our ADMM are listed in Table 3. From Table 3 we see that the words in the first sparse PC are approximately in the category “school”, the words in the second PC are approximately in the category “religion”, and the words in the third sparse PC are approximately in the category “computer”. So our ADMM can classify the keywords into appropriate categories very well.

Table 3 Words associated with the first three sparse PCs using ADMM

4.3 Senate Voting Data

In this section, we use sparse PCA to analyze the voting records of the 109th US Senate, which was also studied by Zhang, d’Aspremont, and El Ghaoui in [40]. The votes are recorded as 1 for “yes” and −1 for “no”. Missing votes are recorded as 0. There are 100 senators (55 Republicans, 44 Democrats and one independent) and 542 bills involved in the data set. However, there are many missing votes in the data set. To obtain a meaningful data matrix, we only choose the bills for which the number of missing votes is at most one. There are only 66 such bills among the 542 bills. So our data matrix M is a 66×100 matrix with entries 1, −1 and 0, and each column of M corresponds to one senator’s voting. The sample covariance matrix Σ=MM in our test is a 66×66 matrix.

To see how standard PCA and sparse PCA perform in classifying the voting records, we implemented the following procedure as suggested in [40]. We used standard PCA to find the largest two PCs (denoted as v 1 and v 2) of Σ. We then projected each column of M onto the subspace spanned by v 1 and v 2, i.e., we found \(\bar{\alpha}_{i}\) and \(\bar{\beta}_{i}\) for each column M i such that

$$(\bar{\alpha}_i,\bar{\beta}_i):= \arg\min _{(\alpha_i,\beta_i)} \|\alpha_i v_1+ \beta_i v_2 - M_i\|. $$

We then drew each column M i as a point \((\bar{\alpha}_{i},\bar{\beta}_{i})\) in the two-dimensional subspace spanned by v 1 and v 2. The left figure in Fig. 1 shows the 100 points. We see from this figure that senators are separated very well by partisanship. However, it is hard to interpret which bills are responsible to the explained variance, because all the bills are involved in the PCs. By using sparse PCA, we can interpret the explained variance by just a few bills. We applied our ADMM to find the first two sparse PCs (denoted as s 1 and s 2) of Σ. We set (μ,K)=(0.05,5) for both problems and ϵ 1=ϵ 2=10−4 in the stopping criterion (3.2).

Fig. 1
figure 3

Projection of the senate voting records onto the subspace spanned by the top 2 principal components: Left: standard PCA; Right: sparse PCA

The resulting two sparse PCs s 1 and s 2 produced by our ADMM have eight and six nonzeros, respectively. We projected each column of M onto the subspace spanned by these two sparse PCs. The right figure in Fig. 1 shows the 100 projections onto the subspace spanned by the sparse PCs s 1 and s 2. We see from this figure that the senators are still separated well by partisanship. Now since only a few bills are involved in the two sparse PCs, we can interpret which bills are responsible most for the classification. The bills involved in the first two PCs are listed in Table 4. From Table 4 we see that the most controversial issues between Republicans and Democrats are topics such as “Budget” and “Appropriations”. Other controversial issues involve topics like “Energy”, “Abortion” and “Health”.

Table 4 Bills involved in the first two PCs by ADMM

5 Conclusion

In this paper, we proposed alternating direction method of multipliers to solve an SDP relaxation of the sparse PCA problem. Our method incorporated a variable-splitting technique to separate the 1 norm constraint, which controls the sparsity of the solution, and the positive-semidefiniteness constraint. This method resulted in two relatively simple subproblems that have closed-form solutions in each iteration. Global convergence results were established for the proposed method. Numerical results on both synthetic data and real data from classification of text data and senate voting records demonstrated the efficacy of our method. Compared with Nesterov’s first-order method DSPCA for sparse PCA studied in [9], our ADMM method solves the primal problems directly and guarantees sparse solutions. Numerical results also indicate that ADMM is much faster than DSPCA.