Reconstruction of a high-dimensional low-rank matrix

: We consider the problem of recovering a low-rank signal ma- trix in high-dimensional situations. The main issue is how to estimate the signal matrix in the presence of huge noise. We introduce the power spiked model to describe the structure of singular values of a huge data matrix. We ﬁrst consider the conventional PCA to recover the signal matrix and show that the estimation of the signal matrix holds consistency properties under severe conditions. The conventional PCA is heavily subjected to the noise. In order to reduce the noise we apply the noise-reduction (NR) methodology and propose a new estimation of the signal matrix. We show that the proposed estimation by the NR method holds the consistency properties under mild conditions and improves the error rate of the conventional PCA eﬀectively. Finally, we demonstrate the reconstruction procedures by using a microarray data set.


Introduction
In this paper, we address the problem of recovering an unknown d × n low-rank matrix, A = [a 1 , . . . , a n ]. A is called the signal matrix. Let r = rank(A), where r is unknown. We assume r (< min{d, n}) is fixed. For high-dimensional data, the estimation of the low-rank matrix is quite important in many fields such as genomics, image denoising, recommendation systems and so on. Negahban and Wainwright [5] and Rohde and Tsybakov [6] considered the problem for highdimensional regression models. Shabalin and Nobel [7] considered the estimation of A when observations have a Gaussian noise. In this paper, we consider the problem of recovering A when observations have a non-Gaussian noise.
Suppose we have a d × n data matrix, X = [x 1 , . . . , x n ], where Here, W = [w 1 , . . . , w n ] is a d × n noise matrix, where w j , j = 1, . . . , n, are independent and identically distributed (i.i.d.) as a d-dimensional distribution with mean zero and covariance matrix Σ W ( = O). Note that x j − √ na j , j = 1, . . . , n, are i.i.d. Let Σ A = AA T . Then, it holds that E(XX T )/n = Σ A + Σ W (= Σ, say). Shabalin and Nobel [7] considered (1.1) in a high-dimensional setting, where the data dimension d and the sample size n increase at the same rate, i.e. n/d → c > 0. They assumed that the elements of W are i.i.d. normal random variables. We note that the conditions such as "n/d → c > 0" and the Gaussianity of the noise are often strict in real high-dimensional analyses. In this paper, we consider (1.1) in high-dimensional settings without assuming those conditions. We assume the divergence condition for d and n such as d → ∞ either when n is fixed or n → ∞. The divergence condition includes both highdimension, low-sample-size (HDLSS) settings such as "n/d → 0" and highdimension, large-sample-size settings such as "n/d → c > 0" or "n/d → ∞ as d → ∞".
The eigen-decomposition of Σ W is given by Σ W = U W Λ W U T W , where Λ W is a diagonal matrix of eigenvalues, λ 1(W ) ≥ · · · ≥ λ d(W ) (≥ 0), and U W is an orthogonal matrix of the corresponding eigenvectors. Let W = U W Λ 1/2 W Z. Then, Z is a d×n sphered data matrix from a distribution with the identity covariance matrix. Here, we write Z = [z 1 , . . . , z d ] T and z j = (z j1 , . . . , z jn ) T , j = 1, . . . , d.
Note that E(z jk z j k ) = 0 (j = j ) and Var(z j ) = I n , where I n is the ndimensional identity matrix. We assume that the fourth moments of each variable in Z are uniformly bounded. The singular value decomposition of A is given by A = . Also, note that λ j(A) s depend not only on d but also on n. When r ≥ 2, we assume that λ j(A) s are distinct in the sense that lim inf d→∞ λ j(A) λ j (A) > 1 when n is fixed or n → ∞ for all j < j (≤ r).
In this paper, we consider the problem of recovering the signal matrix A in high-dimensional settings such as d → ∞ either when n is fixed or n → ∞. In Section 2, we introduce the power spiked model to describe the structure of the eigenvalues of Σ. In Section 3, we consider using the conventional PCA to recover A and show that the estimation of A holds consistency properties under severe conditions. In Section 4, we consider the noise reduction (NR) methodology by Yata and Aoshima [11] in (1.1) and apply it to recovering A. We show that the estimation of A by the NR method holds the consistency properties under mild conditions and improves the error rate of the conventional PCA. In Section 5, we discuss the choice of unknown rank r by using the consistency properties. In Section 6, we give several simulation results to recover signal matrices. Finally, in Section 7, we give an application of (1.1) and demonstrate reconstruction procedures by using a microarray data set.

PCA consistency for the power spiked model
In this section, we assume The sample covariance matrix is given by S = n −1 XX T . We consider the dual sample covariance matrix defined by S D = n −1 X T X. Let m = min{d, n}. Note that S D and S share non-zero eigenvalues and rank(S) = rank(S D ) ≤ m. Let λ 1 ≥ · · · ≥λ m ≥ 0 be the eigenvalues of S D . The eigen-decompositions of S and S D are given by S = m j=1λ jûjû T j and S D = m j=1λ jvjv T j , whereû j (or v j ) denotes a unit left-(or right-) singular vector of X/n 1/2 corresponding tô λ 1/2 j . Note thatû j can be calculated byû j = (nλ j ) −1/2 Xv j from the fact that X/n 1/2 = m j=1λ 1/2 jû jv T j . Jung and Marron [3] and Yata and Aoshima [10] investigated consistency properties of the conventional PCA for HDLSS data. Yata and Aoshima [11] gave consistent estimators both of the eigenvalues and eigenvectors together with the principal component (PC) scores by a method called the noise-reduction methodology. Shen et al. [8] gave a consistent estimator of the first eigenvector under a sparsity assumption. Zhou and Marron [13] investigated consistency properties of some estimators for the first eigenvector in outlier contaminated data. Now, we consider the power spiked model in Σ. The eigen-decomposition of Σ is written as Σ = U ΛU T , where Λ is a diagonal matrix of eigenvalues, λ 1 ≥ · · · ≥ λ d (≥ 0), and U = [u 1 , . . . , u d ] is an orthogonal matrix of the corresponding eigenvectors. Let Σ = Σ (1) i with some unknown and positive fixed integer Here, Σ (1) is regarded as an intrinsic part and Σ (2) is regarded as a noise part. Then, if there exists a positive fixed integer k r0 such that the eigenvalues are called the power spiked model. See Section 2 in Yata and Aoshima [12] for the details. When r 0 ≥ 2, we assume that lim inf d→∞ (λ j /λ j ) > 1 for all j < j (≤ r 0 ). They gave the following results.

Reconstruction of the signal matrix by conventional PCA
In this section, we consider recovering the signal matrix A by using the conventional PCA in high-dimensional settings such as d → ∞ either when n is fixed or n → ∞. We reconstruct A by usingλ j s,û j s andv j s. We assumeû T j u j(A) ≥ 0 andv T j v j(A) ≥ 0 for all j (≤ r) without loss of generality. We assume the power spiked model for (1.1) as follows: There exists a positive fixed integer k r such that lim d→∞ tr(Σ kr W ) λ kr r(A) = 0 either when n is fixed or n → ∞. (3.1) so that λ j(A) s are much larger than any eigenvalues of Σ W . We consider (3.1) for j (≤ r) either when n is fixed or n → ∞ in the following two cases: We note that λ j(A) in (I) is larger than that in (II). See Remark 2.1 for the detail. Also, Murayama et al. [4] considered the estimation of A for a special case of (I). We consider the following conditions when d → ∞ while n is fixed or n → ∞: so that (C-i) and (C-ii) hold under (C-iii) when W is Gaussian or z 1k , . . . , z dk (k = 1, . . . , n) are independent.
Note that (C-iii) does not hold for (II) when n is fixed. If (3.2) holds, (C-i) is met even when n is fixed for j (≤ r) in (I). Let κ j = tr(Σ W )/(nλ j(A) ) for j = 1, . . . , r. We have the following results.
as d → ∞ either when n is fixed or n → ∞.
Remark 3.2. If (3.2) holds, Theorem 3.1 is claimed even when n is fixed for j (≤ r) in (I).

Corollary 3.1. For j (≤ r), under (C-i) and (C-iv) in (I) or under (C-i) to (C-iv) in (II), it holds that
as d → ∞ either when n is fixed or n → ∞.
Note thatv j s hold the consistency property without (C-iv) contrary toλ j s andû j s. Based on the theoretical background, we consider recovering the signal matrix A by A r = r i=1λ 1/2 iû iv T i . In Section 5.1, we discuss the choice of r in A r . We define a loss function by where || · || F denotes the Frobenius norm. Let ψ = tr(Σ W )/n. Then, we have the following results.
as d → ∞ either when n is fixed or n → ∞.

Corollary 3.2. Under (C-i) and (C-iv) in (I) with j = r or under (C-i) to (C-iv) in (II) with j = r, it holds that
as d → ∞ either when n is fixed or n → ∞.
From Theorem 3.2, if (C-iv) does not hold, the loss of A r becomes rtr(Σ W )/n asymptotically. In order to reduce the noise, we apply the NR method to recovering the signal matrix in Section 4.

Reconstruction of the signal matrix by NR method
We consider applying the noise-reduction (NR) methodology by Yata and Aoshima [11] to recover the signal matrix A. By using the NR method, we obtain an estimator of λ j(A) as Note that the second term in (4.1) is an estimator of ψ. See Lemma 5.1 in Section 5.1 for the details. Then, we have the following result. (1),λ j holds the consistency property without (C-iv). Remember thatλ j requires (C-iv) to hold the consistency property.

Theorem 4.1. For j (≤ r), under (C-i) in (I) or under (C-i) to (C-iii) in (II), it holds thatλ
Remark 4.1. For estimating eigenvalues, the NR method can improve the conventional PCA even when d is not sufficiently large (e.g. d is about 10). See Figure 1 in Ishii et al. [2] for example. Now, we consider an adjustment ofλ j s as follows: Then, we have the following result.

Corollary 4.1. For j (≤ r), under (C-i) in (I) or under (C-i) to (C-iii) in (II), it holds thatλ
as d → ∞ either when n is fixed or n → ∞. We consider recovering A by A r = r i=1λ 1/2 i(r)û iv T i . In Section 5.2, we discuss the choice of r in A r . Let For the loss function by L( A r |A) = || A r − A|| 2 F , we have the following results.

Corollary 4.2. Under (C-i) and (C-iv) in (I) with j = r or under (C-i) to (C-iv) in (II) with j = r, it holds that
as d → ∞ either when n is fixed or n → ∞.
From Theorems 3.2 and 4.2, we compare 2λ i(

Choice of the rank r
In this section, we discuss the choice of r in A r and A r .

Choice of r in A r
Let r * (> 0) be a candidate (fixed) integer for r, where r * < min{d, n}. We write that A r * = r * i=1λ when r * < r; when r * ≥ r as d → ∞ either when n is fixed or n → ∞.
From Proposition 5.1, it is not always true that r * = r gives the smallest L( A r * |A). In fact, for a power spiked model such as (λ 1(A) , λ 2(A) ) = (d, d 2/3 ), r = 2 and tr(Σ W ) = d, L( A 1 |A) is smaller than L( A 2 |A) as d → ∞ when n is fixed. From Proposition 5.1, one may choose r * as the first integer i (= r 1 , say) satisfying ψ > λ i+1(A) (i.e. κ i+1 > 1). Then, r * = r 1 gives the smallest L( A r * |A) asymptotically for candidate integers. Note that r 1 ≤ r. Now, we consider estimating ψ bŷ Then, we have the following result.

Lemma 5.1. Under (C-i) in (I) or under (C-i) to (C-iii) in (II), it holds that
Letŕ 1 be the first integer i satisfyingψ i+1 >λ i+1 . Then, from Lemma 5.1 we have the following result.
If (C-iv) with j = r holds,ψ i /λ i becomes small for a large integer i, that is, r 1 becomes a large integer asŕ 1 = O(n). Hence, if one has an upper bound for r * as r * ≤ r u with integer r u (< ∞), one may useŕ 1u = min{ŕ 1 , r u } instead ofŕ 1 .

Choice of r in A r
We write that A r * = r * i=1λ We have the following result.
On the other hand, similar to Section 5.1, from Proposition 5.3, one may choose r * as the first integer i (= r 2 , say) satisfying 2γ i+1 > 1 (i.e. κ i+1 > 3). Note that r 1 ≤ r 2 ≤ r. Letŕ 2 be the first integer i satisfyingψ i+1 > 3λ i+1 . Note thatŕ 1 ≤ŕ 2 . Then, from Lemma 5.1, we have the following result. Hence, one may use A r * with r * =ŕ 2 . Here, one should note thatŕ 2 tends to be large if (C-iv) with j = r holds. Also, note that r 2 → r if (C-iv) with j = r holds. From Proposition 5.3, for the loss function, A r * with r * > r is asymptotically equivalent to A r . Hence, when r = r 2 , one may choose a relatively large r * in A r * as r * > r. On the other hand, from Proposition 5.1, the loss of A r * with r * > r is larger than that of A r asymptotically, so that one should not choose a relatively large r * in A r * . Letŕ 2u = min{ŕ 2 , r u }, where r u is given in Section 5.1. When r = r 2 and r ≤ r u , r * =ŕ 2u gives the smallest L( A r * |A) for candidate integers. Hence, for a relatively large r u , we recommend to use r * =ŕ 2u instead ofŕ 2 in A r * .

Simulations
We used computer simulations to compare the performance of A r * with A r * . We set r u = 5. We set r * =ŕ 1u for A r * and r * =ŕ 2u for A r * . See Section 5 for the details. We set r = 3, )/ψ, which were denoted by the solid lines in (a) and (b). See Theorems 3.2, 4.2, Propositions 5.1 and 5.3 for the details. The theoretical value by (iv) was not described for (b) because it is same as that of (iii). We also calculated the variances of simulation results by the 2000 replications. The variances of (i) to (iv) in (a) and (b) were quite small especially when d is large. For example, when d = 2 t for t ≥ 11,  . For (b), the theoretical value of (iv) was not described because it is same as that of (iii). all the variances in (a) were smaller than 0.006. Figure 2 shows the averages of M (λ j ), M (λ j ) and M (λ j(r) ) in (a) and (b). The dashed lines denote the simulation results. For j = 3 both in (a) and (b), the average of M (λ j(r) ) was not described becauseλ 3 is same asλ 3(r) . Note that the average of M (λ j ) is an estimated value of the mean square error (MSE), E(|λ j /λ j(A) − 1| 2 ). The averages of M (λ j ) and M (λ j(r) ) are also the same as in M (λ j ). From Theorem 3.1, we gave the corresponding theoretical value, κ 2 j , for the MSE of M (λ j ). The theoretical values were denoted by the solid lines.
The simulation results appeared close to the theoretical values and it seemed to be good approximations when d is large. As expected theoretically, we observed that A r and A r * with r * =ŕ 2u give more preferable performances compared to A r and A r * with r * =ŕ 1u even when n is fixed. The main reason must be due to κ j which is the bias ofλ j . See Sections 4 and 5.2 for the details. In fact, from Figure 2, the MSE ofλ j s were quite large especially when n is small because κ j s are large for the HDLSS settings. In contrast, the estimators by the NR method gave excellent performances even when n is small. For the estimation of A by the conventional PCA, A r * with r * =ŕ 1u gave a better performance compared to A r because r 1 < r. See Section 5.1 for the details.

Example
In this section, we consider an application of (1.1) to a mixture model. We demonstrate the reconstruction procedures for the mixture model by using a microarray data set.

Application
We suppose that there are l classes, Π i , i = 1, . . . , l, each having unknown mean vector, μ i . We assume that an observation is sampled from one of Π i s and the label of the class is missing. Let n i = #{j|x j ∈ Π i for j = 1, . . . , n} for i = 1, . . . , k, where #S denotes the number of elements in a set S. We define that μ (j) = μ i according to x j ∈ Π i for j = 1, . . . , n. We consider the following mixture model.

Note that
where || · || denotes the Euclidean norm. If μ 1 , . . . , μ l are linearly independent and n i > 0 for all i, the rank of A becomes just l (i.e., r = l). Also, it is likely that

Demonstration
We analyzed gene expression data by Bhattacharjee et al. [1] in which the data set consisted of five lung carcinomas types having 3312 genes (d = 3312). The data set is given in Yang et al. [9]. See [1] and [9] for details of the data set. We used four classes as Π 1 : adenocarcinomas (139 samples), Π 2 : normal lung (17 samples), Π 3 : squamous cell lung carcinomas (21 samples) and Π 4 : pulmonary carcinoids (20 samples). We consider the cases when r * = 1, . . . , 7 and l = 1, . . . , 4. Here, from Section 7.1, l can be regarded as the rank, r. Note that Table 1 Values of (L( Ar * |Ȃ), L( Ar * |Ȃ)) as estimates of (L( Ar * |A), L( Ar * |A)) for the microarray data set given by [1]. For each l, the values of (ŕ 1u ,ŕ 2u ) are also given in the bottom line. When r * > n for Ar * or r * > n − 1 for Ar * , those values were not available becauseλ j or λ j is not available when j > n or j > n − 1.
(a) n 1 = · · · = n l = 5 μ (j) = μ 1 for all j in (7.1) when l = 1. We considered two cases: (a) n 1 = · · · = n l = 5 and (b) n 1 = · · · = n l = 10. Note that n = 5l in (a) and n = 10l in (b). We set r u = 5. For each r * and l, we constructed A r * and A r * by using the first 5 samples in (a) or 10 samples in (b) from each class. We investigated their accuracies by using the remaining samples of each class as a test data set. We defined thatμ (j) =μ i according to x j ∈ Π i for j = 1, . . . , n, whereμ i is the sample mean vector of the test data set for each i. For each r * and l, we constructed A = [μ (1) , . . . ,μ (n) ]/n 1/2 as an estimator of A when the labels of the data set are known. Hence, L( A r * |Ȃ) = || A r * −Ȃ|| 2 F and L( A r * |Ȃ) = || A r * −Ȃ|| 2 F can be regarded as estimators of L( A r * |A) and L( A r * |A). We gave the values of (L( A r * |Ȃ), L( A r * |Ȃ)) for r * = 1, . . . , 7 and l = 1, . . . , 4 in Table 1. We also gave the values of (ŕ 1u ,ŕ 2u ) for each l.
As expected theoretically, we observed that A r * gave more preferable performances than A r * for most cases of (r * , l) in (a) and (b). Also, for each r * and l, when r * ≥ l (= r), most values in (b) are smaller than those in (a). This is probably because the sample size in (b) is larger than that in (a). See Propositions 5.1 and 5.3 for the details. On the other hand, we observed that r 1u andŕ 2u were larger than r. However, A r * gave adequate performances even when r * > r. See Section 5.2 for the theoretical reason. Hence, we recommend to use A r * with r * =ŕ 1u orŕ 2u .

Appendix A
In this section, we give several lemmas, proofs of the lemmas, and proof of Lemma 5.1.
as d → ∞ either when n is fixed or n → ∞.
Proof. We write that When n → ∞, from Lemma 5 given in Yata and Aoshima [12], it holds that as for j (≤ r) under (C-i) in (I) or under (C-i) to (C-iii) in (II). On the other hand, when n is fixed, (C-iii) does not hold in (II). Hence, we consider only the case of (I) when n is fixed. By using Markov's inequality, for any τ > 0 and j (≤ r), under (C-i) in (I), we have that as d → ∞ when n is fixed, so that (A.1) holds in (I) when n is fixed. Thus it concludes the result. Proof. We write that u T i(A) W e 1n = n k=1 e 1k w T k u i(A) . Note that λ 1(W ) = o(λ r(A) ) under (3.1). Also, note that u T i(A) Σ W u i(A) ≤ λ 1(W ) for i = 1, . . . , r. By using Markov's inequality, for any τ > 0 and i = 1, . . . , r, under (3.1), we have that as d → ∞ either when n is fixed or n → ∞, so that n k=1 (w T k u i(A) ) 2 /n = o p (λ r(A) ). Then, by noting that we can conclude the result.

Lemma A.3. For j (≤ r), under (C-i) in (I) or under (C-i) to (C-iii) in (II), it holds that
. . , r as d → ∞ either when n is fixed or n → ∞.
Proof. We write that as d → ∞ either when n is fixed or n → ∞. By combining (A.2) with Lemma A.1 and (A.3), it holds that under (C-i) in (I) or under (C-i) to (C-iii) in (II). Thus, we have thatv T 1 v 1(A) = 1 + o p (1) from the assumption that λ j(A) s are distinct.
When j = 2, we note thatv T It holds that for i ≥ 2

Reconstruction of a high-dimensional low-rank matrix
as d → ∞ either when n is fixed or n → ∞. It concludes the result.

Lemma A.5. For j (≤ r), under (C-i) in (I) or under (C-i) to (C-iii) in (II), it holds that
as d → ∞ either when n is fixed or n → ∞.
Proof. From Lemma A.1, under (C-i) in (I) or under (C-i) to (C-iii) in (II), it holds that for j (≤ r) as d → ∞ either when n is fixed or n → ∞. Then, from Lemmas A.2 and A.3, we have that for all By combining (A.8) and (A.9), we can claim that for all under (C-i) in (I) or under (C-i) to (C-iii) in (II). Here, there exist a random variable ε i ∈ [−1, 1] and a random unit vector y i such that under (C-i) in (I) or under (C-i) to (C-iii) in (II). Thus, it follows that for i = 1, . . . , j On the other hand, from (A.12), for i = 1, . . . , ) under (C-i) in (I) or under (C-i) to (C-iii) in (II). It concludes the results.

Lemma A.6. For j (≤ r), under (C-i) in (I) or under (C-i) to (C-iii) in (II), it holds that
Proof. Note thatû i = (nλ i ) −1/2 Xv i . From (A.13), Lemmas A.2 and A.5, under (C-i) in (I) or under (C-i) to (C-iii) in (II), we have that for j (≤ r) as d → ∞ either when n is fixed or n → ∞. From Lemma A.5, we can conclude the results.
Proof of Lemma 5.1. We write that tr(W T W /n) =

Appendix B
In this section, we give proofs of the theorems, corollaries and propositions in Sections 3 to 5.