Bootstraps Regularize Singular Correlation Matrices

I show analytically that the average of $k$ bootstrapped correlation matrices rapidly becomes positive-definite as $k$ increases, which provides a simple approach to regularize singular Pearson correlation matrices. If $n$ is the number of objects and $t$ the number of features, the averaged correlation matrix is almost surely positive-definite if $k>\frac{e}{e-1}\frac{n}{t}\simeq 1.58\frac{n}{t}$ in the limit of large $t$ and $n$. The probability of obtaining a positive-definite correlation matrix with $k$ bootstraps is also derived for finite $n$ and $t$. Finally, I demonstrate that the number of required bootstraps is always smaller than $n$. This method is particularly relevant in fields where $n$ is orders of magnitude larger than the size of data points $t$, e.g., in finance, genetics, social science, or image processing.


Introduction
Correlation and covariance matrices are fundamental dependence estimators in statistical inference.Their use includes risk minimization in finance [1], analysis of functional genomics [2], or image processing [3].However, when the number of objects under study (n) exceeds the number of available data points (t), theses matrices cannot be inverted.As a result, many standard inference methods cannot be applied directly.To overcome this issue, a large literature on eigenvalue regularization has been devoted to this issue over the last decades.The most relevant ones are the Ledoit-Wolf linear shrinkage [4] and the more recent non-linear shrinkage [5].These methods, apart from regularizing singular correlation matrices, attempt to reduce the noise effect due to finite sample size.In addition, Ref. [6] proposes a recursive algorithm that aims to find the most similar positive-definite matrix to an initial problematic matrix that is not positive-definite.Similarly to the proposed method, this approach does not try to denoise the target matrix but corrects the eigenvalue distribution by removing the non-positive eigenvalues.
In this work, I propose a simple alternative approach based on bootstrap resampling to regularize correlation matrices with z > 0 zero degenerate eigenvalues.In particular, I prove that the probability to obtain a positive defined matrix from the average of k bootstrap resampling scenarios converges rapidly with respect to k to one provided that k is larger than e e−1 n t .

The Bootstrap Average Correlation Matrix
Let X ∈ R n×t be the data matrix and C ∈ R n×n its Pearson correlation matrix.We assume that no column or row of X is a linear combination of the others; this implies that C has rank r = min{n, t − 1}.Let X (b) ∈ R n×t be a bootstrap copy of X obtained by sample replacement of the columns of X, and C (b) its correlation matrix.A generic element of , where h (b) is a vector of dimension t obtained by random sampling with replacement of the elements of vector (1,2 3)) and approximate (eq.( 4)) t-dependence of the first two moments of P(u b ) distribution of Eq. ( 2).The right plot shows the approximate Normal Cumulative Distribution Function (CDF) against the observed CDF obtained with 10 4 random sampling for every integer value of This paper derives an approximate expression of the probability that the smallest eigenvalue λ 0 of the correlation matrix is larger than zero as a function of the number of bootstrap copies.The minimum number of bootstrap copies k + , that guarantees C to be positive-definite within a chosen confidence level, shows a real transition in the large-system limit, defined here as n, t → ∞ at fixed q.

The Distribution of the Number of Null Eigenvalues
The first step is to obtain a probability distribution of the number of zero eigenvalues z b of a given bootstrap correlation matrix C (b) .One has where u b is the number of unique column indices sampled from X in the b-bootstrap copy.The exact probability distribution of u b is known to be [7] where S 2 (t, u b ) is the Sterling number of the second kind.Such a distribution has mean and variance In the limit of large t, eqs (3) become Furthermore, it is worth noticing that the deviation of the empirical P(u b ) from a normal N (µ(t), σ(t)) is negligible for even for moderately large t [7] , as reported in the right-hand side plot of Fig. 1.
If we consider a condition characterized by an abundance of expected zero eigenvalues, i.e., n t, then the probability distribution of z b according to Eq. ( 1) can be approximate by a Normal distribution Now that the distribution of the zero eigenvalues for the single bootstrap copy is known, we can answer the original question, and consider k bootstrap copies of X such that To make further progress, it is necessary to recall the geometrical properties of the space associated to degenerate eigenvalues.Let us suppose that C (i) has z i zero eigenvalues.Then the set of eigenvectors associated with these zero eigenvalues defines a hyper-plane V i of dimension z i embedded in an n dimensional space.Each vector w that lies in V i verifies w C (i) w = 0; however, if there is at least another j = i whose z j zero eigenvalues of C (j) define hyper-plane V j such that dim(V i ∩ V j ) 1, then w C (j) w > 0; and thus w C w > 0 for every vector w that lies in V i or V j .
It is important to point out that eigenvectors associated to z i zero eigenvalues can be assumed to be "randomly" chosen with the constraint to be orthogonal with V ⊥ i , the space defined by the eigenvectors associated with the n − z i non-zero eigenvalues; this because they do not carry any information about the correlation matrix C (i) since they explain zero variance.Therefore every rotation of the basis of V i constrained to be orthogonal with V ⊥ i will produce exactly the same matrix C (i) .In the k = 2 case, the probability that dim(V 1 ∩ V 2 ) 1 will be approximately 1 if z 1 + z 2 ≤ n and 0 otherwise.It is possible to visualize this relationship easily in a three-dimensional space, i.e., n = 3.In case of two random straight lines, that have dimensions z 1 = 1 and z 2 = 1, the probability that they intersect in a straight line is almost zero since they must be coincident; differently, if we consider two random planes z 1 = 2 and z 2 = 2 they will intersect in a straight line almost surely apart from only configurations in which they are parallels.The above-discussed approximation, in the case of the spectral decomposition, is valid if the probability that the orthogonal spaces V ⊥ 1 and V ⊥ 2 defined from the n − z 1 and n − z 2 non-zero eigenvalues perfectly overlap is negligible.In a bootstrap resampling, when t is sufficiently large, this probability is approximately zero, as this requires to sample the same column indices of X for both bootstrap realizations, in other words, C (1) = C (2) .More generally, for k bootstrap copies, every hyper-plane V i will verify dim(V i ∩ V j ) 1 for at least one j = i with probability 1 if If the above inequality holds, then C has no zero eigenvalue.From Eq. ( 6), one can derive an upper bound for the number of bootstrap copies required.In fact, even if all bootstrap correlations have n − 1 null eigenvalues, no more than k = n bootstrap copies are necessary to obtain a positive define matrix C .
According to Eq. ( 5), the distribution of ζ can be approximated by a sum of k identical normal distributions that converges to Therefore, the probability that the smallest eigenvalue λ 0 of C is larger than zero can be obtained from the cumulative distribution function of P(ζ) estimated at (k − 1) n, that is The above equation suggests to set a threshold α such that P(λ 0 > 0) > 1 − α, i.e. 1 − erf(a) = α (for example, a ≈ 1.82 for α = 0.01).One can then define the number of bootstraps required to achieve P(λ 0 > 0) > 1 − α by setting the argument of the erf function to a, which gives A bi-dimensional mapping of the values of k + (a) with a = 1.82 as function of n and t, shown in Fig. 2 left, shows that the number of bootstrap copies k + required to have a positive defined C is quite small, at least for not too extreme values of q = n/t.
To have a rough estimate of the transition point k + for P(λ 0 > 0) ≈ 1 in the limit of large n, we can substitute t = n/q, and compute the k * of the inflection point of the error function, obtained for the argument of erf equals to zero The right plot shows the analytical (dots) and large-system limit (dotted line) values of k + obtained for t that span a geometric progression range in [10, 10 4 ] such that q ∈ [1, 100].The large-system limit is obtained from Eq.( 12).
The large-system limit of the first derivative slope at the inflection point if the error function diverges to infinity This means that P(λ 0 > 0) has a real transition in the large-system limit.The value of the inflection point of Eq. ( 10), in the large-system limit, converges to The right-hand side of Fig. 2 shows that this approximation can provide a quite accurate estimation of the magnitude of k + even for n small when q is not extremely large.
In summary, both the approximate distribution of P(λ 0 > 0) of Eq. ( 8) and the bound limits k + of Eqs.(9) show a very good agreement with the observations, reported in Fig. 3.

Discussion
I have shown that the average correlation matrix of k bootstrap copies converges to a positive-defined matrix for k much smaller than the order of the matrix.Such a matrix can be used in many applications which require to invert C, such as risk optimization.An extensive comparative analysis of the performance of these approaches will be addressed in future works.

Figure 1 :
Figure 1: The left plot shows the exact (Eq.(3)) and approximate (eq.(4)) t-dependence of the first two moments of P(u b ) distribution of Eq. (2).The right plot shows the approximate Normal Cumulative Distribution Function (CDF) against the observed CDF obtained with 10 4 random sampling for every integer value of u b ∈ [0, t].

Figure 2 :
Figure 2: The left plot shows a color map of the analytical value of k + (a), with a = 1.82, for different n and t.The right plot shows the analytical (dots) and large-system limit (dotted line) values of k + obtained for t that span a geometric progression range in [10, 10 4 ] such that q ∈ [1, 100].The large-system limit is obtained from Eq.(12).

Figure 3 :
Figure 3: Observed and predicted probability that C has no zero eigenvalues with k bootstrap copies, for various n and t.The figure shows the predicted k + (a) limit, with a = 1.82.The simulations are obtained by sampling X from a standardized multivariate normal distribution.