Feynman-Hellmann Theorem and Signal Identification from Sample Covariance Matrices

Open Access

Feynman-Hellmann Theorem and Signal Identification from Sample Covariance Matrices

Lucy J. Colwell, Yu Qin, Miriam Huntley, Alexander Manta, and Michael P. Brenner

Phys. Rev. X 4, 031032 – Published 27 August 2014

Abstract

A common method for extracting true correlations from large data sets is to look for variables with unusually large coefficients on those principal components with the biggest eigenvalues. Here, we show that even if the top principal components have no unusually large coefficients, large coefficients on lower principal components can still correspond to a valid signal. This contradicts the typical mathematical justification for principal component analysis, which requires that eigenvalue distributions from relevant random matrix ensembles have compact support, so that any eigenvalue above the upper threshold corresponds to signal. The new possibility arises via a mechanism based on a variant of the Feynman-Hellmann theorem, and leads to significant correlations between a signal and principal components when the underlying noise is not both independent and uncorrelated, so the eigenvalue spacing of the noise distribution can be sufficiently large. This mechanism justifies a new way of using principal component analysis and rationalizes recent empirical findings that lower principal components can have information about the signal, even if the largest ones do not.

Received 17 September 2013

DOI:https://doi.org/10.1103/PhysRevX.4.031032

This article is available under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.

Published by the American Physical Society

Authors & Affiliations

Lucy J. Colwell¹, Yu Qin¹, Miriam Huntley¹, Alexander Manta², and Michael P. Brenner¹

¹School of Engineering and Applied Sciences and Kavli Institute for Bionano Science and Technology, Harvard University, Cambridge, Massachusetts 02138, USA
²Roche Diagnostics GmbH, Penzberg 82377, Germany

Popular Summary

Technological advances have made it possible to measure an ever-increasing number of variables during an experiment. Determining correlations between different variables yields insights into the system, potentially leading to predictive theoretical models. For example, the multidimensional data sets generated by the Cancer Genome Atlas allow for the detection of correlated genetic perturbations that result in cancer phenotypes.

However, when a data set measures many more quantities (i.e., the expression levels of different genes in a genome) than the number of measurements that are made, there is a chance that a measured correlation could be spurious. It is therefore necessary to develop a rigorous procedure for determining when an observed correlation is spurious and when it is statistically reliable. A common technique is to look for variables with unusually large coefficients on principal components with the biggest eigenvalues. We show that even if the top principal components have no unusually large coefficients, large coefficients on lower principal components can still correspond to a valid signal. The mechanism that allows for this methodology is based on a variant of the Feynman-Hellmann theorem, developed for level splittings in quantum mechanics.

Our findings suggest that information about the structure of true covariance between variables can be recovered by examining the component distributions of different eigenvectors (not necessarily those with the largest eigenvalues, as is the case in principal component analysis).

Key Image

Article Text

Click to Expand

References

Click to Expand

Issue

Vol. 4, Iss. 3 — July - September 2014

Subject Areas

Statistical Physics

Reuse & Permissions

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
(a),(b) Analysis of the lymphoma gene expression microarray data set [2], with 4026 genes measured under 96 different conditions. (a) The eigenvalue distribution of the covariance matrix (red histogram) is not well fit by the MP law (blue curve), as would be expected for signal superimposed on Gaussian, uncorrelated background noise. (b) Principal component biplots of 1st versus 2nd and 3rd versus 4th eigenvector components. The colored dots show the positions of clusters identified in the original analysis [2], corresponding to different cancer types. The different clusters show up as somewhat distinct regions in the principal component biplots. The dashed circle indicates a threshold on the eigenvector component size that would be expected for Gaussian, MP distributed background noise. The calibration of the circle radii is explained in the text. (c),(d) Analysis of a serine protease sequence alignment data set [23]. (c) Eigenvalue distribution compared with the MP law. (d) Typical principal component biplots. Colored points identify variables in clusters identified in Ref. [2].
Reuse & Permissions
Figure 2
Simulations where Gaussian distributed noise is superimposed upon a rank-1 signal perturbation. (a) The spectrum is well described by the MP distribution (blue curve), with a single eigenvalue that is far above the noise band. (b) Principal component biplot, showing the first eigenvector component plotted against the second, demonstrating that the magnitude of components in the first eigenvector captures the signal. Panels (c),(d) show that this information is not contained in the lower principal components; other lower principal component biplots are similar (data not shown).
Reuse & Permissions
Figure 3
Simulations where background noise that is not MP distributed is superimposed upon a rank-1 signal perturbation. The noise is generated by drawing $n = 80$ vectors of length $p = 800$ from a Gaussian distribution and then multiplying each vector by a (different) Gaussian random variable with unit variance. The signal strength $s = 35.5$ is near the fourth eigenvalue. (a) The eigenvalue distribution does not agree with MP (blue curve), but is well described by the solution to a nonlinear integral equation for the density distribution (black curve); see the Appendix for details. (b),(c) Biplots of the first and second pairs of principal components do not identify the signal. (d),(e) However, the components of the fourth principal component clearly identify variables (shown in green) involved in the signal $s$ . (f),(g) No information about the signal is present in the fifth, sixth, or seventh eigenvectors.
Reuse & Permissions
Figure 4
(a) Eigenvalues of a diagonal matrix $W$ , with eigenvalues $1, 3, 5, \dots, 29$ , perturbed by the matrix $S$ , as a function of signal strength $s$ . As $s$ increases, different eigenvectors become parallel to the signal eigenvector. The red curve shows the typical behavior of an eigenvalue with increasing $s$ . (b) As (a), except here $W$ is no longer diagonal: $W (i, 1) = W (1, i) = ε_{i}$ , where $ε_{i}$ are random numbers with $0 \leq ε_{i} \leq 0.5$ . Note the “level repulsion” now visible between the different eigenvalues, so that the red curve continuously varies from 19 to 21. (c) Eigenvector alignment $v_{i} \cdot e_{S}$ corresponding to the red curve in (b) (red curve). The blue dots correspond to the theory of Eq. (7).
Reuse & Permissions
Figure 5
(a) Eigenvalues of a random matrix $M = W + S$ as a function of signal strength $s$ . Note the same qualitative characteristics of the eigenvalue as that for the matrix in Fig. 4. (b) Location of eigenvalue where the maximum eigenvector alignment occurs for $M = W + S$ , as a function of signal strength, with excellent agreement with the $λ = s$ law (solid line). (c) Eigenvalues of $M M^{T}$ , where $M = X + S$ , with $X$ a $p \times n$ noise matrix, as a function of $s$ . Note the same qualitative structure as the other cases. (d) Location of eigenvalue where the maximum eigenvector alignment occurs for $M = X + S$ , as a function of signal strength, with excellent agreement with the $λ = s^{2}$ law (solid line).
Reuse & Permissions
Figure 6
Comparison of $v_{i} \cdot e_{S}$ with Eq. (11). Here, $X$ has $n = 400$ samples of $p = 200$ variables, as discussed in the text, with signal matrix $s S$ , where $S S^{T}$ is rank 1.
Reuse & Permissions

Physical Review X