Expression reflects population structure

Population structure in genotype data has been extensively studied, and is revealed by looking at the principal components of the genotype matrix. However, no similar analysis of population structure in gene expression data has been conducted, in part because a naïve principal components analysis of the gene expression matrix does not cluster by population. We identify a linear projection that reveals population structure in gene expression data. Our approach relies on the coupling of the principal components of genotype to the principal components of gene expression via canonical correlation analysis. Our method is able to determine the significance of the variance in the canonical correlation projection explained by each gene. We identify 3,571 significant genes, only 837 of which had been previously reported to have an associated eQTL in the GEUVADIS results. We show that our projections are not primarily driven by differences in allele frequency at known cis-eQTLs and that similar projections can be recovered using only several hundred randomly selected genes and SNPs. Finally, we present preliminary work on the consequences for eQTL analysis. We observe that using our projection co-ordinates as covariates results in the discovery of slightly fewer genes with eQTLs, but that these genes replicate in GTEx matched tissue at a slightly higher rate.


Principal Component Analysis
Let X be an N ×D matrix of rank r ≤ N < D. The singular value decomposition of X is a factorization of the form X = U ΛV , where U and V are N ×r and D×r unitary matrices, respectively, and Λ = diag(λ 1 , . . . , λ r ) is a diagonal matrix containing the singular values of X. The columns of U and V are called the left and right singular vectors, respectively [1]. While keeping the full spectrum of singular values gives an exact decomposition of X, one can instead keep only the top l < r singular values, using U l and V l to denote the first l columns of U and V , respectively, and Λ l to denote the first l singular values. This gives an approximation to X which we callX = U l Λ l V l ≈ X.
Let Σ = X X N −1 be the covariance of X. We use A · Σ B = AΣ + B to indicate matrix multiplication with respect to the Mahalanobis metric with covariance Σ. When X is column-mean subtacted, the principal components of X are the l-dimensional coordinates defined by the rows of U l . These coordinates have basis F = V l Λ l √ N −1 which are orthogonal with respect to the Mahalanobis metric on X. We note that this definition of principal component analysis (PCA) is slightly different than the classical definition. Specifically, the classical formulation takes U l Λ l as its coordinates and V l as the orthonormal Euclidean basis [2]. However in genetics the basis scaling we use here is more common. Specifically, if X is a genotype matrix this is equivalent to defining the principal components as the eigenvectors of the realized relationship matrix as in [3,4].
Then projection of X into the space spanned by F and its complement are given bỹ Next, we extend these concepts to canonical correlation analysis before combining them to define explicitly the method we use.

Canonical Correlation Analysis
Given two data matrices X and Y with the same number of rows representing distinct but related data, canonical correlation analysis (CCA) finds maximally correlated linear combinations of the columns of X and Y . CCA identifies matrices A and B such that the sequence of correlations ρ XY,i = corr(XA (i) , Y B (i) ), where A (i) is the ith column of matrix A, is successively maximized so that the correlation matrix ρ XY = corr(XA, Y B) is diagonal [2].
Specifically, let X and Y be N × D X and N × D Y data matrices of rank r X and r Y respectively. As in the previous section, we denote the singular value decompositions of X and Y by The canonical correlations of X and Y are given by the singular values of the matrix again orthonormal with respect to the Mahalanobis metric on Σ X and Σ Y respectively [5]. As with PCA, one can instead keep only the top k singular vectors, which we denote by U M,k and V M,k to indicate the first k columns of U M and V M respectively. In this case, the bases are The coordinates of the data matrices in k-dimensional CCA space, which we refer to as the canonical variables, are given by Similarly, the projections of the data matrices onto the canonical bases are given bỹ As with PCA the complement is given bỹ From this, it is straightforward to interpret linear discriminant analysis (LDA) as a special case of CCA [6]. Let Y be a matrix of data observations and let L be a length N categorical data vector with K categories. Consider the N × K − 1 indicator matrix X with the ith column the indicator vector I[L = i], sometimes called the one-hot encoding of the data. Then LDA between X and Y is equivalent to learning the CCA projection Y C,k , and one can project out effects correlated with the categories in L by computing Y ⊥ C,k . Indeed, we can replace the regression correction for batch and gender in the main text with this approach and obtain similar visualizations. See Supplementary Figure 1.
We combine PCA with CCA as in [7] to arrive at our final projection. Specifically, we work with the first l X PCA components of X and the first l Y PCA components of Y . In this case, M = U X,l X U Y,l Y such that the coordinates in k-dimensional CCA space are with bases As before, the projection of the data matrices onto the canonical bases and their complements are given bỹ

Leave-One-Out Cross-Validation
Let the genotype matrix of the sample be X and the expression matrix of the sample be Y . For individual i, the full matrices X and Y are reduced by removing row i to create X −i and Y −i . We then learn the canonical bases F X , F Y of the data matrices using the combination of PCA and CCA described above. Next we project the held out individual gene vector y i into this space to get y i C = y i · Σ F Y,−i F Y,−i . After repeating this for all individuals, we form the data matrix Y C where each row i is the projection of individual i's gene expression vector into the CCA-gene space learned without using i. A plot of the first two principal components of this data matrix shows that the population structure learned by this method is valid with respect to held-out samples (Figure 2, Supplementary Figure 3).

Determining the Significance of a Gene for the CCA Projection
Let F Y,(j) be the j'th row in the basis matrix F Y . The variance v j for gene j is given by v j = ||F Y,(j) || 2 2 . Therefore, to determine whether the variance of a gene is non-null, we can perform a permutation test. Specifically, in each permutation p we shuffle the genotype principal components and compute the permuted variance for each gene v p j . The p-value for the test that the variance is greater than the null score is then the number of times the permuted variance is greater than the true variance. That is, for N p permutations,

A regression based approach
While performing CCA between the projected data matrices is one approach to visualizing the common structure underlying the principal components of two data matrices, there are possibly many projections that yield such a result. One involves assuming that the principal components of one matrix have a linear relationship to the principal components of the other with additive noise. That is if, U Y = U X β + , the least squares solution isβ = U X U Y .
This time, the coordinates of the data matrix in the projection are Compared to the CCA approach, the regression approach presumes a different noise model for the data. Rather than modeling the principal components of both gene expression and genotype as linearly related to an underlying hidden factor with additive noise, the regression approach implictly models the genotype principal components as the underlying factor that influences gene expression principal components with additive noise. Ultimately, the choice of model should reflect the "noise" structure emerging from the underlying biology and the nature of the measurements. One notable drawback of the regression based approach is that it uses the location of the genotype in genotype-PCA space to approximate the location of the expression vector in expression-PCA space, as opposed to the CCA projection which only needs to know the global relationship between the points in expression-PCA and genotype-PCA space. A consequence of this is that any held-out individual needs to have their genotype projected into genotype-PCA space, which is computationally infeasible when working with millions of SNPs. Therefore, we have omitted the leave-one-out cross-validation for this model. However the visualization obtained when applying this model is depicted in Supplementary Figure 2.