variancePartition: interpreting drivers of variation in complex gene expression studies

Background As large-scale studies of gene expression with multiple sources of biological and technical variation become widely adopted, characterizing these drivers of variation becomes essential to understanding disease biology and regulatory genetics. Results We describe a statistical and visualization framework, variancePartition, to prioritize drivers of variation based on a genome-wide summary, and identify genes that deviate from the genome-wide trend. Using a linear mixed model, variancePartition quantifies variation in each expression trait attributable to differences in disease status, sex, cell or tissue type, ancestry, genetic background, experimental stimulus, or technical variables. Analysis of four large-scale transcriptome profiling datasets illustrates that variancePartition recovers striking patterns of biological and technical variation that are reproducible across multiple datasets. Conclusions Our open source software, variancePartition, enables rapid interpretation of complex gene expression studies as well as other high-throughput genomics assays. variancePartition is available from Bioconductor: http://bioconductor.org/packages/variancePartition. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1323-z) contains supplementary material, which is available to authorized users.

Statistical methods to apportion variance to multiple experimental variables have recently yielded valuable insight into complex gene expression datasets: • 't Hoen, et al.
[1] used a xed eects ANOVA model in R to examine the contribution of individual and laboratory in the GEUVADIS RNA-seq data, and examined the eect of technical factors to expression variability in a second analysis.
• Melé, et al. [2] used a linear mixed model t with REML in R to quantify the contribution of variation across individuals and tissues to expression variation in GTEx.
• Burrows, et al. [3] used a linear mixed model in R to decompose the variation in expression and methylation in human induced pluripotent stem cells into the contributions of donor individual and cell type of origin.
• Rouhani, et al. [4] examined how multiple dimensions of variation aect expression variation in human induced pluripotent stem cells. This analysis uses a sophisticated custom algorithm to give a genome-wide summary of the contribution of each component. However, this method does not give gene-level results and software is not currently available.
• Trabzuni and Thomson [5] propose a linear mixed model with a nite mixture model extension to test for dierential expression. However, the software requires extensive statistical experience and is not very accessible to non-experts.
Yet no user-friendly software is available. This valuable line of analysis is currently only possible for analysts with advanced knowledge of computational statistics, linear mixed models, ecient parallel computing in R, and visualization in R.
variancePartition addresses this gap in the eld. By providing a computational workow involving a few lines of R code, variancePartition allows anyone with basic prociency in R to apply this powerful analysis framework to a range of gene expression datasets.

Interpretation of percent variance explained
In the main text, we considered a model with only one random eect. Here we extend the intuition about intra-class correlation to a model with two random eects. Consider the i th sample from the k th individual and the c th cell type: Modeling the separate eects of individual and tissue gives the following covariance structure between samples when a linear mixed model is used: The covariance matrix is symmetric so that blank entries take the value on the opposite side of the diagonal. The covariance can be converted to correlation by dividing by σ 2 id +σ 2 cell +σ 2 ε , and this gives the results from above. This example generalizes to any number of variance components [6].

1.3
Variation within multiple subsets of the data The linear mixed model underlying variancePartition allows the eect of one variable to depend on the value of another variable. Statistically, this is called a varying coecient model [6,7]. This model arises in variancePartition analysis when the variation explained by individual depends on tissue or cell type.
A given sample is only from one cell type, so this analysis asks a question about a subset of the data. The data is implicitly divided into subsets base on cell type and variation explained by individual is evaluated within each subset. The data isn't actually divided into subsets, but the statistical model essentially considers samples within each cell type. This subsetting means that the variance fractions no longer sum to 1.
Consider a concrete example with variation from across individual and cell types (T-cells and monocytes). Modeling the variation across individuals within cell type corresponds to the model for the i th sample, k th individual, s th sex and c th cell type: This model has corresponding variance components: In this case the covariance matrix is no longer`proper', because the diagonal values are not equal. But the covariance structure can still be decomposed into the contribution of each variable.
Since the dataset is now divided into multiple subsets, direct interpretation of the fraction of variation explained (FVE) as intra-class correlation does not apply. Instead, we compute an ad hoc pseudo-FVE by approximating the total variance attributable to cell type by using a weighted average of the within cell type variances weighted by the sample size within each cell type. Thus the values of pseudo-FVE do not have the simple interpretation as in the standard application of variancePartition, but nonetheless allows ranking of variables based on genome-wide contribution to variance and enables analysis of gene-level results. The variance fractions were estimated with all methods and estimates were compared to the true variance fractions to evaluate performance of the competing statistical methods (Supplementary Figure 1). The overall performance of each method was evaluated under each simulation condition using the root mean squared error (rMSE) across all variance fraction estimates (Supplementary Figure 2). where it is very competitive. This good overall performance is likely due to the fact that LMM-LM integrates out the parameter values for each category so that it only estimates k+1 parameters (i.e. k variance components, plus the residuals) when k variables are included in the model [6,8]. The linear mixed model improves performance by borrowing information across the multiple categories within the same variable. The benet of this regularization is most evident for variables with many categories. Also notable is that fact that the accuracy and precision of the estimates are not very dependent on the number of categories in a variable for the LMM-LM method.
Conversely, restricted maximum likelihood (REML) and xed eects ANOVA methods performed very poorly under some conditions. REML is designed to produce unbiased estimates of individual variance components [9], but it appears to perform poorly when estimating ratios of variance components. REML shows substantial deviation from the true parameter values and is inferior to LMM-LM in all simulations. REML also shows similar deviation from the true parameter values when only random eects are used (results not shown).
The xed eects ANOVA performs well when variables contain few categories (i.e. simulation 1), but this method performs very poorly in simulations 2 and 3 where variables have up to 100 categories. This is due to the fact that the xed eects ANOVA must estimate a regression coecient for each category in each variable and then use these values to construct variance component estimates [7,8]. As the number of parameters approaches the number of samples, this approach performs poorly since the model overts the data.

Nested variables
Complex study designs often involve correlated variables and especially variables that are nested within another variable.`Nesting' [6] refers to a situation that arises when considering variation across individuals and any properties of these individuals that do not vary in the dataset. For example, sex, age, ancestry and BMI are static properties of each individual and do not vary within individual (assuming the study is not longitudinal). The problem with nested variables is that the BMI of each individual is constant so that BMI can be constructed as a linear combination of individual. Therefore, a xed eects ANOVA model cannot consider both variation across individual and BMI because the model is singular and degenerate.
Yet the linear mixed model can accurately decompose variance even in cases where ANOVA is not applicable. The regularization used by the linear mixed model allows it to estimate the contribution of variables even when variables are correlated or nested. Repeating the previous simulation study except generating the continuous variable so that it is nested within the rst variable (i.e. like BMI) indicates that the linear mixed model estimated with maximum likelihood produces unbiased estimates of the variance fractions under all conditions (Supplementary Figure 3). The rMSE from the maximum likelihood estimates is slightly smaller than from REML (Supplementary Figure 4). We note that the xed eects ANOVA could not be t on these data because the nested variables make the model degenerate.

1.6
Processing expression data GEUVADIS RNA-seq data was downloaded from EBI (http://www.geuvadis.org/web/geuvadis/ RNAseq-project) and processed with limma/voom [10]. The variables Individual, Lab, Ancestry and Sex were modeled as random eects since they are discrete.
GTEx RNA-seq data (phs000424.v4.p1) (i.e. version 4) downloaded from dbGAP and GTEx website (http://www.gtexportal.org/) and processed with limma/voom [10]. Only RIN was modeled as a continuous xed eect. All other variables were modeled as discrete random eects as they are coded in the GTEx metadata: Age is discretized into 10 year bins and Ischemic time is discretized according to Supplementary

ImmVar
Expression array data was downloaded from GEO (GSE56035) and normalized with RMA [11] in the oligo Bioconductor package [12]. Analysis included experiments from individuals observed in both cell types. Age was modeled as a xed eect because it is continuous. All other variables were modeled as random eects. SEQC RNA-seq data was downloaded from GEO (GSE47774) and processed with limma/voom [10]. All varibles are discrete and were modeled as random eects.