The Architecture of Gene Regulatory Variation across Multiple Human Tissues: The MuTHER Study

While there have been studies exploring regulatory variation in one or more tissues, the complexity of tissue-specificity in multiple primary tissues is not yet well understood. We explore in depth the role of cis-regulatory variation in three human tissues: lymphoblastoid cell lines (LCL), skin, and fat. The samples (156 LCL, 160 skin, 166 fat) were derived simultaneously from a subset of well-phenotyped healthy female twins of the MuTHER resource. We discover an abundance of cis-eQTLs in each tissue similar to previous estimates (858 or 4.7% of genes). In addition, we apply factor analysis (FA) to remove effects of latent variables, thus more than doubling the number of our discoveries (1,822 eQTL genes). The unique study design (Matched Co-Twin Analysis—MCTA) permits immediate replication of eQTLs using co-twins (93%–98%) and validation of the considerable gain in eQTL discovery after FA correction. We highlight the challenges of comparing eQTLs between tissues. After verifying previous significance threshold-based estimates of tissue-specificity, we show their limitations given their dependency on statistical power. We propose that continuous estimates of the proportion of tissue-shared signals and direct comparison of the magnitude of effect on the fold change in expression are essential properties that jointly provide a biologically realistic view of tissue-specificity. Under this framework we demonstrate that 30% of eQTLs are shared among the three tissues studied, while another 29% appear exclusively tissue-specific. However, even among the shared eQTLs, a substantial proportion (10%–20%) have significant differences in the magnitude of fold change between genotypic classes across tissues. Our results underline the need to account for the complexity of eQTL tissue-specificity in an effort to assess consequences of such variants for complex traits.


Introduction
Gene expression is an essential cellular function whose regulation determines a significant proportion of the phenotypic variance. Using microarrays and recently second generation sequencing (RNA-seq) [1,2], major progress has been made in understanding the genetics of human gene expression and identifying loci that drive differential expression across individuals [3,4], populations [5][6][7] and tissues [7][8][9][10][11]. This development is especially valuable for the biological analysis of genome-wide association (GWAS) signals [12], which often map to non-genic regions and are thus hard to interpret in the absence of additional information [13].
Transcript abundance is a very proximal endophenotype affected by genetic variation and has already facilitated the identification of candidate susceptibility genes for metabolic disease traits [14], asthma [15] or Crohn's disease [16]. This has been mostly possible when the tissue of expression was relevant to the interrogated complex trait, as disease phenotypes manifest themselves only in certain tissues. eQTLs discovered in LCLs have primarily helped explain GWAS associations with immunityrelated disorders [17,18] while associations with obesity-related traits were mostly observed when gene expression was quantified in adipose tissue [9]. Nevertheless, our guess of tissue relevance is yet far from satisfactory [19], reinforcing thus the incontestable value of measuring expression in multiple cell-types (including primary tissues reflecting in vivo patterns).
Transcriptional regulatory networks are expected to dictate tissue-specificity of regulatory effects [20], but the extent of this is still under debate. Depending on the cell-types compared and the eQTL discovery methods used, current estimates for tissuespecificity of eQTLs range from ,30% (liver, adipose tissue) [21] to 70-80% (LCL, fibroblasts, T cells) [7].
In this study we investigated various aspects of tissue-specificity and we emphasize the importance of accounting not only for statistical significance but also for continuous biological properties of regulatory variants, such as fold change in expression. We explored the complexity of the human cis-regulatory variation landscape in three tissues (LCL, skin and fat) derived from a subset of female Caucasian twins aged between 40 and 87 years old (mean 62 years) from the UK Adult Twin registry [22]. The present study represents the pilot phase of the MuTHER project (Multiple Tissue Human Expression Resource-http://www.muther.ac.uk/), a major resource initiated to enhance our knowledge about common trait susceptibility by providing genome-wide expression, methylation and eventually transcriptome sequencing information for 855 extensively phenotyped twins (clinical, anthropometric, life-style information as well as a wide range of biological measurements are available).

Results
Gene expression was quantified in LCL, skin and fat using Illumina's whole genome expression array (HumanHT-12 version 3)  We tested for SNP-gene expression associations (eQTLs) separately in each tissue. We considered only unrelated individuals at a time by separating twins from the same pair and thus performing two independent eQTL analyses per tissue. This study design, hereafter named Matched Co-Twin Analysis (MCTA), permits immediate replication and validation of eQTL discoveries. We used Spearman Rank Correlation (SRC) to detect associations and restricted our search to cis effects located within 1Mb on either side of a gene's transcription start site (TSS). Statistical significance was assessed at different thresholds using permutations (10,000 per gene) [5]. We detected an abundance of cis eQTLs (Table S1A) per tissue at a comparable rate to other studies of similar sample size [5,7]. The reported eQTLs appear robust as they replicate well between individuals of the two co-twin groups per tissue. We measured the eQTL overlap in a continuous fashion by taking the significant SNP-gene associations from one co-twin set and estimating the proportion of true associations (p 1 statistic [23], see Materials and Methods) on the distribution of corresponding pvalues in the reciprocal co-twin validation set. High levels of eQTL replication were observed across co-twins, with a mean p 1 of 0.93 in skin and 0.98 in LCL and fat (Table 1). We also measured the estimated proportion of true positives among the subset of genes that did not replicate in the co-twin at the same threshold. This too is high (p 1 = 0.84 for skin and 0.94 for LCL and fat), suggesting that exact overlap of genes at a given permutation threshold (PT) is an underestimate of eQTL replication due to winner's curse. In other words, we detected eQTLs in the co-twin that clearly replicated the initial findings, but at p-values that marginally missed the initial discovery threshold. To further confirm the robustness of our discoveries, we overlapped the MuTHER LCL results with available eQTL data from two recent independent studies. 40% of the genes for which we detect LCL eQTLs overlap

Author Summary
Regulation of gene expression is a fundamental cellular process determining a large proportion of the phenotypic variance. Previous studies have identified genetic loci influencing gene expression levels (eQTLs), but the complexity of their tissue-specific properties has not yet been well-characterized. In this study, we perform cis-eQTL analysis in a unique matched co-twin design for three human tissues derived simultaneously from the same set of individuals. The study design allows validation of the substantial discoveries we make in each tissue. We explore in depth the tissue-dependent features of regulatory variants and estimate the proportions of shared and specific effects. We use continuous measures of eQTL sharing to circumvent the statistical power limitations of comparing direct overlap of eQTLs in multiple tissues. In this framework, we demonstrate that 30% of eQTLs are shared among tissues, while 29% are exclusively tissuespecific. Furthermore, we show that the fold change in expression between eQTL genotypic classes differs between tissues. Even among shared eQTLs, we report a substantial proportion (10%-20%) of significant tissue differences in magnitude of these effects. The complexities we highlight here are essential for understanding the impact of regulatory variants on complex traits.  [24] overlap with genes reported in our study. Given the differences in gender distribution, sample preparation or even cell-type tested (LCL versus leukocytes) across these studies, the gene overlap observed is reassuring. The observed variation in gene expression is not entirely due to genetic effects. Experimental noise and environmental conditions also affect transcript levels in a global manner. Therefore, it is desirable to remove the effects of such random variables and thus increase the power to detect eQTLs. For this purpose, we employed factor analysis (FA) on each tissue separately and corrected for global latent effects on all individuals in each tissue [25]. We fitted various parameters such as number of learned factors and proportion of variance explained, in order to maximize for replication of eQTLs per tissue between twin sets. After performing standard SRC eQTL analysis on the factor-corrected expression data (SRC-FA), we obtained a substantial improvement in eQTL discovery at each of the standard permutation thresholds used (Table S1B). The improvement (twice as many eQTLs at 10 23 PT) is consistent in all tissues. The high eQTL replication between twin sets persists after FA, with an additional improvement of true positives detection in skin: p 1 = 0.95 ( Table 1). As expected, FA correction recovers the majority of the eQTLs discovered with the initial analysis (90% of LCL and fat and 80% of skin) ensuring that proximal genetic effects have not been corrected out. The FA correction enabled the discovery of additional signals (Table S2) likely representing real effects that could not be detected initially due to low power. This is supported by the significant overrepresentation of low association p-values (p 1 = 0.99, Figure 1) estimated in the uncorrected data for eQTLs detected only after FA correction.
Direct tissue overlap of significant eQTLs supports an extensive level of tissue-specificity for the three tissues, with very similar proportions in both the SRC and SRC-FA analyses ( Figure 2). In the first co-twin set we discovered 858 eQTL genes (nonredundant union) at 10 23 PT in all three tissues ( Table 2). Of these, 106 genes (12.35%) are shared across all tissues, 139 (16.2%) are shared in at least two tissues and 613 genes (71.44%) are detected in only one tissue. In skin we detect proportionally fewer tissue-specific effects (10.02% of skin eQTLs are specific to skin at 10 23 PT), an observation likely due to tissue heterogeneity and larger variety of present cell-types. SRC-FA results confirm the estimated ,30% of eQTLs to be shared in at least two tissues based on threshold eQTL discovery (Table S3).
Tissue-specific effects are largely not due to tissue-specific expression of the underlying transcripts. We detected regulatory variants active only in one tissue for genes that are expressed at high levels in the other two tissues ( Figure S1). The strength of tissue-specificity was investigated further by performing a joint repeated-measures ANOVA analysis with the tissue modelled as a categorical predictor variable (i.e. tissue type comprised the repeated measure). We assessed the relationship to the genotype by inspecting the SNP6tissue interaction p-value term. As expected, we detected a large enrichment of significant SNP6tissue interaction p-values for all associations (p 1 = 0.56) with tissue-specific effects having higher enrichment (p 1 = 0.6) than shared ones (p 1 = 0.41) ( Figure S2). The enrichment in the shared category suggests additional attributes of tissue-specificity beyond statistical significance, as presented in the succeeding fold change analysis.
The direction of allelic effects for shared eQTLs (10 23 and 10 22 PT) is consistent across the three given tissues ( Figure S3). As expected, for eQTLs significant in one tissue only the SRC correlation coefficient rho (reflecting direction and magnitude of effects) explains a substantially higher fraction of gene expression variation in the tissue of discovery compared to the other two tissues (identical SNP-gene associations - Figure S4). On the other hand, the amount of expression variance explained by shared eQTLs (10 23 PT) is comparable across tissues.
To refine regulatory signals and describe independently acting variants, we mapped eQTLs to recombination hotspot intervals and filtered markers in high LD (Materials and Methods). We found that ,7% of the genes tested are regulated by more than one independent cis eQTL, with similar estimates obtained from the standard and factor eQTL analysis ( Figure S5). For finer comparison of eQTL effects, we conducted an analysis where sharing was required for both the gene and the genomic interval harboring the eQTL. This analysis yielded similar counts of tissueshared and specific effects (Tables S4, S5), suggesting that the vast majority of shared genes also share regulatory variants across tissues. Furthermore, as shown previously [7], we observed that eQTLs cluster symmetrically around the TSS, with shared effects being distributed tightly around the TSS and tissue-specific effects spanning a greater range of distances ( Figures S6, S7).
The results described so far are based on thresholds, which are driven by statistical significance. Overlaps at these levels are heavily dependent on power and affected by winner's curse. In addition, eQTLs sharing statistical significance may still have notable effect differences on gene expression levels across tissues, with potentially different biological consequences. Given these caveats, we examined tissue-specificity in a continuous manner by quantifying the proportion of true positives estimated from the enrichment of low pvalues (p 1 ). Specifically, the p-value distribution of significant SNPprobe pairs (10 23 PT) from a reference tissue was investigated in the other two tissues. The p-value distribution in the other tissues indicates a high degree of tissue sharing (53 to 80%) both with the SRC and SRC-FA, varying slightly depending on the reference tissue in the comparison (Table S6). This suggests that there are effect size differences (both fold change and amount of variance explained) among tissues for the same regulatory variants, which is the basis for the previously described higher eQTL tissue-specificity estimates [7]. Overall, 29% of eQTLs (1-mean p 1 ) are estimated with the continuous approach to be tissue-specific, when comparing the three tissues studied.
As described above, tissue overlap of eQTLs should encompass not only sharing of a statistically significant regulatory effect, but also a similar effect size (fold change in expression) of that variant across tissues. In this respect, we report the fold change as the difference between the gene expression means of the heterozygous and major homozygous genotypic classes. Within the same tissue, the two co-twin sets are only slightly different in their fold change estimates. These minor differences reflect most probably the winner's curse effect (0.96 Pearson's correlation of fold change between Twin 1 and Twin 2 in LCL, 0.93 in skin and 0.93 in fat - Figure 3, Figures S8, S9). The difference in estimated effect size is much more apparent however across tissues (e.g. LCL eQTLs have a 0.69 and 0.77 fold change correlation with skin and fat eQTLs respectively, skin eQTLs have a 0.69 fold change correlation with fat eQTLs). This is largely a consequence of eQTL tissue-specificity, but a small effect of winner's curse is also expected (as observed in the comparison of co-twin sets). Furthermore, additional possible hidden tissue-specific effects are implied by the fact that shared eQTLs (at the same threshold of significance) don't always share the same effect size across tissues (LCL fold change correlation of 0.78 in skin and 0.84 in fat for shared eQTLs i.e. up to 20% difference in fold change magnitude between tissues compared to within-tissue difference). This suggests that even statistically tissue-shared eQTLs have additional dimensions of tissue-specificity and their mere discovery in multiple tissues does not guarantee similar magnitude of consequences.

Discussion
We have performed eQTL analysis in one cell-line (LCL) and two primary tissues of clinical importance (skin -previously unchar-acterized and fat). For each tissue we report robust eQTLs replicating in independent samples with identical (MZ) or on average 50% similar (DZ) genetic background using a matched cotwin design (MCTA). To further increase our power to detect eQTLs and uncover smaller genetic effects, we applied factor analysis accounting for global variance components in the data. We refined our signals to detect independently acting cis eQTLs and for shows that the signal existed in the uncorrected data but wasn't called significant due to low power. In each tissue, the exact SNP-gene combinations (eQTLs) tested are presented for both co-twin sets (Twin 1-first column, Twin 2-second column). doi:10.1371/journal.pgen.1002003.g001 most genes we found single associated regulatory variants. When these variants are shared across tissues, they also share the same direction of allelic effects and map to the same recombination hotspot interval. Using threshold-based criteria, tissue overlap of eQTLs supports a large degree of tissue-specificity for the three tissues studied. However, this estimate is dependent on power and we therefore put forth a continuous measure of tissue-specificity that provides a refined view of the decay of statistical significance as well as fold change effect on gene expression. Using this approach we observed a significant overrepresentation of low p-values in all  Table 2. Tissue-shared and tissue-specific gene associations (10 23 PT), SRC analysis. pairwise tissue comparisons, indicating larger proportions of shared statistically significant regulatory effects, some yet to be discovered with bigger sample sizes. However, we also observed significant eQTLs at the same threshold exhibiting differential fold changes in expression between genotypes across tissues. These cases represent tissue-specific effects as well, since differential fold change in expression is likely to have different biological consequences.
Overall biological interpretation of regulatory effects -much like in the case of complex traits -is tissue-dependent, highlighting the value of multiple tissue expression datasets. Understanding such complexities and context-dependent effects in the genetic architecture of gene expression and other cellular phenotypes is essential for the interpretation of the biological properties of disease causing variants.

Materials and Methods
All samples and information were collected with written and signed informed consent. The project has been approved by the local ethics committees of all institutions involved.

Sample collection
All individuals recruited in this study were Caucasian female twins aged between 40 and 87 years old (mean age 62). Skin punch biopsies (N = 196) were taken from a relatively photo-protected area adjacent and inferior to the umbilicus. The fat sample was then carefully dissected from the same skin biopsy incision. A peripheral blood sample to generate lymphoblastoid cell lines (LCL) was taken contemporaneously. For a full description of the biopsy technique see Text S1.

Gene expression measurements and genotyping
RNA levels were measured in LCL, skin and fat using Illumina's whole-genome expression array HumanHT-12 version 3 as previously described [5]. Each sample had three technical replicates. Illumina's v3 probes were mapped to unique Ensembl gene IDs by combining and cross-checking two methods. The first approach used Illumina's probe annotation to RefSeq IDs. These were further queried with BioMart (Ensembl 54) for corresponding Ensembl genes. RefSeq IDs mapping to multiple EnsGenes were excluded. The second approach used BLAT to map the 50-mer probe sequences to Ensembl transcripts and to extract genomic locations matching for all 50 bases of the probe sequence. Probes with unique perfect match to the genome and corresponding transcripts matching to the same genes were kept. The union of the two mappings after excluding 196 conflictingly matching probes resulted in 27,499 probes corresponding to 18,170 autosomal genes available for association analysis.
Genotyping has been performed in parallel using Illumina's 1M-Duo and 1.2M-Duo custom chips on different subsets of individuals. Before further filtering, there were 106 samples with call rate (CR)$0.90 on the 1.2M and 88 samples with CR$0.90 on the 1M chip. Combined intensity files were created for Illuminus [26] by retaining on a per-chromosome basis only SNPs common to both chips. Additionally, any SNPs that moved position between the two chips were removed. Following further quality checks (Hardy-Weinberg p.10 24 , MAF.1%), 865,544 SNPs were kept for analysis.
The overlapping set of successfully genotyped samples with available expression data amounted to 156 (LCL), 160 (skin) and 166 (fat) individuals.

Post-experimental normalization of gene expression data
Log 2 -transformed expression signals were normalized separately per tissue as follows: quantile normalization was performed across the 3 replicates of each individual followed by quantile normalization across all individuals.

Genotype-gene expression associations and multiple testing correction
The eQTL analysis was done separately for each tissue. Within each tissue, twins from the same pair were separated by id in two samples analyzed independently. This separation resulted in the following sample size for LCL, skin and fat respectively: Twin 1 (74, 76, 79) and Twin 2 (82,84,87). Associations between SNP genotypes and normalized expression values were conducted using Spearman Rank Correlation (SRC). We considered only SNPs in cis, i.e. within a 1MB window from the TSS. We assess the statistical significance of the nominal associations using permuta-tions as previously described [5]. We call an eQTL significant at 10 23 permutation threshold (PT) if the nominal association Pvalue is greater than the 0.001 tail of the minimal P-value distribution resulting from the SNP's associations with 10,000 permuted sets of expression values for each gene.

Factor analysis
We applied a Bayesian factor analysis model [25] to the expression data in each tissue. This approach uses an unsupervised linear model to account for global variance components in the data, and yields a residual expression dataset that can be used in further analysis.
We tested a wide range of parameter settings for the model, controlling the amount of variance explained by it. This was achieved by setting the parameters of the prior distributions for gene expression precision (inverse variance) and factor weight precision. These random variables are modelled using Gamma distributions, thus we varied their natural exponential family parameters -the prior mean and number of prior observations. We varied the prior mean from 10 26 to 10 22 , and number of prior observations from N*10 23 to N, where N is the number of observations from data, and learned 120 latent factors. In the subsequent analysis, we used for each tissue the residual dataset that gave the best eQTL overlap between the two twin samples. The prior values used for each dataset are given in Table S7. The eQTL analysis on the corrected expression data was performed identically to the standard analysis: SRC followed by permutation testing.

Proportion of true positives from p-value distribution
For quantifying eQTL replication and tissue sharing in a continuous way, we used Storey's QVALUE software [23] (implemented in the R package qvalue 1.20.0, default recommended settings). The program takes a list of p-values and computes their estimated p 0 -the proportion of features that are truly null -based on their distribution (the assumption used is that p-values of truly alternative cases tend to be close to zero, while pvalues of null features will be uniformly distributed among [0,1]). The quantity p 1 = 12p 0 estimates the lower bound of the proportion of truly alternative features, i.e. the proportion of true positives (TP). Replication and sharing between two samples is reported as the proportion of TP (p 1 ) estimated from the p-value distribution of independent eQTLs discovered in sample 1 in the second sample (exact SNP-probe combinations are tested).

Recombination hotspot interval mapping and LD filtering
We refined the eQTL signals in order to characterize likely independent effects per gene. For this purpose, we mapped all common autosomal SNPs to recombination hotspot intervals as defined by McVean et.al [27]. We map significant eQTLs to recombination hotspot intervals and save the most significant SNP per gene. For each gene, SNPs resulting from this mapping are in addition filtered for LD in a pairwise manner (for each pair with D9.0.5 the least significant SNP is ignored). This filtering ensures that true shared effects (interval-gene combinations) are compared and not just genes. Figure S1 Median expression values of tissue-specific genes in the tissue of discovery and the other two tissues. Tissue-specific effects are not restricted to genes expressed in a tissue-specific manner. 10.1371/journal.pgen.1002003.s001(TIFF) Figure S2 SNP6tissue interaction p-value from repeated measures ANOVA for all, shared and tissue-specific eQTLs respectively. Greater enrichment of significant SNP6tissue pvalues is observed for tissue-restricted effects. 10.1371/journal.pgen.1002003.s002(TIFF) Figure S3 eQTLs (10 22 PT, SRC) shared in all three tissues tested have the same direction of allelic effect (SRC rho) across tissues. 10.1371/journal.pgen.1002003.s003(TIFF) Figure S4 Cumulative SRC rho distribution across tissues for tissue-specific and shared eQTLs (10 23 PT, Twin1). eQTLs discovered in one tissue only have distinctively higher variance in the tissue of discovery compared to shared effects.