Population-scale proteome variation in human induced pluripotent stem cells

Version of Record

Accepted for publication after peer review and revision.

Download
Cite
Share
CommentOpen annotations (there are currently 0 annotations on this page).

Version of Record published: August 25, 2020 (This version)
Accepted Manuscript updated: August 12, 2020 (Go to version)
Accepted Manuscript published: August 10, 2020 (Go to version)
Accepted: August 8, 2020
Received: March 30, 2020

1. Of interest
Germline cis variant determines epigenetic regulation of the anti-cancer drug metabolism gene dihydropyrimidine dehydrogenase (DPYD)

Ting Zhang, Alisa Ambrodji ... Steven M Offer

Research Article Apr 30, 2024
Further reading

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Human disease phenotypes are driven primarily by alterations in protein expression and/or function. To date, relatively little is known about the variability of the human proteome in populations and how this relates to variability in mRNA expression and to disease loci. Here, we present the first comprehensive proteomic analysis of human induced pluripotent stem cells (iPSC), a key cell type for disease modelling, analysing 202 iPSC lines derived from 151 donors, with integrated transcriptome and genomic sequence data from the same lines. We characterised the major genetic and non-genetic determinants of proteome variation across iPSC lines and assessed key regulatory mechanisms affecting variation in protein abundance. We identified 654 protein quantitative trait loci (pQTLs) in iPSCs, including disease-linked variants in protein-coding sequences and variants with trans regulatory effects. These include pQTL linked to GWAS variants that cannot be detected at the mRNA level, highlighting the utility of dissecting pQTL at peptide level resolution.

Introduction

Induced pluripotent stem cells (iPSC) hold great promise for advancing basic research and biomedicine. By enabling the in vitro reconstitution of development and cell differentiation, iPS cells allow the investigation of mechanisms underlying development and the aetiology of many forms of genetic disease. To realize this potential, it is essential to characterise genetic and non-genetic sources of variability of molecular and cellular phenotypes in human iPSCs.

Recently, multiple reference panels of human iPSC lines have been established (Kilpinen et al., 2017; Panopoulos et al., 2017; Carcamo-Orive et al., 2017), providing a valuable resource for functional experiments in pluripotent cells. These cell lines, together with associated data, have enabled the characterisation of variability in iPSC transcriptomes, identifying genetic and non-genetic determinants of expression variation, including expression quantitative trait loci (eQTL) (Kilpinen et al., 2017; Rouhani et al., 2014; DeBoever et al., 2017) in cis.

While RNA-centric analyses are informative for studying gene regulatory mechanisms at the transcriptional level, most cellular phenotypes ultimately involve downstream mechanisms that are mediated by proteins. Several proteogenomics studies, primarily in cancer (Zhang et al., 2014; Mertins et al., 2016), have underlined the relevance of protein measurements to interpreting how genomic changes act at the phenotypic level. Moreover, recent evidence has shown that genetic alterations can have effects on RNA that are attenuated at the protein level (Gonçalves et al., 2017; Roumeliotis et al., 2017). Vice versa, the mapping of protein quantitative trait loci (pQTL), predominantly in lymphoblast cell lines (Battle et al., 2015; Stark et al., 2014; Wu et al., 2013) and for the plasma proteome (Sun et al., 2018; Yao et al., 2018; Liu et al., 2015; Johansson et al., 2013; Lourdusamy et al., 2012), has revealed genetic effects on protein traits that do not manifest at the RNA level. However, the extent of RNA-independent protein regulation is not yet understood, with previous analyses performed only at gene-level resolution and, in some cases, without comparing protein and RNA data from the same cellular material.

Here, we report the first comprehensive, population-scale, combined proteomic and transcriptomic analysis of human iPSC lines. Our data provide matched quantitative proteomic (Tandem Mass Tag Mass Spectrometry) and transcriptomic (RNA-Seq) profiles of 202 iPSC lines, derived from 151 donors from the HipSci project (Kilpinen et al., 2017). We identify both genetic and non-genetic effects associated with variability in protein expression between individuals and describe the first high-resolution pQTL map in human iPSCs, including loci not detected as eQTLs at the RNA level.

Results

A population reference proteome for human iPSCs

A set of 217 iPSC lines from the HipSci project (Kilpinen et al., 2017), derived from 163 distinct donors, was selected for protein analysis, using material from the identical batches of cells that were used for RNA-Seq and other assays (Materials and methods). Quantitative mass spectrometry was carried out in batches of 10 lines, using tandem mass tagging (TMT, Thompson et al., 2003), with one common reference sample shared across batches (Brenes et al., 2018) (Materials and methods). Collectively, we identified 255,015 distinct (unmodified) peptide sequences, corresponding to 16,773 protein groups (groups of protein isoforms with no discriminating peptides; hereon denoted proteins) with median sequence coverage of 46%.

After quality control, 202 lines (from 151 donors) with matched genotype, RNA-Seq and proteome data, were selected for further analysis (Figure 1a; Figure 1—figure supplement 1; Figure 1—source data 1). We identified 11,140 recurrently detected proteins, corresponding to 10,198 genes (detected in at least 30 lines; Materials and methods) and RNA expression for 12,363 protein-coding genes (population average TPM >1). Out of these, 9013 protein coding genes were detected both at the RNA and protein levels (Figure 1—source datas 2 and 3).

Figure 1 with 6 supplements see all

Download asset Open asset

| Characterising variation in the iPSC proteome and transcriptome.

(a) Experimental design and assays considered in this study. Genotype, RNA-Seq and quantitative proteomics data were obtained from the same cell material of 202 iPSC lines derived from 151 unrelated donors. (b) Variance component analysis of RNA and protein abundances across genes, considering different technical and biological factors. Shown is the distribution of the fraction of variance explained by different factors (upper panel) across proteins, and the number of genes with substantial variance contribution for each factor (>20% contribution; lower panel). Also shown are the number of genes that retain greater than 20% contribution after adjusting for the effect of the corresponding RNA profiles on protein abundance (light blue; see Materials and methods). (c) Association of protein level with X chromosome inactivation (XCI) status across 110 female iPSC lines. Shown are lowess regression curves for 322 and 312 proteins respectively that were identified as significantly up (red) - and down (blue) - regulated with loss of XCI in female iPSC lines (lower panel; 10% FDR). Selected gene ontology enrichments for these sets of proteins are shown (right-hand panel; Materials and methods). XCI status was estimated as the average fraction of allele-specific expression for the inactive chromosome across all chromosome X genes (Materials and methods). (d) Scatter plot of the fraction of variance explained by donor at the RNA (x-axis) versus protein (y-axis) level. Encoded in colour is the fraction of variance explained by donor effects at the protein level after adjusting for the effect of the corresponding RNA profiles on protein abundance (Materials and methods).

Figure 1—source data 1 HipSci proteomics iPSC lines. The public ids, TMT batch, donor, gender, age and growth media for the HipSci iPSC lines used in this study are shown.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig1-data1-v3.xls
Download elife-57390-fig1-data1-v3.xls
Figure 1—source data 2 RNA gene level expression across the 202 lines for genes recurrently detected at the protein level. Lines are indexed by protein Ensembl gene Id. Columns are the line public names.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig1-data2-v3.xls
Download elife-57390-fig1-data2-v3.xls
Figure 1—source data 3 Protein abundance values across the 202 and reference lines for genes recurrently detected at the protein level and with RNA expression (TPM >1). Lines are indexed by protein Uniprot Id. First 229 columns contain intensity values after quality line filtering, batch correction and quantile normalisation. Line names are encoded as follows: [line public name]@[TMT batch]@[TMT channel]. Last columns include protein information: ‘gene_chromosome', 'gene_start', 'gene_end', 'ensembl_gene_id', 'gene_name', 'gene_strand', 'number_of_peptides', 'in_CORUM’.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig1-data3-v3.xls
Download elife-57390-fig1-data3-v3.xls
Figure 1—source data 4 Protein and RNA variance components. Variance decomposition for 6009 genes high RNA expression (TPM >1) and detected in lines at the protein level.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig1-data4-v3.xls
Download elife-57390-fig1-data4-v3.xls
Figure 1—source data 5 Protein and RNA correlation with X chromosome inactivation. Correlation with XCI status of protein and RNA profiles for the 6336 genes (6406 proteins) with high RNA expression (TPM >1) and detected in all female lines at the protein level.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig1-data5-v3.xls
Download elife-57390-fig1-data5-v3.xls
Figure 1—source data 6 Functional enrichment of genes with protein or RNA profiles correlated with XCI. This table enumerates the significant Genome Ontology terms and DNA regulatory motifs (FDR 0.05; fields 'source' and 'term_name') for different gene sets (field ‘molecular_layer’ and ’change_direction’): 1) RNA positively correlated with XCI inactivation, 2) RNA negatively correlated with XIC, 3) proteins positively correlated with XIC and without RNA nominal significance, and 4) proteins negatively correlated with XIC and without RNA nominal significance.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig1-data6-v3.xls
Download elife-57390-fig1-data6-v3.xls

Collectively, these data provide the most comprehensive analysis of the human iPSC proteome reported to date, and one of the most comprehensive proteomic datasets reported for any human primary, or derived, cell type (Supplementary file 1). When overlaying our data with the Human Proteome Map (Kim et al., 2014), the iPSC proteome was most similar to foetal and reproductive organs (Figure 1—figure supplement 2), which is consistent with the expected expression of pluripotency markers in these tissues (Kerr et al., 2008a; Kerr et al., 2008b). We also assessed differences between healthy donors and disease-bearing donors, identifying no systematic expression differences in the iPSC state (Supplementary file 2; Figure 1—figure supplement 3).

RNA and proteome variability

We assessed a range of factors to explain the variation in protein expression between iPSC lines. Leveraging our experimental design with data from two or more lines for 34% of donors, we assessed the effects of donor, alongside age, sex, and the contributions of technical and cell culture related factors (on 6009 genes; Figure 1b,d; Figure 1—source data 4; using a linear mixed model; Materials and methods). Overall, the fraction of variance explained by biological factors was lower for protein levels, compared to RNA variation, which points to higher assay noise and/or stochastic variability of protein abundance. Consistent with previous results using RNA data (Kilpinen et al., 2017), we identified donor genome (i.e. DNA sequence variation) as the most relevant factor, followed by culture medium (Figure 1b). Critically, however, significant donor effects remained after accounting for RNA variability (Figure 1b,d; Figure 1—figure supplement 4; Materials and methods). This indicates that (i) genetic differences between individual donors and experimental differences between culture conditions play an important role in causing the observed variation in proteome expression between the iPSC lines and (ii) post-transcriptional mechanisms also contribute to these effects. Notably, many of the proteins showing strong donor effects were previously identified as differentially expressed between reprogrammed iPS cells and embryonic stem cells (ESCs) (Phanstiel et al., 2011; Munoz et al., 2011), suggesting that some of these previously reported differences could be due to genetic variation between donors, rather than intrinsic differences between the iPSC and ESC cell types (Figure 1—figure supplement 5).

The sex of the donor affected proteome expression, including a subset of proteins uniquely encoded on the male-specific X chromosome. There was also a strong (i.e. >20%) gender-related effect on the expression of a subset of 88 proteins (Figure 1b), which are enriched for proteins encoded on the X chromosome (Odds ratio = 24.8, PV = 3×10⁻³², Fisher’s exact test). This reflected the partial erosion of X chromosome inactivation (XCI) observed in a subset of iPSC lines derived from female donors, as confirmed both by quantification of allele-specific expression and XIST expression in these lines (Figure 1—figure supplement 6). Incomplete XCI has been linked previously to poor iPSC differentiation and changes in RNA levels (Mekhoubad et al., 2012; Salomonis et al., 2016). However, our data provide the first opportunity to link the XCI status of 110 distinct female iPSC lines (as inferred from allele-specific expression; Materials and methods; Figure 1—figure supplement 6), with changes in the abundance of both proteins and RNAs. This identified 1374 genes for which either protein, or RNA levels, or both, showed changes correlated with XCI status (Figure 1c, FDR < 10%; Figure 1—source data 5). Further analysis indicated that XCI status preferentially impacts catabolic processes and mitochondria at the protein level, while this was not observed at the RNA level (Gene Ontology; Materials and methods; Figure 1c; Figure 1—source data 6). These data thus reveal an important effect of XCI status on global gene expression in iPSC lines from female donors, including specific effects at the protein level that are not detected by transcriptomic analysis.

Mapping cis genetic effects on protein abundance

Next, we mapped cis quantitative trait loci at both the RNA and protein levels, considering 8543 autosomal protein coding genes that were quantified at both levels (MAF >5%; within +/- 250 kb of the gene boundaries; using a linear mixed model; Materials and methods). The number of pQTLs identified was greatly increased by adapting the PEER-based adjustment, which was previously developed for mapping of eQTLs (Stegle et al., 2012), for use with proteomic data (Materials and methods; Figure 2—figure supplement 1). Across all autosomal genes, we report 654 genes with at least one cis pQTL and 3487 genes with a cis eQTL (FDR < 10% for both eQTL and pQTL mapping; lead variants only; Figure 2a; Figure 2—source datas 1 and 2). Among these, 273 genes were shared and had identical or correlated lead variants, whereas 82 genes showed evidence for an eQTL and pQTL with independent lead variants (LD-based criterion, r² <0.1, Figure 2—source data 1). Genes with substantial donor components, as identified based on the variance component analysis (>20%; Figure 1b), were enriched for significant cis pQTL (215 genes out of 962; Odds ratio = 4.2; Figure 2—figure supplement 2).

Figure 2 with 4 supplements see all

Download asset Open asset

Human iPSC *cis* protein and RNA QTLs.

(a) Number of genes with a protein (blue) or RNA (green) *cis* QTL (FDR < 10%) and pairwise replication of genetic effects. Left: Number of genes with a pQTL, either with (dark blue) or without (light blue) replicated RNA effect. Right: Number of genes with an eQTL, either with (dark green) or without (light green) replicated protein effect. Replication defined by assessing nominal significance (PV <0.01) of QTL in the respective other layer. (b) Local Manhattan plots displaying negative log p-values (PV) from *cis* RNA (top) and protein (bottom) QTL mapping for *PEX6.* The dashed line and the grey box indicate the genomic positions of the lead QTL and of the gene. Boxplots show RNA and protein expression for different alleles at the pQTL lead variant rs11752813, a variant in LD (r² = 1, 1000 Genomes European populations phase 3) with the Alzheimer risk variant rs1129187 (Jun et al., 2016) (OR 1.13). (c) Cumulative fraction of eQTLs with replicated protein effects as a function of the eQTL effect size (from highest to lowest). (d) Prediction of protein replication of eQTLs, considering features derived from gene annotations, eQTL, RNA and protein data. Predictions were obtained using a random forest model trained on the protein replication status of eQTL (as in a; Materials and methods). Left: Feature importance scores. Right: Precision-recall curve for the model, evaluated in independent test fractions. The model performance was assessed by random sampling of training/testing data with a 80/20 split, performed 50 times. Shown in red is the average precision-recall across all sampled training/test splits and in thin grey lines results of individual folds.

Figure 2—source data 1 pQTL_results. The list of significant (FDR < 10%) genes with a pQTL provided as a supplementary file. Data fields are described in the table below.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig2-data1-v3.xls
Download elife-57390-fig2-data1-v3.xls
Figure 2—source data 2 eQTL_results. Reported are genes with a significant (FDR < 10%) QTL. It consists of variants mapped at RNA, gene resolution, for genes detected at both RNA and protein levels. This table includes the features used in the prediction of the pQTL status. The table columns are analogous to Figure 2—source data 1 pQTL_results.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig2-data2-v3.xls
Download elife-57390-fig2-data2-v3.xls

To identify DNA sequence variants with effects at both the RNA and protein levels, we considered the pairwise replication of pQTLs at the RNA level and vice versa (lead QTL variants, defining ‘replication’ as nominal PV <0.01 with consistent effect direction; Materials and methods), which is more sensitive than assessing overlapping QTL variants. This identified 473 pQTLs (72%) with replicated eQTL effects. Conversely, 893 eQTLs (26%) had replicated protein effects, with globally concordant effect size directions and distance from gene boundaries (Figure 2—figure supplement 3).

Lack of replication of eQTLs at the protein level could arise from a combination of technical and/or biological factors. We identified the eQTL effect size as a strong predictor for protein replication, with larger effects being associated with increased replication rates (Figure 2c). To systematically characterise the determinants of eQTL replication, we considered a random forest model trained to predict the protein replication status (Figure 2d). In addition to the eQTL effect size, this identified other predictive factors, including the protein coefficient of error (estimated from technical replicate samples; Materials and methods) and the protein coefficient of variation across lines (Figure 2d; Figure 2—source data 2).

To explore the physiological relevance of iPSC pQTL variants, we examined their overlap with variants identified in genome-wide association studies (GWAS). Specifically, we probed for QTLs that tag GWAS variants contained in the GWAS catalogue (MacArthur et al., 2017) (i.e. are in LD r² >0.8), identifying 136 (of 654) pQTL signals that tag a known GWAS variant (Figure 2—source data 1). In addition, we assessed the statistical evidence for co-localisation of pQTL and GWAS signals for 51 studies for which full summary statistics were obtained (using eCAVIAR Hormozdiari et al., 2016; Materials and methods; Figure 2—source data 1), yielding 49 pQTLs with evidence of co-localisation (i.e cumulative co-localisation probability greater than 0.1). Among these, examples of pQTLs with corresponding effects at the RNA level include the variant rs7872034, a pQTL for SMC2 with co-localisation evidence for serous invasive ovarian cancer (Phelan et al., 2017), and the variant rs11752813, a pQTL for PEX6 and in LD with Alzheimer's disease in APOE e4+ carriers risk variant rs1129187 (Jun et al., 2016; Figure 2—figure supplement 4; Figure 2b).

Notably, for 33 pQTLs linked to GWAS variants, either via co-localisation or LD tagging, no replicated effect was identified at the RNA level, suggesting protein-specific regulation (Figure 2—source data 1). For example, rs11601507 has no RNA effects, and is associated with TRIM5 protein abundance and with coronary artery disease risk (van der Harst and Verweij, 2018; Figure 2—figure supplement 4). Such cases raise the question of the mechanisms by which these variants modulate protein abundance and, ultimately, phenotypic traits, as addressed below.

pQTL linked to isoform-specific transcript expression

To investigate the mechanisms that underlie discordant eQTLs and pQTLs in more detail, we performed transcript isoform and protein peptide QTL analyses. cis QTL mapping of 33,050 reference transcript isoforms (Zerbino et al., 2018) (quantified using Salmon Patro et al., 2017; Materials and methods) and 119,747 peptides identified 3810 genes with a transcript QTL (tQTL) and 566 genes with a peptide QTL (pepQTL), respectively (Figure 3—figure supplement 1, Materials and methods, Figure 3—source datas 1 and 2).

Transcript-level QTL mapping could explain the lack of protein effects for a small fraction of the 2594 eQTL without a replicated pQTL effect (Figure 2a). For 48 of these, the eQTL variant was identified as exclusively associated with abundance changes of non-coding transcript isoforms (nominal PV <0.01), which explains the absence of protein effects (Figure 3—figure supplement 2). Furthermore, when considering 1262 transcript QTL that neither replicate at the eQTL, nor at the pQTL level, in 45 instances we observed consistent replication when considering peptide QTL (Figure 3—figure supplement 2b).

Among 181 pQTLs without eQTL replication (Figure 2a), 61 had nominally significant transcript QTLs (PV <0.01; Figure 3a). For 12 of these, including a pQTL for MMAB (Figure 3b), we observed genetic effects with opposite directions on coding and non-coding transcript isoforms, which explains the lack of genetic effects when considering gene-level RNA abundance.

Figure 3 with 4 supplements see all

Download asset Open asset

Putative mechanisms of pQTL from transcript isoform regulation and protein-altering variants.

(a) Categorisation of 654 pQTL into four classes according to their putative mechanism: gene expression effect (i.e. replicated at eQTL level), transcript-isoform specific effect (i.e. not replicated at eQTL level, but significant at transcript isoform level), protein-altering variant (i.e. at least one inframe variant in LD with lead pQTL variant) without expression effect at RNA level, and without any putative mechanism identified. (b) Example pQTL without eQTL replication (rs6663; gene *MMAB*), with a directional opposite effect on a coding and non-coding isoform (cyan: ENST00000540016; grey: ENST00000537496), resulting in no overall change in gene expression level. (c) The pQTL variant (rs1051061) is a protein-altering variant associated with VRK2 protein abundance (below), and lacks detectable effect on RNA expression. The pQTL signal is observed across 15 peptides spanning the VRK2 protein sequence (above, left). This variant is associated with schizophrenia risk, and is located at the kinase active site, proximal to the proton acceptor residue (above, right). The dashed line and the grey box indicate the genomic positions of the lead QTL and of the gene. (d) Enrichment of RNA-independent pQTL in different categories of predicted variant effects, using gene variants in high LD with pQTLs (proxy gene variants; r² >0.8; within the *cis* gene boundaries). Enrichment calculated using Fisher’s exact test.

Figure 3—source data 1 tQTL_results. Consists of variants mapped at RNA, transcript isoform resolution, for genes detected at both RNA and protein levels. The table columns are analogous to Figure 2—source data 1 pQTL_results.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig3-data1-v3.xls
Download elife-57390-fig3-data1-v3.xls
Figure 3—source data 2 pepQTL results. Consists of variants mapped at the protein level, peptide resolution, for genes detected at both RNA and protein levels. The table columns are analogous to Figure 2—source data 1 pQTL_results.: https://cdn.elifesciences.org/articles/57390/elife-57390-fig3-data2-v3.xls
Download elife-57390-fig3-data2-v3.xls

pQTL arising from protein-altering variants

Next, we set out to characterise further the remaining 120 pQTL without replication at either eQTL, or transcript QTL levels. When classifying the corresponding lead pQTL variants based on their predicted functional effect, we identified 24 inframe variants, a striking enrichment compared to pQTL with replicated RNA effects (3.8-fold enrichment; PV = 4×10⁻⁵, Fisher’s exact test; Figure 3c and d). These findings are in line with previous observations in lymphoblast cell lines (Battle et al., 2015). Of note, peptides containing protein-altering variants were excluded from the quantifications (Materials and methods), and the reported pQTL effects were observed for multiple peptides from the same proteins (Figure 3—figure supplement 3), providing further confidence in genuine regulatory effects. We assessed whether the 24 pQTL have effects at the RNA level (eQTL) in other cell types, and for 11 of these pQTL we did not find evidence of eQTL nominal significance in any of the 48 GTEx (PV <0.01/48; Battle et al., 2017; Figure 1—source data 1), which further points to RNA-independent mechanisms.

Inframe variants have the potential to affect protein function. We estimated whether a variant is likely to be deleterious to protein function using SIFT scores, which capture evolutionary conservation and amino acid similarity (Ng and Henikoff, 2003). This revealed a clear enrichment of the 24 RNA-independent pQTL that tag inframe variants, 10 of which have predicted deleterious effects (SIFT score <0.05), compared to four among all other pQTL (Odds ratio = 27.5, PV = 3.8×10⁻⁸, Fisher’s exact test; Figure 3d; Figure 1—source data 1). Putative effects of these variants on protein function include loss of enzymatic activity and disruption of protein structure. For example, the variant rs1051061 in VRK2 lies in a conserved sequence in the kinase domain, proximal to the proton acceptor residue, likely impacting kinase activity (Figure 3c). The identical variant has been identified as GWAS risk variant for schizophrenia (Yu et al., 2017) (OR 1.17), with the risk allele being associated with decreased protein abundance. The effect direction is consistent with previous studies that have linked decreased VRK2 expression to neurological disorders including schizophrenia (Azimi et al., 2018; Tesli et al., 2016).

These data show important roles of transcriptional regulation underlying cis pQTL effects, while also highlighting how isoform-specific effects, which are invisible to standard eQTL mapping approaches, can be detected at the protein level. For a substantial subset of pQTLs, we identified linked protein-altering variants, many with deleterious effects. Together with previous observations, these results suggest that proteomics information can aid understanding of pathogenic mechanisms of deleterious variants.

Proteome-wide effects of cis QTLs

Building on the compendium of cis pQTL identified here in iPS cells, we set out to characterise downstream proteome-wide changes. We mapped proteome-wide trans pQTL, considering 654 cis pQTL variants. This identified 51 cis-pQTL lead variants with trans effects on a total of 68 proteins (FDR < 10%; Figure 4—source data 1; Materials and methods). To rule out synthetic associations, we discarded associations with evidence for sequence similarity between cis and trans proteins, and we verified the consistency of the identified trans effects across multiple independent peptides (Materials and methods; Figure 3—figure supplement 3). The detected pairs of proteins with shared genetic regulation were strongly enriched for known protein-protein interactions (CORUM Ruepp et al., 2010, IntAct Orchard et al., 2014, StringDB Szklarczyk et al., 2017; Odds ratio = 9.1, PV = 1.5×10⁻¹⁰, Fisher’s exact test; Figure 4b). The cis and trans effects had similar effect directions and effect sizes, consistent with genetic effects mediated via stabilising protein-protein interactions (Figure 4c). This interpretation of our data in human iPSCs is consistent with the significant donor variance component we observed for many protein complexes (Figure 4d). It is also consistent with previous observations in an outbred mouse cross, showing that protein modules sharing genetic effects in trans are enriched in protein interactions (Ruepp et al., 2010), and identification of trans protein effects due to somatic aberrations in human cancer cell lines (Gonçalves et al., 2017; Roumeliotis et al., 2017). Importantly, our results generalise these previous observations to genetic effects of common variants that segregate in human populations.

Figure 4

Download asset Open asset

| *Trans* effects on the iPSC proteome.

(a) Strategy for mapping *trans* genetic effects on protein abundance. Lead *cis* pQTL variants were considered for proteome-wide association analysis. (b) Enrichment of previously catalogued protein-protein interactions among significant *trans* pQTLs. Shown is the fraction of *cis-trans* gene pairs linked by a *trans* pQTL with evidence of protein-protein interactions (based on the union of CORUM, IntAct, and StringDB), as a function of the considered FDR threshold for *trans* pQTL discovery. The dashed lines correspond to FDR < 10%. Numbers indicate the number of *trans* pQTL identified for each FDR threshold. (c) Comparison of genetic effect sizes, in cis and *trans*, for significant (FDR < 10%) *trans* pQTLs. Red points indicate *cis-trans* pairs with evidence for protein-protein interactions defined as in b. (d) Left: Protein co-expression of protein complex subunits defined based on CORUM. Right: i) subunit with the most significant *cis* pQTL; ii) fraction of subunits in association with the *cis* pQTL at nominal significance (PV <0.01). iii) fraction of the average cluster protein expression level explained by donor effects. (e) *Trans* regulation of the PEX26-PEX6-PEX1 complex. The variant *rs11752813* (LD r² = 1 with *rs1129187*) is associated in cis with changes in the RNA and protein abundance of PEX6 and in trans with changes in the protein abundance of PEX1 and PEX26.

Figure 4—source data 1 trans-pQTL_results. Reported are the trans pQTL (FDR < 10%).: https://cdn.elifesciences.org/articles/57390/elife-57390-fig4-data1-v3.xls
Download elife-57390-fig4-data1-v3.xls

In summary, the trans effects we detected appear to induce strong correlations across protein complex subunits (Figure 4d), whereby a variant associated in cis with one subunit was also associated in trans with other subunits. This is illustrated by PEX26-PEX6-PEX1, a protein complex involved in peroxisome biogenesis. As noted above, the underlying pQTL variant rs1129187 is associated in cis with an increase in both PEX6 RNA and protein abundance (Figure 2b) and is a known risk variant for Alzheimer's disease in APOE e4+ carriers (Jun et al., 2016). This cis pQTL in turn induces downstream associations on the remaining complex subunits, PEX26 and PEX1 (Figure 4e), suggesting that PEX6 acts as a limiting subunit of this complex in iPSCs. Thus, our results provide a potential biological mechanism underlying this risk variant, namely acting through changes in the abundance of the PEX26-PEX6-PEX1 complex. Notably, there is prior evidence for an implication of peroxisomal function in the development of Alzheimer’s disease and in other neurodegenerative processes (Lizard et al., 2012; Berger et al., 2016), providing further support for this hypothesis.

Discussion

Here, we report the first in-depth characterisation of the human iPSC proteome, connecting genetic variation to changes in RNA and protein levels. Beyond the relevance for iPS cell biology, this study, to our knowledge, provides the most detailed population-level analysis of parallel RNA/protein profiles in human cells. By quantifying genome-wide protein and transcript expression variation across more than 200 human iPSC lines, we identified both genetic and non-genetic mechanisms that underlie variation in both protein and RNA levels. We have mapped more than 600 cis protein quantitative trait loci (pQTLs) and analysed how these relate to cis eQTLs, how they impact other proteins in trans, and how pQTLs link to human disease variants.

The variance component analysis explained a lower overall fraction of variance in the protein data compared to RNA variation, which likely reflects larger technical effects and/or stochasticity in protein expression levels. Among the explainable fraction of variance, donor-specific genetic factors are a major contributor to the differences in protein expression detected across the iPSC lines. The corollary is that protein expression variation across iPS cells reflects genetic diversity in the human population. Consistent with this, we identified 654 common genetic variants associated with changes in protein abundance.

Globally, there were substantially fewer pQTLs than eQTLs, and while most pQTLs had effects of the same direction at eQTL, only 30% of eQTLs are nominally significant at the protein level. It is possible that technical factors resulting from the protein measurement methods may contribute, at least in part, towards attenuating the signal detected at the protein level. However, considering our data in light also of results from previous studies, some of which employed alternative technologies for protein detection to the MS methods used here, we suggest that the signal attenuation between eQTL and pQTL levels is not exclusively the result of limitations in protein measurements. Instead, many eQTLs may reflect variation in RNA abundance that does not cause significant changes in steady state protein levels.

By the systematic comparison of matched protein and RNA data, including detailed analysis of separate isoforms, we demonstrated that in order to fully understand the propagation of genetic effects to proteins, isoform-resolution protein and RNA phenotypes are indispensable. In particular, this approach identified additional RNA-dependent regulation that manifests in protein QTL, thereby improving the ability to identify genuine RNA-independent pQTL.

We showed that the pQTLs for which no corresponding changes in transcript levels were detected, are enriched in deleterious missense variants. This result suggests that the phenotypic effects of such variations may be exerted through protein abundance changes. Because most deleterious variants, and in particular pathogenic variants, are rare, larger sample sizes will be required to fully assess the protein components of this class of regulatory genetic effects.

Our study presents the first comprehensive map of pQTLs at peptide resolution, considering a total of 119,747 peptides from 8543 proteins for genetic analysis. This identified 566 peptide QTL, several of which were not detectable when considering whole protein expression levels, as illustrated with the variant rs12795503, pepQTL for gene CTTN (Figure 3—figure supplement 1). While we mapped fewer significant pepQTLs than pQTLs, peptide level analyses were shown here to overcome potential artefacts raised by protein quantification, in particular when mapping trans pQTL, and are invaluable in identifying isoform-specific effects.

Our data highlight the ability of protein-protein interactions to propagate genetic effects in human populations. A long-standing hypothesis has been that certain protein complexes may have a rate-limiting subunit that determines complex abundance, with any excess subunits produced being rapidly degraded (e.g. because of exposure of hydrophobic residues). This implies that cis eQTLs affecting the levels of rate-limiting subunits should also have effects in trans on the abundance of the whole complex, and on most, if not all, subunits therein. While trans genetic effects were previously reported to be mediated by protein interactions in high heterozygosity samples, that is outbred mice, (Chick et al., 2016) and for somatic aberrations in cancer cell lines (Gonçalves et al., 2017; Roumeliotis et al., 2017), to our knowledge, this study provides the first example that such effects act through common genetic variants in untransformed human cells. In the future, the approach we have taken here could be extended by mendelian randomisation-based approaches to formally assess the mediating role of the cis pQTL on protein complex members.

Understanding the mechanisms through which genetic variations act in the human population is of great relevance to characterising risk factors and susceptibility to disease. There is on-going interest in the potential for studying disease mechanisms using disease relevant tissues that are derived from panels of iPSCs (Cayo et al., 2017; Li et al., 2018; D'Aiuto et al., 2014; Schwartzentruber et al., 2018). This study provides important information showing how direct analysis of human iPSCs can advance our understanding of the genetic regulation of protein expression and how this influences cell phenotypes and disease mechanisms.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers
Cell line (Homo-sapiens)	iPSC	www.hipsci.org	RRID:SCR_003909
Software, algorithm	MaxQuant	https://www.maxquant.org/	RRID:SCR_014485
Software, algorithm	Trim Galore	https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/	RRID:SCR_011847
Software, algorithm	STAR	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/	RRID:SCR_015899
Software, algorithm	Salmon	https://combine-lab.github.io/salmon/	NA
Software, algorithm	g:Profiler	http://biit.cs.ut.ee/gprofiler/	RRID:SCR_006809
Software, algorithm	eCAVIAR	http://zarlab.cs.ucla.edu/tag/ecaviar/	NA
Software, algorithm	VEP	https://www.ensembl.org/info/docs/tools/vep/index.html	RRID:SCR_007931
Software, algorithm	MutFunc	http://www.mutfunc.com/	NA
Software, algorithm	Limix	https://github.com/limix/limix	NA
Software, algorithm	Peer	http://www.sanger.ac.uk/resources/software/peer/	RRID:SCR_009326
Software, algorithm	Scikit-learn	http://scikit-learn.org/	RRID:SCR_002577

Share this article

Cite this article

| Characterising variation in the iPSC proteome and transcriptome.

Figure 1—source data 1

Figure 1—source data 2

Figure 1—source data 3

Figure 1—source data 4

Figure 1—source data 5

Figure 1—source data 6

Human iPSC cis protein and RNA QTLs.

Figure 2—source data 1

Figure 2—source data 2

Putative mechanisms of pQTL from transcript isoform regulation and protein-altering variants.

Figure 3—source data 1

Figure 3—source data 2

| Trans effects on the iPSC proteome.

Figure 4—source data 1

Author details

Bogdan Andrei Mirauta

Contribution

Contributed equally with

Competing interests

Daniel D Seaton

Present address

Contribution

Contributed equally with

Competing interests

Dalila Bensaddek

Present address

Contribution

Contributed equally with

Competing interests

Alejandro Brenes

Contribution

Competing interests

Marc Jan Bonder

Contribution

Competing interests

Helena Kilpinen

Present address

Contribution

Competing interests

HipSci Consortium

Contribution

Competing interests

Oliver Stegle

Contribution

Contributed equally with

For correspondence

Competing interests

Angus I Lamond

Contribution

Contributed equally with

For correspondence

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading