Genome‐ and epigenome‐wide studies of plasma protein biomarkers for Alzheimer's disease implicate TBCA and TREM2 in disease risk

Abstract Introduction The levels of many blood proteins are associated with Alzheimer's disease (AD) or its pathological hallmarks. Elucidating the molecular factors that control circulating levels of these proteins may help to identify proteins associated with disease risk mechanisms. Methods Genome‐wide and epigenome‐wide studies (nindividuals ≤1064) were performed on plasma levels of 282 AD‐associated proteins, identified by a structured literature review. Bayesian penalized regression estimated contributions of genetic and epigenetic variation toward inter‐individual differences in plasma protein levels. Mendelian randomization (MR) and co‐localization tested associations between proteins and disease‐related phenotypes. Results Sixty‐four independent genetic and 26 epigenetic loci were associated with 45 proteins. Novel findings included an association between plasma triggering receptor expressed on myeloid cells 2 (TREM2) levels and a polymorphism and cytosine‐phosphate‐guanine (CpG) site within the MS4A4A locus. Higher plasma tubulin‐specific chaperone A (TBCA) and TREM2 levels were significantly associated with lower AD risk. Discussion Our data inform the regulation of biomarker levels and their relationships with AD.


INTRODUCTION
Alzheimer's disease (AD) is one of the leading causes of disease burden and death globally. 1,2 Blood-based methods for assessing disease risk are potentially more cost-effective and less invasive than neuroimaging methods or lumbar punctures for collecting cerebrospinal fluid (CSF). Approaches that use genomics and untargeted proteomics have suggested that there are signals in blood that can supplement targeted assays, and contribute to the understanding and prediction of AD. 3,4 However, the relevance of many candidate protein markers identified by untargeted approaches to AD remains unclear. 5,6 Understanding the molecular factors that regulate the levels of AD-associated proteins may identify proteins that bear relevance to disease risk mechanisms.
Unlike genetic factors, which remain largely stable over the lifecourse, differential DNA methylation (DNAm) profiles are influenced by genetic and non-genetic factors. 7 DNAm is characterized by the addition of methyl groups to DNA, typically in the context of cytosinephosphate-guanine (CpG) nucleotide base pairings. Clusters of CpG sites termed CpG islands are located near 70% of gene promoters. CpG island methylation is typically associated with reduced gene expression. However, it is important to note that DNAm is dynamic, tissuespecific, and cell-specific. 8 DNAm data may capture independent information beyond genetic factors in explaining inter-individual variation in circulating protein levels. Several genome-wide association studies (GWAS) have catalogued polymorphisms associated with plasma protein levels and identified proteins that correlate with risk scores for various disease states including AD. [9][10][11] Zaghlool et al. (2020) performed the only large-scale epigenome-wide association study (EWAS) to date on plasma protein levels (>1000 proteins). 12 Few studies have combined GWAS and EWAS data to quantify the independent and combined contributions of genetic and epigenetic factors toward differential protein biomarker levels. [13][14][15] We performed a structured literature review of studies that report associations between plasma proteins and AD diagnosis or related traits such as amyloid beta (Aβ) burden and cortical atrophy. [16][17][18][19][20][21][22][23][24][25][26][27] We focused on studies that measured plasma protein levels using the SOMAscan affinity proteomics platform (SomaLogic Inc.), as this matches the protocol used in our study, Generation Scotland. We identified 282 proteins that were also measured in our sample (n ≤ 1064).
Our first aim was to conduct GWAS and EWAS on plasma levels of 282 AD-associated proteins. Using Bayesian penalized regression, we estimated the proportion of inter-individual variability in plasma protein levels that can be accounted for by variation in genetic and DNAm factors. BayesR+ implicitly adjusts for probe intercorrelations and data structure, including relatedness. 28 For our second aim, we used Mendelian randomization (MR) and co-localization analyses to test for relationships between plasma protein levels and AD phenotypes.

Study cohort
Analyses were performed using blood samples from participants of the

Protein quantification
The 5k SOMAscan v4 array was used to quantify the levels of plasma proteins in GS participants (n = 1065). This highly multiplexed platform uses chemically modified aptamers termed SOMAmers (Slow and SOMAmers are shown in Table S1. Residualized RFUs were transformed by rank-based inverse normalization. We refer to these as protein levels; however, they reflect RFUs that have undergone a number of quality control, transformation and pre-correction steps. and DNAm data must have had the same number of prior variances (n = 3 each). Mixture variances for SNP data were set to 0.01, 0.1, and 0.2 in combined analyses. Input data were scaled to mean zero and unit variance. Gibbs sampling was used to sample over the posterior distribution conditional on input data and 10000 samples were used.

GWAS
The first 5000 samples of burn-in were removed and a thinning of five samples was applied to reduce autocorrelation. SNPs that exhibited a posterior inclusion probability (PIP) ≥95% were deemed significant.

EWAS
Blood DNAm in Generation Scotland participants was quantified using the Illumina HumanMethylationEPIC BeadChip Array. Blood DNAm was assessed in two separate sets. After quality control, 793706 and 773860 CpG remained in sets 1 and 2, respectively. In total, 772619 CpG sites were shared across sets. Each set was truncated to these overlapping probes.
In the stand-alone EWAS and combined GWAS/EWAS, mixture variances were set to 0.001, 0.01, and 0.1 (n = 778). Missing DNAm data were mean imputed separately within each set as BayesR+ cannot accept missing values. Both sets were combined and adjusted for DNAm batch, set, age, and sex. Each CpG site was scaled to mean zero and unit variance. Houseman-estimated white blood cell proportions were included as fixed-effect covariates. 31 CpG sites that had a PIP ≥95% were deemed significant.
Sensitivity EWAS analyses were performed using linear mixedeffects models and the lmekin function from the R coxme package (version 2.2-16). 32 DNAm data were pre-corrected for age, sex, batch, and set. Houseman-estimated white blood cell proportions were incorporated as fixed-effect covariates and a kinship matrix was fitted to account for relatedness among individuals in STRADL.

Co-localization analyses
Formal Bayesian tests of co-localization were used to determine whether a shared causal variant likely underpinned two traits of interest. 33

Mendelian randomization (MR)
Bidirectional Mendelian randomization (MR) was used to test for associations between (1) gene expression and plasma protein levels, (2) DNAm and plasma protein levels, and (3) plasma protein levels and AD risk or related biomarkers. Pruned variants (r 2 < 0.1) were used as instrumental variables (IVs) in MR analyses. Analyses were conducted using MR-base. 40 Two-sample MR was applied and relationships were assessed using the Wald ratio test. Further information on IVs used are provided in supplementary information.

Identification of plasma proteins associated with AD
Twelve studies were identified that reported associations between SOMAscan plasma proteins and AD or related traits ( Figure 1). Three hundred fifty-nine unique proteins were identified and 22 (6.1%) were reported in more than one study (Table S2-S4). In the Generation Scotland dataset, there were 308 SOMAmers (Slow Off-rate Modified Aptamers) that targeted 282 of 359 proteins of interest (Table S5 and Figure S1). The 282 unique proteins were brought forward for analyses (UniProt IDs and Seq-ids are shown in Table S6).

GWAS on AD-associated proteins
There were 1064 individuals with genotype and proteomic data in Generation Scotland. The mean age of the sample was 59.9 years (standard deviation [SD] = 5.9) and 59.1% of the sample was female. In the BayesR+ GWAS, 64 independent variants (or protein quantitative trait loci, pQTLs) were associated with 41 SOMAmers that mapped to 39 unique protein targets (PIP≥ 95%; Table S7). The phenotypic correlation structure of these 41 SOMAmers is presented in Figure S2.  Figure S3).
Fifty-seven pQTLs were previously reported in GWAS of blood protein levels ( Thirty-three pQTLs were associated with at least one trait in the GWAS Catalog at P < 5 × 10 −8 (range = 1 to 96 associations;

Co-localization of protein QTLs with expression QTLs
The 36 cis pQTLs identified in BayesR+ were annotated to 23 unique proteins. For 12 of 23 proteins, at least one pQTL was previously F I G U R E 1 Structured literature review of SOMAscan plasma proteins that were associated with AD in the literature, and assessment of their molecular architectures and relationships with AD in the present study. The MEDLINE, Embase, Web of Science databases, and preprint servers were queried to identify studies that reported associations between SOMAscan-measured plasma proteins and AD. GWAS, EWAS, and causal inference analyses were performed to identify molecular correlates of 282 AD-associated plasma protein levels and to probe their associations with AD and related traits. AD, Alzheimer's disease; EWAS, epigenome-wide association studies; GWAS, genome-wide association studies. Figure  created using Biorender.com reported to be an expression QTL for the respective gene in blood tissue (eQTL consortium database). 34 The R package coloc 33 (Table S11). MR analyses provided evidence for reciprocal associations between changes in gene expression and circulating levels of these proteins (Table S12). Three proteins had weaker evidence for co-localization (PP ≥75% for GM2A, LYZ, PDCD1LG2) and seven proteins had strong evidence for separate variants underlying gene expression and protein levels.

EWAS on AD-associated proteins
There were 778 individuals with DNAm and proteomic data in the Generation Scotland sample. The mean age of the sample was 60.2 (SD = 8.8) years and 56.4% of the sample were female. Twenty-six CpGs were associated with the levels of 20 unique proteins (PIP >95%, Table S13 and Figure S4). The median correlation coefficient between measured protein levels was 0.16. The associations consisted of 10 cis CpG sites and 16 trans CpG loci (Figure 3). The cg07839457 probe in the NLRC5 locus was associated with IL18BP and CSF1R levels, and the smoking-associated probe cg05575921 in AHRR was associated with GHR, PIGR, and WFDC2 levels.
We used linear mixed-effects models that accounted for relatedness to perform sensitivity analyses for the 26 CpG associations identified in BayesR+ (Table S14). 32 Effect sizes were highly correlated with those from BayesR+ and showed full directional concordance (r = 0.95, 95% CI = 0.90, 0.98; Figure S5). Twenty-one associations were replicated at a genome-wide significance threshold of P < 3.6 x 10 −8 ,and the remaining five associations were replicated at P < 2.0 × 10 −3 .

Co-localization of protein QTLs with methylation QTLs
Fourteen proteins had both genome-wide significant pQTL and CpG associations in our study. There were 39 possible SNP-CpG pairs across these proteins. For each pair, we used linear regression to test if the SNP was associated with CpG methylation at P < 5 × 10 −8 , thereby representing an mQTL effect (Table S17). We also performed look-up analyses of mQTL databases including GoDMC and phenoscanner. 35,36 In instances where an mQTL effect was identified in more than one database, coefficients from the study with the largest sample size were brought forward for co-localization analyses. In addition, in instances where two or more mQTLs were associated with the same CpG site in a given locus, only the most significant mQTL was brought forward for co-localization analyses (n = 19 mQTLs, 13 proteins; Table S18).
For six proteins, we observed strong evidence in coloc that a single cis-acting variant might underpin differential DNAm levels and protein abundances (PP >95%, Table S19). The six proteins were ANXA2, F7, MATN3, PCSK7, PLA2G2A, and SERPINA3. MR analyses provided evidence that relationships between methylation and protein levels were bidirectional (Table S20).

DISCUSSION
We identified seven novel protein QTLs and 19 novel CpG sites that associated with plasma levels of 18 AD-related proteins. Using BayesR+, we provided estimates for associations between common genetic and DNAm variation and inter-individual differences in plasma levels of 282 AD-related proteins. We integrated our data with publicly available gene expression and methylation QTL databases and highlighted molecular mechanisms that might link pQTLs to differential levels of six proteins. We observed strong associations between plasma levels of TREM2 or TBCA and AD risk. These associations were driven by trans pQTLs in MS4A4A and APOE, respectively.
We show that a trans pQTL (rs1530914) in the MS4A4A locus associates with higher plasma TREM2 levels. It is in strong LD (r 2 ∼ 0.9) with the variant rs1582763, which has been associated with higher CSF TREM2 levels and lower AD risk. 3,43 It is also in moderate LD (r 2 = 0.6) with a variant in the 3′UTR region of MS4A6A (rs610932), which was associated with plasma TREM2 levels in a sample of 35,559 Icelanders. 11 Polymorphisms in MS4A4A were shown to alter MS4A4A expression and subsequently modulate TREM2 concentration in human macrophages. 44 We supplement existing data by identifying a novel blood CpG correlate of plasma TREM2 levels (cg02521229) located near MS4A4A that previously associated with dementia risk in Generation Scotland participants. 45 Our data suggest that risk mechanisms arising from MS4A4A polymorphisms and TREM2 levels can be captured in plasma assays and that these mechanisms involve differential methylation in the MS4A4A locus.
We observed associations between plasma levels of three proteins (CSF3, MAPKAPK5, and TBCA) and trans pQTLs in the TOMM40-APOE-APOC2 locus. Furthermore, we identified two pQTLs and three CpG Lines indicate an association between a CpG site and SOMAmer. AD, Alzheimer's disease; CpG, cytosine-phosphate-guanin ; EWAS, epigenome-wide association studies TA B L E 1 MR analyses of plasma protein levels and AD-associated traits (Bonferroni-corrected P < 6.10 × 10 −5 )

Protein
Trait Method Beta SE P Reference

CONCLUSIONS
Our strategy of integrating multiple omics measures determined the degree to which molecular factors can explain inter-individual differ-ences in blood levels of possible biomarkers for AD, and advanced understanding of mechanisms underlying AD risk.