Machine Learning Models of Polygenic Risk for Enhanced Prediction of Alzheimer Disease Endophenotypes

Background and Objectives Alzheimer disease (AD) has a polygenic architecture, for which genome-wide association studies (GWAS) have helped elucidate sequence variants (SVs) influencing susceptibility. Polygenic risk score (PRS) approaches show promise for generating summary measures of inherited risk for clinical AD based on the effects of APOE and other GWAS hits. However, existing PRS approaches, based on traditional regression models, explain only modest variation in AD dementia risk and AD-related endophenotypes. We hypothesized that machine learning (ML) models of polygenic risk (ML-PRS) could outperform standard regression-based PRS methods and therefore have the potential for greater clinical utility. Methods We analyzed combined data from the Mayo Clinic Study of Aging (n = 1,791) and the Alzheimer's Disease Neuroimaging Initiative (n = 864). An AD PRS was computed for each participant using the top common SVs obtained from a large AD dementia GWAS. In parallel, ML models were trained using those SV genotypes, with amyloid PET burden as the primary outcome. Secondary outcomes included amyloid PET positivity and clinical diagnosis (cognitively unimpaired vs impaired). We compared performance between ML-PRS and standard PRS across 100 training sessions with different data splits. In each session, data were split into 80% training and 20% testing, and then five-fold cross-validation was used within the training set to ensure the best model was produced for testing. We also applied permutation importance techniques to assess which genetic factors contributed most to outcome prediction. Results ML-PRS models outperformed the AD PRS (r2 = 0.28 vs r2 = 0.24 in test set) in explaining variation in amyloid PET burden. Among ML approaches, methods accounting for nonlinear genetic influences were superior to linear methods. ML-PRS models were also more accurate when predicting amyloid PET positivity (area under the curve [AUC] = 0.80 vs AUC = 0.63) and the presence of cognitive impairment (AUC = 0.75 vs AUC = 0.54) compared with the standard PRS. Discussion We found that ML-PRS approaches improved upon standard PRS for prediction of AD endophenotypes, partly related to improved accounting for nonlinear effects of genetic susceptibility alleles. Further adaptations of the ML-PRS framework could help to close the gap of remaining unexplained heritability for AD and therefore facilitate more accurate presymptomatic and early-stage risk stratification for clinical decision-making.


Introduction
Genetic factors are estimated to explain approximately 71 percent of the risk of Alzheimer's disease (AD). 1,2The genetic influences of AD are widely presumed to be polygenic in nature, with numerous genetic factors acting collectively to influence susceptibility. 3The apolipoprotein E e4 allele (APOEe4) is the strongest known genetic risk factor of sporadic AD and is associated with increased risk by a factor of 2-3 with 1 allele copy and by a factor of 8-12 with 2 copies. 46][7] For a highly heritable disease, there remains great interest in developing tools to characterize the polygenic risk architecture underlying AD more fully to guide clinical risk stratification.
Polygenic risk score (PRS) methods have shown promise for characterizing this complex genomic background. 8,9These methods generate a summary measure of risk based on an individual's relevant genetic profile (i.e., their genotypes at known susceptibility variants across the genome).Many methods for PRS calculation exist. 8Most standard PRS approaches calculate the combined sum of effect sizes for risk variants taken from GWAS data sets applied toward the dosages of those variants in an individual.In theory, this approach facilitates aggregating the effects of APOEe4 and other AD susceptibility alleles into a quantitative measure, which can be related to outcomes of interest.For AD, PRS approaches have shown strong associations of risk with clinically diagnosed AD dementia, [9][10][11] cognitive decline, 12,13 and important disease endophenotypes such as amyloid and tau burden. 14,15limitation of standard PRS methods is that they are fundamentally linear.The weighted sum of risk alleles and their effects does not include nonlinear genetic effects such as epistatic (gene × gene interaction) influences. 16For example, we cannot expect that the combined effect of SV A and SV B is equal to the sum of the effects of SV A and SV B when considering the activation of biological pathways.Moreover, weights for risk variants drawn from GWAS [5][6][7] reflect effects across the population and therefore may underestimate clinical subpopulations where specific combinations of gene variants may be driving susceptibility.Potential gene-gene interaction terms prove difficult to compute because generating weights for them presupposes the relationship between the constituent SVs.Therefore, improvements that address these limitations of existing PRS approaches may better explain the heterogeneity of AD and thus have greater clinical utility. 17eviously, machine learning (ML) methods have been used to assign functional annotations to genes, as well as to understand mechanisms of gene expression by modeling interactions within a network framework. 18In addition, ML models have shown usefulness in drawing novel insights from multiomics data, including for cancer 19 and other non-AD traits. 20We hypothesized that ML models (especially those accounting for nonlinear interactions) would outperform standard PRS for prediction of important AD endophenotypes.Furthermore, we hypothesized that established approaches for understanding ML models would illuminate the relative importance of individual SVs and SV × SV interactions toward model prediction to provide insights on underlying disease pathways and mechanisms.

Participant Selection and Description
The Mayo Clinic Study of Aging (MCSA) is a populationbased prospective study of older adults residing in Olmsted County, Minnesota. 21Individuals were identified for recruitment using the Rochester Epidemiology Project medical records linkage system. 22e Alzheimer's Disease Neuroimaging Initiative (ADNI) is a longitudinal multicenter study to facilitate development of clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of AD. 23,24 Individuals were recruited from more than 50 sites across the United States and Canada.Further information about the ADNI can be found at adni.loni.ucs.edu/.
Data from the MCSA and ADNI were harmonized where applicable and combined for analyses to maximize statistical power and provide the data set with most dynamic range.The MCSA alone, due to its nature as an epidemiologic study, does not have a large proportion of cases with cognitive impairment.This flaw is rectified by combination with the ADNI data set, which is recruited as a more traditional prospective study with a structure similar to clinical trials.Primary inclusion criteria included the presence of genome-wide SV genotype data and cross-sectional amyloid PET data.In addition, participants of the MCSA study were assigned a clinical diagnosis by consensus approach, which was applied to differentiate cognitively unimpaired vs cognitively impaired (i.e., mild cognitive impairment or dementia) individuals for use as a complimentary outcome of interest.

MCSA Genomic Data
Genomic DNA were extracted from stored peripheral blood samples for MCSA participants.A "GWAS backbone" data set was generated through a custom automated workflow for genotyping-by-sequencing (GxS) developed at the Regeneron Corporation [Tarrytown, NY].The GxS workflow generated a forced call set for specific 1000 Genomes project 30x biallelic variants in regions intended to capture common tagging variation.Variant sites with zero depth were removed.Postsequencing quality control included checks for gender discordance, duplicate samples, and contamination >5% estimated using VerifyBamID.
Genetic ancestry analyses used a subset of 4,874 high-quality SVs that overlapped with HapMap3 SVs.A kernel density estimator approach was applied to principal component analysis (PCA) trained on individual HapMap3 populations to generate ancestry assignments.To limit confounding, analyses were restricted to individuals with European ancestry, and PCA eigenvectors of ancestry were generated for use as covariates in regression and ML modeling.
Assessments for cryptic relatedness were performed using identity-by-descent analysis in PLINK, version 1.9. 25Relatedness analyses were performed subsetting to high-quality common SVs, with minor allele frequency (MAF) > 10% and genotype call rate >95%.Overly related samples were removed based on a conservative threshold of PI HAT ≥0.1875.
The processed GxS data were then subjected to additional quality control, including exclusion for SV genotyping rate <95%, sample genotyping rate <98%, monomorphic SVs, or Hardy-Weinberg equilibrium p < 1 × 10 −6 .This resulted in 1,352,354 SVs available for 5,092 MCSA participants with European ancestry.Imputation was performed on this set using the TOPMed Imputation Server 26,27 and TOPMed GRCh38/hg38 build reference panel.Monomorphic variants and SVs with low imputation quality (r 2 < 0.8) were removed.This resulted in a total of 31,371,642 SVs, of which 8,094,342 had MAF >1%.

ADNI Genomic Data
For ADNI participants, GWAS array data were acquired and filtered for standard quality control metrics, as previously described. 28,29Processed genotype files for participants from the various ADNI study phases were downloaded from the LONI web data sharing platform.Overly related samples were removed as previously described. 15Genome-wide imputation using the TOPMed Imputation Server (with identical protocols to the MCSA set) was performed separately within each batch (by GWAS array) and then merged.This resulted in 16,502,548 variants (8,054,769 with MAF ≥1%) for 1,661 individuals within the ADNI data set.

Neuroimaging Data
In the MCSA, amyloid PET scans were created using an in-house fully automated image-processing pipeline, described here. 30maging was performed with Pittsburgh compound B (PiB). 31 For each scan, a standardized uptake value ratio (SUVR) measure of global cortical amyloid burden was generated by normalizing median tracer uptake across a collection of defined regions of interest (prefrontal, orbitofrontal, parietal, temporal, anterior cingulate, and posterior cingulate/precuneus) to the cerebellar crus gray matter.
In the ADNI, amyloid PET was performed with 18 F-florbetapir (AV-45) using acquisition and processing protocols as described at adni-info.organd with summary measures of global cortical amyloid load downloaded from the ADNI database. 32 used the Centiloid (CL) scale to harmonize measurements of amyloid PET burden across the MCSA and ADNI data sets.Conversion from SUVR units to the CL scale was performed as recommended 33 and as used in previous publications. 30In brief, the CL scale ranges from 0 to 100, anchored by amyloidnegative controls younger than 45 years at the low end (average CL value of zero) and patients with typical Alzheimer dementia at the high end (average CL value of 100).The level-1 stepwise approach defined in a previous study 33 allows for conversion of PiB-PET SUVR measures to CL values, while level-2 approaches facilitate conversion of other amyloid PET tracer SUVR measures to CL values.The primary outcome for our analyses was amyloid PET CL as a continuous measure.As a secondary outcome, amyloid PET positivity was defined by CL ≥ 17. 34

Data Set Creation
To optimize the number of SVs under consideration for this proof-of-concept work, we focused on the top 20 independent SVs defined by having the largest effect sizes (i.e., largest Abs [Ln(odds ratio)]) showing genome-wide significant association with clinically diagnosed AD dementia in a recent large case/ control GWAS. 5 This choice allowed us to have few enough features (even after accounting for gene-gene interaction terms) that we could reasonably apply even the most data-intensive machine learning techniques.We also included APOEe4 and APOEe2 in this list given their well-known associations with amyloid levels in the population. 4,35,36Most machine learning methods struggle with sparse features (i.e., features that contain a large preponderance of zeros), so to mitigate this, SVs with a MAF <5% were removed.This resulted in a set of 12 SVs (listed in Figure 1) to be used for the rest of the study.Age, sex, and the first 5 genetic principal component eigenvectors were included as covariates in all models.A representation of all data sets created for this study is shown in Figure 1.
From this set of SVs, a standard PRS was computed according to the equation r = β 1 s 1 + β 2 s 2 +…+β n s n , where r is risk score, β are effect sizes as determined by the GWAS (denoted by ln[odds ratio]), and s are dosage values for individual SVs.Linear regression was used to test the association of this AD PRS with amyloid PET burden, accounting for covariates specified earlier.In addition, interaction terms (as will be described shortly) were added as covariates for a separate regression.For ML model testing, a data set containing the genotype data for the 12 target SVs, covariates, and amyloid PET CL data was created.
As 1 aspect of the ML model approach, we used a novel quadratic encoding schema to consider the question of potential gene-gene interactions, shown in Equation 1a In this case, a and b are both dosage amounts for a given SV within the genotype data.This encoding method has an advantage over traditional coding methods because it preserves the relative dosage of the components in a way that standard methods would not 37 and was used to create interaction terms for APOEe4 with the other SVs used in this study.The interaction terms were then used to create a data set with the SV genotype data, interaction terms, covariates, and outcome.To observe whether information from these models was fully retained within the interaction terms themselves, we also created a comparator data set leaving out the individual SV genotypes themselves.A similar schema was used to assess SV-age interactions.

ML Model Selection and Training
We applied a series of ML models, each of which was trained on the same data set and for which the summary statistics of each model could be compared with select optimal performers.ML included in the stack were chosen primarily for known robustness to multicollinearity, which we expect to appear with the introduction of the derived gene-gene interaction terms.Several basic linear regressions were applied as part of the same pipeline for comparison with more complex ML models.These consisted of an ordinary least squares regression, a Lasso regression, an elastic net regression, and a partial least squares regression.
To test for potential nonlinear effects, tree-based models including a Random Forest, AdaBoost, and XGBoost were used to construct the other half of the model stack.Each of these works by taking many weak learners (such as basic decision trees or linear models) and combining them in algorithmspecific ways to generate a prediction.Because they are averaging across many weak learners, not all of which are linear, these models accounted for nonlinear effects in the genotype data.Finally, a stacked model was trained.This stacked model works by using the results of the 4 best previous models as features in a new regression, allowing interaction between model classes.All models except for XGBoost were sourced from the scikit-learn library. 38XGBoost was sourced from the xgboost package. 39 classify for amyloid PET status (negative vs positive) and cognitive status (unimpaired vs impaired), a separate stack of classification models was constructed.Because comparison between model types was not an outcome for the classification sets, model inclusion criteria were relaxed.Included models were logistic regression, decision tree, K-nearest neighbor, linear discriminant, quadratic discriminant, Gaussian naïve Bayes, support vector machines, stochastic gradient descent, random forest, gradient boosted trees, perceptron neural net, and a stacked model that functions similarly to the regression variant.
Twenty percent of the data was held out as a test set, and then models were trained and optimized on the remaining 80 percent using a grid search using five-fold cross-validation.
Once trained, models were used to predict amyloid PET CL score on the test set, and the results of this final prediction are reported here.This process was repeated 100 times for each data set, with varying sets of points excluded in the test set, which allowed us to create a distribution for each of the metrics of interest instead of relying on a single report.
The r 2 correlation between predicted and actual amyloid CL values were collected for each run.Tests were conducted for the normality of the r 2 distribution to ensure that significance metrics are valid, and a two-sided independent t test was conducted to determine the significance of any separation between groups of models.For instances where predictions were binary categories, a receiver operating characteristic (ROC) curve was computed using class probabilities, and the area under the curve (AUC) was used as the outcome.Feature importances for models were computed using a permutation scheme where each feature was randomized, and then models were retrained on the data set with the randomized feature.Decrease in model r 2 was the output of interest, with higher decrease meaning a more important feature.

Standard Protocol Approvals, Registrations, and Patient Consents
All study protocols were approved by the Mayo Clinic and Olmsted Medical Center Institutional Review boards (for the MCSA) and by ADNI participating site Institutional Review Boards (for the ADNI).Written informed consent was obtained from all participants or their surrogates.

Data Availability
Anonymized data will be available on reasonable request from a qualified investigator in accordance with the MCSA datasharing protocol.

Study Sample
Demographic information for the participants included in these analyses are summarized in the Table .The frequency of APOEe4 positivity was lower in the population-based MCSA sample when compared with that of the ADNI sample, which was recruited in a manner similar to clinical trials.Mean education and amyloid PET burden were overall higher in ADNI participants compared with MCSA participants.

ML-PRS Outperforms Standard PRS in Predicting Amyloid PET Burden
As shown in Figure 2, predicting amyloid CL value from trained ML-PRS models performed significantly better than a standard AD PRS generated from the same SVs (p < 0.0001).Specifically, compared with AD PRS [r 2 = 0.242(0.241,0.244)],ML-PRS (specifically XGBoost based ML-PRS) explained a larger proportion of the variation in amyloid PET levels across the sample [r 2 = 0.279(0.277,0.281)]on the test set.Both models (standard PRS and ML-PRS) included covariates.Even once gene-gene interaction terms were included as covariates in the standard PRS, ML-PRS continues to significantly outperform (p < 0.0001).This pattern held across individual ML models as well as the aggregate of all tested ML approaches.

Nonlinear ML-PRS Models Perform Better than Linear Models
Linear ML models of all classes performed similarly to each other (Figure 3).Nonlinear ML-PRS models generally performed significantly better than linear ML-PRS models, particularly in considering a stacked regression of all approaches (p < 0.0001).There was variation in performance across nonlinear ML-PRS approaches.For example, AdaBoost, which relies on weak linear regressions as its foundation, underperformed compared with fully linear methods, while XGBoost, which uses small decision trees, showed significantly stronger performance.

Addition of Gene-Gene Interaction Terms
Provides Biological Insights ML-PRS models based on both SV genotype data and SV-SV interaction terms overall performed better than the standard AD PRS but slightly worse than ML-PRS models without interaction terms (Figure 2).Removal of SV genotypes from these models (leaving the SV-SV interaction terms and covariates only) further degraded the r 2 value, suggesting that not all the relevant information from individual genotypes is adequately captured within SV-SV interaction terms.Comparison of feature importance plots for the ML-PRS models without vs with interaction terms for APOEe4 x other SVs yielded potentially relevant relationships (Figure 4).With genotype data alone, age and APOEe4 were the dominant factors in prediction of amyloid PET burden, with sex, CR1, and APOEe2 having more minor but nonzero influences (Figure 4A).
When APOEe4 interaction terms were added for SV genotypes and age (Figure 4B), a different distribution of feature importances was observed.The influence of age was split between the age term alone and the APOEe4 x age term, consistent with a meaningful interaction between these 2 factors on amyloidosis.Furthermore, the APOEe4 genotype term was less important than 7 of the interaction terms between APOEe4 and other risk alleles.Included among these was CASS4 (cas scaffold protein family member 4), whose genotype alone was not a highly influential feature in the noninteraction model but where the interaction between APOEe4 and CASS4 was much more important.These statistical findings may suggest a potential biological relationship, akin to a "two-hit" model where the true effect of APOEe4 on amyloid accumulation in the population overall may be distributed across subgroups where specific subcombinations of other AD risk variants are present.

ML-PRS Methods Improve Classification of Amyloid PET Positivity and Cognitive Impairment vs Standard PRS
We next assessed the predictive value of ML-PRS against other relevant AD outcomes.In classification of amyloid PET status (Figure 5), ML-PRS provides a better AUC (0.795) than APOEe4 alone (AUC = 0.629) or the standard AD PRS including APOEe4 and other risk variants (AUC = 0.628).Given that the variants analyzed in this study were drawn from GWAS of clinically diagnosed AD dementia, we also assessed the ML-PRS in predicting the presence of cognitive impairment (MCI or dementia vs cognitively unimpaired status).We observed that the ML-PRS (AUC = 0.752) outperformed APOEe4 (AUC = 0.572) and a standard AD PRS (AUC = 0.539) in prediction of cognitive diagnosis (Figure 6).We observed that APOEe4 and the standard PRS were both around chance levels in prediction of cognitive status impairment and that in fact the standard PRS was worse than APOEe4 alone for this outcome.In both cases, the standard PRS displayed slightly degraded performance compared with APOEe4 alone, but the differences are small and thus likely due to nuances of the specific small set of SVs chosen for this study.

Discussion
We found that ML approaches performed better than standard PRS approaches for predicting amyloid PET burden (a key early biomarker of AD) based on risk variants for clinically diagnosed AD dementia.This improvement was present with both linear and nonlinear ML techniques but was particularly evident with ML methods that accounted for nonlinear genetic effects.These improvements persisted when using alternative AD-relevant phenotypes, including amyloid PET positivity and cognitive status (impaired vs unimpaired).
We had also hypothesized that inclusion of SV-SV interaction terms in our ML models would further improve their performance by explicitly accounting for epistatic effects within the genome.We did not find this to be the case because ML-PRS models with the addition of these interaction terms explained slightly less phenotypic variation than models based on genotype data alone (though the interaction-containing models still outperformed the standard PRS).However, ML-PRS methods using interaction terms allowed for further investigation of how the models (and their underlying genetic variants) realized their predictions.This is a particular advantage of ML methods over typical PRS approaches, the latter of which do not allow for the application of such post hoc explainer algorithms.
As an example, by comparing feature importance between models, we observed that the CASS4-APOEe4 interaction term was highly important in model prediction compared with the relative importance of CASS4 rs6014724 genotype in isolation.CASS4 encodes a cell adhesion molecule important for axonal transport and has been implicated in APP (amyloid precursor protein) metabolism. 40,41Although CASS4 was identified as a susceptibility gene for clinically diagnosed AD dementia, our results are illuminating toward a mechanistic explanation by specifically suggesting a relationship between APOE and CASS4 on influencing amyloidosis.It is also highly plausible that genetic interaction effects are not limited to simple two-term SV × SV interactions.Instead, there may be more complex multi-SV combinations (e.g., 3-5 SVs) driving heterogeneous patterns of susceptibility to AD.These influences are not likely to be adequately captured by standard PRS approaches that amalgamate large combinations of variants across populations and therefore are likely to miss particular combination "hits," which are driving disease at a precision level.Future work is planned extending our ML-PRS framework to these clustered combination scenarios.
Artificial intelligence approaches have been previously considered toward genetic risk prediction.A model of high-order Interactions-aware Polygenic Risk Score (hiPRS) has been proposed which uses a standard PRS as a base and then finds interactions of interest with Frequent Itemset Mining techniques. 42This approach was applied to risk of developing colorectal cancer, but in allowing interactions of any order to be selected, may miss interactions that are rare but highly impactful.Models from a published preprint attempted to predict clinically diagnosed late-onset AD through multifactor dimensionality reduction techniques, which break SV-SV combinations into high and low risk, and then test association with the phenotype. 43This approach, in reducing SV-SV combinations to a binary outcome, is limited by failing to capture the full variability of interactions possible between 2 SVs.
This work has limitations.The SVs chosen for use in this analysis were preselected from summary statistics for a published AD case/control GWAS, which is not a perfect proxy for biologically defined AD nor for amyloidosis as a key AD biomarker. 44In addition, to avoid complications to our ML approaches from data sparsity, we focused on a subset of common variants having strong effect sizes on clinically diagnosed AD dementia.Future extensions of this proof-of-concept work applying a broader list of input variants (including common and rare variants) may further improve performance.
We also elected to combine data from 2 cohorts (the MCSA and ADNI) to maximize statistical power for these analyses.However, these 2 data sets have different recruitment (populationbased vs a structure similar to clinical trials), clinical composition (predominantly cognitively unimpaired MCSA participants vs majority MCI or dementia diagnoses in ADNI participants), and amyloid PET tracers used (PiB in the MCSA vs AV-45 in ADNI).Although we did not use a true external (i.e., independent from MCSA and ADNI) validation set, we did attempt to mitigate this limitation by using test data sets of MCSA/ADNI participants who were held out from the samples used to train the ML models.We also used the CL scale to harmonize (across MCSA and ADNI) the primary endophenotype used in these analyses, acknowledging that differences in the PET tracers and imaging protocols used in those 2 studies could still theoretically influence results when using a combined approach as in this work.In addition, our study was cross-sectional in design, reflecting the high clinical import of predicting susceptibility to amyloidosis (as a key early marker of AD pathophysiology and with emerging treatments targeting amyloid levels) 45 and the sigmoidal nature  of amyloid accumulation over time in the population (limiting the applicability of longitudinal change in amyloid burden as an endophenotype for this genetic work). 46However, extensions of the ML-PRS approach to other AD endophenotypes more amenable to a longitudinal study design (e.g., tau PET burden, cognition) represent natural future directions.
For AD and other complex diseases with genetic and lifestyle/ environmental influences, the PRS approach has shown promise. 8,47There is clear attraction in the development of a blood-based summary measure that can be used in both the presymptomatic and symptomatic stages of disease (given that most elements of target genetic variation are present in both) for risk stratification.For genetic risk stratification of AD to have lasting clinical utility, predictive models will need to approach much higher accuracy (particularly for a disease presumed to have heritability between 60% and 80% based on twin studies 48 ) than current AD PRS approaches achieve.Our analyses support that ML methods for genetic risk assessment may have greater value than standard PRS methods and therefore represent a high yield target in the development of a precision medicine approach to AD.
In summary, we show that ML methods applied to genetic data can do better than the standard PRS when predicting amyloid PET burden and other relevant AD phenotypes.Furthermore, methods that consider nonlinear genetic effects outperform linear methods, arguing for ML-PRS as an enhanced approach to capturing true genetic influences on a complex and heterogeneous disease.With further optimizations, our findings offer a path toward improved screening and risk stratification for AD.

Figure 1
Figure 1 Flowchart of Data Set Creation for Model Testing

Figure 2
Figure 2 Performance of ML-PRS on Varying Data vs Standard PRS for Prediction of Amyloid PET Burden

Figure 3
Figure 3 Performance of Linear vs Nonlinear ML-PRS Models for Genetic Risk Prediction

Figure 4
Figure 4 Relative Importances of Genetic and Demographic Factors Toward Predicting Amyloid PET Burden

Figure 5 ROC
Figure 5 ROC Curves for Amyloid Case Status