AI-enabled evaluation of genome-wide association relevance and polygenic risk score prediction in Alzheimer's disease

Summary GWAS focuses on significance loosing false positives; machine learning probes sub-significant features relying on predictivity. Yet, these are far from orthogonal. We sought to explore how these inform each other in sub-genome-wide significant situations to define relevance for predictive features. We introduce the SVM-based RubricOE that selects heavily cross-validated feature sets, and LDpred2 PRS as a strong contrast to SVM, to explore significance and predictivity. Our Alzheimer’s test case notoriously lacks strong genetic signals except for few very strong phenotype-SNP associations, which suits the problem we are exploring. We found that the most significant SNPs among ML and PRS-selected SNPs captured most of the predictivity, while weaker associations tend also to contribute weakly to predictivity. SNPs with weak associations tend not to contribute to predictivity, but deletion of these features does not injure it. Significance provides a ranking that helps identify weakly predictive features.


INTRODUCTION
In the search for genetic causes of disease, the assembly of a cohort represents a random sampling of the population.Within that random sample, individuals with various alleles and phenotypes will be collected with variations in proportions due to the stochastic variation in the sampling process.One question is whether overrepresentation of an allele state among those with a phenotype vs. those without is really meaningful, or is an artifact of sampling.It is possible to estimate the probability that such an overrepresentation could have emerged by chance instead of being due to an actual association (which is the null hypothesis).Such a probability is called its significance.Thresholds are set to exclude false negatives (significance thresholds), but at the cost of excluding a number of true active alleles, resulting in false negative rejections.One basic problem in standard genome-wide association studies (GWAS) is that thresholds required to guarantee no neutral SNPs were accepted above a significance threshold across the entire genome.For new whole-genome sequence (WGS) technologies, those thresholds are very severe, resulting in large numbers of false negatives.
There is hope that machine learning (ML) and polygenic risk score methods (PRS) may discover variants that may account for the missing heritability of some diseases.These methods offer the possibility of probing false negative SNPs below genome-wide significance thresholds, at the expense of including false positives, by seeking allele sets that maximize predictive power for a phenotype given a set of alleles (referring to measures of recall and precision, 1 including measures such as F 1 , confusion matrices, false positive rates, etc.).However, the predictivity and statistical significance of simple linear regression coefficients are tightly tied to each other, which we show in the ranking section.However, given large numbers of features, even relatively strongly significant SNPs may be lost among false positives.A primary goal of this study is to understand the relationship between the statistical significance of features, and their predictivity in ML and PRS approaches.The question is relevant in that therapeutics require identification of targetable pathways likely active in the population in order to be useful.
A major premise of GWAS and other similar approaches is that they implicate genes and pathways leading to the etiology of disease and therapeutic targets.A challenging aspect of Alzheimer's disease (AD) is that genome-wide significant SNPs do not account for the heritable variation in risk, 2,3 with some efforts seeking some of that missing variability in epistatic coexpressions. 4 An attractive idea is to capture relevant SNPs excluded by severe genome-wide thresholds by probing for SNPs below those genome-wide thresholds that maximize predictivity.PRS analyses share much in common with ML approaches in adopting this approach.
Unfortunately, false positives among non-genome-wide significant SNPs spuriously predict disease variation appearing as overfitting in PRS and ML models that include large numbers of variants. 5,6ML investigations, seeking the most reliably predictive variables, employ methods such as regularization and cross-validation to weed out SNPs with less generalizable predictive information.Typical ML approaches employ cross-validation with feature ranking and selection performed on a training dataset, and scoring of those selected models on a test dataset.The approach is finding interest in GWAS-like studies as well. 7,8he nonlinearity of artificial intelligence approaches obscures the contribution of individual SNPs to predictivity. 9At the same time, these approaches seek to solve similar problems, such as overfitting.In this study, we seek to understand how model selection for ML and PRS pipelines relates to GWAS-based significance and effect strength (odds ratios) estimations, what is different about the approaches' selections, and how their predictive power varies.We note that some formulations of PRS start by ranking SNPs by significance, and select thresholds according to some measure of predictivity, possibly including heritability.The methods are very similar to support vector machine (SVM).We note LDpred 10 and LDpred2 11 take a different approach, computing scores according to assigned weights.
To that end, we contrast predictive power vs. significance, in LDpred2 and in a SVM solution we call RubricOE (as an exemplar of SVM/ML approaches for predictive omics epidemiology).The contrast between approaches offers parallel but not identical rankable metrics for comparison.We use the heritability/significance ranking from ridge regression we used in our pipeline, RubricOE, and we ordered the LDpred2 weights as a comparable ranking.Both of these are predictivity-related scales.
These approaches to feature selection cover more diversity in methodology, while being extendable in computational scale.Curiously, while ML literature focuses on feature importance in prediction, there is little connection to feature significance.
AD, with its notoriously short list of strongly associated SNPs, serves as a test case that can be used to probe associations between various models and their contribution to predictivity precisely because of AD's short list.
To this extent, it is important to recognize that this study is not a standard comparison between approaches (e.g., LDpred2 and RubricOE); instead, the focus is on understanding how significance and effect relate to the predictivity in each of the two methodologies-RubricOE and LDpred2-which were selected to span approaches that have been applied to solve similar problems in distinct ways.So the primary comparisons for both methods are against GWAS significance instead of a direct comparison between methods.Having said this, we do explore how RubricOE and LDpred2's selection yields different features in their response to significance, and we compare them on predictivity vs. significance accordingly.

RubricOE
RubricOE is a cross-validated SVM pipeline with feature ranking described in the ranking section, and multiple levels of cross-validation, with an overview shown in Figure 1.The primary methodological adaptation is ranking features by heritability within the context of a Ridge regularized regression.
First, we set aside validation data, which does not play a part in any subsequent analysis by the pipeline.
From the working dataset, we generate multiple splits of the data.First, we rank features using heritability estimated from regularized linear regressions.We then score features against test sets by using models comprised of leading ranked SNPs with linear kernel SVMs to yield Youden 12 scores.This second part is performed with multiple data splits as well.Top scoring SNPs are selected according to their Youden scores, and the characteristics of the Youden curve.These steps (after the initial split into validation/working data) are performed several times to select SNPs that appear in multiple iterations.SNP stability is characterized by their absolute frequency counting their appearance in multiple cross-validation replications.Figure 1 also shows a potential use of the outputs of RubricOE for Gene Ontology computations.
Figure S1 shows a schematic view of the pipeline.RubricOE has been implemented in python and it is included in the Geno4SD framework as open source at: https://github.com/ComputationalGenomics/Geno4SD/.

RubricOE feature ranking
In what follows, we order features by estimating their contribution to heritability 13 by extending predictability in standard linear regression 14 to identifying heritability in ridge regression. 15henotypes or disease outcomes y are measured for a row of samples representing individual subjects, each predicted by columns representing characteristics X related to disease such as SNPs and available relevant elements such as age, sex, diet, family history, etc.These columns may be scaled genotypes (values f0;1;2g), or binary values representing sex, or age above a threshold, diet, and exercise, or could be scaled ''quantitative'' traits representing continuous variables, such as BMI, if available.Such a linear model would posit y = Xm + e, where m indicates a contribution of the column in X to the prediction of y, and e represents random variations due to sampling or other sources with EðeÞ = 0, covðe; e T Þ = Dy 2 , a diagonal matrix, and an offset is included as a column X 0 = 1.
A measure of goodness of fit for m can be constructed as E 2 = e T Dy À 2 e = ðy À XmÞ T Dy À 2 ðy À XmÞ.Defining the least-squares coefficient estimator as m = ðX T Dy À 2 XÞ À 1 X T Dy À 2 y; and the residual error as E 2 res = y T Dy À 2 y À y T Dy À 2 XðX T Dy À 2 XÞ À 1 X T Dy À 2 y; then E 2 = E 2 res + ðm À mÞ T ðX T Dy À 2 XÞðm À mÞ.So the minimum error is E 2 res , obtained when m = m.The covariance is covðm;m T Þ = ðX T Dy À 2 XÞ À 1 .The residual may be written: E 2 res = y T Dy À 2 y À m T ½covðm; m T Þ À 1 m = y T Dy À 2 y À ðXmÞ T Dy À 2 ðXmÞ.So the predictive power of the individual features is effectively tied to their Z scores, which establishes a strong connection between significance and predictivity.This motivated the current study.The amount of predictive power, feature by feature, is characterized as the fraction of variation y T Dy ð À 2Þy accounted for by ðXmÞ T Dy ð À 2ÞðXmÞ.Since the expression includes the impact of covariance between features, highly correlated features cannot be treated independently in computing their predictivity to y.These effects may be managed by using linkage disequilibrium (LD) pruning or clumping.
In underdetermined systems, X T Dy À 2 X is likely to have null eigenvalues and will not be invertible.Contributions to E 2 by m in the directions of those null eigenvectors will be zero, so the regression offers no information about variations in m and the covariance diverges in those directions.Further, since the non-null eigenvectors completely span the sampled space, the value of E 2 res is 0, and the set of features splines any set of values to be predicted: a situation called overfitting.One approach to control for this overfitting is by noting that splined m tends to have larger values.The impact of larger values is controlled by applying an L 2 regularization term to the regression so that of E 2 = ðy À XmÞ T Dy À 2 ðy À XmÞ + Cm T m.We chose L 2 over L 1 due to its tractability for error propagation.The first impact of the regularization term is to ensure that there are no divergent components of covðm;m T Þ, and second to define a measure of ''unusual'' for ''unusually large'' components of m typical of splining.In this case, m = ðX T Dy À 2 X+CIÞ À 1 X T Dy À 2 y, E 2 res = y T Dy À 2 y À ðXmÞ T Dy 2 ðXmÞ À Cm T m, E 2 = E 2 res + ðm À mÞ T ðX T Dy À 2 X + CIÞðm À mÞ, and covðm; m T Þ = ðX T Dy À 2 X+CIÞ À 1 X T Dy À 2 XðX T Dy À 2 X+CIÞ À 1 .Note that the residual error contains another contributor Cm T m beyond the predictive ðXmÞ T Dy À 2 ðXmÞ reflecting the impact of the regularization term.Further ðXmÞ T Dy À 2 ðXmÞsm T ½covðm; m T Þ À 1 m, so the predictive power is less tightly connected to the Z scores of the features.We use the individual diagonal components from ðXmÞ T Dy À 2 ðXmÞ for LD pruned or clumped sets to estimate the heritability of individual SNPs in the context of a ridge regularized linear regression.
Estimators for the coefficients we use for ranking are: The residual error yields a quantity tied to the heritability term in simple linear regression, so we select the measures reducing the residual error by each coefficient to rank the SNPs.We use the individual diagonal components from ðXmÞ T Dy À 2 ðXmÞ for LD pruned or clumped sets to estimate the heritability of individual SNPs in the context of a ridge regularized linear regression.The residual may be written: E 2 res = y T Dy À 2 y À m T ½covðm; m T Þ À 1 m = y T Dy À 2 y À ðXmÞ T Dy À 2 ðXmÞ.So the predictive power of the individual features is effectively tied to their Z scores, which establishes a strong connection between significance and predictivity.This motivated the current study.The amount of predictive power, feature by feature, is characterized as the fraction of variation y T Dy ð À 2Þy accounted for by ðXmÞ T Dy À 2 ðXmÞ.Since the expression includes the impact of covariance between features, highly correlated features cannot be treated independently in computing their predictivity to y.These effects may be managed by using LD pruning or clumping.
Youden's J statistic is defined as the J = sensitivity À falsepositiverate = sensitivity + specificity À 1.A plot of J scores for ranked data as a function of a ranking threshold in a sensitivity vs. false positive rate will indicate a point where threshold adjustment to include more samples to acquire more true positives starts including more false positives.We proposed to rank features (SNPs) according to the estimate of simple heritability accounted for in a regularized linear regression.

Cross-validation
For a fraction f of biologically active SNPs, a process identifying biologically active SNPs, i.e., an odds ratio threshold, has a probability a of falsely identifying a true SNP, and a probability b of rejecting a biologically active SNP.So, the fraction of SNPs that the test would identify as positive would be f ,ð1 À bÞ + ð1 À f Þ,a, and the fraction of SNPs identified as being valid is For N replications, the proportion of false positives supported in all replications would be a N , and for true positives ð1 À bÞ N .So, the fraction of SNPs we would expect to be true positives across N replications would behave as z The chances that replication will be primarily populated with true positives are when ð1 À f Þ,a N ( f ,ð1 À bÞ N .The fraction f of biologically active SNPs is likely very small compared to 1 for Alzheimer's.So, the number of required replications required is where Cross-validation is applied heavily in our ''RubricOE'' pipeline.

Analysis of RubricOE output
RubricOE feature selection assigns Youden scores to the ranked feature sets.In the final output, RubricOE presents a set of SNPs, each with an associated number of replications in which the SNP appeared (support).We apply a logistic regression to each SNP predicting the phenotype to evaluate OR's and p values, and contrasted predictivity of the entire set, of those that were supported across all replications, and of the most significant SNPs, where predictivity was evaluated using linear kernel SVMs from sklearn. 16Genes associated with the SNPs were obtained from the appropriate version of the gene location table from https://www.ncbi.nlm.nih.gov/genome for GRCh38.

Analysis of LDpred2 output
LDpred2 11 performed quality control (QC) to the imputed set, including PCA and LD filtering, and produced a set of b scores describing the association between the selected 22,506 SNPs and the phenotypes, applied to compute the individual's phenotype scores.LDPred2 offers predictivity measures of correlation and AUC between the individual's predicted phenotype scores and their actual phenotypes, as well as F 1 .
LDpred2 assigns weights to all SNPs using a Gibbs sampler.LDpred2 explores the space of hyperparameters in terms of sampled grids of hyperparameter values.LDpred2 presents weights, b, for each of the full set of SNPs.The ''auto'' option scores all the SNPs that passed QC.A histogram of the numbers of SNPs vs. their scores shows a very large spike near neutral b = 0. Since b = 0 represents effects close to neutrality, and a large number of SNPs are essentially neutral, exclusion of SNPs with b % 2310 À 4 does not affect risk scores for any given individual.We selected all SNPs from any of the 50 bins that represented less than 2310 À 4 of the SNPs.For the dataset, MAF was filtered at 0.05, LD filtered with window size of 50 kb, step size 5, r 2 threshold of 0.5, and HWE threshold of 1 3 10 À 12 .Beagle was applied for imputation. 18,19

RubricOE feature selection
Ranking was based on the estimation of the contribution of each SNP to heritability in a linear ridge regression, as described previously.Ranking-defined nested subsets of features were scored by RubricOE, using a Youden index, 12 J = sensitivity À ð1 À specificityÞ, essentially a measure of the amount predicted above what would be expected by chance.In AD, chromosome (chr) 19 contains APOE, whose APOE4 genotype (joint genotype of rs429358 and rs7412) is the most strongly associated with disease expression, with a number of linked and related variants in APOE and nearby genes.Therefore, chr 19 is guaranteed to have a relatively small set of very strong SNP associations.Chr 11 had a number of relatively weak associations that appeared in a meta-analysis with 74,046 samples, 20 some of which were supported in a much larger meta-analysis with 788,989 samples. 21We applied RubricOE to chromosomes (chrs) 11 which has had four variants associated with it in large metastudies, 20,22 and to a shuffled set of chr 19 to contrast with chr 11's weaker SNPs.The Youden indices for chrs 11 and 19 are displayed in Figure 2. The extraction of a Youden curve for a shuffled set was not performed.
Initially, the Youden scores yield a bimodal profile, rapidly accumulating predictivity with included features along the increasing abscissa, up to a local maximum beyond which scores temporarily decrease (the ''shoulder''), followed by a larger swell of predictivity scores, as seen in Figure 2. The initial growth for chr 11 is smaller than that for chr 19, suggesting contributions from APOE and linked alleles.Those left shoulders are presumed to show enrichment in true positives compared to the rest of the SNPs, which we take as enriched in true positive candidates for RubricOE input.

LDpred2 feature selection
The histograms (Figure 3) of the LDpred2 Auto scores for the AD data (Figure 3) and shuffled data (Figure 3) show a very striking spike centered around a neutral score b = 0 for the unshuffled set, compared to the shuffled set.These weak spike SNPs contribute very little to the predictive value for any given individual.We also note the strength of the signal represented by the b 0 s is much larger in the unshuffled data compared to the shuffled set.We selected the SNPs outside of the central spike in bins representing less than 2310 À 4 of the total number of SNPs, including a number of SNPs that were not genome-wide significant.

Stable set analysis
We sought to apply a similar analysis for both RubricOE and LDpred2.In both cases, the output is expressed in terms of ranked features.We applied logistic regression to measure the significance of these features, and used SVMs to measure the predictivity of the selected feature sets, with F 1 being used as the metric for comparison.We tested predictivity of real data vs. shuffled data, and for a strongly associated chromosome (chr 19) and a weakly associated chromosome (chr 11).We also tested predictivity of the whole selected subsets against the significant SNPs within the selected subsets.
RubricOE assigns to each SNP, selected in the aforementioned feature selection process, the number of stable sets that the SNP was found under cross-validation.The ''stable sets'' are selected according to the number of replications the features were included, which we call ''support.''The features in Figure 4 were ranked by support, displayed in green, and shown in the right-hand scale.The log of the p value computed by logistic regression is displayed in red, according to the scale on the left side.This figure therefor contrasts significance vs. support.Figure 4F serves as a random control by analyzing randomized data.LDpred2's plot is identical in form, except b replaces support (Figure 4E).The relationship between these SNPs and their p values is displayed in Figure 4.
Chr 19 shows decreasing significance as support declined from a maximum of 50 (Figure 4A) with a rough correlation between them.Specifically, the strength of the log (p values) and density of significant SNPs tend to trend with support, being quenched in low-support SNPs.By contrast, chr 11 has no support reaching a level of 50, and the relationship between significance and support is relatively uncorrelated (Figure 4B). Figure 4D shows RubricOE's result run on the data preprocessed for LDpred2, showing an expanded range of Figure 4A. Figure 4E shows the results for LDpred2, indicating that inflection points in ordered scores appear to occur at spikes in logistic regression p values where the b scores are strongest.These inflections tend to be lost at weaker b weights.This indicates an interaction between the impact of significance on stronger LDpred2 weights.LDpred2 run on shuffled phenotypes from chr 19 shows much weaker association p values, and much less correspondence between p values and the structure of b (see Tables 1 and 2).In summary, the relationship between support and significance shown in Figure 4 is generally that more significant and more strongly significant SNPs appear with larger support, or larger b.It is worth noting the shifts in scales between actual chr 19 samples and shuffled or chr 11 samples, as well as the change in dropoff in À logðpÞ for more weakly supported samples.For example, in Figure 4E, the weaker SNPs are roughly 3310 6 less significant than the SNPs associated with stronger weights b.
Logistic regression results are tabulated together with support for each of the selected SNPs from chr 11 in Table S1 and for chr 19 in Table S4.From the four previously reported SNPs from chr 11, rs3740688, rs7933202, rs3851179, and rs11218343, only two (rs7933202 and rs3851179) passed quality control missingness, and each of those was insignificant in the dataset (p value of 0.1153 and 0.2864, respectively), and was not selected.
The predictive information in the entire selected set was contrasted with subsets with a p value threshold and with a support threshold (Table 2).The p values and supports among the chr 19 selections were much stronger than those of chr 11, so weaker threshold was chosen for chr 11.The predictions are constructed using a linear kernel SVM from sklearn, 16 with predictivity measured by F 1 scores.
In chr 19, the prediction by the most significant SNPs ( À logðpÞ > 50) or by the most strongly supported SNPs (support = 50) yielded as much predictivity as the entire set of predictive SNPs (F 1 z0:62).(Note, F 1 = 2, precision,recall precission+recall : the harmonic mean of precision and recall.)This would suggest that the strongest SNPs carried most of the predictivity for the chromosome.In chr 11, none of the SNPs showed either the p value or the support of the strongest SNPs in chr 19.In this case, the strongest p value and most supported SNPs had a predictivity of F 1 z 0:54, while the whole collection of selected SNPs had a predictivity of F 1 z0:58, stronger than that of the SNPs with support and stronger p values.
LDpred2 predictivity is broken down by jbj bin in Table 1, displaying F 1 , OR, and p value for each bin range.The ORs are computed on the confusion matrix relating the alleles vs. phenotype.The p values are computed from the standard result that the log of randomly sampled ORs will be normally distributed with mean log . The bins show a similar behavior as the Youden curves for RubricOE in that the leading edge shows a predictive peak, a dip, and then a swell from the weaker bins.In contrasting significance and support as plotted in Figure 4, we note that greatest evidence for interactions with significance reflected in the inflection points are occurring in regions that contribute relatively less to the overall risk scores.We note this predictivity structure makes it difficult to implicate specific pathways that might be useful for targeting therapies.

DISCUSSION
AD is notorious for difficulty identifying pathogenic genetic factors promoting the disease in spite of its heritability.The wealth of new information available in next generation sequencing only increases the challenge by raising the threshold required to clearly distinguish true positives from false positives.Approaches to the problem that probe below stiff genome-wide thresholds, seeking to identify falsely rejected SNPs by testing their predictivity, make ML and PRS approaches very attractive.So this paper explores what ML and PRS approaches may have to say about AD and other diseases with similar SNP association profiles.
The answer to that question might best be answered by turning it around: what does the AD SNP association profile reveal about how ML and PRS respond to that kind of signal?In this case, chr 19 is notorious for having a set of strong SNPs generally considered to derive much of their significance from linkage with APOE4.These present a distinctive differentiating character even below genome-wide significance.Meanwhile, chr 11 has relatively weak OR's, 20,22 with little to distinguish weak true SNPs among the false positives from true negatives, and they do not represent predictive power in the selected set.In this sense, they are not distinguishable from overfitting false positives.
A prime question is what relation does predictive SNPs have with possible biological processes in AD or other diseases.In standard linear regression, we show in the Supplement that those SNPs with strongest associations are also those that also account for the most in describing phenotypic variation.To that end, we explored the amount of phenotypic variation accounted by SNPs in ridge regularized regression -heritability, and used those scores in the RubriOE pipeline to rank the SNPs (features) identifying an optimal subset of SNPs to predict phenotypes using Youden indices.Youden scores displayed in Figure 2 reveal a transition from predictivity from stronger SNPs to weaker SNPs, with a separation marked by a notch in the predictivity.It is possible this notch is partly due to the structure of AD associations, which possesses several very strong genotypes, and many weak ones mixed with false positives, with biological effect strength (e.g., odds ratios or regression coefficients) separating the strong vs. weak SNPs.Among weaker SNPs, predictive power of false positives and true positives is essentially indistinguishable.Table 1 shows a corresponding result in LDpred2 b weights, where the strongest weights account for predictivity, which passes through a transition of lowered predictivity below which there is a swell similar to that seen in the RubricOE Youden curves.RubricOE has also been applied to find stable SNPs which led to an increased prediction accuracy of severe COVID-19. 23L and (generic) PRS use cross-validation to exclude false positive SNPs.A short description of the mechanism is presented in the Supplement.RubricOE heavily uses cross-validation to identify stable sets of SNPs.Note that rare alleles will also be lost to cross-validation.
We also provided a null model by shuffling the phenotypes in chr 19 for RubricOE, constructed by shuffling the phenotypes.While spurious associations are expected, RubricOE's SVM cost functions that discriminated the shuffled phenotype tended to be flat, offering little advantage in minimizing the cost by varying the placement of the SVM margin.Convergence was very slow for the shuffled set.Once identified, they were relatively unlikely to be supported in other cross-validation splits.
A histogram of LDpred2's b 0 s histogram (Figure 3) showed much weaker values for b in the shuffled set compared to the unshuffled signal.More, the central spike, marking neutral SNPs, was narrower, but held a higher proportion of SNPs in the real data than in the shuffled set.Measuring association significance with logistic regression, Figure 4 shows the regression p values against RubricOE's stability support (number of cross-validation runs that each SNP appeared in the final results) and LDpred2's b weights.For chr 19, RubricOE's most strongly supported SNPs associate with the most significant SNPs (Figures 4A and 4D).The appearance of the association in LDpred2 (Figure 4E) shows strong inflection points from stronger to weaker b weights at the significant SNPs.The relationship between association and support or b weights broke down in chr 11 and the shuffled sets (Figures 4B, 4C, and 4F).
We further explored the predictive power of the sets selected by RubricOE, and those with non-negligible weights from LDpred2.This was done by applying a linear kernel SVM to the sets, and scoring with F 1 .For each of the selected sets, we also scored subsets that showed significant regressions, and subsets with the strongest support or jbj.Table 2 shows results of these tests.In RubricOE results, chr 19, whether applied to raw data or imputed data QCed for LDpred2, showed F 1 scores around 0.6 (noting Figure 4E indicates relatively few significant SNPs were included in the model), whether for the entire set, or for the subset with significant associations identified by logistic regression, or for the subset with the strongest support.Therefore, many of the other SNPs are not contributing significantly more predictive power to the models, but their inclusion does not injure predictivity.They also have persistently survived cross-validation.As such, they represent compatible candidates that may be biologically relevant, but which are not significant in their own right.The RubrcOE and LDpred2 for shuffled phenotypes F 1 's scored around 0.55 regardless of significance and support.Chr 11 showed a stronger F 1 for the whole model, but actually lost F 1 for more strongly supported or more significant SNPs.Chr 19 SNPs that contributed significantly shared by RubricOE and LDpred2 listed in the Supplement include APOE, APOC, and NECTIN2.
RubricOE is relatively expensive to run due to its heavy use of nested cross-validation without offering large improvements to F 1 over LDpred2.If predictivity was the end of the story, there might be little value to use it.However, it offers a very different model for SNP selection, finding a much more concentrated set of SNPs that yield competitive predictivity.This may be valuable for researchers seeking to implicate relevant pathways for therapeutic targets.It also offers a different dissection of the significance/predictivity from LDPred2, which may help inform researchers of the shape of that space in general in a broader sense.
We conclude that association determined significant SNPs substantially accounts for predictivity of SNP sets selected by ML and PRS approaches.More, the contributions from true positive SNPs with weak associations are both statistically indistinguishable from true negative SNPs, but the tendency for these SNPs to spline, or overfit, predictivity leaves a burden in the end to sort out which of the identified SNPs may be relevant to disease processes.However, the non-significant SNPs that are identified by these approaches tend to have been consistently replicated in RubricOE and LDpred2, and do not injure predictivity by their inclusion.They represent additional candidates that are compatible with the significant SNPs, and may reflect relevant biological processes.Lastly, defining relevance of SNPs in terms of their statistical significance together with their contributions to predictivity helps to understand what ML methods in general (including neural nets), and PRS and support vector machine pipelines specifically may offer to understand AD, as well as to set reasonable expectations for what ML has to say.

Limitations of the study
The primary purpose of this study is to connect both statistical significance and predictivity and their interaction in the evaluation of relevant features for methods seeking to probe below genome-wide significance.We probed two methods that exemplify ML approaches -SVMs as embodied in our RubricOE pipeline and LDpred2.These are certainly not exhaustive, but do represent some extreme points.Other ML methods do exist, and some of them include both significance and predictive components, such as random forests.The study reported here was intended to be demonstrative more than exhaustive.
AD is challenging since so many of its SNPss associate so weakly with disease, 2,21,22 and heritability has not been fully accounted for.WGS data sampled from more diverse populations offer a chance to contain those features, but identifying them is difficult.The statistical power required for identification of significant association between Alzheimer's disease and SNPs required 700K in imputed metastudies with custom validations.However, the WGS data available provided only around 4K samples with much weaker statistical power, while presenting a vastly expanded number of multiple tests, resulting in much more demanding significance thresholds.This drives the question of what ML methods are probing into features far below genome-wide significance.The methods we selected approach some of those issues in very different ways.Still, pre-analysis validation filters much of the more rare alleles that might be players in inducing Alzheimer's.Failure to filter, and to impose cross-validation support will result in overfitting which is the result of being unable to distinguish between real biological features, or statistical artifacts virtually guaranteed to be present in these sample sizes.This is a major limitation of this study -yet it was also a primary motive for exploring how these methods performed in probing sub-genome-wide significant data.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: We now consider the sets thus taking only genotype and phenotype information for which sample IDs are available.
Since we have an ordering on L complete , we can refer to the first, second, etc. sample, both in terms of genotype or phenotype information, by considering for genotype data the vectors X complete ½i, and Y complete ½i for phenotype data.
For rubricOE, X complete and Y complete are the starting data (here distinguished from the originally unmatched phenotype-genotype data that potentially has more samples but no labels or viceversa).To further decouple possible identifying characteristics of the data, we do not use sample IDs beyond this point.
We then split X complete and Y complete into pairs X working , X validation and Y working , Y validation , respectively, where w and v are short for working and validation data.Other than knowing the IDs of samples in the validation set, rubricOE does not have information on those samples, and will only use X working ; Y working for all computations.Let us also denote by n features the number of columns of X working , and we define I working = f0; 1; .; X working g as the set of indices and I features = f0; .; n features À 1g as the features (or columns) indices.
Assume that we have defined

# :
We now describe rubricOE's flow (see Figure S1C), with the following steps performed in each iteration.
(1) Generate n folds splits ðI train ; I test Þ of I working .These splits are generated by randomly choosing s test , I working elements of I working to serve as indices of test elements and, for each fold, we define (2) in X train = X working ½I tra (3) st X test = X working ½I te (4) in Y train = Y working ½I tra (5) st Y test = Y working ½I te (6) (note that we are omitting indices indicating current folds to reduce notation clutter).In addition we require that these sets have the same proportion of case/control samples in g Y workin .(7) For each fold, we fit a regressor (Ridge regression with error bars, in our particular case) to ðX train ;Y train Þ.As a result we obtain for 0 % i < n folds , 0 % j < n feat coefficients c i;j > 0 that represent the importance ascribed to each one of the features by the regressors.(8) Define c j = P i c i;j =n folds .We assign importance in decreasing order to each feature by considering a stable permutation s : I features / I features such that sðj 1 Þ > sðj 2 Þ if and only if c j1 < c j2 .The interpretation of this is that feature j is the sðjÞ-th feature ranked by importance in decreasing order, according to the regressor.Equivalently, feature s À 1 ðjÞ is the j-th feature ranked by importance in decreasing order.(9) We consider a nested sequence of subsets of features F 0 ; .; F nsteps À 1 .For 0 % i < n increment À 1 these sets are defined by

Figure 1 .
Figure 1.Overview of RubricOE pipelineWe begin by splitting the entire dataset into working and validation datasets, then performing feature ranking and scoring multiple times across multiple splits and aggregate results, and finally we validate.The bottom right arrow indicates that the output of the pipeline could be used for Gene Ontology computations.

Figure 2 .
Figure 2. Youden curves for chromosomes 11 and 19 based on heritability ordering from ridge regression

Figure 4 .
Figure 4. Logistic Regression computed p values and support of selected features ranked by support for chrs 19 and 11, respectively (A) RubricOE Chr 19 with shuffled phenotypes, (B) RubricOE Chr 11, (C) RubricOE on Chr 19 with shuffled phenotypes, (D) RubricOE on Chr 19 with LDpred2 QC applied, (E) LDpred2 Chr 19, and (F) LDpred2 on Chr 19 shuffled."Support" is the number of RubricOE final cross-validation replications showing the SNP for figures (A-D); we use LDpred2's b as "support" in figures (E and F).
17ta were acquired on June 26, 2021 from the Alzheimer's Disease Sequencing Project repository at National Institute of Aging Alzheimer's Data Storage Site (https://www.niagads.org/adsp/),comprised of data not excluded for use by commercial entities,17release R3 from 14474 subjects consenting to access by commercial entities, excluding family set data, retaining 78815437 SNPs marked as ''PASS'' by variant quality score recalibration in the GATK suite.SNPs showing systematic missingness were identified and filtered.Phenotypes available in the dataset include ethnicity, race, age, sex, AD status, whether diagnosed on enrollment, acquired it during the study, BRAAK stage, and the APOE4 genotype (determined separately from the WGS analysis).We retained age, sex, and AD status.Females comprised 63.3%, 6,446 of the samples, with 45.9%, 4,365 AD diagnosed subjects.The average age of the subjects was 75.4 years, with a standard deviation of 8.9 years.

Table 1 .
jbj histogram bin predictive measures F 1 , OR, and p values

Table 2 .
F 1 for predictions of AD from RubricOE SNPs

TABLE
Note that rows in X original and Y original are not necessarily in the same order; furthermore: on plink files row contents and their associated sample IDs are kept in separate files, albeit with enough information to define l X original .We assume that if the ID information is thought of as a pair of sets, then L complete = LX original XLY original sB and we impose a fixed ordering on L complete , so that we can enumerateL complete = n l 0 ; l 1 ; .; l n samples À 1Here is a diagram of what we have so far:L complete / LX original / d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d METHOD DETAILS B Detailed description of RubricOE workflow o :

( 1 )
Ranking number of folds ds n fol (2) Ranking test size proportion <10 < s test (3) Scoring number of folds ds m fol (4) Scoring test size proportion <10 < t test (5) Number of curve steps n steps ,