A Machine Learning Model to Aid Detection of Familial Hypercholesterolemia

Background People with monogenic familial hypercholesterolemia (FH) are at an increased risk of premature coronary heart disease and death. With a prevalence of 1:250, FH is relatively common; but currently there is no population screening strategy in place and most carriers are identified late in life, delaying timely and cost-effective interventions. Objectives The purpose of this study was to derive an algorithm to identify people with suspected monogenic FH for subsequent confirmatory genomic testing and cascade screening. Methods A least absolute shrinkage and selection operator logistic regression model was used to identify predictors that accurately identified people with FH in 139,779 unrelated participants of the UK Biobank. Candidate predictors included information on medical and family history, anthropometric measures, blood biomarkers, and a low-density lipoprotein cholesterol (LDL-C) polygenic score (PGS). Model derivation and evaluation were performed in independent training and testing data. Results A total of 488 FH variant carriers were identified using whole-exome sequencing of the low-density lipoprotein receptor, apolipoprotein B, apolipoprotein E, proprotein convertase subtilisin/kexin type 9 genes. A 14-variable algorithm for FH was derived, with an area under the curve of 0.77 (95% CI: 0.71-0.83), where the top 5 most important variables included triglyceride, LDL-C, apolipoprotein A1 concentrations, self-reported statin use, and LDL-C PGS. Excluding the PGS as a candidate feature resulted in a 9-variable model with a comparable area under the curve: 0.76 (95% CI: 0.71-0.82). Both multivariable models (w/wo the PGS) outperformed screening-prioritization based on LDL-C adjusted for statin use. Conclusions Detecting individuals with FH can be improved by considering additional predictors. This would reduce the sequencing burden in a 2-stage population screening strategy for FH.

F amilial hypercholesterolemia (FH) is an autosomal dominant disorder caused by variants in the low-density lipoprotein receptor (LDLR), apolipoprotein B (APOB), proprotein convertase subtilisin/ kexin 9 (PCSK9), or apolipoprotein E (APOE) genes.It is characterized by elevated lowdensity lipoprotein cholesterol (LDL-C) concentration and premature coronary heart disease (CHD). 1 FH-causing variants are found in about 1 in 250 individuals (95% CI: 1:345-1:192) 2 ; however, the condition remains highly underdiagnosed worldwide with only an estimated 1% to 10% of cases diagnosed. 3,4fected individuals are at increased risk of premature CHD, due to lifelong exposure to elevated levels of LDL-C, where early initiation of lipid-lowering treatment is paramount for risk management. 36][7][8] Currently, patient diagnosis often happens after the development of CHD symptoms or by opportunistic measurement of lipid profile and at the discretion of clinicians.Diagnosis is made using tools such as the Dutch Lipid Clinical Network and the Simon Broome criteria, which have not been designed to be used as population screening tools. 1 In 2016, Wald et al 9 suggested screening children aged 15 months of age by measurement of total or LDL-C to systematically identify index monogenic FH cases in the general population as a prelude to testing parents and other family members.Futema et al 10 showed that measurement of LDL-C alone at age 9 may be insufficiently accurate in reliably distinguishing FH-variant carriers from those with an elevated cholesterol as a consequence of diet and lifestyle factors or carriage of a high burden of common cholesterol-raising alleles, and suggested adding a confirmatory targeted-sequencing step to reduce the number of false positive cases detected.
The increased availability of routine health checks in adults either through work-place schemes or local healthcare providers offers an opportunity to systematically identify adult carriers of FH-causing variants. 11Positioning adult FH screening within routine health checks, which typically record a substantial number of other clinical measurements, offers the opportunity to consider additional predictors for FH.
This may be important, because, while the effect of FH on CHD risk is mediated through elevated circulating LDL-C concentration, it is well-known that LDL-C concentration associates with other variables such as blood and liver biomarkers, diet, and also with common genetic variants. 12Combining multiple environmental factors and a polygenic score (PGS) for LDL-C raising genetic variants may improve the detection of people with monogenic FH for prioritization for confirmatory genetic testing. 13,14This is because individuals with monogenic FH are likely to have a measured LDL-C concentration that is higher than can be accounted for by these other variables.
In the current manuscript we utilize the UK Biobank data to evaluate the detection rate and testing burden of 4 prioritization strategies to identify people with suspected FH-causing variants for confirmatory genetic testing: 1) no prioritization (ie, referring all participants for sequencing); 2) a plasma LDL-C-based prioritization model adjusting for statin treatment; 3) a multivariable machine learning prioritization model with nongenetic variables; and 4) a multivariable machine learning prioritization model which includes a PGS for LDL-C (Central Illustration).

METHODS AVAILABLE GENOMICS DATA AND FH ASCERTAINMENT.
We identified 472,147 UK Biobank participants of White British ancestry (data-field 21000) as part of the approved project identifications 40721 and 44972.1).The pathogenic variant p.Leu167del in APOE associated with FH was extracted. 15A total of 488 pathogenic and likely pathogenic FH variants were identified (Supplemental Table 2).Additionally, 660 participants were found to carry FH variants of uncertain significance (VUS) (Supplemental Table 3).These were excluded from the analysis because more evidence is required to interpret the effect of those VUS.
LDL-C PGS GENERATION.We next generated a PGS for LDL-C concentration using an independent data subset of 173,672 White British participants without lipid-lowering medication or WES data (Supplemental To reduce the number of potentially redundant variants and optimize LDL-C prediction, we next applied a least absolute shrinkage and selection operator (LASSO) regression algorithm using the biglasso package in R. 17 The degree of penalization was determined through 15-fold cross-validation, maximizing the explained variance (R 2 ), which resulted in a 1,466 genetic variant LDL-C PGS.CHD and statin use, family history of CHD and alcohol use, and family history of CHD and hypertension.The limited missing data (Supplemental Table 4) were singly imputed using the R package MICE. 18del derivation was performed using the WES data, applying a 80% training data split of 111,824 subjects, retaining 20% testing data (containing 93 carriers of 27,955 subjects) to unbiasedly evaluate model performance (Supplemental Figure 1, Supplemental Table 5).To prevent potential model instability, highly correlated variables (ie, multicollinear) were removed.These included apolipoprotein B and total cholesterol (Supplemental Figure 2).
Variables were standardized to mean 0 and standard deviation (SD) 1 (Supplemental Tables 6 and 7).
Finally, we applied a binomial regression model with LASSO penalization to derive a discriminationoptimized FH prediction model.Specifically, optimal penalization was determined through 15fold cross-validation maximizing the area under the receiver operating characteristic curve (AUC). 17       EVALUATING THE FH SEQUENCING STRATEGIES THROUGH DECISION CURVE ANALYSIS.We next determined at which probability threshold the net benefit of the various models was larger than the "sequence all" strategy (Figure 3).The net benefit of the "sequence all" strategy was lower than that of the other models tested at a threshold of 0.0013 (0.13%).This implies that model-based prioritization for confirmatory FH sequencing is more beneficial if one decided to screen 1/0.0013 ¼ 769 or more people to detected one FH case.Irrespective of the probability threshold, the multivariable machine learning models had a larger net benefit than the LDL-C adjusted for statin use model.At a threshold of 0.0050 (0.5%), the multivariable model with the LDL-C PGS had the largest net benefit out of all the models tested (Figure 3).

PRIORITIZING INDIVIDUALS FOR FH GENOMIC TESTING IN A 2-STAGE POPULATION SCREENING
STRATEGY.As an illustrative example, we evaluated the impact of a 2-stage population screen for FH where the second stage consisted of targeted sequencing of FH variants, comparing the multivariable model with a PGS to the statin-adjusted LDL-C model (Supplemental Figure 5, Supplemental Table 8).In this example, we employed a common probability threshold of 0.006 (0.6%), which falls within the plausible range found using the decision curve analysis (Figure 3).On average, 7 additional FH carriers would be detected for 100,000 individuals screened when using the multivariable model with Furthermore, if we assume that FH has a population prevalence of 1 in 286 (equal to our cohort's prevalence) and that one FH case has on average 1.5 first-degree relatives ([2 children þ 1 sibling]/2) who are also affected by FH (discovered through cascade testing), 21 then overall one FH case would be identified for every w219 people screened when using the multivariable model with LDL-C PGS, compared to one FH case for every w228 individuals screened with the LDL-C and statin use model.

DISCUSSION
In the current manuscript, we derived a multivariable machine learning model to identify people with Above a classification threshold of 0.0013 (0.13%), the multivariable algorithm that contained the LDL-C PGS showed the highest net benefit out of all the models tested (Figure 3), and was able to decrease the number of subjects referred to genetic sequencing (as an example: from 100,000 individuals without any prioritization, to 14,730 with prioritization using the LDL-C and statin use model, and to 12,033 with prioritization using the multivariable model for a predicted probability threshold of carrying a variant for monogenic FH of 0.006; equivalent to approximately a 18% decrease in individuals needed to be sequenced between the last 2 models [Supplemental Figure 5]).
These differences become more significant if extrapolating the values to a population-wide scale comprising of millions of participants screened.The choice of screening method for FH is very much dependent on the threshold chosen (Figure 3) and on the resources available.This manuscript explored the differences in performance of possible FH screening strategies in adults, and our results provide support for opportunistic screening and seeding of cascade testing for FH using the multivariable algorithms derived here, which could be integrated within existing health checks offered to employers or local health care providers (eg, the National Health Service vascular checks in the UK). 11eviously, Banda et al 22  Our multivariable model included 3 terms for LDL-C (LDL-C itself, LDL-C squared, and an interaction with statin prescription), which combined makes it the most important predictor.Additionally, our model also identified novel predictors for FH such as triglyceride and Apo-A1 concentrations, with triglycerides having the largest absolute odds ratio per SD (0.60).In our study we find that FH carriers had significantly lower triglyceride concentrations than noncarriers (Table 1), which resulted in a negative association, indicating that triglyceride concentrations can be useful in discriminating between individuals who have hypercholesterolemia due to lifestyle factors or other causes (eg, combined hyperlipidemia) instead of an FH-causing variant.We also found that higher Apo-A1 concentrations, a protein found on HDL particles, was associated with a decreased probability of FH.Apo-A1 concentration can also be readily replaced by HDL-C concentration without impacting model performance as shown in the Supplemental Results.Finally, we note that our multivariable FH model retained a squared term for LDL-C, suggesting that LDL-C is not linearly related with carrying an FH variant, but rather has a quadratic relationship (Supplemental Figure 4).The variables included in our multivariable algorithm should not be interpreted as causal risk factors for monogenic FH; they simply help to distinguish nonmonogenetic sources of variation in LDL-C concentrations from monogenic causes (as was discussed in more detail previously with triglyceride concentrations).This also provides the rational for including an LDL-C PGS in the model: a large discrepancy between predicted LDL-C concentrations (by the LDL-C PGS) and observed LDL-C concentrations might be indicative of FH carriership, 13,14 demonstrated here by a negative coefficient for LDL-C PGS in the model (Supplemental Table 7).We note that a previous LDL-C PGS by Wu et al 24

Figure 1 )
Figure1).An initial list of 10,137 genetic variants with a P value threshold of <5 Â 10 À4 was obtained from the Global Lipids Genetics Consortium genome-wide association study summary statistics for LDL-C.16 CENTRAL ILLUSTRATION A New Prediction Model to Improve the Detection of Familial Hypercholesterolemia Variant Carriers Was Developed in This Study Using Machine Learning (Least Absolute Shrinkage and Selection Operator) Gratton J, et al.JACC Adv.2023;2(4):100333.This model improves the prioritization of individuals for familial hypercholesterolemia-variant genomic sequencing confirmation.The model, developed and derived in the UK Biobank (with 139,779 whole-exome sequenced participants including 488 familial hypercholesterolemia variant carriers), included 14 predictor variables such as low-density lipoprotein cholesterol, apolipoprotein A1, triglyceride, alanine aminotransferase, c-reactive protein concentrations, statin use, low-density lipoprotein cholesterol polygenic score, age, diastolic blood pressure, body mass index, prevalent type 2 diabetes, family history of coronary heart disease, and interaction terms.It performed better than a model using low-density lipoprotein cholesterol or low-density lipoprotein cholesterol and statin use only.The green icons represent unaffected individuals, while those in orange represent familial hypercholesterolemia carriers.ALT ¼ alanine aminotransferase; Apo-A1 ¼ apolipoprotein A1; BMI ¼ body mass index; CHD ¼ coronary heart disease; CRP ¼ C-reactive protein; DBP ¼ diastolic blood pressure; LASSO ¼ least absolute shrinkage and selection operator; LDL-C ¼ low-density lipoprotein cholesterol; PGS ¼ polygenic score; T2D ¼ type 2 diabetes.

A
first multivariable model was derived with nongenetic variables only (ie, without LDL-C PGS), and a second model was generated with the inclusion of LDL-C PGS.Model performance was evaluated using the 20% testing data based on its discriminative ability (AUC), appropriate calibration of predicted and observed probability of having an FH variant (using calibration plots, calibration-in-the-large [CIL], and calibration slope [CS]), and classification metrics (sensitivity, specificity (or its compliment the false positive rate), positive predictive value, and the negative predicted value).

FIGURE 2
FIGURE 2 Discrimination and Calibration of a Multivariable Algorithm Predicting FH Carriership Using Independent Testing Data

FIGURE 1 Feature 3 A
FIGURE 1 Feature Importance of the Variables Retained by LASSO Regression Predicting Monogenic FH, and the Density Predicted Probability Distributions From This Model for Unaffected and Affected FH Individuals in White British Participants of the UK Biobank SENSITIVITY ANALYSES.We further investigated whether the model was better at predicting APOB or LDLR FH-causing variants.Using the test data, the AUC for predicting APOB FH-causing variants (which in 98% of the cases was the p.Arg3527Gln change) was 0.81 (95% CI: 0.69-0.94),and for predicting LDLR FHcausing variants was 0.76 (95% CI: 0.70-0.82).Additionally, we explored model performance across age groups (Supplemental Results), which did not differ significantly.Finally, we additionally considered a model with HDL-C concentration instead of Apo-A1, where the former is more readily available in most clinical settings; also finding comparable performance with our original multivariable model (test data AUC of 0.77 [95% CI: 0.71-0.82]).
LDL-C PGS compared to the LDL-C and statin use model.This multivariable model would refer 12,033 individuals (12%) for genetic sequencing, compared to 14,730 (15%) with the LDL-C and statin use model, resulting in an 18% reduction in genetic testing for this specific threshold.

FIGURE 3
FIGURE 3 Decision Curve Analysis of the Multivariable Models used a machine learning method to detect monogenic FH cases from electronic health records.While their model showed an impressive AUC of 0.94, one of their most important features was referral to a cardiology clinic, which is in very close proximity to confirmatory FH testing, limiting the model's utility as a prospective tool for FH diagnosis.Besseling et al 23 developed a multivariable model to identify FH carriers validated in study participants consisting of FH cases and their relatives, again limiting applicability to the general population.Our model instead considers FH prioritization in a non-general practice-referred population and is more generalizable as a systematic population screening tool.
had a substantially larger Rsquared (0.21 [95% CI: 0.20-0.22])than reported here (0.14 [95% CI: 0.13-0.15]).Unlike Wu et al who identified genetic variants from an internal UK Biobank LDL-C genome-wide association study overlapping with the PGS training data, we identified variants based on an independent dataset from Global Lipids Genetics Consortium, 16 guarding against overfitting

TABLE 1
UK Biobank Participant Characteristics Stratified by Carrying a FH-Causing Variant IQR).The P values shown in the table are from the Kruskal-Wallis Rank Sum test for continuous variables and from the Mann-Whitney U test for binary variables.BMI ¼ body mass index; CHD ¼ coronary heart disease; CVD ¼ cardiovascular disease; FH ¼ familial hypercholesterolemia; HDL-C ¼ high-density lipoprotein cholesterol; LDL-C ¼ low-density lipoprotein cholesterol; PGS ¼ polygenic score.