Comparison of blood-based liver fibrosis scores in the Mount Sinai Health System, MASLD Registry, and NHANES 2017–2020 study

Background: Liver fibrosis is a critical public health concern, necessitating early detection to prevent progression. This study evaluates the recently developed LiverRisk score and steatosis-associated Fibrosis Estimator (SAFE) score against established indices for prognostication and/or fibrosis prediction in 4diverse cohorts, including participants with metabolic dysfunction–associated steatotic liver disease (MASLD). Methods: We used data from the Mount Sinai Data Warehouse (32,828 participants without liver disease diagnoses), the Mount Sinai MASLD/MASH Longitudinal Registry (422 participants with MASLD), and National Health and Nutrition Examination Survey 2017–2020 (4133 participants representing the general population) to compare LiverRisk score, FIB-4 index, APRI, and SAFE score. Analyses included Cox proportional hazards regressions, Kaplan-Meier estimates, and classification metrics to evaluate performance in prognostication and fibrosis prediction. Results: In Mount Sinai Data Warehouse, LiverRisk score was significantly associated with future liver-related outcomes but did not significantly outperform FIB-4 or APRI for predicting any of the outcomes. In the general population, LiverRisk score and SAFE score outperformed FIB-4 and APRI in identifying fibrosis, but LiverRisk score underperformed among participants who were non-White or had type 2 diabetes. Among participants with MASLD, SAFE score outperformed FIB-4 and APRI in 1 of 2 cohorts, but there were generally few significant performance differences between all 4 scores. Conclusions: LiverRisk score does not consistently outperform existing predictors in diverse populations, and further validation is needed before adoption in settings with significant differences from the original derivation cohorts. It remains necessary to replicate the ability of these scores to predict liver-specific mortality, as well as to develop diagnostic tools for liver fibrosis that are accessible and substantially better than current scores, especially among patients with MASLD and other chronic liver conditions.


INTRODUCTION
The global burden of liver disease, particularly liver fibrosis and its progression to cirrhosis and HCC, presents a substantial public health challenge.Early detection of liver fibrosis is critical for preventing disease progression yet remains challenging due to the asymptomatic nature of early-stage fibrosis and the limitations of current diagnostic tools.Liver biopsy is invasive, subject to sampling error, and has large interobserver variability.Elastography-based methods are reasonably accurate but not widely available, whereas existing blood-based tests like the fibrosis-4 index (FIB-4) and aspartate aminotransferase (AST) to Platelet Ratio Index (APRI) have limited accuracy to rule in significant fibrosis. [1,2]he LiverRisk score, derived by Serra-Burriel et al [3] from a prospective international cohort study, is a novel, noninvasive approach for identifying individuals at risk of liver-related morbidity and mortality based on age, sex, and 6 laboratory measurements.Initial validation of the LiverRisk score in 2 cohorts demonstrated superiority in predicting liver stiffness and future liver-related outcomes compared to FIB-4 index and APRI.Importantly, the derivation and validation cohorts were comprised primarily of White European individuals, and prognostic evaluation was performed only in the UK Biobank, which is subject to a healthy volunteer selection bias. [4]Thus, it is unknown whether LiverRisk score is generalizable to other ethnic groups, health care settings, and individuals with metabolic dysfunction-associated steatotic liver disease (MASLD), currently the most common cause of chronic liver disease. [5]It is also unknown how LiverRisk score compares to steatosis-associated Fibrosis Estimator (SAFE) score, [6] a novel MASLD-specific fibrosis score.
In this study, we address these limitations by conducting a comparison of LiverRisk score, FIB-4 index, APRI, and SAFE score among large, racially and ethnically diverse cohorts, including 32,828 participants without liver disease diagnoses from the Mount Sinai Data Warehouse (MSDW), 422 participants with clinical MASLD diagnoses from the Mount Sinai MASLD/ metabolic dysfunction-associated steatohepatitis Longitudinal Registry (MASLD Registry), and 4133 participants from the National Health and Nutrition Examination Survey (NHANES) 2017-2020 cycle, of whom 829 likely have MASLD.MSDW permits prognostic evaluation in a real-world health care population with substantially different demographic composition compared to the initial LiverRisk score validation cohorts; NHANES permits a comparison of the scores among participants representative of the diverse US population; and the MASLD Registry permits a focused assessment of liver fibrosis prediction among patients with MASLD, which is important for monitoring disease progression.Using results from all cohorts, we provide insight into the use of blood-based tests to identify and manage individuals at risk for liver fibrosis.

Study populations
We compared LiverRisk score, FIB-4 index, APRI, and SAFE score among participants from the Mount Sinai Health System and the NHANES 2017-2020 cycle.We evaluated prognostication among a retrospective cohort of 32,828 nonhospitalized participants from Mount Sinai Data Warehouse (MSDW) without liver disease at baseline.We evaluated fibrosis prediction among 3 cross-sectional cohorts: a cohort of all eligible NHANES participants (NHANES [all]; n = 4133), a subcohort of NHANES participants who likely have MASLD (NHANES [MASLD]; n = 829), and a cohort of 422 participants with MASLD diagnoses from the Mount Sinai MASLD Registry.All research was conducted in accordance with both the Declarations of Helsinki and Istanbul; research in MSDW was exempt from IRB review as all data was deidentified, while use of the MASLD Registry in this study was approved by the Icahn School of Medicine Institutional Review Board under study 22-00080.All participants in the MASLD Registry provided written consent.
MSDW consists of Epic-derived electronic health records for more than 11 million patients from 6 facilities across the Mount Sinai Health System (Mount Sinai Hospital, Mount Sinai Queens, Mount Sinai West, Mount Sinai Morningside, Mount Sinai Brooklyn, Mount Sinai Beth Israel).From MSDW, we identified 49,224 patients with alanine aminotransferase (ALT), AST, gamma-glutamyl transferase (GGT), and platelet measurements available from any single health care encounter between January 1, 2000, and December 31, 2019, as well as random glucose and cholesterol measurements within 1 year prior to this baseline date.
Although fasting glucose measurements were unavailable, our use of random glucose is consistent with Serra-Burriel and colleagues' initial validation in the UK Biobank, which also used random glucose measurements.For patients with complete measurements available from more than one encounter, we used the earliest encounter as the baseline to maximize follow-up time.Of the 49,224 patients, 46,567 were between 19 and 92 years of age, 42,547 did not have exclusionary diagnoses, and 32,828 were not hospitalized at baseline.We did not assess SAFE score in MSDW due to the limited availability of globulin (ie, albumintotal protein) measurements and because of its intended specificity for patients with MASLD.
The MASLD Registry represents 699 pediatric and adult participants with a clinical MASLD diagnosis being followed at the outpatient liver clinic of any Mount Sinai Health System hospital; 546 of these participants have received transient elastography (FibroScan) measurements as of March 1, 2024.All participants have alcohol consumption below 140 g/wk for women and 210 g/wk for men.In this study, we included 422 of the 546 participants who were between 19 and 92 years of age at the date of their FibroScan measurement and who had all 6 LiverRisk score measurements available within an 18-month window containing this date.
The NHANES 2017-2020 study includes 15,560 participants from across the United States, of whom 9023 had complete transient elastography measurements.A total of 7577 of these participants were between 19 and 92 years of age.We included 4557 of these participants who had all 6 LiverRisk score measurements available and who reported a fasting time ≥ 8 hours.We identified 829 of the 4557 participants as likely having MASLD using a controlled attenuation parameter score cutoff of 302 to define S1 steatosis, which in the study by Eddowes et al [7] optimized sensitivity (0.80) and specificity (0.83).In this subcohort, we included only participants who reported consuming alcohol once a week or less frequently or who reported drinking one or fewer beverages on days they consumed alcohol.All 829 participants met cardiometabolic criteria for MASLD.

Statistical analyses
We performed Cox proportional hazards regressions and Kaplan-Meier estimates using the Python lifelines (version 0.28.0)package.For Cox regressions, we adjusted for age, gender, and self-reported race/ethnicity, used the date of the most recent health care encounter as the censoring date, and estimated robust errors using the Huber sandwich estimator to account for time-covariate interactions. [10]e calculated metrics using the Python scipy (version 1.12.0) and scikit-learn (version 1.4.1)packages.Except for the cutoff-dependent analyses, where we used the empirical bootstrap to reduce computation time, we calculated 95% CIs for all metrics using the bias-corrected and accelerated procedure with 1000 bootstraps.We compared predictor performance using paired permutation tests with 1000 iterations. [11]We considered p < 0.05 significant.

Prognostication in a real-world health care population
In the MSDW cohort, there were 22,102, 8,673, 1,512, and 541 participants in the minimal, low, medium, and high-risk groups for LiverRisk score, respectively.Median follow-up periods for liver-related hospitalization, nonliver-related hospitalization, any liver outcome, and cirrhosis were 6.3 (IQR 3.1-9.1),5.3 (1.2-8.4),6.1 (2.6-9.0), and 6.3 (3.1-9.1)years, respectively.Participants in the low-risk, medium-risk, and high-risk groups had significantly higher HRs for all 4 outcomes compared to those in the minimal-risk groups (Figure 1A-F, Supplemental Table S1, http://links.lww.com/HC9/B16).However, there was a continuous increase in HR from the low to high-risk groups only Hazard ratio (reference level: minimal risk group) were 28,671, 2715, 538, and 904 participants in the < 0.5, 0.5-1.0,> 1.0-1.5, and > 1.5 groups (Supplemental Table S1, http://links.lww.com/HC9/B16).Compared to LiverRisk score, both scores had similar event rates in the lowest-risk groups but had substantially more participants assigned to their respective highest-risk groups.Accordingly, for the highest-risk groups of FIB-4 index and APRI, HRs for liver-related hospitalization and any liver outcome were closer to those of the medium-risk rather than highrisk LiverRisk score group (Supplemental Figure S1, http:// links.lww.com/HC9/B17;Supplemental Table S1, http:// links.lww.com/HC9/B16).However, HRs for cirrhosis were approximately equal for participants in the high-risk LiverRisk score group, > 3.25 FIB-4 index group, and > 1.5 APRI group.
To directly compare LiverRisk score, FIB-4 index, and APRI, we partitioned participants into quintiles separately for each score and examined the HR per quintile increase for each outcome (Supplemental Table S2, http://links.lww.com/HC9/B16).Unlike the UK Biobank, we did not observe significant differences in HRs between the 3 scores for any of the outcomes, with all 3 scores being significantly associated with liver-related hospitalization, any liver outcome, and cirrhosis.Additionally, for nonliver-related hospitalization, neither FIB-4 index nor APRI had HRs significantly > 1, suggesting specificity for liver-related outcomes is not specific to the LiverRisk score.These results were generally consistent across subsets of ethnicity, gender, and presence of nonliver comorbidities.However, HRs per quintile increase in LiverRisk score for any liver outcome were significantly larger for males compared to females (p = 6.8×10 −4 ) and non-Hispanic White compared to not non-Hispanic White participants (p = 6.3×10 −6 ) (Supplemental Table S2, http://links.lww.com/HC9/B16), whereas there were no significant differences for FIB-4 index for these comparisons (p = 0.53 and 0.07, respectively).Likewise, for liver hospitalization and cirrhosis, there was an ns trend toward larger HRs per quintile increase in LiverRisk score among male and non-Hispanic White participants.Among participants who reported daily alcohol consumption, while LiverRisk score and APRI were significantly associated with all 3 liverrelated outcomes, FIB-4 index was only significantly associated with cirrhosis.
Multivariable Cox regressions of all LiverRisk score components with each outcome showed that besides age and GGT, none of the features was consistently associated with all outcomes (Supplemental Table S4, http://links.lww.com/HC9/B16).For example, AST was significantly associated with liver hospitalization (p = 0.003) and cirrhosis (p = 6.62×10 −6 ) but not any liver outcome (p = 0.06).Association of glucose, cholesterol, and GGT with FIB-4 index and of glucose, cholesterol, ALT, and GGT with APRI demonstrated that these features still had significant associations with at least one outcome, suggesting that LiverRisk score components missing in FIB-4 index and APRI are important for prognostication in MSDW.

Fibrosis prediction in the general population
Among all eligible NHANES participants, LiverRisk score and SAFE score significantly outperformed FIB-4 index and APRI in both AUROC and area under the precision-recall curve (AUPRC) for distinguishing stiffness ≥ 8 and ≥ 10 kPa (hereafter referred to as classification tasks), while only SAFE score outperformed FIB-4 index and APRI in AUROC for the ≥ 14 kPa classification task (Table 2; Supplemental Tables S5, S6, http://links.lww.com/HC9/B16).However, there were no significant differences in AUROC between LiverRisk score and SAFE score for any of the 3 classification tasks, and SAFE score outperformed LiverRisk score in AUPRC for all 3 classification tasks.
While LiverRisk score still had consistently higher AUROC and AUPRC than FIB-4 index and APRI among participant subsets, the differences were smaller than among all participants (Table S5, http:// links.lww.com/HC9/B16).Further, differences in AUROC between LiverRisk score, FIB-4 index, and APRI were not statistically significant for any of the 3 classification tasks among participants who were non-Hispanic Black, Hispanic, who had type 2 diabetes, who reported daily alcohol consumption, or who did not have steatosis (Supplemental Tables S6, http:// links.lww.com/HC9/B16).In contrast, SAFE score still significantly outperformed FIB-4 index in AUROC among participants who were non-Hispanic Black, Hispanic, or who had type 2 diabetes for the ≥ 8 kPa classification task.SAFE score also significantly outperformed LiverRisk score in AUPRC for at least 1 of the 3 classification tasks in all subsets analyzed except for non-Hispanic Asians.
We assessed threshold-dependent performance for each score while using F1 score, the harmonic mean of precision (positive predictive value) and recall (sensitivity), as an optimization metric (Supplemental Table S7, http:// links.lww.com/HC9/B16).For the ≥ 8, ≥ 10, and ≥ 14 kPa classification tasks, F1 score was highest for LiverRisk scores of 7, 7, and 9; FIB-4 indices of 1.45, 2.5, and 3.25; APRI of 0.5, 0.5, and 0.75; and SAFE scores of 100, 150, and 250, respectively.However, real-world cutoff selection requires context of the use case, with each of these cutoffs resulting in a different tradeoff between precision and recall.
COMPARISON OF BLOOD-BASED LIVER FIBROSIS SCORES For example, for the ≥ 8 kPa task, a LiverRisk score cutoff of 7 yields a precision of 0.294 and sensitivity of 0.345, while a SAFE score cutoff of 100 yields a precision of 0.416 and recall of 0.248 (Supplemental Table S7, http://links.lww.com/HC9/B16).

Fibrosis prediction among participants with MASLD
Among NHANES participants with MASLD, LiverRisk score significantly outperformed FIB-4 index for the ≥ 8 kPa classification task in AUROC and AUPRC, but not among participants who were not non-Hispanic White, who had type 2 diabetes, or who had less than daily alcohol consumption (Tables S8, S9, http://links.lww.com/HC9/B16).For the ≥ 10 and ≥ 14 kPa classification tasks, LiverRisk score did not have a significantly different performance compared to either FIB-4 index or APRI, either among all participants or among participant subsets.While SAFE score outperformed FIB-4 index and APRI in both AUROC and AUPRC among all participants, there were no significant differences among participants who had type 2 diabetes or reported no alcohol consumption.Among MASLD Registry participants, there were no significant differences in AUROC among any of the 4 scores, either among all participants or within participant subsets (Tables S10, S11, http://links.lww.com/HC9/B16).In contrast to both NHANES cohorts, both FIB-4 index and SAFE score outperformed LiverRisk score in AUPRC for all 3 classification tasks, including among Hispanic participants.
This study has several limitations.First, our prognostication evaluation in MSDW was retrospective and included only participants with all 6 LiverRisk score measurements available.Given that AST, ALT, and GGT are not routinely ordered measurements, this likely selected for participants being evaluated for liver disease.In contrast, these measurements were uniformly collected for the LiverRisk score derivation and validation cohorts, creating differences in participant ascertainment.Nevertheless, 69% of MSDW participants had ALT measurements within normal ranges ( < 33 U/L for men and < 25 U/L for women), and we did not observe significant differences in performance between the 47% with and 53% without nonliver comorbidities for LiverRisk score, FIB-4 index, or APRI, suggesting our results remain relevant for general patient populations.Further, LiverRisk score is intended for "general use in clinical practice," including population screening of patients with chronic conditions, [3] and many clinicians may generate Liver-Risk score predictions using pre-existing measurements rather than obtaining new measurements; thus, our study may still be representative of real-world applications.Second, we could not compare liver cancer and mortality prognostication of the four scores due to low incidence and limited availability of biomarker-linked liver-specific mortality data outside of the UK Biobank, respectively.While Liu et al [14] demonstrated that LiverRisk score predicted diabetesspecific mortality in the NHANES III cohort, liver outcome and liver-specific mortality data are unavailable for this cohort.With liver cancer being a top 5 cause of cancer death globally [15] and with liverspecific mortality generally reflecting liver disease severity, it remains necessary to evaluate these scores among large, diverse cohorts with these data available.Fourth, because FIB-4 index and APRI are routinely used in the diagnosis and management of MASLD, the MASLD Registry may be subject to selection bias for participants with high values of these scores.Addressing this, we created a subcohort of NHANES participants with MASLD using objective controlled attenuation parameter scores to avoid this bias.Fifth, we observed large confidence intervals for AUROC and AUPRC when analyzing fibrosis prediction among participant subsets in the MASLD Registry, likely due to the small size of our cohort (422 participants); thus, our study may not have sufficient statistical power to detect differences in predictor performance between subsets.
Together, our results suggest that first, given the differences in results between our cohorts and the original validation cohorts, additional replication studies and possibly recalibration are needed before LiverRisk score can be widely adopted, particularly in health care settings with large non-European populations.Second, FIB-4 index and APRI will likely remain relevant in the clinical setting, especially when glucose, cholesterol, and GGT measurements are unavailable.Among patients with MASLD, FIB-4 index is easier to calculate than SAFE score as it does not require body mass index or globulin, and we did not consistently observe significant differences in AUROC or AUPRC between the 2 scores.Third, more accurate blood-based predictors of liver fibrosis are still needed, especially among patients with MASLD and other chronic liver diseases.These predictors should ideally be derived using large, diverse cohorts with histologically assessed fibrosis.
Laboratories.He consults, advises, owns stock, and is employed by Pensieve Health.The remaining authors have no conflicts to report.

F I G U R E 1
Prognostication among 32,828 Mount Sinai Data Warehouse participants without liver disease at baseline.(A, B) HRs with 95% CIs for liver-related hospitalization and nonliver-related hospitalization (A) or any liver outcome and cirrhosis (B) for participants in the low, medium, and high-risk groups of LiverRisk score compared to participants in the minimal-risk group.(C-F) Kaplan-Meier curves with 95% CIs showing the cumulative incidence of liver-related hospitalization (C), any liver outcome (D), nonliver-related hospitalization (E), and cirrhosis (F) for participants in the minimal, low, medium, and high-risk groups of LiverRisk score.(G-H) HRs with 95% CIs for liver-related hospitalization and nonliver-related hospitalization (G) or any liver outcome and cirrhosis (H) for each quintile increase in LiverRisk score, FIB-4 index, or APRI.
Composition of the 4 cohorts used in this study For FIB-4 index, there were 20,562, 9475, 942, and 1849 participants in the < 1.30, 1.30-2.67,> 2.67-3.25,and > 3.25 groups, respectively, while for APRI, there T A B L E 1Note: Values are either counts with (percentages) or medians with [interquartile ranges].Nonliver comorbidities were defined as those included in the Charlson Comorbidity Index.NHANES (all) represents all eligible NHANES participants while NHANES (MASLD) represents the subset of NHANES participants who likely have MASLD.Abbreviations: ALT, alanine aminotransferase; APRI, AST to Platelet Ratio Index; AST, aspartate aminotransferase; CAP, controlled attenuation parameter; FIB-4, fibrosis-4 index; GGT, gamma glutamyltransferase; MASLD, metabolic dysfunction-associated steatotic liver disease; MSDW, Mount Sinai Data Warehouse; NHANES, National Health and Nutrition Examination Survey; SAFE, steatosis-associated Fibrosis Estimator.forthe liver-related outcomes and not for nonliverrelated hospitalization, replicating the specificity of LiverRisk score for liver-related morbidity.Among 541 participants in the high-risk group, there were HRs of 13.25 (95% CI: 7.26-24.2) for liver-related hospitalization, 3.46 (2.59-4.63)for any liver-related outcome, and 14.64 (8.52-25.18)for cirrhosis.The HR for liver-related hospitalization was significantly lower than that reported in the UK Biobank ( > 100).
T A B L E 2 Performance metrics among all participants in each cohort