Accuracy of cytokeratin 18 (M30 and M65) in detecting non-alcoholic steatohepatitis and fibrosis: A systematic review and meta-analysis

Introduction Association between elevated cytokeratin 18 (CK-18) levels and hepatocyte death has made circulating CK-18 a candidate biomarker to differentiate non-alcoholic fatty liver from non-alcoholic steatohepatitis (NASH). Yet studies produced variable diagnostic performance. We aimed to provide summary estimates with increased precision for the accuracy of CK-18 (M30, M65) in detecting NASH and fibrosis among non-alcoholic fatty liver disease (NAFLD) adults. Methods We searched five databases to retrieve studies evaluating CK-18 against a liver biopsy in NAFLD adults. Reference screening, data extraction and quality assessment (QUADAS-2) were independently conducted by two authors. Meta-analyses were performed for five groups based on the CK-18 antigens and target conditions, using one of two methods: linear mixed-effects multiple thresholds model or bivariate logit-normal random-effects model. Results We included 41 studies, with data on 5,815 participants. A wide range of disease prevalence was observed. No study reported a pre-defined cut-off. Thirty of 41 studies provided sufficient data for inclusion in any of the meta-analyses. Summary AUC [95% CI] were: 0.75 [0.69–0.82] (M30) and 0.82 [0.69–0.91] (M65) for NASH; 0.73 [0.57–0.85] (M30) for fibrotic NASH; 0.68 (M30) for significant (F2-4) fibrosis; and 0.75 (M30) for advanced (F3-4) fibrosis. Thirteen studies used CK-18 as a component of a multimarker model. Conclusions For M30 we found lower diagnostic accuracy to detect NASH compared to previous meta-analyses, indicating a limited ability to act as a stand-alone test, with better performance for M65. Additional external validation studies are needed to obtain credible estimates of the diagnostic accuracy of multimarker models.

Introduction Non-alcoholic fatty liver disease (NAFLD), a condition with a complex and multifactorial etiology, has rapidly emerged as the most common cause of chronic liver disease in the United States and Europe [1,2]. The global prevalence is approximately 25%, representing a wide histological spectrum from simple steatosis (NAFL), non-alcoholic steatohepatitis (NASH) [3] to hepatic fibrosis. Fibrosis is the strongest predictor for long-term clinical outcomes in NAFLD patients, thereby, a key target event for patient stratification and clinical trial recruitment [4].
The clinical reference standard for detecting NASH activity and fibrosis stages is a liver biopsy, a practice with well-established limitations [5][6][7]. As such, only patients at highest risk should be pre-selected for such an invasive and resource intensive procedure. The discovery of less invasive methods with performance comparable to liver biopsy has become essential.
Several blood-based biomarkers have been studied for their ability to identify NASH or fibrosis. Cytokeratin 18 (CK-18) is the main intermediate filament protein in hepatocytes and is released upon the initiation of cell death. The association between elevated CK-18 levels and cell death in the liver [8,9] has made circulating CK-18 (both M30 and M65 antigens) a candidate marker for detecting NASH and fibrosis [10], as a stand-alone test and, more recently, as part of multimarker models.
Although the M30 and M65 antigens are of the same protein, there is a mechanistic distinction between the two. M30 measures the caspase-cleaved CK-18 revealed during apoptosis, while M65 measures the full-length protein, including both caspase-cleaved and intact CK-18, which is released from cells undergoing necrosis [11].
In recommendations by the EASL-EASD-EASO Clinical Practice Guidelines [12] the performance of CK-18 M30 to differentiate NASH from NAFL was judged modest, as per data from a meta-analysis of 11 studies [13]. The Asia-Pacific Working Party on NAFLD [14] similarly concluded modest performance, referencing a meta-analysis of 10 studies [15]. A single study mentioned in both guidelines criticized CK-18 for its limited performance for detecting NASH at a threshold of 165 U/L [10]. However, it is not clear what thresholds would then maximize the test's sensitivity or specificity.
We found several limitations and methodological concerns in the above-mentioned metaanalyses. One performed a meta-analysis on only the M30 antigen in detecting NASH, with the rationale that M65 performed similarly [13]. However, it has been shown that M65 outperforms M30 [9]. Further, we found several methodological concerns in the systematic review by Chen et al. such as overlapping patient populations included in the meta-analysis [15].
An updated and more methodologically robust meta-analysis would be able to generate, in principle, summary estimates with increased precision and more general validity. To address this need, we aimed to conduct a systematic review and meta-analysis of the accuracy of both CK-18 antigens (M30 and M65) in identifying NASH, fibrotic NASH, and fibrosis stages among NAFLD adults.

Materials and methods
This systematic review was conducted as part of the evidence synthesis efforts of the LITMUS (Liver Investigation: Testing Marker Utility in Steatohepatitis) project, funded the European Union's IMI2, aiming to evaluate biomarkers for use in NAFLD. The protocol of the complete systematic review is available in PROSPERO (registration number: CRD42018106821). This study report was prepared using the PRISMA-DTA statement, see PRISMA checklist in S1 Table in S1 File.

Search strategy
A comprehensive search strategy, containing words in the title/abstract or text words across the record and the medical subject heading (MeSH), was developed with a search specialist. MEDLINE (via OVID), EMBASE (via OVID), PubMed, Science Citation Index, and CEN-TRAL (The Cochrane Library) were searched to retrieve potentially eligible studies from inception to August 2018 (see S2 Table in S1 File). We further conducted a manual screening of relevant systematic reviews and reference lists and contacted partners within the LITMUS consortium. The search was updated in May 2019, and again in June 2020.

Study selection
Search results of all databases were merged and deduplicated using Endnote. Titles were screened by one reviewer (YV); a second reviewer independently screened 10% (MHZ). Abstract and full text screening was conducted by two independent reviewers (JL and YV), following pre-established inclusion and exclusion criteria. Any discrepancies were resolved by discussion between the two reviewers. Title and abstract screening phases were conducted on Rayyan QCRI (https://rayyan.qcri.org).

Inclusion and exclusion criteria
We searched for studies including adults (�18 years) with clinical suspicion or biopsy proven NAFLD, with paired data on liver histology and CK-18 (M30 or M65). Diagnostic accuracy studies reported in full articles in peer-reviewed journals, or as conference abstracts, in any language were eligible. Studies with insufficient information for making decisions on inclusion, for evaluating methodological quality, or for calculating diagnostic accuracy were excluded. Study groups with a mix of conditions (e.g. viral hepatitis) were only included if outcomes were separately reported for NAFLD patients.
The target conditions for this systematic review were NASH, fibrotic NASH, and liver fibrosis. The NAFLD Activity Score (NAS) [16] is the most commonly used pathologic criterion for evaluating NASH. We considered a threshold value of NAS �4 with at least one point for each criteria of steatohepatitis for the characterization of NASH. See S3 Table in S1 File for different histological scoring systems developed to characterize NAFLD progression. Fibrotic NASH was defined using the above-mentioned criteria for NASH and at least F1 or more.
A five-point scoring system (F0-F4), developed by the NASH clinical research network (NASH CRN) [17], is the most commonly used for fibrosis staging. Studies assessing significant (�F2) and advanced (�F3) fibrosis were included. See S4 Table in S1 File for different scoring systems for liver fibrosis, and S5 Table in S1 File for a conversion grid of the different scoring systems.

Data extraction and quality assessment
The following information was extracted: study characteristics, clinical characteristics, index test features, liver biopsy features, and data that allowed construction of a 2x2 contingency table (true positives, true negatives, false positive and false negatives) to assess the performance of the index test. For studies that reported accuracy data for multiple thresholds, all data were extracted.
When pertinent data were not reported, the corresponding study author was contacted. Data were extracted independently and cross-checked by two reviewers (JL and YV).
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [18] was used to assess methodological quality of all available full text studies. Two reviewers (JL and YV) independently evaluated the risk of bias and concerns about applicability of the included primary studies using the four domains of QUADAS-2, assigning each study with a judgement of 'low', 'high', or 'unclear' risk.
Sensitivity and specificity estimates from each study, with respective 95% confidence intervals (95% CI), were graphically illustrated as forest plots, for each reported threshold, using RevMan.
Two different meta-analytical methods were applied for the combinations of CK-18 antigens and target conditions based on the number of reported threshold values. For groups 1-3, we applied a linear mixed effects multiple thresholds model (diagmeta package in R) as a majority of the primary studies reported multiple threshold values. The multiple thresholds model utilizes the number of true and false positives and true and false negatives at every threshold to produce summary receiver operating characteristic (SROC) curves. With the model, we could calculate estimates of sensitivity, specificity at any given threshold. We calculated the threshold value that would maximize Youden's J statistic (also called Youden's index): the sum of sensitivity and specificity minus 1.
We computed estimates of positive and negative predictive values in settings with different disease prevalence. We further assessed thresholds of the index test required to achieve prespecified high values of sensitivity and specificity. The minimally acceptable performance levels of AUC and sensitivity and specificity for the index test was 0.80, for it to exceed that of other NAFLD-related screening and diagnostic biomarkers.
As a majority of the primary studies in groups 4 and 5 reported only a single threshold value, we applied a bivariate logit-normal random-effects model (mada package in R) to compute summary estimates of sensitivity and specificity. SROC curves were constructed to represent the overall diagnostic accuracy of the index test.
Publication bias was not formally evaluated as no accepted statistical tests can reliably discriminate publication bias from other sources of bias in diagnostic meta-analyses [19]. Heterogeneity between and within studies was incorporated by calculating 95% prediction intervals [20]. The confidence interval around the summery point reflect the statistical imprecision around the mean. The prediction region around the summary point indicates the region where we would expect results from a new study in the future to lie. It reflects both the uncertainty around the mean and the between study heterogeneity and is therefore wider than the confidence region.
We investigated the influence of studies with compromised methodological quality by excluding those at high risk of bias or with applicability concerns in a sensitivity analysis. We further evaluated the effect of pooling data from various ELISA assays by excluding studies that either did not disclose the assay used or used one from a manufacturer that was not PEVIVA. Sensitivity analysis was also conducted among solely biopsy-proven NAFLD patients, excluding those with clinically suspected NAFLD.
All analyses were conducted using R for Windows (Version 3.6.0; R Foundation for Statistical Computing, Vienna, Austria).

Search results
Our initial search of all biomarkers identified 6,220 studies post deduplication. Following the pre-defined inclusion and exclusion criteria, 778 studies were eligible for abstract screening, of which 265 underwent full-text review. A total of 46 study reports were included for CK-18. Following the exclusion of 10 and inclusion of five studies from the two search updates, a total of 41 studies (5,815 participants) could be included in the present systematic review (Fig 1). Thirty studies were included in one or more of the meta-analyses.

Study characteristics
Characteristics of the included studies can be found in Table 1. A majority of the studies (32/ 41) had included NAFLD patients with mean BMI <35. A relatively wide range of disease prevalence was observed; 21% to 85% for NASH, 21% to 62% for fibrotic NASH, 18% to 59% for significant fibrosis and 19% to 36% for advanced fibrosis. The publication year spanned from 2006 to 2020; 27 studies were published after 2012.
Thirty-two studies investigated the accuracy of M30 in detecting NASH, and three for fibrotic NASH. The accuracy of M30 in detecting significant and advanced fibrosis was studied in six and seven studies, respectively. We further identified eight diagnostic accuracy studies of M65 for NASH and one study of M65 for significant fibrosis.

Quality assessment
The methodological quality of the 41 studies, assessed with QUADAS-2, is summarized in S1 and S2 Figs in S1 File. Ten studies were scored as high risk of bias in the patient selection domain [9,10,29,31,36,37,40,47,50,58]. No study had low risk of bias in the index test domain, with 22 judged as high risk, due to the lack of a pre-established threshold value for CK-18.
Seven studies were scored as unclear risk of bias in the reference standard domain, for failing to report whether biopsy reviewers were blinded to clinical data [8,9,29,32,40,49,57]. Only three studies were classified as at high risk of bias for flow and timing [10,25,40]. We further graded four studies with high concern regarding applicability in the patient selection domain [30,46,54,57].    (Fig 2A).
Using the multiple thresholds model, we calculated the positive predictive value (PPV) and the negative predictive value (NPV) under different clinical settings (5% to 70% NASH prevalence) for desired levels of sensitivity and specificity (Table 2). Optimizing sensitivity (0.80 to When fixing specificity values (0.80 to 0.90), the corresponding sensitivity ranged from 0.48 to 0.61 (Table 2) with threshold values between 304 and 399 U/L. High NPV (0.87 to 0.95) were again seen for low prevalence settings (10 to 20%). A graphical representation of the predictive values in different prevalence settings can be seen in Fig 3A and 3B.
Accuracy of CK-18 M65 in detecting NASH. In the meta-analysis of M65 in detecting NASH, we analyzed six studies with a total of 414 participants (220 with NASH) (S4 Fig in S1 File). Eleven unique threshold values were included in the model, ranging from 340 to 1183 U/ L. The combined AUC was 0.82 (95% CI: 0.69-0.91) with a mean sensitivity of 0.75 (95% CI: 0.51-0.90) and mean specificity of 0.76 (95% CI: 0.49-0.91) at Youden-threshold of 478 U/L (Fig 2C).
We again investigated the PPV and NPV in various clinical settings (Table 3, Fig 3C and  3D). Fixing sensitivity from 0.80 to 0.90, the specificity ranged from 0.70 to 0.51 at threshold values of 337 to 437 U/L (Table 3). NPV in lower prevalence settings (10-20%) ranged from 0.93 to 0.98 with corresponding PPV from 0.17 to 0.40. Similar patterns were observed for optimizing specificity over sensitivity (Table 3). Within a NASH prevalence of 10% or 20% we found PPV and NPV ranged from 0.28 to 0.56 and from 0.88 to 0.96, respectively.
Accuracy of CK-18 M30 in detecting fibrotic NASH. Three studies provided sufficient data for analysis of M30 in detecting fibrotic NASH, with a combined total of 1,271 participants (343 with fibrotic NASH) (S5 Fig in S1 File). Two studies investigated M30 as part of a multimarker models; authors of both studies [24,55] provided accuracy data for M30 at seven threshold values we selected based on the data from the present meta-analysis (133, 200, 248, 292, 356, 395, and 464 U/L). This allowed us to apply the multiple thresholds model (15 thresholds), to calculate an AUC of 0.73 (95% CI: 0.57-0.85), mean sensitivity of 0.63 (95% CI: 0.39-0.82) and mean specificity of 0.73 (95% CI: 0.51-0.88) at a Youden-threshold value of 371 U/L.

Fibrosis
Accuracy of CK-18 M30 in detecting significant and advanced fibrosis. We identified several studies that investigated CK-18 for fibrosis staging. For significant fibrosis, we included a single threshold value (ranging from 122 to 285 U/L) from five studies [27,43,47,55,59] with a total of 1,155 participants (554 had significant fibrosis) (S6 Fig in S1 File). The resulting AUC was 0.68. See S7A Fig in S1 File for SROC curve and corresponding 95% CI and prediction region. One study [50] assessed the ability of M65 to detect significant fibrosis; at a threshold of 244 U/L, sensitivity was 0.71 for a specificity of 0.71 (AUC: 0.74).

PLOS ONE
Accuracy of cytokeratin 18 in detecting NASH and fibrosis: A systematic review and meta-analysis study had to be excluded from the meta-analysis of both significant and advanced fibrosis due to discrepancies in the 2x2 contingency table [30].

Multimarker models including CK-18
Thirteen studies additionally used CK-18 as an ingredient of a multimarker model (Table 4). There was greatest interest in detecting NASH (8/13 studies), with AUCs among the eight models ranging from 0.79 to 0.96. The highest performance was observed in NASH-score (BMI, alanine aminotransferase (ALT), aspartate aminotransferase (AST), alkaline phosphatase (ALP), HOMA-IR, M65 and adiponectin), which produced an AUC of 0.96 [57]. One model was developed with the aim of detecting fibrotic NASH [55]. Composed of three ingredients (HOMA, AST and CK-18) the AUC from the validation group (n = 846) was 0.85. MACK-3 had an AUC of 0.80 when evaluated in a separate study [24].
Two studies [27,43] investigated the combined use of M30 with transient elastography (TE) (FibroScan) to detect fibrosis. One study found combining TE and M30 to detect significant (AUC: 0.89) and advanced fibrosis (AUC: 0.93) did not significantly improve the diagnostic ability from either TE or CK-18 as a stand-alone test [27]. Another study, however, found some improvement in AUC by combining M30 to TE compared to TE alone; in adding M30 they found an improvement in AUC by 0.03 for significant fibrosis, and 0.05 for advanced fibrosis [43].

Main findings
Among NAFLD adults, the diagnostic accuracy of M30 to distinguish NASH from NAFL was under the minimally acceptable performance level, fixed a priori at AUC of 0.80. More promising results were observed for M65 and NASH, although it is of note that only six studies could be included in this meta-analysis, compared to 22 for M30. The superior performance of M65 should further be interpreted with caution, as its ability to detect fibrotic NASH, the most clinically relevant target condition, is limited.
At lower prevalence, mirroring primary care settings, high NPVs above 0.85 were achieved for both M30 and M65 antigens at fixed sensitivity and specificity values above 0.80 (Tables 2  and 3).
Our meta-analysis on the accuracy of M30 in detecting fibrotic NASH also showed modest performance. MACK-3 showed more promise for detecting fibrotic NASH, but the evidence is still limited to two studies, and the model presents with limitations such as adequate performance among subgroups with metabolic syndrome and a large gap of patients who lie between the high and low threshold values [24,55].
Results for both significant and advanced fibrosis were below the minimally acceptable performance level, demonstrating sub-optimal ability of M30 to function as a stand-alone test for fibrosis staging, even more so when considering the available accurate elastography methods and multimarker models for detecting liver fibrosis.
As expected, we observed a wide range of reported threshold values for both CK-18 antigens. This can be explained by the variability of methods employed for choosing a threshold and general lack of established recommendations. With our meta-analysis we suggest high and low thresholds for M30 and M65, which can be selected in accordance to the intention of use (ruling-in or ruling-out NASH). It is of note that the threshold suggestions for the M30 and M65 antigens are strictly for results produced by the PEVIVA assays, as it is understood that different CK-18 assays show poor inter-test reliability and majority of our studies used CK-18 assays from PEVIVIA [42].

Strengths and limitations
By employing novel meta-analytical methods, we were able to incorporate all data available in the primary studies, eliminating arbitrary selection of a single threshold for our meta-analyses. This allowed greater freedom to investigate which clinical setting would optimize the use of CK-18. A more comprehensive evaluation of the clinical performance, including projections of accuracy data (sensitivity, specificity, PPV, NPV) in various prevalence settings was possible. The multiple thresholds model further allowed us to assess the diagnostic accuracy of CK-18 at threshold values not investigated in the original studies. We were however limited in the sense that the data projected by our models are based on the cumulative distribution of CK-18 in the diseased and non-diseased populations of the primary studies, which had higher prevalence than one would expect in a primary care setting.
The approach for selecting either a single 'optimal' threshold value or a set of thresholds were very heterogeneous in our included studies. While some used the Youden or equivalent methods, others chose to optimize either the sensitivity or specificity, and a concerning few did not report how a threshold value was calculated. This was however anticipated as there is no recommended threshold for CK-18. We further observed sparse reporting of the histological procedure, including quality of biopsies and expertise of histological evaluation (S6 Table in S1 File), which raises concerns regarding the reliability of the reference standard test.

In context of published literature
For M30 and NASH (22 studies), we found lower diagnostic accuracy compared to previous meta-analyses. He (2017) (14 studies) reported an AUC of 0.82 [60]; Kwok (2014) (seven studies) reported a summary sensitivity of 0.66, at a specificity of 0.82 [13]; Chen [15] (nine studies) found an AUC of 0.84 [15]; and Musso (2010) (nine studies) found an AUC of 0.82 [61]. Parameters such as mean age, BMI and disease prevalence were not sources of major heterogeneity between the present and previously published meta-analyses [61]. Our meta-analysis did however include a greater number of studies, incorporating more recent publications with lower performance. Among the six studies published after 2017, the AUC ranged between 0.59 and 0.77 for M30 in detecting NASH, a noticeable drop compared to pioneering work from 2008-10 (AUC: 0.71 to 0.88). The lowest AUC (0.59) was found in the largest study (N = 846) conducted in 2018. Interestingly, this study also found M30 to be most accurate in detecting patients with fibrotic NASH, achieving an AUC of 0.72 [55]. In parallel with the incrementally less impressive results, the excitement for CK-18 as a NAFLD biomarker has tempered with each subsequent study, serving as an exemplar of the entire biomarker space.
The only other meta-analysis performed on the diagnostic ability of both M30 and M65 concluded that both antigens had similar ability to distinguish NASH from NAFL (M30 had AUC of 0.82, M65 had AUC of 0.80) [60]. Among the three studies that investigated both M30 and M65 within the same cohort, all found better performance for M65 compared to M30 [9,26,57]. Although M30 has been more popularly studied as a diagnostic biomarker for NASH, our meta-analysis demonstrates the need for more evidence to establish the performance of M65. Further studies conducting head-to-head comparisons of M30 and M65 within the same cohort would be valuable for assessing superior performance of either antigen.
Fibrotic NASH has become an emerging target condition of interest in NAFLD research [17]. Despite the established role of hepatocyte apoptosis in the progression of liver damage [11], there have been contradictory opinions regarding the usefulness of CK-18 for fibrosis staging. Our results showed limited ability of CK-18 to function as a stand-alone test for detecting fibrotic patients compared to existing biomarkers.
Even still, the involvement of CK-18 in the disease pathway of NAFLD indicates potential for CK-18 to be used in combination with other biomarkers. Several promising models that included CK-18 (M30 and/or M65) were identified in our systematic review, most of which exceeded the minimally acceptable performance level of an AUC �0.80. Unfortunately, most models are limited to a single validation within the original studies with the exception of M30 with TE, and MACK-3, which raises the concern of how well the models would perform in practice. Additional validation studies for the proposed multimarker models should be conducted to ensure reliability of their performance. We do acknowledge that other studies including CK-18 in a composite scoring system may exist, despite not being eligible for inclusion in the present systematic review [62,63]. For example, a recent study developed a model for distinguishing NASH from NAFL, finding an AUC of 0.73 (0.66-0.81), with even better accuracy for detecting advanced fibrosis [63].

Implications for current practice and future research
Both the EASL-EASD-EASO and Asia-Pacific Working Party guidelines suggest that CK-18 has limited ability to function as a stand-alone test for distinguishing NASH from NAFL given its modest performance [12,14]. However, in a setting with 20% prevalence, a sensitivity of 0.90 and a NPV of 0.91 were achieved at a threshold value of 127 U/L (M30), demonstrating high negative values for ruling-out those without NASH. In such a scenario CK-18 could be of value as a first-line test at a primary care level for further evaluation by a specialist, even more so when considering the low cost and accessibility. This however comes at the cost of lower specificity, resulting in a high number of false positive results, as well as the compromise of 62% misclassified patients in the same setting with 20% prevalence. Alternatively, should CK-18 be used to rule-in NASH, a higher threshold of 399 U/L would be more appropriate. The trade-off between sensitivity and specificity as well as predictive values should be considered before selecting a threshold to be use in clinical practice, as a substantial number of patients without NASH could be referred for further, more invasive and risky evaluation.
CK-18 can potentially improve risk stratification in combination with other synergistic markers, such as TE or NFS, by testing for elevated M30 levels among patients under the low threshold or in patients with intermediate TE/NFS values (between the high and low threshold) [64]. In the study by Liebig et al., risk stratification was considerably improved with this approach, showing more than 70% of patients with low TE/NFS but elevated M30 revealing presence of NASH (mostly with fibrosis). As with CK-18, other highly validated tests also run the risk of misclassified patients, for example, those with low or intermediate risk by TE who would not be considered for a biopsy despite presence of NASH. In such a step-wise diagnostic regime, a high cut-off for M30 should be selected to optimize specificity and rule-in those with NASH.