Fragility index of positive phase II and III randomised clinical trials of treatments for hepatocellular carcinoma (2002–2022)

Background & Aims The fragility index (FI), i.e., theminimum number of best survivors reassigned to the control group required to revert the statistically significant result of a clinical trial to non-significant, is a metric to evaluate the robustness of randomized controlled trials (RCTs). We aimed to assess the FI in the field of HCC. Methods This is a retrospective analysis of phase 2 and 3 RCTs for the treatment of HCC published between 2002 and 2022. We included two-arm studies with 1:1 randomization and significant positive results for a primary time-to-event endpoint for the FI calculation, which involves the iterative addition of a best survivor from the experimental group to the control group, until positive significance (p <0,05, Log-rank test) is lost. Results We identified 51 phase 2 and 3 positive RCTs, of which 29 (57%) were eligible for fragility index calculation. After reconstruction of the Kaplan-Meier curves, 25/29 studies remained significant, among which the analysis was performed. The median (interquartile range (IQR)) FI was 5 (2-10) and Fragility Quotient (FQ) was 3% (1%-6%). Ten trials (40%) had a FI of 2 or less. FI was positively correlated to the blind assessment of the primary endpoint (median FI 9 with blind assessment versus 2 without, p = 0.01), the number of reported events in the control arm (RS = 0.45, p = 0.02) and to impact factor (RS = 0.58, p = 0.003). Conclusions Several phases 2 and 3 RCTs in HCC have a low fragility index, underlying the limited robustness on the conclusion of their superiority over control treatments. The fragility index might provide an additional tool to assess the robustness of clinical trial data in HCC. Impact and implications The fragility index is a method to assess robustness of a clinical trial and is defined the minimum number of best survivors reassigned to the control group required to revert the statistically significant result of a clinical trial to non-significant. Among 25 randomised controlled trials in HCC, the median fragility index was 5, and 10 trials among 25 (40%) had a fragility index of 2 or less, indicating an important fragility.


Introduction
HCC is the third most common cause of cancer-related death and occurs mainly in chronic liver disease at the cirrhosis stage. 1 The Barcelona Clinic Liver Cancer classification is the most commonly used staging system for HCC in Western countries, linking tumour burden, liver function, and performance status with prognosis and therapeutic management. 2 In 2022, the Barcelona Clinic Liver Cancer group updated its treatment algorithm to reflect recent advances, especially regarding systemic treatment strategies. 2 All the treatments of HCCnamely, radiofrequency ablation, transhepatic chemoembolisation, anti-angiogenic tyrosine kinase inhibitors, and immune checkpoint inhibitors, such as atezolizumab (programmed death-ligand 1  inhibitor) + bevacizumab (antivascular endothelial growth factor) or durvalumab (anti-PDL1 inhibitor) + tremelimumab (cytotoxic T-lymphocyte-associated protein 4 [CTLA4] inhibitor) combinationswere validated in randomised controlled trials (RCTs).
RCTs are designed to assess a specific intervention's safety and efficacy, and are considered to produce highly reliable evidence if appropriate methodologies are used. Although clinicians often rely on provided p values to interpret results and establish significance in RCT results, this practice remains discussed. 3 In addition to the p value, the unit fragility index (FI) offers an easy tool to evaluate the numerical stability of a contrasted difference between two proportions. 4 Indeed, outcomes that meet the arbitrary threshold of a p value less than 0.05 might not be clinically relevant and be based on a low number of events in the experimental arm to reach the significance. The FI was defined as the minimum number of patients whose status would have to change from a non-event to an event required to turn a statistically significant result into a non-significant result. 5 Bomze et al. 6 introduced a simple and intuitive FI for survival analysis as the minimum number of best survivors reassigned from the experimental group to the control group. 6 Consequently, the FI has been recommended as an additional statistical method to present and interpret the results of RCTs. Therefore, our study aimed to assess the FI of positive phase II and III RCTs in the treatment of HCC in the past two decades and identify the characteristics of RCT associated with FI.

Materials and methods
Study design and selection of RCTs To identify positive RCTs relevant to this study, we searched through MEDLINE on PubMed, the Cochrane Library, and the Clinical Trials database using the following terms: 'hepatocellular carcinoma' and 'HCC', as free text word and/or combined with 'trial', 'prospective', 'phase II', 'phase 2', 'phase III', 'phase 3', 'randomized', 'randomised', 'controlled'.
We screened for prospective phase II and III RCTs published between 1 January 2002 and 30 June 2022 with a statistically significant result based on time-to-event data (primary

Data extraction
The following characteristics of each study, including RCT phase (II, II/III, or III), were collected: year of publication, journal of publication and impact factor, sample size, number of enrolling centres, disease stages, treatment arms, type of endpoints, outcomes of interest, and response assessment. Studies were stratified according to quality using a modified version of the Jadad score and the Delphi list consisting of five and nine items, respectively. 7,8 Studies were defined as high quality with a Jadad score > − 6 and a Delphi score > − 5.
Individual survival data from studies were extracted from the Kaplan-Meier curves published using the Digitizer software application (https://automeris.io/WebPlotDigitizer/). 9,10 The reconstructed curves were then compared with the published data to confirm the accuracy of the reconstructed data.
Statistical analysis and calculation of the FI We described continuous data (median [IQR]) and categorical data (frequency and percentage). Comparisons of continuous and categorical variables were performed using the Mann-Whitney test, and the Chi-square or Fisher exact test, respectively.
The FI for survival curves was calculated by iterative reassigning the best survivors from the experimental group to the control group until positive significance (defined as p <0.05) was lost. The best survivor is defined as the patients with the longest follow-up time, regardless of having an event or being censored. 6 Values of p were assessed using a two-tailed log-rank test. A smaller FI indicates a less robust study result. Some significant studies in the publications that turned out to be non-significant after the reconstruction of the Kaplan-Meier curves were excluded from the main analysis.
To overcome the effect of sample size in interpreting the FI, we calculated the fragility quotient (FQ), which is the FI divided by the sample size. 11,12 This would allow us to see what proportion of patients (best survivors) needs to be moved to make the results meaningless or meaningful (the percentage of patients required to be removed to lose the significance). A smaller FQ also indicates a less robust study result.
To evaluate associations between the FI and FQ, and trial characteristics, we used the Spearman rank order correlation coefficient (R S ) for continuous variables. The Kruskal-Wallis test was used for parameters with more than two modalities, and the Wilcoxon-Mann-Whitney test was used for those with two modalities.
Values of p <0.05 were considered significant. Statistical analyses were performed using GraphPad Prism 7.0 (La Jolla, CA, USA) and R Project for Statistical Computing, version 3.5.2 software (The R Foundation for Statistical Computing, Vienna, Austria; http://www.r-project.org/).

General characteristics of positive phase II and III prospective RCTs
The characteristics of the 51 positive phase II and III prospective RCTs included are summarised in Tables 1 and 2 (Tables 1 and 2).

FI analysis
Among the 29 studies with a 1:1 allocation ratio eligible for FI calculation (see Table 1 for the characteristics of these studies), 13 were multicentric (46%), mostly performed in patients with an early or intermediate stage of HCC (88%) and in Eastern populations (79%). The median Jadad and Delphi scores were 8 (IQR 7-8) and 6 (IQR 6-6), respectively. After the reconstruction of the Kaplan-Meier curves, 25/29 studies remained significant, and four studies had a nonsignificant p value. Among these four studies, the p value was evaluated using Cox proportional hazards regression models and not using the log-rank test for three studies, [13][14][15] and for the last study, 16 the p value was assessed using a stratified log-rank test with random assignment stratifications factors.
Among the 25 studies with a remaining significant p value after the reconstruction of the Kaplan-Meier curves (see Table 1 for the characteristics of these studies), the median FI was 5 (IQR 2-10), and the median FQ was 3% (IQR 1-6%). Ten studies had an FI of < − 2. The distribution of the FI of the remaining 25 studies is represented in Fig. 2. We performed subgroup analysis according to the types of treatment received: curative intent treatment (n = 8; median FI 5 [IQR 2-7.2]), adjuvant treatment (n = 4; median FI 3 [IQR 2-5.5]), locoregional treatments in a non-curative intent (n = 10; median FI 5 [IQR 2-11.5]), and systemic treatments in advanced stages (n = 3; median FI 8 [IQR 5-8]) (p = 0.9, Kruskal-Wallis non-parametric test). To note, among the nine positive RCTs not initially included in the FI calculation because of the inability to perform correlation with trial features because of a 2:1 randomisation ratio, seven remained significant after reconstruction of Kaplan-Meier curves, and for these studies, the median FI and median FQ were 4 (IQR 2.5-14.5) and 1% (IQR 0.6-2%), respectively.
Among the 25 studies included in the FI analysis, FI was associated with a blind assessment of the primary endpoint (median FI 9 [IQR [8][9][10][11][12] with blind assessment vs. 2 [IQR 2-6] without blind assessment; p = 0.01). FI was also positively correlated with the number of reported events in the control arm (R S = 0.45, p = 0.02) and the impact factor (R S = 0.58, p = 0.003) and was negatively correlated with the p value (R S = -0.83, p <0.0001) ( Table 2). There was no significant correlation between  Research article the size of the experimental or control group and the FI, and there was no difference in terms of FI between academic and industrial promotion of the study and across the type of treatment assessed (curative, adjuvant, non-curative locoregional, and systemic) ( Table 3). Next, we focused on the correlation between the FQ and the characteristics of clinical trials. The FQ (%) was significantly different between phase II and III studies (median FQ 6.4 [IQR 2.8-9.4] in phase II vs.  (Table 3).

Discussion
The FI is an easy method to quantify the robustness of a trial but should be interpreted with other parameters reported in RCTs such as p value, hazard ratio, absolute difference, and power. Moreover, the effect size is often unstable in small trials, and loss to follow-up can decrease confidence in the significance of the effect. The FI is an absolute measure of stability, irrespective of trial size, and we also included in our study the FQ (defined by the absolute FI number divided by the total sample size) to consider the trial sample size.
Our study assessed the FI and FQ of phase II and III RCTs on the treatment of HCC available in the literature between 2002 and 2022. To our knowledge, this is the largest systematic review evaluating the FI and FQ of RCTs to assess the quality of trials in the field of HCC treatment. Among the 51 positive phase II and III prospective RCTs we identified, only 29 were eligible for the calculation of the FI, 4 of which lost significance after patient data reconstruction using Kaplan-Meier curves. The use in the original study of a stratified log-rank test or a Cox proportional hazards model may explain the differences that we found after the reconstruction of Kaplan-Meier curves for these four RCTs. We could also hypothesise that the results of these studies have limited robustness as the significance of the main results vary according to the statistical test performed.
The main findings of our study are as follows: (1) the median FI in positive RCTs in HCC treatments was 5, and the median FQ was 3%; (2) FI was positively correlated with a blind assessment of the primary endpoint, the number of reported events in the control arm, and the impact factor, and was negatively correlated with the p value; and (3) FQ was negatively correlated with the p value, the number of patients and number of reported events in the experimental arm, and the number of patients in the control arm.
In our study, the median FI was 5, which indicates that at least five best survivors from the experimental arm must be reassigned to the control arm to change the statistically significant result to a non-significant result. As FI is an absolute measure and does not consider the sample size, we calculated the FQ, which is the FI divided by the sample size. 11,12 This would allow us to see the proportion of patients (best survivors) that needs to be moved to make the results meaningless or meaningful. A smaller FQ also indicates a less robust study result. The median FQ in our study was 3%; consequently, 3% of the participants should be reassigned to lose significance. Overall, the larger the FI and FQ, the more robust the trial's results. Our median FI is slightly higher than the median FI of 2 recently reported by Del Paggio and Tannock 17 in phase III RCTs of FDA-approved anticancer drugs globally (drugs approved by the FDA between 1 January 2014 and 31 December 2018). Only one study had already assessed the FI in RCT in the HCC field but only included only six RCTs in its analysis, decreasing the applicability of their results. 18 Moreover, FI has been applied to other RCTs such as oncology, critical care, or heart failure, showing that several RCTs were considered fragile, regardless of the field of research. 11,[19][20][21] Several investigators have recommended the routine inclusion of the FI in reporting clinical trial outcomes and developing clinical guidelines. 11 Although an FI value of 1 indicates extreme fragility, there is no specific cut-off value or lower limit of the FI to classify a study as 'either fragile' or 'robust'. In our study, two RCTs had an FI value of 0-1, indicating extreme fragility, and 10 RCTs had an FI of < − 2, which could be considered as 'fragile' RCTs.
FI was also correlated with the impact factor (p = 0.003). In a recent study, out of all 2,544 RCTs published between 2014 and 2021 in five high-impact journals (New England Journal of Medicine, The Lancet, Journal of the American Medical Association, British Medical Journal, and Annals of Internal Medicine), 643 eligible for FI analysis revealed that statistical significance was dependent on a median of 12 (IQR 3-28) events. 22 In the past decade, statistical significance of RCTs in high-impact journals has become more robust. However, 25% of RCTs are still dependent on three or fewer outcome events. 22 In addition, the impact factor of journals is not a valid measure of RCT quality, contrary to the Jadad score 7 and Delphi list, 8 which were not correlated with the FI in our study. Moreover, FI was higher in RCTs with a blind assessment, suggesting more robust results in these trials. This corroborates evidence in the literature showing that unblind assessment of an endpoint is subject to bias. Moreover, we observed no significant difference in terms of median FI between the types of clinical trials (curative intent treatment, adjuvant treatment, locoregional treatments in a non-curative intent, and systemic treatments in advanced stages). However, the low number of studies included in each subgroup decreases the robustness of this analysis.
Although the FI may improve our understanding of trial results, this method has some limitations, one of which is that the FI can only be calculated in the context of an RCT when outcomes are compared between two groups. Furthermore, the interpretation of the FI can be problematic when the number of participants who drop out for unknown reasons is large. RCTs with small samples and RCTs in which the event of interest is rare tend to be fragile. Another limitation of this study is the inclusion of RCTs characterised by a two-arm parallel design or two-bytwo factorial design and with available Kaplan-Meier curves with time-to-event data for FI measurement. Consequently, we did not assess the FI of RCTs with a non-inferiority design and RCTs including more than two arms. This may lead to some uncertainty in generalising our data to all RCTs available in the field of HCC treatments. However, in our study, we used an adequate statistical methodology for survival data. Indeed, the reconstruction of individual patient data from published Kaplan-Meier curves allowed us to consider not only the events but also the timing of events, which is an essential piece of information to evaluate the effect of treatment on these types of endpoints. A statistical test (log-rank test) adapted to the survival data was also used to evaluate the p value and calculate an unbiased FI. Indeed, the original FI proposed by Walsh et al. 5 is based on binary results and the Fisher exact test, which could provide incorrect results for time-to-event data.
In conclusion, our study suggests that several phase II and III RCTs in HCC treatment have a low FI, resulting in uncertainty regarding their robustness and potential clinical benefit. A systematic calculation of the FI could help interpret RCTs and guide their application in daily practice for patients with HCC.

Financial support
This study received no financial support.

Conflicts of interest
JCN has received research funding from Bayer and Ipsen. SS, NS, CC, JG, and FD have no conflicts of interest. NG-C has received honoraria from Abbie, Bayer, Gilead, Ipsen, Roche, and Shionogi. MR has received educational fees from Canon Medical System, GE Healthcare, Ipsen, Guerbet, and Sirtex. VL has no conflicts of interest.
Please refer to the accompanying ICMJE disclosure forms for further details.

Data availability statement
Not applicable.