The statistical fragility of the management options for reverse shoulder arthroplasty: a systematic review of randomized control trial with fragility analysis

Reverse shoulder arthroplasty (RSA) is used in the treatment of traumatic and arthritic pathologies, with expanding clinical indications and as a result there has been an increase in clinical research on the topic. The purpose of this study was to examine the statistical fragility of randomized control trials (RCTs) reporting outcomes from RSA. A systematic search was undertaken to find RCTs investigating RSA. The Fragility Index (FI) was calculated using Fisher’s exact test, by sequentially altering the number of events until there was a reversal of significance. The Fragility Quotient (FQ) was calculated by dividing the FI by the trial population. Each trial was assigned an overall FI and FQ calculated as the median result of its reported findings. Overall, 19 RCTs warranted inclusion in the review, representing 1146 patients, of which 41.2% were male, with a mean age of 74.2 ± 4.3 years and mean follow-up of 22.1 ± 9.9 months. The median RCT population was 59, with a median of 9 patients lost to follow-up. The median FI was 4.5, and median FQ was 0.083, indicating more patients did not complete the trial than the number of outcomes which would have to change to reverse the finding of significance. This review found that the RCT evidence for RSA management may be vulnerable to statistical fragility, with a handful of events required to reverse a finding of significance.

Reverse shoulder arthroplasty (RSA) is used in the treatment of traumatic and arthritic pathologies. RSA was developed in the 1970s to address poor outcomes associated with anatomic shoulder arthroplasty and shoulder hemiarthroplasty arthroplasty in managing rotator cuff deficient shoulders. When reversing the anatomic position of the articulating glenoid and humeral head, it was hoped that by maximizing deltoid function it would lead to improved range of motion and strength, while limiting the risk of dislocation. 14 RSA case volume has been increasing and between 2011 and 2017 there was an almost 200% increase in the number of RSA being performed in the United States, with an annual incidence of 20/100,000 persons. 28,46 This trend is being replicated across the developed world and is expected to continue over the coming decades with growth in shoulder arthroplasty far outstripping that of hip and knee arthroplasty. 25,46 Evidence-based medicine has become imperative to safe and effective clinical decision-making since the concept was introduced by Cochrane. 5 The randomized control trial (RCT) forms level I evidence at the top of the pyramid of evidence. 3 Orthopedics is a challenging area of medicine to ensure high-quality evidence is available due to often small sample sizes, difficulty in blinding, and patient rejection of randomization. 27 This is borne out in reviews of orthopedic evidence which have found serious issues with methodological and statistical rigor. 32 More new topics such as RSA due to the limited published evidence are at the greatest risk of suffering from an underdeveloped evidence base.
To the authors' knowledge, the use of Fragility Index (FI) and Fragility Quotient (FQ) statistical analysis has not been applied to RCT level I evidence assessing RSA. The FI is a minimum number of events which must be reversed to change the significance finding for a given outcome, while the FQ expresses fragility relative to the size of the trial population. The purpose of this study was to examine the statistical fragility of RCTs reporting outcomes from No Ethical Committee approval was deemed to be necessary for this article as this was a review of publicly available published data and did not include any patient data.
* RSA. Our hypothesis was that included studies would be consistently fragile to a reversal of their stated findings and that the FI would be comparable to the number lost to follow-up (LTFU).

Search strategy
In reference to Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, 2 independent reviewers (T.D. and E.H.) performed a systematic review of the literature in August 2022, including 2 databases (PubMed and Embase). 29 The search terms used were "Arthroplasty, Replacement, Shoulder" [Mesh] AND "reverse shoulder arthroplast*" OR "reverse shoulder replacement" OR "reverse total shoulder arthroplast*" OR "reverse total prosthetic." The texts discovered using this search strategy were screened by both independent reviewers, with removal of duplicate studies, followed by application of our eligibility criteria.

Eligibility criteria
The inclusion criteria were (1) RCTs that investigate the management RSA; (2) reporting dichotomous outcomes and statistical significance; (3) full-text studies, published in the last 20 years; (4) published in peer-reviewed journals; and (5) published in the English language. The exclusion criteria were (1) RCTs without a clear randomization protocol, (2) review articles, (3) studies in vitro, and (4) studies involving animals. In cases of disagreement between the 2 independent authors with regard to a study meeting the inclusion or exclusion criteria, disagreements were to be decided upon by the senior author.

Assessment of evidence
All included studies were assessed for their reported level of evidence, using The Journal of Shoulder and Elbow Surgery criteria. 22 The Risk of Bias II (ROB II) tool was used to assess the quality of evidence of the included RCTs. 38 All studies were assessed for the presence and nature of a statistical power analysis. The latest impact factor of the publication journal was recorded.

Data extraction
Following application of the predetermined inclusion and exclusion criteria, both reviewers collected information on the following variables from included studies in a password-protected database on Microsoft Excel (Microsoft Corporation, Redmond, WA, USA): (1) year of publication; (2) randomization methods; (3) statistical power analysis (type of analysis and reported power); (4) the primary and secondary outcomes as specified in the trial protocol; (5) length of follow-up (months); (6) number of participants included in each of the treatment arms; (7) mean age of participants (years); (8) sex of participants; (9) number as protocol, number per protocol, and numbers LTFU; (10) the reported significance of each event; and (11) all dichotomous outcomes of relevance. As protocol describes the number of patients in a trial who were randomized to a study arm and received the assigned treatment. Per protocol is hereby defined as the number of patients who complete the trial and remain at the end of the follow-up period.

Statistical analysis
The FI was calculated using GraphPad open source online software (GraphPad, San Diego, CA, USA). 17 For dichotomous outcomes, both the events and nonevents for each treatment arm were entered into a 2 Â 2 grid, and a 2-tailed Fisher's exact test was used to calculate the P value, with a ¼ 0.05. As some P values will have been calculated using the Chi-squared test, this is critical. To calculate the FI, the 2 Â 2 grid is manipulated until there is a reversal of the original significance finding (Fig. 1). For an outcome reported as significant, it would be manipulated by adding þ1 to the events in the treatment arm which had less events, while À1 was removed from the nonevents to maintain the overall population of that treatment arm. This process was repeated until the result became nonsignificant (P > .05). Conversely for outcomes which were not significant, the number of events required to decrease P to < .05 was calculated by adding þ1 to the treatment arm which had more events, and À1 from the nonevents to maintain the population of that treatment arm, and repeated until the result became significant. The number of events changed was recorded as the FI for that outcome. The FI for all outcomes reported in a RCT was calculated in this manner. The median and interquartile range (IQR) of outcomes in a trial was recorded as the overall FI for that RCT. For each finding, the FQ was calculated in Microsoft Excel by dividing the FI by the perprotocol number for that RCT. The overall median FQ and IQR for each study was calculated in the same manner as the FI. We used Pearson's correlation coefficient when assessing for direct correlation.

Literature search
Following our initial search, a total of 3594 studies were returned. Following manual removal of duplicate studies, 2663 studies remained for application of our eligibility criteria. Thereafter, the titles and abstracts were evaluated yielding 178 studies for full-text review. Nineteen RCTs met the eligibility criteria warranting inclusion in this systematic review (Fig. 2). The included RCTs represented 1146 patients, with 41.2% being male, a mean age of 74.2 ± 4.3 years, a mean body mass index of 29.9 ± 1.6 kg/m 2 , and a mean follow-up of 22.1 ± 9.9 months.

Assessment of evidence
The quality of evidence was assessed using the ROB-II tool. 38 No RCTs were found to be at a high ROB, 13 were found to have a low ROB, 9,10,15,19,20,23,24,33,40,[43][44][45]48 while for 6 there were some concerns about potential bias 7,16,18,35,41,42 (Supplementary Appendix S1). The current impact factor of the journals in which the included RCTs are published had a mean of 3.6 ± 0.6, with 14 (74%) of the RCTs published in the Journal of Shoulder and Elbow Surgery, 2 in the Journal of Bone and Joint, and 1 each in Journal of Orthopaedic Research, Journal of Orthopaedic Trauma, and Archives of Orthopaedic and Trauma Surgery.

Fragility index and quotient
From the 19 included RCTs, there were 85 reported dichotomous outcomes. The overall median of FI was 4.5 (IQR, [4][5], and the median FQ was 0.083 (0.065-0.098). The median number of patients LTFU was 9 (range, [3][4][5][6][7][8][9][10][11][12]. In 13 RCTs (68%), the number LTFU was greater than the median trial FI, while in 6 RCTs it was less than the median FI. A subgroup analysis is shown in Table I. The FI and FQ of these subgroups show that primary outcomes, significant findings, and outcomes where the FI < LTFU were consistently more prone to fragility when compared to the overall median FI and FQ. It also highlights that the majority of reported dichotomous outcomes were secondary and not significant.

Power analysis
All 19 publications reported a power analysis, with 2 post-hoc analyses 10,45 and 17 priori power analyses. The post-hoc group showed greater fragility with an FI of 3.5 (3.25-3.75) when compared to the priori group with a median FI of 5 (4-5). Eleven RCTs (57.9%) were Appropriately Statistically Powered (ASP) meaning they recruited a sufficient sample size to satisfy a requirement of at least 80% power, 9,10,16,19,20,24,33,35,40,43,48 while 8 RCTs (42%) were statistically underpowered (SUP) as they did not recruit a population sufficient to achieve 80% power. 7,15,18,23,41,42,44,45 The ASP subgroup had a greater median trial FI than the SUP group at 5 (4.50-5) vs. 4 (3.75-4.13). We observed an association between higher powered studies and those with higher FIs, with data shown fully in Table II. There was not a strong relationship between the median FI and the As Protocol (AP) or Per Protocol (PP) population of a trial. The Pearson's correlation coefficient between AP trial population and median trial FI was R(19) ¼ 0.26, P ¼ .256 and between PP trial population and median FI was R(19) ¼ 0.25, P ¼ .302. This suggests a weak nonsignificant positive correlation between having more participants and reporting less fragile results. The number of participants LTFU showed a weak nonsignificant positive correlation to the median FI at R(19) ¼ 0.21, P ¼ .410. The correlation between publishing journal's impact factor and median FI was very weakly positive at R(19) ¼ 0.11, P ¼ .665. A moderate positive correlation was detected between the AP trial population and participants LTFU which was significant at R(19) ¼ 0.63, P < .004. These data are summarized in Table III.

Discussion
The most important finding of this review was that level I RSA clinical evidence was vulnerable to statistical fragility, with a median FI of 4.5 indicating that the reversal of just a handful of outcomes was sufficient to reverse a finding of statistical significance. This should be viewed in the context of the median number of patients LTFU being equal to 9. The median trial lost more patients to follow-up than the number of outcomes which would have to be changed to reverse a finding of significance. These figures add uncertainty to the true validity of a finding of significance, as approximately two-thirds of included events may have had reversed significance findings had there been a more complete follow-up. We cannot know what outcome a patient LTFU had, but it stands to reason that had the trial been completed without their loss, the finding of significance may have been reversed. Events with an FI more than the number LTFU for that trial were more robust than those with an FI less than the number LTFU. These results support the conclusion that a number LTFU > FI is an indicator of potential fragility. Comparative trials of shoulder surgery should consider reporting the FI, FQ, and P value for findings to better demonstrate the statistical evidence which informs clinical decision-making.
Almost all published RCTs will report on statistical significance using P values, with a ¼ 0.05 arbitrarily set as the cut-off for significance. The P value has recently been criticized due to limitations in its clinical relevance. 47 Due to the small sample size of many RCTs    reporting dichotomous outcomes in orthopedics, trials often rely on a small number of events to calculate significance. The FI is a statistical tool first described by Feinstein. For any given outcome, the FI is a minimum number of events which must be reversed to change the significance of the findings using Fisher's exact test. The FI has no arbitrary point at which it is deemed significant unlike a P value and exists independently of the sample size from which it calculated. 13 A lesser FI indicates a fragile result, while a greater FI indicates a more robust result. The FQ described by Ahmed is produced by dividing the FI by the trial population. This expresses the fragility of the finding relative to the size of the trial, giving added context and allowing for more standardized comparison between trials. 1 All included RCTs reported a statistical power analysis, which is a positive indicator of statistical rigor in the RSA literature. Of the 19 RCTs, 58% were appropriately statistically powered (ASP) while 42% were SUP. The ASP group displayed more robust results with greater median FI and FQ as seen in Table II. This finding is in keeping with the assumption that well-designed trials will produce more statistically certain results, while underpowered trials will produce more fragile results as they are at risk of type II data errors. There were 17 prior power analysis and 2 post-hoc analysis. The priori analysis is considered to be the most appropriate method to conduct a power analysis, and this convention is supported by the fact this group had a greater median FI and FQ than the post-hoc group. 36 This review found a nonsignificant weak correlation between both AP and PP trial population and FI. This highlights that the absolute number of participants is not a reliable guide to estimating fragility. The number of participants required will be determined by the size of the clinical effect being measured and its standard deviation. This review found a weak nonsignificant positive correlation between the impact factor of the journal and FI. This highlights that readers should not assume articles are statistical rigorous based solely on the reputation of the publishing journal. Although it should be noted due to the prevalence in this review of articles from a single journal, in this instance this conclusion is limited. There was also a very weak nonsignificant positive correlation between number LTFU and FI, this may be explained by larger RCTs having more patients LTFU in absolute terms and also reporting robust greater FIs.
A fragility analysis in 2018 of the RCTs cited by the American Academy of Orthopaedic Surgeons clinical practice guidelines as "strong evidence" reported a median FI of 2 and a median FQ of 0.022, with 53% of the RCTs statistically underpowered. 4 While a previously published analysis of 12 surgical fragility analyses found the median FI to be 3 and FQ to be 0.039. 8 For the purpose of a more focused comparison, we conducted a search for fragility analyses which focus primarily on shoulder surgery. This returned 6 reviews and they report a median FI of 4 (4-4). 6,12,26,30,31,34 These figures suggest that the RSA RCT evidence base is comparable to the wider orthopedic literature, if not mildly more robust. Although it should be noted that in general RSA literature remains fragile, with a small number of events required to result in reversal of statistical significance.
In 2016, the American Statistical Association issued a policy statement confirming that conclusions should not be reached on the basis of whether a P value reached a specific arbitrary threshold. 47 The P value does not measure the probability of a true result, the importance of a finding, or the size of an effect. On this basis, the authors endorse triple reporting of P values, FI, and FQ as the new standard for RCTs.

Limitations
One potential limitation of this analysis is the exclusive review of RCTs; this excludes other comparative studies which may have been informative. However, it is the opinion of these authors that fragility analyses should be reserved for RCTs to avoid the risk of selection bias and confounding variables which are sources of fragility found in nonrandomized studies. 2,39 A limitation of this review is that it includes fewer RCTs than some other previously published analyses. 8,11,21,37 However, this is an accurate reflection of RSA evidence pool that is currently available. The primary limitation of fragility analyses is that only dichotomous variables may be included. This led to the exclusion of continuous variables such as the Constant and ADLER scores which are important outcome metrics in shoulder surgery. Such variables cannot be included unless there is a cut-off score which indicates a certain outcome has been achieved, as this then becomes dichotomous data. Another limitation is the high prevalence of included secondary outcomes. Trials are usually powered for the detection of their primary outcomes, and so may be underpowered with regards to secondary outcomes. However, many secondary outcomes are very clinically relevant and so their analysis is both justified and important.

Conclusion
This review found that the RCT evidence for RSA management may be vulnerable to statistical fragility, with a handful of events required to reverse a finding of significance.

Disclaimers:
Funding: No funding was disclosed by the authors. Conflicts of interest: The authors, their immediate families, and any research foundation with which they are affiliated have not received any financial payments or other benefits from any commercial entity related to the subject of this article.