The ellipse of insignificance, a refined fragility index for ascertaining robustness of results in dichotomous outcome trials

There is increasing awareness throughout biomedical science that many results do not withstand the trials of repeat investigation. The growing abundance of medical literature has only increased the urgent need for tools to gauge the robustness and trustworthiness of published science. Dichotomous outcome designs are vital in randomized clinical trials, cohort studies, and observational data for ascertaining differences between experimental and control arms. It has however been shown with tools like the fragility index (FI) that many ostensibly impactful results fail to materialize when even small numbers of patients or subjects in either the control or experimental arms are recoded from event to non-event. Critics of this metric counter that there is no objective means to determine a meaningful FI. As currently used, FI is not multidimensional and is computationally expensive. In this work, a conceptually similar geometrical approach is introduced, the ellipse of insignificance. This method yields precise deterministic values for the degree of manipulation or miscoding that can be tolerated simultaneously in both control and experimental arms, allowing for the derivation of objective measures of experimental robustness. More than this, the tool is intimately connected with sensitivity and specificity of the event/non-event tests, and is readily combined with knowledge of test parameters to reject unsound results. The method is outlined here, with illustrative clinical examples.


Introduction
Biomedical science is crucial for human well-being, but there is an increasing awareness that many published results are less robust than desirable [1][2][3] . In fields from psychology 4 to cancer research 5 , a substantial volume of research fails to replicate. There is an urgent need to address this, as spurious findings can not only obscure important research directions, but can even misinform potentially life-or-death decisions. While there are many reasons why published research might fail trustworthiness (including poorly conducted experiments, publish-or-perish pressure, and overt fraud in the form of data and image manipulation), inappropriate or misapplied statistical methods account for a large portion of misleading results. Even a properly performed statistical analysis may fail to adequately identify situations where data might lack robustness. P values are routinely misunderstood and misapplied, leading to confused research outputs [6][7][8] .
Dichotomous outcome trials and studies are crucial in many avenues of bio-medicine, from pre-clinical observational studies to randomized controlled trials. The essential principle is that they contrast experimental and control groups for some intervention, comparing the numbers positive for some specific endpoint in both arms. This is absolutely integral to modern medicine to ascertain significant differences, but some authors have voiced concern that seeming significant findings in these trials can often disappear with the re-coding of even small numbers of patients from endpoint positive to negative in either arm. The fragility index (FI) is the measure of many subjects are required to change a trial outcome from statistical significance to not significant. It is calculated by recoding a patient or subject in the experimental group (or control group) from event to non-event, and employing Fisher's exact test until significance is lost. The number of patients requiring this recoding for this to occur is the fragility index.
The concept of FI has existed in various forms since at least the work of Feinstein 9 , and in general the higher the fragility index is, the more robust an experiment is deemed. Applications of FI have shown some concerning results; in a study of 399 randomised controlled trials (RCTs) in high-impact medical journals, Walsh et al 10 found that median FI was 8 (range: 0-109), with 25% having FI ≤ 3. In 53% of these trials, numbers lost to follow-up exceeded FI. A meta-analysis of spinal surgery studies 11 found a median FI of 2, with 65% of trials having loss to follow-up greater than FI. A review of critical care trials 12 and 2018 review of phase 3 cancer trials 13 both found median FIs of 2, and a 2020 review of epilepsy research 14 yielded a median FI of 1.5. A recent fragility analysis of COVID-19 trials found that had a median FI of only 4, despite the large numbers of patients involved 15 . This suggests that many results are not robust, and teeter on the edge of statistical significance. While a very useful metric, FI has some substantial faults. There is considerable debate over whether is it appropriate for time-to-event cases [16][17][18][19] . More directly, there is no simple FI cut-off metric that designates studies as either robust or fragile, though some authors suggest the fragility quotient (FQ) as an extension, the fraction of FI over sample size 20 . In addition, FI and FQ can also be computationally expensive to run, typically requiring multiple iterations of Fisher's exact test to converge. As Fisher's exact test relies on factorials, it is typically not suited to larger trials or studies. It is also implicitly considers only either control or experimental groups in isolation, even though it is possible that miscoding can occur in both cohorts. Nor does FI relate directly to test parameters between non-events and events, such as sensitivity or specificity. Many of these objections and counterpoints to them are discussed in recent work by Baer et al 21 .
With FI and FQ becoming increasingly commonly reported in literature, it is worthwhile to introduce a related, refined metric with new application. In this work, I introduce a geometric refinement of the concept underpinning FI which overcomes some difficulties associated with FI analysis, considering recoding in both control and experimental groups in tandem. This ellipse of insignificance (EOI) approach is exact and computationally inexpensive, yielding objective measures of experimental robustness. There are two major differences and situational advantages to such a formulation; firstly, it can handle huge data sets with ease and consider both control and experimental arms simultaneously, which traditional fragility analysis cannot. Previously, fragility has been typically considered in the case of relatively small numbers in Randomized controlled trials, which as previous commentators have noted are often fragile by design. The method outlined here handles massive numbers with ease, rendering it suitable for analysis of observational trials, cohort studies, and general preclinical work, to detect dubious results and fraud. This sets it apart in both intention and application to existing measures, and makes it unique in this regard.
Secondly, this methodology is not solely a new, robust fragility index; it also goes further by linking the concept of fragility to test sensitivity and specificity. This a priori allows an investigator to probe not only whether a result is arbitrarily fragile, but to truly probe whether consider certain results are even possible. This renders it less arbitrary than existent measures, as it ties directly statistically measurable quantities to stated results, and is sufficiently powerful to rule out suspect findings in many dichotomous trials and studies. It can accordingly be used to detect likely fraud or inappropriate manipulation of results if the statistical properties of the tests used are known. This is unfortunately highly relevant, as unsound or otherwise manipulated results have become an increasingly recognised problem in biomedical research, and means to detect them are vital. The ellipse of insignificance (EOI) analysis outlined here for any 2 × 2 dichotomous outcome trial or study, with an experimental arm consisting of a subjects with endpoint positive outcomes and b without, and a control arm with c subjects with endpoint positive versus d without. The EOI analysis outlined in the methodology section allows rapid determination of the effects of recoding in all arms simultaneously, and ties this explicitly to test sensitivity and specificity, with illustrative examples of application demonstrated.

Methods
The ellipse of insignificance approach is based upon the principles of a chi-squared analysis. Consider an experimental group containing a participants with a given endpoint and b participants without that endpoint. In the control group, there are c participants with the given endpoint, and d without. The total number of participants is given by n = a + b + c + d. For a 2 by 2 contingency table, the chi-squared statistic is given by . (1) When this statistic is greater than a specified threshold, results are deemed significant and differences between the control and experimental group considered indicative of real differences. The initial question this work concerns itself with is ascertaining how many patients or subjects would have to be recoded to transform an ostensibly significant result into one where the null hypothesis was not rejected. This recoding can be achieved two ways; by subtracting x participants from a (Experimental group, endpoint positive) or by adding y participants to c (Control group, endpoint positive). These configurations are given in table 1 below. Applying the same statistic outlined in equation 1, with a threshold critical value for significance of ν c , the resulting identity is This form can be be expanded, with the resultant equation being a conic section 22 of the form Ax 2 + Bxy+Cy 2 + Dx+ Ey+ F = 0. This corresponds specifically to an inclined ellipse, with coefficients A-F given by Any points on or in inside this ellipse of insignificance (EOI) will fall below the threshold to reject the null hypothesis, and the ellipse is effectively the bound of all values of x and y sufficient to cause a loss of significance at a threshold critical value of ν c , calculated from the chi squared distribution at a given level of significance with one degree of freedom.

FECKUP point and vector
Finding the minimum distance from the origin to the ellipse of insignificance allows us to ascertain the minimal error which would render results insignificant. To find this, we take the implicit derivative of the distance vector from the origin to this unknown point, and the implicit derivative of the equation of the inclined ellipse whose coefficients are given in equations 3-8.
Setting y ′ equal in both equations leads to the pair of simultaneous equations for the unknown point (x e , y e ) of (2Ax e + By e + D)y e − x e (Bx e + 2Cy e + E) = 0 Ax 2 e + Bx e y e + Cy 2 e + Dx e + Ey e + F = 0 (10)

4/12
Solving this results in a quartic equation, resulting in four solutions, one pair of which will be the minimum distance point (x e , y e ). This can be readily checked, and the solution pair will correspond to the absolute minimum pair value to lose significance at a given threshold. This resultant point and vector denotes the Fewest Experimental / Control Knowingly Uncoded Participants (FECKUP), with length f min . An illustration of this is shown in figure 1(a). Accordingly, the points x e and y e can be understood as the resolution of vector f min in the experimental and control directions respectively. If both experimental and control participants can be miscoded, the theoretical minimum number that could be miscoded before a seemingly significant result dissipated, d min , is the sum of the opposite and adjacent lengths of the right-angled triangle formed by hypotenuse f min . As there are only integer numbers of participants, it thus follows that d min = ⌊|x e | + |y e |⌋.
If we instead only consider inaccuracies in the experimental group as possible, we may set y = 0 and x = x i for the equation of the ellipse, yielding the quadratic identity Ax 2 + Dx + F = 0, readily solvable to determine x i . This is the point nearest the origin where the ellipse intercepts the x-axis. Conversely, we may consider a situation where only inaccuracies in the control group may exist. By similar reasoning, considering only inaccuracies in the control group yields a similar quadratic, Cy 2 + Ey + F to yield y i , the intercept of the ellipse with the y-axis. All these vectors are illustrated in figure 1(b), and are the maximum limits of miscoding theoretically possible before significance is lost.

Metrics for fragility of results
To ascertain if a trial or study is robust against the miscoding of patients or subjects, we introduce metrics to quantify this. Considering only inaccuracies in the experimental group, we define the tolerance threshold for error in experimental group as the fraction of subjects that must be correctly allocated in the experimental group to maintain significance, given by This identity is intimately related to the existent fragility index, yielding the traditional fragility quotient. For example, an experiment with ε E = 0.1 after EOI analysis would inform us that up to 10% of experimental participants could be miscoded before the result lost significance. By similar reasoning, the tolerance threshold for error allowable in the control group is then Finally, errors in both the coding of the experimental and control group can be combined with FECKUP point knowledge.
While f min gives a minimum vector distance to the ellipse, we instead take the length of the vector components to reflect to yield an absolute accuracy threshold of Relating test sensitivity and specificity to miscoding thresholds The identities derived thusfar give a measure of the absolute accuracy required for confidence in the robustness of stated results. If details of the specific tests employed to determine endpoints in the experimental and control cohorts are known, then robustness can be directly related to the sensitivity and specificity of the tests employed. If the sensitivity (s ne ) and specificity (s pe ) of the test used to ascertain cases in the experimental group are known, then the observed number of cases with endpoint positive is related to the true number of endpoint positive cases, a o , by a = a o s ne + (a + b − a o )(1 − s pe ). It follows that the minimum miscoded cases in the experimental group is given by A similar relationship can be derived for the control groups, with sensitivity s nc and specificity s pc , and the minimum miscoded cases in the control group is given by

5/12
The values (x m , y m ) denote the minimum miscoding that exists in reported figures because of inherent test limitations, and it follows that if this pair-value lies within the ellipse of insignificance, then any ostensible results of the study are not robust. The forms given in equations 15 and 16 are general forms. In many cases when the same test is used in endpoint determination in the experimental and control group, s ne = s nc and s pe = s pc . However, there are instances when in observational and cohort trials in particular, accrued data will derive from different tests on various cohorts, an example of which will be introduced later in this work.

Method inversion
It is important to note that the analysis presented here can be used not only to ascertain miscoding between endpoint positive and negative situations, but also can be inverted for situations where, for example, endpoint positive or negative might be known with high certainty but there are concerns over miscoding between control and experimental groups. In this case, simply reassigning endpoint positive, experimental and control groups respectively as (a, b) and endpoint negative experimental and control groups as (c, d) allows straightforward application of EOI analysis as outlined.

Polygon of insignificance
The ellipse of insignificance yields a continuously valued boundary. As only integer values are generally of concern, we can also define an irregular polygon of insignificance (POI) by considering the largest integer-valued polygon encompassing the EOI. Similarly, we can also take the floor values of x e , y e , x i and y i in such an approach. This is readily derived from EOI analysis, and code to produce such a shape is included in the supplementary material. 05. An EOI analysis shows that a displacement of less than 2 subjects would be enough to undo this seeming significance as shown in figure 2, and that the absolute tolerance threshold was only ε A = 0.22% as given in table 2. This rendered the actual result highly fragile, given the demonstrable

6/12
fact that inspection of the supplied tables in the paper in question demonstrated that at least 9 subjects had been miscoded in the initial analysis. These weaknesses, coupled with the lack of a plausible biophysical hypothesis and non-physical dose response curve, suggests such findings were likely spurious 23 .
EOI Statistic (α = 0.05) Derived Value Experimental group tolerance x i 6.9 subjects Control group tolerance y i 1.9 subjects FECKUP vector length 1.9 subjects Tolerance threshold for error (Experimental group) ε E 0.99% Tolerance threshold for error (Control group) ε C 0.89% Absolute tolerance threshold for error (All subjects) ε A 0.22%

Illustrative example 2 -EOI Robustness analysis of similar results
Consider two hypothetical experiments that yield highly similar χ 2 statistics. Experiment 1 has (a 1 , b 1 , c 1 , d 1 ) = (770, 230, 550, 450) and experiment 2 gives (a 2 , b 2 , c 2 , d 2 ) = (144, 856, 20, 980), both of which correspond to χ 2 ≈ 100, and p-values < 0.00001. We can employ EOI analysis to ascertain how robust these seemingly strong respective results are for different values of α. The EOI analysis and FECKUP vectors are illustrated in figure 3 for α = 0.05, and relevant statistics for various values of α are given in table 3. It can be seen from this that despite the similar test statistics, Experiment 1 is consistently more robust, and would require the miscoding of at least 178 participants (8.9% of the entire sample) to lose significance, relative to 99 (≈ 5% of the entire sample) in Experiment 2 at α = 0.05, a trend that continues even with lower values of α.

Discussion
The analysis presented here is a deterministic way to ascertain the fragility of a given dichotomous outcome study by considering experimental and control groups in concert. This method is geometrical in origin and computationally inexpensive. It also explicitly can relate outcome fragility to the sensitivity and specificity of tests employed when known, aiding clinicians and meta-researchers in interpreting the trustworthiness of a given study. Sample OCTAVE and MATLAB code and stand-alone Windows applications are provided to run the analysis outlined in this work, available in the electronic supplementary material.
There are a number of limitations of this work that should be explicitly discussed, and caveats to be elucidated. The EOI analysis handles potential miscoding, but cannot be used to infer anything about patients or subjects lost to follow-up. This is a weakness of all FI/FQ methods, as it is not a priori knowable from reported data alone why patients dropped out, or why they might have atrophied from particular subgroups. Redaction bias 25 can occur if subjects leave a particular subset at an elevated rate, and while beyond the scope of this work, it is important to realise that explicit connections between EOI/FI/FQ analsys and numbers lost to follow-up cannot be directly made.
The method outlined is deterministic and rapid, but only currently applicable to dichotomous outcome trials and studies, and should be applied very cautiously to time-to-event data, where it may not be suitable. FI itself is also typically calculated using Fisher's exact test, which well-approximates a chi squared test. However, for small trials, the p value derived from Fisher's exact test can be discrepant from chi squared result. When Fisher's exact test produces a non-significant p value without any recoding, an FI of 0 results, suggesting a distinct lack of robustness of the underlying data. As EOI analysis is built upon chi squared statistics, it is possible in edge cases of small numbers to have discordant results between EOI and Fisher's 9/12 exact test also. The chief advantage of the method outlined here, however, is that it handles extremely large data sets with ease. In large data sets, Fischer's exact test breaks down due to its dependence on factorials, and a chi-squared approximation is more appropriate. This is fitting, given EOI is built upon the chi-square distribution. But the important caveat is that for rare events in small trials, a Fragility index approach built upon Fisher's exact test may be more appropriate 21 .
The usage of FI/FQ itself remains contested in the literature, and one frequent objection is that the mere existence of a small FI might be an artefact of trial design 26 . With clinical RCTs in particular, experimenters often design trials to minimize exposure of patients or subjects to as of yet unknown harms, while seeking to ensure enough of them participate so that clinically relevant causal effects can be reliably detected. From this vantage point, RCTs might be fragile 'by design'. This view is countered by other authors 21 who argue that there is no evidence p-value distributions tend to cluster around the significance threshold after a sample size calculation, and that the fragility index in well-designed studies is not always low 27 . This work does not comment on the absolute applicability of the FI, but offers new metrics for quantification of results in context. More importantly, EOI analysis has definite application for dichotomous outcome results not derived just from fragile-by-design RCTs, but from ecological studies, cohort trials, and preclinical work which should in principle be far more resilient to investigation than RCTs.
There is a less edifying but important reason why EOI analysis might be conducted -the detection of questionable research practices and fraud. While most scientists and clinicians operate ethically, poor conduct and inappropriate statistical manipulation can and do occur. By some estimates, up to three quarters of all biomedical science is affected by poor practice 28 , casting doubt on results to the detriment of science and the public, often a consequence of publish-or-perish pressure 3 . During the COVID-19 pandemic, a number of dubious high-profile results have come to light, particularly on drugs like Ivermectin 29,30 . EOI analysis has a potential role in detecting manipulations that nudge results towards significance, and identifying inconsistencies in data. EOI analysis is perhaps ideal for this purpose, as it explicitly relates known test sensitivity and specificity to projected error tolerance, allowing detection of suspect results in even large data sets, as illustrated by the real examples in this work.
Despite its caveats on usage, the fragility index has seen growing application in analysis of trial outcomes, and the EOI system presented here should allow this to be applied more thoroughly in a multidimensional way. Regardless of whether appropriate research practice has been observed or not, it is important to be able to estimate the soundness of results in biomedical science, to ascertain what level of confidence once can ascribe to them. This need has seen the recent resurgence of FI analysis, and the EOI analysis presented here can help undercover questionable results and experimental inconsistencies, with wide potential application in meta-research and reproducible research.