P-value, compatibility, and S-value

Misinterpretations of P-values and 95% confidence intervals are ubiquitous in medical research. Specifically, the terms significance or confidence, extensively used in medical papers, ignore biases and violations of statistical assumptions and hence should be called overconfidence terms. In this paper, we present the compatibility view of P-values and confidence intervals; the P-value is interpreted as an index of compatibility between data and the model, including the test hypothesis and background assumptions, whereas a confidence interval is interpreted as the range of parameter values that are compatible with the data under background assumptions. We also suggest the use of a surprisal measure, often referred to as the S-value, a novel metric that transforms the P-value, for gauging compatibility in terms of an intuitive experiment of coin tossing.


Introduction
A recent multicenter randomized trial at 130 sites in 18 countries hypothesized that ticagrelor, in combination with aspirin for 1 month, followed by ticagrelor alone, improves outcomes after percutaneous coronary intervention compared with standard antiplatelet regimens [1].The primary endpoint at 2 years was a composite of all-cause mortality or new Q-wave myocardial infarction.The intention-to-treat rate ratio (RR) estimate using the Mantel-Cox method was 0.87 [95% confidence interval (CI): 0.75-1.01]with two-sided P-value of 0⋅073.The authors concluded that "In our multicenter randomized trial, ticagrelor in combination with aspirin for 1 month followed by ticagrelor alone for 23 months was not superior to standard 1-year dual antiplatelet therapy followed by aspirin monotherapy in terms of the composite endpoint of all-cause mortality or new Q-wave myocardial infarction after percutaneous coronary intervention" [1].This conclusion is based on comparing the P-value of 0.073 to the cutoff default value of 0.05.Also, the paper freely uses the term "significantly" including the expression of "did not differ significantly between … groups" four times.
Such misinterpretations of P-value based on the cutoff value of 0.05 and ignorance of the association measure estimate and 95% confidence interval are not uncommon in medical research, which are a consequence of using overconfidence terms such as significance or confidence.In this paper, we argue that P-values and confidence intervals should be interpreted as compatibility measures of different values of parameters with data, and suggest using an alternative measure known as the S-value, which better facilitates the compatibility view.

P-value as a measure of compatibility
The P-value is often defined as the probability of the observed or more extreme results if the test hypothesis is true.This definition implicitly assumes some background assumptions including population distribution of the outcome variable (e.g., Normal distribution), random sampling or randomization of the participants, random measurement error in the exposure and outcome variables, and no bias in the design, execution, analysis, and reporting.In fact, a statistical-testing procedure tests both the test hypothesis and background assumptions, which we refer to as the model.The P-value is an index of compatibility between the data and the model, which varies between 0 (completely incompatible) to 1 (completely compatible) [2][3][4][5][6].For a sufficiently small P-value, we conclude that the model is incorrect, that is, either the test hypothesis or background assumptions or both are incorrect; otherwise we can assume that a rare event has occurred [2].Thus a very small P-value doesn't necessarily indicate a false test hypothesis if some background assumptions are violated.However, for a sufficiently large P-value, we can only say that the data are compatible with the model predictions.However, we cannot conclude that the model is correct as the P-value is not an index of support for the tested model [2,3,7].In clinical studies, there is no guarantee that the background assumptions embedded in the model are correct, and in fact many assumptions are often violated in practice.In the example mentioned above, the model assumes absence of all Cochrane biases [8] including selection bias, performance bias, detection bias, attrition bias, and reporting bias as well as random confounding [9][19-22].Also the Mantel-Cox test used in the paper is based on the following assumptions: [10] [23] censoring is independent of the outcome, the survival probabilities do not vary with follow-up time, and the events occurred at specified times.Censoring due to deaths (about 3% in each group) and lack of blinding may violate some of these assumptions.Moreover, adherence to the allocated intervention was not perfect and some participants in both groups did not receive or complete the allocated intervention, so the analysis was intention-to-treat (ITT).The ITT approach does not invalidate the hypothesis testing, however [8].

S-value
To avoid misinterpretations of the P-value, we suggest transforming it to a quantity known as the Shannon-information or surprisal or self-information called S-value [3][4][5][6][11][12][13] (see Appendix 1): With base 2 for the logarithm, the S-value is scaled in bits (binary digits) of information, where "bit" refers to the information capacity of a binary (0,1) digit.Thus the S-value is the number of bits of information in the data against the model, including background assumptions and the test hypothesis.Fig. 1 shows that the S-value exponentially increases as the P-value goes to zero.In the limits, the S-value = 0 when the Pvalue = 1, which implies that the data provide no information against the model, but as for P-value = 1, we cannot conclude that the model is correct the S-value approaches infinity when the P-value approaches to zero, which indicates that the data provide infinite information against the model, leading one to a more decisive conclusion that the model is incorrect.
Unlike the P-value, the S-value has an intuitive interpretation in a physical experimental coin tossing.Suppose we are concerned about fairness of a coin, so we toss it 4 times and the result turns out to be 4 heads.The P-value would be ( ) 4 , and the S-value 4, which conveys the same evidence against the model as seeing all heads in 4 independent tosses of a coin against the hypothesis that the coin is fair [3].As an example, the S-value of 4.3 bits corresponding to an observation of Pvalue = 0.05 is hardly more surprising than seeing all heads in 4 fair tosses.This shows that the common dichotomization of P-value at 0.05 is an overstatement of evidence against the model as the amount of information that a P-value = 0.05 conveys is small [3,4].Significance testing has been popular simply due to its simplicity as it has allowed researchers and clinicians to make decisions based on the cutpoint of 0.05.In fact, more stringent cutpoints are used outside the health sciences.For example, the 5-sigma criterion for discovery in physics as used for Higgs boson particle corresponds to a one-sided P-value of about 1 per 3.5 million with a corresponding S-value of 21.7 bits [14].Another advantage of the S-value is that log scaling makes information additive, e.g., two independent studies with the same test hypothesis yielding a Pvalue of 0.05 provides an S-value of 4.3 + 4.3 = 8.6 bits of information against the model.Finally, the S-value resolves some misconceptions about the P-value, as shown in Table 1 [3,[15][16][17].The reported P-value of 0.073 in the case study translates to an S-value of 3.8 bits, which is hardly less surprising than seeing all heads in 4 fair tosses.This S-value clearly suggests that it is unjustified to differentially treat P-values of 0.073 and 0.05, as the S-value, unlike the P-value, is a metric that does not contain any cutpoint.

Testing alternative hypotheses
Researchers tend to report P-values only for the null hypothesis, which often corresponds to no association between two variables in the population.However, they can and should test alternative hypotheses, especially those that correspond to minimal clinically important differences [18], and compare the compatibility of different parameter values with the data [3].As an example, the P-value for the RR of 0.8 for the primary endpoint in our example is 0.27 (please see Appendix 2 for the computations) which translate to an S-value of − log 2 0.27 = 1.9 bits.Therefore, a 20% reduction in the rate of the primary endpoint of the study is more compatible with the data than the rate ratio of 1 (S-value = 3.8).Also, the paper reports RR of 0.8 [95% CI: 0.60-1.07]with a Pvalue of 0.14 for the endpoint of new Q-wave myocardial infarction with a corresponding S-value equaling − log 2 0.14 = 2.8 bits.The authors concluded that "The frequency of … new Q-wave myocardial infarction … did not differ significantly between groups".However, we can verify that the P-value for the RR of 0.75 equals 0.66 with an S-value of− log 2 0.66 = 0.60 bits.Thus, the information against RR of 1 is 2.2 bits higher than that for RR of 0.75, which spoils the conclusion of the paper.

Compatibility intervals
The 95% confidence interval is often interpreted as the range of values which include the parameter of interest with the probability of 95%.However, in the presence of biases, the background assumptions Fig. 1.S-value vs. P-value.

Table 1
Some misinterpretations of P-values and their resolution using S-values.

Misinterpretations of P-values
Clarification by S-values P-value is the probability that the result is due to chance S-value is not bounded to be between 0 and 1 so it is not confused with this probability P-value is an error probability resembling the alpha level S-value is not bounded to be between 0 and 1 so it is not confused with this probability are not met (e.g., the assumptions of random sampling and randomization are violated in observational studies) and thus confidence intervals should be more accurately termed as overconfidence intervals.We prefer to use the term compatibility intervals with the following interpretation: The 95% confidence interval includes the range of values which are compatible with the data, that is, statistical testing of values provides no >4.3bits of information against them assuming the background assumptions are correct.In our case-study, statistical testing provides no >4.3bits of information against the rate ratios in the range of 0.75-1.01(4.3 bits information are against the rate ratio limits of 0.75 and 1.01).Moreover, there is no information against 13% decrease in the rate of the primary endpoint among the experimental group compared to the control group (RR = 0.87, P-value = 1, and S-value = 0).

Conclusion
The P-value should be interpreted as an index of compatibility between the data and the model, including the test hypothesis and background assumptions.The confidence interval should be named compatibility interval, and interpreted as the range of values which are compatible with the data.The S-value represents the information of the data against the model, facilitating the compatibility interpretation.Moreover, it is not subject to many misinterpretation of the P-value, and should be used in practice along with the P-value and compatibility interval.This is especially the case when interpreting results of clinical studies.
Large P-values indicate test hypothesis is plausible and small P-values indicate test hypothesis is implausible S-values provide refutational information against the model including both background assumptions and test hypothesis A P-value <0.05 implies test hypothesis is false and a P-value >0.05 implies test hypothesis is correct S-value has an intuitive interpretation based on observing all heads in fair coin tossing to gauge the evidence against the model without any reference to an arbitrary cutpoint S-value shows that the amount of information in the P = 0.05 is small (only 4.3 bits) Equal intervals in P-value represent equal changes in the evidence as measured by the SD change Equal intervals in S-value represent equal changes in the evidence as measured by the information M.A. Mansournia et al.