Addressing researcher degrees of freedom through minP adjustment

When different researchers study the same research question using the same dataset they may obtain different and potentially even conflicting results. This is because there is often substantial flexibility in researchers’ analytical choices, an issue also referred to as “researcher degrees of freedom”. Combined with selective reporting of the smallest p-value or largest effect, researcher degrees of freedom may lead to an increased rate of false positive and overoptimistic results. In this paper, we address this issue by formalizing the multiplicity of analysis strategies as a multiple testing problem. As the test statistics of different analysis strategies are usually highly dependent, a naive approach such as the Bonferroni correction is inappropriate because it leads to an unacceptable loss of power. Instead, we propose using the “minP” adjustment method, which takes potential test dependencies into account and approximates the underlying null distribution of the minimal p-value through a permutation-based procedure. This procedure is known to achieve more power than simpler approaches while ensuring a weak control of the family-wise error rate. We illustrate our approach for addressing researcher degrees of freedom by applying it to a study on the impact of perioperative \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$paO_2$$\end{document}paO2 on post-operative complications after neurosurgery. A total of 48 analysis strategies are considered and adjusted using the minP procedure. This approach allows to selectively report the result of the analysis strategy yielding the most convincing evidence, while controlling the type 1 error—and thus the risk of publishing false positive results that may not be replicable.


Introduction
In recent years, the scientific community has become increasingly aware that there is a high analytical variability when analysing empirical data, i.e. there are plenty of sensible ways to analyse the same dataset for addressing a given research question, and they may yield (substantially) different results (Gelman and Loken, 2014;Silberzahn et al., 2018).If combined with selective reporting, this variability may lead to an increased rate of overoptimistic results, e.g.-depending on the context-false positive test results and inflation of effect sizes (Simmons et al., 2011;Wasserstein and Lazar, 2016;Ioannidis, 2005b), or, beyond the context of testing and effect estimation, to exaggerated measures of predictive performance (Boulesteix and Strobl, 2009) or clustering validity (Ullmann et al., 2023).Hoffmann et al. (2021) outline six sources of uncertainty that are omnipresent in empirical sciences and lead to variability of results in empirical research regardless of the considered discipline, namely sampling, measurement, model, parameter, data pre-processing, and method uncertainty.Failure to take these various uncertainties into account may lead to unstable, supposedly precise, but overoptimistic and thus potentially unreplicable results.Most importantly, model, parameter, data preprocessing and method uncertainties lead to the analytical variability mentioned above.In this context, Simmons et al. (2011) denote the flexibility researchers have regarding the different aspects of the analysis strategy as "researcher degrees of freedom".
While it is clear that selective reporting of the "most favorable results" out of a multitude of results is a questionable research practice that invalidates statistical inference, it is less clear how researchers should deal with their degrees of freedom in practice.In this study, we suggest to tackle this issue from the perspective of multiple testing.More precisely, for analyses based on hypothesis testing we formalize researcher degrees of freedom as a multiple testing problem.We further propose to use an adjustment procedure to correct for the over-optimism resulting from the selection of the lowest p-value out of a variety of analysis strategies.
As the results of different analysis strategies addressing the same research question with the same data are usually highly dependent, a naive approach such as the Bonferroni correction is inappropriate.It would indeed lead to an unacceptable loss of power.Instead, we propose resorting to the single-step "minP" adjustment method (Westfall et al., 1993;Westfall and Young, 1993) and discuss its use in this context.The power achieved by the minP procedure is typically larger than with simpler approaches while ensuring a weak control of the family-wise error rate.This is because the procedure is based on the distribution of the minimal p-value, which is obviously affected by the level of correlation between the tests.
The minP procedure has the major advantage that it has a relatively intuitive principle, as illustrated by the following example.In a comment on a study by Mathews et al. (2008) claiming that breakfast cereal intake before pregnancy is positively associated with the probability to conceive a male fetus, Young et al. (2009) reinterpret the small p-value of 0.0034 obtained in the original article.They notice that Mathews et al. (2008) did not only analyse the association between fetal sex and the consumption of breakfast cereals, but also many other food items-a typical case of multiple testing.Based on the analysis of permuted data (i.e.data with randomly shuffled fetal sex status), Young et al. (2009) argue that "one would expect to see a p-value as small as 0.0034 approximately 28 percent of the time when nothing is going on".Implicitly, they apply the minP procedure for adjusting the smallest raw p-value of 0.0034 to 0.28 in this context where multiple tests are performed to investigate multiple food items.Our suggestion consists of translating this approach into the context of the analytical researcher degrees of freedom towards addressing the statistical factors of the replication crisis.
The minP procedure as used in the example by Young et al. (2009) and considered in this paper is based on an approximation of the null distribution of the minimal p-value through a permutation-based procedure.We note, however, that such a permutation-based procedure is not always possible, and that resorting to theoretical asymptotical results on the distribution of the minimal p-value (or maximimal statistic) is more appropriate in some cases, as will be discussed later.
The goal of this paper can be seen as building bridges between two scientific communities.On one hand, the metascientific community has long recognized that the replication crisis in science is partly related to multiplicity issues, but has to date neither formalized the issue in terms of multiple testing nor applied known adjustment procedures for reducing the occurrence of false positive results.On the other hand, the multiple testing community is increasingly developing theoretically founded general approaches to multiple testing taking into account the dependence of the tests; see Ristl et al. (2020) for a recent important milestone.These approaches are however not yet routinely used to adjust for researcher degrees of freedom in practice.The reasons are manifold.The lack of communication between the two communities and the methodological complexity of these methods certainly play an important role.Another reason is that these approaches, even if increasingly efficient and general, do not address all types of analyses but only regression models, and require assumptions regarding the data format that may not always be fulfilled in practice.In this context, the present paper aims to formalize and demonstrate the use of minP to adjust for researcher degrees of freedom in simple situations not only involving linear models, while hopefully creating a common basis fostering communication between the two communities towards the development (by statistical researchers) and routine use (by applied data analysts) of more complex approaches.This paper aims to establish an easy approach designed to prevent the detection of false-positive findings in the context of fishing expeditions.
The rest of this paper is structured as follows.Problems related to researcher degrees of freedom are outlined in more detail in Section 2, including potential approaches for handling it in practice that were proposed in the literature.As a motivating example, Section 3 presents a study on the impact of perioperative partial arterial pressure (paO 2 ) on post-operative complications after neurosurgery that uses routinely collected real-world data.Our suggested approach is described in Section 4, while Section 5 shows its results on the example dataset and Section 6 briefly discusses limitations of the approach and possible extensions.
2 Background: researcher degrees of freedom

Overview
When analysing biomedical data, researchers are often confronted with a number of decisions that may appear trivial at first view, but often have a considerable impact on study results.Which confounders should we adjust for?How should we handle missing values and outliers?Should we log-transform a continuous variable?What about categorical variables with categories that include no more than a handful of patients?Should these small categories be merged?Is a parametric or non-parametric test more appropriate?The term "researcher degrees of freedom" (Simmons et al., 2011) denotes, in a broad sense, this flexibility arising from the many analytical choices researchers face when analysing data in practice.
In most cases, neither theory nor precise practical guidance from the literature can reliably point researchers to the "best way" to analyse their data.Model selection techniques based, e.g., on the Akaike Information Criterion (AIC) and diagnostic tools (e.g., to assess whether a variable is normally distributed) may be helpful in some cases.However, they most often do not provide definitive clear-cut answers to all the arising questions.Furthermore, the choice of these techniques is itself affected by uncertainty: there usually exist several suitable variants of them.For example, should we prefer the AIC or the Bayesian Information Criterion (BIC) for model selection?Should we use a QQ-plot or apply a test (if yes, which one and at which level?) to assess normality of a variable?Combined with selective reporting, researcher degrees of freedom can lead to an increased rate of false positive results, inflation in effect sizes, and overoptimistic results (Ioannidis, 2005b;Simmons et al., 2011;Wasserstein and Lazar, 2016;Hoffmann et al., 2021).The terms "p-hacking" and "fishing for significance" have been used in the context of hypothesis testing to denote the selective reporting of the most significant results out of a multitude of results arising through the multiplicity of analysis strategies.The resulting optimism is however not limited to the context of hypothesis testing."Fishing expeditions" (also termed "cherry-picking" or "data dredging") are common issues in all types of analyses beyond hypothesis testing (Ullmann et al., 2023).
The multiplicity of possible analysis strategies particularly affects studies involving electronic health records and administrative claims data, which currently raise hopes and promises of "real-world" evidence and personalized treatment regimes.With data that have not been primarily collected for research purposes, uncertainties related to the analysis strategies may indeed be even more pronounced compared to the analysis of classical observational research data.In the last few years, contradictory results have been published in this setting, which can be viewed as a consequence of the uncertainties in a broad sense.See for example the conflicting results by Fields et al. (2019); Childers andMaggard-Gibbons (2019, 2021); Turner et al. (2019) on infectious complications associated with laparoscopic appendectomies and those by Jivanji et al. (2020); Shah et al. (2021) on the association between cardiovascular disease and marijuana consumption.In both cases, different teams of researchers used the same data set to answer the same research question and found contradictory results which can be explained by seemingly trivial choices.

Partial solutions and related work
There are a number of approaches that have been proposed to deal with uncertainty regarding the analysis strategy and are preferable to the selective reporting of the preferred results.
A natural approach is to fix the analysis strategy in advance, i.e. prior to running the analyses, to avoid obtaining multiple results in the first place.For more transparency, this may be done within a publicly available pre-registration document (Nosek et al., 2018;Munafò et al., 2017;Hardwicke and Wagenmakers, 2023), thus preventing result-dependent selective reporting (Naudet et al., 2024).This type of pre-registration is the standard for clinical trials (Chan et al., 2013).However, even in the strictly regulated context of clinical trials, there is some controversy about the question whether statistical analysis plans of clinical trials are detailed enough (Greenberg et al., 2018) to prevent potential selective reporting.Fixing the analysis strategy in advance tends to be even more difficult for exploratory research questions and for complex data sets and research questions.
The opposite approach consists of transparently acknowledging uncertainty and reporting the variety of results obtained with the considered analysis strategies.This concept has been proposed in different variants in the last decade: it encompasses, e.g., the vibration of effect framework (Patel et al., 2015;Klau et al., 2023), multiverse analyses (Steegen et al., 2016) and the specification curve analysis (Rohrer et al., 2017;Simonsohn et al., 2020).With these approaches, the multiple reported results might be conflicting, sometimes yielding a confusing picture and a paper without clear-cut take-home message.In other words, the pitfalls of selective reporting are obviously avoided, but this comes at a high price in terms of interpretability and clarity.
Finally, let us mention the approach of conducting various analyses, selecting the preferred results but-instead of reporting it in a cherry-picking fashion-publishing it only if it can be qualitatively confirmed by running the exact same analysis on independent "validation" data (Daumer et al., 2008).This is the approach Ioannidis (2005a) indirectly recommends when claiming "Without highly specified a priori hypotheses, there are hundreds of ways to analyse the dullest dataset.Thus, no matter what my discovery eventually is, it should not be taken seriously, unless it can be shown that the same exact mode of analysis gets similar results in a different dataset."This approach, however, requires to set apart (or subsequently obtain) a validation dataset of adequate size.This might not always be possible, and even in cases where it is possible, splitting the data may imply a substantial loss of power compared to the analyses that would have been performed using the totality of the data (Daumer et al., 2008).
In the context of analyses strongly affected by uncertainties where none of these simple approaches seems applicable, we suggest an alternative approach based on multiple testing correction.More specifically, we view researcher degrees of freedom from a multiple testing perspective and propose to apply correction for multiple testing to the preferred result to reduce the risk of type 1 error, as outlined in Sections 4.1 and 4.2.

Data
As a motivating example, we use a current research project on the effect of partial arterial pressure of oxygen (paO2) during craniotomy on post-operative complications among neurosurgical patients.This study is based on a routinely collected dataset from a Munich University Hospital preprocessed as described in Becker-Pennrich et al. (2022).
While the irreversible damage to the brain caused by reduced levels of oxygen in the blood (hypoxemia) has been the topic of extensive research, the potential harm caused by an increased amount of oxygen (hyperoxemia) is comparatively not well understood.The dangers of over-supplementation of oxygen during surgical procedures are still debated among anesthesiologists and a topic of current research (McIlroy et al., 2022;Weenink et al., 2020).
The dataset under consideration was extracted from routine clinical care data of n = 3, 163 surgical procedures performed on lung healthy neurosurgical patients.Vital data was measured at several timepoints during surgery for each surgical procedure.As outlined in Becker-Pennrich et al. (2022), measuring paO2 continuously is not feasible, in contrast to other vital parameters.To obtain a reliable assessment of hyperoxemia during the surgical procedure, the paO2 values thus have to be imputed using a surrogate model based on proxy variables that can be measured continuously using non-invasive techniques.Becker-Pennrich et al. (2022) suggest to use machine learning methods for this purpose and identify random forest, and regularized linear regression as well-performing candidates.
In this paper, we consider the assessment of the effect of paO2 on the binary outcome defined as the occurrence of post-operative complications after surgery.Even if we ignore model choice issues arising from the selection of a set of potential confounders, this analysis is characterized by a large number of uncertain choices.They are described in more detail in Section 3.2 along with the options considered in our illustrative study in Section 5.

Researcher degrees of freedom
In our study, we focus on the following choices, depicted in the form of a decision tree in Figure 3.2: (i) missing value imputation, (ii) surrogate model for the unobserved paO2-values, (iii) parameter choice approach, (iv) aggregation procedure, and (v) coding of the exposure variable paO2 and testing method.Uncertainty (ii) is discussed in more details by Becker-Pennrich et al. (2022).In this study, we use the data preprocessed as described in Becker-Pennrich et al. (2022)  Uncertainties (i) to (iv) can be seen as preprocessing uncertainty in the terminology of Hoffmann et al. (2021).For the missing value imputation (i) the two considered options are to either drop or impute the missing values using multiple imputation in the 'mice' package (van Buuren and Groothuis-Oudshoorn, 2011).For surrogate modelling of the unobserved paO2-values (ii) we either use random forest or a regularized general linear model, either using the default parameter values or the parameter values obtained through tuning via random search using predefined tuning spaces (iii) as implemented in the 'mlr3' package (Lang et al., 2019).
After obtaining a prediction of unobserved paO2 values through surrogate modelling, for each surgery the paO2 measurements are aggregated to a single value over multiple measurements for a single patient: either the mean or the median (iv).Finally (v), we either consider paO2 as a continuous variable and use a logistic regression model to assess its effect on the binary outcome, we dichotomize it using the clinically meaningful cutoff value of 200mmHg, or we categorize it into a three-category variable using the clinically meaningful cutoff values of 200mmHg and 250mmHg and use Fisher's exact test.The latter choice can be seen as referring both to preprocessing and method uncertainty, since the choice of the test is related to the transformation of the variable paO2.

Researcher degrees of freedom as a multiple testing problem
In the remainder of this paper, we will focus on analyses that consist of statistical tests.We consider a researcher investigating a-possibly vaguely defined-research hypothesis such as "paO2 has an impact on post-operative complications", as opposed to the null-and alternative hypotheses of a formal statistical test, which are precisely formulated in mathematical terms.From now on, we assume that the research hypothesis the researcher wants to establish corresponds to the formal alternative hypothesis of the performed tests.
In this context, the term "analysis strategy" refers to all steps performed prior to applying the statistical test as well as to the features of the test itself.The following aspects can be seen as referring to preprocessing uncertainty in the terminology by Hoffmann et al. (2021): transformation of continuous variables, handling of outliers and missing values, or merging of categories.Aspects related to the test itself refer to model and method uncertainty in the terminology of Hoffmann et al. (2021).They include, for example, the statistical model underlying the test, the formal hypothesis under consideration, or the test (variant) used to test this null-hypothesis.
In the context of testing, an analysis strategy can be viewed as a combination of such choices.Obviously, different analysis strategies will likely yield different p-values and possibly different test decision (reject the null-hypothesis or not).Applying different analysis strategies successively to address the same research question amounts to performing multiple tests.From now on, we denote m as the number of analysis strategies considered by a researcher.The null-hypotheses tested through each of the m analyses are denoted as H i 0 , i = 1, . . ., m.These null-hypotheses and the associated alternative hypotheses can be seen as-possibly different-mathematical formalizations of the vaguely defined research hypothesis-"paO2 has an impact on post-operative complications" in our example.One may decide to formalize this research hypothesis as "H 0 : the mean paO2 is equal in the groups with and without post-operative complications versus H 1 : the mean paO2 is not equal in these two groups".But it would also be possible to formalize it as "H 0 : the post-operative complication rates are equal for patients with paO2 < 200mmHg and those with paO2 ≥ 200mmHg" versus "H 1 : the post-operative complication rates are not equal for patients with paO2 < 200mmHg and those with paO2 ≥ 200mmHg".Analysis strategies may thus differ in the exact definition of the considered null-and alternative hypotheses.
They may, however, also differ in other aspects, some of which were mentioned above (for example the handling of missing values or outliers).If two analysis strategies i 1 and i 2 (with 1 ≤ i 1 < i 2 ≤ m) consider exactly the same null-hypothesis, we have Of course, it may also happen that the research hypothesis is not vaguely defined but already formulated mathematically as null-and alternative hypotheses, and that the m analysis strategies thus only differ in other aspects such as the handling of missing values or outliers.In this case the m null-hypotheses would all be identical.Regardless whether the hypotheses H i 0 (i = 1, . . ., m) are (partly) distinct or all identical, a typical researcher who exploits the degree of freedom by "fishing for significance" performs the m testing analyses successively.They hope that at least one of them will yield a significant result, i.e. that the smallest p-value, denoted as p (1) , is smaller than the significance level α.If it is, they typically report it as convincing evidence in favor of their vaguely defined research hypothesis.It must be noted that in this hypothetical setting the researcher is not interested in identifying the "best" model or analysis strategy but only in reporting the lowest p-value that supports the hypothesis at hand.Considering this scenario from the perspective of multiple testing, it is clear that the probability to thereby make at least one type 1 error, denoted as Family Wise Error Rate (FWER), is possibly strongly inflated.In particular, even if all tested null-hypotheses are true, we have a probability greater than α that the smallest p-value p (1) is smaller than α; this is precisely the result researchers engaged in fishing for significance will report.This problem can be seen as one of the explanations as to why the proportion of false positive test results among published results is substantially larger than the considered nominal significance level of the performed tests (Ioannidis, 2005b).

Controlling the Family-Wise Error Rate (FWER)
Following the formalization of researcher degrees of freedom as a multiple testing situation, we now consider the problem of adjusting for multiple testing in order to control the FWER.More precisely, we want to control the probability P (Reject at least one true H i 0 ) to make at least one type 1 error when testing H 1 0 , . . ., H m 0 , i.e. the FWER.More precisely, we primarily want to control the FWER in case all null-hypotheses are true.Imagine a case where some of the null-hypotheses are false and there is at least one false positive result.On one hand, if p (1) is not among the falsely significant p-values, the false positive test result(s) typically do(es) not affect the results ultimately reported by the researchers (who focus on p (1) ).This situation is not problematic.On the other hand, if p (1) is falsely significant, H (1) 0 is wrongly rejected, and strictly speaking a false positive result ("p (1) < α") is reported.However, some of the m − 1 remaining null-hypotheses, which are closely related to H (1) 0 (because they formalize the same vaguely defined research hypothesis), are false.Thus, rejecting H (1) 0 is not fundamentally misleading in terms of the vaguely defined research hypothesis.As a result, in the context of the researcher degrees of freedom, false positives have to be avoided primarily in the case when all null-hypotheses are true.
In other words, we need to control the probability P (Reject at least one true to have at least one false positive result given that all null-hypotheses are true, i.e. we want to achieve a weak control of the FWER.Various adjustment procedures exist to achieve strong or weak control of the FWER; see Dudoit et al. (2003) for concise definitions of the most usual ones (including those mentioned in this section).
The most well-known and simple procedure is certainly the Bonferroni procedure.It achieves strong control of the FWER, i.e. it controls P (Reject at least one true H i 0 ) under any combination of true and false null hypotheses.This procedure adjusts the significance level to α = α/m; or equivalently it adjusts the p-values p i (i = 1, . . ., m) to pi = min(mp i , 1).However, the Bonferroni procedure is known to yield low power in rejecting wrong null-hypotheses in the case of strong dependence between the tests.The so-called Holm stepwise procedure, which is directly derived from the Bonferroni procedure, has a better power.However, the Holm procedure adjusts the smallest p-value p (1) exactly to the same value as the Bonferroni procedure.It implies that, if none of the m tests lead to rejection with the Bonferroni procedure, it will also be the case with the Holm procedure.The latter can thus not be seen as an improvement over Bonferroni in terms of power in our context, where the focus is on the smallest p-value p (1) .

The minP-procedure
The permutation-based minP adjustment procedure for multiple testing (Westfall et al., 1993) indirectly takes the dependence between tests into account by considering the distribution of the minimal p-value out of p 1 , . . ., p m .This increases its power in situations with high dependencies between the tests, and thus makes it a suitable adjustment procedure to be applied in the present context.In the general case it controls the FWER only weakly, but as outlined above we do not view this as a drawback in the present context.
The rest of this section briefly describes the single-step minP adjustment procedure based on the review article by Dudoit et al. (2003).The following description is not specific to researcher degrees of freedom considered in this paper.However, for simplicity we further use the notations (p i , H i 0 , for i = 1, . . ., m) already introduced in Section 4.1 in this context.
In the single-step minP procedure, the adjusted p-values pi , i = 1, . . ., m are defined as with P ℓ being the random variable for the unadjusted p-value for the ℓ th null-hypothesis H ℓ 0 (Dudoit et al., 2003).The adjusted p-values are thus defined based on the distribution of the minimal p-value out of p 1 , . . ., p m , hence the term "minP".In the context of the researcher degrees of freedom considered here, the focus is naturally on p(1) = P min 1≤ℓ≤m P ℓ ≤ p 1 | ∩ m i=1 H i 0 .In many practical situations, including the one considered in this paper, the distribution of min 1≤ℓ≤m P ℓ is unknown.The probability in Eq. ( 1) thus has to be approximated using permuted versions of the data that mimic the global null-hypothesis ∩ m i=1 H i 0 .More precisely, the adjusted p-value pi is approximated as the proportion of permutations for which the minimal p-value is lower or equal to the p-value p i observed in the original data set.Obviously, the number of permutations has to be large for this proportion to be estimated precisely.In the example described in Section 3 involving only two variables (paO2 and post-operative complications), permuted data sets are simply obtained by randomly shuffling one of the variables.More complex cases will be discussed in Section 6.

Study design
The study aims at illustrating the use and behavior of the minP-based approach when used to adjust for the multiplicity arising through researcher degrees of freedom.We use the original as well as permuted versions of the paO2 data set.The 48 specifications of the analysis strategy outlined in Section 3 are successively applied.P-values are either left unadjusted, or adjusted using the Bonferroni procedure, or adjusted using the recommended minP procedure with 1000 permutations.All analyses are performed for different sample sizes.Subsets of each considered size are randomly drawn from the original data set without replacement.
The study consists of two distinct parts.In the first part, we assess the family-wise error rate (FWER) for different sample sizes with the three approaches (no adjustment, Bonferroni adjustment, and minP adjustment).For this purpose, we generate data without association between the two variables of interest (paO2 and the outcome "post-operative complications") by using a paO2 covariate vector drawn without replacement from the true dataset but randomly generating the binary outcome variable from a binomial distribution (p = 0.5) to break the association between the outcome and paO2.This procedure is repeated 1000 times for every n ∈ {100, 200, 300, 500, 2000, 3000}.For each run, we calculate unadjusted, minP-adjusted, and Bonferroni-adjusted p-values as outlined above and check whether there is at least one false positive, i.e. whether at least one of the respective p-values of the 48 specification is significant at the 5% level.The proportion of the 1000 runs for which this happens yields an estimate of the FWER of the three approaches.
In the second part, the original data set is analysed.Based on medical knowledge we expect a strong relationship between paO2 and the outcome to be present, but do not formally know the truth.For each of the three approaches (no adjustment, Bonferroni adjustment, and minP adjustment), we calculate the proportion of significant p-values at the 1%, 5% and 10% level among the 48 specifications.This was repeated 1000 times for each sample size n ∈ (50,100,150,200,250,300).As in our example study, the association becomes highly significant for larger Addressing researcher degrees of freedom sample sizes and all p-values are then very close to zero, we only focus on these small sample sizes here.The code for reproducing the analyses can be found on GitHub2 .

Results
Figure 2 shows the estimated FWER for different sample sizes along with the Newcombe confidence intervals (Newcombe, 1998).In the absence of adjustment, false-positive results appear to be present in at least one of the 48 specifications for about 70% of the data sets of size n = 100 and 76% of the data sets of size n = 3000, which aligns with the results of (Simonsohn et al., 2020).If we adjust the p-values using the minP-approach (green), the 5% level is held for all considered sample sizes.As expected the Bonferroni adjustment (blue) is more conservative: the confidence intervals for FWER, which do not include 0.05, only overlap with those of the minP procedure for a sample size of n = 3000.Figure 3 presents the proportion of significant p-values at the 1%, 5% and 10% level over the 48 specifications for the three approaches and different sample sizes.These proportions are averaged over 1000 runs.As we expect a highly significant association between the two variables of interest, we focus on small sample sizes only.The observed trend is not surprising: For all n ∈ (50,100,150,200,250,300,500) it holds that (where the overline stands for the average over 1000 runs and α ∈ (0.01, 0.05, 0.1)), i.e. more significant results appear for the unadjusted p-values compared to the adjusted p-values.Furthermore, the Bonferroni approach is more conservative than the minP-adjustment.

Discussion
In this work, we described a framework for performing valid statistical inference in the presence of researcher degrees of freedom through adjustment for multiple testing.Our results on simulated data and in an application concerning paO2 and post-operative complications suggest that the minP procedure is appropriate for this purpose.
The use of permutation-based procedures has already been recommended by Simonsohn et al. (2020) to address researcher degrees of freedom.There are, however, fundamental differences between this approach and ours.Simonsohn  (50,100,150,200,250,300,500).Line colors indicate results based on unadjusted (red), minP-adjusted (green) and Bonferroni-adjusted (blue) p-values.et al. ( 2020) address the problem of researcher degrees of freedom by specifying all plausible specifications (analysis strategies in our terminology) and ultimately evaluating the joint distribution of the estimated effects of interest across these model specifications.This evaluation is done graphically through the so-called specification curve, but also through a permutation test addressing null-hypotheses such as "the median effect across the specifications is zero".This approach, while similar to ours at first view and interesting, is different in several aspects.Firstly, permutations are used by Simonsohn et al. (2020) as part of a permutation-based test and not within a multiple testing adjustment procedure.Our suggestion is precisely to formalize the multiplicity of analysis strategies as a multiple testing problemand to benefit from various methodological results obtained in the field, for example on the weak control of the FWER through the minP procedure.That said, minP adjustment can be viewed as a simple permutation test for the test statistic "minimal p-value", hence the apparent similarity with the permutation test for the median effect.
Secondly, and more importantly, the focus on the median effect makes the procedure by Simonsohn et al. (2020) sensitive to misspecifications that do not model the data properly and thus fail to show an effect even if there is one.Imagine a fictive example where one runs 99 fully inappropriate analyses yielding non-significant results and one meaningful analysis that identifies a highly significant (truly existing) effect.The true median effect is zero, and the permutation test by Simonsohn et al. (2020) will certainly not reject the null.In contrast, with our approach the truly existing effect is likely to be detected by the meaningful analysis.This is because the minP procedure focuses on the minimal p-value, which is very small in this fictive example.This focus on the minimal p-value better accounts for the fact that, in practice, one would often include some analysis strategies that are in fact inappropriate to detect the effect of interest.It also better reflects the common p-hacking practice that consists of selecting and reporting the smallest p-value.However, our approach raises a number of questions that may be addressed in future research.
Firstly, the specification of an appropriate permutation procedure taking the data and the specificity of the research question into account is not always easy/possible.Let us consider the following example: the null-hypothesis of interest is that the means of a variable are equal in two groups, while the variances may be different in the two groups.By permuting the group labels, one also inevitably enforces equality of the variances, which is a stronger assumption than the null-hypothesis of interest (Dudoit et al., 2003).Defining a permutation scheme that reflects the global null-hypothesis ∩ m i=1 H i 0 may also be intricate in the case of multivariable regression models involving confounders in addition to the exposure of interest whose effect on the dependent variable is to be investigated.On the one hand, permuting only the exposure of interest will destroy the association between this exposure and confounders.On the other hand, permuting the outcome will not only destroy the association between exposure variable and dependent variable, but also the association between the confounders and the outcome.In principle, none of these simple permutation procedures are suitable.Both enforce more than the considered null-hypothesis of no effect of the exposure on the outcome.Complex alternative permutation procedures may be preferred (Berrett et al., 2020).Alternatively, if all analysis strategies are based on marginal generalized estimating equation models, one may resort to asymptotical results on the distribution of the maximally selected statistic to derive adjusted p-values, thus avoiding time-consuming and methodologically complex permutation procedures; see for example Ristl et al. (2020).Even though this approach is extremely powerful for most cases, it comes at the cost of some assumptions that are not applicable in our case (restrictions regarding the input data and focus on parametric tests).
Secondly, it would be interesting to investigate the behavior of our suggested approach compared to the validation approach mentioned in Section 2.2, that consists of splitting the data into two parts, applying all candidate analysis strategies to the first part, and validating the preferred result by applying the analysis strategy that was used to obtain it to the second part of the data.Both this splitting procedure and the adjustment for multiple testing suggested in this paper imply a loss of power compared to the unadjusted analysis one would perform with the selected analysis strategy on the whole dataset.Researchers may prefer to run analyses on the whole dataset without arbitrary splitting, which is a clear argument in favor of our adjustment approach.However, the concept of validation using independent data may also seem attractive.Preference for one or the other approach is a matter of perspective.But, the power resulting from these two approaches may yield a decisive argument in favor for one of them.
Finally, note that our paper should not be understood as a plea for the use of p-values in general.We merely claim that, if statistical testing is used and several analysis variants are performed, it certainly makes sense to adjust for multiplicity before interpreting these p-values.Our approach allows to selectively report the results of the analysis strategy yielding the most convincing evidence, while controlling the type 1 error-and thus the risk of publishing false positive results that may not be replicable.In future research, it approach could in principle be extended beyond the context of hypothesis testing.

Figure 2 :
Figure 2: FWER with Newcombe confidence intervals (computed over 1000 simulation runs) for different sample sizes without an association between post-operative complications and paO2.Dashed red line indicates 5% significance level.
resulting from the different surrogate modelling strategies.Overview of the different researcher degrees of freedom.All in all 48 specifications were analyzed.Green depicts the data pre-processing decisions while brown depicts the method choices.