Exact model comparisons in the plausibility framework

Plausibility is a formalization of exact tests for parametric models and generalizes procedures such as Fisher's exact test. The resulting tests are based on cumulative probabilities of the probability density function and evaluate consistency with a parametric family while providing exact control of the $\alpha$ level for finite sample size. Model comparisons are inefficient in this approach. We generalize plausibility by incorporating weighing which allows to perform model comparisons. We show that one weighing scheme is asymptotically equivalent to the likelihood ratio test (LRT) and has finite sample guarantees for the test size under the null hypothesis unlike the LRT. We confirm theoretical properties in simulations that mimic the data set of our data application. We apply the method to a retinoblastoma data set and demonstrate a parent-of-origin effect. Weighted plausibility also has applications in high-dimensional data analysis and P-values for penalized regression models can be derived. We demonstrate superior performance as compared to a data-splitting procedure in a simulation study. We apply weighted plausibility to a high-dimensional gene expression, case-control prostate cancer data set. We discuss the flexibility of the approach by relating weighted plausibility to targeted learning, the bootstrap, and sparsity selection.


Plausibility functions
We start this section by quickly reiterating important definitions and results from the plausibility framework. Results are taken from previous work unless stated otherwise [5]. We add an asymptotic result at the end of the section. We assume data Y to be sampled from a member of a parametric family of distributions P θ . First, we define statistic T as Here, l is a loss function, in the following taken to be the negative log-likelihood, and c(Y ) is a normalizing term, usually taken to be c(Y ) = l(Y,θ) whereθ is the maximum likelihood estimator (MLE). The normalizing term l(Y,θ) allows to develop the theory by guaranteeing that statistic T Y,θ has support [0, 1] for any P θ , θ ∈ Θ. However, this is non-essential. We also consider c(Y ) = 0 later. The plausibility function is defined as: where F θ is the distribution function of T Y,θ . We call θ * = θ * (T Y,θ ) := arg sup θ∈A F θ (T Y,θ ) the plausibility estimate, when it exists. To shorten notation, we define the distribution function of pl Y (A) as Pl Y,θ (α) := Pl θ (α) := P θ (pl Y (A) ≤ α). The reference to Y is omitted when it is clear which plausibility function is used. We also abbreviate pl Y (θ) := pl Y ({θ}) for θ ∈ Θ. Theorem 1 (Martin 2015). Let A ⊂ Θ. For any θ ∈ A, α ∈ [0, 1], Y ∼ P θ , pl Y (A) is stochastically larger than uniform, i.e. sup θ∈A Pl θ (α) ≤ α 2 PLAUSIBLE MODEL COMPARISONS - SEPTEMBER 13, 2021 We now assume that Y = (Y 1 , ..., Y n ) is an i.i.d. sample with Y 1 ∼ P θ and denote the plausibility function and its CDF with pl n and Pl n , respectively, to indicate sample size. If l does not have discontinuities, pl n is uniformly distributed for all n on (0, 1) as shown previously. Otherwise convergence holds in distribution as n → ∞ when the following uniqueness condition is met. Definition 1. Likelihood L(y; θ) has unique point masses if and only if for point mass α m , the set Y m := {y ∈ R n |pl y,θ (A) = α m } only contains identical observations up to exchangeability, i.e. observations only differ up to ordering y, y ∈ Y m ⇒ (y (1) , ..., y (n) ) = (y (1) , ..., y (n) ), where x (i) denotes the ith order statistic for vector x = (x 1 , ..., x n ).
Lemma 1. If L has unique point masses and under the other assumptions of the previous paragraph,pl n (θ) converges weakly to the standard uniform U (0, 1).
The proof is given in the appendix.
The restriction to unique point masses guarantees the uniqueness of the plausibility estimate and is theoretically strong but usually not strong in practice as in the following example. To illustrate the problems that might occur, consider a likelihood of i.i.d. Bernoulli variables for which the likelihood is modelled as L(θ; Y ) = N i=1 θ Yi (1 − θ) 1−Yi . For θ = .5, every data set has the same probability and therefore both θ = .5 and the MLE θ =θ maximize the plausibility function for every data set. We call a value that maximizes the plausibility for every data set a non-plausible value. The uniqueness condition is usually not fulfilled for most discrete distributions. For example, the binomial distribution with parameter 0.5 would be non-unique due to symmetry around 0.5. Convergence to the uniform would still hold as non-uniqueness would be restricted to to a small subset of events (i.e. a pair, see appendix A.1). We do not try to optimally characterize conditions on the likelihood to guarantee unique estimates. Instead, we see lemma 1 as a guiding principle. For example, a solution for the Bernoulli example is to add the binomial coefficient to the likelihood which guarantees unique estimates (up to symmetries). More details are given in the appendix (section A.1).

Plausible model comparisons
We prepare model comparison by considering a real-valued, measurable function w that acts on realizations of Y . We assume w : Y → R to be free of θ. We first observe that when defining T , we can construct plausibility functions based on w(Y ) by replacing the exponentiated loss function exp • −l by w to get T w y,θ := w(y)/c w (y). If normalization is desired c w (y) is taken to be c w (y) = sup y w(y), or 1 otherwise. The distribution function of T w can then be written as F θ (t) = P θ ({y|w(y) ≤ t}), which induces pl w y (A) = sup θ∈A F θ (T w y,θ ) and Pl w θ (α) = P θ (pl w y (A) < α). If w is bijective, a strict order is imposed on events and the CDF is calculated under this ordering. This concept generalizes standard or unweighted plausibility introduced above, as w(y) := pl y (A) leads to pl w y (A) = pl y (A) (see appendix A.2). Note that w is a fixed function independent of θ as it does not depend on data Y , i.e. it can be pre-computed for each realization y without seeing the data. This definition merely serves to show the generalization, it is not helpful for performing the computations in unweighted plausibility. Lemma 2. With the notation from the previous paragraph, let θ ∈ Θ, A ⊂ Θ and w some test statistic w : Y → R which is free of θ. Then, sup Proof. By definition of Pl w θ , As F θ is the distribution function of T w Y,θ , P θ (F θ (T w Y,θ ) ≤ α) ≤ α by definition. Supremizing over θ completes the proof.
For discrete distributions, U is a cumulative sum of probabilities. In these cases, the cumulative sum proceeds by summing event probabilities that are likely under the alternative but whose probabilities are evaluated under the null hypothesis. Intuitively, U will therefore be more likely to reject the null if the observation was indeed drawn under the alternative. To formally characterize U , we now show that it is asymptotically equivalent to the likelihood ratio (LR) test for the same model comparison. U can therefore be considered an exact version of the LR test. We base our argument on the comparison of rejection regions of the LR and weighted plausibility tests.
We give the proof in the appendix. From this asymptotic equivalence, some properties of the LR procedure are inherited by the weighted plausibility test. Corollary 1. In the one-parameter situation, i.e. dim(θ) = 1, the test U = pl w Y (Θ 0 ) is asymptotically efficient.
In the next section, we deal with the problem of constructing confidence regions for the unrestricted parameter θ ∈ Θ 1 .

Marginal Plausibility Functions
In the context of nested model comparisons, often it is possible to express the null and alternative hypotheses by splitting the parameter vector θ = (ψ, λ) ∈ Ψ × Λ = Θ and constraining ψ to a subset Θ ψ 0 ⊂ Θ under the null, while leaving λ free. λ can be seen as a nuisance parameter. In this context, it is interesting to consider the relative profile likelihood to obtain which allows to define the so-called marginal plausibility function: In principle, it is possible to base inference on the plausibility region for ψ: However, in order to be exact, the distribution of T Y,ψ has to be free of λ. This means that T Y,ψ has to be an ancillary statistics of λ which is a strong limitation in practice. The reason is that T Y,ψ will be evaluated in (ψ,λ(ψ)) instead of the true (ψ, λ(ψ)). To make coverage exact, the distribution ofλ(ψ) would have to be known, which is difficult in practice.
To prepare an alternative approach, we first introduce the profile plausibility which is constructed analogously to the profile likelihood. If θ is partitioned as above and A = {ψ} × A λ for some ψ ∈ Ψ, we call ppl Y (ψ) = sup λ∈A λ F (ψ,λ) (T Y,(ψ,λ) ) the profile plausibility function of likelihood L Y ((ψ, λ)) for ψ.
We now use the weighted plausibility framework to construct an alternative marginal plausibility function. The weighing function is parametrized by ψ and statistic T therefore becomes a function of ψ and λ. We use the following weighing function: w(ψ, y) = w y (ψ) = l(y, (ψ, λ * )) − l(y, (ψ * , λ * )), where (ψ * , λ * ) is the plausibility estimate for data y. For T w Y,ψ;λ we use the weighted profile plausibility which is defined as above by using T w instead of T and denoted with ppl w Y (ψ). To emphasize that w is independent from any data generating parameter, we use (ψ • , λ • ) for data generating parameters and ψ, λ as function parameters.
We now use the profile plausibiliity function for ψ to construct a weighted marginal plausibility region.
This leads to the weighted marginal plausibility region: With weighing function w defined as above, the following lemma holds. Lemma 3. The coverage probability of the weighted marginal likelihood is nominal for ψ, i.e. for α ∈ (0, 1), Proof. We here give an informal argument and refer to the appendix for the formal proof. For each fixed ψ, the weighted marginal plausibility can be interpreted as a weighted plausibility function and is therefore statistically larger than uniform. The special form of the weighting function, which is the likelihood ratio evaluated in the plausibility estimates of the free parameters, guarantees that a value of ψ is selected into the plausibility region if the observed data belongs to the probability α events close to the plausibility estimate. This argument is analogous to the equivalence of constructing confidence intervals and testing a point null-hypothesis.

The normal model
Plausibility can be applied to high-dimensional data (N < p), i.e. data for which the number of predictors (p) exceeds that of observations (N ). We first give a motivation by linear models and then introduce penalized models. For a linear model with data Y, we assume a fixed design with design matrix X so that Y = Xβ + , with = ( 1 , ..., N ), i iid ∼ N (0, σ 2 ). We take the estimateθ for (β, σ 2 ) as the ML-estimate, where the variance estimate is bias-corrected. The plausibility estimate of the parameter vector θ = (β, σ 2 ) ∈ A can then be found by: Remark 3. In the linear model above, the plausibility estimate does not exist for A = R p+1 × R + .
The remark follows from F (β,σ 2 ) (T y,(β,σ 2 ) ) → 1 for σ 2 → ∞ and any fixed β which in turn is due to the fact that any tail probability of the normal distribution converges to 1 for increasing variance. We call such parameters non-plausible. One possible approach to non-plausible parameters is to plug in an estimate of such parameters based on fixing the other parameters. For example, in the case of the linear model the unbiased estimate of residuals can be used (σ 2 (β)). We call a plausibility function based on such an estimate a ML profile-plausibility to indicate that a part of a partitioned parameter vector is fixed at the ML-estimate. Lemma 4. In the linear model above, the profile-plausibility estimate for β using the unbiased variance estimator for σ 2 coincides with the ML estimate, i.e. β * =β.
The proof uses elementary calculations and is given in the appendix. Remark 4. In the linear model above, the profile-plausibility function is degenerate, i.e.
The statement is due to the fact that each data set has a likelihood of (2πσ 2 ) − n 2 exp( n−1 2 ) when evaluated inθ. All potential data with different estimates, have lower likelihood as compared to the observed data. We note, that conditional on any estimate θ * = (β * ,σ 2 ), data is uniformly distributed on the S N −1σ + Xβ * sphere. If data is standardized first, the uniformity is on S N −1 directly. In most situations the scale of the variable is not of interest or even arbitrary. In these cases it is justified to use the conditional distribution Y|σ 2 as the null distribution in the plausibility model, i.e. the density is that of a multivariate normal distribution with a fixed independent covariance matrix and the diagonal fixed atσ 2 . If we call the corresponding weighted profile-plausibility function pl w,σ 2 Y (Θ 0 ), this observation motivates the following lemma, using notation from lemma 2. We have addedσ 2 in the superscript to emphasize that we use a ML profile-plausibility, otherwise it is a standard weighted plausibility function.
Lemma 5. With the notational conventions from lemma 2 and the paragraphs above, Intuitively, the lemma follows from the fact that, conditional on the profile-plausibility estimate, data is uniformly distributed. The proof is given in the appendix. Draws from this distribution for any β can be made as an iid sample from an arbitrary normal distribution after which the sample is re-standardized. The same draw can be re-standardized to new paramter values, which is important when stochastic integration is used (see below).

Global test
To motivate a testing procedure, we first assume that Y has known variance σ 2 . If we are interested in a global test without nuisance covariates, the null hypothesis of interest is β = 0 against β = 0. If the alternative β a = 0 is known, the likelihood ratio (LR) test is defined as where ϕ(·; µ, σ 2 ) is the density of the normal distribution with parameters (µ, σ 2 ). Λ > c is a uniformly most powerful test due to the Neyman-Pearson lemma for appropriate c. This property of the LR motivates the use of the following weighing statistic: where X partitions into nuisance covariates and predictors X = (X 0 , X a ). (β 0 ) denotes the the MLE under the null and (β a0 ,β a ) is the MLE under the alternative partitioned by the covariate components.

High-dimensional data
We now consider the situation where N < p, otherwise keeping the linear model from the previous section. First, we investigate the problem of testing the global null hypothesis β = 0 for the model Y = Xβ + as introduced above, where we assume Y to be centered to justify β = 0. As the problem is ill-posed, one solution is to use penalized regression for the estimation ofβ. Asβ is no longer a ML-estimate, standard likelihood theory no longer applies and the distribution ofβ has to be recovered by additional steps. One approach is to use data-splitting as reviewed in the introduction. We will use one implementation of data splitting to compare to a plausibility comparison [9]. For our weighted plausibility approach, we use the same weighting function (3) as above where we plug in penalized estimates. We consider the Lasso [10], elastic net [11], and Ridge penalties [12] as implemented in glmnet [13]. Under certain conditions, the penalized estimates converge to the true parameter values for a limiting process for which both N and p tend to infinity (see references given in [14]). In the finite samples situtation, the LRT statistic using penalized estimates needs to separate the models well to be a useful weighing function in the plausbility approach. It is difficult to attain theoretical guarantees. In this paper, we rely on simulations to investigate properties of this approach.

Evaluation of the plausibility function
In almost all cases it is not possible to develop a closed-form formula for the plausibility function and numeric evaluation is required. Discrete distributions might allow exact evaluation for small data sets. In general, stochastic integration is needed.
6 PLAUSIBLE MODEL COMPARISONS -SEPTEMBER 13, 2021 To evaluate the plausibility function for point null-hypothesis ϑ 0 , random draws of Y (j) are drawn from the null model Y (j) ∼ F ϑ0 . The plausibility function is then approximated by for M approximation samples. In general, the supremum has to be taken over the full set Θ 0 . In principle, for each evaluation θ ∈ Θ 0 a new sample can be taken. However, this adds additional sampling variation which can dominate function variation close to the supremum. To avoid this problem, the samples (Y (j) ) j drawn from an initial θ 0 is reused as follows: This can be seen as the evaluation of an integral using importance sampling. F θ0 takes the role of the proposal distribution to evaluate F θ . This approximation is justified by re-writing T w Y,θ as follows: We use a grid search around the MLE to find the supremum. Using R-function optim usually also works reliably. After finding the optimum, new samples can be drawn from the optimum and the procedure can be iterated. In practice, this turns out not to be necessary and for the analyses and simulations in this paper the optimization has not been iterated.
For the plausibility region, a grid search is performed to bracket the α level-set of the plausibility region. In our implementation, this search has exponential running time in the number of parameters as we consider a joint grid for all parameters centered around the MLE or plausibility estimate. The efficient construction of plausibility regions for many parameters needs to be addressed in future research.

Retinoblastoma
RB is a childhood tumor of the eye that follows a dominant inheritance pattern [6]. The disease has given rise to the so called two-hit hypothesis: a tumor suppressor gene needs to acquire two mutations to inactivate both copies available on autosomal chromosomes. In familial cases, one variant copy is inherited and only a second mutation is necessary to initiate tumor formation. If the probability for this second hit is high, most individuals inheriting the first mutation will develop a tumor and disease appears to be dominant, as expressed in the penetrance of the disease (disease probability, given presence of first mutation). In RB, families with reduced penetrance are known and one question is whether characteristics of the first variant introduced by a mutation in parents can explain this variation. As a second important question, RB1 gene has been shown to be imprinted at least in some constitutional cells, meaning that only one parental copy is preferentially active in these cells. This can lead to allelic imbalance of expression in cells showing RB imprinting. However, this has not been shown for the putative precursor cells of retinoblastoma as yet [15]. A statistical analysis can help clarifying this question by analyzing the effect of parental origin on disease penetrance.

The Knudson model
Let Y i ∈ {0, 1, 2} denote the number of affected eyes in individual i = 1, ..., N , Y = (Y 1 , ..., Y N ). We assume  where X ij is the number of tumors that individual i has in eye j. We assume that the number of tumors is not known, only the presence of tumors is recorded, making Y i the sum of two indicator variables. Since all where N U and N B are the number of unilaterally and bilaterally affected individuals, respectively. As we have the λ is the average tumor count per eye. If measured per individual (as in the Knudson paper), we re-parametrize as λ I = 2λ (called m in the Knudson paper) and get the eye-distribution To model covariates, we use a logistic model for p. For individual i, we define disease probability p i as follows: where x i is the covariate vector of individual i and β is the vector of regression coefficients. We assume x i1 = 1 for an intercept model. In our context, relevant covariates are family membership as a proxy for variant type and parental origin of the variant allele. An example pedigree is shown in figure 1. Note, that families are ascertained, i.e. at least one member is affected by RB. This fact can be modeled by an ascertainment correction in the likelihood. Founders, i.e. individuals without parents in the pedigree, are ignored as they may have acquired the mutation and, if so, may have it present in a mosaic state, i.e. the variant allele would be present in only part of the cells of the body. In our notation, we assume that founders have already been removed, i.e. N represents the effective number of individuals in the pedigree.
In total, the following likelihood is used: where the ascertainment correction A(β) = 1 − P (Y = 0) N = 1 − ( x P (Y = 0|X = x)P (X = x)dx) N represents the event that at least one individual is affected. If ascertainment is modeled, the covariate distribution needs to be modeled as well. In the following, we assume a random design, i.e. X is drawn from the underlying population.

Simulations
All families in our data set contain several affected members (e.g. Figure 1). This implies that the ascertainment correction in formula (6) will be close to one, i.e. the probability of no affected family members will be small. As a consequence, we did not model ascertainment in the simulations and used parameter values that emulate this characteristic. This makes it also easier to compare to other standard tests such a χ 2 goodness-of-fit test which does not allow to account for ascertainment in its standard form.
For our simulations, we have implemented the calculation of pl y,w by an exact computation, which was feasible for data considered here. An alternative is to use stochastic integration as mentioned above.
First, we regroup (6) into sets with identical covariate vectors, by considering only discrete covariates. Setting c(y) to 0, F β (T w Y,β ) for a binomial (k, n), covariate values x i , and observed counts y i = (y i0 , ..., y in ) for this covariate combination becomes where ∆ n N is the standard (n − 1)-simplex scaled to n and restricted to N n to represent all integer partitions of n (∆ n = n∆ n ∩ N n ). For covriate combinations For sample sizes up to N = 30, the plausibility function can still be efficiently evaluated exactly without resorting to stoachastic integration.
Families were simulated by deterministically distributing sample size across generations, adding a new founder per generation and drawing an inheritance vector for non-founders from a multivariate Bernoulli iid Binom(1, .5). Parentof-origin was added as an additional covariate and inferred from the simulated data. Finally, outcome was drawn from the model specified above according to effects considered in the simulation scenarios. Sample size was set to N = 8 for all simulations, two families and two generations per family were simulated.
We compared the following procedures: (1) Unweighted plausibility (Plausibility), (2) Weighted plausibility with LR weighting (W-Plausibility LR), (3) Parametric Bootstrap (Bootstrap), (4) Pearson goodness-of-fit statistic comparing expected binomial counts under the logistic model with observed count (Chi-Square), (5) the likelihood-ratio test (LR), and finally (6) weighted plausibility with "relative" weights (W-Plausibility Rel). This relative plausibility is a weighted plausibility where parameters for the LR-weights are estimated from the data to be tested. Relative plausibility serves as an example where weights are not free of θ. Unweighted plausibility evaluates a goodness-of-fit to a model where effects for variables of interest are set to zero (e.g. family). For the simulations, 200 replications, and 10 3 bootstrap samples (for procedure Bootstrap) were used. Figure 2 shows simulation results under the null hypothesis when an intercept model is compared to a model containing a family effect. Unweighted plausibility is conservative, weighted plausibility precisely exhausts the α-level up to a level of roughly 0.5, whereas relative plausibility is highly anti-conservative. Both the Pearon and the boostrap tests perform well but show α-levels with conservative and anti-conservative behavior. The LR test is similar to the Bootstrap and Pearson tests except that deviations of size from α-level are stronger.
Under the alternative, several scenarios with values for the intercept of 0.5 and 1 and family effects between 0.5 and 2 (on the log-OR scale) have been evaluated for an α-level of 0.05 ( figure 3). Relative plausibility performs best but has to be discounted due to anti-conservative behavior under the null-hypothesis. Otherwise, the LR test performs best but very similar to weighted plausibility. The difference is best explained by slightly anti-conservative behavior of the LR test at the 0.05 level. Bootstrap and Pearson's test show power close to the 0.05 level and seem unable to cope with the small sample size. Unweighted plausibility has some power when the intercept is small or when family effect is large (log-OR 2) but power is always smaller than 15%. The simulations confirm that LR and weighted plausibility behave very similarly. Family 2  1  17 13  3  1  3  3  1  2  18 17  1  0  7  1  0  3  19 16  1  2  5  1  2  4  20 16  2  2  4  2  2  5  31 20  9  2  6  8  1   Y pat mat  0  13  12  1  4  11  2 1 5  Table 2: Results of data analysis. Method: test used, F0: Null-model in notation outcome ∼ covariates, fid: factor for family, poo: parent-of-origin. F1: Model under alternative. Intercept: estimated coefficient. Fid: range of coeffiencts for the families. Poo: coefficient for parent-of-origin. P: P-value.

Data analysis
We used data from a larger database on Retinoblastoma collected from the literature. Initially, we selected the largest families as they should be most informative. To restrict the computational burden, the five smallest families have been selected from this subset. Outcome data was explicitly ignored when making this decision. Outcome distribution is shown in table 1A for both total families and mutation carriers. Parent-of-origin (i.e. the sex of the transmitting parent) for mutation carriers is summarized in table 1B. From this table it is apparent that parental origin strongly influences tumor status.
To be able to compute plausibility statistics, stochastic integration was used as complete iteration of all possible events was unfeasible. Also a grid search over all parameters was not possible as the grid increases exponentially with the number of parameters. R function optim with the Nelder-Mead algorithm was used to find plausibility estimates. The LR statistic was computed by fitting nested models and using the R function anova with the χ 2 statistic. Results are shown in table 2. Heterogeneity between families could not be demonstrated (first half). Notably, the P-value of standard plausibility is much larger than the P-value of the weighted plausibility. This reflects the fact that standard plausibility rejects against a wider class of alternatives whereas weighted plausibility focuses power on a small class of alternatives. Analysis of the parent-of-origin (POO) effect shows statistically significant findings for weighted plausibility and the LR test. The P-value for standard plausibility is not significant. Plausibility estimates for family effects are close to but not identical to ML estimates. Technically, the ML-estimates are corrected for POO whereas plausibility estimates are not which is one explanation of discrepancies apart from differences in methodology. In all cases, plausibility P-values are larger than LR-based P-values.

High-dimensional data
In this sub-section, we investigate finite sample properties of plausible model comparisons in the high-dimensional setting, using simulations with a normally distributed outcome. We also apply the global test constructed above to a well-known prostate cancer data set where a binary outcome is modeled with a logistic model.

Simulations
In order to evaluate behavior of the compared tests, high-dimensional data was simulated. Sample size was chosen to be either N = 200 or N = 500. p = 500 covariates were simulated. Covariates were drawn in independent blocks of 10 covariates with an exchangeable correlation structure of .1 (low) or .9 (high). Under the null, the outcome was independently drawn from a standard normal distribution. 5 × 10 3 replications were used to determine test size. For the stochastic integration 10 3 samples were used and the mixing parameter of elastic net regression was set to α = 0.9. data splits were used for lms. Figure 4 shows test sizes. Ridge regression perfectly exhausts the α level, whereas elastic net and Lasso exhaust the α level up to 0.75 for low correlation above which the procedures become conservative. This is due to the sparsity induced by these methods. For high correlation, this behavior is a bit more pronounced and conservative behavior starts at an α level of roughly 0.6. Under the null, elastic net and Lasso behave almost identical. lms has poor power in these scenarios.
In the sparse scenario, lasso and elastic net performed very similarly and were the most powerful procedures in all scenarios that were considered. lms could outperform ridge regression for the scenario of a single, strong effect and low correlation between covariates. In all other scenarios, lms was the least powerful procedure. In general, power was again poor for lms.
For a sample size of 100, plausibility based methods had sufficient power in scenarios that matched the method (dense vs sparse). For a sample size of 200 power was still below 80% for many scenarios.
In general, correlation structure was very important as power increases substantially when comparing low and high correlation scenarios, e.g. power for lasso and an effect size of 0.075 for low correlation and sample size of 200 is ∼55% and increases to > 80% (dense scenario).

Data analysis
As an illustration, we analyze a prostate cancer data set [16] as provided by R package sda [17]. The data set contains healthy (N = 50) and prostate cancer samples (N = 52) and measurements of 6033 gene expression values. We analyze the data set using a logistic model and the penalties used in the simulation section as well as the lms method.
To get the lms run, the penalty parameter had to be increased from the default choice (λ = √ N + p/5 instead of λ = √ N + p/10). With this penalty, lms could not select any predictor and resulted in a P-value of 1. All plausibility models resulted in P-values < 10 −3 , when the P-value was limited by the number of stochastic integration samples.
To illustrate the methods, figure 6 shows the regression coefficients from the penalized plausibility models evaluated at the median penalty parameter from models generated during stochastic integretion. The figure clearly reflects the different sparsities of the methods. While lasso and elastic net (α = 0.9) select few variables, elastic net with α = 0.1  selects more and ridge all variables. Effect size change correspondingly (large for sparse methods, low for dense methods). We discuss this finding later.

Discussion
In this paper, we have extended the plausibility framework by a weighing component. For discrete data, the weighing leads to a re-ordering of data sets so that cumulative probabilities are no longer evaluated according to ordered probabilities but according to an (arbitrary) re-ordering induced by the weighing. Intuitively, it is clear that properties from plausibility carry over to weighted plausibility as long as the weighing does not depend on the data. Ordering by probability thus turns out to be just one possible ordering corresponding to goodness-of-fit evaluations. Comparing models corresponds to weighing by LR. More precisely, a plausibility-ratio should be considered but this is computationally expensive. This aspect will be further discussed below.
The flexibility of weighing is illustrated by our data analysis in the high-dimensional setting. Regression coefficients in this case strongly depend on the sparsity of the method. This fact is well known among practitioners and can lead to difficulties in model interpretation as it is often unclear how to choose sparsity. To the authors' knowledge theoretical underpinning is lacking. A weighted plausibility can be constructed for which the weighing function tunes the sparsity parameter. This would create a procedure reflecting steps taken in practice with guarantees for the alpha-level. Other weighting schemes could consider non-nested models. By evaluating data under one of the models and weighing data sets by a contrast between the models. This could be prediction accuracies, cross-validated LRs, or information criteria. The possibilities are clearly endless.
On the other hand, the resulting tests have less strong interpretations as compared to alternative approaches. In the high-dimensional data analysis, there are some important difference between the considered plausibility tests and the lms method. lms uses models coming from a conditional lasso model. lms can thereby reject a single co-variate while controlling family-wise error rates. The only test we considered was a global test based on the linear predictor of all covariates. Rejecting the null hypothesis would therefore not entail the rejection of any single covariate and further steps are needed. One approach is to hierarchically split the covariates according to some outcome-independent procedure (say hierarchical clustering), and test covariates along the tree. This can lead to efficient procedures, e.g. [18,19]. The plausibility framework does have some important limitations. We mention non-plausible parameters and nonplausible parameter values. Non-plausible parameters have to be handled by a different estimation procedure which might involve iterated estimation between say, plausibility and ML. In the normal model, this limitation is non-essential. Non-plausible parameter values seem to be rather a technical problem, although the problem manifested itself in simulations during the preparation of this manuscript. Some care therefore needs to be taking when deriving the likelihood to be used. A major limitation is certainly computation time. While this limitation is shared with other methods like cross-validation as used in targeted learning, bootstrapping, or data-splitting (lms), the problem is usually more severe for plausibility. During stochastic integration, a full penalized analysis including parameter tuning through cross-validation has to be performed for every data set. In our analyses of high-dimensional data, we did not include covariates under the null, which made the evaluation of plausibility relatively cheap. A single data set was analyzed in a matter of a few minutes. There are several ways to improve efficiency. Using importance sampling during stochastic integration seems to be a promising approach. We have also implemented some short-cuts, for example evaluating the penalty parameter first and using it throughout stochastic integration. Certainly, efficiency remains a challenge.
Plausibility can guarantee test sizes under the null for finite samples. A fully non-parametric treatment leads back to a non-parametric bootstrap procedure as data would be drawn from the empirical distribution function (EDF). Every data set has maximal probability under its own EDF and the plausibility approach would therefore not be useful. A compromise by using, say, kernel density estimates could mitigate this problem. However, test sizes cannot be guaranteed under such an approach as the density estimate would involve an estimation error. Further research is needed to explore non-parametric or semi-parametric models.
Targeted analysis defines so-called target parameters for which statistical inference is required. Typically, this would be the difference between a treatment effect and a counterfactual opposite treatment decision in the case of clinical studies [20]. Testing the target parameter corresponds to a model comparison of a null effect with a model including a treatment effect. Given that the target parameter can be defined using high-dimensional data, the model comparison is high-dimensional. The proposed way to evaluate the global test is the influence function using cross-validation [21]. Weighted plausibility is a natural alternative to evaluate the test statistic in targeted analyses. Such an analysis would be similar to what was used in section 3.3. Weighted plausibility is therefore an alternative to perform statistical inference for the ensembles learned in targeted learning. Such an analysis could be beneficial in some cases as the full data is used as compared to cross-validation as proposed for targeted learning. Empirical studies would have to show under which circumstances plausibility would offer advantages or would be disadvantageous.
In conclusion, the plausibility framework allows to conduct exact model comparisons in small sample size situations such as our RB data set. In these cases, valid concerns can be present about the validity of asymptotic or empirical p-values computed by other means, which was confirmed by our simulations. The same principles can be applied in the high-dimensional setting enabling statistical inference in these situations. The conceptual similarity to the bootstrap is reflected by the fact that any data independent test statistic can be used as a weighing function allowing further applications in high-dimensional data analysis, targeted analysis or non-nested model comparisons.

Software
We provide an implementation for analyzing data using weighted plausibility. The use of this package is documented in appendix A.4.

A.1 Non-plausible parameters
If we take the likelihood for a i.i.d Bernoulli experiment to be L(θ; Y ) = N i=1 θ Yi (1 − θ) 1−Yi , L(0.5, Y ) = .5 N . Therefore, pl Y ([0, 1]) = 1 and the supremum is achieved in θ = .5 and θ =θ. For this likelihood, every data set would have plausibility of 1 even if we exclude θ = .5, and the non-plausible value of 0.5 would not be important, however, the example can be easily extended.
If we iterate the experiment and draw K groups of N i.i.d Bernoulli variables, the non-plausible value starts to matter. The likelihood can be modeled as L(θ; ∀j}, the likelihood reduces as above for θ 1 = 0.5 and pl Y (A) = 1 for all data sets. However, the plausibility restricted under A is smaller than 1 in general, if we exclude θ 1 = .5 from A and the θ i 's are truly different. If we now consider the nested experiment to come from a binomial distribution, where Y j is the number of outcomes 1 in each block, we get L (θ;  Y ) for, in general, non-exchangeable observations. The assumption requiring unique point-mass is therefore too strong as the plausibility function can be readily evaluated. We hypothesize that it is sufficient to require unique point-mass on a dense subset of the parameter space although this is not straightforward to show.

A.2 Plausibility is a special case of weighted plausibility
To show that that plausibility is a special case of weighted plausibility, we choose the weighting function as w(y) = pl y (A). For observation y, we then get pl w y (A) = sup ϑ∈A F ϑ (T w y,ϑ ) and F ϑ (T w y,ϑ ) becomes Here, ϑ * , ϑ * are the plausibility estimates for y and Y , respectively. The distribution function F ϑ * can be dropped in (1), as F ϑ * (T Y,ϑ * ) ≤ F ϑ * (T y,ϑ * ) iff T Y,ϑ * ≤ T y,ϑ * . The second inequality follows due to F ϑ * (T Y,ϑ * ) ≤ F ϑ * (T Y,ϑ * ). The first inequality is shown indirectly. First, we note that we evaluate the event Z := {Y |pl Y (A) ≤ pl y (A)}, which is independent of ϑ, once under P ϑ and once under P ϑ * . Let us assume that there exists a ϑ so that P ϑ (Z) > P ϑ * (Z). Assuming that T is either discrete or continuous (but not mixed), we can find an outcome z := arg max z ∈Z T z ,ϑ . But then, pl z (A) ≥ F ϑ (T z,ϑ ) > pl y (A), as {Y |T Y,ϑ ≤ T z,ϑ } ⊃ Z. This is a contradiction to the definition of Z.
Supremizing over ϑ shows that sup ϑ F ϑ (pl y (A)) ≥ pl y (A) as ϑ * is among the ϑ's, which concludes the proof.
Note, that we do not need the plausibility estimates ϑ * , ϑ * or z to be unique. They only need to achieve the supremum or maximum in the respective terms.

A.3.1 Proof of lemma 1
Proof. If Pl θ (α) is continuous in α, i.e. Pl θ (α) = Pl θ (α−), Pl θ (α) = α by the properties of the CDF. Otherwise, assume θ known and define pl Y,θ := F θ (T y,θ ) and P l Y,θ the corresponding CDF. Then ∆ := sup{Pl 1,θ (α) − Pl 1,θ (α−)} the supremum over the discontinuities of Pl 1,θ (α) which we assume to be bound away from 1. Let α M := arg sup{Pl n,θ (α) − Pl n,θ (α−)} and Y m := {y|pl y,θ (A) = α m }. Then for each y m ∈ Y m , T y,θ ≤ ∆ n , as T = T n is the product measure of T 1 and P θ (Y m ) ≤ |Y m |∆ n (| · | denotes cardinality). Due to the uniqueness assumption of point masses, each y ∈ Y m contains exchangeable observations, i.e. different vectors y are identical up to ordering. The size of |Y m | is given by a multinomial coefficient which can be upper bounded by the binomial coefficient n n/2 . The maximal discontinuity for this case is achieved for class probabilities close to .5 and therefore ∆ ≈ 0.25. By Sterling's approximation n n/2 ∼ 1 √ π/2n 2 n , so that P θ (Y m ) → 0 with rate of at least √ n. This implies that Pl n,θ (α) converges pointwise to the CDF of the uniform. Applying Portmanteau's theorem completes the proof for known θ.
Let now θ * n be a sequence of plausibility estimates with θ * n P → θ * (see proof of theorem 3 in [5]), or equivalently, pl n (θ) → 0 with P θ * -probability for θ = θ * . α M (θ) can therefore be bounded by a continuous function in an appropriately chosen neighborhood U of θ * , which completes the proof.

A.3.2 Lemma 6
To preapre the proof of lemma 2, we start with a lemma about event sequences. Lemma 6. Under the assumptions of section 2.2, let E 1 , ..., E n and E 1 , ..., E n be two sequences of events for which either E n ⊂ E n or E n ⊂ E n . Let D n = E n E n be the symmetric difference. If probabilities of both sequences converge for some θ ∈ Θ, α := lim n→∞ P θ (E i ), β := lim n→∞ P θ (E i ), then lim n→∞ P θ (E i ) = lim n→∞ P θ (E i ) if and only if P θ (D n ) → 0 in probability.
Otherwise, E i ⊃ E i infinitely often. By applying the above argument to the sub-sequence for which this inclusion holds, the same contradiction arises. "⇐": Assume |β − α| > > 0. For E n ⊃ E n for all n > n 1 , P θ (E n ) = P θ (U n ) = P θ (I n ) + P θ (D n ) → n→∞ α, a contradiction, which arises again for the sub-sequence for which E i ⊃ E i .

A.3.3 Proof of theorem 2
Proof. Y = (Y 1 , ..., Y n ), Y i iid ∼ P 0 θ , P θ = (P 0 θ ) n . The rejection region of the LR test is given by R LR n = {y|w(y) < c}, where c is the appropriately transformed quantile of a χ 2 -distribution. For the rejection region of the plausibility test, we have R pl n = {y|w(y) < c }, where c is chosen smallest, so that P θ (R pl n ) ≤ α. By definition, we therefore have R pl n ⊂ R LR n or R LR n ⊂ R pl n . Let D n be the symmetric difference between the rejection regions, i.e. D n = R pl n R LR n . From standard likelihood theory we have P θ (R LR n ) P → α. Using lemma 1, we also have P θ (R pl n ) P → α. Applying lemma 6 completes the proof.
For a given θ, the rejection region of the LR test is composed of y's, for which the LR is large, i.e. w(y) is small. As w(y) is also used as the weighing function in the plausibility test, the rejection regions overlap and thereby fulfill the conditions of Lemma 6.

A.3.4 Proof of lemma 3
Proof. For data generating parameter θ • = (ψ • , λ • ) it is to be shown that P θ • (Π w y (α) ψ • ) ≥ 1 − α. For event Π w Y (α) ψ • and some appropriately chosen constant c α ∈ R, we have The inclusion in the last step holds, as l(ψ • , λ * ) ≥ l(ψ • , λ • ). c α is chosen to fulfill the constraint imposed by α. The final event Y cα is exactly the event for which the plausibility function for θ • has a value greater or equal to 1 − α. As