Multiple testing of interval composite null hypotheses using randomized p-values

One class of statistical hypothesis testing procedures is the indisputable equivalence tests, whose main objective is to establish practical equivalence rather than the usual statistical significant difference. These hypothesis tests are prone in bioequivalence studies, where one would wish to show that, for example, an existing drug and a new one under development have the same therapeutic effect. In this article, we consider a two-stage randomized (RAND2) p-value utilizing the uniformly most powerful (UMP) p-value in the first stage when multiple two-one-sided hypotheses are of interest. We investigate the behavior of the distribution functions of the two p-values when there are changes in the boundaries of the null or alternative hypothesis or when the chosen parameters are too close to these boundaries. We also consider the behavior of the power functions to an increase in sample size. Specifically, we investigate the level of conservativity to the sample sizes to see if we control the type I error rate when using either of the two p-values for any sample size. In multiple tests, we evaluate the performance of the two p-values in estimating the proportion of true null hypotheses. We conduct a family-wise error rate control using an adaptive Bonferroni procedure with a plug-in estimator to account for the multiplicity that arises from the multiple hypotheses under consideration. We verify the various claims in this research using simulation study and real-world data analysis.


Introduction
Equivalence tests are testing procedures for establishing practical equivalence rather than the usual statistical significant difference.Within the frequentist framework, this test uses the fact that failing to reject a given null hypothesis of no difference is not reasonably equivalent to accepting the null hypothesis.Equivalence studies are common in the medical field, for example, where one would wish to show that an existing drug and a new one under development have the same therapeutic effect.We refer to such studies as "bioequivalence studies" and classify them into three categories according to the distance measure between two populations.The categories are individual, population, and average equivalence.Another common area of application is in genetics, where they can be used to identify non-DE (differentially expressed) genes (cf.Qiu and Cui (2010)) or to test for Hardy-Weinberg equilibrium (HWE) in the case of multiple alleles as in Ostrovski (2020).Other areas of application of equivalence tests include the comparison of similarity between two Kaplan-Meier curves, which estimate the survival functions in two populations.See Sect.1.3 of Wellek (2010) for an in-depth discussion of these applications.We can state equivalence as a difference or ratio between two means.Rejecting the null hypothesis is the same as declaring an equivalence.This rejection is similar to the interval under the alternative hypothesis containing a zero (for the difference in means hypothesis) or a one (for the ratio of means hypothesis).Some studies on equivalence testing include Romano (2005), which provides bounds for the asymptotic power of equivalence tests and constructs efficient tests that attain those bounds.The same author also gives an asymptotically UMP test based on Le Cam's notion of convergence of experiments for testing the mean of a multivariate normal.Equivalence tests can also use intersection-union tests (cf.Berger and Hsu (1996)) since the null is a union of several null hypotheses, and the alternative is an intersection of many rejection regions.Berger and Hsu (1996) consider this intersection-union test for the simultaneous assessment of equivalence on multiple endpoints.This test requires that all the (1 − 2α)100% simultaneous intervals fall within the equivalence bounds for an overall α level test.This approach can be conservative depending on the correlation structure among the endpoints and the study power.Another popular approach to equivalence testing is the Two One-Sided Test (TOST) procedure.We can use this procedure as an alternative to the goodness-of-fit tests.However, since it is sensitive to the noise level in the data, it can have low power for data sets with a high variance.Alternatively, we can use a distance measure such as the Euclidean distance between two probability vectors.Munk (1996) considered equivalence tests for Lehmann's alternative, which are unbiased for equal sample sizes within the two groups.An extension of the expected p-value of a test (EPV) to univariate equivalence tests was considered by Pflüger and Hothorn (2002).Since this procedure is independent of the distribution of the test statistic under the null hypothesis, it avoids the problem of looking for this distribution for the test statistic.Furthermore, the EPV is independent of the nominal level α.
Equivalence tests are univariate, and we can apply them to each characteristic of interest without a multiplicity adjustment.However, the probability of making false claims of equivalence (type I errors) increases when we analyze multiple characteristics without a multiplicity adjustment.Leday et al (2023) proposed a familywise error rate (FWER) control based on Hochberg's method.The same authors also showed that Hommel's method performs as well as Hochberg's and that an "adaptive" version of Bonferroni's method outperforms Hommel's in-terms of power for equivalence testing.Giani and Finner (1991) and Giani and Straßburger (1994) on the other hand considered simultaneous equivalence tests in the k−sample case and proposed tests based on the range statistic.Qiu and Cui (2010) and Qiu et al (2014) consider multiple equivalence tests based on the average equivalence criterion to identify non-DE genes.
Both articles investigate the power and false discovery rate (FDR) of the TOST.Since the variance estimator in the TOST procedure can become unstable and lead to low power for small sample sizes, the later article proposes a shrinkage variance estimator to improve the power.Huang et al (2006) also applied an average equivalence test criterion but adjusted for the multiplicity using the simultaneous confidence interval approach.
Multiple test procedures that utilize p-values are valid only if the p-value statistics are uniformly (0, 1) distributed under the null hypothesis.Since we use the p-values many times, any non-uniformity in their distribution quickly accumulates and reduces the power of the overall procedure.We can decompose the equivalence hypothesis into two one-sided hypotheses, each leading to a composite null hypothesis.The p-values from such a hypothesis can fail to follow the uniform distribution if we do not compute them under the least favorable parameter configurations (LFCs).Furthermore, we can have categorical data, for example, in genetic association studies that generate discrete data in counts, leading to test statistics with discrete distributions.Since the p-value is a deterministic transformation of the test statistic, this leads to discretely distributed p-values that are also nonuniform under the null hypothesis.
The problems of composite nulls and discrete test statistics can lead to conservative p-values, which implies that the p-value is stochastically larger than U N I(0, 1) distribution under the null hypothesis.To our knowledge, no research has previously considered a two-stage randomized p-values in testing for equivalence hypotheses.In this article, we propose a two-stage randomized p-value for multiple testing of equivalence hypotheses to address these two issues.The two-stage procedure uses the UMP p-value in the first stage to remove the discreteness of the test statistic.The randomized p-value proposed in Hoang and Dickhaus (2022) for a continuous test statistic is then used in the second stage to deal with the composite null hypothesis.
When utilizing the non-randomized version of the Two One-Sided Test (TOST) UMP p-value in discrete models, Finner and Strassburger (2001) showed that it is possible for the power function based on a sample of size n to coincide on the entire parameter space with the corresponding power function based on size n + i for small i ∈ N. We illustrate that the power function of a test based on the two-stage randomized (RAND2) p-value for discrete models, just like the one for the UMP randomized pvalue, is strictly increasing with an increase in the sample size.We further illustrate that for small sample sizes, it is possible that the power functions of the test based on the two p-values (UMP and RAND2) do not strictly increase with an increase in the sample size.
We also investigate the behavior of the distribution function for the UMP and RAND2 p-values under the null and alternative hypothesis.Three objectives are of interest: First, to find if the power and level of conservativity of the p-values depend on the size of the equivalence limit.Second, to investigate the behavior of the CDFs when the chosen parameter is close to the boundary of the null or alternative hypothesis, and third, to find out if the level of conservativity of the p-values depends on the sample sizes.Finally, we consider multiple testing of equivalence hypotheses where we assess the performance of our p-values in estimating the proportion of true null hypotheses using an empirical-CDF-based estimator.An adaptive version of the Bonferroni that utilizes the plug-in estimator of Finner and Gontscharuk (2009) is used for familywise error control.
The rest of this paper is organized as follows.General preliminaries are provided in Section 2. The definitions, CDFs, and investigations of the behaviors of those CDFs under the null and alternative hypothesis for the UMP and the two-stage randomized p-values are considered in Section 3. We also investigate if the power function of the p-values is monotonically increasing with an increase in the sample size in the same section.Furthermore, we give the parameter value that maximizes the power of a test based on the p-values in the same section.We defer all matters concerning multiple testing until Section 4, where we consider a real-world data analysis and a simulation study to assess the performance of the p-values in estimating the proportion of true null hypotheses.Finally, we discuss our results and give recommendations for future research in Section 5.

General preliminaries
Let X X X = (X 1 , . . ., X n ) ⊺ denote our random data where each X r is a real-valued, observable random variable, 1 ≤ r ≤ n with the support of X X X denoted by X .We assume all X r are stochastically independent and identically distributed (i.i.d.) with a known parametric distribution.The marginal distribution of X 1 is assumed to be P θ , where θ ∈ Θ ⊆ R is the model parameter.The distribution of X X X under θ is as a result given by P ⊗n θ =∶ P θ .We will be concerned with an interval hypothesis test problem of the form for given numbers θ 1 , θ 2 ∈ Θ such that θ 1 < θ 2 .When k hypotheses are of interest, then they will be expressed as H j ∶ θ ∉ ∆ j versus K j ∶ θ ∈ ∆ j where ∆ j denotes the range of values in the j th interval between θ the difference between the j th true parameter θ (j) and θ (j) constant for all the k hypotheses, then this is referred to as the "average equivalence" criterion.We can sometimes make the interval in (1) symmetric to achieve equivariance to the permutation of groups, for example, the choice θ 2 = θ −1 1 in Pflüger and Hothorn (2002) and Munk (1996).
As mentioned before, one method for testing this hypothesis is the Two One-Sided Test (TOST) procedure, where one tests for the alternatives θ < θ 1 and θ > θ 2 separately at size α and in no particular order.TOST is a particular case of the intersection-union test proposed by Berger (1982) where the null hypothesis is a union of disjoint sets, and the alternative hypothesis is an intersection of the complements of those sets.For this reason, we conduct the separate individual tests at size α without a multiplicity adjustment like α/2.Practical equivalence is declared if one rejects both tests and otherwise non-equivalence.These procedures suffer from a lack of power, and an alternative that is more powerful but too complicated has been suggested in the literature by Berger and Hsu (1996) and Brown et al (1997).Since alternative tests are difficult to implement, we use TOST in this research.
We consider test statistics T (X X X), where T ∶ X → R is a measurable mapping.
Furthermore, the test statistics T r for r = 1, . . ., n are also assumed to be mutually independent.The marginal p-value p(X X X) resulting from T (X X X) is assumed to be valid, meaning that P θ (p(X X X) ≤ α) ≤ α holds true for all α ∈ [0, 1] and for any parameter value θ in the null hypothesis.Valid p-values are stochastically larger than UNI (0, 1), as investigated by, among many others, Habiger and Pena (2011) and Dickhaus et al (2012).On the same note, we call a p-value conservative if it is valid and P θ (p(X X X) ≤ α) < α holds true for some α ∈ (0, 1).Throughout the article, we refer to the CDF of a p-value under the alternative hypothesis as a power function because we reject the null hypothesis for small p-values.Finally, we also make use of the (generalized) inverses of certain non-decreasing functions mapping from R to [0, 1].In this regard, we follow Appendix 1 in Reiss (1989): If F is a real-valued, non-decreasing, rightcontinuous function, and similarly G is a real-valued, non-decreasing, left-continuous function where we define both F and G on R, then 3 Interval composite hypothesis

Introduction
In this article, we are interested in the (interval) composite null hypothesis of the form in (1).We test this hypothesis using two different p-values whose definitions and the CDFs we now give as follows.
Definition 1 (First stage randomization).Let U be a UNI(0, 1)-distributed random variable independent of the data X X X.Further assume that T (X X X) is our test statistic whose distribution has monotone likelihood ratio (MLR), the UMP-based p-value where θ is the chosen true parameter while γ n and δ n are the randomization constants.
The critical constants C n , D n ∈ R and the randomization constants For large sample sizes, the critical and the randomization constants are We can use the p-value defined in Equation ( 2) with models possessing monotone likelihood ratio (MLR), for example, any one-dimensional exponential family and the location family of folded normal distribution.For continuous models, the critical constants C n and D n are slightly modified, for example, by introducing the variance in the case of a normal distribution.Moreover, the randomization constants in (3) are such that γ n = δ n = 0 for such continuous models.Next, we give a lemma whose proof is in the Appendix to show that the UMP p-value in Definition ( 1) is the maximum of the p-values for a lower-and an upper-tailed test.
Lemma 1.For a fixed but arbitrary significance level α ∈ [0, 1] and a chosen true parameter under the null hypothesis θ 0 = θ 1 or θ 0 = θ 2 , the UMP p-value in Equation (2) is the maximum of the p-values for a lower-and an upper-tailed test.
In calculating the UMP p-value in (2), using either θ 1 or θ 2 leads to the same result for the p-value.The UMP p-value is used in the first stage of randomization to deal with the discreteness of the test statistics.We conduct a second randomization to deal with the composite null hypothesis.The second stage randomized p-value (RAND2) (cf.Hoang and Dickhaus (2022)) is defined as follows.
Definition 2 (Second stage randomization).Let U and Ũ be two different UNI(0, 1)distributed random variables both stochastically independent of the data X X X and are also independent of each other.Assume also that we have a constant c ∈ (0 (4) where P U M P (X, U ) is the UMP p-value in the first stage as defined in Equation (2).
With our p-values so defined, we are now ready to use them to test our hypothesis.We first describe an example of a discrete model that we use to illustrate our randomized p-values in practice.
Example 1 (Binomial distribution).Assume that our (random) data is given by X X X = (X 1 , . . ., X n ) ⊺ , where each X r is a real-valued, observable random variable, 1 ≤ r ≤ n, and all X r are stochastically independent and identically distributed (i.i.d.) Bernoulli variables with parameter θ i ∈ (0, 1) for i = 1, 2, Bernouli(θ i ) for short.A sufficient test statistic for testing the hypothesis in Binomial random variable with parameters n and θ i , i = 1, 2 and we shall denote this by Bin(n, θ i ).The respective p-values with their CDFs are calculated using Equations (2),( 3),(4), and (5).The critical constants C n and D n are given by Bin(n,θ2) (t) for t ∈ (0, 1) where F −1 (•) denotes the quantile of a binomial random variable with parameters n and θ.The randomization constants γ n and δ n for large sample sizes and for arbitrary t ∈ (0, 1) are given by , where F Bin(n,θ) denotes the CDF and f Bin(n,θ) the probability mass function of binomial variable with parameters n and θ.
In this section, as mentioned before, we consider the individual test problem where k = 1.We are interested in finding if randomization is beneficial when the equivalence limit ∆ increases or decreases and if the power functions for the p-values are monotonic in sample size.Furthermore, we seek to find if the level of conservativity of the p-values depends on the sample sizes.

Sample size versus power
We expect that the power function for a test would be strictly increasing with an increase in sample size.A power function that is strictly increasing with an increase in the sample size is ideal for sample size planning since an additional observation cannot lower the power.In the case of discrete models, Finner and Strassburger (2001) showed that it is possible for the power of the (least favorable configuration) LFCbased p-value at a sample of size n to coincide over the entire parameter space with that of size n + i, for small i ∈ N. We illustrate in the second panel of Figure 1 and for the model in Example (1) that this paradoxical behavior can also occur for the UMP p-value and cannot be corrected even by use of randomization.The problem occurs for small samples with the chosen true parameter θ too close to the boundary of the alternative hypothesis.To generate Figure 1, we set the tuning parameter c = 0.5, θ 1 = 0.25, and θ 2 = 0.75 in both panels.Furthermore, we choose θ = 0.5 as the true parameter under the alternative hypothesis in the left panel and θ = 0.4 in the right.On the left panel in Figure 1, both power functions are strictly increasing with an increase in the sample size.On the right panel, both power functions are not monotonically increasing with larger sample sizes.We further illustrate in Figure 2 that this paradoxical behavior of the power function of the UMP p-value in the right panel of Figure 1 does not occur for small equivalence limit ∆.To generate Figure 2, we maintain the parameter settings as in the right panel of Figure 1 but only change θ 1 to 0.35 so that the resulting ∆ is decreased compared to the initial one.Theorem 2 ( Monotonicity of the power functions).The CDFs of the UMP and RAND2 p-values are strictly increasing with an increase in the sample size n for any fixed parameter value θ under the alternative hypothesis.Consequently, for any significance level and a fixed parameter value θ under the alternative hypothesis, the power of the corresponding test is monotonically increasing with an increase in the sample size n.

Conservativity of the p-values
As mentioned in the introduction, we expect that the distribution of a p-value under the null hypothesis is close to that of a U N I(0, 1) distribution.A p-value can fail to meet this requirement and hence be conservative, meaning it is stochastically greater than the uniform distribution.We illustrate for the model in Example 1 that among the two p-values, only RAND2 p-value comes close to meeting this requirement and is therefore less conservative than the UMP p-value.We illustrate in Figure 3 that utilizing the two-stage randomized p-value reduces the conservativeness of the UMP p-value.In this figure, we consider two cases where we have set θ 1 = 0.25 and θ 2 = 0.75 in the first case and θ 1 = 0.3 and θ 2 = 0.75 in the second case.For both cases, we use a sample of size n = 50 and set the tuning parameter to c = 0.5.The chosen parameter θ is 0.2 under the null and 0.35 under the alternative hypothesis.Notice that the equivalence limit ∆ in the first case is larger than the second case.The reason for using these two equivalence limits is to find if the p-values will become more or less powerful (or conservative) depending on the size of the equivalence limit ∆.From Figure 3, the CDF of the UMP p-value under the null hypothesis is far from the U N I(0, 1) line compared to the one for RAND2 p-value in both cases.Therefore, the UMP p-value is more conservative compared to RAND2 p-value.Under the alternative hypothesis, the CDF of the UMP p-value is also far from the U N I(0, 1) line compared to the one for RAND2 p-value in both cases.Therefore, as expected, the power of the UMP p-value exceeds that of RAND2 p-value.This power loss is the price for using our randomized p-value.We can use conditioning (cf.Zhao et al ( 2019)) to improve the power of tests based on these p-values.
Under the same parameter configurations and only shrinking the equivalence limit ∆, the UMP p-value becomes less powerful and more conservative.The two-stage randomized p-value also becomes less powerful, but the conservativeness of the p-value reduces even further.Notice that we shrink the equivalence limit by increasing θ 1 while holding θ 2 constant.Since the chosen parameter θ under the null hypothesis is also constant, this parameter will now be too far from the boundary of the resulting equivalence limit.Shrinking the equivalence limit by increasing θ 1 and reducing θ 2 lowers the power but does not affect the level of conservativeness for both p-values.
Furthermore, holding θ 1 constant and reducing the equivalence limit by decreasing θ 2 does not affect both the power and the level of conservativeness for both the p-values.
These observations on the CDf for the two p-values under the null (alternative) for the model in Example 1 depends on whether the chosen parameter under the null (alternative) is such that θ ≤ θ 1 or θ ≥ θ 2 (θ < 0.5 or θ > 0.5) and we cannot provide a general statement.
A similar trend in Figure 3 occurs when the equivalence limit is kept constant with the chosen true parameter under the null too far from the null boundary or the one under the alternative too close to the boundary.Furthermore, the same behavior in Figure 3 occurs when the chosen true parameter under the null or alternative hypothesis is held constant and ∆ is shifted by an ϵ ∈ R so that the new interval is of Shifting the equivalence limit and the chosen parameter under the null or alternative hypothesis with an ϵ leads to different behaviors for the CDFs.We illustrate this in Figure 4 using n = 50 and the tuning parameter set at c = 0.5.Furthermore, we consider two cases where in the first one, we set θ 1 = 0.2, θ 2 = 0.7, and the chosen true parameter θ = 0.15 under the null and θ = 0.25 under the alternative hypothesis.In the second case, we shift the parameters by an ϵ 1 = 0.1 so that θ 1 = 0.3 and θ 2 = 0.8.The true parameters are shifted by ϵ 2 = 0.12 so that θ = 0.27 under the null and θ = 0.37 under the alternative hypothesis.Notice that ϵ 2 > ϵ 1 .Fig. 4 The CDFs of the UMP and RAND2 p-values against t for n = 50 and c = 0.5.Furthermore, we set θ 1 = 0.2, θ 2 = 0.7, and the chosen true parameters are θ = 0.15 under the null and θ = 0.25 under the alternative hypothesis in the first case (I).In the second case (II), we set θ 1 = 0.3, θ 2 = 0.8, and the chosen true parameters are θ = 0.27 under the null and θ = 0.37 under the alternative hypothesis.
From Figure 4 and with the parameters shifted as described, the CDF for the UMP p-value under the null hypothesis moves closer to the U N I(0, 1) line while there is no change in the one for RAND2 p-value.The CDFs for both p-values under the alternative hypothesis move away from the U N I(0, 1) line.These results hold true for any ϵ 1 and ϵ 2 as long as ϵ 2 > ϵ 1 .For ϵ 2 ≤ ϵ 1 , the CDFs for both the p-values under the null and alternative hypothesis behave exactly as in Figure 3. Next, we give Figure 5 to illustrate the behavior of the CDFs for the two p-values under the null hypothesis using the same parameter configurations as in Figure 3 except that the sample size n is not constant.Again, we consider two cases but with n = 50 in the first case (I) and n = 100 in the second case (II).From Figure 5, the CDF of the UMP p-value moves away while the one for RAND2 p-value moves closer to the U N I(0, 1) line as sample size increases.Therefore, the UMP p value becomes more conservative while the RAND2 p value is less conservative as the sample size increases.

Maximum power
We find the parameter value that maximizes the CDF of the two p-values under the alternative hypothesis for a given equivalence limit ∆.Once we get this parameter, we can choose it as our parameter under the alternative hypothesis, so we always get the maximum power.Furthermore, one may wonder if the value of this parameter depends on ∆ or if two or more such parameters exist within the alternative parameter space.
We generate Figure 6 to address these questions for Example 1, where we have set c = 0.5 and used n = 50 as our sample size.Furthermore, we use θ 1 = 0.15 and θ 2 = 0.45 in the left panel of Figure 6 and θ 1 = 0.25 and θ 2 = 0.45 in the right one.procedure, ABON for short.Since in practice we never really know the number (proportion) k 0 (π 0 ), we make use of ABON combined with the plug-in (ABON+plug-in) procedure of Finner and Gontscharuk (2009) to estimate π 0 .The ABON+plug-in procedure, unlike closed testing procedures (like Hommel (1988) and Hochberg (1988)), provides a theoretical guarantee to control the type I error rate at the desired level.
One classical but still commonly used estimator for k 0 is the Schweder and Spjøtvoll (1982) estimator.It is given by where λ ∈ [0, 1) is a tuning parameter and Fk is the empirical CDF (ecdf) of the k marginal p-values.It is often suggested in practice to choose λ = 0.5.One crucial prerequisite for the applicability of this estimator is that the marginal p-values p 1 , . . ., p k are (approximately) uniformly distributed on (0, 1) under the null hypothesis; see, e.
The randomized p-values considered in this work are close to meeting the uniformity assumption, whereas the non-randomized p-values are over-conservative when testing two one-sided composite null hypotheses, especially in discrete models.Typically, the estimated value of k 0 becomes too large if many null p-values are conservative and the estimator from ( 6) is employed.

Empirical distributions
To illustrate the implication of using our proposed two-stage randomized p-value in multiple testing, we employ a graphical algorithm in computing π 0 .This algorithm connects the points (λ, Fk ) with the point (1, 1).We draw a straight line to connect the two points and extend this line to intersect the y axis at the point 1 − π0 .The best p-value for use in the estimation of π 0 is that for which the resulting straight line meets the y axis at a point that is very close to the actual 1 − π 0 .We require the empirical CDF of the p-value not to lie below the U N I(0, 1) line for this to be actualized.Another way to look at this is to find the gradient of the resulting straight line between the points (λ, Fk ) and (1, 1), and this should give you an estimate of π 0 .
If the resulting ECDF line lies below the U N I(0, 1) line between those two points, then it has to be a curve whose gradient can only be at a tangent and hence will give a poor estimate of π 0 .To generate Figure 8, we let the number of hypotheses to be k = 1000, the tuning parameters c and λ are both set at 0.5, and use a sample of size n = 50.We take the proportion of true null hypotheses to be π 0 = 0.7 and set θ 1 = 0.25 and θ 2 = 0.75.Furthermore, to calculate the UMP-based p-value, the parameter θ under the null and alternative hypothesis are chosen as 0.18 and 0.37, respectively.1.0 t Proportion<=t q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q qq q qq qq qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q UNI [0,1] UMP RAND2 Fig. 8 Empirical CDF of the UMP p-value (black curve) and the two-stage randomized (RAND2) p-value (grey curve) for k = 1000, λ = 0.5, c = 0.5, and π 0 = 0.7.We set θ 1 = 0.25 and θ 2 = 0.75.
Furthermore, we choose the true parameter under the null as θ = 0.18 and otherwise θ = 0.37.The dashed vertical line intersects the x axis at the value of λ.
From Figure 8, RAND2 p-value outperforms the UMP p-value since its ECDF lies above the U N I(0, 1) line for all values of t ∈ (0, 1).Furthermore, an extension of a straight line from the points (1, 1) to (λ, F RAN D2 k ) as earlier mentioned, meets the y axis at a point which is close to 1 − π 0 .

Simulation study
We ∆.Since we are utilizing randomized p-values in (6), we average the estimated value of k 0 over the 10,000 Monte Carlo repetitions.For exemplary purposes, we present ten choices of θ 1 and θ 2 in Table 1.A detailed description of the simulation is provided in Algorithm 1.
The results from our simulation study based on the Algorithm 1 are presented in Table 1.
Algorithm 1 Computation of the proportion of true null hypotheses 1) For each of the k = 47 regions in the COVID-19 data set, find the proportions of recoveries θ i , i = {1, . . ., 47} and use these as the assumed true proportions (i.e., as the assumed ground truth) in the simulations.2) For each θ i from step 1.) and for each of the sample sizes n i for i = {1, . . ., 47}, simulate a single data point x i from Bin(n i , θ i ). 3) Select two proportions θ 1 , θ 2 ∈ [0, 1] such that θ 1 < θ 2 as the null values to be tested against.Take k 0 as the number of values i ∈ {1, . . ., k} fulfilling that θ i ≤ θ 1 or θ i ≥ θ 2 .We use the selected θ 1 , θ 2 as well as the numbers x i , and n i from step 2.) in the computation of the p-values, where i ∈ {1, . . ., k}.This step generates a pair of p-values for the UMP p-value since we decompose the null hypothesis in (1) into a lower-and an upper-sided test.Denote these p-values by p l and p u , respectively.
In each case, pick max (p l , p u ) which is the maximum of the two p-values.4) Compute the statistic in Equation ( 6) r = 10, 000 times for the UMP and RAND2 p-values and take the mean.From Table 1, for whatever value of ∆, RAND2 p-value outperforms the UMP pvalue by giving estimates which are on average close to the actual value of k 0 .Also, as is expected, the number of true null hypotheses k 0 decreases with an increase in the interval ∆.

Role of the tuning parameter λ
In this section, we investigate the role of the tuning parameter λ in the estimator given in (6) when using the two p-values.Proper choice of this parameter is important since a smaller λ will lead to high bias and low variance while a larger one leads to low bias and high variance of the proportion estimator.Based on this bias-variance trade-off, we take the optimal λ to be the one that minimizes the MSE.Other researches in this direction include the use of change-point concepts for choosing λ in the Storey

Discussion
In this research, we have considered the UMP and randomized p-value (RAND2) in the interval composite null hypothesis.Using large sample sizes, we have illustrated that the power functions for the UMP and RAND2 p-values are monotonically increasing in sample size.We have also found that it is possible for the power function of the UMP and the two-stage randomized p-value for a sample of size n and that of n + i for small i ∈ N to coincide on the entire parameter space.This problem occurs when dealing with relatively small samples for ∆ too wide while the chosen parameter θ under the alternative is too far from the boundary.This problem does not occur when ∆ is too narrow while the chosen parameter θ is too close to one of the boundaries of the alternative hypothesis (see Figure 2).This problem only occurs if the interval ∆ gets smaller from one end while the other one is kept constant, for example, by holding θ 2 constant and increasing θ 1 .
The problem of nonmonotonicity of the power functions gets worse if the equivalence limit decreases from both ends.A similar observation in Qiu and Cui (2010) is that when the equivalence limit is too narrow, the ROC curve of the TOST procedure is nonmonotonic for small sample sizes.A complete characterization of this paradox will be considered in future research following the ideas in Finner and Strassburger (2001) and Finner and Roters (1993).Of course, in practical problems, the equivalence limits are determined before the data collection and remain fixed throughout the experiment.The adjustments made here are to illustrate the behavior of the p-values and their CDFs under different equivalence limits.
A plot of the CDFs for the UMP and RAND2 p-values under the null and alternative hypothesis illustrates that the UMP p-value is more conservative but less powerful compared to RAND2 p-value.The conservativeness of the UMP p-value increases while the one for RAND2 reduces with a further decrease in ∆.Furthermore, the power functions for the p-values are decreasing with ∆.This decrease in power implies there is more benefit in using RAND2 as ∆ reduces.A similar trend for the CDFs, which leads to the same conclusion, occurs when ∆ is kept constant, and the chosen parameter under the null (alternative) is too far from (close to) the boundary of the null (alternative).
Increasing both the parameters θ 1 and θ 2 by ϵ 1 and θ by ϵ 2 leads to an increase in power for both the p-values, a decrease in conservativity of the UMP p-value, and no change in the level of conservativity of RAND2 p-value.A similar trend occurs for a large equivalence limit, provided ϵ 2 > ϵ 1 .The power increases for a large equivalence limit since θ moves closer to the midpoint of ∆, which is the parameter that gives the maximum power for both UMP and RAND2 p-values under this condition.We were also interested in finding the parameter value that maximizes the CDFs under the alternative hypothesis for the two p-values.We found that for large ∆, the parameter value that maximizes the CDFs of both p-values occurs at the midpoint of ∆.For small ∆, however, this parameter value can be too close to the boundary of the alternative hypothesis for RAND2 p-value while the one for the UMP p-value is always at the midpoint.
Concerning the level of conservativity of the p-values to the sample size, we found that the CDF for the UMP p-value moves further away while the one for RAND2 p-value moves closer to the U N I(0, 1) line with an increase in the sample size.Therefore, the UMP p-value becomes more conservative while the level of conservativity for RAND2 p-value remains the same with an increase in the sample size.Furthermore, Munk (1996) and Wellek (2010) Sect.1.2 (p. 5) argues that equivalence tests require much larger sample sizes to achieve a reasonable power compared to the one-or twosided tests; unless ∆ is chosen too wide that even "nonequivalent" hypotheses would be declared "equivalent."Therefore, it would be better to consider RAND2 p-value for multiple equivalence tests since they are less conservative even for large sample sizes.
A plot of the empirical CDFs of the p-values evaluates their performance when used with the estimator in (6).The ECDF of RAND2 p-value, unlike the one for the UMP p-value, is above the U N I(0, 1) throughout.Furthermore, the slope between the points (1, 1) and (λ, The choice of the tuning parameter λ for the estimator in (6) has also been of great concern in the recent literature.The sensitivity of this estimator to λ is more pronounced for conservative p-values than for non-conservative ones.Since the UMP pvalue is more conservative than RAND2 p-value, the choice of λ is critical for obtaining estimates that are close to the actual number of true null hypotheses when using the UMP p-value.Based on the results from our simulation study, we recommend a small value of λ when utilizing the UMP p-value.Assuming we are using a small α, this choice is similar to the recommended choice of λ = α in the previous literature.When using RAND2 p-value, any choice of λ which is not close to one is recommended.We recommend this choice since the estimate of k 0 based on RAND2 p-value oscillates wildly around the value of k 0 as λ → 1.
The recommendation in Dickhaus (2013), Habiger andPena (2011), andHabiger (2015) that randomized p-values are nonsensical for a single hypothesis also applies to our RAND2 p-value and in that case the usage of the UMP p-value is advocated for.
Furthermore, we caution the practitioner against using randomized p-values in bioequivalence studies.Some general extensions of this research include using randomized j ∈ {1, . . ., k} and k is the multiplicity of the problem.Denote the resulting k p-values by p 1 , . . ., p k .We consider the case k = 1 in Section 3 and defer the multiple test problem till Section 4. When

Fig. 3
Fig.3The CDFs of the UMP and RAND2 p-values against t for n = 50 and c = 0.5.We choose the true parameter θ = 0.2 under the null hypothesis and θ = 0.35 under the alternative hypothesis.Furthermore, we set θ 1 = 0.25, and θ 2 = 0.75 in the first case (I) and θ 1 = 0.3, and θ 2 = 0.75 in the second case (II).

Fig. 5
Fig.5The CDFs of the UMP and RAND2 p-values against different values of t with c = 0.5, θ 1 = 0.25, θ 2 = 0.75, and with θ = 0.2 as the chosen parameter under the null hypothesis.Furthermore, we use n = 50 in the first case (I) and n = 100 in the second case (II).

Fig. 6 Fig. 7
Fig.6The CDFs for the UMP and RAND2 p-values against the chosen parameter θ for c = 0.5 and n = 50.We set θ 1 = 0.15 and θ 2 = 0.45 in the left panel and θ 1 = 0.25 and θ 2 = 0.45 in the right one.The vertical lines intersect the respective CDF curves at their maximum and the x axis at the parameter value that maximizes those CDFs.The bold vertical line is for the UMP p-value while the thin dotted line is for RAND2 p-value.Furthermore, the thin dotted horizontal lines intersect the y axis at the value of α.
now conduct a simulation study based on real-world data to support the claim in Section 4.2 that RAND2 p-value outperforms the UMP p-value in estimating the proportion of true null hypotheses in multiple testing.We use the publicly available Coronavirus Disease 2019 (COVID-19) data taken from https://github.com/CSSEGISandData/COVID-19 (cf.Dong et al (2020)).It consists of confirmed COVID-19 cases and recoveries for the United States of America as of 12 th May 2020.The data set has k = 58 regions.After cleaning the data by removing all the missing val-ues, we have k = 47 regions for our analysis.We select an interval of recovery rates θ 1 and θ 2 and conduct a TOST to find if the true rates from the data set belong to these intervals.We use a Monte Carlo simulation to assess the (average) performance of the UMP and RAND2 p-values in estimating k 0 .We set the constant c and the tuning parameter λ in (6) to 0.5 for all the simulations.The recovery rates from the data set are assumed to be the true proportions.Using these rates and the number of confirmed cases, we generate a new data set on the computer for calculating the p-values.Using different values of θ 1 and θ 2 , we obtain various values of k 0 ∈ {0, . . ., 47} depending on

)
for RAND2 p-value is close to π 0 compared to the one for the UMP p-value.Therefore, RAND2 p-value outperforms the UMP p-value in estimating the proportion of true null hypotheses.To further justify this claim, we have given a real example and provided a simulation study, showing that RAND2 pvalue outperforms the UMP p-value for all values of ∆ by giving estimates that are closer on average to the true proportions.

Table 1
Estimates of the number of true null hypotheses.