Comparison of Methods for Estimating the Proportion of Null Hypotheses π 0 in High Dimensional Data When the Test Statistics is Continuous

Advances in Genomics have re-energized interest in multiple hypothesis testing procedures but have simultaneously created new methodological and computational challenges. In Genomics for instance, it is now commonplace for experiments to measure expression levels in thousands of genes creating large multiplicity problems when thousands of hypotheses are to be tested simultaneously. Within this context we seek to identify differentially expressed genes, that is, genes whose expression levels are associated with a particular response or covariate of interest. The False Discovery Rate (FDR) is the preferred measure since the Family Wise Error Rates (FWERs) are usually overly restrictive. In the FDR methods, estimation of the proportion of null hypotheses (π0) is an important parameter that needs to be estimated. In this paper, we compare the effectiveness of 12 methods for estimating π0 when the test statistics are continuous using simulated data with independent, weak dependence, and moderate dependence structures. Declared True Declared False H0 True Correct Decision (1-α) Type I Error (α) H0 False Type II Error (β) Correct Decision (1-β) Table 1: Possible outcomes for a single hypothesis test. Significant Not Significant Total True Null V U m0 False Null S T m1 Total R W m Table 2: Possible outcomes for m hypothesis tests. Citation: Dialsingh I, Cedeno SP (2017) Comparison of Methods for Estimating the Proportion of Null Hypotheses π0 in High Dimensional Data When the Test Statistics is Continuous. J Biom Biostat 8: 343. doi: 10.4172/2155-6180.1000343


Introduction
A hypothesis is a claim or assertion either about a population parameter.In the case of a single hypothesis, we typically test the null hypothesis H 0 versus an alternative hypothesis H a based on some test statistic.We reject H 0 in favor of H a whenever the test statistic lies in the rejection region specified by some rejection rule.The test statistic is a function of the sample data on which the decision to reject or not to reject H 0 will be based.Within the realm of hypothesis testing, there are two possible types of errors that can be committed.A Type I error, is committed if we reject H 0 when H 0 is true while a Type II error occurs when we fail to reject H 0 when H 0 is false, Table 1 summarizes the error possibilities.
The problem associated with single hypothesis testing is compounded in the realm of multiple hypothesis testing.Multiple hypothesis testing is concerned with controlling the rate of false positives when we are testing multiple hypotheses simultaneously.When conducting multiple hypothesis tests, the probability of making at least one Type I error is substantially higher than the nominal level used for each test, particularly when the number of total tests, m, is large.
It is not unusual to have thousands of tests being done simultaneously.This implies that the probability of getting at least one significant result approaches 1.These methods of multiple hypothesis testing require an adjustment of α in some way so that the probability of observing at least one significant event, due largely to chance, remains below the chosen level of statistical significance.(3) Many methods have been created modifying the FWER concepts.A review of these techniques are found in [1,2].The majority of these methods involves the use of step-up or step-down procedures [3][4][5].Controlling FWER results in tests with low power.The FDR is less restrictive and allows a certain amount of false positives [6].A number of methods have been established for estimation of the proportion of null hypotheses when the test statistics are discrete [7].In this paper, we compare 12 methods for comparing false discovery rates when the test statistics are continuous under varying dependence structures.

Methods for controlling false discovery rate
Modern approaches in multiple hypothesis testing focus on False Discovery Rate (FDR) control [6], which is a simple step-up procedure using the ordered p-values of the tests.False Discovery Rate originated in two papers that dealt with multiple hypothesis testing.Schweder and Spjotvoll [8] first suggested that the ranked p-values be plotted.The assessment of the number of true null hypotheses m 0 would then be done via an eye fitted straight line starting from the largest p-values.The p-values that deviate (outliers) from this straight line would then correspond to the false null hypotheses.The density of the p-values can be expressed as: Where π 0 and π 1 represent the proportion of p-values under the null density f 0 and under the alternative density f 1 respectively.For continuous tests, p-values are uniformly distributed on the interval (0,1).The distribution of the p-values under the alternative hypothesis is unknown.Methods for estimating π 0 when the test statistics are continuous have been developed by coupling the mixture model with the assumption that either the density of marginal p-values, f(p), or the density of p-values under the alternative, f 1 (p) is non-increasing.
The first procedure than controls FDR is the BH Algorithm [6].To control FDR at level q, reject all null hypotheses where: It has been shown that when the test statistics are continuous and independent, this procedure controls the FDR at level π 0 q where π 0 is the proportion of true null hypotheses.In this procedure, each of the m hypotheses are treated as null so in this case, π 0 =1.There are many other methods that were created.We briefly describe 12 of these methods for which we do our comparisons on.

Storey's Smoother Method
The proposed estimator is: Where λ ϵ [0,1) and is called a tuning parameter [9].Beyond 0.5 the histogram of the p-values is flat and this suggests that mostly null p-values are there.Taking λ=0 gives ( ) π λ = which is overly conservative and as λ→ 1 equation ( 6) above indicates that the variance of ( ) 0 π λ increases and this makes the estimated q-values unreliable.
The general algorithm for estimating q-values from the p-values is: 1.
Assume that we have m ordered p-values according to their evidence against the null hypotheses, i.e. p (1) ≤ p (2) ≤… ≤ p (m) .
Let f be a natural cubic spline with 3 degrees of freedom of ( ) Set the estimate of π 0 to be ( ) Let t be a threshold where 0 < t ≤ 1 calculate: For i=m-1, m-2, …, 1 calculate: 3. Equation ( 8) above gives the estimated q-value for the i th most significant feature.This method is referred to as "smoother" throughout this paper.

Storey's bootstrap method
Storey's bootstrap method [10] resamples the m ordered p-values p (1) ≤ p (2) ≤ … ≤ p (m) with replacement and creates pseudo-data sets for some number of bootstrap resamples B. The p-values are then calculated for each pseudo-data set and a bootstrap estimate is done for the FDR.The method enables a counter to record whether the minimum p-value from the pseudo-data set is less than or equal to the actual p-value for each base test.For m tests there would be m such counters.This process is repeated a large number of times, say B, and the proportion of resampled data sets where the minimum pseudo p-values is less than or equal to an actual p-value is the adjusted p-value.This method has the ability to incorporate all sources of correlation from both the multiple contrasts and the multivariate structure.The resulting adjusted p-values incorporate all correlations and distributional characteristics.The bootstrap method provides strong control when the joint distribution of the p-values for any subset of the null hypotheses is identical to that under the complete null hypothesis, that is, when the subset pivotality condition holds.It always provides weak control of the FWER.This method is referred to as "st.boot" throughout this paper.

Two-step procedure of Jiang and Doerge
This two-step procedure proposed by Jiang and Doerge [11] increase the power of detecting differentially expressed genes in microarray data.The null hypothesis of equality across the mean expression levels for all treatments is tested in step one.In the second step pairwise comparisons are done on genes for which the treatment means were shown to be statistically significant in step one.This approach estimates the overall FDR used in both steps in such a way that the overall FDR is controlled below a pre-specified FDR significance level.The twostep approach has increased power over a one-step procedure and it controls the FDR at the desired level of significance.The algorithm for this two-step procedure is as follows: 1.
Test the null hypothesis that a gene is not differentially expressed across each treatment condition.This can be done using the global F-test from an ANOVA model for instance.For m tests corresponding to m genes an FDR control procedure is applied to control FDR at the α 1 level.Tests that produce p-values ≤ some specified c 1 are considered to be statistically significant.If we get L significant tests and M genes are declared to have statistically significant treatment effects then we go to step 2. However if L=0 then we stop and conclude that there are no statistically significant pairwise comparisons and therefore no differentially expressed genes.

2.
For genes that form collection M we would perform N pairwise comparisons for each gene.We would apply an FDR control procedure at the α 2 level to these L*N tests.Any comparisons found that produce p-values ≤ some specified c 2 are considered to be statistically significant.
It should be noted that this procedure assumes that genes that show no significant treatment effect (step 1) will also show no statistically significant pairwise comparisons (step 2).However, a gene having a significant treatment effect may or may not have statistical significance in pairwise comparisons.This method is referred to as "jiang" throughout this paper.

Nettleton's Histogram method
Nettleton's Method [12] estimates π 0 by estimating the proportion of the observed p-values that follow a uniform distribution.The steps in the algorithm are as follows: Assume all null hypotheses are true, and set ( ) Calculate the expected number of p-values for each bin given the current estimate of the number of true null hypotheses.
Beginning with the leftmost bin, sum the number of p-values in excess of the expected until a bin with no excess is reached.
Use the excess sum as an updated estimate of m 1 , and then use that to update the estimate of m 0 =m -m 1 .

Return to
Step (iii) and repeat the procedure until convergence is reached.
The number of bins is used as a tuning parameter and the authors suggested using B=20.In their simulation study the histogram based estimator with 20 bins tended to be conservative for independent and autoregressive correlation structures in that it tended to overestimate the number of true null hypotheses for certain correlations that was considered in their simulation study.This method is referred to as "histo" throughout this paper.

Convex decreasing p-value estimate (Langaas)
Langass and Landqvist's estimator is based on the non-parametric MLE of the p-value density [13].The authors restricted their attention to decreasing and convex decreasing densities because of their many mathematically attractive properties.The derived estimator of π 0 assumed independence of the test statistics.These estimators were found to be robust with respect to the independence assumption but also worked well for test statistics that showed moderate levels of dependence.The aim is to derive a NPMLE (non-parametric maximum likelihood estimate) for a convex decreasing density on [0, 1).The mixture density given in ( 4) is modified by requiring f(p) be a convex decreasing function with f(1)=0.The algorithm is as follows: Specify a convex decreasing initial function ( )
If this optimality is not found then the next iterate is: Where The procedure employed by this algorithm is similar to the "steepest descent" algorithm used to optimize a function on the Euclidean n-space n R .The next iterate in each step of the algorithm is the optimal convex combination of the current iterate and the mixing density fθ that corresponds to the most negative directional derivative.This method is referred to as "langaas" throughout this paper.

Robust estimation (Pounds)
Pounds and Cheng's estimator of π 0 [14] when the test statistics are continuous is given by: ( ) ( ) for two sided tests min 1, 2 t ˆ for one sided tests Where is the average p-value for all m hypotheses, [ ] and min(a,b) is the minimum of a and b.This estimator is biased upward but the bias is small when pf 1 (p) is small or when π 0 is close to 1.This method is referred to as "pounds" throughout this paper.

Lowest slope method
This stepwise procedure was developed in ref. [15] first estimates the value of m 0 using the data and this estimate is used in the adaptive procedure itemized in the algorithm that follows: i.
For level q and i=1, 2, …., k we compare each p (i) with is found to be smaller we do not reject any null hypotheses and stop.
Compute the slopes Starting with i=1 proceed as long as S i ≥ S i-1 .When we find S j < S j-1 stop.
Set This method is referred to as "abh" throughout this paper.

Sliding linear method
A sliding linear model (SLIM) to estimate π 0 in datasets with dependence structures [16].Whenever undetected dependence structures exist they distort the distribution of the p-values making any estimate of π 0 less effective.The SLIM method of FDR estimation is based on a linear model transformed from the non-linear λ estimator of Storey.The method functions by partitioning the data into local dependence blocks.This data partitioning and optimization allows SLIM to utilize information from a broader range of p-value distributions for π 0 estimation.The employed optimization scheme exploits the non-static relationship between the p-values and the q-values by minimizing the difference between the fractions of tests called significant by the p-value and q-value methods.SLIM handles the hidden dependence in the data without the need to empirically adjust the null p-value distributions.This method is referred to as "SLIM" throughout this paper.

SPLOSH method
The Spacings LOESS histogram (SPLOSH) estimates the conditional false discovery rate (cFDR) [17].This cFDR is the expected proportion of false positives conditioned on k "significant" findings.SPLOSH is designed to be more stable than Storey's approach in that the q-value is based on an unstable estimator of the positive false discovery rate (pFDR).SPLOSH is applicable in a wider variety of settings than BUM.The SPLOSH algorithm is as follows: Let p (1) ≤ p (2) ≤ … ≤ p (k) be k ordered p-values.Let ( ) be their adjusted ranks to control the type I error rate.
Apply the arc-sine transform for i=1,2,…,k to the ranked p-values using: Apply LOWESS (LOcally WEighted Scatter-plot Smoother) to For j=1,…,u to obtain the estimated derivative of the CDF up to a unitizing constant c. Let be an estimate of the PDF at p i for i=1,2,…,k.The trapezoidal rule is used to estimate c using: where be the CDF and for k=l + 1,..,u let be an estimate of ( ) ˆ( ) j F p  obtained using the trapezoidal rule in approximate integration.Take ( ) ( ) ( ) Define a monotone quantity ( ) ( ) ( ) based on the cFDR estimates r (i) for i=1, … , k.This method is referred to as "splosh" throughout this paper.

Beta-Uniform Mixture (BUM) Model
Mixture modeling of the p-value distribution was first proposed in ref. [18].This method models the p-value distribution as a mixture of a Uniform (0,1) distribution, corresponding to the true null hypotheses and a Beta(α,β) distribution corresponding to the false null hypotheses.The Beta-Uniform Mixture (BUM) Model was proposed by Pounds and Morris [19].This is basically a mixture of a Uniform(0,1) distribution and a Beta(α,1) distribution.The Beta distribution is chosen because of its flexibility in modeling any distribution on the interval [0,1].This mixture model assumes independence of gene expression levels across genes.Thus, under the null hypothesis the p-values are uniformly distributed on the interval [0,1] regardless of the statistical test being used, as long as the test is valid, and regardless of the size of the sample.Under the alternative hypothesis the distribution of the p-values tends to cluster closer to zero than one.By referring to the entire distribution of the p-values obtained in the sample we can then address whether there is statistically significant evidence that any of the genes under study exhibits a difference in expression across the groups.This is usually done by performing an omnibus test of whether the observed distribution of the p-values is significantly different from a uniform distribution.This method is referred to as "BUM" throughout this paper.

Grenander density estimator
The Grenander density estimator [20] is a non-parametric maximum likelihood estimator (NPMLE) similar to an empirical cumulative density function (ECDF).Unlike the ECDF it has the added constraint of an underlying density.The Grenander estimator uses the ECDF as an estimator of the associated distribution function F(p).This estimator is the decreasing piecewise-constant function equal to the slopes of the least concave majorant (LCM) of the ECDF.Monotone regression with weights is used to compute the LCM of the ECDF in order to obtain this estimator.If x i and γ i are used to denote coordinates for the ECDF and ∆x i =x i+1 -x i and ∆γ i =γ i+1 -γ i , then the slopes of the LCM are given by antitonic regression and the raw slopes ∆x i and ∆γ i with weight ∆x i [21].This method is referred to as "strimmer" throughout this paper.

Two stage adaptive procedure
Adaptive procedures first estimate the number of null hypotheses m 0 and then use this estimate to revise a multiple test procedure.Knowledge of m 0 can be used to improve upon the performance of the FDR controlling procedure.The two stage adaptive procedure makes use of a linear step-up procedure making use of m ordered p-values p (1) ≤ p (2 ) ≤ … ≤ p (m).Let . If this k exists then we reject all the k hypotheses associated with p (1) ≤ p (2) ≤ … ≤ p (k), otherwise do not reject any of the hypotheses.If m 0 were known, then the linear set-up procedure with: would control the FDR at precisely the desired level q in the independent and continuous case, and would then be more powerful in rejecting hypotheses for which the alternative holds.The adaptive Benjamini-Hochberg procedure [6] at the level q is as follows: If  0 0 m = reject all hypotheses; otherwise, test the hypotheses using the linear set-up procedure at level  0 qm m .

1.
Use the linear set-up procedure at level q, and if no hypothesis is rejected stop; otherwise, proceed.
This method is referred to as "TSBKY" throughout this paper.

Results and Discussion
The authors conducted a simulation study to compare the twelve (12) methods for estimating the proportion of null hypotheses π 0 under varying levels of dependence.Simulations were drawn from a MVN (μ, Σ) distribution where μ is the mean vector and Σ is the covariance matrix using the R library MASS.The study involved simulating the expression values of 1000 "genes" from 20 samples.The first 10 samples for each "simulated expression value" came from the case subjects and the last 10 samples came from the control subjects.The simulations were done for π 0 values of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, and 1.00.The data was simulated for three cases of dependency within the genes: i.
Independent Genes -No dependence structure within or between the genes.In this case the leading diagonal of Σ contained all 1's and the off-diagonal entries, denoted as ρ, were all zero.A 1000 x 12 matrix of p-values was generated for each value of π 0 .Each column of the matrix would represent 1000 p-values extracted from each of the "m" matrices.This procedure was replicated 1000 times.The p-value matrices were then used as the input test statistics to compute the row means and standard deviations for each π 0 estimation method being compared.This data was then bound into one matrix for each estimation method for plotting. ii.
Weak Dependence -In this case the leading diagonal of Σ contained all 1's and the off-diagonal entries, denoted as ρ, were all 0.1.The simulation was done in the same manner as for independence.
iii.Moderate Dependence -In this case the leading diagonal of Σ contained all 1's and the off-diagonal entries, denoted as ρ, were all 0.5.
All of the simulations were done using R version 3.2.3.Source codes for most of the estimation methods were available freely online.
In Figure 1 the solid black line that runs from bottom left to top right represents "the truth", meaning that if the true value of π 0 is 0.10 then the estimated value  0 π should correspond to 0.10.Departures either above or below this 45° line would allow one to make graphical inferences about the performance of the proposed estimator.This graphical approach has the advantage of allowing for easy visualization of the data and identification of possible trends and patterns that it contains that might not be readily obvious with tabular approaches.All of the methods compared tended to overestimate  0 π for low values of π 0 .The estimators all performed better as π 0 increased.This is well documented in literature."SLIM" performed very well for small values of π 0 but tended to underestimate as π 0 increased.It was found that "smoother", "st.boot", "Langaas" and "histo" all performed very well for most values of π 0 ."TSBKY" and "abh" overestimated  0 π for every value of π 0 .For example, Table 3 shows that when π 0 was 0.30 "TSBKY" gave an estimate of  0 0.805 π = with an estimated standard deviation of 0.022 while "abh" gave an estimate of  0 0.607 π = with an estimated standard deviation of 0.047.
Figure 2 shows the comparison of all twelve methods for weakly dependent test statistics.Most of the estimators overestimated  0 π for small values of π 0 , however, "TSBKY" and "abh" significantly overestimated  0 π for most values of π 0 .For example, Table 4 shows that when π 0 was 0.30 "TSBKY" gave an estimate of  0 0.814 π = with an estimated standard deviation of 0.125 while "abh" gave an estimate of  0 0.619 π = with an estimated standard deviation of 0.116.For weak dependent test statistics "BUM" initially overestimated  0 π but performed better for π 0 ≥ 0.60.With weak dependence in the test statistics "smoother", "st.boot", "Langaas" and "histo" all performed very well for most values of π 0 .for small values of π 0 , however, "TSBKY" and "abh" significantly overestimated  0 π for most values of π 0 .For example, Table 5 shows that when π 0 was 0.30 "TSBKY" gave an estimate of  0 0.837 π = with an estimated standard deviation of 0.228 and "abh" gave an estimate of  0 0.668 π = with an estimated standard deviation of 0.249.In this case "smoother", "st.boot", and "jiang" performed better than most of the other estimators even though they all had a high standard deviation.A small standard deviation is highly desirable because this gives us an indication that the data points are packed very tightly around the mean value that we are estimating, so there is little spread in the data.The table shows that all of the methods appear to perform very badly for π 0 =0.40, 0.60, 0.90, 0.99, and 1.00.This might be due to the methods ignoring the correlation effect among the test statistics.

Conclusion
A simulation study to compare 12 methods for estimating the proportion of null hypotheses under different configurations of dependence.Dependence among genes exists.This paper shows which of the methods would be most appropriate under various dependence configurations.The estimation of the proportion of null hypotheses is important especially when it comes to estimation of the false discovery rate.There is no doubt that this field will continue to evolve as datasets becomes larger, the demands on statistics becomes greater, and computing hardware and software improves.

Volume 8 •
) Then starting with the largest p-value p (m) , compare each p (i) with 0 iq m until we arrive at the first p-value satisfying ( ) Issue 2 • 1000343 J Biom Biostat, an open access journal ISSN: 2155-6180would reject all k hypotheses whose p values are smaller than p (k) .
values that are equal to zero we use L'Hôspital's Rule to justify ( ) ˆ0 f π as an estimate of the cFDR:

Figure 1 :
Figure 1: Comparison of π 0 estimation methods using independent test statistics.

Figure 3 :
Figure 3: Comparison of π 0 estimation methods using moderately dependent test statistics.

Figure 3
Figure 3 shows the comparison of all twelve methods for moderately dependent test statistics.Most of the estimators overestimated  0 π

Table 1 :
Possible outcomes for a single hypothesis test.

Table 2 :
Possible outcomes for m hypothesis tests.

Table 3 :
Mean (a)and Standard Deviation (b) for compared estimation methods using independent test statistics.

Table 4 :
Mean (a)and Standard Deviation (b) for compared estimation methods using weak dependent test statistics.

Table 5 :
Mean (a)and Standard Deviation (b) for compared estimation methods using moderately dependent test statistics.