Estimating the false discovery rate using mixed normal distribution for identifying differentially expressed genes in microarray data analysis.

The recent development of DNA microarray technology allows us to measure simultaneously the expression levels of thousands of genes and to identify truly correlated genes with anticancer drug response (differentially expressed genes) from many candidate genes. Significance Analysis of Microarray (SAM) is often used to estimate the false discovery rate (FDR), which is an index for optimizing the identifiability of differentially expressed genes, while the accuracy of the estimated FDR by SAM is not necessarily confirmed. We propose a new method for estimating the FDR assuming a mixed normal distribution on the test statistic and examine the performance of the proposed method and SAM using simulated data. The simulation results indicate that the accuracy of the estimated FDR by the proposed method and SAM, varied depending on the experimental conditions. We applied both methods to actual data comprised of expression levels of 12,625 genes of 10 responders and 14 non-responders to docetaxel for breast cancer. The proposed method identified 280 differentially expressed genes correlated with docetaxel response using a cut-off value for achieving FDR <0.01 to prevent false-positive genes, although 92 genes were previously thought to be correlated with docetaxel response ones.


Introduction
Genetic markers are promising for our ability to predict the anticancer drug response in individual patients.The recent development of DNA microarray technology allows us to measure simultaneously the expression levels of thousands of genes and to identify truly correlated genes with the anticancer drug response, called differentially expressed genes, from many candidate genes by comparing the gene expression levels between cells or tissues under different conditions.However, since a typical microarray experiment measures the expression levels of thousands of genes with a small sample-size simultaneously, identifying differentially expressed genes poses complex multiple testing problems, and it is diffi cult to precisely identify differentially expressed genes using traditional statistical methods.The traditional methods such as the two-sample t-test have been used to identify differentially expressed genes [15].However, such tests often provide unreliable and inaccurate results due to strong parametric assumptions and multiple testing problems.In contrast, Bonferroni correction [5] controlling the family-wise error rate (FWER) is often too conservative, failing to identify differentially expressed genes.In order to solve this problem, the false discovery rate (FDR) is increasingly used.The FDR is defi ned as the expected proportion of false-positive genes among total identifi ed genes as an index for optimizing the identifi ability of differentially expressed genes [4].Many statistical methods have been proposed for estimating the FDR, i.e. empirical Bayes (EB) method [12], Signifi cance Analysis of Microarray (SAM) [34], and mixture model method (MMM) [24].Among them, SAM is most widely used for cancer outcome by its attractive advantages in microarray data analysis [10].Actually, the diffi culty of multiplicity problems in simultaneous testing of a large number of genes with a small sample-size data is relieved by SAM through estimating the number of false-positive genes based on a permutation procedure without strict parametric assumptions and replacing the usual t-test statistic with a SAM-statistic [34] or t-type score [24].Available computer software specifi c for SAM, also help biological researchers for managing SAM [9].The precision of estimated FDR in SAM have been examined by many researchers [14,22,23,36].Among them, Xie et al. (2005) pointed out that the permutation-based methods for FDR estimation such as SAM might overestimate FDR in a certain condition.This suggests the importance of the examination of factors such as target FDR, sample-size, and proportion of differentially expressed genes which may affect the bias and variance of estimated FDR in SAM.If the bias and variance of estimated FDR differ, depending on the experimental condition, we have to choose a suitable method for the experimental condition in a confronted case.We therefore, conducted a simulation study to examine the bias and variance of estimated FDR in SAM.
In this paper, we also propose a new method for estimating the FDR.The proposed method assumes a mixed normal distribution on t-type score, estimating the FDR for a cut-off value based on the numerical integration of probability distribution.Here, the t-type score is a test statistic with a correction term added to the denominator of the Welch type t-statistic in order to stabilize the variation of the denominator [24].We compared both bias and variance of the estimated FDR between the proposed method and SAM through the simulation study.Additionally, both methods are applied to actual data comprised of the expression levels of 12,625 genes of 10 responders and 14 non-responders to docetaxel for breast cancer (Accession No: GDS360) [20].Although 92 correlated genes with the docetaxel response were previously identifi ed using a two-sample t-test with the signifi cance level 0.001 [7], there are many falsenegative genes among unidentifi ed genes because the adopted signifi cance level is too low to get reasonable result.We, therefore, examined the FDR in this actual data using the proposed method.

t-type score
For each gene i, i = 1, 2, …, g, the expression level is X i1 , …, X im from m samples collected from cells or tissues under Condition 1, and Y i1 , …, Y in from n samples collected from cells or tissues under Condition 2. A traditional method for testing for a difference in the means between two conditions assuming a normal distribution is the two-sample t-test.However, since thousands of genes are evaluated simultaneously; when some of them have a very small sum of squares under two conditions, their absolute t-statistic becomes very large even though their mean difference is not large.This disadvantage is exacerbated due to the small sample-size.In the case where two-sample t-test is used, therefore, many non-differentially expressed genes are identifi ed as differentially expressed genes.In order to avoid this problem, a new statistic with a correction term added to the denominator of the Welch type t-statistic in order to stabilize the variation of its denominator, called t-type score, has been proposed [24].We use the t-type score as a test statistic for identifying the differentially expressed genes.Let z i denote the t-type score for gene i,

Signifi cance analysis of microarray (SAM)
SAM is often used to estimate the FDR for identifying the differentially expressed genes for cancer outcome [10].The FDR is estimated through the replications of permutation among all samples for a total of B times.For the bth permutated data, the t-type score is calculated and denoted by z i b , i = 1, …, g.When FDRsam denotes the two-sided FDR estimator, FDRsam can be written as where c 1 (Ͼ0) and c 2 (Ͻ0) are the cut-off values, respectively.We can identify over-and under-expressed genes simultaneously using the FDRsam.On the other hand, we formulate the one-sided FDR estimator for each cut-off value (c 1 , c 2 ) in order to correspond to the FDR estimator of the proposed method.When FDRsam(c 1 ) and FDRsam(c 2 ) denote the one-sided FDR estimator for c 1 and c 2 respectively, FDRsam(c 1 ) and FDRsam(c 2 ) can be written as and respectively.
The FDRsam(c 1 ) is used in order to identify the differentially expressed genes that the gene expression levels under Condition 1 over-express more than under Condition 2. On the other hand, the FDRsam(c 2 ) is used in order to identify the differentially expressed genes that the gene expression levels under Condition 1 under-express more so than under Condition 2.

Proposed FDR estimation method
We propose estimating the FDR assuming a Kcomponent mixed normal distribution on t-type score z i , i = 1, ..., g.The probability density function of K-component mixed normal distribution is where To obtain the maximum likelihood estimate θ, the Newton-Raphson method is used.The one-sided FDR for each cut-off value (c 1 , c 2 ) is estimated using the parameter estimates θ.When P TP1 and P TP2 denote the proportion of total identifi ed positive genes for each cut-off value (c 1 , c 2 ) respectively, P TP1 and P TP2 can be written as respectively.
Let P FP1 and P FP2 denote the proportion of falsepositive genes for each cut-off value (c 1 , c 2 ) respectively, P FP1 and P FP2 can be written as ˆˆˆˆ.
Note that f 0 (z; Δ 0 , V 0 ) denotes the normal distribution with the smallest absolute mean among and FDRp(c 2 ) denote the one-sided FDR estimator for each cut-off value (c 1 , c 2 ) respectively, FDRp(c 1 ) and FDRp(c 2 ) can be written as and respectively.
We can determine the cut-off value for the target one-sided FDR by changing c 1 and c 2 sequentially using Formula (11) and Formula (12).

Simulation study to examine the performance of the proposed method and SAM
In usual microarray experiments, we evaluate the gene expression levels of thousands of genes simultaneously under various experimental conditions.Specifi cally, target FDR for determining the cut-off value, the sample-size, and the proportion of differentially expressed genes are varied depending on the experimental conditions.We therefore, examined the bias and variance of estimated FDR in both the proposed method and SAM under various experimental conditions through a simulation study.Although we conducted simulation experiments using a three-component model with over-expressed genes and under-expressed genes as well as a twocomponent model, this paper discusses the result obtained using the two-component model because the results of them were similar.
As the framework of simulation, we set the following simulation conditions.

Simulation condition 1
The simulation study was designed to have g (i = 1, …, g) genes in total, with s differentially expressed and g-s non-differentially expressed.Each condition had an equal sample-size N (N = m = n).We generated, for j = 1, …, N, and Since each population mean of differentially expressed genes was different respectively, we assumed a random effect model, that is,

Simulation condition 2
The total number of replication of permutation (B) was 400 times in SAM.

Simulation condition 3
The proposed method assumes a two-component mixed normal distribution on the t-type score, estimating the parameters ( θ) by the Newton-Raphson method.
The procedure for conducting the simulation study was as follows: Step 1. Generate X ij and Y ij (i = 1, …, g, j = 1, …, N) according to Simulation Condition 1, calculating the t-type score (z i ) of g genes including the s differentially expressed genes and g-s non-differentially expressed genes.
Step 2. Determine a cut-off value (c 1 ) for target FDR (tFDR) by changing the cut-off value sequentially.
Step 3. In SAM, calculate the t-type score (z i b , i = 1, …, g, b = 1, …, 400) using 400 permutated data according to Simulation Condition 2. In the proposed method, estimate the parameters (θ) of two-component mixed normal distribution according to Simulation Condition 3.
Step 4. Estimate the FDR using Formula (3) in SAM and Formula (11) in the proposed method for a cut-off value (c 1 ).
Step 5. Repeat Steps 1 -4 1,000 times, calculating the average of the bias of the estimated FDR and the variance of the estimated FDR in both methods.
The three situations of the simulation study were as follows: Simulation situation 1 Each value is set as g = 3,000, s = 150, and N = 20, calculating the bias and variance of the estimated FDR in both methods when target FDR is set as tFDR = 0.01, 0.05, 0.1, 0.2, and 0.5 respectively.

Simulation situation 2
Each value is set as tFDR = 0.1, g = 3,000, and s = 150, calculating the bias and variance of the estimated FDR in both methods when sample-size is set as N = 5, 10, 20, 40, and 80 respectively.

Simulation situation 3
Each value was set as tFDR = 0.1, g = 3,000, and N = 20, calculating the bias and variance of the estimated FDR in both methods when the number of differentially expressed genes of the total genes is set as s = 30, 75, 150, 300, and 600 respectively.

Results of simulation study
The bias and variance of the estimated FDR by both methods under each simulation situation are shown in Table 1, Table 2, and Table 3 respectively.Table 1 suggests that the bias and variance increase as target FDR becomes high in SAM, whereas the bias and variance were almost constant regardless of the target FDR in the proposed method.Table 2 suggests that the bias increases as the sample-size becomes large in SAM, whereas the bias decreased in the proposed method.In both methods, the variance was almost constant regardless of the sample-size.Table 3 suggests that the absolute bias increases as the number of the differentially expressed genes becomes large in SAM, whereas the bias decreases in the proposed method.In both methods, the variance decreases as the number of differentially expressed genes becomes large.Additionally, when tFDR = 0.5 or s = 600 in SAM and N = 5 or 10 in the proposed method, the absolute bias is larger than 0.01.The variance is smaller than that of SAM under all situations in the proposed method, except for N = 5.

Application to actual data
We applied the proposed method and SAM to actual data comprised of the expression levels of 12,625 genes of 10 responders and 14 non-responders to docetaxel for breast cancer (Accession No: GDS360) [20].This actual data was measured and analyzed in order to identify the correlated genes with the docetaxel response for predicting anti-tumor activity of individual patients [7].Although 92 correlated genes with the docetaxel response were previously identifi ed using a two-sample t-test (signifi cance level 0.001), it was expected that there would be many false-negative genes among the genes that were not identifi ed because a very strict signifi cance level was used.We identifi ed the correlated genes with docetaxel response based on the FDR using the proposed method and SAM.
In the proposed method, we assumed fi ve mixed normal distributions on the t-type score with K = 2, …, 5, comparing their fi tness by using Akaike Information Criterion (AIC) [1].AIC is the most well-known criterion for determining the number of components in the model.As a result, we selected a two-component mixed normal distribution from the viewpoint of simplicity of interpretation, although AIC of the two-component model is almost equal to that of a three-component model.The density function of the two-component mixed normal distribution is f(z) = 0.319 f 1 (z; 0.659, 0.476) + 0.681 f 0 (z; − 0.057, 0.251).Figure 1 shows a histogram of the t-type score of 12,625 genes and the density function of a two-component mixed normal distribution.As shown in Figure1, the twocomponent mixed normal distribution fi ts the t-type score well.We also calculated the order statistics of z i s from raw data, and the expected order statistics of z i b s from 1,000 permutated data.Figure 2 shows the scatter plot of the ordered t-type score versus the expected ordered t-type score in SAM.As shown in Figure 2, it is indicated that there are many differentially expressed genes.

Discussion
While numerous research has been undertaken related to the bias of the estimated FDR by SAM [14,22,23,36], little is known about the variance of the estimated FDR by SAM.Jung and Jang (2006) [14] noted that SAM can accurately estimate FDR when the target FDR is smaller than 0.1, which is an appropriate value in usual microarray data analysis.Pan (2002Pan ( , 2003) ) [22,23] and Xie et al. (2005) [36] indicated the permutation-based methods for FDR estimation caused overestimation of FDR.In this paper, we examined both bias and variance of the estimated FDR by SAM under various experimental conditions through the simulation study in order to clarify the features of SAM.As a result of the simulation study, we uncovered some problems related to the SAM method.Estimating the distribution of non-differentially expressed genes using permutated data may not lead to precise estimation of FDR.Such a distribution based on the permutation is more dispersed than the true distribution of nondifferentially expressed genes, resulting in overestimation of the number of false-positive genes.
In particular, this disadvantage was infl uenced by the target FDR and the sample-size.In contrast, the proposed method estimates directly the distribution of non-differentially expressed genes assuming the mixed normal distribution on the t-type score.Although the estimated FDR by the proposed method was underestimated, the degree of bias of the estimated FDR in both the proposed method and SAM were almost same and the variance of the estimated FDR by the proposed method was smaller than that of SAM under all simulation situations, except for N = 5.The distribution based on the mixed normal distribution might be not more dispersed than the distribution based on the permutation.From the viewpoint of over-dispersion, therefore, the proposed method might precisely estimate the FDR than SAM.
-1.8 -1.35 -0.9 -0.45 0.0 0   In the simulation study, FDR tended to be underestimated in the proposed method and overestimated in SAM.Although the underestimation was not so large, this may cause the increase of false-positive genes.For instance, when 100 genes are identifi ed as differentially expressed genes with the target FDR 0.1, truly false-positive genes are only 10 with the unbiased FDR, whereas more than 10 false-positive genes may be included in 100 by the underestimation of the FDR.To the contrary, the overestimation may cause the decrease of truepositive genes.
Our simulation study also made clear the different strength of the proposed method and SAM.When the sample-size was as small as 10, the absolute bias in SAM was smaller than that in the proposed method, while the variance was almost the same between them.This strength of SAM may be attractive because microarray experiments are often conducted with small sample-sizes.When the number of differentially expressed genes was as small as 10% of the total genes, FDR were more accurately estimated in SAM than the proposed method.An additional simulation experiment with no differentially expressed genes, i.e. s = 0, revealed that the bias and variance of estimated FDR in SAM were slightly smaller than that in the proposed method.When the sample-size or the number of differentially expressed genes was large, however, both the bias and variance in the proposed method were smaller than those in SAM, probably because SAM could not accurately estimate the distribution of non-differentially expressed genes.The proposed method has an advantage over SAM when the sample-size is greater than 20 or the number of differentially expressed genes is greater than 10% of the total genes.Thus, the proposed method outperforms SAM when the sample-size of each group is more than 20 or the proportion of differentially expressed genes is more than 10% irrespective of the target FDR.Otherwise, SAM outperforms the proposed method.
There would be many over-expressed genes in responder group relative to non-responder group based on both Figures 1-2 in the actual data, whereas under-expressed genes would be few.Table 4 shows the estimated FDR and the number of identifi ed genes in both methods when the cutoff value is changed from 0.1 to 2.0 by 0.1.The number of identifi ed genes was equal between the two methods, because the same t-type score and cut-off value was used.According to the result of simulation study, FDR by the proposed method may be slightly underestimated since the samplesize of the responder group and non-responder group were 10 and 14, respectively, in the actual data.However, the degree of underestimation would not be so large that its infl uence might be cancelled by taking a slightly smaller value of estimated FDR than the target FDR.For instance, the estimated FDR by the proposed method is 0.007, which corresponded to the cut-off value 1.7 in Table 4, may be appropriate for the target FDR 0.01.If so, 280 genes were identified as the differentially expressed genes correlated with the docetaxel response.
variances for gene i, a 0 is the 90th percentile of s m s n

Figure 1 .
Figure 1.Histogram of the t-type score and the density function of a two-component mixed normal distribution.The solid line is f, the dotted line is f 0 , and the broken line is f 1 in a two-component mixed normal distribution.

Figure 2 .
Figure 2. Scatter plot of the ordered t-type score versus the expected ordered t-type score in SAM.
and variance V k , and mixed proportion p k .θ represents all unknown parameters {p k , Δ k , V k : k = 1, ..., K} in a K-component mixed normal model.To estimate the all unknown parameters, given z 1 , …, z g , the following log-likelihood function is maximized.log ( ; ) log ( ; ) 1

Table 2 .
Results of simulation situation 2.

Table 1 .
Results of simulation situation 1.

Table 3 .
Results of simulation situation 3

Table 4 .
Results of application of the proposed method and SAM.The estimated FDR in both methods, and the number of identifi ed genes for each cut-off value.