Sequential interim analyses of survival data in DNA microarray experiments

Background Discovery of biomarkers that are correlated with therapy response and thus with survival is an important goal of medical research on severe diseases, e.g. cancer. Frequently, microarray studies are performed to identify genes of which the expression levels in pretherapeutic tissue samples are correlated to survival times of patients. Typically, such a study can take several years until the full planned sample size is available. Therefore, interim analyses are desirable, offering the possibility of stopping the study earlier, or of performing additional laboratory experiments to validate the role of the detected genes. While many methods correcting the multiple testing bias introduced by interim analyses have been proposed for studies of one single feature, there are still open questions about interim analyses of multiple features, particularly of high-dimensional microarray data, where the number of features clearly exceeds the number of samples. Therefore, we examine false discovery rates and power rates in microarray experiments performed during interim analyses of survival studies. In addition, the early stopping based on interim results of such studies is evaluated. As stop criterion we employ the achieved average power rate, i.e. the proportion of detected true positives, for which a new estimator is derived and compared to existing estimators. Results In a simulation study, pre-specified levels of the false discovery rate are maintained in each interim analysis, where reduced levels as used in classical group sequential designs of one single feature are not necessary. Average power rates increase with each interim analysis, and many studies can be stopped prior to their planned end when a certain pre-specified power rate is achieved. The new estimator for the power rate slightly deviates from the true power rate but is comparable to other estimators. Conclusions Interim analyses of microarray experiments can provide evidence for early stopping of long-term survival studies. The developed simulation framework, which we also offer as a new R package 'SurvGenesInterim' available at http://survgenesinter.R-Forge.R-Project.org, can be used for sample size planning of the evaluated study design.


Background
A frequent objective of cancer related studies is to detect genes or biomarkers that can predict the outcome of therapy. The hardest criterion of success for therapies is the survival of patients. To identify predictive genes, the expression levels in samples of tumor or normal tissue are measured by DNA microarrays before the therapy is applied. Then, expression levels are compared to survival data of the patients. Usually, tissue samples are only available at distinguished points in time and it can take several years until the full planned sample size is available and follow-up is complete. In such long-lasting studies it would be beneficial to obtain interim results already before their planned end. An early detection of survival related genes within an interim analysis would for example allow their further laboratory validation before the end of the study. In addition, if an interim analysis provides evidence for the early stopping of the study it would save time and costs and would spare further patients to be involved or allow better treatment. Classically, interim analyses are performed in studies of group sequential designs. In such designs, interim analyses are performed when a certain fraction of the full planned sample size N has been reached, for example when 1/3 · N and 2/3 · N samples are available. There are numerous articles that deal with group sequential designs in the case of one single feature, e.g. [1,2]. As the repeated testing of the same hypothesis by interim analyses is one form of multiple testing and thus inflates the overall type I error [3], both articles propose reduced significance levels to solve this problem. DNA microarray data, however, comprise expression levels for thousands of genes, meaning that there are more features than available samples. For this case of high-dimensional data there have been very few approaches published to date. Marot and Mayer [4], for example, propose a method for combining p-values from several independent microarray analyses and show that the overall false discovery rate is not inflated when testing repeatedly a large number of hypothesis. A similar result was obtained by Posch et al. [5]. We make use of these results and apply them for studies in which survival data is correlated with gene expression data in interim analyses. In order to detect the survival related genes, we use gene-wise Cox-regression analyses [6].
An important issue in interim analyses of multiple features is the choice of a stopping criterion. In the case of testing one single feature, the study is simply stopped when a significant result is observed. Similarly, Victor and Hommel [7] use gene-wise stopping rules in the case of high-dimensional microarray data. Marot and Mayer [4] and Posch et al. [5] propose to stop the study when a certain proportion of true positive genes has been detected. We pick up the latter idea and derive an estimator for the proportion of true positive findings This new estimator is compared to a variant based on [8] and to an estimator proposed in [4].
The article is structured as follows. The study design, the detection of survival-related genes, the problem of performing interim analyses in the case of multiple hypothesis testing, and the stop criterion are detailed in the Methods section. Subsequently, a simulation study evaluates the behavior of the false discovery rate and of the estimator for the proportion of true positive detections. The simulation study covers several settings of survival-focused microarray experiments with interim analyses. After presenting the results from these simulations, we apply our methods to gene expression data from a breast cancer study van de Vijver et al. [9]. Parameters from this breast cancer study are used one more simulation presented subsequently. Finally, a discussion on the results follows, and further ideas are given.

Methods
This section starts with an illustration of the particular study design we are considering in this article. As survival is a special focus of this work, we detail afterwards the methods for the detection of survival-related genes. Next, an overview of common methods for multiple hypothesis testing and their application in group sequential interim analyses is given. In this context, we also describe rules for the early stopping of such studies.
First, let us introduce the basic notation. Let N be the total number of subjects that would be involved if the study was not stopped after any interim analysis. For each of the relating tissue samples the expression levels of d genes are measured by means of DNA microarrays. We denote the complete (d × N) data matrix with all genes and all samples by X. As typical for microarray data, each row represents one gene and each column represents one sample. Thus, x ij denotes the expression level of gene i in the tissue sample of subject j (i = 1, ..., d; j = 1, ..., N). An overview of all notations is given in Table 6.

Study Design
Assume, the whole study is not stopped after any interim analysis. Then, the N patients will have individual arrival times a = (a 1 , ..., a N )', e.g. months after the begin of the study. We denote the arrival time of the last patient to enter the study by l 1 = max(a). Thus, the first part of the study, during which patients are recruited, will take place in the time frame [0, l 1 ].
Let us consider a second study episode which serves as a follow-up time of length l 2 , where no new patients enter the study but the patients survival data is still observed and up-dated. Thus, the length of the full study without any early stopping would be L = l 1 + l 2 , and the time frame of the second study part would be [l 1 , L]. The study design is visualized in Figure 1.
Assume further, that M 1 interim analyses are planned to take place during the first study part. An interim analysis is always performed when (1/M 1 ) · N new samples are available. This makes sure, that the sample size in each analysis is equal. In addition, M 2 interim analyses are planned for the second study part, where an interim analysis is always performed when a time proportion of (1/M 2 ) · l 2 of the planned follow-up time has passed. Thus, we chose to perform analyses at equally spaced time-steps during this second part of the study. We denote the times at which interim analyses are performed by t m (m = 1, ..., M, where M = M 1 + M 2 ).

Detecting Survival-related Genes
Disease specific survival is the hardest control for measuring the success of cancer therapies. Therefore, we modelled the survival information in dependence on the gene expression data using Cox proportional hazard regression as was proposed for example by Simon et al. [10]. Survival information consists for each patient j of a pair (s j , z j ), where s j is the observed survival time of the patient, and z j {0, 1} specifies whether the patient has already died or not at the time the analysis takes place. According to the Cox-regression model, the survival information is modelled in terms of the hazard function at time t in dependence on gene i by where h 0 is an unspecified function of t, called the baseline hazard. The hazard function h(t) can be interpreted as the patients risk of dying in a short time frame, [t, t + ε), assuming the patient has survived thus far [11]. More precisely, the hazard function is defined as where t* is the patients observed survival time [12]. The influence of gene i on survival can be determined by testing the hypothesis H 0i : b i = 0 in the related Cox model. The d resulting p-values from the gene-wise survival analyses can than be adjusted for multiple testing as described below.
In general, other models than the Cox model can be considered to detect survival-related genes. Park et al. [13] for example propose to use partial least squares regression to account for the presence of covariates.

Interim Analyses of Multiple Endpoints
In each interim analysis one statistical test is performed per gene in order to detect those genes which correspond to the studied response variable (e.g. overall survival).
If the d hypotheses were all true and independent, testing each of them at the same level a, the expected number of false positive detections would be given by a · d. In whole genome microarray experiments, where d is typically about 40.000, the expected number of false positive detections would be too large to be tolerable. In multiple testing situations, it is therefore common to reduce the number of false positive decisions by controlling a pre-specified type I error rate. Note that the notion of type I error rate is not used consistently in the literature. Following Dudoit et al. [14] we will use the term type I error rate to name the superordinate concept of different types of such error rates, among which there are the family-wise error rate (FWER) and the false discovery rate (FDR).
In microarray experiments the FDR, introduced by Benjamini and Hochberg [15], is the most commonly considered type I error rate. The FDR is defined as the expected proportion of false positives among all positive test decisions, i.e. FDR = E(FP/R), where R > 0 denotes the total number of rejected null-hypotheses. The proportion of false positives (FP) among all positives itself is also known as false discovery proportion (FDP = FP/ R). In the special case that R = 0, i.e. no positive test decisions were found, the FDP as well as the FDR are defined to be zero.
The FDR can be controlled by adjusting the raw pvalues resulting from the gene-wise tests. The adjusted p-values are then compared with a pre-specified level a of the FDR that is desired to be controlled. We denote the unadjusted p-value for gene i by p i and the respective adjusted p-value byp i . In our simulation study, we consider the adjusting procedure proposed by Benjamini and Hochberg [15]. Other adjusting procedures are detailed in [14].
Alternatively to comparing the adjusted p-values with the pre-specified FDR-level a, genes can be selected by comparing the raw p-values with adjusted a-levels.
According to the procedure in [15], the raw p-values are ordered by increasing size, i.e. p (1) ≤ p (2) ≤ ... ≤ p (d) , and the largest k (k = 1, ..., d) for which p (k) ≤ (k/d) a has to be determined. All hypothesis associated with p (1) to p (k) are then rejected. We denote the adjusted a-level that corresponds to this largest value of k by a BH .
Similar as in multiple hypothesis testing, the control of type I error rates is an important issue in group sequential interim analyses. In clinical trials on one single feature (for example when only one gene would be tested), interim analyses have been studied in-depth. As was shown by Armitage et al. [3], performing interim analyses increases the probability for making a type I error. In order to avoid such an increase and to maintain a pre-specified type I error, the tests at each interim analysis are performed at lower nominal levelsα m (m = 1, ..., M). The first authors who proposed a-level adjustments for group sequential interim analyses were Pocock [1] and O'Brien and Fleming [2].
At first glance, performing interim analyses in microarray experiments seems to require two adjustments. One to account for the interim analyses and one to account for multiple testing. However, the two recently published articles by Posch et al. [5] and by Marot and Mayer [4], show that the adjustment for interim analyses can be omitted, when d is large, while the FDR remains controlled. The result in [4] is based on the observation that the correlation between a single p-value and the empirical distribution of all p-values approaches zero when d gets large. The results in [5] are based on the findings of Storey et al. [16] who proved that, under certain assumptions, This holds for each interim analysis independently. Let m ∈ {1, ..., M} denote the random interim analysis where the study is stopped. Posch et al. [5] proved that equation (3) holds also at this interim analysism, since With equations (3) and (4) and the Lemma of Fatou it follows that the FDR is controlled asymptotically when d gets large.
This argumentation is not valid under the global null hypothesis (no gene significant), but it is also possible to extend the argumentation to the case of the global null hypothesis [5].
When performing interim analyses in experiments with multiple endpoints, one has to decide which data to base the interim analysis on. We decided to use all available accumulated data in each interim analysis.
This approach makes sure, that every analysis uses the maximal available data. However, one has to be aware of its drawback: it requires renormalization of the data in each analysis which leads to inconsistencies across the interim analyses.

Stopping Rules
One important point in studies with planned interim analyses is the stop criterion. Each interim analysis provides the possibility to stop the study prior to its planned end, entailing the mentioned ethical and financial benefits. In studies on one single feature the study is usually stopped if that feature is found to be significant. In studies on a large number of features one could stop the study as soon as a pre-specified fixed number of features has been found to be significant. In the case of microarray analyses, this criterion might be useful when the number of genes that can be further validated by laboratory experiments is limited by costs or time.
Here, we follow the approach of Marot and Mayer [4] who consider as stop criterion the achieved proportion of detected true positives, the so called average power rate (APR) where d 1 is the number of non-true null hypothesis. At each interim analysis the achieved APR is estimated and the study is stopped if this estimate exceeds a predefined level, e.g. 80%. In the case of d 1 = 0 the APR is defined to be zero.
We employ a new estimator of the APR, similar to the FDR-estimator of Storey and Tibshirani [8] which is based on the following relations: The three components E[R], E[FP] and E[d 1 ] can be estimated as in [8] which is shown in the following. The expectation of R can simply be estimated by the observed number of rejected hypothesis, i.e.
The estimation of E[FP] is based on the fact, that pvalues belonging to true null-hypotheses are uniformly distributed within [0,1]. Thus, the probability that a pvalue which belongs to a true null-hypothesis is smaller than a threshold t (t [0,1]) is exactly t. Therefore, if the significance level is chosen to be a' = t and d 0 null hypotheses are true, E[FP] can be estimated by where π 0 is the fraction of true null hypothesis. For a' one can for example choose a BH as defined in the previous subsection. The unknown fraction π 0 of true null hypotheses can be estimated bŷ Here, ϑ serves as a tuning parameter that balances bias versus variance. For a well chosen ϑ, the p-values in [ϑ, 1] will belong 'mostly' to true null-hypotheses, and therefore equation (9) estimates the fraction of true null-hypotheses. Again, the argument is the uniform distribution of p-values belonging to true null-hypotheses. See Figure 2 for graphical illustration.
According to our simulations (see below) setting ϑ = 0.5 results in a good estimateπ 0 . Other automated ways to choose ϑ have been proposed in [17] and [8]. In both cases the estimate of π 0 is used to estimate not the APR but the FDR. Storey [17] calculates the mean squared error of the FDR estimator for a range of values of ϑ and takes the one minimizing this MSE. The calculation of this MSE thereby is in turn based on a plug-in estimate of the FDR. The method presented in [8] does not need the FDR but is based onπ 0 only. The underlying observation is, that the bias in the estimator of π 0 vanishes in the extreme choice ϑ = 1. Thus, the approach there is, to setπ 0 = lim ϑ→1π0 (ϑ).
A non-parametric estimator of π 0 is given by Langaas et al. [18]. Based on this estimator and on the empirical distribution of the p-values Marot and Mayer [4] construct an APR estimator analog to the beta uniform model presented by [19].
In any case, the expectation of d 1 can be estimated by such that the final estimator for the APR is given by We will use APR to denote the estimator that results from setting ϑ = 0.5 and APR (S) to denote the estimator which is based on the procedure for estimating π 0 presented in [8]. The estimator by Marot and Mayer [4] will be denoted by APR (L).

Simulation Study Data Generation and Settings
In order to simulate a study of the design proposed in subsection 'Study Design' we set the following parameters. At first, we chose the total number of samples to be N = 50. Furthermore, we set the intended length of the recruitment part of the study l 1 and the length of the follow-up part to be 60 months, each, i.e. l 1 = l 2 = 60. We assume that the arrival times a are distributed uniformly during this first part. Thus, they were drawn from U(0, l 1 ). The patients' survival times were assumed to follow an exponential distribution Exp(1/l), where l is the mean survival time. Here, we set l = 60 months.
As we wanted to generate a set of genes which correspond to the simulated survival times, we split the individuals into two groups. The one group comprises the subjects with survival smaller than the specified l the other group the subjects with equal or longer survival times than l. The gene expression data for the two groups were then drawn from multivariate normal distributions N d (μ k , Σ), k = 1, 2. Expression levels were simulated for d = 10000 genes. The different mean vectors of the two groups represent the differentially expressed genes between subjects with short or long survival. For both groups, the same covariance matrix Σ with an autoregressive structure was generated. This structure images the fact that some genes are highly correlated among each other while others behave rather independent. In detail, we set The expectation vector μ 1 of the one group was set to be the null-vector, while a fraction of τ = 50% randomly chosen genes was altered in the other group. The alterations in μ 2 were drawn from 'discretized' normal distributions. Larger fold changes were simulated via a higher standard deviation of this normal distribution. The structure of μ 2 is illustrated in Figure 3.
This way we simulate an effect of a fixed size in the gene expression depending on whether the patient belongs to the group of long-term survivors or not. Of course in biology the inverse direction is true, i.e. survival is regulated by gene expression. However, we think that for our purpose it does not matter in which order the survival and expression data are generated. In addition, it is typical in biology that a gene is either up-or down-regulated. Hence, only the direction of regulation but not the strength of regulation influences the outcome. Therefore we chose to model the relationship between single genes and survival by a discrete function and not by some continuous one.
At each interim analysis, a gene-wise Cox regression was performed to detect the survival correlated genes, and resulting p-values were adjusted to control the FDR at a level of 5%. Following the results of Marot and Mayer [4] and of Posch et al. [5], no additional adjustment for interim analyses was performed. The number of simulation runs was set to 1000 for each setting. All simulations were performed with the free software R in version 2.10 [20].
We simulated two different setups. One, where 2 interim analyses are scheduled for both study parts, i.e. M 1 = M 2 = 2, and one where 5 analyses are planned to take place per part, i.e. M 1 = M 2 = 5. In each simulation run, our power estimator described in Section 'Stopping Rules' was applied and the study was stopped when an estimated APR of 80% was achieved.
As a more extreme setting we additionally simulated in the second setup (M 1 = M 2 = 5) the situation of a smaller fraction τ = 5% of altered genes.

Adherence to False Discovery Rate
As no interim-specific adjustments were applied, the question arises, whether the pre-specified FDR-level of 5% was maintained at each analysis. Figure 4 displays the mean simulated FDR at each interim analysis in two different settings. Figure 4 Table 1 and  Table 2 contain the mean and standard deviation of the FDR for these cases. Both results were obtained with the fold changes for the genes between long and short time survivors being generated as shown in Figure 3(b). The pre-specified FDR-level of 5%, indicated by the dashed line, is maintained at nearly each analysis.
One can observe, that with only a small number of patients available the problem is harder, such that only a small number of genes is detected. In such cases each false positive gets more weight in the calculation of the FDR and one has to expect higher FDR levels. In later interim analyses, the simulated FDR stabilizes at a more conservative level. In the setting of M = 10 interim analyses, the FDR is considerably small in the very first interim analysis. This observation can be explained by the fact that the FDR is defined to be zero when no genes are found at all.
In the more extreme setting with only τ = 5% survival related genes, the overall course of the FDR over the interim analyses -as shown in Figure 6(a) and Table 3 stays the same, but the peak in the early analyses becomes more prominent, and the specified FDR-level is not strictly maintained also during the later interim analyses.

Average Power Rate and Early Stopping
In Figure 5, the estimated and true APR at each interim analysis is shown. The corresponding descriptive values can be found in Tables 1 and 2   study and the other half during the follow-up part. In both cases, the estimated and the true APR increases with each interim analysis. In addition, true and estimated power do not diverge dramatically, however our estimation ( APR ) appears to be slightly liberal. Comparable performs the estimator APR (S), where we plug in the π 0 estimation procedure of Storey and Tibshirani [8]. The power estimation by Marot and Mayer [4] ( APR (L)) overestimates the real power. The pre-specified stopping criterion, an estimated APR of 80%, is represented by the dashed line. At average, this criterion is achieved at the 3rd analysis in the case of M = 4 planned interim analyses, and at the 7th analysis in the case of M = 10 planned interim analyses. In addition, true and estimated APR become not much higher than the desired level of 80%. In particular, the power increases when new samples are included during the recruitment part, but nearly stagnates in the followup part, where survival data is up-dated only.
One main interest of our simulation was to find out whether interim analyses can provide an early stopping in such survival studies. Figure 7 shows for each interim analysis the fraction of simulation runs which could be stopped at this point. Both figures show the simulations with M = 10 analyses. The fold changes in these two settings were generated either with small effects (fold changes) or with large effects (compare Figure 3). In the case of small effects (Figure 7(a)), only 60% of all simulated studies reached the last planned final analysis while 40% were stopped at an interim analysis. In the case of large effects (Figure 7(b)), even more than 80% of the simulated studies were stopped before the final analysis.
The average power rate and its estimations in the harder setting with only τ = 5% survival related genes is shown in Figure 6(b). In this setting neither the true APR nor its estimates reach the 80% level, such that no study was stopped at earlier analyses. While the APR (L) again overestimates the true APR, the other estimators become conservative.

Application to Breast Cancer Data
In order to evaluate our method on real data, we analyzed gene expression levels from 295 patients suffering from breast cancer [9]. The data contain expression levels of 24496 genes. In this study, patients    were recruited between 1984 and 1995. Thus, the recruitment part of the study was l 1 = 11 years. Exact arrival times were not given in the public available data set, thus, we drew these times randomly from a uniform distribution U(0, l 1 ). To account for random effects, which might have been introduced by drawing the arrival times, we repeated the simulated study based on this data 1000 times with newly drawn arrival times and took the average of the resulting error and power estimations.
In the data, minimum and maximum survival was 0.5 and 18 years, respectively. Median survival was 7 years. We analyzed the data set with M 1 = 5 interim analyses during the recruitment part of the study and M 2 = 5 analyses within the follow-up part. We intended to control the FDR at a level of 5% and to stop the study when the estimated APR exceeds 50%.
The raw p-values from the final analysis are displayed in Figure 2. From this figure it can be seen, that the suggested [8] choice of ϑ = 0.5 is indeed a good choice The descriptive values for each interim analysis including the number detected genes, the simulated FDR, the real APR, and the APR estimates, all in the setting of τ = 50% altered genes with large fold changes and ten interim analyses. These values correspond to Figure 4 (b) and Figure 5 (b).
Interim Analysis for this data, as the histogram resembles a uniform distribution very well in the range [0.5, 1]. Figure 8 shows the estimated FDR and estimated APR at each interim analysis. The pre-specified FDR-level of 5% is not exceeded, and the estimated APR increases with each interim analysis. As in the simulation study, the estimator APR (L) seems to overestimate the real APR. The estimators APR and APR (S) again perform comparably. In the first interim analysis, no significant genes are detected. Thus, FDR and APR are equal zero.
Interestingly, the stopping of an estimated 50%-APR is not achieved in any interim analysis when the more reliable APR estimators are used. In detail, the maximum achieved APR was only 27% (with 1900 detected genes). Therefore, the study could not be stopped early with this criterion. Figure 8(b) illustrates the different character of recruitment and follow-up part of the study. The increase of power is much stronger in the recruitment part than during the follow-up part, meaning that in Interim Analysis  The descriptive values for each interim analysis including the number detected genes, the simulated FDR, the real APR, and the APR estimates, all in the setting of τ = 5% altered genes and ten interim analyses.
this study not the available survival data but the sample size was the more restraining factor.

Simulation with Parameters from Breast Cancer Data
In order to perform the simulation also with different distributional assumptions, we performed an additional simulation, where parameters were taken from the breast cancer data described in the previous section. To this end we simulated the patient data and the gene expression data for 295 patients. Because the recruitment time of the real study was 11 years, the arrival times were again drawn randomly from a uniform distribution U(0, 11). A weibull, a gamma, and a log-normal distribution were fitted to the survival times of the real data using the fitdistcens function from the fitdistrplus R package [21]. We employed the Akaike Information Criterion (AIC) to select the fitted log-normal distribution, from which, thus, the survival times were drawn. We wanted to set the proportion of survival related genes according to the real data set. We, therefore, used    the results from the previous section where in the last analysis 1900 genes were found and the APR was estimated to be 27%. Thus, the total number of survival related genes in this data set was estimated to be 7037.
To generate the gene expression data, we divided the patients -as before -into two groups along their median survival time. The gene expression data was again multivariate normally distributed in both groups. The mean vector was set to 0 in one group. For the other group we used a discretization of the difference between the empirical mean vectors in both groups of the real data set. We chose the discretization grid to consist of steps with a width of 0.04, which resulted in 8542 survival related genes.
The distribution of the resulting mean vector of the second group is given in Table 4.
We estimated the covariance matrix as proposed by Schäfer and Strimmer [22] using the implementation in the R package corpcor, but that resulted in a maximum power of 2.6% in the final analysis.
Therefore, we employed again a covariance matrix based on equation (12) as it was used in the other simulations. Because the effects in this simulation (see Table  4) were rather small, we reduced the simulated variance by a factor of 10 compared to the previous simulations.
The results are shown in Figure 9 and Table 5. As in the simulation setting with τ = 5% survival related genes, the FDR peaks at the 3rd interim analysis, where only 14 genes were found on average. The APR estimators are comparable to the APR estimators in the real data, but show an erroneous aberration in the second analysis, even though there are no genes detected in that analysis. This aberration is corrected, though, in the third analysis as soon as there are some genes found.

Discussion
Typically, survival studies require long time spans from recruitment of the first patients until the availability of first results. Therefore, there is a strong desire to obtain results prior to the planned end of the study, not only for financial aspects but also for ethical ones. Classical group sequential designs exhibit a methodology for interim analyses including the potential for an early stopping of a trial. Whereas these classical methods concentrate on studies with one single feature, there has little been done for the case of multiple features, particularly the high-dimensional case. However, many survival studies now concentrate on correlating observed survival times with high-throughput data from genomics or proteomics experiments which yield expression levels for thousands of features measured in a small number of samples.
Based on the findings of Marot and Mayer [4] and Posch et al. [5] we simulated the possibility of early stopping in interim analyses of survival data in microarray experiments. Likewise to these prior findings we observed that a pre-specified false discovery rate is maintained during interim analyses without particular adjustments. I.e., adjustment appears only to be necessary for multiple testing but not additionally for interim analysis. While it was shown in the two mentioned articles that this principle holds asymptotically when the number of tested hypothesis is large, we have seen in further simulations beyond those presented in the section on the simulation that it also works for rather small numbers of hypotheses (e.g. testing 500 genes).
We used the Benjamini-Hochberg procedure to do the multiple testing adjustment, even though the Benjamini-Hochberg procedure does not control the FDR under arbitrary dependency structures. However, in our simulations and in the real data example it could be seen that this procedure mostly controlled the FDR. We believe, that in microarray studies a strict control of the FDR is of minor importance, as microarray studies are mainly used for hypothesis generation and, thus, need further validation anyway. In cases where a stricter control of the FDR is required, the more conservative procedure of Benjamini and Yekutieli [23] might be more appropriate.
An important issue in interim analyses of high-dimensional data is the choice of an adequate stopping criterion. Here, we chose the achieved average power rate as stopping criterion which is defined as the proportion of detected false null hypothesis. We derived a new estimator for the average power rate that comes close to the true proportion of true positive findings. However, this estimator behaved slightly liberal when the data contained many survival related genes and conservative when the data contained few survival related genes. We also tried other methods like the more sophisticated ϑ estimator given in [8] and the APR estimation method proposed in [4] which resulted in comparable and worse approximations, respectively. Improvements remain In the simulation based on the parameters from the breast cancer data the mean vector of the gene expressions of the long time survivors is set to this discretized version of the difference of the empirical mean of the two groups in the real data. therefore necessary. With this criterion we observed that early stopping can be achieved in certain studies, based on the actual proportion of false null hypothesis and the effect sizes (size of fold changes). We applied the methods onto gene expression data from a microarray study on breast cancer. In this analysis we obtained an estimated average power of 20% at the fifth interim analysis (i.e. roughly five years after begin of the study) and of 27% at the eighth interim analysis (i.e., roughly after eight years). These estimated proportions seem to be rather small. However, the estimated APR of 27% in the final analysis corresponds to about 1900 genes detected by Cox regression (see Figure 10). This set might provide a signature which enables to build a survival predictor of sufficient quality. Predictor quality and classification accuracy are other interesting stopping criteria for interim analyses of high-dimensional data. The prognostic value of survival models based on gene Interim Analysis  The descriptive values for each interim analysis including the number detected genes, the simulated FDR, the real APR, and the APR estimates, all in the setting where parameters were taken from the breast cancer data. expression signatures was for example studied by Hielscher et al. [24]. Evaluation of such alternative stopping criteria remains an open point which we are going to study in our further research. The necessary sample size another important point in planning microarray studies in combination with survival data. Our simulation study provides the basis for such sample size considerations. With certain information from pilot studies, including expected distributions of fold changes and expected survival times, our simulation approach can be used to study the development of the APR in interim analyses. Therefore, we made our R-code available as package on the R-CRAN repository http://cran.r-project.org within the package survGenesInterim.
Several extensions of our simulation framework can be considered. In the analysis of microarray experiments normalization is an important pre-processing step to make the single arrays comparable. We therefore intent as methodological improvement to add different normalization approaches to our simulation framework. When making interim analyses, one can for example consider a re-normalization with each set of new array data or use the normalization parameters obtained in a previous interim analysis. Another improvement can be considered with regard to the survival analysis. While we have used the proportional hazard model, here, this assumptions may not always be true and other models with time variant effects seem to be more reliable.

Conclusions
Group sequential interim analyses of microarray experiments in survival studies are frequently performed without considering the adherence of the overall error rate. Our simulation framework helps to evaluate the behaviour of error rates and power rates  in such experiments. The framework also enables to study the developing of results when survival data is up-dated at subsequent times during studies that take several years.