Robust Likelihood-Based Survival Modeling with Microarray Data

Gene expression data can be associated with various clinical outcomes. In particular, these data can be of importance in discovering survival-associated genes for medical applications. As alternatives to traditional statistical methods, sophisticated methods and software programs have been developed to overcome the high-dimensional diﬃculty of microarray data. Nevertheless, new algorithms and software programs are needed to include practical functions such as the discovery of multiple sets of survival-associated genes and the incorporation of risk factors, and to use in the R environment which many statisticians are familiar with. For survival modeling with microarray data, we have developed a software program (called rbsurv ) which can be used conveniently and interactively in the R environment. This program selects survival-associated genes based on the partial likelihood of the Cox model and separates training and validation sets of samples for robustness. It can discover multiple sets of genes by iterative forward selection rather than one large set of genes. It can also allow adjustment for risk factors in microarray survival modeling. This software package, the rbsurv package, can be used to discover survival-associated genes with microarray data conveniently.


Introduction
Gene expression can be associated with clinical outcomes such as survival.Genes associated with clinical outcomes can play a role as biomarkers for medical uses.Efforts to discover such biomarker genes have been made by many investigators.For these purposes, the development of microrray technology presents a challenge to quantitative researchers as well as biological researchers.
To discover survival-associated genes, statistical methods for survival analysis such as the Cox model, the log-rank test, and the Wald test have been applied to various disease studies with microarray experiments (Rosenwald et al. 2002;Beer et al. 2002;Wigle et al. 2002;Jenssen et al. 2002;Freije et al. 2004;Sanchez-Carbayo et al. 2006;Mandruzzato et al. 2006;Matsui 2006).Score or Mantel tests were also employed (Shannon et al. 2002;Goeman et al. 2005;Jung et al. 2005).The L 1 or L 2 penalized estimation for the Cox model or the transformed model were applied to improve performance (Bair and Tibshirani 2004;Gui and Li 2005;Xu et al. 2005) and partial least squares and LASSO were utilized for data reduction (Nguyen and Rocke 2002;Park et al. 2002).Bayesian approaches were also applied (Tadesse et al. 2005;Sha et al. 2006).
Though there exist such various algorithms, we wanted to develop a new software program that we could use conveniently and interactively in the R environment (R Development Core Team 2008) because R is a widely used statistical package and practical functions for microarray survival analysis need to be included.We utilized the partial likelihood of the Cox model which has been the basis for many of the aforementioned methods.Our algorithm is simple and straight-forward, but its functions such as the generation of multiple gene models and the incorporation of risk factors are practical.For robustness, it also selects survival-associated genes by separating training and validation sets of samples.This is because such a crossvalidation technique is essential in predictive modeling for data with large variability.The program employs forward selection to generate a series of gene models and later select an optimal model.Furthermore, iterative runs after putting aside the previously selected genes can discover the masked genes that may be missed by the forward selection.This software package, the rbsurv package, is available from the Bioconductor website at http://www.bioconductor.org/(Gentleman et al. 2004), which provides many bioinformatics packages used in the R environment (R Development Core Team 2008).A programming example can be found in the package's vignette.

Robust likelihood-based survival modeling
Suppose the data consist of G genes and N samples, and each sample has its observed (survival or censoring) time and censoring status.Thus, it consists of the triple (Y j , δ j , X j ), j = 1, . . ., N , where Y j and δ j are observed time and censoring status (usually, 1=died, 0=censored) for the j-th sample respectively, and X j = (X 1j , X 2j , . . ., X Kj ) is the j-th vector of the expression values for K genes (K < N and K denote the ordered times with D distinct values and X (i)k be the k-th gene associated with the sample corresponding to Y (i) .The Cox proportional hazards model (Cox 1972) , where h(y|X 1 , X 2 , . . ., X K ) is the hazard rate at time y for a sample with risk vector (X 1 , X 2 , . . ., X K ), h 0 (y) is an arbitrary baseline hazard rate, and β k is the coefficient for the k-th gene.The partial likelihood for the Cox model is where R(Y (i) ) is the set of all samples that are still under study at a time just prior to Y (i) .Maximizing the likelihood provides the maximum likelihood estimates (MLE) of the coefficients, so denote the MLEs by β1 , β2 , . . ., βk .Then, as a goodness-of-fit, we can use the fitted partial likelihood: (2) The negative log-likelihood (-loglik) is greater than zero, so the smaller -loglik the model better.For robustness, however, the model should be evaluated by independent validation samples rather than the training samples used for fitting the model such as where * indicate the use of the validation samples and the estimates β0 1 , β0 2 , . . ., β0 k are obtained by the training samples.For robust gene selection, we thus use training samples for model fitting and validation samples for model validation.This cross-validation is repeated many times independently.In other words, we fit the Cox model with a gene (or genes) and select a gene (or genes) maximizing mean loglik * (i.e., minimizing the mean negative loglik * ).

Robust gene and model selections
Independent validation samples are usually unavailable.Therefore, the given samples are randomly divided into training and validation sets of samples.To avoid a bias arising from a partition of samples, such partitions are repeated many times and the measurements are averaged.By robust likelihood-based modeling with repeated random partitions, we select the best predictive gene g (1) , and then the next best predictive gene g (2) retaining the first selected gene g (1) .This forward gene selection generates a series of gene predictive models: g (1) , g (1) + g (2) , g (1) + g (2) + g (3) , . ... This can be continued until fitting is impossible because of the lack of samples.It is also important to select an optimal model among a series of gene predictive models.In microarray studies, sufficient samples for double-partitioning as well as independent samples are often unavailable because of the limited supplies of biological samples or the high cost of the experiments.Double-partitioning means that the samples are partitioned into training and validation sets for selecting predictive genes and a test set for selecting an optimal model.Thus, we may have to reuse samples used for gene selection.For this, it is essential to employ an error measure that can prevent over-fitting.For instance, if we test the series of gene models with the samples with loglik, we always select the largest model.To avoid this over-fitting, we employ the Akaike information criterion (AIC), −2loglik + ak, where k is the number of parameters in the model and a is some pre-specified constant (usually, a = 2).We select a model with the smallest AIC.Then we can select an optimal robust model, which may not be the largest model selected by using loglik.We discuss the demonstration in the results and discussion section.

Multiple optimal sets of genes
As described above, an optimal model consists of several survival-associated genes which can be selected and then utilized.However, other genes may also be associated with survival, but not the members of the model because of the masking effect.For instance, suppose there exist two genes similarly associated with survival.If one gene is selected, the other gene may not help improve the model so as not to be selected.Nevertheless, the other survival-associated gene may play an important role.Thus, it is more meaningful to select and provide multiple optimal sets of genes rather than a single optimal set of genes, as indicated in Ein-Dor et al. (Ein-Dor et al. 2005).Thus, we put aside the genes in the first model.We then construct another optimal model.Again, we can construct a third optimal model after putting aside the genes in the first and second optimal models.In this manner, we can make multiple optimal models.This iterative procedure can also mitigate the limitation of forward gene selection, which highly depends on the previously selected genes.The number of optimal models can be determined in the practical point of view.The cost and time of confirmation experiments in each lab can be factors to determine it.Among the selected models, the first optimal one might be the best statistically, but it does not mean that it is the best biologically.

Adjusting for risk factors
Survival may be associated with some risk factors, such as age and a disease stage rather than certain genes.Survival modeling without an adjustment of risk factors may result in finding the genes associated with other risk factors rather than survival.Therefore, we can improve the ability to discover truly survival-associated genes by modeling genes after adjusting for certain risk factors.Thus, we allow adjustment for risk factors in the above robust likelihoodbased survival modeling.

Reducing computing time
Forward gene selection by repeatedly swapping training and validation samples substantially inflates computing time for modeling high-throughput data such as microarray gene expression data with tens of thousands of genes.Therefore, it is crucial to reduce computing time without loss of meaningful information in a practical view.It is a fundamental step to filter out meaningless probe sets meaning in the microarray experiments.For instance, probe sets are kept if a coefficient of variation is greater than 0.2 and at least 10% of samples have an expression intensity greater than 500 (Freije et al. 2004).However, a lot of probe sets are often left after initial filtering, so it is wasting computing time to evaluate all of them because many of them may not be associated with survival.Thus, univariate survival modeling and evaluating with whole samples can reduce the number of candidate genes without losing important genes.This univariate survival pre-selection is adopted into our software program.
We can also consider a biological approach if we can obtain other data sets from related studies.For instance, suppose we have a microarray data set with drug A-treated or untreated samples, in addition to a microarray data set for cancer patients with survival.Examining the differential expression under two conditions of the experiments provides Drug A-induced genes.This handful of Drug A-induced genes can be used for robust survival modeling.Finally the selected genes are Drug A-induced as well as associated with survival.

Algorithm for modeling microarray data with survivals
We now summarize the algorithm described above.Suppose the data consists of expression values X for G genes and N samples.Each sample has its observed (survival or censoring) time Y and censoring status δ.We assume that expression values are already normalized and transformed in appropriate ways.Prior gene selection such as univariate survival modeling and/or biological pre-selection can also be performed if necessary.Univariate survival modeling can be performed in our software program.Our algorithm is summarized as follows.
1. Randomly divide the samples into the training set with N (1 − p) samples and the validation set with N p samples (e.g., p = 1/3).Fit a gene to the training set of samples and obtain the parameter estimate β0 i for the gene.Then evaluate loglik * with the parameter estimate, β0 i , and the validation set of samples, (Y * i , δ * i , X * i ).Perform this evaluation for each gene.
2. Repeat the above procedure B times (e.g., B = 100), thus obtaining B loglik * s for each gene.Then select the best gene with the smallest mean negative loglik * (or the largest mean loglik * ).The best gene is the most survival-associated one that is selected by the robust likelihood-based approach.
3. Let g (1) be the selected best gene in the previous step.Adjusting for g (1) , find the next best gene by repeating the previous two steps.In other words, evaluate g (1) + g j for every j and select an optimal two-gene model, g (1) + g (2) .
5. Compute AICs for all the K candidate models, M 1 , M 2 , . . ., M K , and select an optimal model with the smallest AIC.
6. Put aside the genes in the optimal model in the previous step.Then repeat steps 2-6.This can be repeated several times sequentially (e.g, 3 times), generating multiple optimal models.
In addition, suppose that p risk factors, Z 1 , Z 2 , . . ., Z p , are available for each sample.Then risk factors can be included in every modeling of the previous algorithm.

Software
The above algorithm was implemented into a R package (called rbsurv), employing two other R packages Biobase (Gentleman et al. 2004) and survival (Therneau and Lumley 2008).This rbsurv package can be used conveniently and interactively in the R environment (R Development Core Team 2008).For instance, the package can be run as follows.
R> library("rbsurv") R> fit <-rbsurv(time = time, status = status, x = x, z = z, alpha = 0.05, + gene.ID = NULL, method = "efron", max.n.genes = 100, n.iter = 100, + n.fold = 3, n.seq = 3, seed = 1234) The required arguments time and status are vectors for survival times and survival status (0=censored, 1=event) and x is a matrix for expression values (genes in rows, samples in columns).The optional argument z is a matrix for risk factors.To include only the significant risk factors, a significance level less than 1 is given to alpha, e.g., alpha = 0.05.In addition, there are several controlled arguments.gene.ID is a vector for gene IDs; if NULL, row numbers are assigned.method is a character string specifying the method for tie handling.One of efron, breslow, exact can be chosen.If there are no tied death times all the methods are equivalent.In the algorithm of Section 2.6, n.fold is the number of partitions of samples in step 1, n.iter is the number of iterations for gene selection in step 2, and n.seq is the number of sequential runs (or multiple models) in step 6. seed is a seed for sample partitioning.max.n.genes is the maximum number of genes considered.As described in Section 2.5, if the number of the input genes is greater than the given maximum number, it is reduced by fitting individual Cox models and selecting the genes with the smallest p-values.The input arguments of rbsurv are summarized in Table 1.The major output fit$model contains survival-associated gene models with gene IDs, nlogliks, and AICs.The genes in the optimal model with the smallest AIC are marked with asterisks (*).
The open-source R statistical package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/ and our developed program rbsurv is available at the Bioconductor website (http://www.bioconductor.org/).A programming example can be found in the accompanying vignette.

Examples
We now describe a demonstration of our developed algorithm with a microarray data set for patients with gliomas.This real data set consists of gene expression from 85 patients with gliomas (Freije et al. 2004).For this study, Affymetrix U133A and U133B chips were used and dCHIP was used to convert data files (.CEL) into expression values with median intensity normalization.As suggested in the paper, we first selected about 8,000 genes with a coefficient of variation greater than 0.2 and at least 10% of the samples having an expression

*
Figure 1: Negative log-likelihood and AIC (1st run).These plots show negative loglikelihoods and AICs against genes.Two plots at the bottom utilized standardized negative log-likelihoods and standardized AICs, which were divided by those with no gene, respectively.The asterisk (*) indicates the smallest value.intensity greater than 500.We ran our developed software program (called rbsurv) to discover survival-associated genes with microarray data for the 85 patients with gliomas.

Negative log-likelihoods and AICs for two iterative runs
Figure 1 shows that the negative log-likelihoods (nloglik) always decrease as the number of genes increases.Thus, the largest model is always selected because it has the smallest nloglik.However, it could be over-fitted, i.e, consisting of too many genes.In contrast, AICs tend to decrease for a while and then increase with the number of genes.This implies  that the optimal gene model is not necessarily very large.In this run, the 24-gene model was selected (Table 2).Among the genes in the model, BCAN(Brevican)/ BEHAB(Brain enriched hyaluronan binding) is one of the members of the lectican family protein.It comprises extracellular matrix of the brain with hyaluronan and tenascin-R (Gary et al. 2000;Yamaguchi 2000).Although its role is not completely understood, BCAN/ BEHAB is considered to play an important role in structural maintenance of the brain's extracellular matrix (Nakada et al. 2005).In normal adult brain, the expression level of BEHAB/brevican is very low, however, its expression is increased in glial origin tumors, including glioblastoma (Jaworski et al. 1996;Held-Feindt et al. 2006).In rat model studies, over-expression of BCAN/ BEHAB was reported to be associated with tumor invasion.Rats having intracranial grafts of BCAN/ BEHAB-transfected glioma cell line demonstrated worse prognosis (Jaworski et al. 1996;Nutt et al. 2001).Taken these together, over-expression of BCAN/ BEHAB may be associated with aggressiveness and worse survival rate in patients with glioblastoma.We re-ran rbsurv after putting to one side the genes in the selected model at the first run.Note that these iterative runs can be conducted automatically according to a user's choice in rbsurv.In a similar way, we obtained another model containing 14 survival-associated genes (Figure 2 and Table 3).The nlogliks and AICs in Figure 2 had the trends similar to those in Figure 1.

Figure 2 :
Figure2: Negative log-likelihood and AIC (2nd run).These plots show negative loglikelihoods and AICs against genes.Two plots at the bottom utilized standardized negative log-likelihoods and standardized AICs, which were divided by those with no gene, respectively.The asterisk (*) indicates the smallest value.

Figure 3 :
Figure3: Negative log-likelihood and AIC (with age).These plots show standardized negative log-likelihoods and standardized AICs with age, which were divided by those with no gene, respectively.The asterisk (*) indicates the smallest value.