Cancerclass: An R package for development and validation of diagnostic tests from high-dimensional molecular data

Progress in molecular high-throughput techniques has led to the opportunity of a comprehensive monitoring of biomolecules in medical samples. In the era of personalized medicine, these data form the basis for the development of diagnostic, prognostic and predictive tests for cancer. Because of the high number of features that are measured simultaneously in a relatively low number of samples, supervised learning approaches are sensitive to overfitting and performance overestimation. Bioinformatic methods were developed to cope with these problems including control of accuracy and precision. However, there is demand for easy-to-use software that integrates methods for classifier construction, performance assessment and development of diagnostic tests. To contribute to filling of this gap, we developed a comprehensive R package for the development and validation of diagnostic tests from high-dimensional molecular data. An important focus of the package is a careful validation of the classification results. To this end, we implemented an extended version of the multiple random validation protocol, a validation method that was introduced before. The package includes methods for continuous prediction scores. This is important in a clinical setting, because scores can be converted to probabilities and help to distinguish between clear-cut and borderline classification results. The functionality of the package is illustrated by the analysis of two cancer microarray data sets.


Introduction
Progress in molecular high-throughput techniques has led to the opportunity of simultaneous monitoring of hundreds or thousands of biomolecules in medical samples, e.g. using microarrays.In the era of personalized medicine, these data form the basis for the development of prognostic and predictive tests.Because of the high dimensionality of the data and connected to the multiple testing problem, the development of molecular tests is sensitive to model overtting and performance overestimation.Bioinformatic methods have been developed to cope with these problems, e.g. the multiple random validation protocol that was presented in [1].
Cancerclass integrates methods for development and validation of diagnostic tests from high-dimensional molecular data.In the past, simple classiers were shown to have a good performance on high-dimensional data compared to more sophisticated methods [2].Therefore, the protcol of cancerclass uses simple classication methods, while much attention is payed to validation and visualization of classication results.In short, the protocol starts with feature selection by a ltering step.Then, a predictor is constructed using the nearestcentroid method.The accuracy of the predictor can be evaluated using training and test set validation, leave-one-out cross-validation or in a multiple random validation protocol.Methods for calculation and visualization of continuous prediction score allow to balance sensitivity and specicity and dene a cuto value according to clinical requirements.
In the following, the functionality of cancerclass is illustrated using two sets of cancer gene expression data.A gene expression data set of two types of leukemia (AML and AML) [3] is delivered with cancerclass.Gene expression data of breast cancer with good and poor prognosis [4,5] are obtained from the ExperimentData package cancerdata.

Multiple random validation protocol
First, the package cancerclass and an example data set are loaded.GOLUB1 is a gene ltered version of gene expression data from 72 leukemia patients [3,1].Using a protocol similar to [1] we investigate the dependence of classication accuracy on the number of features (Fig. 1 The classication task is to distinguish between two types of leukemia, ALL and AML.Fig. 1 shows the overall classication accuracy, the sensitivity for prediction of ALL and the sensitivity for prediction of AML.The condence interval of the overall classication rate is estimated from 200 random splits in training and test sets.
In order to reduce the computing time for the generation of the vignette, the gene expression data set has been reduced to the rst 200 genes out of a total number of 3571 features.Classication rates will improve, when the calculation is done for the complete data set.
Next, we evaluate the performance of 10-gene predictors on the size of the training set (Fig. 2

Predictor construction and validation
Two gene expression data sets of breast cancer are loaded.Both data sets were generated using the same type of microarrays.VEER is the original data set of 78 breast cancer samples [4].VIJVER is a larger data set of 295 breast cancer samples including some of the proles of the original data set [5].An independent validation set VIJVER2 is obtained by removing the samples of VEER from VIJVER.A predictor of distance metastasis is tted using the VEER data and validated in VIJVER2.Four methods dist = "euclidean", "center", "angle", "cor" are available for calculation of the distance between test samples and the centroids (see documentation of predict-method).
> library(cancerdata) > data(VEER) > data(VIJVER) > VIJVER2 <-VIJVER[, setdiff(sampleNames(VIJVER), sampleNames(VEER))] > predictor <-fit(VEER, method="welch.test")> prediction <-predict(predictor, VIJVER2, positive="DM", dist="cor") The result of the prediction is a continuous score for each of the breast cancer patients.Three methods score = "z", "zeta", "ratio" are avaiable for calculation of the prediction score (see documentation prediction-class).The prediction score turns out to be signicantly increased for patients that developed a distance metastasis within 5 years after surgery (Fig. 3).In fact, only three patients with prediction score zeta > 0.5 developed a distance metastasis.ROC analysis allows to trade o between sensitivity and specicity for the prediction of distant metastases.In fact, there is a cut o point for the prediction score yielding a sensitivity above 90% at a specicity of about 50% (Fig. 4).Condence intervals of sensitivity and specicity are calculated by the Wilson procedure.The ROC curve runs signicanlty above the diagonal with an area under the curve (AUC) of 0.74.Finally, a logistic regression model is tted to the prediction score.Using Fig. 5, the probability of developing a distant metastasis within 5 years can be estimated from the gene expression based prediction score.> plot(prediction, type="logistic", positive="DM", score="zeta") Call: glm(formula = y ~x, family = binomial)

Figure 1 :
Figure 1: Missclassication rates in dependence of the number of genes.

Figure 2 :
Figure 2: Missclassication rates for 10-gene predictors in dependence of the training set size.For each training set size, 200 splits in training and test sets were randomly drawn.Each training set contains an equal number of ALL and AML patients.

Figure 3 :
Figure 3: Histogram of the prediction score zeta patients that developed a distance metastasis within the rst 5 years (DM) and patients that remained distance metastasis-free.

>Figure 4 :
Figure4: ROC curve for the prediction of distance metastases.95% conndence intervals for sensitivity (red lines) and specicity (green lines).AUC = area under the curve.