Staging of prostate cancer using automatic feature selection, sampling and Dempster-Shafer fusion.

A novel technique of automatically selecting the best pairs of features and sampling techniques to predict the stage of prostate cancer is proposed in this study. The problem of class imbalance, which is prominent in most medical data sets is also addressed here. Three feature subsets obtained by the use of principal components analysis (PCA), genetic algorithm (GA) and rough sets (RS) based approaches were also used in the study. The performance of under-sampling, synthetic minority over-sampling technique (SMOTE) and a combination of the two were also investigated and the performance of the obtained models was compared. To combine the classifier outputs, we used the Dempster-Shafer (DS) theory, whereas the actual choice of combined models was made using a GA. We found that the best performance for the overall system resulted from the use of under sampled data combined with rough sets based features modeled as a support vector machine (SVM).


Introduction
Prostate cancer is the leading type of cancer in men with a 27% incidence. The Canadian Cancer Society estimates about 72,700 cancer deaths, of which 3500 are in men with prostate cancer. 1 There are several different types of treatments for prostate cancer, although some of the treatment techniques can substitute each other. The choice of a treatment is dependant upon the stage of the disease, i.e. the extent of cancer spread and whether the cancer has spread beyond the prostate. This information results in staging the cancer. In essence, on the information gathered about the disease through biopsy and/or prostatectomy, staging categorizes a patient. Since different treatments result in variable outcomes, staging helps assess the risk of cancer progression and death based on the current extent of cancer, tumor characteristics, metastasis in the lymph nodes, and the spread of the disease to other parts of the body. Staging also helps in establishing a tradeoff between the risks of death due to cancer and death or medical complications due to treatment. This is particularly true for prostate cancer since a majority of men diagnosed with the disease are older adults, often suffering from multiple comorbidities.
Typically, medical doctors assess the clinical and the pathological data about the individual patient to assign a cancer stage and to choose the most appropriate treatment procedure. This is also known as clinical decision making and it is a complex process. Cancer staging performed by a doctor involves weighing multiple variables and processing information gathered by patient examination and by conducting various tests. However, this process may be subjective and therefore depends heavily on the doctor's experience, skills and knowledge. Machine learning techniques can also be employed to learn and model the underlying theory when provided with relevant information and data. A variety of statistical, probabilistic and optimization tools under the umbrella of machine learning can be used to learn from past examples and a number of information systems have been developed to aid the clinical decision-making process. 2 An automated cancer staging system was developed and is described in this paper. A classifi er based framework was used to draw a distinction between two primary stages of the disease: organ-confi ned disease and extraprostatic disease. The classifi er was modeled using past data from patients diagnosed with prostate cancer who are selected to undergo surgery (or prostatectomy). One of the key issues with such data sets is the class imbalance, resulting from the number of patients with organ-confi ned disease, which considerably exceeds the number of patients with extraprostatic disease. Data imbalances pose problems when less observed patterns are of higher relevance, because most of the machine learning techniques tend to generalize the patterns observed over the majority of data and ignore those observed over smaller portions of the data. 3 Three approaches of dealing with the class imbalance problem were explored in this study. We also investigated the feature space reduction, because the data can potentially contain more information than required to perform the classifi cation.
There are a number of machine learning techniques that can be used to develop a classifi er, but it is well known that specifi c techniques are more suitable for certain domains. That is, for a particular problem, specifi c techniques have superior performance, while other techniques are only able to produce mediocre or acceptable results. Even within one problem domain, different techniques can differ in the effi ciency for different (partial) ranges of the problem space. This leads to the conclusion that it is most appropriate when solving a problem, such as the one considered here, to primarily rely on actual data and information about the problem, rather than trying to generalize the performance of certain machine learning techniques in a generic manner.
Lately, a number of machine learning based tools have been developed for cancer diagnosis and prediction. And a majority of this work can be categorized into those which are heavily dependant upon expert domain knowledge or on extensive historic data. Garzotto et al. 4 and Spurgeon et al. 5 sequentially developed a decision tree approach, specifi cally CART (classifi cation and regression tree) to classify patients with aggressive prostate cancer based on ultrasound and biopsy markers. Sensitivity and specifi city were adopted as performance metrics in the study. Zlotta et al. 6 developed a set of two artifi cial neural networks (ANN), where the fi rst ANN predicts the pathological stage of the patient and the second ANN classifies patients in groups based on cancer stage and the model performance was measured using the Receiver Operating Characteristic (ROC) curve. Veltri et al. 7 favorably compared an ANN model to logistic regression for prostate cancer prediction and staging. Percentage of correctly classifi ed cases was adopted as the performance metric.
In this paper, we propose a novel system which selects the features automatically and couples appropriate techniques in order to maximize the system performance. Specifi cally, performance of three re-sampling techniques and three feature selection techniques were evaluated. A GA is used to search for optimal pairs of feature selection and sampling techniques; where optimality is based on the performance of the system. Once such pairs are identifi ed, respective classifiers are developed and a DS method 8 is used to combine the component classifi er performances. The performance of such a system has been shown for two different types of classifi ers; the k-nearest neighbor (KNN) method and the support vector machines (SVM). ROC curves 9 are used as performance measures for the proposed cancer staging system. The next section describes the data used for this study and Section 3 provides justifi cation for the use of ROC as a performance metric. Section 4 introduces the proposed method along with brief descriptions of the component classifi ers and feature extraction and sampling methods. Section 5 details the classifi er fusion and the obtained results. Section 6 provides details on the generic applicability of this method by validating it on a simulated prostate cancer dataset and on two publicly provided datasets of breast and lung cancer.

Study Population
Staging and analysis of prostate cancer may be regarded as a function of the information (predictors) gathered during the diagnosis of every patient. Data from a total of 1174 patients with prostate cancer positive biopsies matched with their radical prostatectomies were used for this study. All specimens were processed in one laboratory between 07/2000 and 04/2005 and were reported using standard synoptic reports. The biopsy and the prostatectomy data were collected from the information system (Cerner) of the for the disease, and thereby the stage data was converted into a binary format where extraprostatic disease represented by stages pT3 and pT4 was denoted by 1 and the organ-confi ned disease represented by stage pT2 was represented with a 0. Table 1 presents the statistical description of the data used in this study. Of the 1174 patient records, 1054 records were retained after removing the ones with missing variables; 934 patients in the organ-confi ned stage and a 120 with an extra prostatic extension.

Performance Metric
A classifi er's performance typically refl ects how well it can discriminate between the objects belonging to different classes (two in this case). But the proposed system is expected to play a critical role, particularly for patients where treatment modality may have a substantial impact on the post-treatment prognosis and survival. Therefore, in terms of the classifi cation performance, the cost of wrongly classifying a patient with extraprostatic disease is much higher than the cost of wrongly classifying one in the organ-confi ned stage. Conventional measures of performance therefore will not provide a relevant estimate of the classifi er performance; therefore an ROC curve and the area under the curve (AUC) were used as performance indices in this study. An ROC curve is obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) for varying decision thresholds. As shown in Table 2, TPR (and FPR) is representative of the number of positive (negative) examples correctly The decision threshold or boundary for binary classifi cation refers to a threshold, below which the object is classifi ed as negative and above which it is classifi ed as positive. Such a threshold can be adjusted to trade off the cost of TP against the cost of FP, and each threshold setting provides a (FP, TP) pair. A series of such pairs produced by varying the decision threshold are used to plot the ROC curve. One should note that, depending on the outcome of misclassifi cation, ideal decision thresholds may vary. For example, if the cost of misclassifying a patient with extraprostatic disease is lower than misclassifying a patient with organ-confi ned disease, then a reciprocal ROC curve (to the one discussed above) will be preferred. But in this study, a classifi er with an ROC tending towards the top-left of the graph indicates better performance than the ones with a lower ROC. In addition, ROC curves for different classifi ers tend to intersect each other; in which case AUC is used as an alternate metric. AUC ranges between the interval [0, 1] and greater the value of AUC, better is the technique. The AUC of a classifi er is equivalent to the probability that the classifi er will rank a randomly chosen positive example higher than a randomly chosen negative example. AUC as a measure has been proved to be equivalent of the Wilcoxon test statistic 10 and the Gini Index 11 i.e. unlike a typical measure of classifi cation accuracy, the AUC quantifi es the likelihood that the underlying method will assign a higher probability of success to a patient having extraprostatic disease compared to a patient where the cancer is contained. Therefore such a measure will provide a true insight even in the case of imbalanced data. Another important advantage is that the respective ROC is invariant to monotone transformations of feature values, 12 which renders fl exibility in manipulating the feature set if necessary.

Proposed System and Methodologies
The proposed system consists of four major parts: feature extraction, data sampling, classifi cation and classifi er fusion. Feature extraction helps identify the most prominent features in the search space thereby reducing the required computational and interpretational effort. Data sampling provides a mechanism to eliminate the bias or imbalance that exists in the data by over-sampling the minority class or under-sampling the majority class or a selective combination of both. Feature extraction and sampling enable the implementation of a classifi er in order to model the class disparity in the data. A GA is then used to identify compatible sets of features, sampling techniques and classifi ers in order to maximize the performance of subsequent DS classifi er fusion. The adopted methods are individually described in the following subsections.

A. Feature selection
Features extracted or selected from the input data, can be categorized into continuous, discrete or projected features. Existing processes of selecting features can be classifi ed as: those based on statistical information, those based on empirical information and those based on search in the sample space. Following this approach, three techniques associated with the major types of features have been adopted. Selection of RS based discretized features relies on empirical information about the system, PCA based transformed feature selection relies on the statistics of the data and GA based continuous feature selection relies on intelligent search through the sample space.

RS features
Rough Set Theory 13 adopts an equivalence relation such that two objects (x, y) form an indiscernible pair over the attribute a, only if a(x) = a( y); and assuming that (x, y) Є Ra, then Ra would be called the a-indiscernibility relation and denoted by the symbol INDa. Given that, a lower approximation set consists of all objects which can certainly be classifi ed as elements of X over an indiscernibility The lower approximation set is also known as the Positive region i.e.

POS X R X
The signifi cance of a variable is then expressed as a function of the dependency (γ) of a variable in classifying the objects into the classes of U IND D | ( ). The dependency of decision variable D on independent variable R is given as: where U denotes the cardinality of set U , i.e. the number of elements contained in that set. The signifi cance of a variable a is the increase of dependency between the independent variables and the decision variable after the addition of a, i.e.
Because dependency (defi ned by Equation 3) only considers the number of rules that cover various instances and not the number of instances that each of the rules represents, a parameterized average support heuristic method has been adopted to include both, the number of rules and the number of instances supporting each rule in computing the average support function (similar to the measure of dependency), given as: The signifi cance of a variable (Equation 4) is redefi ned as: 2. PCA features PCA 14 reduces the feature space by projecting the complete feature set onto fewer variables known as the principal components with the objective of maximizing the variance in a least squares sense. This produces uncorrelated components with minimal information loss. X denotes a {n × p} matrix for n instances of a system represented through a p-dimensional feature space. Applying PCA begins by fi rst normalizing X into a feature set with zero mean and unit variance. PCA aims to transform this p-dimension into an m-dimensional feature space where m Յ p, but typically the fi rst few components represent the largest portion (~90%) of the original information, therefore effectively using only the fi rst m* (ϽϽ p) components. The correlation matrix S X of X is given as: If Y represents the {n × m} matrix for n instances of a system represented through a reduced m-dimensional feature space, the transformation is achieved by weighting the original features using m number of principal components. The m components are identifi ed as the eigenvectors corresponding to the fi rst m largest eigenvalues of the correlation matrix S X . The transformation of the feature space is therefore given as: where V m is a { p × m} matrix made up of the m eigenvectors. Singular value decomposition of X is performed as: where U is column-orthogonal matrix of size {n × p}, L is a square diagonal matrix of size { p × p} where the diagonal elements are square roots of the eigenvalues of the correlation matrix S X and V is also a square matrix of size { p × p} where the columns correspond to the eigenvectors of the correlation matrix. V m consists of the fi rst m eigenvectors in V. The number of principal components or eigenvectors (m) is determined as per a set threshold.

GA features
GA 15 based search depends on a user defi ned fi tness function; in this study, the product of the number of features and the average error in the predicted output class has been adopted as the fi tness function. GA based search uses a chromosome representation of the solutions and a set of genetic manipulations in order to arrive at an optimum solution. First, a chromosome of the length equal to the total number of features in the input space is created. The value of the bit associated with a feature is set to 0 to indicate that the feature is not considered, whereas a bit value of 1 indicates that the feature will be considered. The search process begins with an initial generation where the population is generated randomly; all of the chromosomes in this generation are evaluated against the fi tness function and the best chromosomes (representing better solutions) are chosen to propagate into the next generation. Through heuristic manipulation of the chromosome structures in every generation, it is ensured that the newer generations always have an average fi tness higher than the previous generations. The search stops either when a fi tness threshold is achieved, or when the search runs out of the threshold on the number of generations. In this study, the population size per generation was set to 25 and the limit on the generations used in the search was set to 1000. These numbers were selected after a short sequence of random trials. Mutation and crossover operators were utilized to generate off springs for the next generation, and a simple natural selection based on the current fi tness values was used to identify potential parents.

B. Data sampling
Two re-sampling methods are often used in order to overcome an imbalance; one is to under-sample the majority class to match the size of the minority class and the other is to over-sample the minority class to match the size of the majority class.
Over-sampling and under-sampling techniques have been previously evaluated for imbalanced datasets, 16 and a conclusion that both methods were effective was drawn. In one study, 17 combined over-sampling of the minority class and under-sampling of the majority class was used, but the combination did not provide significant improvement in the performance. Over-sampling in these cases was done by duplicating the original examples from the minority class, which does not cause minority class decision boundary to spread into the majority class region, but instead creates decision regions similar to those existing for the minority class. This shortfall may be overcome using an approach called SMOTE, as proposed in one study. 18 SMOTE is an over-sampling of the minority class by creating In this study, we used three re-sampling techniques: 1) under-sampling 2) SMOTE and 3) combined under-sampling and SMOTE in the multi-classifi er fusion diagnosis system.

C. Classifi ers
Classification, an operation of assigning an unknown sample to one of the output classes can typically be performed by either fi tting a model around the independent variables or through averaging or majority voting. In order to exemplify the proposed system for both types, we used SVM, which identifi es a nonlinear model of the input variables, and KNN, which is based on majority voting.

Support vector machines
SVM 19 is a supervised machine learning methodology used for classification and regression. Compared with the traditional statistical and neural network methods, SVM has the advantage to effectively avoid a local minimum due to its convexity property. On the other hand, SVM uses the kernel trick to apply linear classification techniques to nonlinear classification problems. When Gaussian kernels are used, the resulting SVM corresponds to a radial basis function (RBF) network with Gaussian radial basis functions. In comparison with traditional RBF network, SVM has the ability to automatically determine the model size by selecting the support vectors based on quadratic programming (QP) procedure. Hidden neurons correspond to the support vectors. The support vectors serve as the centers of basis function in the RBF network. For linear SVM, the decision function is given in a linear form as: The decision value produced by SVM is not the estimate of posterior probability. Here, the binning technique is used to transform the output of the SVM into calibrated probability. 19 The binning technique proceeds by fi rst sorting the training examples according to their decision values, and then dividing the value range into k equal sized intervals or bins. Given an example x, place it in a bin according to its decision value. The conditional probability of x is estimated as:

K-nearest neighbor
Subsets A ∈ 2 Θ that satisfy the condition m(A) Ͼ 0 are called the focal elements of mass function m. Since a subset A is the disjunction of all its elements. If the proposition B ⊆ A is true, then the proposition A is also true. Hence, given a subset A, the belief bel (A) is defi ned as the sum of all the masses of its subsets: The belief value bel(A) indicates the degree that evidence supports the proposition A. When two evidences exist, there will be two different mass functions m 1 and m 2 . If these two evidences are independent of each other, then the two mass functions can be combined into a new mass function m using Dempster's rule 21 : where m k (A i ) indicates the degree of belief that proposition A i is true, provided by classifi er ck, k = 1, 2. Combining the evidence provided by the individual classifi ers, the belief value of each proposition is given as: where m is the combined mass function calculated by the sequential use equation ( The fi nal decision is made according to the approach dealing with the imbalanced data. If re-sampling is used in the fused diagnosis system, the fi nal decision is made by assigning the label of the class with the largest belief value: If the imbalance is dealt with by changing the decision threshold, the class of the example is determined by changing the threshold on belief value of the positive class: where T is the threshold.

Results
Experiments were done to assess the performance of all of the feature extractor-sampling-classifi er pairs. The available data set was divided into two: one for building the models and the other to test the developed models. Appropriate ratio of the training and testing data sizes for SVM was identifi ed by running different trials as shown in Figure 1. The best testing performance was observed when 70% of the total samples were used for training. The dip in the performance of the SVM beyond the training data size of 70% can be attributed to overfi tting, when the trained model lost its ability to generalize, and was rather rigid to the training data. As the performance of the KNN depends on the number of neighbors considered in the output class allocation, the optimum number of neighbors was identifi ed by running trials with different sizes, and 5 was the most optimal. Although higher number of neighbors may seem to have the ability to generalize, it is the separability of the data according to assigned classes that has the highest infl uence on the appropriate number of neighbors. The AUC curves were generated for all of the feature-sample-classifi er sets by altering the decision     threshold. When the outputs of the classifi ers for all combinations were transformed into conditional probabilities, altering the decision threshold simply meant altering the respective probability threshold. By varying the probability threshold, the testing examples are re-labeled, giving out a series of (FP, TP) pairs. Each pair of (FP, TP) is a point on the ROC curve. For individual classifi ers, ROC curves are created by changing the threshold on conditional probability p (c(x) = 1|x). ROC curve for the fused classifi er is created by changing the threshold on belief value of the positive class. The outputs of SVM and KNN are then fused using DS fusion approach and ROC curves are plotted for the individual classifi ers (Figures 3-14). AUC of the generated plots were tabled per classifi er in Tables 3 and 4.
To show the effect of combining multiple sets of features and the effect of combining different types of sampling, DS fusion of the classifi er outputs was performed over various combinations of sampling and feature selection methods. The following notation was adopted to refer to the classifi ers developed in this study;            US (under-sampling + SMOTE)}. Therefore SVM-P-S would imply the classifi er based on SVM trained over PCA features obtained from SMOTE sampled data.
It is evident in Table 3 Table 4. In general, DS fusion improved the performance over any single model. We also note that under sampling proved a more efficient method of re-sampling the data than generating synthetic samples. As it has been observed previously, 18 generating synthetic samples can cause the decision boundary to spread, resulting in a poor performance. Despite the fact that fewer samples make parameter estimation difficult in SVM and neural networks, under sampling of the data should be preferred, since performance degradation is higher when using synthetic sampling than using SVM trained over smaller data sets.

B. Comparison between different methods
Methodology proposed in this study relies on the use of GA to identify the most optimal set of classifi ers for fusion, where fi tness is defi ned as the overall fused performance. A total of 18 models (9 each for SVM and KNN with variations in the sampling method and features) were developed, and GA was used to choose the best set of fusion classifi ers. Once an x set of classifi ers for best fused performance were identifi ed, the AUC (of ROC for changing thresholds) was determined for the test samples alone. The results for the overall performance from all model combinations obtained by fusing 2, 3, 4 and 5 models are shown in Table 5 (A) with the highest classification accuracy at 90.1% and a respective ROC AUC of 0.8640. Clearly, the number of fused models does not have a considerable impact on the performance, which is the reason why the trials were stopped after the fusion of 5 models. A combination of the outputs of 4 models has the best overall performance on the testing data. In this study, under sampling was observed to be the most effi cient for data sampling. KNN (with average performance for all models much lower than SVM) seems to have contributed equally in the best overall performance. As observed from Tables 3 and 4, overall effi ciency of rough sets based features was highest of all subsets. Rough sets based discretized features increase the distance between different output classes, and thus Table 5. Performance of GA optimized fusion for PCa.  tend to impact the overall classifi cation performance in a favorable manner. In order to contrast the performance of the proposed method with existing techniques, a C4.5 decision tree and a two-layer neural network have been additionally developed using the same data. C4.5, similar to the likes of expert systems and nomograms, is a simple and transparent technique which can also be regarded as means of knowledge representation. ANN, similar to the likes of SVM and nonlinear regression, on the other hand is a complex nonlinear model. These two models have been chosen to represent the majority of present day classifi ers. As can be observed from Table 5 (B), the proposed method fairs very well compared to the two other. Moreover, the performance of a 2-model combination (AUC: 0.8617, Accuracy: 89.4%) is better than that of the DS (C4.5 + ANN) method (AUC: 0.8580, Accuracy: 88.5%).

Validation with Other Datasets
A number of such techniques have been reported in the literature and these techniques perform very well for individual datasets that the techniques have been built for. Data used in the previous section is a large, unbiased and consecutive patient cohort originating from one institution. Therefore it should be comparable to other current patient data obtained from patients who undergo prostatectomy in the larger North American centers. Although currently we have no access to data sets from different hospitals, we present the results of applying the proposed method to three different cancer datasets to testify that the proposed method is generic in applicability and that it outperforms other existing techniques, Firstly, we considered a simulated prostate cancer (SimPCa) dataset. 1000 patient records were synthetically generated using a combination of the under-sampling and SMOTE methods. It was ensured that the overall statistics of the data remained consistent with the dataset reported in Table 1. In order to facilitate comparison, the same classifi ers, feature extractors and sampling techniques were adopted. ROC curves were generated and the respective AUC given in Tables 6-8. The performance of the classifi er is very similar to the results in Section 5. SVM trained over RST identifi ed features had the best performance and KNN remained a weak classifi er for this dataset. Consistent with the other results, GA optimization identifi es the same four model fusion.
The other two datasets considered for this purpose have been adopted from the University of California-Irvine data repository. Using the two public datasets we demonstrate that the proposed method compares favorably to the existing techniques. The Wisconsin Breast Cancer (BCa)  and Tables 13-14 for the lung  cancer data. Tables 11 and 15 depict the performance of the GA optimized classifi er fusion. It can be observed from the above Table 9 that SVM outruns KNN as a classifi er and the best SVM classifi er was built using GA for feature extraction and under-sampling as a remedy for data imbalance. Unlike in the prostate cancer data, GA optimization identifi ed a three model fusion with the same AUC as for higher combinations.
The best overall classification accuracy was 99.4%. Table 12 summarizes the performance of different methods in the literature for the same data. Not all the works listed in Table 12 use ROC as a metric; therefore classifi cation accuracy has been used to draw a comparison between all of them. Quinlan 24 and Pena-Reyes and Sipper 25 applied C4.5 decision tree and fuzzy-GA methods. Although these methods are relatively simpler and yield a transparent and user-friendly model, the overall accuracy of the methods is compromised. On the other hand, Goodman et al. 26 Ubeyli, 27 Polat and Gunes 28 and Akay 29 applied thoroughly non-linear techniques. But it is evident that the proposed method, although marginal compared to Akay, 29 has a superior performance than the other techniques.
For the lung cancer data, KNN and SVM have a similar performance. The combination of under-sampling and SMOTE with PCA feature Table 11. Performance of the GA optimized fusion for BCa.    Table 15 with an overall classifi cation accuracy of 97.5%. Table 16 compares the performance of the proposed method to two other methods from the existing literature. The proposed method has a much better performance when compared to that of Aeberhard's 30 RDA model and the neuro-fuzzy model adaptation used by Luukka. 31 Although Luukka 31 reports a 99.99% accuracy for a larger training-testing ratio, a performance of 65.48% is given for a 70:30 ratio, which is the same as used for this work. Therefore, it is concluded that the proposed GA based optimization of multiple model fusion is generically applicable across a wide range of data and in addition ensures better performance than most existing techniques.

Conclusions
A number of classifi er models based on KNN and SVM have been developed to test the automatic prediction of cancer stage using different features and data samples. We propose a novel approach of using a GA to select the best models and to combine their outputs using the DS theory. Owing to DS fusion and GA optimization, the overall performance improved in all tested models. In particular, three sampling and three feature selection methods have been employed in this study. Under-sampling and rough sets based features were identifi ed to be most useful in improving overall performance of the system.

Disclosure
The authors report no confl icts of interest.