A Comparison of Artificial Neural Network and Decision Trees with Logistic Regression as Classification Models for Breast Cancer Survival

In the field of medicine, several recent studies have shown the value of Artificial Neural Networks, decision trees, logistic regression are playing a major role as the predictor, and classification methods. The research has been expanded to estimate the incidence of breast, lung, liver, ovarian, cervical, bladder and skin cancer. The main aim of this paper is to develop models of logistic regression, Artificial Neural Networks, and Decision trees using the same input and output variables and to compare their success in predicting breast cancer survival in woman. To find the best model for breast cancer survival, the sensitivity and specificity of all these models are measured and evaluated with their respective confidence intervals and the ROC values. KeywordsArtificial neural networks, Logistic Regression, Breast cancer, Decision Trees, Cancer survival.


Introduction
In the field of clinical diagnostics, computer models are playing a prominent role in differentiating between a healthy and an ill patient. Such computer model accuracy is held accountable for encouraging correct decisions about the risk of disease based on the patient's characteristics. Numerous data models designed, assessed and improved include both statistical methods as well as non-statistical ones. Every methodology uses different assumptions and, depending on the data context, may or may not produce similar results. Three frequently used methods are regression, decision trees, and artificial neural networks. Regression methodology estimates the relation between the independent and dependent variables. Regression methods are also referred to as dependence analysis techniques (Agresti, 2010). Study of regression models is a crucial part of several research projects. Such models are widely applied in order to assess the survival of severely ill patients admitted to the intensive care unit (Gellar et al., 2014). Linear models and logistic regression models are two main categories of regression methods. The technique of logistic regression is very widely used in data analysis. It is considered a well-known model of classification that enables probabilistic decisions to be made and shows promising results on several issues. A logistic model is often considered a clinically interpretable model for providing best-fit similar results.
Four models are developed in this paper considering the same output and the set of input variables using decision trees, logistic regression and artificial neural networks. Performance of these models is assessed in breast cancer survival prediction. The four models are developed by considering cancer data available in SEER. Surveillance, Epidemiology and End Results (SEER) is the most widely used Medicare database for cancer datasets. This data has been pre-processed (cleansed) to eliminate or account for missing data and repetitions. The final data set considered in this research contained 47,167 records of malignant tumors (Mudunuru, 2016).
In this paper, the survival of a breast cancer woman is estimated using important available attributable variables in the database. Initially, we selected all the independent variables including size of the tumor, age of the patient, stage of the breast cancer, treatment administered, duration, tumor grade, marital status of the patient, and the count of primary tumors recorded. Logistic regression eliminated variables like size of the tumor, grade of the cancer, marital status of the patient to be statistically insignificant in prediction survival of women with breast cancer. We developed our four models by inputting only one of the remaining significant variables at a time. All these models provided us with an output vector with two variables for each case: either 1 (alive/ survived) or 0 (dead/ not survived). A number ranging between 0 and 1 is used to estimate the accuracy of predicted value.

Logistic Regression
One of the primary assumptions of linear logistic regression is that the independent covariates are in linear association with their corresponding natural logarithm of odds. The three main components that define logistic function are a systematic part, the related link function and the random experiment. The binary nature of the dependent variable that is eminently suited for modelling data is also another important characteristic of logistic function. As a result of this, the target of logistic regression therefore is slightly different since we estimate the probability that the response variable is equal to provided values of independent variables (McFadden, 1973).
Survival prediction of breast cancer is the output variable in our current research from the given patient's age, size of tumor, stage of cancer, treatment administered, and the duration.

Survival Prediction using Logistic Modelling
The two most general and widely used survival analyses techniques are Event history models and logistic regression. Survival time is considered as a continuous variable in the event history model whereas it is considered as discrete in logistic regression models. A dichotomous measure (survived or not) is thus the target. Logistic regression model as a classifier is applied in this section to predict breast cancer survival in women.
Age and size of the tumor are used in the first model (model-1). Age, size of the tumor and stage of the cancer are used in the second model (model-2). Along with the three variables selected in model-2, treatment is added to develop the third model (model-3). Age, size of the tumor, cancer stage, treatment and duration are all used in the fourth model (model-4). Our primary aim is to measure or estimate the women's breast cancer survival.
The overall accuracy of the models is attributed as a reliability measure of the given estimate. Specificity, sensitivity along with 95% confidence intervals and overall accuracy for the developed logistic models are given in the Table 1. The ROC area values along with specificity at 95% sensitivity for the developed logistic models are given in Table 2. The results showed that for model-3, the overall accuracy is 70.42% that increases to 80% for model-4. As expected, the tumor duration has large significance for predicting accurate survival during the study period. The ROC area of logistic regression model-1 is 68.8% and a survival sensitivity of 95% yielded a specificity of just 25%, model-2 on the other hand with a ROC area of 71% and survival sensitivity of 95% provided 30% specificity rate. The ROC area values for the remaining two models are 71.8% and 85.5% respectively, and the 95% survival sensitivity rate has a 29% and 61% overall specificity, respectively. Highly attractive is the performance of fourth logistic model offering 80% overall accuracy. The specificity and sensitivity for this model are 81.54%, 76.82% respectively. At a 95 % sensitivity this model has 61% specificity. Figure 1 displays the ROC graphs for the logistic models.

ANN Perceptron Classification and Back Propagation
A perceptron is an arrangement of one McCulloch-Pitts neuron input layer feeding forward into one McCulloch-Pitts neuron output layer. A feedforward neural network of at least two layers is characterized as a multilayer perceptron. Growing of the perceptron layers is utilized by splitting them into small linearly separable parts of data given to solve nonlinear separable problems (Widrow and Lehr, 1990). Hard-limiting function (step function) is the most preferred activation function for producing the outputs. The second most important function is a sigmoid function. Sigmoid function prevents the information of the inputs to overflow into the inner neurons. In modelling a complex data, step function is replaced with a sigmoid function. The combination of each individual perceptron with a sigmoid function combined with another series of perceptrons serves as the output of a multilayer perceptron. The architecture and design of the network is intended to be more flexible, for example, no immediate connection among input and target layers as well as associations between layers is assumed, the number of outputs need not be equivalent to the number of inputs, and there is no limitation on the number of hidden layers or units. Hidden units can even be more than or less than input and output units (Jabri and Flower, 1992).
The most widely used learning algorithm for training feedforward artificial neural networks is the backpropagation algorithm. In this technique, corresponding weights of previous layers neurons are multiplied separately along with receiving signals in the current layer. Inputs of one or more previous neurons are weighted independently and added. The weights between neurons are optimized using the backpropagation technique to generate the best network. Thus, multilayer perceptrons have two essential characteristics, generalization and fault tolerance. Neural networks are extremely tolerant to faults. This feature is also called graceful degradation (Mohr et al., 2000). Even if certain interconnections between certain neurons within the layers fail, the neural networks keep working. We developed ANN models in this research to fulfil the above characteristics.

Survival Prediction using ANN Modelling
When modelling non-linear data, artificial neural networks (ANN) are efficient estimators.
Constructing an ANN requires minimum domain awareness in the fields of mathematics and statistics. The type of ANN used in this study is called the multilayer perceptron (MLP) or multilayer feed-forward network that propagates input signals forward and returns error signals. During the process the weights are changed to make prediction more accurate. This method is vulnerable to problematic overfitting. Training to the network is provided with some of the data values to prevent overfitting and then test its performance by testing the trained network with the values of the remaining data. We split the data into 70% -30%. 70% for training our ANN models and the 30% data for testing and validation.
The ANN models consist of an input layer, a hidden layer and an output layer. Table 3 summarizes the training results of ANN models including specificity, sensitivity along with 95% confidence interval and overall accuracy of the model performance. The specificity, sensitivity along with 95% confidence intervals and overall accuracy results of the developed ANN models when testing is summarized in Table 4. Architecture of developed artificial neural network models with their ROC values along with specificity at 95% sensitivity are given in Table 5. From the training results we notice that the overall accuracy for model-3 to model-4 increased from 71.12% to 82.80%. The neural network model-1 yielded a ROC area of 72.1% and a survival sensitivity of 95% yielded a specificity of 30%, model-2 reported ROC area under curve of 73.1% and a survival sensitivity of 95% provided 32% specificity. The ROC region for the remaining two models is 73.8% and 87.4% respectively, and the 95% survival sensitivity yielded 39% and 66% specificity rates, respectively.
Comparing at 95% sensitivity, all four artificial neural network models has a higher specificity than logistic models. Figure 2 has the area under curve ROC graphs for the four neural network models.

Decision Tree Classification
The effective classification approach is combining data mining tools with decision trees which assists physicians and medical specialists with a clear and easy understanding of classification rules. False positive and false negative decisions can be minimized with the help of methods of data mining. (Quinlan, 1987).
A classification method or a classifier to assess acceptable action is referred as a Decision tree. Root, internal, or test nodes and leaf nodes form the core part of a simple decision tree. The final decisions for the target class are obtained on the leaf nodes by performing split tests within the internal nodes. The leaf node contains, in complex cases, target value or a probability vector for the result. (Zantema and Bodlaender, 2000).
The decision tree is usually composed of continuous or nominal attributes. One outcome is given for the target value when working with nominal attributes, while the continuous attributes will have two outcomes, one for each interval. Decision-makers typically prefer a less complex, decisionmaking tree. Any path from root to leaf of the decision tree generates a rule by calculating tests along the path that grant terminal node class prediction (Friedman et al., 1996).

Splitting Techniques
Decision trees also employ univariate separating, i.e., separating steps at each internal node is centred on the single attribute. Upon completion of splitting, the inducer is looking for the best attribute at the internal node. The splitting procedures are used in numerous ways, depending on the tree's initial measure and the level of the tree. For a univariate splitting, the choice of splitting criteria does not affect the performance of the tree.
Multivariate splits are usually based on the linear combination of the input variables. Methods used to find optimal splitting include the greedy search method (Breiman et al., 1984) linear programming (Duda et al., 2012), linear discriminant analysis (Friedman, 1977) and many more.
An alternative approach for maximizing the tree is to allow it to grow and replenish using certain methods of pruning. Pruning methods aim to yield simple trees at relatively small cost of reducing accuracy. There are different pruning methods, such as cost-complexion, reduced and minimal error, optimal, etc.

Survival Prediction using Decision Trees
The training and testing results of specificity, sensitivity along with 95% confidence intervals and overall accuracy results using both CHAID and CRT methods of the four decision tree models are summarized in Table 6 and Table 7.   Figure 3 has the ROC graphs for the decision trees developed using CHAID technique. From the results of Table 8, we notice that there is a significant increase in the area under curve for model-3 and model-4. Clearly, approximately there is an 88% chance that model-4 can distinguish between positive and negative cases. Figure 4 has the ROC graphs for the decision trees developed using CRT technique. As observed in CHAID models, even in CRT based models, there is a significant increase in the area under curve for model-3 and model-4 with model-4 having an 87.4% chance of distinguishing between positive and negative cases. By default, CHAID uses multiway splits while CRT uses binary splits. This disparity between the CRT and the CHAID affects even the decision tree structures. Segmentation or classification has various significant applications.
 When the dependent variable is categorical, CHAID, Chi-square Automatic Interaction Detector, splits the tree based on chi-square test. On the other hand, CRT uses impurity reduction as a measure to split.  In order to produce a smaller tree rather than an exhaustive one, CHAID uses the forward stepwise stopping rule. CRT intentionally overfits and uses validation data to prune back to identify the best and smallest tree.
Finally, if the aim is to identify or explain the relationship between a response variable and a collection of explanatory variables, one may prefer CHAID, while CRT is best suited for constructing a regression model. At this viewpoint, in this paper, we conclude that for the survival classification of breast cancer in women, CHAID decision tree performed marginally better than CRT. The central concern of implementing various modelling applications in this paper is to determine which of the techniques proposed would enhance predictive accuracy. Even a small fraction of a percent change will turn into substantial savings or increased revenues.
The performance of logistic regression models, artificial neural network models and decision tree models in this work is evaluated based on sensitivity, specificity, accuracy and ROC area under curve values. Sensitivity is the proportion of true positive elements that the model correctly classifies. Specificity is the proportion of true negatives that the model correctly classifies.
In general, we compare the area under the ROC curve, which is an easy way to test predictive binary classification models when the analyst or decision-maker does not have any knowledge about the cost or extent of classification errors. We compared the results of the four logistic models with the results of four artificial neural network models and four decision tree models, respectively. Figure 5 and Figure 6 depicts the comparison of specificity and overall accuracy of the models developed. In comparing the specificity and overall accuracy of all models, model-4 clearly stands out as a better model. Models 1, 2 and 3 resulted with similar results of accuracy, specificity and ROC values. Table 9 includes the performance assessment of logistic, artificial neural network, and decision tree techniques. Compared with logistic models, neural network and decision tree techniques reported almost the same overall accuracy for correct classification of breast cancer survival in women. Nonetheless, the specificity of the logistic model results is higher than the neural network and decision tree models. Models developed are ranked, given in Table 9, based on their performance and accuracy in classification. For all the four models, the area under the ROC curves of decision tree methods is comparatively higher than the other two techniques. International Journal of Mathematical, Engineering and Management Sciences Vol. 5, No. 6, 1170-1190, 2020 https://doi.org/10.33889/IJMEMS.2020.5.6.089 Figure 6. Comparison of overall accuracy of Logistic Regression, Artificial Neural Networks and Decision Tree models  Table 10 provides the specifics of comparing ROCs of the three different approaches along with their ranking. attributes of a woman with breast cancer, including age, tumor size and cancer stage, treatment administered, and duration. Four models with different attributable variables are developed and compared using logistic, ANN and decision trees (using CHAID and CRT algorithms). The degree of generalization (or the precision of predictive ability) was determined and the corresponding predictive abilities of the developed models are assigned a rank in this order. Model-4, Model-3, Model-2, Model-1.
To maintain consistency in comparing these implemented models, we considered the accuracy of three techniques when classifying breast cancer survival results. The receiver operating characteristic (ROC) curves are plotted and estimated the area under the curves. The results indicated that Model-4 yielded better performance using logistic regression, ANN, and decision tree methods. The model classification accuracy is obtained as 79.98%, 82.31% and 82.6% respectively for logistic regression, ANN and decision tree models. In the models developed using ANN and decision tree methods, we find no significant difference in the results obtained. However, the ANN method provided a better specificity when compared to the logistic and decision tree models at a 95% sensitivity. In this study, the artificial neural networks and decision tree methods reported a modest improvement in the outcomes compared with logistic regression. Figure 7 is a comparison of the ROC curves of model-4 using all three techniques. We conclude in this analysis that ANN and decision tree models have a higher predictive probability compared to the logistic model.   Figure 9 explains the steps involved in modelling a logistic regression. Figure 10 gives the steps for generating ROC curves for a logistic regression model. Figure 11 is a step-by-step break down for performing a neural network analysis along with generating ROC curves. Figure 12 has the steps involved in performing a decision tree analysis by both CHAID and CRT techniques. Figure 13 gives the steps for generating ROC curves for a decision tree model.