Combining the Performance Strengths of the Logistic Regression and Neural Network Models: A Medical Outcomes Approach

The assessment of medical outcomes is important in the effort to contain costs, streamline patient management, and codify medical practices. As such, it is necessary to develop predictive models that will make accurate predictions of these outcomes. The neural network methodology has often been shown to perform as well, if not better, than the logistic regression methodology in terms of sample predictive performance. However, the logistic regression method is capable of providing an explanation regarding the relationship(s) between variables. This explanation is often crucial to understanding the clinical underpinnings of the disease process. Given the respective strengths of the methodologies in question, the combined use of a statistical (i.e., logistic regression) and machine learning (i.e., neural network) technology in the classification of medical outcomes is warranted under appropriate conditions. The study discusses these conditions and describes an approach for combining the strengths of the models.


INTRODUCTION
The study of patient outcomes has become increasingly important in the effort to contain medical costs, streamline patient management, and codify practices. Such study has aided the recent efforts to implement disease management programs (i.e., the application of outcomes principles to the practices of healthcare providers). By being able to predict the likelihood of a patient event prior to its occurrence, a case manager may learn to forestall or delay the event.
To study the feasibility of such an approach, a patient population has been selected (Medicare beneficiaries suffering from congestive heart failure [CHF] with identifiable outcomes [mortality within a specified period after discharge]). We develop neural network models and a model based on the logistic regression method, compare these approaches, and recommend a possible combined approach.
Neural networks have been used extensively in many industries including health care. They have, for instance, been the subject of extensive study and application in biomedicine and have been used to create diagnostic aides, analyze medical images, and to develop drugs. However, there has been limited use in the management of health care and so we investigate the possibility of their use in this area here.
For instance, issues such as the quality and management of patient care, the allocation of scarce resources, and the funding and reimbursement of institutional and provider services have dominated our discussions of the health care delivery system for the last 20 years. Since the formulation, analysis, and hoped for conclusion of these issues is dependent on the accurate portrayal of patient and administrative data, it is imperative that such data be efficiently and accurately mined for their contribution to these issues. This study tests the presumption that the use of neural networks to predict patient outcomes is valid and that its accuracy is comparable to that of conventional statistical approaches.

Neural Nets
A neural network (NN) is a software-and/or hardware-based system of interconnected nodes (or neurons) that "learns" by modifying the connection strengths (or weights) between its elements in order to match the input-output behavior of the network to the process or system being modeled [1]. NNs are frequently composed of many computational elements operating in parallel. These computational nodes or neurons are connected via weights that are typically adapted to improve the performance of the network.
NN models are developed via training; the process of weight adaptation as prescribed by a set of well-defined rules [2]. The most common forms of learning include: (1) supervised learning (e.g., backpropagation) and (2) unsupervised learning (e.g., Kohonen self-organizing NN). Supervised learning requires the pairing of an input vector with a target vector (a training pair). With supervised learning, an NN is trained so that the application of a set of vector inputs will produce the set of vector outputs. Often, the NN "learns" through the minimization of error between the actual and estimated outputs. This occurs over many training iterations (epochs) since: (1) the initial weights are unlikely to provide the desired outputs and (2) many training pairs may be presented to a network. Incremental adjustments are made to the weights of a network until they gradually converge on an optimal set of values. This paper compares the performance of two supervised learning techniques. The training algorithms include: (1) multilayer perceptron (MLP) or backpropagation and (2) radial basis function (RBF).
The backpropagation method [3] is a very popular training algorithm in which an input vector is disseminated through a network in a forward fashion and the corresponding target output (actual) compared with the network derived output (estimated). Typically, the difference between the actual and the estimated outputs are calculated and the weights adjusted so as to minimize the sum-squared error. The modified errors are propagated back to the preceding nodes and layers (excluding the input layer) whereby the weights attached are adjusted to further minimize the sum-squared error. This process is repeated for the entire training set or matrix. As such, the input vectors are applied sequentially, errors calculated, and weights adjusted for each vector until the error for the entire training set is either at an acceptably low level (determined by the investigator) or a predetermined training duration had been reached.
The RBF network [4] is a supervised, feed-forward NN with a single hidden layer. Unlike multilayer networks, which transform a weighted summation of inputs, radial basis networks determine the outputs (each of which represents a basis function) of the hidden layer by measuring the distance between the network input and the center (or centroid) of each RBF.

Logistic Regression
Logistic regression can be characterized as a modeling approach, which can be used to describe the relationship of several independent variables to a dichotomous dependent variable [5]. As such it lends itself to the classification of a dichotomous outcome such as the presence and occurrence or absence and nonoccurrence of a disease or event.
The logistic function is a squashing function that transforms an input with a value between + infinity, into an output in the range [0,1]. The function f(z) indicates the risk for the presence or absence of a disease or event where z = b + w 1 p 1 +w 2 p 2 + … w k p k is an index of combined risk factors. Since in the domain for z of + infinity, f(z) ranges from 0 to 1, it can clearly be used to describe the probability or risk of an event or disease occurring.
The S-shaped curve of the logistic function provides a mechanism to consider the impact of a threshold on the likelihood of an outcome or risk for disease. The shape of the curve indicates that the effect of z will be negligible in the limit to -infinity. However as z increases, its impact will increase until a threshold is reached and so the risk or likelihood of the event (disease) rises rapidly over an intermediate range of values and then remains extremely high as z tends to + infinity.
So the logistic function can be converted to a logistic model = P(D = 1|x 1 , x 2 , …x k ), whereby D = 1 represents the occurrence of the disease.

Congestive Heart Failure
CHF is described as a disease process associated with profound symptoms and a poor long-term prognosis [6,7]. Its symptoms are characterized by abnormalities of the left and right ventricular function and are generally accompanied by changes in neurohormonal regulation, effort intolerance, fluid retention, and decreased survival. CHF is neither rare nor benign. It is often terminal, occurring after all reserve capacity and compensatory mechanisms of the myocardium and peripheral circulation have been exhausted. For many patients, the predominant symptom of CHF is a reduction of functional capacity due to poor exercise tolerance resulting from limited cardiac reserve [8,9].

DATA SOURCES
The abstract patient data used for model development are provided by Louisiana Health Care Review (LHCR). Note: all unique patient and provider identifiers have been deleted from the data set and replaced with integers randomly assigned to each individual. The data form a subset of the Medicare (Part A) patient discharge database (fiscal 1990 to 1994) and includes all Louisiana hospital beneficiaries who are residents of the state.
Because the data set contains longitudinal patient data, it can be used to track a patient's passage through the prospective payment system. The data set contains all admission records (CHF or otherwise) of Louisiana patients with at least one principal diagnostic code (the admitting diagnosis) of CHF (n = 53,289). In addition, a smaller clinical data set is used to supplement the analysis. The clinical data set contains information from the LHCR's statewide baseline review. There are 1,068 records within the clinical data set. A listing of the variables included within the data sets is provided in Appendix A.
The study population includes all Medicare (Part A) beneficiaries who were admitted to a Louisiana hospital during the fiscal years 1991 through 1994 (October 1, 19901, through September 30, 1994. The study sample includes all Louisiana residents (within the study population) with at least one recorded episode (a principal diagnostic code) of CHF.
The definition of CHF (as provided LHCR) will include the principal diagnostic codes shown in Table 1. It should be noted the definition differs from the New York Heart Association's (NYHA) classification of CHF [10].

MODEL DEVELOPMENT
The study design is retrospective, correlational, and nonexperimental. We first surveyed the literature to identify a preliminary set of independent variables.

Training, Testing, and Validation of Logistic Regression Model
In this phase, outcome or dependent variables were created that identify the mortality of a beneficiary within 90 and 365 days of discharge. The outcome variables are dichotomous and identify survival or death within a specified period after discharge. The variables were coded 0 (event did not occur) and 1 (event did occur), respectively, i.e., patient did not die within a specified time period vs. patient did die within a specified time period after discharge. Where appropriate, both the claims and clinical data sets were linked to create a combined data set (n = 1,024). In addition, dummy variables were developed for race (White, Black), principal diagnosis (CHF, Unspecified with CHF), principal procedure (No Procedure, Operations on the Cardiovascular System), left ventricular hypertrophy from EKG and past MI from EKG. Dummy variables were also created for EKG-related categories in order to account for instances where an EKG was not administered. Interaction terms for the sociodemographic variables age, sex, race, and Metropolitan Statistical Area (MSA) were also developed as possible candidates for the development of the logistic regression model(s).
Two subsets of the database were created for the training and testing of the logistic regression models. The training set was used explicitly for model development, while the testing set was used to compare the predictive performances of the various (90-and 365-day) logistic regression and NN models.
In order to develop models, variables of significance to be considered as candidates for model development have to be determined. So tests for association between the outcome and dependent variables were generated using chi-square (categorical variables) and the Student t test (continuous variables). A significance level of 0.2 was chosen for each test and results are shown in Tables 2 and 3, respectively.
The development of the "final" model was a three-step process involving the creation of two intermediary and a conclusive logistic regression model. A forward (conditional) stepwise approach was applied to the development of the models. The entry of variables was based on the significance of the score statistic (<0.05). The removal (or exit) of variables was based on the significance (<0.1) of the likelihood ratio statistic (using conditional parameter estimates). The method of contrast was the "Indicator" method. The method testifies to the absence or presence of membership within a category. The reference group is by default the last category. The first intermediary model was generated by "entering" all first-order terms in Block 1 of the logistic regression model. All second-order terms were included in Block 2 of the model. A forward (conditional) stepwise approach was applied to Block 2 of the model (entry: 0.05; removal: 0.1). The second intermediary model was generated by "entering" in Block 1 all first-and secondorder terms relevant to Block 2 of the previous or first intermediary model. The remaining firstorder terms were included in Block 2 of the model. A forward (conditional) stepwise approach was applied to Block 2 of the model (entry: 0.05; removal: 0.1).
The "final" or conclusive model was generated by entering: (1) all first-and second-order terms from Block 1 of the second intermediary model and (2) all second-order terms relevant to Block 2 of the second intermediary model (entry: 0.05; removal: 0.1).  Probabilities for the occurrence (or nonoccurrence) of the event were developed for the beneficiaries within the testing set and receiver operating characteristic (ROC) curves were used to identify the cut-off point. In addition, cross-tabulations were used to compare the predicted outputs of the logistic regression models and the actual occurrences of the event.

Development of the Neural Network
Predictors identified by the previous models were used to create the NNs. They provide the input layer of the NN -the output node will comprise the patient outcome (it is binary). The following training algorithms were applied to the training set: (1) MLP or backpropagation and (2) RBF. The choice of activation functions, number of hidden layers and nodes, and the final meansquared error at termination were recorded. Random initialization of the weights was performed during model development. As with the development of the logistic regression models, the probabilities of the occurrence (or nonoccurrence) of the event were developed for the beneficiaries within the testing set and ROC curves were used to identify the probability cut-off point. Again cross-tabulations compared predicted outputs of the neural network models and actual event occurrences.

COMPARING PERFORMANCES OF LOGISTIC REGRESSION AND NEURAL NETWORK MODELS
The models were used to predict outcomes within the testing set. All ROC curves were reviewed and analyzed in order to compare the predictive performances of the various logistic regression and NN models. The measures of comparison include the following: overall accuracy, sensitivity, specificity, and Area Under Curve (Az).

Logistic Regression Model
We used several tests to evaluate the 90-and 365-day logisitic regression models. They include likelihood, Hosmer-Lemeshow test, ROC curves, and a z statistic. We first present a detailed discussion and results for the 90-day model and then a synopsis of the 365-day results.

90-Day Model
A sample of 484 cases was used to develop the 90-day logistic regression model. The variables (including interaction terms) that were identified as significant (p < 0.05) to the prediction of death within 90 days of discharge are seen in the Table 4A. The variable age was also included in order to develop the interaction terms: age by absence of left ventricular hypertrophy (EKG); age by presence of left ventricular hypertrophy (EKG). The certainty of the observed results (given the parameters of the model) is identified as the likelihood. Since likelihood <1, it is customary to use -2 multiplied by the log of the likelihood, or -2 log likelihood (-2LL) as a measure of model fit. The likelihood of a perfect fit for a model (to its data set) is 1 (the value of -2LL is 0). The -2LL for the 90-day logistic regression model (constant only) is 452.50. The -2LL for the model is 338.83; as the model improves, the value of -2LL decreases.
The model chi-square measures the difference between the -2LL for the model with constant only and the -2LL for the current model. It tests the null hypothesis that the coefficients for variables added at the last step are 0. Note: measure is comparable to the overall F statistic in linear regression. The model chi-square for the 90-day logistic regression model is 113.67 with 12 degrees of freedom. Since the test statistic is significant (p < 0.0000), we can reject the null hypothesis.
Goodness of fit measures indicate how well a model fits data. So next we used the Hosmer-Lemeshow test, which is a variant of the goodness of fit statistic [11]. It tests the null hypothesis that the model provides a poor fit to the data; it should be noted that if a model fits well with its data, the difference between the observed and predicted values based on the fitted model will be small and (as a consequence) the goodness of fit statistic will be nonsignificant. The Hosmer-Lemeshow statistic for the 90-day model is 8.928 with p > 0.3 where a small p value (p < 0.1) would have indicated a poor fit between the observed and expected outcomes as seen in Table  5A. Table 6A shows the distribution of probabilities for the test.
An ROC curve [12] was generated to assess the predictive performance of the 90-day logistic regression model against the testing set of 294 cases (see Fig. 1A). Cut-offs, which maximized the overall accuracy of the model, were identified and using a cut-off of 0.10, the accuracy of the  Test statistic follows a chi-square distribution with 8 df p-value = 0.348389 indicating good fit A small p-value (< 0.10) would indicate expected counts far from observed -> poor fit.
to get p-value : use CHIDIST(calc sum of (O-E)^2/E , degrees of freedom) = CHIDIST (9.49372,8)   equal sample size, cases in same order) utilizes the degree of correlation of the curves in order to increase the statistical power of the comparison [13]. Both approaches were used to compare the performances of the various models under consideration. The Mann-Whitney U statistic has a well-characterized distribution that approaches normality as sample size increases. This feature enables the use of the z statistic to determine the degree to which the test or predictive model is superior to the "line of no information" (Az = 0.5). It tests the null hypothesis that the observed Az was obtained by chance. One-and two-tailed p values are then generated by way of the z statistic.
A z statistic was calculated by dividing the difference between the Az and the line of no information with the standard error of the Mann-Whitney U statistic in order to determine the degree to which the model is superior to the line of no information. It tests the null hypothesis that the Az was obtained by chance. One-and two-tailed p values were then generated by way of the z statistic. The Az for the 90-day logistic regression model was 0.55 (one-tail p: 0.1237; twotail p: 0.2474). Since the test statistic is nonsignificant, we cannot reject the above null hypothesis (see Table 7A for detail).

365-Day Model
Our analysis follows that for the 90-day model above. A sample of 485 cases was used to develop the 365-day logistic regression model and the variables identified as significant (p < 0.05) to the prediction of death within 365 days of discharge are in Table 4B. Again the variable age was also included in order to develop the interaction terms. Here the -2LL for the 365-day logistic regression model (constant only) is 609.26 and the -2LL for the model is 494.77. The model chi-square for 365 days is 114.50 with 14 degrees of freedom. Since this test statistic is significant (p < 0.0000), we can reject the null hypothesis as for 90 days.
The Hosmer-Lemeshow statistic now is 7.06 with p > 0.5 (a small p value [p < 0.1]) again indicating a poor fit between the observed and expected outcomes in Table 5B. Goodness of fit results are in Table 6B.
The ROC curve assessed the 365-day logistic regression model against the testing set of 256 cases (see Fig. 1B). With a cut-off of 0.41, the overall accuracy of the model was 57% (sensitivity: 70%; specificity: 50%).
One-and two-tailed p values were generated again and the Az for the 365-day logistic regression model was 0.60 (one-tail p: 0.0048; two-tail p: 0.0096). Since the test statistic is significant, we can reject the null hypothesis. Table 7B illustrates this in contrast to the 90-day case.

Neural Network Models
To develop the models, the variables significant (p < 0.05) to the prediction of mortality within 90/365 days of discharge were used as input to develop the respective MLP and RBF models (Table 8A). A training set of 152 cases was used to train the 90-day NN models and 285 cases for the 365-day models. A validation set of 18 cases was used to cross-validate the performance of the models during training for the 90-day model and 32 for the 365-day model.
Two layers (including the hidden and output layers) were used to develop the MLP or backpropagation network. The number of nodes within each layer was selected automatically. The number of nodes within the input layer total 18 and 4 in the hidden layer for 90 days and 21 and 5, respectively, for 365 days. The activation function for the hidden layer is a tangent sigmoid and the activation function for the output layer is linear.  Test statistic follows a chi-square distribution with 8 df p-value = 0.529989 indicating good fit A small p-value (< 0.10) would indicate expected counts far from observed -> poor fit.

90-Day Model
A total of 86 weights were generated by the 90-day MLP model. The conjugate gradient descent method was used to alter nodal weights during training; the initialization of weights was random.
The conjugate gradient method measures the gradient of the error surface after each backward and forward propagation of the model. It then adjusts nodal weights in order to minimize the mean-square error. After training of the NN model is terminated, the root mean square (RMS) error was recorded. The RMS error for the 90-day MLP network was 0.52.   Two layers (including the hidden and output layers) were used to develop the 90-day RBF network. The number of nodes within the input layer total 18. The activation function for the output layer is a spline function whereby f(x) = d 2 logd (where d = distance of x from a centroid). The positional strategy for the initial centroids was based on a sampling of data points and the distance measure in use was Euclidean. A total of 30 centroids (reflecting the optimal number of centers within the data set) were generated by the 90 RBF model. Note: the number of centroids chosen can range from 5 to 50. The RMS error for the 90-day RBF network was 0.62.
A ROC curve was again used to assess the performance of the 90-day MLP model (see Fig.  2A). Using a cut-off of 0.82, the accuracy of the model was 72% (sensitivity: 41%; specificity: 78%). The Az for the 90-day MLP network was 0.55 (one-tail p: 0.1397; two-tail p: 0.2795). Since the test statistic is nonsignificant, we cannot reject the null hypothesis (see Table 7A for details). For the 90-day RBF model (see Fig. 2B) with a cut-off of 0.51, the accuracy now was 62% (sensitivity: 68%; specificity: 61%). The Az was 0.65 (one-tail p: 0.0006; two-tail p: 0.0012), and now since the test statistic is significant, the null hypothesis is rejected (Table 7A).

365-Day Model
The significant variables for the 365-day model are seen in Table 8B and again, age developed the interaction terms. A total of 122 weights were used for the MLP model, again with conjugate gradient descent for training with random initialization of weights. The RMS error after training was 0.61.
As for the 90-day example, two layers were used to develop the 365-day RBF network with a similar approach. A total of 20 centroids were generated and the RMS error here was 0.59.
The ROC curve for the 365-day MLP model is seen in Figure 3A. Here with a cut-off of 0.58, the accuracy of the model was 72% (sensitivity: 57%; specificity: 80%). The Az for was 0.69 (one-tail p: 0.0000; two-tail p: 0.0000) and the null was rejected due the significance of the test statistic (Table 7B).
For the RBF, the ROC curve is in Figure 3B. Now with a cut-off of 0.53, the model's accuracy was 67% (sensitivity: 58%; specificity: 72%) and the Az was 0.67 (one-tail p: 0.0000; two-tail p: 0.0000). Again the test statistic is significant, and we can reject the null hypothesis as seen in Table 7B.

Comparison of Logistic Regression and Neural Network Models
A comparison of the 90-day logistic regression and NN models reveals the following: (1) we accept (do not reject) the null hypothesis that no difference exists between the Az of the logistic regression and MLP network (chi-square = 0.0038; df = 1; one-tail p = 0.4753; two-tail p = 0.9506); (2) we reject the null hypothesis that no difference exists between the Az of the logistic regression and RBF network (chi-square = 2.2177; df = 1; one-tail p = 0.0682; two-tail p = 0.1364).
Similarly for the 365-day cases we see: (1) we reject the null hypothesis that no difference exists between the Az of the logistic regression and MLP network (chi-square = 5.8766; df = 1; one-tail p = 0.0077; two-tail p = 0.0153); (2) we reject the null hypothesis that no difference exists between the Az of the logistic regression and RBF network (chi-square = 4.2225; df = 1; one-tail p = 0.0200; two-tail p = 0.0399).
Based on the findings for the study population, we accept (do not reject) the assertion stating that NNs represent a valid approach to predicting patient outcomes. Given the parameters of the study, it has also been shown that NNs can outperform the logistic regression model in terms of sample prediction. Table 9 provides a comparison of the sensitivity, specificity, positive predictive value, and negative predictive value of the various models.

Limitations
The study has determined that: (1) NNs represent a valid approach to predicting the mortality of the study population within a specified period after discharge and (2) NNs are able to outperform the logistic regression in terms of sample prediction.
Several limiting factors should be taken into account when considering the results of this study. These limitations relate to the study's methodological approach, the paucity of clinical predictors, the reliance on default parameters in the development of the NN models, and the nature of the disease process identified as CHF.
For example, the study's approach to the development of the logistic regression model carries inherent limitations that may hinder its performance. It should be noted that a logistic regression model forces a linear relationship between a continuous predictor and the predicted log odds of its  outcome variable. Information is lost when a nonlinear relationship is thus forced into a linear relationship. Since the logistic regression models contain continuous predictors that may harbor a nonlinear relationship with its outcome variable, the threat of such a loss must be acknowledged. If a nonlinear relationship does exist between a continuous predictor and its outcome variable, it is advisable to capture it by converting the continuous into categorical variables. The paucity of clinical predictors is another limiting component of model development. Many of the variables available simply had too many missing values to be of use. Note: one example includes the numerous predictors that pertain to ejection fraction or its subjective assessment.
The predictors eventually selected via model development include historical (pertaining to the presence or absence of a condition), radiographic (pertaining to the presence or absence of a condition or a treatment regimen), and laboratory data (pertaining to BUN, creatinine, and potassium values); it should be noted that these values have the advantage of simplicity and availability. Little information was captured with respect to ejection fraction (a variable widely thought to be an indicator of poor health within a population suffering from CHF) or its colloraries.
The reliance on default parameters in the development of the NN models is another limiting factor of model development. In order to enforce a uniformity of approach to network development (and thus support the study's comparative efforts), it was decided to rely only on the default parameters available to the software program. As such, the optimization of each NN model was sacrificed for uniformity. It is anticipated that individual efforts to optimize the performance of each network will result in a marked improvement of their performances. This effort is recommended if the models are to be used within an applied setting.
The nature of CHF is another limiting component of model development. Rightly identified as a syndrome, its lack of specificity is reflected in the data and may have hindered model development.

CONCLUSIONS
Neural networks represent a broad class of nonlinear regression (e.g., MLP or backpropagation network) and discriminant models (e.g., ADALINE); data reduction and nonlinear dynamical systems (e.g., Kohonen self-organizing network) [14]. They are oftentimes similar or identical to popular statistical techniques (e.g., generalized linear models, principal components, cluster analysis, polynomial regression), particularly if the emphasis were on the prediction, rather than the explanation of outcomes. Such similarities may account for their strengths. As such, they have much to offer by way of data analysis and classification.
It has been shown that given the parameters of the study and its sample, the NN methodology is superior to the logistic regression methodology in terms of sample prediction performance. It must be noted, however, that the use of the logistic regression methodology (or statistical methods in general) should not be overlooked. On the contrary the logistic regression methodology is capable of providing an explanation (either intuitive or explicit) regarding the relationship(s) between: (1) the outcome and the independent variables (multivariate analyses) and (2) the dependent variables (bivariate analyses). Also, statistical software programs can also be used to produce confidence intervals, prediction intervals, diagnostics, and various graphical displays -features that are rarely provided by NN technologies [14].
The study therefore recommends that the combined use of statistical and machine learning technologies in the classification of medical/patient outcomes is warranted when: (1) information regarding the relationship(s) between variables is/are unclear, i.e., the exercise of producing a logistic regression is helpful to the understanding of relationships (both statistical and clinical) between variables; and (2) improving the predictive performance of a model is critical for clinical or disease management purposes. It also recommends an approach that is similar to the study's methodology.
The study recommends that future research efforts should include the application of NN technology within a patient or disease management setting. In addition, other candidates for further research may include: (1) the prediction of continuous outcomes from recorded healthcare data, (2) the prediction of discrete outcomes from transient (or real time) healthcare data, and (3) the prediction of continuous outcomes from transient (or real time) healthcare data.