Use of Logistic Regression Approach to Determine the Effective Factors Causing Renal Failure Disease

The main goal of this research is to determine the impact of some variables that we believe that they are important to cause renal failure disease by using logistic regression approach. The study includes eight explanatory variables and the response variable represented by (Infected, uninfected). The statistical program SPSS is used to perform the required calculations. Key word: Renal failure disease, logistic regression, Wald test, Hosmer and Lemeshow test.


Introduction
It is well known that the regression model is the most popular statistical models.One type of regression models is known as logistic regression.This type is suitable when the response variable is binary variable, while the explanatory variables may be categorical or continuous variables.Practically situations involving categorical outcomes are quite common.In a medical setting for example, an outcome might be presence or absence of a disease.Logistic regression is based on the logit transformation of dependent variable.This transformation is necessary since dichotomous dependent data violates least squares assumptions, furthermore, the error terms are not normally distributed which implies that all normality tests become invalid [1].

Assumptions of Logistic Regression
The main difference between the linear regression analysis and the logistic regression analysis is that the normally distributed dependent variable and homogeniety of the variance are not required in logistic regression analysis.The probabilities and the nature of log curve are the basis of the theory underlying logistic regression.The only assumptions of logistic regression are that the resulting logit transformation is linear, the resultant logarithmic curve does not include outliers, the dependent variable must be categorical and the categories have to be mutually exclusive so that a case can be only in one category and every case must be a member of one of the categories [2].

Types of Logistic Regression
Two types of logistic regression can be identified, namely, the binary and multinomial logistic regression.Binary logistic regression is a predictive model that can be used when the categorical response variable consists of two categories.
For example, live |die, presence absence of a disease.Multinomial logistic regression is an extension of binary type so that it allows the response variable to include more than two categories, however, the focus on this paper is on the binary logistic regression.

The Logit Model
Assuming that  ,  , … …  denoted n independent observations on a binary response variable y : let  be the probability that  =1 then for p explanatory variables , the logit model is defined as :  2) can be simplified as follows  1

𝑒 3
This equation has desired property that whatever we substitute for ′ or   , the value of  will always be a number between 0 and 1.

Maximum Likelihood Estimator
The data with binary response variable is one case that maximum likelihood method handles very nicely.The likelihood of observing the values of y for all of the n observations can be written as: Then we take the natural logarithm of both sides to get: Taking the derivative of  with respect to  and setting it equal to zero we get: Where  is the predicted probability of y for a given value of  .
Actually, (8) is a system of (k+1) equations one for each element of .
There is no explicit solution to (8), instead, we must utilize iterative methods which yield successive approximations to the solution until the approximations converge to the correct value.
One of the most common iterative methods is referred to as Newton Raphson method which can be explained as follows [3]: Let u( be the vector of the first derivative of lnL with respect to .That is : Let I( be the matrix of second derivative, of lnL with respect to

𝐼 𝛽 𝜕 𝑙𝑛𝐿 𝜕𝛽𝜕𝛽 𝑥 𝑥 𝑦 𝑦
The Newton Raphson algorithm is then       9 Practically, we need a set of initial values , which can be started with all coefficients equal to zero.These initial values are substituted into the right hand side of Equation (9) which give the result for the first iteration  , again these values are substituted back into the right hand side of Equation (9) where the first and second derivatives are recomputed and the result is .This process is repeated until the maximum change in each parameter estimate from one step to the next is less than the value specified for tolerance on the logistic regression modeling.

Statistical Hypotheses and Model Fitting
In analyzing logistic regression two hypotheses are of interest specifically: 1-Null hypothesis which arises when all logistic regression coefficients are equal to zero i.e. there is no relationship between the binary response variable and the predictor variables.2-Alternative hypothesis which arises when the logistic regression coefficients differ significantly from zero.i.e. there exists a significant effect of predictors on the binary response variable.There are two popular approaches for testing the null hypothesis.The first is performed by calculating the Wald statistic, which is similar to t test in multiple linear regression.This statistic is the division of the parameter estimate by the standard error of that estimate . For the large sample sizes the distribution of  approaches to normality this implies that the standard errors are asymptotic , accordingly , the standard error should be regarded approximate for small sample sizes..The wald statistic is more reliable for large sample sizes.

Hosmer and Lemeshow (H -L) Goodness of Fit Test
A popular test for the goodness of fit of the logistic regression model depending upon the Chi square statistic is known as Hosmer and Lemeshow test [4].According to this test , the whole subject have been divided into ten ordered groups which are constructed on the basis of their estimated probability , those with estimated probability between 0 and 0.1 form one group , and so on , up to those with probability between 0.9 to 1 .The test statistic here is based on two components , namely , the observed component which does not depend on any theoretical distribution and expected component obtained from the estimated logistic model .A probability p value is computed from the Chi square distribution with 8 degrees of freedom to test the fit of the logistic model .We accept the null hypothesis if the H-L test statistic is greater than 0.05 .This means that there is no difference between the observed and model predicted values which implies that the model's estimate fit the data at an acceptable level.

Omnibus Test of Model Coefficients
This test is applied to determine whether the overall model with all predictors is significantly different from the model with only the intercept.The Omnibus test is interpreted as a test of the capability of all predictors in the model jointly to predict the response variable.The test is based on the difference between the log likelihood for the overall model and log likelihood of the model with only the constant term, such difference follows chi square distribution where the number of predictors represent the degrees of freedom [5].

The Practical Study
In our practical study, we attempted to assess the impact of some factors (explanatory variables) that we think they are important to cause the renal failure disease by employing logistic regression procedures.The real data were collected from (60) real patients suffering renal failure and (55) people do not suffer from this disease from Al Kindi Hospital in Baghdad.The logistic regression analysis was then performed with two groups (Infected and uninfected) and 8 predictor variables that we believe they cause the disease.The variables for each group are: A. The dependent variable which is represent [1 for Infected (group 1)] and [0 for uninfected (group 2)] B. Eight independent variables are described below: 1-Diabetes (0 uninfected,1 Infected) 1. 2-Blood pressure (0 uninfected,1 Infected) 2-(Glp) Glomerulonephritis (Renal syndrome) (0 uninfected,1 Infected) 3-(UTI) Chronic urinary tract infection (0 uninfected,1 Infected) 2. 5-Genetics (0 don't exist, 1 exist) 3. 6-Age 4. 7-Sex (1 for Male) (0 for Female) 5. 8-Kidney stones (0 don't exist, 1 exist) The statistical program SPSS was used to perform the required calculations.

Logistic Regression Analysis
Employing the statistical program SPSS, the required results were obtained, and arranged in the following tables.Table 1.summarizes data enters to the analysis, the sample size studied and the missing data.
The code (or symbol) of the dependent variable values is displayed in Table 2. Table 3. includes the number of iterations for the derivatives of likelihood function in order to obtain the minimum value of -2log likelihood, that is required to get the optimal estimates for the model coefficients.The minimum value of -2log likelihood was obtained at the six iteration, it was equal to 81.207 the process was stopped at this iteration since the differences between the values of coefficients became very small (less than 0.001).In fact, the variation between the estimated coefficients became very small after the forth iteration as it is shown in Table 3.Then its estimated coefficients to be the best estimated coefficients that can be obtained.4. describes the estimator of the parameters of the optimal model obtained from the sixth iteration given in Table 3.All estimated coefficients  ,  … . . as well as the standard error and the Wald statistic for each estimated coefficient and the upper and lower bound for exp (β) are include Where:  is the value of likelihood function of the full model ,  is the value of likelihood function of the reduced model,the value of  was found to be 77.791which is significant at the level α less than 0.001 and 8 degrees of freedom with sig=0 which ensure the significance of the whole fitted model as shown in Table 5.Another test for the goodness of fit of the model depending upon the χ statistic is presented in Table 6.The test statistic here is based on two components, as Hosmer and Lemeshow explanatory variables are contributed significantly to the prediction, namely, X1, X2, X3, X4 and X8 whose significant values are less than 0.05 as it is shown in Table 4.
2) Hosmer and lemeshow (H -L) goodness of fit test is one of the most popular methods in logistic regression for testing the null hypothesis which assumes that there is no difference between the observed and model predicted values.If the value of the test statistic is more than 0.05 as it is desired for good fitting model, we accept the null hypothesis.In our practical study, the H -L statistic has 0.244 as it is shown in Table 7.This ensures that our model is quite of good fit, this preferable conclusion indicates that there are no differences between the observed and predicted values.
3) In addition to using a goodness of fit statistic, we often interest in looking at the proportion of outcomes we have managed to classify correctly.For this we need to look at the classification table.In our study ,81.5% were correctly classified for the uninfected group and 85.5% for the infected group.Overall 83.5% were correctly classified.This indicates a good performance of logistic regression model.

1
Unlike the usual linear regression model ,there is no random error term in the expression for logit model , this doesn't mean that the model is deterministic since there is still room for random variation represented by probabilistic relationship between  and  .We can solve the logit model for  to obtain

Table 2 .
Dependent Variable Encoding

Table 4 .
Variables in the Equation

Table 5 .
Omnibus Tests of Model Coefficients