Performance of a New Restricted Biased Estimator in Logistic Regression

It is known that the variance of the maximum likelihood estimator (MLE) inflates when the explanatory variables are correlated. This situation is called the multicollinearity problem. As a result, the estimations of the model may not be trustful. Therefore, this paper introduces a new restricted estimator (RLTE) that may be applied to get rid of the multicollinearity when the parameters lie in some linear subspace  in logistic regression. The mean squared errors (MSE) and the matrix mean squared errors (MMSE) of the estimators considered in this paper are given. A Monte Carlo experiment is designed to evaluate the performances of the proposed estimator, the restricted MLE (RMLE), MLE and Liu-type estimator (LTE). The criterion of performance is chosen to be MSE. Moreover, a real data example is presented. According to the results, proposed estimator has better performance than MLE, RMLE and LTE.


Introduction
The binary logistic regression model has become the popular method of analysis in the situation that the outcome variable is discrete or dichotomous. Although, its original acceptance is important in the field of epidemiologic researches, this method has become a commonly employed method in applied sciences such as engineering, health policy, biomedical research, business and finance, criminology, ecology, linguistics and biology [9]. In the analysis of a dichotomous dependent variable, lots of distribution functions are used, see [5]. However, the logistic distribution being an extremely flexible and easily used function and providing clinically meaningful interpretation, it has become the popular distribution in this research area [9]. Now, consider the following binary logistic regression model with intercept where the dependent variable is distributed as Bernoulli Be(π) such that π = e Xβ 1 + e Xβ (1) where X is the n × (p + 1) design matrix such that (n > p), β = [β 0 , β 1 , . . . β p ] is the (p + 1) × 1 coefficient vector and p is the number of explanatory variables. In order to estimate the coefficient vector β , the following loglikelihood function is needed to be maximized The log-likelihood function can be maximized by differentiating it with respect to β and setting the obtained expression called likelihood equations equal to zero. The likelihood equations are given as follows: Y. Asar / Performance of a New Restricted Biased Estimator in Logistic Regression Since the Equations (3) and (4) are nonlinear in β , one may use the iteratively weighted least squares (IWLS) algorithm. Therefore, maximum likelihood estimator (MLE) of β can be obtained using IWLS algorithm where which is given as follows [21] where π t is the estimated values of π using β t and W t = diag π t j 1 − π t j such that π t j is the jth element of π t . In the final step of the algorithm, one gets the maximum likelihood estimator as where S = X W X, z = (z 1 , z 2 , ..., z n ) with η j = x j β and The weighted sum of squares can be minimized approximately by using the MLE. However, this estimator becomes unstable when the regressor variables are correlated. This problem is called multicollinearity. Thus, due to the high variance and very low t-ratios, the estimations of MLE are no more trustful. This is because the matrix S becomes ill-conditioned when there is multicollinearity. There are some solutions to this ill-conditioning problem. Ridge regression which is firstly defined by [7] for the linear model is a very popular method. The ridge estimator has been generalized to binary logistic regression by [22] successfully as follows: where k > 0 and I is the (p + 1) × (p + 1) identity matrix. The authors applied the ridge estimators defined by [7] and [8] in logistic regression. Recently, a number of logistic ridge estimators have been applied and investigated by [16]. Moreover, see the following studies for different characterizations of this method in different models: [19], [23] and [24]. Another solution to the problem is to use Liu estimator defined by [12]. Logistic version of this estimator is defined by [15]. The authors showed that logistic Liu estimator has a better performance than MLE according to mean squared error (MSE) criterion. Since the logistic Liu estimator uses shrinkage parameter, its length becomes smaller than the length of MLE. The logistic Liu-type estimator (LTE) defined by [10] can also be used as a solution to the problem. There are two parameters used in this estimator which seems to be a combination of Liu estimator and ridge estimator. LTE was defined as follows: where k > 0 and −∞ < d < ∞. Different methods to select the parameters (k, d) used in LTE are proposed by [3].
In statistical research, there may be prior information regarding the variables considered in the statistical analysis.
Such kind of information my arise from different sources such as past experience or being an expert of the area etc. (see [18]). Therefore, in this paper, we also consider imposing some restrictions on the parameter space of the coefficient vector. The purpose of this paper is to propose a restricted estimator by imposing restrictions on LTE and make a comparison between the estimators considered in this study and the new restricted Liu-type estimator (RLTE) by designing a Monte Carlo simulation study and a real data application. The organization of the paper is as follows: In Section 2, derivation of the proposed estimator is considered and MSE characteristics of listed estimators are given. Moreover, the optimal shrinkage parameters of the new estimator are obtained. In Section 3, the details of Monte Carlo simulation are demonstrated, a discussion regrading results of simulation are provided. and a real-life application is performed to show the benefits of the new method. Finally, a brief summary and conclusion are given in Section 4.

Definition of the new estimators
Consider the following restrictions on the parameter space of the coefficient vector β where H is a matrix of order q×(p + 1) of known elements and h is a vector of known elements of order q × 1. The restricted MLE (RMLE) is proposed by [6] by imposing restrictions on the log-likelihood function (2). Therefore, the following objective function should be maximized where λ is a vector of Lagrangian multipliers. A Newton-Raphson method can be applied to find the solution ( [11], [20]). One can compute the derivatives of (10) with respect to β and λ as follows Now, the t th step of Newton-Raphson method is given by and β * is the solution vector. Thus, Equation (11) becomes . Now, multiplying both sides of (12) by H and using H β t+1 * = h, H is a positive definite matrix, λ can be estimated as Now, using (13) in (12), it is obtained that Therefore, letting β * = β RMLE , in the the final step of this weighted procedure RMLE can be obtained as Now, following [6], [13] and [20], the following penalized log-likelihood function is considered: where β is the norm of β . Taking the derivatives of (16) with respect to β and λ , the followings are obtained Similarly, the t th step of Newton-Raphson method is given by (17) where β * * is the solution the problem. Now, rearranging the terms of (17), it becomes where is the MLE at the (t + 1) th step. Multiplying both sides of (19) and using H β t+1 * * = h, one obtains Then, after some algebra, the estimator of λ becomes where (20) in (18), it is obtained that At the final iteration, letting β * * = β RLT E , the proposed estimator RLTE is given by where S k = S + kI. There is also an alternative expression of RLTE as follows where

MSE characteristics of estimators
Containing all relevant information of an estimator MMSE and MSE functions are used in the literature to make comparisons between estimators. MMSE and MSE of an estimator β are defined respectively by where tr is the trace of a matrix. In this subsection, the MMSE and MSE functions of the estimators are obtained. To obtained these functions, firstly, covariance matrices and bias vectors of estimators are computed. Firstly, since MLE is asymptotically unbiased, the covariance matrix, MMSE and MSE of MLE are given as follows (see, [16]) where λ i 's are the eigenvalues of the matrix S. RMLE has the following theoretical properties (see [1]): where Cov(η) is the covariance matrix and Bias(η) is the bias of the vector η, m j j is the j th diagonal of V MV and δ j is the j th component of V δ such that the columns of V are the eigenvectors of S.

The bias and covariance of LTE are presented by
and In [1], MMSE and MSE of LTE are respectively given as and Using the alternative definition of RLTE, we can compute MMSE and MSE of RLTE as the following where m (k) j j is the j th diagonal of V M k V , λ j is the j th eigenvalue of S.
Since the values of k and d are not known in real data, it is useful to compare the estimators for some definite values of these parameters. Therefore, we obtain the MSE differences between the estimators. Using the equations (28) and (40), we compute the difference Similarly, using (32) and (40), we can also compute Finally, using (36) and (40), the following difference is obtained If the differences ∆ 1 , ∆ 2 and ∆ 3 can be showed that they are positive, then it means that RLTE is superior to the others. However, we skip the detailed theoretical comparisons and refer to [2] for similar comparisons. Therefore, we design a Monte Carlo experiment to compare the estimators in Section 3.

How to choose k and d
This subsection presents how to choose the biasing parameters used in RLTE. Since the MSE functions are quadratic functions of the parameter d and nonlinear functions of the parameter k, fixing the value of k, the optimal values of the parameter d can be obtained. In order to find the optimal parameter, fixing the value of the parameter k, it is sufficient to minimize the MSE function given in the past subsection by differentiating the MSE functions according to d and solving the resultant expression for d. Since, the optimal way of choosing the value of the parameter k cannot be obtained, the value of k is computed by using k = p+1 β MLE β MLE due to [22].

Y. Asar / Performance of a New Restricted Biased Estimator in Logistic Regression
the optimal parameter d RLT E is computed as follows:

Numerical Experiments
In order to evaluate the performances of listed estimators, a Monte Carlo simulation experiment is conducted. Details and results of the simulation are presented in this section.

Details of the simulation
In a simulation study, defining important factors in designing the simulation is crucial. The main effective factor of this study is degree of correlation ρ among independent variables. In the experiment, strength of correlation ρ varies such that ρ = 0.90, 0.99 and 0.999. The sample size and number of regressor variables, being crucial factors, are varied as in many researches, for example see [14], [15], [16] and [25].
The following equation is used to produce the dataset having different strengths of correlation: where i = 1, 2, . . . , n, j = 1, 2, . . . , p and z i j is a random number produced using standard normal distribution. The response variable is also is generated from the Bernoulli distribution Be(π i ) where where x i is the i th row of data matrix X.
To impose some restrictions on the parameter space, following [17], the following restriction matrices are chosen for p = 4 and p = 8 respectively: and h = 0 0 for both cases.
Performances of estimators are investigated by the simulated MSEs which are computed by the following equation: where β − β r is the difference for each estimator considered in this study at the rth step of simulation. Finally, we estimate the parameters of LTE following [1] and the parameters of RLTE are estimated by the proposed method.

A real data application
In this subsection a real data application is presented to show the usefulness of the new estimator. The data set is taken from [9] and it is called "Myopia Study". There 618 observations and 17 different explanatory variables. However, for illustrative purposes, only the following variables are considered in the analysis. The response variable is that whether a subject has a myopia or not (coded as 1 and 0 respectively). Since all the variables are in the same scale (mm), the design matrix is not centred and standardized. The correlation matrix is presented in Table 3. Moreover, the condition number of the matrix X W X is computed as 14426 which shows that there is a collinearity problem in the data [4]. The restriction matrix H 1 = 1 −1 −1 1 0 with h = 0 is used in order to compare the variables with the opposite sign of correlation and the correlation between AL and VCD is 0.94. Moreover, another restriction matrix H 2 defined as follows  Table 4. According to Table 4, it is observed that RLTE has the lowest MSE value and the MSE of MLE is inflated due to multicollinearity.

Conclusion
In this paper, a new restricted estimator is proposed in the logistic regression. Theoretical properties of the new estimator are investigated. Moreover, MMSE and MSE functions are obtained. By a Monte Carlo simulation, the estimators MLE, RMLE, LTE and RLTE are compared in the sense of simulated MSE values. According to the results of the simulation, it is concluded that the new estimator RLTE has better performance than the others especially when the degree of correlation is high and the sample size is low. Furthermore, an application of the mentioned methods are also applied to a real life example and RLTE has the least MSE value. Therefore, RLTE is a better alternative when the multicollinear situations are present in the data.