ESTIMATION IN POLYTOMOUS LOGISTIC MODEL: COMPARISON OF METHODS

. The logistic regression model is a powerful method for modeling the relationship between a categorical variable and a set of explanatory vari- ables. In practice, however, the existence of maximum likelihood estimates is known to be dependent on the data conﬁguration. In fact, the Maximum Like- lihood Estimators (MLE) of unknown parameters exists if, and only if, there is data overlapping. The Hidden Logistic Regression (HLR) is an alternative model under which the observed response is related to the unobservable response. The Maximum Estimated Likelihood (MEL) method is also proposed, once it is immune to the complete or quasi-complete separation of data. The Principal Component Logistic Regression (PCLR) model is useful to reduce the number of dimensions of a logistic regression model with continuous covariates avoiding multicollinearity. In this paper we present an extension of the HLR and PCLR models as means for the solution of problems with polytomous responses. The main purpose is to compare the classiﬁcatory performance ob- tained by the models mentioned above with those of the Classical Logistic Regression (CLR) and Individualized Logistic regression (ILR) models, in the case of polytomous responses. The purpose is to propose an alternative approach for the parameter estimation problem in polytomous logistic models when the data groups are completely separated. Simulations results resulting from databases taken from the literature show that the proposed approach is feasible.


1.
Introduction. The logistic regression model is a powerful method for modeling the relationship between a categorical -or ordinal -dependent variable and a set of explanatory variables, involving both discrete and continuous variables. However the estimation of the logistic model parameters can be influenced by peculiarities in the data. Albert and Anderson [2] suggested a data classification into three categories: overlap, quasi-complete separation and complete separation, and proved 2. Polytomous logistic regression model. We consider that a sample of n independent observations is available from the groups G 1 , G 2 , ..., G s . Let X the vector of explanatory variables, X T = (x 0 , x 1 , ..., x p ), where x 0 ≡ 1, for convenience. Let Y denote the polytomous dependent variable with s possible outcomes. Let us consider a set of n observations that we will summarize in a matrix form given by: where k = 1 , 2 , ... , s − 1 and B s = 0. In this paper the group s is called reference group. The model involves (s − 1) (p + 1) unknown parameters. The conditional likelihood function is given by: T and Y i = (Y 1i , ..., Y si ), with Y ki = 1 if Y = k , and Y ki = 0 otherwise. Taking the logarithm, the log-likelihood function is: Thus: The Maximum Likelihood Estimator (MLE)B is obtained by setting the derivatives to zero and solving for B. The mostly used iterative method is the Newton-Raphson. However, in practice, the estimation of unknown parameters is affected by the data's properties. Albert and Anderson [2] suggested a sample classification into three categories: complete separation, quasi-complete separation and overlap. They also proved that the MLE do not exist for complete and quasi-complete separation. An algorithm to distinguish separation from multicollinearity is proposed by [15]. Another important reference is [11]. They argued that, in practice, monitoring the variance in the iteration process is sufficient to declare the separation. 3. Individualized logistic regression model. For the polytomous logistic regression [4] proposed an alternative method for the maximum likelihood estimation, called Individualized Logistic Regression (ILR) model. For a response variable with s groups, they suggested fitting (s -1) binary logistic regressions and, each time, comparing a group d, d = 1 , ... , s -1, with the reference group. The coefficients for the polytomous logistic model are obtained from the (s -1) separately fit logistic models. According the authors, the estimators that are obtained are consistent and will be approximately those from the CLR model. The criterion to evaluate the performance of estimators is the ARE, which is defined as the ratio of the estimators' asymptotic variance. The authors showed that the estimates loss in ARE is small. According [12], the individualizing fit can be useful to select variables. By using this method, a series of (s -1) conditional likelihood are separately maximized, where each likelihood is given by: The estimatesθ d are obtained by setting the derivatives to zero and solving for θ. A modification to the Begg and Greys method is suggested by [17]. Accordingly, the s(s -1)/2 pairwise binary logistic regressions are fitted and the estimators are obtained by weighted least squares. The modified procedure requires more computations, but it has a higher ARE than Begg and Greys method, according to [17]. 4. Hidden logistic regression model. The CLR model assumes observable independent responses Y i with Bernoulli distributions. The Hidden Logistic Regression (HLR) model is a robust estimation method presented by [18], and assumes that the true response is unobservable due to a stochastic mechanism. In the classical approach, with dichotomous response, the true response has two possible values, usually denoted as success and failure. The HLR model assumes that the true status T, which can be s (success) or f (failure), is unobservable. However, there is an observable variable Y that is related to T. If the true status is T = s, then Y = 1 with a P (Y = 1 | T = s) = δ 1 probability. Hence, the probability of misclassification is P (Y = 0 | T = s) = 1 − δ 1 . If T = f, then Y = 1 with a P (Y = 1 | T = f ) = δ 0 probability, and Y = 0 with a P (Y = 0 | T = f ) = 1 − δ 0 probability. The authors consider that 0 < δ 0 < 0.5 < δ 1 < 1.
According [18], [7] considered the general case as having n observations. In the notation Rousseeuw and Christmann's adopt, there are n unobservable random independent variables T i that result from a CLR model with a finite parameter vetor given by θ = (β 0 , β 1 , ..., β p ) T . The T i response has a Bernoulli distribution with a success probability of The Y i observable responses are related to T i trough an entire mechanism that is called hidden logistic regression, because the T i true status are hidden by the stochastic structure. The maximum likelihood estimator of T given The conditional probability that Y = 1 given the MLE of T is: where y is the observed value of Y, and denoting the probabilities byỸ ,Ỹ = (1 − Y ) δ 0 + Y δ 1 . In a model with n observations y i ,

ESTIMATION IN POLYTOMOUS LOGISTIC MODEL: COMPARISON OF METHODS 243
where the pseudo-observationỹ i is the success probability, conditioned by the estimate of t i . Let us keep in mind that t i is the true status.
In order to fit a logistic regression to the pseudo-observations we can apply the maximum likelihood formula, given by: which is called the Estimated Likelihood, because the true likelihood is unknown and it is dependent on the unobservable t 1 , ... , t n . In the Rousseeuw and Christmann terminology, the Estimated Likelihood function maximizer is called the Maximum Estimated Likelihood (MEL) estimator. Moreover, the authors showed that unlike the classical Maximum Likelihood Estimator, the MEL estimator always exists and is unique, even when the data set has no overlap.

4.1.
Hidden logistic regression model applied to polytomous case. Our approach to solve the multiple group problem is to provide a simple and direct generalization of the HLR model. We consider n unobservable independent variables In a model with n responses y ij , i = 1 , ... , n and j = 1 , ... , s, where y ij = 1, if Y i = j, and y ij = 0, otherwise, we can define the variable given by: Let us keep in mind that in the CLR model, δ jj = 1 and δ jk = 0, if j = k. The purpose is to maximize: The log-likelihood function becomes: The MEL estimators are the maximizers of the log-likelihood function, which is strictly concave. For more details about the maximization of log-likelihood function, the interested reader is referred to [3]. In related literature different approaches to implement robust estimation methods are given by [10], [13] and [14], to name just a few. 4.2 Choice of δ. According to [18], [6] found that accurate the estimation of δ 0 and δ 1 , in the binary case, is very difficult, unless n is extremely large. The symmetric approach is to chose a constant γ > 0 and to set δ 0 = γ and δ 1 = 1 − γ, where γ is small so that terms in γ 2 can be ignored, and δ 0 <π < δ 1 , where 244 INÁCIO ANDRUSKI-GUIMARÃES AND ANSELMO CHAVES-NETÔ π, δ 0 and δ 1 are given by For a detailed explanation and discussion see [6], [13] and [18]. In this paper, we consider that the probability of observing the true status, which is given by P (Y i = j | T i = γ j ) = δ jj , should be higher than 0.5, this is, 0.5 < δ jj < 1, furthermore s k=1,k =j δ jk < δ jj . Therefore, we cannot take the estimate given bȳ π j = 1 n n i=1 y ij , j = 1 , ... , s, onceπ j can be smaller than 0.5. Our default choice will be δ = 0.99, and set δ jj = δ and δ jk = 1−δ s−1 .

Principal component logistic regression.
In order to improve the parameter estimation under multicollinearity and to reduce the dimensions in the problem, [1] propose to use a reduced set of optimum principal components of the original covariates as covariates for the logistic regression model. This approach, which is called Principal Component Logistic Regression (PCLR) model, provides an accurate estimation of the parameters in the case of multicollinearity. Some features in this model, in binary case, are given below. Let us consider n observations of p continuous variables, given by the matrix: The sample covariance matrix is given by: And the observations can be standardized, so that: It is well known that S can be written as S = V T ΛV, where Λ = diag (λ 1 , ..., λ p ) and V being orthogonal. Let Z be the matrix whose columns are the principal components, which are given by Z = XV, where v 1 , ..., v p are the eigenvectors of the matrix S, associated to the eigenvalues λ 1 ≥ ... ≥ λ p , so that the observations matrix can be written as X = ZV T , where The binary logistic regression model will then be given by: And this probability can be expressed as: v jk β j , k = 1, ... ,p, and z ik are the elements of Z = XV, i = 1, ... ,n, k = 1, ... , p. Furthermore, matrices Z and V can be written as: Considering that r = p − s, the PCLR model in terms of the first s principal components of matrix X, is given by: According to [1], the principal components with the largest variance are not necessarily the more efficient predictors, because principal components with small variances could be correlated with the response variable. There are different methods to include principal components in the regression method and the interested reader should see [9] for a detailed explanation in linear case.

5.1.
Principal component logistic regression model extended to polytomous response. The generalization of the PCLR model for polytomous responses does not require a complex formulation. We begin by computing the covariance matrix S. Then the matrix X can be written as: where i = 1 , ... , s, j = 0 , ... , p, t = 1 , ... , s and β sj = 0.
Making γ tj = p k=1 v kj β tk , the PCLR model extended to polytomous responses is given by: In order to estimate the principal components model's parameters, one can apply the Maximum Likelihood Method. In the dichotomous case, [1] also proposes two methods to solve the problem of choosing the optimum principal components that should be included in the model. In this paper the purpose is only to investigate the PCLR model's classificatory performance in polytomous cases. 6. Examples. In this section we consider three benchmark data sets. Iris Data, taken from [8], Fatty Acid Data, taken from [5], and Insulating Oil Data, taken from [16]. The purpose is to compare the results provided by the four models. We applied the CLR, ILR, HLR and PCLR models to the data sets. The computer program that implements the approaches previously described was written in Visual Basic 6.0 and was run on a HP Pavilion b1040br computer. The results that were obtained in terms of estimates and performance are described below. Example 1: Iris Data. This data set includes three groups: Iris Setosa (G 1 ), Iris Versicolor (G 2 ) and Iris Virginica (G 3 ). For each group there are 50 observations and four independent variables: Sepal Length, Sepal Width, Petal Length and Petal Width, all measured in mm. The reference group is Iris Virginica. It is well known that two groups (the Iris Versicolor and Iris Virginica ones) overlap and form a cluster that is completely separated from Iris Setosa group. There is no MLE for the CLR model. For the ILR model there is only the MLEβ 2 , which corresponds to the Iris Versicolor group. Table 1 displays the estimates and their standard errors, between brackets, obtained by the ILR model and the MEL estimates for the HLR model. One can see that the coefficients that were obtained from the ILR model are close to those from the HLR model.   The correct classification rates for HLR and PCLR models are summarized in Table 4. From Table 4, we can conclude that both models have high classification capability in terms of the correct classification rates. Example 2: Fatty Acid Data. There are 120 observations, five groups and seven variables that represent the percentage levels of seven fatty acids, namely palmitic, stearic, oleic, linoleic, eicosanoic and eicosenoic acids. In this paper we consider five groups: rapeseed (G 1 ), sunflower (G 2 ), peanut (G 3 ), corn (G 4 ) and pumpkin (G 5 ) oils. The original data set has eight groups and was used by [5] to develop and implement an automated method to classify vegetable oil samples. According to [5], the Principal Component Analysis (PCA) was successful to distinguish clusters of different oil samples, but it is not suitable for an automatic prediction of vegetable oil classes, because PCA requires a visual inspection and the final decision has to be made by an expert. The complete table of the original data can be found in the mentioned paper. In this paper we use the group 5 (pumpkin oil) as the reference group. Table 5 displays the estimates and their standard errors obtained by the ILR and HLR models. Table 6 displays the principal components and their cumulative percentage of the total variance. Table 7 displays the estimates obtained with six principal components, and Table 8 displays the classification matrix for the HLR and PCLR models.

Model
Observed group Allocated Group G 1 G 2 G 3 G 4 G 5 Example 3: Insulating Oil Data. This data set contains 2,567 observations, three groups and six variables that represent the physical and chemical characteristics that were observed in order to evaluate oil deterioration in power transformers: the neutralization number (NN), given in mg/KOH, the power factor (PF), given in percentages, the dielectric strength (DS), in kV, the interfacial tension (IFT) in dyne/cm, water contents (WA), in p.p.m., and the oil temperature (OT), given in O C. Insulating oil that are in service are subject to deterioration due to mechanical and chemical conditions, and may oxidise producing acids and sludge. In order to evaluate oil conditions, oil samples should be taken from power transformers and tested. Insulating oils in service can be classified as: oils that are in satisfactory condition ( G 1 ), oils that have low breakdown voltage and require reconditioning ( G 2 ), and oils with low interfacial tension, high acidity and low electrical resistivity and in this case the oil should be replaced ( G 3 ). In this paper we applied the four models in order to investigate their performances when data overlap. Table 9 displays the estimates, and standard errors that were obtained for the CLR, ILR and HLR models. The reference group is G 3 (oil to be replaced).  Table 10 displays the principal components and their cumulative percentage of the total variance. Table 11 displays the estimates, and their standard errors, obtained by the PCLR model, with six principal components, and Table 12 summarizes the correct classification rates. Based on the results, we conclude that logistic regression provides an alternative to develop a computational tool that could evaluate the insulating oil condition.

7.
Conclusion. The application of the Classical Logistic Regression Model and the Individualized Logistic Regression Model has been reported in many studies involving bankruptcy prediction and cancer classification, among others applications. However, while these techniques work well for many situations, they may not work when the data set has no overlapping. Moreover, the logistic model becomes unstable when there is dependence between the explanatory variables. The purpose with this job is to develop and implement a direct generalization for two alternative methods, the Hidden Logistic Regression and the Principal Component Logistic Regression models, for polytomous response, and to explore the performance of both models when compared to the Classical Logistic Regression and Individualized Logistic Regression models. The main advantage of the Hidden Logistic Regression model is that the MEL estimator always exists and it is unique, even when the data set has no overlapping. The Principal Component Logistic Regression model allows the reduction of the number of dimensions in a logistic regression model, with continuous variables and avoiding the multicollinearity of these variables. One can see that the model proposed by [18] and employed to model the relationship between a dichotomous response variable and a set of covariates, can be used with a few modifications when the response variable is polytomous. For practical purposes the main advantage is the existence and uniqueness of estimators. Furthermore, if the data set presents overlapping, the estimators are close to those obtained from the Classical Logistic Regression and Individualized Logistic Regression models. With respect to the model proposed by [1] we can see that this model, when applied to a polytomous response case, outperforms the Hidden Logistic Regression model at least in one simulation, although is not immune to the complete or quasi-complete separation.
In the future we intend to study the behavior of the models that were approached with respect to aspects such as their performance regarding data sets with a reduced number of observations and the bias of the estimators that were obtained. In the particular case of the HLR model the purpose is to study the influence that the