Modeling Governance KB with CATPCA to Overcome Multicollinearity in the Logistic Regression

The problem often encounters in logistic regression modeling are multicollinearity problems. Data that have multicollinearity between explanatory variables with the result in the estimation of parameters to be bias. Besides, the multicollinearity will result in error in the classification. In general, to overcome multicollinearity in regression used stepwise regression. They are also another method to overcome multicollinearity which involves all variable for prediction. That is Principal Component Analysis (PCA). However, classical PCA in only for numeric data. Its data are categorical, one method to solve the problems is Categorical Principal Component Analysis (CATPCA). Data were used in this research were a part of data Demographic and Population Survey Indonesia (IDHS) 2012. This research focuses on the characteristic of women of using the contraceptive methods. Classification results evaluated using Area Under Curve (AUC) values. The higher the AUC value, the better. Based on AUC values, the classification of the contraceptive method using stepwise method (58.66%) is better than the logistic regression model (57.39%) and CATPCA (57.39%). Evaluation of the results of logistic regression using sensitivity, shows the opposite where CATPCA method (99.79%) is better than logistic regression method (92.43%) and stepwise (92.05%). Therefore in this study focuses on major class classification (using a contraceptive method), then the selected model is CATPCA because it can raise the level of the major class model accuracy.


Introduction
In the field of modeling studies often found an association between variables, whether the relationship between the response variable with the explanatory variables, the relationship between the response variable, as well as the relationship between the explanatory variables. The relationship between the response variable will cause the relationship between remnant. This will result in inter-residual graphs cannot describe the actual state of the data, so that interpretation to be incorrect. However, the relationship between the response variable can still be controlled when the number of degrees of freedom and a lot of large remnants [2].
Relationships between more than two explanatory variables are often referred to as multicollinearity. In the case of a linear relationship between the explanatory variables in the regression parameter estimates lead to be inappropriate, although the coefficient determination generated great and significant. Multicollinearitybetween explanatory variables in the logistic regression model parameter estimation will cause to be invalid. This is because the variance covariance matrix that will be formed to be interconnected, so that parameter estimation is not appropriate [1].
Some researchers didmulticollinearitydetection before further analysis. If the explanatory variables are numeric variables, then to detect multicollinearity using correlation matrices, variance influence

L Khikmah 1 , H Wijayanto and U D Syafitri
Corresponding author:aisyah.salsabila17@gmail.com factor (VIF), and the eigenvalues of the correlation matrix. However, if the explanatory variables are categorical variables, then to detect multicollinearity using association analysis. Multicollinearitytechniques for handling classified into two methods, the method of removal (discard the explanatory variables are correlated) and variable selection method. In general, the variable selection method used is the best subset, the method, and the backward stepwise regression method. But the stepwise method and VIP still not up to handle their multicollinearity, so use some method development. In the case of linear data, handling multicollinearity can be resolved by methods PLS-VIP, methods PLS-BETA, LASSO and stepwise regression [3]. Bastien [4] examined using PLS-LR and Aguilena [5] using PCLR. Both of these methods are good enough to handle multicollinearity in the logistic regression. However, if the explanatory variables in the form of nominal and ordinal data (nonlinear), Kemalbay [1] developed a method Categorical Principal Component Analysis (CATPCA).This study will compare the methods of stepwise regression and CATPCA method by using case studies of governance KB modeling with logistic regression.

Categorial Principal Component Analysis (CATPCA)
The goal of traditional principal component analysis (PCA) is to reduce the number of m variables to a smallernumber of p uncorrelated variables called principal components which account for the variance in the data as muchas possible. Since PCA is suitable for continuous variables which are scaled at the numerical level of measurement such that interval or ratio and it also assumes linear relationship among variables, it is not an appropriate method of dimension reduction for categorical variables. Alternatively, categorical (also known as nonlinear) principalcomponents analysis (CATPCA) has been developed for the data given mixed measurement level such that nominal, ordinal or numeric which may not have a linear relationship with each other. For categorical variables, CATPCA uses optimal scaling process which transforms the category labels into numerical values while the variance accounted for among the quantified variables is maximized [6]. We refer to [7] for a historical review of CATPCA using optimal scaling. For continuous numeric variables, the optimal scaling process is as the traditional case. Suppose we have measurement of n individuals on m variables given with an m n observed scores matrix H where each variable is denoted by j X, j=1,...,m that is the jth column of H. If the variable j X are of nominal or ordinal measurement level, then a nonlinear transformation called optimal scaling is required where each observed scores transformed into category quantification given by: j j j X q (1) whereQ is the matrix of category quantifications. Let See the p n matrix of object scores, which are the scores of the individuals on the principal components, obtained by CATPCA. The object scores are multiplied by a set of optimal weights which are called component loadings. Let A be a p m matrix of the component loadings where the j th column is denoted by j a . Then the loss function for minimization of difference between original data and principal components can be given as follows:

Logistic Regression Multinomial
Logistic regression analysis is an analysis that is used to model the response variable Y that is based on the categorical explanatory variables X that are numerical and categorical. Logistic regression analysis requires several things (assumptions) about the nature of the data, namely: 1. The dependent variables should be categorical. 2. There is no significant correlation between the independent variables. 3. The number of observations for each variable should be adequate, and the overall sample size is large enough.The number of samples of at least 400 units to be able to get a good fitness model [8]. The conditional probability can be written: or π(x)= exp β 0 +β 1 x 1 +β 2 x 2 +…+β p x p 1+ exp β 0 +β 1 x 1 +β 2 x 2 +…+β p x p Equivalently, logit has a linear equation as follows: which is a bridge function (link function) of the explanatory variables [9].
Testing parameters simultaneously performed with a likelihood ratio test (likelihood ratio test

Data
Data were used in this research is secondary data. Data were a part of data Indonesian Demographic Health Survey (IDHS) in 2012 obtained from Demographic and Health Surveys (DHS). The unit of observation in the form of household, namely the productive age group of women aged 15-49 years as many as 29 882 people. Variables used a categorical variable by 8 variables. Variables used are shown in Table 1.

Classification Table
Classification accuracy results can be seen in Table 2.

Analysis Method
Steps of data analysis performed in this study are as follows: (1) Explore the data to get a general overview of user IDHS data 2012. (2) Modeling tool selection without contraception with logistic regression using stepwise regression method and CATPCA. In this step will be performed with logistic regression modeling while maintaining the association between the explanatory variables. So it will be visible difference modeling results with and without maintaining the association between the explanatory variables. The model will be established by equation 6  Based on Table 2, it can be seen that the accuracy of classification models using logistic regression models good enough for 61.54%. AUC values of 57.39% indicate that the model's accuracy is good enough. The percentage accuracy of the classification model quite well. A value large enough sensitivity that is equal to 92.43%, which means that the accuracy of classification for methods of contraception would be appropriate but still errors may occur a minority who will be classified to the majority class. Classification of a logistic regression model with a stepwise method is done by entering the variables most associated with Y is the age category, the wife of education, knowledge of contraceptives, the number of living children, living and working place today, so we get Table 3.

Result
Performance classicification measurement of the classification multinomial logistic regression model that has been formed.
After testing the suitability of the models that have been obtained, then it was performed the accuracy  Table 3, it can be seen that the accuracy of classification models using a stepwise logistic regression model with good enough for 60.80%. AUC values of 58.66% indicate that the model's accuracy is good enough. The percentage accuracy of the classification model quite well. A value large enough sensitivity that is equal to 92.05%, which means that the accuracy of classification for methods of contraception would be appropriate but still errors may occur a minority who will be classified to the majority class. With CATPCA, we get Table 4.

Table 4. Classification Performance with CATPCA Logistic Regression Model
Based on Table 4, it can be seen that the accuracy of classification models using a stepwise logistic regression model with good enough for 61.16%. AUC values of 57.39% indicate that the model's accuracy is good enough. The percentage accuracy of the classification model quite well. A value large enough sensitivity that is equal to 99.79%, which means that the accuracy of classification for methods of contraception would be appropriate but still errors may occur a minority who will be classified to the majority class.

Comparison Model
The model that has been obtained is then compared to the level of accuracy with AUC values. AUC values on stepwise logistic regression model (57.89) are higher than the logistic regression model (57.39) and CATPCA (57.89). This shows that the stepwise logistic regression model is more accurate than the logistic regression models and models CATPCA. In addition to comparing the AUC values obtained, may be considered sensitivity and specificity of each model. AUC value was composed by sensitivity and specificity. In this study, more sensitivity focused on value, because in this case, the use of contraceptive methods is more likely than not using a contraceptive method. According to Table  4, the model can improve sensitivity CATPCA big enough though specificity decreased. Sensitivity models CATPCA (99.79) is higher than the logistic regression model (92.43) and stepwise models (92.05). Thus, based on the sensitivity of the model to predict the data CATPCA minor class with a lot better.

Conclusions and Recommendations
This research concluded that Based on AUC values, the classification of the contraceptive method using stepwise logistic regression model is better than the logistic regression mode and CATPCA. While based on the value of sensitivity, the model CATPCA better than the logistic regression model and stepwise. In this study focuses on major class (the use of contraceptive methods), CATPCA can raise specificity major class model. In general, in the condition data is not balanced method CATPCA less well in the accuracy of the classification, but were able to classify correctly. This research recommends that In the actual case, there are four categories in which the two categories have a small percentage so that data is not balanced. So to further research needs to be treated assessment of the data is not balanced so as to optimize the performance of classification.