Regularized Feature Selection in Categorical PLS for Multicollinear Data

Article presents the algorithmwhichmodels the categorical multicollinear data by providing the balance in model accuracy on test data and number of selected features in the model. In all scientific fields, multicollinear data is being generated, where obviously some variables are noise and some are influential reference to response variable. Features and response appeared to be categorical in mathematical and statistical modeling of public health data. %ese datasets usually appeared to collinear, where partial least squares (PLS) is the potential method, which is not feature selection at its default level and deals with quantitative features. Recently, categorical PLS (Cat-PLS) is introduced.We have implemented the regularized feature selection in Cat-PLS where filterbased feature selection and categorical mean through Cramer’s V, Phi coefficient, Tschuprow’s T coefficient, Contingency Coefficient, and Yule’s Q and Yule’s Y are used. Monte carlo simulation with 100 runs indicates CramerV∗VIP is the better choice in terms of better model performance, number of feature selection, and interpretations for modeling the stillbirths, which is taken as the case study. %e framework can be used in related areas to explore and model the related data structures.


Introduction
Sciences are experiencing the multicollinear datasets where the task is to establish the meaningful relation for better understanding and interpretation of real life process [1][2][3][4]. Like other sciences, in public health, the selection of influential features X nxp is of the researcher's interest [2,5,6] which explains the variation in response or outcome y nx1 , where n is sample size and p is number of features. Here, data usually comprises of correlated features having categorical nature [7]. Logistic regression-type models are the famous candidates for modeling the categorical response [8]. In presence of multicollinearity, logistic regression estimates' variance gets so large; hence, logistic regression does not work optimally if the features are correlated. Ridged logistic regression controls variance of the estimates but is unable to find the influential features [9] and is not designed for categorical X. Alternative to the logistic-type regression model is Partial Least Square (PLS) regression, which is a statistical learning approach specifically designed to model the correlated X [10]. Developments in PLS algorithm is going on with passage of time, and a latest contribution is the up gradation of PLS algorithm for categorical X features [7], called it Categorical PLS (CPLS). In PLS, loading weights have a pivotal role in model building which reflects the mutual correlation of a respective X feature with response y. e PLS loading weights is somehow closer to Pearson coefficient of correlation which is being replaced with Cramer's V, Phi coefficient, Tschuprow's T coefficient, Contingency Coefficient, and Yule's Q and Yule's Y correlation measure in CPLS. Hence, CPLS is a potential candidate for modeling the categorical response y with multicollinear X. In PLS, several methods have been proposed for influential feature selection and are reviewed in [11]. Feature selection in PLS can be grouped into three broader categories which are filter, embedded, and wrapper. Filter feature selection is a two-step procedure where, at the first stage, PLS is fitted and, at the second stage, filter measure is computed. Features above a threshold are marked as influential. In embedded feature selection, filter selection is embedded in iterative computational loop of PLS. In wrapper feature selection, an external loop is considered over filter selection, whereas in each loop, selection is carried out, the model is updated, and performance is evaluated. All three types of feature selection methods have their own advantages and disadvantages. For instance, filter methods are very fast but may have low performance. Embedded and wrapper methods are relatively time expensive but are expected to perform better compared to filter methods. Regularized elimination procedure for feature selection in PLS [12] is a potential candidate from wrapper selection methods. Regularized elimination procedure in PLS selects a model next to optimal, given that the selected model's performance is not significantly different from the optimal model, whereas the selected model has fewer features compared to the optimal model. In this article, we have proposed the modification in regularized elimination procedure over categorical PLS instead of standard PLS.
We have implemented the proposed regularized elimination in categorical PLS over the still birth, which is crucial issue in developing countries. Although considerable progression has been observed in the last 25 years [13], but still the issue needs considerable attention [14]. Several surveys cover the issues regarding stillbirth in Pakistan, but there is need to determine the causes of stillbirth [15]. e most comprehensive and reliable source to have data related to still births and related features is Pakistan Demographic and Health Survey (PDHS) conducted by the National Institute of Population Studies (NIPS) and is technically assisted and funded by USAID. Although the case study taken here is from public health, but the proposed method is applicable over the categorical multicollinear data. Possibilities include, engineering, robotics, gaming, chemometrics, and bioinformatics.

Data Set.
e data set was obtained from the Pakistan Demographic and Health Survey (PDHS) (https://www. nips.org.pk/PDHS-Data-Set.htm) from 2017-18, which was designed to provide population and health indicators at the national and regional levels.
e sample design contained specific indicators for each of the five provinces (Punjab, Sindh, Khyber Pakhtunkhwa (KPK), Balochistan, and Gilgit Baltistan) of Pakistan. According to WHO definition of late fetal deaths for international standards, the sample of stillbirth for this study was restricted to birth of 28 or more weeks of gestation. Women with incomplete information were excluded from the sample, and then, 752 women who experienced stillbirths and 1504 women who had live births were included in the analysis. e sampled women included in the case group were mothers of newborn babies without signs of life after at least 28 weeks of pregnancy, while women included in the control group had live births. e response variable y of this study was the occurrence (labeled as 1) or nonoccurrence (labeled as 0) of stillbirths among women of child bearing age (15-49 years). Maternal features such as socio, economic, and other health features related to still births were taken as explanatory features, i.e., X matrix.

Categorical Partial Least Squares (CPLS). Categorical
Partial Least Squares (CPLS) [7] models the categorical data set which is the upgradation of standard PLS [10]. e algorithm for CPLS starts with centered features' data X 0 � X − 1x ′ and response y 0 � y − 1y. CPLS is an iterative procedure based on C iterations called components. In each CPLS component, c � 1, 2, . . . , C loading weights, score vectors, X matrix, y loadings, and deflated X and y are computed as (1) Loading weights can be defined through Cramer's V w CV [16], Phi coefficient w PC [16], Tschuprow'T coefficient w TC [17], Contingency Coefficient w CC [18], and Yule's Q w YQ and Yule's Y w YY [19] as where χ 2 is derived from Pearson's chi-squared test, n is the total number of observations, and r and c denote number categories, respectively, in response and in respective feature: which is referred as mean square contingency coefficient: which is the refined form of Phi loading weights, where r and c denote the number of categories, respectively, in response and in respective feature. ϕ is the mean square contingency defined as which is the proportion of the sample in the (i, j) th cell of the r × c contingency table: which measures the strength of association between categorical features: determines the strength of relationship between feature and the response based on odds' ratio (OR). Normalizing the loading weights, w k ←w c /‖w c ‖.
(2) Compute the score vector t c by t c � X c−1 w c .
(3) Computing the X-loading p c through regressing X c−1 on the score vector, (4) Deflate X c−1 and y c−1 by subtracting the involvement of t c : (5) If c < C, come back to 1. From each component-computed loading weights, score vector, loadings, and deflated data are stored in respective matrices/vectors W, T, P, and q.

Regularized Elimination in CPLS.
Regularized elimination is the wrapper feature selection method. e current version is modification and simplification of regularize elimination in PLS [12]. Here, we need to attach the filter measures with CPLS in a wrapper function. We have considered the following filter measures, which reflects the level of importance of each explanatory feature for response: (i) Loading weights (LW): PLS LW reflect the covariance of feature j with response; hence, the importance of feature j at CPLS component c can be measured by r j � |w c,j / max w c |. Features having |LW| < u for some user defined fixed threshold can be eliminated from the model. (ii) Regression coefficients (RC): RC is an established and well-known measure for feature selection defined as RC � W(P ′ W) − 1 q. Features having |RC| < u for some user-defined fixed threshold can be eliminated from the model. (iii) Variable importance on projections (VIP): VIP for the feature j is defined according to [20] as where c � 1, 2, . . . , c * , w cj is the loading weight for feature j using c components, and t c , w c , and p 2c are, respectively, CPLS scores, loading weights, and y-loadings, respectively, corresponding to the c th component. Feature j can be eliminated if VIP j < u for some user-defined threshold u ∈ [0, ∞). (iv) Selectivity ratio (SR): SR is based on the target projection approach [21] which is postprojection of the predictor features onto the fitted response vector from the estimated model. For each feature j, r j can be computed as where Var exp ,j is the explained variance and Var res,j is the residual variance for feature j from the target projection model. Feature j can be eliminated if SR j < u for some user-defined threshold u ∈ [0, ∞) (v) Significance multivariate correlation (SMC): SMC [22] is defined as the ratio of mean square of regression compared to mean square residuals: where feature j can be eliminated if SMC j < u for some user-defined threshold u ∈ [0, ∞).
Once the filter measures are defined, the elimination procedure for removing the 'worst' features from the CPLS model is presented here. Let M 0 � X and F j be any filter measure from LW, RC, VIP, SR, or SMC.
(1) For iteration g, run y and Z g through cross-validated CPLS and performance P g is computed. e matrix Z g has p g columns, and for used filter measure, we get p g criterion values which are sorted as s (1) , . . . , s (p g ) . (2) ere will M criterion values below the threshold u, i.e., number of noninfluential features. Let N � fM for some fraction f ∈〈0, 1]. Eliminate the features corresponding to the N most extreme criterion values. (3) If there are still more than one feature left, let Z g+1 contain these features, and return to (1). e fraction f determines the part of the elimination algorithm, where small f will eliminate few features from each iteration. With each iteration, number of influencing features in Z g decreases, but the performance may increases or decrease. e increase in performance is because of removal of noise features and decrease in performance is because of relevant features. After the optimal iteration g * with performance P * � P g * � max g P g , there is reduction in the number features against the modest drop in performance. Hence, by eliminating beyond g * , one could have a much simpler model with small loss of performance. To conduct this regularization, McNemar test can be used. e prediction for the optimal model is compared with the models next to the optimal model beyond g * iteration. If the prediction difference is not significant and this happens over several iterations beyond g * , then the selected model is the one having the least number of features.

Model Fitting and Validation.
e regularized elimination in CPLS has several parameters to tune, for instance, elimination fraction f, number of CPLS components c, and Mathematical Problems in Engineering threshold used for filter measure u. Elimination fraction f affects the number of iteration in regularized elimination in CPLS; hence, it mostly affects over the computational time, so we can take f � 0.1, which means, in each iteration, we are eliminating only 10 % of extreme criterion features. u and c can effect the model performance; hence, they need to tune. We have considered c � 1, 2, . . . , 20 and distributionbased 4 levels, i.e., quantiles for u. For this, we first computed the respective filter measure for all features in the model; then, the 4 quantiles of the filter measure were used as different levels of u. For fitted model's evaluation and parameter tuning, we have adopted the cross-validation procedure. For this, full data set was divided into training (70 %) and test (30 %) data. Using training data, the CPLS model was fitted against all possible combination of u and c and performance on test data was computed. We have used accuracy as performance measure, i.e., how good CPLS predicts the response on test data. Since split of data into test and training is random, to minimize the effect of this randomness, we have used Monte Carlo simulation with 100 runs, where, in each step, the CPLS model was fitted and evaluated as per above description.

Computations.
All methods are implemented in the R computing environment (http://www.r-project.org/) and codes are available from corresponding author upon request.

Results and Discussion
e data set contains a total of 2256 births with 752 stillbirths, and we observed several outliers in the data. Moreover, for modeling, we assume the samples should be independent of each other. So, in order to remove outliers and to ensure samples are independent, we have used k-mean clustering over still birth and alive birth samples separately. We found data of 141 independent samples with 94 alive births and 47 still births with 34 features covering maternal features, placental deficiency, fetal growth limitations, fetal growth features, and congenital features related to still births which were taken as explanatory features, i.e., X−matrix. In CPLS, there are six categorical measure-based loading weights, i.e., Cramer's V w CV , Phi coefficient w PC , Tschuprow'T coefficient w TC , Contingency Coefficient w CC , and Yule's Q w YQ and Yule's Y w YY . Each CPLS was fitted within regularized elimination which utilizes five filter measures that is loading weights (LW), regression coefficients (RC), variable importance on projections (VIP), selectivity ratio (SR), and significance multivariate correlation (SMC). Hence, there are 6 × 5 � 30 regularized elimination in CPLS models to fit and to compare. For this, 100 Monte Carlo simulations were executed. e response y and explanatory feature matrix X were divided into training (70%) and test (30%) data sets in each Monte Carlo simulation. Each of 30 regularized eliminations in the CPLS model was fitted over the training data, while test data was used to tune the model parameters and to measure the model performance that is accuracy. Hence, each model was fitted and evaluated 100 times.
Since regularized elimination in CPLS selects the model after the optimal model having nonsignificant difference in response prediction. Hence, we have two models named as the optimal model and selected model for each of 30 regularized eliminations over each Monte Carlo run in CPLS. In regularized elimination, we have used McNemar test with p − value � 0.05. Since we have used several filter measures and categorical measures in regularized categorical PLS, so we have used Kruskal-Wallis test to study their significance over the accuracy on test data. It appears the accuracy on test data is significantly varying with filter measures (p − value ≤ 0.01) and is also significantly varying with categorical measures (p − value ≤ 0.01). Figure 1 presents the distribution and comparison of accuracy from both models.
is indicates the CPLS based on Contingency Coefficient w CC with filter measures SMC, i.e., ContCoef * SMC and with filter measure SR, i.e., ContCoef * SR have low accuracy in optimal model and consequently in the selected model. Left-hand panel of Figure 1 is a magnified view of the upper right part of the right-hand panel.
is indicates CPLS based on Tschuprow'T coefficient w TC with filter measure LW, i.e., TschuprowT * LW, Phi coefficient w PC with filter measure LW, i.e., Phi * LW and with filter measure VIP, i.e., Phi * VIP are performing with best accuracy over optimal model but having relatively low accuracy over the selected model. Yule's Q w YQ with filter measure LW, i.e., Yule Q * LW and Cramer's V w CV with filter measure VIP, i.e., Cramer V * VIP have reasonably good performance, which is dully supported by Wilcoxon rank sum test with continuity correction (p − value � 0.047).
When choosing a model for feature selection, the stability of the model is an important aspect to consider. Figure 2 presents the standard deviations of accuracy for all fitted models. e variation is smaller for Yule Y * LW, Yule Q * SMC, Yule Q * SR, and CramerV * VIP. In concert with accuracy analysis and stability analysis expressed from Figures 1 and 2, respectively, we can conclude that Cramer V * VIP performs good average accuracy both on the optimal model and the selected model and, at the same time, has higher stability in accuracy on the selected model since this has shown lower standard deviation of accuracy.
When it comes to the parsimonious model, sample size together with accuracy of the fitted model is important. Smaller number of the features in the fitted model presents the model is better for interpretation and understanding the reallife phenomena. e distribution of numbers of features used in the selected and optimal model is presented in Figure 3. All selected models have relativity low numbers of features compared to the optimal model, which is the expected pattern from regularized elimination in CPLS algorithm. It appears Cramer V * VIP has smaller number of features in the selected model as well as having good and consistent accuracy, hence can be used as a potential candidate for modeling the occurrence or nonoccurrence of stillbirths among women of child bearing age. Influential features obtained by fitting the Cramer V * VIP are presented in Table 1. e count and percentage of these features over the occurrence and nonoccurrence of still birth together with their odds' ratio (OR) and significance is also presented. As the small subset of refined cases is considered here, hence the presented effects are not for quantification to be used for trend.
Results indicate that it is 1.51 times more likely to have still births in Baluchistan province compared to Punjab Selected model's accuracy  Mathematical Problems in Engineering province. Punjab is more developed province compared to Baluchistan province. e health and related facilities are much better in Punjab compared to Baluchistan [23] and the same trend is reflected in the results. Similarly, it is 2.5 times more likely to have still births in rural areas compared to urban areas. is is again more likely since cities or big towns are expected to have better health facilities. It is 0.685 times less likely to have still births if mother's age increases from ≤ 19 to 27 − 33. ese results support the findings reported in [24]. With nurses and traditional attendance, the chances of still birth decreases by 0.979 times and 0.901 times, respectively. Increase in antenatal care (ANC) visits upto 3 counts, the chances of still birth decreases by 0.441 times. It is 1.089 times more likely to have still births for women having pregnancy complication. e use of iron tablets during pregnancy decreases the risk of still birth by 0.916 times. If a mother is socially dependent for medical assistance, then the chances of stillbirth get increased by 2.53 times. Primary, secondary, and higher level of husband's  education decreases the risk of still birth by 0.57, 0.43, and 0.31 times, respectively, compared to illiterate husband. It is observed that working women are 0.312 times less likely to have stillbirths. Compared to 1-3 pregnancy order, a woman having 4-6 and ≥7 pregnancy order are 2.46 and 5.99 times more likely to have stillbirths. Most importantly, it is reported that education reflects the socioeconomic position and improved socioeconomic status generated by higher education and better working status of women and ends with healthier mother and child [25]. Notably, the proposed method is applicable over the categorical multicollinear data only; if data conditions vary, the performance of the proposed method may vary.

Conclusion
A comprehensive comparison of filter-based feature selection and categorical PLS loading weights in the frame work of regularized elimination in PLS is conducted. Monte Carlo-based simulation with 100 runs indicated that Cramer V * VIP is the better choice for modeling the occurrence or nonoccurrence of stillbirths in terms of improved model performance, number of feature selection, and interpretation. Influential features which affect the occurrence of still birth covers the maternal socio, economic, and health facilitation-related features. e proposed method is applicable over the categorical multicollinear data only; if data conditions vary, the performance of the proposed method may vary. e framework can be used in related areas to explore and model health-related issues.

Data Availability
e data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest
e author declares no conflicts of interest.