The Sparse Group Log Ridge for the Selection of Variable Groups

In recent years, the sparse model is a research hotspot in the field of artificial intelligence. Since the Lasso model ignores the group structure among variables, and can only achieve the selection of scattered variables. Besides, Group Lasso can only select groups of variables. To address this problem, the Sparse Group Log Ridge model is proposed, which can select both groups of variables and variables in one group. Then the MM algorithm combined with the block coordinate descent algorithm can be used for solving. Finally, the advantages of the model in terms of variables selection and prediction are shown through the experiment.


Introduction
In fields such as artificial intelligence and machine learning, the space of explanatory variables in many problems of classification and regression tends to be very high-dimensional, or even ultra-highly dimensional. However, high-dimensional data can lead to overfitting in machine learning, thus resulting in poor generalization performance of statistical models. Therefore, variable space dimensionality reduction and variables selection are two problems to be solved. The objective of variable selection is twofold. One is to achieve accurate prediction and classification, and the other is to make the model more interpretable and reduce the complexity of statistical models. The so-called interpretability refers to the simplicity of the model. Obviously, the lower the dimensionality of the variable space is, the better the interpretability of the statistical model is. The Lasso [1] based on the L1-norm penalty proposed by Tibshirani can effectively reduce the dimension of variable space but can only achieve distributed variable selection. In many cases, there is a certain structure among variables. For example, in gene microarray analysis, there are often multiple variants in a gene. When identifying the genetic variant that is associated with the disease, it is more reasonable to divide the variants belonging to the same gene into one group. Given this, some scholars conduct variables selection by taking the structure among variables as a priori information. Group Lasso [2] is a sparse model that uses the group structure of variables as a priori information, which can achieve the selection of variable groups. Sparse models such as SCAD [3], MCP [4], and adaptive Lasso [5] improve the consistency in variables selection of the Lasso. Elastic Net [6] can select groups of highly correlated variables simultaneously. Models such as Lasso have also been applied to the field of support vector machines (SVMs) to constitute sparse SVMs [7][8][9][10][11]. Support vector machine (SVM) has good classification performance, which has been studied extensively in the past ten years. What is more, the standard L2-norm SVM classifier uses a ridge penalty function, so feature selection cannot be implemented. However, feature selection is very important in many situations. For example, selecting important genes is a very important issue in bio-informatics. In many cases, eliminating noise variables can improve the classification accuracy of the classifier. Therefore, how to achieve (1), respectively. From the image of the log penalty, it can be seen that the log penalty tends to the L1norm penalty when 0  a . As the absolute values of the coefficients increase, the log penalty gradually decreases, i.e., the degree to which the variable coefficients are compressed gradually decreases, which is similar to the principle of SCAD penalty and MC penalty. Also, this property can effectively reduce the bias of estimating regression coefficients.  is known, where β is the coefficient vector of 1  P -dimension, X is the design matrix of P n  order, and y is the output vector of 1  n -dimension.
The noise ε follows a Gaussian distribution.First, P variables are divided into J groups. j β is the coefficient vector of the j th variable group, j X is the sub-design matrix of the j th variable group, j d is the number of variables in the j th variable group, and Without loss of generality, we may assume that that the data is standardized, and x is the observation value of the p th variable in the j th variable group. Then the SGLR can be written in the following form: is the log penalty, and 1  , 2  , and a are adjustable parameters. The integration of the group structure of the penalty function and the L 1 -norm penalty enables the SGLR to achieve twolayer variables selection, that is, select both the objective variable group and the objective variables within the objective variable groups. The SGLR proposed in this paper is inspired by the hierarchical penalty function, and the penalty function of the hierarchical penalty function can be written d is the number of variables contained in the j th variable group. For SGLR, the function out f in the outer layer is the log penalty and the function in the inner layer is the L 1 -norm penalty.

Solution Algorithm of the SGLR model
In SGLR, solutions cannot be obtained by directly using the block coordinate descent algorithm due to the lack of an explicit solution for a single variable. In this paper, the solution of SGLR is obtained by using MM (Majorize-Minimization) algorithm combined with the block coordinate descent algorithm.
Since the log penalty is concave function in the range of    , 0 , the relationship between the log penalty and its first-order Taylor expansion at the current point  is the first-order partial derivative of the log penalty at the current point 1 j β . And since the equality in the above inequality holds at the current point can be used as the majorizing function with respect to    The reason why the above equality holds is that the elimination of the constant term does not affect the optimal solution to the optimization problem. Then the optimization problem for the th variable groups can be written as follow. (4) In fact, Eq. (4) is a Lasso problem, which can be solved by means of the block coordinate descent method. Since Eq. (4) is not differentiable at 0, sub-gradient and subdifferential are used to derive the optimal solution. The subdifferential of at zero is , where is the first-order partial derivative of the log penalty at the current point . Since the function is not differentiable at zero, it is necessary to use sub-gradient and subdifferential. The following derivation is carried out.The necessary and sufficient condition for the case that is the solution of Eq. (4) as follows.
where , that is, leads the following equation: For and , the following can be obtained: Then, it can be known that when . For , the following can be obtained: For , the following can be obtained: Then its explicit solution with respect to a single variable is as follows.
Once the explicit solution of SGLR with respect to a single variable is obtained, the block coordinate descent algorithm can be solved.The algorithm for solving the SGLR is given below.
(a) Input the response vector , the design matrix , and the initial value of the regression coefficient vector. (b) Repeat the following steps when .  (c) Get the regression coefficient vector after traversing all the groups once, and determine whether the predefined convergence condition or the number of iterations is satisfied. If not, go to step (b); otherwise, end. (d) Output the regression coefficient vector .

Experiments with Synthetic Datasets
This section generates three different types of datasets: synthetic dataset 1, synthetic dataset 2, and synthetic dataset 3. Among them, synthetic dataset 1 is sparse both in the level of variable groups and variables, while synthetic dataset 2 is sparse only in the level of variable groups and synthetic dataset 3 is sparse both in the level of variable groups and variables, and extremely sparse in the level of variables within group. All three datasets are generated based on the linear regression model  150 300 3.00 0.00 3.00 0.0239 For the three synthetic datasets, the variables are randomly divided into two sub-datasets containing an equal number of samples, one of which is used as the training dataset and the other as the test dataset. The above division process is repeated for 30 times to obtain 30 experimental results, and the mean value of the 30 experimental results is taken as the final experimental result. The experimental results are shown in Table 1, Table 2, and Table 3. Where, n denotes the number of training samples, P denotes the total number of variables, Size denotes the total number of selected variables, Rel denotes the number of identified objective variables, Noi denotes the number of eliminated redundant variables, and MSE denotes the predicted mean square error. The experimental results of the three synthetic datasets show that SGLR performs well on prediction and variables selection in terms of accuracy.
To verify the capability for two-layer variables selection of SGLR, the first experiment of 30 experiments conducted for the synthetic data set 1 is selected, and the non-zero regression coefficients fitted for that experiment are included in Tables 4, 5 . By comparing the experimental results of Lasso and Group Lasso, it can be known that Lasso only achieves scattered variables selection, while Group Lasso only achieves the selection of variable groups, where the coefficients of the variables grouped into the same variable group are all zero or all non-zero.
The capability for within-group variables selection of SGLR can be seen by comparing its experimental results with those of Group Lasso. Group Lasso only achieves the selection of variable groups, where the coefficients of variables grouped into the same variable group are all zero or all non-zero, while SGLR achieves within-group variables selection, where the coefficients of variables within the same group are either zero or non-zero.
The capability for variable groups selection of SGLR can be seen by comparing its experimental results with those of Lasso. The true regression coefficient of V40 is 0.3, and V40 is the objective variable expected to be selected, and Lasso selected V40. But the regression coefficient value of V40 obtained in SGLR was 0, and V40 was not selected in SGLR. The reason is that V40 in SGLR was in the same variable group with V31-V39, which are other nine non-significant variables with a true regression coefficient of zero. The existence of a large number of non-significant variables enables the variable group to be a non-significant variable group. V40 is affected by these nine non-significant variables in the same group, resulting in V40 not being selected. In other words, 10 variables in the same group in SGLR are not selected at the same time, that is, the group sparsity and selection of variable groups are achieved. Therefore, the experimental results of Lasso, Group Lasso, and SGLR show that SGLR achieves the selection of both variable groups and within-group variables, that is, SGLR has the two-layer variable selection capability.