The selective regularization of a linear regression model

This paper discusses constructing a linear regression model with regularization of the system matrix of normal equations. In contrast to the conventional ridge regression, where positive parameters are added to all diagonal terms of a matrix, in the method proposed only those matrix diagonal entries that correspond to the data with a high correlation are increased. This leads to a decrease in the matrix conditioning and, therefore, to a decrease in the corresponding coefficients of the regression equation. The selection of the entries to be increased is based on the triangular decomposition of the correlation matrix of the original dataset. The effectiveness of the method is tested on a known dataset, and it is performed not only with a ridge regression, but also with the results of applying the widespread algorithms LARS and Lasso.


Introduction
, the least squares (LS) method is used, which is reduced to solving a system called normal equations with a positive definite symmetric matrix: where t is the transpose symbol. After determining and returning to the observed values, we obtain a linear LS-model:   If, after centering, the normalization of the observed variables is performed, then it becomes a matrix of paired correlations of predictors. The diagonal entries of the matrix are equal to 1, all values are between -1 and 1. The values of the off-diagonal terms of the matrix can be used to conclude that the predictors are mutually correlated (collinear). The greater the correlation between the i -th and the j-th the predictors closer is, will be the value of the off-diagonal entry with the ij index to 1 and the larger is the matrix condition number and the variance of the response values. If such an entry of the matrix is equal to 1, then the corresponding predictors are linearly dependent and the matrix is singular. The correlation matrix is often used to determine the values of the vector b .
Algorithms for the analysis and construction of the regression models Ridge, LASSO, LARS are part of machine learning algorithms and are included in the Python, R, Matlab libraries.
In the Ridge model [2], the diagonal matrix I is added to the matrix t WW ( ) , ,  is the positive number called a parameter.
Adding a parameter to all diagonal terms of a positive definite symmetric matrix leads to a decrease in its condition number. The values are called the penalty imposed on the coefficients of the model: With an increase in  , the dispersion decreases, but the shift of the regression coefficients ()  from the values LS increases. The Ridge coefficients can be viewed as linear combinations of coefficients LS. LASSO [3] consists in the introduction of a constraint the norm of the vector of the coefficients of the model, which leads to the inversion of some of them to 0. The LARS (Least Angle Method) algorithm [4] can reduce the size of the model by analyzing the contribution of predictors to the response similarly to the stepwise regression method, which in [2] is considered to be the best way to build a linear model. Further, a method for reducing the condition number of a matrix is considered, which differs from the Ridge model in that an increase is only in those diagonal terms of the matrix that cause a significant collinearity of the model, and the added parameters for different terms may be different. This method of regularization will be called selective (SR).

Selecting and increasing the diagonal entries of the matrix
We denote the matrix t WW by A and write down the result of its triangular decomposition by the Cholesky method, which is one of the popular methods for solving LAE systems with a symmetric matrix: , t

A = LL
where L is the upper triangular matrix. The condition number of the matrix A can be reduced if the condition number of its factors is reduced, since: It was shown in [5] that the lower bound for the condition number of a triangular non-singular matrix has the following form: ) .
To obtain the coefficients of the system of equations, two systems with triangular matrices are solved It was shown in [7] that an increase in the i -th diagonal entry of a triangular matrix in the process of a triangular decomposition is equivalent to the fact that the matrix A with an increased ii a undergoes decomposition without changing the other diagonal entries. It follows that if in the process of factorization there was an increase in several diagonal entries of a triangular matrix, then such a decomposition is an exact decomposition of the following complete matrix where S is the matrix, whose nonzero entries are located on the main diagonal and their indices coincide with the indices of the diagonal entries of the triangular matrix, which increase during the decomposition. The converse is also true: it follows from equation (3)

The computational experiments
To analyze the effectiveness of the method proposed, we used Hald's data, analyzed in detail in [2]. In this dataset, the number of observations is 13, the number of predictors is 4. The correlation matrix is as follows (all calculations and graphing are done in Python): (When performing calculations, it should be borne in mind that with an increase in the value of ( 1) ii li  by increasing ii a , the off-diagonal terms ii l decrease, and the diagonal () jj l j i  increases. Therefore, it is advisable to rearrange the predictor columns so that the diagonal entries of the corresponding triangular matrix do not increase. In our case, this condition is fulfilled.) Table 1 shows the characteristics of the SR algorithm in comparison with the Ridge regression with an increase in the parameter λ and the corresponding diagonal entries of the matrix.   The optimal value of  chosen in [2] for the Hald data is 0.013. In table 1, this value corresponds to point 3; the values of the regression coefficients for Ridge regression and SR are summarized in table 2. It follows from the last column that SR decreases in due to the fact that the fourth predictor is the most correlated one.  Figure 2 shows the dependences of the SSE values with a decrease in the condition number of the matrix. The SSE values for the Ridge regression sharply increase with increasing a regularization parameter.       were taken equal to 0, is less than for Lars at point 10.

Conclusion
The results obtained are as follows : • it is shown that an increase in several diagonal terms of the correlation matrix or the matrix of normal equations leads to an increase in the stability of the regression equation in the same way as in the Ridge regression with an increase in all diagonal entries; • unlike Ridge, an increase in diagonal entries leads to a decrease in the values of the regression coefficients corresponding to collinear predictors; • the efficiency of the method in terms of reducing the size of the model is comparable to the well-known LARS and Lasso algorithms; in this case, the method is much simpler and on the presented experimental data shows a better result in terms of a standard deviation.