Mitigating collinearity in linear regression models using ridge, surrogate and raised estimators

Collinearity in the design matrix is a frequent problem in linear regression models, for example, with economic or medical data. Previous standard procedures to mitigate the effects of collinearity included ridge regression and surrogate regression. Ridge regression perturbs the moment matrix � → � + k p, while surrogate regression perturbs the design matrix → S. More recently, the raise estimators have been introduced, which allow the user to track geometrically the perturbation in the data with → ̃ . The raise estimators are used to reduce collinearity in linear regression models by raising a column in the experimental data matrix, which may be nearly linear with the other columns, while keeping the basic OLS regression model. We give a brief overview of these three ridge-type estimators and discuss practical ways of choosing the required perturbation parameters for each procedure. Subjects: Mathematical Statistics; Mathematics & Statistics; Science; Statistical Computing; Statistics; Statistics & Probability


PUBLIC INTEREST STATEMENT
Collinearity is a frequent problem in statistical analysis of data, for example, with ordinary least square linear regression models of economic or medical data. Standard procedures to mitigate the effects of collinearity include ridge regression and surrogate regression. Ridge regression is based on a standard numerical technique that is used in computing an inverse of a nearly singular matrix. Surrogate regression is based on perturbing the data in a way to allow for more accurate numerical solutions. More recently, the raise estimators have been introduced. This technique also perturbs the data while allowing the researcher to track the changes in the data while retaining the basic ordinary least square regression model. We give a brief overview of these three ridge-type estimators and discuss practical ways of choosing the required perturbation parameters for each procedure. Our case study indicates an advantage for using the raise estimators. and the response vector is n × 1 consisting of the observed data. The Ordinary Least Squared OLS estimators ̂ L are solutions of given by The solutions ̂ L are unbiased with variance matrix V(̂ L ) = 2 ( � ) −1 . For convenience, we take 2 = 1. The OLS solutions require that ( � ) −1 be accurately computed.

Ridge and surrogate estimators
With economic or medical data, the predictor variables in the columns of may have a high level of collinearity; that is, there may be a nearly linear relationship among the predictor variables. In this case, ′ in Equation (1) is nearly singular and thus ( � ) −1 will be numerically difficult to evaluate.
It was observed by Riley (1955) that the perturbed matrix � + k p with k > 0 is better conditioned than the matrix ′ and he suggested using the perturbed matrix in Equation (1). With k > 0 large enough, ( � + k ) −1 can be accurately computed with standard numerical procedures. Using � → � + k p , Hoerl (1964) dubbed this procedure ridge regression with ridge estimators Near dependency among the columns of causes ill-conditioning in ′ which results in OLS solutions with inflated squared lengths ||̂ L || 2 , with ̂ L of questionable signs (±) and with ̂ L being "very sensitive to small changes in " (Belsley, 1986). With ill-conditioning in ′ , the OLS solutions at k = 0 in Equation (3) are known to be unstable with a slight movement away from k = 0 giving completely different estimates of the coefficients .
In The International Encyclopedia of Statistical Science, Hadi (2011) discusses two standard remedies for addressing collinearity in linear regression; namely (1) the ridge system {( � + k p ) = � ; k ≥ 0} (Hoerl & Kennard, 1970) with solutions {̂ R (k); k ≥ 0} and (2) the surrogate system {( � + k p ) = ( � k k ) = � k ; k ≥ 0} (Jensen & Ramirez, 2008) with solutions {̂ S (k);k ≥ 0}. The ridge estimators come from modifying ′ → � + k p on the left side of Equation (1) while the Jensen and Ramirez surrogate estimators modify the design matrix → k on both sides of Equation (1). In matrix notation, ridge regression comes from perturbing the eigenvalues of ′ as i → i + k, while surrogate regression comes from perturbing the singular values of as i → √ 2 i + k. From the singular value decomposition = PD( i ) � , the surrogate design is with a diagonal matrix of dimension n × p, the columns of the left-singular vectors and the columns of the right singular vectors. The surrogate transformation → k preserves the ridge moments, with � k k = � + k p allowing for comparison between the two methods. Ridge regression has a long history of use in the statistical literature. The earliest detailed expositions of ridge estimators are found in Marquardt (1963) and Hoerl and Kennard (1970), with Marquardt (1963) acknowledging that Levenberg (1944) had observed that a perturbation of the diagonal improved convergence in steepest descent algorithms. The history of the early use of matrix diagonal increments in statistical problems is given in the article by Piegorsch and Casella (1989).
To alleviate the problems inherent with a singular value, say p , which is indicating collinearity in , the surrogate transformation converts p → √ 2 p + k moving the singular value away from zero.
Principal Component Regression (PCR) does the opposite and replaces p with 0 and regresses = D( 1 , … , p−1 , 0) + with = Q . Hadi and Ling (1998) have noted "that it is possible for the PCR to fail miserably." Their example is constructed with the response variable being highly correlated with the deleted eigenvector associated with the deleted singular value. This deletion (1) results in the remaining explanatory variables being unable to provide a good fit for the response variable.
Since ridge regression is based on a numerical analysis technique, the ridge estimators may lack desirable statistical properties. Three such desirable statistical properties follow.
(1) The condition number for a square p × p matrix is a measure of the ill-conditioning in and is defined as the ratio of the largest to smallest eigenvalues, denoted ( ) = 1 ∕ p . Since perturbation procedures are designed to improve the regression model, one would expect that as k → ∞ that (V(̂ R (k)) → 1. However, as shown in Jensen and Ramirez (2010a), ). Initially, as k increases, the ill-conditioning in the variance matrix starts to get better but then returns to the original (bad) value. However, the surrogate system does have the desirable monotone property that (V(̂ S (k)) → 1 as k → ∞. This allows the user of surrogate estimators to be assured that, regardless of the chosen value for k, the variance matrix for the surrogate estimators will be more "orthogonal" than the original OLS variance matrix.
An "ideal" predictor variable in column j would be orthogonal to the other predictor variables in , with being zero for all off-diagonal values in the j th row and j th column. In this "ideal" case, the "ideal" variance for ̂ L j would be the ratios of actual variances to "ideal" variances had the columns of been orthogonal, with VIF(̂ L j ) = 1 for the ideal orthogonal case. Marquardt and Snee (1975) have identified VIF as "the best single measure of the conditioning of the data." Again since perturbation procedures are designed to improve the regression model, one would expect that as k → ∞ that VIF(V(̂ R j (k)) → 1. Jensen and Ramirez (2010a) also showed that VIF(V(̂ R j (k)) → VIF(V(̂ R j (0)) for the ridge estimators but that VIF(V(̂ S j (k)) → 1 as k → ∞ for the surrogate estimators, resulting in less collinearity between the surrogate estimators than exists between the OLS estimators.
indicating that the ridge model should not be used. The Hoerl and Kennard (1970) result assures that for some positive value of k, the ridge model is an improved model. Jensen and Ramirez (2010a) have shown that for any k ∈ (0, ∞) the corresponding result holds for surrogate estimators. A further improvement with surrogate estimators is given by MSE(̂ S (k)) ≤ MSE(̂ R (k)); that is, for any value of k, the surrogate estimators have predicted values closer to the original data than the ridge estimators. As the ridge and surrogate estimators are not equivariant under scaling, the common convention is to scale ′ to correlation form with the explanatory variables centered and scaled to unit length.
Remark 1 Scaling ′ to correlation form can lead to some anomalies. as noted in Jensen and Ramirez (2008). For example, the map k → ||̂ R (k)|| 2 is known to be monotonically decreasing with centered but unscaled. Using Proc Reg in SAS with the Ridge option, this monotone property can be lost as the original ′ moment matrix is (1) scaled into correlation form and (2) the ridge estimators are computed using the correlation form for ′ and (3) the ridge solutions are mapped back into the original scale. This scaling-rescaling can cause k → ||̂ R (k)|| 2 to lose its monotonicity as in the example in Jensen and Ramirez (2008).
Remark 2 Let be mean-centered. Let 2 be the diagonal matrix with entries 1∕ � j,j , 1 ≤ j ≤ p, then the scaling → has ( ) � ( ) in correlation form., that is with diagonal entries all having value one. This is the scaling we have used. Sardy (2008) has suggested a covariance-based scaling using the diagonal matrix 2 Σ with entries ( � ) −1 j,j , 1 ≤ j ≤ p. We note that in this case ( Σ ) � ( ) has diagonal entries which are the variance inflation factors VIF(̂ j ) . The variance inflation factors are the ratios of the variances of ̂ j to the "ideal" variances of ̂ j assuming the explanatory variables are orthogonal; that is, Remark 3 When the regression model retains the parameter 0 for the constant term with the design matrix containing a unit constant column, the user needs to be careful with defining VIF(̂ j ) when the data have not been mean-centered. In short, VIF(̂ j ) is based on comparing the (j, j) entry of the variance matrix to the corresponding entry of an "ideal" covariance matrix. The inverse of the "ideal" covariance matrix is denoted as the "ideal" moment matrix. The "ideal" ̂ j is uncorrelated with the other explanatory variables ̂ i , 0 < i ≠ j. Thus, the constraints on the "ideal" covariance matrix are that (1) the off-diagonal (i, j) and (j, i) entries for cov(̂ i ,̂ j ) are zero where 0 < i ≠ j. Note that the "ideal" covariance matrix is not a diagonal matrix as the entries relating to ̂ 0 in the first row and column are retained as the data have not been centered. Additionally, the constraints on the "ideal" moment matrix are that (2) the entries in first row and first column are the first order moments determined from the data and (3) the entries down the diagonal (j, j) with j ≥ 0 are the second order moments determined from the data. Jensen and Ramirez (2013) have given an easy to compute algorithm for computing the "ideal" covariance matrix that satisfies constraints (1), (2) and (3).
The variance inflation factors, which are the standard measure for collinearity, have a geometric interpretation which allows them to be conveniently computed as a ratio of determinants. We assume that the variables are centered. Reorder = [ [p] , (p) ] with (p) = p the p th column and [p] , the design matrix without the p th column, dubbed the resting columns. Garcia, Garcia and Soto (2011) introduced the metric number to measure the effect of adding the last column (p) to the resting columns [p] . An ideal p th column would be orthogonal to the other columns with the entries in the off diagonal elements of the p th row and p th column of ′ all zeros, with idealized ′ moment matrix The metric number is defined by MN( p and it measures the effect of enlarging the design matrix with the adding of the p th exploratory column. The metric number is easy to compute and is functionally equivalent to the VIF statistics with for example, O'Driscoll and Ramirez (2015).
In spite of the established usage of ridge regression, it is now known that the surrogate estimators have superior statistical properties over the ridge estimators. Indeed, for their statistical analysis, Woods et al. (2012) used the Jensen-Ramirez surrogate estimates for modelling of diabetes in stock rats.
A crucial question for both the ridge estimators and the surrogate estimators is: What value of k should be used? McDonald (2009McDonald ( , 2010 has suggested that k can be determined by controlling the correlation between the observed values and the predicted values from ridge regression. We extend this methodology to surrogate regression and will compare the two procedures. McDonald (2009,2010) showed that the square of the correlation coefficient R 2 (̂ R (k)) between the observed values and the ridge predicted values ̂ R (k) = ̂ R (k) is a monotone decreasing function in the ridge parameter k. The corresponding result for the square of the correlation coefficient R 2 (̂ S (k)) of the observed values and the surrogate predicted values ̂ S (k) = ̂ S (k) for the surrogate regression is a monotone decreasing function in the surrogate parameter k, as shown in Garcia and Ramirez (in press). This allows the user to determine a unique value for k by controlling the decrease in correlation between the observed and predicted values. The user can set a lower bound for the reduction in R 2 (̂ R (k)) and R 2 (̂ S (k)) and numerically compute the associated ridge and surrogate parameters, For example, to preserve 95% of the OLS correlation, we solve R 2 (k) = 0.95R 2 (0).
With the computed value for k, we can measure the reduction in collinearity using the VIF statistic or using the condition number of � + k p . For our case study, we use the example in McDonald (2010) which is known to have severe collinearity. We report the improvements in collinearity for both methods.

Raise estimators
We assume that the columns of = 1 , 2 , …, x p are centered and standardized, that is, ′ is in correlation form with || j || 2 = 1. For the n × p matrix = [ 1 , 2 , … , p ], the column span, is denoted by Sp( ), with (j) denoting the j th column vector j and [j] denoting the n × (p − 1) matrix formed by deleting (j) from . For the linear model = + , central to a study of collinearity is the relationship between (j) and Sp( [j] ).
The raise estimators are based on perturbing a column j →̃ j = j + j j by a j multiple of a vector j orthogonal to the span of the remaining resting columns. We follow the notation from Garcia and Ramirez (in press). The regression of j , viewed as the response vector using the remaining resting columns as the explanatory vectors, has an error vector j with the required properties. The raise estimators are constructed sequentially as follows.
Some desirable properties of the raised regression method are as follows.
(1) Raising a column vector in does not effect the basic OLS regression model as the raised vector remains in the original Sp( ), j = (j) − [j] (j) ∈ Sp( ) so Sp(̃ ) = Sp( ), as shown in Garcia et al. (2011).
(2) Garcia et al. (2011) has shown that the raise estimators satisfy the MSE Admissibility Condition assuring an improvement in Mean Squared Error MSE(̃ ( )) for some ∈ (0, ∞) and thus the raise estimators can be said to be of ridge-type.
(4) Starting with ′ in correlation form with results in the final raising matrix ̃ having moment matrix Thus, the raised regression perturbation matrix is equivalent to a generalized ridge regression perturbation matrix. And conversely, any generalized ridge regression matrix has a corresponding raised regression matrix as in Garcia and Ramirez (in press).
The raise estimators allow the user to specify, for each of the variables, a precision j that the data will retain during the raising stages by restricting the mean absolute deviation MAD in the j th column of −̃ from Thus, given a specified precision j > 0, the user can raise column j in <1,…,j> to ̃ j ( j ) = j + j j , where j is solved from Equation (5). The precision values should be based on the researcher's belief in the accuracy of the data. The raised parameters j are thus constrained to assure that the original data have not been perturbed more than what the researcher has permitted.
Remark 4 The ridge and surrogate procedures do not require to be of full rank. For example, with the surrogate transformation i → √ 2 i + k any zero singular value will be mapped to √ k > 0 with k now full rank. On the other hand, the raise procedure does require the columns of to be independent as the crucial step 1 →̃ 1 ( 1 ) = 1 + 1 1 moves 1 in the direction of the orthogonal complement of Sp(

Case study
Our case study is the numerical example in McDonald (2010). Here, n = 60 and p = 2 with = [ 1 , 2 ] with 1 the nitrogen oxide pollution potential and 2 the hydrocarbon pollution potential and the total mortality rate in 60 US metropolitan areas. The original data-set had 15 explanatory variables. Following McDonald (2010), we concentrate on the two variables which have the highest correlation p = 0.9838. Since, is assumed to contain only explanatory variables, the vectors 1 , 2 , are all mean-centered and scaled to have unit length.  Table 2. OLS and raise regression with precision j = 0.009158 squared correlation R 2 ( ,̂(k)), computed parameters j , estimated coefficients ̂ , squared lengths ̂ ′̂ , condition numbers , variance inflation factors VIF, and mean absolute deviation for −̃ for raise design

OLS
Step 1 Step 2  correlation between and ̂ R = R (k) for the ridge parameter and between and ̂ S = S (k) for the surrogate parameter. Thus, both methods have the same small decrease in R 2 ( ,̂ (k)) down to 0.3086 shown in Row 1 of Table 1 and with the associated parameters in Row 2 of Table 1. This allows us to compare the improvement in collinearity between the two procedures. The estimated coefficients are shown in Row 3 of Table 1.
Ridge-type procedures are designed to (1) decrease the squared length of the estimated coefficient ′ which is given in Row 4 of Table 1; (2) to decrease the condition number ( � + k p ) of the matrix which needs to be inverted which is given in Row 5 of Table 1; (3) to decrease the variance inflation factors VIF given in Row 6. Since p = 2, both VIFs have a common value so only one value appears in Row 6. For each of these three criteria, surrogate regression is shown to be a superior procedure achieving a model with smaller collinearity with comparable loss of squared correlation R 2 ( ,̂ (k)).
The standard method for computing VIF for ridge regression in correlation form follows the procedure suggested by Marquardt (1970), which is to use the values on the main diagonal values of ( � + k ) −1 � ( � + k ) −1 . Although this is the correct expression for k = 0, it has been shown to be in error for k > 0 by Garcia et al. (2015) as the Marquardt expression allows inadmissible values less than one. Thus, we have used Equation (22) and (25) in Garcia et al. (2015) to compute the corrected values for VIF for ridge regression.
From Table 1, we see that the mean absolute deviation MAD for −̃ from the surrogate system is 0.009158. To compute a comparable raise system of estimators, we will set the precision j = 0.009158 in Equation (5). The OLS values from Table 1 are shown in Column 1 of Table 2 for comparisons. Using 1 = 0.009158 in Step 1, we solve for 1 = 0.5671 to raise 1 →̃ 1 . With this value, the squared lengths ̂ �̂ = 8.04, = 51.19 and VIF = 13.29 all showing an improvement in collinearity. The angle between the two column vectors in the design has improved from 10.31 • to 15.92 • . The corresponding R 2 ( ,̂ ( )) = 0.3169 indicates that 97.5% for the squared correlation has been retained. For Step 2, we solve for 2 = 0.3898 to raise 2 →̃ 2 . This is the final raised design ̃ = [̃ 1 ,̃ 2 ]. With these values, the squared lengths ̂ �̂ = 4.15, = 27.43 and VIF = 7.36 all showing an improvement in collinearity. The angle between the two column vectors in the design has improved to 21.62 • . The corresponding R 2 ( ,̂ ( )) = 0.3147 indicates that 96.9% for the squared correlation has been retained. Row 8 records MAD which is 0.009158 by construction.
The values in Column 3 of Table 2 are comparable to the values for the surrogate model from  Table 1. However, following Marquardt (1970, p. 610), VIF should be less than 10 and thus, for this example, we would favor the raise estimators as the ridge-type method to be used.