The Principal Problem with Principal Components Regression The Principal Problem with Principal Components Regression

Principal components regression (PCR) reduces a large number of explanatory variables down to a small number of principal components. PCR is thought to be more useful, the more numerous the potential explanatory variables. The reality is that a large number of candidate explanatory variables does not make PCR more valuable; instead, it magnifies the failings of PCR


The Principal Problem with Principal Components Regression
and Hotelling (1933Hotelling ( , 1936 independently developed principal component analysis, a statistical procedure that creates an orthogonal set of linear combinations of the variables in an n x m data set X via a singular value decomposition, where U is an n x m matrix with orthonormal columns, is an m x m diagonal matrix with the ordered singular values, and V is an m x m orthonormal matrix. The non-negative eigenvalues of X'X are the squared diagonal elements of , the eigenvectors of X'X are the columns of V, and the principal components of X are given by XV. Hotelling (1957) and Kendall (1957) recommended replacing the original explanatory variables in a multiple regression model with their principal components. This replacement evolved into a recommendation by several prominent statisticians that components with small variances can be safely omitted from a regression model (Hocking 1976, Mansfield, Webster, and Gunst 1977, and Mosteller and Tukey 1977. Thus, principal components regression (PCR) discards the eigenvectors that have the smallest eigenvalues, in contrast to other procedures like surrogate regression (Jensen and Ramirez 2010) and raise regression (Garcia, Garcia, and Soto 2011) that increase the magnitude of the small eigenvalues.
PCR enthusiasts evidently believe that components with small variances are of little use in predicting variations in the dependent variable. Mansfield, Webster, and Gunst explicitly state that, "The small magnitude of the latent root indicates that the data contain very little information on the predictiveness of those linear combinations (page 38)". Mosteller and Tukey argued that, A malicious person who knew our x's and our plan for them could always invent a y to ! 2 make our choices look horrible. But we don't believe nature works that way-more nearly that nature is, as Einstein put it (in German), "tricky, but not downright mean." (pp. 397-398) Hadi and Ling (1998) show by theory and example that PCR may discard a principal component that is perfectly correlated with the variable being predicted, while retaining components that are completely uncorrelated with the dependent variable. Our point is more general. The principal problem with principal components regression is that it imposes constraints on the coefficients of the underlying independent variables that have nothing whatsoever to do with how these variables affect the dependent variable in the regression model.
Hadi and Ling note that PCR advocates argue that, "Because the PCs . . . are orthogonal, the problem of multicollinearity disappears completely, and no matter how many PCs are actually used, the regression equation will always contain all of the variables in X (because each PC is a linear combination of the variables in X." The problem we highlight is that, while all of the original explanatory variables may be retained, their estimated coefficients are distorted by PCR in ways that diminish the accuracy of the model when it used to make predictions with fresh data. Principal components regression (PCR) is now commonplace. A principal components transformation of the original explanatory variables is used to create a set of orthogonal eigenvectors, with the corresponding eigenvalues representing the fraction of the variance in the original data that is captured by each eigenvector. The principal components selected for the multiple regression model are then based on a rule such as the largest eigenvalues that capture at least 80 percent of the total variance. A few examples from a wide variety of fields are Cowe and ! 3 McNicol (1985), Stock and Watson (2002), Price, Patterson, Plenge, Weinblatt, Shadick, and Reich (2006), Dray (2008), Sanguansat (2012), Sainani (2014), Qi and Roe (2015), and Sabharwal and Anjum (2016).
Some argue that PCR solves the multicollinearity problem created by high correlations among the original explanatory variables; for example, Kudyba (2014), Alibuhtto and Peiris (2015). However, a transformation that retains all the principal components doesn't affect the implicit estimated values or standard errors of the coefficients of the original variables or the predicted values of the dependent variable. The regression model is affected if some of the principal components are omitted, but, as will be illustrated later, this is because restrictions with no theoretical basis are imposed on the original parameters.
More recently, PCR has become popular in exploratory data analysis where there is a dauntingly large number of candidate explanatory variables and the researcher wants to let the data determine the final model; for example, Sakr and Gaber (2014), Taylor and Tibshirani (2015), Jolliffe and Cadima (2016), Verhoef, Kooge, and Walk (2016), George, Osinga, Lavie, and Scott (2016), Chen, Zhang, Petersen, and Müller (2017).
Among others, Gimenez and Giusanni (2017) emphasize that it is difficult to interpret the coefficients of the principal components because they are weighted averages of the coefficients of the underlying explanatory variables. Others criticize PCR for its linearity and propose a variety of nonlinear weighting schemes; for example, Liu, Li, McAfee, and Deng (2012), Deng, Tian, and Chen (2013), Yuan et al. (2015), Bitetto, Mangone, Mining, and Giannossa (2016), and Yu and Khan (2017).
These issues are not the most serious problem with principal components regression. The ! 4 eigenvector weights depend solely on the correlations among the explanatory variables, with no regard for the dependent variable that the model will be used to predict. As a consequence, PCR may constrain the coefficients of the original explanatory variables in ways that cause the model to fare poorly with fresh data. Specifically, the constraints that the eigenvector weights impose on the implicit estimates may cause the estimated coefficients of nuisance variables to be large, while the estimated coefficients of important explanatory variables may be very small or have the wrong sign.
The Appendix uses a very simple model to provide a detailed example of the practice and pitfalls of principal components regression. We also use a Monte Carlo simulation model to demonstrate how this core problem with principal components regression is exacerbated in large data sets.

A Simulation Model
All the explanatory variables in our Monte Carlo simulations were generated independently in order to focus on the fact that a principal components analysis might be fooled by purely coincidental, temporary correlations among the candidate explanatory variables, some of which are nuisance variables that are independent of the true explanatory variables and of the variable being predicted, and might be useless, or worse, out-of-sample. Two hundred observations for each candidate explanatory variable were determined by a Gaussian random walk process: where the initial value of each explanatory variable was zero, and was normally distributed with mean 0 and standard deviation x. The central question is how effective principal components regression is at estimating models that can be used to make reliable predictions with fresh data. So, in each simulation, 100 observations were used to estimate the model's coefficients, and the remaining 100 observations were used to test the model's reliability.
All of the data were centered by subtracting the sample means. The in-sample data were centered on the in-sample means and the out-of-sample data were centered on the out-of-sample means so that the out-of-sample predictions would not be inflated if the in-sample and out-ofsample means differed.
Five randomly selected explanatory variables (the true variables) were used to determine the values of a dependent variable where the value of each coefficient was randomly determined from a uniform distribution ranging from 2 to 4, and is normally distributed with mean 0 and standard deviation y. The range 0 to 2 was excluded because the real variables presumably have substantial effects on the dependent variable. Negative values were excluded so that we can compare the average value of the estimated coefficients to the true values. The other candidate variables are nuisance variables that have no effect on Y, but might be coincidentally correlated with Y.
A principal components analysis was applied to the in-sample data to determine the eigenvalues, eigenvectors, and principal components. The multiple regression model was estimated by using the principal components associated with the largest eigenvalues such that at least 80 percent of the variation in the explanatory variables is explained by these components.
Our base case was x = 5, y = 20, and 100 candidate variables, but we also considered all combinations of x = 5, 10, or 20; y = 10, 20, or 30; and 10, 50, 100, 500, or 1000 candidate variables. One million simulations were done for each parameterization of the model.

Results
The number of principal components included in a multiple regression equation is not affected by the standard deviation of Y since the eigenvalues do not depend on Y, just the correlations among the candidate explanatory variables. For the same reason, the number of included principal components does not depend on whether the candidate variables truly affect the dependent variable or are merely nuisance variables.
In our simulations, it also turned out that the assumed standard deviation of the explanatory variable hardly mattered either, at least for the range of values considered here; so, we only report the results for our base case of x = 5 and y = 20.
With 100 candidate variables, the average PCR equation had 3.01 principal components. Table 1 shows that the average number of components retained increased with the number of candidate variables.
We used the estimated coefficients of the principal components included in the multiple regression model to calculate the implicit estimates of the coefficients of the five real variables and each of the nuisance variables. The expected value of the coefficient of each of the five real variables is 3.0; the true coefficient of each nuisance variable is 0. principal components using eigenvector weights imposes unwelcome constraints on the estimated coefficients of the explanatory variables. As the number of candidate variables increases, they become essentially indistinguishable, with estimates that average near zero and consequently do not capture the importance of the real explanatory variables that determine the dependent variable. As the coefficient estimates become essentially noisy, the model becomes less useful for making predictions. Table 2 uses three metrics to compare the in-sample and out-of-sample prediction errors. The first is the simple correlation between the actual and predicted value of the dependent variable.
The second metric is the mean absolute error (MAE) The third metric is the root mean square error (RMSE): !
The first row, "5M" in Table 2, is a baseline, using multiple regression estimates with the five true explanatory variables. The other estimates use the principal components with the largest eigenvalues. The principal components models consistently performed far worse out-of-sample than in-sample. As the number of candidate variables increased, the in-sample fit worsened somewhat, while the out-of-sample fit deteriorated substantially.
The results are robust with respect to the number of observations. An increase in the number of observations improves the precision of the estimated coefficients of the principal components, but does materially affect the results, because the flaw in PCR is that correlations among the The conclusions are also little affected by in-sample correlations among the explanatory variables. We focused on independent candidate variables because we wanted to emphasize the reality that PCR will often give large weights to nuisance explanatory variables even if they are independent of the true explanatory variables. For comparison, we also considered the case of candidate variables with 0.9 pairwise correlations. Table 3  On the other hand, Table 3 also shows that PCR did relatively well if the explanatory variables happen to be highly correlated both in-sample and out-of-sample. In the first two ! 9 scenarios shown in Table 3, the independence of the explanatory variables out-of-sample exposed the PCR pitfall of putting inappropriate weights on the explanatory variables. If the explanatory variables happen to continue to be highly correlated out-of-sample, then these inappropriate weights are not as costly because it doesn't matter as much whether the estimation procedure can distinguish between true variables and nuisance variables.

Conclusion
The promise of principal components regression is that it is an efficient way of selecting a relatively small number of explanatory variables from a vast array of possibilities, based on the correlations among the explanatory variables. The problem in that the eigenvector weights on the candidate variables have nothing to do with their relationship to the variable being predicted.
Mildly important variables may be given larger weights than important variables. Nuisance variables may be given larger weights than the true explanatory variables. The coefficients of the true explanatory variables may be given the wrongs signs.
It might be thought that the larger the number of possible explanatory variables, the more useful is the data reduction provided by principal components. The reality is that principal components regression is less effective and more likely to be misleading, the larger is the number of potential explanatory variables.

Appendix A Principal Components Regression Example
Equations 1 and 2 were used to generate twenty observations for four explanatory variables, of which two variables, X1 and X2, were used with randomly determined coefficients (3.092 and 3.561, respectively) to determine the values of the dependent variable Y. The other two explanatory variables, X3 and X4, were nuisance variables. To keep the standard errors comparable to the main paper, we used x = 5 and y = 5. The first ten observations were used for the in-sample statistical analysis, with the ten remaining observations reserved for an out-ofsample test of the model. These data are shown in Table A1.
The eigenvectors and eigenvalues for the four explanatory variables are shown in Table A2.
The sum of the eigenvalues is 1,778.42, with the first and second eigenvalues a fraction 0.601 and 0.287 of the total, respectively. Using the 0.80 rule, the two principal components corresponding to these eigenvalues were used in the multiple regression equation.
The first two principal components are The absolute values of the weights were larger for the first explanatory variable than for the second, even though the true coefficient of the second variable was larger than the true coefficient of the first variable (3.092 versus 3.561). The weights given the two nuisance variables were comparable to the weights given the real variables. Notice also, that in the first principal component, the weights for the first and second explanatory variables have opposite signs, even though their true coefficients have the same sign. The inescapable problem is that the principal component weights are derived from the correlations among the explanatory variables ! 11 with no concern for how the dependent variable is related to the explanatory variables.
If The matrix multiplication of the original data by the eigenvector weights gives the principal components shown in Table A3. Using the 0.80 rule, a multiple regression using the first two principal components gave these estimates, with the standard errors shown in parentheses !
The substitution of Equations 1 and 2 into the multiple regression Equation 3 gives the implicit estimates of the coefficients of the original explanatory variables shown in Table A4.
The coefficient of X2, the variable with the largest true coefficient, has the wrong sign, and the coefficient of two nuisance variables are substantial.
Equation 3 was used to make out-of-sample predictions for observations 11 through 20. Table   A5 shows that the out-of-sample prediction errors were much larger than the in-sample errors, no doubt because the model's estimated coefficients were so inaccurate. For comparison, a naive model that completely ignores the explanatory variables and simply predicts that Y will equal its average value (0) Table 3 One Hundred Highly Correlated Candidate Variables, x = 5, y = 20 Correlation Among Candidate Variables ---------------------None In-Sample Only In-and Out-of-Sample