Variable Selection in Multivariable Regression Using SAS/IML

This paper introduces a SAS/IML program to select among the multivariate model candidates based on a few well-known multivariate model selection criteria. Stepwise regression and all-possible-regression are considered. The program is user friendly and requires the user to paste or read the data at the beginning of the module, include the names of the dependent and independent variables (the y (cid:146)s and the x (cid:146)s), and then run the module. The program produces the multivariate candidate models based on the following criteria: Forward Selection, Forward Stepwise Regression, Backward Elimination, Mean Square Error, Coefficient of Multiple Determination, Adjusted Coefficient of Multiple Determination, Akaike(cid:146)s Information Criterion, the Corrected Form of Akaike(cid:146)s Information Criterion, Hannan and Quinn Information Criterion, the Corrected Form of Hannan and Quinn (HQ c ) Information Criterion, Schwarz(cid:146)s Criterion, and Mallow’s P C . The output also constitutes detailed as well as summarized results.


Introduction
Applications where several quantities are to be predicted using a common set of predictor variables are becoming increasingly important in various disciplines (Breiman & Friedman, 1997; Bilodeau & Brenner, 1999). For instance, in a manufacturing process one may want to predict various quality aspects of a product from the parameter setting used in the manufacturing.
Or, given the mass spectra of a sample, the goal may be to predict the concentrations of several chemical constituents in the sample (Breiman & Friedman, 1997). A natural class of models that accommodate this would be generalization of a univariate multiple regression model, called multivariate multiple regression (MMR). In MMR, q dependent variables (y 1 , y 2 , …, y q ) are to be predicted by linear relationships with k independent variables (x 1 , x 2 , …, x k ). where Y represents n (independent) observations of a q-variate normal random variate, X represents the design matrix of rank k+1 with its first column being the vector 1, Β is a matrix of parameters to be estimated and E represents the matrix of residuals.

The statistical linear model for the MMR model is
In practice, MMR uses include a large number of predictors where some of them might be slightly correlated with the y's or they may be redundant because of high correlations with other x's (Spark et al., 1985). The use of poor or redundant predictors can be harmful because the potential gain in accuracy attributable to their inclusion is outweighed by inaccuracies associated with estimating their proper contribution to the prediction (Spark et al., 1985).
The problem of determining the "best" subset of independent variables in multiple linear regression has long been of interest to applied statisticians, and it continues to receive considerable attention in recent statistical literature (McQuarrie & Tsai, 1998). Two approaches are suggested in the statistical literature to deal with this problem. The first approach is to find the "best" set of predictors for each individual response variable using one (or more) of the multiple model selection criteria that are available in most of statistical packages such as S-plus, SAS, SPSS etc. In this approach, researchers perform model selection procedures on a univariate basis q times, where q is the number of the y's in the model. This can lead to q different subset of predictors, one for each y. The second approach is to find the "best" set of predictors for all response variables simultaneously, where one subset of predictors that "best" predict all y's using an analogous matrix expression of one of the univariate variable selection criteria is selected. Sparks et al. (1985) criticized univariate model selection methodology as compared to multivariate techniques and stated two reasons for dealing with target variables jointly rather than separately. One reason is simply that it is computationally more efficient because the number of times required doing necessary computations for model selection would be reduced from q to one. A second reason is that researchers sometimes need to establish which subset of predictors can be expected to perform well for all target variables, especially if there are costs associated with sampling the predictors.
Although the second approach is becoming increasingly important in various disciplines, to-date statistical software such as SAS and SPSS cannot be utilized to implement the second approach (SAS/STAT User's guide, 1990; and SPSS Base System, 1992). In this paper, we present a SAS module to select the "best" subset of predictors that can be conveniently used to predict all y's jointly utilizing popular multivariate model selection criteria. Our SAS module performs model selection using three automatic search procedures (Forward Selection, Forward Stepwise Regression, and Backward Elimination), and nine all-possible-regression procedures (MSE, R 2 , AdjR 2 , AIC, AIC C , HQ, QH C , BIC, and C p ). Spark et al. (1983) were the first to introduce the multivariate version of variable selection using the multivariate C p -statistic. Later, Spark et al. (1985) presented a multivariate selection method that uses the mean squared error of prediction rather than tests of hypotheses as the basis for selection. They also discussed the relationship between these two approaches. Bedrich and Tsai (1994) developed a small-sample criterion (AIC C ), which adjusts the Akaike information criterion (AIC) to be an exact unbiased estimator for the expected Kullback-Liebler information, for selecting MMR models. Another modification of AIC and C p has been proposed by Fujikoshi and Satoh (1997); their modification of the AIC and C p criteria were intended to reduce bias in situations where the collection of candidate models includes both underspecified and overspecified models. Recently, McQuarrie and Tsai (1998) present and compare the performance of several multivariate as well as univariate variable selection criteria for two special models and give comprehensive details on model selection.

Description of Model selection Methods
Stepwise regression and all-possible-regression are two types of variable selection procedures that are employed by most of the statistical software packages, and used in practice. In the former, investigators delete or add variables one at a time using a stepwise method and in the later they examine all possible subsets and choose one model based on some criteria.
Before presenting a detailed description of each procedure, we note that all variable It would be helpful also to introduce a standard notation for all variables, vectors, matrices, and functions used before describing each criterion. The following table presents notations and definitions of variables and functions used in defining the criteria: Table 1 Notations and definitions of variables and functions used.

Symbol Definition n
The number of observations p The number of parameters including the intercept k The number of x's in the "full model" q The number of y's Y The matrix of dependent variables X The matrix of all candidate independent variables with its first column being the vector 1

X p
The submatrix of X containing the vector 1 and the columns corresponding to selected variables x p in the model.

J
The q×q matrix of ones Λ The Wilks' Λ statistic, which is analogous to F random variable, defined as the ratio of two independent chi-square random variables divided by their respective Table 1 Notations and definitions of variables and functions used.

degrees of freedom Σ
The sum squared error for a "full-model" including the intercept p Σ The sum squared error for a model with p parameters including the intercept ln The natural logarithm |•| The determinant function

Stepwise Regression
Stepwise regression consists of three procedures: Forward Selection, Forward Stepwise Regression, and Backward Elimination (Barrett & Gray, 1994;Rencher, 1995). Although the forward stepwise regression is probably the most widely used procedure (Neter et al., 1996), all three criteria will be presented and used in the module.
The usual criteria used for adding (or deleting) an x variable is either partial Wilks' Λ or partial F criterion. Our SAS module employs only the partial Wilks' Λ. The Wilks' Λ is analogous to F random variable, defined as the ratio of two independent chi-square random variables divided by their respective degrees of freedom (Rencher, 1998). It is defined as A variable would be a candidate for addition when the minimum partial Wilks' Λ value falls below a predetermined threshold value. The variable would be a candidate for deletion when the maximum partial Wilks' Λ value exceeds a predetermined value.

Forward Selection
The forward selection technique begins with no variables in the model. For each of the independent variables, the forward method calculates partial Wilks' Λ statistics that reflect the variable's contribution to the model if it is included. The minimum value for these partial Λ statistics is compared to a predetermined threshold value. If no Λ statistic falls below the predetermined threshold value, the forward selection stops. Otherwise, the forward method adds the variable that has the lowest partial Wilks' Λ statistic to the model. The forward method then calculates partial Wilks' Λ statistics again for the variables still remaining outside the model, and the evaluation process is repeated. Thus, variables are added one by one to the model until no remaining variable produces a significant partial Wilks' Λ statistic. Once a variable is in the model, it stays. (For more details, the reader is referred to Rencher, 1995)

Forward Stepwise Regression
The stepwise method is a modification of the forward selection technique and differs in that variables already in the model do not necessarily stay there. As in the forward selection method, variables are added one by one to the model, and the partial Wilks' Λ statistic for a variable to be added must have an entry significant value (i.e., the minimum value of partial Λ statistics falls below a predetermined threshold value). After a variable is added, however, the stepwise method looks at all the variables already included in the model and deletes any variable that does not produce an stay significant partial Λ statistic (i.e., its partial Λ value exceeds a predetermined value). Only after this check is made and the necessary deletions accomplished can another variable be added to the model. The stepwise process ends when none of the variables outside the model has an entry significant partial Λ statistic and every variable in the model is significant to stay, or when the variable to be added to the model is the one just deleted from it.

Backward Elimination
The backward elimination method begins with all x's included in the model and deletes one variable at a time using a partial Λ. At the first step, the partial Λ for each x is calculated and the variable with largest partial Λ statistic that exceeds the predetermined threshold value is deleted.
At the second step, a partial Wilks' Λ is calculated for each of the q-1 remaining variables, and again the least important variable in the presence of the others is eliminated. This process continues until a step is reached at which the largest partial Λ is "significant" (i.e., does not exceed the predetermined value), indicating that the corresponding variable is apparently not redundant in the presence of the other variable in the model.

All-Possible-Regression
The all-possible-regression procedure calls for considering all possible subsets of the pool of potential predictors and identifying for detailed examination a few "good" subsets according to some criterion (Neter et al., 1996). Various criteria for comparing the regression model may be

Residual Mean Square Error
The residual mean square error is the variance estimator for each model and is defined by is the sum squared error for a model with p parameters including the intercept. It is often suggested that the researcher choose the model with minimal value of MSE.

R 2 Selection Criterion
R 2 is the coefficient of multiple determination and the method finds subsets of independent variables that best predict a dependent variable by linear regression in the given sample. It efficiently performs all possible subset regressions and displays the models in decreasing order of matrix (R 2 ) magnitude within each subset size. The R 2 is computed as: The R 2 method differs from the other selection methods; it always identifies the "best" model as the one with the largest R 2 for each number of variables considered.

Adjusted R 2 Selection Criterion
Since the number of parameters in the regression model is not taken into account by R 2 , as R 2 does not decrease as p increases, the adjusted coefficient of multiple determination (AjdR 2 ) has been suggested as an alternative criterion. The AjdR 2 method is similar to the R 2 method and it finds the "best" models with the highest AdjR 2 within the range of sizes. The criterion is

Akaike's information criterion (AIC)
The AIC procedure (Akaike, 1973) is used to evaluate how well the candidate model approximates the true model by assessing the difference between the expectations of the vector y under the true model and the candidate model using the Kullback-Leibler (K-L) distance. The Kullback-Leibler (K-L) distance is the distance between the true density and estimated density for each model. The criterion is The model that best predicts the y's jointly, with this procedure, is the one that has the minimum AIC's value.

The Corrected Form of Akaike's information criterion (AIC c )
Bedrick & Tsai (1994) pointed out that the Akaike's information criterion might lead to overfitting in small samples. Thus, they proposed a corrected version (AIC C ) of AIC, which is The "best" subset of x's, with this procedure, is the one that has the minimum AIC C 's value.

Hannan and Quinn (HQ)
Although HQ information criterion introduced by Hannan The "best" model is the model that corresponds to the minimum HQ value.

The Corrected Form of Hannan and Quinn (HQ c ) Information Criterion
Similarly, the procedure identifies the "best" subset of the x's that yields the smallest value.

Schwarz's Criterion (BIC)
The function computes Schwarz 's Bayesian information criterion for each model using the Kullback-Leibler (K-L) distance (Schwarz 1978; SAS/STAT User's Guide, 1990), which can be utilized to identify the "best" model. The criterion is The "best" model by the procedure is the model that corresponds the minimal value.

Mallow's P C
The P C criterion was initially suggested by Mallow's (1973) for univariate regression and extended by Spark et al. (1983) to multivariate multiple regression. It evaluates the total mean squared error of the n fitted values for each subset regression. The criterion is obtained by using the formula , and I is the identity matrix of size (q × q). The procedure identifies the "best" subset of the x's with the one that gives both small P C (in terms of scalar 13 function of matrix such as determinant) and near pI (Spark et al., 1983;Rencher, 1995). If , however, | P C | is negative and unreliable (Spark et al., 1983). Hence, modification of | P C | has been suggested by Spark et al. (1983) to remedy this problem, the quality | | p 1 Σ Σ ) ) − is always positive and is written in terms of P C as k n When the bias is 0, p C = pI, and (2.12) becomes Hence, subsets are sought that satisfy (2.14)

Practical Example
Anderson   (1983) considered these data and found the "best" subsets of predictors based on the multivariate C p criterion. We use this data set here to obtain the "best" subset(s) of predictors using all the criteria mentioned.
Three straightforward steps are needed in order to run the program properly for the data: (1) paste the data at the beginning of the program and read it using DATA statement, (2) name the dependent and the independent variables, and (3) run the program.   ). Table 3-2 shows the values of q k n p n ) ( − − at each p.

Summary and Conclusion
A SAS/IML program has been written to locate the multivariate candidate models for several multivariate model selection criteria. Three straightforward steps need to run the program properly for a new data: (1) Paste (or read) new data in the place of the example data at the beginning of the program and read it using DATA statement, (2) change the dependent and the independent variables' names in IML procedure, and (3)