Journal of Biometrics & Biostatistics

In order to use any method as a model selection algorithm, it is needed to check the adequacy and stability of a model selected by the algorithm. Adequacy of a robust model was not checked by giving outliers in various ways. In this paper, several contamination cases have been introduced to check the adequacy of the robust model


Introduction
When the number d of candidate predictor variables is small, a linear prediction model can be chosen by computing a reasonable criterion (e.g., AIC, Mallow's C p , BIC, CV) for all possible subsets of the predictor variables. However, the computational burden of this approach increases very quickly as d increases. This is one of the main reasons why step-by-step algorithm like forward selection (FS) is popular. But when the data sets contain outliers and other contaminations, classical FS procedure yields poor results, and often fails to select important covariates that would have been chosen if there were no outliers and other contaminations in the data sets. On the other hand, it is not logical to predict future outliers without having knowledge about the radical mechanism that produces these outliers.
Classical FS algorithm has been expressed in terms of sample means, variances and correlations [1]. Robust FS is obtained by replacing these classical sample quantities by their robust counterparts [1]. As a stopping rule, partial F-test procedure is used. The focus of this work is not to fit a final model but to check the adequacy and stability of the robust model. Khan et al. [1] checked the adequacy of the robust model by giving outliers to the noise variables and their corresponding response values. In this study, the adequacy of the robust model has been checked by giving outliers in several ways through a simulation study.
The rest of the paper is organized as follows. In §2, the classical FS is reviewed. In §3, the robust version of the algorithm is reviewed. In §4, a simulation study is presented for the comparison of robust FS and standard FS. In §5, a real data application is presented. And finally, §6 is the conclusion.

FS Algorithm Expressed in terms of Correlations
Let X 1 ,X 2 ,…,X d be n dimensional predictor variables and Y be the n dimensional response variable. Each variable is standardized with mean 0 and variance 1. The FS procedure begins with the assumption that there are no predictor variables in the model other than the intercept. The first predictor (X 1 , say) selected for entry into the equation is the one which has the largest absolute correlation |r 1Y | with Y, and the residual vector Y -r 1Y X 1 is obtained. For entering all the other predictor variables into the competition, they are 'adjusted for X 1 '. That is, each X j is regressed on X 1 , and the corresponding residual vector Z j.1 (which is orthogonal to X 1 ) is obtained. The correlations of these Z j.1 with the residual vector Y -r 1Y X 1 called the partial correlations between X j and Y 'adjusted for X 1 ' decide the next variable (X 2 , say) to enter the regression model. All the other predictor variables are then 'adjusted for the first two selected variables X 1 and X 2 ' for entering into further competition, and so on. This procedure of adding one predictor variable at each step is continued, until a stopping criterion is met. Let the correlation between X j and Y be r jY and R X be the correlation matrix of the predictors X 1 ,X 2 ,…,X d Let us assume, without loss of generality, X 1 has the maximum absolute correlation with Y Then, X 1 is the first variable that enters the regression model. The predictors in the current regression model are active predictors a. The remaining candidate predictors (d-a) are inactive predictors. The second predictor X 2 (say) that enters the regression model is the one that has the maximum absolute partial correlation |r jY.1 | with Y.

FS steps in correlations
FS algorithm is summarized in terms of correlations among the original variables as follows [1]: 1. To select the first covariate X m1 , determine m 1 =argmax |r j | 2. To select the kth covariate (k=2,3,…), calculate 1 ( 1) . k jY m m r −   , which is proportional to the partial correlation between X j and Y adjusted for X m1 …,X m(k-1) and then determine

Stopping rule for FS
At each FS step, once the "best" covariate (among the remaining covariates) is identified, a partial F-test can be performed to decide whether to include this covariate in the model (and continue the process) or to stop. The new "best" covariate enters the model only if the partial F-value, denoted by F partial , is greater than F(0.90,1,n-k-1) (say), where k is the current size of the model including the new covariate. Here again, the required quantities can be expressed in terms of correlations among the original variables, which is shown below.
When (k-1) covariates X 1 ,X 2 ,…,X k-1 are already in the model, and without loss of generality X k has the largest absolute partial correlation with Y after adjusting for X 1 ,X 2 ,…,X k-1 the partial F-statistic for X k can be expressed as:

Robustification of FS Algorithm
Simple robustification of FS algorithm is achieved by replacing the non-robust ingredients of FS algorithm by their robust counterparts [1,2]. For the initial standardization, the choices of first countable robust center and scale measures are straight forward: median (med) and median absolute deviation (mad). Most available robust correlation estimators are computed from the d-dimensional data and therefore are very time consuming [3]. On the other hand, robust univariate approaches [4] are very sensitive to correlation outliers (outliers that are not detected by univariate analyses but affect the classical correlation).
One solution is to derive correlations among pairs of variables from an affine-equivariant bivariate covariance estimator. A computationally efficient choice is the bivariate M-estimator defined by Maronna [5]. Alternatively, the robust correlation estimator of Gnanadesikan and Kettnring [6] or the related orthogonalized Gnanadesikan Kettnring estimator [7] can be used. For very large, high-dimensional data sets, we need an even faster robust correlation estimator. Huber [4] introduced the idea of univariate winsorization of the data and suggested that classical correlation coefficients be calculated from the winsorized data. Alqallaf, et al. [8] re-examined this approach for the estimation of individual elements of a high-dimensional correlation matrix. For n univariate observations X 1 ,X 2 ,…,X n , the transformation is given by where the Huber score function c (x) is defined as ψ c (x)=min{max{c,x},c}, with c a tuning constant chosen by the user (e.g., c=2 or c=2. 5). Note that in our case, med(x i )=0 and mad(x i )=1, because med and mad are used to robustly standardize the data. This univariate winsorization approach can be computed very rapidly, but unfortunately it does not take into account the orientation of the bivariate data.
To remedy this problem, Khan et al. [2] proposed a bivariate winsorization of the data based on an initial robust bivariate correlation matrix R 0 and a corresponding tolerance ellipse. Outliers are shrunken to the border of this ellipse by using the bivariate transformation is the Mahalanobis distance based on R 0 . Notice that c is a tuning constant that was chosen to be c=5.99 the 95% quantile of the 2 2 χ distribution. The choice of R 0 is discussed later.

The initial correlation estimate
Choosing an appropriate initial correlation matrix R 0 is essential for bivariate winsorization. In principle, we could use any robust bivariate scatter estimate, but for computational convenience, Khan et al. [2] proposed a new method called adjusted winsorization. This method considers quadrants relative to the coordinate-wise medians (which is considered as 0 due to the robust standardization of the data) and uses two tuning constants to perform univariate winsorization of the data. A larger tuning constant, c 1 , is used to winsorize the points lying in the two diagonally opposed quadrants that contain majority of the standardized data (called the "major quadrants"). A smaller tuning constant c 2 is used to winsorize the remaining data in the other two quadrants. In this article, we used c 1=2 and 2 1 c hc = , where h=n 2 /n 1 ,n 1 is the number of observations in the major quadrants and n 2 =n-n 1 . The initial correlation matrix R 0 is obtained by computing the classical correlation matrix of the adjusted winsorized data. The adjusted winsorization handles correlation outliers much better than univariate winsorization. By using bivariate winsorization, the outliers are shrunken to the boundary of the larger ellipsoid and thus appropriately down-weighted so that a robust correlation estimate is obtained. Although the initial adjusted winsorization and the resulting bivariate winsorization are not affine-equivariant, they can be computed very rapidly and can appropriately handle correlation outliers

Stopping rule for RFS
The classical correlations in the partial F statistic are replaced by their robust counterparts to form a robust partial F statistic. For stopping rule, standard F distribution is used as in §2.

A Simulation Study
A simulation study is accomplished analogous to Khan et al. [1] so that the performance of robust FS and classical FS can be compared. To perform the simulation study, d=50 candidate predictor variables are considered out of which a=9 or a=15 are non-zero target predictor variables. No-correlation case and two different correlation cases i.e., moderate-correlation and high-correlation cases which exist among the target predictor variables are considered. These cases are described below: For the no-correlation case, independent predictor variable X j~N (0,1) is considered and the response variable Y is generated using the a non-zero predictor variables with coefficients [1,2,9] which is repeated three times for a=9 and five times for a=15. The rest of the candidate predictors are considered as noise variables whose coefficients are zero. The variance of the error term is chosen in such a way that the signal to noise ratio equals to 2.
For the moderate-correlation case, three latent variables are introduced which are responsible for the systematic variation of both the response and the covariates, but are not active covariates. The linear model is created as follows: where L j~N (0,1) , i=1,2,3 and ε is a normal variable with mean 0 and standard deviation 110 . 2 = σ When a=9, a set of candidate predictor variables d=50 is created as follows. Let, where all e ij~N (0,1) and u k~N (0,1). Thus, the true correlation between these covariates is 0.5.
Similarly, for the high-correlation case, a similar linear model is created as in moderate-correlation case and a set of candidate predictor variables d=50 for a=9 is created as follows. Let, Here, δ is a fixed constant which is chosen to be 0.5 so that the correlation between these covariates is 0.80.
For the no-correlation, moderate-correlation and high-correlation cases, 1000 data sets each of size 200 is generated. Each data set is randomly divided into a training sample of size 100 and a test sample of size 100. We considered data without outliers, and also with 10% and 15% outliers or bad leverage points. 10% and 15% of bad leverage points are obtained by generating the errors with mean 50 and standard deviation 1. While at the same time all or parts of the corresponding predictor variables are contaminated. The contaminated predictor variables are generated with mean 500 and standard deviation 1.

Process of contamination of the training data
For contamination of the training data, at first a number of rows is randomly chosen and the covariates of these rows were replaced by large positive numbers. The corresponding response values were also replaced by large positive numbers. The corresponding response values were also replaced by large numbers. To contaminate the training sample with 10% of bad leverage points, the probability that any specific row of the training sample will be contaminated is 1-(1-p) z .
That is, For each of the no-correlation, moderate-correlation and highcorrelation cases, the training data sets are contaminated in different ways for measuring the adequacy of the robust model. Different cases of contaminations are given below: • Case 1: All candidate predictor variables are contaminated.
• Case 2: All active predictor variables plus 5 first noise variables are contaminated.
• Case 3: All active predictor variables are contaminated.
• Case 4: All active predictor variables related to the two most important latent variables L 1 ,L 2 are contaminated.
• Case 5: Two active predictor variables related to each of the three latent variables L 1 ,L 2 and L 3 are contaminated.
• Case 6: Most important active predictor variables plus first 10 noise variables are contaminated.
• Case 7: First three active predictor variables related to the most important latent variable L 1 are contaminated.
At first the training data is used for fitting the obtained models by applying each of the classical and robust FS methods. Then the test data is used for testing the significance of the fitted models. Both the classical and robust models are fitted by using a regression MMestimator because of its high breakdown point which is 0.5, and high efficiency at the normal distribution [10].
For each simulated data set, 10% trimmed mean of squared prediction error on the test sample is recorded. The average, standard deviation (SD) and median absolute deviation (mad) of the three quantities i.e., mean squared prediction error (MSPE), noise variables and target variables are shown in parentheses.
At first the performance of the classical and robust methods in clean data for the no-correlation, moderate-correlation and highcorrelation cases is presented. Table 1 depicts that the performance of classical FS and robust FS is comparable for the no-correlation, moderate-correlation and high correlation cases in clean data.

All candidate predictor variables
In this case, the values of the d=50 candidate predictor variables and their corresponding response values are contaminated. Table 2 represents the results for the no-correlation case in contaminated data. It shows that the test error produced by robust FS is much smaller than for the classical FS for both 10% and 15% outliers cases. The median absolute deviations and standard deviations are much smaller for the robust method than for the classical one. Also, the model selected by robust FS contains less noise variables than the classical FS. At the same time, more we increase the percentage of outliers in the training data, more the robust method performs very well while the performance of classical method is quite poor. Because classical FS selects more noise variables in the final model as the percentage of bad leverage points is increased. On the other hand, robust method selects less noise variables for the cases of 10% and 15% outliers. For example, for a=5 when we increase outliers from 10% to 15%, the average of the noise variables decreases from 0.9 to 0.6. Thus, we say that the robust method fits the final model with a small number of predictor variables by producing less test error compared to the classical method.
Tables 3 and 4 present the results for the moderate-correlation and high-correlation cases respectively. Here, the conclusions of the results for both the correlation cases are same as the no-correlation case [11][12][13][14].

All active predictor variables plus 5 first noise variables
In this case, the first 14 and 20 predictor variables are contaminated for a=9 and a=15 respectively, and the corresponding values of the response variable are also contaminated. Table 5 represents the results for the no-correlation case. It shows that the robust FS has less MSPE than the classical FS when a=9 active covariates are considered for both 10% and 15% outliers cases. Also the robust method fits the model with less noise variables than the classical method. But when we increased the number of active predictor variables from a=9 to a=15 in the model, the robust FS produces more test error than the classical FS. Despite of producing more test error, robust method includes less noise variables in the final model than the classical method. The conclusions of the results for the moderate correlation and high-correlation cases are also similar as in the no-correlation case.
In all other contamination cases, the conclusions of the simulation results are similar as in Case 1 and Case 2. So, the results of other contamination cases are not included.

Real Data Application
In this section, a real-data set is used to evaluate the robustness and scalability of robust FS method.

Breast cancer data
This data set was used for the KDD-cup 2008. We considered the    training data set which consists of a total of n=102,294 candidates, each described by 117 feature variables. We used the first 101 feature variables with a total of n=50,000 observations in our analysis. The first variable is considered as the response variable, and the remaining 100 variables are considered as the candidate predictor variables. n=50,000 observations are divided into a training sample of size n=25,000 and a test sample of size n=25,000. When the classical FS and robust FS are applied to this training data set, the classical FS selects a huge model with the following 63 covariates: (32, 59, 19, 42, 33, 23, 11, 89, 14, 81, 30, 51, 8, 9, 66, 17, 34, 5,  The 10% trimmed mean of squared prediction error for both the methods are 0.005. That is, the robust FS fits a good model by using only 13 predictor variables, while the classical FS does the same thing by using 63 predictor variables. We really don't know whether this data set is contaminated or not. To check the scalability and robustness of robust FS, this data set is contaminated in three different ways. These contamination cases are described below: This data set is contaminated by replacing one small value of the predictor variable 32 (say 15023 th value 0.5222431) by a large value 522. When both the classical FS and robust FS methods are applied to this contaminated data set, robust method selects the same model as before but classical FS selects a different model containing 62 covariates. This data set is contaminated by replacing one small value of the predictor variables 20 (say 9001 th value 0.692335) by a large value 160 and the corresponding value of the response variable (0.1446717) by a large value 180. Again in this case, robust FS selects the same model as before, but classical FS selects a model containing 52 covariates which is different from the previous model.

Conclusions
In this study, we considered the problem which arises when we select a linear prediction model for large high-dimensional data sets that may be clean or possibly contain a fraction of contaminations. At the same time, our goal was to achieve robustness and scalability. The performance of the classical FS and robust FS is compared through a simulation study and real data application. In simulated data sets, the performance of robust FS is comparable to standard FS for the nocorrelation, moderate-correlation and high-correlation cases in clean data. We also compared the performance of robust FS and classical FS by contaminating the simulated data sets in different ways. Robust FS has performed much better than standard FS. As we increased the percentage of bad leverage points in the simulated data sets, the robust FS has performed much better than the standard FS for the no-correlation, moderate-correlation and high-correlation cases. In almost all the contamination cases, the classical FS produced more test error, and also included more noise variables than the robust FS. In some contamination cases, the robust FS produced almost same test error but included less noise variables than the classical FS. Overall the performance of robust FS is better than the classical FS. In real data set, when we replaced some observations by bad leverage points, the model selected by classical FS changes frequently and produces more test error than robust FS. From the simulation study and real data example, it is proved that the robust FS outperforms the classical FS in contaminated data.