Combining some Biased Estimation Methods with Least Trimmed Squares Regression and its Application

In the case of multicollinearity and outliers in regression analysis, the researchers are encouraged to deal with two problems simultaneously. Biased methods based on robust estimators are useful for estimating the regression coefficients for such cases. In this study we examine some robust biased estimators on the datasets with outliers in x direction and outliers in both x and y direction from literature by means of the R package ltsbase. Instead of a complete data analysis, robust biased estimators are evaluated using capabilities and features of this package.


The Least Trimmed Squares
Least Trimmed Squares (LTS) or Least Trimmed Sum of Squares is one of a number of methods for robust regression (Rousseeuw & Leroy 1987). There exists several algorithms for calculating the LTS estimates in the literature: Ruppert & Carrol (1980), Neykov & Neytchev (1991), Tichavsky (1991), Atkinson & Weisberg (1991), Ruppert (1992), Stromberg (1993), Hawkins (1994), Hossjer (1995), Rousseeuw & van Driessen (1999), Agullo (2001), Hawkins & Olive (2002), Willems & van Aelst (2005), Jung (2005), Li (2005), Cizek (2005), Rousseeuw & van Driessen (2006). Peter Rousseeuw introduced several robust estimators including LTS in his works. LTS is a statistical robust technique for fitting a linear regression model to a set of n points given a trimming parameter h as it is insensitive due to outliers (n/2 ≤ h ≤ n). More formally, LTS estimator is defined on an objective function which is minimized by where (e 2 ) i:n is the i th smallest residual or distance when the residuals are ordered in ascending order. As h is the number of good data points, LTS estimator obtained a robust estimate by trimming the (n − h) data points having the largest residuals from the data set. Note that, when h = n, it is equivalent to the ordinary least squares estimator. It is also possible to take h close to the number of good points as the more accurate estimates are rational to the number of good points. For small sample sizes the existing algorithms are fine, however the computation time increases with the larger size of data set. Hence other possible ways for fitting are considered. Rousseeuw & van Driessen (1999) proposed a fast algorithm based on a random sampling for computing LTS which was finally published as Rousseeuw & van Driessen (2006). In this study, only the FAST-LTS algorithm proposed by Rousseeuw and van Driessen will be considered.
The paper unfolds as follows: Section 2 outlines the contributions to LTS in the presence of multicollinearity. Section 3 explains some robust biased estimators. The next section introduces the ltsbase package and gives statistical analysis of the example datasets in subsections. Finally, the last section presents the remarkable difference between the ltsbase and previous algorithms in R.

Contributions to LTS in the Presence of Multicollinearity
Multicollinearity is a common problem in many areas, i.e., economical, technical and medical applications. This problem has been examined in literature from different points of view like estimation and testing the hypothesis of parameters, removal and diagnostic tools. Several diagnostic tools such as condition number, condition indices, variance inflation factors, singular value decomposition, etc. have been suggested and used for detection of multicollinearity Belsley (1991), Heumann, Shalabh, Rao & Toutenburg (2008), Wissmann, Toutenburg & Shalabh (2007). In this study, we focus exclusively on the Variance Inflation Factor forβ i with the following form V IF = 1/(1 − R 2 i ) and Condition Number, λ max /λ min , in order to diagnose the multicollinearity. Here, R 2 i is the coefficient of determination and λ max , λ min refer to the maximum and minimum eigenvalues of the corresponding matrix, respectively.
When multicollinear datasets have also outliers, researchers are forced to deal with those problems simultaneously. For this purpose, Kan, Alpu & Yazici (2013) studied the effectiveness of some robust biased estimators via a simulation study for different types of outliers. Also they provided a dataset with outliers in y direction to show the performance of biased estimators based on LTS.
In this paper, Kan Kilinc B. and Alpu O. (2013) introduce a new package ltsbase, implemented in the R System for statistical computing and available on http:/CRAN.r.project.org/package=ltsbase. It can be used to perform a biased estimation based on a robust method (Kan Kilinc B. and Alpu O. 2013).
Differently from , we expand on some robust biased estimators for the datasets with outliers in x direction and outliers both in x and y direction by means of the ltsbase package. Hence this study will help close the considerable gap in the estimation of the Ridge and Liu parameters in the presence of multicolinearity and outliers by using the LTS method.

Robust Biased Estimators
In standard linear regression, consider the model where β=(β 0 , β 1 , . . . , β p ) is the unknown parameter vector, X (n×(p+1)) is a fixed matrix of full rank of observations and i are iid random variables with mean 0 and variance σ 2 I n . The estimation of the regression coefficients,β, is generally obtained by Ordinary Least Squares (OLS) method. However, large numbers of regressors in multiple linear regression analysis can cause serious problems in estimation and prediction.
A serious ill conditioned problem is characterized by the fact that the smallest eigenvalue of the X X is much smaller than unity. In other words, the matrix X X has a determinant which is close to zero, which makes it ill conditioned so that the matrix can not be inverted. Here, the least squares solution is still unbiased but is plagued by a large variance. Hence thr OLS solution yields a vectorβ coefficients which are too large in absolute value (Marquardt & Snee 1975).
The Ridge and Liu regressions penalize the size of the regression coefficients. Here, both k and d are tuning (biasing) parameters which control the strength of the penalty term.
In our study, the biasing parameters klts, dlts and the MSE values of two robust biased estimatorsβ ltsRidge ,β ltsLiu are examined when outliers and multicollinearity exist in the dataset. klts and dlts are considered as the robust choice is of the biasing parameters k and d. In application we have different robust biased estimations since the robust biasing parameters change by the user increment. Thus, we might choose the biasing parameters klts and dlts which minimize MSE value. To illustrate the performance of the robust biased estimators, the MSE criterion is used.
where tr denotes the trace andβ • present is the robust biased estimators ).

The ltsbase Package: Features and Functions
The R System has many packages and functions-e.g., MASS:lqs() (Venables & Ripley 2002), robustbase:ltsReg() (Rousseeuw & van Driessen 1999), and sparseLTSEigen:RcppEigen() (Alfons, A. 2013), to perform least trimmed squares regression and related statistical methods. The ltsbase package has a number of features not available in current R packages and fills the existing gap in the R statistical environment which is the convenient comparison for biased estimations based on the LTS method.
The ltsbase package includes centering, scaling, singular value decomposition (svd) and the least trimmed squares method. Hence centering or scaling the data is not required by the user. On the other hand, when computingβ Ridge numerically, the matrix inversion is avoided because of inverting X X can be computationally expensive. Rather, the svd is utilized. So that, the regression coefficients of each model are estimated. The package ltsbase has three functions to serve three purposes. First, it is the minimum MSE (Mean Squared Error) value which is extracted by calling ltsbase() function. Then the fitted values and the residuals of the corresponding model might be extracted as well. To return these values, one should use the ltsbaseDefault() function. Finally, the biasing parameters and regression coefficients for the corresponding model at minimum MSE value might be extracted by using ltsbaseSummary() function. Furthermore, the ltsbase package was designed especially to create "comparison of MSE" graphics based on the methods used in the analysis. Hence it allows users to see visual output without creating each graphic individually.
The ltsbase() function is the main function of the ltsbase package. This function computes the minimum MSE values for six methods: OLS, Ridge, Ridge based on LTS, LTS, Liu, and Liu based on LTS for sequences of biasing parameters. It returns a comprehensive output presenting the biasing parameters and the coefficients for the models at minimum MSE value. Basically, the following code line executes the main function: Here, xdata is a data frame including regressors and y is a response variable. The values of MSE and the comparison of MSE values of the four methods (Ridge, Ridge based on LTS, Liu, Liu based on LTS) in lines (with different colours and line type) on a plot obtained by setting plot and print parameters TRUE. The alpha in the function is the percentage (roughly) of squared residuals whose sum will be minimized by the LTS regression method. It requires a value between 0.5 and 1. The last argument by is a number giving the increment of the sequence where the biasing parameters are defined.
In the following two sections the usage of ltsbase package is illustrated by two examples presenting two different cases of outliers .

Case Study 1: Outliers in x Direction
An artificial dataset hbk involving outliers with 75 observations for three regressors x 1 , x 2 , x 3 and one response variable y was created by Hawkins, Bradu & Kass (1984), the raw data (hereafter refered to as the hbk data) being found in Appendix A.1. Since hbk is a well-known data set, the analysis of variance and parameter estimates of OLS will not be shown here. However, some diagnostic measures for the OLS analysis may be found in Appendix A.2. Of particular interest is the placement of leverage points among the remaining data points. Mason & Gunst (1985) showed that collinearity can be increased without bound by increasing the leverage of a point (Mason & Gunst 1985). They also showed that a q-variate leverage point can produce q − 1 independent collinearities (Chatterjee & Hadi 2006). A closer look at the diagnostics of points are given in Figure 1. In Figure 1, multiple high leverage points which may cause the multicollinearity are observed. The figure identifies all 14 leverage points. The four good leverage points of them have small standardized LTS residuals but a large robust distance, and the 10 bad leverage points (1, 2, . . . , 10 numbered) have large standardized LTS residuals and a large robust distance (see Appendix A.3).

ltsbase Function
Let y denote the vector of response values and xdata the regressors. Also regressors are assumed to be given in a data frame (not in a matrix or in an array). To fit Ridge and Liu regression models based on LTS, we call the ltsbase function.
R> model1=ltsbase (xdata,y,print=FALSE,plot=TRUE,alpha=0.875,by=0.001) Here, when print is TRUE the user can call all the values calculated in the analysis. Also, when plot is TRUE, the function produces the lines of all MSE values versus biasing parameters. The alpha is the percentage (roughly) of squared residuals whose sum will be minimized by 0.875 and by is the increment of the sequence, by default 0.001. The LTS regression method minimizes the sum of the h smallest squared residuals, where h > n/2, i.e. at least half the number of observations must be used. The default value of h (when alpha=1/2) is roughly n/2, where n is the total number of observations, but by setting alpha, the user may choose higher values up to n.
As reported in the previous section, hbk data is used to highlight the specific features of ltsbase and how to interpret the results. The aim of this analysis is to find the MSE value among some methods such as OLS, Ridge, Ridge based on LTS, LTS, Liu and Liu based on LTS. After running the code, the outputs are given in the following: The returned output contains three elements: (1)  The colors and line types of curves represent the values of the biasing parameters versus MSE values. For instance, the black-line curve is obtained by ridge regression and the blue-dotted curve is from Liu estimation (See top right legend of the Figure). As the plot argument in ltsbase function supports a layout of MSE values versus biasing parameters for four methods, one can easily provide the immediate visual information about the MSE values. Note that each line is also plotted in different types for print color as gray.
As can be readily seen in Figure 2, the model is identical to the OLS regression model at (k,klts,d,dlts)=(0,0,0,0). The aim of the plotting is actually an exploratory tool to show the sensitivity of the MSE values to the methods being used here. On the figure, each method is traced along its biasing parameter scale beginning at 0 and ending at 1. As k increases, the MSE values assosciated with Ridge regression are increasing and then almost horizontal after a certain point of k. The same pattern is followed by Ridge regression based on LTS. However, the MSE values obtained by the LTS method are much smaller than those obtained by Ridge regression as the biasing parameter klts increases. On the other hand, following the blue-dotted curve which is produced by the Liu estimation, the MSE values rises at low levels of d and falls steeply as the biasing parameter d increases.
Observing the MSE values of Liu based on the LTS method as dlts increases, note how the MSE value decreases slightly and then levels out.

ltsbaseSummary Function
A summary of the analysis produced by the ltsbaseSummary function showing the biasing parameter at minimum MSE values. The following code runs the summary of the biased LTS method. Here we have three results: (1) the best biasing parameter which gives the minimum MSE among the others, (2) the regression coefficients of the corresponding regression model at the best biasing parameter, (3) the minimum MSE value. It is also possible to see in Figure 2 that the MSE value begins to stabilize at around dlts = 0.65 and shows a slight downward trend at dlts = 0.67 which is the minimum among the other methods. It also extracts the coefficients of the corresponding model.

ltsbaseDefault Function
The fitted values and residuals of the corresponding model are also extracted as one of the returned outputs by ltsbase package (see Appendix A.4).
As seen, there are substantial differences among available packages related to LTS in R and the ltsbase is currently the only one to offer together: (1) Maguna, Nunez, Okulik & Castro (2003) examined the toxicity of carboxylic acids on the basis of several molecular descriptors in their research. They reported the results of a QSPR study and obtained quite reasonable estimates compared to the previous theoretical calculations. The aim of their experiment was to predict the toxicity of carboxylic acids on the basis of several molecular descriptors.

Case Study 2: Outliers in Both x and y Direction
One of the concerns is how well our method performs when the data have outliers in both directions. We explore this on a data frame with 38 observations on the 10 variables used in application and the description of the data set is given in Table 1. In the table, the toxicity is defined as the response variable and the remaining variables are considered as regressors.
In Figure 3, the placements of outliers are presented and the points are identified by numbers. It is seen that while the observations 23, 28, 32, 34, 35, 36, and 37 are identified as outliers in the x direction, the observation 11, 12, and 13 are identified as outliers in the y direction. The remaining data are all well-behaved or good leverage points.  Secondly, we use on the data to determine whether there is multicollinearity among regressors or not. The procedure has been used for hbk data and is repeated for toxicity data in terms of multicollinearity and outliers. To detect multicollinearity for toxicity data, the same measures given in Appendix A.2 are used and interpreted in Appendix B.2. Considering all indicators together, there is severe multicollinearity, therefore it can be said that this is fairly effective on the results.
Due to the presence of multicollinearity and outliers in the toxicity data, neither MASS::lqs nor robustbase::ltsReg in R are suitable to cope with those problems. Currently, the ltsbase package deals with both multicollinearity and outliers simultaneously and offers a wide array of features including a graphical comparison for the analysis.

ltsbase Function
This subsection provides illustrations of code ltsbase for toxicity data and returns the following components of the biased estimation based on the LTS method: In Figure 4, MSE values versus different biasing parameters for four methods obtained by ltsbase are presented when there are outliers in both x and y directions. In the figure, it is possible to see approximately at which method the MSE value is at its smallest.  As seen in Figure 4, the minimum MSE value is obtained by the LTS Liu method. Afterwards the user may have the exact calculation results by calling the ltsbaseSummary function.

ltsbaseSummary Function
The ltsbaseSummary function is designed to summarize the whole analysis and gives (1)  From the output, among the whole biasing parameters, the one which gives the minimum MSE is obtained by LTS Liu as 0.712.

ltsbaseDefault Function
The fitted values and residuals of the model which is summarized by ltsbaseSummary function are given in Appendix B.4.

Conclusions
The package ltsbase fills the existing gap in the R statistical environment and provides a convenient comparison for biased estimations based on the LTS method. The package has four important features both for users and package developers that are not available in at least some of the alternatives: MASS::lqs (Venables & Ripley 2002) and robustbase:ltsReg (Rousseeuw, P.J. and Croux, C. and Todorov, C. and Ruckstuhl, A. and Salibian-Barrera, M. and Verbeker, T. and Koller, M. and Maechler, M. 2012). First, the package provides the estimation of Ridge and Liu parameters based on the LTS method for the datasets in which both multicollinearity and outliers exist at the same time. Second, the estimates of biasing parameters at minimum MSEs are automatically calculated. Third, the user can easily obtain the MSE values of each model for comparison. Fourth, a graph of MSE values versus the biasing parameters for four biased methods are plotted as well.
Moreover, we introduce not only a program/package which analyses some of the biased techniques based on the LTS method but also a comparison of analysis using well-known datasets which are in the literature when outliers are existing in different directions is thought to be given and interpreted. Hence the analyst will practice with those datasets and hopefully ltsbase will gain confidence.
Received: April 2014 -Accepted: March 2015 As seen, VIF's are greater than 10 which means there is a multicollinearity problem.
Detecting multicollinearity via Condition Number: The degree of multicollinearity can also be calculated using a Condition Number (CN) that is a ratio of the maximum eigenvalue divided by the minimum eigenvalue (λ max /λ min ). As a rule of thumb, if the CN k is between 100 and 1000 there is moderate multicollinearity and if it exceeds 1000 there is severe multicollinearity.
Appendix B.1. Tabulated Data Table 1 describes the response variable and several molecular descriptors with 38 observations (Maguna et al. 2003).
A closer look at the toxicity data is briefly given by head() :