Subset selection in multiple linear regression in the presence of outlier and multicollinearity
Introduction
Consider the multiple linear regression model where is a vector of observations on the response variable, is an matrix of observations on regressor variables with 1’s in the first column, is a vector of unknown regression parameters and is an unknown random error assumed to follow normal distribution with zero mean and constant variance . Without loss of generality, we assume that the regressor variables are standardized in such a way that is in the form of a correlation matrix.
In the literature, various subset selection methods based on the least squares (LS) estimator are available like Mallows’ [13], stepwise selection methods. The Mallows’ is one of the most popular subset selection methods. It is defined as where is the residual sum of squares of the subset model based on regressor variables, is the error variance and is replaced by its suitable estimate is the LS estimator of of the full model based on regressor variables.
It is well known that, the statistic is based on the LS estimator and the LS estimator is very sensitive to the presence of outliers or the violation of the assumption of normality on the error variable (see Huber [9]). In the past three decades, many robust parameter estimation methods as well as subset selection methods have been devised. For instance, Ronchetti [17] proposed robust version of AIC called RAIC, Ronchetti and Statudte [18] proposed robust version of Mallows’ called , Sommer and Huggins [19] proposed criterion based on Wald test statistic, Kim and Hwang [12] defined method, Kashid and Kulkarni [11] proposed an criterion which is a more general criterion for subset selection in the presence of outlier in the data. The criterion is operationally simple to implement as compared to the other robust subset selection methods; it is defined as where and are the predicted values of based on the full model and the subset model respectively. The unknown is replaced by its suitable estimate based on the full model as , where is the th residual.
The presence of multicollinearity is also one of the most serious and frequently encountered problems in multiple linear regression. Due to the presence of multicollinearity, the variance of the LS estimator gets inflated and consequently, the LS estimator becomes unstable and gives misleading results. To overcome such a problem, Hoerl and Kennard [5], [6] proposed the ordinary ridge regression (ORR) estimator. Recently, Dorugade and Kashid [3] proposed statistic for subset selection based on the ORR estimator of . It is defined as where is the error variance and is replaced by its suitable estimate and is the ORR estimator of based on the full model. The matrix , , and are the biasing constants known as ridge parameters. Note that, the above and the statistics are equivalent to Mallows’ when the LS estimator is used. Though the and Statistics are used for correct subset selection in different situations, the subset selection procedure of these three statistics is same and it is given as follows.
Subset selection procedure based on and statistics
Step I. Compute the value of statistic for all possible subset models.
Step II. Select a subset of minimum size, for which the value of the statistic is close to ‘’ or plot the values of statistic vs. ‘’ for all possible subset models and select the subset which is closer to the line ‘’.
Many researchers have pointed out that the -estimator is a better alternative to the LS estimator in the presence of outliers (see Brikes and Dodge [1]) and the ORR estimator performs better in the presence of multicollinearity (see [5], [6], [7]). Brikes and Dodge [1], Montgomery et al. [16] have given description of these methods in the context of parameter estimation. However, these methods give misleading results when outlier and multicollinearity occur simultaneously in the data (see Jadhav and Kashid [10]).
To overcome the problem of simultaneous occurrence of outlier and multicollinearity, very recently, Jadhav and Kashid [10] proposed an estimator known as Jackknifed Ridge - (JRM) estimator. They showed that, the performance of the JRM estimator is better in the mean square error sense when outliers and multicollinearity present in the data.
In this article, we have proposed a generalized criterion, called as criterion for subset selection based on the JRM estimator when outlier and multicollinearity occurs simultaneously in the data.
The rest of the article is organized as follows. In Section 2, the effect of presence of multicollinearity and outliers on the existing subset selection criteria is demonstrated. Section 3 briefly introduces the various estimators which are used in this article. In Section 4, a motivation to propose a new subset selection criterion is presented and a subset selection criterion based on the JRM estimator is defined. Some results and the equivalence of statistic with and statistics are presented in Section 5. In Section 6, simulated data sets are considered to illustrate the performance of the proposed method. Also, a correct model selection ability of the statistic and the performance of various robust estimates of are presented in Section 6. Finally, the article ends with some concluding remarks.
Section snippets
The problem
This section illustrates the problem of outlier and multicollinearity from the viewpoint of subset selection. The purpose of this section is to highlight the effect of simultaneous occurrence of outlier and multicollinearity on the subset selection criteria based on the LS estimator (), -estimator () and ORR estimator ().
A simulation design given by McDonald and Galarneau [15] is used to introduce multicollinearity in the regressor variables as follows.
The estimators
In the multiple linear regression, an important task is to estimate the unknown regression parameters using an appropriate method of estimation. In this section, some of the existing estimation methods of are briefly discussed as follows.
Least squares (LS) estimator
For the multiple linear regression model given in Eq. (1.1), the LS estimator of the unknown regression parameters is defined as Any standard textbook of regression like Draper and Smith [4], Montgomery et al.
Proposed method
Consider the multiple linear regression model given in Eq. (1.1). Then the vector of predicted values of based on the JRM estimator of is where is the prediction matrix based on the full model. The full model is the one which contains all regressor variables.
The model given in Eq. (1.1) can be written as where and are partitioned as and . The matrix is of order with 1s in the first column and the matrix is
Some results
In this section, we present some results to support the use of the proposed criterion to select the correct subset model. Also, we have derived the equivalence of the proposed statistic with the and statistics. Result 5.1 If the subset model is adequate then,
Proof Consider, and be the th predicted values of based on the JRM estimator for the full model and the subset model respectively. Then, we can write,
Simulation study
A simulation study is carried out to illustrate the performance of proposed method. A simulation study is divided into three subsections. Section 6.1 illustrate the performance of the and criteria through numerical examples for all combinations of absence and presence of outlier and multicollinearity. A correct model selection ability of these criteria is evaluated in Section 6.2. Also, the various choices of the estimator of are considered in Section 6.3 and their
Conclusion
We have developed a subset selection procedure based on the JRM estimator of the unknown regression parameters. This method works well in subset selection for clean data or in presence of only outlier or only multicollinearity or both outlier and multicollinearity. Also, the performance of the proposed method is evaluated for the presence of more than one outlier observations and multicollinearity in the data. The correct model selection ability of the proposed method is also obtained. It
Acknowledgments
The authors thank the anonymous reviewers for their valuable comments and constructive suggestions which substantially improve the quality of the manuscript. This research was supported by the University Grants Commission, New Delhi, India under Major Research Project Scheme.
References (19)
Robust model selection in regression
Statist. Probab. Lett.
(1985)- et al.
Alternative Methods of Regression
(1993) - et al.
Sensitive Analysis in Linear Regression
(1988) - et al.
Variable selection in linear regression based on ridge estimator
J. Stat. Comput. Simul.
(2010) - et al.
Applied Regression Analysis
(2003) - et al.
Ridge regression: biased estimation for nonorthogonal problems
Technometrics
(1970) - et al.
Ridge regression: applications to nonorthogonal problems
Technometrics
(1970) - et al.
Ridge regression: some simulations
Commun. Stat.
(1975) Robust Regression: asymptotics, conjectures, and Monte Carlo
Ann. Statist.
(1973)
Cited by (26)
Corporate governance mechanisms with conventional bonds and Sukuk’ yield spreads
2020, Pacific Basin Finance JournalCitation Excerpt :Nevertheless, Tabachnick and Fidell (2007) stated that the multicollinearity problems among variables can be detected when their coefficient of correlation estimations is <0.9. Other statistical test in Stata is performing a Variance Inflation Factor (VIF) to identify the severity of the multicolinearity among variables (Mack, 2015; Jadhav et al., 2014; Wooldridge, 2000; Snee, 1977; Marquardt, 1970). According to Wooldridge (2000), VIF and its tolerance result are two measures that can perform to detect a multicollinearity problem whereby the variance of the OLS estimator for a typical regression coefficient.
Building performance evaluation through a novel feature selection algorithm for automated arx model identification procedures
2017, Energy and BuildingsCitation Excerpt :For these reasons, “feature selection”, i.e., selecting the minimum number of features required to best describe a selected output, is a crucial step in the creation of a meaningful ARX model, which is widely used in different field [38–40], such as medicine, economy, or biology [41–43] but rarely applied to building performance evaluation. In the last 40 years, numerous studies on this topic have been carried out [37,38,40,44–47], proposing various deterministic or heuristic algorithms, each with advantages and disadvantages. The simplest deterministic algorithm is sometimes called “brute force” because it tries all input combinations, generating 2n models, and is thus prohibitive for large number of characteristics n. Heuristic algorithms try to generate a model close to the best one in a reasonable time.
Have market-oriented reforms improved the electricity generation efficiency of China's thermal power industry? An empirical analysis
2016, EnergyCitation Excerpt :Multicollinearity is one of the most serious and frequently encountered problems in multiple linear regression. Due to the presence of multicollinearity, the variance of the ordinary least squares (OLS) estimator gets inflated and consequently, the OLS estimator becomes unstable and gives misleading results [23]. When a variable is seen in a model more than once as in polynomial models, or if inter-correlated variables are included in the same model, strong multicollinearity structures may well be formed [24].
Cluster-based L2 re-weighted regression
2015, Statistical MethodologyFactors Influencing Developmental Care Practice Among Neonatal Intensive Care Unit Nurses
2019, Journal of Pediatric NursingCitation Excerpt :To examine factors that influence developmental care practice among NICU nurses, multiple linear regression analysis was performed using variables found to be significant in univariate analysis. To ensure that the conditions for regression analysis were met, there should be no auto-correlation or multicollinearity (Jadhav, Kashid, & Kulkarni, 2014). The Durbin-Watson statistic and tolerance tests were performed to confirm statistical assumptions.
Is corporate governance important for green bond performance in emerging capital markets?
2024, Eurasian Economic Review