Subset selection in multiple linear regression in the presence of outlier and multicollinearity

doi:10.1016/j.stamet.2014.02.002

Statistical Methodology

Volume 19, July 2014, Pages 44-59

https://doi.org/10.1016/j.stamet.2014.02.002 Get rights and content

Abstract

Various subset selection methods are based on the least squares parameter estimation method. The performance of these methods is not reasonably well in the presence of outlier or multicollinearity or both. Few subset selection methods based on the $M$ -estimator are available in the literature for outlier data. Very few subset selection methods account the problem of multicollinearity with ridge regression estimator.

In this article, we develop a generalized version of $S_{p}$ statistic based on the jackknifed ridge $M$ -estimator for subset selection in the presence of outlier and multicollinearity. We establish the equivalence of this statistic with the existing $C_{p}$ , $S_{p}$ and $R_{p}$ statistics. The performance of the proposed method is illustrated through some numerical examples and the correct model selection ability is evaluated using simulation study.

Introduction

Consider the multiple linear regression model $Y = X β + ε,$ where $Y$ is a vector of $n$ observations on the response variable, $X$ is an $n \times k$ matrix of $n$ observations on $(k - 1)$ regressor variables with 1’s in the first column, $β = {(β_{0}, β_{1}, \dots, β_{k - 1})}^{'}$ is a vector of $k$ unknown regression parameters and $ε$ is an unknown random error assumed to follow normal distribution with zero mean and constant variance $σ^{2}$ . Without loss of generality, we assume that the regressor variables are standardized in such a way that $X^{'} X$ is in the form of a correlation matrix.

In the literature, various subset selection methods based on the least squares (LS) estimator are available like Mallows’ $C_{p}$ [13], stepwise selection methods. The Mallows’ $C_{p}$ is one of the most popular subset selection methods. It is defined as $C_{p} = \frac{{RSS}_{p}}{σ^{2}} - (n - 2 p),$ where ${RSS}_{p}$ is the residual sum of squares of the subset model based on $(p - 1)$ regressor variables, $σ^{2}$ is the error variance and is replaced by its suitable estimate ${\hat{σ}}^{2} = {(Y - X {\hat{β}}_{LS})}^{'} (Y - X {\hat{β}}_{LS}) / (n - k), {\hat{β}}_{LS}$ is the LS estimator of $β$ of the full model based on $(k - 1)$ regressor variables.

It is well known that, the $C_{p}$ statistic is based on the LS estimator and the LS estimator is very sensitive to the presence of outliers or the violation of the assumption of normality on the error variable (see Huber [9]). In the past three decades, many robust parameter estimation methods as well as subset selection methods have been devised. For instance, Ronchetti [17] proposed robust version of AIC called RAIC, Ronchetti and Statudte [18] proposed robust version of Mallows’ $C_{p}$ called ${RC}_{p}$ , Sommer and Huggins [19] proposed ${RT}_{p}$ criterion based on Wald test statistic, Kim and Hwang [12] defined $C_{p (k)}$ method, Kashid and Kulkarni [11] proposed an $S_{p}$ criterion which is a more general criterion for subset selection in the presence of outlier in the data. The $S_{p}$ criterion is operationally simple to implement as compared to the other robust subset selection methods; it is defined as $S_{p} = \frac{\sum_{i = 1}^{n} {({\hat{Y}}_{i k} - {\hat{Y}}_{i p})}^{2}}{σ^{2}} - (k - 2 p),$ where ${\hat{Y}}_{i k}$ and ${\hat{Y}}_{i p}$ are the predicted values of $Y_{i}$ based on the full model and the subset model respectively. The unknown $σ$ is replaced by its suitable estimate based on the full model as $\hat{σ} = 1.4826 median | r_{i} - median (r_{i}) |$ , where $r_{i}$ is the $i$ th residual.

The presence of multicollinearity is also one of the most serious and frequently encountered problems in multiple linear regression. Due to the presence of multicollinearity, the variance of the LS estimator gets inflated and consequently, the LS estimator becomes unstable and gives misleading results. To overcome such a problem, Hoerl and Kennard [5], [6] proposed the ordinary ridge regression (ORR) estimator. Recently, Dorugade and Kashid [3] proposed $R_{p}$ statistic for subset selection based on the ORR estimator of $β$ . It is defined as $R_{p} = \frac{\sum_{i = 1}^{n} {({\hat{Y}}_{i k} - {\hat{Y}}_{i p})}^{2}}{σ^{2}} - tr (H_{R}^{'} H_{R}) + tr (H_{RA}^{'} H_{RA}) + p,$ where $σ^{2}$ is the error variance and is replaced by its suitable estimate ${\hat{σ}}^{2} = {(Y - X {\hat{β}}_{R})}^{'} (Y - X {\hat{β}}_{R}) / (n - k)$ and ${\hat{β}}_{R}$ is the ORR estimator of $β$ based on the full model. The matrix $H_{R} = X {(X^{'} X + r I)}^{- 1} X^{'}$ , $H_{R A} = X_{A} {(X_{A}^{'} X_{A} + r_{A} I)}^{- 1} X_{A}^{'}$ , $r$ and $r_{A}$ are the biasing constants known as ridge parameters. Note that, the above $S_{p}$ and the $R_{p}$ statistics are equivalent to Mallows’ $C_{p}$ when the LS estimator is used. Though the $C_{p}, S_{p}$ and $R_{p}$ Statistics are used for correct subset selection in different situations, the subset selection procedure of these three statistics is same and it is given as follows.

Subset selection procedure based on $C_{p}, S_{p}$ and $R_{p}$ statistics

Step I. Compute the value of statistic for all possible subset models.

Step II. Select a subset of minimum size, for which the value of the statistic is close to ‘ $p$ ’ or plot the values of statistic vs. ‘ $p$ ’ for all possible subset models and select the subset which is closer to the line ‘ $statistic = p$ ’.

Many researchers have pointed out that the $M$ -estimator is a better alternative to the LS estimator in the presence of outliers (see Brikes and Dodge [1]) and the ORR estimator performs better in the presence of multicollinearity (see [5], [6], [7]). Brikes and Dodge [1], Montgomery et al. [16] have given description of these methods in the context of parameter estimation. However, these methods give misleading results when outlier and multicollinearity occur simultaneously in the data (see Jadhav and Kashid [10]).

To overcome the problem of simultaneous occurrence of outlier and multicollinearity, very recently, Jadhav and Kashid [10] proposed an estimator known as Jackknifed Ridge $M$ - (JRM) estimator. They showed that, the performance of the JRM estimator is better in the mean square error sense when outliers and multicollinearity present in the data.

In this article, we have proposed a generalized $S_{p}$ criterion, called as ${GS}_{p}$ criterion for subset selection based on the JRM estimator when outlier and multicollinearity occurs simultaneously in the data.

The rest of the article is organized as follows. In Section 2, the effect of presence of multicollinearity and outliers on the existing subset selection criteria is demonstrated. Section 3 briefly introduces the various estimators which are used in this article. In Section 4, a motivation to propose a new subset selection criterion is presented and a subset selection criterion based on the JRM estimator is defined. Some results and the equivalence of ${GS}_{p}$ statistic with $C_{p}, R_{p}$ and $S_{p}$ statistics are presented in Section 5. In Section 6, simulated data sets are considered to illustrate the performance of the proposed method. Also, a correct model selection ability of the ${GS}_{p}$ statistic and the performance of various robust estimates of $σ^{2}$ are presented in Section 6. Finally, the article ends with some concluding remarks.

Section snippets

The problem

This section illustrates the problem of outlier and multicollinearity from the viewpoint of subset selection. The purpose of this section is to highlight the effect of simultaneous occurrence of outlier and multicollinearity on the subset selection criteria based on the LS estimator ( $C_{p}$ ), $M$ -estimator ( $S_{p}$ ) and ORR estimator ( $R_{p}$ ).

A simulation design given by McDonald and Galarneau [15] is used to introduce multicollinearity in the regressor variables as follows. $x_{i j} = {(1 - ρ^{2})}^{1 / 2} z_{i j} + ρ z_{i (l + 1)} i = 1, 2, \dots, n$

The estimators

In the multiple linear regression, an important task is to estimate the unknown regression parameters $β$ using an appropriate method of estimation. In this section, some of the existing estimation methods of $β$ are briefly discussed as follows.

Least squares (LS) estimator

For the multiple linear regression model given in Eq. (1.1), the LS estimator of the unknown regression parameters $β$ is defined as ${\hat{β}}_{LS} = {(X^{'} X)}^{- 1} X^{'} Y .$ Any standard textbook of regression like Draper and Smith [4], Montgomery et al.

Proposed method

Consider the multiple linear regression model given in Eq. (1.1). Then the vector of predicted values of $Y$ based on the JRM estimator of $β$ is ${\hat{Y}}_{k} = X {\hat{β}}_{JRM} = H Y$ where $H = X R {(X^{'} W X)}^{- 1} X^{'} W$ is the prediction matrix based on the full model. The full model is the one which contains all $(k - 1)$ regressor variables.

The model given in Eq. (1.1) can be written as $Y = X_{A} β_{A} + X_{B} β_{B} + ε$ where $X$ and $β$ are partitioned as $X = [X_{A} : X_{B}]$ and $β^{'} = [β_{A}^{'} : β_{B}^{'}]$ . The matrix $X_{A}$ is of order $n \times p$ with 1s in the first column and the matrix $X_{B}$ is

Some results

In this section, we present some results to support the use of the proposed criterion to select the correct subset model. Also, we have derived the equivalence of the proposed ${GS}_{p}$ statistic with the $C_{p}, S_{p}$ and $R_{p}$ statistics.

Result 5.1

If the subset model is adequate then, $E (\sum_{i = 1}^{n} {({\hat{Y}}_{i k} - {\hat{Y}}_{i p})}^{2}) ≅ σ^{2} tr [{(H - H_{1})}^{'} (H - H_{1})] .$

Proof

Consider, ${\hat{Y}}_{i k}$ and ${\hat{Y}}_{i p}$ be the $i$ th predicted values of $Y$ based on the JRM estimator for the full model and the subset model respectively. Then, we can write, $\sum_{i = 1}^{n} {({\hat{Y}}_{i k} - {\hat{Y}}_{i p})}^{2} = {({\hat{Y}}_{k} - {\hat{Y}}_{p})}^{'} ({\hat{Y}}_{k} - {\hat{Y}}_{p}) = ($

Simulation study

A simulation study is carried out to illustrate the performance of proposed method. A simulation study is divided into three subsections. Section 6.1 illustrate the performance of the $C_{p}, S_{p}, R_{p}$ and ${GS}_{p}$ criteria through numerical examples for all combinations of absence and presence of outlier and multicollinearity. A correct model selection ability of these criteria is evaluated in Section 6.2. Also, the various choices of the estimator of $σ^{2}$ are considered in Section 6.3 and their

Conclusion

We have developed a subset selection procedure based on the JRM estimator of the unknown regression parameters. This method works well in subset selection for clean data or in presence of only outlier or only multicollinearity or both outlier and multicollinearity. Also, the performance of the proposed method is evaluated for the presence of more than one outlier observations and multicollinearity in the data. The correct model selection ability of the proposed method is also obtained. It

Acknowledgments

The authors thank the anonymous reviewers for their valuable comments and constructive suggestions which substantially improve the quality of the manuscript. This research was supported by the University Grants Commission, New Delhi, India under Major Research Project Scheme.

References (19)

E. Ronchetti
Robust model selection in regression
Statist. Probab. Lett.
(1985)
D. Brikes et al.
Alternative Methods of Regression
(1993)
S. Chatterji et al.
Sensitive Analysis in Linear Regression
(1988)
A.V. Dorugade et al.
Variable selection in linear regression based on ridge estimator
J. Stat. Comput. Simul.
(2010)
N.R. Draper et al.
Applied Regression Analysis
(2003)
A.E. Hoerl et al.
Ridge regression: biased estimation for nonorthogonal problems
Technometrics
(1970)
A.E. Hoerl et al.
Ridge regression: applications to nonorthogonal problems
Technometrics
(1970)
A.E. Hoerl et al.
Ridge regression: some simulations
Commun. Stat.
(1975)
P.J. Huber
Robust Regression: asymptotics, conjectures, and Monte Carlo
Ann. Statist.
(1973)

There are more references available in the full text version of this article.

Cited by (26)

Corporate governance mechanisms with conventional bonds and Sukuk’ yield spreads
2020, Pacific Basin Finance Journal
Citation Excerpt :
Nevertheless, Tabachnick and Fidell (2007) stated that the multicollinearity problems among variables can be detected when their coefficient of correlation estimations is <0.9. Other statistical test in Stata is performing a Variance Inflation Factor (VIF) to identify the severity of the multicolinearity among variables (Mack, 2015; Jadhav et al., 2014; Wooldridge, 2000; Snee, 1977; Marquardt, 1970). According to Wooldridge (2000), VIF and its tolerance result are two measures that can perform to detect a multicollinearity problem whereby the variance of the OLS estimator for a typical regression coefficient.
The main objective of this study is twofold. First, is to investigate the significant mean difference between conventional bonds and sukuk’ yield spreads. Second, is to investigate the relationship between yield spreads and corporate governance mechanisms. The data covers from 2000 to 2014 for 256 and 405 tranches of long-term and medium-term issuances of conventional bonds and sukuk respectively. Unbalanced panel data are applied. The findings show that (1) there are significant mean different between yield spreads of conventional bonds and sukuk, and (2) the institutional ownerships and BOD characteristics, role duality and size significantly reduce yield spreads in long-term conventional bonds and sukuk. The institutional investors should have more shareholdings in the issuer firms, those who are issue long-term conventional bonds and medium-term sukuk since the default risk is low. BOD also need to fully compliant to Malaysian Code on Corporate Governance (MCCG) for best practices in the firm.
Building performance evaluation through a novel feature selection algorithm for automated arx model identification procedures
2017, Energy and Buildings
Citation Excerpt :
For these reasons, “feature selection”, i.e., selecting the minimum number of features required to best describe a selected output, is a crucial step in the creation of a meaningful ARX model, which is widely used in different field [38–40], such as medicine, economy, or biology [41–43] but rarely applied to building performance evaluation. In the last 40 years, numerous studies on this topic have been carried out [37,38,40,44–47], proposing various deterministic or heuristic algorithms, each with advantages and disadvantages. The simplest deterministic algorithm is sometimes called “brute force” because it tries all input combinations, generating 2n models, and is thus prohibitive for large number of characteristics n. Heuristic algorithms try to generate a model close to the best one in a reasonable time.
ARX models are an effective instrument to evaluate continuous building performance from insufficient monitoring data. However, selecting the right model features is NP-hard. The problem of finding a minimal subset of informative inputs has been studied extensively in various fields but automatic, fast, and reliable procedures for finding optimal models for building performance evaluation are still missing. We propose a novel feature selection algorithm named Greedy Correlation Screening (GCS), which identifies a possible solution at a time by greedily maximizing the correlation between inputs and output and minimizing cross-correlations between inputs. These two objectives are competing, thus leading to best tradeoffs. Among these, the best model is automatically selected by applying filters and quality criteria such as the adjusted coefficient of correlation and non-correlation of residuals.
The performance of the proposed heuristic method is compared to two of the best algorithms used in the field, such as GRASP for feature selection and NSGA-II (Non-dominated Sorting Genetic Algorithm). The application on a real case study demonstrates that the proposed method solves the problem of feature selection in building performance estimation efficiently and reliably. Moreover, the model creation is automatic, making it ideal for integration into a Building Management System (BMS) in order to detect faults and perform short-term predictive control.
Have market-oriented reforms improved the electricity generation efficiency of China's thermal power industry? An empirical analysis
2016, Energy
Citation Excerpt :
Multicollinearity is one of the most serious and frequently encountered problems in multiple linear regression. Due to the presence of multicollinearity, the variance of the ordinary least squares (OLS) estimator gets inflated and consequently, the OLS estimator becomes unstable and gives misleading results [23]. When a variable is seen in a model more than once as in polynomial models, or if inter-correlated variables are included in the same model, strong multicollinearity structures may well be formed [24].
In 2003, China implemented market-oriented reforms to its electric power industry, aimed at improving the generation efficiency of its thermal power plants. In this paper, we use the polynomial functions, PLS (partial least squares) algorithm, and generation efficiency data from 1993 to 2012 to evaluate the effect of these reforms. Empirical analysis shows that the reforms caused a sudden down shift of 0.142 kW h per kg SCE (standard coal equivalent) to the “natural” generation efficiency curve of the thermal power industry, resulting in 555.8 million tons of SCE of wasted fossil energy during 2003–2012. This was mainly due to the non-implementation of electricity price bidding. To improve the generation efficiency of the thermal power plants, market competition should be further introduced into China's electric power industry as a matter of urgency. The major policy adjustment directions include: a) Electricity price bidding should be promoted by sub-region according to the unified trading rules designed by the central government; b) Over-the-counter transaction should be permitted; as well as c) Dynamic incentive mechanisms for renewable energy development should be established.
Cluster-based L2 re-weighted regression
2015, Statistical Methodology
A simple robust $L 2$ -regression estimator is presented. The proposed method blends a minimum covariance determinant $(M C D)$ concentration algorithm with a controlled ordinary least squares regression phase. A hierarchical cluster analysis then partitions the data into main cluster of “half set” and a minor cluster of one or more groups. An initial least squares regression estimate arises from the main cluster of “half set”. Thereafter, a group-additive difference in fit statistic is used to activate the minor cluster and a controlled re-weighted least squares regression yields a robust efficient estimator with high breakdown value. Simulation experiment shows the advantage of the proposed method over the popular robust regression techniques in terms of robustness of coefficients, and blending outlier diagnostic procedure with parameter estimation.
Factors Influencing Developmental Care Practice Among Neonatal Intensive Care Unit Nurses
2019, Journal of Pediatric Nursing
Citation Excerpt :
To examine factors that influence developmental care practice among NICU nurses, multiple linear regression analysis was performed using variables found to be significant in univariate analysis. To ensure that the conditions for regression analysis were met, there should be no auto-correlation or multicollinearity (Jadhav, Kashid, & Kulkarni, 2014). The Durbin-Watson statistic and tolerance tests were performed to confirm statistical assumptions.
This study aimed to examine factors that influence developmental care practice among neonatal intensive care unit nurses.
This descriptive, cross-sectional study was conducted using a questionnaire. Data were collected from 141 neonatal intensive care unit nurses from 6 hospitals in South Korea. Multiple linear regression analysis was used to examine factors influencing developmental care practice.
This study found that professional efficacy had the largest influence on developmental care practice, followed by perception of developmental care and a task-oriented organizational culture. Clinical and educational experience regarding developmental care and working environment was not associated with developmental care practice among NICU nurses.
To enhance nurses' practice of developmental care, enhancement of nurses' individual competency, positive perception of developmental care, and organizational efforts are required. A practical training program should be provided to nurses to promote confidence in implementing developmental care for preterm infants.
A trained nurse should provide staff nurses with useful information on developmental care to encourage them to have a positive attitude towards developmental care. The nurse manager should create an organizational culture in which nurses perceive developmental care to be an essential nursing task in their unit.
Is corporate governance important for green bond performance in emerging capital markets?
2024, Eurasian Economic Review

View all citing articles on Scopus

View full text

Subset selection in multiple linear regression in the presence of outlier and multicollinearity

Abstract

Introduction

Section snippets

The problem

The estimators

Proposed method

Some results

Simulation study

Conclusion

Acknowledgments

Statist. Probab. Lett.

Alternative Methods of Regression

Sensitive Analysis in Linear Regression

Variable selection in linear regression based on ridge estimator

J. Stat. Comput. Simul.

Applied Regression Analysis

Ridge regression: biased estimation for nonorthogonal problems

Technometrics

Ridge regression: applications to nonorthogonal problems

Technometrics

Ridge regression: some simulations

Commun. Stat.

Robust Regression: asymptotics, conjectures, and Monte Carlo

Ann. Statist.