Elsevier

Statistical Methodology

Volume 19, July 2014, Pages 44-59
Statistical Methodology

Subset selection in multiple linear regression in the presence of outlier and multicollinearity

https://doi.org/10.1016/j.stamet.2014.02.002Get rights and content

Abstract

Various subset selection methods are based on the least squares parameter estimation method. The performance of these methods is not reasonably well in the presence of outlier or multicollinearity or both. Few subset selection methods based on the M-estimator are available in the literature for outlier data. Very few subset selection methods account the problem of multicollinearity with ridge regression estimator.

In this article, we develop a generalized version of Sp statistic based on the jackknifed ridge M-estimator for subset selection in the presence of outlier and multicollinearity. We establish the equivalence of this statistic with the existing Cp, Sp and Rp statistics. The performance of the proposed method is illustrated through some numerical examples and the correct model selection ability is evaluated using simulation study.

Introduction

Consider the multiple linear regression model Y=Xβ+ε, where Y is a vector of n observations on the response variable, X is an n×k matrix of n observations on (k1) regressor variables with 1’s in the first column, β=(β0,β1,,βk1) is a vector of k unknown regression parameters and ε is an unknown random error assumed to follow normal distribution with zero mean and constant variance σ2. Without loss of generality, we assume that the regressor variables are standardized in such a way that XX is in the form of a correlation matrix.

In the literature, various subset selection methods based on the least squares (LS) estimator are available like Mallows’ Cp   [13], stepwise selection methods. The Mallows’ Cp is one of the most popular subset selection methods. It is defined as Cp=RSSpσ2(n2p), where RSSp is the residual sum of squares of the subset model based on (p1) regressor variables, σ2 is the error variance and is replaced by its suitable estimate σˆ2=(YXβˆLS)(YXβˆLS)/(nk),βˆLS is the LS estimator of β of the full model based on (k1) regressor variables.

It is well known that, the Cp statistic is based on the LS estimator and the LS estimator is very sensitive to the presence of outliers or the violation of the assumption of normality on the error variable (see Huber  [9]). In the past three decades, many robust parameter estimation methods as well as subset selection methods have been devised. For instance, Ronchetti  [17] proposed robust version of AIC called RAIC, Ronchetti and Statudte  [18] proposed robust version of Mallows’ Cp called RCp, Sommer and Huggins  [19] proposed RTp criterion based on Wald test statistic, Kim and Hwang  [12] defined Cp(k) method, Kashid and Kulkarni  [11] proposed an Sp criterion which is a more general criterion for subset selection in the presence of outlier in the data. The Sp criterion is operationally simple to implement as compared to the other robust subset selection methods; it is defined as Sp=i=1n(YˆikYˆip)2σ2(k2p), where Yˆik and Yˆip are the predicted values of Yi based on the full model and the subset model respectively. The unknown σ is replaced by its suitable estimate based on the full model as σˆ=1.4826median|rimedian(ri)|, where ri is the ith residual.

The presence of multicollinearity is also one of the most serious and frequently encountered problems in multiple linear regression. Due to the presence of multicollinearity, the variance of the LS estimator gets inflated and consequently, the LS estimator becomes unstable and gives misleading results. To overcome such a problem, Hoerl and Kennard  [5], [6] proposed the ordinary ridge regression (ORR) estimator. Recently, Dorugade and Kashid  [3] proposed Rp statistic for subset selection based on the ORR estimator of β. It is defined as Rp=i=1n(YˆikYˆip)2σ2tr(HRHR)+tr(HRAHRA)+p, where σ2 is the error variance and is replaced by its suitable estimate σˆ2=(YXβˆR)(YXβˆR)/(nk) and βˆR is the ORR estimator of β based on the full model. The matrix HR=X(XX+rI)1X, HRA=XA(XAXA+rAI)1XA, r and rA are the biasing constants known as ridge parameters. Note that, the above Sp and the Rp statistics are equivalent to Mallows’ Cp when the LS estimator is used. Though the Cp,Sp and Rp Statistics are used for correct subset selection in different situations, the subset selection procedure of these three statistics is same and it is given as follows.

Subset selection procedure based on Cp,Sp and Rp statistics

Step  I. Compute the value of statistic for all possible subset models.

Step  II. Select a subset of minimum size, for which the value of the statistic is close to ‘p’ or plot the values of statistic vs. ‘p’ for all possible subset models and select the subset which is closer to the line ‘statistic=p’.

Many researchers have pointed out that the M-estimator is a better alternative to the LS estimator in the presence of outliers (see Brikes and Dodge  [1]) and the ORR estimator performs better in the presence of multicollinearity (see  [5], [6], [7]). Brikes and Dodge  [1], Montgomery et al.  [16] have given description of these methods in the context of parameter estimation. However, these methods give misleading results when outlier and multicollinearity occur simultaneously in the data (see Jadhav and Kashid  [10]).

To overcome the problem of simultaneous occurrence of outlier and multicollinearity, very recently, Jadhav and Kashid  [10] proposed an estimator known as Jackknifed Ridge M- (JRM) estimator. They showed that, the performance of the JRM estimator is better in the mean square error sense when outliers and multicollinearity present in the data.

In this article, we have proposed a generalized Sp criterion, called as GSp criterion for subset selection based on the JRM estimator when outlier and multicollinearity occurs simultaneously in the data.

The rest of the article is organized as follows. In Section  2, the effect of presence of multicollinearity and outliers on the existing subset selection criteria is demonstrated. Section  3 briefly introduces the various estimators which are used in this article. In Section  4, a motivation to propose a new subset selection criterion is presented and a subset selection criterion based on the JRM estimator is defined. Some results and the equivalence of GSp statistic with Cp,Rp and Sp statistics are presented in Section  5. In Section  6, simulated data sets are considered to illustrate the performance of the proposed method. Also, a correct model selection ability of the GSp statistic and the performance of various robust estimates of σ2 are presented in Section  6. Finally, the article ends with some concluding remarks.

Section snippets

The problem

This section illustrates the problem of outlier and multicollinearity from the viewpoint of subset selection. The purpose of this section is to highlight the effect of simultaneous occurrence of outlier and multicollinearity on the subset selection criteria based on the LS estimator (Cp), M-estimator (Sp) and ORR estimator (Rp).

A simulation design given by McDonald and Galarneau  [15] is used to introduce multicollinearity in the regressor variables as follows. xij=(1ρ2)1/2zij+ρzi(l+1)i=1,2,,n

The estimators

In the multiple linear regression, an important task is to estimate the unknown regression parameters β using an appropriate method of estimation. In this section, some of the existing estimation methods of β are briefly discussed as follows.

Least squares (LS) estimator

For the multiple linear regression model given in Eq. (1.1), the LS estimator of the unknown regression parameters β is defined asβˆLS=(XX)1XY. Any standard textbook of regression like Draper and Smith  [4], Montgomery et al. 

Proposed method

Consider the multiple linear regression model given in Eq. (1.1). Then the vector of predicted values of Y based on the JRM estimator of β is Yˆk=XβˆJRM=HY where H=XR(XWX)1XW is the prediction matrix based on the full model. The full model is the one which contains all (k1) regressor variables.

The model given in Eq. (1.1) can be written as Y=XAβA+XBβB+ε where X and β are partitioned as X=[XA:XB] and β=[βA:βB]. The matrix XA is of order n×p with 1s in the first column and the matrix XB is

Some results

In this section, we present some results to support the use of the proposed criterion to select the correct subset model. Also, we have derived the equivalence of the proposed GSp statistic with the Cp,Sp and Rp statistics.

Result 5.1

If the subset model is adequate then, E(i=1n(YˆikYˆip)2)σ2tr[(HH1)(HH1)].

Proof

Consider, Yˆik and Yˆip be the ith predicted values of Y based on the JRM estimator for the full model and the subset model respectively. Then, we can write, i=1n(YˆikYˆip)2=(YˆkYˆp)(YˆkYˆp)=(

Simulation study

A simulation study is carried out to illustrate the performance of proposed method. A simulation study is divided into three subsections. Section  6.1 illustrate the performance of the Cp,Sp,Rp and GSp criteria through numerical examples for all combinations of absence and presence of outlier and multicollinearity. A correct model selection ability of these criteria is evaluated in Section  6.2. Also, the various choices of the estimator of σ2 are considered in Section  6.3 and their

Conclusion

We have developed a subset selection procedure based on the JRM estimator of the unknown regression parameters. This method works well in subset selection for clean data or in presence of only outlier or only multicollinearity or both outlier and multicollinearity. Also, the performance of the proposed method is evaluated for the presence of more than one outlier observations and multicollinearity in the data. The correct model selection ability of the proposed method is also obtained. It

Acknowledgments

The authors thank the anonymous reviewers for their valuable comments and constructive suggestions which substantially improve the quality of the manuscript. This research was supported by the University Grants Commission, New Delhi, India under Major Research Project Scheme.

References (19)

  • E. Ronchetti

    Robust model selection in regression

    Statist. Probab. Lett.

    (1985)
  • D. Brikes et al.

    Alternative Methods of Regression

    (1993)
  • S. Chatterji et al.

    Sensitive Analysis in Linear Regression

    (1988)
  • A.V. Dorugade et al.

    Variable selection in linear regression based on ridge estimator

    J. Stat. Comput. Simul.

    (2010)
  • N.R. Draper et al.

    Applied Regression Analysis

    (2003)
  • A.E. Hoerl et al.

    Ridge regression: biased estimation for nonorthogonal problems

    Technometrics

    (1970)
  • A.E. Hoerl et al.

    Ridge regression: applications to nonorthogonal problems

    Technometrics

    (1970)
  • A.E. Hoerl et al.

    Ridge regression: some simulations

    Commun. Stat.

    (1975)
  • P.J. Huber

    Robust Regression: asymptotics, conjectures, and Monte Carlo

    Ann. Statist.

    (1973)
There are more references available in the full text version of this article.

Cited by (26)

  • Corporate governance mechanisms with conventional bonds and Sukuk’ yield spreads

    2020, Pacific Basin Finance Journal
    Citation Excerpt :

    Nevertheless, Tabachnick and Fidell (2007) stated that the multicollinearity problems among variables can be detected when their coefficient of correlation estimations is <0.9. Other statistical test in Stata is performing a Variance Inflation Factor (VIF) to identify the severity of the multicolinearity among variables (Mack, 2015; Jadhav et al., 2014; Wooldridge, 2000; Snee, 1977; Marquardt, 1970). According to Wooldridge (2000), VIF and its tolerance result are two measures that can perform to detect a multicollinearity problem whereby the variance of the OLS estimator for a typical regression coefficient.

  • Building performance evaluation through a novel feature selection algorithm for automated arx model identification procedures

    2017, Energy and Buildings
    Citation Excerpt :

    For these reasons, “feature selection”, i.e., selecting the minimum number of features required to best describe a selected output, is a crucial step in the creation of a meaningful ARX model, which is widely used in different field [38–40], such as medicine, economy, or biology [41–43] but rarely applied to building performance evaluation. In the last 40 years, numerous studies on this topic have been carried out [37,38,40,44–47], proposing various deterministic or heuristic algorithms, each with advantages and disadvantages. The simplest deterministic algorithm is sometimes called “brute force” because it tries all input combinations, generating 2n models, and is thus prohibitive for large number of characteristics n. Heuristic algorithms try to generate a model close to the best one in a reasonable time.

  • Have market-oriented reforms improved the electricity generation efficiency of China's thermal power industry? An empirical analysis

    2016, Energy
    Citation Excerpt :

    Multicollinearity is one of the most serious and frequently encountered problems in multiple linear regression. Due to the presence of multicollinearity, the variance of the ordinary least squares (OLS) estimator gets inflated and consequently, the OLS estimator becomes unstable and gives misleading results [23]. When a variable is seen in a model more than once as in polynomial models, or if inter-correlated variables are included in the same model, strong multicollinearity structures may well be formed [24].

  • Cluster-based L2 re-weighted regression

    2015, Statistical Methodology
  • Factors Influencing Developmental Care Practice Among Neonatal Intensive Care Unit Nurses

    2019, Journal of Pediatric Nursing
    Citation Excerpt :

    To examine factors that influence developmental care practice among NICU nurses, multiple linear regression analysis was performed using variables found to be significant in univariate analysis. To ensure that the conditions for regression analysis were met, there should be no auto-correlation or multicollinearity (Jadhav, Kashid, & Kulkarni, 2014). The Durbin-Watson statistic and tolerance tests were performed to confirm statistical assumptions.

View all citing articles on Scopus
View full text