The bias and skewness of M -estimators in regression

: We consider M estimation of a regression model with a nui- sance parameter and a vector of other parameters. The unknown distribution of the residuals is not assumed to be normal or symmetric. Simple and easily estimated formulas are given for the dominant terms of the bias and skewness of the parameter estimates. For the linear model these are proportional to the skewness of the ‘independent’ variables. For a nonlinear model, its linear component plays the role of these independent variables, and a second term must be added proportional to the covariance of its linear and quadratic components. For the least squares estimate with normal errors this term was derived by Box [1]. We also consider the eﬀect of a large number of parameters, and the case of random independent variables.


Introduction
The asymptotic theory of M estimates for regression models has been the subject of many papers. We refer the readers to the excellent book by Maronna et al. [7] for a comprehensive review.
In this note, we give formulas for the dominant terms of the bias and skewness of M -estimates in linear and nonlinear regression models. These formulas could have applications in many areas, including bias reduction, confidence regions and Edgeworth expansions. For the least squares estimate the formula for bias is just that given by Box [1] for the case of normal errors. The main results are in Section 2 with proofs deferred to Section 6. The exact regularity conditions are not given but some sufficient conditions are discussed. Section 3 gives some applications. Section 4 considers the effect of a large number of parameters on bias and skewness, and shows how to adapt the results when the 'independent' variables are random. Section 5 gives some simulation results for an L 1 -estimate. Section 7 extends our results to the case, where the residuals may have different distributions.

Main results
First consider the linear model: we observe where x N and β are, respectively, known and unknown p-vectors, {e N } are random residuals with an unknown distribution F , with density f (if it exists) not necessarily symmetric, and α is a nuisance parameter centering the residuals around zero in some way. We estimate the unknown parameters ϕ = (α, β) by ( α, β) minimising where ρ is some smooth function for which a minimum exists. By smooth we mean that its derivatives exist except at a finite number of points. It turns out as is well known that for ( α, β) to be consistent, α must satisfy the centering condition: where ρ 1 = ρ (1) (ν − α)dG(ν) and ρ (r) denotes the rth derivative of ρ. In general, set where the number of indices on the left hand side is the same as the number of terms in the integrand on the right hand side. For example, ρ 2 = ρ (2) (e)dF (e), (1) (e)dF (e) and ρ 123 = ρ (1) (e) ρ (2) (e) ρ (3) (e) dF (e). The condition in (2.3) makes α identifiable.
We now show that the bias and skewness of β is essentially proportional to the 'skewness' (third central moments) of the {x N }. Set Then for ρ, F and {x N } suitably regular as n → ∞, where c 1 = ρ 11 ρ −2 2 . Furthermore, β has bias, covariance, third cumulants and skewness as: where K ab = c 1 m ab for 1 ≤ a, b ≤ p. The third cumulants are given by where 3 abc f abc = f abc + f bca + f cab , while a plain sums repeated pairs of suffixes over their range (that is i, j, k over 1 · · · p in (2.7), (2.8)).
Note that the M on the right hand side of (2.6) should be interpreted as the limit of the M defined by (2.4) as n → ∞.
These formulas for bias and skewness have an immediate application to experimental design: if {x N } are chosen to have skewness zero (or ∼ n −1 ) then the bias and skewness (third moments) of β are reduced from ∼ n −1 and n −2 to ∼ n −2 and n −3 , and so the nominal level of the one-sided confidence interval for β a based on approximate normality has error reduced from ∼ n −1/2 to ∼ n −1 .
We now consider the general non-linear model That is we replace the regression functions {x ′ N β} by smooth functions {f N (β)}. We shall see that the role of x N in Theorem 2.1 is now replaced by but there is an additional term in the bias and skewness proportional to the covariances between the linear and quadratic components of the model, i.e.

Applications
Then for ρ suitably regular and variance ∼ n −1 , and the confidence region Unlike the jackknife or bootstrap versions of β a which require ∼ n 2 or more calculations to reduce the bias to ∼ n −2 (and retain variance ∼ n −1 ) the estimate in (3.2) only requires ∼ n calculations. Estimates for which ρ (1) or ρ (2) discontinuous, such as the L 1 -estimate or Our expressions for bias and skewness enable us to calculate the first term of the Edgeworth expansion for the distribution of Y n = n 1/2 ( β − β) and its Studentised version. In particular, if and Φ, φ are the distribution and density of a unit normal random variable, If we drop the last term in (3.5) we must replace O(n −1 ) by O(n −1/2 ); c.f. Withers [10,11]. If p = 1 (3.4) can be written See Withers [10] for more details on this type of applications. Note K a and K abc may also be used to obtain 'small sample asymptotics' for the density and tails of β by the method of Easton and Ronchetti [3].

A numerical example
We noted that our formula for bias in the nonlinear model extends one of Box [1] for the least squares estimate. He gives a numerical example for F normal and n = 3 showing excellent agreement between exact bias (as estimated by 52,000 simulations) and our formula for K a . Here, we do a similar comparison for the linear model with p = 1, x N = x N −m−1 , x i = i/n + τ (i/n) 2 for | i |≤ m, where n = 2m + 1 and τ is assumed known. By changing τ this allows for an arbitrary value of µ 3x . We take ρ(e) =| e | and F (e) = G(1 + e), where G(ν) = 1 − ν −A /2 for ν ≥ 2 −1/A and A > 0. This is chosen as an example of a one-sided heavy-tailed distribution. (As noted in Example 2.1, α transforms G so that F has median zero.) So, β has bias n −1 K 1 +O(n −2 ), where K 1 = c 2 µ −2 2x µ 3x , µ 2x = µ X + τ 2 µ 2X , µ 3x = 3τ µ 2X + τ 3 µ 3X , and µ X , µ rX is the mean and rth central moment of , which is zero at τ = 0 or ∞. Figures 1 and 2 plot the estimated biases of β, β − n −1 K 1 and β − n −1 K * 1 (labeled 1, 2 and 3) against A and τ . Estimates were obtained from two separate runs of 10 5 simulations each for β = 1, α = 0 and n = 11, 21. Calculations were done using NAG routine E02 GAF. This took nearly 48 hours of CPU time on a VAX 780. The large number of simulations was required to obtain good accuracy, as indicated by the small variation between runs. This may be due to the non-uniqueness of the L 1 -estimate, or more fundamentally the fact that ρ(e) =| e | has a discontinuous derivative. The bias estimates n −1 K 1 and n −1 K * 1 are seen to be excellent, and almost indistinguishable at the number of simulations.

Extensions to non-identical residual distributions
Here, we extend Theorems 2.1 and 2.2 to the case, where instead of being i.i.d. F , We assume that each F N is centered so that Eρ (1) (e N ) = 0. Set ρ N rs... = Eρ (r) (e N )ρ (s) (e N ) . . . and define the linear operator ̺ rs... by Set a ij = ρ 2 g ij and (a ij ) = (a ij ) −1 .
The other results of Theorem 2.2 hold with K a , K abc replaced by K a = a ai −a jk ̺ 12 + V jk ̺ 3 g ijk + a ai a jk ̺ 11 − V jk ̺ 2 g j,ki − V jk ̺ 2 g i,jk /2 ,
Similarly, ER ijk = −̺ 3 g ijk + ̺ 2 3 ijk g i,jk , so C 1h = a jk a hi (−̺ 12 g ijk + ̺ 11 g j,ik ) − V jk a hi Also a ij = ̺ 2 g ij and V ij = n cov (S i , S j ) = a ik a jl n cov (R k , R l ) = a ik a jl ̺ 11 g kl . Now write the last term for C 1h as a hi ̺ 2 g i,jk V jk + a hi ̺ 2 g j,ki V jk + ̺ 2 g k,ij a hi V jk /2.

C. Withers and S. Nadarajah/The bias and skewness of M -estimators in regression 13
So, the second plus fourth terms simplify to a hi a jk ̺ 11 − V jk ̺ 2 g j,ki − V jk ̺ 2 g i,jk /2 , while the first plus third terms add to a hi {a jk (−̺ 12 + V jk a jk ̺ 3 )g ijk since V jk = a jl (̺ 11 g lm )a mk . This proves (6.2). Put and n 2 κ (S a , S b , S c ) = a ai a bj a ck n 2 κ (R i , R j , R k ) = −a ai a bj a ck ̺ 111 g ijk .

So,
K abc = a ai a bj a ck ̺ 111 g ijk + 3 abc V kb a ci a ji (−̺ 12 g ijk + ̺ 11 g i,jk ) Of these five terms, the first plus second simplifies to a ai a ck (a bj )̺ 111 −3V bj ̺ 12 )g ijk and the fourth simplifies to 3a ai V bj V ck ̺ 3 g ijk /2, so the first plus second plus fourth gives a ai {a bj a ck ̺ 111 − 3V bj a ck ̺ 12 + 3V bj V ck }g ijk . The third plus fifth terms give the last two of the three terms in (6.3).