A generic algorithm for reducing bias in parametric estimation

: A general iterative algorithm is developed for the computation of reduced-bias parameter estimates in regular statistical models through adjustments to the score function. The algorithm uniﬁes and provides ap- pealing new interpretation for iterative methods that have been published previously for some speciﬁc model classes. The new algorithm can use- fully be viewed as a series of iterative bias corrections, thus facilitating the adjusted score approach to bias reduction in any model for which the ﬁrst- order bias of the maximum likelihood estimator has already been derived. The method is tested by application to a logit-linear multiple regression model with beta-distributed responses; the results conﬁrm the eﬀectiveness of the new algorithm, and also reveal some important errors in the existing literature on beta regression.


Introduction
Suppose that interest is in the estimation of the q-vector of parameters β = (β 1 , . . . , β q ) of a parametric model. If l(β) is the log-likelihood for β, the maximum likelihood estimatorβ solves the score equations where n is the sample size or some other measure of how information accumulates for the parameters of the model. This means that the bias of the maximum likelihood estimator vanishes as n → ∞. Nevertheless, in practice the bias ofβ may be considerable for small or moderate values of n.
An approach to the correction of the bias of the maximum likelihood estimator is to define a bias-corrected estimatorβ =β − b(β), where b(β) is the O(n −1 ) term in the asymptotic expansion of the bias of the maximum likelihood estimator. It may be shown thatβ has bias of asymptotic order O(n −2 ) [see, for example , 11]. An extensive literature has been devoted to obtaining analytical expressions for b(β) and studying the properties of the bias-corrected estimator, especially for classes of models where the bias ofβ is large enough to affect inferences appreciably. Characteristic examples of such studies are Cox and Snell [9], Schaefer [29], Gart et al. [15], Cook et al. [4], Cordeiro and McCullagh [6], Breslow and Lin [2], Lin and Breslow [20], Cordeiro and Vasconcellos [7] and Cordeiro and Toyama Udo [5].
An alternative family of estimators β * with O(n −2 ) bias was developed in Firth [14]. These estimators differ from the bias-corrected estimatorβ in that they are not computed directly from the maximum likelihood estimator. The latter fact has motivated the study and use of the bias-reduced estimator β * instead ofβ [for example, 3,16,17,19,22,25,28,32], especially in models where there is a positive probability that the maximum likelihood estimate is on the boundary of the parameter space. Leading examples are log-linear, logit-linear and similar models for counts, where the bias-corrected estimator is undefined whenever the maximum likelihood estimate has one or more infinite components [see, for example 1, for multinomial response models]. The estimator β * results from the solution of a set of adjusted score equations, and hence in all but the simplest cases an iterative scheme needs to be employed to obtain the bias-reduced estimate. For some specific families of models efficient estimation schemes have been developed by exploiting the specific structure of the adjustments. For example, for generalized linear models Kosmidis and Firth [19] suggest an iterative scheme that operates through appropriate adjustment of the maximum likelihood "working observations" [see also , 13], and for the particular case of binomial regression models Kosmidis [18] develops an appealing iterative scheme based on iterative adjustment of the binomial counts. More generally, however, the special structure needed for the existence of such iterative adjustment schemes is absent.
In the current paper a generic procedure for obtaining the bias-reduced estimate is developed. The procedure directly depends on b(β) and hence it can be easily implemented for all the models for which b(β) has already been obtained in the literature. Furthermore, as will be shown, for certain prominent members of the family of bias-reduced estimators the algorithm provides a unified computational framework for bias correction and bias reduction.
The new algorithm is then tested through an application to nonlinear regression with beta-distributed responses, a situation in which bias correction of the maximum likelihood estimator has received considerable recent attention in the literature. In addition to demonstrating the effectiveness of the algorithm developed here, a thorough numerical study reveals some errors in the recent literature on such models. The design and analysis of the simulation experiment conducted to detect such errors have special features associated with the largesample behaviour of bias and variance, and form a template for the numerical study of asymptotic properties more generally.

Bias reduction via adjusted score functions
Firth [14] showed that an estimator with O(n −2 ) bias may be obtained through the solution of an adjusted score equation in the general form where A(β), suitably chosen, is O p (1) in magnitude as n → ∞. Firth [14] described two specific instances of the general bias-reducing adjustment A, denoted by A (E) and A (O) , based respectively on the expected and observed information matrix. The components of these two alternatives are given by and } are higher order joint null moments of log-likelihood derivatives.
Kosmidis and Firth [19] gave a more general family of bias-reducing adjustments to the score vector. The general adjustment is of the form where G(β) is either F (β) or I(β) or some other matrix with expectation F (β), and R(β) is a q × q matrix with expectation of order O(n 1/2 ). The vector is the O(n −1 ) asymptotic bias. It is immediately apparent that if G(β) = F (β) with R(β) = 0 then the A (E) adjustment results, and if G(β) = I(β) with R(β) = 0 the A (O) adjustment results.

Bias reduction as iterated bias correction
A full Newton-Raphson iteration for obtaining the bias-reduced estimate would require the evaluation of the matrix I(β) + ∇ T β A(β). Even for relatively simple models a closed form expression for ∇ T β A(β) requires cumbersome algebra and, depending on the complexity of the resultant expression, may also be difficult to implement. For this reason, the following quasi Newton-Raphson iteration is proposed: where β (j) is the candidate value for β * at the jth iteration. An alternative to the above iteration is the modified Fisher scoring iteration proposed in Kosmidis and Firth [19] in the specific context of generalized nonlinear models. The key difference between (3.1) and the iteration proposed in Kosmidis and Firth [19] is the use of F (β) instead of I(β) for calculation of the direction. Either iteration may be used but (3.1) seems closer in spirit to a Newton-Raphson iteration; because of the omission of the term ∇ T β A(β), though, the convergence rate is generally linear instead of quadratic.
By substituting (2.1) and (2.3) into (3.1), the iteration may be re-expressed in the form Note that the first two terms on the right hand side of the above expression correspond to a Newton-Raphson iteration for maximizing the log-likelihood and hence (3.2) may be re-expressed as whereβ (j+1) is the candidate value for the maximum likelihood estimate that would be obtained by taking a single Newton-Raphson step from β (j) .
Convergence or otherwise of the above iteration depends on the properties of the specific model under consideration. Nevertheless, assuming that it does converge, it is apparent from (3.1) that at convergence iteration (3.3) gives the solution to the equations S * (β) = 0.
In regular statistical models the maximum likelihood estimator differs from the bias-reduced estimator by a quantity of order O(n −1 ). Typically, then, the maximum likelihood estimate is a good starting value for the iterative scheme, provided that none of its components is on the boundary of the parameter space.
In the special case of bias reduction based on the A (O) adjustment, iteration (3.3) can be usefully re-expressed as simply This has a rather appealing interpretation: at each step, the next candidate value of the maximum likelihood estimate is corrected by subtracting the O(n −1 ) bias evaluated at the current value of the bias-reduced estimate. Hence, if β (0) =β, the first step of the proposed scheme delivers the bias-corrected maximum likelihood estimate; and iterating until convergence yields the bias-reduced estimate based on adjustment A (O) . For bias reduction based on the A (E) adjustment, iterated bias correction as in (3.4) can still be used, but with the symbols in (3.4) having a different meaning. In that caseβ (j+1) represents the candidate value for the maximum likelihood estimate obtained by taking a single Fisher-scoring step from β (j) , instead of a Newton-Raphson step. This provides a useful new interpretation of the modified Fisher scoring iteration that was suggested in Kosmidis and Firth [19].

Beta generalized linear model
As an illustrative application of the bias-reduction algorithm, consider the case of a generalized linear model with Beta-distributed responses [for example, 10,12]. Suppose that Y 1 , . . . , Y n are independent Beta-distributed random variables, the density of Y i being For the purposes of the current paper it will be assumed that the precision quantities δ i + ǫ i (i = 1, . . . , n) are all equal, and we will write 1/(1 + δ i + ǫ i ) = σ 2 < 1. The response dispersion, relative to the common variance function In some applications this constant-dispersion assumption might need to be relaxed, for example, as in Smithson and Verkuilen [31] or Simas et al. [30]. The dependence of the response mean µ i upon a p-vector x i of covariate values is commonly modeled through a link function g(.) to a linear predictor η i (i = 1, . . . , n). Since µ i ∈ (0, 1), the obvious candidate link functions in this context are inverse cumulative distribution functions (logit, probit and suchlike). The assumed relationship between the expected response and the covariate values is then and so the parameters of this model are the vector of regression coefficients γ = (γ 1 , . . . , γ p ) and the dispersion parameter σ 2 . Ferrari and Cribari-Neto [12] parameterized the model in terms of the precision parameter φ = 1/σ 2 − 1 and that representation will also be used here, with β = (γ 1 , . . . , γ p , φ) being the full vector of model parameters.

Bias reduction
For Beta regression models, Ospina et al. [23] express the vector b(β) of the first-order biases ofβ as the estimator of the regression coefficients of an appropriately weighted linear regression. A similar result is obtained in Simas et al. [30] for nonlinear Beta regression models with dispersion covariates. Despite the analytical elegance of such expressions for b(β), patterned after similar results for generalized linear models in Cordeiro and McCullagh [6], they seem to offer no benefit in terms of efficient implementation. In what follows a more direct approach is taken. For general families of models, equations (2.2), (2.4) and (3.3) suggest that bias correction and any bias-reduction method can be directly implemented if F (β), I(β), P t (β) + Q t (β) (t = 1, . . . , q) and the required form of G(β) + R(β) are all available in closed form; the matrix inversions and multiplications necessary for implementation can then be done numerically.
For the Beta regression model the log-likelihood can be written in the form are the sufficient statistics for the Beta distribution with natural parameters δ i and ǫ i , respectively (i = 1, . . . , n). Direct differentiation of l(β) with respect to φ and γ gives that where the dependence on β of the quantities appearing in the right hand side has been suppressed for notational convenience. Here, 1 n is an n-vector of ones, X is the n × p matrix with x i as its ith row, D = diag{d 1 , . . . , d n } with d i = dµ i /dη i (i = 1, . . . , n), and M = diag{µ 1 , . . . , µ n }. Furthermore,Ū andZ are the nvectors with ith components the centered sufficient Expressing all likelihood derivatives in terms of the sufficient statistics U i and Z i facilitates the calculation of P t + Q t because the derivation of higher-order joint cumulants of Z i and U i is merely a simple algebraic exercise; any joint cumulant of Z i and U i results from appropriate-order partial differentiation of the cumulant transform of the Beta distribution with respect to the natural parameters δ i and ǫ i .
Further differentiation of l(β) gives the observed information on β, 3) is immediately recognised to be the expected information on β, since the expectation of the second summand in the right hand side of (4.2) is zero. Here, A careful examination of expressions (4.1) and (4.2) reveals that . . , n). Re-expressing the above expectations as sums of joint cumulants of U i and Z i , direct differentiation of the cumulant transform of the Beta distribution with respect to δ i and ǫ i gives

Some algebra then gives
and where

Numerical study
As a partial check on the correctness of the bias-reduction algorithm a model with is fitted to the n = 32 observations of the gasoline yield data of Prater [26]. The response variable is the proportion of crude oil converted to gasoline after distillation and fractionation, and s i1 , . . . , s i9 are the values of 9 binary covariates which represent the 10 distinct experimental settings in the data set for the triplet i) temperature in degrees Fahrenheit at which 10% of crude oil has vaporized, ii) crude oil gravity, and iii) vapor pressure of crude oil (i = 1, . . . , n). Lastly, t i is the temperature in degrees Fahrenheit at which all gasoline has vaporized for the ith observation (i = 1, . . . , n). The same model was also used for illustration in Ospina et al. [23]. The parameters β = (α, γ 1 , . . . , γ 9 , δ, φ) are estimated using maximum likelihood, bias correction and bias reduction with A (O) and A (E) adjustments using the expressions (4.4) and (4.5) to implement iteration (3.3). The results for the actual data are shown in Table 1, while Table 2 presents the results of a simulation study based on this example. Some remarks on these results follow.

Remark 1: Checking correctness of the implementation
The bias-corrected estimates and the bias-reduced estimates with A (E) adjustments in Table 1 differ appreciably from the corresponding values reported in  Table 6] (labeled "CBCE" and "PBCE", respectively, therein) while the maximum likelihood estimates are the same, at least to five significant digits. After some investigation it was found that the differences arise from two distinct sources.
The reason for the fairly substantial difference in the reported bias-reduced estimates is an elementary but serious error: in equation (3.5) of Ospina et al. [23] the sign of the adjustment term is the opposite of that suggested in Firth [14]. As a result of this, and as is also apparent in the simulation studies reported in Ospina et al. [23], the estimator labeled "PBCE" therein approximately doubles the bias of the maximum likelihood estimator instead of eliminating it. The same mistake was made also in equation (12) of Simas et al. [30], with the same unfortunate consequence.
That mistake does not, however, account for the differences seen also between the reported bias-corrected estimates here and in Table 6 of Ospina et al. [23]. Those differences, while relatively small, are still too large to be attributed to presentational rounding error: at least one of the two implementations of the O(n −1 ) bias term b(β) must therefore be incorrect. A brief account follows of an extensive simulation exercise designed to determine which of the two reported sets of bias-corrected estimates is incorrect.
The bias of the maximum likelihood estimator can be written in the form B(β)/n + O(n −2 ), where b(β) = B(β)/n is as in (2.4). The following simulation experiment relies on the fact that as n increases the bias ofβ is almost completely determined by the value of B(β).
Let X be the n × p model matrix for (4.6) and denote by Z(j) a n j × p model matrix with n j = nj, whose rows result from repeating j times each row of X (j = 1, 2, . . . ; n = 32; p = 11). Then the bias of the maximum likelihood estimatorβ  Table 1 for the current implementation, and the bias-corrected estimatesβ (Osp) reported in Ospina et al. [23, Table 6], respectively.
For any j ∈ {1, 2, . . .}, consider now simulating some number N j of samples from the model, using the maximum likelihood estimates in Table 1 as the true value for β. The bias ofβ [j] can then be estimated using maximum likelihood fits to those samples and, after multiplication by n j , can be compared to B (cur) (β) and B (Osp) (β). The standard error of the estimator of n j times the bias ofβ [j] is O n j /N j . Hence, the order of that standard error can be stabilized to O(1/ √ N ) by choosing N j = N n j , for some sufficiently large N . The plots in Figure 1 give the estimated values of the components of n j times the bias ofβ [j] that correspond to α, γ 1 , . . . , γ 9 , and δ, for N = 5000 and j = 1, . . . , 15. To provide an indication of the uncertainty due to simulation error, the estimates are accompanied by 99% confidence intervals based on asymptotic normality. For every parameter the estimate is very close to B (cur) (β). This experiment provides clear evidence of either an algebraic or implementation error in Ospina et al. [23] as far as b(β) is concerned, at least for the parameters α, γ 1 , . . . , γ 6 and δ; the differences found here are too large to be accounted for by the presentational rounding to 5 decimal places in Table 6 of Ospina et al. [23].

Remark 2: Impact of bias in this example
The results in Table 2 are as expected: the bias found in the maximum likelihood estimatorβ is reduced by all three of the alternative methods which aim to improve the bias. By far the most important effect of bias reduction here is to reduce substantially the estimated value of φ; in regard to estimation of the regression parameters (α, γ, δ), on the other hand, the bias in this example is rather slight and is therefore of no real consequence. The smaller value of φ does, however, result in an appreciable increase in the estimated standard errors for all the regression parameter estimates. A further simulation exercise, not reported in detail here, was conducted to check the accuracy of the asymptotic standard errors reported in Table 1: it was found that those standard errors all agree with simulation-based standard errors to at least 3 decimal places. Because of the large bias in the estimated precision parameterφ, the usual standard errors based on the maximum likelihood analysis are systematically too small. The principal effect of bias reduction in this example, then, is to produce more realistic standard errors for the estimates of all the parameters.
It should be noted that the estimated value of φ in this example, even after reduction of the bias, is quite large: that is, the residual dispersion in the model is quite small. This accounts for the rather small bias found in the maximum likelihood estimator for the regression parameters. In a different situation with more substantial residual dispersion present, biases in the regression estimates themselves (i.e., not only in their standard errors) would likely become more important.

Remark 3: Parameterization
In general, bias reduction will typically make most sense when applied to estimators whose distribution is approximately symmetric, since it will then most often improve the accuracy of inferences made when using first-order asymptotic normal approximations. In the present application, the distributions for all parameters except φ are close to symmetric;φ exhibits substantial positive skewness, as is often found in the estimation of positive-valued parameters.
In this model it seems preferable, then, to consider bias reduction instead for a transformed version of φ, the most obvious candidate being log φ. The distribution of logφ is much closer to being symmetric than that ofφ, a fact confirmed by graphical summaries (not presented here) of the simulation experiment underlying Table 2.
Consider a general re-parameterization from β to ω = (α, γ 1 , . . . , γ 9 , δ, ζ), with ζ = h(φ) for some appropriate function h : ℜ + → H ⊂ ℜ. Because the maximum likelihood estimator is equivariant under re-parameterization, the components of the bias vector -and hence of the vector of first-order biases -corresponding to α, γ 1 , . . . , γ 9 and δ will be the same in both the ω and β parameterizations. Assuming that h(.) is at least three times differentiable, and using the consistency of the maximum likelihood estimatorφ of φ,ζ − ζ admits the expansion if r is odd and O(n −r/2 ) if r is even [see, for example 24, Section 9.4 for the asymptotic expansion of the maximum likelihood estimator], taking expectations in both sides of (4.7) gives that the bias ofζ can be written as Here b φ (β) and F −1 φφ (β) denote the components of b(β) and {F (β)} −1 which correspond to φ. Furthermore, the expected information matrix on ω is F * (ω) = J(ω)F (β(ω))J(ω), where β(ω) = (α, γ 1 , . . . , γ 9 , δ, h −1 (ζ)) and Expression (4.8) can be used to obtain the first-order bias of h(φ) for every h(.), by merely using the first-order bias ofφ, the inverse of F (β) and the first two derivatives of h(.). Also, the bias-reduced estimate of ω based on A (E) adjustments can be obtained by using iteration (3.4) witĥ where S * (ω) = J(ω)S(β(ω)) is the score vector in the ω parameterization. While the maximum likelihood and bias-corrected estimates for α, γ 1 , . . . , γ 9 and δ will be exactly the same in both the β and ω parameterizations, the corresponding bias-reduced estimates will generally differ slightly between parameterizations. Table 3 gives the maximum likelihood, bias-corrected and bias-reduced estimates of ω when h(φ) = log φ, along with estimated standard errors based on F * (ω) evaluated at the corresponding estimates. The principal differences between Table 3 and Table 1 are in the implied estimates of φ, and consequently in the standard errors for estimates of the regression parameters α, γ 1 , . . . , γ 9 and δ. For example, the bias-reduced estimate of φ from Table 3   Table 3 Estimates of the parameters of model (4.6) using maximum likelihood, bias correction and bias reduction based on A (E) , for the re-parameterization with ζ = log(φ). In parentheses are the corresponding estimated standard errors based on the expected information matrix F * (ω) is exp(5.61608) = 274.8; this is slightly larger than the corresponding value 261.0 from Table 1, resulting in slightly smaller estimated standard errors for the regression parameters in Table 3.

Concluding remarks
The new algorithm developed here unifies various iterative methods that have been made available previously for specific models, and extends them to cover any new situation for which the O(1/n) bias of the maximum likelihood estimator can be derived. The method was tested and demonstrated here in the context of beta-response nonlinear regression, and was found to perform robustly in all of the very large number of samples that were used in simulation studies. The particular illustrative example presented here is just one of several betaregression applications that the authors have worked through carefully, and the results were qualitatively the same in all of them. Bias in estimation of the regression parameters in such models is typically so small as to be of no consequence, at least when the precision parameter φ is not unreasonably small; but the standard errors in a maximum-likelihood analysis are systematically under-estimated, with the likely consequence that spuriously strong conclusions would often be drawn. Reducing the bias in the estimated precision parameter increases the estimated standard errors in such a way that they reflect better the true variability of the estimated regression parameters.
The calculations described here were all programmed in R [27], and the code is available on request from the first author.