Normalized estimating equation for robust parameter estimation

: Robust parameter estimation has been discussed as a method for reducing a bias caused by outliers. An estimating equation using a weighted score function is often used. A typical estimating equation is non- normalized, but this paper considers a normalized estimating equation, which is corrected to ensure that the mean of the weight is one. In robust parameter estimation, it is important to control the diﬀerence between the target parameter and the limit of the robust estimator, which is referred to as the latent bias in this paper. The latent bias is usually discussed in terms of inﬂuence function and breakdown point. It is illustrated by some examples that the latent bias can be close to zero for the normalized estimating equation even if the proportion of outliers is not small, but not close to zero for the non-normalized estimating equation. Furthermore, this behavior of the normalized estimating equation can be proved under mild conditions. The asymptotic normality of the robust estimator is also presented and then it is shown that the outliers are naturally ignored with an appropriate proportion of outliers from the viewpoint of asymptotic variance. The results can be extended to the regression case. The behaviors of the latent bias and mean squared error are investigated by numerical studies.


Introduction
Maximum likelihood estimation is a typical form of parameter estimation. However, if an outlier is present in observations, then it often causes a severe bias. To overcome this problem, many methods for robust parameter estimation against outliers have been proposed (Hampel et al., 1986;Maronna, Martin and Yohai, 2006;Huber and Ronchetti, 2009).
Let the parametric density be denoted by f (x; θ) = f θ (x). Let the loglikelihood and score functions be denoted by l(x; θ) = log f (x; θ) and s(x; θ) = (∂/∂θ)l(x; θ), respectively. Then the maximum likelihood estimator is a root of the estimating equation given by n i=1 s(x i ; θ) = 0, where x 1 , . . . , x n are the observations. To weaken the adverse effect of outlier, an estimating equation using a weighted score function, n i=1 w(x i ; θ)s(x i ; θ) = 0, can be considered for robust parameter estimation (Field and Smith, 1994). The weight w(x i ; θ) is small when x i is an outlier. However, the bias-correction is necessary to ensure Fisher consistency. The bias-corrected estimating equation is given by Recently, the density power weight w(x; θ) = f (x; θ) γ (γ > 0) has been discussed for robust parameter estimation, because f (x i ; θ) is close to zero when x i is an outlier. Basu et al. (1998) proposed this type of estimating equation and discussed the corresponding divergence. The divergence with γ = 1 is the same as the L 2 -divergence, which is a well-known divergence to generate a strong robust estimator (Scott, 2001). The divergence limits to the KL-divergence as γ goes to zero. The tuning parameter γ controls the trade-off between bias and variance. The divergence was applied to independent component analysis (Minami and Eguchi, 2002), a mixture of independent component analysis models (Mollah, Minami and Eguchi, 2006), Gaussian graphical models (Miyamura and Kano, 2006), model selection (Mattheou, Lee and Karagrigoriou, 2009) and kernel principal component analysis (Huang, Yeh and Eguchi, 2009). A general weight ξ(l(x; θ)), including the density power weight, was also discussed by Eguchi and Kano (2001) and Murata et al. (2004): 1 n n i=1 ξ(l(x i ; θ))s(x i ; θ) = E f θ [ξ(l(x; θ))s(x; θ)] . (1.1) It is easy to construct the corresponding divergence, which belongs to a class of Bregman divergence. On the estimating equation with the weight ξ(l(x; θ)), the weight is not always normalized; that is, the mean of weight is typically not one, more precisely, (1/n) n i=1 ξ(l(x; θ)) = 1. In this paper, this type of estimating equation is called a non-normalized estimating equation. Let us consider another estimating equation by replacing the weights ξ(l(x i ; θ)) and ξ(l(x; θ)) by ( 1.2) The normalized estimating equation with the density power weight was proposed by Windham (1995). Jones et al. (2001) constructed the corresponding divergence, which was further explored by Fujisawa and Eguchi (2008). The divergence is related to Tsallis entropy (Tsallis, 1988(Tsallis, , 2009Ferrari and Yang, 2010;Ferrari and La Vecchia, 2012;Eguchi and Kato, 2010;Eguchi, Komori and Kato, 2011). Additionally, two types of divergences were further discussed by Cichocki and Amari (2010) and Eguchi and Kato (2010) and applied to vector quantization by Villmann and Haase (2011).
Here, we remark the difference between the non-normalized and normalized estimating equations. It is easy to construct the corresponding divergence for the non-normalized estimating equation, as described already, but not easy (often impossible) for the normalized estimating equation except for the density power weight. For this reason, there would have been no discussion about other weights except for the density power weight on the normalized estimating equation. The non-existence of the corresponding divergence for the estimating equation was discussed in the framework of generalized estimating equation (McCullagh and Nelder, 1983).
In robust parameter estimation, it is important to control the difference between the target parameter and the limit of the robust estimator, which is referred to as the latent bias in this paper. The latent bias is usually discussed in terms of influence function and breakdown point. A distinguishing feature of the normalized estimating equation with the density power weight is that the latent bias becomes arbitrarily small when the occurrence probability of the outlier becomes arbitrarily small in a certain sense, even if the proportion of outliers is not small (Fujisawa and Eguchi, 2008). It should be noted that this favorable property was proved by using a specific property of the corresponding divergence. In this paper, this result is extended to a normalized estimating equation with any weight, which enables us to use various weights, including the logistic weight (Eguchi and Kano, 2001;Murata et al., 2004;Takenouchi and Eguchi, 2004). The approach of the proof is different from the divergencebased one, because we cannot use a convenient property of divergence. It is further illustrated by some examples that the latent bias can be close to zero for the normalized estimating equation, but not always for the non-normalized estimating equation.
This paper is organized as follows. The non-normalized and normalized estimating equations are described in Section 2. The corresponding estimators can be regarded as M-estimators. In Section 3, the latent bias is discussed for non-normalized and normalized estimating equations and it is shown that the latent bias can be arbitrarily small for a normalized estimating equation even if the proportion of outliers is not small. Asymptotic properties of the robust estimators are presented in Section 4. These results are extended to the regression case in Section 5. Numerical examples are illustrated in Section 6. Some discussions are given in Section 7.

Estimating equation
The non-normalized estimating equation given by (1.1) can be expressed as (2.1) The robust estimatorθ U is defined as a root of this estimating equation, which is an M-estimator. To weaken an adverse effect of outlier, we assume that the weight ξ(l(x; θ)) is close to zero for an outlier x, more precisely, 2) because f (x; θ) is close to zero for an outlier x and l(x; θ) = log f (x; θ) goes to minus infinity as f (x; θ) approaches zero. Suppose that the function ξ(a) and the density function f (x; θ) satisfy some conditions, including differentiability, integrability, and so on, which are described in the subsequent sections. Various properties, including asymptotic properties and robustness of the estimator and test, can be easily obtained by the theory of M-estimator (Maronna, Martin and Yohai, 2006;Heritier and Ronchetti, 1994). Some types of weights have been discussed. One is the density power weight, ξ(l(x; θ)) = exp(γl(x; θ)) = f (x; θ) γ , as described already in Section 1. The density power weight for an outlier x * decreases with increasing γ. The logistic weight, ξ(l(x; θ)) = exp(l(x; θ))/(exp(l(x; θ)) + η) = f (x; θ)/(f (x; θ) + η), was considered by Eguchi and Kano (2001) and used in a divergence related to the boosting . The tuning parameter η was referred to as the value of saturation in multilayer perceptron models in neural networks.
The logistic weight decreases with increasing η and it is essentially proportional to f (x; θ) for a sufficiently large η. The threshold type of weight, ξ(l(x; θ)) = min{f (x; θ), c}, can also be used and a similar type of weight was applied to a logistic model by Croux and Haesbroeck (2003).
The normalized estimating equation given by (1.2) can be rewritten as The robust estimatorθ N is defined as a root of this estimating equation, which is also an M-estimator.

Latent bias
Let f (x) = f (x; θ * ) be the target density. Let δ(x) be the contamination density related to outliers. Suppose that the observations are drawn from the underlying density given by where ε is the proportion of outliers. Letθ ψ be the estimator defined as a root of the estimating equation n i=1 ψ(x i ; θ) = 0. We assume the Fisher consistency. Let θ * ψ be the limit ofθ ψ . The bias caused by contamination can be expressed as θ * ψ − θ * , which is hereafter referred to as the latent bias. The latent bias θ * ψ − θ * can be approximated to (Huber and Ronchetti, 2009). It is favorable that the influence function IF ψ (x; θ) approaches zero as |x| goes to infinity, because the latent bias can be approximated to zero for a large value of |x|. The function ψ(x; θ) is said to be redescending when the function ψ(x; θ) approaches zero as |x| goes to infinity, which implies the above favorable property of the influence function from the formula of IF ψ .
Consider the simple case where the target density is an exponential distribution with mean one. Figure 1 shows the influence functions for the nonnormalized and normalized estimating equations with density power weight (γ = 1). The influence function for the normalized estimating equation is redescending because it is easily shown that ξ(l(x; θ)) and ξ(l(x; θ))s(x; θ) on the formula (2.3) approach zero as x goes to infinity. This property is generalized in what follows. However, the influence function for the non-normalized estimating equation is not redescending from the formula (2.1), because the bias-correction term for ψ U , E f θ [ξ(l(x; θ))s(x; θ)], is not zero. Suppose that in the neighborhood of θ = θ * . These conditions hold for various combinations of weights and distributions. Here we consider the simple case where δ is the dirac function at a sufficiently large x * . We can suppose that f (x * ; θ) ≈ 0 because x * can be regarded as an outlier. The first condition becomes ξ(l(x * ; θ)) ≈ 0. This immediately follows from the property (2.2). The second condition becomes ξ(l(x * ; θ))s(x * ; θ) ≈ 0. This holds for various combinations of weights and distributions.
(Some examples are given in Appendix A). Consequently, from the formula (2.3), the condition (3.1) implies the redescending property of ψ N . Therefore, we see that the condition (3.1) is more general than the redescending property of ψ N , because the redescending property is considered under a restricted situation that θ is the true parameter θ * and δ is the dirac function at x * . Under the condition (3.1), we obtain a stronger property than usual, as described later. Let us consider the limit of the normalized estimating equation, given by We see that From the condition (3.1), the normalized estimating equation (3.2) is roughly expressed as Note that the proportion of outliers, ε, vanishes. If this approximation is replaced by equality, then θ * is a root. Let the limit of the normalized estimating equation be denoted by Let θ * N be the root of λ g (θ) = 0. The formula (3.3) implies that θ * N ≈ θ * , which shows the possibility that the latent bias can be close to zero even if the proportion of outliers is not small.
Let us give a clear statement of the above discussion. Before that, we calculate the differential of λ f θ * (θ) at θ = θ * . It follows from straightforward but lengthy calculations that This is non-positive by Cauchy-Schwartz's inequality and usually negative definite for various weights and density functions. The following theorem can be shown from this assumption and implicit function theorem. The proof is given in Appendix B.

Asymptotic property
The M-estimatorθ ψ , which is a root of the estimating equation n i=1 ψ(x; θ) = 0, has consistency and asymptotic normality under mild conditions (van der Vaart, 1998). We can use the following theorem to obtain the asymptotic properties ofθ U andθ N .
Theorem 4.1 (Theorems 5.41 and 5.42 of van der Vaart (1998)). Suppose that x 1 , . . . , x n are randomly drawn from the underlying density g. We assume: (a) The function ψ(x; θ) is twice continuously differentiable with respect to θ for any exists and is nonsingular. (e) The second-order differentials of ψ(x; θ) with respect to θ are dominated by a fixed integrable functionψ(x) in a neighborhood of θ = θ * ψ . Then there exists a sequence of roots, {θ n } ∞ n=1 , such that There are many assumptions in the above theorem. The assumptions (a), (c) and (e) are very easy to verify, but the assumptions (b) and (d) are necessary to verify. When the estimating equation is non-normalized or normalized with the density power weight, the assumptions (b) and (d) are easy to verify because the corresponding divergence exists (Basu et al., 1998;Jones et al., 2001). Consider the case of the normalized estimating equation under the aforementioned conditions. The assumption (b) directly follows from Theorem 3.1. The assumption (d) can be verified by adding an extra condition like (3.1) (Appendix C). Therefore, all the assumptions hold for both non-normalized and normalized estimating equations. Furthermore, the asymptotic variance for the normalized estimating equation can be expressed as follows. The derivation is given in Appendix D.
Theorem 4.2. Consider a normalized estimating equation. Assume the same conditions as in Theorem 3.1, and the continuity of J f (θ), J δ (θ), K f (θ) and K δ (θ), and that J δ (θ * ) ≈ O and K δ (θ * ) ≈ O. It then holds that This theorem implies that the outliers are naturally ignored with the appropriate proportion of outliers. Suppose that the sample size of outlier is m. The proportion of outliers is expressed as ε = m/n. The asymptotic variance of the robust estimator based on the n − m observations without outliers is given by τ 2 f (θ * )/(n − m) = τ 2 f (θ * )/n(1 − ε) ≈ τ 2 g (θ * N )/n, which is the asymptotic variance of the robust estimator based on all the observations.

Regression case
The robust parameter estimation for the regression case has been discussed in Cantoni and Ronchetti (2001), Copt and Heritier (2007) and Croux, Gijbels and Prosdocimi (2012). The content in the previous sections can be extended to the regression case in a similar manner. Let x and y be the explanatory and response variables, respectively. Let f (y|x; θ) be the parametric conditional density of y given x. Let the log-likelihood and score functions be denoted by l(y|x; θ) = log f (y|x; θ) and s(y|x; θ) = (∂/∂θ)l(y|x; θ), respectively. The estimating equation for the maximum likelihood estimator is n i=1 s(y i |x i ; θ) = 0. The downweighted estimating equation is n i=1 ξ(l(y i |x i ; θ))s(y i |x i ; θ) = 0 and then the bias-corrected (non-normalized) estimating equation is given by

The normalized estimating equation is expressed as
Let z = (x, y) and z i = (x i , y i ). They can also be expressed as It was shown in Section 3 that the latent bias can become arbitrarily small for a normalized estimating equation under some conditions. The same result holds for the regression case under some conditions similar to in Section 3. Let the underlying density of x and y given x be denoted by g(x) and g(y|x) = (1 − ε)f (y|x; θ * ) + εδ(y|x). The necessary conditions for the regression case can be obtained by replacing f (x; θ * ) and δ(x) by f (y|x; θ * )g(x) and δ(y|x; θ * )g(x) in Section 3.
Note that a non-normalized estimating equation is based on the U-statistic but a normalized estimating equation is based on the V-statistic by replacing ψ N (z i , z j ) by (ψ N (z i , z j ) + ψ N (z j , z i ))/2. The asymptotic properties of the U -and V-statistics are investigated (Serfling, 1980;Lee, 1990). Hence, we expect that under additional conditions, the asymptotic distributions of the robust estimators derived from the non-normalized and normalized estimating equations can be obtained by suitable extensions of the proof for the i.i.d. case.
The weight ξ(l(y|x; θ)) can downweight the score function only when y is an outlier. The robustness against an outlier of x is not incorporated on the above estimating equations. This is possible by replacing ξ(l(y|x; θ)) by ξ(l(y|x; θ))w(x), where w(x) is small when x is an outlier. This idea is frequently adopted in robust parameter estimation.

Numerical examples
The latent bias and mean squared error were investigated when the target distribution was normal with mean zero and variance one, the contamination distribution was normal with mean five and variance one, and the parametric model was normal with mean µ and variance σ 2 . Two types of weights were used; one was the density power weight and the other was the logistic weight. The proportion of outliers was set to be ε = 0.05, 0.2. The root of the estimating equation was obtained through an iterative algorithm (Appendix E). Figure 2 illustrates the latent bias in the case of the density power weight. For the mean parameter, two latent biases for the normalized and non-normalized estimating equations are closer to zero as the tuning parameter γ becomes larger. They are almost the same when ε = 0.05, but slightly different around γ = 0.4 when ε = 0.2. For the standard deviation parameter, as the tuning parameter γ becomes larger, the latent bias for the normalized estimating equation is closer to zero. In contrast, the latent bias for the non-normalized estimating equation presents a different behavior. It can not be close to zero and is farther from zero after it has attained the minimum. Additionally, when ε = 0.2, the minimum value is larger than 0.2, which is very large in comparison to the target parameter σ = 1. Figure 3 depicts the latent bias in the case of the logistic weight. The behaviors were similar to those in the case of density power weight, except that the latent bias for the non-normalized estimating equation can be much closer to zero.

Latent bias
In Figure 4, the mean of contamination distribution was changed to µ out = 10. The density power weight was investigated. In this scenario, the curves of latent bias shift left. This is because with increasing x * , f (x * ; θ * ) decreases and then the necessary value of the tuning parameter decreases to give the same weight value for an outlier.

Mean squared error
The root of the mean squared error (RMSE) was investigated for the density power weight. The sample size was set to be n = 40. The RMSE was estimated from 500 replications.  Figure 5 illustrates the RMSE for the normalized estimating equation. The trade-off between bias and variance was observed. The minimum of RMSE was attained at a certain value of tuning parameter. Letγ µ andγ σ be the optimal tuning parameters that minimized the RMSE for the parameters µ and σ, respectively. It should be noted that the latent bias was small enough when the tuning parameter wasγ µ orγ σ , as seen in Figure 2. The optimal tuning parameter would correspond to the case where the latent bias is small enough and the variance of the estimator is as small as possible. In addition, the valueγ σ was slightly larger fromγ µ . This might be because the latent bias for the standard deviation parameter was larger than that for the mean parameter. Figure 6 depicts the RMSE for the non-normalized estimating equation. The minimum of RMSE for the non-normalized estimating equation is larger than that for the normalized estimating equation. The RMSE for the non-normalized estimating equation slowly increased with increasing γ after the minimum was attained.

Discussion
In this paper, a normalized estimating equation using a weighted score function was presented and compared with a non-normalized estimating equation. It was shown that the latent bias could be close to zero even if the proportion of outliers was not small. The latent bias and mean squared error were illustrated by some examples. In this section, the weight selection is further discussed. To obtain the robust estimate, we must set the tuning parameter. Remember that the tuning parameter controls the trade-off between bias and variance, as described in Section 6.2. As seen in Section 6.1, when the tuning parameter is larger than a certain value, the latent bias is close to zero, in other words, the estimate is close to the true value. Consider the set of tuning parameters that is larger than a certain value and show a similar estimate. In this set, a smaller value of tuning parameter would be favorable because the latent bias is close to zero and the variance is smaller. We can also use robust model selection criterion to select a good tuning parameter.
There are many candidates for the weight function. We might think what type of weight function is better. For example, among the weight functions satisfying the assumption in Theorem 3.1, we might consider that a weight function with a smaller maximal bias is better. Note that this type of condition has not been assumed so far. This additional condition might imply a new problem about the optimality of maximal bias. This will be a future issue.