Influence Function Analysis of the Restricted Minimum Divergence Estimators : A General Form

The minimum divergence estimators have proved to be useful tools in the area of robust inference. The robustness of such estimators are measured using the classical Influence functions. However, in many complex situations like testing a composite hypothesis using divergence require the estimators to be restricted into some subspace of the parameter space. The robustness of these restricted minimum divergence estimators are very important in order to have overall robust inference. In this paper we provide a comprehensive description of the robustness of such restricted estimators in terms of their Influence Function for a general class of density based divergences along with their unrestricted versions. In particular, the robustness of some popular minimum divergence estimators are also demonstrated under certain usual restrictions. Thus this paper provides a general framework for the influence function analysis of a large class of minimum divergence estimators with or without restrictions on the parameters.


Introduction
The minimum divergence approach has proved to be a very useful one in the context of parametric statistical inference. The idea behind this approach is to quantify the discrepancy between the sample data and the parametric model through an appropriate divergence and minimize this discrepancy measure over the parameter space. There are two ways of such quantification in literatures; -either through the distribution functions or the probability density functions (with respect to some suitable measure). Most of the density based minimum divergence methods are seen to be particularly useful due to their strong robustness along with high efficiencies.
However, in many complex statistical problems we need to estimate the parameter of interest under some pre-specified restrictions on the parameter space. For example, when testing a composite hypothesis, we need to estimate the parameter under the restriction imposed by null hypothesis. For such cases we need to minimize the divergence measures only over a restricted subspace of the parameter space. Simpson (1989), Lindsay (1994) and Basu et al. (2013) used such restricted minimum divergence estimators in the context of testing statistical hypothesis and derived their asymptotic properties. But they did not consider the robustness of these restricted estimators through the usual indicators, although it is also very important in order to obtain robust solution for the overall complex inference problem. Indeed, the robustness aspect of the restricted minimum divergence estimators are not well studied in literatures. In this paper, we will consider this very important issue and describe the robustness of the general minimum divergence estimators in terms of the Influence Function Analysis.
The rest of this paper is organized as follows. Sections 2 describes the concept of the minimum divergence estimators and present a general form for their influence function analysis. In Section 3, we will derive a general form of the influence function of the restricted minimum divergence estimators. Finally, in Section 4, we will apply the general results in case of some popular divergences -disparity, density power divergence, S-divergence; under certain usual restrictions on the parameter of interest.

Density-Based Minimum Divergence Estimation and Influence Function : General Form
Let us begin our discussion with a general parametric estimation problem. We have n independent and identically distributed observations X 1 , . . . , X n from a distribution G. We want to model it by a parametric family of distributions F θ = {F θ : θ ∈ Θ ∈ R p }. Without loss of generality, let the support of G and the parametric model F θ are the same. Also let both G and F θ belong to G, the class of all distributions having densities with respect to the appropriate σ−finite measure µ on the σ−field (Ω, A) and f θ , g be the density functions of F θ , G respectively with respect to µ. We want to estimate the parameter θ based on the available sample data. In case of density-based minimum divergence estimation, this is done by choosing the model element that provides the closest match to the data where the separation between the model and data is quantified by a nonnegative function ρ(·, ·) from G × G to [0, ∞) that equals zero if and only if its arguments are identically equal. Such functions ρ(·, ·) are termed as the Statistical Divergence and the estimatorθ of θ obtained by minimizing ρ(ĝ, f θ ) with respect to θ ∈ Θ, whereĝ is some nonparametric estimator of g based on the sample data, is called the Minimum Divergence Estimator (MDE). In terms of statistical functionals, the Minimum Divergence Functional T ρ (G) corresponding to the divergence ρ(·, ·) is defined by the relation ρ(g, f Tρ(G) ) = min θ∈Θ ρ(g, f θ ) provided such a minimum exists.
for some suitable function D(·, ·) : R × R −→ [0 ∞). So, in this paper also, we will restrict our attention to the divergences satisfying Equation (1) only. Then, the estimating equation of the MDE is given by where ∇ represents the derivative with respect to θ. Note that, this estimating equation does not necessarily give us an M-estimator; it does so only when ∇D(g, f θ ) containing f θ includes only the linear function of g or some constant independent of g. However, the number of divergences satisfying this condition is limited (See, eg. Patra et al., 2013) so that we can not always apply the theory of M-estimators to describe the properties of the MDEs. However, all the MDEs obtained as a solution to (2) will be Fisher consistent by definition of ρ(·, ·). The MDEs are mostly popular due to their strong robustness and in this context a useful tool is the Influence Function (Hampel,1968(Hampel, , 1974which is an indicator of their classical first-order robustness, as well as of their asymptotic efficiency. To obtain the influence function of the minimum divergence estimators based on the divergence ρ(·, ·), we consider the ǫ contaminated version of the true density g given by g ǫ (x) = (1 − ǫ)g(x) + ǫχ y (x). Similarly G ǫ (x) = (1 − ǫ)G(x) + ǫ ∧ y (x); here χ y (x) and ∧ y (x) are respectively density and distribution function of the degenerate distribution at y. Let θ g = T ρ (G) and θ ǫ = T ρ (G ǫ ) be the functional obtained via the minimization of ρ(g, f θ ) and ρ(g ǫ , f θ ) respectively. Then the Influence function of the Minimum Divergence Functional T ρ (·) is defined as IF (y, T ρ , G) = ∂θǫ ∂ǫ ǫ=0 . But from the definition of θ ǫ , it must satisfy the estimating equation (2). Now substituting g ǫ and θ ǫ in place of g and θ in (2) respectively and differentiating with respect to ǫ at ǫ = 0 we get , Here, D (i) (·, ·) denotes the first order partial derivative of D(·, ·) with respect to its i th argument, D (i,j) (·, ·) denotes its second order partial derivative with respect to i th and j th arguments (i, j = 1, 2) and we have assumed that the standard regularity conditions hold for the densities so that all above derivatives exists and can be interchanged with the integrals. Thus the expression of the influence function of the minimum divergence functional T ρ simplifies to, where In particular, when the true distribution G belongs to the parametric model, so that the density g(x) = f θ0 (x) for some θ 0 ∈ Θ, we get θ g = θ 0 and the influence function becomes IF (y, . Therefore, the influence functions of the MDEs will be bounded at the model for all those divergences for which the function |M (y; θ)| is bounded in y for all θ.
Further, note that as expected from the interpretation of the influence function by Hampel et al. (1986), ) is asymptotically normal with mean zero and variance where V ar g (·) denotes the variance under the distribution of g.

The Influence Function of Restricted MDE : General Case
We will now consider the case of restricted minimum divergence estimators and derive a general expression for its influence function extending the concepts of the previous section. Consider the set-up of the previous Section 2, but now we want to estimate the parameter θ only over a restricted (proper) subspace Θ 0 of the whole parameter space Θ. In most of the cases, we can define the subspace Θ 0 by a set of r restrictions of the form for some function h : R p −→ R r satisfying the property that the p × r matrix H(θ) = ∂h(θ) ∂θ exists with rank r and is continuous in θ. Thus, under Θ 0 , the parameter θ essentially contains p − r independent parameters.
We can solve the above estimation problem by minimizing ρ(ĝ, f θ ) with respect to θ ∈ Θ 0 and the estimator obtained from this minimization exercise will be called the Restricted Minimum Divergence Estimator (RMDE). Lindsay (1994) and Basu et al. (2013) derived the asymptotic distribution of such RMDE for the disparity family and the density power divergences respectively. Let us define the Restricted Minimum Divergence Functional T ρ (G) by the relation ρ(g, f Tρ(G) ) = min θ∈Θ0 ρ(g, f θ ) = min h(θ)=0 ρ(g, f θ ), provided such a minimum exists. We can easily solve this minimization problem using the Lagrange multiplier method. Now to derive the influence function of the Restricted Minimum divergence functional, as before, we will consider the ǫ-contaminated density g ǫ (x) and let θ g = T ρ (G) and θ ǫ = T ρ (G ǫ ). Note that θ ǫ is the minimizer of ρ(g ǫ , f θ ) subject to (4). Let us consider the restrictions which can be substituted explicitly in the expression of ρ(g ǫ , f θ ) before taking its derivatives with respect to θ; the corresponding derivative will be then zero at θ = θ ǫ and proceeding as in Section 2, we get where N 0 (θ), ξ 0 (θ), M 0 (y; θ) are the same as N (θ), ξ(θ), M (y; θ) respectively but with an additional restriction of h(θ) = 0. Also, since θ ǫ must satisfy (4), a differentiation with respect to ǫ at ǫ = 0 yields We need to solve the two equations (5) and (6) to get a general expression for the influence function IF (y, T ρ , G). Combining them, we get After simplification, we get the general expression for the influence function of Restricted Minimum Divergence functional which is presented in the following Theorem: provided N 0 (·), ξ 0 (·) and M 0 (·) can be defined as above.
In particular, if the true density belongs to the model family and the imposed restrictions are valid, i.e., g = f θ0 for some θ 0 satisfying h(θ 0 ) = 0, then we just put θ g = θ 0 to obtain the corresponding influence function. Therefore, the influence functions of the RMDEs will be bounded at the model for all those divergences for which the function |M 0 (y; θ)| is bounded in y for all θ. in particular, whenever the IF of the MDE at the model is bounded the IF of RMDE at the model will also be bounded at the model for any restrictions; but the converse is not true.

Remark 3.1. It is easy to check that
, then it follows that (Hampel et al., 1986) the asymptotic distribution of √ n( T n − T (G)) is asymptotically normal with mean zero and variance At the model g = f θ0 for some θ 0 satisfying h(θ 0 ) = 0, above expression of asymptotic variance further simplifies to We will now explore a couple of particular cases of restrictions that are commonly used in parametric estimation.
Example 1: First we will consider a simple and perhaps most popular case of restrictions where few components of the parameter θ is pre-specified. Precisely, let θ = (θ 1 θ 2 ) T where θ 1 is an r-vector and its value is specified at θ 1,0 as restrictions. Thus we consider the RMDE of θ under the restriction θ 1 = θ 1,0 . Note that, intuitively, in this case we must have RMDE of θ 1 to be fixed at θ 1,0 having zero influence function and the influence function analysis of the RMDE of θ 2 should be the same as that of the unrestricted MDE considering θ 2 as the only parameter of interest. We will now apply the general formulas derived above to this simple case to verify if those general results are in-line with the intuitive results.
Remark 3.2. Note that Above Theorem 3.1 can only be applied provided the restrictions are such that rank(H( θ g )) = r. But in many practical situations we need to consider restrictions for which the rank is strictly less than r and we can not apply the above Theorem 3.1 directly to obtain the influence function of the corresponding RMDEs. However, the arguments presented to derive the theorem can still be applied with some small modifications as required. One such common case is presented below in Example 2.
Example 2: Let us now consider another slightly complicated case of restrictions where the first r components of θ depend among themselves through only one unknown parameter, say β. Such restrictions are common in case of multivariate normal models with mean µ and variance σ 2 I p when we consider the restrictions µ = βµ 0 with known µ 0 . And estimation of the parameter β and σ 2 are important under such restrictions for various composite testing problems with p independent normal populations. For example, while testing for homogeneity of mean among the p normal populations with unknown equal variances, we have to consider the specified restrictions with µ 0 = (1, · · · , 1) T under the null hypothesis. In general, let θ = (θ 1 θ 2 ) T where θ 1 is an r-vector and assume that θ 1 = φ(β) with known function φ : R → R r . We will assume that φ(β) = (φ 1 (β), · · · , φ r (β)) T and each φ i are twice differentiable real functions with non-zero derivatives. Here also we will consider the partitions of the matrices N (θ), ξ(θ) and M (y; θ) in terms of θ 1 and θ 2 as in Example 1.
To derive the influence function of the RMDEs in this case, note that h(θ) = θ 1 − φ(β) so that where the r × r matrix B is defined as B = ∂φ(β) ∂θ1 . Note that, the (i, j) th element of the matrix B is given by is the first derivatives with respect to β. Next, simple differentiation gives that, where, B * is a p × p matrix defined as and where the r×r 2 matrix B (1) is defined as B (1) = ∂ 2 φ(β) ∂θ 2 1 . Then we have M 0 (y; θ) = B * M (y; θ), ξ 0 (θ) = B * ξ(θ) and Now, note that rank(H( θ g )) = r − 1 and so we can not apply Theorem 3.1 directly to obtain the influence function of the RMDE in this case. However, we can restart with the set of equations (5) and (6) with θ = θ g = (φ( β g ), θ g 2 ) T and then solve those equations for the IF . For, let us partition the influence function IF (y, T ρ , G) of T ρ in terms of that of the functionals T ρ,1 and T ρ,2 corresponding to θ 1 and θ 2 respectively as IF (y, T ρ , G) = IF (y, T ρ,1 , G) IF (y, T ρ,2 , G) .
Now from Equation (10), we get Using this, Equation (9) further simplifies to We need to solve above for the first partition IF (y, T ρ,1 , G) subject to B T IF (y, T ρ,1 , G) = IF (y, T ρ,1 , G) and then use Equation 12 to get the remaining second partition IF (y, T ρ,2 , G) of the IF.
In particular, if we have N 12 (θ) = O, then the estimators θ g 1 and θ g 2 becomes asymptotically independent and their influence functions also become independent of each other. The influence function of θ g 2 becomes It is easy to see that this is indeed of the same form as the corresponding influence function in the unrestricted case. And , the influence function of θ g 1 , in this case, is given by the solution of subject to the restriction B T IF (y, T ρ,1 , G) = IF (y, T ρ,1 , G). Now let us try to derive the influence function for our motivating case in this example; namely, the p-variate normal model with mean µ and variance σ 2 I p with the restriction µ = βµ 0 . Thus, here, φ i (β) = β(µ 0 ) i for all i = 1, · · · , p. Hence we have b ij = constant for all i, j and so B (1) = O. Further, considering θ 1 = µ and θ 2 = σ 2 , in this case we have N 12 (θ) = 0. Thus, from above, the influence function of σ 2 g is given by And the influence function of µ g is then a solution of subject to the restriction B T IF (y, µ g , G) = IF (y, µ g , G). Thus we will get a non-zero influence function of µ g if the matrix B T has one of its eigenvalue as 1 and in that case the influence function is given by that eigenvalue of B T corresponding to the eigenvalue 1 which satisfies the equation (16). After simplification, in that case, the influence function must be of the form where v is a vector in the null-space of the matrix B.
For the special choice µ 0 = (1, · · · , 1) T , we have b ij = 1 for all i, j so that the matrix B does not have eigenvalue 1 and hence IF (y, µ g , G) = 0.

Applications : Some Particular Divergences
Based on the general results obtained in the two previous sections, one can describe the influence function analysis and the asymptotic distributions of any MDE or RMDE provided, she can prove only their √ nconsistency. In this section, we will apply those results for some common divergence measures and common model family. Throughout this section, we will assume some common notations from the likelihood theory as, L(θ; Θ) = ln f θ (x) for all θ ∈ Θ is the likelihood function, u θ (x) = ∇L(θ; Θ) is the the likelihood score function, is the fisher information matrix. Also, we will define similar quantities under a proper subspace Θ 0 ⊂ Θ (different by the restrictions h(θ) = 0 as, L(θ; Θ 0 ) being the restriction of L(θ; Θ) onto the subspace Θ 0 , u 0 θ (x) = ∇L(θ; Θ 0 ) and

Disparity Measures
One of the most popular family of divergences is the disparity family (Lindsay, 1994) that yields fully efficient and robust estimators upon minimization. It is defined in terms of a non-negative thrice differentiable strictly convex function φ on [−1, ∞) with φ(0) = 0 and φ ′ (0) = 0, called the disparity generating function, as It is of the form of general divergences defined in Equation (1) with D(a, b) = φ a b − 1 so that we can apply all the results derived above. Using the same notations, we have, where, the function A(δ), defined as A(δ) = C ′ (δ)(δ + 1) − C(δ), is known as the Residual Adjustment Function in the context of minimum disparity estimation and plays a crucial role in its robustness (Lindsay, 1994). Thus, using (3), the influence function of the minimum disparity estimator is given by which is the same as obtained by Lindsay (1994) independently. In particular, at the model g = f θ0 , this influence function simplifies to I(θ 0 ) −1 u θ0 (y) which is independent of the disparity generating function C(·, ·) and so is same as that of the MLE. This is unbounded function for most of the common model families. Now, let us consider the restricted minimum disparity estimation under the restrictions h(θ) = 0. Using the notations of Section 3, it is easy to see that M 0 (y; θ) = −A ′ (δ)u 0 θ (y), and Then, we can derive the influence function of the restricted minimum disparity estimators from Theorem 3.1 and above simplified expressions. However, the interesting case is when true density belongs to the model family, i.e. g = f θ0 . In that case, we will have M 0 (y; θ 0 ) = −u 0 θ (y), ξ 0 (θ 0 ) = 0 and N 0 (θ 0 ) = I 0 (θ 0 ). Then, we get the simple expression of the restricted minimum disparity estimator T C corresponding to the disparity generated by C(·, ·) as Note that the above expression is independent of the choice of the disparity generating function and hence it also gives the influence function of the Restricted Maximum Likelihood Estimators.
Further, it will help us to derive asymptotic distribution of the restricted minimum disparity estimators θ C including that of the restricted maximum likelihood estimators. Following the argument of Lindsay (1994), one can easily prove the √ n-consistency of the restricted minimum divergence estimators. Then, as pointed out in Remark 3.1, the asymptotic distribution of √ n( θ C − θ 0 ), at the model g = f θ0 , is normal with mean zero and variance given by This expression coincides with the asymptotic distribution of restricted maximum likelihood estimators obtained independently from the likelihood theory. Hence it provides a justification of our general results obtained in this paper.

Density Power Divergence
In the recent decades, arguably the most popular divergence measure in the context of the robust minimum divergence estimation is the Density Power Divergence (Basu et al., 1998). The increasing popularity of this divergence is mainly due to the fact that corresponding minimum divergence estimation does not require any kernel smoothing for the continuous models; which is a major drawback of disparity measures. The density power divergence is defined in terms of a non-negative tuning parameter α as Note that the case of α = 0 gives the likelihood disparity and so the influence function of the corresponding minimum divergence estimator is already discussed in previous subsection. Let us now consider the case α > 0. Interestingly, for any given fixed α > 0, this divergence also belongs to the general family of divergence defined in Equation (1) with Now, we can apply all the results derived above for the density power divergences where, M (y; θ) = −(1 + α)u θ (y)f α θ (y), and where, i θ = −∇u θ . Thus, from (3), the influence function of the minimum density power divergence estimator T α can be written as which exactly as derived in Basu et al. (1998). In particular, if we assume g = f θ0 , then the influence function becomes This influence function is bounded for all α > 0 and most of the common model families.
Next, we will consider the restricted minimum density power divergence estimation under the restrictions h(θ) = 0. Again, we use the notations of Section 3 so that M 0 (y; θ) = −(1 + α)u 0 θ (y)f α θ (y), and with i 0 θ = −∇u 0 θ . Then, Theorem 3.1 gives us the expression of the influence function of the restricted density power divergence estimators. In particular, if g = f θ0 , then the influence function of the restricted minimum disparity estimator T α simplifies to where again this influence function is generally bounded for all α > 0.
Finally, we can derive the asymptotic distribution of the restricted minimum density power divergence estimators θ α from Remark 3.1. The √ n-consistency of the restricted minimum density power divergence estimators follows from a modification of the argument of Basu et al. (1998) used to prove the same for minimum density power divergence estimators. So, if we have g = f θ0 , then asymptotic distribution of √ n( θ α − θ 0 ) is normal with mean zero and asymptotic variance

S-Divergence
We will now consider a recent family of divergences, namely the S-Divergence Family, developed by Ghosh et al. (2013). This is a general super-family containing both the density power divergence (Basu et al., 1998) and the Cressie-Read family of power divergences (Cressie and Read, 1984) and also contains many other useful divergences. It is defined in terms of tqo parameters λ ∈ R and α ≥ 0 as ρ(g, f ) = S (α,λ) (g, f ) = 1 where, A = 1 + λ(1 − α) and B = α − λ(1 − α). For either A = 0 or B = 0, it is defined by the corresponding continuous limit of divergences [See Ghosh et al. (2013) for details]. Again, this large family of divergence can be written in the form of equation (1) with Then, we have M 0 (y; θ) = −(1 + α)u θ (y)f B θ (y)g A (y), and Then, we ge the influence function of the minimum S-divergence estimator from equation 3 as given by which is again the exactly same as obtained in Ghosh et al. (2013). For the special case g = f θ0 , this influence function coincides with that of the density power divergence given by equation (20). Finally, the influence function of the restricted minimum S-divergence estimators under the restrictions h(θ) = 0 can be derived from Theorem 3.1. It is then easy to see that, at the model g = f θ0 , the influence function of the restricted minimum S-divergence estimators coincides with that of the restricted density power divergence estimators derived in equation (21).

Conclusion
This work present the derivation of the influence function of the restricted and unrestricted minimum divergence estimators for a general class of density based divergences. It will help researchers to derive the robustness properties of any minimum divergence estimators under several restrictions on the parameters. As an example, we have examined the same for some popular minimum divergence estimators, namely the disparity, density power divergence and S-divergence family; we have also presented an example with a set of linearly dependent restrictions for general model family. Further, this paper gives us several directions for future works including the influence function of more general class of divergences that are possibly based on the distribution functions; author want to solve the related problems in subsequent researches.