UvA-DARE (Digital Academic Repository) Semiparametrically efficient estimation of constrained Euclidean parameters

: Consider a quite arbitrary (semi)parametric model for i.i.d. observations with a Euclidean parameter of interest and assume that an asymptotically (semi)parametrically eﬃcient estimator of it is given. If the parameter of interest is known to lie on a general surface (image of a continuously diﬀerentiable vector valued function), we have a submodel in which this constrained Euclidean parameter may be rewritten in terms of a lower-dimensional Euclidean parameter of interest. An estimator of this underly- ing parameter is constructed based on the given estimator of the original Euclidean parameter, and it is shown to be (semi)parametrically eﬃcient. It is proved that the eﬃcient score function for the underlying parameter is determined by the eﬃcient score function for the original parameter and the Jacobian of the function deﬁning the general surface, via a chain rule for score functions. Eﬃcient estimation of the constrained Euclidean parameter itself is considered as well. Our general estimation method is applied to location-scale, Gaussian copula and semiparametric regression models, and to parametric models.


Introduction
Let X 1 , . . . , X n be i.i.d. copies of X taking values in the measurable space (X , A) in a semiparametric model with Euclidean parameter θ ∈ Θ where Θ is an open subset of R k . We denote this semiparametric model by (1.1) Typically, the nuisance parameter space G is a subset of a Banach or Hilbert space. This space may also be finite dimensional, thus resulting in a parametric model. We assume an asymptotically efficient estimatorθ n =θ n (X 1 , . . . , X n ) is given of the parameter of interest θ, which under regularity conditions means that is the corresponding efficient score function at P θ,G for estimation of θ within P. The topic of this paper is asymptotically efficient estimation when it is known that θ lies on a general surface, or equivalently, when it is known that θ is determined by a lower dimensional parameter via a continuously differentiable function, which we denote by (1.4) Here f : N ⊂ R d → R k with d < k is known, N is open, the Jacobiaṅ ..,k j=1,...,d (1.5) of f is assumed to be of full rank on N , and ν is the unknown d-dimensional parameter to be estimated. Thus, we focus on the (semi)parametric model Q = P f (ν),G : ν ∈ N, G ∈ G ⊂ P. (1.6) In order for ν to be identifiable we have to assume that f (·) is invertible; note that θ itself is identifiable as it is assumed that it can be estimated efficiently.
The first main result of this paper is that a semiparametrically efficient estimator of ν, the parameter of interest, has to be asymptotically linear with efficient score function for estimation of ν equal tȯ (·; ν, G, Q) =ḟ T (ν)˙ (·; θ, G, P). (1.7) Such a semiparametrically efficient estimator of the parameter of interest can be defined in terms of f (·) and the efficient estimatorθ n of θ; see equation (4.1) in Section 4. This is our second main result. How (1.7) is related to the chain rule for differentiation will be explained in Section 2, which proves this chain rule for score functions within regular parametric (sub)models. The semiparametric lower bound for estimators of ν is obtained via the Hájek-LeCam Convolution Theorem for regular parametric models and without projection techniques in Section 3. In Section 4 efficient estimators within Q of ν and θ = f (ν) are constructed. The generality of our results facilitates the analysis of numerous statistical models. We discuss some of such parametric and semiparametric models and related literature in Section 5.

Technicalities are collected in Appendices A and B.
Several examples of estimation under constraints for nonparametric models have been studied in [2]. The efficient influence function for such a constrained nonparametric model is determined by projection of the efficient influence function for the unconstrained model on the tangent space of the constrained model; see e.g. Example 3.3.3 of [2]. The constraints are formulated via equations the distributions should satisfy. Some such equations for distributions can be reformulated for semiparametric models in terms of equations for the Euclidean parameter. For semiparametric models constrained by such equations the efficient influence function can also be determined by projection of the efficient influence function for the unconstrained semiparametric model, as we will show in a companion paper; see [12]. Quite many semiparametric models with constraints by equations the Euclidean parameters should satisfy, can be reparametrized as in (1.6), but not all. Likewise, not all submodels (1.6) can be phrased via equations. Simple counterexamples to prove these claims are given in the companion paper. In these counterexamples the condition that N be open, is crucial. This looks like a minor feature. However, in asymptotic statistics one typically assumes the parameter space to be open in order to avoid (interesting) pathologies at the boundary.
Therefore the topics of the present paper and its companion one do not coincide completely. We do not use projection techniques in the present paper, but base our approach directly on the concept of least favorable submodel and on the parametric version of the Hájek-LeCam convolution theorem, and not on its generalization to the semiparametric situation as given by Theorem 3.3.2 and Theorem 3.4.1 of [2] and by Theorem 25.20 of [19]. This new approach seems to be well suited to the formulation of the main Theorem 3.1 and it makes our proofs elementary and pretty straightforward.
Most, if not all, papers on estimation in constrained parametric models focus on constrained (or restricted) maximum likelihood estimation implemented via Lagrange multipliers. The first paper on this subject seems [1]. Another early treatise related to the theme of the present paper for the parametric case is [15]. This book studies classical and Bayesian estimation for parametric models under constraints in terms of equalities and inequalities.
The topic of the present paper should not be confused with estimation of the parameter θ when it is known to lie in a subset with nonempty interior of the original parameter space. This situation corresponds to d = k with f (·) the identity and ν = θ in (1.4)-(1.6). If an asymptotically efficient estimatorθ n is given for the unconstrained model, this estimator is also asymptotically efficient within the constrained model, as N is open and henceθ n takes values in N with probability tending to 1. A comprehensive treatment of finite sample estimation problems with N a proper subset of Θ with the same dimension, may be found in [20].

The chain rule for score functions
The basic building block for the asymptotic theory of semiparametric models as presented in e.g. [2] is the concept of regular parametric model. Let P Θ = {P θ : θ ∈ Θ} with Θ ⊂ R k open be a parametric model with all P θ dominated by a σ-finite measure μ on (X , A) . Denote the density of P θ with respect to μ by p(θ) = p(·; θ, P Θ ) and the L 2 (μ)-norm by · μ . If for each θ 0 ∈ Θ there exists a k-dimensional column vector˙ (θ 0 , P Θ ) of elements of L 2 (P θ0 ), the so-called score function, such that the Fréchet differentiability holds and the k × k Fisher information matrix is nonsingular, and, moreover, the map θ →˙ (θ, P Θ ) p(θ) from Θ to L k 2 (μ) is continuous, then P Θ is called a regular parametric model. Often the score function may be determined by computing the logarithmic derivative of the density with respect to θ; cf. Proposition 2.1.1 of [2]. We will call P from (1.1) a regular semiparametric model if for all G ∈ G is a regular parametric model.
This Proposition is also valid for parametric models, as may be seen by choosing G finite dimensional or even degenerate. The basic version of the chain rule for score functions is for such a parametric model P Θ . We have chosen the more elaborate formulation of Proposition 2.1 since we are going to apply the chain rule for such parametric submodels P ψ of semiparametric models P.

Convolution theorem and main result
An estimatorθ n of θ within the regular semiparametric model P is called (locally) regular at P 0 = P θ0,G0 if it is (locally) regular at P 0 within P ψ for all regular parametric submodels P ψ of P containing P Θ,G0 . According to the Hájek-LeCam Convolution Theorem for regular parametric models (see e.g. Section 2.3, in particular the Note on page 27, of [2]) this implies that for such a regular estimatorθ n of θ within P the normed estimation error √ n(θ n − θ 0 ) has a limit distribution under P 0 that is the convolution of a normal distribution with mean 0 and covariance matrix I −1 (θ 0 , P ψ ) and another distribution, for any regular parametric submodel P ψ containing P 0 . If there exists ψ = ψ 0 such that this last distribution is degenerate at 0, we callθ n (locally) efficient at P 0 and P ψ0 a least favorable parametric submodel for estimation of θ within P at P 0 . Then the Hájek-LeCam Convolution Theorem also implies thatθ n is asymptotically linear in the efficient influence function˜ The argument above can be extended to the more general situation that there exists a least favorable sequence of parametric submodels indexed by ψ j , j = 1, 2, . . . , such that the corresponding score functions˙ (θ 0 , P ψj ) for θ at θ 0 within model P ψj converge in L k 2 (P 0 ) to˙ (θ 0 , G 0 , P) =˙ (·; θ 0 , G 0 , P), say. A regular estimatorθ n of θ within P is called efficient then, if it is asymptotically linear as in (3.2) with efficient influence function˜ (θ 0 , G 0 , P) =˜ (·; θ 0 , G 0 , P) satisfying Indeed, by the Convolution Theorem for regular parametric models the convergence ⎛ holds with the k-vectors R P,j and Z P,j independent and Z P,j normal with mean 0 and covariance matrix I −1 (θ 0 , P ψj ). Taking limits as j → ∞ we see by tightness arguments and by the convergence of˙ (θ 0 , P ψj ) to˙ (θ 0 , G 0 , P) in holds with R P and Z P independent k-vectors and Z P normally distributed with mean 0 and covariance matrix To be more precise, consider the difference between the left hand sides of (3.5) and (3.4). The second half of this vector of differences equals minus the first half. Both halves converge in distribution by the central limit theorem to a normal distribution with mean 0 and covariance matrix This implies that the vector of differences is tight and, as the left hand side of (3.4) is tight, that the left hand side of (3.5) is tight as well. Let (R T P , Z T P ) T be a limit point of the left hand side of (3.5). As (3.6) converges to 0 for j → ∞, this limit point is also the limit in distribution of (R T P,j , Z T P,j ) T . Consequently, all limit points (R T P , Z T P ) T have the same distribution. By the independence of R P,j and Z P,j for all j we obtain and hence (3.5) with R P and Z P independent. This independence turns (3.5) into a convolution theorem. If R P is degenerate at 0, thenθ n is locally asymptotically efficient at P 0 within P and the sequence of regular parametric submodels P ψj is least favorable indeed. Now, let us assume such a least favorable sequence and efficient estimatorθ n exist at P 0 = P θ0,G0 with θ 0 = f (ν 0 ) and f (·) from (1.4) and (1.5) continuously differentiable. By the chain rule for score functions from Proposition 2.1 the score function˙ (ν 0 , Q ψj ) for ν at ν 0 within Q ψj satisfieṡ and hence the corresponding influence function˜ (ν 0 , Q ψj ) satisfies Letν n be a locally regular estimator of ν at P 0 within the regular semiparametric model Q. By the Convolution Theorem for regular parametric models the holds with the k-vectors R Q,j and Z Q,j independent and Z Q,j normal with mean 0 and covariance matrix and the argument leading to (3.5) yields the convolution theorem ⎛ with R Q and Z Q independent. Note that Z Q has a normal distribution with mean 0 and covariance matrix Under an additional condition on f (·) we shall construct an estimatorν n of ν based onθ n for which R Q is degenerate. This construction ofν n will be given in the next section together with a proof of its efficiency, and this will complete the proof of our main result formulated as follows.
Theorem 3.1. Let P from (1.1) be a regular semiparametric model with P 0 = P θ0,G0 ∈ P, θ 0 = f (ν 0 ), and f (·) from (1.4) and (1.5) continuously differentiable. Furthermore, let f (·) have an inverse on f (N ) that is differentiable with a bounded Jacobian. If there exists a least favorable sequence of regular parametric submodels P ψj and an asymptotically efficient estimatorθ n of θ satisfying (3.5) with R P = 0 a.s., then there exists a least favorable sequence of regular parametric submodels Q ψj of the restricted model Q from (1.6) and an asymptotically efficient estimatorν n of ν taking values in N , satisfying (3.12) with R Q = 0 a.s., and attaining the asymptotic information bound (3.13).
Note that (3.11) and (3.12) with R Q = 0 a.s. imply the chain rule for score functions as formulated in (1.7).

Efficient estimator of the parameter of interest
For many specific types of (semi)parametric problems methods to construct efficient estimators have been devised. A general approach is upgrading a √ nconsistent estimator as in Sections 2.5 (the parametric case) and 7.8 (the general case) of [2]. A somewhat different upgrading approach is used in the following construction.
Theorem 4.1. Consider the situation of Theorem 3.1, whereθ n is an efficient estimator of θ within the model P. If the symmetric positive definite k×k-matrix I n is a consistent estimator of I(θ, G, P) within P andν n is a √ n-consistent estimator of ν within Q, then is efficient, i.e., it satisfies (3.12) with R Q = 0 a.s.
Proof. The continuity ofḟ (·) and the consistency ofν n andÎ n imply that converges in probability under P 0 to This means thatK n consistently estimates K 0 . In view of (4.1), (3.11), (3.3), and (3.5) with R P = 0 we obtain By the consistency ofK n the second term at the right hand side of (4.4) converges to 0 in probability under P 0 in view of the central limit theorem. Because f (ν n ) = f (ν 0 )+ḟ (ν 0 ) (ν n − ν 0 )+o p (ν n − ν 0 ) holds and K 0ḟ (ν 0 ) equals the d×d identity matrix, the first part of the right hand side of (4.4) also converges to 0 in probability under P 0 .
To prove Theorem 3.1 with the help of Theorem 4.1 we will construct a √ n-consistent estimatorν n of ν and subsequently a consistent estimatorÎ n of I(θ, G, P). Let · be a Euclidean norm on R k . We chooseν n in such a way that holds. Of course, if the infimum is attained, we chooseν n as the minimizer. There are many numerical optimization techniques that will yield aν n satisfying (4.5). By the triangle inequality and the √ n-consistency ofθ n we obtain The assumption from Theorem 3.1 that f (·) has an inverse on f (N ) that is differentiable with a bounded Jacobian, suffices to conclude that (4.6) guarantees √ n-consistency ofν n .
In regular parametric models without nuisance parameters any consistent estimator of θ yields a consistent estimator of the Fisher information matrix I(θ, P) = I(θ, G, P) by substitution. Typically, in regular semiparametric models the construction of an efficient estimator is accompanied by an estimator of the efficient influence function, which can be simply transformed into a consistent estimator of the Fisher information matrix. Nevertheless, in order to formally complete the proof of Theorem 3.1 we shall construct a consistent estimator of the Fisher information matrix based on the given efficient estimation methodθ n alone, although this estimator will probably have little practical value. In constructing this estimator we split the sample in blocks as follows. Let (k n ), ( n ), and (m n ) be sequences of integers such that k n = n m n , k n /n → κ, 0 < κ < 1, and n → ∞, m n → ∞ hold as n → ∞. Such sequences of integers exist. For j = 1, . . . , n letθ n,j be the efficient estimator of θ based on the observations X (j−1)mn+1 , . . . , X jmn andθ n,0 be the efficient estimator of θ based on the remaining observations X kn+1 , . . . , X n . Consider the "empirical" characteristic function which we rewrite aŝ In view of m n /(n − k n ) → 0 and (3.5) with R P = 0 a.s. we see that the first factor at the right hand side of (4.8) converges to 1 as n → ∞. The efficiency ofθ n in (3.5) with R P = 0 a.s. also implies as n → ∞, with Z P normally distributed with mean 0 and covariance matrix I −1 (θ 0 , G 0 , P). Some computation shows It follows by Chebyshev's inequality thatφ n (t) and henceφ n (t) converges under P 0 = P θ0,G0 to the characteristic function of Z P at t, For every t ∈ R k we obtain Choosing k(k + 1)/2 appropriate values of t we may obtain from (4.12) an estimator of I −1 (θ 0 , G 0 , P) and hence of I(θ 0 , G 0 , P). Indeed, with t equal to the unit vectors u i we obtain estimators of the diagonal elements of I −1 (θ 0 , G 0 , P) and an estimator of its (i, j) element is obtained via When needed, the resulting estimator of I(θ 0 , G 0 , P) can be made positive definite by changing appropriate components of it by an asymptotically negligible amount, while the symmetry is maintained. Under a mild uniform integrability condition it has been shown in [11], that existence of an efficient estimatorθ n of θ in P implies the existence of a consistent and √ n-unbiased estimator of the efficient influence function˜ (·; θ, G, P). Basing this estimator on one half of the sample and taking the average of this estimated efficient influence function at the observations from the other half of the sample, we could have constructed another estimator of the efficient Fisher information. However, this estimator would have been more involved, and, moreover, it needs this extra uniformity condition.
With the help of Theorem 4.1, the estimatorν n of ν from (4.5), and the construction via (4.12) of an estimatorÎ n of the efficient Fisher information we have completed our construction of an efficient estimatorν n as in (4.1) of ν. This estimator can be turned into an efficient estimator of θ = f (ν) within the model Q from (1.6) byθ with efficient influence functioñ and asymptotic information bound Indeed, according to Section 2.3 of [2],θ n is efficient for estimation of θ under the additional information θ = f (ν).
attains the infimum at the right hand side of (4.5). So, the estimator (4.1) becomesν with efficient influence function (3.11) and asymptotic information bound (3.13) withḟ (ν 0 ) = L, and the estimator from (4.13) Note thatθ n is the projection ofθ n on the flat {θ ∈ R k : θ = Lν + α, ν ∈ R d } under the inner product determined byÎ n (cf. Appendix A) and that the covariance matrix of its limit distribution equals the asymptotic information bound Another way to describe this submodel Q with θ = Lν+α is by linear restrictions where R T α = β holds and the k × d-matrix L and the k × (k − d)-matrix R are matching such that the columns of L are orthogonal to those of R and the k × k-matrix (L R) is of rank k. Note that the open subset N of R d determines the open subset Θ of R k and vice versa. See [4], [18], [14], and [10] for some examples of estimation under linear restrictions. In terms of the restrictions described by R and β the efficient estimatorθ n of θ from(4.18) within the submodel Q can be rewritten as with asymptotic information bound as will be proved in Appendix A.

Examples
In this section we present five examples, which illustrate our construction of (semi)parametrically efficient estimators. We shall discuss location-scale, Gaussian copula, and semiparametric regression models, and parametric models under linear restrictions.

Example 5.1. Multivariate normal with common mean
Let G be the collection of nonsingular k × k-covariance matrices and let the parametric starting model be the collection of nondegenerate normal distributions with mean vector θ and covariance matrix Σ, Efficient estimators of θ and Σ are the sample meanX n = n −1 n i=1 X i and the sample covariance matrixΣ n = (n − 1) −1 n i=1 (X i −X n )(X i −X n ) T , respectively. Note thatX n attains the finite sample Cramér-Rao bound and the asymptotic information bound with I(θ, Σ, P) = Σ −1 .
The parametric submodel we consider is In view of (4.17) and (3.13) is an efficient estimator of μ within Q that attains the asymptotic lower bound In case the covariance matrix Σ is diagonal with its variances denoted by σ 2 1 , . . . , σ 2 k , we are dealing with the Graybill-Deal model as presented on page 88 of [20]. WithX i,n = 1 n n j=1 X j,i , S 2 i,n = 1 n n j=1 (X j,i −X i,n ) 2 , and Σ n = diag(S 2 1,n , . . . , S 2 k,n ) we obtain the Graybill-Deal estimator
We consider the submodel with the coefficient of variation known to be equal to a given constant c = σ/μ and with ν = μ the parameter of interest. According to Theorem 4.1 the estimatorν n =μ n of μ from (4.1) withν n =μ n and θ n = (μ n ,σ n ) T is efficient and some computation showŝ μ n = I 11 + 2cI 12 + c 2 I 22 −1 [(I 11 + cI 12 )μ n + (I 12 + cI 22 )σ n ] . (5.5) In case the density g(·) is symmetric around 0, the Fisher information matrix is diagonal andμ n from (5.5) becomeŝ In the normal case with g(·) the standard normal densityμ n reduces tô withμ n andσ n equal to e.g. the sample mean and the sample standard deviation, respectively; cf. [8], [5], and [9].

Example 5.3. Gaussian copula models
Let X 1 = (X 1,1 , . . . , X 1,m ) T , . . . , X n = (X n,1 , . . . , X n,m ) T be i.i.d. copies of X = (X 1 , . . . , X m ) T . For i = 1, . . . , m, the marginal distribution function of X i is continuous and will be denoted by F i . It is assumed that ) T has an m-dimensional normal distribution with mean 0 and positive definite correlation matrix C(θ), where Φ denotes the one-dimensional standard normal distribution function. Here the parameter of interest θ is the vector in R m(m−1)/2 that summarizes all correlation coefficients ρ rs , 1 ≤ r < s ≤ m. We will set this general Gaussian copula model as our semiparametric starting model P, i.e., P = {P θ,G : θ = (ρ 12 , . . . , ρ (m−1)m ) T , G = (F 1 (·), . . . , F m (·)) ∈ G}. (5.8) The unknown continuous marginal distributions are the nuisance parameters collected as G ∈ G. Theorem 3.1 of [13] shows that the normal scores rank correlation coefficient is semiparametrically efficient in P for the 2-dimensional case with normal marginals with unknown variances constituting a least favorable parametric submodel. As [7] explains at the end of its Section 1 and in its Section 4, its Theorem 4.1 proves that normal marginals with unknown, possibly unequal variances constitute a least favorable parametric submodel, also for the general m-dimensional case. Since the maximum likelihood estimators are efficient for the parameters of a multivariate normal distribution, the sample correlation coefficients are efficient for estimation of the correlation coefficients based on multivariate normal observations. But each sample correlation coefficient and hence its efficient influence function involve only two components of the multivariate normal observations. Apparently, the other components of the multivariate normal observations carry no information about the value of the respective correlation coefficient. Effectively, for each correlation coefficient we are in the 2-dimensional case and invoking again Theorem 3.1 of [13] we see that also in the general m-dimensional case the normal scores rank correlation coefficients are semiparametrically efficient. They are defined aŝ being the marginal empirical distributions of F r and F s , respectively, 1 ≤ r < s ≤ m. The Van der Waerden or normal scores rank correlation coefficientρ (n) rs from (5.9) is a semiparametrically efficient estimator of ρ rs with efficient influence functioñ This means thatθ n = (ρ For θ = 1 k ρ we get by simple but tedious calculations (see Appendix B) Each matrix with the components of each column vector adding to 1 has the property that the sum of all row vectors equals the vector with all components equal to 1, and hence the components of each column vector of its inverse also add up to 1. This implies and hence by (4.17) attains the asymptotic information bound (cf. (3.13)) (5.17) [7] proved the efficiency of the pseudo-likelihood estimator for ρ in dimension m = 4. [17] extended this result to general m and presented the efficient lower bounds for m = 3 and m = 4 in its Example 5.3. However, its maximum pseudolikelihood estimator is not as explicit as our (5.16).

Example 5.4. Partial spline linear regression
Here the observations are realizations of i.i.d. copies of the random vector X = (Y, Z T , U T ) T with Y , Z, and U 1-dimensional, k-dimensional, and pdimensional random vectors with the structure Y = θ T Z + ψ(U ) + ε, (5.24) where the measurement error ε is independent of Z and U , has mean 0, finite variance, and finite Fisher information for location, and where ψ(·) is a real valued function on R p . [16] calls this partially linear additive regression, [2] mentions it as partial spline regression, whereas [3] is talking about the partial smoothing spline model. Under the regularity conditions of its Theorem 8.1 [16] presents an efficient estimator of θ and a consistent estimator of I(θ, G, P). Consequently our Theorem 4.1 may be applied directly in order to obtain an efficient estimator of ν in appropriate submodels with θ = f (ν) without our construction of an estimator of I(θ, G, P) via characteristic functions. Note that for submodels with θ restricted to a linear subspace, θ = Lν say, our approach is not needed, since the reparametrization Y = ν T L T Z + ψ(U ) + ε brings the estimation problem back to its original (5.24).

Example 5.5. Restricted maximum likelihood estimator
As mentioned in the Introduction, most papers on estimation in constrained parametric models focus on constrained (or restricted) maximum likelihood estimation implemented via Lagrange multipliers; cf. [1]. Maximum likelihood estimation of the generalized linear model under linear restrictions on the parameters is done in [14] via an iterative procedure using a penalty function. [10] introduces the restricted EM algorithm for maximum likelihood estimation under linear restrictions. Our approach as described in Remark 4.1 withθ n a(n unrestricted) maximum likelihood estimator avoids such iterative procedures.

Appendix A: Proof of bound subject to linear restriction
In this appendix proofs will be presented of (4.21) and (4.22).
SinceÎ n has been chosen to be symmetric and positive definite, x TÎ n y, x, y ∈ R k , is an inner product on R k . Define the k × k-matrices Π n,L and Π n,R by With the above inner product these matrices are projection matrices on the linear subspaces spanned by the columns of L andÎ −1 n R, respectively. Indeed, Π n,L Π n,L = Π n,L , Π n,R Π n,R = Π n,R , (x − Π n,L x) TÎ n Π n,L x = 0, x ∈ R k , (y − Π n,R y) TÎ n Π n,R y = 0, y ∈ R k , Π n,L Lx = Lx, x ∈ R d , and Π n,RÎ