On the role of parameterization in models with a misspecified nuisance component

Significance Statistical models are often chosen based on a combination of scientific understanding and flexibility or mathematical convenience. While the aspects of core scientific relevance may be relatively securely specified in terms of interpretable interest parameters, the rest of the formulation is often chosen somewhat arbitrarily. In many statistical models this formulation includes so-called nuisance parameters, which are of no direct subject-matter concern, but are needed to complete the model or reflect the complexity of the data. This paper contributes to the foundations of statistics by studying the interplay between model structure and likelihood inference, under misspecification of the nuisance component.


Supporting Information Text
Models and definitions.We assume that the true distribution has density function m, specified in part by the true value ψ * of an interest parameter ψ.The model to be used for inference has likelihood function L(ψ, λ), where ψ is the interest parameter and λ is a nuisance parameter.We assume the model is misspecified in the sense that there are no values of λ in the parameter space for the assumed model for which the true distribution is recovered.The limiting value of the maximum likelihood estimate ( ψ, λ) obtained by maximizing the assumed likelihood function is also the value which minimizes the Kullback-Leibler divergence between the fitted model and the true distribution, and is written (ψ 0 m , λ 0 m ).Our interest is in studying model structure which leads to consistency of ψ, i.e. ψ 0 m = ψ * .
Definition 0.1 (symmetric parametrization).Let Y1 and Y0 be independent random variables with probability measures in PG and density functions f1 and f0 respectively.Their joint distribution is said to be parametrized ψ-symmetrically with respect to (ψ, γ) if g = g ψ ∈ G depends only on ψ and if the density functions f1 and f0 relate to fU by fU (u; γ)du = f1(gu; gγ)d(gu) = f0(g −1 u; g −1 γ)d(g −1 u).
where, by the mean value theorem for vector valued functions, with the integral taken elementwise.Since B is positive definite, ψ 0 m = ψ * if and only if Em ∇ ψ (ψ * , λ 0 m ) = 0.
A Taylor expansion of ∇ ψ (ψ * , λ) around λ 0 m for fixed ψ * shows that where Nullity of the left hand side of Eq. ( 1) for general λ entails A = 0, for which a sufficient condition is ψ * ⊥m Λ.Without such orthogonality, positive and negative parts must cancel in the integral defining A, which cannot be simultaneously achieved for all λ, establishing necessity of ψ * ⊥m Λ if Em ∇ ψ (ψ * , λ) = 0 is to hold for all λ.
Proof.It is notationally convenient to suppose that the interest parameter ψ is scalar, although the results generalize straightforwardly to higher dimensions.We therefore write ψ for the derivative of the log likelihood function with respect to ψ and s for the derivative with respect to the sth component of λ, and similarly for other quantities.We use the convention that Roman letters appearing both as subscripts and superscripts in the same product are summed.Thus, for the purpose of this calculation, write ( λ − λ) s for the sth component of λ − λ.
Let = /n and ˆ = ( ψ, λ), etc.The rescaling by n is immaterial and is used only to aid presentation of the derivation.
Taylor expansion of ψ around (ψ * , λ) followed by evaluation at ( ψ, λ) gives Let nj rs denote the components of the inverse matrix of the observed information matrix at (ψ * , λ) whose components are nj rs = − rs.Inversion of the previous equation as in Barndorff-Nielsen and Cox (1994, p. 149) gives ) provided that the dimension of λ is treated as fixed.This is not in general the case when λ = λ 0 m , meaning that successive replacements in equation Eq. (2) do not produce terms of decreasing orders of magnitude as in equation (5.19) of Barndorff-Nielsen and Cox (1994).
Suppose that for any λ.Again, the second and higher order terms in Eq. ( 2) do not converge to zero in general.The restriction to ) ensures that ψ converges to ψ 0 m = ψ * at the same rate as if the model was correctly specified.But if Eq. ( 3) holds for all λ, it holds a fortiori for λ in a neighbourhood of λ 0 m , thereby proving the claim.
In the special case that ψ and λ are both scalar parameters, i ψψ = det −1 i λλ and i ψλ = −det −1 i ψλ , where det = i ψψ i λλ − i 2 ψλ , so that the general condition becomes i λλ g ψ = i ψλ g λ .
Proof.In Eq. ( 5), with f (γ) an unknown density function for the nuisance parameter.All quantities are, for present purposes, evaluated at the same value of ψ and the superscript is henceforth suppressed.

Next consider the orthogonality condition of Proposition 1.1. The global orthogonality equation in the matched comparison
problem is where m(y1, y0) = m(y1, y0 ; ψ * ) is as defined in Eq. ( 6) and ψ * has been reintroduced for clarity.Since m does not depend on λ, The double integral on the right hand side is, apart from the evaluation at ψ * , that of Eq. ( 5).Orthgonality at ψ * thus follows by an identical argument to that establishing Eq. ( 5).The orthogonality is therefore local at ψ * and global in λ, as required in Proposition 1.1.
Since T1 and T0 have a joint distribution that is parametrized ψ-symmetrically with respect to (ψ, γ), and such that the log-likelihood derivative, when expressed in terms of u1 = g −1 t1 and u0 = gt0, is antisymmetric, both conditions of Proposition On noting that E Suppose further that the analysis is overstratified so that under the assumed model the log-likelihood function is (ψ, λ) = where di = K (ηi) − K (ξ * i ).On letting D * = diag(K (ξ * 1 ), . . ., K (ξ * n )), the components of the information matrix i at (ψ * , λ) The relevant components of i −1 from Proposition 1.2 are If derivatives of K higher than 2 are null, as for linear regression, d = D * W λ by a Taylor expansion of K (η) so that i ψψ g ψ = i ψψ X T D * W λ and i ψλ g λ = −i ψψ X T D * W λ, verifying Proposition 1.2.More generally however, d = D * W λ and the consistency arises instead because λ 0 m = λ * = 0, at which point the condition is trivially satisfied.