Gradient statistic: higher-order asymptotics and Bartlett-type correction

We obtain an asymptotic expansion for the null distribution function of thegradient statistic for testing composite null hypotheses in the presence of nuisance parameters. The expansion is derived using a Bayesian route based on the shrinkage argument described in Ghosh and Mukerjee (1991). Using this expansion, we propose a Bartlett-type corrected gradient statistic with chi-square distribution up to an error of order o(n^{-1}) under the null hypothesis. Further, we also use the expansion to modify the percentage points of the large sample reference chi-square distribution. A small Monte Carlo experiment and various examples are presented and discussed.


Introduction
The most common hypothesis tests for large samples are the likelihood ratio (Wilks, 1938), the Wald (Wald, 1943), and the Rao score (Rao, 1948) tests. These tests are widely used in areas such as economics, biology, and engineering, among others, since exact tests are not always available. An alternative test uses the gradient statistic recently proposed by Terrell (2002). An advantage of the gradient statistic over the Wald and the score statistics is that it does not involve knowledge of the information matrix, neither expected nor observed. Additionally, the gradient statistic is quite simple to be computed. This has been emphasised by C.R. Rao (Rao, 2005), who wrote: 'The suggestion by Terrell is attractive as it is simple to compute. It would be of interest to investigate the performance of the [gradient] statistic'. Let x 1 , . . . , x n be a random sample of size n with joint probability density function f (·; θ), which depends on a p-dimensional vector of unknown parameters θ = (θ 1 , . . . , θ p ) ⊤ . Let ℓ(θ) = n −1 n i=1 log f (x i ; θ) and U (θ) = ∂ℓ(θ)/∂θ be the log-likelihood function and the score vector, respectively; notice that, for convenience, both are divided by n. We wish to test the null hypothesis H 0 : θ 1 = θ 10 against the two-sided alternative hypothesis H a : θ 1 = θ 10 , where θ 10 is a fixed q-dimensional vector, θ 1 = (θ 1 , . . . , θ q ) ⊤ and θ 2 = (θ q+1 , . . . , θ p ) ⊤ . The partition in θ induces the corresponding partition in U (θ): U (θ) = (U 1 (θ) ⊤ , U 2 (θ) ⊤ ) ⊤ . Let θ = ( θ 1 , θ 2 ) ⊤ and θ = (θ 10 , θ 2 ) ⊤ be the unrestricted and the restricted (under H 0 ) maximum likelihood estimators of θ = (θ ⊤ 1 , θ ⊤ 2 ) ⊤ , respectively. The gradient statistic for testing H 0 is defined as and can also be written as S = nU 1 ( θ) ⊤ ( θ 1 − θ 10 ), since U 2 ( θ) = 0. Like the likelihood ratio, the Wald, and the score statistics, the gradient statistic has an asymptotic χ 2 q distribution under the null hypothesis, q being the number of restrictions imposed by H 0 .
Equation (1) is the inner product of the score vector evaluated at H 0 and the difference between the unrestricted and the restricted maximum likelihood estimators of θ. Although the gradient statistic was derived by Terrell (2002) from the score and the Wald statistics, it is of a different nature. The score statistic measures the squared length of the score vector evaluated at H 0 using the metric given by the inverse of the Fisher information matrix, whereas the Wald statistic gives the squared distance between the unrestricted and the restricted maximum likelihood estimators of θ using the metric given by the Fisher information matrix. Moreover, both are quadratic forms. The gradient statistic, on the other hand, is not a quadratic form and measures the distance between the unrestricted and the restricted maximum likelihood estimators of θ from a different perspective. It measures the orthogonal projection of the score vector at H 0 on the vector θ − θ.
Recently, the gradient test has been the subject of some research papers. In particular, Lemonte and Ferrari (2012a) obtained the local power of the gradient test under Pitman alternatives (a sequence of alternative hypotheses converging to the null hypothesis at the rate of n −1/2 ). The authors compared the local power of the gradient test with those of the likelihood ratio, the Wald, and the score tests. They showed that none of the tests is uniformly more powerful than the others, and therefore, the gradient test is not only very simple to be calculated but it is also competitive with the others in terms of local power. Other recent works in which the gradient test is investigated are Lemonte (2011Lemonte ( , 2012 and Ferrari (2011, 2012b,c).
The main result in Lemonte and Ferrari (2012a) regarding the local power of the gradient test up to an error of order o(n −1/2 ) represents the first step in the study of higher order asymptotic properties of the gradient test. In the present paper, we wish to go further by focusing on deriving the secondorder approximation to the null distribution of the gradient statistic. In other words, our aim is to obtain an asymptotic expansion for the cumulative distribution function of the gradient statistic under the null hypothesis up to an error of order o(n −1 ).
The usual route for deriving expansions for the distribution of asymptotic chi-square test statistics involves multivariate Edgeworth series expansions. Although such a route has been followed by many authors, it is extremely lengthy and tedious (see, for example, Hayakawa, 1977;Harris, 1985).
Here, on the other hand, in order to derive an asymptotic expansion for the null distribution of the gradient statistic up to order n −1 , we follow a Bayesian route based on a shrinkage argument originally suggested by Ghosh and Mukerjee (1991) and described later in Mukerjee and Reid (2000). Although it uses a Bayesian approach, this technique can be used to solve frequentist problems, such as the derivation of Bartlett corrections and tail probabilities (Datta and Mukerjee, 2003).
Additionally, we obtain a Bartlett-type correction factor for the gradient statistic from the results in Cordeiro and Ferrari (1991). Under the null hypothesis, the corrected statistic is distributed as chi-square up to an error of order o(n −1 ), while the uncorrected gradient statistic has a chi-square distribution up to an error of order o(n −1/2 ); that is, the Bartlett-type correction factor makes the approximation error be reduced from o(n −1/2 ) to o(n −1 ). For a detailed survey on Bartlett and Bartlett-type corrections, the reader is referred to Cordeiro and Cribari-Neto (1996).
The paper unfolds as follows. In Section 2, we present our main results, namely an asymptotic expansion for the cumulative distribution function of the gradient statistic and its Bartlett-type correction. In Sections 3 and 4, we particularise our general results to one-parameter families and to families with two orthogonal parameters, respectively. A small Monte Carlo study is also presented in Section 4. Section 5 closes the paper with a brief discussion. Technical details are collected in two appendices.

The main result
First, let us introduce some notation. Let D j = ∂/∂θ j (j = 1, . . . , p) be the differential operator. We define U j = D j ℓ(θ), U jr = D j D r ℓ(θ), U jrs = D j D r D s ℓ(θ), and so on. We make the same assumptions, such as the regularity of the first four derivatives of ℓ(θ) with respect to θ and the existence and uniqueness of the maximum likelihood estimator of θ, as those fully outlined by Hayakawa (1977). Let κ j,r = E(U j U r ), κ jr = E(U jr ), κ jrs = E(U jrs ), κ jrsu = E(U jrsu ), κ j,rs = E(U j U rs ), κ jrs,u = E(U jrs U u ), κ ju,rs = E(U ju U rs ) − κ ju κ rs , κ j,u,rs = E(U j U u U rs ) + κ ju κ rs , etc., denote the cumulants of log-likelihood derivatives. The cumulants are not functionally independent, for instance, κ j,r = −κ jr , κ jr,s + κ jrs = κ (s) Relations among them were first obtained by Bartlett (1953a,b). Further, let K be the Fisher information matrix K = ((κ j,r )) = −((κ jr )) = K 11 K 12 K 21 K 22 , with K −1 = ((κ j,r )) denoting its inverse. Finally, define the matrices In what follows, we use the Einstein summation convention, where ′ denotes summation over all components of θ; that is, the indices j, r, s, k, l and u range over 1 to p. We now establish the following theorem.
Theorem 1. The asymptotic expansion for the null distribution of the gradient statistic for testing H 0 : θ 1 = θ 10 against H a : θ 1 = θ 10 is where G z (x) is the cumulative distribution function of a chi-square random variable with z degrees of freedom, κ jrs κ klu a lu 3m jk a rs + m jr κ s,k + 2a sk + 6 ′ κ jrs,u m jr a su − 6 ′ κ jrsu + κ jrs,u m jr κ s,u + 2m ju a rs + 6 ′ κ klu + κ kl,u 2 κ jrs + κ jr,s κ s,j κ r,k κ l,u − a sj a rk a lu + κ s,k κ l,j κ r,u − a sk a lj a ru − κ jrs κ s,u + a su κ j,k κ l,r − a jk a lr + m jr a sk a lu + κ s,k κ l,u + 2a rs κ j,k κ l,u − a jk a lu + 2a rk a ls m ju + 12 ′ κ jrsu + κ j,rsu + κ jsu,r + κ ju,rs + κ j,u,rs κ j,s κ u,r − a js a ur , Proof. The proof is presented in Appendix 1.
Basically, in order to prove Theorem 1, we follow a Bayesian route based on a shrinkage argument. This argument is described in Appendix 2.
If the null hypothesis is simple, we have q = p, A = 0 and M = K −1 . Therefore, an immediate consequence of Theorem 1 is the following corollary.

Corollary 1. The asymptotic expansion for the null distribution of the gradient statistic for testing
and the A's are A 3 = ′ κ jrs κ klu 9κ j,r κ s,k κ l,u + 6κ j,k κ r,l κ s,u /12, A 1 = −6 ′ κ jrsu + κ jrs,u κ j,r κ s,u + 6 ′ κ klu + κ kl,u 2 κ jrs + κ jr,s κ s,j κ r,k κ l,u + κ s,k κ l,j κ r,u − κ jrs κ s,u κ j,k κ l,r + κ j,r κ s,k κ l,u + 12 ′ κ jrsu + κ j,rsu + κ jsu,r + κ ju,rs + κ j,u,rs κ j,s κ u,r , We are now able to present a Bartlett-type corrected gradient statistic. A Bartlett-type correction is a multiplying factor, which depends on the statistic itself, that results in a modified statistic that follows a chi-square distribution with approximation error of order less than n −1 . Cordeiro and Ferrari (1991) obtained a general formula for a Bartlett-type correction for a wide class of statistics that have a chi-square distribution asymptotically. A special case is when the cumulative distribution function of the statistic can be written as (2), independently of the coefficients R 1 , R 2 , and R 3 . Hence, from Theorem 1 and the results in Cordeiro and Ferrari (1991), we have the following corollary.

Corollary 2. The modified statistic
has a χ 2 q distribution up to an error of order o(n −1 ) under the null hypothesis.
The factor {1 − (c + bS + aS 2 )} in (3) can be regarded as a Bartlett-type correction factor for the gradient statistic in such a way that the null distribution of S * is better approximated by the reference χ 2 distribution than the distribution of the uncorrected gradient statistic.
Instead of modifying the test statistic as in (3), we may modify the reference χ 2 distribution using the inverse expansion formula in Hill and Davis (1968). To be specific, let γ be the desired level of the test, and x 1−γ be the 1 − γ percentile of the χ 2 limiting distribution of the test statistic. From expansion (2), we have the following corollary.
Corollary 3. The asymptotic expansion for the 1 − γ percentile of S to order n −1 takes the form where In general, equations (3) and (4) depend on unknown parameters. In this case, we can replace these unknown parameters by their maximum likelihood estimates obtained under H 0 . It should be noticed that the improved gradient test of the null hypothesis H 0 may be performed in three ways: (i) by referring the corrected statistic S * in (3) to the χ 2 q distribution; (ii) by referring the gradient statistic S to the approximate cumulative distribution function (2); (iii) by comparing S with the modified upper percentile in (4). These three procedures are equivalent to order n −1 .
Finally, the three moments, up to order n −1 under the null hypothesis, of the gradient statistic are presented in the following corollary.
Corollary 4. The first three moments, up to order n −1 under the null hypothesis, of the gradient statistics are In the next sections, we consider some applications of the general results derived in this section in two special cases: a one-parameter model and a two-parameter model under orthogonality of parameters.

The one-parameter case
We initially assume that the model is indexed by a scalar unknown parameter, say φ. The interest lies in testing the null hypothesis where φ is the maximum likelihood estimator of φ. Here, A 1 , A 2 , and A 3 given in Corollary 1 reduce to We now present some examples.

Power
A's, the first three approximate moments, and the Bartlett-type corrected statistic coincide with those obtained for the Pareto distribution.

Models with two orthogonal parameters
The two-parameter families of distributions under orthogonality of the parameters (Cox and Reid, 1987), say φ and β, will be the subject of this section. The null hypothesis under test is H 0 : φ = φ 0 , where φ 0 is a fixed value, and β acts as a nuisance parameter. The orthogonality between φ and β leads to considerable simplification in the formulas of A 1 , A 2 , and A 3 . Here, κ φφβ = E(∂ 3 ℓ(θ)/∂β∂φ 2 ), κ (β) φφβ = ∂κ φφβ /∂β, etc. After some algebra, we have where A 1φ and A 2φ are equal to A 1 and A 2 given in (5) and (6), respectively, and The expressions for A 1φβ and A 2φβ in (8) can be regarded as the additional contribution introduced in the expansion of the cumulative distribution function of the gradient statistic owing to the fact that β is unknown and has to be estimated from the data. In the following, we present some examples.

Example 5. (Two-parameter Birnbaum-Saunders distribution)
The two-parameter Birnbaum-Saunders distribution was proposed by Birnbaum and Saunders (1969) and has cumulative distribution function in the form is the standard normal cumulative distribution function; φ > 0 and β > 0 are the shape and scale parameters, respectively. We wish to test H 0 : φ = φ 0 against the alternative hypothesis H a : φ = φ 0 , where φ 0 is a known positive constant. The gradient statistic to test H 0 is x −1 i , and β is the maximum likelihood estimator of β obtained under H 0 . We have κ φφ = −2/φ 2 , κ φβ = 0, and Since the necessary quantities to obtain the A's were derived, a Bartlett-corrected gradient statistic may be obtained from Corollary 2. It is interesting to note that the A's do not depend on the unknown scalar parameter β. Next, we shall present a small Monte Carlo simulation regarding the test of the null hypothesis H 0 : φ = 1. The simulations were performed by setting β = 1 and sample sizes ranging from 5 to 22 observations. All results are based on 10,000 replications. The size distortions (i.e. estimated minus nominal sizes) for the 5% nominal level of the gradient statistic and its Bartlett-corrected version for different sample sizes are plotted in Figure 1(a). It is clear from this figure that the Bartlett-corrected test displays smaller size distortions than the original gradient test.
Finally, we set n = 10 and consider the first-order approximation (χ 2 1 distribution) for the distribution of the gradient statistic and the expansion obtained in this paper. Figure 1(b) presents the curves. The difference between the curves is evident from this figure, and hence, the χ 2 1 distribution may not be a good approximation for the null distribution of the gradient statistic in testing the null hypothesis H 0 : φ = 1 for the two-parameter Birnbaum-Saunders model if the sample is small.  Lemonte and Ferrari (2012a) showed that the gradient test can be an interesting alternative to the classic large-sample tests, namely the likelihood ratio, the Wald, and the Rao score tests, since none is uniformly superior to the others in terms of second-order local power. Additionally, as remarked before, the gradient statistic does not require to obtain, estimate, or invert an information matrix, unlike the Wald and the Rao score statistics. Its formal simplicity is always an attraction. The exact null distribution of the gradient statistic is usually unknown and the test relies upon an asymptotic approximation. The chi-square distribution is used as a large-sample approximation to the true null distribution of this statistic. However, for small sample sizes, the chi-square distribution may be a poor approximation to the true null distribution; that is, the asymptotic approximation may deliver inaccurate inference. In order to overcome this shortcoming, an alternative strategy is to use a higher-order asymptotic theory.

Discussion
The asymptotic expansion up to order n −1 for the null distribution function of the gradient statistic was derived in this paper. A Bayesian route based on the shrinkage argument (Ghosh and Mukerjee, 1991;Mukerjee and Reid, 2000) proved to be extremely useful in this context. The expansion is very general in the sense that the null hypothesis can be composite in the presence of nuisance parameters. We show that the coefficients which define this expansion depend on the joint cumulants of log likelihood derivatives for the full data. Unfortunately, these coefficients are very difficult to interpret in generality. Cordeiro and Ferrari (1991) showed that, quite generally, continuous statistics having a chi-square distribution asymptotically can be modified by a suitable correction term that makes the modified statistic have chi-square distribution to order n −1 . Their work can be viewed as an extension of Bartlett corrections to the likelihood ratio statistic (Lawley, 1956) to other statistics having a chisquare distribution asymptotically. The correction term comes from the coefficients of the O(n −1 ) term in the expansion of the cumulative distribution function of the test statistic in such a way that it becomes better approximated by the reference chi-square distribution. It is known as the Bartletttype correction. It is well known that Bartlett and Bartlett-type corrections have become a widely used method for improving the large-sample chi-square approximation to the null distribution of the likelihood ratio and Rao score statistics, respectively. In recent years there has been a renewed interest in Bartlett factors and several papers have been published giving expressions for computing these corrections for special models. Some references are Zucker et al. (2000), Lagos and Morettin (2004), Tu et al. (2005), van Giersbergen (2009), Bai (2009), Lagos et al. (2010), and Noma (2011. From the general expansion derived in this paper and using results in Cordeiro and Ferrari (1991), we also obtained a Bartlett-type correction factor for the gradient statistic. Our results are very general and not tied to special classes of models. They allow the parameter vector to be multidimensional and are valid regardless of whether nuisance parameters are present or not. Additionally, as the coefficients in the expansion, and consequently in the Bartlett-type correction factor, are written as functions of cumulants of log-likelihood derivatives, they can be obtained for all the classes of parametric models for which those cumulants can be determined. Therefore, applications of our general results in several parametric models, such as the generalised linear models and extensions, can be studied in future research.

Lemma 1. An asymptotic expansion under the null hypothesis for the gradient statistic
Proof. Using a procedure analogous to that of Chang and Mukerjee (2011), the result holds.