Higher-order properties of Bayesian empirical likelihood

: Empirical likelihood serves as a good nonparametric alterna- tive to the traditional parametric likelihood. The former involves much less assumptions than the latter, but very often gets the same asymptotic infer- ential eﬃciency. While empirical likelihood has been studied quite exten-sively in the frequentist literature, the corresponding Bayesian literature is somewhat sparse. Bayesian methods hold promise, however, especially with the availability of historical information, which often can be used suc- cessfully for the construction of priors. In addition, Bayesian methods very often overcome the curse of dimensionality by providing suitable dimen- sion reduction through judicious use of priors and analyzing data with the resultant posteriors. In this paper, we provide asymptotic expansion of pos- teriors for a very general class of priors along with the empirical likelihood and its variations, such as the exponentially tilted empirical likelihood and the Cressie–Read version of the empirical likelihood. Other than obtaining the celebrated Bernstein–von Mises theorem as a special case, our approach also aids in ﬁnding non-subjective priors based on empirical likelihood and its variations as mentioned above. substantiates the robustness of our nonparametric approaches against model misspeciﬁcation.


Introduction
Empirical likelihood, over the years, has become a very popular topic of statistical research. The name was coined by Owen in his classic 1986 paper, although similar ideas are found even earlier in the works of [11], [19], [16] and others. The main advantage of empirical likelihood is that it involves fewer assumptions than a regular likelihood, and yet shares the same asymptotic properties of the latter.
Research in this area has primarily been frequentist with a long list of important theoretical developments accompanied by a large number of applications. To our knowledge, the first Bayesian work in this general area appeared in the article of [13] followed by some related work in [17,18], the latter introducing the concept of "exponentially tilted empirical likelihood". [13] suggested using empirical likelihood as a substitute for the usual likelihood and carrying out Bayesian analysis in the usual way.
Baggerly [1] viewed empirical likelihood as a method of assigning probabilities to a n-cell contingency table in order to minimize a goodness-of-fit criterion. He selected Cressie-Read power divergence statistics as one such criterion for construction of confidence regions in a number of situations and pointed out also how the usual empirical likelihood, exponentially tilted empirical likelihood and others could be viewed as special cases of the Cressie-Read criterion by appropriate choice of the the power parameter. This was also discussed in [15] who pointed out that all members of the Cressie-Read family led to "empirical divergence analogues of the empirical likelihood in which asymptotic χ 2 calibration held for the mean".
The objective of this article is to provide an asymptotic expansion of the posterior distribution based on empirical likelihood and its variations under certain regularity conditions and a mean constraint. The work is inspired by the work of [5] who provided a somewhat different expansion subject to a mean constraint. Unlike [4,5], our result is based on the derivatives of the pseudo likelihood with respect to the parameter of interest evaluated at the maximum empirical likelihood estimator, and a rigorous expansion is provided with particular attention to the remainder terms. Moreover, we consider a general estimating equation which includes the mean example of [5] as a special case. The need for different pseudo-likelihoods for statistical inference is felt all the more in these days, especially for the analysis of high-dimensional data, where the usual likelihood based analysis is hard to perform, These alternative likelihoods are equally valuable for approximate Bayesian computations, a topic which has only recently surfaced in the statistics literature (see e.g. [3]).
Asymptotic expansion of the posterior based on a regular likelihood was given earlier in [12], and later in [8]. We follow their approach with many necessary modifications in view of the fact that any meaningful prior needs to have support in a data-driven compact set which grows with number of observations. As a special case of our result, we get the celebrated Bernstein-von Mises theorem. The latter was mentioned in [13] for the special case of empirical likelihood, but here we provide a rigorous derivation with the needed regularity conditions in a general framework. The asymptotic expansion can also be used in providing asymptotic expansions of the posterior moments, quantiles and other quantities of interest. Moreover, we utilize this asymptotic expansion to find some moment matching priors, earlier given in [9] based on the regular likelihood. In constrast to [9], the moment matching prior does not depend on the expectation of the derivatives of the log-likelihood function, but depends instead on the second and third central moments of the unbiased estimating function, say g(X, θ). In the particular case, g(X, θ) = X − θ, the prior depends only on knowledge about the second and third central moments of the distribution, and does not require specification of a full likelihood. The moment matching priors differ also from the reference priors as introduced in [2]. The latter is an analogue of Jeffreys' prior under most circumstances, with the Godambe information matrix [10] replacing the Fisher information matrix.

Basic settings
Suppose X 1 , . . . , X n are independent and identically distributed random vectors satisfying E{g(X 1 , θ)} = 0, where θ ∈ R. In this context, [14], formulated Higher-order properties of Bayesian empirical likelihood 3013 empirical likelihood as a nonparametric likelihood of the form n i=1 w i (θ), where w i is the probability mass assigned to X i (i = 1, . . . , n) satisfying the constraints The target is to maximize n i=1 w i or equivalently n i=1 log w i with respect to w 1 , . . . , w n subject to the constraints given in Eq. (2.1). Applying the Lagrange multiplier method, the solution turns out to bê where ν, the Lagrange multiplier satisfies It may be noted that in [4,5], Closely related to the empirical likelihood is the exponentially tilted empirical likelihood where the objective is to maximize the Shannon entropy − n i=1 w i log w i with the same constraints in Eq. (2.1). The resulting solution isŵ where ν, the Lagrange multiplier, satisfies The exponentially tilted empirical likelihood is related to Kullback-Leibler divergence between two empirical distributions, one with weights w i assigned to the n sample points, and the other with uniform weights 1/n assigned to the sample points. The general Cressie-Read divergence criterion given by We focus on the cases λ ≥ 0 and λ ≤ −1, because in these cases, CR(λ) is a convex function of the w i (i = 1, . . . , n), and hence the minimization problem will produce a unique solution. The limiting cases λ → 0 and λ → −1 correspond to the usual empirical likelihood and the exponentially tilted empirical likelihood as defined earlier.

X. Zhong and M. Ghosh
For convex CR(λ), its minimum will be attained in the compact set H n determined by data. The Lagrange multiplication method now gives the weightŝ where we abbreviate μ(θ) as μ and ν(θ) as ν, which satisfy We now introduce the posterior based on an empirical likelihood. The basic idea was first introduced by [13] with several numerical examples. The intuition relies on close relationship between the empirical likelihood and the empirical distribution. [15] formulated the two concepts under the same optimization framework, that is, they shared the same objective function, but the former was solved under parametric constraints, while the latter was not. Considering this similarity, we can use the empirical likelihood as a valid distribution parameterized by some inferential target. Within the Bayesian paradigm, writingŵ i (θ) as generic notation for eitherŵ EL i ,ŵ ET i orŵ CR i , and a prior with probability density function ρ(θ), with support in H n , the profile (pseudo) posterior is (2.7) The main objective of this paper is to provide an asymptotic expansion of π(θ | X 1 , . . . , X n ). This will include in particular the Bernstein-von Mises theorem. Towards our main result, we develop a few necessary lemmas in the next section. Some of these lemmas are also of independent interest as they point out some interesting features pertaining to empirical likelihood.

Lemmas
We first point out the natural domain of θ in empirical likelihood settings. In practice, some values of θ will result in an empty feasible set under constraints Eq. (2.1). The set of θ values which guarantees a non-empty feasible set, and thus a solution of the optimization problem, constitutes a natural domain of the empirical likelihood. One may question whether the size of the natural domain is large enough to contain the true value. The following lemma alleviates this worry. Lemma 1. Assume g(·, ·) is a continuous function, then the natural domain defined by the constraints Eq. (2.1) is a compact set and is nondecreasing with respect to the sample size n.
Proof. By the third constraint of Eq. (2.1), θ is a continuous function of w 1 , w 2 , . . . , w n , but w i are defined on a simplex which is a compact set due to the first constraint of Eq. (2.1). We may recall that a continuous function maps compact sets to compact sets. Hence, θ is naturally defined on a compact set denoted by H n .
If all the g(X i , θ), i = 1, . . . , n are non-positive or all are non-negative, then the constraints Eq. (2.1) are violated and H n = ∅. Hence, we define the domain as {g(X i , θ) ≤ 0} c ]} will increase, and so will their intersection H n .
Although, intuitively we expect the empirical likelihood to behave as the true likelihood, we need some theoretical support to show that the former enjoys some of the basic properties of the latter. In particular, we need to verify that ν and μ are smooth functions of θ and the (pseudo) Fisher Information based on the empirical likelihood is positive.
We first establish the positiveness of the Fisher information. We consider the three cases separately to introduce more transparency and continuity in our approach.
Our first lemma shows that the Lagrange multipliers ν(θ) and μ(θ) are both smooth functions of θ, under the following mild assumptions, Proof. We first consider the empirical likelihood and observe that, ν(θ) is a implicit function of θ in view of (2.3). Further so that by the implicit function theorem, ν is differentiable in θ. Moreover, differentiating both sides of Eq. (2.3) with respect to θ, one gets

X. Zhong and M. Ghosh
which on simplification leads to Next, for exponentially tilted empirical likelihood, in view of Eq. (2.4) and the relation d dν once again, the implicit function theorem guarantees the differentiability of ν in θ. Further, differentiating both sides of Eq. (2.4) with respect to θ, one gets A similar conclusion is achieved for ν(θ) and μ(θ) defined in Eq. (2.6) in connection with CR(λ). Specifically, defining it follows that, Again, by implicit function theorem, one gets differentiability of μ(θ) and ν(θ) with respect to θ, and 3) The next result shows that all the derivatives of the Lagrange multipliers ν(θ) and μ(θ) are smooth functions of θ ∈ H n . We provide a unified proof for all three cases where we utilize the previous lemma. with an assumption stronger than Assumption 2,

Lemma 3. Under Assumptions 1 and 3, all orders of derivatives of ν(θ) and
Proof. The result is proved by induction. We have seen already in Lemma 2, that the first derivatives of ν (θ) and μ (θ) are smooth functions of θ. Suppose the result holds for all kth derivatives of ν(θ) and μ(θ) for k = 1, . . . , K. Then writing ∂h k ∂θ which is also a smooth function of θ by the induction hypothesis and Lemma 1. A similar proof works for μ(θ).
We know that when the number of constraints and dimension of the parameters are the same, the corresponding empirical likelihood is maximized at θ =θ, the M -estimator of θ based on n i=1 g(X i , θ) = 0. Thus, ν(θ) = 0 and μ(θ) = 1. We next show thatl(θ) has a negative second order derivative when evaluated atθ.
Finally, for the Cressie-Read case, The main result is proved in the next section.

Main result
Before stating the main theorem, we need a few notations. We assume that the prior density ρ(θ) has a Kth-order continuous derivative atθ. Taylor approximation of the prior density. Further denote the higher-order derivatives of (pseudo) log empirical likelihood as Define the summation index set Let y = √ nb(θ −θ) be the normalized posterior random variable and as the normalized lower and upper bounds of the support of the distribution. Now for any ξ ∈ (Y (1) , Y (n) ) and H n = [h 1 , h 2 ], let To control the higher-order error terms, we need the following assumptions.
Moreover, we also need an assumption to guarantee the consistency of M -Estimator,

Assumption 5. g(·, θ) is either bounded or monotone in θ.
Now we state the main theorem of this section.
Theorem 1 (Fundamental Theorem for Expansion). Let X 1 , X 2 , . . . , X n be independent and identically distributed. Assume the prior density ρ(θ) has a support containing H n and has (K + 1)th-order continuous derivative. Under Assumptions 1,3,4 and 5,there This theorem can not only be used to prove asymptotic expansion of the posterior cumulative distribution function, the main result of this paper, but it can also be used to find the asymptotic expansions of the posterior mean, quantiles and many other quantities of interest, as in [12] and [21].
Next we write the posterior cumulative distribution function as where ϕ(y) is standard normal density, be restricted to R n . Define polynomial γ i (ξ, n), i = 1, . . . , n recursively as The first two terms of γ i (ξ, n) are and We now provide the next important result result of this section, namely the asymptotic expansion of the posterior distribution function.
Theorem 2 (Asymptotic Expansion of the Posterior Cumulative Distribution Function). Use the same assumptions as in Theorem 1. Then there exist constants N 2 and M 2 , such that for any n ≥ N 2 and ξ ∈ (Y (1) , Y (n) ), almost surely. and We know that all the terms in Eq. (4.3) and Eq. (4.4), are integrals of continuous functions over bounded closed intervals. Hence they are almost surely bounded below by some constant C 1 and bounded above by some constant C 2 , for all n > N 1 . Then Now we find the quotient series of P K (ξ, n)/P K (Y (n) , n). By the definition of γ i (ξ, n), through simple calculation, we have By the discussion following Lemma 7 in the Appendix B , we know that all γ i are almost surely uniformly bounded for all large n. Thus, there exists a constant M 3 , such that We combine Eq. (4.7) and Eq. (4.8), to get Eq. (4.2).
Let K = 2. Then we get asymptotic normality of the posterior distribution.

Corollary 1 (Bernstein-von Mises Theorem).
Use the assumption in Theorem 1 with K = 2, then the posterior distribution converges in distribution to normal distribution almost surely, that is √ nb θ −θ X 1 , . . . , X n → N (0, 1), a.s. , Theorem 1 builds a strong foundation for asymptotic expansions of many other quantities based on the posterior, such as the mean and higher-order posterior moments. This follows simply by replacing the prior density ρ(θ) with an appropriate function. Here we use the posterior mean as an example. More examples can be found in [12].
Applying the same argument as in the proof of Theorem 2, the asymptotic expansion of the posterior mean is Then a moment matching prior ( [9]) is found as the solution of For the empirical likelihood and the exponentially tilted empirical likelihood, some heavy algebra yields

X. Zhong and M. Ghosh
Using strong law of large numbers, Hence, we have the following corollary.
In the special case, g(x, θ) = x − θ, by Corollary 2, the moment matching prior is

Remark 1.
Because the posterior mean is the most widely used Bayesian estimator, this result could provide a useful tools for higher order analysis. For example, our result would produce the same result in [20], which estimates the quantile using the posterior mean of Bayesian empirical likelihood based on smoothed estimating equation.
Let K(x) = x −∞ k(u) du be a kernel smoothing function, and K h (·) = K(·/h)h. The smoothed constraint of the empirical likelihoods to estimate the α quantile θ is Higher-order properties of Bayesian empirical likelihood 3025 and g(θ, α). Hence the asymptotic variance is If the prior on θ is normal with parameters μ 0 and σ 0 , then Hence, by Eq. 4.9, we have , which is the same as the formula in Remark 1 of [20].

Simulation results
In this section, we give some simulation results.

Heavy tailed distribution
First we take g(X i , θ) = X i −θ, i = 1, . . . , n. Let K = 3, we compare the first order approximation with normal approximation and second order approximation. By heavy algebra, we get for all the three empirical likelihoods, Sol (3) (X) for the three empirical likelihoods are asymptotically equivalent up to the second order. The true cumulative distribution function is calculated by numerical integration. The normal approximation polynomial is Φ(ξ | R n ), and the second order approximation polynomial is

X. Zhong and M. Ghosh
We take samples of size n = 10, and 80 from a t distribution with degrees of freedom 6, and the Cauchy prior. Set Cressie-Read divergence parameter λ = 2. Then The results are given in Figure 1 on page 3027 and Figure 2 on page 3028. In both plots, the red line stands for normal approximation of the posterior cumulative distribution function, the blue line stands for the first order approximation, the green line stands for the posterior based on the empirical likelihood, the purple line stands for the posterior based on the exponentially tilted empirical likelihood, and the black line stands for the Cressie-Read divergence empirical likelihood. We see that even when the sample size is 10, the three types of empirical likelihoods are quite close to each other, which supports the fact they are equivalent at least up to the second order, and the second order approximation works well. The first order approximation is closer than the normal approximation, which lends credence to our theorem. When the sample size increases to 80, all the lines almost coincide with each other, which means that the approximations are quite successful.

Skewed distribution and different priors
Here we use gamma distribution with shape parameter 2 and scale parameter 0.2, so that the skewness is 2/ √ 2 = √ 2 and the mean is 0.4. We stil use one dimension constraint g(X, θ) = X − θ to estimate mean. The priors are Cauchy distributions with different means μ 0 and different variances σ 2 0 . We use the accuracy defined as to measure the performance of our approximationsP (θ ≤ y | X) with respect to true Bayesian empirical likelihood posteriors. The results are summarized in Table 1. The table is organized as follows: • The two columns under Empirical Likelihood means the true posteriors are based on the empirical likelihood. Other columns can be intepretted similarly. The table justifies our theoretical results well.

Comparison with parametric Bayesian model
By our Corollary 1, we know the asymptotic variance of the Bayesian empirical likelihood is the inverse of Godambe information number, while in parametric Bayesian model, the asymptotic variance is the inverse of Fisher information number. Generally speaking, the Fisher information number will be larger than the Godambe information number. The difference between the asymptotic variances serves as a "payment" to use a semiparametric model instead of a full parametric model. We also do some simulation to further justify this statements. Let X i ∼ exp(λ 0 ), be i.i.d samples, where EX 1 = λ −1 0 = 1. Let the prior on λ be a gamma distribution with shape parameter α = 2 and scale parameter β = 1. The the parametric posterior of λ is Γ(α + n,

Fig 2. Posterior cumulative distribution function when sample size is 80
parameter of interest is still the population mean, which is inverse of λ 0 . We compare the full parametric approaches under the true model and the misspecified model and the empirical likelihood approaches by measuring the Bayesian risk of the mean under square loss. In the full parametric approach under the true model, we can directly calculate the theoretical Bayes risk, In the full parametric approach under the misspecified model, we assume the data are drawn from χ 2 ν . We use inverse gamma distribution with α = 2 and β = 1 as the prior on ν, so that prior distribution on the population mean are the same under the two full parametric approaches and the three empirical likelihood approaches. The posterior distribution under this prior is proportional to  The last line of above equations are the kernel of inverse Gaussion distribution with mean parameter μ = 2/(n log 2) and scale parameter λ = 2. Let V be the inverse Gaussion random variable with above parameters, we can calculate the Bayes risk using Monte Calro integration by In the empirical likelihood approaches, we need the simulation. The results are summarized in Table 2. We can see the Bayes risks of all the tree empirical likelihoods are larger than the full parametric approach under the true model, but are smaller than that of misspecified model, which substantiates the robustness of our nonparametric approaches against model misspecification.

Discussion
The paper provides an asymptotic expansion of the posterior based on an empirical likelihood subject to a linear constraint. The Bernstein -von Mises theorem and asymptotic expansions of the cumulative distribution function and the posterior mean are obtained as corollaries. Future work will include an extension to the multivariate case as well as expansions subject to multiple constraints. Another potential topic of research is asymptotic expansion of posteriors under regression constraints, extending the arguments of [7,6] Appendix A: Behavior of log empirical likelihood in the tail The Taylor expansion consists of expanding the log empirical likelihood and prior density around the mean and then control the tail part of the log empirical likelihood. In order to implement this idea, the tail part ofl(θ) must vanish faster than the required polynomial order. In this section, we will show that indeed the tail part ofl(θ) vanishes at an exponential rate.

Appendix B: Higher-order derivatives
In order to expand around the mean, we need to control the remainder terms in the Taylor expansion, which involves the finiteness of higher-order derivatives ofl.

Lemma 6.
Let Then under Assumptions 1 and 3, for any k = 2, . . . K + 3, where P k are polynomial function, and all the r j < C(k), and C(k) is some constant only depending on k, and M j are the weighted average of higher order derivatives of g, i.e.
l can be the same.
Proof. From Lemma 2, we know that Moreover, So for k = 1, the lemma holds. Assume for k = n, the lemma holds. Then for k = n + 1, The partial derivative of P k is still a polynomial. Also, D itself is a polynomial in n −1 n i=1 ω r i g(X i , θ) 2 , μ, and ν. Next Similar to the calculation of the first order derivative of empirical log likelihood, we know the dω i / dθ are polynomials involving terms like M i . So for k = n + 1, the higher order derivatives of empirical log likelihood are of a similar form. Hence, by mathematical induction, the lemma holds for all n.
By Lemma 6, the higher-order derivatives of log empirical likelihood are rational functions of the sample moments of higher-order derivatives of g. We can anticipate that higher order derivatives of log empirical likelihood can be bounded in a small neighborhood of the true parameter when sample size is large, provided the population moments of higher-order derivatives of function g are finite. This we prove in the following lemma.

Lemma 7.
Under Assumptions 3, 4 and 5, there exist constants δ 2 , C 3 and N 4 such that for any |b(θ −θ)| ≤ δ 2 and n > N 4 , any j = 1, . . . , k, Proof. All ω i in Lemma 6 are equal to 1 when evaluated atθ. Under the assumption of finite moments, by strong law of large numbers, and strong consistency of the M -estimatorθ, we have By Lemma 3, the higher order derivatives are continuous functions. Hence for any small number ε 2 > 0, there exists a constant δ 2 such that whenever |b(θ − θ)| ≤ δ 2 , By Lemma 6, there exists a constant N 4 , such that whenever n > N 4 .
By assumption, all the moments are bounded when k ≤ K + 3. Then there exist a constant C 3 , such that which leads to (B.1).
This leads tõ By Lemma 9, the first term in the right hand side is bounded by By Lemma 8, and Taylor expansion of ρ(θ), the second term is bounded by Hence (C.5) holds.

Appendix D: Proof of the fundamental theorem for expansion
We first intuitively derive P K (ξ, n) . First, we expand Multiplying the above expression by ρ K , For the third term in the right hand side of the above equation, we change the summation index. Let Thus the third term in the summation can be rearranged as Similarly for the fourth term, let K+3 u=3 m u,i (u−2)+j = h. Then the summation can be rearranged as We collect same order terms of n, and denote the summation of all the terms with order higher than K to be R K (Y ). Then we get the product as Integrating any Borel set (Y (1) , ξ], we get the polynomial P K (ξ, n). Now we prove Theorem 1.

Proof. Let
For the second term in the right hand side, we add and subtract R K (y) in integrand, and by Taylor expansion,  We need δ 4 sufficiently small, so that there exist C 6 and C 7 , such that Adding all the parts, we get the inequality in Theorem 1 .