Exact and efficient inference for Partial Bayes problems

Bayesian methods are useful for statistical inference. However, real-world problems can be challenging using Bayesian methods when the data analyst has only limited prior knowledge. In this paper we consider a class of problems, called Partial Bayes problems, in which the prior information is only partially available. Taking the recently proposed Inferential Model approach, we develop a general inference framework for Partial Bayes problems, and derive both exact and efficient solutions. In addition to the theoretical investigation, numerical results and real applications are used to demonstrate the superior performance of the proposed method.


Introduction
In many real-world statistical problems, the information that is available to the data analysts can be organized in a hierarchical structure. That is, there exists some past experience about the parameter(s) of interest, and data relevant to the parameter(s) are also collected. For this type of problems, the standard approach to statistical inference is the Bayesian framework.
However, in many applications, the data analysts have only limited prior knowledge. For instance, the prior information may be insufficient to form a known distribution, so that data analysts need to assume some unknown distributional components in the Bayesian setting.
This class of problems has brought many challenges to statisticians; see for example Lambert and Duncan (1986); Meaux et al. (2002); Moreno et al. (2003). To systematically study such problems that involve partial prior information, in this article we refer to them as Partial Bayes problems, in order to highlight their nature that there exists only partial information in the Bayesian prior distribution. Partial Bayes problems have drawn a lot of attention in statistics literature. One popular type of Partial Bayes problems refers to the case where there exists an unknown prior distribution, either parametric or non-parametric, in a Bayesian hierarchical model. A very popular approach to this type of models is known as the Empirical Bayes, which has been first proposed by Robbins (1956) for handling the case with non-parametric prior distributions, and later by Efron and Morris (1971, 1972a,b, 1973, 1975 for parametric prior distributions. Another kind of Partial Bayes problems was studied by Xie et al. (2013), in which the joint prior distribution of a parameter vector is missing, but some marginal distributions are known. For clarity, we will refer to this type as the marginal prior problem. In Xie et al. (2013), the solution to the marginal prior problem is based on the Confidence Distribution approach (Xie et al., 2011), which provides a unified framework for meta-analysis.
The Empirical Bayes and Confidence Distribution approaches both have successful realworld applications. However, one fundamental problem in scientific research, the exact in-ference about the parameter of interest, remains to be an open question for Partial Bayes problems. As pointed out by many authors (Morris, 1983;Laird and Louis, 1987;Carlin and Gelfand, 1990), Empirical Bayes in general underestimates the associated uncertainty of the interval estimators, so these authors have proposed various methods to correct the bias of the coverage rate. However, even if they have shown better performance, the target coverage rates are still approximately achieved for such methods. The same issue happens in the Confidence Distribution framework. Confidence Distribution provides a novel way to combine different inference results, but these individual inferences may or may not be exact.
All of these indicate that the exact inference for Partial Bayes problems is highly non-trivial.
Recently, the Inferential Model (Martin andLiu, 2013, 2015a,c) is proposed as a new framework for statistical inference, which not only provides Bayesian-like probabilistic measures of uncertainty about the parameter, but also has an automatic long-run frequency calibration property. In this paper, we use this framework to derive interval estimators for the parameters of interest in Partial Bayes problems, and demonstrate their important statistical properties including the exactness and efficiency. When compared with other approaches, we refer to the proposed estimators as Partial Bayes solutions for brevity.
The remaining part of this article is organized as follows. In Section 2 we study a hierarchical normal-means model as a motivating example of Partial Bayes problems. In Section 3 we provide a brief review of the Inferential Model framework as the theoretical foundation of our analysis. Section 4 is the main part of this article, where we introduce a general framework for studying Partial Bayes problems, and deliver our major theoretical results.
We revisit some popular Partial Bayes models in Section 5, are conduct simulation studies in Section 6 to numerically compare the proposed solutions with other methods. In Section 7 we consider an application to a basketball game dataset, and finally in Section 8 we conclude with a few remarks. Proofs of theoretical results are given in the appendix.

A Motivating Example
In this section, we use a motivating example to demonstrate what a typical Partial Bayes problem is, and how its solution differs from the existing method. Consider the well-known normal hierarchical model for the observed data X = (X 1 , . . . , X n ) . The model introduces n unobservable means µ 1 , . . . , µ n , one for each observation, and assumes that conditional on µ i 's, X i 's are mutually independent with X i |{µ 1 , . . . , µ n } ∼ N(µ i , σ 2 ) for i = 1, . . . , n, where the common variance σ 2 is known. In addition, all the µ i 's are i.i.d. with µ i ∼ N(µ, τ 2 ) for i = 1, . . . , n, where the variance τ 2 is known but the mean µ is an unknown hyper-parameter.
The problem of interest here is to make inference about the individual means µ i , and for simplicity we focus on µ 1 without loss of generality. The aim of inference is to construct an interval estimator for µ 1 that satisfies the following conditions: using the terminology in Morris (1983), a sample-based interval C α (X) is an interval estimator for µ 1 with 100(1−α)% confidence level, if it satisfies P µ 1 ,X (C α (X) µ 1 ) ≥ 1 − α for all µ, where the probability that indicates the coverage rate is computed over the joint distribution of (X, µ 1 ).
The standard Empirical Bayes approach to this problem can be found in Efron (2010).
Compared with Empirical Bayes, the proposed interval has the same center but is slightly wider for small n. For a numerical illustration, we fix α to be 0.05, and take σ 2 = τ 2 = 1. Figure 1 shows the theoretical coverage rates of both the Empirical Bayes solution and the Partial Bayes solution as a function of n. It can be seen that the coverage probability of the Empirical Bayes interval is less than the nominal value 1 − α, and is close to the target only when n is sufficiently large. On the contrary, the Partial Bayes solution correctly matches the nominal coverage rate for all n.

A Brief Review of Inferential Models
Since our inference for Partial Bayes problems is based on the recently developed Inferential Models, in this section we provide a brief introduction to this new framework, with more details given in Martin and Liu (2013). Inferential Model is a new framework designed for exact and efficient statistical inference. The exactness of Inferential Models guarantees that under a particular definition, the inference made by Inferential Models has a controlled probability of error, for example, in hypothesis testing problems the Type I error should be no greater than a pre-specified level. In addition, Inferential Models provide a systematic way to combine information in the data for efficient statistical inference.
Formally, Inferential Models draw statistical conclusions on an assertion A, a subset of the parameter space, about the parameter of interest θ. For example, the subset A = {0} stands for the assertion θ = 0, and A = (1, +∞) corresponds to θ > 1. In the Inferential Model framework, two quantities are used to represent the knowledge about A contained in the data: the belief function, which describes how much evidence in the data supports the claim that "A is true", and the plausibility function, which quantifies how much evidence does not support the claim that "A is false".
Like Fisher's fiducial inference, Inferential Models make use of auxiliary or unobserved random variables to represent the sampling model. In order to have meaningful probabilistic inferential results, unlike Fisher's fiducial inference, Inferential Models predict unobserved realizations of the auxiliary variables using random sets, and propagate such uncertainty to the space of θ. Technically, Inferential Model is formulated as a three-step procedure to produce the inferential results: Association step This step specifies an association function X = a(θ, U ) to connect the parameter θ ∈ Θ, the observed data X ∈ X, and the unobserved auxiliary random variable U ∈ U with U following a known distribution P U . This relationship implies that the randomness in the data is represented by an auxiliary variable U .
Prediction step Let u * be the true but unobserved value of U that "generates" the data.
This step constructs a valid predictive random set, S, to predict u * . S is valid if the quantity Q S (u * ) = P S (S u * ), interpreted as the probability that S successfully covers u * , satisfies Combination step This step transforms the uncertainty from the U space to the Θ space by defining Θ x (S) = u∈S Θ x (u) = u∈S {θ : x = a(u, θ)}, a mapping from U back to θ after incorporating the uncertainty represented by S. Then for an assertion A, its belief function is defined as bel x (A) = P {Θ x (S) ⊆ A|Θ x (S) = ∅}, and similarly, its plausibility function is The plausibility function is very useful to derive frequentist-like confidence regions for the parameter of interest (Martin, 2015). If we let A be a singleton assertion A = {θ} and denote pl x (θ) ≡ pl x ({θ}), then a 100(1 − α)% frequentist-like confidence region, which is termed as plausibility region in Inferential Model (or plausibility interval as a special case), is given by In Inferential Model, the exactness of the inference is formally termed as validity. For example, the validity property of Inferential Model guarantees that the above region PR x (α) has at least 100(1 − α)% long-run coverage probability.
It is worth mentioning that Inferential Models also have a number of extensions for efficient inference. When the model has multiple parameters but only some of them are of interest, the Marginal Inferential Models (MIM, Martin and Liu, 2015c) appropriately integrate out the nuisance parameters. For models where the dimension of auxiliary variables is higher than that of the parameters, the Conditional Inferential Models (CIM, Martin and Liu, 2015a) could be used to combine information in the data such that efficient inference can be achieved.
Both MIM and CIM are used extensively in our development of exact and efficient inference for Partial Bayes problems.

Inference for Partial Bayes Problems
In this section we build a general model framework for studying Partial Bayes problems.
The derivation of our interval estimator is described in detail using the Inferential Model framework, and some of its key statistical properties are also studied.

Model Specification
Our attempt here is to provide a simple model framework that is general enough to describe a broad range of Partial Bayes problems introduced in Section 1.
Let X be the observed data, whose distribution f relies on an unknown parameter vector θ. The information on θ that comes from the collected data is expressed by the conditional distribution of X given the parameter: X|θ ∼ f (x|θ). In many cases, we have prior knowledge about θ that can be characterized as a prior distribution π 0 (θ). When π 0 (θ) is fully specified, standard Bayesian method can be used to derive the posterior distribution of θ. In other cases, there is only partial prior information available. Formally, assume that the parameter θ can be partitioned into two blocks, θ = (θ, θ * ), so that the desirable fully-specified prior of θ can be accordingly decomposed as π 0 (θ) = π(θ|θ * )π * (θ * ), where π(θ|θ * ) is the conditional density function ofθ given θ * , and π * (θ * ) is the marginal distribution of θ * . We call the prior information partial if only the conditional distributionθ|θ * ∼ π(θ|θ * ) is available, but π * (θ * ) is missing. In general, inference is made onθ or a component ofθ, i.e.,θ can be further partitioned intoθ = (η, ξ), with η denoting the parameter of interest and ξ denoting the additional nuisance parameters. In this article we focus on the case that η is a scalar, which is of interest for many practical problems. For better presentation, we summarize these concepts and the proposed model structure in the following table: Sampling model Component without prior θ *

Parameter of interest η
Despite its simplicity, the above model includes the well-known hierarchical models as an important class of practically useful models. Moreover, the formulation goes beyond the hierarchical models, and also includes the marginal prior problem. As described in Section 1, our target of inference is to construct a sample-based interval C(X) that satisfies some validity conditions. Specifically, the following two types of validity properties are considered: Definition 1. C(X) is said to be an unconditionally valid interval estimator for η with where the probability is computed over the joint distribution of (X, θ).
Definition 2. C(X) is said to be a conditionally valid interval estimator for η given H(X) with 100(1 − α)% confidence level, if P X,θ|H(X) (C(X) η|H(X) = h) ≥ 1 − α for all π * (θ * ) and h, where H(X) is a statistic of the data, and the probability is computed over the joint distribution of (X, θ) given H(X) = h.
Definition 1 is a rephrasing of the validity condition in Morris (1983), and Definition 2 comes from Carlin and Gelfand (1990). It should be noted that the second condition is stronger than the first, since it can be reduced to Definition 1 by averaging over H(X). In this article, we aim to produce the second type of interval estimators, but the first validity property is studied when different interval estimators for η are compared with each other.

Inferential Models for Partial Bayes Problems
In this section we describe a procedure to analyze Partial Bayes problems in the Inferential Model framework, and develop intermediate results that are used to derive the proposed interval estimator in Section 4.3. The procedure consists of the three steps introduced in Section 3, and outputs a plausibility function for η, the parameter of interest.

The Association
Step The association step has three sub-steps, and we highlight their tasks at the beginning of each sub-step.
Constructing data and prior associations The first association equation comes from the data sampling model X|θ ∼ f (x|θ), for which we write X = a 1 (θ, W 1 ), where a 1 (·) is the "data association" function, and W 1 is an unobservable auxiliary variable that has a known distribution. Since θ can be partitioned into θ = (θ, θ * ) withθ|θ * ∼ π(θ|θ * ), the equation that represents this partial information can be written asθ = a 2 (θ * , W 2 ), where a 2 (·) is the "prior association" function, and W 2 is another auxiliary variable independent of W 1 . Substituting the prior association into the data association, we get X = a 1 ((a 2 (θ * , W 2 ), θ * ), W 1 ). To avoid the over-complicated notations, we simply write this relation as X = a(θ * , W ), where W = (W 1 , W 2 ).
As described in Section 4.1, we are only interested in an element of theθ vector, so we assume thatθ = a 2 (θ * , W 2 ) can be equivalently decomposed as η = a η (θ * , V η ) and where a η (·) and a ξ (·) are the decomposed associations and W 2 = (V η , V ξ ). Therefore, the model for Partial Bayes problems can be summarized by the following system of three equations: Note that ξ can be regarded as a nuisance parameter, and (2) is "regular" in the sense of Definition 2 of Martin and Liu (2015c). Then according to the general theory of MIM in that paper (Theorems 2 and 3), the third equation in (2) can be ignored without loss of efficiency.
Decomposing data association Next, since the sample X usually contains multiple observations, the dimension of W can often be very high. In order to reduce the number of auxiliary variables, assume that the relationship X = a(θ * , W ) admits a decomposition for one-to-one mappings x → (T (x), H(x)) and w → (τ (w), ρ(w)). Martin and Liu (2015a) shows that this decomposition broadly exists for a large number of models, and in case that (3) is not available, we simply write H(X) = 1 and ρ(W ) = 1. The equation (3) implies that when the collected data have a realization x, the auxiliary variable W H := ρ(W ) is fully observed with the value h := H(x). By conditioning on W H = h, we obtain the following two conditional associations where the notation Z ∼ P Z|h means that the random variable Z has a distribution P Z|h given In the rest of Section 4.2, when we discuss the distribution of a random variable that depends on W T or V η , the condition W H = h is implicitly added.
Obtaining the final association Finally, to make inference about η, the unknown quantity θ * needs to be marginalized out of the equations. We seek a real-valued continuous func-tion b(·, ·) such that when its first argument is fixed to some value t, the mapping η → b(t, η) is one-to-one. At the current stage we simply take b as an arbitrary function, and we defer the discussion of its optimal choice in Section 4.4. As a result, associations (4) and (5) are equivalent to Conditional on θ * , W b (θ * ) is a random variable whose c.d.f. F W b (θ * )|h is indexed by the unknown parameter θ * . If the function b is chosen such that θ * has only little effect on F W b (θ * )|h , the first equation (6) provides little or even no information about η, and hence it can be ignored according to the theory of MIM. The final association equation (7) thus completes the association step.

The Prediction
Step The aim of the this step is to introduce a predictive random set S h conditional on W H = h that can predict W b (θ * ) with high probability. The following two situations are considered.
The first situation is that W b (θ * ) is in fact free of θ * . This can be easily achieved if θ * has the same dimension as η, and if the mapping η = a η (θ * , V η ) can be inverted as θ * = a θ * (η, V η ).
The second situation is more general and thus more challenging, in which case F W b (θ * )|h relies on the unknown parameter θ * . Typically this occurs when the dimension of θ * is higher than that of η. To deal with this issue, we generalize the Definition 5 of Martin and Liu (2015c) to define the concept of stochastic bounds for tails.
Definition 3. Let Z and Z * be two random variables with c.d.f. F Z and F Z * respectively, and denote by med(Z) the median of Z. Z is said to be stochastically bounded by Z * in The difference between this definition and the one in the literature is that here the medians of Z and Z * are not required to be zero.
Assume that we have found a random variable W * b such that given W H = h, W b (θ * ) is stochastically bounded by W * b in tails for any θ * . Note that the first situation discussed earlier can be viewed as a special case, since any random variable is stochastically bounded by itself in tails. To shorten the argument, we only consider this more general case for later discussion. There are various ways to construct such a random variable W * b , see the examples in Martin and Liu (2015c). Here we provide a simple approach, by defining the c.d.f. to be provided that the resulting function is a c.d.f..
Given F W * b |h , a standard conditional predictive random set S h can be chosen for the prediction of W b (θ * ). For the purpose of constructing two-sided interval estimators, we first define the generalized c.d.f. of a random variable Z as F −1 Z (u) = inf{x : F Z (x) ≥ u}, and then construct S h as follows: This completes the prediction step, and other choices of the predictive random set for different purposes are discussed in Martin and Liu (2013).

The Combination Step
In what follows, to avoid notational confusions we use η to represent the parameter of interest as a random variable, and denote byη the possible values of η. In the final combination step, denote by Θ T (x) (w) the set ofη values that satisfy the association equation (7) with Then the conditional plausibility function for η is obtained as which completes the combination step.

Interval Estimator and Validity of Inference
In Section 4.2.3 a conditional plausibility function for the η parameter has been derived under the Inferential Model framework, and in this section it is used to construct the proposed interval estimator. Similar to the construction of plausibility region introduced in Section 3, we define the following set-valued function of x: From (9) it can be seen that cpl T (x)|h (η) depends on the data on two aspects: the random set S h depends on h = H(x), and the association function Θ T (x) (w) depends on T (x). As a result, we define our Partial Bayes interval estimator for η to be C α (X), obtained by plugging the random sample X into C α (x).
In the typical case that η is a fixed value, the Inferential Model theory guarantees that C α (X) is a valid 100(1 − α)% frequentist confidence interval for η. However in our case, the joint distribution of the parameter and data is considered, as in Definitions 1 and 2.
Therefore, the validity of C α (X) does not automatically follow from the Inferential Model theory, and hence needs to be studied separately. The result is summarized as Theorem 1.
In such cases, Theorem 1 reduces to the unconditional result corresponding to Definition 1.

Optimality and Efficiency
Theorem 1 states that the proposed interval estimator C α (X) defined in (10) satisfies the validity condition. Another important property, the efficiency of the estimator, is discussed in this section. We claim two facts about the proposed interval estimator: 1. If π * (θ * ) is known, then with a slight modification to the predictive random set S h , the optimal interval estimator C o α (X) can be constructed.
2. If π * (θ * ) is unknown, then under some mild conditions, well. The discussion also guides the choice of the b function in (7).
First consider the ideal scenario that π * (θ * ), the marginal distribution of θ * , is known, in which case a full prior distribution for θ is available. On one hand, it is well known that given a fully-specified prior distribution, the optimal inference for the parameter is via its posterior distribution given the data. On the other hand, given this new information, the approach introduced in Section 4.2 can still be used to derive an interval estimator, with some slight modifications shown below. Later this result is compared with the Bayesian solution.
Combining it with (6) and (7), we obtain the following three associations: where . Again, the second equation implies that given the data x, Z T is fully observed with value t := T (x), so the auxiliary variable W b can be predicted using its conditional distribution given W H = h and Z T = t, which we denote by F W b |h,t . Similar to the prediction step in Section 4.2.2, we construct a predictive random (8), and proceed with the same combination step to obtain

As a result, the interval estimator for η is obtained as
α}. Comparing the cpl T (x)|h,t (η) function that defines C o α (X) and the cpl T (x)|t (η) function in (9), it can be seen that they only differ in the distributions assigned to the predictive random sets. The following theorem shows that with this slight change, C o α (X) matches the Bayesian posterior credible interval.
Theorem 2. Assuming that π * (θ * ) is known and η has a continuous distribution function is optimal in the sense that it matches the Bayesian posterior Theorem 2 implies that, by choosing a proper predictive random set S h,t for the W b auxiliary variable, the inference result can attain the optimality. This fact implies that even when π * (θ * ) is missing, as long as there exists a predictive random set close to S h,t , the resulting interval estimator would be as efficient as the optimal one, at least approximately.
Recall that the optimal predictive random set S h,t is induced by the distribution F W b |h,t , and when π * (θ * ) is missing, only F W b (θ * )|h is available. Therefore, the next question is to find out the conditions under which F W b (θ * )|h is close to F W b |h,t . Since they are both conditional on W H = h, to simplify the analysis we remove this condition from both distributions, and then study the closeness between F W b (θ * ) and (7), and F W b |t stands for the distribution of W b defined in (11) given Z T = t.
In most real applications, the association relation for T (X) changes with the data size n.
To emphasize the dependence on n, in what follows we write W bn (θ * ), Z Tn , and W bn in place of W b (θ * ), Z T , and W b , respectively. The following definition from Xiong and Li (2008) is needed to study the large sample property of a conditional distribution.
Definition 4. Given two sequences of random variables X n and Y n , the conditional distribution function of X n given Y n , a random c.d.f. denoted by F Xn|Yn , is said to converge weakly to a non-random c.d.f. F Z in probability, denoted by X n |Y n d.P → Z, if for every continuous This definition is a generalization to the usual concept of weak convergence. Then we have the following result: Theorem 3. Let g n , h n , and p n denote the densities of W bn , Z Tn , and (W bn , Z Tn ), respectively. Also define l n (w, z) = p n (w, z)/[g n (w)h n (z)]. If (a) for fixed u, a T (u, W Tn ) is seen as a fixed value.
Remark 1. Conditions (a) and (b) are intentionally expressed in a simple form. In fact they can be replaced by a T (u, W Tn ) where f 1 and f 2 are one-to-one functions, and the limiting distribution is changed to f 2 (V η ) accordingly.
Remark 2. The three conditions are easy to check. Condition (a) states that T (X) should be a consistent estimator for θ * if θ * is seen as fixed. Condition (b) guides the choice of the b , and a sufficient condition for (c) is that the density of (W bn , Z Tn ) also converges to that of (V η , U ), which is satisfied by most parametric models.
To summarize, Theorem 3 indicates that W bn (θ * ) and W bn |Z Tn converge to the same limiting distribution, in which sense the random sets S h and S h,t have approximately identical distributions when n is sufficiently large. As a result, the proposed interval estimator C α (X) defined in (10) can be seen as an approximation to the optimal solution C o α (X). Combining Theorem 1 and Theorem 3, it can be concluded that the proposed interval estimator possesses the favorable properties of both validity and efficiency.

Popular Models Viewed as Partial Bayes Problems
In this section we apply the methodology in Section 4 to a collection of popular models viewed as Partial Bayes problems, and show how their Partial Bayes solutions are developed.

The Normal Hierarchical Model
The normal hierarchical model is extremely popular in the Empirical Bayes literature, partly due to its simplicity and flexibility; see for example Efron and Morris (1975); Morris (1983); Casella (1985); Efron (2010). The model setting has been given in Section 2, and without loss of generality we set σ 2 = 1, since X i 's can always be scaled by a constant to achieve an arbitrary variance. We will consider both the cases where τ 2 is known and unknown, and our parameter of interest is µ 1 . To summarize, we write As a first step, this model can be expressed by the following association equations: µ i = µ + τ ε i and X i = µ i + e i for i = 1, . . . , n, where ε i iid ∼ N(0, 1), e i iid ∼ N(0, 1), and e i and ε i are independent. An equivalent expression for these associations is µ i = µ+τ ε i , X i = µ+τ ε i +e i , in which the data are directly linked to the unknown µ. Since the focus is on µ 1 , equations related to µ 2 , . . . , µ n can be ignored. In the following two subsections we discuss the cases with both known and unknown τ 2 .

The case with a known τ 2
This case corresponds to the motivating example presented in Section 2, and we are going to derive formula (1) with σ 2 = 1. Since τ is known, let W i = τ ε i + e i , i = 1, 2, . . . , n, and then the system of associations X i = µ + τ ε i + e i can be rewritten as X = µ + W and . . , n, where X = 1 n n i=1 X i and W = 1 n n i=1 W i . Therefore, by denoting T (X) = X and H(X) = X (−1) − X 1 1 n−1 , where X (−1) = (X 2 , . . . , X n ) and 1 n−1 is a vector of all ones, the decomposition in equation (3) is achieved. The associated auxiliary variable for H(X) is W H = W (−1) − W 1 1 n−1 , where W (−1) = (W 2 , . . . , W n ) .
Next, we keep the following two associations X = µ + W and µ 1 = µ + τ ε 1 , where W ∼ P W |h and ε 1 ∼ P ε 1 |h conditional on W H = h ≡ H(x). The last step is to take b(X, µ 1 ) = X − µ 1 , and the final association equation is b(X, µ 1 ) = W b := W − τ ε 1 . It can be verified that the conditional distribution of W b given W H = h is and the predictive random set (8) can be constructed accordingly. As a result, the conditional plausibility function for µ 1 is obtained as where Φ is the standard normal c.d.f., and hence the interval estimator for µ 1 is

The Poisson Hierarchical Model
The Poisson hierarchical model is useful for analyzing discrete data such as counts. Assume that given parameters λ i > 0, the observed data X = (X 1 , . . . , X n ) satisfy X i |λ i ∼ Pois(λ i t i ), i = 1, . . . , n, where t i > 0 are known constants. In real-world problems, λ i can be interpreted, for example, as the rate of events in unit time, and t i is the length of the time window. It is also assumed that λ i 's follow a common prior, λ i iid ∼ γGamma(s), where s is a known shape parameter and γ is an unknown scale parameter. In this setting the parameter of interest is λ 1 . This model can also be expressed using the formulation in Section 4.1: Partial priorθ = (λ 1 , λ 2 , . . . , λ n ),θ|θ * ∼ i γGamma(s) iid ∼ Gamma(s), and U and V are independent. After plugging prior associations into data associations and ignoring irrelevant parameters, the following association equations are kept without loss of information: A fundamental difference between this Poisson model and the normal model studied earlier is that, due to the discreteness of X i and the heterogeneity of the t i values, it is improbable to find a non-trivial function H(x) such that the distribution of H(X) is free of γ. This is an example that the decomposition (3) is not available, and hence we trivially take H(X) = 1 and T (X) = X. As a result, the next step is to seek the b function in (7) such that b(T (X), λ 1 ) only weakly relies on λ 1 . The idea is as follows.
Note that in the associations (18), X i can also be written as with respect to λ 1 , and we express it as X = a(λ 1 , U, V ) for simplicity. Therefore, given the b function, the final association (7) then becomes b(X, λ 1 ) = W b (λ 1 ), where the auxiliary Let G λ 1 be the c.d.f. of W b (λ 1 ) conditional on λ 1 , and then the unconditional plausibility function for λ 1 is pl x (λ 1 ) = 1−G λ 1 (b(x, λ 1 )). Finally, the interval estimator for λ 1 is obtained by inverting the plausibility function, i.e., C α (x) = {λ : pl x (λ) ≥ α}. The computation details are given in Appendix A.6.
The choice of the b function is not unique, and the one used here is inspired by Martin (2015). Due to the choice of T (X) = X, Theorem 3 no longer applies to this case, but the simulation result in Section 6 suggests that the interval estimator derived in this section is indeed very efficient. Also, it is worth mentioning that the validity property always holds regardless of the choice of b.
Similar to the association steps of previously studied models, we first plug the prior association into the data association, resulting in X = F −1 m,p 1 (U,ω) (U 1 ), Y = F −1 n,p 2 (U,ω) (U 2 ), and δ = U.
Again due to the discreteness of X and Y , it is unlikely to find a function H(X, Y ) such that its distribution is free of ω , so the goal is to seek the b function as in the Poisson model.
Like in the Poisson case, we first find an approximationδ to δ, and then solve the functional equations b(x, y,δ) = 0 and ∂b/∂δ| δ=δ = 0. However, this model has two significant differences from the Poisson case: first, δ has a genuine prior δ ∼ π, and second, there is one more unknown parameter ω. Our proposal here is to use the maximum a posteriori estimator for δ as the approximation, derived as follows: let f (x, y, δ; ω) be the joint density function of (X, Y, δ) and define (δ, ω; x, y) = log f (x, y, δ; ω).

Simulation Study
In this section we conduct several simulation studies to compare Partial Bayes solutions with other existing methods such as Empirical Bayes and Confidence Distribution approaches.
Specifically, given the observed data from a model and the parameter of interest, each method computes an interval estimator for the parameter. Data are simulated 10,000 times in order to calculate the empirical coverage percentage and the mean interval width for all the methods compared. The nominal coverage rate is set to 95% for all experiments. In the following part, the three popular models studied in Section 5 are considered.
The Normal Hierarchical Model The normal hierarchical model in Section 5.1 is extremely popular in literature. In this experiment the Partial Bayes solution is compared with the naive Empirical Bayes and other improved methods, including the full Bayes method with flat prior (Deely and Lindley, 1981), the approach used by Morris (1983) and Efron (2010), the Bootstrap method (Laird and Louis, 1987), and the Conditional Bias Correction method (Carlin and Gelfand, 1990). In this model, both hyper-parameters µ and τ 2 are assumed to be unknown, with the same setting in Laird and Louis (1987): the true µ is fixed to 0, and two values of τ , 0.5 and 1, are considered. For the Partial Bayes solution, the γ constant in (16) is fixed to be 1 3 . The results of the empirical coverage percentage and the mean interval width for different methods are summarized in Figure 2. Figure 2: The empirical coverage percentage (the top two panels) and mean interval width (the bottom two panels) for µ 1 in the normal hierarchical model with an increasing sample size n and two parameter settings, among 10,000 simulation runs. For all the methods compared, only the Partial Bayes solution guarantees the nominal coverage rate for all n.
It is obvious in Figure 2 that among all the methods compared, only the Partial Bayes solution achieves the nominal coverage rate for all sample sizes. In terms of interval width, the Partial Bayes solution has wider interval estimates than other methods, due to the guarantee of coverage rate; however, as the sample size increases, the gaps between different methods become smaller and smaller, indicating that all methods are efficient asymptotically.

The Poisson Hierarchical Model
The second simulation experiment is for the Poisson hierarchical model discussed in Section 5.2. For simplicity, we set all the t i s to be 1, and fix the true value of θ to be 1. Two different values of s, s = 2, 10, and a sequence of sample sizes, n = 10, 15, . . . , 50, are considered. There are fewer existing results for the Poisson model than the normal one, and here the Partial Bayes solution is compared with the naive Empirical Bayes and full Bayes approaches, with the results illustrated in Figure 3. The pattern of the simulation results is very similar to that of the normal model. As expected, the other two solutions have narrower interval estimates than the Partial Bayes solution, but they do not preserve the nominal coverage rate. In contrast, the Partial Bayes solution has coverage percentages above 95%, and its interval width is getting close to the other two when sample size increases. The simulation result again verifies both the exactness and the efficiency of the Partial Bayes solution.
The Binomial Rates-Difference Model In the last experiment we consider the binomial model studied in Section 5.3. The prior of δ ≡ p 1 − p 2 is chosen to have the same distribution as 2β − 1 with β ∼ Beta(a, b) for some known value of (a, b). This choice of prior guarantees that the support of π(δ) is [−1, 1]. For each simulated δ, the value of τ ≡ p 1 + p 2 is created as τ = 1 + (1 − |δ|)ω with ω ∼ Unif(−1, 1). Then the corresponding true values of p 1 and p 2 used to simulate the data can be determined accordingly. Two settings of prior distribution parameters, (a, b) = (2, 2) and (2, 5), and a sequence of binomial sizes, m = n = 20, 30, . . . , 100, are considered. Since the typical Empirical Bayes methods do not apply to this problem, in Figure 4 we give the results of Partial Bayes and Confidence Distribution solutions. Figure 4: The empirical coverage percentage (the top two panels) and mean interval width (the bottom two panels) for δ in the binomial rates-difference model with an increasing binomial size n and two parameter settings, among 10,000 simulation runs. Partial Bayes and Confidence Distribution solutions are compared, showing that the Partial Bayes solution guarantees the nominal coverage for all n.
Similar to the Empirical Bayes solutions in the previous two simulation studies, Confidence Distribution does not possess the desired coverage, while Partial Bayes provides exact inference results. This is because the Confidence Distribution method for this model relies on large sample theory, and may not work well for small samples. The interval width of the Partial Bayes solution is slightly wider than that of the Confidence Distribution method, but the difference is only tiny; as expected, the width will decrease as sample size increases, which again indicates the efficiency.

Application
In this section we apply the Partial Bayes model to a dataset of National Basketball Association (NBA) games. In basketball competitions, a three-point shot, if made, rewards the highest score in one single attempt. Therefore, as the game comes to an end, three-point shots are more valuable for a team that has very limited offensive possessions and needs to overcome the deficit in score. When the game is decided by the last possession, a three-point shot is usually beneficial or even necessary for such teams, and the choice of player that will make the attempt is crucial to the outcome of the game.
Typically, the player to be chosen should have the highest success rate of three-point shots, and historical data can be used to evaluate each player's performance. If X i is the number of three-point shots made in n i attempts by player i, then usually X i can be modeled by a binomial distribution Bin(n i , p i ) or a Poisson distribution Pois(n i p i ), where p i stands for the success rate. In this application we choose the latter one for simplicity. Given this model, a classical point estimator for p i isp i = X i /n i , and a 100(1 − α)% frequentist confidence interval for p i is G X i α for each player are computed from this dataset.
To take the prior information into account, we first use the Empirical Bayes method to analyze this dataset similar to the analysis in Efron and Morris (1975) for baseball games, but with a Poisson model instead of a normal one. The p i 's are assumed to follow a common exponential prior exp(θ), where θ > 0 stands for the mean. The MLE of θ is obtained aŝ θ = 0.410 using the marginal distribution of X i . As a result, the point estimator for p i is taken to be the posterior mean (X i + 1)/(θ −1 + n i ), and the approximate 100 . Finally, the Partial Bayes model in Section 5.2 is used to derive an interval estimator for p i , and the point estimator is chosen as the value of p i that maximizes pl x (p i ). The comparison of the three methods mentioned above is shown in Figure 5 for five representative players. Stephen Curry, as a third case, is almost unaffected by the shrinkage. This is because he made a large number of shot attempts, so that his personal performance dominates the overall estimate. It is worth noting that David West has a higher point estimate of success rate than Stephen Curry in the classical method, but their rankings are reversed in Empirical Bayes and Partial Bayes methods.
The comparison of the three methods also highlights the advantage of the Partial Bayes method. It is known that the classical confidence interval is exact, but is wider than that of the other two methods. The Empirical Bayes solution is more efficient, but theoretically it is only approximate. The Partial Bayes solution, in contrast, combines the advantages of the other two methods, providing both exact and efficient inference results. This example hence suggests that the Partial Bayes model framework is useful for real-life data analysis tasks.

Conclusion and Discussion
This article considers the statistical inference for Partial Bayes problems, i.e., Bayesian models without fully-specified prior distributions. We have developed a general model framework for studying such problems, and have provided theoretical justification for both the exactness and the efficiency of the inference results. Compared with other existing methodologies dealing with partial prior information, such as Empirical Bayes and Confidence Distribution, our proposed method has shown superior performance.
Indeed, statisticians and scientists do care about exact inference for such useful models. For example, pioneering work in the Empirical Bayes literature, such as Morris (1983); Laird and Louis (1987); Carlin and Gelfand (1990), has revealed the fact that Empirical Bayes estimators could underestimate the uncertainty, and these authors all emphasized the importance of providing exact inference for such problems. To some extent our discussion sheds new light on this issue and shows promising results. From this perspective, Partial Bayes models are powerful extensions to conventional Bayesian models, as they allow for more flexibility on the prior specifications, and meanwhile avoid sacrificing the exactness of inference. As a result, they can be used to combine different types of information for which other existing methods are difficult.
Of course, "There is no such thing as a free lunch." The exact and efficient inference for Partial Bayes problems is very useful yet challenging. As has been illustrated by the three examples models, the construction of the interval estimators can sometimes be quite technical and non-trivial. Also, similar to the hierarchical Bayesian models, the computational cost for Partial Bayes solutions may be massive when the model structure is complex. Despite all these obstacles, we believe that the Partial Bayes model framework is useful in real data analysis, and we expect that more research along this direction can be fruitful, as far as exact and efficient probabilistic inference concerns.

A.2 Proof of Theorem 2
Similar to (20), we have cpl is one-to-one by definition, so the mapping must be monotone. Without loss of generality we assume b(t, η) is increasing in η, since otherwise we can use −b in place of b.

A.3 Proof of Theorem 3
We first show that Z Tn P → U and W bn P → V η under conditions (a) and (b). Let P U be the probability measure of U . Since U and W Tn are independent, we have that for any ε > 0, indicates that f n → 0, and then by |f n | ≤ 1 and the dominated convergence theorem, we have where Z η = a η (U, V η ). Then by the continuous mapping theorem and condition (b) we obtain Next we prove that E(f (W bn )|Z Tn ) P → E(f (V η )) for any bounded continuous function f , where the notation E(X|Y ) stands for the conditional expectation of X given Y . The main tool to prove this result is Theorem 2.1 of Goggin (1994). Let Q n be a probability measure under which W bn and Z Tn are independent, i.e., where F W bn and F Z Tn are the corresponding marginal c.d.f.'s. Then for any ε > 0, under the Q n measure, P Qn (|l n (W bn , Z Tn ) − 1| > ε) = I An dQ n , where I An is the indicator function of the set A n = {(w, z) : |l n (w, z) − 1| > ε}. Condition (c) implies that I An → 0 pointwisely, so by the dominated convergence theorem we have I An dQ n → 0. As a result, under the Q n measure, l n (W bn , Z Tn ) P → 1 and hence (W bn , Z Tn , l n (W bn , Z Tn )) d → (V η , U, 1). Then Theorem 2.1 of Goggin (1994) claims that E(f (W bn )|Z Tn ) d → E(f (V η )|U ) for any bounded continuous function f . Since U and V η are independent, we have E(f (V η )|U ) = E(f (V η )) and hence Finally, Theorem 2.1 of Xiong and Li (2008) shows that E(f (W bn )|Z Tn ) A.4 Proof of (12), (13), and (14) Let 0 k denote the k × 1 zero vector, I k be the k × k identity matrix, and J k be a k × k matrix with all elements being one. It is easy to show that (W b , W H ) = A(e , ε ) , where A =    1 n 1 n 1 n−1 ( 1 n − 1)τ τ n 1 n−1 −1 n−1 I n−1 −τ 1 n−1 τ I n−1    , e = (e 1 , . . . , e n ) , and ε = (ε 1 , . . . , ε n ) . Since , where Σ 11 = {1+(n−1)τ 2 }/n, Σ 12 = τ 2 1 n−1 , and Σ 22 = (τ 2 + 1)(J n−1 + I n−1 ).
The interval estimator then follows directly.
Finally, the auxiliary variable to predict is and (17) follows immediately.
To obtain G λ 1 , the c.d.f. of W b (λ 1 ), we first use Monte Carlo method to simulate U and V to get a random sample of W b (λ 1 ), and then G λ 1 is approximated byĜ λ 1 , the empirical c.d.f. of W b (λ 1 ). Finally, the interval estimator is computed using a grid search on pl x (λ 1 ).
The remaining part of the computation proceeds similarly to the Poisson model, by simulating (U 1 , U 2 , U ) and computing the distribution of W b (ω), and hence the details are omitted.