Bayesian inference in partially identiﬁed models: Is the shape of the posterior distribution useful?

: Partially identiﬁed models are characterized by the distribution of observables being compatible with a set of values for the target parameter, rather than a single value. This set is often referred to as an identiﬁ- cation region . From a non-Bayesian point of view, the identiﬁcation region is the object revealed to the investigator in the limit of increasing sample size. Conversely, a Bayesian analysis provides the identiﬁcation region plus the limiting posterior distribution over this region. This purports to convey varying plausibility of values across the region. Taking a decision-theoretic view, we investigate the extent to which having a distribution across the identiﬁcation region is indeed helpful.


Partial identification
Limitations in terms of what variables can be observed, and how well they can be measured, can result in a statistical model which is nonidentified. That is, multiple values of the parameters can give rise to the same distribution of observables. Say the statistical model at hand is parameterized by θ ∈ Θ, where θ has p components. If the likelihood function depends on θ only through φ = s(θ) having q < p components, for some non-injective function s(), then the model is nonidentified. In this paper we consider situations where the model for the data given φ obeys standard asymptotic regularity conditions, so that √ n-consistent estimation of φ is possible. We also presume that interest focusses on a scalar inferential target, denoted ψ = g(θ).
In what follows, we generically use square brackets to denote taking the image of a set under a function, as opposed to simply evaluating the function at a point in its domain. The identification region for the target parameter is defined as I(φ) = g[{θ ∈ Θ : s(θ) = φ}]. Intuitively, say the true parameter values giving rise to the data are θ = θ 0 , with φ 0 = s(θ 0 ). In the large-sample limit the data reveal the value of φ 0 . Thus the corresponding identification region I(φ 0 ) is all values of the target that remain compatible with the data in this limit.
For simplicity of exposition, and without very much loss of generality, we restrict our interest to models under which the identification region is in fact guaranteed to be an interval of finite length, i.e., I(φ) is an interval for all φ ∈ s [Θ]. More fundamentally, we consider only the sub-class of nonidentified models and choices of target parameter for which the target is partially identified in the following sense. By construction, for every θ ∈ Θ, g[{θ}] ⊆ I(s(θ)) ⊆ g [Θ]. We say the target is partially identified if g[{θ}] I(s(θ)) g [Θ], for at least one θ ∈ Θ. Note that this corresponds very literally to a sense of partial information. For a sequence of data arising under such a θ, at least one a priori plausible value of the target is ruled out as the data accumulate, while at least one incorrect value of the target remains plausible.

Example: Imperfect compliance in a randomized trial
As a more involved example of a partially identified model, we consider a version of the imperfect compliance model with binary variables considered by various authors, including Chickering and Pearl [3], Imbens and Rubin [16], Pearl [24,Ch. 8], and Richardson et al. [26]. Clinical trial participants are randomly sampled from a population comprised of never-takers, always-takers, and compliers, in unknown proportions ω N T , ω AT , and ω CO = 1 − ω N T − ω AT respectively. Each subject is randomly assigned to either control or treatment. As the labels suggest, never-takers will not take treatment regardless of their assignment, always-takers will take treatment regardless of their assignment, and compliers will follow their assignment. We exclude the possibility of defiers in the population, though the general version of the problem allows for them.
Assume that a participant's binary response is Y (0) if treatment is not taken, and Y (1) if treatment is taken, regardless of treatment assignment. Then a participant's observable response is Y = (1 − X)Y (0) + XY (1) , where X indicates receipt of treatment. Against this, let Z indicate randomization to treatment, with the possibility that X = Z. For compliance type indicated by C ∈ {N T, AT, CO}, let γ C,i be the mean of Y (i) amongst the sub-population of that type. We consider inference about the population average causal effect (ACE), given as It is easy to verify that the present set-up gives a nonidentified model, with p = 8, q = 6, θ = (ω N T , ω AT , γ), and φ = (ω N T , ω AT , γ N T,0 , γ AT,1 , γ CO,0 , γ CO,1 ). Particularly, the form of the invertible map from φ to the (Y, X|Z) cell probabilities is readily established (see Appendix A for details). Unsurprisingly, the parameters absent from φ, namely γ N T,1 and γ AT,0 , are the intuitively unestimable quantities: the mean outcomes for never-takers who take treatment and for always-takers who don't take treatment.
It is also straightforward to verify that this model is partially identified when the ACE is the inferential target. Defining the identification interval for the ACE is . Thus, unless the population happens to contain only compliers, uncertainty about the ACE will remain no matter how much data accumulates. We will investigate the limiting behavior of Bayesian inference for the ACE in Section 3.1.

Example: Inferring gene-environment interaction
As another example of a partially identified model, consider binary disease status Y , binary environmental exposure X, and binary genotype G. As a variant of a problem studied by Gustafson [10] and Gustafson and Burstyn [13], interest lies in the (Y |X, G) relationship when only (Y, G) data are available, but certain assumptions can be invoked. The first of these is the gene-environment independence assumption, that X and G are independent in the source population. Second, the disease risk amongst the unexposed is assumed to not vary by genotype, i.e., any impact of genotype is only via modification of the exposure effect, a so-called gene-environment interaction. Third, while (Y, X, G) data are not available, information about the X prevalence in the population is presumed to be available. So the problem can be viewed as one of "ecological inference," as we wish to infer a property of the joint (Y, G, X) distribution from information about the (Y, G) and X marginals. As one example of an inferential target, say the task is to estimate ψ = P r(Y = 1|X = 1, G = 1)−P r(Y = 1|X = 0, G = 1), the risk difference associated with exposure amongst those with genotype G = 1.
To gain a foothold in this problem, let θ = (µ 0 , µ 10 , µ 11 ) parameterize the distribution of (Y |X, G), according to µ 0 = P r(Y = 1|X = 0) = P r(Y = 1|X = 0, G = g), for g = 0, 1, and µ 1g = P r(Y = 1|X = 1, G = g), for g = 0, 1. We take r = P r(X = 1) as a fixed constant, and define φ g = P r(Y = 1|G = g) = (1 − r)µ 0 + rµ 1g , for g = 0, 1. Thus the likelihood arising from the (Y |G) data depends on θ only through φ = (φ 0 , φ 1 ). Hence we have a nonidentified model with p = 3 and q = 2. (Note that here we have left the marginal distribution of G unmodeled, but it makes no material difference if we include P r(G = 1) as a further parameter and then have a nonidentified model with p = 4 and q = 3.) In Section 3.2 we will determine the identification interval I(φ) for the target parameter ψ = µ 11 − µ 0 , and we will consider the limiting behavior of Bayesian inference in this setting.

Inferential approaches to partially identified models
There is a considerable literature on non-Bayesian approaches to partially identified models. See, for instance, Manski [21], Imbens and Manski [15], Romano and Shaikh [27], Vansteelandt et al. [29], Zhang [30], Tamer [28]. Typically the endeavor is split into two tasks. For a given problem, first one determines the form of the identification interval. Then the interval endpoints are viewed as the parameters of interest. Inference is considered as a separate exercise, comprised of estimating the endpoints and/or reporting a confidence set for the identification interval. As a side note, there is an interesting distinction between confidence sets designed to have nominal or better coverage for the true value of the target versus those designed to have nominal or better coverage of the whole identification interval. More importantly for present purposes, these approaches do not naturally lend themselves to a sense of some target values being more plausible than others in light of the data. Conceptually, if the investigator were handed an infinite number of datapoints, and hence perfect knowledge of the distribution of observables and the value of φ, then the identification interval I(φ) would simply be reported as "the answer." Bayesian inferences in partially identified settings, and nonidentified models in general, have received considerable attention recently. In part this is due to needs arising in observational epidemiology. Study and data limitations which preclude identification are commonplace in this field. Works promulgating Bayesian or Bayes-like inference in such settings includes Joseph et al. [17], Dendukuri and Joseph [5], Greenland [7], Hanson et al. [14], Greenland [8], MacLehose et al. [20]. One theme in the broader literature is that identification and inference are very integrated under a Bayesian analysis (see, for instance, Barankin [1], Kadane [18], Dawid [4], Neath and Samaniego [23], Poirier [25], Gustafson [9]). Based on a sample of size n, the investigator carries out prior-to-posterior updating, yielding a marginal posterior distribution on the target parameter. As n increases, this distribution converges to a non-degenerate distribution with support equal to the identification interval. Given an infinite number of datapoints then, "the answer" is this limiting posterior distribution, which constitutes a relative weighting of points in the identification interval.
Thus there is a fundamental discrepancy between non-Bayesian and Bayesian inference in partially identified models. This discrepancy is more extreme than for identified models, where the identification interval is typically a single point, and therefore does not admit a weighting of its elements. As an example of this, Gustafson [12] considers the large-sample limit of frequentist coverage for Bayesian (1 − α) credible intervals, in the partially identified context. He shows this limit is one over a large subset of the parameter space, and zero over its complement, where large means having prior probability 1 − α. More generally, both Liao and Jiang [19] and Gustafson [10] suggest that obtaining a posterior distribution across the identification interval is a strength of the Bayesian approach. In contrast Moon and Schorfheide [22], who draw some large-sample comparisons between Bayes and non-Bayes procedures, are much more guarded about the prospect of reporting a posterior distribution across an identification interval as opposed to simply estimating the interval. None of these authors, however, attempt any sort of quantification of the potential utility of the shape of the posterior distribution over the identification interval.
It might be tempting to intuit that the force of the data is completely used up in determining the identification interval, so that the shape of the limiting posterior distribution across the interval is driven exclusively by the choice of prior distribution. Indeed, this is the case in some problems. In the simple example of Section 1.1, for instance, the relationship between φ and I(φ) is clearly bijective. Consequently, with a fixed prior distribution over Θ, knowledge of the identification interval for the target completely determines the limiting posterior distribution over the interval.
We can quickly establish, however, that other problems, such as the examples in Sections 1.2 and 1.3, exhibit more complex behavior. There can be distinct points φ 1 and φ 2 in s[Θ] such that I(φ 1 ) = I(φ 2 ), but, starting with the same prior distribution over Θ, the limiting posterior distribution arising for true values of θ such that s(θ) = φ 1 differs from that arising if s(θ) = φ 2 . This directly corresponds to the data having a say in the shape of the limiting posterior distribution of the target, as well as having a say in the support of this distribution. In turn this gives a sense in which there can be more to take away from an (infinite-sized) dataset than just the identification interval, whereas non-Bayesian approaches seem to suppose the opposite. Thus the situation is nuanced, and warrants investigation.
In the remainder of the paper we investigate the inferential utility of the shape of the posterior distribution, by taking a decision-theoretic view. In the largesample limit, we focus on the typical height of the marginal posterior density for the target parameter, at the true value of the target. This can be compared to the typical height of other densities over the identification interval. More technically, we consider the expected score for a probabilistic forecast of the target parameter, under a logarithmic scoring rule. The difference in expected score between the Bayesian forecast and an ad-hoc choice of distribution over the identification interval can be decomposed into two terms. The first term speaks to the value of Bayesian processing of the information in the data about the identification interval. The second term reflects the additional information that can be recovered from using all the data. This decomposition is worked out for both the trial compliance example and the gene-environment interaction example. It is hoped that this will be found relevant by researchers from both Bayesian and non-Bayesian backgrounds. For the Bayesian, it seems important to recognize that the usual decision-theoretic optimality of Bayesian procedures has a different "look-and-feel" in the partially identified case, particularly as the relevant posterior distributions are not converging to point masses. For the non-Bayesian, it seems important to recognize that there may be a sense in which the data speak beyond just estimating the identification interval.

Methodology
Starting with the parameter vector θ, say the investigator is willing to specify a prior distribution having a smooth density function π(θ) over Θ. With respect to an appropriate measure we write the density of data given parameters as π(d|θ), also assumed to be a smooth function of θ. Thus we can unambiguously refer to the joint density π(d, θ) = π(d|θ)π(θ) induced by the prior and model. In what follows, when useful we will write d n to emphasize observable data comprised of n observations which are independent and identically distributed given θ. Also, we occasionally use upper-case notation for functions of data and parameters when it is helpful to stress random variable interpretations, e.g., inside expectations.
We frame our discussion in terms of how well we can generate a probabilistic forecast for the target parameter ψ = g(θ). For a finite sample of size n, say a family of density functions h(·; ·) is used, such that h(·; d n ) is the probabilistic forecast of the value of ψ, based on observing data D n = d n . We summarize the utility of the forecasting procedure by the expected score (ES) under a logarithmic scoring rule, Note here that by taking the expectation with respect to π, we are evaluating what would happen on average across repeated instantiations of both parameter and data values, with this ensemble of parameter values distributed according to the prior distribution. Note also that (3) is following the well-studied path of preferring forecasts with the highest expected score, which is essentially the same idea as preferring inferential schemes with the highest expected utility. But we can also think more prosaically simply in terms of exp(ES (n) π,h ) being the typical height of the forecast density at the true value, with typical being in the sense of a geometric mean. For a general discussion of scoring rules for density forecasts, see Gneiting and Raftery [6]. Also, Bernardo [2] gives a sense in which all members of a class of scoring rules with desirable properties are equivalent to a logarithmic scoring rule.
The usual decision-theoretic optimality of Bayesian procedures applies here, despite the fact that this optimality is more commonly seen expressed for estimators than for probabilistic forecasts. The choice of family of densities h which maximizes (3) is the marginal posterior density of the target, h(ψ; d n ) = π(ψ|d n ). This can be seen as an immediate consequence of the non-negativity of Kullback-Leibler divergence.
Because we are studying partially identified problems in which the posterior distribution of the target converges to a non-degenerate distribution as the sample size grows, the limiting version of (3) is immediate. Observation of an infinite amount of data corresponds to knowledge of φ, so in the limit we are concerned with a family of density functions of the form h(·; φ), and the corresponding expected score: Bear in mind here that ψ and φ are both functions of θ, and the expectation is with respect to the prior density π(θ). Again the non-negativity of Kullback-Leibler divergence immediately implies that (4) is maximized by h(ψ; φ) = π(ψ|φ). That is, the optimal probabilistic forecast is the conditional prior distribution of ψ given the value of φ that is gleaned from the data. Equivalently, this is the limit of the marginal posterior distribution of ψ, as the sample size goes to infinity. Thus the optimality of the marginal posterior distribution on the target extends smoothly in the limit. Note also that the lack of full identification typically implies that π(ψ|φ) is not degenerate, hence the maximized value of (4) will be finite. Continuing to work in the large-sample limit, we can use (4) as a starting point for understanding the "information flow" in the partially identified model. Henceforth it is useful to write the identification interval explicitly as Much of the non-Bayesian literature on partial identification treats the identification interval as the bivariate target of inference, with the consequent notion that knowledge of this interval is either all that should be gleaned, or all that can be gleaned, upon observation of an infinite-sized dataset. Thus it might be viewed that knowledge of φ * is just as good as knowledge of φ, even if the map from φ to φ * is not invertible. Bearing this in mind, we term a family of densities indexed by φ * to be an ad-hoc probabilistic forecast for the target, in the limiting case.
To fix ideas, one example of an ad-hoc scheme would be corresponding to a uniform distribution over the identification interval. This would arise as the large-sample limit of forecasting a uniform distribution between estimates of the identification interval endpoints φ * . Or, given that performance is measured on average with respect to prior π, an ad-hoc attempt to "do better where it counts" would involve truncating the prior distribution to the identification interval, i.e., Posterior distributions in partially identified models

483
Again, this could be thought of as the large-sample limit arising from truncating the prior distribution according to estimated values for φ * . For a given ad-hoc procedure h(ψ; φ * ), let ES ∞ π,AH = E π {log h(Ψ; Φ * )} be the expected score with respect to prior π. In contrast, let ES ∞ π,B be the optimal expected score arising from h(ψ; φ) = π(ψ|φ), with the subscript B reminding us that this is the limit of the Bayesian procedure. We want to decompose ES ∞ π,B − ES ∞ π,AH ≥ 0 in a way that sheds light on the utility of the shape of the limiting posterior distribution over the identification interval.
In investigating ad-hoc schemes, we are considering taking only information about φ * from the data, which may "leave behind" some information about φ. We can easily elucidate that the best possible ad-hoc scheme is h(ψ; φ * ) = π(ψ|φ * ), i.e., the same argument used above immediately reveals that the prior conditional distribution of the target given φ * maximizes E π {log h(Ψ; Φ * )}. We refer to this as the coarsened Bayes (CB) procedure, and denote its expected score as ES ∞ π,CB . Note that in some problems it may literally be possible to regard the CB procedure as arising in the limit when Bayesian inference is applied only to a coarsened version of the data. That is, there might exist a function t such that coarsened data D * n = t(D n ) have the following properties: (i), the distribution of D * n depends on φ only through φ * , and (ii), the distribution of D * n given φ * supports √ n-consistent estimation of φ * . By construction then, using only data D * n suffices to estimate the identification interval for the target, and the posterior distribution of (Ψ|D * n ) must converge to the conditional prior distribution given by π(ψ|φ * ). However, regardless of whether we can actually exhibit such a function t(), we can interpret π(ψ|φ * ) as the limiting Bayesian knowledge about the target were we to extract just enough of the data to estimate the identification interval, and not an iota more. Now we are in a position to try to understand the worth of the shape of the limiting posterior distribution of the target across the identification interval. For a given ad-hoc procedure, we immediately have (5) where both terms on the right-hand side of (5) are guaranteed to be nonnegative. This follows trivially from the optimality of the CB procedure amongst AH procedures, and the global optimality of the Bayesian procedure. So the first term on the right in (5) reflects the value of Bayesian processing of information about the identification interval, relative to ad-hoc processing of this information. The second term represents the value of using all the information in the data, not just the information about the identification interval. Put another way, the second term reflects information "left on the table" by supposing that the data can only speak to the location of the identification interval. The second term is of particular interest, since non-Bayesian approaches to partially identified models are predicated on the idea that knowledge of the identification interval is indeed all that can be obtained in the limit of infinite sample size.
Yet another interpretation is that the second term in (5) reflects the utility of the fact that multiple θ values in Θ can lead to the same identification interval but different limiting posterior distributions over this region. In the special case that the map from φ to φ * is invertible, there is only one limiting posterior distribution corresponding to a given identification interval, and the second term in (5) is zero. However, when there is indeed coarsening (i.e, the map from φ to φ * is not invertible), there is no general reason to expect the second term to be zero.
In the next section, we simply compute the decomposition (5) in the examples from Sections 1.2 and 1.3. In doing so, we will refer to the first term as describing the first-order Bayes advantage. In particular, exp(ES ∞ π,CB − ES ∞ π,AH ) is interpreted as the typical density ratio comparing the coarsened posterior density to the ad-hoc density, at the true value of the target parameter. Thus we can think of the first-order advantage as follows. Presuming that we choose to measure performance by averaging across the parameter space with respect to π(θ), and presuming that we are only allowed to take information about I(φ) from the data, then we are quantifying the gain from doing a Bayesian analysis with π(θ) as the prior. Similarly, we can view exp(ES ∞ π,B − ES ∞ π,CB ) as quantifying the second-order Bayes advantage, again on the density ratio scale. This gives us the further gain achieved if we allow ourselves to hear all that the data have to say, rather than just taking the information about the identification interval.
We consider Bayesian inference under a uniform prior distribution; particularly, a prior under which ω ∼ Dirichlet(1, 1, 1) and independently each of the six components of γ follow a Unif(0, 1) distribution. Under this prior, Gustafson [11] shows that the limiting posterior distribution of the target, π(ψ|φ), has a trapezoidal-shaped density. In particular, whereas the bottom edge of the trapezoid is the identification interval a(φ) ± b(φ), the top edge of the trapezoid is In this problem it is readily apparent that multiple values of φ ∈ s[Θ] can produce the same values of a() and b() but different values of c(). Or, put another way, the mapping from φ to φ * is not invertible. An explicit illustration of this appears in Figure 1. While for this problem the form of π(ψ|φ) is very simple to characterize, determination of the coarsened limiting posterior distribution, π(ψ|φ * ), is rather more involved, as described in Appendix A. The coarsened distribution is also depicted in Figure 1. The various limiting expected scores for this problem are reported in Table 1. These are computed in a direct Monte Carlo fashion. That is, we generate m independent and identically distributed realizations θ (1) , . . . , θ (m) according to the prior π(θ). Then, for each realization we determine the information available from an infinite-sized dataset, φ (i) = s(θ (i) ), and the target value ψ (i) = g(θ (i) ). For any probabilistic forecast then, the expected score is numerically approximated by the Monte Carlo average m −1 m i=1 log h(ψ (i) ; φ (i) ). Moreover, the numerical error involved is easily quantified by the standard error associated with this average, and similarly the error involved in computing the difference in two expected scores is described by the standard error arising from averaging m differences. Table 1 Expected scores in the imperfect compliance example. These are computed as Monte Carlo averages across 10,000 realized values of θ drawn from the prior distribution. Simulation standard errors are given in parentheses. The labels AH(TP) and AH(U) refer to the ad-hoc truncated prior and ad-hoc uniform procedures respectively From Table 1 we see that the first-order Bayes advantage is appreciable in this problem. Using only information about the identification interval, the coarsened posterior density over this interval is typically considerably higher at the true value than an ad-hoc density (with a typical density ratio of exp(0.14) ≈ 1.15 compared to the truncated prior density and 1.09 compared to a uniform density). Finally, we see there is a modest second-order Bayes advantage. By using all the information rather than just the information about I(φ), the fully Bayesian posterior garners a further 2.4% improvement over the coarsened posterior, on the density ratio scale. Moreover, this improvement is calculated with simulation-significance, i.e., we have computed with sufficient accuracy to be convinced that ES ∞ π,B > ES ∞ π,CB . Generally then we see the shape of the posterior distribution is helpful in this problem. The magnitude of the first-order effect corresponds to a practical advantage in an applied statistics sense. The second-order effect is much more modest in magnitude, but the important point here is that the data can make a helpful contribution beyond the direct information they convey about the identification interval.
In this problem it is easy to see that (φ, ψ) = (φ 0 , φ 1 , ψ) constitutes a linear reparameterization of θ = (µ 0 , µ 10 , µ 11 ), with the inverse of the mapping given as Thus for given φ the identification interval is all values of ψ under which (6) yields a value in [0, 1] 3 , i.e., the identification interval endpoints are: and As we will demonstrate explicitly below, the mapping from φ to φ * is not invertible, which leaves open the possibility that ES ∞ π,B > ES ∞ π,CB . We consider Bayesian inference using the prior distribution having (µ 0 , µ 10 , µ 11 ) independent and identically distributed as Beta(k 1 , k 2 ). In what follows, b k1,k2 () is used to denote the density of this Beta distribution. The linear map between θ and (φ, ψ) immediately gives the joint prior density of (φ, ψ) as with support restricted to (φ, ψ) such that ψ ∈ I(φ). The limiting posterior distribution for the target is given by the conditional prior π(ψ|φ), with conditioning on the true value of φ. At least up to a normalizing constant, this conditional density can be read off from the joint density (9), by regarding this expression as a function of ψ for fixed φ.
An obvious choice of prior specification for this problem is (k 1 , k 2 ) = (1, 1), corresponding to a uniform distribution on each of the three outcome probabilities. In this special case, it is immediate from (9) that for every φ ∈ s[Θ], π(ψ|φ) is the uniform distribution on the identification interval I(φ). The limiting posterior is therefore always the same as the coarsened limiting posterior, and there can be no second-order Bayes advantage.
For other choices of prior distribution, the situation is more nuanced. As an example, in Appendix B we examine in detail the specification k 1 = k 2 = 2, which gives slightly more prior weight to mid-range values of the response probabilities. Using f () to denote the map from φ to φ * , we prove that for every value of φ * ∈ s[Θ] there is either (i), two distinct point solutions to f (φ) = φ * , which we denote as φ A , φ B , or (ii), a line-segment of solutions of the form Note that as one of infinitely many solutions, the further point solution φ B does not contribute to (10). We are then able to compute π(ψ|φ * ) for a given φ * . For the (k 1 , k 2 ) = (2, 2) case, Figure 2 compares the Bayes, coarsened Bayes, and ad-hoc probabilistic forecasts to both the true value of the target and the marginal prior density of the target, for some selected values of θ. Note that the full and coarsened limiting posterior densities are virtually indistinguishable in each case, while being quite different from both the ad-hoc forecasts (uniform distribution over the identification interval, prior marginal distribution truncated to the identification interval).
As in the previous example, the various expected scores are computed as Monte Carlo averages across a large number of draws of θ from π(θ), with results reported in Table 2. Again we have "simulation significance" to attest to ES ∞ π,B > ES ∞ π,CB . However, the difference between these two expected scores is so small as to be negligible in any practical sense. This jibes with the close agreement seen between π(ψ|φ) and π(ψ|φ * ) in Figure 2. So there is a tiny, but non-zero, second-order Bayes advantage. On the other hand, the first-order Bayes advantage is very substantial in this example. The optimal-shaped density over the identification interval tends to be 17% higher at the true value than truncated marginal prior, and 21% higher compared to the uniform distribution over the identification interval. Table 2 Expected scores in the gene-environment example with hyperparameters k 1 = k 2 = 2. These are computed as Monte Carlo averages across 10,000 realized values of θ drawn from the prior distribution. Simulation standard errors are given in parentheses. The labels AH(TP) and AH(U) refer to the ad-hoc truncated prior and ad-hoc uniform procedures respectively

Robustness
In general the decision-theoretic optimality of any Bayesian procedure stems from using the same distribution over the parameter space ("the prior") to both (i), average the procedure's performance across possible true parameter values, and (ii), use as an input to form the posterior distribution for a given dataset. An obvious question to ask then is how quickly does the optimality fade if two different distributions are used? That is, what happens if "Nature's prior distribution" used to average performance across the parameter space differs from the "investigator's prior distribution" used to determine the posterior distribution upon receipt of data. We retain π(θ) as notation for the investigator's prior, but consider what happens when Nature's prior is π * λ (θ) for some choice of λ. We assume the class of possible choices for Nature's prior is centered around the investigator's prior, i.e., π * 0 (θ) = π(θ). Specifically we look at the comparison between the limiting posterior marginal distribution over the identification interval and a uniform distribution over the identification interval, as the investigator's prior stays fixed but Nature's prior moves away from it. Let where the expectation is with respect to π * λ (θ). Clearly then t(0) = ES ∞ π,B − ES ∞ π,AH(U) > 0, and the magnitude of λ required to make t(λ) ≤ 0 reflects the stability of the Bayes advantage.
When λ has more than one component, it may become complicated to evaluate (11) in many different directions away from λ = 0. Thus we propose computing the gradient where s(θ) = ∂ log π * λ (θ)/∂λ| λ=0 . Then evaluating (11) for values of λ proportional to this gradient corresponds to looking along the direction in which (11) changes most rapidly with λ, locally at zero.
Returning to the compliance example of Section 3.1, we assume that Nature and the investigator agree on a uniform prior for the components of γ. However, whereas the investigator uses ω = (ω CO , ω N T , ω AT ) ∼ Dirichlet(1, 1, 1), Nature uses ω ∼ Dirichlet(1 + λ 1 , 1 + λ 2 , 1 + λ 3 ). Numerical evaluation of (12) indicates that t ′ (0) ∝ (0, 1, 1) ′ . Thus we focus attention on the case that Nature's prior is Dirichlet(1, 1 + λ, 1 + λ), for a scalar value of λ. For selected values of λ, t(λ) is given in Figure 3. We see that the advantage of the Bayesian procedure is maintained even when the discrepancy between Nature's prior and the investigator's prior is given by λ = −0.9. This suggests considerable robustness, since in practical terms the Dirichlet(1, 0.1, 0.1) distribution is fairly extreme and far from Dirichlet(1, 1, 1). In particular, this distribution puts considerable weight on extremely small values of ω N T and ω AT .

Discussion
Returning to the question posed in the title of this paper, we have seen that the shape of the posterior distribution in partially identified models is indeed useful.
Most of the benefit lies in what we have termed the first-order Bayes advantage.
If we take only the data that inform the identification interval, then processing these data in a Bayesian way, as we have termed the coarsened Bayes procedure, tends to yield a higher posterior density for the target, evaluated at the true value. In the two examples, gains of 10-20% in density height were seen, relative to ad-hoc choices of distributions across the identification interval. We have also seen, both theoretically and empirically, that there can be a further second-order advantage that arises from using all the data. This arises as in some problems different datasets (in the limiting sense) can produce the same identification interval but different posterior distributions over this interval. In turn, this gives a tangible sense in which the data themselves speak to the relative plausibility of different values inside the identification interval, and a tangible sense in which the shape of the posterior distribution is not just a consequence of the choice of prior distribution. Of course in both examples, and particularly in the second example, the second-order advantage is small, both in absolute terms and in comparison to the first-order advantage. Thus the impact lies in the conceptual and theoretical understanding of how inference works in the partially identified setting, rather then in finding hitherto inaccessible gains in estimator performance.
One issue arising is that the first-order Bayes advantage might in part be a self-fulfilling prophecy, since the same prior distribution is used to both derive the posterior distribution and weight the averaging of performance across the parameter space. And of course this is the general underpinning of the decisiontheoretic optimality of all Bayesian procedures. While a full look at this issue is beyond the scope of this article, a modest evaluation of "robustness" was conducted in Section 4, examining what happens when two different distributions over the parameter space are used to fulfill the two roles mentioned above.
Of course any assessment of a statistical procedure involves choices concerning how performance is quantified. Arguably our choice of logarithmic scoring of probabilistic forecasts for the target parameter has intuitive appeal. If two forecast distributions always have the same support, then it seems appealing to deem the one with highest average density at the true value as having the more useful shape. Obviously other assessments could be made though, say based on point-estimator performance. For instance, say m h (φ) = ψh(ψ; φ) is the mean of the probabilistic forecast, reducing to the limiting posterior mean when h(ψ; φ) = π(ψ|φ). Then performance of various schemes (Bayesian or not) could be based on average mean-squared error E π [{m h (Φ) − Ψ} 2 ]. While investigation of this is beyond the scope of the present paper, it is easy to note that this would change things substantially in our imperfect compliance example. Here the Bayes, coarsened Bayes, and truncated uniform procedures all yield distributions which are symmetric about a(φ) as given in (1). Hence all three give rise to m h (φ) = a(φ), and consequently identical performance.
A referee has also raised the question of evaluating procedures in terms of predictive distributions. For instance, in the Bayesian case the forecast distribution of the next data point D * given the first n datapoints d n tends to π(d * |φ) = π(d * |θ)π(θ|φ)dθ, as n goes to infinity. In general this is not so tightly connected to the present work, since our focus is on π(λ|φ), which is only a marginal distribution associated with π(θ|φ) required to form the predictive distribution. In some cases the two do coincide though, particularly should (φ, ψ) constitute a reparameterization of θ (this happens in the gene-environment interaction example, but not in the imperfect compliance example). When they do coincide, a possibility would be to shift the evaluation of performance from (4) to E π {log π(D * |Φ, λ)h(λ; Φ)dλ}.
A limitation of this work is that only the asymptotic limit is considered. We point out, however, that this limit is often approached rather quickly, in the following sense. For given data d n , the posterior variance of the target parameter is The first term on the right-side will tend to Var π (Ψ|Φ = φ) > 0, while the second term falls off as n −1 . As soon as the second term is small relative to the first we are "near" the limit, with minimal scope for further reduction in the width of the target posterior distribution as further data are collected. Thus we can reasonably expect that what we learn in the limit applies at realistic sample sizes.
A final comment is that the general topic of inference in the absence of identification may be perceived by some as rather esoteric. Indeed, there is often a feeling that in order to tackle an applied problem, one needs to make enough modeling assumptions so as to attain an identified model. Unfortunately, in many scientific domains this can promote the use of quite dubious modeling assumptions. Arguably then, we need to make peace with models that are only partially identified for the target parameter, and we need to understand the workings of inference in such settings.
Appendix A: Further details of the imperfect compliance example The map from φ to (Y, X|Z) cell probabilities is given via Expressed in this form, the mapping is readily seen to be invertible.
Appendix B: Details of the gene-environment model with hyperparameters k 1 = k 2 = 2 Recall that f is the map from φ to φ * . For a given c * in the image of f , we need to characterize solutions to f (φ) = c * . Note that the domain of f is the subset of the unit square U given by S = {φ ∈ U : |φ 0 − φ 1 | < r}. The form of (7) and (8) is such that S can be partitioned as S = A ∪ B ∪ C as depicted in the left panel of Figure 4, with φ * L being continuous and piecewise-linear on these subsets. Similarly, S = D ∪ E ∪ F as in the right panel, with φ * U being linear on these partition sets. The two dotted reference lines on both panels are the φ * L = 0 and φ * U = 0 level sets, with φ * L > 0 above the upper reference line and φ * U < 0 below the lower reference line. Let S 1 ⊂ S be the region between the reference lines, for which the identification interval crosses zero. Note that the gradient of φ * L points straight up on B and straight down on A. Thus a level set for a negative value of φ * L has an open-parallelogram shape, as exemplified in the left panel of Figure 4. We can then speak unambiguously of the "bottom," "spine," and "top" of such a level set. In contrast, a level set for a positive value of φ * L corresponds to a line parallel to, and above, the upper reference line. A mirror-image situation applies to φ * U , as depicted in the right panel of the figure.
Letφ be one solution to f (φ) = c * . (If it is helpful, one can think ofφ as the true value of φ.) Then we have three possible cases.
Case 1. Say thatφ ∈ S − S 1 , i.e., the identification region is to one side of zero. Without loss of generality, sayφ lies above the upper reference line. Then φ * L remains constant along the line throughφ which is parallel to the upper reference line. Along this line, φ * U takes the value one at the boundary between D and E, decreasing linearly from here in both directions. Moreover, it is simply verified that φ * U has a common value at both intersections of this line with the boundary of S. Therefore, there must be exactly two point solutions to h(φ) = c * in total. Case 2. Say thatφ ∈ S 1 ∩ B C ∩ E C . By inspection, it must be that either φ ∈ A ∩ F ∩ S 1 orφ ∈ D ∩ C ∩ S 1 . Without loss of generality, assume the former. Then the base of the level set for φ * L intersects the spine of the level set for φ * U atφ. Given this, exactly one further solution is generated, as either the spine extends up far enough to hit the top of the level set for φ * L , or, failing this, the top of the level set for φ * U hits the spine for φ * L . Case 3. Say thatφ ∈ S 1 ∩ (B ∪ E). Without loss of generality, say thatφ is in B rather than E. Then, intersecting the tops of the level sets for both φ * L and φ * U gives a horizontal line segment of solutions of the form φ 0 ∈ (1 − r,φ 1 ), φ 1 =φ 1 . We can also see from the shape of the level sets that there will be an additional point solution somewhere to the "southwest" of B, where the higher of the two bases of the two level sets crosses the spine of the other.
As claimed then, for a given c * in the image of f , either there are two point solutions to f (φ) = c * , or one horizontal line segment of solutions plus an additional point solution.