Case-deletion importance sampling estimators: Central limit theorems and related results

Case-deleted analysis is a popular method for evaluating the influence of a subset of cases on inference. The use of Monte Carlo estimation strategies in complicated Bayesian settings leads naturally to the use of importance sampling techniques to assess the divergence between full-data and case-deleted posteriors and to provide estimates under the case-deleted posteriors. However, the dependability of the importance sampling estimators depends critically on the variability of the case-deleted weights. We provide theoretical results concerning the assessment of the dependability of case-deleted importance sampling estimators in several Bayesian models. In particular, these results allow us to establish whether or not the estimators satisfy a central limit theorem. Because the conditions we derive are of a simple analytical nature, the assessment of the dependability of the estimators can be verified routinely before estimation is performed. We illustrate the use of the results in several examples.


Introduction
Complex Bayesian models are fit with simulation techniques. A Monte Carlo method is used to generate a sample from the posterior distribution, and this sample is used to estimate many quantities, such as posterior means and variances of parameters, posterior probabilities of events, predictive distributions of future cases, etc. For a complete analysis, one examines the data, looking for outliers and influential cases. One also considers information external to the model which suggests groups of cases that may depart from the model. When interesting groups of cases are found, they are dropped from the data set, and estimates are recomputed. The resulting case-deleted posterior distribution and the case-deleted estimates are of interest, as are the changes in the posterior and estimates. Substantial changes in posterior or estimates may lead to refinement of the model. Cross-validation also relies on case-deletion, as formalized by the conditional predictive ordinate (CPO) (see, for example, p. 47 and p. 284 of [4]).
Case-deleted posterior distributions are examined through importance sampling. The large sample from the full posterior distribution is reweighted, as suggested for example in [21] and [23], to compute summaries with respect to the case-deleted posterior distribution. Examples of this and similar approaches are presented in [3,15,16,25,26] and [27]. As shown in [13] it is essential for the importance sampling weights to have finite variance. If the 2nd moment of the weights does not exist, typical estimators will not follow a n 1/2 asymptotic, nor will they follow a central limit theorem.
It is shown in [19] that, for the case of a popular Bayesian linear model with conjugate priors, whether or not the weight function for a single case-deletion has finite 2nd moment depends on simple conditions involving the scale parameter of the prior distribution of the error variance, the leverage of the observation being deleted, its residual, and the total residual sum of squares. In this article, we expand upon the results of [19] in several directions. We first analyze the situation of multiple case-deletions and provide necessary and sufficient conditions for the rth (r > 1) posterior moment of the weight function to be finite. This allows us to treat a group of observations coherently, thereby capturing synergistic effects of similar cases. We extend the results to much broader classes of prior distributions, so that we can handle nonconjugate as well as conjugate priors. This is accomplished by formally defining classes of distributions that are thick or thin tailed with respect to the conjugate priors. This extension is coupled with two devices, bounding functions and adjustment of the prior, to allow us to establish a connection between a finite rth moment of the weight function and the finiteness of the 2nd moment for a variety of functions. The existence of two moments for these functions implies that a central limit theorem holds for an estimator. As in [19], the conditions are on sample size, leverage and an adjusted residual sum of squares.
In addition to the linear model, we provide results for the Michaelis-Menton (MM) model. The MM model is nonlinear, but has the property that, conditional on one parameter, the mean structure is linear in the remaining parameters. Making use of conditional linearity, we develop uniform versions of the conditions for the linear model that ensure existence of the weight function's rth moment in the MM model. Many other models are conditionally linear (among them linear regression used in conjunction with Box-Cox transformations or linear regression along with Box-Tidwell transformations). We pursue further extensions of the linear model, deriving results for the logistic regression model.
Our results have a very practical implication. They let us determine, quickly and analytically, whether central limit theorems hold for particular functionals. If central limit theorems hold, then we can pursue the strategy of fitting the model to the full data set and using importance sampling to estimate the functionals under case-deleted posteriors. If central limit theorems do not hold, we must alter our inferential strategy, either using more sophisticated importance sampling techniques (such as the importance link function technique introduced in [17]) or fitting the model for particular case-deleted data sets with separate Monte Carlo simulations.
By providing conditions under which r moments of the case-deleted weight function exist, our theorems go beyond the typical central limit theorem results that rely on the existence of second moments. This is important for two reasons. First, one may be interested in functionals where higher order moments of the case-deleted weight function come into play (see, for example, estimation of χ 2 divergence in [26] and [27]). Second, the number of moments which exist for the deletion of particular cases can be used as a measure of their influence, thus allowing one to asses influence along a continuum. The connection between influence and moment conditions is elucidated by applying results presented in [7] and [8], which, for an arbitrary, non-negative random variable X, contain the definition of a quantity called the moment index of X. Denoting by W the weight function resulting from the deletion of a given set of cases, its moment index r * is the least upper bound on the number of moments which exist. This represents a quantitative summary of the limiting tail behavior of the case-deleted weight function in the sense that, as stated in [7] and [8], r * = lim inf t→∞ [log P (W > t)]/[log(1/t)]. A larger moment index corresponds to a larger class of functions for which the central limit theorem exists. Practical illustration of these ideas are presented in Sections 4 and 6.
This article is laid out as follows. Section 2 contains preliminary results and formal definitions of thick and thin tails. Section 3 provides conditions for the (non)existence of the rth moment of the case deletion weight function in the linear model. Section 4 gives conditions on moments' existence for the MM model, and Section 5 gives parallel results for the logistic regression model. Section 6 shows how the results can be used to establish central limit theorems. A summary of sufficient conditions on the weight function's moments to ensure a central limit theorem for several popular Bayesian measures of influence are presented in Table 2. The section also shows the results in action, investigating both measures of influence and their impact on model development in a multiple linear regression setting. The final section contains concluding remarks. Technical details of proofs are left to the appendix.

Notation and preliminary results
Each Bayesian model considered in this article depends on a finite dimensional parameter vector s = (s 1 , . . . , s k ). Suppose that a set of observations y = (y 1 , . . . , y n ) is collected and let p(s) = p(s|y) denote the full posterior density for s. Let I denote the set of indices to be deleted from the analysis and let I be its cardinality. Let y \I represent the n − I observations remaining after the indices in I are omitted with p \I (s) = p \I (s|y \I ) denoting the corresponding case-deleted posterior density. Furthermore, let q(s) = q(s, y) and q \I (s) = q(s, y \I ) denote functions computable at every point (s, y) and proportional to the joint prior densities (e.g., prior × data likelihood) of (s, y) and (s, y \I ), respectively.
Suppose that a sample z 1 , . . . , z M from p(s) is available. In a typical application this will be either an independent sample or a dependent sample from an ergodic Markov chain. We wish to construct an estimate of E p \I [g(s)] = g(s)p \I (s) d(s), for some real valued function g(s) such that |g(s)| p \I (s) ds < ∞. This can be done by computing a Monte Carlo sum in which the individual elements g(z m ) are reweighted. Typically, p(s) and p \I (s) are not available because their normalizing constants are unknown and only q(s) and q \I (s) are directly computable. In that case we can define the weight function w \I (s) = q \I (s)/q(s) and estimate the expectation by: The denominator in Equation (2.1) divided by M estimates the ratio of the two unknown normalizing constants. Thus, if p(s) and p \I (s) are available, w \I (s) can be replaced by w * \I (s) = p \I (s)/p(s) in the numerator and the denominator can be replaced by M , resulting in the related estimator that we denote byÊ * p \I [g(s)]. In both cases, the resulting estimators are consistent under mild assumptions (see [13] for the case of i.i.d. samples and [24] for the case of samples from ergodic Markov chains). Throughout the article we refer to estimators of the formÊ p \I [g(s)] andÊ * p \I [g(s)] as case-deleted importance sampling estimators.
The prior distribution plays a large role in determining whether the Estimator (2.1) is asymptotically normal. To ensure asymptotic normality forÊ p \I [g(s)], we need both w 2 \I (s)g 2 (s)p(s) ds < ∞ and w 2 \I (s)p(s) ds < ∞. Finiteness of these integrals is unchanged by substitution of w * 2 \I (s) for w 2 \I (s). (See Section 6 for further discussion of conditions for the asymptotic normality of botĥ E p \I [g(s)] andÊ * p \I [g(s)].) In many instances, a prior distribution with sharp enough tails will ensure that these integrals are finite while a flatter tailed prior will lead to infinite integrals.
The upcoming lemma enables us to work easily with priors having different tails. In particular, it enables us to derive preliminary results for conjugate prior distributions, and then to quickly extend the results to non-conjugate prior distributions. Use of the lemma is demonstrated in the examples.
To set up the lemma, we first define the basic notation. Let for i = 0, 1. The functions f, π i and h are assumed to be non-negative. The constants c i are assumed to be finite and positive. Let 0 < b < B < ∞.
A device that we have found useful is a formal description of thinner and thicker tailed distributions. Since the prior distributions that we consider here are all absolutely continuous with respect to Lebesgue measure on R k , we use a simple definition that suffices for our purposes. We describe the result in terms of a distribution for a parameter since that is how we will use the result.
Consider a parameter s ∈ S. The parameter space S is taken to be R k . Let F represent a set of distributions on s, all of which have densities with respect to Lebesgue measure. The following definition concerns the relationship between another distribution, g, and the set of distributions F . Definition 2.1. The density g is said to be thick-tailed with respect to F if, for each f ∈ F and for each sequence s t with ||s t || → ∞ as t → ∞, lim t→∞ g(s t )/f (s t ) = ∞.
Definition 2.2. The density g is said to be thin-tailed with respect to F if, for each f ∈ F and for each sequence s t with ||s t || → ∞ as t → ∞, lim t→∞ g(s t )/f (s t ) = 0.
We note that these definitions capture the general notion of which distributions are thicker or thinner tailed than others. For example, a t distribution will be thicker tailed than the class of normal distributions. A one-dimensional normal distribution will be thinner tailed than the Laplace distribution. A t distribution with 5 degrees of freedom will be thicker tailed than a t distribution with 7 degrees of freedom, etc. We also note that a normal distribution with variance σ 2 is thicker tailed than a normal distribution with variance cσ 2 if c < 1.

A Bayesian linear model
In [19] the author considers a standard specification of the Bayesian linear model and derives necessary and sufficient conditions for the variance of the case-deleted importance sampling weight function to be finite when a single observation is omitted. Loosely, the conditions for a finite variance stated in [19] can be described as (a) small leverage for the deleted case, (b) large enough sample size, and (c) small enough residual for the deleted case. In this section we extend the results of [19] in two different directions: we analyze the situation of multiple case-deletions and provide necessary and sufficient conditions for the rth (r > 1) posterior moment of the case-deleted weight function to be finite. Our conditions are also on leverage, sample size and residual. In addition, we extend the results to nonconjugate models by considering the tail behavior of the prior distribution. In Section 6, these results are used to establish central limit theorems for a broad class of importance sampling estimators.
Let the n × 1 vector of observations Y be distributed as where I denotes the identity matrix and X denotes an n × k design matrix of rank k. Assume that the variance σ 2 , having an inverse gamma prior distribution with known positive parameters α and β, is independent of the k × 1 vector of regression parameters θ = (θ 0 , . . . , θ k−1 ) T having a proper prior density π 1 with full support R k , i.e., To describe conditions under which moments of the case-deleted weight function exist, we introduce some additional quantities. Let H = X(X T X) −1 X T and RSS = y T (I − H)y denote the projection matrix and the residual sum of squares from the least squares fit of the full data set, respectively. The index set, I, consists of the indices of the I cases to be deleted. Given the index set I, let Y I be the I × 1 random vector of observations Y i , with i ∈ I, and let X T I be the I × k submatrix of the I rows of X indexed by I. Define the leverage of set I to be the principal minor of H corresponding to I: H I = X T I (X T X) −1 X I , and define e I to be the I × 1 vector of the elements indexed by I in the vector of the ordinary residuals e = (I − H)y, i.e., e I = y I − X T I (X T X) −1 X T y. Finally, for each r > 0, if the I × I matrix (I − r H I ) is non-singular, let RSS * \I (r) = RSS − r e T I (I − r H I ) −1 e I . When I = 1, so that I = {i}, H I = x T i (X T X) −1 x i is the leverage of ith observation, say h ii , e i = y i − x T i (X T X) −1 X T y is the residual of observation i, and RSS * \i (r) = RSS − r e 2 i /(1 − rh ii ). When r = 1, RSS * \I (r) is the residual sum of squares from the least squares fit of the case-deleted data set. Letting s = (θ, σ 2 ), the unnormalized importance sampling weight function resulting from the deletion of the I cases indexed by I is given by This functional form of the weight results from ignoring normalizing constants not depending on the model parameters and from canceling the common factors in the numerator and the denominator represented by the prior and by the portion of the Gaussian likelihood which corresponds to the undeleted cases.
For the Bayesian linear model specified by Equations (3.1) and (3.2) the following theorem holds.
(ii) If the noninformative prior π(θ, σ 2 ) ∼ 1/σ 2 is used, then conditions (a) and (a ′ ) remain unchanged, and conditions (b), (c), (b ′ ) and (c ′ ) become: (b) n > rI + k, (b ′ ) n ≤ rI + k, (c) RSS * \I (r) > 0 and (c ′ ) RSS * \I (r) < 0. Remark 3.1. Theorem 3.1 includes the problem investigated in [19] as a special case. There, the author takes r = 2 and specifies the prior distribution on (θ, σ 2 ) as θ | Σ ∼ N (θ 0 , Σ), σ 2 ∼ IG (α, β) and Σ ∼ IW (νR, ν), with conditional independence at all stages of the model. The parameter θ 0 ∈ R k is a known mean vector, α and β are known positive constants, and IW (νR, ν) is an inverse Wishart distribution with ν a known integer greater than or equal to k and R a known k × k positive definite matrix. Remark 3.2. The statement of Theorem 3.1 involves the eigenvalues of the I × I matrix H I . In typical applications, the cardinality I of the set of observations being deleted will be fairly small and the calculation of the eigenvalues can be accomplished quickly with standard software. For the illustrative examples presented in the article, we computed all eigenvalues using the R function eigen().
Theorem 3.1 holds for any proper prior distribution on θ having full support on R k , provided the parameters θ and σ 2 are independent and the prior for σ 2 is IG(α, β). This follows from the form of the likelihood function, which, for fixed σ 2 , is an exponential function with quadratic argument in θ, and, for fixed θ, is the product of a power and an exponential function in 1/σ 2 . Recognizing a connection with the integral needed to normalize the kernel of an inverse gamma density suggests how to extend the results to the case of non-conjugate prior distributions. The next two corollaries make this extension, placing the focus on the tails of the prior distribution.
The corollaries assume independence between θ and σ 2 , and so we consider their tail behavior separately. Let π 11 denote a (proper) prior distribution on θ, and let F 1 be the family of all nondegenerate multivariate normal distributions on R k . Corollary 3.2 distinguishes between priors that are thick-tailed with respect to F 1 and those that are not. Let π 12 denote a (proper) prior distribution on σ 2 , and let F 2 be the family of all inverse gamma distributions, IG(α, β), α > 0, β > 0. Exploiting the connection mentioned in the previous paragraph, the proof of Theorem 3.1 shows that conditions (a), (a ′ ), (c) and (c ′ ) determine the integrability (or lack thereof) of a certain function of σ 2 in a neighborhood of zero. For σ 2 going to infinity, a suitable number of observations guarantees integrability. Thus, the corollaries focus on the tail for σ 2 near 0, or the tail for the precision, 1/σ 2 , tending to ∞. A distribution, π 12 which is thicktailed with respect to F 2 has the property that lim σ 2 →0 π 12 (σ 2 )/π 02 (σ 2 ) = ∞, for all π 02 ∈ F 2 ; a distribution that is thin-tailed with respect to F 2 satisfies lim σ 2 →0 π 12 (σ 2 )/π 02 (σ 2 ) = 0, for all π 02 ∈ F 2 .
Before proceeding, we summarize the notational conventions just introduced and the assumptions common to both corollaries.
The first corollary deals with thick-tailed prior distributions π 12 on σ 2 and covers the case of all proper prior distributions π 11 on θ.
Corollary 3.1. Assume 1.-6. above and let π 12 (σ 2 ) be thick-tailed with respect to F 2 . If λ I < 1/r and RSS * \I (r) > 0, then the case-deleted weight function has finite rth moment with respect to the full posterior distribution. On the other hand, if λ I > 1/r or RSS * \I (r) < 0, then the rth moment of the case-deleted weight function is infinite.
The next corollary applies to thin-tailed distributions π 12 (σ 2 ). It provides only a sufficient condition if π 11 (θ) is thin-tailed and necessary and sufficient conditions if π 11 (θ) is thick-tailed with respect to F 1 .
Corollary 3.2. Assume 1.-6. above and let π 12 (σ 2 ) be thin-tailed with respect to F 2 . If λ I < 1/r, then the case-deleted weight function has finite rth moment with respect to the full posterior distribution. If π 11 (θ) is thick-tailed with respect to F 1 , then λ I > 1/r implies that the case-deleted weight function has infinite rth moment.
If λ I > 1/r and both the prior π 11 on θ and the prior π 12 on σ 2 are thintailed, we cannot draw any conclusions about the finiteness of the full posterior rth moment of w \I (θ, σ 2 ) as shown in the following example.
Finally, consider the case of a prior distribution π 11 (θ) having bounded support. Arguing as in the proof of Corollary 3.2, it is easy to verify that the rth moment of w \I , E(w r \I (θ, σ 2 )|y), always exists if π 12 (σ 2 ) is thin-tailed with respect to F 2 . On the other hand, if π 12 (σ 2 ) is either in F 2 or is thick-tailed with respect to F 2 , the finiteness of E(w r \I (θ, σ 2 )|y) depends essentially on the value of RSS * \I .
Then one can prove the following:

A nonlinear model
To illustrate some of the issues that arise when the fitted model is nonlinear, we revisit a Bayesian analysis of the Puromicyn data presented in [17]. The data come from a biochemical reaction and are described in [5], p. 425. For a group of cells not treated with the drug Puromycin, there are n = 11 measurements of the initial velocity of a reaction, V i , obtained when the concentration of the substrate was set at a given positive value, c i . The observations are recorded in Table 1 and plotted in Figure 1. The Bayesian model fit in [17] assumes a non linear regression of velocity on concentration given by the Michaelis-Menten (MM) relation: According to this relation, when the concentration of the substrate equals the Michaelis parameter, κ, the velocity reaches half of its maximal value, m, which is also the limiting velocity as the concentration goes to infinity. Following [17], we model the n observations as independent realizations from normal distributions with means given by Equation (4.1) and common variance σ 2 . All three parameters m, κ, and σ 2 are constrained to be positive and their  prior distribution is specified as π(m, κ, σ 2 ) = π 1 (m, σ 2 ) π 2 (κ), with π 1 (m, σ 2 ) ∝ 1/σ 2 representing a noninformative prior density for (m, σ 2 ) and π 2 representing a proper prior density for κ such that ∞ 0 κ π 2 (κ) dκ < ∞. This requirement guarantees that the posterior is proper.
The MM model is, conditional on κ, a linear regression model with no intercept and covariate The unconditional rth moment may be infinite for a different reason: the finite conditional rth moments may integrate to infinity. Thus, the conditions will need to be strengthened to ensure a finite rth moment. To avoid this route to infinity, the conditions on leverage and residual are applied uniformly in κ. Finally, an apparent infinite moment will sometimes be finite due to the restriction on the support of m.
Define the conditional design matrix X(κ). Proceeding as in Section 3, define the matrix H I (κ), and concentrate on its largest eigenvalue. The conditional leverage, l( , is the only non-zero eigenvalue. The condition on the residual can be expressed in terms of simpler functions which will prove useful later. Define A(I, r, κ) i . The adjusted, conditional residual sum of squares is RSS * \I (r, κ) = C(I, r) − B 2 (I, r, κ)/A(I, r, κ). The set of zeroes of A(I, r, κ), with I and r held fixed, contains at most 2(n − 1) points, a set of Lebesgue measure 0, and so we need not worry about the apparent division by 0. One last quantity is needed to handle the partial support of m. Define g( the product of covariate and response summed over the deleted cases divided by the same quantity summed over all cases. The results on the finiteness of E(w r I (m, σ 2 , κ)|v) are summarized in the following theorem. Remark 4.3. The strategy applied to the MM model applies to an array of models that are, conditional on some set of parameters, linear. We impose the leverage and residual conditions uniformly across the parameters that render the model nonlinear. Important classes of models are linear regression models that allow for Box-Cox transformation of the response variable and/or Box-Tidwell transformation of the explanatory variables.
The authors of [17] specified a t distribution on 3 degrees of freedom restricted to [0, +∞) as a prior π 2 for the parameter κ and fit the model to the Puromycin data using the program BUGS (see [22]). (Due to some technical restrictions of BUGS, they had to use approximations for some of the prior specifications.) They considered deletion of single cases and computed the corresponding casedeleted weight functions. They reported detailed estimation results based on the deletion of case 1, an observation that produces highly variable realized weight functions, and illustrated how a transformation based approach (the Importance Link Function method) can effectively reduce the variability of the weight functions and lead to improved estimation.
We discuss the implications of the results developed in this section on the analysis presented in [17]. We consider, as was done in [17], deletion of single observations and focus on the case r = 2, so that the sample size condition (b) of Theorem 4.1 is satisfied for all cases. An examination of the leverage condition (a) shows that observation 11 has large leverage (for κ = 2, l(I, κ) = 0.5065 > 1/r = 1/2), and so by Remark 4.1, the posterior variance of the case-deleted weight function for case 11 is infinite. All remaining cases have leverages bounded away from 0 and above strictly by 1/2, and so satisfy condition (a).
Turning to the residual conditions (c) and (d), we find that all observations other than 1 and 11 satisfy condition (c), thus ensuring finite variances for their case-deleted weight function. For observation 1, the adjusted residual sum of squares is negative for values of κ near 0.08, violating condition (c). Condition (d) is also violated since sup κ>0 g(1, κ) = 0.05501 < 1/2. Consequently, Theorem 4.1 implies that the case-deleted weight function for observation 1 has infinite variance.
In addition to r = 2, we can examine other moments of the case-deleted weight functions. Table 1 displays, for every case-deletion i, the moment index r * , i.e., the least upper bound on the value of r for which the rth moment exists (see [7] and [8]). If the influence of the ith observation on the posterior distribution p(m, σ 2 , κ) is assessed by the χ 2 divergence measure between the case-deleted and full posteriors: then, as suggested in [26] and [27], we can estimate χ 2 by means of the Monte Carlo sum appearing in Table 2. As indicated in Section 6, this estimator is asymptotically normal if E(w 4 \i (s)|y) < ∞. According to the values of r displayed in Table 1, a central limit theorem holds only for the estimators of χ 2 corresponding to observations 3, 4, 6, 7, and 8.  Section 6). KL represents Kullback-Liebler divergence, L1 is integrated L 1 loss, L2 is integrated L 2 loss, ∆1 is change in first moment of a parameter θ, ∆2 is change in second moment of a parameter θ, Hel is Hellinger distance, ChSq is chi-square distance, CP O is the Conditional Predictive Ordinate, and Bdd is a bounded function. As a shorthand for the notation introduced in Section 2, a subscript m means that a function is evaluated at zm (e.g., wm = w(zm), etc.). The symbol L I (s) represents the likelihood of the observations in I evaluated at the point s, L \I represents the likelihood of the observations not in set I, and π represents the prior density. The expression 2 + δ in the table means that it is sufficient, for some δ > 0, that 2 + δ moments exist.R = M m=1 wm/M .Ĉ is an estimator of C = q(s) ds. There are many estimators of C, with some based on a different simulation than that used to fit the model. In lines 2 and 3 of the n.a.

Bayesian logistic regression
We now switch our focus to generalized linear models, concentrating on the study of a logistic regression model. Assume that, for each of n subjects, we have available a k × 1 vector of covariate information, x i , and we observe a 0-1 outcome, Y i . Suppose that the Y i are independently distributed as Bernoulli random variables taking on value 1 with probability The case-deleted weight function is proportional to The following theorem covers prior distributions with the exponential tails that match the logistic regression likelihood. Subsequent corollaries cover thinner and thicker tailed prior distributions.
The theorem can be applied to prior distributions proportional to exp{−|β T |ǫ}, where ǫ is a vector of positive numbers. In this instance, a rescaling of the covariates to obtain a prior with a single real ǫ results in the type of prior for which the theorem is stated.
The theorem may be strengthened somewhat by explicitly considering the case of max β:|β T |1=1 h(β, r, ǫ) = 0, although the statement of precise conditions under which the case-deleted rth moment is infinite becomes messy. The conditions in Theorem 5.1 are easy to check since the maximum of h(β, r, ǫ) may be found via linear programming methods.
As in the case of the linear model, we will investigate the rth moment of the case-deleted weight function under thick-tailed and thin-tailed prior distributions. The main tool for the proofs is, once again, Lemma 2.1. The first corollary deals with thick-tailed distributions. if h(β, r, 0) < 0 for all β such that |β T |1 = 1, the case-deleted weight function has finite rth moment (r > 0) with respect to the full posterior p(β). If, for some vector β such that |β T |1 = 1, h(β, r, 0) > 0, then the case-deleted weight function has infinite rth moment.
The next corollary applies to thin-tailed distributions.

Applying the corollaries
The preceding corollaries enable us to determine quickly whether the casedeleted weight function has finite or infinite rth moment. Consider an arbitrary logistic regression model where the prior distribution on β is taken to be the normal distribution with mean β 0 and variance Σ, with Σ of full rank. This distribution is thin-tailed with respect to the family of prior distributions used in Theorem 5.1. To verify this, write the ratio of priors, with g representing the normal prior density and f representing the prior density under a member of the exponential-tailed class: where λ 1 is the largest eigenvalue of Σ. Applying Corollary 5.2 with a normal prior distribution, we find that all positive moments of the case-deleted weight function are finite. This result holds, even if all of the cases are deleted. Suppose instead that the prior distribution on β is taken to be a multivariate t distribution with ν degrees of freedom, location vector β 0 and scale matrix Σ, with Σ of full rank. This t distribution is thick-tailed with respect to the family of prior distributions used in Theorem 5.1. A formal verification of this follows from an examination of the ratio of prior density functions. To establish finiteness or infiniteness of the case-deleted moments, use Theorem 5.1 with ǫ = 0.
We note that Theorem 5.1 can be of help in establishing whether or not the rth moment of the case-deleted weight function will be infinite, even when the prior distribution is improper. If the prior density for β is uniform on R k , for example, we merely apply the theorem with ǫ = 0. The conclusion of a finite case-deleted rth moment is conditional upon the propriety of the posterior distribution. This propriety is not guaranteed, as use of the uniform prior distribution may lead to an improper posterior distribution (see [12] and [18]). However, since the weight function w \I (β) in Equation (5.1) always exceeds one, if the first moment of the case-deleted weight function is finite, so is the normalizing constant for the posterior: the posterior distribution is proper if any case-deleted weight function has finite first moment.

Central limit theorems
The previous sections provide results that enable us to calculate the number of moments which exist for the case-deleted weight function. The results apply to classes of prior distributions, and so can be quickly used to establish asymptotic normality of the importance sampling estimatorÊ p \I [g(s)] given in Equation (2.1) and of the related estimatorÊ * p \I [g(s)]. In this section, we indicate how these results apply to a variety of measures of case influence. We also present two techniques which are generally useful for applying the results.
Central limit theorems (CLTs) for importance sampling estimators when the parameter vectors s are generated as i.i.d. samples or arise from a uniformly ergodic Markov chain, are described in [13] and [24], respectively. Under either source for the sample, the estimatorÊ * These conditions are explicitly presented for i.i.d. samples in [13]. A slight technical extension of the CLT in [24] helps to establish the result for ergodic samples. The extension consists of an application of the Cramer-Wold device to establish the joint asymptotic normality of the estimator of the normalizing constant for the weight function and of an estimator proportional toÊ * p \I [g(s)], followed by an application of the delta method (e.g., see [9]).
The first technique for establishing a CLT recognizes that the g 2 (s) term in the integral in condition (6.1) can be grouped with p(s|y), yielding, say, p * (s|y). The quantity p * (s|y) is the formal posterior distribution for s given the data, provided that it is integrable. It corresponds to a proper Bayesian analysis with g 2 (s)-adjusted prior distribution proportional to g 2 (s)π(s), provided that 0 < g 2 (s)π(s) ds < ∞. We note that this integral is against the prior distribution, and so is typically easy to evaluate.
To facilitate application of the theorems we wish to preserve full support of the function-adjusted prior distribution. To check the asymptotic normality of E * p \I [g(s)] we need only verify condition (6.1), provided g(s) is never equal to zero so that the g 2 (s)-adjusted prior has full support. In all other cases we act as if the prior distribution had density (1 + g 2 (s))π(s). This preserves full support of the function-adjusted prior distribution in case g(s) is not always different from zero. This also allows us to verify at once conditions (6.1) and (6.2) when we wish to determine ifÊ p \I [g(s)] is asymptotically normal.
The second technique that we find useful is to establish the finiteness of integrals in case-deleted posteriors for a bounding function which then implies finiteness for interesting classes of functions. The relation log 2 (x) ≤ (C ǫ + x −ǫ + x ǫ ) 2 for some constant C ǫ and all x > 0 connects moments of the case-deletion weight function to finiteness of integrals for several influence measures. The moment generating function is also a useful bounding function. Hence, we consider g(s) = exp(s T t) for all t in some open neighborhood of 0, say, U . If w 2 \I (s) exp(s T t)p(s|y) ds < ∞, for all t ∈ U , then, condition (6.1) is satisfied for any polynomial in s n1 1 · · · s n k k and any constant. We note that condition (6.1) implies condition (6.2) and the CLT applies to the importance sampling estimators of any mixed and marginal moment of s.
Formal Bayesian techniques that describe the influence of a set of cases on an analysis focus on a one-dimensional summary of the difference between the case-deleted posterior distribution and the full posterior distribution. Bayesian measures of model fit focus on case-deleted measures of predictive accuracy and cross-validation. A plethora of summaries exist. In this subsection, we show how our results can be used to verify that a CLT holds for the summaries estimated on the basis of a Monte Carlo sample. We illustrate this point with a discussion at the end of Example 6.1 concerning estimation of the conditional predictive ordinate (CPO).
This approach can be applied to many Bayesian case influence measures. Table 2 contains a summary of results. Each row of the table corresponds to a measure of influence. The measure is given under the column headed Estimand, and a formula for estimation is given under the heading Estimator. The last three columns present sufficient conditions for the estimator to follow a CLT. The column headed Mom's gives a number of moments of the case-deleted weight function; the column headed Adjmnt presents the function used to adjust the prior distribution, if needed, and the column headed Adj-Mom's gives a number of moments of the case-deleted weight function against the function-adjusted prior distribution. If the given numbers of moments and adjusted moments both exist, then a CLT holds for the estimator.

Examples
Example 6.1. This example illustrates the practical use of the results presented in Section 3. We fit a linear model to data assembled by the authors of [20] to investigate growth rates across mammalian species. Gestational time is known to be an important factor in determining growth rate. The data set contains 96 entries with complete information on growth rates and possibly related covariates for mammalian species. There is one marsupial that we excluded from our analysis. Three of the remaining species exhibit delayed implantation, a phenomenon by which the blastocyst, after reaching the uterus, remains dormant and unattached to the uterine lining for an extended period of time. An examination of the covariate gestation time (in days) led us to conclude that the recorded gestational time for the grizzly and polar bears-ursus arctos and thalarctos maritimus-included the preimplantation time while the recorded gestational time for the nine-banded armadillo-dasypus novemcinctus-did not. This last gestational time was adjusted to include preimplantation. After this adjustment, the recorded gestation time for each species included in the analysis covers the time from egg fertilization to birth.
The response variable is a species' advancement, defined as the ratio of neonatal to adult body weight. We built a linear model including an intercept and three covariates: the natural logarithms of gestation time, litter size, and adult body weight (centering all three covariates around their respective means). The least squares fit of this model yields a multiple R-square of 0.4344. Based on a Bayesian analysis with noninformative prior distributions for the model parameters, the 95% highest posterior density (hpd) intervals for the coefficients for log litter size and log body weight include only negative values while the 95% hpd interval for the coefficient for log gestation time includes only positive values. This indicates that heavier species, species with larger litter sizes, and species with shorter gestation times give birth to relatively immature offspring.
We use the theoretical results of the preceding sections for three purposes: we examine the influence of a preselected group of cases on inference, we screen all groups of cases of a certain size for their influence, and we verify the stability of cross-validatory estimators of summary measures. First, consider the three species with delayed implantation. We interpret the moment index r * I , i.e., the cut-off value for the existence or non-existence of the rth moment of the case deletion weights (see, [7] and [8]), as a measure of influence of the cases being excluded. This cut-off value is given by the minimum of the cut-off value r * a,I between the leverage conditions (a) and (a ′ ) and the cut-off value r * c,I between the distance conditions (c) and (c ′ ). Dropping the three species leads to the values r * a,I = 4.74 and r * c,I = 2.93. Thus r * I = 2.93 for this set of species. This number is small, suggesting that this group of cases is influential. A glance at Table 2 shows us that a central limit theorem will not hold for the chi-square distance, but that it will hold for the other measures listed in the table.
As with traditional measures of influence, we consider where our set of observations falls on the measures r * a,I and r * c,I with respect to other sets of similar size. We scanned all triples of species, computing cut-offs for each triple. Ordering the triples of dropped species according to their increasing values of r * a,I , we found that the nine-banded armadillo belongs to 99 of the top 100 triples (all but the 31st), while the grizzly bear belongs to 2 of the top 100 triples (the 18th and the 99th), and the polar bear belongs to one of the top 100 triples (the 38th). Our three species in combination rank 1343rd out of the 138415 triples, with an r * a,I value of 4.74. Ordering the triples of dropped species according to their increasing values of r * c,I , we find that both bear species belong to each of the top 93 triples and that the grizzly bear belongs to each of the top 167 triples. Dropping all three species with delayed implantation at once yields the 6th smallest value for r * c,I . From this comparative analysis, we conclude that the three species with delayed implantation may be influential, the nine-banded armadillo due mainly to its leverage and the two bear species due mainly to their outlyingness. This set of three species stands out, as there is a common underlying factor that may differentiate them from the other species.
Pursuing the potential influence of our triple of cases, we examine whether inclusion of the dormant period in the total gestation time affects the conclusions that we draw from the model. To answer this question we adjusted the gestation times of these species to account only for the period of actual development and reconsidered the linear model described above. The least squares fit now yields a multiple R-square of 0.5267. The 95% hpd interval for the coefficient for log litter size now contains 0, suggesting possible simplification of the model, although the qualitative interpretations of the impact of species weight, litter size, and gestation time remain the same.
Repeating the earlier exercise of dropping triples of cases, we find that the leverage of the nine-banded armadillo is a little diminished, as it now enters only in 7 of the top 100 triples for r * a,I . The grizzly bear has a little more leverage, as it now belongs to 4 of the top 100 triples (the 7th, 10th, 17th and 65th), while the polar bear belongs to just one of the top 100 triples (the 98th). The three species in combination rank 1916th with an r * a,I value of 5.24. The smallest value of r * c,I , which now equals 3.65, is attained when the three species papio papio, ursus arctos, and thalarctos maritimus are dropped. Both bear species belong to each of the top 22 triples, the grizzly bear belongs to 96 of the top 100 triples and the polar bear belongs to 97 of the top 100 triples. Dropping all three species with delayed implantation at once yields the 23rd smallest value for r * c,I . Thus, the three species with delayed implantation are still influential when the model is fit to the adjusted gestation times, although the extent of their influence is slightly diminished. According to both analyses, the two bear species are highly influential, due mainly to their large residuals. This not only confirms the well known fact that bears have an unusually small advancement but also reveals that the dormant gestation period by itself cannot account for it. Quoting from a January 27, 2004 New York Times article (see [1]): Polar bears share with all bears an extreme disparity between the size of their mother, in the quarter-ton range, and that of a newborn cub-about a pound. "It's dramatic trait in the bear family," Dr. Peatkau said. "They are off the chart among placental mammals, and closer to marsupials like the kangaroo." Model fit is commonly assessed via k-fold cross validation. The data are partitioned at random into k subsets of approximately equal sizes and each of the k subsets is used in turn as a test set with the union of the remaining k − 1 subsets serving as a training set. The model is fit to the data in the training sets and its predictions are compared to the actual values of the observations in the test sets by computing some measure of predictive ability averaged over the k sets of predictions. In a Bayesian analysis, CP O provides a measure of overall predictive ability. The cross-validated CP O can be estimated with draws from the full posterior and importance sampling weights. However, as noted in Table 2, for a given partition, the central limit theorem will not hold if r * I drops below 2 when any of the k subsets of observations is excluded.
To investigate how often the central limit theorem breaks down for CP O, we considered the case of 5-fold cross validation for the model fit to the data used in the first analysis and simulated 10,000 random partitions of the data into 5 subsets of size 19 each. For each split we computed five values of r * I . Out of the total 10,000 simulated partitions, there were 658 partitions where r * I dropped below 2 for exactly one of the five case deletions and there was one partition where it dropped below 2 for two of the the five case deletions. The value of r * I never dropped below 2 for more than two of the five case deletions.
If it is established that, for a particular partition, no central limit theorem holds for importance sampling estimation of CP O, then the analyst must turn to other methods of estimation. For example, sampling from a mixture distribution with components given by the full posterior and by the case deleted posteriors conditional on those subsets for which r * I ≤ 2 ensures the existence of a central limit theorem for the estimate of CP O.
Example 6.2. The authors of [11] in their influential paper on Bayesian model selection/model averaging put a prior distribution over a collection of Bayesian linear models. There have been a host of extensions of their model, most of which are amenable to the treatment below. Formally, we describe a prior distribution having the form of Equation (3.2). The likelihood for the model follows Equation (3.1). The prior distribution on the error variance is σ 2 ∼ IG(α, β). The prior distribution on the regression coefficients is described in two stages. At the first stage, there is an indicator vector of whether a regressor, θ j , "appears in" the model. The indicators are independent Bernoulli(p j ) variates. If the regressor does not, then the conditional prior distribution on θ j is N (0, τ 2 ) with small τ ; if the regressor does, then the conditional prior distribution on θ j is N (0, cτ 2 ), with large c > 1. Marginalizing p j , the prior distribution on an individual regressor is θ j ∼ (1 − p j )N (0, τ 2 ) + p j N (0, cτ 2 ). The resulting prior distribution remains absolutely continuous with respect to Lebesgue measure while effectively allowing regressors to be included in or excluded from the model.
The regression analysis is used to estimate the regression coefficients and the associated posterior expected loss. Pursuing a decision theoretic approach, we ask when the case-deleted importance sampling estimators follow CLTs. We use the standard sum-of-squared error loss, so that L(θ, a) = k−1 j=0 (θ j − a j ) 2 . The Bayes action, a, is the posterior mean vector. Here, we focus on the posterior expected loss. The posterior expected loss is E[L(θ, a)|y] = k−1 j=0 V ar(θ j |y), and then, for the asymptotic normality ofÊ p \I [g(s)] orÊ * p \I [g(s)], the function g(θ) to be considered is g(θ) = k−1 j=0 θ 2 j . We now proceed with the technique. First, we verify that the functionadjusted prior distribution is proper. Since the prior distribution on the regression coefficients is a finite mixture of normals, (1 + g 2 (θ))π(θ) dθ < ∞. (6.3) Next, we consider Theorem 3.1 as applied to the function-adjusted prior distribution with r = 2. If conditions (a), (b) and (c) of the theorem hold for a particular case-deletion, then the case-deleted weight function have finite second moment, or equivalently, w 2 \I (θ, σ 2 )(1 + g 2 (θ))p(θ, σ 2 |y) dθ dσ 2 < ∞, establishing conditions (6.1) and (6.2) and hence a CLT for the estimatorŝ E p \I [g(s)] andÊ * The impact of the prior distribution's tail behavior on decision rules is discussed in [2]. Robustness considerations suggest that it is often wise to use a prior distribution with thicker tails than the likelihood. For MCMC algorithms, a convenient replacement of the normal distribution is a t-distribution, see for example [4]. The technique used above can be directly applied and yields the same results when the prior distribution for θ j is (1−p j )N (0, τ 2 )+p j T (d, 0, cτ 2 ), with the latter term in the mixture a t-distribution with d > 4 degrees of freedom, center 0 and scale τ 2 . The requirement d > 4 guarantees that condition (6.3) holds.
Example 6.3. The results of a study used to estimate the survival distribution for leukemia patients are presented in [10]. The response variable is survival time (from diagnosis), and explanatory variables are white blood cell count at diagnosis (WBC) and whether "Auer rods and/or significant granulature of the leukemic cells in the bone marrow at diagnosis" were present (AG positive) or absent (AG negative). The authors of [10] develop estimates of the survival distribution based upon presumed exponential distributions which are allowed to depend on the covariates. The authors of [6] dichotomize the survival times by defining a new response which indicates survival past 50 weeks. They analyze the data with the frequentist counterpart to the logistic regression model described in Section 5, where there are k = 3 covariates: an intercept, WBC and AG. The authors of [6] identify one case, a patient with a high WBC count and a survival time of more than 50 weeks, as having extremely large influence. They also note that altering the model (to predict survival based on log(WBC) and AG) can reduce the influence of the case.
We examine influence under a product of double-exponential prior distributions for β. The distribution has scale parameter 10 in each direction (and hence a prior distribution with mean for β i |(β i > 0) of 10). Case 15, diagnosed in [6] as an influential observation, is easily found to have an infinite variance for its case-deleted weight function. The value of the criterion h(β, 2, 0) is found to be h((0, −1, 0) T , 2, −0.1) = 45.15. This value is well in excess of 0, and indicates that the choice of ǫ = 0.1 for the prior distribution has little to do with why the case-deleted weight function has infinite variance. On the other hand, the value of the criterion in the positive direction for β 2 is less than 0, indicating that this tail of the distribution of β 2 is well-behaved. No other case results in an infinite variance for its case-deleted weight function.
For case 15 condition (6.2) does not hold and the estimatorsÊ p \I [g(s)] and E * p \I [g(s)] are not asymptotically normal. Condition (6.2) holds for all remaining observations. Thus we can establish the asymptotic normality of their associated estimators by showing that condition (6.1) holds as well. We do this by using the bounding strategy described above showing that h(β, 2, ǫ) < 0, for all β : |β T |1 = 1, implies the existence of an open neighborhood U of 0 such that w 2 \I (s) exp(s T t)p(s|y) ds < ∞, for all t ∈ U . Hence, it follows that exp{h(β, 2, ǫ) + 2β T t} dβ is finite for all t in U , which in turn, arguing as in the proof of Theorem 5.1, implies that w 2 \I (s) exp(s T t)p(s|y) ds < ∞, for all t ∈ U .
It is interesting to note that the analysis above is not strictly connected to the particular choice of the prior distribution as a product of double exponentials. Indeed, in light of Corollary 5.1, if the (proper) prior distribution on β is thicktailed with respect to the family of product of double exponential distributions, then h(β, 2, 0) < 0 for all β : |β T |1 = 1 still implies both conditions (6.1) and (6.2). This is true, even when the noninformative prior distribution π(β) ∼ 1 is assigned. Finally, if π(β) is thinner-tailed than any product of double exponentials, then conditions (6.1) and (6.2) are always satisfied.

Conclusions
The development of effective computational tools for fitting hierarchical models has spurred the growth of Bayesian data analysis. As with its classical counterpart, a complete Bayesian data analysis investigates sensitivity of inferences to changes in the data set, with particular consideration given to excluding observations from the analysis. This exclusion is most often accomplished through the use of importance sampling based on case-deleted weight functions. The theoretical results in Sections 3 through 5 provide conditions under which importance sampling estimators of various functionals will follow central limit theorems. Further results along these lines may be obtained for other likelihoods (particularly those in the exponential family) and for other specific model structures (as in Section 4). The techniques in Section 6 provide a simple means of verifying the conditions of the earlier theorems. We have found that the combination of these techniques and the theorems allow us to easily verify (or disprove) asymptotic normality of many estimators.
The results can be used to evaluate computational strategies. In many situations, computations can be hastened by sampling from a formal model that uses a nicely structured prior distribution-say π s (s)-in place of the actual prior distribution, π(s). This change may be motivated by the speed of programming conjugate calculations or by the speed of execution of the algorithm (e.g., see [17]) used to fit the model. With the altered model, inference is made through use of importance sampling with weights w p (s) = π(s)/π s (s). When concerned about the effects of groups of cases, these importance sampling weights can be combined with the case-deletion weights to produce inference under the casedeleted posterior distribution. The weights are w(s) = w p (s)w \I (s). Suppose that the weights due to the prior distribution have r p moments and the casedeletion weights have r I moments (under the model with prior distribution π s ). Then a straightforward calculation shows that the combined weights have at least (r −1 p + r −1 I ) −1 moments. Thus, the suitability for quick and efficient data analysis based on the computational strategy where π is replaced by π s for the sampling algorithm can be evaluated.
There is a strong connection between the tail of the prior distribution relative to the likelihood and the robustness of inference based on the model. Sentiment generally favors prior distributions with thicker tails than the likelihood. With a thick-tailed prior distribution, when there is a clash between likelihood and prior, inference is dominated by the likelihood (e.g., see [2], Chapter 4). Our preference is to select a prior distribution that reflects the analyst's beliefs. Often, this will be a thick-tailed prior distribution, leading to simplified conditions such as those in Corollaries 3.1 or 5.1. While our preference is to select the prior distribution on the basis of modeling considerations, we do note that the results of this paper could be used to select a prior with tails thin enough to guarantee existence of some targeted r moments.
The results we derive apply to broad classes of models. As an example, the specification of the normal theory linear model in (3.1) and (3.2) can mask a much richer hierarchical model. The richer model may include further parameterssay γ-where the prior distribution on θ depends on γ. As long as the likelihood is a function only of θ and σ 2 , the case-deleted weight function will also be a function of these parameters. The theorems are applied with the marginal prior distribution of θ and σ 2 . The prior specifications in [11] and [19] may be viewed in this light.
Models which combine different studies provide a less evident match for these theorems. A typical linear model used for such combination will allow the regression coefficients to vary from study to study. Such variation is captured with a hierarchical model that links the coefficients across studies by means of hyperparameters. The overall model can be expressed in graphical form as a hierarchical model. The advantage of the general conditions in the theorems that describe only the tail behavior of the prior distribution becomes apparent in this setting. For case deletions involving only one study, and referring to the notation of the previous paragraph, γ includes the parameters specific to the other studies, the data specific to the other studies, and the hyperparameters. Thus the marginal prior distribution on θ and σ 2 to be used in the theorems is the marginal distribution on these parameters, posterior to the data from the other studies. While this distribution is usually inaccessible in closed form, one can often verify that its tails behave like some (unspecified) normal distribution or that they are thicker than the class of normal distributions. This is sufficient for application of the theoretical results.

Proof of Lemma 2.1
To prove the first part of the lemma, note that To prove the second part of the lemma, note that The third part of the lemma follows from the first two parts.
The proof of Theorem 3.1 relies on the following two lemmas. Proof. To prove the lemma we use a formula for matrix inversion given in [14]. For every square matrix W and any conforming rectangular matrices U and V , assuming that each of the stated inverses exists: By applying formula (A.1) to the matrices W = X T X, U = −rX T I and V = X T I , an expression for the inverse of (X T X − rX I X T I ), when it exists, is given by: On the other hand, if we substitute W = I, U = −r(X T X) −1 X I and V = X I into Equation

Part (i)
The assumption that λ i = 1/r for all i = 1, . . . , I implies that is well defined in view of formula (A.2) and Lemma A.1, and the posterior rth moment of w \I (s), E(w r \I (s)|y), is proportional to If condition (a) holds, then, by Lemma A.2, (X T X −rX I X T I ) is positive definite, and Using the expression for (X T X − rX I X T I ) −1 given in Equation (A.2) and the property that H I commutes with (I − H I ), we obtain: To avoid heavy algebra, we consider only the case θ 0 =θ, although the result is true for an arbitrary θ 0 .
The proof of Theorem 5.1 relies on the following lemma which relates a bound in terms of polar coordinates to the finiteness of the integral. Lemma A. 3. Suppose that f (β) is continuous in β, β ∈ R k , and that, for some M < ∞ and b < 0, |f (β)| ≤ exp(b ||β||) for all β such that ||β|| ≥ M . Then R k |f (β)|dβ < ∞.
Proof. Split the integral into two portions. For β such that ||β|| ≤ M , we have the integral of a continuous function over a compact set. This integral is finite. The integral over the remaining portion of the space is also finite: where c k r k−1 is the surface area of the k dimensional sphere of radius r.

Proof of Theorem 5.1
The expected rth moment of the case-deleted weight function can be written as an integral against the prior times the likelihood: In order to apply Lemma A.3, we consider a ray emanating from the origin in an arbitrary direction, specified by a particular β under the constraint that |β T |1 = 1. In this fixed direction, the rate of decay (or increase) of the tail is determined by the maximum contribution, either 1 or exp{β T x i }, from each term of the form 1 + exp{β T x i } in the products above. Collecting terms, we have that the rate of decay is governed by We consider the expression above, and note that we can obtain an (decreasing) exponential bound on the tail whenever the term inside the exponential is negative. If the corresponding expression is negative for every direction, we can construct a uniform bound which satisfies the assumption of the lemma which, in turn, allows us to conclude that the rth moment of the case-deleted weight function is finite.
The infinite rth moment case involves a positive value for some direction specified by β. In this event, since h(β, r, ǫ) is continuous in β, we conclude that there is a neighborhood of directions in which the integral along a ray is infinite. Thus, the integral is infinite, and so is the rth moment of the casedeleted weight function.