An exact predictive recursion for Bayesian nonparametric analysis of incomplete data

: This paper presents a new derivation of nonparametric distri- bution estimation with right-censored data. It is based on an extension of the predictive inferences to compound evidence. The estimate is recursive and exact, and no stochastic approximation is needed: it simply requires that the censored data are processed in decreasing order. Only in this case the recursion provides exact posterior predictive distributions for subsequent samples under a Dirichlet process prior. The resulting estimate is equivalent to the Susarla-VanRyzin estimator and to the beta-Stacy process.


Preliminaries on Dirichlet processes with incomplete data
An important extension of Bayesian nonparametric theory is the treatment of incomplete data as they involve indirect observations or latent variables. The first completely Bayesian approach to the problem of dealing with observations censored on the right was made by Susarla and Van Ryzin [15] who used a Dirichlet process [5,2] as a prior for the random distribution F and obtained in closed form the mean of the posterior distribution of F given the data. Blum and Susarla [3] complemented this result by showing that the conditional survival function turns out to be a Dirichlet processes mixture, as considered by Antoniak [1] with specified transition and mixing measures.
In the class of models under consideration, instead of the "precise, microscopic" values of the random sample t 1 , . . . , t n i.i.d ∼ F , the unknown distribution, the evidence is given by less precise "fuzzy, macroscopic" observations. That is the t's are not directly observed; instead we only know that t 1 ∈ A 1 , . . . , t n ∈ A n , where the A's are subsets of R. In the case of rightcensoring, A i is a singleton if t i is uncensored, while A i = (c i , ∞) is an interval containing t i > c i , where c i is the censoring time.
In the case of "precise" evidence, standard Bayesian calculations are induced by placing a Dirichlet process prior distribution with parameter α on F , denoted D α . This prior is indexed by a positive finite measure α(·) defined on the range Ω of t i . The base distribution F 0 (·) = α(·) α(Ω) is obtained by normalizing the measure α(·) having total mass α(Ω), which in turn measures the strength of the prior belief. The meaning of F 0 is that of initial (predictive) probability on the first observation, that is P(t 1 ∈ B) = F 0 (B) for a measurable set B ⊂ Ω. Moreover F 0 can be seen as a prior guess about the unknown distribution F, as F 0 (·) = E[F (·)] is the expectation of F with regards to the prior D α . Conditional on the precise data D p = {t i }, the posterior mean F n (·) = E[F (·)|D p ] of the random probability measure F , is the expectation with regards to the posterior D α+ δt i , and its meaning is the posterior predictive distribution P(t n+1 ∈ ·|D p ) = F n (·) = (α(·) + δ ti (·))/(α(Ω) + n).
In the case of "compound evidence", that is conditional on the data D c = {A i }, the posterior results to be a mixture of Dirichlet processes, which is cumbersome [14]. (This type of mixture comes from the incomplete data, and it is different in principle from those mixtures with a Dirichlet process mixing distribution, which have been studied only with the advent of Monte Carlo Markov chain (among others [8,9]) and related solutions, such as sequential importance sampling [13].
In order to escape the complication of mixtures induced by incomplete data, one can use the alternative and powerful approach of Blackwell and MacQueen [2], which have shown that the extension of the Polya updating rule to a continuous space, based only on the predictive distribution, is equivalent to Ferguson's setting. In this framework the predictive distributions F n (·) = F 0 (·|D p ) are easy to calculate, and F n is a simple recursive function of F n−1 . Extending this approach to compound evidence, Newton et al. [10,11,12] proposed a class of computationally efficient methods to approximate mixtures of Dirichlet processes, involving a recursive algorithm. The advantage is that the predictive uncertainty can be encoded in a single measure approximating Bayes estimator in general, and avoiding the high computational cost required for both an exact solution and the Markov chain Monte Carlo calculations. Unfortunately the typical problem of Newton's estimate is that F n (·) depends on the order in which data are processed, i.e. the recursive algorithm is order dependent. Hence one can approximate the final result averaging on (a sample of) all permutations. A recent paper by Martin and Ghosh [7] explores the above-said fast recursive algorithm as a special case of stochastic approximation.
In the present note, in the right censoring case A 1 ⊂ · · · ⊂ A n , we prove that no stochastic approximation is necessary, as there is a sole rational pattern which recursively produces the exact posterior predictive distributions for subsequent samples under a Dirichlet process prior.
By "exact"we mean that our result coincides with the Susarla-Van Ryzin estimator under squared error loss.

On n = 1
On the basis of a single observation, the Dirichlet process implies that the (exact) updating rule for the measure is linear: and the associated predictive probability is The extension to missing data, posing α 1 (B) := α(B|t 1 ∈ A 1 ), is given by: where α 1 (B) is determined as the expectation value of α(B|t 1 ) weighted by P(t 1 |A 1 ), see Appendix A. (We reserve the subscript n to measures and probabilities conditioned to observables A 1 , . . . , A n ). The associated probability is This result was first derived in [1] where missing data are analyzed in the field of mixtures of Dirichlet processes. It is interesting that Bayes' rule enters in the second term of (4) through probability calculus. The corresponding posterior calculations are given by the initial F 0 on t 1 .
The usual Polya scheme (2) is contained in (5) if A 1 reduces to t 1 . If the information is complete, it implies the addition of a unit mass concentrated at the observed value, which coincides with the discreteness property of F . If the information is not complete, in general one needs to add an entire probability distribution to the urn, whose support is the observed A 1 . In this case F is modified proportionally to its value.

Newton's approximate recursive algorithm
After precise evidence t 1 , . . . , t n the updating rule for the measure is which can be easily transformed into a recursive algorithm Does the idea that the predictive uncertainty can be encoded in a single measure, and that learning occurs by adding measure, hold also for missing data? Thus would lead us to the following recursive generalization of (4) approximating uncertainty in F with a single Dirichlet process centered at F n−1 , on the basis of the well-known Polya sequence characterization. Again Bayes' rule enters in the second term taking F n−1 as the updated prior distribution for t n . As constructed F n (B) is not the exact posterior predictive distribution for t n+1 in general, except the case n = 1. It will be only the case of no information loss, that is when A i = t i , and (8) and (6) coincide. In facts the value depends on the order in which A 1 , . . . , A n are processed.
The main hindrance of this recursion method is the dependence of the result upon the data order. The suggested algorithm in [10,11,12] implies to process data in some order through (8), to re-calculate it over a permutation of the data and average the results. Does a privileged order, providing an exact result, exist?

An exact recursive algorithm
If no information is lost the updating of F 0 (·) and α(·) after n data is given as follows, since the added measure is a point mass at t i for each datum, unconditioned to any other quantity: In fact δ ti (·) is a point mass distribution concentrated in t i .
Otherwise if there is information loss, that is D c = (t 1 ∈ A 1 ) ∩ · · · ∩ (t n ∈ A n ), A i ⊂ Ω, and the exact generalization of (4) is, with α n (B) := α(B|D c ) = α(B|A 1 , . . . , A n ): where P is based on the initial probability α(·)/α(Ω). Equation (11) holds for all cases of predictive inferences based on uncertain or indirect evidence. If Ω is finite the predictive probability can be exactly computed, but in general an exact recursive method does not exist [10]. The computation is easy to perform, although its length is rapidly increasing as a function of #Ω. One must compute the probability of all unobserved patterns (t 1 , . . . , t n ), that are the elementary events compatible with the evidence, by means of the initial measure α(·), and then normalize to the evidence, obtaining thus the joint conditional distribution P(t 1 , . . . , t n |A 1 , . . . , A n ) = P(t1,...,tn) P(A1,...,An) that rules (11). Hence the empirical weights result as an average on the probability of any pattern. If Ω is continuous this procedure is impossible, as in (11) the sum is substituted by integrals.
Fortunately the case of right-censoring is peculiar, because the compound events {A i } are all of the same kind, that is they are events in the tail. We can show that (11) can be represented in the form (8) in an exact way, if the data are reordered and processed so that A 1 ⊂ · · · ⊂ A n , that is A i = (c i , ∞), and c 1 ≥ · · · ≥ c n . Recall that we are speaking about the order in which data are processed, not the order in which they are collected, which is inessential due to the exchangeability property of the process.
To this end we re-write (10) in the following form, interchanging E and ; where n i=1 P(t i ∈ B|D c ) is the expected number of unobserved events that happen in B conditioned to D c .

On n = 2
Given two compound events D c = A 1 , A 2 , and A 1 ⊂ A 2 , that is c 1 ≥ c 2 , (12) implies: Let Hence substituting in (13) we have that is exactly the (8) for n = 2.
If A n+1 contains all previous evidence (that is c n+1 precedes all other censoring times), then P(B i |D n , A n+1 ) = P(B i |D n ). The first n terms are just α n (B). The last term is P(B n+1 |D n , A n+1 ) = P(Bn+1∩An+1|Dn) The solution of the recursion, that is the final predictive probability, is equivalent to the Susarla-Van Ryzin estimator [15]. Moreover it appears as a predictive generalization of the alternative method to Efron's algorithm proposed by Dinse [4].

Example
We briefly illustrate the algorithm on the dataset used by Susarla and Van Ryzin [15], which is nothing else the "classical" dataset used in the original paper by Kaplan and Meier [6] and re-used in many successive works (see f.i. Walker and Muliere [16,17], whose approach, based on the beta-Stacy process and the right-neutrality, is predictive, but not recursive). The data are 0.8, 1.0+, 2.7+, 3.1, 5.4, 7.0+, 9.2, 12.1+, where a "+"denotes a censored observation. Our illustration is intended to make a numerical comparison of our algorithm with the exact results obtained by Susarla and Van Ryzin. They took F 0 to be the exponential distribution with mean 1/0.12, and for α(R + ) they considered three cases, α(R + ) = 4, 8 and 16. They obtained the mean of the posterior distribution of F , under a squared error loss, as t ranges over the eight censored and uncensored points. For the sake of maximizing discrepancies we make our comparison only for the case α(R + ) = 4.
The recursion proceeds as follows: first of all the four precise data are processed, the order being inessential. Introducing the cumulative weight function G(t) = α(0, t), from (7) one gets for n = 1, . . . , 4 where Θ is Heaviside's function. The Survival function after the precise data is given by S 4 (t) = 1 − G 4 (t)/G 4 (∞), see Fig. 1. Restarting from G 4 (t), the recursive equation for G n (t) is: The thin gray graph is S 4 (t), conditioned on the dead only; the two in black are S 8 (t), conditioned to the losses too. The thicker curve, obtained processing censored data in decreasing order, is the correct one; the other curve is obtained processing them in increasing order.

Table 1
Maximum differences between the correct Survival function and the permuted ones. The table shows the value and the abscissa of the maximum difference. All permuted survival functions are dominated by the correct one, so that any average on them produces a systematic error on the final estimate. This deviation tends to increase with the lag of the last observed value from the first place, and (conditioned to it) with the the lag of the second-last one, and so on The result of our recursive algorithm is exactly the same if censored data are processed in decreasing order. Any permutation of this privileged order produces appreciable deviations, and the maximum deviation is obtained by the reversed order (Fig. 1). These deviations increase in the limit α(R + ) → 0, where one obtains the Kaplan-Meier estimate in the empirical range. All "permuted" Survival functions are dominated by the correct one, so that any average on them produces a systematic error on the final estimate. Table 1 shows the value and the abscissa of the maximum difference between the correct Survival function and the "permuted" one is reported. This deviation roughly increases with the lag of the last observed value from the first place, and (conditioned to it) with the the lag of the second-last one, and so on.
To magnify the effect of permutation on the recursion, we consider the simplified case of Fig. 2, where one considers the same prior, α(R + ) = .4, and only two censored data, 7.0+ and 12.1+. In all the first two graphs the dotted line denotes the prior, the thin black line denotes the first updated Survival function, and the thick one the second (final) one. At each new-processed datum a cusp appears. The first updated one contains one cusp, the final contains two cusps. The first graph describes S obtained in the correct order, that is {12.1+, 7.0+}, the second S * in the reversed one. The third graph magnifies the difference between the two final Survival functions. The difference is null in the domain [0, 7.0] , it is maximum at the end of the data domain.