Conjugacy properties of time-evolving Dirichlet and gamma random measures

We extend classic characterisations of posterior distributions under Dirichlet process and gamma random measures priors to a dynamic framework. We consider the problem of learning, from indirect observations, two families of time-dependent processes of interest in Bayesian nonparametrics: the first is a dependent Dirichlet process driven by a Fleming-Viot model, and the data are random samples from the process state at discrete times; the second is a collection of dependent gamma random measures driven by a Dawson-Watanabe model, and the data are collected according to a Poisson point process with intensity given by the process state at discrete times. Both driving processes are diffusions taking values in the space of discrete measures whose support varies with time, and are stationary and reversible with respect to Dirichlet and gamma priors respectively. A common methodology is developed to obtain in closed form the time-marginal posteriors given past and present data. These are shown to belong to classes of finite mixtures of Dirichlet processes and gamma random measures for the two models respectively, yielding conjugacy of these classes to the type of data we consider. We provide explicit results on the parameters of the mixture components and on the mixing weights, which are time-varying and drive the mixtures towards the respective priors in absence of further data. Explicit algorithms are provided to recursively compute the parameters of the mixtures. Our results are based on the projective properties of the signals and on certain duality properties of their projections.

In the context of this article, the most relevant strand of this literature attempts to build time evolution into standard random measures for semiparametric time-series analysis, combining the merits of flexible exchangeable modelling afforded by random measures with those of mainstream generalised linear and time series modelling. For the case of Dirichlet processes, the reference model in Bayesian nonparametrics introduced by Ferguson (1973), the time evolution has often been built into the process by exploiting its celebrated stick-breaking representation (Sethuraman, 1994). For example, Dunson (2006) models the dependent process as an autoregression with Dirichlet distributed innovations, Caron et al. (2008) models the noise in a dynamic linear model with a Dirichlet process mixture, Caron et al. (2007) develops a time-varying Dirichlet mixture with reweighing and movement of atoms in the stick-breaking representation, Rodriguez and ter Horst (2008) induces the dependence in time only via the atoms in the stick-breaking representation, by making them into an heteroskedastic random walk. See also Caron and Teh (2012); Caron et al. (2016); Griffin and Steel (2006); Gutierrez et al. (2016); Mena and Ruggiero (2016). The stick-breaking representation of the Dirichlet process has demonstrated its versatility for constructing dependent processes, but makes it hard to derive any analytical information on the posterior structure of the quantities involved. Parallel to these developments, random measures have been combined with hidden Markov time series models, either for allowing the size of the latent space to evolve in time using transitions based on a hierarchy of Dirichlet processes, e.g. Beal et al. (2002); Van Gael et al. (2008); Stepleton et al. (2009) ;Zhang et al. (2014), or for building flexible emission distributions that link the latent states to data, e.g. Yau et al. (2011); Gassiat and Rousseau (2016).
From a probabilistic perspective, there is a fairly canonical way to build stationary processes with marginal distributions specified as random measures using stochastic differential equations. This more principled approach to building time series with given marginals has been well explored, both probabilistically and statistically, for finite-dimensional marginal distributions, either using processes with discontinuous sample paths, as in Barndorff-Nielsen and Shephard (2001) or Griffin (2011), or using diffusions, as we undertake here. The relevance of measure-valued diffusions in Bayesian nonparametrics has been pioneered in Walker et al. (2007), whose construction naturally allows for separate control of the marginal distributions and the memory of the process.
The statistical models we investigate in this article, introduced in Section 2, can be seen as instances of what we call hidden Markov measures, since the models are formulated as hidden Markov models where the latent, unobserved signal is a measure-valued infinite-dimensional Markov process. The signal in the first model is the Fleming-Viot (FV) process, denoted {X t , t ≥ 0} on some state space Y (also called type space in population genetics), which admits the law of a Dirichlet process on Y as marginal distribution. At times t n , conditionally on X tn = x, observations are drawn independently from x, i.e., (1.1) Hence, this statistical model is a dynamic extension of the classic Bayesian nonparametric model for unknown distributions of Ferguson (1973) and Antoniak (1974). The signal in the second model is the Dawson-Watanabe (DW) process, denoted {Z t , t ≥ 0} and also defined on Y, that admits the law of a gamma random measure as marginal distribution. At times t n , conditionally on Z tn = z, the observations are a Poisson process Y tn on Y with random intensity z, i.e., for any collection of disjoint sets A 1 , . . . , A K ∈ Y and K ∈ N, Hence, this is a time-evolving Cox process and can be seen as a dynamic extension of the classic Bayesian nonparametric model for Poisson point processes of Lo (1982). The Dirichlet and the gamma random measures, used as Bayesian nonparametric priors, have conjugacy properties to observation models of the type described above, which have been exploited both for developing theory and for building simulation algorithms for posterior and predictive inference. These properties, reviewed in Sections 2.1.1 and 2.2.1, have propelled the use of these models into mainstream statistics, and have been used directly in simpler models or to build updating equations within Markov chain Monte Carlo and variational Bayes computational algorithms in hierarchical models.
In this article, for the first time, we show that the dynamic versions of these Dirichlet and gamma models also enjoy certain conjugacy properties. First, we formulate such models as hidden Markov models where the latent signal is a measure-valued diffusion and the observations arise at discrete times according to the mechanisms described above. We then obtain that the filtering distributions, that is the laws of the signal at each observation time conditionally on all data up to that time, are finite mixtures of Dirichlet and gamma random measures respectively. We provide a concrete posterior characterisation of these marginal distributions and explicit algorithms for the recursive computation of the parameters of these mixtures. Our results show that these families of finite mixtures are closed with respect to the Bayesian learning in this dynamic framework, and thus provide an extension of the classic posterior characterisations of Antoniak (1974) and Lo (1982) to time-evolving settings.
The techniques we use to establish the new conjugacy results are detailed in Section 4, and build upon three aspects: the characterisations of Dirichlet and gamma random measures through their projections; certain results on measurevalued diffusions related to their time-reversal; and some very recent developments in Papaspiliopoulos and Ruggiero (2014) that relate optimal filtering for finite-dimensional hidden Markov models with the notion of duality for Markov processes, reviewed in Section 4.1. Figure 1 schematises, from a high level perspective, the strategy for obtaining our results. In a nutshell, the essence of our theoretical results is that the operations of projection and propagation of measures commute. More specifically, we first exploit the characterisation of the Dirichlet and gamma random measures via their finite-dimensional distributions, which are Dirichlet and independent gamma distributions respectively.  FV and DW signals,proved in Theorems 3.1 and 3.2

. In this figure
Xt is the latent measure-valued signal. Given data Y 1:n , the future distribution of the signal L(Xt n+k | Y 1:n ) at time t n+k is determined by taking its finite-dimensional projection L(Xt n (A 1 , . . . , A K ) | Y 1:n ) onto an arbitrary partition (A 1 , . . . , A K ), evaluating the relative propagation L(Xt n+k (A 1 , . . . , A K ) | Y 1:n ) at time t n+k , and by exploiting the projective characterisation of the filtering distributions.
Then we exploit the fact that the dynamics of these finite-dimensional distributions induced by the measure-valued signals are the Wright-Fisher (WF) diffusion and a multivariate Cox-Ingersoll-Ross (CIR) diffusion. Then, we extend the results in Papaspiliopoulos and Ruggiero (2014) to show that filtering these finite-dimensional signals on the basis of observations generated as described above results in mixtures of Dirichlet and independent gamma distributions. Finally, we use again the characterisations of Dirichlet and gamma measures via their finite-dimensional distributions to obtain the main results in this paper, that the filtering distributions in the Fleming-Viot model evolves in the family of finite mixtures of Dirichlet processes and those in the Dawson-Watanabe model in the family of finite mixtures of gamma random measures, under the observation models considered. The validity of this argument is formally proved in Theorems 3.1 and 3.2. The resulting recursive procedures for Fleming-Viot and Dawson-Watanabe signals that describe how to compute the parameters of the mixtures at each observation time are given in Propositions 3.1 and 3.2, and the associated pseudo codes are outlined in Algorithms 1 and 2.
The paper is organised as follows. Section 1.2 briefly introduces some basic concepts on hidden Markov models. Section 1.3 provides a simple illustration of the underlying structures implied by previous results on filtering one-dimensional WF and CIR processes. These will be the reference examples throughout the paper and provide relevant intuition on our main results in terms of special cases, since the WF and CIR model are the one-dimensional projections of the infinite-dimensional families we consider here. Section 2 describes the two families of dependent random measures which are the object of this contribution, the Fleming-Viot and the Dawson-Watanabe diffusions, from a non technical viewpoint. Connections of the dynamic models with their marginal or static sub-cases given by Dirichlet and gamma random measures, well known in Bayesian nonparametrics, are emphasised. Section 3 exposes and discusses the main results on the conjugacy properties of the two above families, given observation models as described earlier, together with the implied algorithms for recursive computation. All the technical details related to the strategy for proving the main results and to the duality structures associated to the signals are deferred to Section 4.

Hidden Markov models
Since our time-dependent Bayesian nonparametric models are formulated as hidden Markov models, we introduce here some basic related notions. A hidden Markov model (HMM) is a double sequence {(X tn , Y n ), n ≥ 0} where X tn is an unobserved Markov chain, called latent signal, and Y n := Y tn are conditionally independent observations given the signal. Figure 2 provides a graphical representation of an HMM. We will assume here that the signal is the discrete time sampling of a continuous time Markov process X t with transition kernel P t (x, dx ). The signal parametrises the law of the observations L(Y n |X tn ), called emission distribution. When this law admits density, this will be denoted by f x (y).
Filtering optimally an HMM requires the sequential exact evaluation of the so-called filtering distributions L(X tn |Y 0:n ), i.e., the laws of the signal at different times given past and present observations, where Y 0:n = (Y 1 , . . . , Y n ). Denote ν n := L(X tn |Y 0:n ) and let ν be the prior distribution for X t0 . The exact or optimal filter is the solution of the recursion ν 0 = φ Yt 0 (ν) , ν n = φ Yt n (ψ tn−tn−1 (ν n−1 )), n ∈ N. (1.2) This involves the following two operators acting on measures: the update operator, which in case a density exists takes the form and the prediction operator The update operation amounts to an application of Bayes' Theorem to the currently available distribution conditional on the incoming data. The prediction operator propagates forward the current law of the signal of time t according to the transition kernel of the underlying continuous-time latent process. The above recursion (1.2) then alternates updates given the incoming data and predictions of the latent signal as follows: Papaspiliopoulos et al. If X t0 has prior ν = L(X t0 ), then ν 0 = L(X t0 |Y 0 ) is the posterior conditional on the data observed at time t 0 ; ν 1 is the law of the signal at time t 1 obtained by propagating ν 0 of a t 1 − t 0 interval and conditioning on the data Y 0 , Y 1 observed at time t 0 and t 1 ; and so on.

Illustration for CIR and WF signals
In order to appreciate the ideas behind the main theoretical results and the Algorithms we develop in this article, we provide some intuition on the corresponding results for one-dimensional hidden Markov models based on Cox-Ingersoll-Ross (CIR) and Wright-Fisher (WF) signals. These are the one-dimensional projections of the DW and FV processes respectively, so informally we could say that a CIR process stands to a DW process as a gamma distribution stands to a gamma random measure, and a one-dimensional WF stands to a FV process as a Beta distribution stands to a Dirichlet process. The results illustrated in this section follow from Papaspiliopoulos and Ruggiero (2014) and are based on the interplay between computable filtering and duality of Markov processes, summarised later in Section 4.1. The developments in this article rely on these results, which are extended to the infinite-dimensional case. Here we highlight the mechanisms underlying the explicit filters with the aid of figures, and postpone the mathematical details to Section 4. First, let the signal be a one-dimensional Wright-Fisher diffusion on [0,1], with stationary distribution π = Beta(α, β) (see Section 2.1.2), which is also taken as the prior ν for the signal at time 0. The signal can be interpreted as the evolving frequency of type-1 individuals in a population of two types whose individuals generate offspring of the same type of the parent, which may be subject to mutation. The observations are assumed to be Bernoulli with success probability given by the signal state. Upon observation of y t0 = (y t0,1 , . . . , y t0,m ), assuming it gives m 1 type-1 and m 2 type-2 individuals with m = m 1 + m 2 , the prior ν = π is updated as usual via Bayes' theorem to ν 0 = φ yt 0 (ν) = Beta(α + m 1 , β + m 2 ). Here φ y is the update operator (1.3). A forward propagation of these distribution of time t by means of the prediction operator (1.4) yields the finite mixture of Beta distributions whose mixing weights depend on t (see Lemma 4.1 below for their precise definition). The propagation of Beta(α + m 1 , β + m 2 ) at time t 0 + t thus yields a mixture of Beta's with (m 1 + 1)(m 2 + 1) components. The Beta parameters range from i = m 1 , j = m 2 , which represent the full information provided by the collected data, to i = j = 0, which represent the null information on the data so that the associated component coincides with the prior. It is useful to identify the indices of the mixture with the nodes of a graph, as in Figure 3-(b), where the red node represent the component with full information, and the yellow nodes the other components, including the prior identified by the origin. The time-varying mixing weights are the transition probabilities of an associated (dual) 2-dimensional death process, which can be thought of as jumping to lower nodes in the graph of Figure 3-(b) at a specified rate in continuous time. The effect on the mixture of these weights is that as time increases, the probability mass is shifted from components with parameters close to the full information (α + m 1 , β + m 2 ), to components which bear less to none information on the data. The mass shift reflects the progressive obsolescence of the data collected at t 0 as evaluated by signal law at time t 0 + t as t increases, and in absence of further data the mixture converges to the prior/stationary distribution π.
Note that it is not obvious that (1.4) yields a finite mixture when P t is the transition operator of a WF process, since P t has an infinite series expansion (see Section 2.1.2). This has been proved rather directly in Chaleyat-Maurel and Genon-Catalot (2009) or by combining results on optimal filtering with some duality properties of this model (see Papaspiliopoulos and Ruggiero (2014) or Section 4 here).
Consider now the model where the signal is a one-dimensional CIR diffusion on R + , with gamma stationary distribution (and prior at t 0 = 0) given by π = Ga(α, β) (see Section 2.2.2). The observations are Poisson with intensity given by the current state of the signal. If the first data are collected at time t 1 > t 0 , the forward propagation of the signal distribution to time t 1 yields the same distribution by stationarity. Upon observation at time t 1 of m ≥ 1 Poisson data points with total count y, the prior ν = π is updated via Bayes' theorem to yielding a jump in the measure-valued process; see Figure 4-(a). A forward propagation of ν 0 yields the finite mixture of gamma distributions whose mixing weights also depend on t (see Lemma 4.2 below for their precise definition). At time t 1 + t, the filtering distribution is a (y + 1)-components mixture with the first gamma parameter ranging from full (i = y) to null (i = 0) information with respect to the collected data ( Figure 4-(b)). The timedependent mixture weights are the transition probabilities of a certain associated (dual) one-dimensional death process, which can be thought of as jumping to lower nodes in the graph of Figure 3-(a) at a specified rate in continuous time.
Similarly to the WF model, the mixing weights shift mass from components whose first parameter is close to the full information, i.e. (α + y, β + S t ), to components which bear less to none information (α, β + S t ). The time evolution of the mixing weights is depicted in Figure 5, where the cyan and blue lines are the weights of the components with full and no information on the data respectively. As a result of the impact of these weights on the mixture, the latter converges, in absence of further data, to the prior/stationary distribution π as t increases, as shown in Figure 4-(c). Unlike the WF case, in this model there is a second parameter controlled by a deterministic (dual) process S t on R + which subordinates the transitions of the death process; see Lemma 4.2. Roughly speaking, the death process on the graph controls the obsolescence of the observation counts y, whereas the deterministic process S t controls that of the sample size m. At the update time t 1 we have S 0 = m as in (1.5), but S t is a deterministic, continuous and decreasing process, and in absence of further data S t converges to 0 as t → ∞, to restore the prior parameter β in the limit of (1.6). See Lemma 4.2 in the Appendix for the formal result for the one-dimensional CIR diffusion. When more data samples are collected at different times, the update and propagation operations are alternated, resulting in jump processes for both the filtering distribution and the deterministic dual S t ( Figure 6).

Preliminary notation
Although most of the notation is better introduced in the appropriate places, we collect here that which is used uniformly over the paper, to avoid recalling these objects several times throughout the text. In all subsequent sections, Y will denote a locally compact Polish space which represents the observations space, M (Y) is the associated space of finite Borel measures on Y and M 1 (Y) its subspace of probability measures. A typical element α ∈ M (Y) will be such that ( 1.7) where θ = α(Y) is the total mass of α, and P 0 is sometimes called centering or baseline distribution. We will assume here that P 0 has no atoms. Furthermore, for α as above, Π α will denote the law on M 1 (Y) of a Dirichlet process, and Γ β α that on M (Y) of a gamma random measure, with β > 0. These will be recalled formally in Sections 2.1.1 and 2.2.1. We will denote by X t the Fleming-Viot process and by Z t the Dawson-Watanabe process, to be interpreted as {X t , t ≥ 0} and {Z t , t ≥ 0} when written without argument. Hence X t and Z t take values in the space of continuous functions from [0, ∞) to M 1 (Y) and M (Y) respectively. We will write X t (A) and Z t (A) for their respective one dimensional projections onto the Borel set A ⊂ Y, whereas discrete measures x(·) ∈ M 1 (Y) and z(·) ∈ M (Y) will denote the marginal states of X t and Z t . We adopt boldface notation to denote vectors, with the following conventions: where the dimension 2 ≤ K ≤ ∞ will be clear from the context unless specified. Accordingly, the Wright-Fisher model, closely related to projections of the Fleming-Viot process onto partitions, will be denoted X t . We denote by 0 the vector of zeros and by e i the vector whose only non zero entry is a 1 at the ith coordinate. Let also "<" define a partial ordering on Z K + , so that m < n if m j ≤ n j for all j ≥ 1 and m j < n j for some j ≥ 1. Finally, we will use the compact notation y 1:m for vectors of observations y 1 , . . . , y m .

The static model: Dirichlet processes and mixtures thereof
The Dirichlet process on a state space Y, introduced by Ferguson (1973) (see Ghosal (2010) for a recent review), is a discrete random probability measure x ∈ M 1 (Y). The process admits the series representation where (Y i ) i≥1 and (W i ) i≥1 are independent and (Q i ) i≥1 are the jumps of a gamma process with mean measure θy −1 e −y dy. We will denote by Π α the law of x(·) in (2.1), with α as in (1.7). Mixtures of Dirichlet processes were introduced in Antoniak (1974). We say that x is a mixture of Dirichlet processes if where α u denotes the measure α conditionally on u, or equivalently With a slight abuse of terminology we will also refer to the right hand side of the last expression as a mixture of Dirichlet processes. The Dirichlet process and mixtures thereof have two fundamental properties that are of great interest in statistical learning (Antoniak, 1974): where H y1:m is the conditional distribution of u given y 1:m . Hence a posterior mixture of Dirichlet processes is still a mixture of Dirichlet processes with updated parameters. • Projection: let x be as in (2.2). For any measurable partition ) and π α denotes the Dirichlet distribution with parameter α.
Letting H be concentrated on a single point of U recovers the respective properties of the Dirichlet process as special case, i.e. x ∼ Π α and y i |x

The Fleming-Viot process
Fleming-Viot (FV) processes are a large family of diffusions taking values in the subspace of M 1 (Y) given by purely atomic probability measures. Hence they describe evolving discrete distributions whose support also varies with time and whose frequencies are each a diffusion on [0, 1]. Two states apart in time of a FV process are depicted in Figure 7. See Ethier and Kurtz (1993) and Dawson (1993) for exhaustive reviews. Here we restrict the attention to a subclass known as the (labelled) infinitely many neutral alleles model with parent independent mutation, henceforth for simplicity called the FV process, which has the law of a Dirichlet process as stationary measure (Ethier and Kurtz, 1993, Section 9.2).
One of the most intuitive ways to understand a FV process is to consider its transition function, found in Ethier and Griffiths (1993). This is given by where x m denotes the m-fold product measure x × · · · × x and Π α+ m i=1 δy i is a posterior Dirichlet process as defined in the previous section. The expression (2.3) has a nice interpretation from the Bayesian learning viewpoint. Given the current state of the process x, with probability d m (t) an m-sized sample from x is taken, and the arrival state is sampled from the posterior law Π α+ m i=1 δy i . Here d m (t) is the probability that an N-valued death process which starts at infinity at time 0 is in m at time t, if it jumps from m to m−1 at rate λ m = 1 2 m(θ +m−1). See Tavaré (1984) for details. Hence a larger t implies sampling a lower amount of information from x with higher probability, resulting in fewer atoms shared by x and x . The starting and arrival states thus have correlation which decreases in t as controlled by d m (t). As t → 0, infinitely many samples are drawn from x, so x will coincide with x and the trajectories are continuous in total variation norm (Ethier and Kurtz, 1993). As t → ∞, the death process which governs the probabilities d m (t) in (2.3) is eventually absorbed in 0, which implies that P t (x, dx ) → Π α as t → ∞, so x is sampled from the prior Π α . Therefore this FV is stationary with respect to Π α (in fact, it is also reversible). It follows that, using terms familiar to the Bayesian literature, under this parametrisation the FV can be considered as a dependent Dirichlet process with continuous sample paths. Constructions of Fleming-Viot and closely related processes using ideas from Bayesian nonparametrics have been proposed in Walker et al. (2007); Favaro et al. (2009); Ruggiero and Walker (2009a,b). Different classes of diffusive dependent Dirichlet processes or related constructions based on the stick-breaking representation (Sethuraman, 1994) are proposed in Mena and Ruggiero (2016); Mena et al. (2011).
Projecting a FV process X t onto a measurable partition A 1 , . . . , A K of Y yields a K-dimensional Wright-Fisher (WF) diffusion X t , which is reversible and stationary with respect to the Dirichlet distribution π α , for α i = θP 0 (A i ), i = 1, . . . , K. See Dawson (2010); Etheridge (2009). This property is the dynamic counterpart of the projective property of Dirichlet processes discussed in Section 2.1.1. Consistently, the transition function of a WF process is obtained as a specialisation of the FV case, yielding with analogous interpretation to (2.3). See Ethier and Griffiths (1993).
For statistical modelling it is useful to introduce a further parameter σ that controls the speed of the process. This can be done by defining the time change X τ (t) with τ (t) = σt. In such parameterisation, σ does not affect the stationary distribution of the process, and can be used to model the dependence structure.

The static model: Gamma random measures and mixtures thereof
Gamma random measures (Lo, 1982) can be thought of as the counterpart of Dirichlet processes in the context of finite intensity measures. A gamma random measure z ∈ M (Y) with shape parameter α as in (1.7) and rate parameter Similarly to the definition of mixtures of Dirichlet processes (Section 2.1.1), we say that z is a mixture of gamma random measures if z ∼ U Γ β αu dH(u), and with a slight abuse of terminology we will also refer to the right hand side of the last expression as a mixture of gamma random measures. Analogous conjugacy and projection properties to those seen for mixtures of Dirichlet processes hold for mixtures of gamma random measures: where |z| := z(Y) is the total mass of z. Then where H y1:m is the conditional distribution of u given y 1:m . Hence mixtures of gamma random measures are conjugate with respect to Poisson point process data. • Projection: for any measurable partition A 1 , . . . , A K of Y, we have where α u,i = α u (A i ), and Ga(α, β) denotes the gamma distribution with shape α and rate β.
Letting H be concentrated on a single point of U recovers the respective properties of gamma random measures as special case, i.e. z ∼ Γ β α and y i as in (2.6) imply z|y 1:m ∼ Γ β+1 Finally, it is well known that (2.1) and (2.5) satisfy the relation in distribution . This extends to the infinite dimensional case the well known relationship between beta and gamma random variables. See for example Daley and Vere-Jones (2008), Example 9.1(e). See also Konno and Shiga (1988) for an extension of (2.8) to the dynamic case concerning FV and DW processes, which requires a random time change.

The Dawson-Watanabe process
Dawson-Watanabe (DW) processes can be considered as dependent models for gamma random measures, and are, roughly speaking, the gamma counterpart of FV processes. More formally, they are branching measure-valued diffusions taking values in the space of finite discrete measures. As in the FV case, they describe evolving discrete measures whose support varies with time and whose masses are each a positive diffusion, but relaxing the constraint of their masses summing to one to that of summing to a finite quantity. See Dawson (1993) and Li (2011) for reviews. Here we are interested in the special case of subcritical branching with immigration, where subcriticality refers to the fact that in the underlying branching population, which can be used to construct the process, the mean number of offspring per individual is less than one. Specifically, we will consider DW processes with transition function See Ethier and Griffiths (1993b). The interpretation of (2.9) is similar to that of (2.3): conditional on the current state given by the measure z, m iid samples are drawn from the normalised measure z/|z| and the arrival state z is sampled from Γ Here the main structural difference with respect to (2.3), apart from the different distributions involved, is that since in general S * t is not an integer quantity, the interpretation as sampling the arrival state z from a posterior gamma law is not formally correct; cf. (2.7). The sample size m is chosen with probability d |z|,β m (t), which is the probability that an N-valued death process which starts at infinity at time 0 is in m at time t, if it jumps from m to m − 1 at rate (mβ/2)(1 − e βt/2 ) −1 . See Ethier and Griffiths (1993b) for details. So z and z will share fewer atoms the farther they are apart in time. The DW process with the above transition is known to be stationary and reversible with respect to the law Γ β α of a gamma random measure; cf. (2.5). See Shiga (1990); Ethier and Griffiths (1993b). The Dawson-Watanabe process has been recently considered as a basis to build time-dependent gamma process priors with Markovian evolution in Caron and Teh (2012) and Spanò and Lijoi (2016).
The DW process satisfies a projective property similar to that seen in Section 2.1.2 for the FV process. Let Z t have transition (2.9). Given a measurable partition A 1 , . . . , A K of Y, the vector (Z t (A 1 ), . . . , Z t (A K )) has independent components z t,i = Z t (A i ) each driven by a Cox-Ingersoll-Ross (CIR) diffusion (Cox et al., 1985). These are also subcritical continuous-state branching processes with immigration, reversible and ergodic with respect to a Ga(α i , β) distribution, with transition function (2.10) As for FV and WF processes, a further parameter σ that controls the speed of the process can be introduced without affecting the stationary distribution. This can be done by defining an appropriate time change that can be used to model the dependence structure.

Filtering Fleming-Viot signals
Let the latent signal X t be a FV process with transition function (2.3). We assume that, given the signal state, observations are drawn independently from x, i.e. as in (1.1) with X t = x. Since x is almost surely discrete (Blackwell, 1973), a sample y 1:m = (y 1 , . . . , y m ) from x will feature K m ≤ m ties among the observations with positive probability. Denote by (y * 1 , . . . , y * Km ) the distinct values in y 1:m and by m = (m 1 , . . . , m Km ) the associated multiplicities, so that |m| = m. When an additional sample y m+1:m+n with multiplicities n becomes available, we adopt the convention that n adds up to the multiplicities of the types already recorded in y 1:m , so that the total multiplicities count is m + n = (m 1 + n 1 , . . . , m Km + n Km , n Km+1 , . . . , n Km+n ). (3.1) The following Lemma states in our notation the special case of the conjugacy for mixtures of Dirichlet processes which is of interest here; see Section 2.1.1. To this end, let be the space of multiplicities of K types, with partial ordering defined as in Section 1.4. Denote also by PU α (y m+1:m+n | y 1:m ) the joint distribution of y m+1:m+n given y 1:m when the random measure x is marginalised out, which is determined by the Blackwell-MacQueen Pólya urn predictive scheme (Blackwell and MacQueen, 1973) Here "∝" denotes proportionality. The updated distribution is thus still a mixture of Dirichlet processes with different multiplicities and possibly new atoms in the parameter measures α + Km+n i=1 (m i + n i )δ y * i . The following Theorem formalises our main result on FV processes, showing that the family of finite mixtures of Dirichlet processes is conjugate with respect to discretely sampled data as in (1.1) with X t = x.
The transition operator of the FV process thus maps a Dirichlet process at time t 0 into a finite mixture of Dirichlet processes at time t 0 + t. The mixing weights are the transition probabilities of a death process on the K m dimensional lattice, with K m being as in (3.7) the number of distinct values observed in previous data. The result is obtained by means of the argument described in Figure 1, which is based on the property that the operations of propagating and projecting the signal commute. By projecting the current distribution of the signal onto an arbitrary measurable partition, yielding a mixture of Dirichlet distributions, we can exploit the results for finite dimensional WF signals to yield the associated propagation (Papaspiliopoulos and Ruggiero, 2014). The propagation of the original signal is then obtained by means of the characterisation of mixtures of Dirichlet processes via their projections. See Section 4.2 for a proof. In particular, the result shows that under these assumptions, the prediction operation (1.4) with the transition function (2.3) reduces to a finite sum.
Iterating the update and propagation operations provided by Lemma 3.1 and Theorem 3.1 allows to perform sequential Bayesian inference on a hidden signal of FV type by means of a finite computation. Here the finiteness refers to the fact that the infinite dimensionality due to the transition function of the signal is avoided analytically, without resorting to any stochastic truncation method for (2.3), given, e.g., by Walker (2007); Papaspiliopoulos and Roberts (2008), and the computation can be conducted in closed form.
The following Proposition formalises the recursive algorithm that sequentially evaluates the marginal posterior laws L(X tn |Y 1:n ) of a partially observed FV process by alternating the update and propagation operations, and identifies the family of distributions which is closed with respect to these operations. Define the family of finite mixtures of Dirichlet processes and p m,n (t) as in (3.8).
Note that the update operation (3.10) preserves the number of components in the mixture, while the prediction operation (3.7) increases its number. The intuition behind this point is analogous to the illustration in Section 1.3, where the prior (node (0, 0)) is updated to the posterior (node (2, 1)) and propagated into a mixture (coloured nodes), with the obvious difference that here the maximum number of distinct values is unbounded and not fixed.
Algorithm 1 describes in pseudo-code the implementation of the filter for FV processes.

Filtering Dawson-Watanabe signals
Let now the signal Z t follow a DW process with transition function (2.9), with invariant measure given by the law Γ β α of a gamma random measure; see (2.5). We assume that, given the signal state, observations are drawn from a Poisson point process with intensity z, i.e., as in (2.6) with Z t = z. Analogously to the FV case, since z is almost surely discrete, a sample y 1:m = (y 1 , . . . , y m ) from (2.6) will feature K m ≤ m ties among the observations with positive probability. To this end, we adopt the same notation as in Section 3.1.
The following Lemma states in our notation the special case of the conjugacy for mixtures of gamma random measures which is of interest here; see Section 2.2.1.

Algorithm 1: Filtering algorithm for FV signals
Data: yt j = (y t j ,1 , . . . , yt j ,m t j ) at times t j , j = 0, . . . , J, as in (1.1) Set prior parameters α = θP 0 , θ > 0, The updated distribution is thus still a mixture of gamma random measures with updated parameters and the same number of components.
The following Theorem formalises our main result on DW processes, showing that the family of finite mixtures of gamma random measures is conjugate with respect to data as in (2.6) with Z t = z.
The transition operator of the DW process thus maps a gamma random measure into a finite mixture of gamma random measures. The time-varying mixing weights factorise into the binomial transition probabilities of a one-dimensional death process starting at the total size of previous data |m| and into a hypergeometric pmf. The intuition is that the death process regulates how many levels down the K m dimensional lattice are taken, and the hypergeometric probability chooses which admissible path down the graph is chosen given the arrival level. In Figure 3 we would have K m = 2 distinct values with multiplicites m = (2, 1) and total size |m| = 3. Then, e.g.,p (2,1),(1,1) (t), is given by the probability Bin(1; 3, p(t)) that the death process jumps down one level from 3 in time t (Figure 3-(a)), times the probability p((1, 1); (2, 1), 2), conditional on going down one level, of reaching (1, 1) from (2, 1) instead of (2, 0), i.e. of removing one item from the pair and not the singleton observation. The Binomial transition of the one-dimensional death process is subordinated to a deterministic process S t which modulates the sample size continuously in (3.14), starts at the value S 0 = s (cf. the left hand side of (3.14)) and converges to 0 as t → ∞.
The result is obtained by means of a similar argument to that used for Theorem (3.1), jointly with the relation (2.8) (which here suffices to be applied at the margin of the process). In particular, we exploit the fact that the projection of a DW process onto an arbitrary partition of the space yields a vector of independent CIR processes. See Section 4.3 for a proof. Analogously to the FV case, the result shows that under the present assumptions, the prediction operation (1.4) with the transition function (2.9) reduces to a finite sum.
The following Proposition formalises the recursive algorithm that evaluates the marginal posterior laws L(X tn |Y 1:n ) of a partially observed DW process, allowing to perform sequential Bayesian inference on a hidden signal of DW type by means of a finite computation and within the family of finite mixtures of gamma random measures. Define such family as with M as in (3.2).
Proposition 3.2. Let Z t be a DW process with transition function (2.9) and invariant law Γ β α defined as in Section 2.2.1, and suppose data are collected as in (2.6) with Z t = z. Then F Γ is closed under the application of the update and prediction operators (1.3) and (1.4). Specifically, (3.17) with t(y, M) as in (3.9),ŵ n as in Proposition 3.1, and andp m,n (t) as in (3.15) and S t as in (3.16).
Algorithm 2 describes in pseudo-code the implementation of the filter for DW processes.

Computable filtering and duality
A filter is said to be computable if the sequence of filtering distributions (the marginal laws of the signal given past and current data) can be characterised by a set of parameters whose computation is achieved at a cost that grows at most polynomially with the number of observations. See, e.g., Chaleyat-Maurel and Genon-Catalot (2006). Special cases of this framework are finite dimensional filters for which the computational cost is linear in the number of observations, the Kalman filter for linear Gaussian HMMs being the reference model in this setting.
Let X denote the state space of the HMM. Papaspiliopoulos and Ruggiero (2014) showed that the existence of a computable filter can be established if the following structures are embedded in the model: Conjugacy: there exists a function h(x, m, θ) ≥ 0, where x ∈ X , m ∈ Z K + for some K ∈ N, and θ ∈ R l for some l ∈ N, and functions t 1 (y, m) and t 2 (y, θ) such that h(x, m, θ)π(dx) = 1, for all m and θ, and t 1 (y, m), t 2 (y, θ))π(dx).

Algorithm 2: Filtering algorithm for DW signals
Data: (mt j , yt j ) = (mt j , y t j ,1 , . . . , yt j ,m t j ) at times t j , j = 0, . . . , J, as in (2.6) Set prior parameters α = θP 0 , θ > 0, Here h(x, m, θ)π(dx) identifies a parametric family of distributions which is closed under Bayesian updating with respect to the observation model. Two types of parameters are considered, a multi-index m and a vector of real-valued parameters θ. The update operator φ y maps the distribution h(x, m, θ)π(dx), conditional on the new observation y, into a density of the same family with updated parameters t 1 (y, m) and t 2 (y, θ). Typically π(dx) is the prior and h(x, m, θ) is the Radon-Nikodym derivative of the posterior with respect to the prior, when the model is dominated. See, e.g., (4.6) below for an example of such h when π is the Dirichlet distibution.
Duality: there exists a two-component Markov process (M t , Θ t ) with statespace Z K + × R l and infinitesimal generator acting on bounded functions, such that (M t , Θ t ) is dual to X t with respect to the function h, i.e., it satisfies for all x ∈ X , m ∈ Z K + , θ ∈ R l , t ≥ 0. Here M t is a death process on Z K + , i.e. a non-increasing pure-jump continuous time Markov process, which jumps from m to m − e i at rate λ(|m|)m i ρ(θ) and is eventually absorbed at the origin; Θ t is a deterministic process assumed to evolve autonomously according to a system of ordinary differential equations r(Θ t ) = dΘ t /dt for some initial condition Θ 0 = θ 0 and a suitable function r : R l → R l , whose ith coordinate is denoted by r i in the generator A above and modulates the death rates of M t through ρ(θ). The expectations on the left and right hand sides are taken with respect to the law of X t and (M t , Θ t ) respectively, conditional on the respective starting points.
The duality condition (4.1) hides a specific distributional relationship between the signal process X t , which can be thought of as the forward process, and the dual process (M t , Θ t ), which can be thought of as unveiling some features of the time reversal structure of X t . Informally, the death process can be considered as the time reversal of collecting data points if they come at random times, and the deterministic process, in the CIR example (see Section 1.3), can be considered as a continuous reversal of the sample size process, which instead increases by steps. For example, in the well known duality relation between the WF diffusion and the block counting process of Kingman's coalescent, the latter describes the number of surviving non mutant lines of descent in the tree backwards in time which tracks the ancestors of a sample of individuals in the current population. See Griffiths and Spanò (2010). See also Jansen and Kurt (2014) for a review of duality structures for Markov processes.
Under the above conditions, Proposition 2.3 of Papaspiliopoulos and Ruggiero (2014) shows that given the family of distributions if ν ∈ F, then the filtering distribution ν n which satisfies (1.2) is a finite mixture of distributions in F with parameters that can be computed recursively. This in turn implies that the family of finite mixtures of elements of F is closed under the iteration of update and prediction operations. The interpretation is along the lines of the illustration of Section 1.3. Here π, the stationary measure of the forward process, plays the role of the prior distribution and is represented by the origin of Z K + (see Figure 3), which encodes the lack of information on the data generating distribution. Given a sample from the conjugate observation model, a single component posterior distribution is identified by a node different from the origin in Z K + . The propagation operator then gives positive mass at all nodes which lie beneath the current nodes with positive mass. By iteration of these operations, the filtering distribution evolves within the family of finite mixtures of elements of F.

Computable filtering for Fleming-Viot processes
In the present and the following Section we adopt the same notation used in Section 3. We start by formally stating the precise form for the transition probabilities of the death processes involved in the FV filtering. Here the key point to observe is that since the number of distinct types observed in the discrete samples from a FV process is K m ≤ m, we only need to consider a generic death processes on Z Km + and not on Z ∞ + . For FV processes, the deterministic component Θ t is constant: here we set Θ t = 1 for every t and we omit θ from the arguments of the duality function h.
The following Lemma will provide the building block for the proof of Theorem 3.1. In particular, it shows that the transition probabilities of the dual death process are of the form required as coefficients in the expansion (3.8).
Proof. Since |m 0 | < ∞, for any such m 0 the proof is analogous to that of Proposition 2.1 in Papaspiliopoulos and Ruggiero (2014).
The following Proof of the conjugacy for mixtures of Dirichlet processes is due to Antoniak (1974) and outlined here for the ease of the reader.
As preparatory for the main result on FV processes, we derive here in detail the propagation step for WF processes, which is due to Papaspiliopoulos and Ruggiero (2014). Let be the infinitesimal generator of a K-dimensional WF diffusion, with α i > 0 and Here δ ij denotes Kronecker delta and A K acts on C 2 (Δ K ) functions, with Proposition 4.1. Let X t be a WF diffusion with generator (4.4) and Dirichlet invariant measure on (4.5) denoted π α . Then, for any m ∈ Z ∞ + such that |m| < ∞, with p m,m−i (t) as in (4.1).

Proof.
Define (4.6) which is in the domain of A K . A direct computation shows that Hence, by (4.2), the death process M t on Z K + , which jumps from m to m − e i at rate m i (θ + |m| − 1)/2, is dual to the WF diffusion with generator A K with 3479 respect to (4.6). From the definition (1.4) of the prediction operator now we have where the second equality holds in virtue of the reversibility of X t with respect to π α , the fourth by the duality (4.1) established above together with (4.3) and the fifth from Lemma 4.1.
The following proves the propagation step for FV processes by making use of the previous result and by exploiting the strategy outlined in Figure 1.
Proof of Theorem 3.1. Fix an arbitrary partition (A 1 , . . . , A K ) of Y with K classes, and denote bym the multiplicities resulting from binning y 1:m into the corresponding cells. Then on (A 1 , . . . , A K ). Since the projection onto the same partition of the FV process is a K-dimensional WF process (see Section 2.1.2), from Proposition 4.1 we have pm ,n (t)π α+n .
Furthermore, since a Dirichlet process is characterised by its finite-dimensional projections, now it suffices to show that pm ,n (t)π α+n so that the operations of propagation and projection commute. Given (4.7), we only need to show that the mixture weights are consistent with respect to fragmentation and merging of classes, that is The last needed result to obtain the recursive representation of Proposition 3.1 reduces now to a simple sum rearrangement.
Proof of Proposition 3.1. The update operation (3.10) follows directly from Lemma 3.1. The prediction operation (3.11) for elements of F Π follows from Theorem 3.1 together with the linearity of (1.4) and a rearrangement of the sums, so that

Computable filtering for Dawson-Watanabe processes
The following Lemma, used later, recalls the propagation step for one dimensional CIR processes.
As preparatory for proving the main result on DW processes, assume the signal Z t = (Z 1,t , . . . , Z K,t ) is a vector of independent CIR components Z i,t each with generator acting on C 2 ([0, ∞)) functions which vanish at infinity. See Kawazu and Watanabe (1971). The next proposition identifies the dual process for Z t .
Proof. Throughout the proof, for ease of notation we will write h C i instead of h C αi . Note first that for all m ∈ Z K + we have where x i = z i /|z|, which follows from direct computation by multiplying and dividing by the correct ratios of gamma functions and by writing We show the result for K = 2, from which the statement for general K case follows easily. From the independence of the CIR processes, the generator (Z 1,t , Z 2,t ) applied to the left hand side of (4.10) is (4.11) A direct computation shows that Substituting in the right hand side of (4.11) and collecting terms with the same coefficients gives with α = α 1 + α 2 and m = m 1 + m 2 . From (4.10) we now have an application of (4.9) on h(z, m, s) shows that (Bh(z, ·, ·))(m, s) equals the right hand side of (4.12), so that (4.2) holds, giving the result.
The previous Theorem extends the gamma-type duality showed for one dimensional CIR processes in Papaspiliopoulos and Ruggiero (2014). Although the components of Z t are independent, the result is not entirely trivial. Indeed the one-dimensional CIR process is dual to a two-components process given by a one-dimensional death process and a one-dimensional deterministic dual. The previous result shows that K independent CIR processes have dual not given by a K independent versions of the CIR dual, but by a death process on Z K + modulated by a single deterministic process. Specifically, here the dual component M t is a K-dimensional death process on Z K + which, conditionally on S t , jumps from m to m − e i at rate 2m i (β + S t ), and S t ∈ R + is a nonnegative deterministic process driven by the logistic type differential equation (4.13) The next Proposition formalises the propagation step for multivariate CIR processes. Denote by Ga(α, β) the product of gamma distributions Ga(α 1 , β) × · · · × Ga(α K , β), with α = (α 1 , . . . , α K ).
Using now the fact that a product of Binomials equals the product of a Binomial and an hypergeometric distribution, we have |m| i=0 Bin(|m| − i; |m|, p(t)) 0≤i≤m,|i|=i which, using (2.8), yields (4.14). Furthermore, (3.16) is obtained by solving (4.13) and by means of the following argument. The one dimensional death process that drives |M t | in Theorem 4.1, jumps from |m| to |m| − 1 at rate |m|(β + S t )/2, see (4.9). The probability that |M t | remains in |m| in [0, t] if it is in |m| at time 0, here denoted P (|m| | |m|, S t ), is then P (|m| | |m|, S t ) = exp − |m| 2 Iterating the argument leads to conclude that the death process jumps from |m| to |m| − i in [0, t] with probability Bin(|m| − i | |m|, p(t)).
Note that when s ∈ N, Ga(α i + m, β + s) is the posterior of a Ga(α i , β) prior given s Poisson observations with total count m. Hence the dual component M i,t is interpreted as the sum of the observed values of type i, and S t ⊂ R + as a continuous version of the sample size. In particular, (4.14) shows that a multivariate CIR propagates a vector of gamma distributions into a mixture whose kernels factorise into a gamma and a Dirichlet distribution, and whose mixing weights are driven by a one-dimensional death process with Binomial transitions together with hypergeometric probabilities for allocating the masses.
The following Proof of the conjugacy for mixtures of gamma random measures is due to Lo (1982) and outlined here for the ease of the reader.
where n are the multiplicities of the distinct values in y 1:n . Finally, by the independence of |z m | and z m /|z m |, the conditional distribution of the mixing measure follows by the same argument used in Proposition 3.1.
We are now ready to prove the main result for DW processes. Since the inner sum is the only term which depends on multiplicities and since Dirichlet processes are characterised by their finite-dimensional projections, we are only left to show that whereñ denotes the projection of n onto (A 1 , , . . . , A K ). This is the consistency with respect to merging of classes of the multivariate hypergeometric distribution, and so the result now follows by the same argument at the end of the proof of Theorem 3.1.
We conclude by proving the recursive representation of Proposition 3.1, whose argument is analogous to the FV case.
Proof of Proposition 3.2. The update operation (3.17) follows directly from Lemma 3.1. The prediction operation (3.11) for elements of F Π follows from Theorem 3.2 together with the linearity of (1.4) and a rearrangement of the sums, so that As a final comment concerning the strategy followed for proving the propagation result in Theorems 3.1 and 3.2, one could be tempted to work directly with the duals of the FV and DW processes (Dawson and Hochberg, 1982;Ethier and Kurtz, 1993;Etheridge, 2000). However, this is not optimal, due to the high degree of generality of such dual processes. The simplest path for deriving the propagation step for the nonparametric signals appears to be resorting to the corresponding parametric dual by means of projections and by exploiting the filtering results for those cases.