A Bayesian approach to disease clustering using restricted Chinese restaurant processes

: Identifying disease clusters (areas with an unusually high inci- dence of a particular disease) is a common problem in epidemiology and public health. We describe a Bayesian nonparametric mixture model for disease clustering that constrains clusters to be made of adjacent areal units. This is achieved by modifying the exchangeable partition probability function associated with the Ewen’s sampling distribution. We call the resulting prior the Restricted Chinese Restaurant Process, as the associated full conditional distributions resemble those associated with the standard Chinese Restaurant Process. The model is illustrated using synthetic data sets and in an application to oral cancer mortality in Germany.


Introduction
A disease cluster is a higher-than-expected incidence of a particular disease or disorder occurring in close proximity in terms of both time and geography. Although communicable diseases (those that can be spread from one person to another, such as the flu or HIV) often occur in clusters, clusters of noncommunicable disease are rare and their presence might indicate the presence of a harmful environmental factor or other hazard. Therefore, identification of cancer clusters is a key task in epidemiology and public health.
A strand of the statistics literature on disease clustering focuses on methods for confirmatory cluster analysis. Sometimes called focused tests, these methods are concerned with determining whether the rate of disease in a pre-specified area (which usually contains some putative health hazard) is higher than expected (e.g., see Stone, 1988, Besag & Newell, 1991, Tango, 1995, Morton-Jones et al., 1999. In contrast, the focus of this paper is on methods for de novo identification of disease clusters in datasets in which the presence of such clusters is not known. Methods based on scan statistics (e.g., see Weinstock, 1981, Kulldorff, 1997, Tango & Takahashi, 2005 are well known examples of this type of approaches. Implementations of classical approaches to disease cluster analysis are widely available in a number of platforms. One example is the R package DCluster (Gómez-Rubio et al., 2005).
Methods for disease clustering can also be classified according to whether they are designed to work with point-referenced or with spatially aggregated (areal) data. In the case of point-referenced data, it is common to distinguish between distance-based methods (Whittemore et al., 1987, Besag & Newell, 1991, and Tango, 1995, which derive tests based on the distribution of the time/distance between locations on which events occurred, and quadrat-based methods (e.g, Openshaw et al., 1987, Kulldorff & Nagarwalla, 1995, which study the variability of case counts in certain subsets of the region of interest (called quadrats). In the case of areal data, frequency tests similar to those used in quadrat-based methods are frequently used (e.g., see Whittinghill, 1966a andWhittinghill, 1966b). Bayesian methods for disease clustering in spatially aggregated data have been proposed by Knorr-Held & Raßer (2000), Gangnon & Clayton (2000), Green & Richardson (2002), Gómez-Rubio et al. (2018), Wakefield & Kim (2013), and Anderson et al. (2014). Other recent contributions to the field include the work of Moraga & Montes (2011), Charras-Garrido et al. (2012, Heinzl & Tutz (2014), and Wang & Rodríguez (2014). Kulldorff et al. (2003), Waller et al. (2006), and Goujon-Bellec et al. (2011) present detailed comparisons of various methods for disease clustering.
It is worth noting that the main goals of disease clustering methods are similar but distinct from those of disease mapping. Typically, disease mapping applications deal with the estimation of smooth covariate-adjusted risk measures, but do not aim at identifying discontinuities in the risk function. On the other hand, the whole point of methods for de novo identification of cancer cluster is to pinpoint such discontinuities. Of course, these two objectives are not necessarily opposed (e.g., see Knorr-Held & Raßer, 2000, Green & Richardson, 2002, and Anderson et al., 2014, but most techniques designed for disease mapping are not directly applicable in the context of disease clustering. The literature on disease clustering is also related to, but distinct from, the literature on boundary analysis in areal data (sometimes referred to as "areal wombling", e.g., see Lu & Carlin, 2005;Lu et al., 2007;Fitzpatrick et al., 2010;Li et al., 2015b;Guhaniyogi, 2017).
In this paper we develop a Bayesian approach for de novo identification of disease clusters in areal data. Our approach uses a restricted version of the Exchangeable Partition Probability Function (EPPF) associated with a species sampling model (SSM) (Pitman, 1995(Pitman, , 1996 as a prior on the partition of areal units. To simplify our exposition, we focus here on the SSM associated with the Dirichlet process (Ferguson, 1973;Blackwell & MacQueen, 1973;Antoniak, 1974;Lee et al., 2013;Rodríguez & Quintana, 2015), which is sometimes referred to as the Chinese restaurant process (CRP). However, the formulation is more general and our key results (particularly around the form of the full conditional distributions associated with the prior) extend to other SSMs such as the Generalized CRP induced by the two-parameter Poisson-Dirichlet process (Pitman & Yor, 1997).
The restricted prior we introduce in this paper is specifically designed to enforce clusters made of adjacent spatial units (which we call admissible). The approach we develop in this paper is related to those developed in Fuentes-García et al. (2010) and Martínez et al. (2014) in the context of time series data. Fuentes-García et al. (2010) consider a special case of our model that assumes ordered observations and uses reversible jump Markov chain Monte Carlo algorithms for inference. More recently, Martínez et al. (2014) propose a change-point model constructed by restricting a Generalized CRP (Pitman, 1995;Gnedin & Pitman, 2006), but their proposal differs from ours in the way the probability associated with inadmissible partitions is redistributed. Our model can be seen as generalizing the ideas in Fuentes-García et al. (2010) and Martínez et al. (2014) to situations in which the EPPF is restricted to partitions driven by general neighborhood graphs. We also show that, for our construction, the full conditional distributions associated with the restricted prior take a simple and appealing form, making the use of reversible jump algorithm unnecessary.
The model we introduce here is also related to the literature on spatiallydependent mixture models. Fernández & Green (2002) consider the use of a Potts model as the joint prior on the cluster indicators of a finite mixture model, leaving the question of how the number of clusters is to be selected open. Loschi & Cruz (2005), Müller et al. (2011), andPage et al. (2016) consider extensions of Hartigan's product partition model (PPM) (Hartigan, 1990) in which the so-called coherence functions account for temporal and/or spatial dependence. These models cannot be easily generalized to other SSMs beyond the CRP, where the prior on the partition cannot be written in terms of the product of coherence functions. Dahl (2008), Blei & Frazier (2011), Ghosh et al. (2011), and Dahl et al. (2017 consider Chinese restaurant processes in which co-clustering probabilities are functions of the distance between observations. In a similar spirit, Li (2015), Li et al. (2013), Li et al. (2014), Li et al. (2015a), Li et al. (2016a), Li et al. (2016b), andLi et al. (2016c) generalize the approach of Blei & Frazier (2011) so that the clustering probabilities depend on side information. A common feature of all these approaches is their focus on soft constraints that encourage nearby areas to cluster together but still allow clusters to be disconnected. In contrast, our focus is on ensuring that clusters are fully con-nected. This type of constraint is the most natural one in the context of our application to disease clustering, and cannot be easily enforced with any of the models discussed above. For example, Page et al. (2016) note that computational challenges arise when a restricted cohesion function in a PPM is considered in order to assign zero probability to "non-desirable" cluster configurations. Our proposal overcomes these challenges.
The remaining of the paper is organized as follows: Section 2 presents our model and discusses its properties. Section 3 describes our computational approach. Sections 4 and 5 present the analysis of two simulated data sets and an application to oral cancer mortality in Germany, respectively. Finally, Section 6 discusses the limitations of our model as well as future research directions.

The model
Suppose that areal data in the form of pairs (y i , h i ) are available, where y i records the observed number of cases in region i, and h i represents the expected number of cases in region i, obtained by internal or external standardization, for i = 1, . . . , n. As is common in the literature, we assume that the counts in region i are independently Poisson distributed and model their rates as a function of h i and their log-scale relative risk, η i , log-RR for short. More specifically, where η i = x t i θ i , x t i is the transpose of a p-dimensional vector of covariates associated to region i, θ i ∈ R p is the random effect associated with region i, and P ois(λ) denotes the Poisson distribution with rate λ > 0. When no covariates are available, i.e., x i,j = 1, for all i, j, the log-RR reduces to η i = θ i , with θ i ∈ R.
Our approach assumes that the geographic information associated with the data set is encoded in a known n × n binary adjacency matrix, W = [w i,i ]. This adjacency matrix can be interpreted as defining an unweighted, undirected graph G whose nodes correspond to the different geographical regions under study. In our illustration we focus on first-order neighborhood matrices in which w i,i is equal to 1 if regions i and i share a common boundary, and equal to 0 otherwise. However, our methodology applies more generally to higher order neighborhoods, or to other ways to define the (binary) adjacency relationship, e.g., by means of the distances between centroids or of distances based on length of shared boundary.
Recall that our interest is to identify clusters of spatially connected regions that share the same relative log-risk. Therefore, regions i and j can belong to the same cluster k only if G contains a path that connects them for which all the nodes in the path also belong to cluster k. In what follows we will propose a spatially restricted prior distribution for the cluster membership random variable and illustrate how specific adjacency matrices impact the probability mass function of the number of clusters.
To construct our prior on the random effects we borrow ideas from the modelbased clustering literature. More specifically, we augment the model in (1) with Table 1 All possible partitions and cluster configurations for a sample of size n = 3.  Table 1). If no spatial information were available (or, alternatively, if W corresponds to the complete graph), it would be convenient to assign c a Chinese restaurant process prior (CRP) (Pitman, 1995), where is the number of labels having value k or, equivalently, the number of observations in cluster k and Γ(z) = ∞ 0 t z−1 e −t dt denotes the Gamma function.
The corresponding full conditionals are given by where c −i = (c 1 , . . . , c i−1 , c i+1 , . . . , c n ). Therefore, each c i is either an already existing label, with probability proportional to n k (c −i ), or a new label, with probability proportional to α. Clearly, the concentration parameter α controls the number of clusters, with larger values of α favoring larger numbers of clusters a priori.
In order to define a spatially restricted prior distribution that enforces connected clusters, we propose to modify (2) by giving zero probability to configurations that involve clusters with non-connected components. More specifically, let G A k be the subgraph of G involving only the nodes that belong to the set A k . We call a partition A 1 , . . . , A K admissible if G A k is a connected subgraph (but not necessarily complete) for every k = 1, . . . , K. If we define the function Q(c, W ) as being equal to 1 whenever c is an admissible cluster configuration under W , our prior takes the form where the normalizing constant C(α, W ) is given by We call this prior the Restricted Chinese Restaurant Process with parameters α and W , denoted c | α, W ∼ RCRP (α, W ).
In general, there is no closed form expression for C(α, W ); two exceptions are provided in Appendix A. However, we can still make some general statements. For example, we note that C(α, W ) is a polynomial function of degree n in α, i.e., we can write where the coefficient f l (W ) is a weighted sum over the admissible partitions involving l clusters, In particular, f 1 (W ) = Γ(n) for any W , f n (W ) = 1 for any W that implies a graph G that is connected (and f n (W ) = 0 otherwise), and f l (W ) = |S n,l |, the unsigned Stirling number of the second kind, when W implies the complete graph. Figure 1 presents the probability associated with the number of clusters K(c) for the unrestricted CRP, as well as for the neighborhood structures associated with a star and a linear graphs (see Appendix A), when n = 6. Recall that the prior on partitions implied by the model of Fuentes-García et al. (2010) corresponds to the linear graph setting, while the model in Martínez et al. (2014) is constructed to ensure that the prior on K(c) matches the one for the unrestricted CRP. It is clear from the graph that the probability of the partitions depends on the underlying adjacency matrix. For both restricted cases, the probability of K = 1 and K = n are higher than in the non-restricted case.
For smaller values of α (α = 0.1 and α = 1), the linear graph seems to favor a smaller number of clusters than the star graph. This pattern is reversed for the larger values of α (α = 3 and α = 10).
While an explicit expression for C(α, W ) is generally not available, the full conditional distributions associated with (4) take a particularly simple, appealing, and computationally convenient form (see Appendix B): Hence, when the assignment of a region to a particular cluster leads to an admissible configurations, (7) and (3) agree. On the other hand, for assignments that lead to inadmissible configurations, the full conditional is zeroed out. As before, α controls the number of clusters.
The CRP prior has sometimes been criticized in the context of clustering applications because of their tendency to create clusters of unbalanced size. In disease clustering applications, where the disease clusters can usually be expected to be rare and small a priori, this behavior (which will carry out to the restricted model we discuss below) is an appealing feature of the model.
Having defined the spatially restricted prior distribution for the labeling random variables, the rest of the model is specified byθ k iid ∼ N (μ, σ 2 ), where μ ∈ R and σ 2 ∈ R + are the mean and variance of the normal distribution. Additionally, we consider hyperprior distributions for α, μ, and σ 2 . Our model can finally be written as with one more level for the hyperpriors where the parameters κ, φ 2 , a, and b are fixed, IG(a, b) denotes the inverse gamma distribution with shape and scale parameters a > 0 and b > 0, respectively, and π is a distribution on R + .

Computational aspects
We use Markov Chain Monte Carlo (MCMC) algorithms (Smith & Roberts, 1993;Robert & Casella, 2005) to generate samples from the posterior distributions associated with the proposed model. This Section describes the full conditional distributions involved. The full conditional distribution for the components of the indicator vector c takes the relatively simple form: σ 2 ) and f (y i |θ k ) denotes the probability mass function of a Poisson distribution with rate parameter h i eθ k . This resembles algorithm 8 from Neal (2000). Note that there is an implicit and very standard relabeling of vector c every time a cluster becomes empty (i.e., n k = 0, for some k), as in the "no gap" algorithm of MacEachern & Müller (1998). The Markov chain that results from cycling through these full conditional distributions is irreducible: we can move from between any two admissible configurations by first breaking each cluster (one region at a time, starting with the "periphery" to ensure that admissibility is preserved), and then reassembling the new clusters.
While the need to repeatedly check on the admissibility of the configurations might suggest that the computational cost of implementing this algorithm would be high, that is not the case. Using the fact that the current configuration must be admissible, a careful implementation of the algorithm only requires that, for each i, we check that the cluster currently containing observation i remains fully connected if that observation is removed (which, in the worst case scenario, can be done in quadratic time on the size of that cluster). Then c i is updated with the label of any of its neighbours (which can be directly identified from W in linear time) or with a new label according to the probabilities given by (10).
As with a standard mixture model, posterior distributions for log-RR parametersθ 1 , . . . ,θ K(c) are conditionally independent and take the form where y + k = {i:ci=k} y i and h + k = {i:ci=k} h i . Since these posterior distributions do not belong to any tractable family of distributions, one must resort to algorithms such as random walk Metropolis-Hastings (M-H) or Hamiltonian Monte Carlo to sample from them. However, these algorithms require that the user selects a number of tuning parameters (such as the variance of the random walk in random walk M-H algorithms, or the size and number of steps in Hamiltonian Monte Carlo algorithms). Instead, we resort to slice samplers algorithms (Damien et al., 1999), which do not require the selection of any tuning param-eters and therefore facilitate the use of our approach by practitioners. More specifically we introduce unit-rate exponentially distributed auxiliary random variables, u k , leading to truncated exponential and truncated normal conditional distributions for u k andθ k , respectively, where Exp(· | λ) denotes the exponential distribution with rate λ. On the other hand, the hyperparameters μ and σ 2 are sampled from their conjugate posterior distributions Finally, we discuss the process of generating samples from the full conditional distribution of α. This step is particularly difficult because the full conditional is double intractable: additionally to the posterior not belonging to any known distribution, it involves the computation of the intractable normalizing constant C(α, W ). In this paper we use the noisy exchange algorithm (NEA), proposed by Alquier et al. (2016), for sampling from (12). The NEA updates α using a M-H step replacing the ratio of normalizing constants C(α, W )/C(α * , W ) by an unbiased importance sampling estimator. Note that, as a consequence of importance sampling, where E c |α,W denotes the expectation under c | α, W . Therefore, the ratio of normalizing constants is approximated by where c 1 , . . . , c N are samples from (4), obtained by running a second MCMC algorithm based on the full conditionals given by (7). In order to speed up computations, we considered a discrete prior distribution for α with support on a relatively small number of point masses, and computed the ratios of normalizing constants in advance.

Simulated data
We conduct two simulation studies to ascertain the performance of our model. Both scenarios assume that the spatial association is given by the first-order neighborhood structure of the counties in the U.S. state of Ohio, and that h i = 100 for every i = 1, . . . , 88. Scenario I involves eight snake-shaped clusters with log-RR θ i ∈ {−2, 0, 2} (see Figure 2, top row). Note that in this case some clusters share same disease risk. Scenario II is formed by 4 round-shaped connected clusters with log-RR θ i ∈ {−1.5, −0.5, 0.5, 1.5}, so each cluster has a different disease risk (see Figure 2, bottom row).

A comparison model
We compare the performance of our model against the boundary distance (BD) model proposed by Knorr-Held & Raßer (2000). The BD model uses a Poisson likelihood similar to ours, but assigns a different prior distribution on the partition indicators c 1 , . . . , c n that is inspired by K-means clustering. More specifically, their prior is specified hierarchically through a prior distribution on the number of clusters, K, a uniform prior distribution on the set of cluster centers, denoted (g 1 , . . . , g K ), with g k ∈ {1, . . . , n}, and a measure of distance between regions, which in their case is given by the minimum number of boundaries that need to be crossed to move between the two regions. Given K and (g 1 , . . . , g K ), each region i is assigned to the cluster whose center is closest. If one region is equally distant from two cluster centers, then it is assigned to the cluster with the smallest index position. As we will show in our simulations, this prior strongly favors round shaped cluster configurations. In terms of the prior for the log-RR, Knorr-Held & Raßer (2000) make a choice that is similar to ours. In particular, they set θ i =β ci , with logβ k ∼ N (μ, σ 2 ). Their posterior sampling MCMC scheme is based on a reversible jump MCMC iterating over birth, death, shift, switch, height and hyper steps. This algorithm tends to mix very slowly, and requires a large number of iterations (in the order of millions of samples) to produce approximations with a reasonably small Monte Carlo error.

Prior specification and comparison criteria
In our simulation studies, we compare the performance of the RCRP model and BD model, under different specifications of the prior distributions. For the RCRP model, we fix α = 4 (resulting on a prior distribution on the number of clusters centered roughly around K = 5), and a product of independent priors for (μ, σ 2 ), π 1 (μ, σ 2 ) = N (μ | q 0.5 , s 2 n /2)IG(σ 2 | 2, s 2 n /2), where q 0.5 and s 2 n denote the median and unbiased sample variance of log(y i /h i ). The hyperparameters κ = q 0.5 and a = 2 were chosen such that the prior mean of the log RR are centered at q 0.5 and the prior variance of σ 2 is infinity. A sensitivity analysis involving three more combinations of hyperparameters for φ 2 and b is included in the supplementary material (Wehrhahn et al., 2020). Results appeared to be robust to the different prior specifications.
To ensure that the comparison between models is fair, we slightly modify the hyperpriors for the BD model from those originally used in Knorr-Held & Raßer (2000). In particular, we assign (μ, σ 2 ) the same hyperpriors as the RCRP model. Additionally, rather than the original geometric prior used by Knorr-Held & Raßer (2000), we consider two slightly different prior distributions for K that more closely resemble the prior implied by our model. These two priors correspond to a truncated Poisson and a truncated Negative Binomial, respectively.
For each of the two models, we report point estimates for the cluster configuration,ĉ, heat maps of the posterior probability of two regions belonging to the same cluster, π(c i = c j | y), i = j (which provide a measure of the uncertainty associated with the point estimates), and a comparison between the prior and posterior distributions over the number of clusters K. The point estimateĉ is obtained by minimizing (using iterative componentwise optimization) a slightly modified version of the expected loss function discussed in Lau & Green (2007), Note that the ratio w 1 /w 2 controls the relative loss of incorrectly clustering or separating a pair of regions, and the multiplier Q(ĉ, W ) ensures that our point estimate corresponds to an admissible partition. In our illustrations we set w 1 /w 2 = 1. We evaluate the ability of the models to identify clusters using the adjusted random index (ARI, Hubert & Arabie, 1985) of the posterior cluster configurations. The ARI evaluates the agreement in cluster assignment between two cluster configurations. It ranges between −1 and 1, larger values indicating agreement between cluster configurations. On the other hand, estimates of the log-RR are evaluated through the mean squared error (MSE) of the posterior mean of the log-RR.

Results
For each model, a single Markov chain was generated. In all cases, the inferences presented below are based on 10,000 samples obtained after burn-in and thinning. The amount of burn-in varied; we discarded 40,000 samples in both instances of the RCRP model, 200,000 for the BD model in Scenario I, and 300,000 for the BD model in Scenario II. Convergence was evaluated by standard convergence tests, as implemented in the CODA R package (Plummer et al., 2009), and by examining the trace plot of the log-posterior distribution, the number of clusters K(c), and the hyperparameters μ and σ 2 . Figure 3 displays point estimates for the cluster configurations for Scenario I and Scenario II. Under Scenario I, the BD model struggles even though the rates associated with the clusters are quite well differentiated. In particular, note that the BD model breaks down the 8 snake-like clusters into a large number of very small, round clusters. Under scenario II, the BD model improves its performance substantially, but still tends to slightly overestimate the number of clusters. In particular, note that the upper left and bottom right clusters are being broken down by the BD model into two subclusters each. In contrast, the RCRP model is able to recover the true cluster structure in both scenarios. Figure 4 provides further insight into the estimates of the cluster structure by displaying the heat maps of the posterior probability of two regions belonging to the same cluster. To facilitate visualization, regions are ordered according to the true cluster configuration. In general, there is very little uncertainty associated with the point estimates presented in Figure 3, particularly for the RCRP model. Along similar lines, Figure 5, displays boxplots of the posterior distribution of the ARI. We can see that, while the posterior distribution of the ARI for the RCRP model is concentrated around 1 (further confirming that the model places high probability on the true clustering configuration), the values for the BD model tend to be much smaller, particularly in Scenario I. It is also worth noting that, for the BD model, π 2 (K) leads to much higher variability in the quality of the cluster estimates.
Finally, we compute the MSE of the log-RR with respect to the true log-RR. Under Scenario I the MSE for the RCRP model and α = 4, BD model and π 1 (K), and BD model and π 2 (K) were 0.00145, 0.00579, and 0.00731, respectively. On the other hand, under Scenario II, the respective MSEs were 0.00057, 0.00073 and 0.00074. For both scenarios the RCRP model has the best performance; in the best case scenario, the MSE of the BD model, was 3.99 and 1.28 times bigger that the MSE for the RCRP model in Scenario I and Scenario II, respectively. Also, note that, under the BD model, prior π 1 (K) shows the best performances.
Further results comparing the performance of the models regarding the number of clusters and the log-RRs can be found in the online supplementary material (Wehrhahn et al., 2020).

An application to oral cancer in Germany
As a second illustration, this Section reports our analysis of the oral cancer mortality data discussed in Knorr-Held & Raßer (2000) (see Figure 6). The     data set registers the observed and expected number of deaths during the period 1986-1990 across 544 administrative districts in Germany. As in the simulation studies, this analysis emphasizes a comparison between the RCRP and the BD models.

Prior specification
Under the RCRP model, we employ a discrete uniform prior distribution for α. This prior has support on the set {16, 20, 24, 28, 32} and is denoted π 1 (α). This choice results in a prior distribution for the number of clusters centered around 16. As we discussed before, the use of a discrete prior enables additional flexibility while containing the computational cost of the MCMC algorithm (by allowing us to pre-compute approximations to the ratio of intractable normalizing constants used in the noisy exchange algorithm). As in the case of the simulated data, the hyperprior on (μ, σ 2 ) is given by (14). For the BD model, two prior specifications for K were considered: π 3 (K) = NB(K | p = 0.65, r = 92.84)1 (K ≥ 1) , π 4 (K) = NB(K | p = 0.01, r = 1)1 (K ≥ 1) .

Results
As before, we generate a single Markov chain for each model specification. For the RCRP model, one sample of size 10,000 was generated, obtained by saving 1 out of every 10 iterations and after a burn-in period of 20,000 samples. In order to approximate the intractable ratio of normalizing constants, one sample of size 10,000 was generated for each pair of values of α in the support of the discrete prior, after a burn-in period of 15,000 iterations. Based on these samples, the ratios of normalizing constants, used in the M-H updating step of α, were estimated as in (13). For the BD model, we also generated posterior samples of size 10,000. However, following Knorr-Held & Raßer (2000), in this case the samples were obtained after a burn-in period of 1,000,000 iterations by saving 1 out of every 10,000 iterations. As in the case of the simulated data, convergence of these chains was evaluated by standard convergence tests, as implemented in the CODA R package (Plummer et al., 2009), and by examining the trace plot of the log-posterior distribution, the number of clusters in the data, and the hyperparameters μ and σ 2 . Figure 7 presents our point estimate of the partition structure under each model. The differences are again striking. The RCRP model identifies five clusters: a main one, formed by the vast majority of regions, two small ones (formed by 110 and 9 regions, respectively), and a couple of singleton clusters. In contrast, the BD model identifies 84 and 98 clusters under π 4 (K) and π 3 (K), respectively. The shape of the clusters suggests that we might be in a situation that is similar to the one in our first simulated data sets, where the BD model artificially splits large, non-circular clusters into a large number of smaller, roughly circular ones.
To further emphasize the difference in the reported partition structure, we present in Figure 8 heat maps of the posterior probability of two regions belonging to the same cluster. From these graphs we can see that, while the level of uncertainty in the point estimates is somewhat larger in this case when compared to the simulation studies, it is clear that all models are quite certain about the main features of their reported partition estimates.
Finally, Figure 9 displays the posterior mean log-RR estimate under both models. There are some clear similarities in the estimated rates. For example, both the RCRP and the BD models estimate a higher incidence of the disease in the Southwest corner of Germany, and a lower incidence in the East of the country. However, it is clear that the RCRP model smooths the rates much more than the BD model. This is not surprising given the very different estimates of the partition structures induced by these models.
Further results comparing the performance of the models regarding the number of clusters can be found in the supplementary material (Wehrhahn et al., 2020).

Discussion
We have proposed a restricted mixture model for detecting clusters of non communicable diseases. The restriction is imposed in the CRP prior for the cluster membership vector, constraining the space of possible configurations only to those resulting in connected clusters, i.e., those where there is a path joining any pair of regions that includes only regions that belong to the cluster. We show that the model is very flexible and less computationally demanding that the alternative BD model.
A number of extensions of this model are possible. For example, while we have focused here on restrictions of a CRP prior, the basic approach could be used to restrict any other exchangeable SSM. Our key result around the structure of the full conditional distribution of the restricted model should extend in a straightforward fashion. Similarly, the model could easily accommodate different likelihood functions and other more general definitions of the binary neighborhood matrix W . Furthermore, while this paper has focused on applications to disease clustering, it is clear that our approach can be extended to applications in time series and image segmentation. Another extension would involve using dependent priors for the cluster specific parametersθ 1 , . . . ,θ K . Indeed, we might expect that nearby clusters have more similar rates than clusters located far away. For example, we could consider a (proper) conditionally autoregressive prior forθ 1 , . . . ,θ K . However, such an approach introduces complex identifiability issues that make prior elicitation complex. In particular, note that a model in which there is a single cluster is equivalent to the (limit) model in which we allow for any number of clusters but make the spatial correlation in the prior goes to 1. This means that the prior distribution would have an even more critical effect on the model, one that would be hard to measure a priori. In that sense, the use of independent priors can be seen as a kind of "maximum separation" prior that will maximize the ability of the model to identify a disease cluster. Finally, an anonymous referee suggested the use of a more general form for Q(c, W ) that would allow for continuous values on [0, 1].
Our model works by restricting an EPPF, which leads to a model that is partially exchangeable. That is, the probability distribution implied by the model is invariant to simultaneous permutations of the observations and the rows and columns of the neighborhood matrix W . A referee pointed out that using an exchangeable PPF as the starting point is not required. While this is true as a general modeling strategy, the appropriateness of such an approach will depend on the application at hand. For example, in applications to time series data (where there is a natural ordering to the observations), the use of such a nonexchangeable PPF as our starting point might make sense. However, in spatial applications such as the ones considered in this paper, partial exchangeability would seem to be the right assumption: why would one want to have a different model when counties are listed alphabetically in the database vs. when they are listed by, say, population size?
One shortcoming of our model is that it uses a single set of random effects to capture both overdispersion in the Poisson model and spatial dependence. One way to address this limitation is to use two sets of random effects: one set in which they are independent (meant to capture over-dispersion), and another set in which they are dependent and modeled using our restricted CRP (therefore capturing the spatial effects in the data). See, for example, Banerjee et al. (2014). Such an approach could be easily incorporated into our disease clustering model.

Appendix A: Two special restrictions
In what follows we consider two special adjacency graphs for which the normalizing constant C(α, W ) can be computed in closed form: (1) when the underlying graph G is the star graph, and (2) when the underlying graph G is linear. The former is of interest for its simplicity, and the latter because of its potential application to modeling ordered/time series data.
The analysis of the these two special graphs is also useful in terms of highlighting the differences between our model and that of Martínez et al. (2014). Indeed, if the approach in Martínez et al. (2014) was extended beyond time series models to accommodate general adjacency graphs, it would lead to exactly the same prior distribution on partitions under either graphs. That is because the number of admissible partitions for any size happens to be the same under both graphs. In contrast, our model leads to quite different specifications for these two graphs because our prior weights partitions according to the size of the clusters.

A.1. Restriction under a star graph
Under this adjacency structure, in any cluster configuration with l clusters there will be exactly l −1 singleton clusters, and one cluster with n−l +1 observations (which must include the root node). This is because any cluster of size greater than one must necessarily include the root node in order to be formed of adjacent regions. See the top row of Table 2 for an example with n = 4. Hence, f l (W star ) = n − 1 n − l Γ(n − l + 1) = (n − 1)! (l − 1)! , and C (α, W star ) = n l=1 (n − 1)! (l − 1)! α l .

A.2. Restriction under a linear graph
In a linear graph, there are also a total of n−1 l−1 configurations involving exactly l clusters. Indeed, note that choosing l adjacent clusters is equivalent to picking l − 1 breakpoints out in n − 1 possible positions for them. However, unlike the star case, each of these configurations involves clusters of different sizes (see the bottom row of Table 2). configuration, then we must consider two cases. For l ≤ K(c −i ), This follows from the fact that Γ (n k (c −i ) + 1(k = l)) = Γ (n k (c −i )) k = l, n k (c −i )Γ (n k (c −i )) k = l.
On the other hand, if l = K(c −i ) + 1, The simplified expression in (7) are obtained by noting that both (16) and (17)  The previous derivation assumes that there is at least one value of c i in the set {1, . . . , K(c −i ) + 1} that leads to an admissible configuration. Otherwise the normalizing constant is zero and the full conditional is not well defined. Since our MCMC algorithm must start in an admissible configuration, and the full conditionals maintain admissibility, this is not an issue for our purposes.