Original Research
Bayesian cluster detection via adjacency modelling

https://doi.org/10.1016/j.sste.2015.11.005Get rights and content

Highlights

  • We develop a two-stage spatial model for clustering disease risk.

  • First stage produces potential cluster structures, second stage fits these within a CAR model.

  • Model outperforms alternatives in a simulation study.

  • Applied to COPD mortalities in English local authorities and identifies north–south divide.

Abstract

Disease mapping aims to estimate the spatial pattern in disease risk across an area, identifying units which have elevated disease risk. Existing methods use Bayesian hierarchical models with spatially smooth conditional autoregressive priors to estimate risk, but these methods are unable to identify the geographical extent of spatially contiguous high-risk clusters of areal units. Our proposed solution to this problem is a two-stage approach, which produces a set of potential cluster structures for the data and then chooses the optimal structure via a Bayesian hierarchical model. The first stage uses a spatially adjusted hierarchical agglomerative clustering algorithm. The second stage fits a Poisson log-linear model to the data to estimate the optimal cluster structure and the spatial pattern in disease risk. The methodology was applied to a study of chronic obstructive pulmonary disease (COPD) in local authorities in England, where a number of high risk clusters were identified.

Introduction

Disease risk varies geographically as a result of many factors, including differences in environmental exposures, and cultural and behavioural differences between the inhabitants of different areas. Within a country such as England, there are substantial inequalities in terms of health and disease risk, with poverty being one of the most important reasons for these differences (Marmot et al., 2010) Disease maps allow us to illustrate these differences graphically. Such maps are produced by partitioning the study region into n non-overlapping areal units such as electoral wards or census tracts, and then calculating the overall risk of disease for the population living in each areal unit. Health agencies routinely produce such maps for numerous diseases, including cancer (Public Health England, 2010) and cardiovascular disease (Prevention, 2011). The main use of these maps is that they allow public health officials to visually identify high-risk clusters of areal units, allowing them to focus resources on those areas exhibiting elevated disease risk.

Many different approaches have been proposed for the identification of the spatial extent of high-risk clusters in spatial disease maps, including Bayesian hierarchical modelling (Charras-Garrido et al., 2012), scan statistics (Kulldorff, 1997) and point process methodology (Diggle et al., 2005). The first of these is typically based on a Poisson log-linear model, where covariates and/or a set of random effects are used to represent the spatial disease risk pattern. The random effects are included to account for spatial autocorrelation in the response that was not captured by the covariates; and are typically modelled by a conditional autoregressive (CAR) prior. These priors were proposed by Besag et al. (1991) and developed by Leroux et al. (1999), and are a type of Gaussian Markov random field. CAR priors make the naive assumption of global correlation between all pairs of random effects in geographically adjacent areal units, and therefore produce a spatially smooth risk surface. However such smoothing is detrimental to our main aim, which is to identify groups of areas which have much higher (or lower) risks compared with surrounding areas, so an alternative approach is required.

Therefore, this paper outlines new methodology which allows for the estimation of the spatial pattern in disease risk, whilst simultaneously detecting the spatial extent of high or low risk clusters. In doing so the cluster structure is accounted for when estimating disease risk, so that high risk clusters are not smoothed towards their geographical neighbours that do not exhibit elevated risks. The methodology brings together hierarchical agglomerative clustering techniques and conditional autoregressive models in a two-stage approach. The first stage is a spatially-adjusted hierarchical agglomerative clustering algorithm first proposed in Anderson et al. (2014), which respects the spatial contiguity of the study region. This algorithm is applied to disease data preceding the study period to elicit n candidate cluster configurations containing between 1 and n clusters. The second stage fits an extended Poisson log-linear model to the study data, where Markov Chain Monte Carlo (MCMC) simulation methods are used to estimate both the optimal cluster structure and disease risk.

Applying the clustering algorithm to the study data itself would necessitate the information in the data being used twice, once for eliciting a set of candidate cluster configurations and again for estimating the model parameters. To overcome this issue, a second data set is required for the clustering stage, and emphasis should be placed on obtaining a dataset which is as similar as possible to the study data. Possible choices include data on disease risk in the time period prior to the study period or data on a different disease from the same time period as the study data. This study utilises the former choice, because it is unlikely that there has been any substantial change in the spatial patterns in the population characteristics governing disease risk (such as poverty) unless substantial urban regeneration has taken place. The approach proposed in this paper is thus appropriate for data on chronic diseases whose risk factors are spatially stable, but would be unsuitable for epidemic diseases such as influenza, where the spatial pattern in disease risk in the years prior to an outbreak would be vastly different to the pattern during an outbreak.

The remainder of this paper is organised as follows. Section 2 gives a brief introduction to Bayesian disease mapping, and discusses the existing methods of cluster identification that have been proposed in this context. Section 3 proposes our new methodological extension, while Section 4 establishes its efficacy via simulation. Section 5 presents the motivating application for our methodology, a study of chronic obstructive pulmonary disease (COPD) mortalities in English local authorities in 2010. Finally, Section 6 discusses the implications of this paper and ideas for future work.

Section snippets

Study design and modelling

The study region A is partitioned into n non-overlapping areal units A = {A1,,An}, and Y=(Y1,,Yn) and E=(E1,,En) represent the observed and expected numbers of disease cases in each unit during the study period. The latter are constructed by external standardisation, based on the age and sex demographics of the population living in each areal unit. A Poisson log-linear model is commonly used to estimate disease risk, and a general form is given by Yi|Ei,RiPoisson(EiRi)i=1,...,n,ln(Ri)=xiTβ+ϕ

Method

We propose a two-stage approach for estimating the spatial pattern in disease risk and identifying spatially contiguous clusters that exhibit either elevated or reduced disease risks. In the first stage (Section 3.1) we utilise the spatially adjusted hierarchical agglomerative clustering algorithm proposed by Anderson et al. (2014), and use it to elicit a set of candidate cluster configurations for the data. In the second stage (Section 3.2) we propose a hierarchical Bayesian model for the

Simulation study

A simulation study was conducted to establish the efficacy of the two-stage modelling approach outlined in the previous section. The template for the study was based on the set of 324 local authorities in England, which is also the study region for the motivating application presented in Section 5. A study was conducted comparing the two-stage approach proposed here with existing alternatives, and the results are summarised below.

Study design

The study region is the country of England, which is the largest of the four constituent nations of the United Kingdom and has a population of approximately 53 million people. The country is divided into n=324 local authorities, containing populations of between 7338 and 1,061,074 people with a median value of 124,781. The disease data are the numbers of mortalities with a primary diagnosis of chronic obstructive pulmonary disease (COPD) in each local authority in 2010. The expected mortality

Discussion

The main aim of this paper was to develop statistical methodology to simultaneously estimate the spatial pattern in disease risk and identify clusters of areas exhibiting high (and low) risk. To achieve this aim a new methodology has been developed which fuses together spatial agglomerative hierarchical clustering techniques with an extended conditional autoregressive model, with inference based on Markov-Chain Monte Carlo simulation. This approach allows us to identify an optimal cluster

Acknowledgments

We would like to thank the editor and two referees whose comments have improved the motivation for and presentation of this paper. The work of the first author was funded initially by the Carnegie Trust and then by the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS).

References (21)

  • AndersonC. et al.

    Identifying clusters in Bayesian disease mapping

    Biostatistics

    (2014)
  • BesagJ. et al.

    Bayesian image restoration, with two applications in spatial statistics

    Ann Inst Stat Math

    (1991)
  • Centers for Disease Control and Prevention. National Cardiovascular Disease Surveillance System. Technical report....
  • Charras-GarridoM. et al.

    Classification method for disease risk mapping based on discrete hidden Markov random fields

    Biostatistics

    (2012)
  • Charras-GarridoM. et al.

    On the difficulty to delimit disease risk hot spots

    J Appl Earth Obs Geoinf

    (2013)
  • DiggleP. et al.

    Point process methodology for on-line spatio-temporal disease surveillance

    Environmetrics

    (2005)
  • Fraley C, Raftery AE, Murphy TB, Scrucca L. mclust version 4 for R: normal mixture modeling for model-based clustering,...
  • GreenP. et al.

    Hidden Markov models and disease mapping

    J Am Stat Assoc

    (2002)
  • Hastie T, Tibshirani R, Friedman J. The elements of statistical learning; Springer New York Inc. [chapter...
  • Knorr-HeldL.

    Bayesian modelling of inseparable space-time variation in disease risk

    Stat Med

    (2000)
There are more references available in the full text version of this article.

Cited by (14)

View all citing articles on Scopus
View full text