Original ResearchBayesian cluster detection via adjacency modelling
Graphical abstract
Introduction
Disease risk varies geographically as a result of many factors, including differences in environmental exposures, and cultural and behavioural differences between the inhabitants of different areas. Within a country such as England, there are substantial inequalities in terms of health and disease risk, with poverty being one of the most important reasons for these differences (Marmot et al., 2010) Disease maps allow us to illustrate these differences graphically. Such maps are produced by partitioning the study region into n non-overlapping areal units such as electoral wards or census tracts, and then calculating the overall risk of disease for the population living in each areal unit. Health agencies routinely produce such maps for numerous diseases, including cancer (Public Health England, 2010) and cardiovascular disease (Prevention, 2011). The main use of these maps is that they allow public health officials to visually identify high-risk clusters of areal units, allowing them to focus resources on those areas exhibiting elevated disease risk.
Many different approaches have been proposed for the identification of the spatial extent of high-risk clusters in spatial disease maps, including Bayesian hierarchical modelling (Charras-Garrido et al., 2012), scan statistics (Kulldorff, 1997) and point process methodology (Diggle et al., 2005). The first of these is typically based on a Poisson log-linear model, where covariates and/or a set of random effects are used to represent the spatial disease risk pattern. The random effects are included to account for spatial autocorrelation in the response that was not captured by the covariates; and are typically modelled by a conditional autoregressive (CAR) prior. These priors were proposed by Besag et al. (1991) and developed by Leroux et al. (1999), and are a type of Gaussian Markov random field. CAR priors make the naive assumption of global correlation between all pairs of random effects in geographically adjacent areal units, and therefore produce a spatially smooth risk surface. However such smoothing is detrimental to our main aim, which is to identify groups of areas which have much higher (or lower) risks compared with surrounding areas, so an alternative approach is required.
Therefore, this paper outlines new methodology which allows for the estimation of the spatial pattern in disease risk, whilst simultaneously detecting the spatial extent of high or low risk clusters. In doing so the cluster structure is accounted for when estimating disease risk, so that high risk clusters are not smoothed towards their geographical neighbours that do not exhibit elevated risks. The methodology brings together hierarchical agglomerative clustering techniques and conditional autoregressive models in a two-stage approach. The first stage is a spatially-adjusted hierarchical agglomerative clustering algorithm first proposed in Anderson et al. (2014), which respects the spatial contiguity of the study region. This algorithm is applied to disease data preceding the study period to elicit n candidate cluster configurations containing between 1 and n clusters. The second stage fits an extended Poisson log-linear model to the study data, where Markov Chain Monte Carlo (MCMC) simulation methods are used to estimate both the optimal cluster structure and disease risk.
Applying the clustering algorithm to the study data itself would necessitate the information in the data being used twice, once for eliciting a set of candidate cluster configurations and again for estimating the model parameters. To overcome this issue, a second data set is required for the clustering stage, and emphasis should be placed on obtaining a dataset which is as similar as possible to the study data. Possible choices include data on disease risk in the time period prior to the study period or data on a different disease from the same time period as the study data. This study utilises the former choice, because it is unlikely that there has been any substantial change in the spatial patterns in the population characteristics governing disease risk (such as poverty) unless substantial urban regeneration has taken place. The approach proposed in this paper is thus appropriate for data on chronic diseases whose risk factors are spatially stable, but would be unsuitable for epidemic diseases such as influenza, where the spatial pattern in disease risk in the years prior to an outbreak would be vastly different to the pattern during an outbreak.
The remainder of this paper is organised as follows. Section 2 gives a brief introduction to Bayesian disease mapping, and discusses the existing methods of cluster identification that have been proposed in this context. Section 3 proposes our new methodological extension, while Section 4 establishes its efficacy via simulation. Section 5 presents the motivating application for our methodology, a study of chronic obstructive pulmonary disease (COPD) mortalities in English local authorities in 2010. Finally, Section 6 discusses the implications of this paper and ideas for future work.
Section snippets
Study design and modelling
The study region is partitioned into n non-overlapping areal units = {}, and and represent the observed and expected numbers of disease cases in each unit during the study period. The latter are constructed by external standardisation, based on the age and sex demographics of the population living in each areal unit. A Poisson log-linear model is commonly used to estimate disease risk, and a general form is given by
Method
We propose a two-stage approach for estimating the spatial pattern in disease risk and identifying spatially contiguous clusters that exhibit either elevated or reduced disease risks. In the first stage (Section 3.1) we utilise the spatially adjusted hierarchical agglomerative clustering algorithm proposed by Anderson et al. (2014), and use it to elicit a set of candidate cluster configurations for the data. In the second stage (Section 3.2) we propose a hierarchical Bayesian model for the
Simulation study
A simulation study was conducted to establish the efficacy of the two-stage modelling approach outlined in the previous section. The template for the study was based on the set of 324 local authorities in England, which is also the study region for the motivating application presented in Section 5. A study was conducted comparing the two-stage approach proposed here with existing alternatives, and the results are summarised below.
Study design
The study region is the country of England, which is the largest of the four constituent nations of the United Kingdom and has a population of approximately 53 million people. The country is divided into local authorities, containing populations of between 7338 and 1,061,074 people with a median value of 124,781. The disease data are the numbers of mortalities with a primary diagnosis of chronic obstructive pulmonary disease (COPD) in each local authority in 2010. The expected mortality
Discussion
The main aim of this paper was to develop statistical methodology to simultaneously estimate the spatial pattern in disease risk and identify clusters of areas exhibiting high (and low) risk. To achieve this aim a new methodology has been developed which fuses together spatial agglomerative hierarchical clustering techniques with an extended conditional autoregressive model, with inference based on Markov-Chain Monte Carlo simulation. This approach allows us to identify an optimal cluster
Acknowledgments
We would like to thank the editor and two referees whose comments have improved the motivation for and presentation of this paper. The work of the first author was funded initially by the Carnegie Trust and then by the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS).
References (21)
- et al.
Identifying clusters in Bayesian disease mapping
Biostatistics
(2014) - et al.
Bayesian image restoration, with two applications in spatial statistics
Ann Inst Stat Math
(1991) - Centers for Disease Control and Prevention. National Cardiovascular Disease Surveillance System. Technical report....
- et al.
Classification method for disease risk mapping based on discrete hidden Markov random fields
Biostatistics
(2012) - et al.
On the difficulty to delimit disease risk hot spots
J Appl Earth Obs Geoinf
(2013) - et al.
Point process methodology for on-line spatio-temporal disease surveillance
Environmetrics
(2005) - Fraley C, Raftery AE, Murphy TB, Scrucca L. mclust version 4 for R: normal mixture modeling for model-based clustering,...
- et al.
Hidden Markov models and disease mapping
J Am Stat Assoc
(2002) - Hastie T, Tibshirani R, Friedman J. The elements of statistical learning; Springer New York Inc. [chapter...
Bayesian modelling of inseparable space-time variation in disease risk
Stat Med
(2000)
Cited by (14)
A systematic review of aberration detection algorithms used in public health surveillance
2019, Journal of Biomedical InformaticsCitation Excerpt :The initial search yielded 3804 papers and 3074 articles were left after removing duplicates. Of these, 403 studies were selected for full-text screening and 145 were eligible for inclusion [2,8–11,20–159] (Fig. 1). Given the wide range of methods used for aberration detection, researchers have developed different schemes to classify them.
Modeling the spread of COVID-19 in spatio-temporal context
2023, Mathematical Biosciences and EngineeringLoss to Follow-Up Risk among HIV Patients on ART in Zimbabwe, 2009–2016: Hierarchical Bayesian Spatio-Temporal Modeling
2022, International Journal of Environmental Research and Public HealthSpatio-temporal disease risk estimation using clustering-based adjacency modelling
2022, Statistical Methods in Medical ResearchA markov chain monte carlo algorithm for spatial segmentation
2021, Information (Switzerland)