A tutorial on spatio-temporal disease risk modelling in R using Markov chain Monte Carlo simulation and the CARBayesST package

https://doi.org/10.1016/j.sste.2020.100353Get rights and content

Highlights

  • We present a new tutorial of Bayesian spatio-temporal disease risk modelling in R.

  • The tutorial describes data input and visualisation, model fitting, model assessment and results presentation.

  • The tutorial is illustrated by a new worked example on pneumonia mortality risk in England between 2002 and 2017.

Abstract

Population-level disease risk varies in space and time, and is typically estimated using aggregated disease count data relating to a set of non-overlapping areal units for multiple consecutive time periods. A large research base of statistical models and corresponding software has been developed for such data, with most analyses being undertaken in a Bayesian setting using either Markov chain Monte Carlo (MCMC) simulation or integrated nested Laplace approximations (INLA). This paper presents a tutorial for undertaking spatio-temporal disease modelling using MCMC simulation, utilising the CARBayesST package in the R software environment. The tutorial describes the complete modelling journey, starting with data input, wrangling and visualisation, before focusing on model fitting, model assessment and results presentation. It is illustrated by a new case study of pneumonia mortality at the local authority level in England, and answers important public health questions including the effect of covariate risk factors, spatio-temporal trends, and health inequalities.

Introduction

Population-level disease risk varies in space and time between sub-populations, due to variation in environmental exposures and the prevalence of risk inducing behaviours such as smoking. Estimating the spatio-temporal variation in disease risk is an important endeavour for researchers and policy-makers alike, because it allows them to answer important public health questions such as: (i) where are the highest-risk areas for disease that can be targeted for an intervention; (ii) is disease risk increasing or decreasing over time; (iii) how big are the health inequalities, and are they changing over time; and (iv) what environmental or social factors affect the risk of disease. In the above, health inequalities are the differences in disease risk between different communities (World Health Organisation, 2013) and are often driven by poverty, with poorer populations typically exhibiting higher disease risks than more affluent populations.

Data summarising population level disease incidence are commonly used to answer these questions, which have been spatio-temporally aggregated to a set of K non-overlapping areal units for N consecutive time intervals. These data thus comprise a K × N matrix of spatio-temporal observed disease counts, which are augmented by matrices of expected counts computed using indirect standardisation to allow for varying population demographics, and covariate risk factors. A large number of models have been developed for estimating the spatio-temporal variation in disease risk, and comprehensive reviews are given by Lawson and Lee (2017) and Lawson (2018). The class of conditional autoregressive (CAR, Besag et al., 1991) models are commonly used to represent the spatially correlated variation in disease risk, and are used as a prior distribution for a set of spatially structured random effects. Spatio-temporal extensions of CAR models have been proposed by numerous authors, including (Bernardinelli, Clayton, Pascutto, Montomoli, Ghislandi, Songini, 1995, Knorr-Held, 2000, Rushworth, Lee, Mitchell, 2014). These models assume that disease risk varies smoothly in space and time, and thus account for the inherent spatio-temporal autocorrelation typically observed amongst the disease data. A Bayesian approach to inference is typically adopted, using either Markov chain Monte Carlo (MCMC, Robert and Casella, 2010) simulation or Integrated Nested Laplace Approximations (INLA, Rue et al., 2009)).

A number of software packages have been developed to allow researchers to fit spatio-temporal conditional autoregressive type models in a Bayesian setting, including WinBUGS (Lunn et al., 2000) and STAN (Morris et al., 2019). However, these packages can be difficult to use for novice users, and also lack the ability to undertake additional analyses, such as data visualisation and wrangling, within the same software environment. Therefore the R software environment (https://cran.r-project.org) has a number of packages for spatio-temporal areal unit modelling, including the INLA package (Martins et al., 2013) using integrated nested Laplace approximations, and CARBayes (spatial modelling, Lee, 2013) and CARBayesST (spatio-temporal modelling, Lee et al., 2018) both using MCMC simulation. Both INLA and CARBayes / CARBayesST can fit a range of different spatio-temporal disease risk models for areal unit data, while INLA can also model geostatistical (Lindgren et al., 2011) and point process (Illian et al., 2012) spatio-temporal data. The main advantage of INLA is computational speed, because it uses Laplace approximations to estimate the posterior distribution of the unknown parameters, rather than drawing repeated samples from the posterior distribution as MCMC does. In contrast, while slower than INLA, CARBayesST is potentially easier to use for novice users, because: (i) all models can be implemented via a simple one-line function call requiring the user to specify few arguments for a default analysis, even when specifying the spatio-temporal data structures; and (ii) important epidemiological quantities such as posterior exceedance probabilities are straightforward to produce because you have direct access to samples from the posterior risk distribution. Additionally, CARBayeST allows users to fit localised spatial autocorrelation models such as that proposed by Rushworth et al. (2017), which is not possible using the INLA package. These localised correlation models recognise that disease risk in geographically adjacent areal units may not always be similar (correlated), which stems from the seminal work of (Womble, 1951).

Excellent tutorials for how to use the INLA package for spatio-temporal modelling have been written by Blangiardo et al. (2013) and Moraga (2018), but no such tutorial exists for spatio-temporal modelling with CARBayesST using MCMC simulation. Therefore this paper fills that gap, and presents a tutorial describing how to undertake spatio-temporal Bayesian areal unit modelling via MCMC simulation using the CARBayesST package. Lee et al. (2018) provides a general vignette for the CARBayesST software including a description of the suite of models that the package can fit, where as here we provide a specific tutorial on spatio-temporal disease risk modelling aimed at applied public health researchers. We also note that the purely spatial modelling package CARBayes has an identical syntax to CARBayesST, and hence will be straightforward to use following a similar approach to that described here.

This tutorial is illustrated by a new study of pneumonia mortality in England at the local authority level. The data and the questions motivating the analysis are described in Section 2, while exploratory analysis is presented in Section 3. Spatio-temporal model fitting and model assessment is outlined in Section 4, while the results of the analysis are presented in Section 5. Finally, this tutorial finishes with an outline of future developments in Section 6. The data, shapefiles and R code used in this tutorial are available to download from https://github.com/duncanplee/Spatio-temporal-modelling-tutorials, which also includes a video tutorial illustrating the analysis presented here.

Section snippets

Motivating study

The study region is mainland England, which has a population of around 55 million people and has been partitioned into K=322 local authorities. Data are available at yearly intervals between 2002 to 2017 inclusive, yielding N=16 time periods. The disease data {Ykt} are counts of the numbers of pneumonia mortalities (International Classification of disease ICD-10 codes J12–J18) in local authority k during year t, and were obtained from NHS digital https://digital.nhs.uk/. These counts vary

Reading in and visualising the spatio-temporal trends in the data

The disease and covariate data are stored in EnglandLUAdata.csv, which has a unique local authority code (Code) and year (Year) combination for each row of the data set. The local authority code is accompanied by the name of local authority (Name), and the remaining 4 variables were described in the previous section. The data can be read into R and the variable names visualised using the following commands.

Note, it is simplest in practice to put the data in the same folder as the R

Bayesian spatio-temporal modelling using MCMC simulation

As summarised in the introduction, a large number of models have been proposed for estimating the spatio-temporal trends in disease risk, and as our disease outcome variable is a count, they have the general formYktPoisson(Ektθkt)fork=1,,K,t=1,,Nln(θkt)=β0+β1IMDk+β2PM25kt+ψkt,where ψkt is the random effect for local authority k and time period t. The regression parameters are typically assigned independent and weakly informative normal priors, and the default prior specification in CARBayesST

Inference from the model

We now answer the three questions of interest motivating the analysis posed in Section 2. Firstly, the effects of IMD and PM25 on pneumonia mortality risk are quantified as relative risks, for a fixed increase ξ in each covariates value. For example, the relative risk for a ξ increase in IMD is computed byRR(IMD,ξ)=RiskofdiseaseifIMDincreasedbyψRiskofdiseasegiventhecurrentvalueofIMD=exp(β0+β1(IMDk+ξ)+β2PM25kt+ψkt)exp(β0+β1IMDk+β2PM25kt+ψkt)=exp(β1ξ).

Here we use the standard deviation of each

Conclusions

In this tutorial we have illustrated how to analyse spatio-temporal areal unit data using the CARBayesST package in R, which fits models in a Bayesian setting via MCMC simulation. The worked example on pneumonia mortality in England describes a complete spatio-temporal analysis, beginning with data input, wrangling and visualisation, and then modelling and drawing inference from the fitted model to answer epidemiologically important questions. The package allows the estimation of disease risk

References (28)

  • A. Gelman et al.

    Bayesian Data Analysis

    (2013)
  • J. Geweke

    Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments

    Bayesian Statistics

    (1992)
  • J. Illian et al.

    A toolbox for fitting complex spatial point process models using integrated nested Laplace approximation (INLA)

    Ann. Appl. Stat.

    (2012)
  • L. Knorr-Held

    Bayesian modelling of inseparable space-time variation in disease risk

    Stat. Med.

    (2000)
  • Cited by (0)

    View full text