Local variation in childhood diarrheal morbidity and mortality in Africa, 2000-2015

Background Diarrheal diseases are the third leading cause of morbidity and mortality in children under 5 in Africa, responsible for an estimated 30,000,000 (95% Uncertainty Interval [UI], 27,000,000 - 33,000,000) severe cases and 330,000 (95% UI, 270,000 - 380,000) deaths in 2015. Developing targeted approaches to combat this burden is hampered by the lack of comprehensive, fine-scale diarrhea estimates between and within countries. Methods Annual estimates of diarrheal prevalence, incidence, and mortality were produced with high geographic detail (5-km2) across Africa from 2000 to 2015. Estimates were created using Bayesian geostatistical techniques, and were calibrated to the results from the Global Burden of Diseases, Injuries, and Risk Factors Study 2016. Results The results revealed geographic inequality in diarrhea risk in Africa. Of the estimated 330,000 childhood deaths attributable to diarrhea in 2015, over 50% occurred in only 55 (of the 782) first administrative subdivisions. In 2015, mortality rates between first administrative subdivisions in Nigeria exhibited six-fold differences. The case fatality ratio is highly variable at the national level across Africa, with the highest values observed in Lesotho, Mali, Benin, and Nigeria. Conclusions Our findings show concentrated areas of morbidity and mortality associated with diarrhea across countries with consistently high burden as well as countries that experienced considerable national-level improvements. In the era of precision public health, the distribution of limited resources can be optimized with proven interventions, targeted at locations most likely to have a high impact, reducing the avertable burden of diarrheal diseases.

In contrast to the declines observed in diarrheal mortality, there were many regions with negligible decreases or even increases in diarrheal cases from 2000 to 2015. The calculated annualized rates of change (AROC) for incidence of severe diarrhea cases shows low to moderate gains ( Figure S1, Panel A), and reductions exceeding 5% per year were rare [p<.05]. As illustrated by the direct comparison between 2000 and 2015 severe incidence rates, 118 first 35 administrative subdivisions demonstrated an increase in severe incidence rate ( Figure S1, Panel A).

Mortality Rates of Change
Average annual rates of change for mortality rates have exceeded 7.5% per year across much of Africa ( Figure S2, Panel A and B). Across Sahelian regions with high mortality rates in 2000, 40 many locations in Western sub-Saharan Africa have significant decline in mortality (e.g., Ghana, Senegal, and Liberia). Conversely, gains in regions of some of the worst Sahelian countries has been limited (Niger, Nigeria, Chad, Central African Republic, South Sudan, and Somalia). In central and southern Africa, there have been limited gains in Zimbabwe and Namibia.

45
Using posterior draws for 2015 to assess progress towards the GAPPD mortality goal of less than 1 in 1,000 deaths attributable to diarrhea, we find that multiple countries have already surpassed this threshold ( Figure S2, Panel C and D). Countries that are estimated to be close to the goal may or may not have correspondingly high or low posterior probabilities of achieving the goal as the probability incorporates uncertainty in the estimates. For example, within 50 Mozambique across first administrative subdivisions, mean mortality rates vary from 0.8 to 1.1 per 1,000 in 2015. However, due to non-symmetric uncertainty, all posterior probabilities that the true value is less than 1 in 1,000 are greater than 30%.
Several countries began the millennium with such a high mortality rate -either uniformly or 55 within subdivisions -that, despite substantial gains, results in the regions' mortality remaining well above the GAPPD target for diarrheal mortality. Cameroon, Rwanda, and much of Western sub-Saharan Africa have subnational mean mortality AROCs of 5% or greater ( Figure  S2, Panels A and B), but due to their high initial mortality rates (Figure 1, Panel A and B), estimated mortality remains high in 2015 (Figure 1, Panels C and D) and the probability they 60 have achieved the GAPPD goal is very low (Figure S2, Panels C and D). As with other countries with extremely high mortality rates in 2010, even with the substantial gains made on the western portion of the border between Niger and Nigeria, the posterior probability that they have already met the mortality goal there is very low.
Using the calculated AROCs (either at the pixel level or calculated on the aggregated first administrative subdivision level), projections of GAPPD mortality goal attainment can be estimated ( Figure S2, Panels E and F). In 2015, many locations that had achieved great gains in reduction of mortality were still above the established goals ( Figure S2, Panels C and D). Conversely, given 10 more years of similar relative reductions in mortality the picture changes 70 substantially ( Figure S2, Panels E and F). With steady decline in mortality, DRC appears on track to reduce mortality well below the 1 in 1,000 threshold, along with Ethiopia, Benin, Togo, and much of Angola. South western Nigeria is projected to have pockets that are very likely to have achieved the mortality goal. Northern Cameroon started 2000 with rates comparable to both neighboring Chad and Nigeria, but due to significant declines in mortality risk, they are 75 nearing the mortality goal.

Limitations
Our analysis assumes that the same proportion of incident cases are severe across all countries and years. This lack of spatio-temporal variation in severity over time prohibits our 2025 80 projection analysis from definitively concluding that this goal is unattainable. If diarrhea incidence is proportional to severe diarrhea incidence, as we have assumed in this analysis, it is unlikely that the majority of Africa will achieve a 75% reduction from the 2010 baseline by 2015.

Data Sources
The data sources used to model diarrhea prevalence indicators are described below. Information on survey locations, years, source, polygons, and/or geopositioned survey clusters 90 can be found in Table S2. Data availability for each indicator can be found in Figure S2.
Select data sources that were identified to contain diarrhea prevalence within the geographic area of interest were excluded for the following reasons: missing survey weights for areal data, incomplete sampling (e.g., only a specific age range), or untrustworthy data (as determined by the survey administrator or by inspection). Within each source, administrative units with a 95 sample size of one were excluded.

Survey cluster combination and spatial integration over polygon records 100
Our individual-level data were collapsed (summarized) into clusters if they could be georeferenced to latitude-longitude pairs. Otherwise, we collapsed our individual-level data to the smallest polygon that could be referenced. We used survey weights and the survey package in R to account for matching our observations to a higher resolution than the representative resolution of the survey. 1

105
Data without latitude and longitude, but that could be geolocated to an administrative area, were resampled to generate candidate point locations based on the underlying population of the administrative area. The main concept is to leverage covariate values across the polygon when performing the regression, while simultaneously accounting for a population-driven 110 survey design. The methods used for the resampling are consistent with those used in geospatial modelling of under-5 mortality, published previously. 2 For each polygon-level observation, 10,000 points were randomly sampled from within the polygon (regardless of the polygon's area) using the WorldPop total population raster 3 to 115 weight the locations of the draws. K-means clustering was performed on the candidate points to generate integration points (1 per 1,000 pixels) used in the modelling. Weights were assigned to each integration point proportionally to the number of candidate points that entered into the kmeans cluster, such that the weight of each point represented the number of populationsampled locations contained within the K-means cluster location, divided by the number of 120 sampled points generated (10,000). Each point generated by this process is assigned the diarrhea prevalence observed from the survey for that polygon. These sample weights are used in model fit. 4

Model geographies
A total of five models were run for each indicator based on continuous geographic regions within Africa chosen to align with the regions used in the Global Burden of Disease Study, which determines regions based on both proximity and epidemiological similarity (see Figure  S3 for listing of regions and countries). Minor changes were made to the GBD regions to ensure 130 spatial contiguity across Africa. Initial investigation of regional configurations identified both the Democratic Republic of Congo and Egypt as outliers in their respective regions. As such, each was fit as individual regions, see Figure S5 for an illustration of the modelling regions. All data within the spatial region, and within a one-degree buffer from the boundaries of each region, were included in each model to minimize edge effects.

135
As this study was limited to mainland Africa and African island nations, select countries were excluded from the North Africa and Middle East region (Afghanistan, Bahrain, Iran, Iraq, Jordan, Kuwait, Lebanon, Oman, Palestine, Qatar, Saudi Arabia, Syria, Turkey, UAE, and Yemen). Western Sahara was included as part of the North region. Several countries were moved to East (Lesotho and Swaziland from South, Sudan from North) to make high-income 140 status more similar in the North and South regions.

Ensemble covariate modelling
An ensemble covariate modelling method was implemented in order to both select covariates and capture possible non-linear effects and complex interactions between them. 5 For each 145 region, four sub-models were fit to our dataset, using all of our covariate data as explanatory predictors: generalised additive models, boosted regression trees, lasso regression, and ridge regression. Sample weights are used in sub-models, where applicable, such that survey cluster locations with latitude and longitude had a sample weight of 1, while survey cluster locations where the latitude and longitude was generated by the polygon resampling process had a 150 weight based on the K-means clustering process (refer to section 3.1).
Each sub-model is fit using five-fold cross-validation to avoid overfitting. The out-of-sample predictions from across the five holdouts are compiled into a single comprehensive set of predictions from that model. Additionally, the same sub-models were also run using 100% of 155 the data, and a full set of in-sample predictions were created. The five sets of out-of-sample submodel predictions are fed into the full geostatistical model as the explanatory covariates when performing the model fit. The in-sample predictions from the sub-models are used as the covariates when generating predictions using the fitted full geostatistical model. A recent study has shown that this ensemble approach can improve predictive validity by up to 25% over an 160 individual model. 5 Predictions from each sub-model are generated based on patterns and relationships between the raw covariates and prevalence data, while predictions from the full geostatistical model are generated based on patterns and relationships between the predictions from the ensemble of 165 sub-models and prevalence data. To discover the relationships between the sub-model prediction layers (used as covariates in the full geostatistical model) and the prevalence data, the only values of the covariates (sub-model prediction layers) "seen" by the model are the values underlying the locations of surveys. As such, it is possible that estimates will be generated in areas where the values of the covariates exceed the minimum and maximum 170 values observed by the model. In these areas, the estimates are generated by extrapolating from the patterns observed within the range of covariates underlying the survey and census data.

Model Description 175
Binomial count data are modelled within a Bayesian hierarchical modelling framework using a logit link function and a spatially and temporally explicit hierarchical generalised linear regression model to fit prevalence of each of our indicators in seven regions of Africa as illustrated in Figure S5. 6 For each GBD region, we explicitly write the hierarchy that defines our Bayesian model as follows: For each geospatial risk factor and region, we model the number of children at survey cluster , among a sample size, , who are afflicted with a risk factor as binomial count data, . We have suppressed the notation, but the counts, , probabilities, , predictions from the five 200 submodels , and residual terms * are all indexed at a space-time coordinate. The probabilities, represent both the annual prevalence at the space-time location and the probability that an individual child will be afflicted with the risk factor given that they live at that particular location. The logit of annual prevalence, , of our indicators was modelled as a linear combination of the three sub-models (generalized Additive Model , boosted regression 205 tree, elastic net penalized regression), , a correlated spatio-temporal error term, , and an independent error term, . Coefficients, , on the sub-models represent their respective predictive weighting in the mean logit link and are constrained to sum to one. The joint error term, , accounts for residual spatio-temporal autocorrelation between individual data points that remains after accounting for the predictive effect of the sub-model covariates, is a 210 country random effect, and , which is an independent error term. The residuals, , are modelled as a three-dimensional Gaussian process in space-time centered at zero and with a covariance matrix constructed from a Kroenecker product of spatial and temporal covariance kernels. The spatial covariance, Σ space , is modelled using an isotropic and stationary Matérn function, 7 and temporal covariance, Σ time , as an annual autoregressive order 1 (AR1) function 215 over the 16 years represented in the model. This approach leveraged the data's residual correlation structure to more accurately predict prevalence estimates for locations with no data, while also propagating the dependence in the data through to uncertainty estimates. 8 The posterior distributions were fit using computationally efficient and accurate approximations in R-INLA 9,10 (integrated nested Laplace approximation) with the stochastic partial differential 220 equations (SPDE) 11 approximation to the Gaussian process residuals.

Priors
The following priors were used for all four of our diarrhea models: We used the uncorrelated multivariate normal priors that INLA automatically determines (based on the finite elements mesh) for the log-transformed spatial hyperparameters and .
The mean (µ) and variance (σ 2 ) parameters for the hyperpriors selected by INLA for the meshes 235 in each region can be found in Table S4. In our parameterization we represent and in the loggamma distribution as scale and shape, respectively.

Mesh construction
We constructed the finite elements mesh for the stochastic partial differential equation 240 approximation to the Gaussian process regression using a simplified polygon boundary (in which coastlines and complex boundaries were smoothed) for each of the regions within our model. We set the inner mesh triangle maximum edge length (the mesh size for areas over land) to be 0.2 degrees, and the buffer maximum edge length (the mesh size for areas over the ocean) to be 5.0 degrees. An example finite elements mesh constructed for central sub-Saharan mesh 245 can be found in Figure S6.

Model fitting and estimate generation
Models were fit in INLA with methods consistent with those used in geospatial modelling of 250 under-5 mortality, published previously. 2 Resampling K-means weights (refer to section 3.2) were used within the INLA fit by multiplying the corresponding log-likelihood evaluation for the specific observation by the observation's K-means weight. Data points that could be georeferenced to latitude-longitude 255 locations were assigned a weight of 1, ensuring that when the log-likelihood contribution from that observation was evaluated it contributed only to the log-likelihood at the observation's space-time location. For survey cluster locations generated based on the polygon resampling process, the log-likelihood of those points contributed proportionate to the K-means weights, effectively diffusing the evaluation of the observation across the polygon.

260
As part of the ensemble modelling process (section 3.3.2), prediction surfaces from the out-ofsample ensemble sub-models were used as covariates in the spatio-temporal model. Estimates of the fixed effects beta coefficients derived from the contribution of each of the sub-models to INLA's predicted prevalence estimates, in conjunction with parameter estimates of the 265 contribution of location and time (based on estimated parameters described in model description (refer to section 3.3.3) were generated and can be found in Table S5. To create final estimates, the in-sample prediction surfaces of prevalence from the sub-models (serving as covariates) were used to calculate estimates of prevalence for each pixel in each year.

270
All estimates were generated by taking 1,000 draws from the posterior distribution. For estimates at the pixel level, these draws were used directly to generate estimates and uncertainty. Aggregated estimates, in which estimates at the pixel level were summarized to administrative boundaries, were generated by creating population-weighted averages for each administrative boundary, for each draw. 95% uncertainty intervals around the mean of our 275 estimates ( Figure S7) were generated by taking the 2.5% and 97.5% quantiles of each of the draws, at the pixel or administrative level.

280
Fitted parameters and hyperparameters, as well as their 95% uncertainty intervals are shown by indicator and region in Table S5. Spatial hyperparameters ( and ) and their uncertainties have been transformed into more interpretable nominal variance and range parameters. Nominal variance, approximating the variance at any single point, is calculated as . = 4 2 2 , and nominal range, approximating the distance before spatial correlation decays by 90%, as In order to assess the predictive validity of our estimates, we validated our models using 299 spatially stratified five-fold out-of-sample cross-validation. 12 We used a modified bi-tree Using these out-of-sample predictions, we then calculated mean error (ME, or bias), root-mean-  Estimates produced by MBG models were also compared to raw estimates from the DHS series,  incidence reach 75% of the severe incidence rate in 2010. Pixels with fewer than ten people per 369 1-km 2 and classified as "barren or sparsely vegetated" are colored in grey. to the first administrative subdivision using population weighting. Pixels with fewer than ten 383 people per 1-km 2 and classified as "barren or sparsely vegetated" are colored in grey. Pixels with fewer than ten people per 1-km 2 and classified as "barren or sparsely vegetated" are 419 colored in grey. (purple and circled orange), we combine the predictions from step 2 and calibrate them such that the population weighted mean 503 diarrhea prevalence for a particular country-year from our model matches the GBD estimates. 14,15 Finally (orange), we aggregate our 504 estimates to first administrative units. Using these aggregate estimates and the previously calibrated pixel estimates, we are able to 505 convert prevalence estimates into mortality and incidence estimates and otherwise generate maps of these values.

Data Inputs
For all data inputs from multiple sources that are synthesized as part of the study: 3 Describe how the data were identified and how the data were accessed.

6-7 4
Specify the inclusion and exclusion criteria. Identify all ad-hoc exclusions.

6; S.4 5
Provide information on all included data sources and their main characteristics. For each data source used, report reference information or contact name/institution, population represented, data collection method, year(s) of data collection, sex and age range, diagnostic criteria or measurement method, and sample size, as relevant.

S.36-S.41
6 Identify and describe any categories of input data that have potentially important biases (e.g., based on characteristics listed in item 5).

6-8
For data inputs that contribute to the analysis but were not synthesized as part of the study: For all data inputs: 8 Provide all data inputs in a file format from which data can be efficiently extracted (e.g., a spreadsheet rather than a PDF), including all relevant meta-data listed in item 5. For any data inputs that cannot be shared because of ethical or legal reasons, such as third-party ownership, provide a contact name or the name of the institution that retains the right to the data.

15
Provide published estimates in a file format from which data can be efficiently extracted.
Raster files for spatial data and CSVs of admin 1 and admin 2 estimates to be made available at ghdx.healthdata.org 16 Report a quantitative measure of the uncertainty of the estimates (e.g. uncertainty intervals).
9-11; S.19-S.20; S47-50 17 Interpret results in light of existing evidence. If updating a previous set of estimates, describe the reasons for changes in estimates.

12-15
18 Discuss limitations of the estimates. Include a discussion of any modelling assumptions or data limitations that affect interpretation of the estimates.

12-15
524 525   Lower, median, and upper quantiles (0.025%, 0.50%, 0.975%) are displayed for the main parameters by region. The first four 542 rows provide information on the fixed effects: the intercept (int) and the the covariates (gam, gbm, and enet) corresponding 543 to the predicted ensemble rasters. Fitted values for the spatio-temporal field hyperparameters and the precisions (inverse 544 variance) for our random effects are shown in the bottom four rows.