Vibrio cholerae O1 transmission in Bangladesh: insights from a nationally representative serosurvey

Summary Background Pandemic Vibrio cholerae from cholera-endemic countries around the Bay of Bengal regularly seed epidemics globally. Without reducing cholera in these countries, including Bangladesh, global cholera control might never be achieved. Little is known about the geographical distribution and magnitude of V cholerae O1 transmission nationally. We aimed to describe infection risk across Bangladesh, making use of advances in cholera seroepidemiology, therefore overcoming many of the limitations of current clinic-based surveillance. Methods We tested serum samples from a nationally representative serosurvey in Bangladesh with eight V cholerae-specific assays. Using these data with a machine-learning model previously validated within a cohort of confirmed cholera cases and their household contacts, we estimated the proportion of the population with evidence of infection by V cholerae O1 in the previous year (annual seroincidence) and used Bayesian geostatistical models to create high-resolution national maps of infection risk. Findings Between Oct 16, 2015, and Jan 24, 2016, we obtained and tested serum samples from 2930 participants (707 households) in 70 communities across Bangladesh. We estimated national annual seroincidence of V cholerae O1 infection of 17·3% (95% CI 10·5–24·1). Our high-resolution maps showed large heterogeneity of infection risk, with community-level annual infection risk within the sampled population ranging from 4·3% to 62·9%. Across Bangladesh, we estimated that 28·1 (95% CI 17·1–39·2) million infections occurred in the year before the survey. Despite having an annual seroincidence of V cholerae O1 infection lower than much of Bangladesh, Dhaka (the capital of Bangladesh and largest city in the country) had 2·0 (95% CI 0·6–3·9) million infections during the same year, primarily because of its large population. Interpretation Serosurveillance provides an avenue for identifying areas with high V cholerae O1 transmission and investigating key risk factors for infection across geographical scales. Serosurveillance could serve as an important method for countries to plan and monitor progress towards 2030 cholera elimination goals. Funding The Bill & Melinda Gates Foundation, National Institutes of Health, and US Centers for Disease Control and Prevention.


S1. Cholera seroincidence model and inference
The primary goal of these analyses is to estimate the proportion of the population infected by Vibrio cholerae O1 in the previous year, which we refer to as the 'seroincidence rate' and denote as π. We use a previously validated random forest model to classify whether each member of a recent nationally-representative serosurvey was infected in the year before the before the survey. We treat the binary outcome of this model like a diagnostic test, which when summarized at the population-level can be adjusted for sensitivity and specificity. As detailed in the paper, the serosurvey was a two-stage cluster survey, with 70 communities selected with probability proportional to each community's population and at least 10 households (with at least 40 total samples) sampled from each community.
We make three estimates of the seroincidence rate, all of which rely on a Bayesian hierarchical model that is specified below. For the 'survey estimate', we assume that the nationwide estimate of seroincidence is equivalent to the in-sample estimate of seroincidence, since the serosurvey sample was nationally representative. For the 'post-stratified estimate', we use the parameter estimates from the model to predict the seroincidence in the sampled communities based on their demographics. For the 'spatial estimate', we extend the poststratified estimate to the rest of the country using a logistic regression model including covariates and a Matern spatial covariance function.

S1.1. Bayesian hierarchical model
We model the random forest predictions of seropositivity (see Section S1.3) for each individual, z i (i = 1, . . . , n, from household h = 1, . . . , H in community c = 1, . . . , C), with a Bayesian hierarchical model similar to that of Makela, Si, and Gelman, [1] augmented to account for the sensitivity and specificity of the random forest model: logit(π i ) = α ci + α hi + X i β (2) α c ∼ Normal(α 0 + γ log(N c ), σ 2 c ) (3) We separate the probability that an individual is predicted to be seropositive by the random forest model into two parts: the true positive rate, where the underlying probability of being seropositive π i is multiplied by the sensitivity θ 1|1 , and the false positive rate, calculated as the underlying probability of being seronegative (1 − π i ) multiplied by one minus the specificity (1 − θ 0|0 ). We assume that the underlying probability of being seropositive π i for each individual is logit-normal with random effects for their community α ci and household α hi and fixed effects β for their age and sex X i . The community-level random effect has a mean determined by the sum of a country-level intercept, α 0 , and linear term to account for the (log) population of the community (N ci with coefficient γ) and a variance σ 2 c . Since the probability that a community was sampled was proportional to its population, we need to account for any relationship between community population and seroincidence rate to control for confounding. The household-level random effect is centered at zero with a variance of σ 2 h . We use weak standard normal priors for α 0 , γ, and the β coefficients and t distributions truncated to be positive as priors for σ c and σ h . Section S1. 4 has more details about estimating the sensitivity and specificity.
Upon estimating π i , we can estimate the seroincidence of each community π survey c and the whole survey π survey :ŷ i = Bernoulli(π i ) (5) π survey = 1 n n i=1ŷ i (6) π survey c = i∈cŷ i i∈c n i .
The serostatus of each individual is determined by a draw from Bernoulli with probability π i . These draws can be averaged across each community to calculate the community-specific survey estimates π survey c or across all individuals to calculate the survey estimate π survey .
For each sampled community, we estimated the seroincidence π poststrat c by post-stratifying the posterior samples of our parameter estimates to match the demographics of that community. For every combination of age category (a = 1, . . . , A = 3), sex (s = 0, 1), and community c, we estimated the probability of seropositivity, π a,s,c for each posterior draw of β, α c , and σ 2 h . We can estimate the community seroincidence π poststrat c by taking a weighted average of the π a,s,c , where weights are determined by the demographic distribution of the community. π a,s,c = 1 0 logit −1 (α c + X a,s,c β + σ h * Φ −1 (t))dt (8) π poststrat c = A a=1 1 s=0 N a,s,c π a,s,c N c (9) where Φ −1 (t) is the quantile function of a standard normal distribution. Since communities were chosen at random (weighted by population), we assume that a simple average of the sampled communities is sufficient for estimating the overall post-stratified seroincidence π poststrat . We fit the Bayesian hierarchical model in the Stan probabilistic programming language and used the rstan package in R to run the model and analyse outputs. We ran 5,000 iterations (4 chains with 1,500 iterations each with 250 for warm-up) and assessed convergence visually and using the R-hat statistic. [2,3]

S1.2. Integrated nested Laplace approximations and the spatial estimate
Our primary estimate of the country-wide seroincidence used in the main analyses is the spatial estimate. To produce this estimate, we extend the community-specific post-stratified estimates to the entire country using a logistic regression model with a Matern spatial covariance function and covariates with integrated nested Laplace approximations (INLA), [4] as described in the main text. To choose the covariates, we conduct a backwards stepwise regression, based on Wantabe-Akaike Information Criteria (WAIC), where the outcome is the average seroincidence in each sampled community. The covariates considered were log population, proportion female, proportion age 0-9, proportion age 10-19, distance to major body of water, a poverty index, travel time to a major city, and altitude. After step-wise selection, we compared the predictive skill of the best model with covariates to a null model with only a Matern spatial covariance function (no covariates) by conducting a leave-one-community-out cross validation and evaluating the results with mean absolute error (MAE). To make the nationwide estimates, we fit the INLA model with the lowest MAE to each of 1,000 posterior draws of community seroincidence from the Bayesian hierarchical model (described above) and then predict the seroincidence for all 5km by 5km grid-cells across the country. In the end, we generate 1,000 maps of cholera seroincidence rates and we take a population-weighted average to produce the nationwide spatial estimates. Our primary results are the median spatial estimate and the 95% credible interval.

S1.3. Random forest predictions
Azman et al. fitted a random forest model using age, sex, vibriocidal titers (Ogawa and Inaba), anti-LPS IgG and IgA antibodies, and anti-CTB IgG and IgA antibodies, and blood group to classify the seropositive status of individuals from a longitudinal cohort study in Bangladesh. [5] Since no blood group information was collected in the serosurvey, we fit a new random forest model to the cohort data using all of the remaining covariates. For each observation in the cohort study, we use the proportion of trees that predict that the observation is seropositive as the probability of seropositivity. From these probabilities, we calculate the receiver operating characteristic (ROC) curve for the cohort predictions and calculate the cutoff that maximizes the Youden's J statistic, i.e. the sum of the sensitivity and the specificity. [6] We use the random forest model to predict the seropositivity status of each participant in the serosurvey; the participants whose probability of seropositivity exceed the Youden cutoff are classified as seropositive, z i in Equation 1.

S1.4. Specificity and sensitivity of the random forest predictions
As with all imperfect tests, population-level (e.g., aggregated) random forest model seroincidence estimates can be corrected for the test's specificity and sensitivity, when known. To estimate the specificity and sensitivity of this random forest model we conducted leave-one-individual-out cross validation (LOOCV) on the original cohort data used in Azman et al.. Each participant j (j = 1, . . . , J) was sampled at multiple times t, where the recent infection status y cohort j,t was known For each individual in the cohort, we fit a random forest model to the rest of the cohort, calculate the Youden cutoff, and predict the seropositivity for all of the timepoints of the left-out individual, which we call LOOCV predictions and denote z cohort To estimate the specificity, θ 0|0 in Equation 1, we include the LOOCV predictions in our Bayesian hierarchical model: We found that there was correlation between the LOOCV predictions of the seronegative observations within an individual, but that there was no effect of time since infection. 1 We model these predictions as Bernoulli random variables with the probability of seropositivity equal to (1 − θ 0|0 j ), where the individual specificity θ 0|0 j is distributed logit-normal with a mean α θ 0|0 j and standard deviation σ 2 θ 0|0 . The sensitivity of the random forest predictions varies across days since infection due to the decay in antibody response over time, while specificity remains constant given that it is defined by looking at test performance in uninfected individuals. After infection with V. cholerae O1, most antibodies rise including vibriocidals, one of the most informative markers, which peak around 7-10 days post-infection. As time since infection increases, the antibody profile of an individual, in general, returns to pre-infection levels. The vibriocidal titers decay quickly in the first three months before decaying more slowly over the following three years. This decay is illustrated by the decline in raw sensitivity of the random forest predictions over time (Table S1). To account for this decay, we estimate the sensitivity as a time-varying quantity rather than as a static quantity and rewrite the overall sensitivity as a joint probability: where Z is the result of the test (i.e. the random forest model), Y is the true seropositive status of the individual, and T is the time since infection in days. We need to estimate the time-varying sensitivity, P(Z = 1 | Y = 1, T = t), and the probability of being infected T = t days ago, P(T = t | Y = 1). Since sensitivity only concerns seropositive individuals (i.e. Y = 1) and seropositivity is by our definition infection over the past 365 days, T is restricted to be less than or equal to 365 days for all components of θ 1|1 .

S1.4.1. Time-varying sensitivity
We estimate the time-varying sensitivity of the random forest predictions in the cohort study using a logistic regression model with a random coefficient for each individual α θ 1|1 j and a cubic polynomial for the log of days since infection, similar to the method used by Leisenring et al.: [7] [z cohort j,t | y cohort logit(θ 1|1 j (t)) = α θ 1|1 j + β 1 log(t) + β 2 log(t) 2 + β 3 log(t) 3 (16) The posterior median and 95% credible interval for the sensitivity at each time since infection from 5 to 365, P(Z = 1 | Y = 1, T = 5, . . . , 365), is shown Figure S1. Days 1 through 4 were excluded as the serological biomarkers for cases had yet to rise and thus tests conducted during this time frame were considered negative baseline measurements as in Azman et al.

S1.4.2. Daily probability of infection
The estimates of time-varying sensitivity allow us to calculate the overall sensitivity given the time since infection, however we do not know this time for any individual in the serosurvey. We assume that individuals only get infected once in the past year, such that the daily probabilities sum to one across T = 1, . . . , 365. In our primary analyses we assume that the risk of infection for each individual was uniformly distributed over the year before sample collection: We set the expected value of any given time since infection to be equal to 1 365 , however the Bayesian hierarchical model allows for variability around each estimate. S1.4.2.1. Alternate analysis: incorporating seasonality. Past work on clinical cholera has shown that there is seasonal variation of cholera in Bangladesh, which varies regionally across the country. [8] To estimate the probability of infection by day for each sampled community P(T = t | Y = 1, C = c), we combine sentinel surveillance testing data from Khan et al. (the proportion of acute watery diarrhea (AWD) cases that are cholera) [9] with district-level acute watery diarrhea data from Hegde et al. (in prep).
We estimate the monthly proportion of AWD cases that are cholera at each serosurvey site, using a logistic generalized additive model with a thin-plate spline for longitude and latitude, a cubic cyclic spline for month, and a random effect for year (from 2014-2016) as implemented in the brm package in R. These estimates are aggregated to the district-level (second administrative unit), where they can be multiplied by the number of AWD cases per district to calculate the number of cholera cases per district per month. For each district that contains a sampled community, we fit a model with a monotonically increasing p-spline for the cumulative number of cholera cases from the beginning of 2014 through 2016, with the number of cumulative cases assigned to the last day of each month. From this model, we can predict the daily number of cholera cases for each location for the 365 days preceding the serosurvey for each sampled community (Figure S2). We assume that the daily probability of infection is proportional to the daily number of cases and that a person can only be infected once per year. To make seroincidence estimates incorperating seasonality of infection risk, we use these P(T = t | Y = 1, C = c) in place of P(T = 1 | Y = 1) in a slightly modified version of the Stan model.

S1.5. Other estimators and time frames
We fit several alternative models and observe the differences in their resulting seroincidence estimates. We use an 'unadjusted' model, where our random forest estimates are not adjusted by sensitivity and specificity (i.e. π i θ 1|1 + (1 − π i )(1 − θ 0|0 ) in Equation 1 is replaced by π i ). Previous work showed that a vibriocidal titer (either Inaba and Ogawa) of at least 320 was the best threshold for maximimizing sensitivity and specificity for identifying individuals infected in the previous year; thus we fit a 'vibriocidal' model which used these predictions in place of the random forest predictions for z i in Equation 1. To see how the seroincidence changed over multiple time frames, we also fit the random forest and vibriocidal models to 100 and 200 days infection windows. We use the raw aggregated random forest predictions to make the unadjusted estimate and use the same Stan framework as in the main analysis to estimate the seroincidence using the vibriocidal cutoff and for other time windows. S6 S1. 6

. Risk factors for seroposivity
We used a series of logistic regression models with a Matern spatial covariance function to explore the association between seropositivity (random forest positive for individuals) and various individual-, household-and community-level covariates. We explored both univariate relationship and multivariate (linear) relationships between the covariates and the binary seropositivity outcome using models with and without different random effects and spatial correlation. The 'full' model, used for the primary analyses in manuscript, included a Matern spatial random field and random effects for both households and communities (assumed to be independent and identically distributed with log-gamma priors). We also estimated the relationship between the covariates and seropositivity with a model including no random effects for household or community and only spatial correlation, and another model including only random effects for household and community without spatial correlation.

S1.7. Results
The three methods produced similar estimates for the nationwide seroincidence rate for the 365-day infection window before the serosurvey ( Figure S3A). The median spatial estimate from the INLA 5km by 5km grid-cell maps was 17.3% (95% CI: 10. 5  Seroincidence rate, by estimator B Figure S3: The estimated annual seroincidence rate medians and distributions by estimator for the whole population (A) and by community (B). The unadjusted random forest estimates (orange) do not account for sensitivity or specificity. The adjusted survey estimates (light blue) are the survey estimates from the Bayesian hierarchical model which accounts for the sensitivity and specificity, of the random forest estimates, as well as age, sex, and the community population size. The post-stratified estimates (green) are from the same model but extrapolating to the unsampled population in each community based on its demographics. The spatial estimates (dark blue) extend the survey estimates to the rest of the country using a logistic regression with a Matern spatial covariance function. There are no community-specific estimates for the spatial model in (B).

S1.7.1. Alternate analyses
The unadjusted 365-day random forest model nationwide estimate is similar to the median estimates from the adjusted and post-stratified models (median: 19.9%), however there is considerable variability between locations ( Figure S3B). By comparison, the vibriocidal model yields lower seroincidence rate estimates (median: 15.0%, 95% CI: 10. [8][9][10][11][12][13][14][15][16][17][18][19].6%) than the models based on random forest estimates despite the fact that the proportion of the serosurvey that had vibriocidal titers greater than or equal to 320 is similar to the proportion that was classified as seropositive by the random forest model (19.5% vs. 19.9%). This is due to the vibriocidal estimates having a lower specificity than the random forest models ( Figure S4).    Figure S4: The estimated seroincidence rate across varying infection window sizes and estimators. We estimate the seroincidence rate with two different estimators across three infection time windows (100, 200, and 365 days). The random forest model (RF) uses age, sex, and measurements of six antibodies (vibriocidal Inaba, vibriocidal Ogawa, anti-CTB IgG, anti-CTB IgA, anti-LPS IgG and anti-LPS IgA) to classify individuals as seroincident. As a comparison, we use the historical convention where those with either vibriocidal titers greater than or equal to 320 is classified as seroincident. The vibriocidal titer method has lower specificity, which yields lower estimates of seropositivity than those from the random forest model. The estimate of the median seroincidence rate with each estimator is displayed above its distribution. The estimates of the adjusted sensitivity and specificity for each estimator and window size are presented in the table below the figure.
The model including seasonality produced overall post-stratified estimates that were very similar to the main estimates (median: 20.1%, 95% CI: 15.5-24.6%). The community-level estimates were also similar, with the largest difference between the estimates in any community being 4.1%. 80.0% of estimates accounting for seasonality of risk were within 1% of estimates not accounting for seasonality ( Figure S5). Estimates adjusted for seasonality were also similar to the main estimates at 100 and 200 days.

S1.7.2. LOOCV
As described above, to choose the covariates for the model to make the spatial maps, we used a backwardsstepwise selection using WAIC. The model that minimized the WAIC had two covariates, the proportion of the grid cell that is female and poverty. We compared this model to a null model that had a spatial field but no linear covariates and a naive model that used the average of all other communities using leave-one-community-out cross validation and evaluating their predictions using mean absolute error (MAE). The null model had the least error (MAE: 8.5%) and was similar to the model with covariates (MAE: 8.5%); both had smaller MAE than the naive model (MAE: 9.8%), thus we used this null model in our primary analyses. Figure S6 shows a comparison of the community-specific results. Seroincidence rate Predicted seroincidence rate Figure S6: The results from leave-one-community-out cross validation. Results from cross-validation where each community was held out of the INLA model, one at a time, and the posterior predictive mean for that location was estimated (y-axis). The green dashed line illustrates the best fit line for predictions from the model with proportion female and poverty as covariates. The blue dashed line illustrates the best fit line for the null model predictions. The solid orange line is the best fit line for a naive model that predicts the average of the mean of the other sampled communities.

S1.7.3. Mapping
Using the null model (only spatial random field and no covariates), we made grid-cell estimates of seroincidence for all of Bangladesh. From these we made a series of maps to observe the geographic variability of cholera throughout Bangladesh. Maps of the median seroincidence rate and estimated number of annual infections by grid cell are in the main manuscript ( Figure 2). To help identify high-risk regions and our confidence in the estimates, we calculated proportion of posterior grid-cell seroincidence estimates with a relative risk greater than two ( Figure S7). As in the manuscript, we see higher risk in the Bay of Bengal and in pockets in the northwest and north, though less so in the northeast. S10 0.00 0.25 0.50 0.75 1.00 Proportion of samples with relative risk >2 Figure S7: Proportion of posterior samples with relative risk greater than 2 for each 5km x 5km grid cell. Figure S8 investigates the variance of our estimates. As variance scales with the size of the estimate, it is difficult to interpret. The coefficient of variation, the standard deviation divided by the mean, is another measure often used but it can become very large for places where mean estimates are very small. Instead, we use the width of the logged relative risk credible interval in this map. To do this, we first bound all posterior samples of the relative risk to be between 0.25 and 4 (or −2 and 2 on the log 2 scale), which represent reasonable cutoffs for very high and very low risk as only 2.1% of upper bounds across all grid cells are greater than 4, though 93.4% of lower bounds are less than 0.25. Next, we take the log 2 difference between the upper and lower bounds of the 95% credible interval for each grid cell. Grid cells with a log 2 difference of 4 have an upper bound relative risk that is greater than 4 and a lower bound relative risk less than 0.25, indicating that we are very uncertain of the true risk in that grid cell; 1.0% of the grid cells in our map have a log 2 difference of 4 and are displayed in white. Grid cells with a log 2 difference of 0 have both upper and lower bounds either above 4 or below 0.25 and would be indicated on the map with maximum opacity if there were any. The colors on the map indicate the posterior median for that grid cell and the opacity indicates the log 2 difference between the upper and lower bounds. This map demonstrates how the certainty in our estimates fades with distance from the sampled communities.   Figure S9: Estimates of odds ratios for seropositivity from different models.

S2. Additional descriptive analyses and figures
In this section we present additional descriptive analyses to illustrate the distributions of each of the antibody levels in different ways and characteristics of the cohort.