Spatiotemporal variation in risk of Shigella infection in childhood: a global risk mapping and prediction model using individual participant data

Summary Background Diarrhoeal disease is a leading cause of childhood illness and death globally, and Shigella is a major aetiological contributor for which a vaccine might soon be available. The primary objective of this study was to model the spatiotemporal variation in paediatric Shigella infection and map its predicted prevalence across low-income and middle-income countries (LMICs). Methods Individual participant data for Shigella positivity in stool samples were sourced from multiple LMIC-based studies of children aged 59 months or younger. Covariates included household-level and participant-level factors ascertained by study investigators and environmental and hydrometeorological variables extracted from various data products at georeferenced child locations. Multivariate models were fitted and prevalence predictions obtained by syndrome and age stratum. Findings 20 studies from 23 countries (including locations in Central America and South America, sub-Saharan Africa, and south and southeast Asia) contributed 66 563 sample results. Age, symptom status, and study design contributed most to model performance followed by temperature, wind speed, relative humidity, and soil moisture. Probability of Shigella infection exceeded 20% when both precipitation and soil moisture were above average and had a 43% peak in uncomplicated diarrhoea cases at 33°C temperatures, above which it decreased. Compared with unimproved sanitation, improved sanitation decreased the odds of Shigella infection by 19% (odds ratio [OR]=0·81 [95% CI 0·76–0·86]) and open defecation decreased them by 18% (OR=0·82 [0·76–0·88]). Interpretation The distribution of Shigella is more sensitive to climatological factors, such as temperature, than previously recognised. Conditions in much of sub-Saharan Africa are particularly propitious for Shigella transmission, although hotspots also occur in South America and Central America, the Ganges–Brahmaputra Delta, and the island of New Guinea. These findings can inform prioritisation of populations for future vaccine trials and campaigns. Funding NASA, National Institutes of Health–The National Institute of Allergy and Infectious Diseases, and Bill & Melinda Gates Foundation.


Training database 1.Study inclusion criteria
This analysis used an Independent Participant Data Meta Analysis (IPD-MA) framework in which raw, individual participant-level data were pooled from studies that used molecular diagnostics to diagnose Shigella in stool samples collected from children aged under 5 years in Low-and Middle-Income Countries (LMICs, as defined by the Organisation for Economic Co-operation and Development [1], excluding those in Europe). The IPD-MA design is considered the gold standard in systematic reviews, offering numerous advantages over aggregate data meta-analyses, and a greater potential for generalizable inferences compared with individual studies [2,3]. In this IPD-MA, studies which tested for Shigella using quantitative or real-time polymerase chain reaction (PCR) were identified through nonsystematic, exploratory literature review and professional networks and investigators from eligible studies were contacted with requests for access to their data. No specific search strategy was used. If investigators were responsive and agreed to participate, data use agreements were executed between the University of Virginia and the collaborating institution and study-specific datasets were securely transferred and stored, deidentified and combined into a central database with a standardized format and list of variables. Analyses were conducted on the pooled data. Table S1 summarizes key features of studies that contributed data. No protocol has been registered or published for this study.

Household-and participant-level covariate data
Most contributing studies collected household-and participant-level data through baseline and follow-up assessments, including numerous variables relevant to enteropathogen infection risk. Covariate ascertainment was not consistent across all studies as some variables were not ascertained in certain studies, while in others information was collected but was incomplete. Furthermore, some anthropometric measurements and feeding status assessments were not contemporaneous with the dates of sample collection. To obtain a complete set of values for all covariates, missing participant-level variable data were interpolated, and household-level variable data imputed. Since some studies lacked data for entire variables, equivalent data was extracted from individual child-level microdata from all standard Demographic and Health Surveys (DHS) [24] and Multiple Indicator Cluster Surveys (MICS) [25] carried out in the same countries dating back to 2005 for the same survey stratum (region and urban/rural status) in which the study sites were located. Survey databases were coded identically to the pooled study database and then appended to it to add more locally relevant information prior to interpolation/imputation.

Imputation of missing time-fixed household-level covariate data
Missing data for the household-level covariates were simultaneously imputed using multivariate normal regression (MVN) with an iterative Monte Carlo method with country, survey region, urbanicity, source (survey or study) and linear and quadratic terms for time as additional predictors. These variables were assumed to be static (time-fixed) and, for children for whom they were ascertained at multiple time points during follow-up, the first available value was used, for consistency with those studies that only assessed them at baseline.

Interpolation of missing time-varying anthropometric and feeding status data
Missing data relating to the feeding variables were imputed and interpolated using predictions from Cox proportional hazard models that modeled age of first introduction of complementary foods and then of full weaning adjusting for country, survey region, urbanicity, sex, time (with linear and quadratic terms), and source (survey or study). Anthropometric Z-scores were calculated for each child at each available anthropometric assessment based on their length/height, weight and age using the WHO Child Growth Standards STATA igrowup package, with implausible values recoded as missing [26]. Then, linear mixed effects models were fitted to each anthropometric measure in turn (LAZ, WAZ, WHZ) with fixed effects for sex, age (with terms up to fourth order polynomial), time (with linear and quadratic terms), and source (survey or study) and participant-and countryspecific random effects. Predictions from these models were used to interpolate missing Zscores to the child and date of sample collection.

Relative humidity
Average daily relative humidity in the 7-day period from t-9 to t-3 days Percentage GLDAS [44] Soil moisture Average daily soil moisture in the 7-day period from t-9 to t-3 days Percentage GLDAS [44] Solar radiation Average daily solar radiation in the 7-day period from t-9 to t-3 days Watts per square meter GLDAS [44] Specific humidity Average daily specific humidity in the 7-day period from t-9 to t-3 days Grams/kilogram GLDAS [44] Surface pressure deviations Deviations from the mean daily surface pressure in the 7-day period from t-9 to t-3 days Millibars GLDAS [44] Surface runoff Average daily surface runoff in the 7-day period from t-9 to t-3 days Millimeters GLDAS [44] Temperature Average daily temperature in the 7-day period from t-9 to t-3 days Degrees Celsius GLDAS [44] Wind speed Average daily wind speed in the 7-day period from t-9 to t-3 days Meters per second GLDAS [44] 1 For categorical variables, the reference category is underlined.     For environmental spatial covariate extraction, stool samples were georeferenced to the approximate location of the child's residence. For some study sites, the exact coordinates of the child's household locations were available, otherwise, children were georeferenced to the approximate centroid of their neighborhood, village, or district or, where such information was unavailable, the location of the health facility at which they were enrolled. For samples georeferenced to health facilities, covariates were averaged over a theoretical catchment area represented by a 20km buffer around the facility location using the ArcMap Zonal Statistics tool, otherwise they were extracted to household or community locations using the Extract Values to Points tool [45]. All continuous spatial covariates were normalized and standardized to their overall distributions.

Hydrometeorological variable extraction, standardization, and lagging:
The effect of weather on enteropathogen outcomes is lagged [46,47] and the incubation period for shigellosis can be as short as 12 hours, with an average interval of three days from exposure [48] Therefore, daily hydrometeorological variables were averaged over a 7-day lagged period of exposure from 3 to 9 days prior to the date of sample collection (t-9 to t-3, where t0 is the date of sample collection). Two of the hydrometeorological exposure variables -precipitation and surface pressure -were standardized to their local distributions by recalculating each one as the deviation from its site-specific mean value over the period 2005 -2019. For precipitation, this was to account for the distribution being highly rightskewed, while for surface pressure, which was narrowly distributed within each site, it reduced between-site relative to within-site variability [46].

Prediction databases 2.1. Creation of covariate rasters:
To predict Shigella infection probability from the model across all LMICs, it was necessary either to obtain raster datasets of the values for each covariate predictor that are coextensive with that domain, or to fix the prediction at a certain value of that variables. For the hydrometeorological and static spatial covariates, rasters are available covering the entire domain of interest. For several of the household-and participant-level variables, raster predictions are available from IHME's Local Burden of Disease (LBD) project website [33], though most do not cover the entire domain. For variables for which such rasters were available, predictions for the most recent year (2017) were used. For other such variables, and to fill gaps in coverage of the LBD rasters, subnational unit-level values were calculated using microdata from each LMIC's most recent nationally representative household survey (Standard DHS, MICS or, in small number of cases, country-specific surveys [49,50]), merged with unit polygon shapefiles and converted to raster format. For countries for which no eligible surveys were available in the public domain (e.g., North Korea, Eritrea, Iran), averages across all other countries in that country's World Bank region [51] were substituted. All covariate rasters were clipped and resampled to match the extent and resolution of the GLDAS grids. Figures S1, S2, and S3 show the geographical variation in, respectively, averages of the hydrometeorological variables, the time-static environmental spatial covariates, and the participant-and household-level covariates.

Supplementary results:
Output model predictions of probabilities and standard errors are available for each of nine age groups/symptom stratum combinations at the following GitHub repository https://github.com/joshcolston/Badr_Shigella_predictions. Methods -objectives and scope section 2. List the funding sources for the work Financial support section Data inputs 3. Describe how the data were identified and how the data were accessed.

Guideline compliance
Methods section and supplementary tables S1 and S2 4. Specify the inclusion and exclusion criteria. Identify all ad-hoc exclusions.
Methods section and supplementary appendix section 1.1 5. Provide information on all included data sources and their main characteristics. For each data source used, report reference information or contact name/institution, population represented, data collection method, year(s) of data collection, sex and age range, diagnostic criteria or measurement method, and sample size, as relevant.
Methods section and supplementary tables S1 and S2 6. Identify and describe any categories of input data that have potentially important biases (e.g., based on characteristics listed in item 5).
Methods section and supplementary appendix sections 1 and 2 7. Describe and give sources for any other data inputs.
Not applicable 8. Provide all data inputs in a file format from which data can be efficiently extracted (e.g., a spreadsheet rather than a PDF), including all relevant meta-data listed in item 5. For any data inputs that cannot be shared because of ethical or legal reasons, such as third-party ownership, provide a contact name or the name of the institution that retains the right to the data.
See data availability statement.

Data analysis
9. Provide a conceptual overview of the data analysis method. A diagram may be helpful.
Methods -statistical analysis section 10. Provide a detailed description of all steps of the analysis, including mathematical formulae. This description should cover, as relevant, data cleaning, data pre-processing, data adjustments and weighting of data sources, and mathematical or statistical model(s).
Methods -statistical analysis section; Supplementary equation S1 11. Describe how candidate models were evaluated and how the final model(s) were selected.
Methods -statistical analysis section 12. Provide the results of an evaluation of model performance, if done, as well as the results of any relevant sensitivity analysis. Figure 2 13. Describe methods of calculating uncertainty of the estimates. State which sources of uncertainty were, and were not, accounted for in the uncertainty analysis.
Methods -statistical analysis section

Discussion section
18. Discuss limitations of the estimates. Include a discussion of any modelling assumptions or data limitations that affect interpretation of the estimates. If applicable, describe how any studies for which IPD were not available were dealt with. This should include whether, how and what aggregate data were sought or extracted from study reports and publications (such as extracting data independently in duplicate) and any processes for obtaining and confirming these data with investigators.

Discussion section
Not applicable.
11. Describe how the information and variables to be collected were chosen. List and define all study level and participant level data that were sought, including baseline and follow-up information. If applicable, describe methods of standardizing or translating variables within the IPD datasets to ensure common scales or measurements across studies.
Supplementary appendix section 1, supplementary tables S1 & S2 Describe what aspects of IPD were subject to data checking (such as sequence generation, data consistency and completeness, baseline imbalance) and how this was done.
Not applicable.
12. Describe methods used to assess risk of bias in the individual studies and whether this was applied separately for each outcome. If applicable, describe how findings of IPD checking were used to inform the assessment. Report if and how risk of bias assessment was used in any data synthesis.
Not applicable.
13. State all treatment comparisons of interests. State all outcomes addressed and define them in detail. State whether they were pre-specified for the review and, if applicable, whether they were primary/main or secondary/additional outcomes. Give the principal measures of effect (such as risk ratio, hazard ratio, difference in means) used for each outcome.
Not applicable.
14. Describe the meta-analysis methods used to synthesize IPD. Specify any statistical methods and models used. Report any important issues identified in checking IPD or state that there were none.
Not applicable 19. Present data on risk of bias assessments. If applicable, describe whether data checking led to the up-weighting or down-weighting of these assessments. Consider how any potential bias impacts on the robustness of meta-analysis conclusions.
Methods, Data sources and outcome variable section; Discussion 20. For each comparison and for each main outcome (benefit or harm), for each individual study report the number of eligible participants for which data were obtained and show simple summary data for each intervention group (including, where applicable, the number of events), effect estimates and confidence intervals. These may be tabulated or included on a forest plot. 21. Present summary effects for each meta-analysis undertaken, including confidence intervals and measures of statistical heterogeneity. State whether the analysis was pre-specified, and report the numbers of studies and participants and, where applicable, the number of events on which it is based.
Results, figure 3, supplementary table S1 When exploring variation in effects due to patient or study characteristics, present summary interaction estimates for each characteristic examined, including confidence intervals and measures of statistical heterogeneity. State whether the analysis was pre-specified. State whether any interaction is consistent across trials.

Not applicable
Provide a description of the direction and size of effect in terms meaningful to those who would put findings into practice.
Results, figure 3 22. Present results of any assessment of risk of bias relating to the accumulated body of evidence, including any pertaining to the availability and representativeness of available studies, outcomes or other variables.
Effect estimate of study design, figure 3 23. Give results of any additional analyses (e.g., sensitivity analyses). If applicable, this should also include any analyses that incorporate aggregate data for studies that do not have IPD. If applicable, summarize the main meta-analysis results following the inclusion or exclusion of studies for which IPD were not available.

Discussion
24. Summarize the main findings, including the strength of evidence for each main outcome.

Discussion
25. Discuss any important strengths and limitations of the evidence including the benefits of access to IPD and any limitations arising from IPD that were not available.

Discussion
26. Provide a general interpretation of the findings in the context of other evidence.

Discussion
Consider relevance to key groups (such as policy makers, service providers and service users). Consider implications for future research.

Discussion
Funding 27. Describe sources of funding and other support (such as supply of IPD), and the role in the systematic review of those providing such support.
Abstract and financial support section