Seasonal dynamics of Anopheles stephensi and its implications for mosquito detection and emergent malaria control in the Horn of Africa

Significance Invasion of the malaria vector Anopheles stephensi across sub-Saharan Africa poses a threat to disease control efforts, particularly in cities where malaria transmission has historically been low but where this invasive vector can thrive. We collate longitudinal catch data to systematically characterize the species’ seasonal dynamics in areas at risk of invasion, which is necessary to guide surveillance and control activities. An. stephensi’s temporal abundance is highly variable and, in contrast to dominant vectors across Africa, poorly predicted by patterns of rainfall, instead being shaped by temperature and patterns of land use. This variation has material consequences for effective control of this invasive vector and highlights an urgent need for longitudinal entomological monitoring of the vector in its new environments.


31
In this supplementary document we outline the methods and data used to explore and analyse 32 the patterns and drivers of Anopheles stephensi population dynamics across South Asia and 33 the Middle East. In Supplementary Information 1, we present an overview of the systematic 34 search strategy used to collate the references containing the extracted and analysed data, as 35 well as details about the initial pre-processing steps applied to said data. In Supplementary  36 Information 2, we describe the statistical methodologies used to process and analyse this 37 extracted data. The output of these analyses forms the basis for the results presented in the 38 main text. Finally, in Supplementary Information 3, we present a number of additional figures 39 and tables to support the work detailed in the main text.  44 We collated references from two previously published systematic reviews of literature relating 45 to Anopheles stephensi (focusing on its presence/absence across a wide geographical range 1 46 and its seasonal dynamics in India 2 respectively), and updated these previous searches ( (i.e. date range of 1990-2020 rather than 2017-2020) to ensure completeness of the collated 57 references, and fill in countries not included during previous reviews. Our searches identified 58 a total of 926 records, which were screened according to the following Inclusion/Exclusion 59 criteria: 60 Inclusion Criteria: 61 • Reference contains temporally disaggregated adult mosquito catch data for 62 An. stephensi, at a temporal resolution of monthly or finer. 63 • The time-period spanned by the survey must be at least 10 consecutive months in 64 duration and have caught at least a total of 25 An. stephensi over the period for which 65 catches were being carried out. 66 Exclusion Criteria: 67 • Mosquito catch data is not temporally disaggregated to a sufficient extent (e.g. catches 68 were done yearly or seasonally rather than monthly). 69 • Mosquito catch data was collected as part of a trial assessing a vector control 70 intervention (which would perturb the natural dynamics of the vector, rendering the 71 data unrepresentative of the population dynamics in the absence of control). 72 • The reference only contained information on immature/larval mosquito life cycle stages 73 rather than mature adults. 74 • The reference contained insufficient information to geolocate the area in which the 75 study was conducted to at least the administrative unit 2 level. 76 Overall, a total of 34 references were collated containing 62 time-series from catch surveys 77 carried out in distinct locations from across Afghanistan (n=2), Djibouti (n=1), India (n=30), 78 Iran (n=17), Myanmar (n=5) and Pakistan (n=7) from the systematic review. These were 79 further supplemented with 2 references (from Pakistan and India respectively, yielding a total 80 of 3 time-series) collated as part of a review of the bionomics of An. stephensi previously 81 carried out 3 , yielding a total of 65 time-series from these 36 references. The next section 82 describes in further detail about extraction and collation of the data associated with each study. 83

Entomological Data Extraction 85
For each reference, we extracted all relevant entomological catch data provided that pertained 86 specifically to An. stephensi. Where data were presented in a table, data was copied directly  87  from the table. Where the data were in a graph, data were extracted using the DataThief TM  88 software. This yielded a total of 65 time series of monthly mosquito catch data (no reference  89 presented data at a finer temporal resolution), ranging in length from 10 -60 months, with a 90 mean time-period of 15.6 months and a median time-period of 12 months, a mean catch size 91 of 758 and a median catch size of 289. 92 The primary focus of these analyses was to characterise annual and seasonal patterns of 105 variation in An. stephensi abundance. Given this, and also that variations in time-series length 106 are a factor known to affect their statistical properties 4 (and therefore limit the comparability of 107 the time series gathered and analysed here), all time-series were standardised to be 12 108 months in length. For time series containing more than 12 time points (i.e. time series that 109 spanned longer than a single year), we averaged the recorded catches for a given month. For 110 time-series containing less than 12 months of data, this was not carried out. Where the study 111 has been initiated in a month other than January, and concluded in a month other than 112

Supplementary
December, the recorded counts were reordered to yield a complete time series running from 113 January to December (and then subsequently adjusted so that the month of peak vector 114 density is arbitrarily set to month 7 when plotting the time-series, to enable graphical 115 comparability, see Fig.2A more information about how each time-series was processed). 129

Study Geolocation and Environmental Covariate Extraction 130
For each study where geolocation was possible, we recorded the location at both the 131 administrative unit 1 and 2 level, based on information provided in the reference. A number of 132 the references identified in our review had previously been utilised as part of previous 133 reviews 1,2 -where this data was available, these descriptions of study location were used. For 134 each location, we then extracted a suite of satellite-derived environmental covariates. These 135 environmental covariates consist of raster layers spanning all countries in which studies had 136 been conducted in (i.e. Afghanistan, Djibouti, India, Iran, Myanmar and Pakistan) at a 2.5 arc-137 minute (~5km by 5km, depending on the exact location and distance from the equator) spatial 138 resolution. The covariates utilised here were initially selected from a set of 19 derived from the 139 BioClimatic variables (a suite of biological relevant covariates defined from monthly rainfall 140 and temperature satellite data 5 , making the strong assumption that these variables, which 141 represent location averages over the period 1970-2000, adequately describe the climactic 142 factors present in the periods spanned by our studies, which were predominantly conducted 143 after 2000) as well as measures of landcover and urbanicity 6 , population density 7,8 and 144 enhanced vegetation index 9,10 . This provided a total of 43 covariates, many of which were 145 highly correlated with one another. To reduce the degree of this multicollinearity, we generated 146 a reduced subset of covariates using tools available in the tidymodels collection of R 147 packages 11 that aim to minimise the Spearman rank correlation coefficients between retained 148 covariates, and also exclude covariates where there is minimal variation for that covariate 149 across the full dataset, leaving 19 covariates in total. In addition to the environmental 150 covariates described above, for each of the administrative units a survey had been carried out 151 in, we also collated daily rainfall estimates for the time-period the survey had been conducted 152 in, using the "The Climate Hazards Group Infrared Precipitation With Stations" (CHIRPS) 153 dataset 12 . These data were aggregated up to the same temporal resolution as the An. 154 stephensi catch data (i.e. monthly). These rainfall data were used to calculate the cross-155 correlation coefficient between mosquito catches and rainfall. Country survey had been carried out in Calculated empirically for each timeseries (grouped into "India", "Iran" and "other").

NA
Note: There are 43 covariates total here, as Landcover contains 19 distinct covariates (each 160 describing the proportion of cover attributable to a particular landcover class in a given area). In-line with previously work modelling the seasonal dynamics of different Anopheline mosquito 167 species from across India 2 , we utilise a flexible Gaussian Process modelling framework to 168 temporally interpolate between the monthly-catch datapoints and smooth the raw, noisy and 169 overdispersed catch data. Gaussian processes specify a distribution over functions such that 170 any finite set of function values ( 1 ), ( 2 ), … ( ) have a joint Gaussian distribution 13 . The 171 Gaussian process is entirely specified by its mean function: 172 and by its covariance function: 174 The covariance function is also known as the kernel and defines, based on the Euclidean 176 distance between any two points, their covariance (and thus the covariance matrix of the 177 Gaussian Process when all pairwise combinations of points are considered). Many different 178 forms of the kernel are possible that each encode different prior information about how we 179 expect two datapoints ( and ′ in this instance) to be similar, and the distance over which we 180 expect this similarity to persist. Given that mosquito population dynamics are typically 181 characterised by seasonally repeating patterns occurring either, a periodic kernel function was 182 used to define the covariance between pairs of points: 183 where represents the period over which we would expect points to show similar dynamics 185 (i.e. a period of twelve would imply we expect points separated by 12 months to be most 186 similar), specifies the magnitude of the covariance, and represents a lengthscale 187 parameter further constraining the extent to which two values separated by a given time can 188 co-vary. 189 Bayesian inference and fitting of normal Gaussian Processes typically follow this hierarchical 190 formulation: 191 where represents a vector of hyperparameters involved in defining the kernel's properties, 195 is a distribution of functions from a zero-mean Gaussian Process with covariance function 196 , (x) are function evaluations at times , and the observed data. We modify this structure 197 to account for specific characteristics of the mosquito data being utilised -specifically that the 198 data are integer counts, that mosquito catch data is rarely normally distributed and frequently 199 displays high levels of overdispersion (a common property of biological systems generally). 200 We therefore adapted the above framework to accommodate a Negative Binomial likelihood, 201 leading to the following inferential framework: 202 , , ~ ( , , ) 203 where ( ) is used to reflect the fact that we use a log link between the observed counts and 207 the underlying latent process reflecting the population dynamics, and represents the 208 overdispersion parameter of the Negative Binomial distribution. 209

Prior Specification 210
Per previous work 2 , prior distributions for the estimated parameters were defined as follows: 211 Weakly informative priors were set on the scaling factor , the period, , and the 216 overdispersion parameter, . The prior for the kernel period ( ) was centred on 12 (a value of 217 the period that would represent annual variation being the dominant temporal modality) to 218 reflect our prior belief that observed variation in mosquito abundance is likely to cycle annually. 219 However, recognising that other temporal patterns of fluctuating abundance are possible, we 220 placed a large standard deviation on to allow the model to accommodate instances of 221 bimodality or periods operating across timescales longer than a year. We placed lower and 222 upper bounds on at 4 and 18 months respectively, to avoid identifiability issues arising from 223 the lack of data at temporal resolutions below and above these bounds. 224

Model Fitting and Parameter Inference 225
This Negative Binomial Gaussian Process were fitted using a Bayesian framework 226 implemented in STAN, a probabilistic programming language for statistical inference written 227 in C++ that employs the No-U-Turn sampler, a variant of the gradient-based Hamilton Monte 228 Carlo algorithm for inference 14 . For each time-series, 2 chains of 20,000 iterations were run 229 for purposes of model fitting and parameter inference. Half of each chain's iterations were 230 discarded as burn-in/the adaptive phase of the sampling, leaving a total of 20,000 iterations 231 available for inference. Measures of MCMC convergence such as the Gelman-Rubin statistic 232 were monitored in all cases and were all consistently < 1.02. 233

Fitted Time Series Normalisation and Von Mises Distribution Fitting 234
After having fitted and smoothed the mosquito catch time-series, we normalised each in the 235 following way: 236 where is the proportion of the annual catch recorded at timepoint . This was done in order 238 to establish comparability across the time series (which varied substantially in the absolute 239 numbers of Anopheles stephensi caught). We then further characterised the periodic 240 properties of these time series by fitting Von Mises distribution to the time-series. The Von-241 Mises distribution is a continuous probability distribution that exists on the circle, with range 0 242 to 2 . It is the circular analogue of the normal distribution (which exists on the line), with the 243 probability density function for the angle given by: 244 where 0 ( ) is the modified Bessel function of order 0, the parameter is a measure of location 246 (analogous to the mean of the normal distribution, describing where on the circle the 247 distribution is clustered around) and describes the concentration of density around (and 248 thus its inverse is a measure of dispersion, analogous to 2 for the normal distribution. We 249 fitted two sets of Von Mises densities to the normalised time series, the first containing a single 250 component: 251 and another with two-components, formulated as: 253 where represents the normalised mosquito count formulated as a random variable on the 255 circle (i.e. = 2 12 ). Fitting was undertaken using the optim function in R, with the root mean 256 squared error as the loss function. The outputs from this fitting were then included in the 257 process generating aggregate summaries of the temporal properties of the time-series, a 258 process described in further detail below. 259

Time Series Characterisation and Analysis 260
To characterise the temporal properties of each time-series, we calculated a series of 261 summary statistics for each, drawing on previous work carried out exploring the empirical 262 structure of time series 12 . In doing this, we can make explicit comparisons between time-series 263 about key aspects of their temporal properties (e.g., the degree or timing of seasonality), and 264 in doing so, identify time-series with similar statistical and temporal properties. These 265 summary statistics were the following: 266  In-keeping  305 with previous, operationally aligned estimates of malaria seasonality 15 , we calculated 306 using a sliding 3-month window the maximum percentage of the total annual catch that 307 was caught in any 3 month period. 308

Principal Components Analysis and Clustering 309
Principal Components Analysis (PCA) is a statistical procedure that utilises an orthogonal 310 transformation to convert a set of correlated variables (in this case the outputs of the 7 311 mathematical operations described above for each of the time series) into a set of linearly 312 uncorrelated variables (known as the "principal components"). In doing so, this allows us to 313 summarise this set of variables with a smaller number of representative variables that together 314 explain the majority of the variability in the variables. Reducing the dimensionality of the 315 dataset in this way facilitates visualisation of time series properties (as defined by the 316 mathematical operations) as well as clustering of the time series into groups which share 317 similar properties (clustering algorithms typically perform poorly in high dimensional settings, 318 necessitating the use of PCA as described here). Clustering was then undertaken using the 319 k-means clustering algorithm, using the first four PCA components that together described 320 85% of the total variation present in the data. 321

Random Forest Modelling and Prediction of Seasonality 322
Random Forests are a machine learning, ensemble-based method that work by constructing 323 a collection of decision trees that together explain the results (where results are either a 324 continuous outcome variable in the regression context, or a binary indicator in the classification 325 context) 16 . The outputs of these decision trees are subsequently aggregated in a statistically 326 principled and coherent way to produce a "forest" (or ensemble) of trees that together produce 327 predictions for comparison with data. They have previous been shown to provide significant 328 improvements in accuracy over traditional linear regression based approaches, particularly in 329 contexts where non-linear relationships or interactions between covariates are likely present 330 and to be relevant to prediction of an outcome 17 . 331 We used a Random Forest based approach to either 1) classify time-series cluster 332 membership (i.e. predict whether a time-series belonged to either Cluster 1 or Cluster 2, as 333 defined via the PCA and k-means clustering analysis described above); or 2) predict An. predictions on each training sample using only the trees that did not have that training sample 357 in their bootstrap sample. We also calculated both permutation variable importance and 358 generated partial dependency plots 20 for each model to assess the contribution of specific, 359 individual environmental covariates to whether a time-series had a single seasonal peak or 360 not. Together these methods allow evaluation of the importance of each included covariate to 361 model predictive accuracy, and in turn, allows us to "rank" covariates according to their 362 contribution to the predictive performance of the model. This entire process was repeated 25 363 times in order to average over the stochasticity and variation inherent in the Random Forest 364 fitting process. 365 We also carried out an additional sensitivity analysis where a set of the available data (n=12 366 time-series) was held-out at the onset, and the random forest model trained (using 6-fold 367 cross-validation) on the remaining available data (n=53 time-series total, with 43 time-series 368 used in model fitting and 10 time-series used for performance evaluation in each of the cross-369 validation folds). Optimal hyperparameters were selected in the same way as described 370 above, and then a final model fitted to the full, non-held out data (n=53 time-series), and model 371 predictive accuracy assessed by evaluating performance on the held-out data (n=12 time-372 series). 373  from which the probability of not sampling An. stephensi (i.e. the total number of An. stephensi 408 caught is equal to 0) during that sampling period can be calculated. 409

Probability of Detecting
For each time-series, we then identified the month in which monthly rainfall peaked, and the 410 month in which vector density was highest (noting that these months were very rarely the 411 same month). We then calculated the cumulative probability of An. stephensi detection under 412 a range of different surveillance strategies. Specifically three strategies were simulated: 413 • Vector-Peak Timed: Starting the survey at the month with peak vector density (noting 414 that in the absence of pre-existing detailed entomological information this is largely a 415 hypothetical quantity, meant to illustrate the maximum detection probability that could 416 be achieved). 417 • Rainfall-Peak Timed: Starting the survey at the month with peak rainfall. 418 • Random Month Timed: The expected cumulative probability of detection achieved if 419 the survey was started during a random month (calculated in practice by simulating 420 survey starting in each of the year's 12 months and then calculating the average 421 cumulative probability of these surveys). 422 In addition to varying the timing of the survey (which varies according to the surveillance 423 strategy considered, as described directly above), we also varied the amount of sampling effort 424 (number of days sampled within each month) and the overall duration of the (i.e. how many 425 consecutive months were sampled). Note that the aim here is not to describe the exact 426 probability of missing An. stephensi in any given entomological survey, as this will depend on 427 a wide array of other, poorly defined and heterogeneous factors (such as type of catch 428 methodology used etc). Instead, the aim is to highlight how variation in seasonal dynamics 429 can influence the nature of surveillance required to successfully An. stephensi. 430

Modelling of Malaria Transmission and the Impact of Anopheles stephensi 431
We integrated the temporal profiles of An. stephensi abundance into a well-established 432 deterministic compartmental model of Plasmodium falciparum malaria transmission and 433 disease 21-23 to explore the implications of the vector's establishment and seasonality on the 434 dynamics of malaria transmission, with a particular focus on areas where malaria transmission 435 is currently low or absent. What follows is a description of the mathematical modelling 436 framework in general terms, followed by specific details about how exactly this framework was 437 used to model malaria transmission underpinned by An. stephensi in settings where malaria 438 is currently absent or only minimally present. 439 The deterministic malaria model used here considers both human and mosquito populations. 440 Humans begin as Susceptible (S), and upon infection (at a rate which is dependent on the 441 force of infection they experience), progress to either Asymptomatic (A) or clinical disease, 442 with the comparative probability of these two outcomes depending on the degree of acquired 443 natural immunity due to previous exposure to the parasite. In all cases, the impact was calculated by comparing the reduction in malaria burden (as 496 measured by total annual incidence in the 12-month period following spraying) compared to a 497 counterfactual of no IRS. Stratified by Cluster. For each study location and time-series, we calculated the time-535 difference (in months) between the peak vector density and peak monthly rainfall. Our results 536 highlighted systematic differences between clusters, but also significant variation within 537 clusters. we re-ran the k-means clustering algorithm this time specifying 4 clusters. The less seasonal 547 cluster from the 2 cluster analysis in the main text (Cluster 2 in the main text results) was 548 retained (here Cluster 3), and Cluster 1 from the main text was further disaggregated into 3 549 different clusters (here, Clusters 1, 2 and 4), each defined by different peak timings (mean 550 timing of vector density peak 7, 8.25 and 5.86 months after January for Clusters 1, 2 and 4 551 respectively) and the timing of the vector peak relative to peaks in rainfall (rainfall peak on 552 average 1.03 and 2.32 months before vector density peak for Clusters 1 and 2, 1.09 months 553 after vector density peak on average for Cluster 4). time-series to create a dataset with equal numbers of time-series belonging to each cluster. 584 As a sensitivity analysis, we also carried out the random forest fitting without upsampling and 585 assessed both model fit (as measured by AUC, (A)) and variable importance (B). Model 586 performance was somewhat reduced compared to the upsampled data (mean AUC of 0.81 vs 587 mean AUC >0.9 for the upsampled dataset), whilst variable importance results were broadly 588 consistent across both analyses, with population per square kilometre and various land-cover 589 measures all emerging as important predictive variables. We also present partial dependence 590 plots for all of the included covariates (C). The y-axis on the left shows the probability of the 591 time-series belonging to Cluster 2 (i.e. a high probability indicates the time-series is predicted 592 to likely belong to Cluster 2, a low probability indicates the time-series likely belongs to Cluster 593 1). The x-axis describes the value of the (scaled, normalised) covariate. the results presented in the main text were generated using a random forest-based workflow where final model fitting (using hyperparameters 598 tuned using 6-fold cross-validation) utilised the entirety of the dataset. As a sensitivity analysis, we also carried out the random forest fitting 599 holding out a small portion of the dataset (n = 9) during model fitting, with model performance subsequently evaluated on this held-out data. 600 Results presented above are in the case where data was upsampled to address class imbalance (top) and where no upsampling was carried out 601 (bottom). Collated time-series are displayed above coloured according to 1) whether or not the study was carried out in an urban or rural location; and 2) 618 which cluster they were assigned to. 619