Towards accurate spatial prediction of Glossina pallidipes relative densities at country-scale in Kenya

.


Introduction
Vector-borne diseases (VBDs) pose a significant and pervasive threat to public health worldwide, affecting both humans and livestock.These diseases are transmitted through the bites of various vectors, predominantly insects.According to World Health Organization (WHO), the impact of most VBDs can be mitigated through the proper implementation of various vector control strategies (World Health Assembly, 2017).Quantifying the distribution and relative abundance of disease vectors plays a pivotal role in guiding decision-making processes and executing timely and efficient control measures.While research has made reasonable progress in understanding the environmental factors that shape the geographic distribution of various disease vectors and has extensively advanced in producing spatial and temporal maps of their suitable habitats, only a few studies attempted to assess the spatial relative abundance of disease vectors (Waldock et al., 2022).Understanding the spatial variation in disease vector populations is crucial, as these numbers serve as indicators of disease risk and persistence.
Conducting large-scale, in-situ monitoring of disease vectors to understand their spatial and temporal relative abundance and identify areas of potential risk of disease and consequently identify priority areas for control is often impractical.Remote sensing technology enables the mapping of environmental and weather information over extensive regions, and remote sensing derived datasets have become readily available over the past years.Despite challenges related to integrating in-situ data with satellite-based estimates, such as data quality and model complexities, the improved integration of geospatial and remote sensing expertise into VBD control programs has notably advanced global research on the environmental and weather factors that affect disease vector abundance (Carrasco-Escobar et al., 2022;Dlamini et al., 2019;Kalluri et al., 2007;Mechan et al., 2023;Palaniyandi et al., 2021).In addition, the rapid evolution of technology has led to the development of increasingly sophisticated predictive modeling methodologies, particularly through the utilization of machine learning (Kaur et al., 2021(Kaur et al., , 2022;;Keshavamurthy et al., 2022;Yu et al., 2022), offering an advantage over traditional geostatistical methods by facilitating the capturing of intricate non-linear interactions among variables (Taconet et al., 2021).
Among the disease vectors, studies aiming to map and understand their spatial relative densities have focused mostly on mosquitoes and ticks.In addition to the commonly employed classical geostatistical models (Mudele et al., 2021;Rosà et al., 2019;Shutt et al., 2022;Talbot et al., 2019), these studies have increasingly turned to machine-learning technologies (González Jiménez et al., 2019;Joshi and Miller, 2021;Makridou et al., 2023;Rahman et al., 2021;Schneider et al., 2022).A significant advantage of machine learning over traditional geostatistical methods is its innate capacity to capture complex associations among variables, often resulting in higher predictive accuracies.For instance, Ibañez-Justicia and Cianci (2015) showed more accurate performance of random forest over other models in predicting mosquito abundance, a finding that was reiterated by Rahman et al. (2021) when predicting the abundance of Aedes aegypti female mosquitoes in Thailand.For tick abundance in Southern Scandinavia, Jung Kjaer et al. (2019) employed Boosted Regression Trees (BRT) with gridded environmental and weather variables.Their approach yielded higher accuracy for tick larvae and nymphs (R 2 of 0.69) but less accurate results for adult ticks (R 2 of 0.1).Ceia-Hasse et al. ( 2023) demonstrated an improved performance of deep learning over classical machine learning (area under curve values 0.83 and 0.75 respectively) in predicting the abundance of yellow fever mosquitoes in Madeira, Portugal.These findings collectively highlight the promising potential of various machine learning techniques in developing high-performing models for assessing disease vector abundance.
This study focuses on tsetse flies.While other insects can mechanically transmit trypanosome pathogens (Desquesnes and Dia, 2003;Mihok et al., 1995), tsetse flies are the sole biological vectors of trypanosome pathogens causing African Trypanosomiasis in humans and livestock across Sub-Saharan Africa and by far the most significant vectors.While environmental factors influencing tsetse fly distribution are well-documented (Bishop et al., 2021;De Beer et al., 2021;Gachoki et al., 2021) based on the pioneering work by Rogers and Randolph (1986), only a few studies assessed spatial variations in tsetse numbers and these are limited to the use of standard geostatistical models.For instance, Lord et al. (2018) used generalized linear models (GLMs) and satellite data to forecast tsetse numbers inside and outside Serengeti National Park in Tanzania.However, their model revealed varying prediction accuracy due to a temporal mismatch between tsetse data collection (2010 and 2015) and satellite predictors used (2015).Mugenyi et al. (2021) employed standard Poisson and zero-inflated Poisson models to predict tsetse numbers per trap per day in Uganda.This study found that during dry seasons high tsetse numbers were concentrated in low-lying areas, animal reserves, wooded landscapes, and shrub-covered regions.However, the models failed to capture tsetse abundance patterns in the wet season, which was associated with an increased dispersal rate of tsetse flies during this period.
Among the many machine learning techniques, here we selected the classical random forest method due to its well-documented success in predicting the abundance of other disease vectors, such as mosquitoes.To enhance the precision of our method, we implemented two feature elimination techniques namely, Recursive Feature Elimination (RFE; Darst et al., 2018;Khun, 2022) and Variable Selection Using Random Forests (VSURF; Genuer et al., 2022;Speiser, 2021;Speiser et al., 2019).RFE retains variables based on their importance in explaining the response variable (Khun, 2022), while the VSURF method goes a step further by assessing how these important variables contribute to predicting the response variable and retains only those that lead to a reduction in prediction error (Genuer et al., 2022).To the best of our knowledge, there have been no previous attempts to develop random forest models for predicting relative tsetse fly numbers by integrating insitu tsetse catches with satellite-based variables.Furthermore, the impact of different feature elimination techniques on tsetse predictions has not been assessed previously.
We employed freely and readily available satellite-based estimates of human population, land cover, soil properties, elevation, precipitation, and land surface temperature.In previous studies, these environmental variables have been effectively associated with tsetse relative abundance (Gachoki et al., 2023a;Lord et al., 2018;Mugenyi et al., 2021;Ngonyoka et al., 2017aNgonyoka et al., , 2017b) ) and distribution (Gachoki et al., 2021).Our main aim was to assess how well a random forest model, utilizing different sets of predictor variables, could predict the relative spatial abundance of Glossina pallidipes tsetse species across a broader region of Kenya, while utilizing geographically constrained tsetse data.The tsetse trap locations used in this study have previously been employed to predict tsetse habitats at different life stages (Gachoki et al., 2021), evaluate the accuracy of transferring these habitats between regions (Gachoki et al., 2023b), and comprehend the environmental factors driving the observed temporal dynamics in tsetse numbers (Gachoki et al., 2023a).Here, our specific objectives were to: a) leverage readily available satellite-based estimates of environmental factors and train a random forest model for predicting the relative spatial abundance of G. pallidipes, b) apply the trained model beyond the spatial domain of the data used for training and evaluate the reliability of the spatial predictions, and c) identify limitations and opportunities for accurately mapping the relative spatial densities of tsetse flies beyond the monitored localities.

Tsetse density data
The tsetse fly data used to train the random forest model were collected from three geographically separated areas in Kenya (Fig. 1a, b,  and c) during different months in 2021 (Table 1).Tsetse monitoring involved a mix of biconical (Brightwell et al., 1987), andNGU (Dransfield et al., 1991) traps baited with cow urine and acetone, except in the Nguruman conservancy (Fig. 1b), where only NGU traps were deployed.In all regions, the traps were emptied every two days, with two or four repetitions per month (see Table 1).Although in previous studies (Gachoki et al., 2021(Gachoki et al., , 2023a(Gachoki et al., , 2023b) ) data from traps at the same, or similar, locations were used, those studies did not use the 2021 data used here.The G. pallidipes tsetse species was the sole species that occurred in all sampling sites and accounted for the majority of trapped tsetse species (Table 1).While we have highlighted the percentages of other tsetse fly species captured per region, all analyses in this study utilize only the G. pallidipes data.The datasets for the three areas were combined into a single database.To standardize and ensure comparability across the diverse monitoring periods of our traps, we implemented a systematic approach to the data analysis.First, we calculated the number of flies per trap per day (FTD) by dividing the total G. pallidipes count with the number of days the trap was monitored.Given the trapping period ranged between 8 and 32 days in our data (Table 1), we wanted to avoid the effect of monitoring effort on our tsetse fly abundance observations entering the model.To achieve this, we assumed a baseline expectation of at least one tsetse fly catch every days.In instances where traps were monitored for periods exceeding days with only one tsetse fly trapped, these were treated as if no flies were caught.While it is known that the number of tsetse flies entering a trap is usually low (Lindh et al., 2009) and that bi-conical traps are less efficient in trapping G. pallidipes compared to the NGU trap (Asfaw et al., 2022;Dransfield and Brightwell, 2001), this adjustment allowed us to align the data and facilitate meaningful comparisons across all monitoring periods, excluding the effect of monitoring effort.For modeling purposes, we converted the FTD-values into log 10 (FTD + 1) to address data skewness and variance stabilization (Feng et al., 2014); adding +1 to avoid undefined numbers in case of no observations).

Environmental and weather predictor variables
To minimize model complexity, we utilized predictor variables that have been demonstrated to correlate with tsetse occurrence and population dynamics in previous studies (Gachoki et al., 2021(Gachoki et al., , 2023a;;Lord et al., 2018;Mugenyi et al., 2021).These included land cover fractions for multiple classes, elevation, average daily annual rainfall, soil moisture, land surface temperature, human population density, sand, and silt content (Table 2).We created two databases; one for 2021 (henceforth referred to as "2021") and another covering multi-annual averages from 2011 to 2021(referred to as "2011-2021").For land cover fractions we calculated the percentage 10 m-by-10 m of pixels of the European Space Agency (ESA) global land cover (Zanaga et al., 2021) data within a 1 km-by-1 km grid for each land cover class.The aggregation of land cover classes into a 1 km-by-1 km grid size was chosen based on findings that the distribution of G. pallidipes is significantly influenced by the abundance of vegetation cover within these distances (Gachoki et al., 2021).Additionally, this resolution aligns with previous research suggesting the potential for tsetse flies to travel distances of up to 1 km within their geographic range in a day (Vale et al., 1984;Williams et al., 1992).All the other predictor variables were sourced from openly available resources (Table 2) and resampled to a 1 km-by-1 km spatial resolution by taking the average value.
We extracted the relevant values of each set of environmental and weather variable for both 2021 and 2011-2021 at the trap locations (n = 660) and a random sample of the same size and plotted histograms to compare each predictor variable at the trap level against the random sample.The two databases differed solely in terms of temporary varying variables.

Variable elimination
While machine learning methods can partially address multicollinearity (Fig  prevent potential performance degradation of the model and avoids overfitting, which can limit the ability to make good extrapolations to unseen regions (Duque-Lazo et al., 2016).In this study, we employed two common variable elimination techniques for random forests: 1) RFE (Recursive Feature Elimination, Khun, 2022) and 2)VSURF (Variable Selection Using Random Forests; Genuer et al., 2022).In RFE, the user defines a termination condition for model performance, and the algorithm iteratively removes one variable at a time while evaluating its impact on the model's performance.This process continues until the algorithm reaches the best predefined level of model performance.In this study, Root Mean Square Error (RMSE) was used as the termination condition.As a result, RFE retained all variables whose removal led to a deterioration in the best RMSE value.VSURF follows a three-step process.First, it utilizes random forests to evaluate variable importance and systematically removes those with very low importance.Second, the retained variables are ranked based on their importance scores, and the top-ranked variables that contribute most to the predictive power of the model are retained.Finally, these selected variables are employed for making predictions of the response variables, retaining only those that reduce the prediction error.Both the RFE and VSURF variable elimination methods were independently applied to the two sets of data, resulting in four retained databases: 1) RFE 2021 , 2) RFE 2011-2021 , 3) VSURF 2021 , and 4) VSURF 2011-2021 .All analyses were performed in R programming using the Caret (Khun, 2022) and VSURF (Genuer et al., 2022) packages.

Spatial cross-validation, model training, and spatial mapping
To create spatial clusters for the spatial cross-validation, we divided the tsetse monitoring traps in each area into two distinct sets by placing a diagonal line across each area, ensuring that both presence and absence samples were present in each set (Fig. 1; grey diagonal lines).We partitioned the data into training and testing datasets with the "CreateSpacetimeFolds" function from the CAST package (Meyer et al., 2019) resulting in a single dataset with six distinct training ("index") and corresponding testing ("indexOut") subsets, which were used directly as part of the tuning parameters during the model training process.We used the log transformed FTD as our response variable and applied the ranger method (a faster implementation of random forest; Wright and Ziegler, 2017) within the caret package to fit the random forest model with each set of predictor variables.
The model performance was assessed based on the final trained model (average of the six models) "out-of-bag" R 2 and RMSE values, which is an indication of how well the model generalizes the unseen data.Variable importance was determined using the permutation method, where variables are ranked according to how much the model's performance degrades when the values of specific variables are randomly shuffled (Breiman, 2001).We used the 'pdp' package (Greenwell, 2022) to generate partial dependence plots, which show the impact of each predictor on the response variable while maintaining all other predictors constant.
To create spatial maps of tsetse relative abundance, we applied the trained models to the respective predictor variables used during their training, covering the entire land area of Kenya.Because of the geographical constraints within our training dataset, we utilized the "aoa" (Area of Application) function from the CAST package to demarcate areas for which the environmental conditions are sufficiently represented by our training data (Meyer and Pebesma, 2021).The delineation of these areas depends on a threshold value that is internally calculated by measuring the dissimilarity between the predictor variables in the training data and those used in the model extrapolation.Predictor variables are assigned weights in this dissimilarity measurement according to their significance in explaining the response variable.

Distribution of predictor variable values
Thirty-two percent of the 660 trapping locations had at least one fly trapped every eight days.The distribution of predictor variables at the trap level represented only a fraction of the broader range found within Kenya for most of the variables (Fig. 2).For example, temperature is a key factor influencing tsetse population dynamics (Are and Hargrove, 2020) and the range of daytime land surface temperature within our dataset is ~30 • C-45 • C while in larger Kenya it is ~20 • C-50 • C.This implies that extrapolating the trained model to areas where predictor variables have values outside the training data range, may result in underestimation or overestimation of tsetse relative abundance, depending on how those variables influence the relative abundance of tsetse.

Feature elimination and model performance
The two elimination techniques retained different sets of predictor variables, but with some overlaps.Notably, the RFE method retained a higher number of variables compared to the VSURF elimination method.Precipitation (P), population density (Pd), soil moisture (Sm), and tree cover fraction (Tlc) were retained for both elimination techniques across the two datasets (Table 3).Models trained using the predictor variables from 2021 measured an "Out-Of-Bag" R 2 and RMSE value of 0.41 and 0.52, respectively.In contrast, models trained using multi-year averages from 2011 to 2021 exhibited a lower average R 2 (0.38) and higher average RMSE value (0.55).This suggests that the spatial variability of tsetse fly numbers is better explained by environmental and weather conditions near the time of sampling.

Variable importance and partial dependence plots
Discrepancies were observed in the ranking of key predictors elucidating the spatial variability in relative tsetse numbers across the four sets of predictors utilized (Fig. 3).Within the subset of predictors determined through the VSURF feature elimination method, tree cover percentages and precipitation emerged as the foremost variables explaining G. pallidipes, both for yearly (2021) and multi-annual (2011− 2021) average predictors.In the RFE retained predictors soil moisture and human density were among the top predictors for 2021 while for 2011-2021, significant predictors included cropland fractions and precipitation.
We observed that tsetse numbers exhibited an increase with higher tree cover fractions and a decrease as population density, croplands, soil moisture and rainfall increased (Fig. 4).The presence of abundant tree cover creates favorable conditions for tsetse populations, as it provides shaded areas for their resting and breeding.Conversely, in densely populated regions, humans may alter environments that were originally conducive to tsetse flies through activities such as the removal of woodlands and wild hosts, resulting in a reduced population of flies.Additionally, we observed a unimodal response with soil moisture, suggesting that both very dry and very wet soils are unsuitable for tsetse flies (Fig. 4).The plausible explanation is that excessively dry soils are too hard, preventing larvae from burrowing and pupating, while overly wet conditions can lead to drowning.Furthermore, excessive rainfall may cause flooding or water accumulation, posing risks to both adult tsetse flies and their larvae, or washing away burrowed pupae.Ultimately, these factors contribute to a decline in tsetse populations.

Tsetse relative density maps
The extrapolated predictions for Kenya reveal significant disparities among the four models (Fig. 5).The hatched black lines show areas that fall outside the range of environmental conditions observed in our training data based on the area of application analysis and predictions in these regions should be regarded as less reliable.Note that the hatched areas also show differences between maps because different predictor variables are used.Models generated using variables retained through VSURF for year 2021 (Fig. 5b) exhibit more prominent tsetse hotspots outside of monitored regions when compared to the predictions based on other sets of variables (Fig. 5a, c and d).The majority of these hotspots (>6 FTD; Fig. 5b and c) are within known tsetse fly belts In Kenya (DeVisser and Messina, 2009).However, without tsetse ground data, it is impossible to definitively conclude that this reflects the actual situation.

Discussion
The primary objective of this study was to evaluate how well a classical random forest machine learning model together with satellitebased environmental estimates can predict relative tsetse abundance in all of Kenya using a spatially limited set of tsetse trapping data.Based on our results, in this section we also pinpoint areas of improvement and opportunities to enhance the precision and reliability of predictions of tsetse fly relative densities within Kenya.

National scale tsetse mapping with spatially limited data
Different sets of predictors revealed distinct important variables (Fig. 3).In the RFE datasets for 2021, the top predictors were soil moisture, human density, and tree cover.In contrast, for the long-term averages (2011-2021), highly ranked predictors included croplands, mean annual precipitation, and tree cover percentages.In the VSURF dataset, only four variables were retained, and the notable difference in their ranking was that soil moisture held a higher rank than human density in the 2021 set of predictors.Tsetse numbers started to decline when the daily annual average rainfall exceeded 2 mm/day and the Fig. 3. Variable importance plot for the different dataset combinations and feature elimination techniques.
S. Gachoki et al. volumetric soil moisture exceeded 10 mm (Fig. 4).This finding is consistent with prior research that examined temporal patterns of abundance, such as by Gachoki et al. (2023a), who found that tsetse numbers rose with increased rainfall but then declined when rainfall increased for more than a month.Intense rainfall can lead to excessive water accumulation, increasing soil moisture, which, in turn, can lead to the submerging or dislodging of buried pupae, ultimately causing a decrease in tsetse populations (Lukaw et al., 2014;Ngonyoka et al., 2017aNgonyoka et al., , 2017b;;Omoogun et al., 1989;Signaboubo et al., 2021).Additionally, during periods of heavy rainfall, the behavioral activity of tsetse flies actively seeking a host for feeding is likely to decrease thereby lowering the probability of entrapment.
In densely populated areas, heightened human activities such as clearing land for cultivation and settlements are likely to disrupt the favorable environments for tsetse fly resting and breeding, which explains the observed negative relationship.Conversely, a higher percentage of tree cover offers suitable conditions for tsetse resting and breeding and other research also found tsetse numbers to positively correlate with abundant vegetated areas.For example, Lord et al. (2018) reported that high G. pallidipes abundance in Serengeti National Park, Tanzania, correlated with areas rich in vegetation, and Mugenyi et al. (2021) documented that another tsetse species, G. fuscipes fuscipes, also exhibited high numbers in vegetated regions.Shaded areas, such as those with ample tree cover, create a cooler microclimate that is essential for tsetse flies breeding and resting (Gachoki et al., 2021;Isherwood and Duffy, 1959).These areas also provide refuge for the animal hosts that tsetse flies rely on for blood meals (Isherwood et al., 1961).The combination of these factors may explain the observed The attained performance of the trained models (R 2 values ranging from 0.38 to 0.41) demonstrates the potential for predicting tsetse fly relative numbers using machine learning methodologies.Our analysis indicates that using environmental and weather data near the period of tsetse monitoring yields more accurate predictions compared to longerterm averages.When extrapolating tsetse number predictions based on different sets of predictors, significant disparities emerge.VSURFretained variables of 2021 reveal pronounced tsetse hotspots (>6 FTD) within known tsetse belts (McCord et al., 2012).On the other hand, predictions based on RFE-retained variables did not identify prominent hotspots, and most of these predictions fell outside the range of the environmental and weather data used for model training.Notably, for RFE-retained variables and VSURF long-term variables, extrapolations indicate higher tsetse number predictions (>2 FTD) in the northwestern (Fig. 5) region of Kenya, which is not a historically known tsetse fly belt.
Previous research such as Lord et al. (2018) also reported overestimations of tsetse numbers by GLM-based models in regions beyond those monitored in Serengeti National Park.They attributed this overestimation to a mismatch between the period tsetse data was collected and when the environmental variables were estimated.However, in our study, most of the predictions of high tsetse numbers occurred in regions for which no tsetse data were available and thus not included in the model training, and where environmental predictors had values outside the range of our training data.Using predictive modeling techniques to extrapolate beyond the training data can lead to less accurate predictions due to the model's limited fitting of the response variable in those conditions (Gutzwiller and Serno, 2023;Muckley et al., 2023).We expected that creating a mask to delineate the "area of applicability" for the trained model (Meyer and Pebesma, 2021) would successfully filter out a significant portion of regions lacking training data, particularly in the Northern and Eastern regions where high tsetse predictions were evident.However, we found that most of these areas still fell within the range where the trained model's accuracy remained valid.

Prospects for enhancing large-scale spatial prediction of tsetse abundance
While this research analysis does establish a basis for predicting tsetse numbers for large areas, the reliability of the current predictions remains uncertain.Consequently, to guarantee that future national-level spatial maps of tsetse abundance are accurate and reliable, it is imperative to undertake several critical steps.
As earlier mentioned, extrapolating predictive modeling techniques beyond the data range used for model training can result in poor predictions because the model lacks knowledge of how the response variable behaves in such conditions (Gutzwiller and Serno, 2023;Muckley et al., 2023).In this study, the utilized tsetse fly data did not encompass all environmental conditions in Kenya, highlighting the need for additional trapping data covering a wide range of such conditions.The recently published Kenya tsetse atlas reveals that additional data exist from various sources (Ngari et al., 2020).However, trapping data is lacking in certain areas, particularly in the northern and eastern regions, where our current models consistently predict high tsetse abundance (Fig. 5a, c, d).Without trap data from these localities, it becomes impossible to validate the current models and this may equally hinder the development of improved predictive models, even when incorporating data from the atlas.While initiatives like COMBAT (Controlling and progressively Minimizing the Burden of Animal Trypanosomiasis; Boulangé et al., 2022) can use these research findings to identify areas requiring increased sampling efforts, these regions might still be extensive, leading to high tsetse sampling costs.A cost-effective S. Gachoki et al. alternative would be to consider the implementation of citizen mapping (Hamer et al., 2018).In this approach, local communities would receive training on identifying tsetse flies in set traps and reporting their findings over time.Similar programs have proven effective in mapping other disease vectors, such as mosquitoes (Cohnstaedt et al., 2016;Palmer et al., 2017) and ticks (Laaksonen et al., 2017;Xu et al., 2016).
Tsetse flies rely on blood from both wild and domesticated mammals for survival (Ducheyne et al., 2009;Rogers, 1979), but this study lacked information on the distribution of animal hosts that tsetse flies feed on.Consequently, the predictive results in this study only explain tsetse densities based on environmental variables.This has a drawback that unsampled areas may have environmental and weather conditions favorable to tsetse, but, where nonetheless tsetse flies will not be present due to host absence.Therefore, incorporating data on the distribution of animal hosts can help exclude such areas, refining the extent of tsetse distribution.However, obtaining animal distribution data is challenging.While animal tracking seems like a viable method, the associated costs and potential reluctance from wildlife managers, who view tsetse flies as "guardians of Africa's biodiversity," make this approach less likely (Rogers and Randolph, 1988).An alternative solution could involve using publicly available information on protected zones as a predictor variable, given that many of these areas serve as refuges for wildlife.If localities beyond the protected zones are identified as hosting high tsetse fly densities, ground-truthing efforts may be necessary.
Another improvement to consider is developing a land cover map with classes specifically associated with the tsetse species under consideration.In this study, the primary tsetse species was G. pallidipes, which is positively correlated with woodlands, a land cover class that was absent from the freely available land cover layer we utilized.Also, the way satellite-based data is integrated into the models is of paramount importance.The prevailing approach in most tsetse predictive mapping models involves establishing a direct correlation between tsetse presence or abundance and various attributes related to vegetation cover.These attributes are derived from the actual and static land cover observed at the trapping site.However, given that tsetse flies move within their geographic range in search of a host to feed on (Brightwell et al., 1992) and tsetse traps are strategically positioned in areas where tsetse flies perceive them as potential hosts (Fuentes, 2017), it becomes highly likely that the land cover at tsetse trap locations may not accurately depict the genuine environmental conditions sustaining the tsetse population.While studies like the one conducted by Lord et al. (2018) made attempts to incorporate the tsetse dispersal range by employing a buffer to calculate averages of dynamic variables like LST, when it comes to land cover classes, it might prove more advantageous to calculate the overall percentage of each land cover class within a radius that corresponds to the typical movement range of tsetse flies (Gachoki et al., 2021).This adjustment has the potential to significantly enhance model performance, particularly when dealing with categorical data such as land cover.
The selection of appropriate parameters for model tuning is crucial when constructing predictive models.While it is widely acknowledged that machine learning methods, such as the random forest, excel at handling multicollinear data, the inclusion of irrelevant variables can significantly diminish model performance.Furthermore, the choice of a variable elimination method should be made with careful consideration, considering the representativeness of the training data in relation to the broader environmental conditions to which the model will be extrapolated.Our study demonstrates that a variable elimination method that emphasizes retaining variables on how well they reduce the prediction error yields more reliable (based on known tsetse belts; McCord et al., 2012) results compared to methods that retain variables solely based on their importance in explaining the training data.Furthermore, future studies should explore the possibility of aligning tsetse observations with environmental and weather data collected during the same period as tsetse monitoring, as this is likely to enhance model performance.
Lastly, when evaluating model performance, the common practice involves the separation of the training and test data beforehand.However, this approach can introduce bias, particularly when dealing with data that contains many zero values.Therefore, we strongly recommend that future research efforts adopt spatial cross-validation techniques to bolster model robustness.Spatial cross-validation operates by randomly selecting blocks of data for training while reserving others for testing.This process is repeated multiple times based on the number of specified folds.Through this iterative approach, the model refines its estimation of prediction errors, consequently enhancing overall model performance (Meyer et al., 2019).

Conclusion
Our research presents a framework for the prediction of relative G. pallidipes densities for large areas.Our findings indicate that to achieve a more reliable relative tsetse abundance map in Kenya, additional tsetse sampling is essential.This necessity arises because our models predicted high tsetse numbers in regions lacking in-situ tsetse trap data, potentially indicating an underrepresentation of environmental conditions in the training data.The method employed for eliminating irrelevant variables is crucial when extrapolating predictions beyond monitored regions.The VSURF elimination method, which retains variables based on their ability to reduce prediction errors, offered a more reliable approach for extrapolation.While the accuracies of the extrapolated predictions in this analysis remain uncertain, our comprehensive map of tsetse relative densities for Kenya serves as a valuable tool that relevant organizations can effectively leverage to optimize and strategically deploy their surveillance efforts.
. A.1), removing redundant and irrelevant variables can

Fig. 1 .
Fig. 1.Sampling sites for tsetse fly in Homabay (a), Kajiado (b) and Kwale (c) counties in Kenya.The flies per trap per day (FTD) belong to trapped Glossina pallidipes species.The light green boundaries in a) and c) are the Ruma National Park and Shimba Hills National Reserve boundaries respectively.The grey diagonal lines indicate the split of each cluster for spatial cross-validation.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 2 .
Fig. 2. Histograms showing the distribution of the various predictor variables at trap level and at random samples level within Kenya.The sky-blue (random) and red (traps) bars represent the static predictor variables among the two data combinations.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 4 .
Fig. 4. Partial dependent plots for the top four important predictor variables showing their estimated influence on tsetse abundance for the different set of predictors.

Fig. 5 .
Fig. 5. Extrapolated G. pallidipes FTD.The hatched black lines show areas that were outside the range of the environmental conditions in the training data according to the Area of Application methods (Meyer and Pebesma, 2021).

Table 1
Details on tsetse traps for the three geographic regions in Kenya.n = number of tsetse flies observed.

Table 2
Predictor variables used and their sources.The ranges indicate the minimum and maximum values for the whole of Kenya.For the variable column, the italicized letters in brackets represent the acronym used to refer to these variables in this study.

Table 3
Retained variables based on the various feature elimination techniques.