Ground truthing global-scale model estimates of groundwater recharge across Africa

improved. However, global-scale models which re ﬂ ected stronger climatic controls on their recharge estimates compared more favourably to ground-based estimates. Given this signi ﬁ cant uncertainty in recharge estimates from current global-scale models, we stress that groundwater recharge prediction across Africa, for both research investigations and operational management, should not rely upon estimates from a single model but instead considerthedistributionofestimatesfromdifferentmodels. Our workwillbeofparticular interest to decision makers and researchers who consider using such recharge outputs to make groundwater governance decisions or investigate groundwater security especially under the potential impact of climate change.

P u blis h e r s p a g e : h t t p:// dx. d oi.o r g/ 1 0. 1 0 1 6/j.s cit o t e nv.2 0 2 2. 1 5 9 7 6 5 < h t t p:// dx. doi.o r g/ 1 0. 1 0 1 6/j.s ci to t e nv.2 0 2 2. 1 5 9 7 6 5 > Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s. S e e h t t p://o r c a . cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s. Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Introduction
Global groundwater resources dwarf all alternative sources of freshwater (Gleeson et al., 2016) and are used by billions of people worldwide for drinking water, for securing global food production (Dalin et al., 2017), and for industrial water supply. As a key component in determining sustainable groundwater use (Gleeson et al., 2020), understanding how groundwater recharge varies in space and time is essential for ensuring groundwater security (Aeschbach-Hertig and MacDonald et al., 2021). However, we cannot directly measure this critical variable across larger domains (or arguably even at small scales); thus, continental and global-scale investigations into groundwater security and variability rely on groundwater recharge estimates from commonly used global-scale models. Recharge estimates from these models have helped to investigate global groundwater depletion (Döll et al., 2014;Wada et al., 2010), its key drivers (Dalin et al., 2017;Döll, 2009), and alternative constraints on sustainable usage de Graaf et al., 2019). Similarly, de Graaf et al. (2015) and Reinecke et al. (2019) used recharge outputs from a globalscale model to drive global gradient-based groundwater models. Cuthbert et al. (2019) used recharge estimates from a single global-scale model to explore global patterns in climate-groundwater interactions, though there they also approximated the range of likely uncertainty in recharge by comparison to a second global recharge model data set.
However, recharge outputs from these models are poorly-or unconstrained as such models are typically either uncalibrated or calibrated against streamflow records (de Graaf et al., 2015;Döll and Fiedler, 2008). Calibration of global-scale models to groundwater heads, as done by de Graaf et al. (2017), suffers from commensurability issues due to scale differences between observations and modelled variables Reinecke et al., 2020). Furthermore, any model calibration is typically biased towards data-rich regions of the world such as the USA and Europe (Döll and Fiedler, 2008). Therefore, in data-sparse regions such as Africa, recharge estimates are more dependent upon choice of model structures and uncalibrated parameterisations directly derived from global datasets of relevant soils and other properties. Thus, such models typically reflect a limited representation of those processes not easily captured in globally available data , e.g., subsurface heterogeneity produced by karst (Hartmann et al., 2017), or transmission losses in drylands (Quichimbo et al., 2021). Uncertainties in these model structures and parameterisations are then propagated into subsequent analyses, which rarely quantify the influence of potentially divergent recharge estimates between models (Reinecke et al., 2021). Instead, authors usually select recharge outputs from a single model (Wada et al., 2010;de Graaf et al., 2015;Reinecke et al., 2019). If recharge outputs from several models disagree considerably, this likely suggests that our current understanding of groundwater behaviour and security attained by subsequent analyses is uncertain and possibly not robust.
Syntheses of ground-based estimates compiled from the literature could help us understand whether global model estimates appear reasonable or at least plausible. Recharge rates in these ground-based studies are estimated by an array of different methods with different strengths and weaknesses, including environmental tracers, chloride mass balance, water table fluctuations, soil moisture balance models, water balance models, and groundwater models (Scanlon et al., 2002). Mohan et al. (2018), Moeck et al. (2020) and MacDonald et al. (2021) have recently used literature compilation datasets to understand how and why recharge rates vary across continental to global scales. However, these ground-based estimates have rarely been used to assess whether recharge outputs from global-scale models are reasonable. Döll and Fiedler (2008) provide an example of how such estimates can be used to evaluate global-scale models, comparing annual recharge rates estimated by the global hydrologic model WaterGAP to those from 25 chloride profiles in arid and semi-arid regions. The authors found that WaterGAP overestimated recharge in dry settings and subsequently adjusted the model in these environments to only allow recharge on days when rainfall exceeded 10 mm. Jasechko et al. (2014) used ground-based estimates to evaluate another global model, PCR-GLOBWB, assessing whether seasonal biases in recharge ratios (recharge/precipitation) agreed with those identified using groundwater isotopes. However, neither study evaluated the models according to their ability convert reasonable proportions of rainfall into recharge. Doing so could highlight whether discrepancies are associated with the partitioning of rainfall to recharge (i.e., model structure) or perhaps differences in precipitation forcing.
In this study, we compare, for the first time, groundwater recharge estimates of eight global hydrological models (Reinecke et al., 2021)witharecently compiled dataset of over 100 ground-based estimates of groundwater recharge across the African continent (MacDonald et al., 2021). By comparing multiple models to ground-based estimates, we can examine where different modelling frameworks seem plausible. We focus on Africa as it is a data sparse region where the use of large-scale hydrological models is particularly important and because groundwater is essential for water security in many parts of the continent (MacDonald et al., 2012;Calow et al., 2010). Three research questions guide our investigations into how model structure uncertainty influences recharge estimates from global-scale models in data-sparse regions.
• Where do model estimates of average annual recharge and recharge ratio agree or disagree? • Which environmental controls describe the recharge patterns produced by the different models? • How do model estimates compare to ground-based estimates compiled from the literature?

Data and methods
In this section we discuss the key datasets and methods used in our analysis. This includes ground-based estimates of groundwater recharge from the literature, groundwater recharge outputs from global-scale models and the use of Random Forests to investigate the environmental controls on recharge outputs from global-scale models.

Global models
We compare historical  recharge estimates from eight global-scale models within the Inter-sectoral Impact Model Intercomparison Project (ISIMIP), simulation round 2b (Reinecke et al., 2021). ISIMIP is a model intercomparison framework that enables the comparison of climate impact projections in different sectors (Frieler et al., 2017). We use the recharge estimates available through this project, all of which are provided at the same 0.5°× 0.5°uniform grid (approx. 50 × 50 km at the equator), and do not run the model simulations ourselves. Six of the eight models incorporated time-varying historical land use and water abstractions in their simulations. Two took a different approach; for CLM 4.5, abstractions and land use are fixed to 2005, and JULES-W1 does not model any abstractions. Telteu et al. (2021) provide a complete description of all the models included within ISIMIP and Reinecke et al. (2021) discuss how each model calculates groundwater recharge. Below we summarize the relevant model features reported in each paper (Table 1).
All model simulations used in this study are driven by the HadGEM2-ES Global Circulation Model, developed by the UK Met Office Hadley Centre (Collins et al., 2008). The Global Circulation Model was bias-adjusted by Frieler et al. (2017) using a trend preserving algorithm and the EWEMBI data as a baseline climate condition (Lange, 2018).

Ground-based groundwater recharge estimates
Ground-based recharge estimates in Africa were initially compiled from the literature by (MacDonald et al., 2021) and are shown here in Fig. 4. MacDonald et al. (2021) undertook a thorough quality assurance when compiling the dataset which includes comprehensive meta-information such as uncertainty ranges on the recharge estimates. For these reasons, C. West et al. Science of the Total Environment 858 (2023) 159765 and because of its focus on Africa, we selected this database above other meta-datasets (Moeck et al., 2020;Mohan et al., 2018). We further screened the dataset derived previously to only include georeferenced and time-stamped findings between 1979 and 2005, enabling a comparison to global models. This meant that we ultimately use 124 (out of originally 134) ground-based estimates of annual recharge across Africa. Our analysis of recharge ratio uses fewer data points, as only 106 of these estimates report corresponding mean annual precipitation values. Spatially, 28 of these estimates reflect recharge rates over spatial scales <100 km 2 , a further 39, 29, and 28 are for spatial scales of 100-2500 km 2 , 2500-62,500 km 2 and >62,500 km 2 ,respectively.

Random forests and predictor variables
We used random forests to predict the simulated long-term annual recharge and recharge ratio estimates using environmental attributes ( Table A1 in the appendix) to assess environmental controls on modelled outputs. A random forest is a supervised machine learning algorithm that combines multiple trees to produce an ensemble of predictions (Breiman et al., 1984;Breiman, 2001), which link predictor variables (environmental attributes) to a response (global model outputs). As discussed by Addor et al. (2018), the advantages of random forests include no relationship assumptions, the allowance of non-linear relationships between multiple predictors, reduced risk of overfitting compared to individual regression trees, and computational efficiency. Each regression tree in the ensemble model is trained on observations (model grid cells) randomly selected with replacement from a sub-sample of 70 % of the total observations ('in-bag' observations). In each forest, we use 250 trees, each of which can make a maximum of 400 decision splits. Greater numbers of trees or splits did not improve the accuracy of the predictions.
To interpret the random forest models, we follow an approach taken by Addor et al. (2018) and group predictor variables according to climate, landcover, topography, soils and geology. Independent random forest models are then developed for each predictor group when estimating recharge outputs from the global models. Determining the R 2 values for the 'out of bag' (i.e., not training data) predictions from each ensemble model then allows us to see how much of the variability in global model outputs can be explained by either climate, landcover, topography, soils, or geology; as well as jointly using all variables. Information about the predictor variables/attributes in each predictor group and the datasets used to characterise these variables can be found in Table A1 in the appendix.

Results and discussion
We organise the following results and discussion according to the three research questions presented at the end of the introduction. For each question, the results are presented and then immediately discussed.

Where do recharge estimates from global-scale models agree or disagree?
We assess the agreement of annual recharge and recharge ratio estimates from the eight global models by investigating the standard deviations (using absolute values) and coefficients of variation (also called relative standard deviation using normalized values) across the model outputs ( Fig. 1). In absolute values, we find that annual recharge estimates from the eight models disagree considerably in wetter regions of Africa ( Fig. 1. a). Disagreements in annual recharge estimates are greatest in Central Africa, the Ethiopian highlands, Madagascar, and along the west African coastline, with 24 % of pixels having standard deviations above 100 mm/ year. In contrast, disagreements in annual recharge estimates are <10 mm/year for 38 % of pixels, predominantly extending across dry regions of the continent such as the Sahara, Southern Africa, and the Horn of Africa. Though, even in some very dry regions, we find places where annual recharge estimates disagree, predominantly along the river Nile and irrigated agriculture regions of the Sahara. In these locations, CLM 4.5, CWatM, H08, LPJmL, MATSIRO all estimate recharge ratios greater than one (Figs. S1 & S2), likely reflecting modelled transmission losses or irrigation return flows.
Recharge ratio estimates disagree across both wet and dry regions ( Fig. 1.b). They disagree across much of the Sahara, with standard deviations >0.1. However, most of this divergence derives from only one model, PCR-GLOBWB, which estimates recharge ratios >0.2 for nearly all Table 1 List of ISIMIP 2b global-scale models included in our analysis with details about some structural properties, recharge definition and calibration procedure. Model type abbreviations stand for Global Hydrological Model (GHM), Land Surface Model (LSM) and Dynamic Global Vegetation Model (DGVM).

Model
Model type PET scheme Runoff scheme Capil. rise Pref. flow

Soil layers
Total soil depth
Calibrated against monthly and daily streamflow data in 12 catchments H08 GHM Bulk transfer coefficient (Hanasaki et al., 2008) Leaky bucket (Manabe, 1969)  Northern Africa (Fig. S2). The other seven models estimate recharge ratios below 0.05 for most of the Sahara. Modelled recharge ratio estimates also disagree throughout Central Africa and the Ethiopian Highlands, such that 64 % of pixels have a standard deviation >0.1 for recharge ratio. For 32 % (5 %) of the pixels, standard deviations in recharge ratio estimates from the eight models are below 0.05 (0.01), primarily distributed across the Sahel, Southern Africa, and the Horn of Africa.
The picture changes if we compare estimates of mean annual recharge ( Fig. 1.c) and recharge ratio ( Fig. 1.d) relative to their ensemble mean. The relative disagreement between models is high where estimated recharge rates are low, and vice versa. Therefore, model differences tend to be either high in magnitude (large standard deviation, small coefficient of variation) or high relative to estimated recharge (small standard deviation, large coefficient of variation). Identifying regions where model disagreement is high relative to recharge flux is important, especially if a later model use lies in the estimation of climate change impacts (e.g., Hartmann et al., 2017). Here, the coefficient of variation (i.e., relative standard deviation) between modelled estimates is >100 % (200 %) for 66 % (29 %) of pixels in Africa, mostly occurring in Northern and Southern Africa as well as the Horn of Africa. Whilst only 6 % of pixels have a coefficient of variation <50 %. Furthermore, as relative differences in modelled annual recharge estimates are equal to relative differences in modelled recharge ratio estimate, they show the same spatial patterns (Fig. 1.c & d). Hence highlighting how differences in annual recharge estimates are caused by differences in precipitation-recharge conversion rates, which could be attributed to varying model structures, parameterisation schemes or datasets used to characterise the land surface. Further analysis investigating model differences at the global scale are included in the supplemental information (Figs. S3 & S4), along with a discussion about previous model comparisons for groundwater recharge in other parts of the world. Fig. 2 shows that both absolute and relative differences in groundwater recharge estimates vary strongly with long-term mean annual P-PET. In disagreement with previous empirical studies ( MacDonald et al., 2021;West et al., 2022), recharge ratio estimates from PCR-GLOBWB and CWatM are surprisingly high (>0.2) across desert regions where P-PET is low, leading  Table 1 and discussed in detail in Reinecke et al. (2021). (a) the standard deviation of long-term average annual recharge rates (mm/year); (b) the standard deviation of long-term average recharge ratios (−). (c) the coefficient of variation of long-term average annual recharge rates (%); (d) the coefficient of variation of long-term average recharge ratios (%). The colouring scheme we have selected is intended to highlight where our understanding of recharge varies or converges. Lighter colours highlight where model estimates diverge and hence where our understanding of recharge processes is incoherent . Whereas darker colours show where models agree, though this does not mean they are correct. to large absolute model disagreements for recharge ratio throughout dry and wet regions ( Fig. 1.b & Fig. S2). The reasons for this deviation are not obvious.

Which environmental controls describe the recharge patterns produced by the different models?
We use random forest models to investigate which environmental controls, as identified from datasets, with global coverage, describe the spatial patterns evident in the recharge outputs from global-scale models. The logic is that recharge outputs from global-scale models should reflect similar process controls to what is being identified in continental to global-scale empirical studies found in the literature. If global-scale models show very different controls to what is being found through empirical studies, then this may indicate that the model structures of global-scale models do not adequately represent dominant recharge controls.
Climate variables explain most of the spatial variability in annual recharge and recharge ratio estimates from all eight models, with landcover and soil properties also showing some explanatory power (Fig. 3). The static descriptors of climate, landcover, and soils used in the random forest models explain at least 80 % of the modelled spatial variability of annual recharge and at least 70 % of the spatial variability of recharge ratio. Climate alone explains between 67 % and 89 % of the spatial variability in long-term annual recharge estimates from the eight global-scale models. For five of the models, climate also explains >70 % of the spatial variability in recharge ratio estimates. For the remaining three models (PCR-GLOBWB, LPJmL, CWaTM), just over 50 % of the spatial variability of recharge ratio can be explained by climate alone (53 %, 65 %, 51 %). Landcover attributes are second in explaining the spatial variability in recharge outputs from all eight global models. These attributes explain on average 61 % (56 %) of annual recharge (recharge ratio) estimates, whilst soil properties explain on average 33 % (34 %) of their spatial variability.
Considering the interactions between climate, landcover, topography, soils and geology allows us to explain only small additional proportions of the spatial variability in both annual recharge and recharge ratio estimates from the global models. For annual recharge (recharge ratio), the additional explanatory power when considering interactions between all predictor variables in contrast to climate controls alone, is on average 10 % (13 %) across each of the global-scale models. This might be expected as the co-evolution of climate, landcover and soils causes these properties to co-vary in space and form large-scale landscape patterns (Pelletier et al., 2013;Troch et al., 2013). Therefore, using climate variables alone to predict modelled recharge outputs implicitly considers some information about landcover and soils.
The importance of climate controls on global model outputs is consistent with the findings of MacDonald et al. (2021) for ground-based recharge estimates. Their regression showed that ground-based annual recharge estimates vary throughout Africa according to mean annual precipitation, which explains 82 % of this variability. Yet in contrast to our global model findings, they do not find that including additional variables beyond climate improve their prediction of ground-based recharge estimates across Africa, though they suggested that it could be important locally. The low level of predictability of climate controls on recharge estimates from PCR-GLOBWB, LPJmL and CWaTM, shows that these models are inconsistent with the controls found for ground-based estimates in Africa (MacDonald et al., 2021). Other empirical studies, analysing annual recharge rates globally, do however highlight the importance of vegetation and soils for partitioning precipitation to recharge (Kim and Jackson,  Table 1 and shown in Fig. 1. (a) long-term average P-PET (mm/year) against the standard deviation of long-term average annual recharge rates (mm/year); (b) long-term average P-PET (mm/year) against the standard deviation of long-term average recharge ratios; (c) long-term average P-PET (mm/year) against the coefficient of variation of long-term average annual recharge rate; (d) long-term average P-PET (mm/year) against the coefficient of variation of long-term average recharge ratios. 2012; Mohan et al., 2018). The differences between these global analyses and MacDonald et al. (2021) could highlight particularly strong climate controls on the spatial variability of recharge in Africa. Whereas precipitation gradients in other regions may not be as great as those in Africa, which may explain why landcover and soil properties have helped to explain the variability of recharge globally. It may also be attributed to spatial patterns becoming more emergent in larger (>600 datapoints) datasets covering more environments or more rigorous data quality assurance by MacDonald et al. (2021). Quality assurance procedures by MacDonald et al. (2021) led to the removal of 182 (from an initial 316) data points, which may have reduced the variability in recharge estimates and therefore produced a clearer climatic control.

How do model estimates compare to ground-based estimates compiled from the literature?
Ground-based recharge estimates are distributed unevenly across the African continent and are predominantly located in dryland landscapes whilst wet tropical landscapes are underrepresented (Fig. 4). For desert, dryland, wet tropical, and wet tropical forest landscapes we have 21 (16), 71 (61), 29 (26), and 3 (3) ground-based estimates of annual recharge (recharge ratio), respectively. We compare these estimates to those from the global-scale models for the relevant period. A description of how we delineated the four landscapes is provided by West et al. (2022).
Discrepancies between global-scale model and ground-based estimates of annual recharge and recharge ratio are larger in wetter than in drier recharge landscapes, though relative discrepancies are similar across the different environmental settings (Fig. 4). The median magnitude of discrepancies in annual recharge (recharge ratio) in desert, dryland, wet tropical and wet tropical forest landscapes are 0.4 mm/year (0.007), 4.26 mm/year (0.01), 31.7 mm/year (0.02) and 105 mm/year (0.06), respectively when looking across all eight global models. In contrast, the median magnitude of relative discrepancies in annual recharge (recharge ratio) in desert, dryland, wet tropical and wet tropical forest landscapes are 23 % (30 %), 29 % (30 %), 29 % (22 %) and 12 % (30 %), respectively. Furthermore, relative discrepancies in annual recharge estimates are often similar if not identical to relative discrepancies in recharge ratio. For each model, linear correlations between the relative discrepancies in annual recharge and recharge ratio vary between 0.96 and 0.99 (Fig. S5).
Landscape specific discrepancies between global models and groundbased recharge estimates are more noticeable for several global models. CLM 4.5 and CWatM show the greatest overpredictions in wet tropical forest landscapes, while H08, JULES, and WaterGAP 2 significantly underpredict in this domain. PCR-GLOBWB and CWatM show the largest overpredictions in terms of recharge ratios in desert landscapes. Climate controls on recharge ratio estimates from these two models were less dominant than for the other models (Fig. 4), which together with generally large overestimations in recharge ratio suggests that they do not represent recharge controls in desert regions adequately. CLM 4.5 also displays larger overpredictions in recharge ratio in wet tropical forest landscapes. It is also interesting to note that in some model and landscape combinations, discrepancies to ground-based estimates show very little bias but a lot of variability, whilst in others the bias can be high but with much less variability. This is likely due to the modelling decisions made by each of the model development groups, such as model structure, calibration procedures and dataset selection (Telteu et al., 2021). However diagnosing which of these decisions are responsible for our findings is non-trivial and will likely require further discussion and assessment between and by model developers (Gudmundsson et al., 2012). Additionally, comparing global model outputs to ground-based information also raises questions about how comparable the model spatial resolution is to the representative area of the groundbased data. However, we did not find a clear relationship between the magnitude of global model discrepancies and the varying spatial scales of ground-based recharge estimates (Fig. S6).
We find that model similarity to ground-based estimates varies considerably and inconsistently throughout the different Recharge Landscapes of Africa (Fig. 4). Even though six models predominantly underestimate mean annual groundwater recharge rates, this is not a consistent pattern, with over-estimates present for most models and in each landscape (Fig. 5). JULES-W1 appears to be the exception to this, as 87 % (92 %) of its annual recharge (recharge ratio) estimates are below ground-based findings. Across all eight models, we found that 56 % (55 %) of annual recharge (recharge ratio) discrepancies are underestimates, 18 % (18 %) are overestimates, and 26 % (27 %) fell within the uncertainty range of the ground- Fig. 3. Coefficient of determination when predicting global model estimates for Africa of (a) long-term average annual recharge and (b) recharge ratio. Out of bag predictions using random forest models were performed for individual predictor groups (climate, landcover, topography, soils, and geology) as well as by considering the interaction between all predictor attributes (all). Random forest models are an ensemble of 250 regression trees each with a maximum of 400 decision splits. When predicting recharge ratios, we excluded pixels with estimates greater than one as their inclusion led to very low coefficient of determination scores when using all predictor attributes. Percentage of pixels excluded when predicting recharge ratio estimates were <0.5 % for all models. Horizontal lines highlight the three models which consistently reflect the strongest levels of climate, vegetation and soil controls on their annual recharge and recharge ratio estimates. Fig. 4. Map of Africa organised into four recharge landscapes (West et al., 2022) with the distribution of ground-based annual recharge and recharge ratio (annual recharge/ annual precipitation) estimates superimposed. Discrepancy between global model and ground-based estimates of (b) annual recharge (mm/year) and (c) recharge ratio (−) organised according to these recharge landscapes. Discrepancy is defined as global model estimate -ground-based estimate. Relative discrepancy between global model and ground-based estimates of (d) annual recharge (%) and (e) recharge ratio (%) organised according to these recharge landscapes. Relative discrepancyisdefined as 100 X (global model estimate -ground-based estimate)/ground-based estimate. Scatter points show the individual relative discrepancies and boxplots show the inter-quartile range of relative discrepancies in each landscape. Scatter points show the individual discrepancies or relative discrepancies, and boxplots show their inter-quartile range in each landscape. Some ground-based estimates have uncertainty ranges. If global model estimates fall within this range the discrepancy and relative discrepancy is zero. based estimate. Similar over and underestimation statistics for recharge rates and recharge ratios, reflect that overestimation (underestimation) in annual recharge mostly corresponds to overestimation (underestimation) in recharge ratio. The tendency to underestimate recharge rates is particularly noticeable in dryland recharge landscapes, where 60 % of the global model estimates are below ground-based estimates. Though this varies from 28 % to 92 % across each of the models.
Global models which reflect the strongest levels of climate controls on their recharge outputs agree with ground-based estimates more than the other models. Recharge outputs from H08, MATSIRO, and WaterGAP 2, consistently reflect more climate controls than the other models and show lower average discrepancies. The median absolute discrepancies in annual recharge (recharge ratio) for these three models are 0.36 mm/year (0), 1.56 mm/year (0.006) and 2.9 mm/year (0.006), respectively, in contrast to between 5 mm/year and 10 mm/year (0.016 and 0.03). This agrees with previous empirical analyses of ground-based recharge estimates by MacDonald et al. (2021) and West et al. (2022) who both found climate to be the dominant controls on spatial variability in Africa. Moreover, Zaherpour et al. (2018) found these three models performed better than others when estimating mean annual runoff at 40 catchments distributed throughout the world. Hence potentially suggesting the greater plausibility of these model structures for long-term hydrological partitioning . Median absolute discrepancies for each model are generally low as >70 % of datapoints are in desert or dryland landscapes. Interestingly, the median absolute discrepancy in annual recharge (recharge ratio) estimates for the ensemble mean of the models is 1.11 mm/year (0.005), which is similar to H08, MATSIRO and WaterGAP 2.
Based on these findings, we suggest predicting groundwater recharge across continental scales for research or operational management should not rely on one specific model (Fletcher et al., 2019). No model consistently compares well with ground-based estimates in a given landscape and therefore we cannot rely upon individual global models to estimate recharge. If groundwater governance in wet landscapes depends on recharge estimates from one global-scale model, this could potentially lead to large over or under-utilization of groundwater resources, as absolute discrepancies between global models and ground-based estimates are often large in these settings. Although absolute discrepancies between global model and ground-based estimates are often small in dry and desert landscapes, relative to the ground-based estimate and to the total available water resources of the region, these discrepancies are often large. It is perhaps not surprising that global models are unreliable in wet tropical regions as so few studies have investigated recharge processes in these settings (Mohan et al., 2018;Moeck et al., 2020;MacDonald et al., 2021). Nonetheless, recharge studies are much more common in dry settings, and this is still not leading towards improved recharge estimation in relative terms, by global-scale models. Hence highlighting the need to advance recharge estimation by global-scale models across a wide range of environmental settings. Interestingly, the performance in absolute terms of the ensemble mean of the global-scale models is like that of the individually better performing models (H08, MATSIRO and WaterGAP 2), for both annual recharge and recharge ratio. This is perhaps not surprising as these models reflect the greatest level of climate controls, and by taking the ensemble mean across all the global-scale models, we essentially de-emphasize how individual models partition water at the land surface and instead emphasize the climate forcing.
However, it is still unclear how we can improve recharge estimation by global-scale models as their discrepancies to ground-based estimates are inconsistent, with both over and under-estimates by models in a given landscape. If biases were systematically either positive or negative this could help guide further diagnostic model evaluation considering other hydrologic fluxes (Niraula et al., 2017), but this is not the case. Previous comparison studies have shown that the development of consistent modelling protocols that harmonise model inputs, parameter estimation, initial conditions etc. is very difficult even for simple models (Ceola et al., 2015). In the long-run, modular modelling approaches will enable a better implementation of different model structures for global hydrologic models. Such strategies have been widely used for lumped hydrologic models at the catchment scale (e.g. Clark et al., 2008;Knoben et al., 2019;Leavesley et al., 2002;Wagener et al., 2001). However, developing such frameworks for distributed global-scale models will not be straightforward and would best be done as a large-scale collaboration between different modelling groups. Successful implementation would exceed the levels of diagnostic analysis currently possible in existing model intercomparison projects, such as ISIMIP. Further still, if this modular framework facilitated spatially variable model structures, this would allow the simultaneous implementation of landscape-specificmodels (Hartmann et al., 2017;Quichimbo et al., 2021), which may be more plausible than models which uniformly apply the same model structure over entire continents or globally. Developing a sensible separation of the landscape for specific models and capturing the expert understanding of how these different systems function is an interesting current challenge . Using expert knowledge to develop more plausible models is particularly important, as even a direct calibration of models to available data may not improve recharge estimation due to sparsity of recharge data (see Section 2 of the supplemental information for further discussion).

Conclusions
We set out to examine where and how global-scale model estimates of groundwater recharge agree or disagree with one another and with ground-based estimates across the African continent. We did so using the outputs of eight global models that were previously run within the ISIMIP framework (Frieler et al., 2017;Reinecke et al., 2021) and over 100 ground-based estimates of recharge compiled from the literature (MacDonald et al., 2021).
We found that global-scale model estimates of long-term mean annual recharge rates and recharge ratio disagree significantly throughout much of Africa. In absolute terms (using standard deviation), models disagree more in wet tropical regions and agree more in dry regions, whilst in relative terms (using coefficient of variation) the opposite is true. However, absolute model disagreement for recharge ratios is also high throughout the Sahara Desert, though this is mostly attributed to surprisingly high estimates by PCR-GLOBWB and CWatM in this region. When investigating controls on global-scale model recharge outputs for Africa using Random Forests, we found that climate controls on average explain 75 % and 68 % of the spatial variability of annual recharge and recharge ratio estimates, respectively. However, climate controls only explained approximately 60 % of the spatial variability in recharge estimates from PCR-GLOBWB and CWatM, which is significantly less than the 82 % reported by MacDonald et al. (2021) in their empirical analysis. H08, MATSIRO and WaterGAP 2, reflected the strongest level of climate control on their recharge outputs and show the greatest similarity to ground-based estimates. Using the ensemble mean of all eight global-scale models performed similarly to these three best performing models.
Our work adds further evidence to previous studies which showed that the robustness of global hydrologic model simulations can vary considerably, suggesting that we should be aware of these problems when utilizing these model outputs for subsequent studies. Studies regularly use the output of such models (and often just one of them) as input in their followup analysis, only with the occasional caveat in the discussion section about potential limitations and robustness issues in the model predictions.
Rather, highlighting where model robustness is low and where models deviate from our current perception of hydrologic systems are important strategies to guide future research towards areas of greatest knowledge gaps .

Data availability
Data will be made available on request.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements
This work was funded as part of the Water Informatics Science and Engineering Centre for Doctoral Training (WISE CDT) under a grant from the Engineering and Physical Sciences Research Council (EPSRC), grant number EP/L016214/1. MOC gratefully acknowledges funding for an Independent Research Fellowship from the UK Natural Environment Research Council (NE/P017819/1). Support for RR was partially provided by the International Atomic Energy Agency (CRP D12014). Support for TW and RR was provided by the Alexander von Humboldt Foundation in the framework of the Alexander von Humboldt Professorship endowed by the German Federal Ministry of Education and Research. We would also like to thank the ISIMIP project for supplying the simulated recharge data and the ISIMIP modelling community for their contributions towards this project.

Table A1
List of environmental attributes used in analysis organised by predictor groups (i.e., climate, landcover, topography, soils, geology), along with information about the global datasets used to characterise attributes.   (Williams and Ford, 2006)