Comparison of multimodel ensembles of global and regional climate models projections for extreme precipitation over four major river basins in southern Africa— assessment of the historical simulations

This study assesses the performance of large ensembles of global (CMIP5, CMIP6) and regional (CORDEX, CORE) climate models in simulating extreme precipitation over four major river basins (Limpopo, Okavango, Orange, and Zambezi) in southern Africa during the period 1983–2005. The ability of the model ensembles to simulate seasonal extreme precipitation indices is assessed using three high-resolution satellite-based datasets. The results show that all ensembles overestimate the annual cycle of mean precipitation over all basins, although the intermodel spread is large, with CORDEX being the closest to the observed values. Generally, all ensembles overestimate the mean and interannual variability of rainy days (RR1), maximum consecutive wet days (CWD), and heavy and very heavy precipitation days (R10mm and R20mm, respectively) over all basins during all three seasons. Simple daily rainfall intensity (SDII) and the number of consecutive dry days (CDD) are generally underestimated. The lowest Taylor skill scores (TSS) and spatial correlation coefficients (SCC) are depicted for CDD over Limpopo compared with the other indices and basins, respectively. Additionally, the ensembles exhibit the highest normalized standard deviations (NSD) for CWD compared to other indices. The intermodel spread and performance of the RCM ensembles are lower and better, respectively, than those of GCM ensembles (except for the interannual variability of CDD). In particular, CORDEX performs better than CORE in simulating extreme precipitation over all basins. Although the ensemble biases are often within the range of observations, the statistically significant wet biases shown by all ensembles underline the need for bias correction when using these ensembles in impact assessments.


Introduction
Most extreme climate events, especially those related to precipitation, have profound effects on human populations, ecosystems, and the environment . In many parts of the globe, substantial changes in the intensity and frequency of extreme precipitation events have already been reported (Wan et al. 2021, Seneviratne et al. 2021. Because of poor adaptive capacity due to restricted access to climate-related information, technology, finance, and capital assets, developing nations are particularly vulnerable to the effects of precipitation extremes, including floods and droughts (Stephenson et al. 2010;Sylla et al. 2016;Yaduvanshi et al. 2021;Abiodun et al. 2020;Akinsanola et al. 2021). This is particularly evident in southern African nations; for instance, during the 2015-16 rainy season in southern Africa, a severe drought and subsequent dry spells caused widespread crop failure, which resulted in severe food insecurity in the region (approximately 40 million people required humanitarian support, SADC 2016). Recently, thousands of people have died, millions of people have been forced to evacuate, and infrastructure has been destroyed by heavy precipitation events caused by tropical cyclones (e.g., cyclones Idai in 2019, Chalane in 2020, and Eloise in 2021).
River basins are not exceptional to the effects of climate change, and they contribute substantially to socioeconomic development (Jain and Singh 2020). For instance, agricultural production was weakened over the Zambezi basin during droughts that occurred during the rainy seasons of 1991-1992and 1994-1995(SADC-WD/ZRA 2008. In addition, river levels in sub-Saharan Africa were very low in 2019, reducing the water available in Kariba to run the hydropower plant to 10% (Hulsman et al. 2021). Climate change and human population growth are expected to impose more stress on the ecology of the Zambezi basin (SADC-WD/ZRA 2008). The Limpopo and Orange basins are also expected to experience significant desiccation and a decline in water levels due to changes in precipitation and temperature (Mitchell 2013).
The construction of climate information, especially when relevant for decision-making at local and regional scales, must be based on multiple lines of evidence, including but not limited to the analysis of the results of different classes of climate models (Doblas-Reyes et al. 2021). In fact, using different classes of model ensembles is very useful for detecting areas of disagreement and agreement in climate information across various ensembles (Dosio et al. 2021a;Doblas-Reyes et al. 2021).
The World Climate Research Programme (WCRP) launched several coordinated programs to provide historical and future climate projections using large ensembles of global climate models (GCMs) and regional climate models (RCMs). The most prominent among these coordinated programs are Coupled Model Intercomparison Project Phase 5 (CMIP5; Taylor et al. 2012), Phase 6 (CMIP6; Eyring et al. 2016), and the Coordinated Regional Climate Downscaling Experiment (CORDEX; Giorgi and Gutowski 2015). CMIP experiments provide historical and future climate projections from a large ensemble (approximately 30) of GCMs (Luo et al. 2022). CMIP6 was launched as an improvement to CMIP5, particularly due to improved physical processes, parameterizations, increased spatial resolutions, and additional biogeochemical processes (Eyring et al. 2016). However, for regionally and locally tailored impact assessments, high-resolution climate projections are required (Doblas-Reyes et al. 2021). The simulation of regional phenomena, especially those impacted by complex topography, land use heterogeneity, coastal lines, and mesoscale convection, is often poor in GCMs because of their low horizontal resolution (typically on the order of a hundred kilometers or more). To this extent, although dynamic downscaling does not always add value, compared to GCMs, in the simulation of mean quantities (e.g., Dosio et al. 2015), RCMs improve the simulation of precipitation characteristics, especially for extreme events (e.g., Gibba et al. 2019).
Under the CORDEX initiative, RCMs were used to dynamically downscale several CMIP5 GCMs to an ~ 50 km horizontal resolution over several domains (Giorgi et al. 2021). More recently, the CORDEX-CORE (Coordinated Output for Regional Evaluations) initiative was launched, aiming at producing climate projections in a more homogeneous framework, where all participating RCMs were required to downscale the same set of driving GCMs (in contrast to CORDEX, where the choice of GCMs was left to the individual RCM modeling groups). Additionally, to make the CORE results more suitable for application in impact studies, the horizontal resolution was set twice as high as that of CORDEX (~ 25 km).
The application of climate projections from CMIP and CORDEX is an important tool for generating information on climate change, which is important for policymaking and developing adaptation strategies. However, before their use, it is essential to validate the performance of climate models over a historical reference period, especially when climate simulations are used as inputs to impact models. Most studies evaluating the ability of CMIP, CORDEX, and CORE to simulate extreme precipitation over Africa have focused on continental or regional scales (Pinto et al. 2016;Abiodun et al. 2017;Gibba et al. 2019;Abiodun et al. 2020;Dosio et al. 2021a;Ogega et al. 2020;Faye and Akinsanola 2022;Akinsanola et al. 2021Ayugi et al. 2021Dosio et al. 2022a, b). Additionally, most of these studies are based on climate models from one or two coordinated projects or use a limited subset of the model ensembles. Recently, studies have been conducted in Africa using ensembles of CMIP5, CMIP6, CORDEX, and CORE. For instance, extreme precipitations from large CMIP (approximately 30 CMIP5 and CMIP6 models), CORDEX (24), and CORE (9) simulations were compared by Dosio et al. (2021a) over Africa, but their analysis was focused on future projections. Focusing on southern Africa, Karypidou et al. (2022) assessed the performance of CMIP5, CMIP6, CORDEX, and CORE ensembles in simulating mean and extreme precipitation. However, their findings were confined to a relatively small subset of the CMIP ensembles (13 CMIP5 and 8 CMIP6 models), and the study's main emphasis was on mean precipitation.
Several studies have evaluated how well GCMs and RCMs can simulate extreme rainfall in African river basins. Diatta et al. (2020) investigated the Rossby Center Regional Climate Model's (RCA4) ability to simulate extreme precipitation over the Casamance river basin. Salaudeen et al. (2021) evaluated the CMIP5 GCMs' ability to reproduce extreme precipitation in the Gongola Basin. Agyekum et al. (2022) investigated the performance of CMIP6 in simulating extreme precipitation over the Volta Basin. Samuel et al. (2022) evaluated CORE's capacity to reproduce Zambezi's extreme precipitation.
Although prior studies have evaluated climate models' capacity to reproduce extreme precipitation over southern Africa, to our knowledge, no studies have compared the seasonal performance of large ensembles over southern Africa and its major river basins. In this study, we investigate how well GCM (CMIP5 and CMIP6) and RCM (CORDEX and CORE) ensembles can reproduce observed extreme precipitation during the rainy season. This study focuses on DJF and transitional seasons (SON and MAM). These seasons are chosen because of their impact on southern Africa's rain-fed agriculture. Understanding climate model simulations performances during these seasons is vital for southern African policymakers and climate information users. We employed six indices to characterize mean and extreme precipitation as established by the Expert Team on Climate Change Detection and Indices (ETCCDI, Zhang et al. 2011), focusing on indices for identifying excessive dryness and moderately wet conditions. The results of this research provide information that is beneficial to the scientific and user communities, especially regarding the use of these ensembles as inputs in impact models. The remainder of this paper is organized as follows: Sect. 2 presents the study area, data, and methods. The results and discussion are presented in Sect. 3, and a summary and concluding remarks are presented in Sect. 4.

Definition of the study area and sub-regions
In this study, we define southern Africa as the region that lies between 10-35°S and 10-40°E, focusing on four major river basins (the Limpopo, Okavango, Orange, and Zambezi basins), as shown in Fig. 1. In fact, major economic activities, such as agriculture and power production, occur within the basins, making them vital in socioeconomic activities across the region (Abiodun et al. 2019).

Observational data
Several studies (Gibba et al. 2019;Abiodun et al. 2020;Dosio et al. 2021b;Hamadalnel et al. 2022;Olusegun et al. 2022) have highlighted the lack of reliable high-quality in situ datasets at spatiotemporal coverage suitable for model evaluation as a key challenge in model evaluation in Africa. Despite the considerable differences between merged satellite and gauged station data, they are widely used as references for model evaluation over Africa (e.g., Abiodun et al. 2020;Ayugi et al. 2021;Klutse et al. 2021;Samuel et al. 2022). Discrepancies among the observations make it difficult to choose a specific dataset as a reference for model evaluation. Therefore, the mean of multiple observational data has been used as a reference for Fig. 1 Topography of southern Africa. The black lines represent the four major river basins (Limpopo, Okavango, Orange and Zambezi) in southern Africa model evaluation in previous studies (e.g., Abidium et al. 2020;Wan et al. 2021;Karypidou et al. 2022;Ilori and Balogun 2021).
In this study, we used gridded data based on merged satellite and gauge observations. In particular, we used daily precipitation datasets obtained from the Climate Hazards Group Infrared Precipitation with Station (CHIRPs version 2, Funk et al. 2015), with a spatial resolution of 0.05° × 0.05°, the Tropical Applications of Meteorology using SATellite and groundbased observations (TAMSAT version 3.1, Maidment et al. 2017), with a spatial resolution of 0.04° × 0.04°, and the African Rainfall Climatology (ARC version 2, Novella and Thiaw. 2013) from the Famine Early Warning System, with a spatial resolution of 0.1° × 0.1°. These datasets have been evaluated against gauge stations over southern Africa and have demonstrated better performance compared to other existing gridded observational data over the region (Maidment et al. 2017). Their high spatial resolution and superior performance over southern Africa make these datasets suitable for climate model evaluation, particularly over small regions such as river basins.

Climate model simulations
In this study, we used historical daily precipitation simulations from both global (CMIP5, CMIP6) and regional (CORDEX, CORE) climate models. Tables S1-4 provide a list of the models and their basic descriptions. In particular, we used 30 simulations from CMIP5, 26 simulations from CMIP6, 25 simulations based on six RCMs downscaling 13 CMIP5 GCMs under the CORDEX experiment, and 9 simulations based on three RCMs downscaling three CMIP5 GCMs under the CORE experiment, obtained from the Earth System Grid Federation (ESGF) servers. The models are selected based on the availability of both historical and future (SSP5-8.5 for CMIP6 and RCP 8.5 for CMIP5, CORDEX, and CORE) simulations at the time of writing. The models selected here are used to project future changes in the second part of our study. To simplify the evaluation, we used simulations of one ensemble member for each model.

Extreme precipitation indices
This study analyzes six extreme precipitation indices (Table 1) as defined by the Expert Team on Climate Change Detection and Indices (ETCCDI) (Zhang et al. 2011). We used Climate Data Operators (CDO, https:// code. zmaw. de/ proje cts/ cdo) to compute all the indices. The selected indices provide information on the present and future characteristics of both wet and dry conditions in terms of intensity and duration. These indices have been widely used to define extreme precipitation (Gibba et al. 2019;Akinsanola et al. 2021;Zhu et al. 2021a, b;Abiodun et al. 2020;Dosio et al. 2021a;Ayugi et al. 2021;Yao et al. 2021;Dike et al. 2022;Luo et al. 2022;Samuel et al. 2022). The indices we used can be classified into three categories: duration indices, frequency indices, and intensity indices (Table 1). We analyze the indices for each year during December-January-February (DJF), March-April-May (MAM), and September-October-November (SON) from each observational dataset and climate model simulation on their native grids.

Evaluation methods
The performance of the CMIP5, CMIP6, CORDEX, and CORE simulations in representing historical extreme precipitation indices over southern Africa is evaluated for 23 years , which is common for both observations and climate model simulations. The spatial resolution differs across the individual models for CMIP5 and CMIP6. Although individual simulations for CORDEX and CORE are available on 0.5° and 0.25° grids, respectively, the grid types differ across climate models. Hence, for CMIP5 and CMIP6, the indices are regridded onto a 1.32° × 1.32° grid using the bilinear method, while for CORDEX and CORE, are remapped to a common grid type (latitude × longitude) at their original resolution. The equal-weighted method is used for computing multimodel ensemble means (MMEs). We admit that the equal-weighted technique utilized here is constrained because models generated by the same institute or GCMs downscaled by the same RCM may have similar structural biases. This method has been used by most research dealing with ensembles of climate models over different regions of the world, including the "Africa-box" in the recent IPCC Special Report on 1.5 °C warming (Hoegh-Guldberg et al. 2018) and AR6 (Gutiérrez et al. 2021). Weigel et al. (2010) found that, for many applications, equal weighting may be the more transparent way to combine models and is preferable to a weighting that does not appropriately represent the true underlying uncertainties, as "optimum weighting" requires both accurate knowledge of the single model skill and the relative contributions of the joint model error and unpredictable noise; both issues are still open to discussion.
To compute the mean of the three observations (OBSE) and evaluate the MMEs using statistical methods, we regridded the indices for each observation onto the corresponding MME grid using a conservative method.
The mean bias (MB; MME minus OBSE) is used to evaluate the performance of the MMEs in reproducing the spatial distribution of the magnitude of extreme precipitation. To assess the statistical significance of the bias, we used the method defined by Dosio et al. (2021a), which is similar to that developed for the Intergovernmental Panel on Climate Change (IPCC) 6th Assessment Report (AR6, Gutiérrez et al. 2021). Briefly, the bias of each simulation is considered statistically significant if it is greater than the interannual variability of the observations (regardless of the sign of bias). The interannual variability is defined as γ = √(2⁄23) × 1.645 × σ, where sigma is the standard deviation of the linearly detrended annual time series of the observations. If more than 66% of the individual models exhibit significant bias and more than 66% of those models agree on the sign of the bias, the bias of the MMEs is deemed significant. If more than 66% of simulations show significant bias but less than 66% agree on its sign, the bias of the MMEs is deemed conflicting.
We further evaluated the performance of the model ensembles in simulating the observed extreme precipitation indices spatially averaged over four major river basins in southern Africa. The Taylor diagram (Taylor 2001) is used to evaluate the ability of the models to reproduce the observed spatial patterns of extreme precipitation. The Taylor diagram is used to summarize the three statistics (spatial correlation coefficients: SCC; standard deviation: SD; and root mean square error: RMSE). The Taylor skill score (TSS; Wang et al. 2018) is used to further quantify the similarities between the ensembles and the observation. The TSS is calculated as follows: Furthermore, the skill of the ensembles is quantified using the Kling-Gupta efficiency (KGE; Gupta et al. 2009). The KGE is calculated as follows: where PC is the SCC between the OBSE and models and PC 0 is the maximum SCC (here, we used 1). o and m are the SDs of the OBSE and models, respectively. o and m are the means of the OBSE and models, respectively.
The standard deviation (SD) of basin-averaged time series is used to evaluate the ability of the models to represent the magnitude of the observed interannual variability of each extreme precipitation index. The SD has been used to assess interannual variability in previous studies (e.g., Rajendran et al. 2022;Dosio et al. 2022a, b) and the IPCC AR6 report (Gutiérrez et al. 2021).

Results and discussion
The main focus of this section is on the performance of multimodel ensembles in reproducing extreme precipitation, with a brief evaluation of the monthly mean precipitation annual cycle. Figure 2 shows the annual cycles of monthly averaged daily precipitation for the OBSE and multimodel ensemble means (MMEs) for CMIP5, CMIP6, CORDEX, and CORE. The results have been spatially averaged over the four southern Africa major river basins (Limpopo, Okavango, Orange, and Zambezi) shown in Fig. 1. Generally, all MMEs can reproduce the annual cycle of precipitation over all four river basins (Fig. 2). This is consistent with the findings of Karypidou et al. (2022) over southern Africa and Dosio et al. (2021a) over western southern Africa and eastern southern Africa, respectively. Although MMEs generally capture the temporal evolution of the precipitation cycle, they overestimate precipitation over all basins, especially between November and March. CMIP6 (CORDEX) exhibits the largest (lowest) biases over all basins except over the Orange basin. Similar to the findings of previous studies (Lim Kam Sian 2022; Karypidou et al. 2022), the wet biases of the MMEs are larger during the peak of the rainy season (DJF). Generally, the intermodel spread is very large, with larger uncertainties for CMIP5 than for the other ensembles. More specifically, CMIP5 (CORDEX) shows the largest (smallest) intermodel uncertainty over the Okavango and Orange (Limpopo and Zambezi) basins, while CMIP6 (CORE) shows the smallest (largest) intermodel uncertainty over the Limpopo (Zambezi) basins. The performance of the RCM MMEs is better than that of the GCM MMEs, especially during the peak (DJF) of precipitation. CORDEX shows good agreement with the observed precipitation peak during the DJF over the Zambezi basin. In agreement with a previous study Karypidou et al. 2022), CORDEX and CORE MMEs perform better than CMIP5 and CMIP6, which shows the added value of downscaling in simulating annual precipitation cycles over the four basins. However, better performance in CORDEX than in CORE shows that other than the resolution, the model physical configuration also plays a critical role in improving the performance of climate models in simulating precipitation. Similar findings were reported by Wu et al. (2020). For instance, the better performances in CCLM and REMO might be associated with improvements in the convective scheme under CORE (Tiedtke with modifications) compared to under CODEX (Tiedtke). In fact, Olusegun et al. (2022) noted that the modified Tiedtke cumulus convection scheme is more suitable for West Africa than cumulus convection. Figure 6 in Panitz et al. (2014) shows that the increase in resolution of CCLM at ~ 50 to ~ 25 km has no impact in simulating annual precipitation cycles over southern Africa.

Daily extreme indices
Here, we present the results of the performance of the ensembles in simulating extreme precipitation indices during the main rainy season (DJF) only, with results for other seasons (SON and MAM) available as supplementary materials (Figs. S1-S10).  Figure 3 shows the spatial distribution of the climatological biases of CDD, CWD, and RR1 from CMIP5, CMIP6, CORDEX, and CORE MMEs. Similar maps for the SON and MAM seasons are shown in Figs. S1 and S2. The results show that all the ensembles tend to significantly underestimate the observed values of CDD over most of southern Africa ( Fig. 3d-g). In particular, all ensembles largely underestimate CDD values over most of the region, with negative biases of more than 10 days over coastal Angola, the northeastern Orange basin, and the eastern Limpopo basin. However, CORDEX and CORE overestimate CDD values by up to 6 days over the eastern Zambezi basin, northern Mozambique, and in some areas over southwestern coastal south Africa and the southwestern Orange basin. CMIP5 overestimates CDD by 6 days over northern Mozambique and in some areas over the northeastern Zambezi basin, southwestern Orange, and southwestern coastal South Africa. The biases of MME for CDD are lower over the Zambezi basin than over the other three basins. Similar to the DJF season, all the ensembles significantly underestimate CDD over all basins during MAM (Fig. S1d-g). In contrast, during SON, the ensembles exhibit larger areas of overestimation of CDD ( Fig. S2d-g). For instance, during SON, the ensembles exhibit larger areas of statistically significant overestimation (underestimation) of CDD values over the Orange basin (northern Mozambique and southern Tanzania). Overall, for CDD, the ensembles exhibit less biases over the Zambezi basin than over the other three basins. All the ensembles tend to overestimate CWD (Fig. 3h-k) and RR1 (Fig. 3l-o)  Overall, CORDEX has better performance than CMIP5, CMIP6, and CORE in simulating CWD over southern Africa. The greater extent of overestimation of CWD in CORE than in CORDEX could be associated with the excess moisture supply in CORE than in CORDEX. Pinto et al. (2016) associated the wet biases in CORDEX RCMs with poor representation of atmospheric circulation patterns such as the Angola low and hence increased moisture input from the Atlantic Ocean. In contrast to CDD, MME biases are larger over the Zambezi basin compared to the other three basins for CWD, especially in CMIP5, CMIP6, and CORE. The largest positive biases of RR1 are located in southeastern Zimbabwe, the southern Okavango basin, and the eastern Orange and Limpopo basins (Figs. 3, S1, and S2). Similar to previous studies (Abidum et al. 2020; Karypidou et al. 2022;Luo et al. 2022), larger biases over Drakensberg Mountain illustrate the influence of complex topography on precipitation. Furthermore, the lower biases in the RCM MMEs in this region demonstrate their ability to resolve fine-scale regional processes better than GCMs (e.g., complex topography; Mishra et al. 2014;Karypidou et al. 2022). CMIP6 shows a larger overestimation of RR1 over a larger area than CMIP5. For instance, CMIP6 exhibits an overestimation of RR1 by 24 days over large areas of the southeastern Zimbabwe, Okavango, and Limpopo basins (eastern coastal areas of southern Africa) during (MAM) DJF (Figs. 3 and S1). Overall, RCM MMEs demonstrate better performance than GCM MMEs in simulating CWD and RR1, with slightly better performance in CORDEX. The better performance in CORDEX and CORE than CMIP5 and CMIP6 could be a result of their ability to better represent topography than GCMs and hence better in capturing northerly moisture transport into southern Africa (Munday and Washington 2018;Karypidou et al. 2022). However, the smaller number of ensemble members in CORE compared to COR-DEX and the larger overestimation of RR1 and CWD in RegCM simulations may partly be responsible for the slight underperformance of CORE compared to CORDEX.

Spatial distribution of biases
The climatological spatial distribution biases of SDII, R10mm, and R20mm are shown in Fig. 4. All ensembles underestimate the observed SDII values over southern Africa, except over a few areas where they slightly underestimate SDII. A common region with the largest underestimation of SDII values is shown in all MMEs. In particular, all MMEs exhibit a negative bias of more than 6 mm/day over southern coastal Mozambique and the southern and eastern Limpopo basin (Fig. 4d-g). Conversely, all MMEs show larger areas of slight overestimation over the southeastern Orange basin. Additionally, the CORDEX (CORE) MME shows areas of slight overestimation of SDII over south Angola, northcoastal Namibia, southeastern Limpopo basin, and (southern Angola, northern-eastern Okavango basin, and most of Orange basin). A comparison of the three seasons shows that all MMEs exhibit larger biases during MAM than during DJF and SON, particularly during MAM over the south coastal Mozambique, Limpopo basin, and northern Orange basin (Figs. 4, S3, and S4). It is important to note that GCM and RCM MMEs show similarities in the magnitude of biases for SDII. CMIP5 and CMIP6 MMEs show statistically significant positive biases for R10mm and R20mm in most of southern Africa and small-negative biases over southern Tanzania, northern Mozambique, and northern Angola. CORDEX and CORE MMEs tend to overestimate R10mm over the Okavango, Orange, and Limpopo basins and western coastal Angola but show a strong underestimation (up to 12 days) over the northern Angola, northern Mozambique, southern Tanzania, and eastern Zambezi basin (Fig. 4j, k).
For R20mm, the CORDEX and CORE MMEs exhibit positive biases over most of southern Africa, with few areas of negative biases, particularly over Mozambique. Comparing the biases of the DJF season to those of the MAM and SON seasons shows that all MMEs show better performance in simulating R10mm and R20mm during MAM and SON (Fig. 4, S3, and S4). Furthermore, there are larger areas of statistically insignificant biases in all MMEs for R10mm and R20mm during the MAM and SON seasons compared to DJF. The MMEs show a common larger wet bias over Lesotho and southeast coastal South Africa. Diallo et al. (2015) associated the wet bias shown over Lesotho and southeast coastal South Africa with the overestimation of southerly wind flux and the effects of complex topography on convection triggering. More specifically, the area of overestimation over Lesotho (southeastern coastal South Africa) is slightly larger in CMIP6 and CORE than in CMIP5 and CORDEX. However, the magnitude of the negative bias for R10mm and R20mm over northern Mozambique is slightly higher in CMIP5 than in CMIP6. Despite the persistence of statistically significant biases in MME simulations for the six extreme precipitation indices, they reasonably reproduce the spatial distribution of extreme precipitation over southern Africa (Fig. not shown). This implies that the MMEs can capture key climate systems that influence precipitation over southern Africa. Figure 5 shows the comparison between the observed and simulated extreme precipitation indices averaged over the four southern Africa major river basins shown in Fig. 1. The differences between the observed mean CDD are larger during MAM than during DJF and SON, with the largest difference over the Limpopo and Orange basins (Figs. 5a, S5a, and S6a). The MMEs tend to underestimate the observed mean CDD over all basins during the three seasons, with few cases of overestimation. The interquartile range of MMEs is smaller over the Zambezi basin, with ranges of 2.3, 2.0, 1.8, and 2.3 days for CMIP5, CMIP6, CORDEX, and CORE, respectively. Notably, all MMEs overestimate CDD relative to TAMSAT, while CMIP5, CMIP6, and CORE (CORDEX) underestimate (overestimate) mean CDD over the Orange basin during SON relative to ARC and CHIRPs (Fig. S6a). The CMIP5, CMIP6, CORDEX, and CORE medians overestimate the mean CWD and RR1 over all basins (Fig. 5b, c). In particular, for RR1 and CWD, the interquartile ranges of CMIP5 and CMIP6 are outside the maximum range of the observations over all basins, whereas CORDEX and CORE medians are generally higher than the largest mean of the observations (Fig. 5b, c).

Regional analysis
In contrast to CDD, for CWD, the interquartile range for MMEs is larger over Zambezi compared to other basins, particularly for CMIP5 (24.4 days) and CMIP6 (24.0 days). The differences between observations are larger for CDD (with a maximum difference of 4.6 days over the Limpopo basin) than for CWD (with a maximum difference of 2.7 days over the Zambezi basin). For R10mm, R20mm, and SDII, the biases of MMEs are different depending on the reference observation and river basins (Fig. 5d, e, f). Previous studies have reported that the performance of climate models depends on the choice of reference observation and geographical location (Dosio et al. 2021a, b;Faye and Akinsanola 2022). The differences between the observations are larger than the intermodel spread for the mean SDII over the Limpopo basin compared to other basins, with means of 10.4, 14.1, and 18.3 mm/day for ARC, CHIRPS, and TAMSAT, respectively. This illustrates that the agreement of the observations is less than that of the model simulations (see Fig. 5f and, e.g., Dosio et al. 2021b). Interestingly, CORE shows a larger overestimation of R20mm than CMIP5 and CMIP6 over the Orange basin (all basins) during DJF (MAM and SON, except for Zambezi during MAM). Additionally, the intermodel spread of CORE is larger than those of CMIP5, CMIP6, and CORE during MAM for R20mm (Fig. S5e). Samuel et al. (2022) noted that RegCM (under the CORE experiment) has challenges in simulating precipitation indices over the Zambezi River basin, therefore largely contributing to the interquartile spread of the CORE. A comparison between CMIP5 and CMIP6 shows that CMIP5 performs slightly better for most indices, except SDII and CWD, over the Zambezi basin. However, the interquartile range of CMP5 is generally larger than that of CMIP6. Overall, the intermodel spread is larger during MAM and SON than during DJF for most of the indices over all basins. Generally, RCMs show better performance in simulating the magnitude of the observed extreme precipitation indices and have lower interquartile spreads than GCMs for most  Fig. 1. The results are shown for each observation (solid circles), CMIP5, CMIP6, CORDEX, and CORE ensembles (box and whisker plots). The boxes indicate the interquartile (25th and 75th) model range, the solid marks within the boxes show the multimodel median, and the whiskers indicate the total intermodal range indices, with some exceptions, particularly for CORE. In particular, CORDEX shows better performance than CORE in simulating extreme precipitation over the four basins. Figures 6 and 7 show Taylor diagrams of the spatial correlation coefficient (SCC), normalized standard deviation (NDS), and centered mean-square difference (CRMSD) of the simulated indices (for the individual models and the multimodel ensemble means) against CHIRPS for the four river basins. To assess the uncertainty among the observations, the results for ARC and TAMSAT are also shown. The performance of individual models and MMEs varies with basin and index (Figs. 6 and 7). For instance, individual models and MMEs show lower SCCs over Limpopo (< 0.4) and Zambezi (< 0.6) for CDD. However, agreement among observations is also weak over the two basins for CDD. The spread among the models is larger for CWD than for CDD, RR1, R10mm, R20mm, and SDII. Notably, the uncertainty of the observations is larger over the Orange and Zambezi basins. In particular, the NSDs for most of the models are > 1.5 for CWD (RR1) over all river basins (Orange River basin). In particular, the CORDEX MME shows better performance than CMIP5, CMIP6, and CORE in replicating CHIRPS NSDs for CWD over the Limpopo and Okavango basins, with NSD values of 1.0 and 1.66, respectively, in the COR-DEX MME. CORDEX MME performs better than CMIP5, CMIP6, and CORE MMEs in simulating most of the extreme precipitation indices. The SCCs (NSDs) for individual models and MME range from 0.6 to 0.95 (0.5 to 1.5) for R10mm and R20mm. Similar to CDD, most individual models and MMEs exhibit lower SCC (< 0.7) over the Limpopo and Zambezi basins compared to the Orange and Okavango basins for R10mm and R20mm. More specifically, the spread among models is larger over Okavango for R10mm. Most of the individual models underestimate CHIRPS NSDs for SDII, particularly over the Limpopo, Orange, and Zambezi basins. The model spread is smaller for CDD and SDII than for CWD, RR1, R10mm, and R20mm over the four river basins. Overall, individual models and MMEs exhibit the worst performance in simulating extreme precipitation over the Limpopo and Zambezi basins, although the uncertainty among the observations is also larger over the two river basins for most of the indices. Taking the uncertainty among the observations into account, all ensembles generally reproduce the spatial patterns of extreme precipitation over the four river basins, with better performance in MMEs compared to individual models. Figure 8 shows the Taylor skill score (TSS) for the model ensembles relative to the mean of the observations (OBS) during DJF, averaged over the four basins. For CDD, the MMEs show better (worst) skills in representing extreme precipitation over Okavango and Orange (Limpopo) with TSS greater than 0.8 (less than 0.2). For example, the TSS for CDD for CMIP5, CMIP6, CORDEX, and CORE ranges between 0.01 and 0.46, 0.03 and 0.28, and 0.05 and 0.39, respectively. The higher TSS over the Limpopo basin during MAM and SON (> 0.40 in all ensembles with the exception of CMIP6 during MAM) compared to DJF (< 0.2 in all ensembles) demonstrates the better performances of the ensembles for CDD during MAM and SON compared to DJF (Figs. 8,S7,and S8). For CWD in DJF, MMEs exhibit the highest and lowest skills over Limpopo and Zambezi, respectively. The range of TSS is higher for CWD over the Zambezi basin despite lower skill in the ensembles, with TSS ranging from 0.05 to 0.7 for CMIP5, 0.08 to 0.66 for CMIP6, 0.03 to 0.58 for CORDEX, and 0.02 to 0.42 for CORE. The TSS of the ensembles is generally greater than 0.5 for both R10mm and R20mm over all the basins. Over the Orange basin, the ensembles show TSS values less than 0.4 for SDII, with TSS values for individual models ranging from 0.10 to 0.72 for CMIP5, 0.06 to 0.72 for CMIP6, 0.12 to 0.62 for CORDEX, and 0.13 to 0.70 for CORE. Figures 8, S7, and S8 illustrate that MMEs can reasonably represent extreme precipitation indices over the four basins, especially in DJF and SON. However, the range of TSS is higher for most extreme precipitation indices. Notably, the results show that CORDEX has better skills than the other ensembles in representing extreme precipitation during DJF, MAM, and SON. Table 2 shows the Kling-Gupta efficiency (KGE) values of extreme precipitation for the ensembles over each basin. Generally, the ensembles show poor skills in simulating CWD, with all ensembles showing negative KGE values over the basins except for CORDEX over Limpopo and Okavango. In agreement with the Taylor diagram and the TSS discussed previously, the ensembles show poor skills in simulating CDD over the Limpopo basin. However, the ensembles show good skills in simulating CDD over Okavango, Orange, and Zambezi. The skills of the GCM ensemble (CORE) are poor in simulating RR1 over the Limpopo and Orange basins (Limpopo basin). On the other hand, CORDEX shows good skill in simulating RR1 over these basins. Generally, the KGE values for R10mm, R20mm, and SDII for the ensembles are positive over all basins, with few exceptions (Table 2). Overall, the ensembles show good skills in simulating CDD, RR1, SDII, R10mm, and R20mm. Figure 9 shows the box and whisker plots of the interannual variability of extreme precipitation indices calculated using the standard deviation (SD) of the time series  for the MMEs. The circles show the interannual variability of each  observation. The results for CDD show the largest differences among the observations (4.9, 2.3, and 4.1 days for ARC, CHIRPS, and TAMSAT, respectively) over the Orange basin and the lowest (1.3, 1.4, and 1.7 days for ARC, CHIRPS, and TAMSAT, respectively) over the Zambezi basin (Fig. 9a). The sign of MME biases for the interannual variability in CDD differs depending on the observation (Figs. 9a, S9a, S10a). For instance, all MMEs overestimate (underestimate) the interannual variability relative to TAMSAT (ARC and CHIRPS) over orange basins. Similar to the observations, the interquartile ranges of CDD for CMIP5 (0.7 days), CMIP6 (0.5 days), CORDEX (0.4 days), and CORE (0.8 days) are lower over the Zambezi basin and larger (1.8, 1.5, 1.2, and 1.0 days, respectively) over the Orange basin. The larger intermodel spread shown for the interannual variability in CDD during MAM and SON is similar to that for the mean CDD (Figs. 9a, S9a, and S10a). It is interesting to note that CMIP5 and CMIP6 exhibit smaller biases than CORDEX and CORE for CDD over the Limpopo, Okavango, and Zambezi basins (Fig. 9a). For CWD, all MMEs tend to overestimate the interannual variability relative to all observations, except over the Zambezi basin, where the CORE underestimates interannual variability relative to the ARC. Similar to CDD, the biases of RR1, R10mm, and R20mm vary depending on the observation (Fig. 9c, d, e, f). MMEs overestimate the observed interannual variability for CWD, with larger positive biases in GCM ensembles compared to RCM ensembles over all basins (Fig. 9b-d). Generally, the intermodel spread is larger in GCM ensembles than in RCM ensembles for most of the indices over the basins. A comparison of CMIP5 and CMIP6 shows that CMIP5 exhibits a larger intermodel spread than CMIP6 for most indices over all basins.

Summary and concluding remarks
The evaluation of historical climate simulations is fundamental, especially before they are used to assess the future impacts of extreme precipitation on key sectors such as agriculture and water resources. This is particularly important over southern Africa because of the difficulties of climate models in representing precipitation over the region (Desbiolles et al. 2020;Samuel et al. 2022). This study investigates the performance of large ensembles of global (CMIP5 and CMIP6) and regional (CORDEX and CORE) climate models in simulating extreme precipitation over the four major river basins (Limpopo, Okavango, Orange, and Zambezi) of southern Africa. The performance of climate models in simulating extreme precipitation was evaluated for 23 years  during DJF, MAM, and SON using six extreme precipitation indices (CDD, CWD, RR1, R10mm, R20mm, and SDII) defined by ETCCDI. Three satellite-based observations (ARC, CHIRPS, and TAMSAT) are used. The assessment is mainly focused on the performance of MMEs compared to the mean of the three observations (OBSE) during the peak of the rainy season (DJF). However, we considered the spread of the observations and ensembles to assess their respective uncertainties. Several statistical metrics were used to quantify the performance of MMEs over the four basins.
The results show that all MMEs can reproduce precipitation peak during DJF over all basins, albeit with wet biases in all ensembles. The spread of the ensembles is generally larger than that of the observations, with a larger spread in the GCM than in the RCM. CORDEX is closer to the observations compared to the other three ensembles. The spatial distributions of the biases of extreme precipitation are consistent across the ensembles. In particular, all the ensembles overestimate (underestimate) CDD, RR1, R10mm, and R20mm (CDD and SDII) over all basins, except for CORDEX and CORE over the eastern region of the Zambezi basin for R10mm. Lower biases in CORDEX and CORE compared to CMIP5 and CMIP6 show the added value of dynamic downscaling. In particular, the biases of CORDEX simulations are lower than those of CORE, despite CORDEX having a lower spatial resolution than CORE. This can be partially because the number of CORE simulations is very limited (only 3 RCMs downscaling 3 GCMs), but a detailed analysis of the CORE performance over southern Africa is still missing.
The intermodel spread is larger than the observational spread for most of the indices over all basins. In particular, the spread of the CMIP5 and CMIP6 ensembles is larger than those of CORDEX and CORE.
The biases of interannual variability of extreme precipitation indices are generally consistent with those of the mean of extreme precipitation over the four basins, with regional models usually performing better than the GCMs, apart from CDD.
We used several statistical metrics to quantify the performance of the ensembles in simulating extreme precipitation spatially averaged over the four basins. The lowest SCC and TSS are observed over the Limpopo basin for CDD compared with the other three basins. However, MMEs largely overestimate the NSD relative to CHIRPS, with NSD values greater than three for most ensemble members over all basins. Generally, the ensembles show good skill in simulating extreme precipitation over the basins, except for CDD and CWD over the Limpopo basin and all basins, respectively.
In summary, RCM ensembles perform better than GCM ensembles for most extreme precipitation indices over all basins, which illustrates the added value of dynamic Page 21 of 26 57 downscaling in simulating extreme precipitation. Generally, the intermodel spread is very large for all indices over all basins.
Without any claim for completeness, we acknowledge several caveats about this study. First, the results of this study provide a first-order assessment of multimodel ensemble performances in reproducing the observed extreme precipitation. However, persistent biases in the ensembles indicate the need for additional study on model evaluation over basins using a process-based approach. Second, multimode ensembles, which do not represent the performance of individual models, are the primary focus of this study. To account for individual model performance, we computed the intermodel spread for each ensemble. Nonetheless, a thorough assessment of individual model performance may be helpful in better understanding ensemble biases. Third, due to the lower resolution of GCM simulations, regionally averaged assessments are restricted to basin averaging, which disregards spatial heterogeneity of precipitation within the basin borders. Even CORDEX simulations (at ~ 50 km resolution) may still be considered coarse to be used when subdividing basins into subregions with homogeneous precipitation. As the main aim of our studies is to compare different classes of models (with different resolutions), we therefore believe that pour choice is a valid compromise and, while spatial heterogeneity of precipitation within basins is not considered in the basinaveraged results, the information on model performance is still informative, as shown in previous studies (e.g., Abiodun et al. 2019;Zhu et al. 2021a, b). Finally, annual cycles of precipitation are evaluated using climatological monthly means, which may introduce some uncertainty into the results. Despite its shortcomings, this approach is frequently used to evaluate the ability of climate models to simulate annual precipitation cycles Hamadalnel et al. 2022;Dike et al. 2022).
Despite the aforementioned caveats, the study still provided important information on the ability of CMIP5, CMIP6, CORDEX, and CORE to reproduce observed extreme precipitation over southern Africa's major river basins. Hence, we believe the results in this study are robust and provide important information to the scientific community and policymakers on the capabilities and limitations of CMIP5, CMIP6, CORDEX, and CORE in representing extreme precipitation over southern Africa's major river basins.
In particular, this study shows that wet biases persist in all model ensembles across all of the basins for most indices except for SDII, which has not always been reduced (and sometimes has been increased) by model development (CMIP6 vs. CMIP5) or increased resolution (CORE vs. CORDEX).