Establishing uncertainty ranges of hydrologic indices across climate and physiographic regions of the Congo River Basin

A B S T R A


Introduction
Hydrologic indices or signatures are the characteristics of a sub-basin's long-term hydrologic behavior.They reflect the dynamics of the different components of the catchment water balance such as climate, water storage and different runoff processes.They have been used in many hydrologic applications such as directly for runoff prediction (Kult et al., 2014;Zhang et al., 2018), model evaluation and optimization (Shafii and Tolson, 2015), model selection (Jothityangkoon et al., 2001;McMillan et al., 2011), uncertainty analysis (Westerberg and Mcmillan, 2015;Westerberg et al., 2016), environmental flow assessment (Olden and Poff, 2003), catchment classification (Ley et al., 2011;Sawicz et al., 2011) and ensemble predictions (Yadav et al., 2007;Zhang et al., 2008; streamflow data that is available to quantify the indices.Ultimately, the indices are intended to constrain the outputs from hydrologic models and consequently reduce the uncertainty in predictions and the risks associated with water resources decision making.

The study area
The Congo River Basin covers a drainage area of approximately 3.7 × 10 6 km 2 .It is the world's second largest in both size and discharge after the Amazon.In Africa, it is second only to the Nile River in length.The climate is warm and humid with two distinctive wet and dry seasons that vary with distance from the equator (Bultot, 1974;Samba et al., 2008).The mean temperature is approximately 25 °C.The mean annual rainfall is ∼2 000 mm y −1 in the central parts of the basin, decreasing both northward and southward to ∼1 100 mm y −1 .The annual potential evapotranspiration is between 1 100 and 1 200 mm y −1 and varies little across the basin (Alsdorf et al., 2016).Land cover varies from tropical evergreen forest, with little seasonal variation, in the central parts, to savannas in the north and south (Mayaux et al., 2000;Hansen et al., 2008).Similarly, the heterogeneity of the soil types and geological settings are two of the factors affecting the spatial variability in hydrologic dynamics across the basin (Tshimanga and Hughes, 2014).
The basin has four main drainage systems (Oubangui River in the north east, Sangha River in the north west, Kasai River in the south west and Lualaba River in the south east) that converge to form the main Congo River.The details of the basin topography are well documented in Runge (2007) and are not repeated here.A previous study delineated the basin into 99 modeling units, 83 resulting from an analysis of the dominant slopes and elevations, while 16 sub-basins were based on the locations of the key gauging stations (Tshimanga, 2012).As a result, the smallest modeling unit was 533 km 2 , while the biggest was 185 835 km 2 .In a recent study (Tshimanga et al., 2018), a similar delineation procedure was undertaken but using a revised Digital Elevation Model (MERIT DEM) (Yamazaki et al., 2017) which was corrected to remove vegetation height effects.This is particularly important for the Congo River Basin given the extent of dense tropical forest.The new delineation resulted in 403 sub-basins (Fig. 1) that are considered appropriate to represent natural hydrologic variability and that account for the current water resources management needs within the basin.

Hydro-climatological data
Streamflow time series for 58 gauging stations (Fig. 1), with different periods of record, were obtained from several sources including the Global Runoff Data Centre (Fekete et al., 1999), the Office National de Recherche et du Developpement (Lempicka, 1971), Hydrosciences Montpellier-Système d'Informations Environnementales (SIEREM, http://hydrosciences.fr/sierem) and the Annuaire hydrologique du Congo Belge (Devroey, 1951(Devroey, -1959)).Less than 25 % (15 gauging stations) of the gauging stations represent non-impacted headwater flow regimes (Fig. 1 and Table 1), while the majority are located in the downstream parts of the basin and represent cumulative streamflow characteristics from large catchment areas.Hydrologic indices obtained from headwater gauging station data are useful for establishing regional variations, but indices obtained from downstream stations will include mixtures of different upstream sub-basin hydrologic responses and will therefore be less directly useful.Gauging stations downstream of some known large wetland systems, or any major water resources infrastructure, were excluded and not used to quantify hydrologic indices.
The climate data used (rainfall and evapotranspiration) are from the Climate Research Unit (CRU TS 3.10) data for the period of 1901-2014 (Harris et al., 2014), at a spatial resolution of 0.5°.They are used to derive an aridity index (the ratio of mean annual evapotranspiration to mean annual rainfall) as a potential predictor of hydrologic behavior, as reported in many hydrologic studies (Beck et al., 2015;Tumbo and Hughes, 2015;Ndzabandzaba and Hughes, 2017;Zhang et al., 2018) and the runoff ratio.The UNIDEL (University of Delaware) rainfall dataset (covering the same period and spatial resolution) was used as to check the appropriateness of the CRU rainfall data in specific areas (Sun et al., 2018).The use of global climate datasets is justified by the lack of adequate ground-based information available for long periods and with good spatial coverage.However, the paucity of rainfall gauges over the Congo River Basin suggests that only limited observed records are used to construct and validate the global datasets, contributing to potential errors and input uncertainties (Tshimanga, 2012).While these uncertainties are likely to be quite important for hydrologic modeling, they are less likely to have a large impact on the derivation of climate indices.

Physiographic data
Physiographic data that have potential relationships with sub-basin hydrologic response characteristics are used for the classification of the sub-basins and include the topographic wetness index (TWI), slope, soil textures (fractions of silt, sand and clay) and curve number (CN).The TWI and the slope were derived from the 90 m MERIT DEM (Yamazaki et al., 2017), while soil texture data were obtained from ISRIC, the world Soil information website (Batjes, 2017).The curve number was extracted from the United States Natural Resources Conservation Service (NRCS) Runoff Curve Number (CN) dataset (Zeng et al., 2017).These physiographic properties (Table 2) have been previously reported as being important in understanding and regionalizing sub-basin runoff responses.The soil clay content has been used as a predictor for the base flow index (Beck et al., 2015), while the topographic wetness index is widely used to approximate relative soil moisture patterns (Buchanan et al., 2014) and quantify topographic control on hydrologic processes (Sørensen et al., 2006).The curve number is used in many hydrologic models (Williams et al., 2012;Zeng et al., 2017;Peña-arancibia et al., 2019;Yang et al., 2019) to simulate surface runoff generation, and is calculated based on factors such as hydrologic soil group type, land use land cover, hydrologic surface condition and antecedent moisture condition (Zeng et al., 2017).

Methods
The approach involves: (i) pre-processing of hydrologic data, (ii) catchment classification, (iii) quantification of hydrologic indices and development of regression relationships, and (iv) establishing of uncertainty ranges of hydrologic indices.The steps are shown in Fig. 2.

Pre-processing of hydrologic data
The available observed streamflow time series are relatively short and have different record periods that might represent different sequences of dry and wet climatic conditions.This situation can affect the representativeness of derived hydrologic indices and therefore the data were pre-processed using a spatial interpolation approach (Hughes and Smakhtin, 1996) to extend the observed flow series to common record periods.This approach is based on the assumption that flows occurring simultaneously at sites in reasonably close proximity to each other correspond to similar percentage points on their respective duration curves.Its application requires the identification of key gauging stations within each region that have the longest record periods so that they can be used as source gauges for extending the flow series of gauging stations in their vicinity.While the method allows for the use of up to 5 source gauges with different weighting factors (Hughes and Smakhtin, 1996), only one source gauge has been used in the Congo application  due to the limited number of available gauges.The outputs consist of both a patched (filling missing data periods) and extended time series, as well as a time series representing estimates for all months (substitute time series).The latter can then be used to compare with the original observed flow data and the reliability of the method assessed using typical objective functions (such as the Nash coefficient of efficiency).

Catchment classification
Due to the largely ungauged nature of the Congo River Basin, the available gauging stations are not sufficient to represent the variability of the hydrologic response characteristics across the different climate and physiographic regions.It is therefore necessary to assume that more readily available climate and physiographic (surrogate) information can be used in a catchment classification approach to group sub-basins into regions of expected similar hydrologic response.The limited available gauging station data can then be used to quantify the expected response characteristics of the sub-basins within the regions, albeit with a degree of uncertainty.Self-Organizing Maps (SOM) are able to analyze, organize and cluster various types of data through non-linear relationships, which represent the internal similarity of the variables.They have been used in many hydrologic applications (Hall et al., 2002;Srinivas et al., 2008;Herbst and Casper, 2008;Toth, 2009;Di Prinzio et al., 2011;Ley et al., 2011;Toth, 2013).In this study, the Viscovery SOMine software (https://www.viscovery.net/somine/) is used for the sub-basin scale similarity analysis.
The details of SOM methods are well documented in previously reported studies (Ley et al., 2011;Di Prinzio et al., 2011).A SOM consists of two layers of interconnected neurons (nodes), the input and the output layers (Kalteh et al., 2008), where the input layer represents the sub-basin attributes used in the classification, and the output layer corresponds to the number of classes to which the sub-basins are assigned.Although there are no well-defined guidelines on the appropriate number of clusters to be formed, SOM allows for both automatic (Liu et al., 2011), and user-defined cluster numbers based on the dataset and the level of detail required in the classification.For the classification to be coherent, only a few parameters need to be specified during the training cycle of a SOM.These are the map size, training parameters and clustering method.Three types of clustering methods are imbedded in the Viscovery SOMine software and include SOM-Ward, Ward and SOM-Single-Linkage.The SOM-Ward is generally used because it is considered the most efficient technique (Kohonen, 2001) and is implemented in this study.The quality of a well-trained SOM is evaluated by means of a quantization error (known also as Euclidean distance) defined as the average of the squared distance of all data records associated with a node in the output layer.It should be as small as possible and is often used as the basis for assigning input vectors (sub-basins) to nodes.Therefore, sub-basins that have similar quantization error are assigned to the same class.An independent evaluation of the accuracy of the classification is achieved with the ANOSIM statistic R that provides a test whether there exists a significant difference between the identified clusters (Clarke, 1993;Warton et al., 2012).
Prior to the SOM training, all attributes need to be standardized in order to suppress the effect of their different orders of magnitude and ensure that they all have equal importance in calculating a meaningful Euclidean distance between two points.Viscovery SOMine provides two scaling methods (Variance and Range) that work simultaneously depending on the internal distribution of each attribute.In both methods, the mean value of the attribute is subtracted from each value of the attribute so that the mean of the scaled attribute is zero.However, in the variance scaling the difference between the mean value of the attribute and each value of the attribute is divided by the standard deviation of the attribute so that the new variance of the scaled attribute will always be 1.In range scaling, the difference is multiplied by 8/(maximum-minimum) of the attribute, such that the new range is always 8. This has an advantage of reducing the impacts of outliers in the training process, thus speeding up learning and leading to faster convergence.Range scaling is automatically activated if the difference between the maximum and minimum values of an attribute is smaller than 8 times the standard deviation, otherwise the variance scaling applies.
The classification strategy adopted here was achieved in a four-step approach: -Step (1) classifies the all 403 sub-basins based on the climate and physiographic attributes; -Step (2) identifies, in the formed homogenous regions from step (1), all the selected headwater gauged sub-basins; -Step (3) performs two independent classifications of the selected headwater gauged sub-basins.The first was based on the climate and physiographic attributes, and the second was based on the indices of hydrologic behavior.The hydrologic classification was performed 5 times including each index separately and all indices together in order to identify indices responsible for the highest affinity with climate and physiographic attributes; -Step (4) compares the two independent classifications obtained in step (3) to assess the level of overlap.This was done using an index of affinity developed by Rand (1971).
The rand affinity index (Rand, 1971;Di Prinzio et al., 2011;Ssegane et al., 2012) was calculated using the following expression: Where R varies between 1 (perfect agreement between the two pools of clusters) and 0 (no agreement).The meaning of the terms a, b, c and d is given under the following assumptions: Consider two classifications (C1 and C2) of the same dataset, a pair of sub-basins can be assigned to the same class or different clusters in C1 and C2.So, "a" is defined as the number of sub-basin pairs that are in the same cluster in classification C1 and in the same cluster in classification C2; "b" as the number of sub-basin pairs that are in different clusters in C1 and in different in C2; "c" as the number of sub-basin pairs that are in the same cluster in C1 but different clusters in C2; "d" as the number of sub-basin pairs that are in different clusters in C1 but in the same cluster in C2.

Derivation of the indices and their relationships
The choice of hydrologic indices depends on the objectives of the study, which in this case are focused on establishing indices that are suitable for constraining the outputs of a specific rainfall-runoff model (the Pitman model: Tumbo and Hughes, 2015;Ndzabandzaba and Hughes, 2017).The hydrologic indices used in the current version of this model (Pitman model) are the mean monthly runoff volume (MMQ in m 3 * 10 6 ), the mean monthly groundwater recharge (MMR in mm), the 10th, 50th and 90th percentiles of the flow duration curve (FDC) expressed as a fraction of MMQ and the percentage of zero flow.However, this study is limited to the mean monthly runoff volume and the 10th, 50th and 90th percentiles of the flow duration curve (FDC) because they can be directly estimated from the available streamflow data and zero flows are not relevant to the Congo River Basin at the scale of the sub-basins used in the study.We could have also used 33rd and 66th, which are sometimes used to compute the slope of the flow duration curve.However, the 10th, 50th and 90th percentiles are considered here to represent the minimum number of key indices that can characterise the complete flow duration curve, for they consider more extreme high and low flows.Essentially, we want to capture a wider range of flow behaviour than the 33rd and 66th percentiles represent.These percentiles are also used within the modelling software that will be used.
The mean monthly runoff volume (MMQ) was expressed as runoff ratio (RR) in order to suppress sub-basin scale effects: Where, MMQ is the long-term average monthly streamflow and P the long-term average monthly precipitation.The RR represents the long-term water balance separation between water being released from the sub-basin as streamflow and as evapotranspiration.
The three percentiles of the flow duration curve represent the frequency distribution of flows of different magnitude, the 10th percentile represent high flows, the 50th medium flows and the 90th low flow conditions.
The classification approach used in this study assumes that streamflow indices exhibit some consistent relationships with subbasin climate and physiographic characteristics (Yadav et al., 2007).However, this study also follows the approach applied by Tumbo and Hughes (2015) and Ndzabandzaba and Hughes (2017) where the relationships between the same hydrologic indices and the aridity index are explored.It is assumed that any relationships used to estimate hydrologic indices in un-gauged sub-basins will necessarily be uncertain and therefore 90 % regression confidence limits are used to quantify the degree of uncertainty.

Spatial disaggregation of flow time series
Due to the low number of gauging stations, the relationships between the climate/physiographic attributes and the hydrologic indices can result in high uncertainty.However, there are several gauging stations that include a relatively small number of subbasins (≤6) that can be included if an appropriate method of spatially disaggregating the total downstream response characteristics can be used.While there may be several alternative approaches, the current study adopted an iterative process.Two gauging stations (in the north of the Congo River Basin), which represent eleven sub-basins, were selected and used to derive estimates of the subbasin hydrologic indices.None of these have any identified upstream anthropogenic impacts and are not influenced by wetland effects.The iterative process essentially involves using the initial regression relationships (developed from the gauged headwater subbasins) to provide initial estimates for the 11 additional sub-basins.This step ensures that sub-basin relative differences in response are consistent with the initial relationships.The flow percentile indices are converted to absolute values (i.e.not as fractions of MMQ) and are summed to give cumulative values at the gauging station.All of the cumulative index values are compared to the observed gauging station values and correction factors determined for each index.These correction factors are then applied to the sub-basin initial estimates and the FDC indices converted back to fractional values by dividing by MMQ.Clearly, the indices for the additional 11 data points are less certain than those derived from the gauged headwaters.However, the approach is justified on the basis of the very limited number of gauged headwater sub-basins and at least allows some of the response characteristics of the two larger gauged catchments to be included in a second round of regression analysis.

Validating the uncertainty ranges and assessing the uncertainty
Two types of independent information on hydrologic indices were used to validate the derived ranges of indices.The first source is made up of published runoff ratios for specific areas across the basin (Snel, 1957;Laraque et al., 1998).The aridity index values of these areas were used to check whether the runoff ratio could fit within the computed ranges.The second source of information is based on the cumulative flow time series at downstream gauged stations that have no substantial attenuation effects.The hydrologic indices were computed at these gauges and plotted against the area-weighted values of the sub-basin aridity indices.One of the major sources of uncertainty in the quantified hydrologic indices involve the input rainfall data.It has been acknowledged that the reliability of a rainfall dataset is mainly limited by the number and the spatial coverage of surface stations (Sun et al., 2018) and the CRU rainfall dataset used in this study is based on a very limited amount of observed rainfall data.Previous studies reported on the lack of agreement between different interpolated rainfall datasets (Sun et al., 2018).The consistency of the CRU and the UNIDEL (Sun et al., 2018) datasets is checked by computing ratios of UNIDEL to CRU mean annual rainfall.Regions where the computed ratio is high (e.g.> 1.2) were identified as potential areas of high uncertainty where the uncertainty ranges of the indices would need further refinement.

Extended streamflow series
It was important to obtain common record periods of streamflow time series in order to ensure the representativeness of the computed hydrologic indices and minimize the differential effects of the number of wet and dry periods represented in short record periods.Table 3 lists the original and extended record periods, as well as the source gauging stations used for the record extension.Fig. 3a shows a plot of two streamflow series at one gauging station (L_CB261) located in the upper Lualaba sub-basin and illustrates the reliability of the approach by comparing the observed flows with the substitute time series (i.e.all months estimated from the source gauge).The general pattern of the observed streamflow series is well reproduced and therefore the extended and infilled records should be adequately reliable.Similar results were obtained for all the gauging stations and their Nash coefficients of efficiency, based on comparisons between the observed and substitute flows are shown in Fig. 3b.For the majority of gauging stations the Nash coefficient of efficiency is above 0.6 for both low and high flows.The final hydrologic indices were derived from the patched streamflow series (i.e. a combination of observed, patched and extended data).* Gauging stations having long period of records with no need of extension.
# gauging station with record period not extended because of lack of donor gauge in its vicinity.@ New gauging stations added for the spatial disaggregation procedure (see Section 3.4).

Classification by climate and physiographic
The classification of the 403 sub-basins of the Congo River Basin was achieved by training a self-organizing map (SOM).Fig. 4a.displays different levels of the quantization error (QE) obtained from using different map sizes.The results show that the larger the map size (number of nodes), the smaller the quantization error, which expresses the adequateness of representing the input vectors by a specific node.Out of seven attributes used for the classification, only five were retained, while the TWI and Sand attributes were removed because they were highly correlated with Slope and Clay, respectively.Therefore, a map size of 2 000 nodes (QE = 0.0003) was judged appropriate in representing the dataset and was used for the classification.Different numbers of clusters were also tested in order to obtain the number that maximize the within group similarity and between group dissimilarity.The application of an independent quality measure (ANOSIM statistic R: Fig. 4b) suggests that six clusters are appropriate for the Congo River Basin.These homogenous groups are significantly different with a global R of 0.7 at a p-value of 0.001 %.The R statistic increases with an increase in the number of clusters, but this increase is associated with low dissimilarity (0.25 < R < 0.48) between some clusters when greater than six clusters are used.In contrast, with six clusters all the between-group dissimilarity values are above 0.53.Table 4 displays the dissimilarity matrix of the six homogeneous groups where high values of R implies high dissimilarity between groups.Overall, the average squared distances within groups were far smaller than the average squared distances between groups.
The spatial distribution of the six climate and physiographic regions resulting from the classification is shown in Fig. 5.The groups are coherent and preserve a high degree of spatial proximity.Approximately 28 % of sub-basins are assigned to Region 1, 19.4 % to Region 2 and 3, 14.6 % to Region 4, 11.6 % to Region 5 and 7.2 % to Region 6. Fig. 6 illustrates the spatial relationships between the climate and physiographic attributes across the six regions, while Table 5 provides descriptions of each homogenous region.Region 1 has the highest number of gauged headwater sub-basins, while Region 6 has none.

Classification by hydrologic behavior
Based on the distribution of gauged sub-basins, two independent classifications, each having five groups, were performed using a similar approach as highlighted in Section 3.2.The comparison between the physiographic classification and the five classifications based on hydrologic behavior is shown in Fig. 7.Only the three fractions of the FDC indices and runoff ratio were used to represent sub-basin response behavior, while five climate and physiographic attributes represented sub-basins physical properties.Overall, there exists a high degree of affinity (Rand index = 73 %) between the physiographic classification and the hydrologic classification when all indices of hydrologic behavior are used in the classification.However, the highest affinity (Rand index = 82 %) is achieved when only the Q50/MMQ index is used.This suggests that the climate and physiographic attributes used in this classification are able  capture different components of the sub-basin's hydrologic response at different degrees.Due to the limited number of gauged headwater sub-basins, some clusters are made up of less than 3 sub-basins (Table S1 in supplementary information).The implication of this result is that predictive equations for the hydrologic indices in most of the clusters cannot be developed, due to a lack of enough sample points.For those clusters where there are 4 representative gauges, Table 6 illustrates that there are potentially strong relationships between the physiographic/climate attributes and the hydrologic indices.However, the shape of the relationships is regionally variable and because of a lack of enough gauging stations to represent some clusters (Fig. 8), we cannot extrapolate the relationships to all the sub-basins in the Congo.In contrast, when all the sub-basins are included in the regression analysis, the aridity index is revealed as the best single predictor of hydrologic behavior (Table 7), with CN second.The other attributes do not appear to offer any additional predictive value (Table 7).

Aridity index as a control on hydrologic indices
The spatial pattern of the aridity index (Fig. 9) across the Congo River Basin exhibits some degree of similarity with the six homogenous regions (Fig. 5).A concentric pattern of aridity index suggests the increase of the aridity index from the Cuvette Centrale towards headwater tributaries located in the north, east and south parts of the basin.
Fig. 10a shows the plot of the runoff ratio (RR) as a function of arity index (AI).Overall, a high aridity index results in a low runoff ratio and vice versa.For instance, the majority of gauging stations representing sub-basins located in the southeast and north of the Congo River Basin (Region 1) show high values of aridity index, indicating a low runoff ratio (RR < 0.25) potential.The observed pattern in humid conditions (AI < 0.75) does not show a similar pattern of RR as in Region 1.With very limited information in this region 4, we cannot confirm the apparent increase of RR with the increase of AI.In general, while two discernible relationships could have been established between the AI and the runoff ratio, it would have been difficult to determine the basis on which ungauged sub-basins could have been assigned to either relationship, given the fact that observed gauging stations of the same region would be on different regression lines.Therefore, a single regression relationship (R 2 = 0.63) is derived between the aridity index and the runoff ratio.A positive power relationship (R 2 0.84) between the aridity index and the Q10/MMQ is shown in Fig. 10b.Sub-basins of Region 4 are located at the bottom end of the regression line, indicating a slow to moderate response to rainfall inputs.In contrast, sub-basins of Region 1 are spread throughout the regression line.The Q50/MMQ index (Fig. 10c) shows a positive power relationship (R 2 = 0.81) with the aridity index and follows a similar trend to the runoff ratio, where high values of aridity index were associated with low indices.The Q90/MMQ index also exhibits a positive power relationship (Fig. 10d) but with a higher degree of scatter (R 2 = 0.725).Comparing Fig. 10b and d, suggests that Regions 2 and 4 have regimes with low variability, while Region 3 is more variable and Region 1 is represented by flow regimes of different degrees of variability.Without more data points it is difficult to predict if areas with higher aridity indices would have very low Q90/MMQ indices and possibly zero flows for some of the time.This could depend on the spatial scale of the sub-basins, in that aggregation of contributions from different parts of a large sub-basin, coupled with the effects of flow routing, suggest that zero flow conditions are unlikely.However, small sub-basins might experience zero flows, but the majority of the sub-basins used in the current modeling units (Tshimanga et al., 2018) are greater than 2 000 km 2 , limiting the possibility of zero flow at a monthly time scale.
Through the application of the spatial disaggregation procedure (Section 3.4), estimates of the hydrologic indice values are available for an additional 11 sub-basins (Table S2 in supplementary information).While these values are more uncertain than those based on the gauged headwater sub-basins, they have been included to expand the data set before calculating regression line confidence intervals to quantify the uncertainty ranges of the hydrologic indices.

Uncertainty ranges of hydrologic indices
Fig. 11 shows the 90 % confidence intervals based on the updated regression relationships after the inclusion of the data for additional 11 sub-basins disaggregated from downstream gauging station data.A comparion between Figs. 10 and 11 suggests that the shape of the relationships has hardly changed, while most of the R 2 values have slightly decreased and therefore the final Given that the available data to quantify the hydrologic indices is very limited, it is impossible to make any firm a priori Characterized by high values of aridity index (AI), medium content of clay and flat to undulating topography (slope).These three attributes account for more than 75 % to the within group similarity.Sub-basins are mostly located in the south-eastern and northern parts of the Congo River Basin.The presence of high values of curve number suggests the potential of the sub-basins to be dominated by surface hydrological processes, making them prone to flash flooding.statements about how representative these uncertainty relationships are for all of the sub-basins of the Congo River Basin.However, Ndzabandzaba and Hughes (2017) proposed that checks can only be made by using the uncertainty bounds to constrain the individual sub-basin responses in a hydrological model and assess the outputs against observed data representing aggregated responses downstream.Some limited independent estimates of runoff ratio (Fig. 12) suggest that many of the reported runoff ratios fit within the computed bounds.Notable exceptions are some estimates for the Batéké plateau system in the western Congo River Basin (Region 5), and for the rift valley region (eastern Congo River Basin).There are quite large uncertainties in the rainfall data for these   areas, but even if a different rainfall data set is used (UNIDEL), the estimated runoff ratios remain well outside the computed uncertainty bounds.In contrast, the independent estimates in flat to undulating topography (0.5-5%) regions (Cuvette Centrale, Northeast Congo, upper Lualaba and Southeast Kasai), generally fall within the uncertainty bounds regardless of which rainfall data set is used.The average runoff ratio (0.24) for the whole Congo River Basin (Laraque and Olivry, 1996) also falls within the uncertainty bounds.

Discussion and conclusion
It is a common practice in hydrology to use catchment classification as a means of extending hydrologic information from gauged to ungauged sub-basins.This procedure requires each region to have a predictive equation of hydrologic response based on potential climate and physiographic predictors (Yadav et al., 2007;Kapangaziwiri et al., 2012).The success of this approach largely depends on the number of gauged sub-basins with sufficiently long records.The classification of the 403 sub-basins of the Congo River (Fig. 5 and Table 5) demonstrated that the climate and physiographic attributes used in this study can identify relatively homogeneous regions, suggesting that there is a potential to interpret hydrologic similarity based on similarity in climate and physiography (Oudin et al., 2010;Ley et al., 2011).Furthermore, the hydrologic classification based on the 15 gauged sub-basins showed that the aridity index, the surface slope, the curve number, the silt and clay contents are potential predictors of hydrologic response indices.However, the number of gauged sub-basins (even including the disaggregated data for a few larger gauged sub-basins) is not sufficient to develop individual predictive relationships between the hydrologic indices and the climate and/or physiographic attributes for each of the identified regions.Some of the clusters (identified in the hydrologic classification: Section 4.3) are represented by less than 3 sample gauged sub-basins.This is part of the challenge of data scarcity in many parts of the world, including the Congo River Basin.Fortunately, the alternative approach of developing generic predictive equations for the whole basin generated results that produce acceptable levels of uncertainty, as measured by the width of the confidence intervals around the regression relationships.While several approaches to developing these equations were assessed using combinations of the climate and physiographic attributes (informed by the results of the hydrologic classification), it transpired that the aridity index produced the best results for all of the A comparison between the patterns of climate/physiographic regions (Fig. 5) and the aridity index groups (Fig. 9) shows some distinct similarities.The areas where there are fewer similarities are mainly within or across those regions that are less distinguishable (based on the ANOSIM statistics in Table 4) from other regions.Regions 1, 2 and 3 have low ANOSIM statistics and also cover a wide range of aridity groups (2-6), while Regions 4 and 5 have the highest ANOSIM statistics and all the sub-basins have generally low aridity values.Similarly, Region 6 is generally distinguishable from the other regions and most sub-areas fall within aridity groups 3 and 4. The conclusion is that, although the original climate/physiographic regionalization results are not used in the developed predictive equations for the hydrologic indices, the spatial patterns of variability in hydrologic response are not too dissimilar to the originally identified regions.
The fact that the aridity index emerges as the best available predictor of hydrologic indices (Table 7) in the Congo River Basin is perhaps not surprising, as the same index has been used successfully in other parts of southern Africa (Tumbo and Hughes, 2015;Ndzabandzaba and Hughes, 2017) and elsewhere (Zhang et al., 2018).Zhang et al. (2018) found that the aridity index was one of the most influential attributes and was well correlated with mean discharge as well as the 10th and 50th percentiles of the flow duration curve.Similarly, Ndzabandzaba and Hughes (2017) determined relationships between the aridity index and the runoff ratio, however, they also found distinct regional differences across Eswatini (Swaziland), that are less evident in the Congo River Basin.Part of this difference may be related to the substantial topographic and climate variations across the small country of Eswatini, while the much larger number of sample points (based on previous simulations) used by Ndzabandzaba and Hughes (2017), could also play a major role.Tumbo and Hughes (2015) also found regional differences in the link between aridity and hydrologic response indices, but they were not able to quantify regression relastionships, and their final result was based on simple index ranges for each identified region in the Great Ruaha River basin of Tanzania.
The main limitation for extending the predictive equations of the hydrologic indices to ungauged sub-basins is related to the spatial representativeness of the observed streamflow gauging stations.The majority of the final sub-basins used to develop the uncertainty ranges are found in region 1, even after the inclusion of the disaggregated gauging station data.While the developed uncertainty ranges of hydrologic indices can be applied with high confidence in sub-basins representing the climate and physiographic properties of region 1 (upper Lualaba and northern Oubangui), less confidence can be ascribed to their application in the other regions where some aspects of the hydrologic behavior may not have been captured.The fact that the range of the aridity index values (0.66-1.59) used to develop the predictive equations represent most of the climate variability across all 403 sub-basins of the River Basin suggests that the approach may be quite robust in representing the diverse climate conditions within the basin.However, Fig. 12 shows some examples where some independent estimates of runoff ratio for some steep sub-basins fall well outside the uncertainty range.While these may be isolated examples of outliers, the lack of enough data seriously constrains any attempts to further validate the applicability of the relationships and the ranges of uncertainty across the whole basin.Gnann et al. (2019) have shown that the variability of low flow in humid sub-basins of the United Kingdom and the United States could not be primarily attributed to the aridity index and that the aridity index is the key determinant of low flows only in arid regions.The Congo data tend to support this conclusion in that the uncertainty range for Q90/MMQ index is quite large and there is  no real trend in the values Region 4 (Fig. 9d).According to Laraque et al. (1998), the hydrologic response of the sub-group (Batéké plateaux) of Region 4 sub-basins is characterized by very little seasonal variation between the low and high flows, indicating the presence of a high storage capacity groundwater system that contributes to the regulation of flows.Physical attributes describing the geological settings might have been informative in terms of including the role of sub-surface processes, but a consistent database of groundwater characteristics for the whole Congo River Basin is not yet available.
Inevitably, the developed uncertainty ranges of the hydrologic indices account for several different sources of uncertainty.These uncertainties could be due to the uncertainty in rainfall (Maidment et al., 2015;Sun et al., 2018) and evapotranspiration estimates, the length of the streamflow records, the number of the gauging stations used and their spatial distribution, the percentage of missing data, the reliability of rating curves (Kiang et al., 2018) used to convert raw stage data into streamflows, and stage observational errors.Maidment et al. (2015) found considerable differences in trend sign and magnitude (−10 and +39 mm yr −1 per decade) between different sets of global rainfall over Central Africa.They reported that the spurious negative trends identified in CRU rainfall dataset were due to the decline in rainfall gauge density across Central Africa, including the Congo River Basin.Our results have shown that substantial differences between global interpolated rainfall datasets (CRU and UNIDEL) are observed in the Batéké plateau and rift valley sub-regions located in the western and eastern parts of the Congo River Basin, respectively.It is shown that these differences were more pronounced in steep topography (Section 4.5) and constitute a common problem facing interpolated global rainfall datasets, especially in complex mountain areas, due to the limited number and spatial coverage of surface stations as well as the types of algorithms and data assimilation models used to generate the interpolated rainfall data (Sun et al., 2018).This type of uncertainty clearly affects both the aridity index and the runoff ratio.
Translating the uncertainty in instantaneous discharge observations into potential errors in monthly streamflow volumes is more difficult, particularly if the raw stage data and rating curves are not accessible, as is often the case in many countries of southern Africa.In the Congo River Basin, the lower and upper bounds of these errors for the gauging stations in the southern part of the basin (Kasai and Lualaba drainage systems) were estimated to be between −12 % and +29 % (Charlier, 1955 andLempicka, 1971).The average uncertainty obtained for the Q10/MMQ and Q50/MMQ indices are less than this total value of 41 % (38 % and 32 % for Q10/MMQ and Q50/MMQ indices, respectively).These results are consistent with previous studies on the uncertainty in hydrologic signatures (Westerberg and McMillan, 2015).Westerberg et al. (2016) reported that the uncertainty in hydrologic signatures varied with signature type, with the highest uncertainties ( ± 30-40 %) found in high and low flow characteristics due to the uncertainty in the observed discharge and the regionalization procedures.
In the northern part of the Congo River Basin (Sangha and Oubangui drainage systems), previous studies (Laraque and Olivry, 1996;Mahé, 1995) noted a decrease in streamflow from the main tributaries of the right bank of the Congo River for the period of 1953-1993.An average 28.5 % decrease was observed in the Oubangui drainage system, 15 % in Sangha, 11 % in the Cuvette Centrale and very little change in the Batékés plateau system.Therefore, any data records that only fall within this period would be expected to generate lower mean monthly flow (MMQ) indices than would be appropriate for a longer simulation period, thus adding further levels of uncertainty.The presence of any extreme flows over the record periods, as is the case for the Congo River in the 1960s, would also impact on the MMQ indices.The majority of the gauging stations used in this study have short record periods around the 1960s.However, it is argued that the extension and infilling of missing data (Section 4.1) has at least partially overcome these effects.The uncertainty ranges of the hydrologic indices presented in this study are not affected by wetland and channel routing effects because any gauging stations located below wetlands were not represented in the 26 sub-basins used in the development of these uncertainty bounds.
The overall conclusion is that the developed relationships (and uncertainty bounds) between aridity index and the hydrologic indices are appropriate for constraining sub-basin hydrologic simulations for the whole of the Congo River Basin, with the likely exception of some areas of very steep topography on the eastern borders of the basin and the Batéké plateaux.The ultimate test of these relationships and their uncertainty bounds will be to assess the results of the constrained simulations at downstream gauging stations which have not been used in their development.It is likely that they will not work in areas downstream of the identified steep sub-basins and therefore will need to be re-calibrated for those areas.
property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property.In so doing I confirm that I have followed the regulations of my institution concerning intellectual property.

Fig. 1 .
Fig. 1.Presentation of the study area showing the 403 sub-basins of the Congo River Basin, the spatial distribution of the 58 gauging stations available across the basin and the headwater gauging stations used for the quantification of hydrologic indices.

Fig. 2 .
Fig. 2. Flowchart of the methodological framework including catchment classification using SOMs, hydrologic indices quantification and uncertainty ranges.

Fig. 3 .
Fig. 3. Results of the pre-processing analysis of flow data.(a) graphical comparison between the original flow series and the substitute series (at L_CB261 gauging station).The latter is defined as estimated values of flow records that overlap with the original series.(b) Nash coefficients of efficiency for low (CE ln) and high (CE) flows showing how well the original series are reproduced during the extension procedure of the flow record period.

Fig. 4 .
Fig. 4. Optimal size of the trained maps through SOM and the selection of optimal number of clusters.(a) asymptotic decrease of the quantization error showing the lowest achieved with 2000 nodes.(b) the evolution of the ANOSIM statistic R with the number of clusters.The dashed line shows the optimal number of clusters with a global R statistic of 0.7.

Fig. 5 .
Fig. 5. Spatial distribution of the six homogenous regions of sub-basins of similar climate and physiographic properties in the Congo River Basin identified from the application of SOM.

Fig. 6 .
Fig. 6.Non-linear relationships between climate and physiographic attributes (Aridity index, Clay, Silt, Curve number and Slope) across the six regions of the Congo River Basin obtained from the application of SOM.The red colour represents high values of the attributes and blue low values.
texture with high content of clay resulting in a decrease of silt content with medium to high values of curve number.These attributes contribute by more than 85 % to the overall similarity within the region.These conditions suggest the potential of the region to limited infiltration rate while maintaining appropriate level of humidity (AI < 1) on flat to undulating topography.The majority of the sub-basins are located in the northeastern part of the Cuvette central, while the others are specifically located in the Sangha drainage system.3 78 1 Mostly dominated by high clay content, high slope and low aridity, thus accounting for more than 85 % of the within region similarity.In contrast to region 2, this region is characterized by high silt content and high slope, suggesting the dominance of sub-surface processes particularly interflow.The sub-basins are mostly located in the eastern mountainous region of the Congo, known as rift valley.Similar conditions are found in the south of the Cuvette central and the lower Congo River before exiting to the Atlantic Ocean. 4 59 4 Represents most of sub-basins located in the Cuvette central, the central part of the Congo River Basin.More than 80 % of within group similarity is controlled by CN, Slope and Silt.Low values of curve number suggest high infiltration rate resulting in predominance of subsurface processes over the surface processes.The climate is humid with lowest values of aridity index and flat to undulating topography.These conditions portray the prevalence of the accumulation processes of the eroded materials coming from all the upstream tributaries, thus favouring factors that contribute to the formation of the wetlands and channels with high degree of sinuosity and braiding.5 47 1 Clay, Slope and Silt represent more than 70 % of within group similarity.The soils characteristics (low clay and silt) imply the dominance of the infiltration processes, resulting in high storage capacity.Sub-basins of the Batéké plateau system, located in western part of the Congo River Basin, are found in this group and are mostly characterized by v-shaped valleys with deep soils, suggesting the presence of groundwater aquifer systems with high storage capacity.However, a humid climate (AI < 1) on undulating to steep topography characterizes this region.6 29 0 Almost similar characteristics as region 5, but the difference resides in that region 6 has arid climate (0.94 < AI < 1.3) and flat to undulating topography.The sub-basins are located in southern part of the Kasai drainage system.Clay, Silt and AI account for more than 75 % to the within group similarity.

Fig. 8 .
Fig. 8. Potential regression relationships between the climate/physiographic attributes and the hydrologic indices within clusters formed from the 15 gauged headwater sub-basins.(a) Runoff ratio seems to develop a relationship with Clay in cluster 1, (b) Q10/MMQ index seems to have a relationship with AI in cluster 2, (c) Q50/MMQ index seems to have a relationship with Curve number in cluster 1 and (d) Q90/MMQ index seems to develop relationships with clay in clusters 2 and 3.

Fig. 9 .
Fig. 9. Spatial pattern of the aridity index (PE/P) across the Congo River Basin showing a concentric pattern.The aridity index increases from the Cuvette Centrale towards north, east and south headwater tributaries of the Congo River Basin.

Fig. 10 .
Fig. 10.Power regression relationships between the aridity index and the hydrologic indices across headwater gauged sub-basins of the Congo River Basin.(a) Runoff ratio, (b) Q10/MMQ index, (c) Q50/MMQ index and (d) Q90/MMQ index.Regions refer to those obtained through the application of SOM to all 403 sub-basins of the Congo based on climate and physiographic attributes.

Fig. 11 .
Fig. 11.Final uncertainty ranges of hydrologic indices derived based on the aridity index for all sub-basins of the Congo River Basin.(a) Runoff ratio index, (b) Q10/MMQ index, (c) Q50/MMQ index and (d) Q90/MMQ index.Regions refer to those derived from the physiographic classification of the 403 sub-basins of the Congo River Basin.The region 6 did not appear among the plotted indices because of the lack of gauged headwater subbasins.

Fig. 12 .
Fig. 12. Validation of the uncertainty ranges of the runoff ratio index across the Congo River Basin.The average runoff ratio observed over the entire Congo River Basin at the Kinshasa gauging station fits within the computed bounds.Gauging stations located in the Batéké plateau system (C_CB169) and rift valley system (L_CB191) are out of the computed bounds regardless of which rainfall data set is used.

Table 1
Headwater gauging stations used for the quantification of hydrologic indices in the Congo River Basin.

Table 2
Description of the climate and physiographic attributes used for the classification of the 403 sub-basins of the Congo basin.

Table 3
Extended flow record periods of gauging stations used for the derivation of hydrologic indices in the Congo River Basin.

Table 4
Dissimilarity matrix (ANOSIM statistic R) of the six climate and physiographic regions showing that all regions are significantly different at p = 0.001 %

Table 5
Description of the six homogenous regions of the Congo River Basin obtained from the application of SOM.

Table 6
Highest R 2 of the relationship observed between physiographic attributes and hydrologic indices within clusters/groups having 4 sub-basins formed from 15 gauged headwater.In bracket are correlation coefficients and in bold the highest R 2 for each hydrologic index.

Table 7
Coefficient of determination of power regression relationship between hydrologic and physiographic attributes across 15 gauging stations.AI and CN are potential predictors.