A priori selection of hydrological model structures in modular modelling frameworks: application to Great Britain

ABSTRACT Multi-model studies are widespread in large-sample hydrology. However, significant challenges remain in identifying interpretable connections between high-performing model structures and catchment characteristics, and thus in developing a coherent strategy for developing tailored multi-model ensembles. Here, we assess the importance of selecting model structures that are consistent with the expected hydrological variability across the study domain. We compare results of two modular modelling frameworks across 998 catchments in Great Britain. The RRMT framework includes model structures historically evolved in the UK, while the FUSE framework employs model structures from diverse global origins. While both groups of model structures contain high-performing members, the historically evolved group members separate between catchments in line with our expectation of hydrologic differences. We find that four hydrologic signatures organize these distinctions. Our results emphasize (1) the importance of model structure selection based on explicit perceptual models, and (2) the need to look beyond statistical performance alone.


Introduction
Modular rainfall-runoff modelling frameworks have been widely used to provide a more flexible approach to modelling diverse catchments.These frameworks consist of model structures or model structural elements that can be combined in various ways to represent different dominant hydrological processes (often at the catchment scale).Some of the more widely used frameworks include the Modular Modeling System (MMS) (Leavesley et al. 1996), the Rainfall-Runoff Modelling Toolbox (RRMT) (Wagener et al. 2001), the Framework for Understanding Structural Errors (FUSE) (Clark et al. 2008), the Catchment Modelling Framework (CMF) (Kraft et al. 2011), SUPERFLEX (Fenicia et al. 2011, Kavetski andFenicia 2011), the Structure for Unifying Multiple Modelling Alternatives (SUMMA) (Clark et al. 2015a(Clark et al. , 2015b)), the Eco-hydrological Simulation Environment (ECHSE) (Kneis 2015), Dynamic fluxEs and ConnectIvity for Predictions of HydRology framework (DECIPHeR) (Coxon et al. 2019), the Nonstationary Rainfall-Runoff Toolbox (NRRT) (Sadegh et al. 2019), the Modular Assessment of Rainfall-Runoff Models Toolbox (MARRMoT) (Knoben et al. 2019), andRAVEN (Craig et al. 2020), among others.These frameworks vary in their spatial resolution, in the model structures and structural elements included, in the granularity of the components that make up the model structures and in other ways such as the optimization or uncertainty quantification tools included.
An initial step in any modular modelling exercise is the selection of the model structures (or model structural components) to be considered given that they should be both potential representations of the system(s) under study and appropriate for the modelling objective(s).We might assume that a framework is so flexible that it can reflect any system, but even then, some pre-selection might be helpful to avoid testing model structures that can already be considered a priori unsuitable -e.g.those based on process perceptions that are not present in the study domain -to avoid getting the right result for the wrong reasons (e.g.Grayson et al. 1992, Kirchner 2006).Experience, embedded in a perceptual model(s) of the underlying system(s), is one way to identify the differences between systems such as catchments (Seibert and McDonnell 2002, Beven and Chappell 2021, Wagener et al. 2021, Fenicia and McDonnell 2022).One recurring problem in this context is that various multi-model studies found that there is no specific model that performs better than all others for a specific catchment, and that we might only find some basic trend that models with more parameters have more flexibility to fit rainfall-runoff relationships (e.g.Perrin et al. 2001, Kollat et al. 2012, Van Esse et al. 2013, Orth et al. 2015).However, even that conclusion is not always clear, and it is compounded by the adequacy of certain process representations (Knoben et al. 2020).As a consequence, there is often no clear relationship between catchment type and well-performing model structures (e.g.Nicolle et al. 2014, Ley et al. 2016, Knoben et al. 2020).One aspect that has so far been studied less extensively is that the actual choice of model structures included in such studies might be increasing the problem of equifinality in model performances, and thus that the problem can be reduced through better a priori selection.
Two of the above-mentioned modular modelling frameworks have so far been used in multiple studies in the UKthe RRMT (Wagener et al. 2001, Lee et al. 2005) and FUSE (Coxon et al. 2014, Lane et al. 2019) frameworks.The former framework was specifically developed for the UK, a region with mostly small catchments located in a temperate climate, while the latter includes a wider range of model components based on models used around the globe (more details are given in the methods section).Using RRMT, Lee et al. (2005) tested 12 model structures on 28 UK catchments.The authors could not find evidence for a relationship between catchment type and model structure, although they identified a subset of models that performed better than the rest.Lee et al. (2005) characterized catchment types using only descriptors that are available for both gauged and ungauged UK catchments -area, regionalized baseflow index and average rainfall -which have limited value in characterizing hydrological differences (Addor et al. 2018).Using the FUSE framework, Coxon et al. (2014) evaluated performances of 78 model structures across a different set of 24 catchments in England and Wales.They found that statistical model performance increased with catchment wetness and that only certain model structures provided good model performance in baseflow-dominated catchments.Moreover, they highlighted the possibility of identifying more informative signatures for better model identification (in gauged catchments).Lane et al. (2019) followed up on Coxon et al.'s study by selecting only four model structures from the FUSE framework but by implementing them in over 1000 Great Britain (GB) catchments.The performance patterns of all four models across GB were very similar.They performed better (worse) in wetter (drier) catchments and were particularly poor in catchments with groundwater leakage to neighbouring catchments, which none of the model structures accounted for.Even though some performance differences exist between these model structures in different types of catchments, the authors were not able to explain them through model structure/complexity differences.
In this study, we aim to test how much distinguishing between model structures in multi-model studies is a function of which model structures are selected beforehand.In other words, we would like to investigate the importance of a priori model selection step (i.e.deciding which model structures will be included in a multi-model study) for our ability to observe relevant performance differences between model structures.To do so, we compare the performance of two modular rainfallrunoff modelling frameworks (and the model structures they include) on 998 GB catchments in a Monte Carlo framework.For this comparison, we select six model structures from the RRMT (Wagener et al. 2001) and compare these to simulation results from the four FUSE model structures reported by Lane et al. (2019).We then attempt to explain the resulting differences between model structures using hydrological signaturesas more informative descriptors than catchment properties.Lane et al. (2019) used model structures of similar complexity and found no distinguishable differences between their process representations that can easily be linked to our perceptions of different dominant hydrological processes across the UK.We hypothesize that a different a priori model selection will improve our ability to distinguish the performance of model structures across catchment types.So, we are trying to overcome two problems we were left with after the study by Lane et al. (2019): (1) the lack of well-performing model structures in leaky catchments; and (2) the lack of distinctive and hydrologically relevant performance differences between model structures and catchment types.
2 Data, modular modelling framework and methods

Data and hydrological signatures
In this paper, we analyse the same catchments across GB as those selected by Lane et al. (2019) to ensure the comparability of our results.These catchments were selected from the National River Flow Archive (Centre for Ecology and Hydrology 2020) based on the quality and availability of flow time series.They represent a diverse range of catchment characteristics in terms of topography, geology and climate, thus capturing much of the variability found across GB.In GB, mean annual precipitation decreases from northwest to southeast, with a range of 550-3600 mm/year.Conversely, mean annual potential evapotranspiration (PET) increases from northwest to southeast with a range of 380-570 mm/year.Given that precipitation decreases and PET increases, runoff coefficients (defined as the average fraction of precipitation that leaves the catchment as streamflow) decrease from northwest (average of 0.8) to southeast (average of 0.3).Streamflow regimes are affected not only by climate but also by groundwater interactions beyond the limits of topographic catchments, with highly permeable aquifers present in the southeast leading to groundwater-dominated regimes.Streamflow in the southeast and the midlands of England is further influenced by human modifications such as abstractions, effluent discharges (i.e.effluent returns), urbanization and/or reservoirs.Very little to no snow is observed in most (>90%) catchments, i.e. no more than 5% of all precipitation falls as snow, resulting in snow fractions of no more than 0.05 (Coxon et al. 2020).Only 12 catchments in Scotland have higher snow fractions than 0.1, up to 0.17 (Coxon et al. 2020).
We use daily rainfall, streamflow, and PET time series for 21 years (1 January 1988-31 December 2008) to cover the same time period as that used by Lane et al. (2019).Daily rainfall and PET data are derived from the Centre for Ecology and Hydrology Gridded Estimates of Areal Rainfall (CEH-GEAR) (Tanguy et al. 2021) and the Climate Hydrology and Ecology Research Support System Potential Evapotranspiration (CHESS-PE) (Robinson et al. 2015a), respectively.Daily PET (mm/day) was calculated using the Penman-Monteith equation (Monteith 1965) for a well-watered grass surface (Allen et al. 1998) with meteorological data from the Climate Hydrology and Ecology research Support System dataset (CHESS-met) (Robinson et al. 2015a(Robinson et al. , 2015b)).Daily observed streamflow data from the National River Flow Archive (NRFA) are used to evaluate model performances (Centre for Ecology and Hydrology 2020).
To compare model structure performances across different catchment types, we organize the catchments based on four hydrological signatures that have been found helpful for distinguishing UK catchments in the past (Coxon et al. 2014, McMillan et al. 2022): baseflow index (BFI), runoff ratio (RR), deficit in water balance (dRR), and slope of flow duration curve (slope of FDC).We choose these hydrological signatures because they differ in the information they provide about the runoff processes of catchments.RR represents the proportion of precipitation becoming streamflow, while BFI represents the proportion of streamflow sourced from groundwater.dRR indicates whether catchments produce more or less runoff than expected based on climate only, while the slope of the FDC indicates whether catchments have more or less flashy (i.e.variable) flow regimes (e.g.Yadav et al. 2007).We further use streamflow-derived BFI values from the UK Hydrometric Register (Marsh and Hannaford 2008).
For each catchment, we calculate its long-term water balance (i.e.runoff ratio, RR = Q/P), aridity index (AI = PET/P) and slope of FDC (33rd-66th percentile; Sawicz et al. 2011) values using daily precipitation, P, potential evapotranspiration, PET, and streamflow, Q, data.We also calculate the expected water balance (expected RR) based only on climate using the Turc-Mezentsev curve.The Turc-Mezentsev curve is based on the widely studied Budyko framework (Budyko 1961).It provides reference conditions for energy and water limits on a catchment water balance.Catchments with water balances unimpacted by other natural or human controls beyond climate are expected to plot along the Budyko curve located between water limits (AET = P) and energy limits (AET = PET) where AET represents actual evapotranspiration.Similarly, Turc and Mezentsev linked the long-term average evaporation to long-term average precipitation (Mezentsev 1955, Turc 1955).Since measurements of actual evapotranspiration are not available at the catchment scale, we adjust the formula using 1 − (Q/P) as a response term instead of AET/P as used in the original formulation.This adjustment is also used in previous studies (e.g.Lebecherel et al. 2013, Gnann et al. 2019) and quantifies water loss or gain in a catchment by taking the difference (i.e.delta RR) between the actual (i.e.observed) and the expected runoff ratio values.The formula that we use to plot the Turc-Mezentsev curve is therefore: dRR is calculated by taking the difference between the observed runoff ratio and the expected runoff ratio derived from the Turc-Mezentsev curve.

Modular modelling framework
In this study, we selected a minimal set of model structures that covered the range of dominant hydrological processes across our study domain of GB.We wanted to ensure that the model structures had different levels of complexity (i.e. the number of parameters) to represent these dominant hydrological processes and that model structural choices (such as the type of flow routing module) could be evaluated in isolation to demonstrate the impact of different model structural modules on model performance.
To achieve these goals, we first selected model structures from the RRMT (Fig. 1(a)) (Wagener et al. 2001).RRMT is a flexible modelling framework that allows the user to develop model structures with different complexity levels by combining soil moisture accounting and flow routing modules that are of a low and medium complexity (Wagener et al. 2004).We selected six model structures from the toolbox, consisting of different combinations of two soil moisture accounting modules and three flow routing modules.The soil moisture accounting modules (PEN (i.e.Model based on Penman Drying Curve) and PDM (i.e.Probability distributed model)) are based on long-standing experience with UK catchments (see discussions in Wagener et al. 2004, Lee et al. 2005, Moore 2007) and capture key runoff generation processes across GB.The flow routing models (CRES (i.e.Conceptual reservoir), 2PAR (i.e.Two conceptual reservoirs in parallel) and LEAK (i.e.Leaky aquifer model structure)) capture different levels of complexity from one to two linear reservoirs and a leaky routing component to reflect different flow pathways because of different soils and regional aquifers that occur across GB (Moore 2007).The model structural modules are explained in detail below, and the parameters of each module are listed in the Supplementary material, Table S1.
PEN is a parsimonious two-store structure based on an empirical drying curve concept developed from observed drying patterns in UK soils by Penman (1949).The upper and lower store represent the root zone and an infinite soil reservoir, respectively.Analysing UK soils, Penman (1949) found that actual evapotranspiration occurs close to the potential rate whenever water is available in the root zone reservoir.The actual rate decreases to a very small percentage of the potential rate (8%) when the upper store is depleted.Effective rainfallthe part of the rainfall that contributes to runoff -is created in two ways: either as rainfall bypass to represent processes such as rapid groundwater recharge or rainfall falling close to a river, or as saturation-excess runoff which is produced when both stores are full.The model parameters define the size of the root zone storage, S max1 , and the fraction of bypass flow, φ.
PDM is the probability-distributed soil moisture accounting component, which represents the variability in soil moisture storage across a typical humid catchment using a distribution of storage depths (Moore 2007).Effective rainfall is produced as overflow from the stores, which are described as a Pareto distribution based on two parameters: the maximum storage capacity, C max , and parameter b, describing the shape of the distribution.
There are three different routing components.Firstly, CRES is a single linear reservoir defined only by a time constant.2PAR is a combination of two linear reservoirs in parallel for routing, one representing fast flow and the other representing slow flow.The effective rainfall is distributed with respect to parameter a describing the fraction of flow through the fast reservoir, while both reservoirs are defined by a time constant.And, thirdly, LEAK is a leaky aquifer routing component, which allows the model to consider the situation when the water balance of a catchment is not closed.The flow from the bottom outlet represents leakage from the catchment, while the middle and upper outlets contribute to routing the effective rainfall.(2019).TOPMODEL and ARNO/VIC have 10 parameters, PRMS has 11 parameters and SACRAMENTO has 12 parameters.In the diagram, p, e, ER, S and Q represent precipitation, evaporation, effective rainfall, storage and outflow respectively.In PEN module, S max1 , S max2 , d, φ represent size of the upper store (i.e.root constant), size of the lower store, initial deficit in upper store and bypass value, respectively.In PDM module, C max , b and c represent maximum storage capacity, degree of spatial variability and initial critical capacity, respectively.In CRES module, T represent the residence time of reservoir.In 2PAR module, a, T s and T f represent the fraction of effective rainfall going through fast reservoir, the residence times of reservoirs for slow flow and fast flow, respectively.In LEAK module, T u , T m , T l , h 1 and h 2 represent the residence times of upper, middle, lower parts, lower threshold and upper threshold, respectively.In FUSE model structures, θ wlt , θ fld , and θ sat represent the soil moisture at wilting point, field capacity, and saturation, respectively.
The four model structures provided by the FUSE modelling framework (Clark et al. 2008) and used by Lane et al. (2019) are shown in Fig. 1(b).These model structures are based on four hydrological models, which are TOPMODEL (Beven and Kirkby 1979), the Variable Infiltration Capacity (ARNO/VIC) (Liang et al. 1994, Todini 1996), the Precipitation-Runoff Modelling System (PRMS) (Leavesley et al. 1983) and SACRAMENTO (Burnash et al. 1973).The details of model parameters are listed in the Supplementary material, Table S2.The modelling decisions are described by Lane et al. (2019, Table 3).Even though these models have similar complexity, their structures are different in terms of the structures of upper and lower soil layers and the parametrizations of water balance components such as evaporation, surface runoff, percolation, interflow and baseflow.Since only a small proportion of the catchments (1%) have a snow fraction higher than 0.1 and are likely to be snow impacted, no snow modules are used in any model structures selected from either the RRMT or FUSE framework.
Model equations in the FUSE framework are solved by an implicit version of the Newton-Raphson method (see appendix A in Clark et al. 2008).Equations in the soil moisture accounting and routing modules of the RRMT framework are the firstorder equations, which are solved in the MATLAB programming environment (Wagener et al. 2001).However, our focus is not the analysis of the relative performance between the two frameworks, but rather the differences between model structures within each framework.Each framework uses a consistent strategy throughout its model structures, which means that differences between model structures should be unrelated to the numerical implementation.

Set-up
To enable comparison with the results from Lane et al. (2019) we replicated the modelling set-up the authors employed.Consequently, 10 000 parameters for the six model structures in this study are independently and randomly sampled from uniform distributions.The first five years of a 21-year period  are used as a warm-up.

Model performance evaluation
In this study, it is crucial to be able to compare the performance of multiple model structures across many catchments and to make results comparable with Lane et al. (2019).Considering this, we use the Nash-Sutcliffe (NSE) -which was used by Lane et al. (2019)  To establish whether the performance of specific model structures (measured using KGE or NSE) varies with the magnitude of a specific hydrological signature (Section 2.1), one might simply create scatter plots of one against the other, but there is a lot of noise that makes it difficult to see trends.We therefore smooth the data to lower the effects of variability across catchments so that the separations between increasing or decreasing trends in the relative performance differences of model structures can be observed more clearly (e.g.Burn and Elnur 2002).Without smoothing, it is difficult to observe the increasing and decreasing trends on the scatter plots (see Supplementary material, Fig. S1).We use a nonparametric local weighted regression (LOWESS) approach that includes a bi-square weight function to minimize the effect of the outliers in the smoothed values (Cleveland 1979, Coxon et al. 2015).Details regarding the LOWESS smoothing process are explained in the Supplementary material and shown in Fig. S1.We find that a smoothing window size of 40 catchments reflects the performance changes across catchments without overly smoothing the results.We then calculate the performance difference (i.e. the NSE or KGE difference) between each model structure and the best model structure (i.e. the model structure having the highest smoothed NSE or KGE value).

RRMT and FUSE performance across GB
First we compare the performance of model structures from the two frameworks across GB.While Lane et al. (2019) included 1013 catchments in their analysis, we remove 15 catchments because they have unrealistic runoff ratio values (i.e.RR > 1) or because all model structures fail to work (i.e.NSE < 0).We assume that these problems are caused by unknown and thus unaccounted-for anthropogenic impacts.Figure 2(a) and (b) show the best NSE performance from all the model structures from RRMT and FUSE frameworks, respectively, for 998 GB catchments.Both frameworks simulate 95% of the studied catchments with NSE values higher than 0.5, as shown in Fig. 2(c).The more complex FUSE models perform slightly better in catchments where both frameworks achieve high NSE values.
The spatial patterns of model performance in Fig. 2(a) and (b) are largely similar.However, there are 40 catchments located in southeastern GB where we find larger performance differences (i.e.> ± 0.2 NSE) between the frameworks (see Supplementary material, Fig. S2(a) and (c)).In 28 of them, highest NSE values are obtained by the RRMT framework, and they are significantly higher than ones achieved by the FUSE framework (i.e.NSE difference > 0.2).More than 80% of these catchments have highly permeable geology covering more than 60% of their respective catchment areas.Among them, there are six catchments where the FUSE models perform particularly poorly (i.e.NSE < 0) but RRMT is able to simulate their streamflow with NSE > 0.7.In this chalky region, the catchments are mostly baseflow-dominated, and some of them are losing water through regional groundwater flows.The inclusion of a LEAK routing component in the RRMT framework enables better performances under those conditions.
Lower NSE values are also seen in some catchments of northeast and central Scotland and north Wales, likely due to snow or reservoirs.There are three catchments in northeast Scotland (i.e.snow fractions > 0.1) for which model performances show NSE values less than 0.5.We did not focus on this any further given that these are just three out of almost 1000 catchments.To investigate the impact of reservoirs in more detail, we investigated the relationship between two reservoir-related descriptors (contributing area upstream of the reservoir and normalized upstream capacity; Salwey et al. 2023) and highest NSE scores obtained by RRMT and FUSE model structures for 252 catchments (see Supplementary material).We found that there is a small decline in model performance the closer a reservoir exists to the catchment outlet, to a lesser degree the larger it is (Fig. S4).However, the variability in performance change is very large and it would take consideration of additional aspects such as reservoir management to add reservoirs to the models used here (e.g.Payan et al. 2008), which is beyond the main aims of this study.
We also calculate KGE values for the best NSE model runs for comparison (Fig. 2(c) and (b)) and Fig. S3.Overall, they indicate similar patterns in comparison with NSE values across GB.However, Fig. 2(d) shows that RRMT has a larger number of catchments with KGE values > 0.4, and both frameworks have quite similar distributions after KGE > 0.8, whereas FUSE has a larger number of catchments with NSE values > 0.6 (Fig. 2(c)).Performance differences between RRMT and FUSE frameworks are not fully consistent when comparing NSE and KGE values which are calculated based on the simulation producing the best NSE in each catchment, due to the difference in formulation between NSE and KGE.While bias, variance and correlation components of streamflow are equally weighted in the KGE formulation, they are weighted differently in the NSE formulation (i.e. the variance term is more dominant in NSE than the other terms) (Gupta et al. 2009).Moreover, the relationship between NSE and KGE will be different for different catchments, mostly depending on the coefficient of variation of the observed streamflow (Knoben, Freer and Woods 2019;Lamontagne et al. 2020).Having more complex model structures (i.e. with a larger number of parameters) might provide FUSE with more ability to capture the variance of streamflow than RRMT, a difference that goes away when the three components are weighted differently (i.e. in KGE).

Linking model structure performance with hydrological signatures and catchment characteristics
Figures 3 and 5 show the differences in NSE values of the six RRMT model structures (PEN+2PAR, PEN+LEAK, PEN +CRES, PDM+2PAR, PDM+LEAK, PDM+CRES) and the four FUSE model structures (TOPMODEL, ARNO/VIC, PRMS, SACRAMENTO) in relation to the best-performing model structure in each framework.We plot these results against four hydrological signatures -(a) BFI, (b) dRR, (c) RR and (d) slope of FDC -which have in the past been shown to be informative for UK settings (Yadav et al. 2007).We visualize the results in two different ways.The left columns of Figs 3 and 5 show scatter plots of model structure performances in percent difference compared to the best-performing model (in each framework) against hydrological signatures after the smoothing process described in Section 2.3.2 has been applied.The left column also shows a threshold of 10% to visualize (as a dashed horizontal line) which model structures are similar in their performance (i.e.having an NSE difference of less than 10% with respect to the best model structure, which has the highest NSE value).Choosing 10% is a subjective decision (for visualization purposes only), but clearer separations are observed between performances of model structures using this value in comparison with other thresholds we tried (i.e.5%, 8%, 15%), as shown in the Supplementary material, Fig. S5.The panel bar plots in the right column of Figs 3 and 5 indicate where model structures show NSE differences of less than 10% compared to the bestperforming structure for selected attributes to better show which model structures stop performing well as a function of different signature values.Histograms of BFI, dRR, RR and slope of FDC for 998 GB catchments are given in the Supplementary material, Fig. S6.
Figure 3 shows that there are clear separations between the performance levels of the six RRMT model structures.We find that the model structures containing a parallel flow routing module to represent fast and slow flows (i.e.PEN/PDM +2PAR) and the model structures with the leaky flow routing module (i.e.PEN/PDM+LEAK) outperform other models in 139 catchments with a high baseflow contribution (BFI > 0.7) (Fig. 3(a)).Both structures allow for slower responses and hence better baseflow representation.For small BFI values (i.e.BFI > 0.4, 244 catchments), all model structures have similar performances (i.e.NSE difference < 10%) which suggests that model structures with a single routing reservoir (i.e.PEN/PDM+CRES) are sufficient.Figure 3(b) shows that PEN/ PDM+LEAK outperform other models in 62 catchments with significantly negative dRR values (dRR < −0.2, indicating subsurface losses or large abstractions).Model structures with the leaky flow routing module perform best in catchments that lose water.Interestingly, only PDM+2PAR outperforms other models in 19 catchments that have high water gains (i.e.dRR > 0.2), suggesting that the flexibility of this model in runoff generation and routing is better able to capture this situation than the other model structures.
To explore these interactions in more detail, Fig. 4 shows the relationship between BFI, dRR and model performance for PEN/PDM+2PAR and PEN/PDM+LEAK.We find that the majority (~66%) of catchments with high BFI values (i.e.BFI > 0.7) have higher NSE values when using PEN/PDM+2PAR.The ones where PEN/PDM+LEAK outperforms PEN/PDM +2PAR have very negative dRR values (i.e.dRR < −0.2).This implies that there are some catchments that have both high BFI and very negative dRR values and that PEN/PDM+LEAK outperforms the other RRMT models in these catchments.
Similarly, all model structures have NSE difference < 10% for 391 catchments with high runoff ratio values (RR > 0.6) except for two model structures with the PEN module (i.e.PEN+LEAK/CRES) that have NSE difference > 10% for 22 catchments with RR > 0.9 (Fig. 3(c)).Three of the catchments with RR > 0.9 tend to gain significant amounts of water (i.e.dRR > 0.4) due to water management activities (e.g.effluent returns, groundwater augmentation etc.), and this makes these catchments artificially wet.Simpler model structures based on the Penman drying curve (i.e.PEN+LEAK/CRES) fail here.The possible reason for this is that PDM module represents more flexibility through its distribution function even though it is likely for the wrong reasons.If these artificially wet catchments are ignored, all model structures perform well in wet catchments.Therefore, the simplest model structures with a single conceptual reservoir for flow routing (i.e.PEN/PDM +CRES) are already suitable under those conditions.On the other end of the RR range (i.e.RR < 0.2, 42 catchments), it is interesting that only PDM+LEAK shows sufficient flexibility, i.e. that both soil moisture accounting and routing have to be rather flexible.Lastly, Fig. 3(d) shows that all model structures except the simplest ones with a single flow routing reservoir (i.e.PEN/ PDM+CRES) perform well in catchments showing very high streamflow variability.Larger streamflow variability correlates with larger slope of FDC (i.e.slope of FDC > 4, 33 catchments).On the other end of this signature, only the PDM model with 2PAR and (to a lesser extent) LEAK seems to be able to capture the lack of streamflow variability (i.e.low slope of FDC values, slope of FDC < 1, 17 catchments).It is interesting that this is not just a question of the routing function, but again requires a flexible runoff production function (i.e.PDM).
Figure 5 indicates that there are also some separations between the four FUSE model structures (TOPMODEL, ARNO/VIC, PRMS, SACRAMENTO) although they are not clearly related to hydrological process differences.We find that ARNO/VIC performs well across all BFI values, and outperforms the rest in 139 catchments with high BFI values (BFI > 0.7).It is difficult to explain why this model structure outperformed the other model structures in baseflow-dominated catchments because all four models have a slow flow component, as shown in Fig. 1(b).On the lower end of BFI values (i.e.BFI < 0.4, 244 catchments), all FUSE model structures are within a 10% NSE difference range and there is therefore no significant difference (Fig. 5(a)).
Interestingly, the performance of the ARNO/VIC model is quite robust across a wide range of catchment behaviours.It performs better than or is sufficiently close (within 10%) to the best-performing model across the whole range of RR and slope of the FDC values (Fig. 5(c) and (d)).Also, the TOPMODEL implementation works across all RR values, while the other two models work across a slightly narrower range of RR values only.We know from Lane et al. 's study (2019) that TOPMODEL can produce simulations with less bias, but the reason for its performance advantage is unclear.It seems that there is no specific model structure, except ARNO/VIC, that outperforms the others in the 22 catchments with very high RR values.This is again due to the three artificially wet catchments we discussed above (Fig. 3 (c)).Without those catchments, all FUSE model structures are within a 10% NSE difference range.The SACRAMENTO and PRMS models struggle if the water balance deviates more than about 20% from the one we expect using climate only (i.e.dRR values of > +0.2/ < −0.2, 19/62 catchments) (Fig. 5(b)).The models are thus quite sensitive to water balance problems.ARNO/VIC and TOPMODEL are robust in this regard, although for negative and positive dRR values, respectively.And, finally, ARNO/VIC, TOPMODEL, SACRAMENTO and PRMS work increasingly poorly -in this order -when it comes to fitting flow variability as expressed through the slope of the FDC (Fig. 5(d)).All model structures except PRMS perform well in the 33 catchments with high slope of FDC values (Fig. 5(d)).
To ensure the robustness of our results to different performance metrics, we recreate Figs 3 and 5 using KGE differences (see Supplementary material, Figs S7 and S8).When we compare the NSE and KGE difference for the model structures (using the best NSE model), we find some differences between performance separations of both RRMT and FUSE model structures.For example, PEN+CRES seems to perform better for lower RR values when KGE is used, whereas PEN+2PAR seems to do worse in this region compared to using NSE.Moreover, while PEN/PDM+2PAR seem to outperform PEN/  PDM+CRES in catchments with high slope of FDC values based on NSE values, this is not the case based on KGE values.When we look at FUSE model structures, we also observe some differences in their separations.For instance, ARNO/ VIC does not outperform PRMS when KGE is used rather than NSE in high slope of FDC values (i.e.> 4).Moreover, TOPMODEL and PRMS are within the 10% threshold in the range of 0 < RR < 0.2 and 0.4 < RR < 0.6, respectively, based on NSE, whereas this is not the case based on KGE values.These findings imply that using KGE instead of NSE makes some difference in the performance separation of model structures with respect to the signatures assessed.However, when checking the signature ranges that define specific catchment types (i.e.baseflow-dominated, leaky, wet), only PEN+2PAR (RRMT) and SACRAMENTO (FUSE) show different performance separations when using KGE instead of NSE in baseflow-dominated catchments (i.e.BFI > 0.7).The other model structures from both frameworks show the same separation in all catchment types when using KGE or NSE.This result suggests that there is still some more to learn about the differences in assessing model performances between KGE and NSE, which is beyond the scope of this short technical note.
A final question is whether we can predict the hydrological signatures used in this study (i.e.dRR, RR, BFI and slope of the FDC), so we could apply what we have learned to ungauged catchments.In the GB setting, BFI has been predicted from physical catchment properties in the BFI-HOST (i.e. a baseflow index derived from the 29-class Hydrology Of Soil Types (HOST) classification) framework (Marsh and Hannaford 2008).These BFI-HOST values indicate strong correlation with the BFI values that we use in our study, as shown in the Supplementary material, Fig. S9(a), while RR shows a strong dependence on AI as shown in Fig. S9(b).However, we could not identify a single physical attribute or a reasonable combination of attributes to predict dRR for ungauged catchments given that different physical properties and anthropogenic activities likely influence this deviation.Runoff of leaky (dRR< −0.2) and gaining (dRR > 0.2) catchments is affected by both geological differences and different water management practices such as abstractions, reservoirs, and effluent returns (see Supplementary material, Fig. S10).However, the net effects of such practices across GB catchments have not been assessed so far.

Discussion
We compare two modular modelling frameworks to analyse the influence of a priori model structure selection on performance separation in relation to catchment types across GB.In a direct comparison of model performances, we find that the FUSE structures perform slightly better with respect to the NSE metric when this metric is larger than 0.5 for both frameworks (the result is the inverse for values below).It is generally not surprising that FUSE is slightly better given that its models have between 10 and 12 free parameters, while RRMT has between three and seven.Multiple studies found a link between model performance and the number of free calibration parameters (e.g.Perrin et al. 2001, Kollat et al. 2012, Höge et al. 2018).However, we also show that it is not just the number of parameters that matters for model performance, as for example found by Knoben et al. (2020), since models that include the leaky routing structure of RRMT work better than the FUSE structures in catchments with significant subsurface losses -even though they have fewer parameters.
Figure 6 is a visual summary of what we find across the GB catchments studied here.There are 139 (14%), 62 (6%) and 391 (40%) catchments with BFI > 0.7, dRR < −0.2 and RR > 0.6, respectively.Slope of FDC did not provide additional information about separations between model structures because the flatter slopes are also the catchments that generally have higher BFI values.There is therefore a large mirroring of the BFI and FDC results which does not justify including both.Fig. 6(a) shows that six model structures from RRMT are distinguished from each other across catchment types in line with our expectations regarding hydrological differences.In comparison,Fig. 6(b) indicates that some of model structures from FUSE also outperform the others in some of the catchment types but it is challenging to explain why they differ (as was previously concluded by Lane et al. 2019).The reason is that there are no identifiable structural/behavioural differences that explain performance differences between these model structures.The six model structures chosen in the RRMT framework have evolved from experience in modelling diverse GB catchments (Wagener et al. 2004, Lee et al. 2005, Moore 2007).Our results suggest that these model structures emerge as more suitable for specific catchment types, although we also find that they do not necessarily provide better performance than other model structures (except in the case of catchments with significant groundwater losses).Some of these catchment types have also been found to produce distinguishable model performances elsewhere.Kavetski and Fenicia (2011) and David et al. (2022) also found baseflow-dominated catchments to require routing structures with parallel reservoirs.Kavetski and Fenicia (2011) selected seven model structures from the SUPERFLEX framework and the fixed GR4H model (modèle du Génie Rural à 4 paramètres Journalier) and tested them on four catchments in New Zealand and Luxembourg.David et al. (2022) selected only four model structures, also from SUPERFLEX, and evaluated them across 508 Brazil catchments.Both studies selected model structures based on their prior knowledge and experience in their study domain.Similarly, different studies found that wet catchments can be modelled well using a wide range of model structures (e.g.Atkinson et al. 2002, Kavetski and Fenicia 2011, Coxon et al. 2014, Massmann 2020, David et al.NSE values of the remaining model structures, divided by the maximum NSE value and multiplied by 100 for every catchment.NSE values of model structures are obtained by moving means with 40 point window size.Through visual inspection, 10% is selected as the most helpful threshold to show which model structure is performing differently in relation to a specific attribute.Therefore, bar plots of four model structures are created by taking 10% as the NSE difference.The range between two grey dashed vertical lines indicates the ranges where the smoothing is based on 20 left and right of the average calculated.Outside these ranges, points become increasingly biased by the points at the minimum and maximum signature values. 2022).More specifically to GB, our findings are similar to those of Lee et al. (2005), who also found that a leaky routing component is needed in catchments with permeable aquifers across such as Chalk, Jurassic limestone, and Carboniferous/Devonian rock.
Nonetheless, some studies (e.g. Lee et al. 2005, Van Esse et al. 2013, Lane et al. 2019, Knoben et al. 2020) that were conducted in different countries (e.g.UK, France, US) and used model structures from different modular frameworks (e.g.RRMT, SUPERFLEX, FUSE, MARRMoT) have not been able to identify clear model structure-catchment type relationships (beyond the aforementioned permeable catchments in the case of Lee et al.).Both Lee et al. (2005), who used 12 model structures from RRMT across 28 UK catchments, and Van Esse et al. (2013), who used 12 model structures from SUPERFLEX plus the GR4H model across 237 French catchments, observed performance differences between the model structures that they used, but they could not establish a catchment type-model structure relationship.Both studies suggested that the catchment characteristics used were insufficient to reflect catchments' hydrological behaviours.Lee et al. (2005) stated some additional possible reasons for this, such as the other choices made in their study (e.g.number of catchments, suitability criteria) and using observed rainfall-runoff data, which is insufficient to represent the catchments.In addition, studies by Lane et al. (2019), who used four model structures from FUSE across 1013 GB catchments, and Knoben et al. (2020), who used 36 model structures from MARRMoT across 559 US catchments, could not observe distinct separations between their model performances across catchment types due to the selection of multiple model structures with similar process representations or complexities.
Our findings suggest that modular modelling frameworks might benefit from an adequate strategy for the inclusion of Baseflow-dominated catchments are the ones containing a higher proportion of the river that derives from stored sources (i.e.having high BFI values).Leaky catchments are the ones most likely losing water (i.e.having very low negative dRR values).Wet catchments are the ones where the rainfall is most likely to become runoff (i.e.having high RR).*If two models perform equally well, and if we do not have any additional reason to prefer one model over the other (e.g. because they fit our perceptual model better), then we believe applying Occam's razor is a sensible strategy (Young et al. 1996).According to this principle, commonly attributed to William of Ockham in the 14 th century, if two competing explanations for a phenomenon exist, one should prefer the simpler explanation, all else being equal.
specific model structures, process modules or system components in their frameworks (tailored to a specific domain).It might be beneficial for them to explicitly provide the conceptual differences and similarities between the process or components of model structures and to establish expectations regarding the type of catchments that they can potentially represent well or poorly.If these differences are unclear a priori, then it is unlikely that we can subsequently explain model performance differences.While some modular modelling frameworks such as SUPERFLEX (Fenicia et al. 2011) and MARRMoT (Knoben et al. 2019) provide detailed information about the differences/similarities between components/fluxes of model structures included and the hydrological processes that they can represent, this might not be enough.Knoben et al. (2020) investigated model suitability by pre-selecting 36 of 46 MARRMoT model structures for 559 US catchments.They ranked the model structures according to their performance in each catchment and then attempted to correlate these rankings with 52 catchment attributes (e.g.hydrological, climatic and physical).However, they could not find clear relationships between model rankings and catchment attributes.Their study stated that not using suitable hydrological signatures/catchment attributes to reflect distinct hydrological behaviours across their study domains could be a reason.Our results suggest that a stronger focus on pre-selecting model structures consisting of (as much as possible) distinct processbased components for the study domain might be a way forward to reduce this problem.

Conclusions
Modular modelling structures are widely popular, although the best approach for selecting model structural components has remained unclear.Probably unsurprisingly, many studies have found it difficult to find meaningful separations between the model structures or structural components considered.Here, we hypothesize that long-term experience within a study domain (e.g. a region such as GB) can lead to the development of different model structures that provide a guide for a priori model inclusion.While rainfall-runoff models have often not explicitly evolved into modular frameworks, they nonetheless can contain at least some of the experiences made when trying to simulate diverse catchments across a heterogeneous domain (e.g.Moore 2007).We therefore use GB experience as a guide in our study.
Applying model structures selected in this manner, we find that these a priori chosen model structures more logically separate regarding their performance across catchments than those used in a previous multi-model study with non-UK focused model structures (Lane et al. 2019).The routing components of our framework separate based on the extent of baseflow contribution into single or parallel flow components, while a leaky component is required for catchments with significant subsurface losses.The two soil moisture accounting components do not separate as strongly, unless significant flexibility is required in which case the PDM structure is favoured (e.g.wetter catchments than expected based on climate alone).
Our results suggest that it might be helpful to first build perceptual models of the diverse catchments (or systems) encountered across a study domain such as GB (e.g.Beven and Chappell 2021, Wagener et al. 2021, McMillan et al. 2023).Here, we conditioned our perceptions on previous experiences with different model structures applied across our study domain.Without consideration of different perceptual models that are reflected in the model structures included, the modular modelling exercise might be reduced to a regression-type analysis with limited knowledge gain.

Disclosure statement
No potential conflict of interest was reported by the authors.

Figure 1 .
Figure 1.Structures of models used in the study.(a) Six model structures consisting of different combinations of two soil moisture modules and three flow routing modules provided by the Rainfall-Runoff Modelling Toolbox (RRMT).PEN/PDM+2PAR have five parameters, PEN/PDM+LEAK have seven parameters and PEN/PDM +CRES have three parameters; (b) four models provided by The FUSE modelling framework.Schematic illustrations of their structures are redrawn from Lane et al.(2019).TOPMODEL and ARNO/VIC have 10 parameters, PRMS has 11 parameters and SACRAMENTO has 12 parameters.In the diagram, p, e, ER, S and Q represent precipitation, evaporation, effective rainfall, storage and outflow respectively.In PEN module, S max1 , S max2 , d, φ represent size of the upper store (i.e.root constant), size of the lower store, initial deficit in upper store and bypass value, respectively.In PDM module, C max , b and c represent maximum storage capacity, degree of spatial variability and initial critical capacity, respectively.In CRES module, T represent the residence time of reservoir.In 2PAR module, a, T s and T f represent the fraction of effective rainfall going through fast reservoir, the residence times of reservoirs for slow flow and fast flow, respectively.In LEAK module, T u , T m , T l , h 1 and h 2 represent the residence times of upper, middle, lower parts, lower threshold and upper threshold, respectively.In FUSE model structures, θ wlt , θ fld , and θ sat represent the soil moisture at wilting point, field capacity, and saturation, respectively.

Figure 2 .
Figure 2. NSE values of best simulations performed by any model structures selected from (a) RRMT and (b) FUSE frameworks; and cumulative distribution function (CDF) plots of (c) NSE and (d) KGE values of these frameworks.

Figure 3 .
Figure 3. NSE difference (%) values and bar plots of six model structures (PEN+2PAR, PEN+LEAK, PEN+CRES, PDM+2PAR, PDM+LEAK, PDM+CRES) plotted against their (a) BFI, (b) dRR, (c) RR and (d) slope of FDC attributes.NSE difference values are calculated by taking the difference between maximum NSE value obtained by any model structure and the NSE values of the remaining model structures, divided by the maximum NSE value and multiplied by 100 for every catchment.NSE values of model structures are obtained by moving means with 40 point window size.Through visual inspection, 10% is selected as the most helpful threshold to show which model structure is performing differently in relation to a specific attribute.The range between two grey dashed vertical lines indicates the ranges where the smoothing is based on 20 left and right of the average calculated.Outside these ranges, points become increasingly biased by the points at the minimum and maximum signature values.

Figure 5 .
Figure 5. NSE difference (%) values and bar plots of four model structures (TOPMODEL, ARNO/VIC, PRMS, SACRAMENTO) plotted against their (a) BFI, (b) dRR, (c) RR and (d) slope of FDC attributes.NSE difference values are calculated by taking the difference between the maximum NSE value obtained by any model structure and the

Figure 6 .
Figure 6.Illustration of (a) six model structures' separation (PEN+2PAR, PEN+LEAK, PEN+CRES, PDM+2PAR, PDM+LEAK, PDM+CRES) and (b) four model structures' separation (TOPMODEL, ARNO/VIC, PRMS, SACRAMENTO) for catchments with different characteristics.Baseflow-dominated catchments are the ones containing a higher proportion of the river that derives from stored sources (i.e.having high BFI values).Leaky catchments are the ones most likely losing water (i.e.having very low negative dRR values).Wet catchments are the ones where the rainfall is most likely to become runoff (i.e.having high RR).*If two models perform equally well, and if we do not have any additional reason to prefer one model over the other (e.g. because they fit our perceptual model better), then we believe applying Occam's razor is a sensible strategy(Young et al. 1996).According to this principle, commonly attributed to William of Ockham in the 14 th century, if two competing explanations for a phenomenon exist, one should prefer the simpler explanation, all else being equal.

Funding
MK was funded by the Ministry of National Education of the Republic of Turkey.Funding for TW has been provided by the Alexander von Humboldt Foundation in the framework of the Alexander von Humboldt Professorship endowed by the German Federal Ministry of Education and Research.Partial support for GC was provided by a NERC grant NE/V009060/1 and a UKRI Future Leaders Fellowship award [MR/ V022857/1].
Lane et al. (2019)KGE) efficiency metrics because they are normalized and unit-free metrics enabling the comparison of model performances across catchments.Both metrics are calculated for the time period of 1993-2008.We only have the best runs based on NSE for theLane et al. (2019)study, which is why we calculate the KGE values for those and do not identify the best KGE run separately.NSE is calculated as follows (Nash and Sutcliffe 1970): where x s;t is the simulated value at time step t, x o;t is the observed value at time step t, n is the total number of time steps and μ O is the mean of observed values.NSE ranges from −∞ to 1, with a value of 1 indicating a perfect correspondence between simulations and observations.NSE = 0 indicates that simulations have the same predictive skill as the mean of the observations, while NSE < 0 indicates that simulations are a worse predictor (Schaefli and Gupta 2007).KGE is calculated as follows (Gupta et al. 2009): with α ¼ σ S = σ O and β ¼ μ S =μ O where σ o and σ S are the standard deviations of observed and simulated values, μ o and μ S are the mean of observed and simulated values and r is the linear correlation coefficient between observed and simulated values.Like NSE, KGE also ranges from −∞ to 1. KGE = 1 also means that simulations are perfectly in agreement with observations.Knoben, Freer and Woods (2019) found that when KGE is approximately −0.41, simulations have the same predictive skill as the mean of the observations.