Benchmarking high-resolution hydrologic model performance of long-term retrospective streamﬂow simulations in the contiguous United States

. Because use of high-resolution hydrologic models is becoming more widespread and estimates are made over large domains, there is a pressing need for systematic evaluation of their performance. Most evaluation efforts to date have focused on smaller basins that have been relatively undisturbed by human activity, but there is also a need to benchmark model performance more comprehensively, including basins impacted by human activities. This study benchmarks the long-term performance of two process-oriented, high-resolution, continental-scale hydrologic models that have been developed to assess water availability and risks in the United States (US): the National Water Model v2.1 application of WRF-Hydro (NWMv2.1) and the National Hydrologic Model v1.0 application of the Precipitation–Runoff Modeling System (NHMv1.0). The evaluation is performed on 5390 streamﬂow gages from 1983 to 2016 ( ∼ 33 years) at a daily time step, including both natural and human-impacted catchments, representing one of the most comprehensive evaluations over the contiguous US. Using the Kling–Gupta efﬁciency as the main evaluation metric, the models are compared against a climatological benchmark that accounts for seasonality. Overall, the model applications show similar performance, with better performance in minimally disturbed basins than in those impacted by human activities. Relative regional differences are also similar: the best performance is found in the Northeast, followed by the Southeast, and generally worse performance is found in the Central and West areas. For both models, about 80 % of the sites exceed the seasonal climatological benchmark. Basins that do not exceed the climatological benchmark are further scrutinized to provide model diagnostics for each application. Using the underperforming subset, both models tend to overestimate streamﬂow volumes in the West, which could be attributed to not accounting for human activities, such as active management. Both models underestimate ﬂow variability, especially the highest ﬂows; this was more pronounced for NHMv1.0. Low ﬂows tended to be overestimated by NWMv2.1, whereas there were both over and un-derestimations for NHMv1.0, but they were less severe. Although this study focused on model diagnostics for under-performing sites based on the seasonal climatological benchmark, metrics for all sites for both model applications are openly available online.


Introduction
Across the hydrologic modeling community, there is a pressing need for more systematic documentation and evaluation of continental-scale land surface and streamflow model performance (Famiglietti et al., 2011).A challenge to hydrologic evaluation stems from the fact that the objectives of hydrologic modeling often vary.Archfield et al. (2015) reviewed how different communities have approached hydro-E.Towler et al.: Benchmarking model simulations of retrospective streamflow in the contiguous US logic modeling in the past, drawing a distinction between hydrologic catchment modelers, whose primary interest has been simulating streamflow at the local to regional scale, versus land surface modelers, who have historically focused on the water cycle as it relates to atmospheric and evaporative processes at the global scale.As modeling approaches have advanced toward coupled hydrologic and atmospheric systems, both perspectives have evolved and are converging towards the goal of improving hydrologic model performance through more intentional evaluation and benchmarking efforts.
Land surface modeling (LSM) has a rich history of community-developed benchmarking and intercomparison projects (van den Hurk et al., 2011;Best et al., 2015).In addition to comparative evaluations of process-based models, the LSM community has used statistical benchmarks, which in some cases have been shown to make better use of the forcing input data than state-of-the-art land surface models (Abramowitz et al., 2008;Nearing et al., 2018).The International Land Model Benchmarking (ILAMB) project is an international benchmarking framework developed by the LSM community (Luo et al., 2012) that has been applied to comprehensively evaluate Earth system models, including the categories of biogeochemistry, hydrology, radiation and energy, and climate forcing (Collier et al., 2018).Although hydrology is a component of ILAMB and other LSM benchmarking efforts, there is a need for closer collaboration with hydrologists to improve hydrologic process representation in these models (Clark et al., 2015).
Hydrologic catchment modeling has begun to move towards large-sample hydrology, an extension of comparative hydrology, where model performance is evaluated for a large sample of catchments, rather than focusing solely on individual watersheds.This is appealing because evaluating hydrologic models across a wide variety of hydrologic regimes facilitates more robust regional generalizations and comparisons (Gupta et al., 2014).As such, many hydrologic modeling evaluation efforts have begun to encompass larger spatial scales.Monthly water balance models have been used to relate model errors for the contiguous United States (CONUS) to hydroclimatic variables (Martinez and Gupta, 2010) as well as for parameter regionalization (Bock et al., 2016).As part of Phase 2 of the North American Land Data Assimilation System project, Xia et al. (2012) evaluated the simulated streamflow for four land surface models, focusing mostly on 961 small basins and 8 major river basins in the CONUS, finding that the ensemble mean performs better than the individual models.Further, several large-sample datasets have been developed for community use.The Model Parameter Estimation Experiment (MOPEX) includes hydrometeorological time series and land surface attributes for hydrological basins in the US and globally that have minimal human impacts (Duan et al., 2006).The more recent Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset includes hydrometeorological data and catchment attributes for more than 600 small-to mediumsized basins in the CONUS (Addor et al., 2017).By using CAMELS basins that are minimally disturbed by human activities, Newman et al. (2015Newman et al. ( , 2017) ) and Addor et al. (2018) were able to attribute regional variations in model performance to continental-scale factors.Knoben et al. (2020) also used CAMELS with 36 lumped conceptual models, finding that model performance is more strongly linked to streamflow signatures than to climate or catchment characteristics.
While these efforts are useful towards evaluating smaller, minimally impacted basins, there is also a need to benchmark model performance for larger basins, including those impacted by human activities.On the global scale, catchment techniques have been applied to global hydrologic modeling, and they have been shown to outperform traditional gridded global models of river flow (Arheimer et al., 2020).On the regional scale, Lane et al. (2019) benchmarked the predictive capability of river flow for over 1000 catchments in Great Britain by using four lumped hydrological models; they included both natural and human-impacted catchments, finding poor performance when the water budget was not closed, such as due to non-modeled human impacts.Mai et al. (2022b) conducted a systematic intercomparison study over the Great Lakes region, finding that regionally calibrated models suffer from poor performance in urban, managed, and agricultural areas.Tijerina et al. (2021) compared the performance of two high-resolution models that incorporate lateral subsurface flow at 2200 streamflow gages; they found poor performance in the central US, potentially due to non-modeled groundwater abstraction and irrigation, during a 1-year study period.As hydrologic model development moves to include human systems, these studies provide important baselines.
This study builds on previous large-sample studies by benchmarking long-term retrospective streamflow simulations over the CONUS.Specifically, two high-resolution, process-oriented models are evaluated that have been developed to address water issues nationally: the National Water Model v2.1 application of WRF-Hydro (NWMv2.1;Gochis et al., 2020a) and the National Hydrologic Model v1.0 application of the Precipitation-Runoff Modeling System (NHMv1.0;Regan et al., 2018).The evaluation is performed on daily streamflow for 5390 streamflow gages from 1983 to 2016 (∼ 33 years), including both natural and humanimpacted catchments, representing one of the most comprehensive evaluations over the CONUS to date.The model performance is compared against a climatological benchmark that accounts for seasonality, and results are examined in terms of spatial patterns and human influences.The climatological seasonal benchmark is used as a threshold to screen the sites for each model application, offering a way to target the results for model diagnostics and development.Niu et al., 2011), which calculates energy and water states and vertical fluxes on a 1 km grid.WRF-Hydro physicsbased hydrologic routing schemes transport surface water and shallow saturated soil water laterally across a 250 m resolution terrain grid and into channels.NWMv2.1 also leverages WRF-Hydro's conceptual baseflow parameterization, which approximates deeper groundwater storage and release through a simple exponential decay model.The threeparameter Muskingum-Cunge river routing scheme is used to route streamflow on an adapted National Hydrography Dataset Plus (NHDPlus) version-2 (McKay et al., 2012) river network representation (Gochis et al., 2020a).A levelpool scheme is activated on 5783 lakes and reservoirs across CONUS representing passive storage and releases from waterbodies; however, no active reservoir management is currently included in the NWM.While the operational NWM does include data assimilation, there is no data assimilation applied in the retrospective simulations used here.Using the AORC meteorological forcings, NWMv2.1 calibrates a subset of 14 soil, vegetation, and baseflow parameters to streamflow in 1378 gaged, predominantly natural flow basins.The calibration procedure uses the Dynamically Dimensioned Search algorithm (Tolson and Shoemaker, 2007) to optimize parameters to a weighted Nash-Sutcliffe efficiency (NSE; Nash and Sutcliffe, 1970) of hourly streamflow (mean of the standard NSE and log-transformed NSE).Calibration runs separately for each calibration basin, and a hydrologic simi-larity strategy is then used to regionalize parameters to the remaining basins within the model domain.The calibration period was from water years 2008 to 2013, and the water years from 2014 to 2016 were used for validation.For the retrospective analysis, NWMv2.1 produces the channel network output (streamflow and velocity), reservoir output (inflow, level, and outflow) and groundwater output (inflow, level, and outflow) every hour and every 3 h for land model output (e.g., snow, evapotranspiration, and soil moisture) and high-resolution terrain output (shallow water table depth, and ponded water depth).For this analysis, hourly streamflow is aggregated to daily averages.

2.2
The National Hydrologic Model v1.0 application of the Precipitation-Runoff Modeling System (NHMv1.0) The U.S. Geological Survey (USGS) has developed the National Hydrologic Model (NHM, version 1.0) application of the Precipitation-Runoff Modeling System (PRMS) (Regan et al., 2018).PRMS uses a deterministic, physical-process representation of water flow and storage between the atmosphere and land surface, including snowpack, canopy, soil, surface depression, groundwater storage, and stream networks.Used here are the NHM daily discharge simulations from version v1.0 (NHMv1.0)and, more specifically, results from the calibration workflow "by headwater calibration using observed streamflow" with the Muskingum-Mann streamflow routing option ("byHRU_musk_obs"; Hay and LaFontaine, 2020).Climate inputs to the NHMv1.0 are 1 km resolution daily precipitation and daily maximum and minimum temperature from Daymet (version 3; Thornton et al., 2016).The geospatial structure, which defines the default parameters, spatial hydrologic response units (HRUs), and the stream network, is defined by the geospatial fabric version 1.0 (Viger and Bock, 2014).The NHM is calibrated using a multipleobjective, stepwise approach to identify an optimal parameter set that balances water budgets and streamflow.The first step calibrates for the water balance of each spatial HRU to "baseline" observations of runoff, actual evapotranspiration, soil moisture, recharge, and snow-covered area derived from multiple datasets (Hay and LaFontaine, 2020).The second step considers timing of streamflow by calibration to statistically generated streamflow in 7265 headwater watersheds with a drainage area of less than 3000 km 2 .The final step calibrates to observed gaged streamflow at 1417 stream gage locations; details of the calibration can be seen in Appendix 1 of LaFontaine et al. (2019).The calibration period included the odd water years from 1981 to 2010, and the even water years from 1982 to 2010 were used for validation.The NHM does not simulate reservoir operations, surface or groundwater withdrawals, or stream releases.The NHM outputs daily streamflow, which is used in the analysis here. https://doi.org/10.5194/hess-27-1809-2023 Hydrol.Earth Syst.Sci., 27, 1809-1825, 2023 3 Evaluation approach

Data
This study evaluates daily simulations from 1 October 1983 to 31 December 2016, or just over 33 years (∼ 12 100 d).
Model simulations are compared to observations at 5390 USGS stations (Foks et al., 2022); stations were included that had a minimum data length of at least 8 years or 2920 daily observations (i.e., ∼ 25 % complete data), although the observations did not need to be continuous (this allows for missing data, including intermittent and/or seasonally operated gages).A subset of these gages (n = 5389) also occurs in the Geospatial Attributes of Gages for Evaluating Streamflow, version II dataset (GAGES-II; Falcone, 2011); therefore, attributes from GAGES-II are used to examine select results.
Figure 1 shows the spatial distribution of the gages as well as their designated region; regions are further aggregations of Level-II ecoregions as defined by GAGES-II (see Fig. 1 caption).Figure 1 shows the uneven distribution of gages: the eastern US has a dense network of gages, followed by decreasing coverage moving west around the 100th meridian (i.e., 100 • W longitude).There is a modest increase in gage density across the intermountain west, and higher coverage along the west coast.Figure 1 also shows the classification -that is, if the site has been characterized as a reference or non-reference site.Reference gages indicate less-disturbed watersheds, and observations associated with non-reference gages have some level of anthropogenic influence (Falcone, 2011).Although the non-reference gages outnumber the reference gages by about 4 : 1, reference gages are relatively well-distributed through the regions.

Metrics
Table 1 shows the metrics used in the evaluation as well as their descriptions.Metrics were calculated in the statistical software R (R Core Team, 2021), including using the hy-droGOF (hydrological goodness-of-fit) package (Zambrano-Bigiarini, 2020).The Kling-Gupta efficiency (KGE) is used as the overall performance metric, which is defined as follows (Gupta et al., 2009): where r is the linear (Pearson) correlation coefficient between the observations (obs) and simulations (sim), σ is the standard deviation of the flows, and µ is the mean.The KGE components of r and the ratio of standard deviations between the simulations and observations (rSD) are also examined.Correlation quantifies the relationship between modeled and observed streamflow and is often used to assess flow timing.The rSD shows the relative variability (Gupta et al., 2009;Newman et al., 2017), indicating if the model is over-or underestimating the variability in the simulated state (in this case, daily streamflow), relative to observations.In this evaluation, instead of using the ratio of means component, the related percent bias (PBIAS) is calculated as follows (Zambrano-Bigiarini 2020): where observed flow is O, simulated flow is S, and t = 1, 2, . ..N is the time series flow index.Percent bias (PBIAS) provides information on if the model is over-or underestimating the total streamflow volume (based on the entire simulation period).
To provide context for the interpretation of the KGE scores, a lower benchmark must be specified (Pappenberger et al., 2015;Schaefli and Gupta, 2007;Seibert, 2001;Seibert et al., 2018).The KGE does not include a built-in lower benchmark in its formulation, but Knoben et al. (2019) showed that models with KGE scores higher than −0.41 contribute more information than the mean flow benchmark.Recently, Knoben et al. (2020) showed that it is more robust to define a lower benchmark that considers seasonality.Hence, a reference time series based on the average and median flows for each day of the year is used to calculate a lower KGE value which serves as a climatological (lower) benchmark.
Two additional hydrologic signatures are included which evaluate performance based on different parts of the flow duration curve (FDC) for high and low flows.The definitions of these hydrologic signatures are consistent with those from Yilmaz et al. (2008).The bias of high flows (the top 2 %) is computed to evaluate how well the model captures the watershed response to big precipitation or melt events (PBIAS_HF).For low flows, the bias of the bottom 30 % (PBIAS_LF) offers insight into baseflow performance.Equations for these two metrics can be found in the online Supplement.
Using daily observations and model simulations, the evaluation metrics from Table 1 are calculated for each of the gages for the NWMv2.1 (Towler et al., 2023a) and NHMv1.0 (Towler et al., 2023b)    KGE is also calculated using daily observations and day-ofyear averages and medians for each site; these KGE scores are referred to as AvgDOY and MedDOY, respectively.

Results
KGE scores for the benchmarks and models are presented as cumulative density functions (CDFs; Fig. 2), and Table 2 quantifies the percentage of sites less than or greater than select KGE scores.First, the seasonal benchmarks and model KGE scores can be compared to the mean flow benchmark (i.e., KGE < −0.41; Knoben et al., 2019): for the MedDOY, 18 % of sites have lower scores (Table 2).A total of 0 % of sites have lower scores than the KGE benchmark if the AvgDOY is used instead of the median flow (i.e., 0 % in Table 2).For the models, the NWMv2.1 simulations do not pro-vide more skill than the mean flow benchmark at 14 % of the sites, similar to 12 % of sites using NHMv1.0.From Fig. 2, it can be seen that the CDFs for the models intersect with the AvgDOY curve at a KGE score of about −0.06; at this value, 19 %-20 % of the sites perform worse in terms of the KGE using the model simulation, whereas the model simulations perform better than the AvgDOY above this value.
In terms of median values, the AvgDOY (MedDOY) has a median KGE of 0.08 (−0.1), while the NWMv2.1 has a median of 0.53 and the NHMv1.0median is 0.46.Given the better performance of the AvgDOY compared with the Med-DOY, only the AvgDOY is used as the lower benchmark in the forthcoming analyses.KGE performance is also examined by whether the gage has been classified as reference or non-reference.Figure 3 shows KGE scores as CDFs for the models and the AvgDOY benchmark grouped by this classification.As expected, the AvgDOY curves are virtually identical regardless of classification.However, for both models, the reference gages outperform the non-reference gages.Table 3 shows the median values for the models: for the NHMv1.0, the KGE is 0.67 and 0.38 for the reference and non-reference, respectively; for the NWMv2.1, the KGE is 0.65 for the reference versus 0.49 for the non-reference.Looking at the components, the r values are the same for both model reference sites (0.78).For the PBIAS, the NHMv1.0shows an underestimation for both reference and non-reference sites (−4.1 % and −5.7 %, respectively), but the NWMv2.1 underestimates (−4.0 %) at the reference sites and overestimates (5.3 %) at the non-reference sites.
Figure 4 shows KGE scores as CDFs for the models grouped by region.The model applications are similar, but there are notable differences by region.In general, performance is best for the Northeast, followed by the Southeast.The Central and West areas perform the worst, although the West exhibits some high KGE values.Table 4 displays the median KGE, r, rSD, and PBIAS values grouped by region, showing the biggest differences in PBIAS among regions and between models.Regional variability can be further examined using the KGE maps for the models: in the West, more of the poorly performing sites are in the arid southwest and the lower-elevation basins in the intermountain west; better  performance is seen at higher elevations in the intermountain west and along the west coast, including the Pacific Northwest (Fig. 5a for NWMv2.1 and Fig. 5b for NHMv1.0). Figure 5 shows that, for both models in the Central region, relatively poor performance is concentrated along the plains areas that span from the high plains (i.e., North Dakota) vertically down through the center of the CONUS (i.e., South Dakota, Nebraska, Kansas, and Texas).Performance is more mixed as one moves further east in the Central region (e.g., around the Great Lakes).Relatively good performance is seen in most of the Southeast, but performance tends to be poor or mixed in Florida.However, as previously mentioned, the model results need to be placed into context by comparing them to a climatological benchmark.Figure 6 shows the KGE map for the AvgDOY, which has relatively high KGE  values mostly in parts of the western CONUS, where there are notable seasonal signatures (e.g., snowmelt runoff), and relatively low KGE values in most of the other regions.By taking the KGE differences by site, it is easier to examine where the model applications are relatively better or worse than the seasonal benchmark.Figure 7 shows the spatial distribution of the KGE differences, where the model with the maximum KGE value is used (i.e., maximum between the KGE NWMv2.1 and KGE NHMv1.0 ).Overall, the model applications tend to outperform the AvgDOY benchmark, except in the West and western Central regions.Figure S1 in the Supplement shows that if the AvgDOY benchmark is outperformed, it is usually by both models (at 63 % of sites); this is similar to the findings of Knoben et al. (2020).KGE difference maps for each individual model follow the same general spatial pattern (Figs.S2, S3).Basins that do not exceed the climatological benchmark are further scrutinized for each model application to offer insights for model diagnostics and development; that is, only sites that have KGE scores worse than the Avg-DOY benchmark are examined hereafter.From here forward, these are called "underperforming sites".By classification, most underperforming sites are human impacted (90 %-93 %; see Table 5).By region, most underperforming sites are in the West (55 %-67 %) or Central (23 %-28 %) regions (Table 6).Next, the bias metrics can be examined to try to determine why these sites are not able to beat the climatological benchmark.Spatial maps of PBIAS show that the NWMv2.1 (Fig. 8a) generally overestimates volume; NHMv1.0 (Fig. 8b) is more mixed with underestimation in the Central region.Both models overestimate water volumes in the West.This could be because neither model is capturing active reservoir operations or water extractions (e.g., for irrigation), which are important because water is heavily managed in the west.This is different from the overall distribution of PBIAS for the modeling applicahttps://doi.org/10.5194/hess-27-1809-2023 Hydrol.Earth Syst.Sci., 27, 1809-1825, 2023 tions, where, if one looks at all the gages (n = 5390), PBIAS for both models is centered around zero (Fig. S4).The underestimation in the Central region for the NHMv1.0,which is absent in NWMv2.1, could be due to the different time steps of the models -NWMv2.1 is run hourly and NHMv1.0 is run daily; this hypothesis is expanded upon in Sect. 5. Maps for PBIAS_HF can be seen in Fig. S5; for PBIAS_HF, the overall distribution of PBIAS_HF is centered below zero, indicating that the models tend to underestimate high flows, but this is more pronounced for the underperforming gages in the NHMv1.0than in the NWMv2.1 (Fig. S6).Results for rSD paint a similar picture: both models tend to underestimate variability, but the underestimation is more pronounced in NHMv1.0 (Figs.S7, S8). Figure 9 shows PBIAS_LF for both model applications: the NWMv2.1 tends to overestimate the low flows, whereas the NHMv1.0 is more mixed and the over-or underestimation is less severe.This can also be seen in the histograms for PBIAS_LF (Fig. S9).

Discussion and conclusions
Water availability is a critical concern worldwide; its assessment extends beyond the individual catchment scale and needs to include both large and small basins as well as anthropogenically influenced basins and those devoid of human influence.As such, large-sample hydrologic modeling and evaluation has taken on a new urgency, especially as these models are used to assess water availability and risks.In  the US, the high-resolution model applications benchmarked here are two widely used federal hydrologic models, providing information at spatial and temporal scales that are vital to realizing water security.To our knowledge, this is the first time that these models have been evaluated so comprehensively, as this analysis included daily simulations at 5390 gages, over a 33-year period, and basins both impacted and not impacted by human activities.Further, a climatological seasonal benchmark is used to provide an a priori expectation of what constitutes as a "good" model.This analysis is aligned with the recent aims of the hydrologic benchmarking community to put performance metrics in context (Clark et al., 2021;Knoben et al., 2020).This paper extends this approach by demonstrating how the climatological benchmark can be used as a threshold to further scrutinize errors at underperforming sites.This is complementary to other model diagnostic and development work that aims to understand model sensitivity and why models improve/degrade with changes.Recent studies have applied sensitivity analyses that consider both parametric and structural uncertainties to identify the water cycle components that streamflow predictions are most sensitive to (Mai et al., 2022a).Infor- mation theory also provides tools that help identify model components contributing to errors (Frame et al., 2021).Further, simple statistical or conceptual models (e.g., Nearing et al., 2018;Newman et al., 2017) could also be used as a benchmark if applied to the same sites/catchments and time periods.
In terms of the KGE, the model applications showed similar performance, despite differences in process representations, parameter estimation strategies, meteorological forcings, and space/time discretizations.Reference gages performed better than the non-reference gages, and the best regional performance was seen in the Northeast, followed by the Southeast, with worse performance in the Central and West areas, although the West presented some high KGE scores.Further, for both models, most of the sites were able to beat the seasonal benchmark, and the majority of sites that did not were non-reference.Despite different forcings (NWMv2.1 is forced by AORC and NHMv1.0 is forced by Daymet version 3), the model applications had generally similar performance.Although outside the scope of this study, further exploration of streamflow biases due to forcing biases could offer insights into error sources.Moreover, the calibration periods of the models differed, and both overlapped with the evaluation period used in this study.While this overlap can introduce biases into the evaluation process, it allowed us to evaluate long-term performance for the same sites and time periods for both models.While this is not without precedent (e.g., Duan et al., 2006), recent studies are exploring best practices for calibration and validation to improve model robustness and generalizability (Shen et al., 2022).
PBIAS results showed that simulated streamflow volumes are overestimated in the West region for both models, particularly for the sites designated as non-reference.Lane et al. (2019) found that poor model performance occurs when the water budget is not closed, such as when human modifications or groundwater processes are not accounted for in the models.This is a likely explanation in our case as well, because water withdrawal for human use is endemic throughout the west and neither model has a thorough representation of these withdrawals.Furthermore, neither model possesses significant representations of lake and stream channel evaporation which, through the largely semiarid west, can constitute a significant amount of water "loss" to the hydrologic system (Friedrich et al., 2018).Lastly, nearly all western rivers are also subject to some form of impoundment.Even neglecting evaporative, seepage, and withdrawal losses from these waterbodies, the storage and timed releases of water from managed reservoirs can significantly alter flow regimes from daily to seasonal timescales, thereby degrading model performance statistics at gaged locations downstream of those reservoirs.As model development moves towards including human systems, the benchmark results provide a concrete goal regarding "how much" improvement could be necessary to achieve by a management module.This addition of management could support decision makers as they grapple with how to account for the anthropogenic influence on watersheds, especially as most studies to date focus on minimally disturbed sites.
Another interesting difference in PBIAS was seen in the Central US, where the NHMv1.0underestimates volumes at underperforming sites.As detailed in the model descriptions, the model applications are run at different temporal scales: NHMv1.0 is run daily, whereas NWMv2.1 is run hourly and aggregated to daily.One hypothesis is that some precipitation events that are occurring on sub-daily scales, like convective storms and the associated runoff modes (Buchanan et al., 2018), may be missed.Similarly, while both models tend to underestimate high flows (PBIAS_HF) and vari-ability (rSD), this is more pronounced for the NHMv1.0,which is in line with this hypothesis.The model applications showed differences in PBIAS_LF, with the NWMv2.1 overestimating low flows and the NHMv1.0 both over-and underestimated them, although the latter over-and underestimations were less severe.Both models used in the applications benchmarked here have only rudimentary representation of groundwater processes.Additional attributes (e.g., baseflow or aridity indices) could be strategically identified to further understand these model errors and differences.Model target applications, which drive model developer selections for process representation, space and time discretization, and calibration objectives, also have a notable imprint on the performance.The NWMv2.1, with a focus on flood prediction and fast (hourly) timescales, shows better performance for highflow-focused metrics, whereas the NHMv1.0,designed for water availability assessment and slower (daily) timescales, shows more balanced and better performance for low-flowfocused metrics.
Identifying a suite of evaluation metrics has an element of subjectivity, but our aim was to focus on streamflow magni-  tude, as these model applications were developed to inform water availability assessments.However, magnitude is only one aspect of streamflow, and different metrics for other categories (e.g., frequency, duration, and rate of change) could be more appropriate for addressing specific scientific questions or modeling objectives.Recently, McMillan (2019) linked hydrologic signatures to specific processes using only streamflow and precipitation.The aforementioned publication did not find many signatures that relate to human alteration; however, in this paper, the streamflow bias metrics are found to be useful in this regard.Clark et al. (2021) pointed out that it is important to characterize the sensitivity of the KGE to sampling uncertainty, which can be large for heavy-tailed streamflow errors.Using bootstrap methods (Clark et al., 2021), uncertainty in the KGE estimates for this study were computed (Towler et al., 2023a, b) and are illustrated in Fig. S11.Alternative estimators of the KGE that are more appropriate for skewed streamflow data (e.g., see Lamontagne et al., 2020) could be added in the future, but they currently require the separate treatment of sites with zero streamflow, which was not feasible for this initial evaluation.Finally, some of the metrics in the benchmark suite include redundant error information; one approach to remedy this has been put forth by Hodson et al. (2021), where the mean log square error is decomposed to only include independent error components (see Hodson et al., 2021, for details).This could also be addressed using empirical orthogonal function (EOF) analysis, which has been done for climate model evaluation (Rupp et al., 2013).In closing, this paper uses the climatological seasonal benchmark as a threshold to screen sites for each model application.While this fit with the purpose of this study, the metrics for NWMv2.1 (Towler et al., 2023a) and NHMv1.0 (Towler et al., 2023b) are available for all sites (Foks et al., 2022); these can be analyzed and/or screened as needed.In the future, it could also be useful to extend the analysis beyond streamflow to other water budget components to assess additional aspects of model performance.

Figure 1 .
Figure 1.Site locations used in evaluation (n = 5390), including regions and classification.Regions were further combinations of the aggregated ecoregions defined by Falcone (2011): Central (n = 1450) includes the Central Plains, Western Plains, and Mixed Wood Shield; Northeast (n = 1218) includes the Northeast and Eastern Highlands; Southeast (n = 1212) includes the South East Plains and South East Coastal Plains; and West (n = 1510) includes the Western Mountains and West Xeric.Classifications are from Falcone (2011): reference (Ref, n = 1115) and non-reference (Non-ref, n = 4274); one gage was not designated (NA, n = 1).The map was sourced from Grannemann (2010), Natural Earth Data (2009), and Esri (2022a, b).

Figure 2 .
Figure 2. Cumulative density functions (CDFs) for Kling-Gupta efficiency (KGE) scores based on daily streamflow at U.S. Geological Survey (USGS) gages for seasonal benchmarks based on the median day-of-year flows (MedDOY) and average day-of-year flows (Avg-DOY) and based on the National Water Model v2.1 (NWMv2.1)and National Hydrologic Model v1.0 (NHMv1.0).The dotted vertical line is the KGE mean flow benchmark (−0.41).For sites (n = 1 for NWMv2.1 and n = 16 for NHMv1.0) for which a KGE could not be calculated (i.e., the modeled time series had all zero values for the entire time series), these are included as negative infinity in the CDFs.

Figure 4 .
Figure 4. Cumulative density functions (CDFs) for Kling-Gupta efficiency (KGE) scores based on the daily streamflow at U.S. Geological Survey (USGS) gages for the National Water Model v2.1 (NWMv2.1)and National Hydrologic Model v1.0 (NHMv1.0).The dotted vertical line is the KGE mean flow benchmark (−0.41).Regions are further combinations of the aggregated ecoregions defined by Falcone (2011): Central (n = 1450) includes the Central Plains, Western Plains, and Mixed Wood Shield; Northeast (n = 1218) includes the Northeast and Eastern Highlands; Southeast (n = 1212) includes the South East Plains and South East Coastal Plains; and West (n = 1510) includes the Western Mountains and West Xeric.

Table 5 .
The number (percentage) of sites in each classification for each hydrologic model application where the KGE score is less than the average day-of-year flow (AvgDOY) benchmark (underperforming sites).The abbreviations used in the table are as follows: NHMv1.0 -National Hydrologic Model v1.0, NWMv2.1 -National Water Model v2.1, max(Model) -model with maximum KGE value from NHMv1.0 or NWMv2.1,Ref -reference (minimal human impacts), and Non-ref -non-reference (influenced by human activities

Figure 7 .
Figure 7. Difference between the Kling-Gupta efficiency (KGE) from the maximum model (maxModel) (i.e., the maximum KGE value from the National Water Model v2.1, NWMv2.1, or the National Hydrologic Model v1.0, NHMv1.0)minus the seasonal benchmark based on the average day-of-year flows (AvgDOY); negative (orange) indicates where the AvgDOY has a higher (better) KGE, whereas positive (purple) indicates that at least one of the models has a higher (better) KGE.The map was sourced from Grannemann (2010), Natural Earth Data (2009) and Esri (2022a, b).

Figure 8 .
Figure 8. Percent bias (PBIAS) maps for the (a) National Water Model v2.1 (NWMv2.1)and (b) National Hydrologic Model v1.0 (NHMv1.0)for sites where the KGE score is less than the average day-of-year flow (AvgDOY) benchmark.Cooler colors are where the model application is overestimating volume, whereas warmer colors are where the model is underestimating volume.The map was sourced from Grannemann (2010), Natural Earth Data (2009) and Esri (2022a, b).

Figure 9 .
Figure 9. Percent bias low flow (PBIAS_LF, flows below the 30 % percentile) maps for the (a) National Water Model v2.1 (NWMv2.1)and (b) National Hydrologic Model v1.0 (NHMv1.0)for sites where the KGE score is less than the average day-of-year flow (AvgDOY) benchmark.Cooler colors are where the model application is overestimating low flows, whereas warmer colors are where the model is underestimating low flows.The map was sourced from Grannemann (2010), Natural Earth Data (2009) and Esri (2022a, b).

Table 1 .
Evaluation metrics calculated on the daily streamflows.The abbreviations used in the table are as follows: KGE -Kling-Gupta efficiency, rSD -ratio of standard deviations between simulations and observations, PBIAS -percent bias, HF -high flows, LF -low flows, Inf -infinity, USGS -United States Geological Survey, and m 3 s −1 -cubic meters per second.

Table 2 .
Median Kling-Gupta efficiency (KGE) scores and the percentage of sites (p) less than or greater than given KGE scores for seasonal benchmarks based on the median day-of-year flows (MedDOY) and average day-of-year flows (AvgDOY) and based on the National Water Model v2.1 (NWMv2.1)and National Hydrologic Model v1.0 (NHMv1.0).

Table 3 .
Median values grouped by reference (Ref, n = 1115) and non-reference (Non-ref, n = 4274) gages (one gage was not designated as Ref or Non-ref and is, therefore, not included).The abbreviations used in the table are as follows: KGE -Kling-Gupta efficiency, r -Pearson's correlation coefficient, rSD -ratio of standard

Table 4 .
Median values for each region.The abbreviations used in the table are

Table 6 .
The number (percentage) of sites in each region for each hydrologic model application where the KGE score is less than the average day-of-year flow (AvgDOY) benchmark (underperforming sites).The abbreviations used in the table are as follows: NHMv1.0 -National Hydrologic Model v1.0, NWMv2.1 -National Water Model v2.1, and max(Model) -model with maximum KGE value from NHMv1.0 or NWMv2.1.