Large Ensemble Diagnostic Evaluation of Hydrologic Parameter Uncertainty in the Community Land Model Version 5 (CLM5)

Land surface models such as the Community Land Model version 5 (CLM5) seek to enhance understanding of terrestrial hydrology and aid in the evaluation of anthropogenic and climate change impacts. However, the effects of parametric uncertainty on CLM5 hydrologic predictions across regions, timescales, and flow regimes have yet to be explored in detail. The common use of the default hydrologic model parameters in CLM5 risks generating streamflow predictions that may lead to incorrect inferences for important dynamics and/or extremes. In this study, we benchmark CLM5 streamflow predictions relative to the commonly employed default hydrologic parameters for 464 headwater basins over the conterminous United States (CONUS). We evaluate baseline CLM5 default parameter performance relative to a large (1,307) Latin Hypercube Sampling‐based diagnostic comparison of streamflow prediction skill using over 20 error measures. We provide a global sensitivity analysis that clarifies the significant spatial variations in parametric controls for CLM5 streamflow predictions across regions, temporal scales, and error metrics of interest. The baseline CLM5 shows relatively moderate to poor streamflow prediction skill in several CONUS regions, especially the arid Southwest and Central U.S. Hydrologic parameter uncertainty strongly affects CLM5 streamflow predictions, but its impacts vary in complex ways across U.S. regions, timescales, and flow regimes. Overall, CLM5's surface runoff and soil water parameters have the largest effects on simulated high flows, while canopy water and evaporation parameters have the most significant effects on the water balance.

Diagnostic sensitivity analysis (SA) has emerged as an important tool for characterizing hydrologic parameter uncertainty. SA clarifies which parameterized processes primarily influence the range of variability in specific model outputs of interest (Herman et al., 2013;Pianosi & Wagener, 2016;Reed, Hadjimichael, Malek, et al., 2022;Saltelli et al., 2021). However, the structural complexity and computationally intensive nature of CLM have hindered SA efforts. A few studies (Hou et al., 2012;Huang et al., 2013;Ren et al., 2016) have performed CLM parameter sensitivity analyses using relatively small ensembles (∼100 members). Others studies (Gao et al., 2021; have attempted to overcome computational limits in their CLM calibration and SA by building a surrogate model or emulator to approximate CLM responses to changes in candidate inputs. Despite their computationally economical nature, selecting, fitting, and validating surrogate models or emulators of a model as complex as CLM5 gives rise to additional challenges and uncertainties that can be limiting, especially when several coupled dynamic processes and/or measures of performance are of interest (Asher et al., 2015;Razavi et al., 2012). For example, Huang et al. (2016) provided a detailed discussion of the difficulty of building a surrogate model for CLM (e.g., the extreme nonlinearity of the response surfaces, sensitivity to non-convergent parameterizations, and rapid growth of computational demands as additional model performance metrics are considered).
Another key limitation in CLM SA and assessments is the use of a single or limited number of error metrics (Gao et al., 2021;Hou et al., 2012;P. Li et al., 2019;Parajka et al., 2016;Ren et al., 2016;Yang et al., 2021;. The choice of model error metric is crucial for SA and diagnostic model evaluation  because the sensitivities of the parameterized processes and obtained inferences of which parameter sets are behavioral can vary with different objective functions (Herman et al., 2013;Kelleher et al., 2015;Pianosi & Wagener, 2016). Thus, it is important to consider a range of metrics to better understand CLM hydrological behaviors and dominant parameterized processes across regions and timescales.
In this study, we perform a comprehensive diagnostic evaluation and SA of CLM5's streamflow predictions using a large ensemble (1,307) of hydrologic parameter sets for 464 CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) basins across the conterminous United States (CONUS). Our analysis considers over 20 model performance metrics that capture the magnitude, timing, variability, water balance, and other flow regime errors ranging from daily, seasonal, to annual scales. Specifically, we aim to: 1. Benchmark the performance of CLM5's default hydrologic parameters on streamflow predictions using over 20 diagnostic error metrics across seven CONUS hydrologic regions. 2. Characterize the effects of hydrologic parameter uncertainty on CLM5 runoff predictions using 1,307 ensemble parameter sets.
3. Conduct a detailed SA to evaluate which of CLM5's parameterized hydrologic processes dominate the suite of runoff error metrics and diagnostically assess how the dominant processes change across CONUS regions.

Study Area
The CAMELS data sets (Addor et al., 2017;Newman et al., 2015) extend the previous Model Parameter Estimation Experiment data sets (Duan et al., 2006) to include 671 headwater-type basins with minimal human influence across the CONUS. The data consists of unregulated, quality-controlled daily streamflow measurement data from 1980 to 2014 for each basin. The CAMELS data sets provide basin area information from two different sources: the national geospatial fabric polygon (Viger & Bock, 2014) and the United States Geological Survey Geospatial Attributes of Gages for Evaluating Streamflow version II database (Falcone, 2011). In this study, we used the 464 basins with a basin area relative difference of less than 2% for CLM evaluation (Figure 1). This cutoff comes from Addor et al. (2017) recommendation to not use basins with large area discrepancies between the two sources.

Data Sources
The necessary meteorological forcing data to drive CLM5 including hourly precipitation, air temperature, surface pressure, wind speed, specific humidity, incoming solar radiation, and incoming longwave radiation were acquired from Phase 2 of the North American Land Data Assimilation System (NLDAS-2) (Xia et al., 2012). NLDAS-2 was developed upon the first phase of the NLDAS project. NLDAS was initiated to generate reliable initial land surface states for coupled atmosphere-land models and improve weather predictions. The majority of NLDAS atmospheric forcing data is derived from the North American Regional Reanalysis (NARR), which features a 32-km spatial resolution and a 3-hr temporal resolution. The NLDAS software is used to interpolate coarse-resolution NARR data to the finer spatial scale 1/8° (∼12 km) NLDAS grid cell and 1-hr temporal resolution (Xia et al., 2012). NLDAS-2 archives the forcing data set from 1979 to the present with a 3-day lag.
The study uses land surface data, such as the fraction of land unit type (urban, natural vegetation, crop, lake, and glacier), soil properties (e.g., soil texture and color), and plant functional type characteristics (e.g., canopy top and bottom heights), acquired from the input data set for the CLM5 configuration at the same NLDAS-2 1/8° spatial resolution over the CONUS (D. Lawrence et al., 2019). The land surface data were aggregated from high-resolution input data sets derived from a variety of sources including the Global Land One-km Base Elevation Project, the International Geosphere-Biosphere Program, and the Moderate Resolution Imaging Spectroradiometer Vegetation Continuous Fields product. For details, readers are referred to D. Lawrence et al. (2020). Besides the requested CLM land surface data, we also extracted the 1-km spatial resolution baseflow index over the CONUS (Wolock, 2003) for basin clustering (see discussion in Section 3.1). For each CAMELS basin, meteorological forcing, land surface data, and the baseflow index were aggregated from grid to basin level using the area-weighted average method by overlaying the basin boundary with data layers.

Basin Clustering
To assess how the effects of CLM5 hydrologic parameter uncertainties and sensitivities vary across the CAMELS basins and major CONUS hydro-climatic regions, we employ the k-means++ clustering method (Arthur & Vassilvitskii, 2007) to classify the 464 basins into regional clusters. The k-means++ clustering method is an unsupervised machine learning approach that minimizes within-cluster variance in terms of squared Euclidean distances. We use bootstrapping with the k-means++ algorithm to construct a stable and reproducible clustering system (von Luxburg, 2010). For a range of cluster sizes (3-10), a clustering workflow consists of the following five steps: (a) randomly partition the 464 basins into training (90%) and testing (10%) sets, (b) bootstrap 70% of the training sets for model training, (c) train the clustering model using the k-means++ algorithm, (d) repeat steps 2-3 40 times (i.e., generate 40 clustering models), and (e) classify the testing sets and examine cluster stability and reproducibility.
Each bootstrapped sample likely yields the same clustering system, but with a different order of unsupervised classes. To address this problem, we identify a benchmark grouping system for the first bootstrapped sample and reorder the clusters found in subsequent bootstrapped sets by comparing the Euclidean distance of the cluster centers. To check the cluster stability and reproducibility, we use four different cluster similarity measures: the Rand Index (Rand, 1971), the Adjusted Rand Index (Hubert & Arabie, 1985), the Jaccard Index (Halkidi et al., 2001), and the Fowlkes-Mallows index (Fowlkes & Mallows, 1983). These measures all range between 0 and 1, with a value of 1 indicating a complete match between the benchmark and each subsequent model (i.e., full reproducibility). Detailed mathematical descriptions of these four similarity measures can be found in the existing literature (Bouchard et al., 2013;Park & Jun 2009). We use a total of 17 physical characteristics/features for clustering (Table 1). We provide additional details on our implementation and evaluation of clustering in Text S1 of the Supporting Information S1. Overall, we conduct steps 1-5 for a range of cluster sizes and select the most stable cluster size with the highest similarity measures to define our regional hydro-climatic clusters for CAMELS basins.

CLM5: Hydrologic Parameter Ensemble Modeling
The CLM5 model structure has been described in detail in several prior studies (Ekici et al., 2019;. Consequently, we will only provide a brief description of the hydrologic process representations that are a focus of this study. The runoff generation process in CLM5 is a simplified TOPMODEL-based representation (Niu et al., 2005). It estimates the fractional saturated area based on the topographic characteristics and soil moisture state of a grid cell. In CLM, saturation excess runoff occurs when precipitation falls on saturated soils. The fraction of the grid cell assumed to be saturated is estimated as an exponential function of the water table depth. When one or more layers of the soil column become saturated, subsurface lateral flow occurs. A general power-law equation, whose argument is the thickness of the saturated portion of the soil column, is used to model the amount of lateral subsurface flow that is a function of the soil's hydraulic properties according to soil texture (i.e., the fractions of sand and clay). The Clapp and Hornberger (1978) relationships are used to determine the porosity, saturated matric potential, saturated hydraulic conductivity, and shape parameter used in the soil water retention curve and to calculate hydraulic conductivity in unsaturated conditions. The fraction of a grid cell covered by snow is calculated differently depending on whether snowpack is accumulating or melting (Dai & Zeng, 1997;Swenson & Lawrence, 2012). During accumulation, the increase in the amount of snow-covered area per increment of new snow is scaled by the parameter acc . During melt, the decrease in the snow-covered area follows a depletion curve whose shape is governed by the parameter melt . The rate at which evaporation from the soil surface occurs is regulated by the thickness of the Dry surface layer (DSL), a thin layer through which moisture must diffuse. Above a specified moisture value, evaporation occurs at the maximum rate. Below this value, a DSL is assumed to develop, and resistance to evaporation increases proportionately with the thickness of the DSL. Precipitation on vegetation is either intercepted by the canopy or passes through to the ground. The fraction of precipitation that the canopy intercepts is a function of the combined leaf and stem area index. The maximum amount of intercepted rain and snow are constrained. Moisture greater than these values is assumed to either drip or fall to the ground. The fraction of the canopy covered in liquid water is constrained to be less than a maximum value. Readers should refer to D. Lawrence et al. (2020) for the detailed hydrologic parameterizations in CLM5. Here we use the CLM5 Perturbed Parameter Ensembles version (CLM5-PPE), recently developed by National Center for Atmospheric Research, to explore hydrologic parameter uncertainties and evaluate ensemble CLM simulations. Briefly, CLM5-PPE allows for the perturbation of originally hard-coded model parameters. For spatially distributed parameters (e.g., soil), values are perturbed by applying a spatially uniform scaling factor to preserve the underlying spatial structure in the original CLM5 parameter data set.

Hydrologic Parameter Uncertainty and Sensitivity Analysis
Parameters related to hydrologic processes in CLM5 can be classified into six groups: (a) canopy water, (b) surface water, (c) soil water, (d) subsurface water, (e) snow, and (f) evaporation. Building on previous sensitivity 6 of 23 studies (Hou et al., 2012;Ren et al., 2016) and expert judgment, we focus on 15 hydrologic parameters likely to have dominant impacts on simulations of hydrologic processes (Table 2). Table 2 provides the default parameter values of CLM5 and their prior ranges. The parameter prior ranges are determined based on prior literature (Hou et al., 2012;Huang et al., 2013;Ren et al., 2016), expert judgment as CLM5 developers, and nature of the equations in the model (i.e., linear ranges for linear equations and exponential-type ranges for non-linear equations).
To ensure that prior ranges won't change the rank order of sensitivity, we performed a range test for a randomly selected basin and compared the sensitivity rank order by increasing and decreasing the prior ranges by 20% and 5%, respectively. We confirmed similar sensitivity scores.
In this study, CLM5 is configured to run sampled hydrologic parameter ensembles separately for each CAMELS basin. For each basin, we use the Latin Hypercube Sampling (LHS) method to sample 1,500 parameter sets from their uniform prior distributions as shown in Table 2, resulting in a total of 1,500 × 464 = 696,000 runs. A total of 193 parameter sets (13%) failed to converge for a few basins. We attempt 1,500 parameter sets per basin with the goal of attaining at least 1,000 successful CLM5 simulations across CONUS. We remove these failed parameter sets and associated simulations from all basins to keep consistency in our parameter space samples, resulting in a total of 1,307 successful CLM5 simulations in each basin. Investigating these failed runs (due to water balance error) did not lead to any spatial or parameter-based patterns. For example, in these failed runs, the surface runoff parameter varies widely from 0.02 to 5 and the basin where the model reported the water balance error varies from the Olympic National Forest, WA to Yellowstone, MT. All sampled parameters are within physical ranges, but their interactions may result in nonphysical states in CLM5 simulations and lead to failed runs. Different parameter sets failed in different basins, suggesting that parameter interactions vary by basin. Numerical experiments need to be carefully designed to tease out error sources and relevant parameters for locations in different climate regimes. While this is a compelling question, it is outside the scope of this study given the considerable work involved. Note that removing these failed runs does not affect the final sensitivity results because we use a SA method that does not require a specific sampling scheme. These failed runs provide valuable insight into the most appropriate SA method for this study.
We select the LHS technique due to its ability to effectively sample full parameter ranges by dividing parameter space into regions of equal probability for representative sample draws (Mckay et al., 2000). After generating ensemble simulations, we use the Delta moment-independent sensitivity analysis method (Delta-MIM) (Borgonovo, 2007;Plischke et al., 2013) to analyze the sensitivity of the 15 parameters to hydrologic predictions. Delta-MIM exploits an empiric density-based measure that identifies the parameters that most influence the entire distribution of the response variable (i.e., it captures higher order interactive effects beyond mean and variance responses). For each parameter, the resulting Delta index measures the normalized expected shift in the distribution of the response variable induced by the parameter. We choose Delta-MIM for this study because it does not require a specific sampling scheme and includes the effects of high-order statistical moments in the response metrics of interest (Hadjimichael et al., 2020). The broad array of combinations in the LHS hydrologic parameter samples may cause CLM5 convergence failures where the model crashes, making it difficult to implement other SA methods that require a specific sampling scheme. In this study, we perform both LHS sampling and Delta-MIM through the open-source Python library SALib (Herman & Usher, 2017).
For each parameter set, we run CLM5 in the satellite phenology mode for 10 years from 2005 to 2014 for model assessment, SA, and uncertainty analysis. We select the 2005-2014 period because it can represent the CONUS flooding climatology (Dougherty & Rasmussen, 2019) and contains extreme hydrological events. These events include major flooding and droughts such as the 2005 Pacific Northwest drought, the 2012 central Great Plains drought, and the 2012-2016 California exceptional drought. Following Dagon et al. (2020), each CLM5 simulation was spun up for 25 years  to equilibrate all state variables.

Diagnostic Error Metrics
At CAMELS basins, we focus on the basin outlet streamflow response and use a total of 27 hydrologic metrics (Table 3) to benchmark the CLM5 baseline performance (using default parameters) and assess parameter sensitivities to the basin hydrologic flow traits and regimes (e.g., high/low flows, flashiness, flow variability, water balance) characterized at different temporal scales (e.g., daily, monthly, seasonal, and annual). These metrics have been described and used in many previous studies (Knoben et al., 2019;Kollat et al., 2012;Sun et al., 2019;Yan et al., 2015;Yilmaz et al., 2008). The mathematical descriptions of these metrics are in Text S2 of the Supporting Information S1. These metrics not only provide a diagnostic evaluation of how closely the model simulates watershed hydrological behaviors, but can also aid in the parameterization of CLM5 for a wide range of applications.

Basin Clusters and Characteristics
We use the k-means++ clustering method to evaluate eight different candidate hydro-climatic regional cluster sizes (3-10). Seven clusters have high similarity measures (e.g., the average value of the four similarity indices is 0.98, Table S1 in Supporting Information S1), suggesting that they are mostly reproducible. Figure 1 shows that six of these clusters contain approximately 50 or more CAMELS basins. The exception is Cluster 3-Arizona/New Mexico (AZ/NM) with only seven basins. We further interpret the clustering model by investigating the importance of each feature using the one-way Analysis of Variance F-statistic (Table S2 in Supporting Information S1). We find that all 17 features, except for PCT_URBAN, are statistically significant at a 5% significance level and useful in the clustering process. This result is not surprising as most CAMELS basins have minimal anthropogenic impacts.  Figure S1 in Supporting Information S1 shows the most frequent month for annual maximum/minimum monthly streamflows during 1980-2014, illustrating flood/drought seasonality in each cluster.
Note that we do not include any observed streamflow information in the clustering process. We can therefore extend the clustering analysis to the remainder of the CONUS where streamflow data is unavailable. We examine the within-cluster hydrologic similarity and between-cluster hydrologic difference for CAMELS basins using four commonly used hydrologic metrics developed from the 1980-2014 flow measurements: mean daily runoff, daily runoff coefficient of variation (CV), slope of daily flow duration curve (FDC) between the 25% and 75% quantiles, and mean annual runoff ratio (runoff/precipitation). As shown in Table 5, clear differences can be observed between the clusters. The mean daily runoff is much larger in Cluster 2-Pacific, followed by Cluster 1-Northeast and Cluster 7-Southeast. These clusters are also associated with higher runoff ratios and FDC slopes, suggesting a higher percent of flow generation from precipitation and a flashier response to precipitation, respectively. The daily runoff CV is much larger for more central and arid basins such as Cluster 3-AZ/NM and Cluster 5-Great Plains, which are also associated with lower runoff ratios and FDC slopes. Because of the heterogeneity, we expect streamflow features to vary within and between clusters. Comparisons of the between-cluster and within-cluster variances indicate that in most cases within-cluster variability is much smaller than between-cluster variability across all four streamflow metrics. This data suggests that streamflow features are much more similar for basins in the same cluster (Tables S3-S6 in Supporting Information S1). Figure 2a shows the CLM5 default parameter performance measured by the daily flow KGE (D-KGE) error metric for the simulation period 2005-2014. Both at-site and regional median D-KGE values are presented. Note that KGE and NSE both range between −∞ and 1 with a value of 1 indicating perfect correspondence between simulations and observations (Gupta et al., 2009 Knoben et al., 2019). The performance of the CLM5 default hydrologic parameters shows strong variation across regions, with very poor performance in 3 of 7 clusters (D-KGE < −0.5). Higher D-KGE values were found in Cluster 1-Northeast, Cluster 2-Pacific, and Cluster 7-Southeast with regional median values of 0.33, 0.30, and 0.21, respectively. The three clusters all had a higher mean daily runoff and winter flooding seasonality ( Figure S1 in Supporting Information S1). Negative regional median values were found in the arid and semi-arid regions of Cluster 3-AZ/NM (−1.69) and Cluster 5-Great Plains (−1.99), as well as the snow-dominated, high-elevation Cluster 4-Rockies (−0.60). We would like to note that model applications focused on flows at different temporal scales (e.g., daily scales for flood control vs. monthly scales for drought management) require careful selection of the error metric for evaluating model skill. Figures 2b and 2c illustrate the effects of temporal scale on model assessment using one basin as an example. Based on the NSE metric, the model can be evaluated as "poor" (NSE = −0.52) using daily flow and "some skill" (NSE = 0.52) using monthly flow.

Default Hydrologic Parameter Performance
Given that KGE is a weighted metric, a larger bias in flow volume or flow variance can result in lower KGE values. Figure 3a shows the at-site and regional median annual flow volume bias (%) for the simulation period 2005-2014. Consistent with the D-KGE spatial pattern, Cluster 1-Northeast, Cluster 2-Pacific, and Cluster 7-Southeast have the smallest flow biases with regional median values of −1%, −5%, and 21%, respectively. Arid regions Cluster 3-AZ/NM and Cluster 5-Great Plains show large positive flow biases with regional median values of 243% and 214%, respectively. These results suggest that CLM5 default hydrologic parameters significantly overestimate annual runoff volume for Cluster 3-AZ/NM and Cluster 5-Great Plains, contributing to their low D-KGE values. Cluster 4-Rockies and Cluster 6-Midwest show moderate annual flow bias in most basins, with   regional median values of 26% and 33%, respectively. The lower D-KGE values in these two clusters are also attributable to the large flow variance bias ( Figure S2 in Supporting Information S1).
Given the sensitivity of model skill to timescales, we further analyze the flow volume bias at seasonal timescales and different flow regimes based on the FDC quantiles ( Figure 3b). Strong spatiotemporal patterns can be observed in the model errors: Cluster 1-Northeast, Cluster 2-Pacific, and Cluster 4-Rockies show lower flow bias in all time windows, despite the negative D-KGE value of Cluster 4-Rockies. Cluster 7-Southeast shows a lower flow bias in all time windows except for at low flow regimes (i.e., Q0-10 and Q10-25), suggesting an overestimation of low flow or drought conditions. Cluster 6-Midwest shows a lower annual flow bias, but a large flow bias in low to moderately high flow regimes (i.e., Q0-75). The arid regions Cluster 3-AZ/NM and Cluster 5-Great Plains show a large flow bias in all time windows. Note that seasonal performance can be strongly correlated with flow quantiles for regions with strong flow seasonality. For example, in Cluster 2-Pacific, where the minimum monthly flow generally occurs in the summer season ( Figure S1 in Supporting Information S1), flow bias in the summer season is strongly associated with Q0-25. These results again emphasize the need for model assessment to consider the multiple timescales relevant for different model applications (e.g., droughts and floods).

Effects of Hydrologic Parameter Uncertainty
We use the CV of the mean annual and 99% quantile (i.e., extreme floods) daily FDC to provide a distributional measure of the hydrologic parameter uncertainty effects and the related strong regional differences across the CAMELS basins. Figure 4 presents CVs across the 1,307 ensemble CLM5 simulations at each of the CAMELS basins. The higher the CV, the greater the effects of hydrologic parameter uncertainties on dispersion around the predicted mean annual and 99% quantile flows. In Figures 4a and 4b, we use regional median CVs across the clustered CAMELS basins. Each region's median CVs are designated using performance pairs (the median across a region's mean annual flow CVs and the median across a region's 99% quantile CVs). Lower CVs of mean annual flow and daily flow 99% quantile are found in Cluster 1-Northeast (16.4%, 18.1%), Cluster 2-Pacific (7.2%, 11.2%), and Cluster 4-Rockies (13.8%, 13.0%). The median regional CVs are larger in the arid regions of Cluster 3-AZ/NM (46.0%, 54.0%) and Cluster 5-Great Plains (72.6%, 74.8%). Runoff contributes a smaller proportion of the water balance in more arid regions so small changes in the parameterized ET dynamics can propagate into larger uncertainties in the distribution of simulated streamflows. Cluster 6-Midwest (32.1%, 36.1%) and Cluster 7-Southeast (25.0%, 28.5%) had moderate median regional CVs.
We evaluate the aggregate quality of regional streamflows using regional mean flows computed by taking the average of the flow time series for all CAMELS basins within each cluster. Figure 5 shows how CLM5's hydrologic parameter uncertainties influence the model's regional aggregate streamflow predictions using boxplots that summarize the 1,307 attained results. These boxplots focus on four error metrics: NSE of monthly flow (M-NSE), annual flow bias, KGE of monthly flow (M-KGE), and transformed root-mean-square-error of monthly flow (M-TRMSE). The metrics emphasize high flows, water balance, weighted variability-bias-correlation errors, and low flows, respectively. Figure 5 highlights that large uncertainties are observed in the attained levels of error metrics and shows the strong regional differences across error metrics. For M-NSE, Cluster 2-Pacific has the lowest uncertainty ranging from −0.33 to 0.60 with a median value of 0.38 and Cluster 6-Midwest has the largest uncertainty ranging from −63.17 to 0.68 with a median value of −1.07. For annual flow bias, Cluster 2-Pacific has the lowest uncertainty ranging from −32.7% to −6.7% with a median value of −21.8% and Cluster 5-Great Plains has the largest uncertainty ranging from −77.3% to 300.9% with a median value of 41.4%. For M-KGE, Cluster 2-Pacific has the lowest uncertainty that ranges from 0.35 to 0.73 with a median value of 0.61 and Cluster 6-Midwest has the largest uncertainty that ranges from −6.30 to 0.80 with a median value of 0.11. For M-TRMSE, we compare the uncertainty among clusters using the CV of the M-TRMSE metric. Cluster 2-Pacific has the lowest uncertainty with a CV value of 0.055 and Cluster 5-Great Plains has the largest uncertainty with a CV value of 0.43. In summary, Cluster 2-Pacific shows the lowest parametric uncertainty across all four metrics, Cluster 6-Midwest shows the largest uncertainty based on M-NSE and M-KGE, and Cluster 5-Great Plains shows the largest parametric uncertainty when evaluated by annual flow bias and M-TRMSE. Figure 5 shows boxplots with the metric values attained using the CLM5 default hydrologic parameters. This information can aid in future model calibration efforts (i.e., how far can we improve CLM5 runoff predictions in terms of these four error metrics by calibrating these parameters). Figure 6 shows the full spread of the regional mean monthly FDCs attained across the 1,307 ensemble members (i.e., the green shading). This figure shows that the overall effects of CLM5's hydrologic parameter uncertainty are constant in neither space nor flow regimes. For clusters with higher aridity, such as Cluster 3-AZ/NM and Cluster 5-Great Plains, the ensemble spread is similar along the entire FDC (i.e., across levels of streamflow). In contrast, more humid regions show a smaller spread of high flows from hydrologic parameter uncertainty. For example, Cluster 2-Pacific has the highest historically observed runoff ratio of the analyzed regions and shows the least spread in high flows across the sampled parameter ensemble. In humid regions such as Cluster 2-Pacific, the parameterized effects of canopy interception and ET are at a minimum during the winter flooding season and a maximum during the summer low flow season. This variability leads to much larger parameter uncertainty effects during summer low flow simulations and, consequently, hydrological drought applications. Despite its visual appearance as a relatively small spread in Cluster 2-Pacific high flows, the effects of hydrologic parameter uncertainties cannot be ignored in all high flow simulations (e.g., the Q99 monthly runoff ranges from 14.5 to 21.2 mm/day or a 46% relative difference). Note that a 46% relative difference in streamflow is the same order of magnitude as the projected impacts of climate change by the end of 21st century under the worst-case scenario (Chegwidden et al., 2020;Hou et al., 2019;Queen et al., 2021;Yan et al., 2021). The clear implication is that hydrologic parameter uncertainty has the potential to substantially influence CLM5-based projections of climate change driven streamflow changes. Comparing monthly FDCs between regional mean observed flows and those attained using CLM5's ensemble hydrologic parameters also reveals underlying CLM5 structural issues in capturing flow regimes. For example, in Cluster 2-Pacific, the full sampled hydrologic parameter ensemble's spread does not capture the observed FDC in the low flow regimes (e.g., Q1-25%), suggesting a potential inadequacy in the CLM5 model structure when simulating the region's low flow season.    Figure 6 also presents the behavioral parameter ensemble spread in CLM5's regional mean monthly FDC predictions using two example performance requirements: (a) annual flow bias within 10% and (b) annual flow bias within 10% and M-NSE ≥ 0.5. The former, less stringent performance constraint only emphasizes water balance.
Adding the M-NSE constraint is far more restrictive in classifying candidate ensemble members as being acceptable in both monthly high flow and water balance. As expected, the latter more stringent performance constraint (two conditions) yields a conditioned posterior distribution of FDCs that substantially reduces the CLM5 hydrologic parameter uncertainty effects relative to the initial uniform prior ranges. However, the constraints' ability to reduce prior ensemble spread varies significantly among clusters, which will be discussed in the following section.

Parameter Screening
The results in Figure 6 strongly emphasize that hydrologic parameter uncertainty has large effects on CLM5 streamflow predictions. Moreover, the figure highlights that conditioning the full 1,307 member ensemble with progressively higher levels of required behavioral performance can serve to substantially reduce uncertainty in CLM5's streamflow predictions in regions where model parameterization can significantly enhance performance (e.g., Cluster 5-Great Plains). This motivates a further diagnosis of which of parameterized hydrologic processes primarily control good (or bad) model performance using a global SA. Figure 7 shows the normalized Delta sensitivity score of the 15 hydrologic parameters to the Monthly Root Mean Square Error (M-RMSE) metric (similar to M-NSE, also emphasizes high flow errors) at each basin. The parameter sensitivities vary significantly from basin to basin (i.e., they are heterogeneous within a cluster) while also showing regional patterns. For example, the surface runoff parameter is dominant in its influence on M-RMSE in all clusters except for Cluster 6-Midwest. The soil water parameters ( sat and/or Ψsat ) are also frequently important, although they are far more variable in dominance and sensitivities across regions and CAMELS basins. It is unsurprising that surface runoff and soil water parameters are of first-order significance: CLM5 simulated surface runoff is based on fractional saturated area and water table depth. The CAMELS basins that compose Cluster 6-Midwest have a large value for depth-to-bedrock (SOIL_DEP in Table 4) in CLM5 parameterizations, leading to a large soil water storage capacity and reduced variability in the sampled flow simulations. Consequently, the soil water parameter sat has a dominant sensitivity signature in nearly all basins in the Cluster 6-Midwest region. Snow parameters acc and melt are only important in some basins in Cluster 4-Rockies. The subsurface runoff parameter has a secondary influence in the humid Cluster 2-Pacific and Cluster 7-Southeast regions. The soil evaporation parameter ini has a secondary influence in some basins in the arid Cluster 3-AZ/NM and Cluster 5-Great Plains regions. As highlighted in Figure 6, different combinations of model performance metric requirements defining behavioral hydrologic parameter samples can substantially influence the posterior uncertainty in streamflow predictions. Although Figure 7 highlights some clear dominant parameters that should be the focus for understanding M-RMSE performance, it is important to diagnostically assess other metrics that emphasize flow traits of interest. Figure 8 provides a broader perspective on the dominant parameterized processes at regional cluster scales for CLM5 streamflow outputs across a wide array of error metrics and temporal scales. Figure 8 reports the cluster accumulated sensitivity scores (estimated by summing the sensitivity score for all basins within a cluster) for 23 selected error metrics listed in Table 3. The figure shows that rankings of the most sensitive CLM5 hydrologic parameters change based on the analyzed error metric and its underlying processes. For example, the top three parameters for the D-KGE metric in Cluster 1-Northeast are sat (soil water), (surface water), and Ψsat (soil water). However, they are lip (canopy water), sat (soil water), and Ψsat (soil water) for annual flow volume bias. Cluster 1-Northeast has a winter to spring flooding seasonality and saturated soil initial condition before runoff occurs (Stein et al., 2021). Therefore, surface water and soil water parameters play an important role in determining daily runoff timing, variability and, consequently, the D-KGE metric. Cluster 1-Northeast also has the largest deciduous forest land cover, making the canopy water parameter important in determining water balance. In arid regions such as Cluster 3-AZ/NM and Cluster 5-Great Plains, subsurface flow has a less important role. The surface runoff parameter is the most sensitive parameter for nearly all 23 metrics. The soil parameter sat and soil evaporation parameter ini are also important for some metrics, such as monthly flow variance bias and annual volume bias. Humid regions such as Cluster 1-Northeast and Cluster 2-Pacific present larger variability across the 23 metrics with no single dominating parameter. Generally, the surface runoff parameter , soil parameters ( sat and/or Ψsat ), and canopy water parameter lip are dominant for different metrics. In Cluster 6-Midwest, the deep depth-to-bedrock results in two soil parameters ( sat and/or Ψsat ) as the most sensitive parameters for most of the 23 metrics, while the soil evaporation parameter ini stands out in flow volume bias metrics. Figure 8 highlights that a single parameter set can yield a mix of "behavioral model runs (acceptable)" or "non-behavioral model runs (not acceptable)" depending on the required error metrics, level performance attainment, and region. The difficulty of attaining a behavioral run versus a non-behavioral run is strongly shaped by the stringency of the performance requirements. This challenge rapidly increases for higher levels of required performance across multiple error metrics that are themselves shaped by different underlying processes. The SA results in Figure 8 are helpful for understanding and diagnosing the underlying parameterized processes that most strongly influence the behavioral/non-behavioral or posterior FDC outcomes shown in Figure 6.
When considering just the annual flow bias constraint, the most substantial decrease in CLM5 hydrologic parameter uncertainty effects occurs in Cluster 3-AZ/NM and Cluster 5-Great Plains. In these two arid regions, the surface runoff parameter has a far greater sensitivity score than the other parameters in nearly all studied error metrics. The annual flow bias metric is able to heavily constrain the parameters and their dominant effects on the surface runoff generation process. In this instance, the improved performance in water balance also helps constrain monthly high flow for the M-NSE error metric. In this instance, the two measures are highly correlated and largely driven by the same dominant parameterized surface runoff generation processes (Figure 8). For humid regions such as Cluster 1-Northeast and Cluster 2-Pacific, the limited ability of the annual flow bias performance requirement to yield reduced posterior FDC uncertainty can be further explained using SA insights.  Figure 8 shows that the canopy water parameters lip and/or wet are the dominant parameters while the surface runoff parameter has minimal impacts. As a result, the annual flow bias metric only constrains the canopy interception process and is unable to constrain the soil water and surface runoff parameters. It is therefore a less stringent filter where a significant number of initial uniform prior parameter sets can meet annual flow bias within 10% (i.e., this also highlights an equifinality issue). In contrast to the arid regions, using the annual flow bias and M-NSE performance constraints substantially reduces the initial FDC parameter uncertainty in Cluster 1-Northeast while no parameter set meets the two conditions for Cluster 2-Pacific. For Cluster 1-Northeast, the M-NSE performance constraint also strongly influences the canopy water parameters in addition to surface runoff and soil water parameters. This explains the region's reduced ensemble spread in Figure 6. The Cluster 2-Pacific region has significant annual flow biases that lead to an overestimation of high flows and failures in meeting the M-NSE performance constraint. Figure 8 highlights the parameter tradeoffs issue (Kollat et al., 2012;Ruddell et al., 2019). Due to inadequacies in the CLM5 model structure, no single hydrologic parameter set can simultaneously optimize all of the hydrologic metrics that focus on different temporal scales and flow regimes.

Selection for Behavioral Modeling
Building from the multi-metric SA results of Figures 8 and 9 uses factor maps to illustrate differences in how CLM5's behavioral and non-behavioral model runs partition in the two or three most sensitive hydrologic parameters, according to different error metrics. Figure 9 focuses on an illustrative single CAMELS basin (02011400) in Cluster 1-Northeast where behavioral parameter sets are classified relative to performance requirements for four different error metrics: Q0-10% flow volume bias, M-NSE, D-KGE, and annual flow volume bias. The four performance requirements are representative of constraints that could be relevant to applications focused on drought and environmental flows, reservoir operations, flood management, and water supply, respectively. For the Q0-10% flow volume bias metric, the two most sensitive parameters are the canopy interception parameter lip and surface runoff parameter . Linear separability of the behavioral parameters is observed (Figure 9a). In this basin, the evaluated hydrologic ensembles tend to underestimate Q0-10% flow volume. Smaller lip and values lead to less canopy interception and higher flow generation. Only about 2% of the 1,307 parameter sets meet the performance requirement that Q0-10% flow bias be ≤50%.
The runoff generation process is highly nonlinear and Figure 9b shows the consequently more complex nonlinear behavioral parameter separability. Again, only about 2% of the total parameter sets are able to meet the performance requirement that M-NSE ≥ 0.5. In transitioning from a focus on monthly to daily flows, the finer timescale dynamics of the basins are more likely to be governed by local landscape characteristics (e.g., basin topography). Parameter equifinality can become more challenging for metrics with more than two sensitive parameters. Relative to M-NSE (Figure 9b), D-KGE ( Figure 9c) has a third highly sensitive soil water parameter sat . Mapping out the behavioral boundaries of the parameter performance space therefore becomes increasingly difficult (showing multiple isolated behavioral parameter sets). About 2% of the total evaluated parameter sets meet the performance constraint that D-KGE ≥ 0.3.
Three dominantly sensitive parameters emerge for the annual flow bias performance requirement (Figure 9d): canopy interception lip , soil water sat , and evaporation ini . Figure 9d shows the far less restrictive condition, where approximately 28% of the total evaluated parameter sets are able meet the requirement that annual flow bias be ≤10%. The figure clearly highlights that the less restrictive performance requirement leads to equifinality challenges and a highly nonlinear behavioral response classification context. In this context acceptable and unacceptable parameters sets are mixed and quite close to one another. Finally, Figure 10 shows the prior and posterior behavioral distributions of the error metrics and most sensitive parameters analyzed in Figure 9. More stringent performance requirements (Figures 10a-10c) yield conditioned posterior distributions very dissimilar to those of uniform prior parameter samples. They demonstrate a strong discrimination between behavioral and non-behavioral performance spaces. Instances where there is a large overlap between the prior and posterior parameter distributions indicate difficulties in distinguishing a clear behavioral parameter space and limiting the effects of hydrologic parameter uncertainty on CLM5's streamflow predictions. Overall, our results highlight that both the stochastic or deterministic calibration and use of CLM5 for streamflow predictions or projections is non-trivial given the complex regional, timescale, and flow regime specific variabilities in that dominate the parameterized processes. Inferred results from this study suggest that parameter sensitivities strongly depend on the diagnostic error metric and its temporal scale. These sensitivities further relates to the basin's hydroclimatic condition, flood-generation mechanism, and the CLM5 model structure.

Conclusions
In this study, we address an issue given limited attention in the LSM community and benchmark the CLM5 default hydrologic parameterization's performance for streamflow predictions at 464 CAMELS basins across the CONUS. We further classify the 464 basins into seven regional clusters using meteorological data and basin physical characteristics for regional analysis. We perform a detailed diagnostic global SA of CLM5's streamflow predictions for 15 hydrological parameters evaluated for over 20 error metrics using a large ensemble size (i.e., 1,307 parameter sets). We characterize the effects of parameter uncertainty on streamflow predictions across the CONUS and diagnostically demonstrate which CLM5 parameterized processes control key features of streamflow prediction errors. Our results clarify that the quality and uncertainty of CLM5 streamflow predictions vary significantly across regions, temporal scales, and error metrics of interest. Based on the results in this paper, we report three major findings: 1. We find that CLM5 default parameter performance for streamflow predictions varies widely over the CONUS.
Based on the D-KGE error metric, moderate model skill occurs in the humid Cluster 1-Northeast, Cluster 2-Pacific, and Cluster 7-Southeast with regional median D-KGE values of 0.33, 0.30, and 0.21, respectively. Very poor performance is observed for arid and semi-arid regions Cluster 3-AZ/NM (−1.69) and Cluster 5-Great Plains (−1.99) as well as the snow-dominated, high-elevation Cluster 4-Rockies (−0.60). These region-specific results closely relate to the CLM5 parameterizations of surface runoff and ET processes. We also show that the default performance changes for different error metrics of focus. For example, while the model shows moderate skill for the D-KGE metric in Cluster 2-Pacific, it significantly underestimates low flows in this region, suggesting substantial biases if used in drought applications. 2. We find that hydrologic parameter uncertainties are important for CLM5 streamflow prediction. The overall effects of hydrologic parameter uncertainty vary substantially in space, by timescale, and over flow regimes. For mean annual flow and 99% quantile daily flow, the Cluster 1-Northeast, Cluster 2-Pacific, and Cluster 4-Rockies regions show fewer effects from hydrologic parameter uncertainties than the Cluster 3-AZ/NM and Cluster 5-Great Plains arid regions. Conditioning CLM5 on carefully selected performance levels for error metrics tailored to the appropriate needs of the application context can potentially improve the quality of the streamflow predictions while reducing the effects of their underlying parameter uncertainty. However, behavioral and non-behavioral classifications of parameters are complex, metric-dependent, and vary significantly among regional clusters. We show that SA provides a helpful guide for navigating this complexity by clarifying which parameterized processes dominantly control a given streamflow prediction error metric of focus. For example, attaining acceptable levels of performance for the annual flow bias metric in Cluster 3-AZ/NM and Cluster 5-Great Plains is dominantly controlled by CLM5's parameterized representation of surface runoff generation. 3. Our diagnostic global SA provides a direct map of the most dominant CLM5 hydrologic parameters across the 23 error metrics. The error metrics evaluate different traits for streamflow predictions in terms of their accuracy in capturing timing, variability, water balance, and specific flow regimes of interest (e.g., high or low flows) at different temporal scales. For example, the top three most influential parameters for the D-KGE metric in Cluster 1-Northeast are sat (soil water), (surface water), and Ψsat (soil water). However, the most dominant parameters for the region's annual flow bias metric are lip (canopy water), sat , and Ψsat . These SA results are also helpful for diagnosing inadequacies in CLM5's model structure or parameterizations. For example, simulation results show that simulated flow variability is extremely low in Cluster 6-Great Plains and the SA results suggest that the soil water parameters dominate for nearly all 23 error metrics. These results indicate that the parameterization of depth-to-bedrock in this region may be inappropriate.
Overall, the results of this study have the potential to aid future CLM5 parameter calibration efforts by reducing the dimensionality of the problem and providing insights on the most sensitive parameterized processes for a given application's focus. These results can provide practical guidance to CLM5 modelers and users who want to improve their hydrologic prediction skills or quantify the parameter uncertainty in their hydrologic studies. The CLM5 hydrologic data sets are publicly available in the MultiSector Dynamics -Living, Intuitive, Value-adding, Environment (MSD-LIVE) data repository (Yan et al., 2023).