Gauging ungauged catchments – Active learning for the timing of point discharge observations in combination with continuous water level measurements

Hydrological models have traditionally been used for the prediction in ungauged basins despite the related challenge of model parameterization. Short measurement campaigns could be a way to obtain some basic information that is needed to support model calibration in these catchments. This study explores the potential of such field campaigns by i) testing the relative value of continuous water-level time series and point discharge observations for model calibration, and by ii) evaluating the value of point discharge observations collected using expert knowledge and active learning to guide when to measure streamflow. The study was based on 100 gauged catchments across the contiguous United States for which we pretended to have only limited hydrological observations, i.e., continuous daily water levels and ten daily point discharge observations from different hypothetical field trips conducted within one hydrological year. Water level data were used as a single source of information, as well as in addition to point discharge observations, for calibrating the HBV model. Calibration against point discharge observations was conducted iteratively by continually adding new observations from one of the ten field measurements. Our results suggested that the information contained in point discharge observations was especially valuable for constraining the annual water balance and streamflow response at the event scale, improving predictions based solely on water levels by up to 50% after ten field observations. In contrast, water levels were valuable to increase the accuracy of simulated daily streamflow dynamics. Informative discharge sampling dates were similar when selected with either active learning or expert knowledge and typically clustered during seasons with high streamflow.


Introduction
Many catchments that are of interest for research or practical purposes are ungauged or poorly gauged even in regions with a relatively dense hydrological observation network. Yet streamflow information is critical for the design and management of water infrastructures. Hydrological models are a commonly used tool to predict streamflow and its temporal variation under both current and future conditions. Parameter values of hydrological models are typically adapted to a specific catchment by calibration and validation against observed streamflow. The prediction of streamflow in ungauged catchments, that is, catchments without any observed discharge, is one of the major challenges in hydrology. This long-standing challenge has received renewed, community-wide attention through the PUB (Prediction in Ungauged Basins) initiative launched by the IAHS (International Association of Hydrological Sciences) (Hrachowitz et al., 2013).
Model calibration should ideally be based on long continuous discharge time series (Brath et al., 2004;Merz et al., 2009;Singh and Bárdossy, 2012;Tada and Beven, 2012;Vrugt et al., 2006). However, it has been demonstrated that much shorter time series between one and six months can lead to robust model parameter estimates (Brath et al., 2004;Melsen et al., 2014;Sun et al., 2017). Others have shown that some point discharge observations, taken at randomly chosen dates, can provide valuable information for model calibration (Kim and Kaluarachchi, 2009;Perrin et al., 2007). Collecting such individual discharge data points strategically by explicitly taking observations during peak flows or events and the subsequent recessions (Correa et al., 2016;McIntyre and Wheater, 2004;Pool et al., 2017;Seibert and McDonnell, 2015), could further lower the number of data points needed to reach acceptable model parameterizations. Results indicate that even a small sample of ten to sixteen observations can be highly informative (Pool et al., 2017;Seibert and Beven, 2009;Seibert and McDonnell, 2015), especially when the natural variability in streamflow is well represented and there are observations when dominant hydrological processes are active (Harlin, 1991;Singh and Bárdossy, 2012;Sun et al., 2017;Tan et al., 2008;Vrugt et al., 2006;Yapo et al., 1996). These findings are in line with results from influence diagnostic statistics, which demonstrated that the ten most influential observations cover a range of flow magnitudes (Wright et al., 2018), whereby the five most influential discharge observations have an order of magnitude more influence on model performance than any other observation in a ten-year time series (Wright et al., 2015). One potential solution to overcome the challenges related to predictions in data-scarce situations thus might be the collection of at least some hydrological data during field campaigns. However, such field campaigns are restricted by practicalities, such as the accessibility of the catchment, financial resources, or time, which make a careful choice of observation times essential. The expert knowledge gained from the previous studies could thereby provide guidance on the choice of sampling dates.
Active learning methods provide an alternative option to investigate the value of short and discontinuous discharge time series for model calibration from an explorative point of view, rather than by testing hypotheses as has been done so far. Active learning is a subfield of machine learning that has been widely applied in the domains of text processing, remote-sensing, or chemoinformatics. These domains typically face the challenge of having large unlabeled datasets (i.e., datasets with a large number of unknown samples) that need to be classified with a prediction model. The training of the model is based on labelled samples, whereby labelling (i.e., assigning a value to an unknown data point) is expensive (Cawley, 2011). Active learning provides a method to select and label the most informative samples from the pool of unlabeled data such that the most favourable model performance can be achieved with the smallest number of samples (Settles, 2012). Active learning is an iterative approach in which the model and the user regularly interact. Current model predictions of each sample are ranked by a performance criterion, and the user selects and labels the highestranked samples that are subsequently used to recalibrate the prediction model (Crawford et al., 2013). A commonly used performance criterion is prediction uncertainty (Lewis and Gale, 1994), which means that high ranks are assigned to samples that have been predicted with the least confidence. It is thereby assumed that samples are most informative for model parameter estimation for points at which model simulations disagree most (Crawford et al., 2013). In hydrology, we face a similar challenge when gauging an ungauged catchment: a hydrologist needs to measure (i.e., label) the most informative discharge observations (i.e., samples) for model calibration from a future time series (i.e., unlabeled dataset) with the least possible effort. We, therefore, hypothesize that active learning could be a powerful tool to decide on the timing of discharge observations for the calibration of hydrological models in previously ungauged catchments. Note that the term sample has different meanings in hydrology (Brunner et al., 2018), and is used here to refer to point discharge observations selected from an existing discharge time series. Seibert and Vis (2016) suggested that instead of performing discharge measurements at different points in time, it could be easier and less time consuming to install a water-level logger. Simulations for more than 600 catchments in the contiguous United States indicated a surprisingly high value of water-level time series for model calibration, especially in humid catchments. With increasing aridity, however, information on dynamics alone was not sufficient, and the lack of volume information steadily reduced model performance. Lebecherel (2015) proposed to combine water-level time series and point discharge observations to make the most out of field trips. Specifically, Lebecherel (2015) argues that a discharge observation at a given water level can be assumed to be representative for all occasions with a similar water level provided that the stage-discharge relationship is stationary and unique. Discharge time series created this way were successfully used to inform the regionalization of model parameters in 609 French catchments. While Lebecherel (2015) exclusively used discharge for calibration and disregarded water levels, Seibert and Vis (2016) proposed a simple method to use water levels for calibration without increasing the number of model parameters. Thus, combining continuous water-level time series and point discharge observations for calibration would allow the prediction of discharge in a previously ungauged basin with local information collected with a reasonable amount of effort.
The aim of this study was to provide further guidance on the optimal collection of streamflow data at a limited number of observation times to improve the prediction in ungauged basins. This study extends previous work on the value of data in ungauged basins by explicitly comparing the value of water levels and point discharge observations, and by testing a machine learning approach for guiding the timing of point discharge observations. To evaluate the value of point discharge observations and water-level time series across a wide spectrum of hydroclimatic conditions, we used a set of 100 gauged catchments distributed over the contiguous United States. Treating the catchments as poorly gauged catchments with only a limited amount of field observations, allowed the following two main objectives to be addressed: 1. Quantification of the relative value of individual point discharge observations, continuous water-level data, or a combination thereof for the calibration of hydrological models. 2. Evaluation of the potential of active learning for providing guidance on the timing of the most informative discharge observations as opposed to a prior decision based on hydrological expert knowledge.

Study catchments
In this study, data from 100 catchments across the contiguous United States were used. The catchments represent a wide range of topographic and hydroclimatic aspects (see Table 1 for statistics). They vary in size from 14 km 2 to 12,601 km 2 , and their mean elevation ranges from 23 m a.s.l. up to 3271 m a.s.l. Annual precipitation is between 267 mm yr − 1 and 3160 mm yr − 1 , of which up to 71% falls as snow in some catchments. While the majority of catchments are either humid (47%) or temperate (35%), 18% can be classified as somewhat arid or arid (as defined by Coopersmith et al., 2014). Annual specific discharge ranges between 28 mm yr − 1 and 2678 mm yr − 1 with baseflow contributing 6% to 91% to annual streamflow. The 100 catchments are a constrained-randomly selected subset of the Newman et al. (2015) dataset containing more than 600 U.S. catchments. The selection is consistent with a previous study  and was necessary to reduce the computational costs of the modelling experiments conducted in this study. The dataset was compiled by Newman et al. (2015) and provides time series of daily discharge, precipitation and temperature for each catchment. Furthermore, the dataset contains time series of different meteorological variables that were used to compute monthly potential evapotranspiration using the Priestley-Taylor equation (Priestley and Taylor, 1972). The dataset also includes basic information on catchment boundaries. However, detailed elevation data were downloaded from the SRTM digital elevation database (Jarvis et al., 2008). Further information on catchment attributes, such as topographic information, climatic indices, and hydrological signatures were extracted from the CAMELS dataset (version 1.0; Addor et al., 2017). Climate indices and hydrological signatures were recalculated for the hydrological years 1990-2009, which were used for model simulations in this study.
The model calibration experiments were based on both discharge amounts at individual points in time and water level time series. Since water levels are not part of the Newman et al. (2015) dataset, synthetic water-level time series were created for each catchment. This was done by replacing the discharge values for each day by their respective rank in the time series. In other words, we created time series that contained only the information about the temporal dynamics but not quantitative information. These series correspond to the information contained in water-level time series in the case of stationary stage-discharge Fig. 1. Information used for model calibration with the lower and upper benchmark (LB WL and UB WL_365Q ), the active learning-based data collection approaches (AL WL_10Q , AL WL_nQ , and AL 10Q ), and the expert knowledge-based data collection approach (HE WL_10Q ). A snow-dominated catchment in the Rocky Mountains is used as an example to indicate the temporal distribution of point discharge observations after ten sampling iterations. Note that the same colour scheme is used in Figs. 3-5 to differentiate the data collection approaches. relationships. In cases where there were shifts in the (real) rating curves used, these shifts were implicitly considered by our approach as we based the ranking on the estimated streamflow and not directly on the observed water levels.

Hydrological model
Continuous daily streamflow was simulated with the HBV runoff model (Hydrologiska Byråns Vattenbalansavdelning; Bergström 1976, Lindström et al. 1997 using the software implementation HBV-light (Seibert and Vis, 2012). The HBV model is a bucket-type model with a conceptual representation of hydrological processes typically dominating streamflow response at the catchment scale. Hydrological fluxes and state variables are represented by fourteen model parameters and four model routines, including a snow routine, soil routine, groundwater routine, and routing routine. Daily temperature and precipitation are used as input time series, together with long-term mean monthly potential evapotranspiration estimates. In the snow routine, a degree-day method is used to calculate snow accumulation and snowmelt. Snowmelt and rainfall supply water to the soil routine, in which simulated soil moisture content controls actual evapotranspiration and groundwater recharge. Recharge increases groundwater levels in the upper and lower reservoirs of the groundwater routine. The two reservoirs simulate the variable contribution of shallow and deep groundwater, or fast and slow runoff components, to total streamflow. Finally, in the routing routine, the sum of the three streamflow components is transformed by a triangular weighting function to simulate the hydrograph at the catchment outlet.
In this study, HBV was used in a semi-distributed way by dividing each catchment into elevation bands of 200 m. Computations in the snow and soil moisture routines were performed separately for each elevation zone, but using the same parameter values. The groundwater routine, on the other hand, was applied in a lumped way for the entire catchment. Daily temperature and precipitation input data were adjusted to each elevation band using lapse rates of 0.6 • C per 100 m (Wallace and Hobbs, 2006) and 10% per 100 m (Johansson, 2000), respectively. In contrast, monthly potential evapotranspiration values were assumed to be equal in all elevation bands.

Data collection approaches
We defined six data collection approaches representing different possible scenarios for the collection of streamflow information in a previously ungauged basin (see Fig. 1 for a visualization of the data collection approaches). The approaches mainly differ in the type of data measured, i.e., water-level time series or point discharge observations, and in the timing of the point discharge observations. The period considered for the collection of streamflow information was restricted to one hydrological year (October 1 to September 30) in each case, to reflect a situation as could be realistic in practice, where there is some limited time to collect data for a previously ungauged catchment. The data collection approaches were 'simulated' by selecting water level and discharge information from the observed time series of each catchment. A more detailed description of the approaches is provided in the following sections.

Benchmark approaches
A relatively simple data collection approach would be installing a water-level sensor for collecting continuous daily time series over an entire hydrological year (Fig. 1a). This approach calibrates the model against streamflow dynamics only and therefore served as a lower benchmark (LB WL ) for more advanced methods.
In contrast, the most data-rich approach would be the use of continuous water-level time series combined with discharge observations for each day of the hydrological year (Fig. 1b). Results from the calibration against the full dataset, including continuous water levels and continuous discharge, provide information about how good model simulations could be at best (see Section 2.4 for calibration details). These simulations, therefore, served as an upper benchmark (UB WL_365Q ).

Approaches based on active learning
Active learning could guide the decision about when to measure discharge to obtain the most informative data for model calibration. The basic idea is that discharge observations are most valuable for constraining a hydrological model on days of high model simulation uncertainty. In other words, we hypothesize that model parameterization needs most support from discharge observations when simulations disagree most.
Active learning is an iterative process, in which model parameterization is improved by adding (discharge) information at each iteration. Here, we conducted a total of ten sampling iterations that represent ten individual field trips. The active learning approach adopted here followed five main steps, whereby step two to five was repeated for each of the ten sampling iterations: • First, an initial set of 100 parameter sets was obtained by calibration against water levels (LB WL ) or by a random selection of parameter values. • Second, the model was run using these 100 parameter sets, which resulted in a range of possible hydrographs for the same forcing input. • Third, simulation uncertainty was calculated at each time step using the difference between the 5th and 95th quantiles of the simulated discharge time series. • Forth, a discharge measurement was selected at the time step with the highest simulation uncertainty. We selected discharge observations alternating from the highest absolute uncertainty and the highest relative uncertainty to give similar weight to different flow conditions. For example, the absolute uncertainty was used in the first iteration, the relative uncertainty was used in the second iteration, and so on. • Fifth, the hydrological model was recalibrated taking into account the discharge observation(s) obtained in step four.
The information collected by active learning was used for model calibration in three different ways: • AL WL_10Q : Water-level time series and point discharge observations were used for model calibration (Fig. 1c). The date of the first discharge observation was defined from simulations with the lower benchmark only (LB WL ). • AL WL_nQ : The same procedure was applied as for AL WL_10Q . However, the discharge time series was extended by assuming that an observed discharge value was representative for all time steps with a comparable water level (Fig. 1d). Comparable water levels were defined as levels for which the corresponding discharge was within +/-5% of the discharge observation on that day (note that water levels were derived from discharge time series as described in Section 2.1). • AL 10Q : Only point discharge observations were used for model calibration (Fig. 1e). The date of the first discharge observation was selected based on the uncertainty range of simulations with randomly selected parameter values.

Approach based on hydrological expert knowledge
Informative discharge days could alternatively be determined based on hydrological expert knowledge (Fig. 1f). Here we defined such an expert-based discharge collection strategy (HE WL_10Q ) using findings from a previous study (Pool et al., 2017). The strategy consisted of ten discharge observations collected at the annual peak, the first three subsequent recession days, and six observations at the 15th of every other month. In case that the 15th of a month coincided with the annual peak and its recession, we randomly selected an alternative day within that month. Model calibration was based on these point discharge observations and water-level time series, whereby discharge observations were iteratively added, starting with the peak flow information.

Model calibration using point discharge observations and water-level time series
The HBV model was calibrated for each study catchment using continuous daily meteorological input and streamflow information according to the six data collection approaches. Independent calibrations were run with data from the ten hydrological years between 1990 and 1999. The 33 months preceding the calibration periods were used for model warming-up to start model calibration from suitable initial state variables.
Parameters values were optimized within predefined feasible ranges using a genetic algorithm (Seibert, 2000) that selected and recombined an initial random set of fifty parameter values over 3500 model runs (note that no local Powel runs were conducted). Calibration was based on the two performance metrics R NS_sqrtQ_adj and R S (Table 2). R NS_sqrtQ_adj is originally a bounded version of the Nash-Sutcliffe efficiency R NS (Nash and Sutcliffe, 1970) that was proposed by Mathevet et al. (2006). R NS_sqrtQ_adj was used to minimize the error between simulated and observed square root-transformed discharge observations. Model optimization against water levels was based on the Spearman rank correlation R S (Spearman, 1904) as proposed by Seibert and Vis (2016). R S transforms the values of a time series into a sequence of ranks and thereby reduces the information of a continuous discharge time series to its dynamical aspects. Both calibration metrics used in this study can vary between − 1 and 1, with 1 representing a perfect fit. The two metrics were averaged arithmetically with equal weights for model calibrations against water levels and point discharge observation.
For each calibration step, 100 independent calibrations were conducted to account for parameter uncertainty, resulting in 100 possible hydrographs for the same forcing input. The described calibration procedure was repeated for each of the ten sampling iterations of the active learning and expert knowledge-based data collection approaches.

Evaluation of the value of point discharge observations and waterlevel time series 2.5.1. Characterization of point discharge observations
As a first step of the analyses, we characterized the sample of discharge observations resulting from the data collection approaches in terms of seasonal distribution and representation of streamflow classes. The seasonal distribution of discharge observations was analyzed using circular statistics (Pewsey et al., 2013). Circular statistics use the unit circle as the basis for the calculation of trigonometric moments, such as measures of location and concentration. Following the theory in Pewsey et al. (2013, Ch. 3.1-3.4, p.21-29) and the hydrological example provided in Hall and Blöschl (2018), we first converted the date of discharge observations to angular values as measured in radian. The mean sampling date (sample mean direction) and concentration index (sample mean resultant length) were then calculated to describe the sample distribution of the discharge observations from all ten sampling years. A concentration index of 1 indicates that discharge observations were tightly clustered around the mean sampling date. In contrast, smaller index values indicate a large spread of sampling dates (a uniform distribution around the year would result in a value of zero). For a more detailed description of circular statistics, we refer the reader to Pewsey et al. (2013).
To gain insights into the distribution of discharge observations at the event scale, discharge observations were classified by streamflow class. Four streamflow classes were considered including the event peak, falling limb of an event, rising limb of an event, and baseflow between two events. The classification was based on the event definition of Sikorska et al. (2015) that was used in Swiss catchments representing a range of runoff regimes. An event was defined as the period that includes a peak flow day, i.e., a day at which the flow reaches a maximum within any moving window of fifteen days. The start of an event was then defined as the day with the minimum flow over five days before an event peak. The first day after the event peak with streamflow of less than 20% of peak flow was considered the end of an event.

Model performance
In the second part of the analysis, we evaluated the model performance related to the six hypothetical data collection approaches. The approaches were thereby evaluated in an independent validation period covering the hydrological years 2000-2009. The continuous daily discharge simulations of the validation years were used to calculate five different performance metrics representing different aspects of the hydrograph (Table 2). R S and R VE were used to assess daily streamflow dynamics and annual volume separately. R NS calculated from untransformed (R NS ), square root-transformed (R NS_sqrtQ ), and log-transformed (R NS_logQ ) time series served to evaluate the daily dynamics and magnitude of high, mean, and low flows.
In addition, each of the five performance metrics was input to the relative model performance metric R*. As suggested by Girons Lopez and Seibert (2016), R* was used as an indicator for the relative value of the active learning and expert knowledge-based data collection approaches compared to the lower and upper benchmarks.
Overall, model performance related to the six data collection approaches was calculated for 100 model parameterization in 10 calibration years and 100 catchments. Unless stated differently, model performances for each catchment were aggregated by calculating the median of the 100 simulations and the 10 sampling years.
Finally, the median model performance values were evaluated in terms of their spatial distribution. Maps of the value of water-level time series and point discharge observations, as quantified by the model performance improvement, were used to visually investigate which parts of the contiguous United States a particular type of data was most Table 2 Performance metrics used for evaluating model performance in the validation period and metrics optimized during model calibration. The relative model performance metric R* was calculated using the performance with limited data (R D ), the lower benchmark (R LB ), and the upper benchmark (R UB ). Abbreviations used in the equations refer to observed (obs) and simulated (sim) discharge (Q), time step i of a time series of length n, and the rank S of time step i within the time series.

Metric Description Formula
Evaluation metrics R NS Nash-Sutcliffe efficiency additionally calculated using square roottransformed (R NS_sqrtQ ) and log-transformed (R NS_logQ ) discharge.  2. Seasonal distribution of point discharge observations resulting from the data collection approaches (a) AL WL_10Q (active learning with water levels and ten discharge observations), (b) AL WL_nQ (active learning with water levels and n discharge observations), (c) AL 10Q (active learning with ten discharge observations), and (d) HE WL_10Q (hydrological expert knowledge with water levels and ten discharge observations). The colours indicate the mean sampling date of all discharge observations after ten iterations. A large (small) marker size indicates that observations were strongly (weakly) concentrated around the mean sampling date. The mean sampling date and the concentration index were calculated from the discharge sampling dates of all ten sampling years.

Fig. 3.
Streamflow classes represented in the point discharge observations resulting from the data collection approaches AL WL_10Q (active learning with water levels and ten discharge observations), AL WL_nQ (active learning with water levels and n discharge observations), AL 10Q (active learning with ten discharge observations), and HE WL_10Q (hydrological expert knowledge with water levels and ten discharge observations). The y-axis indicates the percentage of discharge observations in a given streamflow class in iteration 1-10, whereby values represent an average over all 100 catchments. The last column indicates the mean frequency of a streamflow class over all ten iterations. The total number of discharge observations in the final iteration was ten for AL WL_10Q , AL 10Q , and HE WL_10Q , and ranged from 32 to 303 for AL WL_nQ (average of all catchments was 67 discharge observations).
hydrological regime, we additionally selected the hydrological signatures of mean daily discharge, runoff ratio, and baseflow index. Maps with information on the geographical regions, hydrological regimes, climatic indices, and hydrological signatures can be found in the appendix (Figs. A.1 and A.2)

When are point discharge observations most informative?
The characterization of discharge observations in terms of mean sampling date, seasonal concentration, and streamflow class allowed us to explore the timing of the most informative observations (Figs. 2 and  3). Using active learning to decide on the timing of discharge observations resulted in a strong spatial variability in the mean sampling date that tended to follow the annual peak discharge season. Mean sampling dates were thus observed in fall and winter for the Pacific Northwest and the mountainous regions of the Atlantic Coast states, in spring and early summer in the Rocky Mountains, the adjoining Great Basins and the Great Lakes Region, and in fall along the Gulf Coast. The seasonality in sampling dates was indirectly reflected in the distribution of streamflow classes that indicated a tendency towards observations during the peak and falling limb of events.
For AL WL_10Q and AL 10Q , the concentration of informative discharge observation dates was most pronounced in snow-dominated catchments located in the Rocky Mountains, the Great Basins and the Great Lakes Region. As could be expected, discharge observations were spread across the year if an observation was assumed to exist at all time steps with a similar water level (AL WL_nQ ). The number of observations collected after ten iterations with AL WL_nQ ranged from 32 to 303 (with an average of 67), whereby more observations were collected with increasingly arid conditions or with increasing importance of baseflow. The use of active learning (AL WL_10Q and AL 10Q ) or hydrological expert knowledge (HE WL_10Q ) led to surprisingly similar selections of sampling dates as characterized by their mean. However, discharge observations were generally more distributed over the year with HE WL_10Q , which was a direct result of collecting a range of flow classes at different days of the year.

Learning curves: change of model performance with increasing availability of point discharge observations
Learning curves illustrate the learning effect of a model (here, change in model performance) as a function of the additional information. These curves answer the practical question of how many sampling iterations are needed to reach a certain model performance. The learning curves in calibration and validation indicated that the iterative addition of point discharge observations for the calibration of HBV Fig. 4. Learning curves in the calibration period for the model performance metrics R S and R NS_sqrtQ_adj (calibration metrics) as a function of the point discharge observations collected at ten sampling iterations using the data collection approaches AL WL_10Q (active learning with water levels and ten discharge observations), AL WL_nQ (active learning with water levels and n discharge observations), AL 10Q (active learning with ten discharge observations), and HE WL_10Q (hydrological expert knowledge with water levels and ten discharge observations). The learning curves are shown for a year with an average precipitation of a) a snow-dominated catchment in the Northeast (top row; a1 for R S and a2 for R NS_sqrtQ_adj ) and b) a rain-dominated catchment in the Northwest (bottom row; b1 for R S and b2 for R NS_sqrtQ_adj ). The points indicate the median performance of all 100 calibration runs and the lines range from the 5th to 95th performance quantile. Note that the yaxis limits are different for the two performance metrics.
generally increased model performance continuously. The added value of an additional observation decreased as the number of sampling iterations increased (Figs. 4 and 5).
Calibration results for two example catchments, one snowdominated (Fig. 4a) and one rain dominated (Fig. 4b), suggest a more consistent performance over all 100 calibration runs with an increasing number of point discharge observations. This effect is stronger for calibration against point discharge observations only (AL 10Q ) than for calibration against discharge observations and water levels (AL WL_10Q , AL WL_nQ , and HE WL_10Q ). The effect is also more pronounced for R NS_sqrtQ_adj than for R S as the latter is by definition relatively well simulated by using continuous water level time series.
Validation results for all 100 catchments indicate that the value of point discharge observations for model calibration varied considerably between catchments (grey area in Fig. 5). The variability was lowest when data were collected using active learning (as opposed to using hydrological expert knowledge) or when simulations were evaluated focusing on mean or high flows (R NS , R NS_sqrtQ , and R VE ). Furthermore, the value of point discharge observations tended to become more similar across catchments for an increasing number of sampling iterations.
The median performance for all catchments was used to analyze the learning curves for the relative model performance R* in the validation period (Fig. 6). Results indicated that the value of point discharge observations for improving water-level based model calibration was on average highest for annual volume estimates followed by high flows, mean flows and low flows. More specifically, model performance after ten sampling iterations improved by 58%-84% for R* VE , by 38%-93% for R* NS , by 22%-79% for R* NS_sqrtQ , and by 10%-83% for R* NS_logQ .
The majority of the simulation results are encouraging for the approach of collecting a few point discharge observations, whereby as few as two to six observations are typically already (highly) beneficial for model calibration. However, it is important to note that a small number of observations could, in some cases, also be disinformative for model calibration. This was especially the case for the collection of Learning curves in the validation period for the model performance metrics R S , R VE , R NS_sqrtQ , R NS , and R NS_logQ as a function of the point discharge observations collected at ten sampling iterations using the data collection approaches AL WL_10Q (active learning with water levels and ten discharge observations), AL WL_nQ (active learning with water levels and n discharge observations), AL 10Q (active learning with ten discharge observations), and HE WL_10Q (hydrological expert knowledge with water levels and ten discharge observations). The coloured line indicates the median performance of all 100 catchments and the grey shaded area represents the 25th and 75th performance quantile. Note that the y-axis limits are different for different performance metrics. Some values extend below the lower limit of the y-axis and are plotted onto the x-axis. discharge based on hydrological expert knowledge, where a calibration with less than five point discharge observations negatively affected the simulation of mean and low flows (Fig. 6). A further exception were simulations evaluated with R S , for which model performance decreased when adding any discharge observations to a previous calibration against water levels ( Fig. 5; note that this was expected since a calibration against water levels was based on R S ).

Relative value of point discharge observations and water-level time series
By looking at the relative value of discharge and water levels for model calibration, we analyzed for which hydrograph aspects (represented by the evaluation performance metrics) and for which catchments the two different types of data were more informative. The analysis was based on the validation model performance for each catchment after ten sampling iterations (Fig. 7). Spatial differences in the value of discharge and water levels are presented with a focus on the performance metric that showed the highest benefit of a certain data type (Figs. 8 and 9).

Value of point discharge observations
The comparison of performance metrics between AL WL_10Q and LB WL suggested that point discharge observations inform model calibration with information on streamflow volumes that is missing when only water levels were available (Fig. 7a). Point discharge observations were beneficial for all performance metrics (except for R S ), whereby the effect was most pronounced for R NS (high flows). While simulated high flows were improved in the majority of catchments all over the contiguous United States, calibration against water levels and discharge was most valuable in (semi-) arid catchments (Figs. 8a and 9).
Point discharge observations collected with both AL WL_10Q and HE WL_10Q generally improved model performance. Yet the choice of the data collection approach could make a difference in the effectiveness of point discharge observations when evaluating simulations in terms of R NS and R NS_logQ (Fig. 7b). In arid catchments as well as in baseflow-or snowfall-dominated catchments low flows were better simulated when point discharge observations were selected based on HE WL_10Q . In contrast, observations from AL WL_10Q were more informative in these catchments for the simulation of high flows (Figs. 8b and 9). In relatively humid or rain dominated catchments R NS and R NS_logQ were not distinctly different if point discharge observations were selected based on HE WL_10Q or AL WL_10Q . Fig. 6. Learning curves in the validation period for the relative model performance metrics R* VE , R* NS_sqrtQ , R* NS , and R* NS_logQ as a function of the point discharge observations collected at ten sampling iterations using the data collection approaches AL WL_10Q (active learning with water levels and ten discharge observations), AL WL_nQ (active learning with water levels and n discharge observations), AL 10Q (active learning with ten discharge observations), and HE WL_10Q (hydrological expert knowledge with water levels and ten discharge observations). The curves show the median performance for all 100 catchments. Note that values below the lower limit of the y-axis are plotted onto the x-axis.

Value of water-level time series
The value of water-level time series for model calibration was first evaluated by comparing simulations based on water levels and discharge (AL WL_10Q ) with simulations based on discharge only (AL 10Q ). Results demonstrated that water levels improved the simulation of streamflow dynamics (R S ) in all catchments (Fig. 7c). Also, model performance for metrics sensitive to the dynamics of flow magnitudes (in particular R NS and R NS_sqrtQ ) could often be slightly improved from the combined use of water levels and discharge. Spatially, water levels were most informative for model calibration in catchments with rather dry conditions and summer rainfall (Figs. 8c and 9).
Finally, the use of water-level time series to extend the observed discharge time series (AL WL_nQ ) led to an increased model performance for all metrics that evaluate volume-related hydrograph aspects (Fig. 7d). Thereby, low-flow simulations improved the most, especially in catchments with prolonged periods of relatively constant (low) flow conditions, such as arid or snow-influenced catchments (Figs. 8d and 9).

Value of point discharge observations and water-level time series
Our results indicated that the collection of water-level and discharge data during a limited number of field visits could be highly valuable for predicting streamflow in previously ungauged catchments. Results thereby confirm earlier findings suggesting that a few months of continuous discharge observations (Brath et al., 2004;Melsen et al., 2014;Sun et al., 2017), or a small number of strategically timed discharge observations (Correa et al., 2016;McIntyre and Wheater, 2004;Pool et al., 2017;Seibert and McDonnell, 2015), can be very informative for model calibration.
Assuming that there is the opportunity to perform a number of streamflow observations, one needs to decide on when to measure which variable, i.e., discharge or water levels (Seibert and McDonnell, 2015). Our findings suggested that the combination of both types of data is advantageous over the use of either water levels or discharge. While continuous water-level time series provided information about streamflow dynamics, selected point discharge observations helped to link these dynamics to streamflow volumes.
As demonstrated by Seibert and Vis (2016), volume information is essential for the prediction of discharge in (semi-) arid catchments. In these catchments, the annual water balance is sensitive to actual evapotranspiration, and the corresponding model parameters could only be constrained if some information on volumes was also available. Independent of the hydroclimatic context, volume information was also found to be important at the event scale. While including point discharge observations in calibration improved the simulation of all flow conditions, model performance improved the most for annual volume estimates (R VE ). The improvement was furthermore larger for high flows than for low flows. This was likely because both active learning and hydrological expert knowledge resulted in the collection of a considerable number of observations at the peak and the falling limb of events, Fig. 7. Relative value of point discharge observations and water-level time series for model calibration. The relative value corresponds to the performance difference (ΔR) in the validation period between two data collection approaches after ten sampling iterations for each catchment. Positive values indicate an increased performance if (a) point discharge observations were used in addition to water-level time series, (b) hydrological expert knowledge was used to select point discharge observations as opposed to the use of active learning, (c) water-level time series are used in addition to point discharge observations only, and (d) point discharge observations were assumed to be representative for all dates with a similar water level. AL WL_10Q is active learning with water levels and ten discharge observations, AL WL_nQ is active learning with water levels and n discharge observations, AL 10Q is active learning with ten discharge observations, HE WL_10Q is hydrological expert knowledge with water levels and ten discharge observations, and LB WL is the lower benchmark with water levels. Note that the lower boxplot whisker extends to − 0.47 in the case of R NS_logQ in subplot (c) (marked by *). which better constrained model parameters, influencing the intensity of the streamflow response to a given precipitation event.
Our results suggested that the installation of a water-level logger at the beginning of a field campaign is beneficial for two reasons. First, as opposed to a calibration exclusively based on point discharge observations, considering water-level information improved the simulation of any hydrograph characteristic related to streamflow dynamics in most of the catchments. This was likely due to the high temporal resolution of the water-level time series used for model calibration. The benefit of water levels was therefore especially pronounced in catchments where active learning and hydrological expert knowledge led to a temporally concentrated collection of point discharge observations. Second, as demonstrated by Lebecherel (2015), using water-level time series to extend the observed discharge time series could be an effective way to reduce the number of field trips, in particular, if a catchment is characterized by prolonged periods of similar flow conditions. However, results reported here have to be considered as optimistic, because model calibration for AL WL_nQ was based on the actual discharge values and not on values approximated by the originally 'observed' value.

Value of active learning for the collection of point discharge observations
A main objective of this study was to explore the value of active learning for selecting the most informative points in time for discharge observations as opposed to a decision based on hydrological expert knowledge. The use of active learning and hydrological expert knowledge resulted in surprisingly similar mean sampling dates. These sampling dates were typically aligned with hydrologically 'active' season(s). As a consequence, sampling dates were most concentrated in catchments with a pronounced annual peak flow, such as snow-dominated or winterprecipitation dominated catchments. Our results thereby indicate that active learning (i.e., model uncertainty) could guide the timing of point discharge observations towards hydrologically meaningful periods, which are generally in agreement with an expert's decision on the timing of informative sampling dates. Furthermore, the results confirmed the importance of observations during subperiods of high parameter sensitivity (Harlin, 1991), especially when constraining model parameters with limited data.
The set of point discharge observations collected with active learning is the result of minimizing prediction uncertainty with the least number of observations. For this reason, the timing of such observations is typically model-specific (Crawford et al., 2013). For hydrological assumed to be representative for all dates with a similar water level. AL WL_10Q is active learning with water levels and ten discharge observations, AL WL_nQ is active learning with water levels and n discharge observations, AL 10Q is active learning with ten discharge observations, HE WL_10Q is hydrological expert knowledge with water levels and ten discharge observations, and LB WL is the lower benchmark with water levels. Note that the colour scales are different in (a) to (d).
applications, this not only means that results could be different for different models, but also that the timing of the final set of point discharge observations collected by active learning is subject to model uncertainty and input uncertainty (in particular disinformative events; Beven and Westerberg, 2011). In contrast, the timing of point discharge observations based on hydrological expert knowledge is defined a priori, and their selection is therefore not directly affected by model uncertainty and input uncertainty. However, in the context of this study, active learning and hydrological expert knowledge were applied to the same hydrological model under identical forcing input. Results presented here for active learning and hydrological expert knowledge should therefore be directly comparable.
In this study, we applied active learning for the collection of point discharge observations during a hydrological year without respecting the temporal sequence of the observations. Results, and in particular model performance from calibrations with active learning, therefore provide an indication of how valuable active learning could be at best. Given the conceptual advantages and the practical limitations of active learning, we argue that active learning is especially valuable for complementing the collection of data based on expert knowledge. More specifically, expert knowledge could be used to decide on the timing of the first few field observations. Subsequently, active learning could guide the timing of additional measurements by providing information on flow situations that would be most informative.

Limitations of the study set-up
Our findings provided evidence that the prediction of streamflow in a previously ungauged basin can be greatly improved by the collection of a relatively small amount of local hydrological information. These encouraging results are based on a number of idealized assumptions that might be challenged when moving from a modelling study into practice.
The first major assumption made in this study was the perfectly known forcing time series. In practice, uncertain weather forecasts can lead to a too early or a delayed collection of point discharge observations. Results from previous studies with a limited number of streamflow or water level information suggested that a good coverage of a range of flow conditions is likely more important than the exact timing of observations (Etter et al., 2020;Pool et al., 2017). However, the importance of timing might depend on the flow regime of a catchment. Wright et al. (2015) thereby showed that the influence of single discharge observations on model performance could be considerably larger in an arid catchment than in a humid catchment. The effect of a mismatch in the timing of observations might also differ among flow classes. We expect that the importance of an accurate timing in observations is strongest for peak flows as they indicate the reactivity of a catchment to precipitation. In contrast, the timing might be less relevant during event recessions or baseflow conditions when similar hydrological processes dominate over a longer period. The sensitivity of model performance to the timing of point discharge observations is probably similar for active learning and expert knowledge because both data collection approaches led to a similar frequency of streamflow classes.
A further simplification of this study is the use of mean daily streamflow values as opposed to the use of instantaneous measurements taken during field visits. The difference between instantaneous discharge (discharge reported at 15-minutes interval) and mean daily discharge of the CAMELS dataset was small during recession and lowflow periods, but could be considerable during peak flows. This difference is expected to be most pronounced for either catchments or days with high streamflow variability and probably requires some attention when such field observations are used for model calibration.
Another basic assumption of this study were time-invariante rating curves. In practice, rating curves can change considerably due to changes in the cross-section of a river, backwater, or hysteresis effects (McMillan and Westerberg, 2015). Such changes can affect our modelling results in two ways. First, water-level time series were derived from discharge time series (see Section 2.1) and substantial intra-annual changes in the rating curve could mislead model parameterization. Second, the success of the active learning approach in which point discharge observations were assumed to be valid for all time steps with comparable water level (AL WL_nQ ) relies on a time-invariant rating curve. The value of AL WL_nQ might therefore be overestimated in catchments with considerable rating curve changes within a hydrological year.

Conclusions
Long continuous discharge time series representing a variety of hydrological conditions are usually seen as a requirement for model calibration. In practice, many catchments have no, or only limited, Fig. 9. Spearman rank correlation between catchment attributes and the relative value of point discharge observations and water-level time series for model calibration. The relative value corresponds to the performance difference (ΔR) in the validation period between two data collection approaches after ten sampling iterations for each catchment. Positive values indicate an increased performance if (a) point discharge observations were used in addition to water-level time series, (b) hydrological expert knowledge was used to select point discharge observations as opposed to the use of active learning, (c) water-level time series are used in addition to point discharge observations only, and (d) point discharge observations were assumed to be representative for all dates with a similar water level. AL WL_10Q is active learning with water levels and ten discharge observations, AL WL_nQ is active learning with water levels and n discharge observations, AL 10Q is active learning with ten discharge observations, HE WL_10Q is hydrological expert knowledge with water levels and ten discharge observations, and LB WL is the lower benchmark with water levels. discharge data. Understanding which, and how much data is most valuable for model calibration, is essential to improve the prediction in ungauged basins. In this study, we contributed to an improved understanding by explicitly comparing the relative value of water-level time series and point discharge observations for model calibration, and by testing a machine learning approach to determine when to collect such a limited number of discharge observations. Based on results from simulation experiments for 100 hydroclimatically diverse catchments, the following conclusions can be drawn: • A small number of point discharge observations contained, surprisingly, a lot of information, and can considerably improve model calibrations based on water-level time series with respect to annual and event-scale streamflow volumes. While model performance continuously improved as the number of observations increased, the incremental improvements were most considerable for the first two to six observations. The value of point discharge observations was highest for (semi-) arid catchments and for the simulation of annual volumes. • Continuous water-level time series provided valuable information for the simulation of daily streamflow dynamics. Furthermore, water-level time series could reduce the number of field trips if a point discharge observation was assumed to exist at all time steps with a similar water level. Such an extension of the number of discharge observations was most effective in catchments with prolonged periods of relatively constant flow conditions. • Choosing the date of point discharge observations based on active learning led to similar sampling dates as the selection of dates according to hydrological expert knowledge. In both cases, most observations were selected for the seasons with the highest flows.
Our findings encourage the approach to gauge an ungauged catchment with discharge observations on strategically selected dates. Independent of the geographic region, the most informative sampling dates are typically expected to take place during hydrologically active periods, such as the annual peak discharge, other discharge peaks, and recession situations. Two to six observations during these periods can be already very informative. However, increasing the number of observations to ten allows the collection of additional discharge observations in periods of more constant flow conditions, which is beneficial for a more balanced evaluation of different flow conditions during model calibration. The exact timing of the first few discharge observations could be defined with hydrological expert knowledge, active learning could then be valuable for guiding the timing of the additional observations. Combining such a small number of point discharge observations with (short) continuous water level time series is a promising way towards improved predictions in ungauged basins.

Author contributions
SP and JS designed this study; SP performed the hydrological simulations; SP and JS analyzed and discussed the results; writing of the paper was led by SP with contributions of JS.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.   A2. Hydroclimatic attributes of the 100 study catchments used to explain the spatial distribution in model performance. The aridity index was calculated as the ratio of the sum of potential evapotranspiration and the sum of precipitation (ETo/P). The runoff ratio was calculated as the ratio of the sum of discharge and sum of precipitation (Q/P). The baseflow index was calculated using the EflowStats R-Package from the U.S. Geological Survey (2014).