PHEV! The PHysically-based Extreme Value distribution of river flows

Magnitude and frequency are prominent features of river floods informing design of engineering structures, insurance premiums and adaptation strategies. Recent advances yielding a formal characterization of these variables from a joint description of soil moisture and daily runoff dynamics in river basins are here systematized to highlight their chief outcome: the PHysically-based Extreme Value (PHEV) distribution of river flows. This is a physically-based alternative to empirical estimates and purely statistical methods hitherto used to characterize extremes of hydro-meteorological variables. Capabilities of PHEV for predicting flood magnitude and frequency are benchmarked against a standard distribution and the latest statistical approach for extreme estimation, by using both an extensive observational dataset and long synthetic series of streamflow generated for river basins from contrasting hydro-climatic regions. The analyses outline the domain of applicability of PHEV and reveal its fairly unbiased capabilities to estimate flood magnitudes with return periods much longer than the sample size used for calibration in a wide range of case studies. The results also emphasize reduced prediction uncertainty of PHEV for rare floods, notably if the flood magnitude-frequency curve displays an inflection point. These features, arising from the mechanistic understanding embedded in the novel distribution of the largest river flows, are key for a reliable assessment of the actual flooding hazard associated to poorly sampled rare events, especially when lacking long observational records.


Introduction
Reliable estimates of magnitude and frequency of river floods are crucial for a wide range of social and economic activities. For instance, they inform design of engineering structures, insurance premiums, urban planning and adaptation strategies. But despite hundreds of years of efforts, flooding is still the most common natural disaster [1]. Hazard assessment is indeed hampered by processes that might be more variable than suggested by observed records, let alone ongoing global changes [2,3]. This is a problem, as both purely statistical methods hitherto used for characterizing extremes of hydro-meteorological variables [4][5][6][7] and complex hydrological models encoding state-of-the-art knowledge of physical processes [8][9][10][11] heavily rely on observations. Tools pairing physical understanding of the mechanisms producing extreme events with easily tractable mathematical descriptions of them and less demanding data requirements are therefore vital to a more accurate flood risk estimation [12,13].
This letter systematizes and tests advances in the mechanistic-stochastic description of soil moisture and runoff dynamics in river basins to emphasize what might be such a tool: the PHysicallybased Extreme Value (PHEV) distribution of river flows. Perks of this physically-grounded alternative to statistical methods hitherto used for characterizing extremes of hydro-meteorological variables are its being simple as a statistical distribution, mechanistic as a fully-fledged model, and stochastic as nature is. Capabilities of PHEV for predicting flood magnitude and frequency are here benchmarked against leading flood hazard assessment methods, a pivotal milestone

The physically-based extreme value distribution of river flows
The physically-based extreme value distribution of river flows arises from advances in the mathematical description of catchment-scale daily soil moisture and runoff dynamics in river basins [14][15][16]. These well established scientific theories represent rainfall as a marked Poisson process with frequency λ P [T −1 ] and exponentially distributed depths with average α [L]. Infiltration of rainfall into the soil determines stochastic increments of the moisture content, whereas evapotranspiration causes loss of soil moisture from the root zone. Losses linearly vary with the soil moisture between the wilting point and a critical upper threshold akin to the water holding capacity of the soil. The exceedance of this threshold triggers runoff pulses occurring with frequency λ < λ P [T −1 ] and whose magnitudes follow an exponential distribution with average α [L]. These pulses feed a single catchment hydrologic storage, which is finally drained by the river network as streamflow. A nonlinear storage-discharge relation echoes a hydrological response which varies with the catchment water content and mimics the joint effect of different flow components [17].
The mechanistic-stochastic description of runoff generation processes summarized above yields the following expression for the probability distribution of daily streamflows q [18]: which, in addition to λ and α, depends on the coefficient K and exponent a of a power law function (alike the storage-discharge relation mentioned earlier) used to describe hydrograph's recessions. C 1 is a normalization constant.
It also provides a physically-grounded mathematical form for the probability distribution of ordinary peak flows (sensu [19], i.e. local flow peaks occurring as a result of streamflow-producing rainfall events) [20]: which depends on the same set of parameters. C 2 is also a normalization constant.
By further postulating independence of the peak flows, the physically-based probability distribution of flow maxima (i.e. maximum values in a specified time frame) finally emerges as [20]: where τ [T] is the duration in days of the chosen time frame, p j (q) is the probability distribution of peak flows (equation (2)), and D j (q) =´∞ q p j (q) dq is the exceedance cumulative probability of peak flows. PHEV is derived from the solution of the master equation for the probability distribution of streamflow in a catchment [18]. As such, the approach resembles what is done when large simulation ensembles of deterministic hydrological models are jointly considered to account for varied hydrometeorological scenarios and antecedent catchment conditions. PHEV operates in the same way, with the advantage that an analytical solution (equation (3)) is made available thanks to its low-dimensionality.
The description of precipitation and runoff generation mechanisms embedded in PHEV does not encompass all existing rainfall-runoff processes. In particular, intense snow dynamics entailing shift of water volumes across seasons [21] and precipitation mechanisms producing event duration comparable to the time frame of analysis (e.g. monsoonal rainfall, seasonally dry climates) are disregarded. Nonetheless, a large body of literature [17,[22][23][24][25][26][27][28][29] demonstrated the key relevance of the considered mechanisms in a variety of hydro-climatic and physiographic settings.
Equations (1)-(3) form a consistent and physically-grounded set of expressions to characterize the probabilistic structures of daily flows, ordinary peak flows and flow maxima. They describe their magnitudes, likelihoods of occurrence and the related flood hazard based on a set of four physically meaningful parameters (α, λ, a, K) which embody climatic and landscape attributes of the considered catchment. They also enable the possibility to account for changes of climatic conditions and land cover, as shown by [23,29,30]. These probability distributions might therefore constitute a sound alternative to purely statistical distributions commonly used to characterize hydrological variables and their maxima.

Benchmarking PHEV against leading methods for flood hazard assessment
We benchmark the performance of PHEV against two leading statistical models for flood hazard assessment, the generalized extreme value (GEV) [31] and metastatistical extreme value (MEV) [32] distributions. The GEV distribution is a standard tool [33,34] traditionally applied to samples of block maxima under the assumption that the number of events within each block tends towards infinity [35]. The MEV distribution is instead a recently proposed method to estimate extremes from the features of ordinary events, which is currently gaining momentum [36][37][38][39][40][41][42]. It relaxes the mentioned assumptions which lie at the heart of the GEV distribution and regards as random variables both the number of independent ordinary events occurring in the considered time interval and the parameters of the distribution used to describe their Notice that only quantiles with return periods longer than the length of the calibration samples (i.e. T > 10 years) and until the second longest T which could be estimated from the validation sample are shown. Warmer colors in panels (A)-(C) indicate higher density of points and mark the most frequent performances of the methods. Dark green markers in panel (D) identify case studies for which a strong relation between recession coefficient and ordinary peak flows (see panel (F)) was detected. The current mechanistic description of the catchment hydrologic response embedded in PHEV does not consider such a relation and prevents it from providing satisfactory performances in these cases, which can be identified ex ante through hydrograph recession analyses. Light green markers in panel (D) instead indicate catchments where such a relation is weak (see panel (E)). In these cases the mechanistic conceptualization of hydrological processes underpinning PHEV is suitable and the PHEV distribution of river flows guarantees satisfactory performances. The insets of panels (B)-(D) represent the median performances for all case studies in the figures belonging to ten different hazard categories (i.e. the deciles of the observed normalized flood quantiles). (E) and (F) Exemplary case studies displaying (E) weak (USGS ID: 03 253 500, spring) and (F) strong (USGS ID: 03 504 000, summer) relations between recession coefficients and ordinary peak flows. (G) Location of gauges from the MOPEX dataset analyzed in this study (grey dots). Red markers display a subset of river basins from varied hydro-climatic regions that have been selected to perform analyses on 1000 years long synthetic time series (see figure 2). magnitudes, without any restriction on the distribution underlying them. Details on these two methods are available in appendix A and in [19,35].
For the present analyses we used rainfall and streamflow time series of gauges unaffected by climatic and anthropogenic hydrograph alterations from the MOPEX dataset [43,44]. We only retained catchments for which at least 30 years of observations are available, with less than 10% of missing data in each season (December-February, March-May, June-August, September-November). The resulting minimum, mean and maximum lengths of the analyzed records are 35, 52 and 55 years. We performed all computations at seasonal scale [45] to account for the seasonality of rainfall and runoff generation processes. Catchments where precipitation falls as snow (i.e. when average daily temperatures below zero degrees occur during precipitation) for more than 50% of a season were discarded to comply with key hypotheses of the theoretical framework underlying PHEV [30]. This initial data screening resulted into 161 gauges (483 catchment-season case studies) retained for analysis, whose geographical locations are shown in figure 1(G). The Mann-Kendall test applied to seasonal maxima identified significant (p < 0.01) trends in only a marginal fraction (3%) of the case studies, confirming the absence of flood trends for this region highlighted by [46].
We further selected eight stations from different US hydro-climatic regions (figure 1(G) and table B1) to assess the capability of PHEV, GEV and MEV to predict magnitudes of events with return periods much longer than the available length of observations. This reflects typical conditions in the practice, where the magnitude of events with rather long return periods (say, 100-1000 years) must be extrapolated on the basis of relatively short observational samples. To this aim, we utilized a stochastic rainfall generator and a parsimonious hydrological model (see details in [47]) compliant with the main physical assumptions of PHEV (being MEV and GEV purely statistical models which do not require specific assumptions on the underlying processes) to produce 1000 years long synthetic time series of daily rainfall and streamflow for the select catchments [48]. The series were used as a reference and compared to estimates derived from shorter samples extracted from the 1000 years long data series. In this way we created a controlled experiment which enables analyzing the robustness of the estimation methods to non-ergodic data samples only.
Furthermore, we employed a cross-validation procedure [19,38] to analyze the ability of the different methods to extrapolate information outside the range covered by the observations. To do so, we first divided the available series of observed (synthetic) data into two complementary parts by randomly selecting from them (through resampling without substitution) 100 (1000) subsets of S years. These are the training sets used to calibrate the theoretical distributions, whereas the remaining parts constitute the validation samples. The procedure yields calibration and validation samples with no systematic variability. We used S = 10 years for the observations and repeated the process for S = 10, 20, 30 and 60 years for the long synthetic data series to explore how the different models perform with various sizes of the calibration sample.
We employed L-moments [49] to fit GEV on the calibration sample of seasonal maxima and to estimate the parameters of a Gamma distribution adopted to describe the whole set of ordinary flow peaks in the MEV approach. Further details on the calibration of GEV and MEV are available in [38]. Parameters of PHEV were instead either directly derived from daily rainfall and streamflow series (α, λ and a) or obtained through maximum likelihood calibration on the sample of seasonal maxima (K) [20]. In particular, we computed α as the average amount of precipitation in rainy days, λ as the ratio between longterm average streamflow and α, and a as the median value of the recession exponents obtained by fitting a power law function to dq/dt − q pairs observed for each hydrograph recession. The maximum likelihood approach here used to calibrate the parameter K is known to provide frail estimates when small calibration samples are available, as in this application. Although aware of the handicap thus imposed to the performance of PHEV, we were compelled to adopt this method due to the current unavailability of more suitable approaches.
We finally used metrics such as the relative error (appendix C) and quantile-quantile plots [50] to evaluate and compare performances of PHEV, GEV and MEV when both observed data and long synthetic series are analyzed.

Results and discussion
Figures 1(A)-(C) summarizes the performances of PHEV, GEV and MEV for predicting observed flood quantiles with return periods longer than the length of the calibration samples (i.e. 10 years) in all the case studies. Warmer colors indicate higher density of the point clouds and mark the most frequent performances of the methods. PHEV (panel A) tends to slightly underestimate flood magnitudes in the record (i.e. the red-to-yellow core of the point cloud stands below the 45-degree line). Higher variability of its performances compared to GEV and MEV is manifest, especially in the case of high quantiles for which overestimation often occurs. On the contrary, both GEV (panel B) and MEV (panel C) accurately estimate lower flood magnitudes (i.e. the point clouds are centered on the 45-degree lines for low quantiles) but tend to systematically underestimate the larger floods. In the case of MEV, this behavior is likely due to using a Gamma distribution to describe ordinary events, in agreement with [51]. In fact, such a light tailed distribution might underestimate ordinary peaks characterized by low probability, eventually resulting in underestimation of maxima with long return periods. Although the optimization of the MEV methodology is outside the scope of this work, we recognize that the choice of a site-specific distribution of ordinary peaks might reduce underestimation issues in the MEV approach [38].
As for PHEV, previous studies scantily reported about the possible existence of a relation between recession coefficient K and peak flow and its capability to interfere with the characterization of daily flow distributions provided by the mechanistic-stochastic model underlying it [17,52]. Was there such a relation, the parameters describing the hydrologic behavior of the catchment would vary along with the peak flow magnitude, thus precluding the use of a single effective K to portray the catchment hydrologic response and the resulting magnitude of floods. At the same time, such a relation would also suggest caution about the dependability of purely statistical flood estimates beyond the range of recorded magnitudes, as ostensibly good performances of these methods might indicate their higher flexibility to represent conditions which occurred in the observed time frame rather than robustness in estimating uncharted flood magnitudes [53].
To investigate whether this may be the reason for the higher variability of PHEV performances displayed in panel A, we computed the recession coefficient K by fitting a power law function with exponent set equal to a (i.e. the median exponent across all recessions in a case study) to dq/dt − q pairs observed for each hydrograph recession [54], and investigated the possible existence of a relation between K and peak flow in the MOPEX basins. Panels E and F display examples of catchments where we either did not or identified such a relation. We then classified all the case studies into two groups based on the absolute value of the slope of their K-peak flow relation, rendering them in panel D with either light or dark green markers if the relation was respectively weak (i.e. slope ⩽ 0.4) or strong (i.e. slope > 0.4). Notice that choosing alternative thresholds around this value does not substantially impact the key results of the study.
The performance of PHEV remarkably improves in case studies exhibiting weak relations between K and peak flow (i.e. the majority of light green markers lie just below the 45-degree line in panel D, whereas dark green markers mainly stand above it). The Twosample Kolmogorov-Smirnov test for the probability distribution of the relative estimation errors confirms that the two groups are significantly different (pvalue <0.01). The larger variability of performances previously exhibited by PHEV (panel A) substantially diminishes for the set of cases featuring weak K-peak flow relations (the interdecile and interquartile ranges of the relative error decrease by 25% and 48%). Although PHEV tends to underestimate flood magnitudes in the range of observed quantiles, the bias of the estimate seems to be relatively stable for increasing quantiles compared to GEV and MEV.
The proposed analysis of hydrograph recession characteristics is thus an effective method to distinguish ex ante in what conditions PHEV shall be applied, namely in all cases (268 out of 483 in our dataset, which will be referred in the following analyses) when the recession coefficient K is fairly constant across recessions and thereby agrees with the mechanistic description of runoff generation processes underlying PHEV. Possible causes of the variability of K across events and basins are a much debated topic in recent years [55][56][57][58][59][60][61][62]. The geomorphological theory of recession flows [63] suggests that enhanced variability of K is linked to the existence of manifold hydrologic storages which are differentially activated across events, either because of their uneven distribution along river networks or as a result of variable degrees of saturation preceding the events [64,65]. This behavior is conceivably related to the existence of assorted triggers of runoff events and floods [66].
The contrasting behaviors of PHEV, GEV and MEV for the prediction of the highest quantiles are further analyzed in the insets of figures 1(B)-(D), which display the median performances among case studies belonging to ten different hazard categories (i.e. the deciles of the observed flood quantiles). For PHEV (inset of panel D, light green hue) markers lie just below and parallel to the 45-degree line, indicating a slight tendency to underestimation that is independent of the flood magnitude. The markers instead diverge from the 45-degree lines in the insets of panels B and C (i.e. for GEV and MEV), indicating that the estimation errors associated to these distributions are low for small floods but increase with increasing quantiles.
To better visualize these traits and their implications for estimating magnitudes of events with long return periods, we plotted the median relative errors for each hazard class as a function of the observed quantiles ( figure 2(A)). We then extrapolated the performances of the different extreme distributions to higher quantiles by fitting regression lines using only half the observed errors, randomly extracted a hundred times from the whole set of them by means of resampling without substitution. This procedure yields an evaluation of the uncertainty of the extrapolations (shaded areas in panel A) and concurrently allows for inspecting their robustness against the observations themselves. The performance of PHEV (light green shaded area) is lesser affected by increasing flood quantiles, whereas it ceaselessly deteriorate with the magnitude of floods for GEV and MEV (red and blue shaded areas). This finding is especially relevant for the estimation of high quantiles associated to very long return periods. In this situation, errors of PHEV shall remain fairly limited while those of GEV and MEV are expected to further increase.
We provide a stricter assessment of the performances of the three extreme value distributions for very high quantiles in the remainder of figure 2. There we summarize the analyses performed on 1000 years long synthetic time series of rainfall and streamflow for eight select catchments from different US hydroclimatic regions which exhibit weak relations between K and peak flow. Figure 2(B) displays the relative errors of PHEV, GEV and MEV for a calibration sample of 20 years and a set of large return periods commonly used in the practice. A relative error equal to zero indicates a perfect match between quantiles predicted by the model and those estimated from the validation sample. Although PHEV is affected by a slight underestimation (the median of the box plot is always below the zero line), it shows the smallest interquartile range of the relative error, which is comparable to the one of MEV and lower than GEV, for all considered return periods. The amount of outliers is rather constant for MEV while it increases with the return period for both PHEV and GEV. In particular, large numbers and extent of outliers among GEV estimates appear for return periods as low as 50 years. Conversely, the capability of PHEV to provide reliable estimates does not deteriorate as strongly for large return periods.
A similar pattern is also highlighted by the relative errors between estimated and observed maxima, plotted for increasing values of the ratio between return period T and calibration sample size S (panel C). Here, results for every S = 10, 20, 30 and 60 years are pooled together to give a compendium of the analyses. Negative values of the median relative error for MEV reflect its tendency to underestimate high quantiles in this dataset when a Gamma distribution is used to describe the ordinary peaks. Notwithstanding comparable performances of GEV and PHEV for what concerns median errors, the latter exhibits lower error variance than the former in all analyzed ranges of T/S. Also in this case the performance of PHEV is rather constant with the return period, thus confirming the results of figure 2(A).
The capabilities of the three distributions and an empirical estimation method to reproduce the reference flood magnitude-frequency curve (grey dots) when a limited calibration sample (S = 20 years) is available are exemplified in panels D-I. We display results for these two case studies to emphasize how the shape of the flood frequency curve might affect the performance of the methods. Namely, we chose one case (panels D-F) where the flood frequency curve is rather predictable (i.e. the streamflow magnitude smoothly rises with increasing return period) and one case (panels G-I) characterized by the opposite behavior (i.e. the flood magnitude markedly increases above a certain return period).
The black lines and shaded areas span the breadth of empirical, PHEV (light green), GEV (red) and MEV (blue) estimates. The empirical method reveals its fickleness, as its estimates highly depend on the specific set of flood events sampled in the time frame S even for short return periods (i.e. the uncertainty of the maximum associated to an assigned frequency is considerable for return periods as low as ten years). The use of a theoretical extreme value distribution reduces this variability (i.e. the shaded areas span smaller ranges across the y-axes at short return periods). However, GEV estimates (panels E and H) are as variable for just slightly larger return periods (i.e. 50 years). Conversely, the uncertainty of MEV is limited (i.e. blue shades in panels F and I are narrow) for the whole range of investigated return periods, but the method tends to underestimate magnitudes of rare events with return periods longer than a few hundred years, in particular when the flood frequency curve is characterized by the presence of an inflection point (panel I). PHEV (panels D and G) provides dependable estimates of flood magnitudefrequency curves which are less affected by the specific set of flood events available in the sample (i.e. the amplitude of the light green shade is as well limited). Albeit more uncertain than MEV, the pattern of PHEV estimates closely resembles that of the reference flood magnitude-frequency curve (grey dots) for return periods above a hundred years, especially when the curve trends upward. Remarkably, knowledge of the magnitude of events with return periods substantially longer than the available data series is required at most in the practice.
Notwithstanding the highlighted perks of adopting a mechanistic-stochastic approach to estimate magnitudes and frequencies of floods from daily streamflow dynamics, work remains to be done. Stable calibration methods, such as L-moments, shall be developed for PHEV as done in the past decades for standard probability distributions now routinely used in extreme value analyses. The search for approaches to estimate the parameter K without resorting to calibration on observed maxima (thus achieving entirely calibration-free predictions of flood statistics) shall also continue. Given the apparent dependence of K on the wetness conditions of river basins [56,57], satellite-derived water storage and soil moisture products are promising allies in this regard, as recently shown by [67] and [68].
Using the physically-based distribution of peak flows underpinning PHEV to describe ordinary events in the MEV framework constitutes another promising research avenue whereby capitalizing on the assets of both methods, namely the mechanistic description of runoff generation processes embedded in PHEV and the possibility to explicitly account for the inter-annual variability of parameters provided by MEV. The mechanistic-stochastic conceptualization of runoff generation processes underlying PHEV might as well be further advanced to explicitly include the variability of recession coefficients in its mathematical expression, whose lack currently constitutes the main limitation of the method. Seminal works in this sense [17,69] demonstrated visible improvements of high flow estimates when this variability is embedded in the mechanistic-stochastic description of daily flows. Addressing these aspects would advance the PHEV distribution of river flows towards ripeness for a standard application in the practice.

Conclusions
The PHEV distribution of river flows emerges from a mechanistic-stochastic description of basin-scale soil moisture dynamics and runoff generation processes, which provides consistent analytical expressions of the probability distributions of daily flows, peak flows and flow maxima. It constitutes a physically-based alternative to purely statistical methods hitherto used to characterize extremes of hydrological variables, whose parameters are all but one measurable from daily rainfall and streamflow records. A novel method relying on flow recession analyses and the identification of an inverse relation between recession coefficients and peak flows allows for the domain of applicability of PHEV to be defined a priori. In the absence of such relation PHEV guarantees remarkable performances which are comparable to those of two leading statistical models for flood hazard assessment. In particular, PHEV estimates entail lesser increasing bias with larger flood quantiles than it occurs for the benchmark approaches, and lower uncertainty. These features ensure more reliable appraisal of the magnitude of rare extreme floods from relatively short data series, an especially valuable achievement for the community of professionals involved in environmental hazards assessment. Our findings finally hint at the possibility to leverage the reliability of PHEV estimates for very long return periods to pinpoint inflection points of flood magnitude-frequency curves, whose occurrence entail abrupt increases of the magnitude of high flows, amplified hydrological hazard and a propensity of rivers to generate extreme floods.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: Hydro-climatic time series from the MOPEX data set are available at https://hydrology.nws.noaa.gov/pub/ gcip/mopex/US_Data. Synthetic hydro-climatic time series for the select catchments used in this study to assess performances for very long return periods are available at: www.hydroshare.org/resource/ abd449fbced146eb8d0ec8ad95754f1c.

Acknowledgments
This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -Project Number 421396820 'Propensity of rivers to extreme floods: climate-landscape controls and early detection (PREDICTED)' and Research Group FOR 2416 'Space-Time Dynamics of Extreme Floods (SPATE)' . The financial support of the Helmholtz Centre for Environmental Research -UFZ is as well acknowledged. We thank Andrea Domin for her contribution for producing the long synthetic time series used in this study. We finally thank two anonymous reviewers for their comments to the manuscript.

Appendix A. The generalized and MEV distributions
The extreme value theory states that if a random variable x follows a distribution F(x), the distribution of the maximum of n independent and identically distributed random variables (Y n = max(X 1 , . . ., X n )) can be written as H n (x) = F(x) n . In the traditional framework the assumption that n → ∞ gives H n (x) → H(x), where H(x) is one of three limiting distributions [31,70] unified in the generalized extreme value (GEV) distribution [35,71]. The nonexceedance cumulative distribution function of GEV reads as: with location parameter µ ∈ R, scale parameter σ > 0 and shape parameter ξ ∈ R. The MEV framework [32] relaxes the assumption of n → ∞ and considers as random variables both the parameters ⃗ θ of the ordinary distributions F(x; ⃗ θ) and the number of events n occurring in each considered time frame (e.g. a year or a season). Following [38], we use here the Gamma distribution to describe the ordinary events, and neglect the interannual variability of the parameters to reduce their estimation uncertainty. We instead consider as timevarying the number of events during each time frame (note that the subscript j appears in the exponent n in equation (A.2)). The resulting expression of the nonexceedance cumulative distribution function of MEV [19,32] is: where S is the length of the calibration sample, α > 0 and β > 0 are the shape and scale parameters of a Gamma distribution, and γ(·) and Γ(·) are respectively the lower incomplete and the complete Gamma functions.

Appendix B. Subset of river basins from different US hydro-climatic regions
Information concerning the eight select catchments from different US hydro-climatic regions, which are used in this study to assess performances of the theoretical extreme distributions for very long return periods (O(10 3 )), are available in table B1.

Appendix C. Error metric
In this study we use the relative error [72] as a metric to evaluate and compare the performances of the considered extreme value approaches. The relative error ε is a non-dimensional error between observed (q obs ) and estimated (q est ) maxima computed for the same return period T: ε(T) = q est (T) − q obs (T) q obs (T) . (C.1) A relative error equal to zero indicates a perfect match between estimated and observed quantiles. Negative values of ε indicate underestimation of the model (i.e. the model predicts maxima with lower magnitudes than the observed ones), whereas positive values denote overestimation (i.e. the model predicts maxima with larger magnitudes than the observed ones). In this study we computed the relative error in a predictive fashion, i.e. by only considering in the computation quantiles corresponding to return periods T longer than the sample size S used to calibrate the parameters of the distributions.