Design flood estimation with varying record lengths in Norway under stationarity and nonstationarity scenarios

In traditional flood frequency analysis, a minimum of 30 observations is required to guarantee the accuracy of design results with an allowable uncertainty, however, there has not been a recommendation for the requirement on the length of data in NFFA (nonstationary flood frequency analysis). Therefore, this study has been carried out with three aims: (i) to evaluate the predictive capabilities of nonstationary (NS) and stationary (ST) models with varying flood record lengths; (ii) to examine the impacts of flood record lengths on the NS and ST design floods and associated uncertainties; and (iii) to recommend the probable requirements of flood record length in NFFA. To achieve these objectives, 20 stations with record length longer than 100 years in Norway were selected and investigated by using both GEV (generalized extreme value)-ST and GEV-NS models with linearly varying location parameter (denoted by GEV-NS0). The results indicate that the fitting quality and predictive capabilities of GEV-NS0 outperform those of GEV-ST models when record length is approximately larger than 60 years for most stations, and the stability of the GEV-ST and GEV-NS0 is improved as record lengths increase. Therefore, a minimum of 60 years of flood observations is recommended for NFFA for the selected basins in Norway.


INTRODUCTION
Flooding is one of the major natural disasters and has caused casualties and huge economic losses worldwide in the 21st century (Hallegatte et al. 2013;Arnell & Gosling 2016;Kan et al. 2016;Slater & Villarini 2016;Kuang et al. 2018;Yin et al. 2018;Zhang et al. 2018;Yan et al. 2019a;Hou et al. 2020;Li et al. 2020;Venvik et al. 2020). In the hydrological communities, flood frequency analysis (FFA) is the most widely used approach to characterize flood risk and estimate design values for hydrological structures. The fundamental assumption of FFA is that the historical extreme flood events are independent and identically distributed, namely the so-called stationary (ST) assumption. However, it should be mentioned that the ST assumption is quite rigorous because of the dynamic nature of flood events caused by climate change and human activities (Vogel et al. 2011;Luke et al. 2017;Serago & Vogel 2018;Zhang et al. 2019). Thus, the ST assumption has been questioned by many researchers under changing environment (Milly et al. 2008;Villarini et al. 2009;Montanari et al. 2013;Milly et al. 2015;Yan et al. 2017a;Xu et al. 2018;Xiong et al. 2019;Yan et al. 2019b).
Nonstationary FFA (NFFA) is proposed to address the nonstationary (NS) issue of extreme flood events. As evidenced from recent review articles, there are several different methods for NFFA (Khaliq et al. 2006;Hall et al. 2013;Madsen et al. 2014;Bayazit 2015;Hao & Singh 2016;Yan et al. 2017b;Salas et al. 2018;Jiang et al. 2019;Qu et al. 2020;Xiong et al. 2020;Yan et al. 2021), among which the time-varying moments (TVM) method is the most popular approach. In the TVM method, the statistical parameters of flood distribution are no longer invariant but change with time or other covariates to capture the trend of flood frequency (Rigby & Stasinopoulos 2005). However, despite the numerous studies in NFFA, there is neither a noncontroversial method for performing NFFA nor well accepted governmental guidelines for updating current design flood strategies (Serago & Vogel 2018). One of the concerns is the tremendous uncertainties in NFFA induced by model structure and parameter estimation of NS models (Hu et al. 2015;Serinaldi & Kilsby 2015;Luke et al. 2017;Serago & Vogel 2018;Sen et al. 2020). Typically, the model structure of the NS model is more complex than that of ST model, since additional parameters are introduced to capture the historical trend of extreme flood events. Thus, theoretically, more flood samples are required to provide reliable and robust parameter estimations for NS models. However, in many cases, the available flood record length is shorter than 50 years and is often limited to 20-30 years (McCuen & Galloway 2010;Kjeldsen et al. 2014;Kobierska et al. 2017;Ekeu-Wei 2018;Hu et al. 2020). So, for practical reasons, it is necessary to investigate the impacts of record lengths on NFFA considering the various availability of flood observations worldwide.
Another concern is the predictive capabilities of NS models. Numerous literatures have reported that NS models can improve the statistical representation of historical flood series compared with ST models (Sun et al. 2018;Xie et al. 2018;Lu et al. 2019;Su & Chen 2019), but the predictive capabilities of NS models for out-of-sample prediction are still questioned. Only very limited study has compared the predictive capabilities of NS and ST models. Luke et al. (2017) conducted splitsample testing and comprehensively evaluated predictive capabilities of NS log-Pearson Type III distributions compared with ST models in the USA, and recommended the use of updated ST model for flood risk analysis in basins that have been affected by human activities. But they did not consider the effects of different flood record lengths on the predictive capabilities of NS models. In Norway, Kobierska et al. (2017) evaluated the impacts of record lengths on the performance of ST models with different combinations of distributions and estimation methods and found that there is no clear threshold of record length about using a three-parameter distribution or a two-parameter distribution under ST condition.
The last concern is the engineering design problem based on the NS model. Theoretically, we can predict future annual flood distribution by extrapolating the statistical parameters of the established NS model. As a result, the predicted future flood distributions vary from one year to another year, which brings about new challenge for design flood estimation. Under the framework of NFFA, if we still follow the design method under stationarity, we can obtain annual design flood for a given return period, which is impractical for engineering design since the relationship between design flood and return period is no longer one-to-one. To address the problem of design flood estimation in NFFA, researchers have carried out many studies and developed several design flood methods in recent years (Olsen et al. 1998;Parey et al. 2007Parey et al. , 2010Cooley 2013;Rootzén & Katz 2013;Acero et al. 2017Acero et al. , 2018Hu et al. 2018;Wang et al. 2019;Byun & Hamlet 2020;Lu et al. 2020;Yan et al. 2020). Yan et al. (2019a) comprehensively compared different methods and recommended the use of average design life level (ADLL) and equivalent reliability (Hu et al. 2018). Byun & Hamlet (2020) developed an NS Monte Carlo method to address NS design problems by explicitly accounting for the projected changes of extreme hydrometeorological events. However, the effects of flood record length on the estimation of NS design flood have not been reported, which deserves more attention considering the larger uncertainties in the calculation of design flood with longer return period using relatively short flood series Machado et al. 2015;Engeland et al. 2017;Strupczewski et al. 2017;Hu et al. 2020;Kochanek et al. 2020).
The aforementioned concerns are all related to the record length of flood series. There is a consensus that all the sampling schemes, parameter estimation and design flood estimation can be significantly influenced by the available flood record length. In Norway, dams should be able to withstand extreme floods with high return periods, such as 500 years or 1,000 years, based on the dam safety regulations of Norway (Lovdata 2010). Besides, infrastructures should withstand 20-, 200-, or 1,000-year return period flood events, resting with the impacts of flood events (TEK10 2016). Under ST condition, typically, a minimum of 30 years of flood samples is required for FFA (Stedinger et al. 1993;Midttømme et al. 2011;Kobierska et al. 2017). Luke et al. (2017) conducted NFFA using 1,250 annual maximum discharge records throughout the USA with record length ranging from 30 to 65 years, in which the record length of about 75% series is between 35 and 45 years. Yan et al. (2017a) conducted NFFA using annual maximum flood series (AMFS) with record lengths of 59, 62 and 66. Parey et al. (2010) constructed the NS model using 106 years of annual maximum temperature data. However, when we turn to nonstationarity, what are the record length requirements for NFFA with the expected accuracy or uncertainty? And how the record length might influence the predictive capabilities of NS models and the estimation of NS design flood? In this study, we attempted to answer these questions. Therefore, the objectives of this study are: (i) to evaluate the predictive capabilities of NS and ST models with varying flood record lengths, (ii) to examine the effects of varying flood record lengths on the NS and ST design floods and associated uncertainties and (iii) to provide guidance for the probable requirements of record length for NFFA with the expected accuracy or uncertainty. To achieve these objectives, 20 AMFS with a record length of larger than 100 years in Norway were selected for illustration purposes.

STUDY AREA AND DATA
In this study, a number of annual maximum flood records of hydrological stations in Norway was used for the purpose of evaluating the effects of varying record lengths on model performance, predictive capabilities and design flood estimation of GEV (generalized extreme value)-ST and GEV-NS0. Initially, we collected flood data of 74 hydrological stations throughout the Norwegian mainland from the Norwegian Water Resources and Energy Directorate (NVE). Figure 1 presents the geographical location and data information of the collected stations. Then, only the data sets with more than 100 observations were reserved to ensure the availability of at least 70 observations for the maximum record length in the fitting period and 30 observations in the evaluation period. Finally, the long flood records of 20 watersheds throughout Norway were chosen for illustration purposes (including 2 stations with 99 observations).
Hydrological variables are affected by both climate and land-use/land-cover changes to facilitate the study of climate change and its impact on hydrological variability, and a Hydrological Reference Dataset (HRD) consists of good-quality long-term data series not influenced by human activities causing non-climate-related variability or change was constructed by Norwegian Water Resources and Energy Directorate (NVE) (Fleig et al. 2013). HRD consists of 250 reference stations throughout Norway. One of the selection criteria is the degree of river basin development. Particularly, the proportion of urban area of the 250 reference stations is lower than 10%. The other criterion is the absence of significant regulations, diversions, or water use. The selected 250 potential reference stations are classified as 'unregulated' in the reports on NVE's network of streamflow stations (Pettersson 2003;Kleivane 2006). In this study, two of the final selected nine stations are included in the HRD. For the remaining stations, similar criteria of no significant regulations and land-use/land-cover changes were used. For these reasons, the insensitive land-use and land-cover change as well as other minor physical changes, if any, are not explicitly considered in the study.
The main characteristics of the selected 20 watersheds, including the drainage areas, the geographical locations and starting and end years of the observations, are presented in Table 1. Previous studies have documented the existence of significant trend in the annual maxima and peaks over threshold in some regions of Norway (Vormoor et al. 2015(Vormoor et al. , 2016Debele et al. 2017). In this study, the trend-free pre-whitening Mann-Kendall test (TFPW-MK) (Yue et al. 2002) was employed to examine the trend of the selected AMFS by removing the effects of trend and serial correlation on the examination results ( Table 1). The trend is statistically significant at 5% significant level if the P-value in Table 1 is smaller than 0.05. It should be mentioned that TFPW-MK suffered from type-1 error and Serinaldi & Kilsby (2016) proposed a corrected version called the TFPWcu method. See Serinaldi & Kilsby (2016) for the mathematical and numerical demonstration of TFPW flaws and the TFPWcu method, and see Serinaldi et al. (2020) for detailed discussion about the errors resulting from this superficial application of statistical analysis.

METHODOLOGY
In this section, we describe how the effects of varying record lengths are considered in NFFA from model construction to design flood estimation. The flowchart of this study is presented in Figure 2.

Setting of varying record length scenarios and the split-sample testing method
To assess the effects of record lengths on model performance, predictive capabilities and design flood estimation, different record length scenarios are generated. For an observed AMFS z t (t ¼ 1, . . . , n), we assume there is a hydrological project to be constructed with a design life of 30 years within the observation years. Thus, the predetermined observations z f t (t ¼ 1, . . . , n Ã ) with record length n* in the fitting period and assumed future floods z e t (t ¼ n Ã þ 1, . . . , n Ã þ 30) with a record length of 30 in the evaluation period are determined from the subset of the entire AMFS. The predetermined record lengths n* are sampled every 5 years from 30 to n Ã max , which is given by the following equation: where function floor returns the round down of (n À 30)=5, and 30 is the design life of a project. Herein the split-sample testing method is used. For example, if there are 106 observations, n Ã max is equal to 75 and thus the record lengths are set to be 30, 35, 40, …, 75. Subsequently, both ST GEV (GEV-ST) and NS GEV (GEV-NS) models are built for each predetermined record length in the fitting period, and the following 30 years' observations are assumed to be the design lifespan of a hydrological project, which is retained for evaluating the predictive capabilities and estimating the design flood in the evaluation period ( Figure 3).

NS GEV model
In this study, we focus on the NFFA using single distribution rather than the nonstationarity caused by abrupt change or mixed flood generation mechanisms. GEV distribution is often used to model the annual maxima, such as AMFS, under the ST context. In the fitting period, for the AMFS z f t (t ¼ 1, . . . , n Ã ), the cumulative distribution function (CDF) of GEV-ST model is given by the following equation: where m, s . 0 and 1 are the location, scale and shape parameters of GEV-ST model, respectively. However, under NS condition, the statistical parameters of GEV model can be modeled as a function of covariates, such as time, to capture the trend of AMFS. The CDF of GEV-NS model is as follows: where G t (Á) denotes the time-varying CDF of GEV-NS model. m t , s t . 0 and 1 t are the time-varying location, scale and shape parameters of GEV model, respectively, and t is the time scale. The time-varying statistical parameters, varying from one year to another year, are estimated based on the whole trend of the entire AMFS. Theoretically, although all three statistical parameters can be time-varying, the shape parameter in Equation (3) is sensitive and difficult to estimate (Cheng & Aghakouchak 2014;Du et al. 2015). Besides, the aim of this study is not to select the optimal model for a single basin, but to investigate the impacts of record lengths in NS frequency analysis. Thus, the structure of GEV-NS models should be simplified and fixed for each record length. The GEV-NS models only considering the time-varying location parameter become one of the most widely used GEV-NS models in practical applications. This kind of simplified GEV-NS models can generate realistic design quantiles consistent with the probabilistic behavior of extreme events (Cheng & Aghakouchak 2014;Sarhadi & Soulis 2017), which is described as follows: where G À1 t (Á) is the time-varying inverse CDF of GEV-NS model. p is the annual exceedance probability. m 0 and m 1 are the model parameters to model the change of location parameter m t , that is m t ¼ m 0 þ m 1 Â t. The GEV-NS model in Equation (4) is simplified and convenient, considering that the shape parameter 1 of GEV is quite sensitive and difficult to be estimated. Therefore, in this study, the GEV-NS model is constructed based on Equation (4) as done by previous researchers (Zwiers et al. 2010;Westra et al. 2013;Luke et al. 2017;Um et al. 2017).

Assessment of model performance for the fitting period
In this study, both GEV-ST and GEV-NS models with linearly varying location parameter, denoted by GEV-NS0, are built for each predetermined record length. Thus, to assess the fitting qualities and compare the difference between GEV-ST and GEV-NS0, the Akaike information criterion (AIC) (Akaike 1974) is employed, which is described as follows: where r is the number of independently adjusted parameters of the model. z is the maximum likelihood value of the established model. It should be noted that both the statistical parameters of GEV-ST and GEV-NS0 are estimated by the maximum likelihood method. Thus, z can be estimated by calculating the maximum value of likelihood function via deriving the partial derivatives of the log-likelihood function, which is, for instance, ln L(ujz f t ) ¼ ln g(z f t jm, s, 1) for the GEV-ST model.

Evaluation of predictive capabilities with varying record lengths (evaluation period)
For each predetermined record length, the likelihood function is used to evaluate the predictive capabilities of both GEV-ST and GEV-NS0. Theoretically, the likelihood function can provide information about how plausible the estimated statistical model is with the observations z e t (t ¼ n Ã þ 1, . . . , n Ã þ 30) during the evaluation period. For the GEV-ST model, the likelihood function L(ujz e t ) for the evaluation period is described as follows: g(z e t jm, s, 1) where g(Á) is the probability density function (PDF) of GEV, and n Ã is the record length of predetermined observations. However, the likelihood function L(u t jz e t ) of GEV-NS model for the evaluation period is defined as follows: g(z e t jm t , s, 1) where u t represents the time-varying statistical parameters. It can be seen from Equations (6) and (7) that under ST condition, the likelihood function is evaluated based on a set of invariant statistical parameters, whereas under NS condition, the likelihood function is evaluated based on variant statistical parameters. Besides, the probabilistic coverages were also used to assess the predictive capabilities of GEV-ST and GEV-NS0. Theoretically, the percentage of flood observations below the prescribed percentiles, namely 5th, 25th, 50th, 75th and 95th percentiles, should be 5, 25, 50, 75 and 95%. The absolute difference between theoretical and actual probabilistic coverages, denoted by D pc , can be calculated by the following equation: where PC T and PC A are the theoretical and actual probability coverages, respectively, and PC T ¼ 5, 25, 50, 75 and 95%. percentiles denote the 5th, 25th, 50th, 75th and 95th percentiles. n represents the number of observations retained in the evaluation period and equals 30 in this study.

Average design life level
In this study, the ADLL method is used to solve the NS hydrological design problem using GEV-NS0. The ADLL method was proposed by Yan et al. (2017a) based on the concept of annual average reliability, denoted by RE ave T 1 ÀT 2 , under NS context. RE ave T 1 ÀT 2 is defined as follows (Read & Vogel 2015): where p t is the time-varying nonexceedance probability, z q is the design quantile over the design life of a hydrological project to be constructed from T 1 to T 2 . The ADLL method assumes that the annual average reliability corresponding to z q should be equal to the yearly reliability 1 À 1/m for the return period m. Thus, the m-year design quantile z ADLL T 1 ÀT 2 (m) estimated by the ADLL method can be determined by the following equation: ADLL is recommended for practical use because it has associated design floods with the design life period of projects and is able to yield reasonable design quantiles and confidence intervals (CIs). See Yan et al. (2017a), which compared different NS design methods and recommended the ADLL method, for details of the advantages of ADLL. It should be noted that the maximum record length is equal to the full record length when analyzing the effects of record lengths on design flood quantiles and associated CIs. In addition, the coefficient of variation (CV) of design flood is also estimated for each record length using the nonstationary nonparametric bootstrap (NNB) method to assess the stability of GEV-ST and GEV-NS0.

Uncertainty analysis using the NNB method
In this study, to comprehensively evaluate the effects of record lengths on the ST and NS design results, the uncertainties or CIs of design floods estimated by the ADLL method are also calculated based on the NNB method. The advantage of NNB model is that it depends strictly on the available data without any hypotheses and can be implemented regardless of model complexity. In the NNB method, first, an NS model is used to fit the original observations. Then, the original observations are transformed to identically distributed residuals, and the transformed residuals are resampled with replacement and backtransformed to obtain the new samples. Next, the same NS model constructed in the first step is employed to fit the bootstrapped new samples and estimate corresponding design quantiles. Repeat the above steps for a large number of times and finally determine the CIs of design quantiles. See  and Yan et al. (2017a) for detailed information and operating steps about the NNB method.

NFFA with varying flood record lengths
According to the results of the TFPW-MK test, 9 of the 20 selected AMFS exhibited a significant increasing or decreasing trend at 5% significant level, as displayed in Figure 4. Then, NS frequency analysis of the AMFS of the nine stations was carried out using the generalized additive models for location, scale and shape (GAMLSS) package available in the R language platform.
Both GEV-ST and GEV-NS0 were constructed for different record length scenarios defined in Section 3.1. We evaluated and compared the model performance of GEV-ST and GEV-NS0 using the AIC values. As shown in Figure 5, for eight out of the nine stations, the AIC values of GEV-ST were eventually larger than those of GEV-NS0 with the increase of record lengths, indicating the better performance of GEV-NS0 particularly for samples with large record lengths. Furthermore, the differences between AIC values of GEV-ST and GEV-NS became larger with the increase of record lengths except for stations 12.228 and 19.127. Quantitatively, for 8 out of the 9 stations, the biggest AIC differences in Figure 5 are larger than 5, and for 2 out of the 9 stations, the biggest AIC differences are around or larger than 10, which confirms the advantage of the GEV-NS0. Besides, we also found that the superiority of GEV-NS0 starts to emerge roughly for record lengths of larger than 60 years, indicating the necessity of collecting more flood samples in NFFA rather than the minimum requirement of 30 samples in conventional FFA.

Predictive capabilities of GEV-ST and GEV-NS models with varying record lengths
Likelihood values were employed to evaluate the predictive capabilities of GEV-ST and GEV-NS0. As shown in Figure 6, for majority stations (six out of nine stations), the likelihood values of GEV-NS0 are eventually higher than those of the GEV-ST models for larger record length, particularly when the record length is larger than 60 years. It should be mentioned that the likelihood values in Figure 6 may be affected by high uncertainty for small samples. Besides, we also examined the difference between theoretical probability coverage and actual probability coverages which are the percentage of flood observations under 5th, 25th, 50th, 75th and 95th percentiles, denoted by D 5 , D 25 , D 50 , D 75 and D 95 , respectively. As shown in Figures 7  and 8, the average value of D 5 is smaller than 10% for both GEV-ST and GEV-NS0 when record length is larger than 60, while the average value of D 95 is smaller than 5% when record length is larger than 40, indicating the good predictive capabilities of extreme events for both GEV-ST and GEV-NS0. Furthermore, the results indicate that the difference in nearly all five plots tends to be smaller with the increase of record lengths for GEV-NS0, particularly when the record length is roughly longer than 60 years. While for the GEV-ST models, there is no significant decrease of the five plots with the increase of record lengths, and D 25 and D 50 even exhibit upward trends. It should be mentioned that the difference of probability coverage in Figure 7 may be affected by less stations for larger record lengths. Particularly, for D 25 and D 50 , the decreasing trends become more significant for the two largest sample sizes of 90 and 95. The '95-size' value relies on station 2.1088, while the '90-size' average is taken over stations 2.145 and 2.1088. In fact, decreasing trend can still be identified regardless of these two record lengths, particularly compared with the increasing trends of D 25 and D 50 for the GEV-ST model.
The above results confirm the better predictive capabilities of GEV-NS0 compared with GEV-ST models for most stations. The enhanced model performance and predictive capabilities of GEV-NS0 can be attributed to their better capture of the variability of extreme flood events when new flood samples are added into the flood population.

Influence of varying record lengths on NS design floods and uncertainties
After the evaluation of model performance and predictive capabilities of both GEV-ST and GEV-NS0, we further examine the influence of record length on design flood quantiles and associated CIs. Assuming the design life of hydrological structures is 30 years, both GEV-ST and GEV-NS0 were employed to estimate design floods using the ADLL method for each record length.
For each station, we first evaluated the stability of the established GEV-ST and GEV-NS0 by calculating the CVs of the 20-, 50-and 100-year design flood quantiles for each record length. Figure 9 shows the evolution of CV as a function of record length for all the selected nine stations. It is shown that for both ST and NS conditions, the CVs of 20-, 50-and 100-year design floods tend to be smaller with the increase of record length for nearly all stations, showing the increasing stability of the established statistical models as record lengths increase. Besides, when the record length is approximately larger than 50-60 years, the fluctuation of CV becomes gentle for most stations. It should be mentioned that for stations 19.127, 27.24 and 152.4, the stabilization of both models occurs for a record length of smaller than 60. As for the comparison of the stability of GEV-ST and GEV-NS0, the CVs of GEV-ST are smaller than those of GEV-NS for all the three return periods, which is due to the more complex model structure of GEV-NS0. Figure 10 summarizes the evolution of 100-year design floods and associated 95% CIs with the increase of record length under both ST and NS contexts. The design values yielded by the GEV-ST model and the simplified GEV-NS model based on the ADLL method are similar, particularly with the increase of record length. For six out of the nine stations, the difference between 100-year design flood of ST and NS models is relatively small with the increase of record length or even becomes smaller when record length is larger than 50-60 years, see stations 12.228 and 19.127. Thus, it is difficult to discriminate between these two design strategies for a larger record length. For the rest of the stations, a representative example is station 2.1088, for which the difference between ST and NS design floods becomes larger when the record length is larger than 70 years and the largest difference is obtained when record length is 145 years. In terms of CIs, it is found that the associated bootstrapped 95% CIs of the 100-year design flood become smaller as the record lengths increase for both GEV-ST and GEV-NS0, showing the advantage of collecting more flood samples under both ST and NS conditions. In addition, there is no significant difference between the CIs of ST and NS design floods for most stations.

Discussions
There is still a hot academic debate on whether we should switch to NS frequency analysis, considering the complex model structure, prediction of future distributions and the uncertainties associated with NS hydrological design. However, there is no doubt that we are living in a changing world. Thus, this study focuses on what we could do if the world is NS. Again, it should be emphasized that nonstationarity cannot be inferred only by trend tests but should be confirmed by attribution analysis, statistical analysis and empirical analysis (Milly et al. 2015;Yan et al. 2017a). As discussed by Serinaldi et al. (2018), trend analysis can at most be used as statistical tools for preliminary screening whose outcome should be carefully checked and complemented with exogenous information. If a clear physical mechanism related to a predictable evolution of the properties of the process at hand is not identified, we cannot make conclusions about the reason of rejection or lack of rejection, since multiple factors not included in the null and alternative hypotheses can actually play a role. Furthermore, because of the different availability of flood observations all over the world, it is crucial to assess the impacts of varying record lengths on NFFA for practical reasons, and our research can provide technical guidance for engineers and policymakers. The results of this study may be affected by the employed NS model. NS models make sense if we know a priori mechanism of change, and therefore we can set up parameters varying accordingly. Otherwise, we need to consider the model uncertainty including a variety of possible options and combinations in terms of covariates and link functions. Theoretically, all the three statistical parameters can be time-varying, but the shape parameter is sensitive and difficult to estimate. In practice, either location or scale parameters are likely to change under a changing climate, but in this study, only the location parameter of GEV-NS0 is time-varying to facilitate the model construction and investigation procedures. Although this kind of simplified GEV-NS0 model may reduce the model set and neglect an infinite set of possible alternative NS models, the aim of this study is not to select the optimal model for a selected basin but to investigate the impacts of record lengths on NFFA. It is a trade-off choice between the complexity of model structure of the NS model and the practical applications of NS model in this study. Besides, to strengthen the physical mechanism of NS models, researchers have recommended the use of covariates more relevant to flood process in the construction of NS models to the regions where human activity is an important contributor. In this study, we used time as a covariate for two reasons: (1) to focus the study on the main objectives, which are to examine the impacts of flood record lengths on the NS and ST design floods and associated uncertainties, and to explore the probable requirements of flood record length in NFFA and (2) human activities on the selected Norwegian catchments are not Figure 8 | The difference between theoretical probability coverage and actual probability coverages and the percentage of flood observations under 5th, 25th, 50th, 75th and 95th percentiles, denoted by D 5 , D 25 , D 50 , D 75 and D 95 , respectively, using GEV-NS models. The difference was calculated using Equation (8), and ensemble mean was the average difference of nine stations.
Hydrology Research Vol 52 No 6, 1608 significant and the used maximum extreme values are normally less affected by non-climate factors compared with daily values or low-flow values. From the perspective of feasibility, in this study, all the record lengths of selected stations are larger than 100 years. So, it is very difficult to obtain physical covariates with such a long record length. It should be strengthened that time covariate is just a simple reflection of the gradual change of time series, and it lacks physical meaning. Thus, in future, when more observations data are available, it is feasible to incorporate physical covariates which are relevant to the flood process in the NS modeling.
Theoretically, the required sample size for a given analysis depends on the target statistic of interest (e.g., 50-, 100-and 1,000-year design flood) and required accuracy. In practical flood design problems, hydrological engineers work with the available flood observations they have and account for the corresponding uncertainties to a targeted return period. The recommended minimum sample size in flood design in the handbook of hydrology is to guarantee the reliability of design results for cases with no specific degrees of uncertainty in advance, and urge the hydrological engineers to collect more samples or turn to 'peaks over threshold' sampling scheme to reduce the huge uncertainty when estimating 1,000-year design flood using limited observations. Based on the results of this study, a minimum of 60 years of observations is recommended for NFFA to guarantee the superiority of the NS models. It should be mentioned that the threshold of 60-year observations switching from FFA to NFFA is not a 'magic' number. It is inferred based on fitting quality, predictive capabilities and the stability of GEV- NS0 for most selected stations in Norway for estimating 20-, 50-and 100-year design flood, using GEV-NS0, which is an estimation that may depend on the nature of the time series, its memory, its distribution, the models used and the level of accuracy required. With that said, we still think that the recommendation of a plausible standard to guide the estimation of design flood for engineers, particularly for cases with limited observations, is beneficial for pushing forward the practical application of NS frequency analysis. Besides, it should be noted that in the practical design procedure, no matter whether the available observations satisfy the minimum requirement of record length, performing a cost-benefit (multicriteria) analysis considering the range of uncertainty corresponding to a predefined design value, thus getting the solution that optimizes the required criteria (i.e., structural and economic) is important. Since a number of other variables that are not specified in this study may affect the required sample size. For example, if we improve the level of required accuracy of design values in a specific problem, a series with 60 observations is likely to be not enough, or if we use the NS model with a different model structure whose location parameter and scale parameter are both time-varying, the required sample size may be different. Besides, in this study, only two stations' record lengths are larger than 120 years, influencing the analysis of probability coverage of GEV-ST and GEV-NS0. With the increase of observations in future, to promote the development of NFFA, further studies concerning the record length required for NFFA in other basins worldwide should be carried out.

CONCLUSIONS
The aim of this study was to evaluate the influence of record length on the ST and NS FFA. This aim was achieved using the split-sample testing for the stations with long record lengths (larger than 100 years), and several goodness-of-fit measures were used to evaluate the model performance and predictive capabilities of statistical models. Subsequently, the ADLL design method was employed to predict NS design floods, and the NNB was used to calculate CIs by resampling the original flood observations for record lengths varying from 30 to 145 years.
Based on the aforementioned results, the main conclusions of this study are as follows: 1. For frequency analysis of NS flood extremes, the performance of GEV-NS0 becomes better with the increase of record length and particularly they outperform GEV-ST models when record length is larger than 60 years in the selected basins in Norway, indicating the necessity of collecting more flood samples in NFFA rather than the minimum requirement of 30 samples in traditional FFA. 2. The predictive capabilities of extreme events (95th and 5th percentiles) are good for both GEV-ST and GEV-NS0 in the selected basins in Norway. However, compared with GEV-ST models, the difference between theoretical and actual probabilistic coverages becomes smaller, as record lengths increase for GEV-NS0. More importantly, the likelihood values of GEV-NS0 are larger than those of GEV-ST models when the record length is larger than 60 years for most stations, indicating the better predictive capabilities of GEV-NS0. The improved model performance and predictive capabilities of GEV-NS0 are thought to be attributed to their better capture of the variability of extreme flood events when more flood samples are added into the flood population. 3. The stability of the established GEV-ST and GEV-NS0 is improved, as record lengths increase since the CVs of 20-, 50-and 100-year design floods become smaller with the increase of record length. Particularly, the fluctuation of CV becomes gentle when the record length is roughly larger than 50-60 years for most stations in the selected basins. Besides, the CVs of GEV-ST are smaller than those of GEV-NS0. 4. The difference of 100-year design flood estimated using the ADLL method between GEV-ST and GEV-NS0 is relatively small or even becomes smaller when record length is larger than 50-60 years. In addition, the bootstrapped 95% CIs of the 100-year design flood become smaller with the increase of record length for both GEV-ST and GEV-NS0 models in the selected basins in Norway.