Assessment of Return Value Estimates from Stationary and Non-Stationary Extreme Value Models

This article compares the accuracy of return value estimates from stationary and non-stationary extreme value models when the data exhibits covariate dependence. The non-stationary covariate representation used is a penalised piecewiseconstant (PPC) model, in which the data are partitioned into bins defined by covariates and the extreme value distribution is assumed to be homogeneous within each bin. A generalised Pareto model is assumed, where the scale parameter can vary between bins but is penalised for the variance across bins, and the shape parameter is assumed constant over all covariate bins. The number and sizes of covariate bins must be defined by the user based on physical considerations. Numerical simulations are conducted to compare the performance of stationary and non-stationary models for various case studies, in terms of quality of estimation of the T -year return value over the full covariate domain. It is shown that a non-stationary model can give improved estimates of return values, provided that model assumptions are consistent with the data. When the data exhibits non-stationarity in the generalised Pareto tail shape, the use of non-stationary model assuming a constant shape parameter can produce biases in return values. In such cases, a stationary model can give a more accurate estimate of return value over the full covariate domain as only the most extreme observations (regardless of covariate) are used to estimate tail shape. In other cases, the assumption of a stationary model will ignore key features of the data and be less reliable than a non-stationary model. For example, if a relatively benign covariate interval exhibits a long (or heavy) tail, extreme values from this interval may influence the T -year return value for very large T . However the sample of peaks over threshold, with high threshold, used to estimate a stationary model in this case may not include sufficient observations from this interval to estimate the return value adequately.

Bias in metocean data obviously leads to bias in estimates of extremes. Random errors in metocean data lead to positive bias (i.e. a tendency to estimate return values that are higher than the true return values), since the distribution of random errors is convolved with the distribution of the variable ( [1,2]). Shorter datasets lead to higher variance in estimates of extremes, but can also increase bias, since bias in parameter estimators for various distributions can vary with the number of observations. Similarly, the shape of the tail of the distribution affects both the bias and variance of estimates of extreme values, with estimates of longer-tailed distributions having a higher variance for a given sample size.
Biases in parameter estimates also vary with the shape of the tail (see e.g. [3,4]).
Model misspecification refers to differences between the "true" characteristics of the data (and the data-generating model responsible for it) and the assumptions made in the statistical model. At present, the most commonly applied method for estimating extremes of metocean variables is the peaks-over-threshold (POT) method (see e.g. [5,6]). The POT method makes the following key assumptions about the data: (1) observations are independent and identically distributed (IID) given covariates, and (2) exceedances of a sufficiently high threshold follow a generalised Pareto (GP) distribution. The GP distribution describes the asymptotic behaviour of independent threshold exceedances from a maxstable data-generating distribution. As threshold level increases, theory suggests that the closeness of the conditional distribution of peaks over threshold to the GP form improves. The appropriateness of the GP distribution is therefore based on the threshold being sufficiently high that the asymptotic approximation is valid, with too low a threshold leading to increased bias in the estimated extreme values due to the GP distribution not being an appropriate model. The choice of threshold is a trade-off between increased bias from setting the threshold low and increased variance from setting the threshold high, so that there are fewer observations. The rate of convergence of threshold exceedances from the datagenerating distribution to the GP distribution may however be slow. That is, a very large threshold might be required for the GP form to be considered a reasonable approximation, making practical inference difficult for finite samples.
For this reason, a number of "pre-asymptotic" parametric distributional forms, or "penultimate approximations", have been proposed [7,8]; the idea being that the data-generating distribution is in some sense "closer" to the penultimate approximation than to the asymptotic distribution for finite threshold. However, since the GP model is the most widely used at present it has been applied in the present work and the use of penultimate approximations is not pursued further here. In addition, a large literature on non-parametric alternatives for estimation of distributional tails exists (see e.g. [9], [10]).
Regarding the assumption that observations are IID given covariates, metocean variables typically exhibit serial correlation, so the assumption of independence is not true if a model is fitted to all observations. This is dealt with by declustering the data, where only the largest observation in each storm are considered so that storm maxima can be considered approximately independent. The criterion for what determines independent storms is usually defined in terms of a minimum separation in time. A rigorous treatment of the correlation between successive extreme events can be made by plotting the extremogram [11], an analogue of the autocorrelation function for sequences of extreme events, although care must be taken to first remove the seasonal signal from the data which introduces a longer-range correlation. An example of the use of the extremogram to define a declustering time-scale was presented by Mackay and Johanning [12], which showed that a time-scale of around 5 days was sufficient for the datasets considered in that study. Alternatively, declustering times can be defined based on more heuristic arguments about the average time scales for weather systems to pass over a site, typically taken to be in the range of 2-5 days. Ewans and Jonathan [13] discuss a physically-motivated approach to declustering time series of significant wave height, H s , based on the assumption that the peak severities of different storm events, given covariates, are statistically independent. Storm events are identified from time-series of sea-state H s . A storm event corresponds to the time interval between the H s up-crossing of some threshold level and the subsequent down-crossing of the threshold. In addition, storm intervals separated by less than 24 hours are merged. The threshold can be defined e.g. in terms of a covariate-dependent quantile of sea-state H s . The peak value of sea-state H s during the storm interval then defines the storm peak H s . Values of storm peak H s for different storms are taken to be statistically independent.
The distributions of many metocean variables, such as (significant or individual) wave heights, wind speeds and storm surge, exhibit dependence on other variables, referred to as covariates. For example, many studies have considered the dependence of wind speeds or wave heights on the direction of origin of the storm and the time of year (season) [13][14][15][16][17].
Wave heights and wind speeds are also dependent on large-scale climatic indices such as the North Atlantic Oscillation (NAO) [18] or the El Niño Southern Oscillation (ENSO) [19]. Moreover, most studies tacitly assume that the distribution of metocean variables are stationary in time, neglecting the effects of the changing climate which have been observed in some studies [20,21].
In this study we focus on the effects of periodic covariates such as season and direction and defer consideration of longer-term variations in climate to future work. Specifically, we quantify differences in the performance of models which account for the covariate effects and those that do not (referred to here as constant or stationary models). Obviously, stationary models cannot produce estimates of seasonal or directional extremes, so our interest here is in which model gives the more accurate estimates of return values for the full covariate domain, typically referred to as annual (omniseasonal) or omnidirectional return values; we use the term "omnicovariate" where necessary below for clarity. The quality of historical data is not considered here, but all the other factors listed above influence the comparison between stationary and non-stationary models and therefore need to be considered.
There has been some debate in the literature about the circumstances in which non-stationary models should be applied and whether stationary or non-stationary models produce more accurate estimates of omnicovariate return values. The motivation for using non-stationary models is that their underpinning assumptions better reflect the characteristics of the data and our physical understanding. Non-stationary models assume that the distributions of independent peaks over (covariate-dependent) threshold, conditional on covariates, tend towards a GP distribution. Under this assumption, now consider the highest threshold value (on the covariate domain) for which the GP distribution is a reasonable approximation.
For this threshold value, the omnicovariate distribution is a convolution of the GP distributions over the covariate domain and therefore not a GP distribution itself. If we now increase the threshold yet further, we expect to eliminate the influence of all values of covariate except those contributing to the extreme tail of the omnicovariate distribution and that the resulting distribution of threshold exceedances would be "closer" to a GP distribution once more. It may therefore be expected that a high threshold would be required for stationary models to give a similar level of performance as nonstationary models. However, it is also reasonable to expect that the most information about the shape of the tail of the omnicovariate distribution is contained in the largest observations. In applied extreme value analysis there is a maxim that ensuring a good fit to the bulk of the data does not guarantee a good fit to the tail. It is therefore reasonable to ask whether modelling less extreme observations (in a non-stationary model) reduces the bias and variance of return values.
Model complexity is another consideration. Stationary models are simpler to implement and have fewer parameters to estimate. Whilst the complexity of non-stationary models is not an argument against their use on its own, practical considerations aside, it may be expected that the larger number of parameters that need to be estimated for non-stationary models would increase the variance of those estimates. The need to estimate more parameters is traded off against two effects. Firstly, due to the larger number of parameters, non-stationary models offer a more flexible (hence potentially more accurate) fit to the data. Secondly, non-stationary models are typically fitted to a larger proportion of the observations, increasing the sample size. From this discussion it is apparent that theoretical arguments alone cannot justify the use of a stationary or non-stationary model exclusively. From the practitioner's perspective, the challenge is knowing which type of model gives the most accurate estimates of extreme values in a given situation.
Many of the earlier studies on the use of non-stationary models (e.g. [22][23][24]) compared their performance to stationary models in situations where the true return values were not known. In these studies it is not possible to conclude which type of model is more accurate, only that results differ. Jonathan et al [25] presented a comparison of stationary and non-stationary models using simulated data where the observations are drawn from two distinct distributions, representing storms from two directions. They demonstrate that in the cases they consider the non-stationary models give lower bias in estimates of return values. Mackay et al [26] argued that these results were not representative of real situations, where the distribution of a variable will vary smoothly with direction, season or other covariate, rather than changing sharply at the boundary of two sectors. Mackay et al [26] presented the results of simulations where the distribution of storm peak H S conditioned on season varied continuously through the year. Piecewise-constant models were fitted, where the data were divided into a number of discrete bins and independent fits were made in each bin. It was shown that the piecewise-constant models performed worse in estimating omniseasonal return values than the stationary models, with higher bias and variance in all cases considered, and with both bias and variance increasing with the number of bins used. It was explained that the reason the non-stationary models performed worse in these case studies was due to the independent estimates of the GP shape parameter in each bin. As the number of bins increases, the sample size in each bin decreases and the variance of the parameter estimates increases. A high estimate of the GP shape parameter in one bin is not compensated for by a low estimate in another bin and therefore leads to a positive bias in the annual return values.
Jonathan and Ewans [27] argued that the results in [26] were due to a fortuitous choice of extreme value threshold for the stationary model and that there was no way of knowing in practice where the correct threshold should be set. Jones et al [17] extended the study of Jonathan et al [25] using more sophisticated covariate representations (splines, Fourier series and Gaussian processes), and suggested that the performance of stationary models in estimating omnicovariate models is in general more variable than the performance of a non-stationary model.
The purpose of the present study is to extend the results of Jonathan et al [25] and Mackay et al [26] in an attempt to provide further guidance on the relative performance of stationary and non-stationary models in realistic situations.
We extend the results from [26] in two main ways. Firstly, case studies are constructed where the threshold for both the stationary and non-stationary models can be varied, so that the effect of threshold choice can be examined. Secondly, we consider a penalised piecewise-constant (PPC) non-stationary model [28,29]. In this model the data are partitioned into bins defined by covariates, and the GP scale parameter is allowed to vary between bins but the shape parameter is constant over all bins. The likelihood function used to estimate the parameters is penalised for the variance in the scale parameter over all the bins, with the roughness penalty selected using cross-validation to maximise predictive likelihood.
More advanced non-stationary models than the PPC model have been proposed, which have the objective of providing optimally flexible descriptions of the systematic variability of extreme values with covariate (e.g. [30]). Typically, a regression approach underpins these models (e.g. [31]). A suitable set of basis functions for the covariate domain is defined, and the value of each of the extreme value model parameters (on the covariate domain) is then defined as a linear combination of basis functions; the basis coefficient vector is estimated statistically. Suitable bases for onedimensional covariate domains include splines and Fourier series. Basis functions with compact support, such as Bsplines, are advantageous computationally; PPC exploits a piecewise-constant basis in one-dimension. There are numerous variants of spline parameterisations. These include P-splines (penalised B-splines, [32]), for which squared differences of neighbouring basis coefficients are penalised to increase the smoothness of the representation, and adaptive regression splines (e.g. [33]), for which locations of spline basis knots are also estimated to optimise model fit. Useful bases for higher-dimensional covariates include thin-plate splines (e.g. [34]), suitable kernels (e.g. radial basis functions), and Voronoi tessellations (e.g. [35]); bases for higher-dimensional covariates can also be formed from tensor products of lowerdimensional bases (e.g. [36]). Higher-dimensional bases formed from tensor products of penalised B-splines admit efficient computation using generalised linear additive models ( [37]).
The motivation for using the PPC model over more advanced forms of non-stationary model is that is represents a good compromise between simplicity, robustness and flexibility. The PPC model represents a step up in complexity from binning the data and fitting independent models in each bin (the non-stationary model considered by Mackay et al [26]), where the additional complexity of the roughness penalisation makes the model more robust to increasing uncertainties from dividing the data into bins. The complexity of the PPC model is determined by the number of bins used, rather than the number of covariates. It can therefore be used for multidimensional covariate problems without modification, making it very flexible.
As with previous studies, the scope of the current study is necessarily limited to a relatively small number of case studies. Hence the conclusions drawn here may not be applicable universally. The results presented apply to the PPC model and similar types of non-stationary model. However, we have also attempted to draw more general conclusions that extend to other types of non-stationary model. In particular, the conclusions about the effects of binning the data and assuming a piecewise-constant distribution apply to other types of model that take this approach, and the conclusions about the effects of assuming a stationary shape parameter are likely to be applicable to any non-stationary model that makes this assumption. Moreover, as discussed further in Section 3, since the PPC model only considers the total level of variability between bins, the particular choice of patterns of covariate dependence are not restrictive.
The paper is organised as follows. A brief overview of the theory and model assumptions is presented in Section 2.
The design of the simulation case studies is described in Section 3. In Section 4 we examine the effect of non-stationarity in the data on the shape of the tail of the omnicovariate distribution, and the effect this has on quality of estimation from return values from a stationary fitted model. The effect of partitioning the data into bins and fitting a piecewise-constant non-stationary model is considered in Section 5. Section 6 summarises the results of the simulation studies. Finally, conclusions are given in Section 7.

Return values from a non-stationary distribution
In the present study we consider estimation of the distribution of an arbitrary variable, X, where X could be interpreted as storm-peak H s or another environmental variable, showing dependence on covariates. It is assumed that storm peaks are sufficiently separated in time that adjacent observations are independent. Further, it is assumed that X follows some arbitrary distribution, with parameters dependent on one or more covariates. In the current study we consider the influence of a single covariate, denoted t, which could be interpreted as the time of year (season) or mean wave direction at the storm peak.
Denote the cumulative distribution function (CDF) of X conditional on a particular choice of t as P S (X ≤ x|t). For simplicity, it is assumed that t ∈ T = [0, 360). It is further assumed that the occurrence rate of storm peaks, ρ t (t), is dependent on t, where the rate is defined as the number of storms per year per unit covariate. The probability that a storm, selected at random, has associated covariate t is where is the expected number of storms per year.
The unconditional CDF of X for a storm selected at random, denoted P RS , is obtained by integrating the conditional CDF over the covariate domain, weighted by occurrence The T -year return value, x T , is then the solution of .

Penalised Piecewise-Constant (PPC) model
of n values of peaks over threshold for a random variable X.
be the corresponding values of a covariate t on some domain T . We assume a single covariate, but extension to more complex covariate domains is straightforward as explained in [28]. We make inferences about extreme values of X given t, for t ∈ T .
The piecewise-constant model uses a particularly simple description of non-stationarity with respect to covariates. For each observation in the sample, the value of covariate t i is used to allocate the observation to one and only one of N bin covariate intervals (or bins) {C k } N bin k=1 by means of an allocation vector A such that k = A(i) and T = C k . For each k, all observations in the set {x i } A(i)=k with the same covariate interval C k are assumed to have common extreme value characteristics.
A non-stationary GP model is then estimated using cross-validated roughness-penalised maximum likelihood estimation. For covariate interval C k , the extreme value threshold u k > 0 is assumed to be a quantile of the empirical distribution of X in that interval, with specified non-exceedance probability ψ ∈ (0, 1), with ψ constant across intervals, and estimated by counting. Threshold exceedances are assumed to follow the GP distribution with shape ξ ∈ [−0.5, ∞) and scale σ k > 0, with CDF where The parameters u k , σ k and ξ are the threshold, scale and shape parameters, respectively. Since estimation of the shape parameter is particularly problematic, ξ is assumed constant (but unknown) across covariate intervals, and the reasonableness of the assumption assessed by inspection of diagnostic plots. Parameters ξ, {σ k } are estimated by maximising the predictive performance of a roughness-penalised model, optimally regulating the extent to which {σ k } varies across intervals, using a cross-validation procedure.
The sample GP likelihood L under the piecewise stationary model is where L, {u k }, {σ k } and ξ are functions of marginal extreme value threshold non-exceedance probability ψ, and ξ is constant across the N bin intervals {C k }. The negative log likelihood, penalised for the roughness of {σ k } across intervals, is then where ℓ * is a function of both ψ and roughness coefficient λ σ andσ is the mean value of σ k over the bins: For given ψ and λ σ , estimates for ξ and {σ k } are found by minimising ℓ * . The minimisation is conducted using a simplex search method [38]. The search is initialised using first guess ofξ = 0 (where the caretˆdenotes an estimate of a parameter) and the moment estimates of σ in each interval. The optimisation is constrained to giveξ ≥ −0.5 and A random 10-fold cross-validation is then used to select the valueλ σ of λ σ and correspondingξ, {σ k } which, for each ψ, maximises predictive performance. In the PPC model, if λ σ = ∞ then the model has only one degree of freedom for σ, whereas if λ σ = 0 then the fitted model has N Bin degrees of freedom for σ.
For intermediate values, the "effective" degrees of freedom for σ is at some intermediate value.
In a typical application, the complete PPC modelling procedure is repeated for a number of bootstrap resamples of the original sample to capture sampling uncertainty. Moreover, for each sample, the extreme value model is evaluated for multiple thresholds with non-exceedance probability ψ drawn at random from the interval I ψ ⊆ (0, 1) on which model performance is deemed reasonable from inspection of diagnostics. However, in the current study, where the model is applied in a large number of Monte Carlo simulated data sets, only the original sample is used. Moreover, as we wish to study the effect of threshold level on the estimates, the PPC model is fitted for several values of ψ and the results compared directly.
The method used to fit the PPC model is relatively simple. It is conceivable that other methods such as Markov Chain Monte Carlo (MCMC) could potentially improve results. However, this would represent a significant step up in terms of complexity. As mentioned above, the motivation for using the PPC model is for its balance between simplicity and flexibility. Examples of non-stationary models using MCMC can be found in e.g. [30,39].
Once the PPC model has been estimated the omnicovariate distribution is obtained using the discretised form of (3): where n k is the number of observations in interval C k and Return values can then be estimated using (4). For consistency, the stationary model used in this work is a special case of the PPC model with a single covariate bin and no roughness penalisation.

Assessment criteria
The performance of the stationary and non-stationary models are assessed in terms of the bias, standard deviation whereθ j is the estimate corresponding to the j th Monte Carlo simulated data set (henceforth referred to as a "trial" for brevity).

Design of case studies
Previous simulation studies comparing stationary and non-stationary models have generated data from a GP distribution, where the parameters depend on covariate values. The limitation of this type of study is that the minimum threshold for which the stationary model can be applied is the maximum threshold value over all covariates, since below this level the distribution is not defined for all covariate values. To overcome this limitation, a model is required for the distribution of all storm-peak data, not just the tails. This could be achieved by using a two-part model with a parametric distribution for the body of the distribution and a GP model for the tail. The problem with this approach is that the choice of distribution for the body is arbitrary and it is difficult to ensure continuity of the density function on the boundary between body and tail.
In our simulations we have opted to simulate from the generalised extreme value distribution (GEV) rather than the GP distribution, avoiding the need for a two-part model. Previous investigations (details available from the authors on request) with measured data also show that the GEV distribution is a reasonable model for storm-peak H S . The GEV is the asymptotic distribution of "block maxima" of fixed block size (e.g hourly, daily or weekly maxima). Storm peak data can be considered block maxima in a sense, where the block size is related to the method used for identifying storm peaks, although the block size is not strictly constant. However, we are not using the GEV to generate data to conduct a block-maxima analysis. Instead, we are using the GEV to generate data for a non-stationary POT analysis (using the PPC model). A POT analysis can be applied to data generated from any distribution. The motivation for using the GEV as the data-generating distribution in the current study, is that it has the convenient property that the tail converges to the GP distribution with the same shape parameter, in the sense illustrated below. The CDF of the GEV can be written as where z is defined in the same way as the for the GP distribution in (6). In the tail of the distribution z is small. As That is, the GEV and GP CDFs converge, with common scale and shape parameters, and GEV location parameter µ equal to the GP threshold u, as illustrated in Figure   1 for the case µ = 0, σ = 1, ξ = 0.
[ Figure 1 about here.] It is well known that there is a relation between block-maxima modelled using the GEV distribution and threshold exceedances modelled using the GP distribution (see e.g. [5]). However, the argument above merely relates to the similarity of the functional forms of the GEV and GP tails, and is not the same as the argument associating the GP distribution for peaks over threshold when the GEV is used for block maxima.
Fitting a GP model to a data set with GEV as the data-generating model will introduce some bias at lower threshold values, due to the mismatch between the fitted model and data-generating model (see Figure 1). The resulting bias and STD of parameter and quantile estimates when fitting the GP distribution to GEV data is examined in the Appendix.
The bias in parameter and quantile estimates are slightly higher when a GP model is fitted to GEV data than when a GP model is fitted to GP data. However the STD is slightly lower, resulting in an RMSE that is comparable. The use of the GEV distribution as the data-generating model rather than the GP distribution will therefore not significantly influence the results.
For the PPC model, the likelihood is penalised on the variance of the scale parameter. The difference between the estimates of the scale parameters in adjacent bins is not considered explicitly, only the total variance over all bins. The complexity of the PPC model is therefore determined by the total number of covariate bins only and not the number of covariates used. Therefore, the case studies considered here focus on a single covariate and the results can be expected to apply to cases with multiple covariates.
We now consider two sets of case studies. In the first, the GEV parameters are assumed to vary linearly with covariate t, and in the second the parameters are assumed to vary sinusoidally with t. The parameters in the first set of case studies are defined as The first set of case studies is designed to illustrate the effect of fitting a stationary extreme value model to data from a non-stationarity data-generating distribution, and is similar to the PPC fit in a specific covariate bin (see Section 4). The parameters in the second set of case studies are defined as where different choices of α, β and γ are also considered. As the PPC model does not directly account for the difference in parameter estimates between adjacent bins on the covariate domain, it is mainly the level of variation between bins that influences model fit and not the pattern of variation. The assumption of sinusoidal variation in model parameters is therefore not particularly restrictive. However, it will be shown in Section 6.2 that the level of non-stationarity of the data within a bin does influence model fit.
The second set of case studies is designed to be more representative of a real situation and are used to compare the performance of the stationary and non-stationary models. The GEV parameters for each case are listed in Table 1. The first case with α = β = γ = 0 is included to illustrate the effect of increasing the number of bins on the estimated omnicovariate return values in absence of covariate effects and is discussed in Section 5. The subsequent cases illustrate the effect of different patterns in the variation of the data-generating distribution parameters and are discussed in Section 6.
[ Table 1  For each case 10,000 trials were performed. The estimation of the optimal penalty via cross-validation is the most time consuming step in estimating the PPC model. To reduce computation, the optimal penalty is estimated for only the first 100 trials; subsequent trials use the median penalty from the first 100 trials. The estimated optimal penalty showed very little variation over the first 100 trials, justifying the use of the median value in the remaining trials.
Examples of simulated data sets for four of the cases (see equations 20-22 and Table 1) are shown in Figure 2, together with the theoretical quantiles at non-exceedance levels of ψ = 0.6, 0.9, 0.99, 0.999 and 0.9999. Since we have defined the location parameter to be sinusoidally varying about zero, there are some observations that are negative, which is not representative of some environmental variables such as storm peak wave heights or wind speeds. This could be rectified by adding an offset to µ, which would have the effect of offsetting all the observations. However, the choice of offset would be arbitrary, so has been left as zero. The return values calculated from (3) and (4)  4. Fitting a stationary model to data from a non-stationary data-generating distribution Here we examine the effect of non-stationarity in the data-generating model on the tail of the estimated omnicovariate distribution using a stationary fitted model, so that the effect of non-stationarity can be assessed in isolation, before considering the effect of partitioning the data by covariate. The true data-generating model, with a linear variation in parameters, is described in (17)- (19).
To illustrate the influence of non-stationarity on the shape of tail of the distribution, we apply a normalisation, so that the tail shape can be considered without the influence of varying location and scale parameters. Suppose we wish to examine the shape of the tail above threshold level u, corresponding to non-exceedance probability ψ. Define threshold exceedances as Y = X − u for X > u. The conditional distribution of threshold exceedances is The mean, m, and STD, s, of the conditional distribution are given by where f Y (y) = dF Y (y)/dy is the probability density function of threshold exceedances. Now we consider how non-stationarity affects bias in estimates when fitting a stationary model. For each value of a and b a simulation study was conducted as follows. The sample size was fixed at n = 500 observations. For each simulation, covariate values were generated as uniform random variables t ∈ [0, 360). The parameters of the GEV conditional on t were defined according to (17) and (18) and a stationary model (the PPC model with one covariate bin) was fitted at threshold levels corresponding to ψ = 0.7 and 0. 9. For each value of a and b, 10,000 random trials were conducted. Figure 5 shows the mean of the estimated shape parameter,ξ, as a function of a and b for thresholds at ψ = 0.7 and 0. 9. Note that only the meanξ can be shown, rather than bias, since there is no 'true' shape parameter when a, b = 0 since the data-generating distribution is in fact a non-stationary GEV integrated over t and is not GEV itself. In the case a = b = 0, the true shape parameter is ξ = −0.1. We observe a negative bias inξ due to two effects: the known bias in maximum-likelihood estimators ( [41]) and the fact that we are fitting a GP distribution to GEV data (see discussion in the Appendix). The trends in the meanξ with a and b are similar to the results indicated in Figure 4. When b = 0, non-stationarity in the location has little influence on the estimated shape parameter, but when there is non-stationarity in the scale then the estimated shape becomes more negative with increasing a. It is also clear that non-stationarity in the scale has more influence on the shape that non-stationarity in the location. Figure 6 shows the bias in the estimated omnicovariate return value, X P , where the return value is defined to be the quantile at a non-exceedance probability of 0.9999, corresponding to a return period around 20 times the length of the observations. For the threshold at ψ = 0.9 the non-stationarity has relatively little effect on the bias in the return value, since much of the non-stationarity is removed by the high threshold and the GP model is a good fit for the tail of the distribution. In fact, for a less than approximately 1.5

Fitting a non-stationary model to data from a stationary data-generating distribution
If we are confident there is no non-stationarity in the data, then there is obviously no need to apply a non-stationary model. It is, however, instructive to consider the application of a non-stationary model in this situation. PPC and similar models are designed so that, in application to stationary data, a large roughness parameter λ σ would be estimated, and the variability of estimated σ with covariate t would consequently be small, corresponding to an approximately stationary GP fit. However, it is interesting to study the practical performance of PPC in this setting, in particular the effect of choice of number of covariate bins and other characteristics of covariate binning. Since we know the data-generating distribution is stationary, any effects observed cannot be due to non-stationary in the data set. We consider the simplest case of a fit to data from a constant distribution (Case 1 of Table 1 with α = 0). We consider three model types with increasing complexity. In the first model, independent fits to the data in each covariate bin are performed. In the second model, the shape parameter is assumed to be constant over all bins but the scale parameter fit is unconstrained (i.e. PPC fit with λ σ = 0). The third model is full PPC, where the shape parameter is constant over all bins and the scale parameter per bin is chosen to maximise predictive performance. Note that in the case of a single covariate bin, all models are equivalent.
The bias, STD and RMSE in the 100-year omnicovariate return value, X 100 are shown in Figure 7 for fits using between 1 and 12 bins. The results for the 1000-year return value are similar and are not shown here. For the one-bin (stationary) fitted model there is a negative bias in the estimate of X 100 , which is a result of previously-mentioned bias in the maximum likelihood estimators and fitting a GP model to GEV data. In the case of independent fits to each bin, the bias increases with the number of bins used. This effect was reported by Mackay et al [26] and is caused by the increased uncertainty in the shape parameter, with a high estimate of ξ in one bin not being compensated for by a low estimate in another.
The STD of estimates also increases with both the threshold non-exceedance probability, ψ, and the number of bins used, since sample size per bin reduces with both ψ and the number of bins. The RMSE for the fits with ψ = 0.6 decreases slightly from its value in the 1 bin case to a minimum in the 3 bin case. We attribute this to a balancing between the negative bias from parameter estimation and positive bias from increased binning, resulting in a slightly lower RMSE. For all other threshold levels, the RMSE increases monotonically with the number of bins used.
For the PPC (λ σ = 0) case, the performance of the fitted model is much more stable as a function of the number of bins used, up to 8 bins. For more than 8 bins there is a large increase in both the bias and STD of the estimates.
For PPC (optimal λ σ ) model, the performance is very similar to the PPC (λ σ = 0) model up to 8 bins. However, when using more than 8 bins, the performance of the full PPC model is much more stable due to the influence of the roughness penalty on σ. For more than 10 bins there is some increase in the bias from the PPC model. It is thought that this bias results from lack of convergence of the simple simplex-type optimisation algorithm used for maximum likelihood inference.
Nevertheless, the bias is still very small compared with the other approaches even for 12 covariate bins.
The bias and STD in parameter estimates from the independent fits-per-bin model are shown in Figure 8, with the corresponding plots for full PPC model (with optimal λ σ ) in Figure 9. For the independent fits the bias and STD increases with the number of bins used, due to the reduced sample size in each bin. For the full PPC model, the results are again considerably more stable as a function of number of covariate bins. There is a small reduction in the bias with increasing number of bins used. This is likely due to the increased influence of the σ-roughness penalty, which acts to optimise the performance of the model. The STD in the estimates remains fairly constant with the number of bins used in the full PPC model. For fits with 11 and 12 bins, the STD increases when using a threshold at ψ = 0.6, but reduces for the threshold at ψ = 0. 9. Again, we attribute this effect at least in part to lack of convergence, for large numbers of covariate bins, of the simplex optimisation algorithm used in PPC.
We conclude from this study that the full PPC model provides a good representation of stationary data-generating distributions (with parameters considered), at least when the number of covariate bins does not exceed 10. Therefore, for the studies reported in Section 3, we focus on the fits using up to 8 bins. We note that more sophisticated optimisation schemes (exploiting likelihood slope and curvature characteristics, [36,42]) are available for more challenging applications. [

Fitting a non-stationary model to data from a non-stationary data-generating distribution
In this section we consider the performance of stationary and non-stationary fitted PPC models for the four cases with sinusoidal parameter variation described in Section 3 (equations 20-22 and Table 1), samples of which are illustrated in Figure 2. For these case studies the true omnicovariate data-generating distribution is an integral of GEV distributions over the covariate domain and therefore not itsef a GEV distribution. Hence it is not possible to assess performance in terms of the parameter estimates from the fitted GP models. Instead we focus on the estimating omnicovariate return values for which the true values can be calculated from (4), as illustrated in Figure 3. We consider the four cases from Table 1 in turn. The fitted (one-bin) stationary model shows a negative bias, which reduces with increasing threshold level, consistent with the results shown in Figure 6. For the lower threshold levels, the bias becomes slightly more negative as α increases and the amplitude of variation in location parameter grows. In contrast, at ψ = 0.9, the bias and STD does not vary much with α. The reduction in bias with increasing threshold is due to two effects. used is due to the use of the σ-roughness penalty; as the number of bins used increases, the optimal penalty also increases, so that the model does not over-fit. The trend in bias with increasing number of bins for α = 2 and 3 is somewhat more complex than might be anticipated. This is due to the effect of the location of bin edges, which is discussed further in Section 6.2. In general, the PPC model fitted using 5-8 bins using a threshold at ψ = 0.6 or 0.7 gives the best performance in this case.

Case 1: Data from distribution with non-stationary location parameter
[ Figure 10 about here.]

Case 2: Data from distribution with non-stationary scale parameter
In a similar fashion to Section 6.1, bias, STD and RMSE in the 100-year omnicovariate return values for Case 2 (Table 1)  The excursion for three-bin fits is due to the placement of the bin edges. Figure 12 shows true tail distributions in each covariate bin for a threshold level at ψ = 0.6, when the data is partitioned into 2, 3, 4 or 5 bins, together with the omnicovariate distribution as reference. The distributions in each bin have been normalised using the procedure described in Section 4. In each case, the first bin is centred at t = 0 and all bins are of equal width. For the two bin case, the distribution in bin 1 is shorter-tailed than the distribution in bin 2. As the PPC assumes a constant GP shape parameter across bins, the value ofξ will be an average over the shape for each bin. For the three bin case, the distribution in bins 2 and 3 has a longer tail than that in bin 1, since there is a larger change in the scale parameter in bins 2 and 3 than in bin 1 (see Figure 2). As discussed in Section 4, the non-stationarity in the shape parameter in bins 2 and 3 results in the distribution being longer-tailed in these bins. The estimated shape parameter over the three bins will be more influenced by the two longer-tailed distributions in the lower sectors than the shorter-tailed distribution in the higher sector in bin 1. For the cases with four and five bins, there is less difference between the shapes of the distributions in each bin, since the bins are smaller and the distribution in each bin is more homogeneous. Examination of the distribution ofξ showed that the estimates are indeed more positive for three-bin than for other cases. For higher threshold levels, the size of the excursion is reduced since the sample of threshold exceedances is smaller an more homogeneous. In practice, where smooth variation of the data with covariate is expected, it is not possible to define bins within which there is a homogeneous population. This means that some bin placement effects are unavoidable. However, increasing the number of bins means that a piecewise-constant covariate model is a better approximation to the true data-generating distribution, which should improve the performance of the PPC model. Optimisation of bins widths and locations for directional analysis of extreme conditions is discussed in [13].
To investigate the effect of bin placement further, additional simulations were conducted with random placement of the first bin edge on [0, 360), whilst keeping bin widths constant. This procedure effectively eliminated the excursion discussed above, but otherwise the characteristics of the results (not shown) are similar to those shown in Figure 11. Cases 3 and 4 discussed below utilise random bin placement for this reason. It is possible to optimise the number of bins used and the placement of bin edges (see e.g. [30]). However, this represents a significant step up from the PPC model in terms of complexity and has not been pursued further here.

Case 3: Data from distribution with non-stationary location and scale parameters
The corresponding STD and RMSE in omnicovariate return value estimates for Case 3 (Table 1) using random bin placement (as described in Section 6.2) are similar to those for Case 2 with random bin placement, and are therefore not shown here. The bias for β = 0.5 was found to be somewhat more negative than that for Case 2, but of a similar magnitude between 0 and -10%. The similarity in performance of PPC models for Cases 2 and 3 agrees with results from Section 4; the effect of non-stationary scale is similar regardless of whether the location parameter is stationary or non-stationary. Figure 13 shows the bias, STD and RMSE in the 100-year omnicovariate return value estimates for the cases with γ = −0.1 and -0. 2. The data-generating distribution in the "benign" covariate interval has a longer tail in than elsewhere (see Figure 3). Results for γ = −0.1 show a small negative bias (of 2-4%) for the one-bin case and a small positive bias (2-4%) for the PPC model with 3 or more bins. Bias for the two bin case is close to zero. STD is also relatively stable as a function of the number of bins used. Since STD is larger than bias, RMSE is also relatively stable. There is little difference between stationary and non-stationary models in this case. Results for γ = −0.2 show a small negative bias for the stationary model. For non-stationary models, bias increases with both the number of bins and threshold level. This behaviour is related to the model misspecification: the PPC model estimates a constant shape parameter by maximising predictive likelihood over all bins. The shape parameter estimate will therefore be influenced by the long tail for the benign sector, resulting in a positive bias overall. STD is relatively stable with increasing number of bins used. Due to the large bias effect, RMSE is lowest for the stationary model and increases with the number of bins used.

Case 4: Data from distribution with non-stationary location, scale and shape parameters
Corresponding results for the 1000-year omnicovariate return value are shown in Figure 14. Now the effect of the long tail in the benign sector is more pronounced (see Figure 3). Results for γ = −0.1 are similar to those in Figure 13, but with slightly larger biases and STDs. For γ = −0.2 the stationary model displays a large negative bias, since it does not account for the effect of the longer tail in the benign sector, which has a stronger influence on the 1000-year return value than the 100-year return value. Bias reduces with increasing number of bins used, up to approximately four bins. Since PPC assumes a constant ξ, this reduction in bias can only be explained by compensating optimal choices for bin scale parameters. STD is slightly lower for the stationary model than the non-stationary models, but due to the large negative bias in the stationary model, RMSE is lowest for the non-stationary models using four or more bins. RMSE for ψ = 0.9 is higher than the fits using the lower thresholds. It is likely that this is because of lack of evidence in the sample of threshold exceedances to justify a large variation in the scale parameter to account for the longer tails in the benign sector. In this case the use of more advanced non-stationary models discussed in the introduction may be appropriate.

Conclusions
This study compared the performance of stationary and non-stationary extreme value models in estimating omnicovariate return values in the presence of covariate effects, for samples of peaks over threshold. The non-stationary model considered was a penalised piecewise-constant (PPC) GP model, assuming a constant shape parameter, but covariate dependence of scale and extreme value threshold.
The effects of linear trends in the location and scale parameters of the data-generating GEV model on the shape of the omnicovariate tail distribution were examined. For the cases considered, linear variation of the location parameter has only a small effect on the tail. Linear variation in the scale parameter of the data-generating model results in the omnicovariate distribution having a longer tail. Further, we examined the performance of a stationary GP fit to nonstationary data-generating distribution. For the cases considered, the change in bias due to a linear variation in location or scale was small relative to the bias for the case of a stationary data-generating distribution.
The effect of fitting a non-stationary piecewise-constant model to data from a stationary data-generating distribution was also investigated. It was found that when independent GP models are fitted per covariate bin, bias and variance of and non-stationary PPC fitting models were misspecified, and hence there was less expectation that the non-stationary model would perform better. Clearly additional case studies need to be considered, for which the non-stationary model incorporates a non-stationary representation for shape parameter should be made, for more useful comparison with fits using a stationary GP.
In summary, a non-stationary extreme value model can give improved estimates of omnicovariate return values compared with stationary models, provided that the characteristics of the data-generating model, and the model to be estimated are consistent. However, the relative performance of stationary and non-stationary extreme value models in estimating an omnicovariate return value is problem specific; either approach works reasonably well when the analysis is performed carefully. When all that is needed from the analysis is an estimate of an omnicovariate return value, a stationary fitted model may be sufficient. However, when a set of return values corresponding to multiple different partitions of the covariate domain is required, in addition to the omnicovariate return value, the non-stationary model exploiting suitable covariate representations is likely to provide a more consistent and statistically efficient framework for inference.
This study addresses the relative performance of stationary and non-stationary extreme value models, in the presence of covariate effects. Maximum likelihood estimation, potentially penalised to ensure optimal parameter smoothness, is used as discussed in the main text, in conjunction with a GP distribution for exceedances of a high threshold. It is instructive, in addition, to consider the performance of the maximum likelihood estimators for the GP parameters and extreme quantiles in a stationary case. Moreover, in the current work, the data-generating distribution is the GEV. It is important also therefore to assess the bias and variance in parameter and quantile estimates for a GP fit to data generated from a GP distribution, with a GP fit to data generated from a GEV distribution.
There is a wide range of methods for estimating the parameters of the GP distribution, differing in bias and variance characteristics, with the performance depending on sample size and the value of the GP shape parameter (see e.g. [4,43]).
The maximum likelihood (ML) estimators are asymptotically unbiased and efficient (as the sample size tends to infinity the ML estimators achieve the Cramer-Rao lower bound for the variance of an unbiased estimator). However, the ML estimates do not achieve this asymptotic property for small sample sizes and other methods can produce lower bias and variance.
A key step in the estimation of the PPC model is the penalisation of the likelihood function for the "roughness" of the GP scale parameter estimates, which makes ML the most suitable computational framework for inference. We therefore focus on the properties of ML estimators. Various methods have been proposed for calculating the ML estimators for the GP distribution (e.g. [44,45]) and the performance depends somewhat on the numerical algorithm used. Convergence of the algorithm is sometimes problematic and some methods can give results inconsistent with data, in the sense that ξ < 0 and max(x) >û −σ/ξ. In the PPC model, parameter estimates are forced to be consistent with the data and the shape parameter is constrained to beξ > −0.5, as described in Section 2.2. The asymptotic covariance matrix for the ML estimators of GP parameters is [46] var  σ ξ where n is the sample size. This provides a lower bound for the variance of unbiased parameter estimates for the stationary model. The second-order bias in the ML estimators was derived by Giles et al. [47] n bias (σ) = σ 3 + 5ξ + 4ξ 2 (1 + 3ξ) A simulation study was conducted to compare the bias and variance of GP fits to GP and GEV data and to the theoretical values given above. The aim was to investigate the influence of the threshold level at which the GP distribution is fitted to the GEV. To make a meaningful comparison, the sample size must be consistent between the different threshold levels, which requires generating more extreme GEV values for fits using higher threshold values. The approach taken is summarised as (a) set shape parameter ξ, (b) set GEV threshold non-exceedance probability ψ, (c) set number of GP samples n GP , (d) define number of GEV observations to generate as n GEV = ⌊ n GP 1−ψ + 1 2 ⌋, where ⌊·⌋ is the floor function, (e) generate n GP samples from GP distribution with u = 0 and σ = 1 and fit GP distribution to all samples, and (f) generate n GEV samples from GEV distribution with µ = 0 and σ = 1 and fit GP distribution to largest (1 − ψ)n GEV ≈ n GP samples. For each value of ξ and ψ, 100,000 trials were conducted. As the GP distribution is fit to the GEV data at different threshold levels the estimated scale parameter must be adjusted to allow consistent comparison. A feature of the GP distribution is that if exceedances of threshold u 0 follow a GP distribution with parameters σ u0 and ξ, then for threshold u > u 0 , the exceedances are GP distributed with same ξ and scale parameter σ u = σ u0 + ξ(u − u 0 ) (see e.g. [5]). The parameter σ * = σ u − ξu is therefore threshold-independent. We therefore compare estimates of σ * rather than σ. Note that since u = 0 in this example the true value is σ * = 1. Figure 16 shows the results of the simulation study for a sample size of n = 50. Results for n = 200 yields similar results, and are not reproduced here. In this example the return value X P is defined as the quantile at a non-exceedance probability of P = 0.999 for the GP data. The bias of parameter estimates for fits to GP data agree reasonably well with the theoretical values from (27) and (28), when ξ > −0.1, but the theoretical values depart significantly from the simulations when ξ < −0.1 due to the influence of singularity in the theoretical expressions when ξ = −1/3. For the fits to the GEV data, the bias is larger than that for the fit to the GP data. The bias reduces as the threshold increases and the tail of the GEV converges to a GP distribution. STD of estimates (lower panel) for the fit to GP data is slightly above that predicted by the asymptotic result. For lower values of ξ, STD is closer to the asymptotic values. This is because the estimated shape parameter is constrained to be greater than −0.5, restricting the range of values that the estimates can take. STD ofξ for the fits to the GEV data is slightly lower than that for the fit to the GP data. STD forσ * is lower for the fit to the GEV data for ξ less than approximately −0.1 and higher than that for the fit to the GP data for larger values of ξ. For the estimated return values, there is an increase in absolute bias for fits to GEV data. There is a slight reduction in STD for the fits to the GEV data, except for higher threshold case with ψ = 0.9 and negative shape parameter.
In summary, fitting a GP distribution to threshold exceedances from a GEV data-generating distribution results in a slight increase in the bias of parameter and quantile estimates relative to fitting to GP data, with the bias decreasing as the threshold increases. STD of estimates in fits to GEV data is generally slightly lower, meaning that RMSE in return value estimates is comparable to that for fits to GP data. It therefore seems reasonable to use a GEV model in the case studies in this work, especially considering that the appropriate model for environmental data is not known beforehand.
[ Figure 16 about here.]        (Table 1, Case 1, α = 0) at various threshold non-exceedance levels, ψ. Top row: independent fits in each bin. Middle row: Constant shape parameter across all bins, with independent scale parameter (i.e. PPC with λσ = 0). Bottom row: Constant shape parameter across all bins, scale parameter penalised for variance (full PPC model with optimal λσ).  (Table 1, Case 1, α = 0) using independent fits per bin at various threshold levels.  (Table 1, Case 1, α = 0) using the full PPC model with optimal λσ at various threshold levels.  (Table 1), as a function of the number of covariate bins and extreme value thresholds used. ψ is the non-exceedance probability for the extreme value threshold.  (Table 1), as a function of number of covariate bins and extreme value thresholds used. ψ is the non-exceedance probability for the extreme value threshold.   (Table 1) with negative γ, as a function of number of covariate bins and extreme value thresholds used. ψ is the non-exceedance probability for the extreme value threshold.   (Table 1) with positive γ, as a function of number of covariate bins and extreme value thresholds used. ψ is the non-exceedance probability for the extreme value threshold. Figure 16: Bias and STD of estimators of GP shape and scale parameters and return values for a sample size of n = 50. The return value X P is the quantile corresponding to non-exceedance probability 0.999. Coloured lines are for GP fit to GEV data at different threshold levels with non-exceedance probability ψ. Black lines are for GP fit to GP data. Dashed black lines are theoretical bias and STD given by (26) and (27).