A penalised piecewise-linear model for non-stationary extreme value analysis of peaks over threshold

Metocean extremes often vary systematically with covariates such as direction and season. In this work, we present non-stationary models for the size and rate of occurrence of peaks over threshold of metocean variables with respect to one- or two-dimensional covariates. The variation of model parameters with covariate is described using a piecewise-linear function in one or two dimensions defined with respect to pre-specified node locations on the covariate domain. Parameter roughness is regulated to provide optimal predictive performance, assessed using cross-validation, within a penalised likelihood framework for inference. Parameter uncertainty is quantified using bootstrap resampling. The models are used to estimate extremes of storm peak significant wave height with respect to direction and season for a site in the northern North Sea. A covariate representation based on a triangulation of the direction-season domain with six nodes gives good predictive performance. The penalised piecewise-linear framework provides a flexible representation of covariate effects at reasonable computational cost.


Background
Metocean variables such as significant wave height, wind speed and current speed usually show dependence on covariates such as direction, season and water depth (e.g.Randell et al. 2015), as well as having complex inter-relationships: e.g. the relationship between wind speed and significant wave height (e.g.Towe et al. 2017;Mackay and Jonathan 2020b), or significant wave height and spectral peak period (e.g.Haver 1987;Jonathan et al. 2010), or significant wave height and surge (e.g.Ross et al. 2018).Good characterisation of the metocean environment and its extremes demands statistical models which accommodate dependencies of variables on covariates, and dependencies between extremes of variables, in a careful manner.
In this work, we consider a univariate response Y (e.g.storm peak significant wave height, or wind speed) which varies systematically with one or more covariates X = (X 1 , X 2 , ..., X D ), (e.g.storm direction, season, ...), such that the tail of the conditional distribution Y |X = x follows a known parametric form.In a peaks-over-threshold model, we expect Y |(X = x, Y > u(x)) to follow a generalised Pareto (GP) distribution, provided that the extreme value threshold u(x) is sufficiently large, with shape, scale and threshold parameters ξ(x), σ(x) > 0 and u(x) varying systematically with x, and density where [•] + = max{•, 0}.Using a similar approach, we can also model the rate of occurrence of events, dependent on covariates, X, assuming a Poisson model for counts of exceedances.
In marked contrast, Ross et al. (2018) use a penalised piecewise-constant (PPC) model.The motivation for this approach is the provision of a simple, pragmatic but useful representation for covariate effects, both in terms of parameter estimation and uncertainty quantification.In the PPC model, the multi-dimensional covariate domain D is partitioned into sub-domains or "bins" B 1 , B 2 , ..., B M .In each bin B k , the distribution of Y |(X = x, Y > u k ), x ∈ B k is assumed to be stationary and follow a GP distribution with density given in Equation 1, with pre-specified extreme value threshold u(x) = u k and scale σ(x) = σ k .The shape parameter ξ is assumed common to all bins.The objective of the PPC inference is simultaneous estimation of the scale parameters and common shape parameter for all M bins.This is achieved by maximisation of predictive likelihood for a hold-out sample of data, further penalised by the roughness of the set σ 1 , σ 2 , ..., σ M .The roughness penalty is selected to maximise the predictive likelihood, evaluated using a cross-validation scheme.Bootstrap resampling is used to quantify uncertainties of estimated parameters and other predictions.The PPC model has been used in a number of applications (e.g.Ross et al. (2020), Mackay and Jonathan (2020a), Guerrero et al. 2021).
In Mackay and Jonathan (2020a), it was observed that the piecewise-constant assumption did not provide the most natural, parsimonious representation for parameter variation on the covariate domain for some applications of interest, and that a piecewise-linear covariate parameterisation may be more appropriate.These observations motivated initial research into a penalised piecewise-linear (PPL) model outlined in Mackay and Jonathan (2020b) (using common shape parameter and one-dimensional covariates) and subsequently the full development for arbitrary covariate domains reported here.There is a long history of piecewise linear or segmentation regression, motivated by the assumption of local smooth (and in particular linear) variation of model parameters with covariates (e.g.Cleveland 1979, Yang et al. 2016).Like PPC, the PPL model is intended to provide a useful compromise between simplicity and flexibility.Like PPC, PPL is simpler to implement and compute than extreme value models incorporating general basis function representations for covariates discussed by Zanini et al. (2020).At the same time, relative to PPC, PPL provides a more flexible and physically-realistic covariate representation at the cost of increased computational complexity.

Motivating application
We apply the non-stationary PPL extreme value model to data from the NORA10 hindcast of Reistad et al. (2011), for a location in the northern North Sea.The hindcast provides time-series of significant wave height, (dominant) wave direction and season (defined as day of the year, for a standardised year consisting of 360 days) for three hour sea-states for the period 1957-2010.Throughout, "direction" refers to the direction from which a storm travels expressed in degrees clockwise with respect to north.Since significant wave height is serially-correlated between sea states, storm peak significant wave height characteristics are isolated from the hindcast time-series using the procedure described in Ewans and Jonathan (2008).Contiguous intervals of significant wave height above a low peak-picking threshold are identified, each interval now assumed to correspond to a storm event.The peak-picking threshold corresponds to a directional-seasonal quantile of significant wave height with specified non-exceedance probability, estimated using quantile regression.The maximum of significant wave height during the storm interval is taken as the storm peak significant wave height (henceforth H S ).The values of directional and seasonal covariates at the time of storm peak significant wave height are referred to as storm peak values of those variables.The resulting storm peak sample consists of 5388 values of H S .The seasonal and directional characteristics of the sample are obvious from inspection of the figures in Sections 4 and 5.The directional influence of the land shadow of Norway is clear on the interval (45 • , 210 • ).A more gradual seasonal variation is also present.We expect these sources of non-stationarity to be adequately characterised in the estimated PPL model.Data from the same location have been considered in Randell et al. (2016) and Konzen et al. (2021).

Objectives and layout
The objectives of the current work are to extend the PPL model of Mackay and Jonathan (2020b) to incorporate non-stationary shape parameter and multi-dimensional covariates, and to provide software for applications in one and two dimensions.Further, we seek to demonstrate the usefulness of the PPL model in application to non-stationary extreme value analysis of storm peak significant wave height with directional-seasonal covariates for a location in the northern North Sea.The layout of the paper is as follows.Section 2 describes the PPL model formulation in detail, and Section 3 the approach to inference.Sections 4 and 5 discuss the North Sea application for covariates in one and two dimensions.Finally, Section 6 provides discussion and conclusions.

Model and preliminaries
This section introduces the extreme value model with piecewise-linear covariate representation.It also provides a summary of the approaches used to estimate the density f X of covariates, and the non-stationary extreme value threshold u.A key motivation for the modelling approach is that covariate effects are more easily identified, and are more influential, in some aspects of the analysis than others (e.g.Anderson et al. 2001).In particular, covariate dependence of f X and u is relatively-easily identified.It is therefore reasonable to use more flexible non-stationary non-parametric methods to capture these effects.Covariate dependence of GP scale σ is also identifiable in general, but typically not to the same extent as that of f X and u, suggesting a simpler covariate representation for σ may be appropriate.Quantifying the covariate dependence of GP shape ξ is most problematic; we anticipate that the piecewise-linear representation, or even a stationary estimate, will be suitable in general.

Formulation
The joint density of covariates X and response Y can be written as is the density of covariates.We assume that the tail of f Y |X (y|x) converges to a GP distribution.Our model for exceedances of high non-stationary threshold u(x), can therefore be written is the threshold exceedance probability, and f GP is the density of the GP distribution with shape, scale and threshold parameters ξ(x), σ(x) and u(x), defined in Equation 1.Hence, for y > u(x), our model for the joint density is Inference requires estimation of the covariate density f X (x), a non-stationary threshold u(x) (typically corresponding to a constant values of ζ), and non-stationary GP model with parameters σ(x) and ξ(x).For y ≤ u(x), f X,Y (x, y) can be estimated empirically, e.g. using kernel density estimation.Moreover, although this is not a requirement of the model in general, in the current work, we assume that covariates X are periodic and therefore never extreme, so that kernel density estimation is also appropriate to estimate f X (x).Threshold u(x) will be estimated as a local quantile of Y on the covariate domain.It then remains to estimate the GP model, with parameters σ(x) and ξ(x) which vary systematically as a function of covariate, taking (penalised) piecewise linear forms.

Piecewise-linear covariate representation
The piecewise-linear representation for functions σ(x) and ξ(x) of covariate x ∈ D is defined as follows.We first locate K nodes n k , k = 1, 2, ..., K on the covariate domain D, with coordinates n k = (n k,1 , ..., n k,D ).Next we define a triangulation on D, with nodes as vertices, which partitions D into M covariate bins B m , m = 1, 2, ..., M , each of which is a D-simplex.For problems with a single covariate (D = 1), each B m is a line segment; for D = 2, 3, B m is a triangle and tetrahedron respectively.GP model parameter values are specified at the vertices of the triangulation only, and are assumed to vary linearly within each bin B m .The relationship between M and K depends on D, and the choice of (non-unique) triangulation of nodes.The assumed periodicity of covariates means that some care is needed when interpolating within bins near the boundary of D.
The vertices of bin B m are indexed using the indices of the nodes.For bin B m , define the index vector t m = (t m,1 , . . ., t m,D+1 ), where t m,j ∈ {1, ..., K} and t m,i = t m,j for i = j, so that {n tm,1 , ..., n t m,D+1 } is the set of nodes defining the vertices of B m .For an arbitrary function ψ : R D → R, whose values are specified at the nodes, the linear interpolant of ψ for a point x = (x 1 , x 2 , ..., x D ) ∈ B m is given by ψ(x) = xβ m , where x = (1, x 1 , x 2 , ..., x D ), and β m = (β m,0 , ..., β m,D ) T is the solution of A m β m = c m , where T stores the current values of ψ at the nodes t m .Provided that D is not too large, it is straightforward to calculate A −1 m for each bin B m , and save it in memory for repeated use, making iterative estimation of a piecewise linear model computationally efficient.

Estimation of covariate density and extreme value threshold
Consider a sample S = {(x 1 , y 1 ), . . ., (x N , y N )} of N conditionally-independent observations, to be used to estimate the model in Equation 2. This section describes inference for covariate density f X (x) and threshold u(x), with penalised piecewise-linear inference for the GP model described in Section 3.

Covariate density
The covariate density f X (x) is estimated using kernel density (KD) estimation, for points on a regular grid in D. At location x, the KD estimate is where φ is the standard normal density, and common bandwidth w is user-specified to give a reasonable compromise between smoothness and resolution.We note that many more sophisticated models for non-parametric density estimation are available, but this approach appeared adequate for the application discussed in Sections 4 and 5.

Threshold estimation
In the current work, the threshold exceedance probability ζ is set prior to analysis, hence threshold estimation reduces to estimation of conditional quantiles {u(x) : Pr (Y > u(x) | X = x) = ζ}.We choose to use a two-step process comprised of (1) local quantile estimation and (2) subsequent quantile smoothing for this purpose, but again note that many alternative approaches would be suitable.The local quantile is again estimated on a regular grid in D. For each location x, we find the C nearest observations in S, calculate the empirical quantile corresponding to exceedance probability ζ, and smooth using a Gaussian kernel.The value of C and the bandwidth of the smoothing kernel are modelling choices.

Estimation of penalised piecewise-linear generalised Pareto model
This section describes maximum roughness-penalised likelihood estimation of GP shape ξ(x) and scale σ(x), taking piecewise-linear forms for x ∈ D. A critical precursor to successful inference is reasonable node placement on D, discussed in Section 3.1.Given nodes, the maximum likelihood inference is discussed in Section 3.2.The use of cross-validation for optimal choice of roughness penalties is discussed in Section 3.3.

Node placement and triangulation
We seek to locate nodes on D such that the piecewise-linear representation is able to describe variation of GP parameters well.Intuitively, this suggests that nodes be located where the gradient of parameters changes sharply with respect to covariate.This is particularly important when there are only a few nodes.To guide selection of node positions, we calculate initial estimates of σ(x) and ξ(x) on a regular grid on D, in a similar manner to threshold estimation outlined in Section 2.3.For each x, we find the C nearest threshold exceedances of u(x), and use these to calculate (locally-stationary) moment estimators ξ( , where m(x) and v(x) are the mean and variance of the C nearest threshold exceedances, z j(x) = y j(x) − u(x j(x) ), j(x) = 1, ..., C, where j(x) indicates the index of the j th closest observation to x.If ξ(x) < 0 and z max (x) > −σ(x)/ ξ(x), where z max (x) = max{z j(x) : j(x) = 1, ..., C}, then we set ξ(x) = −σ(x)/z max (x), since for a negative shape parameter the upper end point of the GP distribution is −σ/ξ.If C is large, then the true values of the parameters may have a large variation over the range of x from which the sample is drawn, leading to bias in the estimates.Alternatively, if C is small, then the estimates will be subject to higher variance.Nevertheless, it is assumed that it is possible to choose C such that these initial estimates are adequate to identify gross features of σ(x) and ξ(x) on D. Once node locations have been specified, aided by the local estimates of σ and ξ, we triangulate the covariate domain.For D = 1, the partition achieved is unique, with bins corresponding to intervals between adjacent nodes.For D > 1, the triangulation is not unique, and different options are possible as discussed in Section 5; clearly, the choice of triangulation also affects model performance.

Parameter estimation for given roughness coefficients
Estimates of the GP scale and shape parameters at the nodes are found by maximising the sample likelihood, penalised for parameter roughness on D. Optimal roughness penalties are estimated by cross-validation, described in Section 3.3.
With parameters θ = (σ(n 1 ), . . ., σ(n K ), ξ(n 1 ), . . ., ξ(n K )) to be estimated, the sample likelihood for S under the PPL model is , where I is the index set of exceedances {i : y i > u(x i )}, and for x ∈ B m , parameters σ(x) and ξ(x) take piecewise-linear representations parameterised in terms of the corresponding bin nodes.The sample negative log-likelihood is denoted Roughness penalisation is used to smooth the variation of parameters on D for optimal predictive performance.This is quantified using the gradient of the parameter function in each covariate dimension, in each bin.The partial derivatives of function ψ(x) for a point x ∈ B m , are given by the interpolation coefficients, where λ = (λ σ,1 , . . ., λ σ,D , λ ξ,1 , . . ., λ ξ,D ) is the vector of roughness penalties.For given λ, constrained non-linear optimisation is used to minimise * (θ | λ, S).In the current work, we use the MATLAB function fmincon, under the constraint −0.5 ≤ ξ(x) < 0, reasonable for environmental variables (with ξ = −0.5 corresponding to linear decrease in GP density with y, and ξ < 0 ensuring a finite upper bound for the distribution of Y ).A large value of λ for some parameter-covariate pair imposes a large penalty on the corresponding parameter gradient, and hence drives smoother solutions for that parameter-covariate pair.The starting solution for the optimisation is found using a Voronoi partition of D, with bins corresponding to the set of points closest to each node.Independent GP fits, constrained such that −0.5 ≤ ξ(x) < 0, are then made per Voronoi bin, and the parameter estimates are used as a starting solution for the PPL estimation.In some applications, we may wish to assume that the GP shape parameter does not vary with covariate.In this case the model form can be simplified in the obvious way.

Estimation of optimal roughness coefficients
Section 3.2 describes how the parameter set θ is estimated, given roughness coefficients λ.The value of λ is selected to maximise the predictive performance of the model using a G-fold cross-validation scheme.Cross-validation groups S g , g = 1, 2, ..., G are formed by random partition of the sample S into G groups of approximately equal size.
For each of a large number L of plausible choices for λ, and cross-validation group S g , we estimate the GP model using S \ S g , finding parameter vector θ g (λ) that minimises the penalised negative log-likelihood * (θ | λ, S \ S g ).We then calculate the corresponding unpenalised negative log predictive likelihood (θ g (λ) | S g ) for the excluded group S g , using optimal parameter vector θ g (λ).The predictive performance of the model for penalty vector λ is accumulated as The optimal value of λ is that which minimises P(λ).Since random partitioning into groups leads to uncertainty in the predictive performance P(λ), the procedure is repeated R times, with different replicates of random partitions of S. The overall predictive performance of the model is quantified using P(λ) = 1 R R r=1 P r (λ), where P r (λ) is the predictive performance estimated for replicate r.The uncertainty in P(λ) is calculated using a jackknife procedure.In outline, R different jackknife estimates of P(λ) are calculated, using all R estimates P r (λ), r = 1, 2, ..., R except for the r th estimate, r = 1, 2, ..., R. The uncertainty U ( P(λ)) in the mean predictive performance P(λ) is quantified as the range of the R jackknife estimates.The optimal choice of λ can be adapted to accommodate uncertainty due to the cross-validation strategy.In practice, we might select λ * such that P(λ * ) ≤ P(λ • ) + U ( P(λ • )), where λ • is the value that minimises P, and the values of the components of λ * are larger than those of λ • .Hence, the estimated model using λ * will be stiffer and therefore more parsimonious than that using λ • , whilst constraining the degradation in predictive performance to an acceptable level.

Computational complexity of predictive inference
The inference procedure involves iterative accumulation of predictive performance measure P over L roughness coefficients, G cross-validation groups, and R repetitions of the cross-validation partition.For a D-dimensional covariate domain D, exploring a reasonable number of values for each of the 2 × D components of the vector λ is necessary for reliable performance; searching over S values for 2D roughness coefficients requires L = S 2D iterations.Computational complexity for the full analysis therefore scales as G × R × S 2D .Assuming that GP shape is stationary on D reduces computational complexity to G × R × S D .
In light of this, to restrict computational time to reasonable values for physically-reasonable model complexity, we restrict our interest to the following archetypes.Case A is the computationally-simplest analysis undertaken, for which the GP shape parameter is assumed stationary on D, and a common roughness coefficient is assumed for GP scale with respect to each covariate dimension (i.e.λ σ,1 = λ σ,2 := λ σ for a D = 2 directional-seasonal problem).Case B assumes that both GP shape and scale are non-stationary on D, with common roughness coefficients for each with respect to each covariate dimension (i.e.λ σ,1 = λ σ,2 := λ σ , λ ξ,1 = λ ξ,2 := λ ξ for a D = 2 directional-seasonal problem).Finally, Case C again assumes that GP shape is stationary on D, but estimates separate roughness coefficients for GP scale and each covariate dimension (i.e.λ σ,1 and λ σ,2 for a D = 2 directional-seasonal problem).Note that for D = 1, Cases A and C coincide.
After some experimentation for the North Sea application introduced in Section 1.2, and detailed in Sections 4 and 5 below, we choose to adopt G = 5, R = 5 and S = 10 as reasonable values for the numbers of cross-validation groups, partition repetitions and (component) roughness coefficients respectively.The S values of component roughness coefficients considered are equal logarithmic increments on the interval [10 −1 , 10 5 ], ensuring reasonable coverage from very flexible to very stiff covariate representations.

Application with one-dimensional covariate
Here we discuss the estimation of one-dimensional directional and seasonal PPL extreme value models for the data described in Section 1.2, following the procedure outlined in Section 2.1.Two-dimensional directional-seasonal models are presented in Section 5.
Kernel density estimates for the covariate density f X (x) are shown in Figure 1 for both seasonal and directional cases.After some experimentation, we chose to set the kernel density bandwidth to w = 15 days, to achieve a reasonable seasonal model, compared to the empirical histogram.For the directional model, a bandwidth of w = 5 • was chosen to better resolve variation around 225 • .
To estimate the non-stationary extreme value threshold u, the threshold exceedance probability ζ was set to 0.3 (0.2) for seasonal (or directional) analyses respectively.Inspection of model output confirmed that these choices of ζ ensure relatively large samples are retained for GP inference, without incurring too much apparent bias.Sample sizes of N = 1584 (or 1077) threshold exceedances were retained.Estimates of the corresponding seasonal (or directional) quantiles, u, utilise C = 100 (or 50) nearest observations of threshold exceedance, smoothed using a Gaussian kernel with a bandwidth of 15 days (or 5 • ).The larger value of C and smoothing bandwidth for seasonal analysis reflects  smoother variation of H S with season compared to direction.Resulting threshold estimates are shown in Figure 2.For the seasonal analysis, the threshold ranges from u = 2.3 m in summer to 5.9 m in winter.For the directional analysis, the threshold ranges from u = 2.5 m for storms from the northeast to 6.9 m for storms from the southwest.
Seasonal and directional models with K = 2, 3, 4 and 6 nodes are estimated, corresponding to each of Cases A, B and C above.For the seasonal analysis, nodes are spaced uniformly on covariate domain D, with the first node location at n 1 = 20 days, corresponding to the peak value of the initial local estimate of GP scale σ.For the directional case we place nodes manually, starting with 2 nodes located at the minimum and maximum of the initial local estimates of σ.For increasing K, additional nodes are added in turn, without changing the locations of existing nodes.Examples of the node placement for the K = 2 seasonal model and K = 4 directional model are shown in Figure 3, together with the local estimates (red) of σ, and corresponding initial values (blue) for the PPL model from a Voronoi partition.The blue curve appears to provide a reasonable representation of the variation in scale apparent in the local red estimates.
The left panel of Figure 4 illustrates the choice of λ • σ for directional Case A with K = 4 nodes.The minimum mean negative log predictive likelihood P(λ • σ ) occurs at λ σ = 10 1.67 .However, values of λ σ ≤ 10 2.33 are all within the range of acceptable predictive performance, as quantified by the jackknife uncertainty in P(λ • σ ).Hence we set the optimal penalty as λ * σ = 10 2.33 for this case.The utility of replicating the cross-validation procedure R = 5 times is evident: for four of five repeats, the predictive performance is relatively constant up to the optimal penalty, whereas on one repeat, the performance deteriorates for λ σ < λ * σ .The sharp increase in P(λ σ ) around λ σ = 10 3 indicates that  the corresponding models are too stiff to capture the variation of the scale parameter.For λ σ ≥ 10 3.67 , predictive performance asymptotes as the estimated GP scale σ is effectively constant on D.
The corresponding plot for seasonal Case B with K = 2 nodes is shown in the right panel.Variation of mean predictive performance P with λ σ is similar to that in the left panel.There is less variation in P with λ ξ , indicating that allowing ξ to vary on D provides only small improvement in model performance.
For case A, values of optimal predictive performance P(λ * ), associated uncertainties and optimal values for roughness coefficients λ * for the 1-D seasonal models with K = 2, 3, 4 and 6 are given in Tables 1. Values of predictive performance scores are directly comparable between seasonal models, since all are based on the same sample.
Although there is some variability in P(λ * ), given its uncertainty we can conclude that model performance is generally independent of K. Performance at K = 6 is somewhat worse, probably because that analysis suggests that a larger value of λ * σ = 10 3 is appropriate.Further investigation revealed that for smaller values of roughness coefficient, optimal parameter estimates from a particular cross-validation on sample S \S g (i.e.omitting subset S g for some g = 1, 2, ..., G) yielded a predicted likelihood of zero on prediction set S g , or infinite predictive negative log-likelihood; this is a common problem with prediction from responses defined on bounded domains (e.g.Northrop et al. 2017).Thus, flexible models tend to be rejected, and a large roughness coefficient is required for plausible predictive performance; we surmise that the 6-node model with λ * σ < 10 3 induces some over-fitting.There is a slight improvement in performance, relative to the uncertainty associated with the predictive performance estimates, for Case B over Case A. We note that, generally, a larger roughness coefficient is selected for GP shape ξ than for scale σ.Conditional quantiles corresponding to different non-exceedance probabilities, estimated under the fitted model for different K, are shown in Figure 5. Unsurprisingly, given the similar predictive performances listed in Table 1, quantile levels are similar for all choices of K, with K = 6 suggesting somewhat lower extreme quantile levels in the interval [140,200]    Seasonal variation is clear, with larger scale in winter and somewhat more positive shape in the summer.Uncertainty in the scale parameter estimate is much lower than the magnitude of the seasonal variation.The corresponding uncertainty in the shape parameter estimate is higher relative to the seasonal variability; for some bootstrap resamples, the estimated shape parameter was approximately constant with season (visible as the straight horizontal lines).
Comparisons of the tail probabilities Pr(Y > y) estimated directly from the sample, and from simulation under the fitted 1-D seasonal Case B model with K = 4 are shown in Figure 7. Corresponding "monthly" tails are also compared.The number of the simulated points is 1000 × N , (where N is the original sample size).This reduces sampling uncertainty in the empirical distribution function for the simulated data to a reasonable level at the exceedance probability associated with the largest observation (see Mackay and Jonathan 2021 for details).For the left-hand allyear plot, a 95% bootstrap uncertainty band (based on 100 bootstrap resamples) is also shown; for clarity, uncertainty bands on monthly plots are suppressed).There is good agreement between observations and the fitted model.
Corresponding results for 1-D directional models for Cases A and B are summarised in Table 2.There is little difference in performance for 1-D directional models for Cases A and B, with K > 2. However, K = 2 suggests poorer performance, indicating that at least three nodes are required to capture directional variability.Allowing shape to vary gives a small improvement.However, for Case B with K = 4, 6 the optimal roughness coefficients correspond to effectively constant shape on D, again resulting from occurrences of infinite negative log predictive likelihoods for   Coloured lines show quantiles at nonexceedance probabilities of 0.8 (corresponding to threshold, u), 0.9, 0.99 and 0.999.Circles are original observations.Vertical lines show node locations.Note that for K = 4, 6, optimal models correspond to constant GP shape.
smaller choices of roughness coefficient.Conditional quantiles corresponding to different non-exceedance probabilities, estimated under the fitted Case B model for different K, are shown in Figure 8; K = 2 does not adequately capture behaviour in the land shadow of Norway at angles in the interval [50,200]; for larger values of K, quantile levels are generally comparable.For comparison with Figure 6, estimates of GP scale and shape for Case A with K = 6 are shown in Figure 9, for each of 100 bootstrap resamples of the original sample, S.
Exceedance probability plots were generated for all 1-D models.The plots for the directional models showed comparable levels of agreement to that illustrated in Figure 7, so are not shown here.Note that the predictive performance of the directional and seasonal models cannot be compared, as the set of threshold exceedances is different in each case, due to the different choices of directional and seasonal non-stationary thresholds, with different exceedance probabilities, ζ.

Application with two-dimensional covariate
Here we discuss the estimation of two-dimensional directional-seasonal models, again following the procedure outlined in Section 2.1.Note that the definition of the Cases A, B, C archetypes is given in Section 3.4.
Kernel density estimates for the covariate density f X (x) are shown in the left panel of Figure 10, using a common kernel density bandwidth to w = 10 • directionally and 10 days seasonally, ensuring that directional details can be resolved.The estimated threshold corresponding to a local exceedance probability ζ = 0.3 is given in the right panel of the Figure 10, using the C = 50 nearest neighbours, smoothed with a Gaussian kernel with bandwidth 10 degrees (directionally) and 10 days (seasonally).
Node placement in 2-D presents a greater challenge than in 1-D.We adopt two approaches to node placement.The first approach (henceforth "regular grid") is defined in terms of D (= 2 here) sets of marginal node locations (on [0, 360) here), used together to produce a regular lattice on D = [0, 360) × [0, 360).The first node location in each set is also used to locate an addition "wrapping node", located at the location of the first node plus 360 degrees or days, to accommodate periodicity on D. The sets of marginal nodes are then used to create a regular rectangular grid of nodes on D. Extra nodes are then added at the centres of each rectangle created, producing a (periodic) triangulation of D. For a 2-D representation with K 1 nodes (or bins) directionally, and K 2 seasonally, the resulting total number K of nodes is K = 2K 1 K 2 , and the corresponding number of triangular bins is M = 4K 1 K 2 .An example of a 3 × 2 regular grid (with three marginal nodes in direction and two in season, and a total of 12 nodes and 24 bins) is shown in the left panel Figure 11.We refer to such a representation as a "K 1 × K 2 " grid.The intuitive approach explained in Section 3.1 is again used to specify each set of marginal nodes.The regular grid is attractive because it is specified in terms of sets of marginal nodes which are relatively straightforward to locate.
The second approach to node specification (henceforth "irregular grid") permits "freehand" node location on D. This approach is obviously more flexible, and potentially more parsimonious than the regular grid, but requires judicious choice of node locations in D dimensions; this specification may be relatively straightforward in 2-D, but in general may prove challenging.Once the irregular node locations are specified, we create "wrapping nodes" with co-ordinates shifted by ±360 (degrees or days) relative to those of the specified nodes in each covariate dimension.We then triangulate D using the specified and wrapping nodes.The right panel of Figure 11 illustrated an irregular grid with 6 specified nodes and 12 triangular bins.
As for 1-D covariates, irregular nodes in D dimensions should be placed where the local gradient of a model parameter with covariate is expected to change, i.e. at local turning points.We select node locations by inspection, again using initial local estimates of GP scale to guide node placement.An example is shown in Figure 12 (corresponding to a 2-D extension of Figure 3).The surface shows initial local GP scale estimates from a local stationary GP fit to the C = 50 nearest neighbours any location on D, Gaussian-kernel-smoothed with common bandwidth w = 10 degrees or days.Black grid lines emanating from nodes represent the irregular grid triangulation, and a piecewise-linear representation for GP scale.Node placement can be adjusted manually until reasonable agreement is obtained between actual value of scale, and that suggested by the piecewise linear triangulation.
We assess the predictive performance of non-stationary PPL extreme value models on 2-D covariate domains using all other nodes are repeats, included for ease of comprehension.Threshold exceedances are shown in grey.For ease of illustration on the regular grid, locations of observations with x-coordinate ≤ 120 • and/or y-coordinate ≤ 20 days, where (120 • , 20days) corresponds to the node location nearest the origin) have been shifted by +360.For ease of use of the irregular grid, the triangulation is extended by repeating node locations ±360 in each covariate dimension.Numbers of marginal nodes, (  different choices of regular and irregular grids.For regular grids, we consider K 1 = 2, 3, or 4 marginal directional nodes and K 2 = 2 or 3 marginal seasonal nodes.We consider irregular grids with K = 4, 6, 8, 10, 12 and 14 nodes.
In the present examples, node sets for irregular grids are nested, in the sense that nodes for a K-node grid are also nodes for the K -node grid when K ≤ K .
The cross-validation procedure used to estimate optimal values of roughness coefficients is the equivalent to that discussed in Section 4 for 1-D covariates, adapted to accommodate the appropriate number of roughness coefficients.Figure 13 shows an example for a (regular) 3 × 3 Case A model, involving the estimation of a single roughness parameter λ σ for GP scale in both direction and season.Grey lines show negative log predictive likelihood P on roughness penalty λ σ , for each of R = 5 replicate analyses, and the solid black line represents the mean P over the R replicates.As for the 1-D case, the dashed red line shows the "accepted" value of predictive performance.The inset panel shows a magnified view in the vicinity of the minimum of P. The optimal roughness coefficient is found to be λ * σ = 10 1.67 in this case.Predictive performance for different regular and irregular grids is summarised in Tables 3 and 4. As for 1-D covariates, we find that more complex models sometimes show inferior performance corresponding to overly-heavy roughness penalisation (and hence over-fitting when not sufficiently penalised) to avoid problems of infinite negative log predictive likelihood during cross-validatory assessment.For the regular grids, allowing separate penalties for σ in each covariate dimension (Case C) did not significantly improve the predictive performance compared to having a common penalty for σ in both covariate dimensions (Case A).Therefore, for the irregular grids, only Cases A and B were considered.Allowing shape to vary with covariates (Case B) led to improved performance for some grids, but worse performance for others.In general the difference in performance was less than the uncertainty range in the estimated performance.Moreover, for most grids, the optimal penalties for the shape parameter force shape to be effectively constant with covariate.Irregular grid models with six or more nodes give better performance than regular grid models.Even the simplest 4-node irregular grid model gives comparable performance to the regular grids; this feature is attributed to careful irregular grid node placement, allowing more parsimonious description of parameter variation.Although there is some improvement in the predictive performance for irregular grids with additional nodes, these improvements are small compared with jackknife performance uncertainties.
Figure 14 shows directional-seasonal quantiles under the regular 3 × 3 and irregular 6-node Case A models, corresponding to non-exceedance probability 0.99.The effects of both the piecewise-linear grid for σ(x), and the nonlinear non-parametric threshold estimate are visible.The irregular grid provides a more adequate description of the sharp transition in H S around 210 • , due to judicious node placement.However, quantile estimates are generally similar, indicating that inferences are not strongly dependent on grid choice.
Parameter estimates for the regular 3 × 3 Case A model are shown in Figure 15.The left panel shows the mean scale estimate of inferences using each of 100 bootstrap resamples.The right panel shows the corresponding distribution of stationary shape estimate.The general features of the left panel reflect those of the corresponding covariate density and threshold in Figure 10.Values of parameter estimates are further similar to those for the corresponding 1-D directional model illustrated in Figure 9.An illustrative directional tail plot for the irregular 6-node directional-seasonal Case A model is given in Figure 16.Agreement between observations and the fitted model is good.

Discussion and conclusions
Capturing covariate effects in ocean environmental extreme value models can be important for optimising the design of an asymmetric structure with respect to direction, or assessing risk associated with an operation in a given season.It can also potentially improve inference for extreme quantiles.The level of sophistication required to describe covariate effects depends on the characteristics of the sample.We expect that a relatively simple penalised piecewise-linear (PPL) representation for the functional forms of extreme value model parameters will often be sufficient.In this paper, we demonstrate that the PPL generalised Pareto (GP) model for peaks over threshold of significant wave height provides  good characterisation of directional and seasonal effects at low computational cost.Roughness penalisation, regulated using a cross-validation scheme, ensures that parsimonious models are estimated.Uncertainties in parameter and return value estimates are quantified using bootstrap re-sampling.MATLAB software for the PPL model is available at https://github.com/edmackay/PPL-model,along with a user-guide and the data underpinning the current study.
Relative to more sophisticated covariate representations (e.g.Wood 2011, Zanini et al. 2020), the PPL model is conceptually simple to understand, and facilitates computationally-simple inference, both of which we consider to be attractive features for the metocean practitioner.Relative to the simpler penalised piecewise-constant (PPC) approach of Ross et al. (2020), the additional flexibility of the PPL model allows a physically more plausible representation of continuous parameter variation on the covariate domain.Better use of the sample is made, with parameter values at each PPL node informed by observations in all bins adjacent to that node.In contrast, the PPC model is only penalised for the total variance of parameter values over the bins, ignoring bin proximity.
In the current work, as outlined in Section 2.1 and Equation 2, we are interested in estimating the joint distribution f X,Y (x, y) for y > u(x).This requires models for the extreme value threshold (u(x), in order to extract threshold exceedances, and estimate their density f GP (y|u(x), σ(x), ξ(x))) and the density f X (x) of covariates.Unlike the GP model for threshold exceedances, estimation for both u(x) and f X (x) utilises the full sample.It is therefore reasonable to seek more detailed descriptions, using flexible local estimators, for covariate density and threshold, than for the density of threshold exceedances.In this sense, the current hierarchical approach imposes increasing regularity on non-stationary effects as the statistical efficiency of the estimation decreases, incorporating kernel density estimation for covariate density and local quantile estimation for threshold, but a more rigid piecewise-linear model for GP scale, and either a constant or piecewise-linear model for GP shape.However, as implemented here, uncertainties from the estimation of covariate density and threshold are not propagated into the extreme value analysis; estimates of extreme value models here are therefore conditional on the estimated extreme value threshold.Extreme quantile estimates are further conditional on the estimated covariate density.The effect of threshold uncertainty on return values can be large; in the current work, an appropriate threshold non-exceedance probability is selected to ensure that inferences are stable with respect to small variation in this choice.More complex approaches (e.g.Randell et al. 2016) seek to estimate covariate density, threshold and extreme value parameters in a single inference, at the cost of increased model complexity and computational burden.Due to the relative sparsity of information for the GP shape parameter, its roughness coefficient is often relatively large, implying a smooth variation of the estimate on the covariate domain.
Nevertheless, models admitting non-constant shape perform slightly better in general compared to those with constant shape: allowing a little variation in shape improves performance.
Choice of the number and locations of nodes is key to the success of the PPL methodology.Diagnostic tools can aid these choices.In 2-D, it appears that judiciously-chosen "irregular" nodes provide the best predictive performance.Nevertheless, models exploiting nodes located on a "regular" rectangular grid, specified in terms of nodes for marginal components, also perform relatively well.In the current work, we have assumed that the same node locations are appropriate for GP scale and shape estimation; clearly this may not always be the case, and we might expect that fewer nodes might be sufficient to describe shape parameter variation.Nevertheless, the optimal choice of roughness coefficients goes some way to providing balanced parameterisations across covariate dimensions and GP parameters.
In general, extreme value modelling using Bayesian inference, exploiting reversible jump or similar methodologies in 1-D (e.g.Zanini et al. 2020) and 2-D (e.g.Jonathan 2021) provide approaches to estimate the number and location of nodes automatically as part of the inference, at increased computational cost for a more complex model formulation.Experience also suggests that the Bayesian paradigm which permits, via stochastic sampling schemes (e.g. the Metropolis-Hastings or mMALA algorithms), an exploration of the full posterior distribution of the parameters, is more reliable than the frequentist approach.The latter relies on accurate identification of the global mode of a high-dimensional and often highly multi-modal surface; in many cases it is difficult to either identify the mode or to verify that the mode is indeed global.
In the current analysis, the number of nodes is kept to a reasonably small value, since we are motivated by the idea of providing the conceptually simplest but useful extreme value model for the sample.However, the statistical literature on penalised B-splines (e.g.Eilers and Marx 2010) recommends specification of a large number of spline nodes (or knots), with roughness penalisation subsequently imposing the optimal smoothness on the solution, reducing the effective number of degrees of freedom in the covariate representation.In our case, with small numbers of nodes, penalisation plays a less important role in general, but is nevertheless important when the number of nodes is overspecified.Moreover, as a relatively simple optimisation method has been used in the present work, using a small number of nodes makes finding an optimal set of parameter estimates feasible, without excessive computational effort.
Further work might include a systematic comparison of PPL performance relative to its peers, and perhaps also relative to methods involving transformation of data to standard scales prior to extreme value analysis (e.g.Eastoe and Tawn 2012).The current work has addressed periodic covariates only; inference for non-periodic covariates is in some senses simpler.However, in other applications (e.g.Mackay and Jonathan 2020a) a non-periodic covariate may itself become extreme, requiring joint extreme value modelling of "covariate" and response.

Figure 1 :
Figure1: Kernel density estimates (red) of covariate density for 1-D seasonal and directional analyses, with empirical histograms (grey) for comparison.Kernel bandwidth is 15 days for seasonal analysis and 5 • for directional analysis.

Figure 2 :
Figure 2: Threshold estimates for 1-D seasonal and directional analyses.Preliminary local quantile estimates in red, smoothed estimates in blue.Observations are shown in circles.Threshold exceedance probability is set at ζ = 0.3 for seasonal analysis and ζ = 0.2 for directional analysis.

Figure 3 :
Figure 3: Threshold exceedances together with initial local estimates of GP scale parameter based on nearest observations.Starting values for the PPL estimates of scale based on the Voronoi partition are also shown.The locations of the nodes are shown as vertical dashed lines.

Figure 4 :
Figure 4: Optimal roughness coefficients for 1-D models.Left: Predictive performance (negative log-likelihood) against roughness coefficient λσ, for directional Case A with K = 4. Grey lines indicate results for R = 5 replicate runs, and the solid black line is the mean over replicates.The dashed red line shows the value of the optimal predictive performance P(λ * ) recorded.Right:Mean predictive performance over R = 5 replicates for seasonal Case B with K = 2, with roughness coefficients λ ξ , λσ.In both panels, the red dot shows the location of penalty vector λ • yielding optimal predictive performance.
days.Estimates of GP scale and shape for Case B with K = 4 are shown in Figure 6, for each of 100 bootstrap resamples of the original sample S. Parameter estimates for individual bootstrap resamples are shown in grey, and their mean in black.

Figure 6 :
Figure 6: Estimates of GP scale and shape parameters for 1-D seasonal Case B model with K = 4. Grey show estimates from each of 100 bootstrap resamples of the original sample.Bold lines show bootstrap means.

Figure 7 :
Figure 7: Annual and monthly exceedance probabilities for the 1-D seasonal Case B model with K = 4. Empirical estimates in black, and model-based estimates in red.Dashed lines on annual exceedance probability plot represent 95% uncertainty bands based on 100 bootstrap trials.

Figure 9 :
Figure 9: Estimates of GP scale and shape parameters for 1-D directional Case A model with k = 6.In the left panel, grey lines show estimates from each of 100 bootstrap resamples of the original sample, and bold lines show bootstrap means.The right panel gives a histogram of the corresponding shape estimates.

Figure 10 :
Figure 10: Estimates of covariate density (left) and GP threshold parameter (right) for 2-D directional-seasonal analysis.Kernel bandwidth w = 10 (directionally) and days (seasonally) for both kernel density estimate (left) and threshold smoothing (right).Initial local threshold estimates based on 50 nearest observations.

Figure 11 :
Figure11: Examples of regular and irregular 2-D grids (left and right respectively) on extended covariate domain.Unique nodes are shown as red dots; all other nodes are repeats, included for ease of comprehension.Threshold exceedances are shown in grey.For ease of illustration on the regular grid, locations of observations with x-coordinate ≤ 120 • and/or y-coordinate ≤ 20 days, where (120 • , 20days) corresponds to the node location nearest the origin) have been shifted by +360.For ease of use of the irregular grid, the triangulation is extended by repeating node locations ±360 in each covariate dimension.

Figure 12 :
Figure 12: Gaussian-kernel-smoothed initial local GP scale estimates.Grid lines between nodes (black lines) visualise a piecewiselinear scale representation.Reasonable agreement between actual value of scale, and that suggested by the piecewise linear triangulation, suggests good node placement.An extended covariate domain is used to aid interpretation.

Figure 13 :
Figure 13: Illustrative results for estimation of optimal roughness coefficient for a 3 × 3 Case A model.Grey lines show negative log predictive likelihood Pon roughness coefficient λσ, for each of R = 5 replicate analyses.The solid black line is the mean P over the R = 5 replicates.The dashed red line indicates the 'accepted' value of predictive performance.Inset panel gives a magnified view near the minimum of P.

Figure 14 :
Figure 14: Directional-seasonal conditional quantile of HS for non-exceedance probability 0.99, for regular 3 × 3 and irregular 6-node Case A models (left and right respectively).

Figure 15 :
Figure 15: Estimates of GP scale and shape for the regular 3 × 3 Case A model.Left: mean directional-seasonal scale estimate.Right: distribution of (stationary) shape estimates.

Figure 16 :
Figure 16: Tail plots per directional octant from the irregular 6-node directional-seasonal Case A model.Abscissa is significant wave height in metres.Ordinate is exceedance probability.Empirical estimates in black, and model-based estimates in red.Dashed lines represent 95% uncertainty bands based on 100 bootstrap trials.