2T-POT Hawkes model for left- and right-tail conditional quantile forecasts of financial log-returns: out-of-sample comparison of conditional EVT models

Conditional extreme value theory (EVT) methods promise enhanced forecasting of the extreme tail events that often dominate systemic risk. We present an improved two-tailed peaks-over-threshold (2T-POT) Hawkes model that is adapted for conditional quantile forecasting in both the left and right tails of a univariate time series. This is applied to the daily log-returns of six large cap indices. We also take the unique step of fitting the model at multiple exceedance thresholds (from the 1.25% to 25.00% mirrored quantiles). Quantitatively similar asymmetries in Hawkes parameters are found across all six indices, adding further empirical support to a temporal leverage effect in financial price time series in which the impact of losses is not only larger but also more immediate. Out-of-sample backtests find that our 2T-POT Hawkes model is more reliably accurate than the GARCH-EVT model when forecasting (mirrored) value-at-risk and expected shortfall at the 5% coverage level and below. This suggests that asymmetric Hawkes-type arrival dynamics are a better approximation of the true data generating process for extreme daily log-returns than GARCH-type conditional volatility; our 2T-POT Hawkes model therefore presents a better performing alternative for financial risk modelling.


I. INTRODUCTION
A notable feature of many complex systems is that outcomes are often influenced more by rare extreme events than by more typical fluctuations [1].As a result, these extreme tail events often dominate the associated systemic risk, which makes accurate forecasting of them a vital objective in many disciplines.Extreme value theory (EVT) presents an approach to this problem in which asymptotic tail behaviour is modelled independently from the typical "bulk" fluctuations, under the justification that the two are often generated by distinct mechanisms [2,3].In practical terms, this means that values beyond a defined threshold are classified as extremes and are described by an exceedance distribution that is fitted independently from the sub-threshold (i.e.non-extreme or bulk ) distribution.The simplest application is made when the data generating process for extreme events is stationary.In this case, not only is the distribution of exceedances itself unconditional, but the intensity (i.e.arrival rate in time) of events from this distribution is constant.Together, these two conditions mean that arrivals of extreme events beyond any arbitrary threshold within the tail distribution occur in time according to a homogeneous Poisson point process (as defined by a constant intensity).However, in some systems of interest such as financial markets, complex dynamics and feedbacks give rise to non-stationary behaviour that means the conditions of constant intensity and of an unconditional distribution of exceedance mag- * mft28@bath.ac.uk nitudes cannot be assumed to hold.In such cases, extreme events tend to cluster together in time and their arrival intensity is often correlated with their magnitude; descriptions of these systems demand the development of conditional EVT methods that can account for these effects.
This raises a question about whether these dynamics should also be treated independently from the bulk behaviour or whether they should be related to the conditional moments of the full distribution.In the context of finance, both of these approaches have been used to describe extreme price changes, as measured by extreme log-returns.A seminal example of the latter approach is the GARCH-EVT model created by [4].This is an extension to the family of generalized autoregressive conditional heteroscedasticity (GARCH) processes -a ubiquitous class of reduced form models in financial analysis in which log-returns are described by independent and identically distributed (i.i.d.) innovations (i.e.white noise) scaled by the conditional heteroscedasticity, more often called the volatility [5].The GARCH-EVT model appends a generalized Pareto (GP) distribution to the innovation distribution of the GARCH process, in order to account for the residual heavy tails that are typically observed in log-returns when fitted with a standard (i.e.non-EVT) GARCH process [6].It follows that the intensity and magnitude of extreme events are then simply functions of the conditional volatility.Conversely, a pure conditional EVT model in which the dynamics of threshold-exceeding extremes are purely self-contained is presented by the peaks-over-threshold (POT) Hawkes model.This combines an exceedance distribution with a self-exciting Hawkes point process to describe the time inhomogeneous arrivals of tail events.The Hawkes point process, which first emerged as a stochastic model for the self-reflexive pattern of foreshocks and aftershocks that decorate major seismic activity [7][8][9][10], is defined by past events causing a time-decaying increase in the intensity of future events [11,12].This approach has found broad application to many systems characterized by activity bursts, including neural networks [13,14], crime [15,16], conflict [17,18], epidemics [19], social media [20], and financial markets [12,[21][22][23][24][25][26].[27] were the first to apply a POT Hawkes model to extreme log-returns of a financial price time series, which they defined as those less than the 10% in-sample quantile.In their approach, a Hawkes point process describes the arrival intensity of threshold exceeding logreturns, while the excess magnitudes of these events are described by an unconditional GP distribution.Subsequent literature has developed this work, by proposing alternative parametrizations [28,29], incorporating exogenous signals [30] and through multivariate extensions that describe contagious shocks between different price series [31][32][33].In this literature, the POT Hawkes model has almost always been used to describe the lefttail (i.e.extreme losses) exclusively, but this neglects that log-returns also experience right-tail extremes (extreme gains) 1 .This reflects a similarly exclusive focus on lefttail risk in the broader financial risk literature.Indeed, the two most commonly used tail risk measures -valueat-risk (VaR), which is a conditional quantile for the return of an investment over a given holding period, and the expected shortfall (ES), which is the expected value of the return given that it is less than the VaR -are named and defined for the left-tail only.However, this focus disguises the importance of right-tail extremes, which are highly correlated with their left-tail counterparts [31], can either quickly mitigate the impact of their mirror opposites or present an equivalent risk (or opportunity) under certain investment strategies.This is especially apparent in the wake of the Covid-19 pandemic and the impact its outbreak had on global financial markets.On 2020-03-12, the S&P 500 index suffered its worst daily loss since the infamous 1987 Black Monday crash, declining by −9.5%.Five of the next twelve trading days saw losses of −12.0%, −5.2%, −4.3%, −2.9%, and −3.4%.Crucially, however, the same period also saw daily gains of +9.3%, +6.0%, +9.4%, +6.2%, and +3.4%.Without these strong upswings, the index would have lost a staggering 31.1% of its value over this period rather than the 4.2% aggregate drop that it did experience.[34] developed the two-tail peaks-over-threshold (2T-POT) Hawkes model to investigate the interaction of and asymmetries between left-and right-tail extremes in financial log-returns.In this model, left-and righttail threshold exceedances are described by an asymmetric self-and cross-exciting Hawkes-type arrivals process combined with asymmetric conditional GP tails.This can account for the time clustering of threshold exceeding extremes from both tails, the correlation between the arrival intensity and the magnitude of exceedances, the heavy tailed distributions of those excess magnitudes, and the propensity for all these features to exhibit leftversus-right tail asymmetry -all of which are observed in the fluctuations of financial asset prices as measured by daily log-returns [5,6,[35][36][37].[34] applied their model to the daily log-returns of the S&P 500 (SPX), with leftand right-tail extremes defined by thresholds set at the 2.5% and 97.5% quantiles, respectively.They found that extreme daily losses and gains shared a common conditional intensity: losses were found to contribute 2.2 times as much to this intensity overall and this contribution was found to decay 4.6 times as quickly.These two asymmetries are connected to the leverage effect: a wellknown stylized fact of financial markets, which states that volatility is negatively correlated with the sign of past log-returns [6].However, while the first of these asymmetries can be explained by conventional models of this effect, the latter cannot 2 .If this temporal aspect of the leverage effect is demonstrated to be a common property of this and other classes of financial data, it would prove a novel insight into of one of the lowestorder non-zero correlations in the price signal [38].This could provide an enhanced understanding of the generating mechanism for extreme log-returns from both tails, which, if exploited, could then improve the forecasting of these most consequential events.
In this paper, we directly compare the 2T-POT model with a set of GARCH models including GARCH-EVT, by testing accuracy of one step ahead forecasts of conditional quantile-based risk measures for all the approaches.To achieve this, we first build on the work of [34] and reparametrize the 2T-POT Hawkes model so that the expected average intensity replaces the exogenous background intensity as a fitting parameter.This halves the optimization time and enables a constraint that achieves a dimension reduction of 1 at a negligible cost to the goodness of fit.The resulting improvements to the speed and reliability of the optimization procedure enable us to fit the 2T-POT Hawkes model to the daily log-returns of six international large cap equity indices over the in-sample 2 Specifically, when the 2T-POT Hawkes model was fitted to simulated data generated by a GJR-GARCH process (defined in Section II C 1) the impact asymmetry found in the branching vector γ ↔ was reproduced but the time asymmetry found in the decay vector β was not [34].
period, 1975-01-01 to 2015-01-01, and, in a novel step in the POT Hawkes literature, repeat this over a wide range of exceedance thresholds, from the 1.25% to 25.00% mirrored in-sample quantiles.This allows us to determine the sensitivity of our results on threshold selection.Comparable asymmetries to those reported in [34] are found in the fitted parameters of all six indices across a wide range of thresholds; this provides evidence that the temporal leverage effect is a universal property of this class of assets.
We further expand the 2T-POT model by extending its support to the full distribution of log-returns via a subordinate bulk (i.e.intra-threshold) distribution that is conditional upon the Hawkes exceedance process.This guarantees that forecasts of conditional quantile-based risk measures are always defined at all coverage levels.We compute next step ahead VaR and ES forecasts for both the left and right tails and assess their accuracy and serial independence in the out-of-sample period, 2015-01-01 to 2022-09-10, through backtesting methods.This greatly expands upon similar analysis in previous literature [29,[39][40][41]: both by extending the analysis to the right-tail and by evaluating forecasts over a much wider and finer range of coverage levels, from 0.25% to 15.00%.We find that our asymmetric 2T-POT Hawkes model produces the most reliably accurate forecasts in both the far left-tail (5% quantile or less) and the far right-tail (95% quantile or greater).This is not only of practical use, but also suggests that the asymmetric Hawkes-type arrival dynamics are a better approximation of the true data generating process for extreme daily log-returns than GARCH-type variance dynamics.
We present the reparametrized 2T-POT model with the subordinate bulk distribution in Section II.In this section, we also describe the GARCH-EVT model.The improved 2T-POT Hawkes model is fitted to the insample data in Section III, where we then examine the estimated parameters across different thresholds.In Section IV, we validate and compare the accuracy and serial independence of forecasts produced by the different models using backtesting methods in the out-of-sample data.In Appendix A, we derive the reparametrization of the 2T-POT Hawkes model in terms of the expected intensity.In turn, Appendix B defines the likelihood function and optimization procedure for the model and shows the results of the likelihood ratio tests used for model selection.

A. Conditional quantile-based risk measures
We first explicitly define the conditional quantile-based risk measures that are subject to validation in Section IV so that they can be explicitly defined for each model specified later in this section.When applied to the left-tail, these measures are known in the financial risk literature as value-at-risk and expected shortfall.However, because we generalize these to include equivalent measures for the right-tail, we adopt the more generic names: conditional quantile and conditional violation expectation.

Conditional quantile
If a discrete stochastic time series X t is generated according to the conditional cumulative distribution function (cdf) F X,t , then the left-tail conditional quantile (i.e. the value-at-risk) at the coverage level a q over the holding period from t − 1 to t is such that a fraction a q of X t are less than Q ← aq,t .Violations of the left-tail conditional quantile are indicated by the left-tail violation series where is the Heaviside step function.The right-tail conditional quantile is similarly defined such that a fraction a q of X t are right-tail violations, as identified by the series, I → aq,t = I[+(X t − Q → aq,t )].Note that the superscripts ← and → are used to denote the left-and right-tail, respectively.Henceforth the superscript is used to represent either tail (i.e.either ← or →) in generic expressions and the superscript ↔ denotes a union of the left and right tails.

Conditional violation expectation
The conditional quantile has the limitation of providing no information on the distribution beyond itself.In the context of finance, this has drawn the interest of many practitioners to expected shortfall (known as lefttail conditional violation expectation under our nomenclature) as an alternative risk measure.This has also been recognized at the level of regulation, where expected shortfall is now recommended as a risk measure by the Basel Committee on Banking Supervision [42].The left-and right-tail conditional violation expectations are defined as the conditional expectation of X t given a leftor right-tail violation, respectively: where E[.] is the expectation operator and f X,t = dF X,t /dX is the conditional probability density function (pdf) of X t .Note that throughout this paper F and f are used to denote cumulative distribution functions and probability density functions, respectively.These are accompanied with subscripts to denote the variable or model associated with the distribution.Conditional distributions are also indexed in discrete time with the subscript t.

Hawkes exceedance model
The two-tailed peaks-over-threshold (2T-POT) Hawkes model developed in [34] defines two sets of exceedance events in X t .
Left-tail exceedances events are defined as the values of X t that are less than left-tail exceedance threshold u ← ; this series of events, which is indexed with k, is fully described by the series of left-tail excess magnitudes Similarly, right-tail exceedance events are defined with respect to the righttail exceedance threshold u → , and are fully specified by the series T are defined symmetrically, so that number of exceedance events from each tail within the sample X 0:T is asymptotically equal as the length of the sample T tends to infinity.This is simply achieved by setting the thresholds equal to the value of mirrored in-sample quantiles, which is the typical approach to threshold selection in the financial POT Hawkes literature [27][28][29][31][32][33].Here, these quantiles are specified by the threshold level a u ∈ [0, 0.5], such that the left-tail threshold is equal to the in-sample a u -quantile, u ← = Qau (X 0:Tin ), and the right-tail threshold is equal to its mirrored in-sample quantile, u → = Q1−au (X 0:Tin ).
In the most general description, the arrivals of left-and right-tail exceedance events are counted by two distinct point processes, which can be viewed as the components of the bivariate point process where δ(t ) is the Dirac delta function.The arrival rate at time t of events within either point process is the conditional intensity for that process, The explicit time-dependence of λ (t|M t ) specifies N (t) to be inhomogeneous point processes; Hawkestype behaviour is specified by the conditional dependence on the event history up to the present time t, If N ← and N → are treated as distinct point processes, the conditional probability of a left-tail (right-tail) exceedance event occurring at time t is Note, however, that this is incompatible with the fact that the arrivals of these events within X t are mutually exclusive.Moreover, this treatment cannot natively forbid the case where the probability of an exceedance from either tail occurring at time t is found to be greater than 1, i.e.
It is established in [34] that, in the context of financial log-returns, these two requirements can be enforced at an insignificant cost to the goodness of fit by using the common intensity 2T-POT Hawkes model in which both types of extreme are counted within the same common point process N ↔ (t), whose arrival rate is given by the one-dimensional common conditional intensity λ ↔ .Each exceedance event within N ↔ (t) is then stochastically drawn from either tail upon arrival.Since the left-and right-tail thresholds are selected so that the average expected intensities of events from either tail, ), the high correlation of left and right tail extremes [31,34] means that it can be assumed that each event in N ↔ (t) is drawn from either tail with equal probability.Under this assumption, the conditional probability of a left-tail (right-tail) exceedance event occurring at time t is The Hawkes-type arrival dynamics of the common intensity process are constructed through a constrained bivariate model 3 that still allows for asymmetric self-and cross-excitement between asymmetric tails.This takes the form where θ u is the parameter vector for the Hawkes exceedance model, µ ↔ is the constant exogenous background intensity for the common arrival process, γ ↔ ≡ 3 We note that when the thresholds are selected so that a ← λ = a → λ , an unconstrained bivariate model of the form, λ = µ + Γχ, will tend to have equal rows in µ and the branching matrix Γ, such that λ ← ≈ λ → .Table IV   of left-and right-tail exceedance events as a linear sum of the endogenous excitements χ generated by the arrival of past left-(light red) and right-tail (dark green) exceedances.This may be understood as a branching process (fourth panel) in which the daughter events in generation n + 1 are spawned from the endogenous intensity produced by mother events in generation n.
As is illustrated in Fig. 1, the self-exciting dynamics of the Hawkes process can be understood as a branching process, in which daughter events are triggered by the additional endogenous intensity produced by the arrival of prior mother events.γ ↔ is called the branching vector, because γ ↔ is the mean number of daughter events in N ↔ that are triggered by a mother event in N .This is so, because the endogenous excitement χ is normalized, such that the expected lifetime contribution of each event in N to χ is 1.This normalization also guarantees that the model is uniquely fitted.The process is sub-critical (i.e.non-explosive) provided the aggregate branching ratio (γ ↔← + γ ↔→ )/2 is less than 1 [43].
The components χ of the endogenous excitement vector χ are the sums of contributions from all past events in each tail, (11) where, for each component, the decay kernel φ is a monotonically decreasing function of the time between the arrival of the past event, t k , and the present, t.The 2T-POT Hawkes model was conceived as a parametric model: for the sake of parsimony when describing multiple tails independently, the decay kernels -along with the other functions on which the endogenous excitement depends -are described with parametric functions adapted from previous financial POT literature [28,32].The decay kernel has previously been taken as either an exponential or power law decay [28].[34] used an exponential decay, which is scaled differently for the left and right tails according to the decay vector The advantages of this choice are that Eq. ( 11) can be recast in Markov form, and the decay vector β can be used to infer the characteristic timescales over which daughter events are triggered by mother events from either tail.The conditional impact function κ (M |t) is a monotonically increasing function of the excess magnitude M .Following the approach of [28,32], this is defined so that the intensity jump from the exceedance event arriving at time t k is determined by the contemporaneous value of the conditional cumulative distribution function of excess magnitudes for that tail where the mark parameter α ≥ 0. When α = 0, larger magnitude events produce greater jumps in the excitement.This reduces sensitivity on the choice of threshold value u , since κ (M |t) → (1 + α ) −1 as M → 0. Conversely, when α = 0, κ t becomes unity and an unmarked Hawkes process -in which χ is independent of the magnitudes of past events -is recovered.Eq. ( 14) is specified so that E[κ (M |t)] ≡ 1 for all values of α and t.Also, note that throughout this paper superscript calligraphic letters symbolize parametric probability distributions: N for the normal distribution, S for the Student-t distribution, and P for the generalized Pareto distribution.Hence, F P, M,t in Eq. ( 14) denotes that the conditional cdf of excess magnitudes from either tail is described by the GP distribution.
In EVT methods, the tail is almost always described by the GP distribution.This follows from the Gnedenko-Pickands-Balkema-de Haan (GPBH) theorem, which states that the GP distribution is the limiting distribution for threshold excesses [1,44,45].Accordingly, we assume that the excess magnitudes are distributed according to a conditional GP distribution, as specified by the cdf where the shape parameter ξ can specify a range of tail heaviness over three distinct phases: from the finite decay of the Weibull distribution (ξ < 0), through the exponential decay of the Gumbel distribution (ξ = 0), to the increasingly leptokurtic power-law decay of the Fréchet distribution (ξ > 0) [1,32].Conditional dependence on the endogenous (i.e.non-background) intensity of the Hawkes process is introduced via the conditional scale parameter Thus, when η > 0, larger magnitude events become more likely when the conditional intensity of exceedances is high, as is generally observed in financial returns [6].
Conversely, when η = 0 the excess magnitudes are drawn from an unconditional GP distribution with a fixed scale parameter σ t = ς .In this paper, we greatly expand the application of the 2T-POT Hawkes model to historic market data: in [34], the model was fitted to a single data sample (SPX) at a single threshold level, a u = 0.025; in Section III of this paper, we not only expand this to 6 data samples, but, uniquely within the financial POT Hawkes literature, we also fit the model to each sample at 20 different values of the threshold level, a u = 0.0125k u , where k u ∈ Z∪ [1,20].Given that the computational cost of calibrating the exceedance model scales with a u , the 120 independent calibrations in this paper are approximately 630 times more expensive in total than the single calibration performed in [34].This greatly expanded application of the 2T-POT Hawkes model required improvements to the speed and reliability of the optimization procedure.We achieved this achieved by reparametrizing the exceedance model so that expected average intensity a ↔ λ = E[λ ↔ ] replaced the background intensity µ ↔ as a fitting parameter, with the latter then calculated as as is derived in Appendix A. It is more efficient to use a ↔ λ as a fitting parameter instead of µ ↔ , since the former is orthogonal to γ ↔ .When this updated model was used to reproduce the single calibration from [34], we found that the maximum likelihood (ML) optimization time under the SLSQP method in SciPy [46] was reduced by 53%.More significantly, this modification also greatly reduced the number of failed optimizations in the expanded set of applications performed in this paper.Indeed, we emphasise that the large-scale analysis in Sections III and IV was only made feasible because of this reparametrization.
Even greater efficiency was achieved by noting the relationship between the expected average intensity a ↔ λ and the threshold level a u .Because the thresholds u are set equal to the (a u , 1 − a u ) in-sample quantiles, it follows that the average intensity a ↔ λ in the in-sample period is asymptotically equal to 2a u d t , where d t is the unit measuring one step in discrete time (in the case of daily log-returns, d t denotes trading days), as the size of the in-sample period T in tends to infinity.Thus, a ↔ λ = 2a u d t can be used either as an initial value in estimation or used as a fixed constraint.It is shown by likelihood ratio tests in Appendix B that the constraint achieves a parameter reduction of 1 at negligible cost to the goodness of fit; accordingly, this constraint is enforced for all calibrations in this paper.We also note that the constraint reduced the total optimization time for this paper by 12%.Having demonstrated these significant gains in optimization speed and reliability, we strongly recommend that our reparameterization is applied to all other Hawkes models that use the background intensity as a fitting parameter.
The common intensity 2T-POT Hawkes exceedance model is fully specified by the set of parameters, θ u = {γ ↔ , β, ξ, ς, η, α; a u }, where vector quantities are of the form, β ≡ [β ← , β → ] T .This model is hereafter denoted as H 2 (a u ).Note that, if all parameters in θ u are constrained to be symmetric (so that the left-and right-tail components of all vector parameters are equal), then the common intensity 2T-POT model is equivalent to the classical single-tailed peaks-over-threshold Hawkes model applied to the absolute values of a copy of the original time series that is centred on the mid-point between the thresholds.That is, the set of absolute exceedances , and a univariate Hawkes model applied to this exceedance series describes equal self-and cross-excitations between left and right tails that are symmetric in all properties.Hereafter, this is referred to as the symmetric 2T-POT Hawkes exceedance model and is denoted as H 1 (a u ).

Subordinate bulk distribution
It is simple to calculate left-tail (right-tail) conditional quantile at the coverage level a q over the holding period t − 1 to t (i.e. to calculate Q aq,t ) using the 2T-POT Hawkes exceedance model provided that the conditional probability of a left-tail (right-tail) exceedance estimated by the model is greater than or equal to the coverage level, i.e. p t ≥ a q .In this case, Q aq,t will lie within the GP tail distribution, F P, M,t , specified by the exceedance model.Thus, Hence, The conditional violation expectation over the same holding period can then be calculated as (20) If, however, the left-tail (right-tail) exceedance probability is less than the coverage level (i.e.p ,t < a q ), then the hypothetical conditional quantile would lie between the two thresholds -within the bulk of log-returns not supported by either the left-or right-tail exceedance distributions of the 2T-POT Hawkes model.It is therefore desirable to extend the support of the 2T-POT Hawkes model to the full distribution of X t .We do this by incorporating it as the jump process within a jump-diffusion type model where the intra-threshold bulk is supported by a supplementary conditional distribution that describes diffusion.Even when solely concerned with the forecasting of extremes, the distribution of the intrathreshold bulk becomes increasingly pertinent as the coverage level a q approaches the threshold level a u from below, and even more so if forecasts are to be made over more than a single time step.This extension was essential for this paper because the backtesting methods used in Section IV require that Q aq,t and E aq,t are defined at all values of t, and this can only be guaranteed when the full distribution of X t is supported.
Because we are interested in the Hawkes process as a pure conditional EVT model and how this compares to the conditional volatility mechanism described by the GARCH-EVT model, we take the unusual step of subordinating the diffusion model (responsible for the bulk) to the jump model (generating the extreme events).Specifically, we introduce an intra-threshold bulk distribution that is transformed by parameters that are conditional upon the intensity of the Hawkes exceedance process.This subordinate bulk distribution is described by the conditional cdf F D B,t , where the superscript D is replaced by the symbol of the specified parametric distribution (N for the normal distribution and S for the Student-t distribution).F D B,t is transformed by the conditional location and scale parameters, m t and s t , so that its value at the thresholds u matches the probability of an exceedance event at time t as determined by the Hawkes arrival process.These two constrains are expressed by the equations, where θ D B is a vector containing the additional unconstrained parameters of the chosen parametric distribution D. Here, it becomes essential that the condition, p ← t + p → t ≤ 1 is strictly enforced by the use of a common conditional intensity as described in Eq. (9).In contrast, the bivariate model described in Eq. ( 8) does not enforce this condition, and so it will attribute negative probability mass to the intra-threshold bulk when the components of the bivariate conditional intensity λ are sufficiently high.
The conditional pdf for X t at time t is then a weighted piecewise union of the bulk and tail distributions: The fully supported 2T-POT Hawkes model at threshold level a u is denoted as H D 2 (a u ).The equivalent fully supported symmetric 2T-POT Hawkes model is denoted as C. GARCH-EVT model

GARCH
In this paper, the 2T-POT Hawkes model is compared with the family of generalized autoregressive conditional heteroscedasticity (GARCH) models.These are the standard reduced form models for log-returns in industry and they are a ubiquitous baseline in the financial risk literature [5].GARCH models have earned this status due to their parsimonious description of two key stylized facts of financial returns: volatility clustering, which states that the standard deviation of log-returns (known as the volatility) is non-constant and exhibits significant positive autocorrelation, and the leverage effect, which states that volatility increases when returns become more negative [5,6].GARCH models describe a conditional volatility process in which a real univariate discrete time series is generated as where µ is the unconditional mean, σ t is the conditional volatility, and D t are i.i.d.random innovations drawn from a parametric distribution D, typically with zero mean and unit variance.The GARCH(p, r, q) model describes the conditional variance σ 2 t with an autoregressive moving average (ARMA) process where ω is the minimum conditional variance and {α i , β j , γ k } are the ARCH coefficients.When r > 0, Eq. ( 24) is a GJR-GARCH model [47] that accounts for the leverage effect -interpreted as an asymmetric impact of negative log-returns on volatility -through the Heaviside step function.
In this paper, we consider the GARCH(p = 1, r, q = 1) model with (r = 1) and without (r = 0) the leverage effect.For the innovation distribution, we consider the unit normal distribution (D = N 0,1 ) and the unit Studentt distribution with ν degrees of freedom (D = S 0,1,ν ).Hereafter, we label these specifications of the GARCH model as G D r (0).

GARCH-EVT
While the standard GARCH models specified in Section II C 1 provide the typical benchmark for forecasting log-returns, we are primarily interested in comparing the 2T-POT Hawkes model with the GARCH-EVT model [4,40].This model is constructed by appending GP tails to the distribution of innovations, so that the (unconditional) pdf for t now takes the form where f D B is the pdf of a continuous parametric distribution of zero mean and unit variance 4 and the GP tail distributions f P, M are parameterized equivalently to Eq. ( 15) except that the scale parameter remains constant and is therefore denoted as ς .Thus, The innovation thresholds are set according to the threshold level a u , such that u ← ≡ F D B −1 (a u ) and Thus, when a u = 0, the standard (i.e.non-EVT) GARCH models are recovered.Accordingly, the GARCH-EVT model is labelled as G D r (a u ).Estimates for the innovation tail parameters, ξ and ς , are obtained by ML estimation over the estimated innovations ˆ t = (X t − μ)/σ t [4].Note that we use hat accents to denote values that are estimated or are derived from estimates.
As well as being one of the best performing univariate risk models within the financial risk literature, the GARCH-EVT model represents an alternative conditional EVT approach compared with the 2T-POT Hawkes model.In the latter, the dynamics are solely influenced by the threshold exceeding events.Conversely, the dynamic properties of the former are functions of the conditional volatility σ t , which, as per Eq. ( 24), is influenced by each innovation in t in proportion to their magnitude.Thus, while the 2T-POT Hawkes model faithfully extends the foundational principle of EVT -that 'extreme events should speak for themselves' -to the conditional case, the GARCH-EVT model represents a compromise.The comparison of the forecasting accuracy of these two models in Section IV therefore acts as comparison between these two distinct views of how extreme events are generated.

III. DATA AND EMPIRICAL STUDY A. Data description
The models specified in Section II are applied to the daily log-returns of six international large cap equity indices: the S&P 500 (SPX) and Dow Jones Industrial Average (DJI) from the U.S.A., the DAX 30 (DAX) of Germany, the CAC 40 (CAC) of France, the Nikkei 225 (NKX) of Japan, and the Hang Seng index (HSI) of Hong Kong.These series are widely investigated financial benchmarks that are often perceived as proxies for the broader equity market in the world's three major financial centres: North America, Western Europe, and East Asia.We specifically use the daily log-returns over Table I.Summary statistics for the series of daily log-returns of the six large cap international equity indices.For each sample of log-returns Xt, the number of observations T , mean X, standard deviation σX , median Q0.5(X), and median absolute deviation MADX are given.I.We note that, for both the in-and out-of-sample periods, the six series are not of the exact same length T due to idiosyncratic holidays and suspensions.

B. Threshold selection and in-sample calibration
As is discussed in Section II B 1, the exceedance thresholds u of the POT Hawkes models are determined by the choice of threshold level a u , which specifies the quantiles at which the thresholds are set and is regarded as a tuneable parameter.In unconditional EVT applications, where an unconditional GP distribution is fitted to a stationary tail, the only concern with the threshold is that it determines the sample of excess magnitudes M k .Threshold selection is therefore treated as a classic bias-variance trade-off: a less extreme threshold (i.e. higher threshold level a u ) ensures a greater number of observations in the tail; reducing noise and improving the stability of parameter estimates.However, the asymptotic nature of the GPDH theorem means that the exceedance distribution is better approximated by the GP distribution as the threshold becomes more extreme (i.e. as the threshold level a u is lowered to 0).Diagnostic tools exist for the latter issue [49], and these are often used to inform threshold selection in this time homogeneous case: for instance, in the construction of the GARCH-EVT model when fitting GP tails to the estimated innovations ˆ t [40].However, the introduction of Hawkestype arrival dynamics adds a second layer of importance to threshold selection, because this also determines the sample of arrival times t k .This not only effects the expected distribution of the conditional intensity λ ↔ , but also distribution in time of the threshold exceeding events that provide information to the Hawkes process.While these additional effects invalidate the threshold selection procedures used in the stationary case, they may at the same time introduce distinct phases along a u in which the 2T-POT Hawkes model identifies signals from the data generating mechanisms of the underlying system.This could be evidenced through a phase transition along a u in the fitted parameters or in the forecasting accuracy, either of which could provide a physically meaningful definition of an extreme event in the combined context of arrivals and magnitudes.To explore this, we calibrate the threshold-based models across a wide range of threshold levels: a u = 0.0125k u , where k u ∈ Z ∪ [1,20].We believe this is a novel contribution in the application of POT Hawkes models to financial returns, since, to our knowledge, previous literature has only considered a single (arbitrary) threshold level per study, typically within the range, 0.025 ≤ a u ≤ 0.1.
Multiple specifications of the 2T-POT Hawkes model were calibrated across this range of threshold levels using the estimation procedure described in Appendix B. Likelihood ratio tests were used to select which variant of the model would be used in the later forecasting validation: this selected for the common intensity model with the a ↔ λ = 2a u d −1 t constraint applied and for a Student-t distributed bulk.Tables containing the results of these likelihood ratio tests are included in Appendix B along with a graphical presentation of the estimated parameters (with standard errors) of the selected H S 2 (a u ) model as a function of a u .The fit of the exceedance model to both the arrivals process and the distribution of excess magnitudes is tested through residual analysis [34].For the arrivals process, the Kolmogorov-Smirniov (KS) test is used to test the null hypothesis that the exceedance For the excess distributions, the null hypothesis that the residual excess magnitudes are unit exponential distributed is subject to the KS test.These tests are shown graphically in Fig. 2 for the H 2 (a u = 0.025) exceedance model fitted to the DAX daily log-returns.Equivalent plots for the other five indices are included in the supplementary material.
A further aim of fitting the Hawkes exceedance models across a wide range of threshold levels a u is to verify whether the asymmetric Hawkes arrival dynamics reported in [34] persist across this range.They fitted the 2T-POT Hawkes model to the daily log-returns of the SPX over the in-sample period, 1959-10-02 to 2008-09-01, at a u = 0.025 and found significant asymmetries in both the branching vector (γ ↔← /γ ↔→ = 2.2 ± 0.5) and the decay vector ( β← / β→ = 4.6 ± 1.2) that together described a temporal leverage effect.It is anticipated that the parameters will converge to symmetry as a u → 0.5, since, in this limiting case, every value within X t will qualify as an exceedance event: thus, the conditional intensity would be fixed and entirely attributed to the exogenous background, making excitation from either tail equally redundant.Fig. 3 shows that, with some exceptions, the estimated asymmetries in the arrival parameters are remarkably stable along a u and between the different indices.A slight convergence towards symmetry in both parameters is generally observed as a u increases, but there is no evidence of a sharp transition within the investigated range.The left-over-right component ratios are quantitatively similar to those reported in [34] in the vast majority of cases and are never observed to invert.A notable outlier is the HSI branching vector γ ,HSI , which converges to symmetry at a u = 0.1.This possibly reflects that the HSI had a notably higher average volatility in the in-sample period compared to the other indices (as shown in Table I), which corresponds to the fact that the HSI was a less developed market in this period and, therefore, possibly should be regarded as a separate class.It is also noted that the NKX and HSI decay constant asymmetries β← / β→ are much larger at low a u compared to the other indices.
Overall, these results add further empirical support to hypothesis of [34] that there is a temporal aspect to the leverage effect observed in financial daily log-returns, such that the impact of losses on future extremes is not only greater but also more immediate.This important structural feature contributes to the forecasting of exceedance arrivals, which will be seen in the relative forecasting performance of H 2 (a u ) compared to the fully symmetric H 1 (a u ) model in Section IV.

IV. BACKTESTING OF OUT-OF-SAMPLE QUANTILE FORECASTS
In this section, we use the models calibrated on the insample data (1975-01-01 to 2015-01-01) in Section III to produce next step ahead forecasts of the left-and righttail conditional quantile Q aq,t and conditional violation expectation Ê aq,t in the out-of-sample period (2015-01-01 to 2022-09-10); the accuracy of these forecasts is then validated and compared through backtesting methods.The forecasts produced by the H S 2 (a u ) Hawkes model are compared against those produced by the G S 1 (a u ) GARCH-EVT model in order to determine which conditional EVT approach -Hawkes-type arrival dynamics or GARCH-type volatility dynamics -provides the best description of extreme log-returns.We also include H S 1 (a u ) in this comparison to demonstrate explicitly the benefits of incorporating asymmetry within the 2T-POT Hawkes model.We also take into account three standard GARCH models of increasing complexity [G N 0 (0), G S 0 (0), G S 1 (0)].Forecasts are evaluated across the range of coverage levels: a q = 0.0025k q , where k q ∈ Z ∪ [1,60].This is a much wider and finer range of coverage levels than has been investigated with these backtesting methods in previous financial risk literature [29,[39][40][41]: this is intended to more fully compare where the relative advantages of each model lie and to explore how this depends on the threshold level a u .Since each of the backtesting methods are defined identically for the left-and the right-tail, the superscript is dropped from I aq,t , Q aq,t , and E aq,t in the remainder of Section IV for clarity unless explicitly required.

A. Conditional quantile backtesting methods
The first set of backtesting methods evaluate the accuracy of the conditional quantile forecasts Qaq,t by considering the series of observed violations Îaq,t .If the conditional quantile forecasts are perfectly accurate (i.e.Qaq,t = Q aq,t ∀t), then the arrival process of violations is a Poisson point process, Pr{ Îaq,t = 1|t} = Pr{I aq,t = 1|t} ≡ a q ∀t. ( All of the backtesting methods in Section IV A derive their null hypothesis from Eq. ( 29).

Unconditional coverage (UC) test
The unconditional convergence (UC) test [50] assesses the null hypothesis that the observed proportion of vi-olations π1 is equal to the assumed coverage level a q (H 0 : π1 = a q ).The UC test is formulated as a likelihood ratio test which compares two Bernoulli likelihood functions.Asymptotically, as the total number of observations in the sample, T , goes to infinity, the test statistic LR UC is distributed as χ 2 with one degree of freedom: where T1 = t Îaq,t is the observed number of violations in the sample of length T and π1 = T1 /T .H 0 is rejected if the conditional quantile forecasts Qaq,t have a significant bias, such that they consistently under-or overestimate the true conditional quantiles Q aq,t in the sample.

Conditional coverage (CC) test
The conditional coverage (CC) test [51] seeks to verify both the correct coverage and the independence of violations over consecutive observations.The process of violations is described by a first-order Markov model, with the CC test based upon the 2 × 2 estimated transition matrix with elements πij that give the estimated conditional probability of there being a violation (i = 1) or no violation (i = 0) at t given that there was (j = 1) or was not (j = 0) a violation at t − 1, πij = Tij / Ti0 + Ti1 , where Tij is the number of observations where Îaq,t = i| Îaq,t−1 = j.The null hypothesis of the CC test is that the conditional probability of a violation at t with and without a violation at t − 1 are both equal to the assumed coverage level, i.e.H 0 : π10 = π11 = a q .As the total number of observations T goes to infinity, the test statistic LR CC is asymptotically distributed as a χ 2 with two degrees of freedom: (32) H 0 is rejected if either the conditional quantile forecasts at time t that follow a violation at t−1 ( Qaq,t | Îaq,t−1 = 1) or that follow no violation at t−1 ( Qaq,t | Îaq,t−1 = 0) have a significant overall bias in the sample.

Dynamic quantile (DQJ ) test
The dynamic quantile (DQ J ) test [52] is designed to detect higher-order autocorrelation in the series of violations as well as dependence on other explanatory variables.The test is based upon the hit function, Ĥit aq,t = Îaq,t − a q . (33) If follows from Eqs. ( 29) and ( 33) that, under the correctly specified model, the series Ĥit aq,t should be i.i.d. and have zero mean.Accordingly, Ĥit aq,t should be independent of lagged values of itself and also independent of the contemporaneous conditional quantile Qaq,t , such that the conditional expectation of Ĥit aq,t should be 0 regardless of any such information available at t − 1.The DQ J test used here is derived as the Wald statistic from an auxiliary regression: where the DQ J null hypothesis is H 0 : In other words, the null states that the observed violation coverage probability is equal to the assumed coverage level (φ 0 = 0) and that there is no dependence of Ĥit aq,t on the 1 + J explanatory variables.
The DQ test statistic is asymptotically χ 2 distributed with 2 + J degrees of freedom: where Ĥit aq is a 1×T vector containing the series Ĥit aq,t observed within the sample and A is a T × (2 + J) matrix comprised of the observed sample series of each explanatory variable.

Zero mean discrepancy (ZMD) test
The conditional violation expectation forecasts Êaq,t are evaluated by testing the null hypothesis that the standardized discrepancies Daq,t = X t − Êaq,t Qaq,t − Q0.5,t , at violations (i.e. when Îaq,t = 1) have zero mean [4], i.e.H 0 : t Daq,t Îaq,t / T1 = 0. Assumptions about the distributions of the standardized discrepancies are avoided by employing the dependent circular block bootstrap used in [41,53].The null hypothesis is subject to a two-tailed test against the alternative hypothesis, H 1 : t Daq,t Îaq,t / T1 = 0. Thus, the null is rejected when there is a significant bias in the conditional violation expectation forecasts Êaq,t over the observed violations ( Îaq,t = 1) within the sample.

C. Backtesting results and discussion
We present the backtesting results synoptically in Tables II and III and as full visualizations in Figs.4Table II.Summarized backtesting results in the out-of-sample period, 2015-01-01 to 2022-09-10.The proportions of null hypothesis rejections at the 95% confidence level (p < 0.05) across all six indices within coverage level bands a 0 q < aq ≤ a 1 q are shown for each model, with the results for the EVT models aggregated over three representative values of the threshold level, au ∈ {0.05, 0.1, 0.2}.Lower rejection proportions correspond to better performance on the test; intra-tail performance rankings along each row are given in parentheses, with the first-and second-best performing models highlighted in dark and light grey, respectively.
p, left-tail ← p, right-tail → Test a 0 q to 6. Tables II and III summarize the results for each backtesting method by giving the proportion of null hypothesis rejections at the 95% confidence level (p < 0.05, which appears as dark blue in Figs. 4 to 6) aggregated across all six indices within six bands of the coverage level, a 0 q < a q ≤ a 1 q .For the EVT models (a u > 0), these tables include the results at three representative values of the threshold level, a u ∈ {0.05, 0.1, 0.2}.In Table II, the results at these three threshold levels are aggregated together to give a single overall proportion of null rejections for each of the EVT models; these are given with the equivalent results for the standard (a u = 0) GARCH models.In Table III, the results for the EVT models at the three representative threshold levels are given independently.These tables are designed so that the relative performance of the models at each test and coverage level can be more easily compared; we are primarily focused on the performance within the lower (i.e. more extreme) coverage levels, a q ≤ 0.05.In addition to these tables, we provide full visualisations of the original p-values as a field within the (a q , a u )-space.Due to the large size of these figures, we only include three within the main paper: Figs. 4 and 5 shows the full results of the UC test for each of six indices, while Fig. 6 shows the results of the ZMD test for the DJI, CAC, and HSI.The equivalent visualizations for the other tests and data series are included in the supplementary material.
We first observe that the asymmetric H S 2 (a u ) model consistently produces more accurate forecasts -as measured by lower proportions of null rejections -than the symmetric H S 1 (a u ).This is true for both the aggregated results in Table II and when the two models are compared at the same threshold level a u in Table III.In the left-tail, this advantage is mostly restricted to the lower coverage bands; specifically, to a q ≤ 0.05 in the UC and DQ 4 tests, to a q ≤ 0.075 in the CC test, and to the full range, a q ≤ 0.15, in the ZMD test.In the right-tail, H S 2 (a u ) tends to hold an advantage over H S 1 (a u ) at all values of a q in the UC, CC, and ZMD tests, but this advantage is larger in lower coverage bands and there is no consistent trend for the DQ 4 test.It can be taken from this that there is a demonstrable gain in predictive power when asymmetries are incorporated within the 2T-POT Hawkes model, and this gain is strongest for the most extreme future events.This adds further empirical evidence for the role of these asymmetries -including those that describe the temporal leverage effect -in the data generating process for extreme daily log-returns in high-cap equity indices.II, it is observed for the left-tail that increasing model complexity does tend to yield more accurate forecasts, especially within the lowest coverage level band, 0 < a q ≤ 0.025.This shows that each of the additional features -the leptokurtic innovation distribution [G S 0 (0)], the leverage effect in the variance dynamics [G S 1 (0)], and the asymmetric GP tails [G S 1 (a u )] -capture significant aspects of the generating process for left-tail extremes.However, for the right-tail extremes, the simplest G N 0 (0) model outperforms the two more complex standard (a u = 0) GARCH models in all of the tests based on the conditional quantile (UC, CC, and DQ 4 ) within this lowest coverage band.This is because the Studentt distribution used for the innovations t in G S 0 (0) and G S 1 (0) overestimates the heaviness of the right-tail and so provides a worse approximation than the normal distribution.This problem is to be expected when a symmetric innovation distribution is fitted to the full distribution of observations in one step, since the fit to the right-tail is compromised by the significantly heavier left-tail.In contrast, the asymmetric GP tails of the GARCH-EVT model are fitted to each tail independently, and so the negative impact on the forecasting accuracy of G S 1 (a u ) in both tails is mitigated, especially at the best performing threshold level in Table III.This independence demonstrates a key advantage of conditional EVT methods to forecasters of extreme events.

Now considering the GARCH-type models in Table
We now compare the asymmetric 2T-POT Hawkes model [H S 2 (a u )] with the GARCH-EVT model [G S 1 (a u )], focusing first on the results for the UC and CC tests.We reiterate that these are simple convergence tests that verify whether the fraction of conditional quantile violations in the out-of-sample period is equal to the coverage level a q (UC test) and also if the series Îaq,t is serial independent over a single time step (CC test).These tests therefore provide the most basic measures of overall accuracy and serial independence for the next step ahead forecasts of the conditional quantile Qaq,t .For the threshold aggregated UC and CC test results in Table II, H S 2 (a u ) is by far the best performing model within the lowest left-tail coverage band (a q ≤ 0.025) and in all right-tail coverage bands (a q ≤ 0.15).Looking at the equivalent threshold disaggregated results in Table III, we see that, for the left-tail, G S 1 (0.2) and H S 2 (0.05) stand out as the first-and second-best performing models within the lower half of coverage bands, a q ≤ 0.075.Over the same range of coverage bands in the right-tail, H S 2 (a u ) consistently outperforms G S 1 (a u ) irrespective of threshold selection, though H S 2 (0.2) is by far the best performing.
To better understand the summarized UC and CC test results in Tables II and III, we examine the visualization of the full results of the UC test in Figs. 4 and 5.The equivalent visualizations for the CC test (which can be found in the supplementary material) display very similar features, meaning that the following observations are robust with respect to lag-1 violation independence.A consistent tuning relationship is observed in the lefttail for the 2T-POT Hawkes models across the six indices, wherein there is an approximately linear relationship between coverage level a q and the optimal threshold level a u for forecasting accuracy at that coverage level, as observed by the highest average p-values and, accordingly, by the lowest density of p < 0.05 null hypothesis rejections within the (a q , a u )-space.For the DAX, NKX, and HSI, this trend is flatter, such that very accurate forecasts are produced across almost the full range of coverage levels a q at a single (low) value of the threshold level a u .This is reflected in Table III, where H S 2 (0.05) is the best performing Hawkes model up to a q ≤ 0.125.This predictable and stable dependency of forecasting accuracy on threshold selection is ideal for practical applications, since it means that the correctly tuned 2T-POT Hawkes model should produce reliably accurate forecasts of the left-tail conditional quantile Q← aq,t across a broad range of coverage levels a q .In contrast, the pattern of forecasting accuracy for G S 1 (a u ) in the left-tail is much more unpredictable.Phases of poor forecasting accuracy -as identified by rejection of the null hypothesis -show erratic dependence on threshold selection and instead tend to concentrate within ranges of coverage levels a q that are different for each of the indices.Once consistent feature (observed in all but the HSI data) is that G S 1 (a u ) produces a phase of poor forecasting in the left-tail when both a q and a u are at their lowest.Consequently, the forecasting accuracy of the GARCH-EVT model with respect to the left-tail conditional quantile Q← aq,t (i.e.valueat-risk) at the most extreme coverage levels (a q ≤ 0.025) is much more sensitive on threshold selection than that of the 2T-POT Hawkes model; hence it is the latter that has the lowest proportion of null hypothesis rejections in the corresponding rows of Table II.From this, we infer that the 2T-POT Hawkes model is more reliably accurate than the GARCH-EVT model in this most crucial case.
In the right-tail, the tuning relationship for the 2T-POT Hawkes models is found to be inverted, such that the most accurate forecasts for high coverage levels are achieved when the threshold level a u is low and vice versa.Considering the asymmetries in the arrival parameters reported in Section III B, this likely reflects that less extreme (a q > 0.05) right-tail observations are frequently triggered by the preceding arrivals of more extreme (a u < 0.05) left-tail observations.Despite this unexpected pattern, the right-tail forecasting accuracy of H S 2 (a u ) retains some of the practical advantages compared to G S 1 (a u ) as observed in the left-tail, namely greater consistency between the different indices and approximate tuning relationships that allow for maximally effective forecasting over a wider range of coverage levels a q .Indeed, it is noted that the aggregated right-tail performance of the GARCH-based models in the UC and CC tests in Tables II and III would be significantly worse across all coverage bands but for the outlier that is the NKX data.
Next, we consider the results for the DQ 4 test.As defined in Section IV A 3, this test verifies whether the intensity of conditional quantile violations (and, therefore, the accuracy of conditional quantile forecasts Qaq,t ) are dependent on a set of explanatory series, including lagged values of Qaq,t and Îaq,t .It is expected that the DQ          might be less favourable to the 2T-POT Hawkes model due to its inherently less smooth response to extreme logreturns.Violations at time t − 1 ( Îaq,t−1 = 1) are more likely to also be exceedance events within N ↔ than are non-violations ( Îaq,t−1 = 0).This would tend to cause larger increases in the forecast conditional quantile Qaq,t at time t due to the jump-like response of the Hawkes model to exceedances, which might then cause a negative bias in the coefficient φ 1 in Eq. ( 34).In the threshold aggregated DQ 4 test results (Table II), we find in both tails that H S 2 (a u ) and G S 1 (a u ) share near equal proportions of null hypothesis rejections in the two lowest coverage bands (a q ≤ 0.05), with those for the Hawkes model being slightly lower in the left-tail and slightly higher in the right-tail.For the higher coverage bands (0.05 < a q ), H S 2 (a u ) begins to perform notably worse on this test relative to G S 1 (a u ).In the threshold disaggregated DQ 4 test results (Table III), we again observe the performance of G S 1 (a u ) in the extreme left-tail is much more sensitive on threshold selection compared with H S 2 (a u ).The main implication of these results is that, at the 5% coverage level or lower, the left-and right-tail conditional quantile forecasts Q aq,t produced by the 2T-POT Hawkes model show no greater conditional bias than the forecasts produced by the GARCH-EVT model.This is an important point for practical forecasting applications, where such conditional biases might be correlated with risk, and so might inflict a disproportionate penalty.When taken with the UC and CC test results, this shows that the 2T-POT Hawkes model presents a superior alternative to GARCH-EVT for these forecasts.
Finally, we examine the results for the ZMD test, which are summarized in Tables II and III and are visualized in Fig. 6.The ZMD test is the only test that measures the accuracy of the conditional violation expectation forecasts Êaq,t ; consequently, it places more emphasis on the accuracy of the tail distribution -especially at the extremities -compared to the other tests.This is evident when comparing H S 2 (a u ) to H S 1 (a u ): the proportions of null hypothesis rejections under the former are universally lower in both tails because of the significant asymmetries found in the shape parameter ξ of the GP tail distributions (see Fig. 7 in Appendix B).In comparison to the UC test visualized in Figs. 4 and 5, the ZMD test results visualized in Fig. 6 show much weaker sensitivity on the threshold level a u in all cases.It is also apparent that the test is affected by low sample size at the smallest values of a u .Indeed, there is sometimes no defined value for the test statistic at the lowest values of a q because there are not enough violations at this coverage level within the out-of-sample period to perform the circular block bootstrap.Notably, this is only observed in the right-tail -indicating an asymmetric bias in all models that results in a systemic overestimation of the more extreme (a q ≤ 0.05) right-tail conditional quantiles Q→ aq,t (and perhaps also a systemic underestimation of Q← aq,t within this range as well).
For the threshold aggregated left-tail results in Table II, ZMD is the test in which the performance of the 2T-POT Hawkes model is strongest relative to the GARCH-EVT model.The two models perform almost equally well in the lowest band (a q ≤ 0.025), the H S 2 (a u ) is then the best performing model by far in the other coverage bands.Conversely, in the right-tail, G S 1 (a u ) is found to have slightly lower proportions of null rejections than H S 2 (a u ) in 5 out of 6 coverage bands; this is the opposite of what was observed in the UC and CC tests, where the latter held a clear advantage.One explanation is that tests are technically performed over different samples: the UC and CC tests (which examine the conditional quantile forecasts Qaq,t ) consider all observations in X t , whereas the ZMD test only considers observations that are also violations ( Îaq,t = 1).In particular, we note that the right-tail results in the lowest coverage band (a q ≤ 0.025) are affected by the failure of the circular block bootstrap observed in Fig. 6.Nevertheless, these results show that the 2T-POT Hawkes model presents a very competitive forecasting tool for financial risk analysts to consider.

V. SUMMARY AND CONCLUDING REMARKS
We have developed the 2T-POT Hawkes model of [34] as a conditional EVT model adapted for the forecasting of conditional quantile-based risk measures in both the left and right tails of the same real univariate discrete time series X t .Our reparameterization of the exceedance model in terms of the expected average intensity a λ more than halved the optimization time, and, when constrained by the threshold level a u , achieved a dimension reduction of 1 at a negligible cost to the goodness of fit, as verified by in-and out-of-sample likelihood ratio tests.The resulting improvements to the speed and reliability of the optimization procedure enabled us to greatly expand our application of the 2T-POT Hawkes model to historic market data: (i) the model was fitted to the daily log-returns of six international large cap equity indices over the in-sample period, 1975-01-01 to 2015-01-01; (ii) we independently fitted the model to each series using a wide range of exceedance thresholds, from the 1.25% to 25.00% mirrored in-sample quantiles.The significant asymmetries in the estimated branching vector γ↔ and the decay constant β reported in [34] were reproduced here and were quantitatively similar across the six indices; moreover, these asymmetries were found to be stable over the majority of the tested threshold levels a u .This adds further empirical support for a temporal leverage effect in which the impact of losses is not only greater but also more immediate.
By introducing a subordinate bulk distribution that is conditional upon the Hawkes exceedance process, we extended the support of the 2T-POT Hawkes model to the full distribution of X t ; this guaranteed that forecasts of conditional quantile-based risk measures were always defined at all coverage levels, a q ∈ [0, 1].The fully supported 2T-POT Hawkes model was used to produce next step ahead forecasts of the left-and right-tail conditional quantiles Q aq,t (value-at-risk) and the conditional violation expectations Ê aq,t (expected shortfall).The accuracy and serial independence of these forecasts in the out-of-sample period, 2015-01-01 to 2022-09-10, were assessed through backtesting methods; these results were compared with those for the symmetric 2T-POT Hawkes model and for a set of GARCH models that included GARCH-EVT.This greatly expanded upon similar analysis in previous literature [29,[39][40][41]: both by extending the analysis to the right-tail and by evaluating forecasts over a much wider and finer range of coverage levels, from 0.25% to 15.00%. 1 (a u )] confirms that the incorporation of leftright asymmetries provides a demonstrable increase in predictive power, which adds further empirical support to the significance of the temporal leverage effect.The comparison with the GARCH-EVT model indicates that asymmetric Hawkes-type arrival dynamics provide a better approximation of the true data generating process for extreme log-returns within large cap equity indices than GARCH-type variance dynamics.This successfully extends the foundational principle of extreme value theory -that 'extreme events should speak for themselves' -to the conditional case.
In future work, the comparison between Hawkes-and GARCH-based EVT methods could be extended to include time inhomogeneous processes.These could be based, for example, on the MSGARCH process developed in [54], which describes a single GARCH process that stochastically switches between multiple regimes according to a Markov transition matrix.If developed, a regime switching variant of the 2T-POT Hawkes model could identify whether the asymmetries between the tails reported here are constant or change under different market states.In order to extend the comparison of conditional EVT methods done in this paper to the like-for-like time inhomogeneous case, a novel merger of the MSGARCH model of [54] and GARCH-EVT model of [4] could be developed.Other possible future developments of this work include the extension of the analysis to multi-step ahead aggregate forecasts of the same quantile-based risk measures.This is a natural application for the 2T-POT Hawkes model, since accurate single-step forecasts for one tail would have a compounding impact on the multistep aggregate forecasts of both tails.To further improve forecasting accuracy, the 2T-POT Hawkes model could also incorporate explanatory variables as sources of nonconstant exogenous intensity.Candidates relevant to the equity indices used here include volatility indices such as the CBOE Volatility Index (VIX).Finally, possible generating mechanisms for the universal asymmetries observed across the six indices may be explored by fitting the 2T-POT Hawkes model to the output of relevant heterogeneous agent-based models (hABMs).

ACKNOWLEDGMENTS
An earlier version of this paper was uploaded to the arXiv preprint server under the title "2T-POT Hawkes model for dynamic left-and right-tail quantile forecasts of financial returns: out-of-sample validation of self-exciting extremes versus conditional volatility".We thank the Associate Editor of the International Journal of Forecasting and the two anonymous referees for their feedback on our original submission.Their comments greatly helped us improve the clarity of our arguments and the presentation of our results.M.F.T. acknowledges support from EPSRC (UK) Grant No. EP/R513155/1 and CheckRisk LLP.In the univariate Hawkes process, λ = µ + γχ, each mother event spawns γ daughter events on average.If the original (0-th) generation of events is triggered by the exogenous background intensity µ and every subsequent generation by the endogenous excitement produced by the preceding generation, then the n-th generation events will arrive with an average intensity γ n µ.Thus, by summation, the expected average intensity for the whole process is In Eq. (A1), the infinite sum is the average number of events in a full lineage.Equivalently, it may be said that 1 − γ is the proportion of all events that are in the 0-th generation, hence In the multivariate case, λ = µ + Γχ, the full lineage of the branching matrix is where Q and D are the matrices of eigenvectors and eigenvalues of Γ, respectively, and I is the identity matrix.By analogy with Eq. (A1), and, therefore, The common intensity model is implemented as a constrained case of the bivariate process, Appendix B: Model estimation and selection The log-likelihood of the 2T-POT Hawkes exceedance model under the parameters θ u over the data X 0:T = {X t |t ∈ Z ∪ [0, T )} can be expressed as the sum over the tails The log-likelihood components for each tail can be further separated: where are the log-likelihood components of the arrivals process and of the conditional GP tail distributions, respectively.The estimated exceedance model parameters θu are then estimated by ML estimation using the SLSQP method in SciPy [46].Any unconstrained parameters of the parametric bulk distribution (denoted by the vector θ D B ) are obtained in a second step, through maximization of the log-likelihood Fig. 7 shows the estimated parameters of the H S 2 (a u ) model as a function of a u .The standard errors of the estimated parameters are obtained by finite difference approximation of the Hessian matrix.
Alternative parameterizations of the 2T-POT Hawkes model are compared through likelihood ratio tests in order to select the appropriate parameterization for the analysis in Section IV.At almost all values of a u for all indices, Table IV shows no significant difference between the goodness of fit of the common intensity model compared with the bivariate model.This greatly expands upon the same finding in [34], which only tested the SPX at a u = 0.025.This suggest that a common intensity arrivals process can be universally assumed for the extreme daily log-returns of large cap stock indices.Table V shows that the a ↔ λ = 2a u d −1 t constraint produces a negligible cost to the goodness of fit in all but 1 case out of 120 in the in-sample data.H S 2 (au) parameter estimates (lines) and standard errors (shaded areas) calibrated over the in-sample period, 1975-01-01 to 2015-01-01.The left-and right-tail components of the vector parameters are shown in light orange and dark blue, respectively.Vertical axes (except for those on the row describing the estimated GP shape parameter vector ξ) are displayed on a log-scale.

Figure 1 .
Figure 1.Illustration of the self-exciting 2T-POT Hawkes exceedance model as a branching process.The geometric random walk discrete time series Pt (top panel) is transformed into the series of log-differences, Xt = ∆tln Pt = ln Pt − ln Pt−1 = ln Pt/Pt−1 (second panel from top), with threshold-exceeding "extremes" values marked in bolder shading.The 2T-POT Hawkes exceedance model (third panel) describes the common conditional intensity λ ↔ (light orange and dark blue dashed)of left-and right-tail exceedance events as a linear sum of the endogenous excitements χ generated by the arrival of past left-(light red) and right-tail (dark green) exceedances.This may be understood as a branching process (fourth panel) in which the daughter events in generation n + 1 are spawned from the endogenous intensity produced by mother events in generation n.

Figure 2 .
Figure 2. H2(au = 0.025) Hawkes exceedance model fitted to the DAX daily log-returns (top panel).The bottom panels show the KS test performed on the in-sample arrival process in time (bottom-left panel, H0: exceedance arrivals are Poisson in time t), the in-sample arrivals process in residual time (bottom-middle panel, H0: exceedance arrivals are unit Poisson in residual time t ), and the residual excess magnitudes (bottom-right panel, H0: residual excess magnitudes m k are unit exponential random variables): the grey shaded areas show the 95% (lighter) and 99% (darker) KS confidence intervals; the KS p-values for the left-(light orange) and right-tail (dark blue) processes are also shown, with rejections at the 95% confidence level highlighted in bold.The vertical black lines mark the end of the in-sample period.Equivalents of this figure for the other indices are included in the supplementary material.

Figure 3 .
Figure 3. Asymmetries in the Hawkes arrival parameters for H2(au) estimated over the in-sample period, 1975-01-01 to 2015-01-01.Estimated left-(light orange) and right-tail (dark blue) components of the branching vector γ ↔ (first row, thin black line is the combined branching ratio) and decay constant β (third row) are shown above their respective left-over-right component ratios (second and fourth rows).Vertical axes are displayed on a log-scale.

Figure 4 .
Figure 4. Unconditional convergence test p-values pUC at coverage level aq for three of the six indices (see Fig. 5 for the other three) in the out-of-sample period, 2015-01-01 to 2022-09-10.The left-and right-column panels show the results for the left and right tails, respectively, with each panel row giving the results for a particular index: (a) SPX, (b) DAX, and (c) NKX.Within each panel, the p-values under six different models (leftmost labels) are shown: the top three (thick) strips show the p-values under the EVT models [H S 1 (au), H S 2 (au), and G S 1 (au)] as a function of (aq, au); the bottom three (thin) strips show the p-values under the standard GARCH models [G S 1 (0), G S 0 (0), and G N 0 (0)] as a function of aq.

Figure 5 .
Figure 5. Unconditional convergence test p-values pUC at coverage level aq for three of the six indices (see Fig. 4 for the other three) in the out-of-sample period, 2015-01-01 to 2022-09-10.The left-and right-column panels show the results for the left and right tails, respectively, with each panel row giving the results for a particular index: (a) DJI, (b) CAC, and (c) HSI.Within each panel, the p-values under six different models (leftmost labels) are shown: the top three (thick) strips show the p-values under the EVT models [H S 1 (au), H S 2 (au), and G S 1 (au)] as a function of (aq, au); the bottom three (thin) strips show the p-values under the standard GARCH models [G S 1 (0), G S 0 (0), and G N 0 (0)] as a function of aq.

Figure 6 .
Figure 6.Zero mean discrepancy test p-values pZMD at coverage level aq for three of the six indices (see the supplementary material for the other three) in the out-of-sample period, 2015-01-01 to 2022-09-10.The left-and right-column panels show the results for the left and right tails, respectively, with each panel row giving the results for a particular index: (a) DJI, (b) CAC, and (c) HSI.Within each panel, the p-values under six different models (leftmost labels) are shown: the top three (thick) strips show the p-values under the EVT models [H S 1 (au), H S 2 (au), and G S 1 (au)] as a function of (aq, au); the bottom three (thin) strips show the p-values under the standard GARCH models [G S 1 (0), G S 0 (0), and G N 0 (0)] as a function of aq.

Figs.
Figs. S1 to S6 provide a visual summary of the residual analysis for the H 2 (a u = 0.025) Hawkes exceedance model fitted to each of the six indices.Fig.S3is included in the main paper as Fig.2.
Figure S7.Unconditional convergence test p-values pUC at coverage level aq for three of the six indices (see Fig. S8 for the other three) in the out-of-sample period, 2015-01-01 to 2022-09-10.The left-and right-column panels show the results for the left and right tails, respectively, with each panel row giving the results for a particular index: (a) SPX, (b) DAX, and (c) NKX.Within each panel, the p-values under six different models (leftmost labels) are shown: the top three (thick) strips show the p-values under the EVT models [H S 1 (au), H S 2 (au), and G S 1 (au)] as a function of (aq, au); the bottom three (thin) strips show the p-values under the standard GARCH models [G S 1 (0), G S 0 (0), and G N 0 (0)] as a function of aq.
[48]ed to backtest the next step ahead forecasts of Q aq,t and Ê aq,t in Section IV.The in-sample period is a forty-year span that includes among other notable episodes the 1987 Black Monday crash, the 1997-8 Asian Financial Crisis, the 2000-2002 Tech Bubble crash, and the 2007-8 Global Financial Crisis.The out-of-sample period encompasses more than 92 months of financial data and includes the severe fluctuations caused by the outbreak of the Covid-19 pandemic in March 2020 and by the Russian invasion of Ukraine in February 2022.The data were sourced from the stooq.plonlinedatabase[48];summary statistics are provided in Table 4 test Our asymmetric 2T-POT Hawkes model [H S 2 (a u )] was found to produce the most reliably accurate conditional quantile forecasts Q aq,t in both tails at the 5% coverage level or lower.Within this same coverage range, our model was also found to produce the most accurate forecasts of the lefttail conditional violation expectation Ê← aq,t , though the right-tail forecasts Ê→ aq,t were slightly less accurate than those produced by the GARCH-EVT model [H S 1 (a u )].The comparison with the symmetric 2T-POT Hawkes model [H S TableVIshows that the Studentt distributed bulk achieves a significantly better fit than the normal distribution in most cases.Given these results, we use the common intensity exceedance model with the a ↔ λ = 2a u d −1 t constraint applied and with a Student-t distributed bulk in Sections III and IV.

Table V .
Likelihood ratio test p-values comparing the goodness of fit to threshold exceeding returns of H2(au) with and without the constraint a 1.1E-01 Table VI.Likelihood ratio test p-values comparing the goodness of fit to bulk returns of H N 2 (au) against H S 2 (au).H0 : B H N 2 (au) = B H S 2 (au) .H1 : B H N 2 (au) < B H S 2 (au) .Rejections of H0 at the 95% and 99% confidence levels are highlighted in light and dark grey, respectively.p t LR | In-sample (1975-01-01 -2015-01-01) p t LR | Out-of-sample (2015-01-01 -2022-09-10)