Forecasting Risk Measures Using Intraday Data in a Generalized Autoregressive Score (GAS) Framework

A new framework for the joint estimation and forecasting of dynamic Value-at-Risk (VaR) and Expected Shortfall (ES) is proposed by incorporating intraday information into a generalized autoregressive score (GAS) model, introduced by Patton, Ziegel and Chen (2019) to estimate risk measures in a quantile regression setup. We consider four intraday measures: the realized volatility at 5-min and 10-min sampling frequencies, and the overnight return incorporated into these two realized volatilities. In a forecasting study, the set of newly proposed semiparametric models is applied to 4 international stock market indices: the S&P 500, the Dow Jones Industrial Average, the NIKKEI 225 and the FTSE 100, and is compared with a range of parametric, nonparametric and semiparametric models including historical simulations, GARCH and the original GAS models. VaR and ES forecasts are backtested individually, and the joint loss function is used for comparisons. Our results show that GAS models, enhanced with the realized volatility measures, outperform the benchmark models consistently across all indices and various probability levels.


Introduction
From the perspective of financial risk managers, a risk measure can be considered a map from the space of probability distributions to real numbers. Risk measures can provide banks and financial institutions with specific values of potential losses so that risk managers can adjust their capital reserves against the downside risk. Value-at-Risk (VaR) and Expected Shortfall (ES) are two prevailing measures of financial risk that dominate contemporary financial regulation. VaR provides banks and investment institutions with a loss level that occurs in the worst situation at a given confidence level, and it can be defined as: where F Y (·) is the cumulative distribution function of asset returns y t over a horizon given the information set F t−1 , and α ∈ (0, 1) is a given significance level. As a quantile, VaR can be expressed directly in terms of the inverse cumulative distribution function: V aR α t = F −1 Y (α|F t−1 ), and as a risk measure, it has the advantage of being intuitive and easily understood.
However, VaR has inherent deficiencies as it ignores the shape and structure of the tail and is not a coherent risk measure in the sense of Artzner et al. (1999). Thus, after the financial crisis of 2007-08, the Basel Committee on Banking Supervision has proposed a transition from VaR with a confidence level of 99% to ES with a confidence level of 97.5% (Basel Committee on Banking Supervision, 2013). ES is the expectation of returns, conditional on its realization lying below VaR, and it can be defined as: ES is a coherent risk measure (Roccioletti, 2015), and it can be considered a perfect substitute for VaR in risk management applications. Normally, ES is estimated via a two-stage approach based on VaR estimation. Whilst ES is itself not elicitable, Fissler, Ziegel and Gneiting (2016) have shown that the pair (V aR α t , ES α t ) is elicitable; see also Acerbi and Székely (2014). This means that ES can be estimated jointly with VaR by minimizing a loss function (Ziegel, 2016;. Following the classification of Engle and Manganelli (2004), the current literature on estimating and forecasting risk measures can be divided into three main categories: parametric, nonparametric and semiparametric models. Previous studies using parametric models to predict VaR and ES assume that financial returns follow a certain distribution, such as the standard normal (Gaussian) distribution. In reality, however, it is hardly reasonable to make such strong assumptions. Nonparametric models do not make assumptions about the distribution of financial returns, and have the advantage of being model free. While it is not necessary for such models to make a distributional assumption, an inherent problem is the difficulty in finding the optimal size of the estimation window (Engle and Manganelli, 2004). Semiparametric models impose a parametric structure on the dynamics of VaR and ES through their relationship with lagged information, but require no assumptions on the conditional distribution of financial returns (Patton, Ziegel and Chen, 2019).
Quantile regression, as an approach for estimating risk measures, has only recently been considered: Engle and Manganelli (2004) extend the basic quantile regression model to conditional autoregressive value at risk (CAViaR) models; these models focus solely on the estimation of VaR, and it is not obvious how they can be used for ES estimation. In order to estimate ES jointly with VaR in a semiparametric framework, Taylor (2008) proposes conditional autoregressive expectile (CARE) models, based on a simple function of expectiles. 1 Following this, Taylor (2019) synthesizes the quantile regression with the maximum likelihood estimation based on an Asymmetric Laplace density proposed by Koenker and Machado (1999), and estimates VaR and ES jointly. A growing literature documents a significant improvement in VaR and ES estimation in a quantile regression framework (Halbleib and Pohlmeier, 2012;Žikeš and Baruník, 2014;Wang and Zhao, 2016;and Bayer, 2018).
Following the results of , Patton, Ziegel and Chen (2019) present several novel dynamic models for the joint estimation of VaR and ES. Specifically, they propose four dynamic semiparametric models for VaR and ES, based on the generalized autoregressive score (GAS) framework introduced by Creal, Koopman and Lucas (2013). This model has been successfully applied in risk measures estimation (Patton, Ziegel and Chen, 2019); CDS spread modelling (Lange et al., n.d.;and Oh and Patton, 2018); systemic risk modelling (Cerrato et al., 2017;Eckernkemper, 2017;and Bernardi and Catania, 2019); and high-frequency data modelling (Gorgi et al., 2018;and Lucas and Opschoor, 2018). 2 However, no studies on risk measures incorporating realized volatilities into the GAS framework have been considered so far. 3 This prompted the research question of this paper, namely whether adding intraday measures of volatility into the GAS framework improves the accuracy of joint VaR and ES forecasts.
The question whether intraday data can improve the predictive accuracy of risk mea-sures has already been addressed by academics. 4 Several studies extend quantile regression methods and other semiparametric models by using information variables generated from high-frequency data. 5 Many realized volatility measures have been confirmed to perform efficiently. The realized volatility proposed by Andersen and Bollerslev (1998) and Alizadeh et al. (2002) is one of the most widely used intraday volatility measures. Inspired by Engle and Manganelli (2004), Fuertes and Olmo (2013) propose a conditional quantile forecast method combining an effective device to deal with the inter-daily/intra-daily information. Meng and Taylor (2018) extend the CAViaR model and the Quantile Regression HAR model with realized volatility, overnight return and intraday range. In terms of ES estimation, the CARE models of Taylor (2008) have been extended to allow intraday measures as explanatory variables (Gerlach and Chen, 2014;Gerlach and Wang, 2016a;Gerlach and Chen, 2017;and Wang et al., 2018). While the improvement from adding intraday variables into a semiparametric framework has been widely documented, evidence on using the score-driven model as the framework to estimate risk measures still remains hard to come by. Therefore, in our study motivated by Salvatierra and Patton (2015), the first contribution is that we extend the set of semiparametric GAS models of Patton, Ziegel and Chen (2019): the two-factor GAS model, the one-factor GAS model, the GARCH-FZ model, and the hybrid GAS/GARCH model, to investigate whether realized measures can improve the predictive accuracy of GAS models. This study is the first one to estimate and forecast VaR and ES jointly by using intraday data in a GAS framework. We shed light on the potential improvement in risk forecasting from adding intraday information in the GAS framework for four stock indices using a long forecasting period (that includes the financial crisis period). Then we perform a thorough analysis to compare our forecasts with those generated from prevailing benchmarks in the current literature. Our results show that incorporating intraday data into the GAS framework outperform other (VaR, ES) forecasts in most cases.
Thus, our second contribution to the literature is that we provide empirical evidence that semiparametric models enhanced with realized volatility measures outperform other benchmark models via various backtesting methods. Our proposed models, especially the GAS-2F model, extended with realized volatilities dominate other benchmarks consistently. Thirdly, we compare four different types of realized measures with regard to their forecasting ability for risk measures, when added to GAS models.
The paper is structured as follows: Section 2 briefly introduces the new GAS models that incorporate intraday information; the data used in our empirical study and the in-sample estimation results are presented in Section 3; Section 4 presents the forecasting study and backtesting results; and finally, Section 5 concludes the paper.

GAS models for VaR and ES
Several extensions of the GAS models introduced by Creal, Koopman and Lucas (2013) are proposed in Patton, Ziegel and Chen (2019), which can be estimated by minimizing the loss function of  called FZ0: where Y denotes the daily return, v and e represent the values of VaR and ES, respectively, and 1 is an indicator function which returns 1 when Y ≤ v (i.e., the VaR is exceeded), otherwise it returns zero. Patton, Ziegel and Chen (2019) propose four models: the two-factor GAS model, the one-factor GAS model, the GARCH-FZ model, and the hybrid GAS/GARCH model, to estimate VaR and ES jointly by minimizing the loss function FZ0. The key novelty in their framework is the use of the scaled score (that can be computed as the first order derivative of the objective function 6 ) to drive the time variation in the target parameter. Patton, Ziegel and Chen (2019) present a "news impact curve" to show the impact of past observations on current forecasts of VaR and ES through the score variable. When Y > v, the realized returns do not affect the estimation. But when Y ≤ v, forecasts of ES and VaR react to realized returns through the score variable. The GAS-FZ models are specified as below: (1.A) One-factor GAS model (GAS-1F): v t = a exp{κ t }, 6 Normally, the objective function is a probability density function, but here the loss function FZ0 acts as the objective.
where the score variable s t is defined as: and the Hessian factor H t is set to one for simplicity; (1.B) Two-factor GAS model (GAS-2F): v where w is a (2×1) vector, A is a (2×2) matrix, and B is defined as a diagonal matrix for parsimony, and ( where σ 2 t is the conditional variance and is assumed to follow a GARCH(1,1) process. The parameters of this model are estimated by minimizing the loss function FZ0 in (1), instead of using (Q)MLE.
(1.D) A hybrid GAS/GARCH model (Hybrid): v t = a exp{κ t }, where the variable κ t is the log-volatility, described by the one-day lagged log-volatility, score factor and the logarithm of absolute return.

Realized measures
This section provides a brief introduction to various intraday realized measures (RM) used in this study. The most popular measure is the realized volatility (RV), defined as: where RV t denotes the realized volatility calculated from the sum of M intraday squared returns, at frequency ∆, within day t. Here, the intraday frequency ∆ divides the whole span of market opening hours S into M equal intervals, and P t,i·∆ denotes the log price at time i · ∆ of day t. However, the realized volatility ignores the information from the market overnight return, which is defined as: where P t,0 and P t−1,S denote the opening price on day t and the closing price on the previous day, respectively. Several studies have proven that incorporating the overnight return can lead to a more accurate realized measure. In this paper, we consider the approach of incorporating the overnight return in the realized volatility of Blair, Poon and Taylor (2001), Hua and Manzan (2013) and Meng and Taylor (2018) as follows: In the following, we will use frequencies of 5 and 10 minutes, and use the following notations for simplicity: RN 10 t = RV 10 2 t + (overnight t ) 2 .
In the next section, RM can signify any of these four realized measures of volatility.

GAS models for VaR and ES with realized measures
Salvatierra and Patton (2015) propose a GAS model enhanced with high frequency measures to obtain a GRAS model, which has the equation for the dependence parameter, similar to the last row of (2), replaced with: They use the realized covariance as RM t , computed from the intraday prices P t,i·∆ of a set of assets. The authors find that the inclusion of 5-minute realized covariance significantly improves the in-sample fit and out-of-sample forecasts of the copula models. Motivated by the set of GAS models and the GRAS model, our new models are proposed as: (2.A) One-factor GAS model with realized measures (GAS-1F-Re): where κ t is defined in (13), and the score variable s t is defined in (3). Here, the Hessian factor H t is set to one for simplicity; log(RM t ) is the logarithm of a realized measure which can be: the realized volatility at 5-min and 10-min sampling frequencies (RV 5 and RV 10), and these two realized volatilities with the overnight return incorporated into them (RN 5 and RN 10), as defined in (12).
(2.B) Two-factor GAS model with realized measures (GAS-2F-Re): where w and C are (2×1) vectors, A, and B are both (2×2) matrices, B is defined as a diagonal matrix to simplify computation. Following Patton, Ziegel and Chen (2019), we also define the forcing variables λ v,t and λ e,t as the partial derivatives of the given loss function L F Z0 with respect to v t and e t , as in (5) and (6). Hansen, Huang and Shek (2012) and Hansen, Lunde and Voev (2014) introduce a new framework, Realized (Beta) GARCH, where the variance follows a GARCH(1,1) process, with the squared returns replaced with a realized measure of volatility. Following this model, we propose a GARCH-FZ-Realized model: where the daily return Y t−1 in the GARCH(1,1) variance equation in (7) is replaced with the realized measure RM t−1 . This model is estimated by minimizing the FZ0 loss function.
(2.D) A hybrid GAS/GARCH model with realized measures (Hybrid-Re): v t = a exp{κ t }, where the log-volatility κ t follows the hybrid GARCH model with one-day lagged logvolatility, score factor, realized measures and absolute daily return.

Data description
To evaluate the forecasting performance of the new models and to compare them with benchmark models, we collected daily opening and closing prices of four international stock market indices: the S&P 500 (US); Dow Jones Industrial Average; NIKKEI 225 (Japan) and FTSE 100 (UK), from January 2000 to June 2019, from DataStream. To ensure the applicability of every model, we remove market-specific non-trading days and exactly zero returns from each index series. Panel A in Table 1 presents the summary statistics on the four daily equity return series over the full sample period. From the top panel, average annualized returns range from 0.544% for the NIKKEI 225 to 4.377% for the DJIA, and the annualized standard deviation ranges from 18% for the DJIA to about 24% for the NIKKEI 225. All daily return series exhibit substantial kurtosis at around 10. The second and third panels of this table show the sample VaR and ES for four different α levels: 1%, 2.5%, 5% and 10%. The NIKKEI 225 index proves to be different from the rest since its quantile and ES are lower than the sample risk measures of the other three indices.
Panel B presents the estimated parameters of the ARMA(p,q) models where the lags (p,q) are optimally selected via the BIC method. The ARMA models for the indices only include a constant except for the S&P 500, which contains an MA term with one lag. Panel C shows the estimated parameters of the GARCH(1,1) model, where the residuals are assumed to follow the skew-t distribution. Panel D presents the parameters of the degree of freedom and skewness in the skew-t distribution.
The percentage log overnight returns are generated as in (10). For the realized volatility, the data is obtained at 5-min and 10-min sampling frequencies from the Oxford-Man Institute's realized library 7 (see Heber, Lunde and Shephard, 2009). To generate the new realized measure incorporating the overnight return in realized volatility, we use (12).

[ INSERT TABLE 1 ABOUT HERE ]
The full data period is divided into an in-sample period for estimation and an out-ofsample period to backtest the estimated results. We employ a rolling window approach, where each model is re-estimated every five trading days using a rolling window of 2000 observations. Then the rest of the period until June 2019 of approximately 2900 days, is the out-of-sample period to evaluate one-day ahead VaR and ES estimates.

Forecasting models
VaR and ES are predicted via the score forecast for one trading day ahead in the out-ofsample period for each series, using the proposed GAS-Realized models and the FZ-GARCH-Realized model, as well as nonparametric models and parametric models as benchmarks. For nonparametric models, historical simulations are widely used because of their advantages of being model free and easy to implement. In our study, we select three commonly used rolling window sizes to forecast VaR and ES: 125, 250 and 500 days. Two popular GARCH models are employed in this study, including the Gaussian (GARCH-G) and Skew-t (GARCH-Skt) models as parametric model benchmarks. We also consider other established models that use high-frequency data, considered to be well-suited to forecast VaR and ES: the HAR model of Corsi, Mittnik, Pigorsch and Pigorsch (2008), and the HEAVY model of Shephard and Sheppard (2010). In each model, we estimate VaR and ES with Gaussian and Skew-t distributions of the errors in the second step, after the conditional volatility estimation. We also take the semiparametric approach of Taylor (2019) based on the asymmetric Laplace distribution, into our benchmark set.
To evaluate the performance of the GAS models enhanced with realized measures, we also implement the four models proposed by Patton, Ziegel and Chen (2019) as benchmarks. Differently from Patton, Ziegel and Chen (2019) who used certain parameters estimated from a fixed in-sample period, we use a rolling window approach, where each model is re-estimated every five trading days using a window of size 2000 trading days. In this study, we consider four sets of GAS models extended with different realized measures: RV5, RV10, RN5 and RN10 defined in (12). In the following sections we compare the forecasting performance of these four sets of extended models, which gives a total of 16 models, with the 13 benchmark models enlisted above.

In-sample estimation
The parameters of the GAS models and the proposed four sets of GAS-Realized models are estimated by minimizing the loss function in (1). We estimate these models by using a quasi-Newton method and the functions fminsearch and fminunc as optimization algorithms, which are similar routines to the one used by Engle and Manganelli (2004). It is hard to estimate these models using a non-smooth objective function, and this algorithm is sensitive to the starting values used in the search. For each model, we first generate 10 5 vectors of parameters from a uniform random number generator for the parameters of the GAS models. For the parameters used to generate VaR and ES, we set the individual values between -2 and -3, and -3 and -4, respectively, to ensure that ES is always less than VaR. We computed the average loss value for each vector, then select 10 vectors that generated the lowest average loss as initial values for the optimization routine. The vector producing the lowest loss value was selected as the final initial value of the search algorithm for all windows in order to shorten computational time. Table 2 presents the estimated parameters of the GAS models for the S&P 500, estimated using an estimation period of 2000 days from the beginning of January 2000 for α = 5%. The parameters of the three two-factor GAS models (GAS-2F, GAS-2F-RV5, and GAS-2F-RN5) are presented in the first panel of Table 2; we separate the parameters of VaR and ES. It is clear that the b parameters are statistically significant for both VaR and ES, which can be explained by the volatility clustering effect. The four columns on the right side of this panel show the parameters of GAS-2F extended with the 5-minute realized measures. Due to adding 5-min realized measures, the degree of clustering decreases for VaR and ES. Also, the parameters of score a v and a e experience a significant decrease after adding the realized measures. The parameters of the one-day lagged realized measures RM t−1 , c, are statistically significantly negative for both VaR and ES, indicating that larger values of these realized variables will result in a lower estimated quantile or ES, which is intuitive. The average loss generated by the GAS-2F model is 0.756, which is larger than the loss of the GAS-2F models extended with realized measures (0.735 and 0.734).
The second panel in Table 2 shows the estimated parameters of the other GAS models extended with the 5-minute realized measures using an estimation period of 2000 days from the beginning of January 2000 for the S&P 500, for α = 5%. Similarly to the b parameters of the GAS-2F model, the β parameters of the other models are also statistically significant, which means that the current estimated risk measures rely heavily on the previous estimation. Also, we find that the parameters of realized measures (c for the GAS-1F model, the GARCH-FZ model, and the Hybrid model) are all statistically significantly positive. Intuitively, a large realized volatility will lead to a low quantile through the score variable in these models. We obtain that the inclusion of realized measures in the updating models results in smaller coefficients of the GAS shocks (γ), which is intuitive. Later, we will see the role that the score variable plays in forecasting VaR and ES.

Out-of sample forecasting and backtesting
We evaluate one day-ahead VaR and ES forecasts for the four international stock indices, and for the following four probability levels: 1%, 2.5%, 5% and 10%. One-day ahead VaR and ES forecasts are made with parameter values estimated every 5 days, for each model and probability level, using rolling windows of size 2000 (except for historical simulations). The forecasting sample period for each index is approximately 2900 days. In this section, we backtest the VaR and ES forecasts of the proposed models and compare their performance with that of benchmark models. Firstly, we backtest VaR and ES individually via the Dynamic Quantile (DQ) regression and the Dynamic Expected Shortfall (DES). Following these tests, we employ a method based on the FZ0 loss function to backtest VaR and ES jointly.

Backtesting VaR
The most popular procedures evaluating the performance of VaR forecasts are mainly based on VaR failures, i.e., The commonly used VaR backtesting method, known as the unconditional coverage (UC) test, is proposed by Kupiec (1995), and uses the proportion of failures as its main tool. In this test, the hit percentage is defined as the proportion of the returns below the estimated VaR, then the difference between the hit percentage and its theoretical value of α is examined.
Thus, the VaR model is rejected or not according to the null hypothesis of the UC test below, based on which the Likelihood Ratio (LR) test is performed: Table 3 presents the number of model rejections of the above null hypothesis for four daily equity return series, over the out-of-sample period, for the 29 different forecasting models, at 1% and 5% significance levels, respectively, and for different probability levels. To obtain these columns, we perform the unconditional backtest above for all indices, and count the number of rejections for each model.
The third and fourth columns of Table 3 show that the proposed new GAS models extended with realized measures generally tend to have a lower number of UC test rejections as compared to the number of rejections of the GAS-FZ models of Patton, Ziegel and Chen (2019), for α = 1%. The GARCH model and HEAVY model with a skew-t distribution also tend to have a lower number of rejections at 1% significance level. At 5% significance level, several GAS-FZ models with overnight returns incorporated in the realized volatility have zero rejections of the UC test. In general, adding realized measures into GAS models for predicting VaR achieves a lower number of test rejections, based on our results on the hit percentage test.

[ INSERT TABLE 3 ABOUT HERE ]
However, the UC test is statistically weak for small sample size, and is criticized by several studies (see Nieto and Ruiz, 2016) that it ignores the clustering of failures. To address these drawbacks, the conditional coverage (CC) test is considered, in which the null hypothesis is: We employ the dynamic quantile (DQ) test proposed by Engle and Manganelli (2004) to implement the CC test. The DQ test has power against the misspecification of ignoring conditionally correlated probabilities and can be extended to examine other explanatory variables. The DQ test examines whether the hit variable defined as Hit v,t = 1{Y t ≤ V aR t } − α, follows an i.i.d. Bernoulli distribution with probability level α and whether it is independent of the VaR estimator; the expected value of Hit v,t is 0. Furthermore, from the definition of the quantile function, the conditional expectation of V aR t given any information known at t − 1 must also be 0, which means that the hit function cannot be correlated with other lagged variables. Also, the Hit v,t must not be autocorrelated. If Hit v,t satisfies the conditions stated above, then there will be no autocorrelation in the hits, and no measurement error. In this study, we include one lag of Hit v,t in the regression of the test. Consider the following DQ regression: where a = [a 0 , a 1 , a 2 ] is the set of parameters of the regression. Based on the null hypothesis, we test whether all parameters in the set a are zero. Performing this DQ test gives a test statistic, which is distributed X 2 (3) asymptotically. The middle panel of Table 4 shows the p-values of the DQ test of VaR forecasts for α = 1%, for the four stock indices. P-values that are greater than 5% indicate no evidence against the optimality at 5% significance level (in bold), and values between 1% and 5% are in italics. For the S&P 500, all of our newly proposed models pass the DQ test at 1% significance level. When we consider the NIKKEI 225 and FTSE 100 index, we see significant improvements after adding realized measures in the GAS models. For the DJIA index, using realized measures we obtain that fewer models fail the DQ test, while the historical simulations pass the test and the GARCH model with the skew-t distribution performs well. But for this index, all of the GAS-1F models extended with realized measures are able to pass the DQ test for all four indices. Overall, adding realized measures enables GAS-FZ models to reduce the number of rejections of the DQ test for α = 1%.

[ INSERT TABLES 4-6 ABOUT HERE ]
For α = 2.5% (see Table 5), we obtain similar results, namely that adding realized measures generally reduces the number of rejections of the DQ test. For the DJIA index, the two-factor GAS model can pass the test after adding realized measures RN5 and RN10. For α = 5%, in Table 6, we can see that all original GAS-FZ models can pass the DQ test across the four indices except the Hybrid model for the S&P 500 index. After adding realized measures in the GAS models, it can be seen that the p-values increase and the DQ test is generally passed. Table 7 presents the number of model rejections at 1% and 5% significance levels for quantile regression VaR backtests across the four markets, for different probability levels. It can be concluded that the set of GAS models extended with realized measures tend to have a lower number of rejections than the original GAS models and several other benchmarks. It should be noted that the four GAS-1F model extended with different realized measures have the least number of rejections of the DQ test, especially for low values of α.

Backtesting ES
All models that we consider produce both VaR and ES forecasts. From an economic point of view, for example, when we compare the 2.5% ES forecasts of the GAS-1F-RV5 and the 2.5% ES forecasts of the GAS-1F, the first one has, on average, an ES forecast lower with 13.29% (S&P 500), 17.49% (DJIA), 8.40% (NIKKEI), and 5.31% (FTSE 100). The results indicates that ignoring realized measures overestimates risk on average. Looking at the significance of these values, we follow the backtesting method of Patton, Ziegel and Chen (2019) to evaluate the ES estimates individually, using a dynamic ES (DES) regression test: where λ s e,t is the standardized version of λ e,t defined in (6) is the set of parameters of the regression. Based on the null hypothesis, we test whether all parameters in set b are zero.
The right panel of Table 4 shows the p-values from the DES test of the ES forecasts for α = 1%, for the four stock indices. Similarly to the result of the DQ test, incorporating the realized measure RN10 in GAS models seems to reduce the number of backtest rejections for the NIKKEI 225 and the FTSE 100 indices. GAS-1F models with realized measures can pass the DES test at 5% significance level for all indices, which is consistent with the result of the DQ test. The two-factor GAS model, after adding the risk measure RN10, passes the DES test for all indices. Almost all of our new models pass the DES test across the four indices for α = 2.5%, except the GAS-2F for the NIKKEI 225, as can be seen in the right panel of Table 5. Table 6 presents similar results across four indices using an α of 5%, whilst some benchmarks also have p-values higher than 5%, for example, the HEAVY model with a skew-t distribution. Table 7 summarizes the total number of model rejections at 1% and 5% significance levels for the Dynamic ES regression backtests, across the four markets, for different probability levels. The GAS-1F models enhanced with realized measures have the smallest number of backtest rejections.

Joint backtesting of the (VaR, ES) risk measures
In order to compare jointly the VaR and ES forecasts generated by different models, in this section, a loss function proposed in  is employed. The authors discuss how VaR and ES are jointly elicitable and present a group of loss functions for risk measure estimation and backtesting. We follow the choice of Patton, Ziegel and Chen (2019) for the loss function FZ0, as defined in (1). To compare the performance of each model using the FZ0 loss function, we calculate the average loss value L F Z0 = 1 T T t=1 L F Z0,t for different α values across the four indices.
The left panel of Table 4 presents the average losses for the four equity return series, over the out-of-sample period, for 13 different benchmark forecasting models and 16 newly proposed models that use the 5-min and 10-min realized measures. The lowest average loss in each column is highlighted in bold, whilst the second lowest is highlighted in italics. For α = 1%, the GAS-FZ models enhanced with the realized volatility using overnight returns and the HEAVY-Skt model perform well, overall.
For α = 2.5% (see Table 5), the GAS-2F model employing the 10-min realized volatility and overnight returns (GAS-2F-RN10) outperforms the other models, with lower loss than most other models for most series and being consistently ranked well, being the best model for the DJIA and FTSE 100 index. In Table 6 (α = 5%), the GAS-2F-RN5 and GAS-2F-RN10 models outperform the other models with the lowest loss for the DJIA and the FTSE 100 index, respectively. The HEAVY-Skt model has the lowest loss value for the S&P 500. Table 8 presents the rankings (with the best performing model ranked 1 and the worst ranked 29) based on average losses using the FZ0 loss function, for the four index return series, over the out-of-sample period, for the 29 different forecasting models. The last two columns in each panel represent the average rank across the four series and the rank of the average, respectively. For α = 1%, the best-performing model is the GAS-1F model with the 5-min realized volatility and overnight returns, followed by the GAS-1F models extended with the other two realized measures. Considering α = 2.5%, the GAS-2F-RN10, GARCH-FZ-RV5, and GAS-1F-RN10 are the three models having the lowest average loss values. For α = 5% and α = 10%, our proposed models have a relatively higher rank than the benchmarks, except the HEAVY model with a skew-t distribution, which is ranked second for α = 5%.
Another observation here is that the losses generated from the GAS-FZ models with realized measures are generally lower than the loss generated from most benchmark approaches. However, the HEAVY-Skt is always one of best 5 models considered in the overall ranking for all four probability levels. This suggests that the variables extracted from intraday data provide useful information for risk measure forecasting.

[ INSERT TABLE 8 ABOUT HERE ]
In order to analyse the relative performance of each model, we employ the Diebold-Mariano (DM) test to compare any two models using differences in average losses. In this study, t-statistics from the DM test compare the average losses, using the FZ0 loss function, for four indices, and for different probability levels, over the out-of-sample period. A negative t-statistic indicates that the row model outperforms the column model with a significant loss difference. The absolute values greater than 1.96 (2.575 or 1.64) indicate that the average loss difference is significantly different from zero at 95% (99% or 90%) confidence level. In Figure 1, we present the results for the S&P 500 with the null hypothesis that the row model and the column model have equal values for the loss function. The numbering of the models used in the figure is given in the first column of Table 3. Positive test statistics corresponding to darker colors mean that the row model has larger losses than the column model. The white blocks mean that the row model dominates the column model in loss comparison at 95% significance level; the light green (below white in the color bar) blocks mean that the row model has lower average loss than the column model, but not significantly so; and the dark red blocks mean that the row model has higher loss than the column model at 95% significance level. In Figure 1, the rows for Model 8 (HEAVY-Skt-RV5), Model 23 (GAS-1F-RN5), and Model 27 (GAS-1F-RN10) have lighter blocks compared to other rows. Therefore, these are the three best performing models for the S&P 500 index.

[ INSERT FIGURE 1 ABOUT HERE ]
Following Wang, Gerlach and Chen (2018) and Taylor (2019), we use the model confidence set (MCS) test introduced by Hansen, Lunde and Nason (2011) to compare the forecasting models via the FZ0 loss function. This approach builds model confidence sets using one-sided elimination based on the Diebold-Mariano test. In this study, we consider the 75% confidence level 8 and employ two methods: the R method using sums of absolute values for calculating the test statistic for MCS; and the SQ method uses the summed squares. 9 Table 9 presents the number of models within the MCS test using the block bootstrap with the block length of 12 and 10,000 replications, based on the losses generated from the FZ0 loss function. The GAS-2F-RN10 is the best performing model, overall, and the GAS models extended with realized measures perform better than most of the benchmark models. The main finding generated from the MCS test echo the results from the other backtesting methods. The result that some GAS models enhanced with realized measures end up more often in the MCS than HAR and HEAVY models highlights the usefulness of the score function that the GAS models build on, and we also show evidence that the use of realized measures enhances the risk forecasts of GAS models. Patton, Ziegel and Chen (2019) proposed a set of semiparametric models (GAS-FZ) in a generalized autoregressive score (GAS) framework to estimate risk measures. This study provides an extension of this, using exogenous information from high frequency data, in order to improve on the prediction of VaR and ES. This provides a new semiparametric framework named GAS-FZ-Realized, proposed for estimating and forecasting VaR and ES jointly. Through incorporating four realized measures (5-min and 10-min realized volatility with or without the overnight return) into the GAS-FZ models, we observe an improvement in forecasting risk measures over the in-sample and out-of-sample periods.

Conclusions
We employ the newly proposed models to estimate the VaR and ES of four international stock indices empirically, over the period 2000 to 2019. The parameters of the models are estimated by minimizing the FZ loss function of . Then VaR and ES forecasts are built and individually backtested using the unconditional coverage test and the dynamic quantile (and ES) regression tests, as well as the joint loss function is computed. The main finding is that forecasts generated from the GAS-FZ-Realized models outperform forecasts based on GARCH models or historical simulations, even those based on the original GAS-FZ models. The only exception is the HEAVY-Skt-RV5 which we found is difficult to beat.
To conclude, the GAS-FZ-Realized models, especially the GAS-2F combined with the 10-min realized volatility and the overnight return, can provide more accurate risk measures for risk management across different stock indices and probability levels when compared to other models. This work could be potentially extended by improving the ES component, as the dynamics of VaR may not change simultaneously with ES, for example, by modelling an AR relationship between VaR and ES or by assuming a dynamic Omega ratio to describe the relationship between the two measures (Taylor, 2019). Moreover, this study can be extended by using realized volatility at different frequencies or via other proposed realized measures, for example those found in Meng and Taylor (2018).  Note: This table presents the parameter estimates and standard errors of the four GAS models proposed in Patton et al. (2019) and eight GAS models enhanced with 5-min realized volatility (and overnight returns), for VaR and ES, for the S&P 500 index using the first rolling window of 2000 days starting with January 2000. The top panel presents the estimated parameters of the two-factor GAS models. The bottom panel presents the parameters of the one-factor GAS model, the GARCH model, and the hybrid-factor GAS model, estimated using the FZ0 loss minimization. The bottom row of each panel presents the average (insample) losses from these models.
Note: This table presents the number of model rejections based on hit percentages of VaR forecasts (UC test) for the four daily equity return series, over the out-of-sample period, for 29 different forecasting models. The first three rows (Models 1-3) correspond to rolling window historical forecasts, the next two rows (Models 4 and 5) correspond to GARCH forecasts based on different distributions for the standardized residuals, the next four rows (Models 6-9) correspond to forecasts using high-frequency data and the CAViaR model based on the asymmetric Laplace distribution. The next four rows (Models 10-13) correspond to GAS models proposed by Patton et al. (2019). The last 16 rows (Models 14-29) correspond to the GAS models extended with the 5-min and 10-min realized measures, respectively.         (c) 5% S&P 500 (d) 10% S&P 500 Fig. 1. Color map based on the Diebold-Mariano (DM) test comparing the average losses using the FZ0 loss function over the out-of-sample period for 29 different models, for the S&P 500. White blocks mean that the row model has lower average loss than the column model at 5% significance level; light green (below white in the color bar) blocks mean that the row model has lower average loss than the column model, but not significantly different from it, and so on. Darker color blocks mean that the row model has higher average loss than the column model.