Assessing the accuracy of exponentially weighted moving average models for Value-at-Risk and Expected Shortfall of crypto portfolios

A plethora of academic papers on generalized autoregressive conditional heteroscedasticity (GARCH) models for bitcoin and other cryptocurrencies have been published in academic journals. Yet few, if indeed any, of these are employed by practitioners. Previous academic studies produce results that are fragmented, confusing and conflicting, so there is no commercial incentive to drive an expensive implementation of complex multivariate GARCH models, which anyway would commonly require more data for calibration than are available in the history of most cryptocurrencies, at least at the daily frequency. Consequently, this paper assesses the forecasting accuracy of simple parametric RiskMetrics type volatility and covariance models, with a focus on ad hoc parameter choice instead of a data-intensive calibration procedure. We provide extensive backtests of hourly and daily Value-at-Risk (VaR) and Expected Shortfall (ES) forecasts that are regarded as best practice in the industry and commonly used for regulatory approval. Our results demonstrate that much simpler models in the exponentially weighted moving average (EWMA) class are just as accurate as GARCH models for VaR and ES forecasting, provided they capture an asymmetric volatility response and a heavy-tailed returns distribution. Moreover, on ranking each model's variance and covariance forecasts using average scores generated from proper univariate and multivariate scoring rules, there is no evidence of superior performance of variance and covariance forecasts generated by GARCH models, using either daily or hourly data.


Introduction
The modelling and forecasting of volatility and quantile risk measures for cryptocurrencies is a fairly well-researched topic. Almost 350 papers have been published by academic journals and over 100 of these have appeared during the last 2 years. † This strand of research has become increasingly complex over time, examining numerous variants from the generalized autoregressive conditional heteroscedasticity skewed distributions, such as the generalized error distribution (GED), the Weibull, Beta, generalized hyperbolic, inverse Gaussian and Johnson's SU distribution.
However, this complexity in modelling choices for cryptocurrency risk modelling in the academic literature is in stark contrast with current practice in cryptocurrency markets. It is quite common for investors to apply no form of risk analysis at all, with risk management strategies consisting at most of stop-loss limit orders placed at arbitrary price levels for open positions. † The few online sources that do discuss, use or provide forecasts of volatility, Value-at-Risk (VaR) and/or Expected Shortfall (ES) use equally-weighted methodologies and inappropriate assumptions. For instance, Cryptodatadownload, a cryptocurrency market data and analytics provider, produces daily 1% and 5% VaR and ES forecasts for several cryptocurrencies using a historical methodology over a twoyear period, i.e. the percentage VaR is forecast as −1× the corresponding quantile of the empirical returns distribution and ES is −1× the average of the returns that are lower than the corresponding quantile. A blog from the cryptocurrency exchange OKEx presents a parametric VaR estimation for bitcoin, under the assumption that its one-minute returns follow a normal distribution; the 1% and 5% VaR are then forecast using the sample mean and standard deviation of one-minute returns over the past seven days. ‡ Similarly, the daily 'Bitcoin Volatility Index' is calculated using the standard deviation of returns over the past 30 and 60 days; and the bitcoin Fear & Greed Index and a Forbes article (Bovaird 2021) reporting on bitcoin's volatility both appear to be estimating volatility with a similar equally-weighted moving average. But there is a very well-known problem with any equally-weighted VaR or ES model. Even a single historical outlier, a large negative return which may have occurred far in the past, will have exactly the same influence on the current value of the risk measure as if it happened just now. § The calibration of GARCH and GAS models requires a large number of historical returns. ¶ While some cryptocurrencies such as bitcoin or ether have been trading for some time, the continuous emergence of new coins and tokens that gain investor attention often means that newer cryptocurrencies have insufficient data available to produce robust parameter estimates. For instance, at the time of writing, the list of top ten cryptocurrencies by market cap reported by Cryptocompare includes Avalanche, Solana and Terra which have only been trading for about two years. For such cryptocurrencies, volatility models that can be 'jump-started' and produce forecasts without the need for a lengthy estimation period, such † Note that the above refers to relatively unsophisticated retail investors that maintain unhedged positions in cryptocurrencies; it does not apply e.g. to investors or market makers that partially or completely hedge their positions with derivatives. ‡ The OKEx blog even mentions that 'VaR is useful for calculating the maximum expected loss on an investment', which is a highly inaccurate and misleading interpretation. § These so-called 'ghost features' have been recognized in traditional financial markets for decades-see, for instance, Section IV.2.10.2 in Alexander (2008). ¶ The optimal sample size is highly dependent on the characteristics of the data. For cryptocurrency univariate volatility modelling, between one and two years of data have been a common choice. Much larger samples are required for multivariate model calibrations to be stable over rolling or expanding windows. as the RiskMetrics TM exponentially-weighted moving average (EWMA) model (Longerstaey and Spencer 1996), are ideal. EWMA models have the added advantage of allowing the use of ad hoc parameter values even when we include features such as an asymmetric volatility response and a heavy-tailed Student-t distribution assumption.
However, before this paper we had little or no idea of the performance of EWMA models for bitcoin and other cryptocurrencies, relative to the more complex models that have been the focus of previous academic research. Indeed, a major limitation of the extant literature is the lack of consideration of simpler models, even though such models are most commonly employed by practitioners. There are also numerous gaps in the extant literature on cryptocurrency risk metrics. For instance, there is a complete absence of the traffic lights for VaR and ES backtesting which have been standard practice in the industry since Basel Committee (1996)-and there is just one single paper which uses scoring rules for density forecast evaluation. Likewise, only one other paper examines the VaR and ES of short positions on cryptocurrencies even though these are as easily traded as long positions on all the major exchanges. Furthermore, hardly any other papers examine the forecasting accuracy of multivariate models, even though these should form the corner stone of cryptocurrency portfolio optimization techniques. And all previous academic studies employ data at the daily frequency, with samples that are often too small to yield robust and reliable results. None of them use hourly data even though these data are readily available and there are distinct advantages of using hourly data: firstly for a 24-fold increase in sample size and hence a much larger data set for risk model calibration and backtesting; and secondly for a means to capture intraday volatility, which is especially important for cryptocurrencies because they have many more price jumps and short bursts of volatility than traditional assets. One purpose of this paper is to fill all these gaps in the otherwise highly prolific literature.
By contrast, the complex end of the modelling spectrum is over-researched, at least from the cryptocurrency practitioner's perspective. Our tenet is that there is very limited scope for real-world applications of FIGARCH, ACGARCH, TGARCH, H-GARCH, ALL-GARCH, APARCH, MS-GARCH and several other varieties that have been explored in this strand of cryptocurrency research. By contrast, a class of EWMA models which extends the basic RiskMetrics TM methodology is ideally suited for risk-based applications of cryptocurrency portfolios-for two main reasons: first because the methodology is easy to understand, validate, and explain in a simple technical document; second, and perhaps most importantly, these models do not require large samples of historical data for parameter estimation, and so they can be backtested using the maximum amount of historical data available which, for some cryptocurrencies, is already rather small. This paper investigates the relative performance of different types of EWMA model and a variety of GARCH models for capturing volatility clustering in USD prices of bitcoin, ether, ripple and litecoin. We use these coins because, unlike many other coins or tokens, they have the sufficiently long history that is needed for proper calibration and thorough backtesting of multivariate GARCH models. Bitcoin, ether and ripple are also among the largest coins by market capitalization, as litecoin also used to be. Our main purpose is to quantify the gains, if any, from using the complex GARCH models whose performance for bitcoin and a few other cryptocurrencies has already been extensively analysed in a burgeoning yet fragmented literature. First we present a concise and accessible summary of the crypto GARCH literature, reviewing its unifying themes and obvious gaps, and conclude that there is no consistent evidence to support the use of any model more complex than a simple asymmetric GARCH(1,1) with Student-t innovations. Empirical results are divided as to whether the exponential GARCH (EGARCH) model of Nelson (1991) or the GJR-GARCH model of Glosten et al. (1993) is better at capturing the necessary asymmetry-we find the EGARCH slightly better for major coins, but either would serve.
Our benchmark volatility model is the sample standard deviation-a simple equally-weighted moving average of past squared returns-against which we assess the performance, in both univariate and multivariate systems, of several adapted EWMA models, with and without asymmetric volatility responses, and both symmetric and asymmetric GARCH models, all with Student-t innovations. Our applications extend previous research in several ways: by analysing hourly as well as daily log returns; by backtesting one-stepahead ES as well as standard VaR metrics; by studying both univariate and multivariate systems; and by further evaluating the volatility and covariance forecasts using univariate and multivariate proper scoring rules.
The daily data backtesting sample is from January 2017 to August 2021 and for the hourly data we produce forecasts from 1 May 2021 to 1 July 2021. Because this research is targeted towards risk management professionals, we backtest VaR forecasts with the industry-standard traffic light and conditional coverage test of Christoffersen (1998); similarly we use a modified traffic light test for ES as well as the exceedance residual test of McNeil and Frey (2000). The accuracy of volatility forecasts is also assessed using the continuous ranked probability score of Gneiting and Ranjan (2011), the energy score developed by Gneiting and Raftery (2007) is employed for evaluating covariance forecasts, and we also assess forecasting accuracy using the univariate and multivariate negatively oriented logarithmic scoring rules, as mentioned by Gneiting and Ranjan (2011) and used by Catania et al. (2019).
Overall, we conclude that EWMA models perform at least as well as GARCH models at all levels of coverage up to and including 99%, and sometimes they perform even better. Interestingly, we find that hourly forecasts are less accurate than daily forecasts in general, when examining the number of models that fail the VaR and ES backtesting in each case. Nevertheless, most EWMA models are sufficiently accurate to pass traffic light and coverage tests at all three tail quantiles, for both long and short positions. By contrast, the more sophisticated Student-t exponential GARCH models often fail to make accurate predictions at the hourly level. Their parameter estimates are less stable than they are with a daily rollingwindow re-calibration. At the hourly frequency it seems that GARCH models are fitting high-frequency fluctuations that appear irrelevant for forecasting the tails of one-hour-ahead distributions and it is better to use the stable, if ad hoc parameters of a EWMA model. For predicting the volatility and covariance structure and when assessing the results using proper scoring rules, all models (including the random walk benchmark) are equally (in)accurate. This is true for both univariate and multivariate density predictions and for one-day-ahead as well as onehour-ahead forecasts. This finding supports a simple form of market efficiency, which is not surprising since the trading volumes on large coins have grown very rapidly during the last few years, so by now the markets have become quite mature. Nevertheless it is worthwhile to have demonstrated this efficiency empirically, at the daily frequency since January 2017 and at the hourly frequency since 1 May 2021.
In the following: Section 2 provides a critical survey of the extensive literature on cryptocurrency volatility, VaR, ES and covariance forecasting; Section 3 specifies the models used in our empirical study, as well as the backtesting of VaR and ES predictions and the use of proper scoring rules for assessing the accuracy of volatility and covariance forecasts; Section 4 provides an overview of the daily and hourly historical data used for the analysis; Section 5 presents our empirical results; and Section 6 summarizes and concludes.

State-of-the-art crypto risk models
Here we summarize the burgeoning academic literature on cryptocurrency market risk modelling by focusing on papers which assess the in-sample and out-of-sample performance of parametric volatility and/or covariance models applied to cryptocurrency returns. For ease of reference, the main characteristics of the most relevant academic papers are summarized in Table 1. Table 1 reports the cryptocurrencies examined, the sample period, the models employed and their distributional assumptions, and the performance criteria used to discriminate between competing models. The cryptocurrencies most commonly examined are: bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), dogecoin (DOGE), dash, monero (XMR), maidsafecoin (MAID), stellar (XML), bytecoin (BCN), bitcoin cash (BCH), bitcoin gold (BTG), bitcoin diamond (BCD), bitcoin private (BTCP), and also an equallyweighted and a minimum variance portfolio. † A few authors examine a more expanded cryptocurrency universe, e.g. Catania and Grassi (2021) analyse a total of 606 cryptocurrencies having at least 700 daily price observations until September 2019, but the majority of papers focus on bitcoin, ether, ripple and litecoin because these offer a historical period of at least five years and they are consistently amongst the largest coins by market capitalization. The sample frequency is almost invariably daily and the sample period used in each paper usually depends on the available historical data. For example, Katsiampa (2017) and Baur et al. (2018) only examine bitcoin, so their sample period begins in 2010. However, Fantazzini and Zimin (2020) use less than three years of data for both calibration and backtesting. This, like most of the † Historical data for the above cryptocurrencies are obtained from the following sources: blockchain.com, Binance, Bitstamp, Bloomberg, Brave New Coin, Coindesk, Coinmarketcap, Cryptocompare, Gemini and Kraken. The columns indicate the author of each paper, the cryptocurrencies and the sample period, the models and distribution assumptions used, and the in-and out-of-sample analysis performed.
In-sample diagnostics include the Akaike (AIC), Bayesian (BIC) and Hannan-Quinn (HQ) information criteria, and their modifications such as the consistent AIC (cAIC) and corrected AIC (corAIC). Other in-sample performance criteria include the direct comparison of the log likelihood (LL), the marginal likelihood ratio (MLR) and also, for models estimated via MCMC, the deviance information criterion (DIC) and the Bayesian predictive information criterion (IC); further analysis of the in-sample fit includes the examination of parameter estimates and residuals. Out-of-sample tests include the unconditional coverage (UC), conditional coverage (CC), dynamic quantile (DQ) tests for VaR and the exceedance residual (ER) and ESR tests for ES and also the multi-level ES approximation test via VaR of Kratz et al. (2018). Other forecast evaluation methods include the likelihood ratio (LR) test, the model confidence set (MCS) process and forecasting performance DM test of Diebold and Mariano (1995) and regression test of Mincer and Zarnowitz (1969), using loss functions such as the mean square forecasting error (MSE) or its square root (RMSE), mean absolute error (MAE), the robust family loss functions (RLF) of Patton (2011) including the MIX and QLIKE functions, the quantile loss (QL) function and the FZL joint VaR/ES loss function of Fissler and Ziegel (2016). Proper scoring rules include the continuous ranked probability score (CRPS) and log score. All GARCH models shown in the Models column refer to models of first order such as a GARCH(1,1). studies summarized in Table 1, would not pass the stringent Basel guidelines on historical data for market risk capital calculation. †

Survey of models employed
First we summarize the models used not only in the papers summarized in Table 1 but also for numerous other applications of GARCH models to cryptocurrencies returns. Regarding the literature summarized in Table 1, the most common choices include the symmetric GARCH of Bollerslev (1986) and asymmetric models such as the GJR-GARCH of Glosten et al. (1993), the exponential GARCH (EGARCH) of Nelson (1991), the threshold GARCH (TGARCH) of Zakoian (1994), the asymmetric power ARCH (APARCH) of Ding et al. (1993) and, less often, the AGARCH of Engle and Ng (1993). These models are in some cases extended further with distribution mixture and Markov switching (MS) frameworks. Some authors use the component GARCH (CGARCH) of Engle and Lee (1999) and variants such as its asymmetric extension ACGARCH, the weighted component GARCH (wCGARCH) of Bauwens and Storti (2009) and the component with multiple threshold (CMT) GARCH of Bouoiyour and Selmi (2014). Still more complex volatility model choices include the H-GARCH and ALL-GARCH of Hentschel (1995), the non-linear NGARCH of Higgins and Bera (1992), the AVGARCH of Schwert (1990), the robust GARCH model of Trucíos et al. (2017), the realized GARCH model of Hansen et al. (2012), the GARCH-MIDAS (mixed data sampling) model of Engle et al. (2013), and also an autoregressive jump intensity (ARJI) model and a stochastic volatility model with co-jumps (SVCJ). More sophisticated univariate models include the realized GARCH and stochastic volatility models which are discussed by Takahashi et al. (2016), Chen et al. (2021) and Takahashi et al. (2021). In the context of cryptocurrencies, stochastic volatility models have been used by Tiwari et al. (2019) and realized GARCH by Trucíos and Taylor (2022)-but only in the univeraite context. These papers also have mixed results, suggesting a possible need for further research. In a simiar vein, more complex distributional assumptions beyond the normal and Student-t and their skewed variants could be applied-including the generalized error distribution (GED), generalized hyperbolic, Weibull, Laplace, Beta-skew-t, generalized Pareto, reflected Gamma, inverse Gaussian and Johnson's SU distribution. All of these GARCH variants have been explored in the voluminous research literature on univariate GARCH modelling, but their extension to large dimensional multivariate systems of returns presents a challenge. Consequently it is not surprizing that there is no evidence of widespread adoption of these complex models by financial risk practitioners, even for volatility modelling in traditional asset classes. ‡ It may be that some of these state-of-the-art volatility models could produce superior results, for some † See for example, [MAR31.12], [MAR31.23] and [MAR31.26] of Minimum capital requirements for market risk. ‡ Implementation of these models commonly employs maximum likelihood estimation (MLE), which is the simplest method of multivariate optimization. Markov chain Monte Carlo (MCMC) has also been applied to regime-switching GARCH models, but only at the individual cryptocurrencies, but in this paper our focus is on the widespread uptake of simpler, multivariate risk models by practitioners, specifically those that fall within an asymmetric extension of the RiskMetricsTM EWMA class.
The vast majority of other papers are about the diversification or hedging effects of bitcoin, and these typically employ some variant of the GARCH class with normal or Student-t distributed innovations, and again all such models are GARCH(1,1). § For instance: Dyhrberg (2016) compares bitcoin with gold and the dollar using both symmetric and exponential normal GARCH; Bouri et al. (2017) examine the hedging and safe-haven properties of bitcoin and use a symmetric model with innovations that follow a generalized error distribution (GED); Al-Khazali et al. (2018) compare the impact of macroeconomic news on bitcoin and gold and find that the best GARCH model is the exponential GARCH with normally distributed error terms; Corbet et al. (2018) examine the applications of bitcoin futures and use a symmetric GARCH; Vidal-Tomás and Ibañez (2018) use a component GARCH to examine the efficiency of bitcoin traded prices; Al-Yahyaee et al. (2019) study the diversification effects of bitcoin and gold for crude oil and S&P 500 investments and use several GARCH models including a fractionally integrated (FI) EGARCH model; and López-Cabarcos et al. (2020) analyse the effect of investor sentiment and S&P 500 and VIX returns on bitcoin's volatility, using GARCH and EGARCH models.
Due to its simplicity and ease of use, the RiskMetrics TM EWMA model of Longerstaey and Spencer (1996) is very popular in financial market applications, and some academic papers focus on assessing its forecasting accuracy using traditional asset as well as cryptocurrency data. For instance, Pafka and Kondor (2001) examine its VaR forecasting ability for returns on the 30 constituent stocks of the DJIA index, arguing that it performs well at lower (e.g. 95%) coverage levels and for short-term risk horizons, but that its accuracy declines at 99% coverage and also for multi-period forecasts. Similar results are reported by McMillan and Kambouroudis (2009), now examining 31 stock market indices. Specifically in the cryptocurrency literature, there is some support for the use of integrated GARCH (IGARCH) models-and the EWMA model falls into the integrated volatility model class. For instance Chu et al. (2017) and Köchling et al. (2020) find that IGARCH provides the optimal in-sample fit for bitcoin and other cryptocurrencies; and Bouoiyour and Selmi (2016) and Baur et al. (2018) both find that bitcoin's variance process is integrated. The forecasting performance of EWMA volatility models is assessed by Catania et al. (2019), Bazán-Palomino (2020), Nekhili and Sultan (2020) and Silahli et al. (2021). Silahli et al. (2021) also examine an even simpler equally-weighted moving average (EQMA) model as a benchmark, while Guesmi et al. (2019) and Segnon and Bekiros (2020) use fractionally integrated models such as the FIGARCH and FIAPARCH. Liu et al. (2020) consider several univariate level. We note that Tiwari et al. (2019) use the crossentropy method of Rubinstein (1997) for calculating the marginal likelihood, again in a univariate model. § For this reason, in the following the term GARCH always means GARCH(1,1), unless otherwise stated.
score-driven EWMA models based on the generalized autoregressive score (GAS) model framework of Creal et al. (2013), and Trucíos (2019), Troster et al. (2019) and Catania and Grassi (2021) also use GAS models. The forecasting performance of multivariate covariance models has been only rarely studied, and in these few papers only in-sample performance has been assessed. Bouri et al. (2017) were the first to examine cryptocurrencies in a multivariate context, using a dynamic conditional correlation (DCC) model of Engle (2002) to test the hedge and safehaven properties of bitcoin. The majority of other studies use the DCC model and only a few employ the earlier BEKK model of Engle and Kroner (1995). For instance, Bazán-Palomino (2020) considers the relationship between bitcoin and similarly structured cryptocurrencies using the multivariate EWMA, BEKK-GARCH and DCC-GARCH, while Guesmi et al. (2019) use the DCC model to examine bitcoin as well as a number of traditional financial assets. Regarding applications of a multivariate EWMA model, Matkovskyy et al. (2020) use one to examine the interdependence between bitcoin, economic policy uncertainty and traditional financial assets, but none of the relevant papers assess its forecasting performance for VaR and ES of cryptocurrencies, nor do they evaluate the accuracy of covariance forecasts via scoring rules. Other covariance modelling choices reported in Table 1 include the asymmetric ADCC model of Cappiello et al. (2006), the modified cDCC and cADCC of Aielli (2013), multivariate extensions of the marginal densities using copula functions to model the correlation structure and time-varying parameter vector autoregression (TVP-VAR) models. Engle et al. (2012) provide a useful survey of the numerous papers that explore the best specification for univariate GARCH models on different types of financial data. To update this survey to include the recent research on cryptocurrencies is difficult because the results are often contradictory, suggesting that the best in-sample fit very much depends on both the cryptocurrencies chosen and the sample period, which vary considerably from study to study. Although, as noted above, at least all previous work employs data at the same, daily frequency. Katsiampa (2017) tests several parametric volatility models for the best in-sample fit on bitcoin returns and all criteria indicate that the ACGARCH model is optimal; this is consistent with Bouoiyour and Selmi (2016) whose in-sample analysis also indicates a model with a transitory and a permanent volatility component. The in-sample analysis of Baur et al. (2018) indicates superiority of the EGARCH model for bitcoin returns, and the authors note that using different asymmetric volatility models does not improve the in-sample fit. Tiwari et al. (2019) compare the fit of GARCH and stochastic volatility models for bitcoin and litecoin and find mixed results, for instance concluding that cryptocurrency returns do not exhibit any asymmetric volatility response, which is at odds with the previous findings. The findings of Sosa et al. (2019) suggest that an EGARCH model with GED innovations provides the best in-sample model fit for bitcoin. Troster et al. (2019) agree that a GED assumption instead of a normal significantly improves goodness-of-fit, but further conclude that the hyperbolic HGARCH model with GED innovations provides the best in-sample fit, which is again contrary to previous findings.

Survey of performance results
In the class of regime-switching volatility models, Ardia et al. (2019) find that a two-state Markov switching skewed Student-t GJR-GARCH provides a better in-sample fit for bitcoin compared to both non-switching and three-state switching models; the authors propose that the two-state model provides a better trade-off between fitting quality and model complexity and further show for three-regime models that fitting gains are only observed for the normal distribution. Alexander and Dakos (2020) also explore the in-sample fit of two-state Markov switching GARCH models for bitcoin returns and show that the best model depends on the exact source of data used.
To sum up, the plethora of in-sample diagnostics applied to GARCH models of cryptocurrency volatility reveals a picture of numerous, but imprecise and highly contradictory conclusions, derived from the painstaking estimation of increasingly complex models which often use insufficient data to provide robust and accurate results. Yet, the state-of-the-art results on out-of-sample forecasting for cryptocurrency returns, to which we now turn, are even more confusing.
Out-of-sample forecasting centres on VaR and Expected Shortfall backtests, usually focusing on the left tail of the returns' distribution to assess the risk of downward price movements on long crypto asset positions. It is worth noting that the only study other than ours that assesses the performance of right-tail forecasts for losses made on short positions is that of Stavroyiannis (2018), who examines the GJR-GARCH model calibrated to bitcoin returns. The most common backtesting methodologies for VaR forecasts are the unconditional coverage (UC) test of Kupiec (1995), the conditional coverage (CC) test of Christoffersen (1998) and the dynamic quantile (DQ) test of Engle and Manganelli (2004); for ES, common backtesting methods include the exceedance residual (ER) of McNeil and Frey (2000), the regressionbased ESR test of Bayer and Dimitriadis (2020) and the multilevel backtest approximation via VaR of Kratz et al. (2018). Other methods of analysis include the use of loss functions either in the model confidence set (MCS) process of Hansen et al. (2011) or also in hypothesis tests of equal forecasting performance such as the DM test of Diebold and Mariano (1995). Finally, the use of proper scoring rules to evaluate cryptocurrency returns density forecasts is much less common, with Catania and Grassi (2021) using the continuous ranked probability score and Catania et al. (2019) using the log score. Also, and very much in the vein of our paper, we emphasize that the industry standard traffic light backtesting framework of the Basel Committee (1996), e.g. as described by Costanzino and Curran (2018), is overlooked by all these papers.
One reason for the confusing conclusions drawn from outof-sample results is that they depend not only on the models employed but also on the particular cryptocurrency returns studied, the sample period employed and the significance levels examined. For instance, Ardia et al. (2019) compare the VaR forecasting accuracy of single-regime and Markov switching models for bitcoin, concluding that only regimeswitching models produce accurate VaR forecasts at the 1% significance level; however, it is worth noting that 5% daily VaR forecasts produced using the relatively simpler singleregime skewed Student-t GJR-GARCH model also succeed the CC test because we cannot reject the null hypothesis of no clustering in exceedances at the 5% significance leveland the DQ test also. Maciel (2021) compares the prediction performance of Markov switching GARCH against singleregime GARCH models for several crypto assets and is in favour of more complex models similar to Ardia et al. (2019), but the results for a similar set of single-and two-regime GARCH models, also applied to bitcoin, are somewhat mixed. Caporale and Zekokh (2019) also apply a variety of different backtests to VaR and ES forecasts for bitcoin, ether, ripple and litecoin with an exhaustive set of mixture and regime switching model combinations, but again the results are inconclusive.
It further transpires that even when very complex volatility models can produce accurate out-of-sample VaR and ES forecasts, relatively simpler models can produce equally accurate results. For instance, Bonello and Suda (2018) compare VaR forecasts for bitcoin using single-regime and two-regime normal and Student-t GARCH models, and find that all specifications can produce accurate VaR forecasts at a 5% significance level. Troster et al. (2019) backtest daily 1% VaR forecasts for bitcoin and find that a Student-t standard GARCH model is on a par with several more complex GARCH and GAS models included in their study. Trucíos (2019) evaluates VaR forecasts for bitcoin between 2011 and 2017 using six competing models, finding that only a robust bootstrap VaR method produces accurate forecasts at the 1% significance level. In fact, in the preliminary results of a subsequent working paper, Trucíos and Taylor (2022) use a more recent sample period and show that bitcoin and ether VaR forecasts based on simpler volatility models such as the standard GARCH may be considered accurate. Acereda et al. (2020) find that more complex model specifications do not outperform the simpler ones for bitcoin VaR, as long as heavy-tailed distributions are used instead of the standard normal. Silahli et al. (2021) also find that simple benchmark models succeed in various VaR backtests for several crypto assets.
Contradictory results are even apparent when one considers EWMA models alone. For example, Silahli et al. (2021) claim that a normal EWMA volatility model produces accurate VaR forecasts for all cryptocurrencies, but Liu et al. (2020) find that a similar model fails VaR backtests. Nekhili and Sultan (2020) examine the out-of-sample performance of a benchmark RiskMetrics TM EWMA model and find that it produces accurate VaR forecasts at the 5% level, but not at 1%; yet for ES forecasts of almost all cryptocurrencies examined, a EWMA produces accurate ES forecasts according to the ER test. Within the multivariate setting the results seem a little more consistent: Silahli et al. (2021) find that a EWMA covariance model used to produce VaR forecasts for a portfolio of bitcoin, litecoin, ripple and dash passes performance tests; and Catania et al. (2019) examine bitcoin, ether, ripple and litecoin, testing several complex multivariate models against a vector autoregression with EWMA variance and find that none significantly outperform this much simpler benchmark.
Finally, Catania et al. (2019) and Catania and Grassi (2021) are the only applications of proper scoring rules specific to cryptocurrencies at the time of writing. Catania et al. (2019) produce multi-period point and density forecasts for bitcoin, litecoin, ripple and ether returns, employing the log score as a measure of forecast accuracy and conclude that most models outperform the EWMA benchmark. Catania and Grassi (2021) use the continuous ranked probability score (CRPS) to assess volatility forecasts from the GAS model versus EGARCH, concluding equal predictive ability as measured by the DM test. They backtest VaR and ES forecasts for a total of 606 cryptocurrencies with at least 700 daily price observations until September 2019. The authors use the scoredriven volatility model specifications that incorporate several stylized features such as leverage effects, long memory of the volatility process and time-varying higher order moments, with a generalized hyperbolic skewed Student-t distribution. These models are compared against a benchmark Beta-Skewt-EGARCH, producing multi-period 1% and 5% VaR and ES forecasts. VaR and ES forecasts are backtested with the DQ and ER tests and the density forecasts are assessed using the CRPS. Score-driven specifications produce accurate 5% and 1% ES and 5% VaR forecasts more often than the Beta-Skew-t-EGARCH benchmark, but GAS models and the EGARCH benchmark are on par when backtesting 1% VaR. Regarding density forecast evaluation via CRPS the authors find that certain score-driven models outperform the benchmark more often than they underperform it. However, even for these successful specifications, equal predictive ability is the most common outcome. For instance, when examining the uniformly-weighted CRPS of the one-day-ahead density forecast across all cryptocurrencies, equal predictive ability occurs in 83% of cryptocurrencies examined, including bitcoin, ether, ripple and litecoin.
While both Liu et al. (2020) and Catania and Grassi (2021) examine several volatility model specifications, the range of models examined is somewhat limited in both cases. Liu et al. (2020) focus specifically on EWMA-type models and do not test other more complex models such as GARCH specifications, nor simpler model specifications that require no calibration such as an equally-weighted moving average or a EWMA with an ad-hoc value chosen for the decay parameter. Therefore, their results are not conclusive with respect to the overall suitability of EWMA-type models in forecasting cryptocurrency volatility compared to other more complex or simpler models. By comparison, Catania and Grassi (2021) focus on highly sophisticated GAS model specifications with a similarly sophisticated heavy-tailed distribution assumption and test these against an already complex benchmark Beta-skew-t-EGARCH model, often finding equal forecasting performance. It is important to note that, as discussed previously, the above finding also extends to VaR and ES forecasting, i.e. the VaR and ES forecasting performance of highly complex GARCH and GAS model specifications can be on par with relatively simpler models such as the standard GARCH. For instance, this is shown in the results of Bonello and Suda (2018) Silahli et al. (2021) and also in the working paper results of Trucíos and Taylor (2022).

Methodology
Our benchmark model is that returns are normally distributed with zero mean and variance estimated as an equally-weighted moving average of the past n squared returns. Except for the benchmark model, we make the universal assumption of Student-t innovations, again with zero mean returns. † This is because previous results, available on request, showed that none of the normal models outperformed their Studentt equivalent, for any cryptocurrency. On the other hand, using more complex distributional assumptions as in Chu et al. (2017), Trucíos (2019) and Liu et al. (2020) is tangential to the theme of this paper. It would obfuscate the motivation for this paper by providing too many details. Therefore, to retain our focus on the main story here-i.e. the relative effectiveness of using ad-hoc values for EWMA parameters-we only describe the models and report the results for Student-t innovations in all the EWMA and GARCH models.
Our benchmark model assumes returns are normal with variance estimated by an n-period equally-weighted moving average of squared returns, we call it the random walk for short. Then we have a EWMA model as per the RiskMetrics TM technical document (Longerstaey and Spencer 1996) and our own asymmetric extension similar to the A-GARCH model of Engle and Ng (1993), a symmetric GARCH(1,1) model (Bollerslev 1986) and an asymmetric EGARCH(1,1) model (Nelson 1991)-and all these models assume a Student-t distribution. Joint density forecasts are produced via n-period equally-weighted moving average covariance matrix estimates, multivariate versions of the EWMA models, and the GARCH and EGARCH models are combined with the dynamic conditional correlation (DCC) model of Engle (2002) and Tse and Tsui (2002) and also its asymmetric extension (ADCC) model of Cappiello et al. (2006).
The basic econometric methodology consists of producing one-period-ahead volatility and covariance forecasts on a daily or hourly rolling basis. These are then combined with parametric distribution assumptions to produce one-periodahead VaR and ES forecasts at various quantiles, where each model has univariate versions for each cryptocurrency and a multivariate version. To assess the risk of both long and short positions we backtest quantiles at 1%, 2.5%, 5%, 95%, 97.5% and 99%. Then the accuracy of one-period ahead volatility and covariance forecasts are evaluated via univariate and multivariate proper scoring rules, respectively.
We test the performance of VaR and ES predictions using the traffic light backtests which have been the industry standard for more than two decades, (Basel Committee 1996), along with the two standard tests for clustering of exceedances, i.e. the conditional coverage (CC) test of Christoffersen (1998) for VaR, and the (raw) exceedance † There is much support for the zero-mean assumption as already discussed in Section 2. For instance, Köchling et al. (2020) find that GARCH model specifications for bitcoin returns that have zero mean are very often included in the model confidence set. residual (ER) test of McNeil and Frey (2000) for ES. ‡ Beyond quantile prediction backtesting, we also examine the accuracy of volatility forecasts using the continuous ranked probability score (CRPS) of Gneiting and Ranjan (2011) for univariate forecasts and the energy score from Gneiting and Raftery (2007) for covariance forecasts. § Additionally, we employ the univariate and multivariate negatively oriented logarithmic scoring rule as described by Gneiting and Ranjan (2011). Note that all models assume a zero mean so these scoring rules aim to examine the accuracy of one-period ahead volatility and covariance forecasts, over and above the specific quantile predictions previously assessed.

Variance and covariance models
Denote the return on a single cryptocurrency at time t by r t and assume their mean is zero. In the random walk benchmark model we have: where σ 2 t is the average squared return over the most recent n periods. In both the EWMA and GARCH models, returns are assumed to follow a zero-mean, location-scale transformed Student-t distribution: where t ν denotes the standardized Student-t distribution with ν degrees of freedom, σ t is the standard deviation of r t and the distribution of ε t is defined such that ε t has unit standard deviation. The variance under the standard EWMA model with decay parameter λ is calculated as: Based on the AGARCH model of Engle and Ng (1993), we introduce the asymmetric EWMA model with a decay parameter λ and an asymmetric volatility response parameter η. Under the AEWMA model, the variance is calculated as: In the standard (symmetric) GARCH(1,1) model, the conditional variance is given by: Similarly, in the Student-t EGARCH(1,1) model, we have: t−1 ‡ Note that the 'raw' ER test consists of using the raw residuals which are not divided by the estimated standard deviation. This version of the ER test is suggested by Bayer and Dimitriadis (2020), who argue that 'the test using the standardized ER is in fact a joint backtest for the triple VaR, ES and volatility, whereas the test using the raw ER is a joint backtest for the pair VaR and ES'. § It would also be possible to assess the accuracy of volatility and covariance forecasts via a loss function such as the mean absolute error of point forecasts but, as mentioned by Gneiting and Ranjan (2011), the CRPS may be regarded as an extension of the mean absolute error for density forecasts, instead of point forecasts.
Regarding volatility forecasts, the random walk, EWMA and AEWMA models described in Equations (1), (3) and (4) have a constant volatility term structure, so their volatility forecasts for period t + 1 are set equal to the corresponding volatility estimates at time t. For the GARCH and EGARCH models the one-period-ahead volatility forecastsσ t+1 are obtained by updating the conditional volatility Equations (5) and (6) using the estimated model parameters and the last of the in-sample estimates forσ t andε t . In a multivariate setting, denote by r t the (m × 1) vector of the m cryptocurrency returns at time t. The multivariate random walk benchmark model assumes that r t follows a multivariate normal distribution: where the covariance matrix t is estimated as the sample covariance matrix of returns over the past n days. The EWMA and GARCH models follow their univariate counterparts, so the vector of returns is assumed to follow a multivariate location-scale transformed Student-t distribution with ν degrees of freedom: where t is the covariance matrix of r t , so that ν−2 ν t is the distribution's scale matrix. The covariance matrix in the multivariate EWMA model with parameter λ is given by: The covariance matrix of the asymmetric EWMA with parameters λ and η is calculated as: where 1 is an (m × 1) vector of ones. For the multivariate GARCH models, the covariance matrix is modelled as: where D t is the diagonal matrix of variances estimated via the univariate GARCH or EGARCH model and C t is the conditional correlation matrix, which is modelled indirectly via the Q t matrix to ensure that C t is a proper, positive semi-definite correlation matrix. In the DCC model, Q t is given by: Similarly, in the ADCC model Q t is calculated as: where ε t is the vector of standardized errors; ε − t are the zerothreshold errors defined as equal to ε t when the corresponding elements are less than zero and equal to zero otherwise; and Q andQ − are the unconditional covariance matrices of ε t and The one-period ahead covariance matrix forecasts are produced similar to the volatility forecasts as described previously. For the multivariate random walk, EWMA and AEWMA the 1-period-ahead covariance matrix forecast at time t is set equal to the estimate at time t − 1 and for the DCC and ADCC models it is obtained by updating the conditional covariance Equation (11).

Backtesting methods
The forecasting accuracy of the volatility models presented in the previous section is assessed by producing rolling forecasts and backtesting them against realized returns. For each of the two quantile risk measures, we use the industry standard traffic light test of the Basel Committee (1996) and one academic standard test, i.e. the conditional coverage (CC) test of Christoffersen (1998) for VaR and the exceedance residual (ER) test of McNeil and Frey (2000) for ES.

Value-at-Risk. The
VaR at a significance level α is defined as −1× the α-quantile of the one-period-ahead forecast F t that is made at time t of returns' distribution function. We set α = 1%, 2.5%, 5% for lower (left-tail) quantiles, using 1 − α for upper (right tail) quantiles, so: for long positions (left-tail VaR) The traffic light approach of the Basel Committee (1996), as described in Costanzino and Curran (2018), is extended here to both left-and right-tail VaR. The exceedance indicator X VaR t (α) of each 1-period-ahead left-and right-tail 100α%-VaR forecast at times t = 1, . . . , N is defined as: where 1 {condition} denotes an indicator function which equals 1 if the condition is satisfied and 0 otherwise. The cumulative number of VaR exceedances X VaR N (α) over the entire forecasting period t = 1, . . . , N is then calculated as: Under the null hypothesis that the VaR model is specified correctly, the total number of VaR exceedances follows a binomial distribution with parameters N and α; † we approximate † The null hypothesis that the VaR model is 'specified correctly' implies a joint hypothesis that the time series of VaR exceedance indicators X VaR t (α) is independent and identically distributed (i.i.d.) and that the proportion of realized VaR exceedances is equal to the the binomial with a normal distribution as: † Let x VaR be the number of realized VaR exceedances over the forecasting period and let z be its standard normal transform. Denote the probability of obtaining x VaR or fewer exceedances as (z), where is the standard normal distribution function. ‡ The traffic light colour zones are then defined as: As described by the Basel Committee (1996), the threezone approach is introduced to mitigate the statistical limitations of backtesting and balance the two error types: type I, i.e. the possibility that an accurate model is classified as inaccurate based on its backtesting results; type II, i.e. the possibility that an inaccurate model is not classified as such based on its backtesting results. In the green zone, the backtesting results are considered consistent with an accurate model and the probability of erroneously accepting an inaccurate model is low. In the red zone, the backtesting results are highly unlikely to have resulted from an accurate model, and the probability of erroneously rejecting an accurate is model is low. In the yellow zone, backtesting results could be consistent with either accurate or inaccurate models, so additional information is required to determine whether the model is specified correctly.
The VaR forecasts are further backtested using the conditional coverage (CC) test of Christoffersen (1998), for which the likelihood ratio test statistic LR cc is: π n 01 01 1 −π 01 n 00π n 11 11 1 −π 11 n 10 , where: α is the significance level used in the VaR model; π 01 = ( n 01 n 00 +n 01 );π 11 = ( n 11 n 10 +n 11 ); n 1 is the number of realized VaR exceedances; n 0 = N − n 1 is the number of realized returns that do not exceed the VaR forecast; n 00 is the number of non-exceedances preceded by a non-exceedance; n 01 is the number of exceedances preceded by a nonexceedance; n 10 is the number of non-exceedances preceded VaR significance level α. Note that the above definition holds for both left-and right-tail VaR, as exceedances are defined respectively based on the 100α% left and right tail of the distribution. † Note that in the case of the VaR traffic light backtest, the binomial distribution is commonly used and is (almost) as easy to work with as the normal. We choose the normal approximation of the binomial distribution to ensure the easiest possible application of our ad hoc methodology, and also for consistency with the ES traffic light backtest where we use the asymptotic normal distribution approximation as suggested by Costanzino and Curran (2018). The approximation of the binomial distribution with the normal is considered accurate based on the rule-of-thumb that both Nα and N(1 − α) should be greater than 5, which is the case for the analysis presented in Section 5, as we use N = 1, 704 in the daily frequency analysis, N = 1, 465 in the hourly frequency and α = 1%, 2.5% and 5%. ‡ As the number of realized VaR exceedances x VaR over the forecasting period is a realization of the random variable X VaR N (α) defined in Equations (16) and (17), the probability of obtaining x VaR or fewer VaR exceedances is given as: by an exceedance; n 11 is the number of exceedances preceded by an exceedance. § The asymptotic distribution of −2 ln LR cc under the null hypothesis is chi-squared with 2 degrees of freedom and the null hypothesis of the CC test for the true transition probabilities π 01 and π 11 is that π 01 = π 11 = α, suggesting that there is a correct probability of exceedances and no clustering in exceedances.

Expected Shortfall. Expected Shortfall (ES) is
defined as the expected loss given that the corresponding VaR forecast is exceeded, i.e.  (15) We further extend this definition to right-tail ES, which yields: ) capture the severity of each VaR exceedance. Returns that exceed the VaR but not the ES receive a relatively low weight and X ES t (α) is dominated by returns of greater magnitude that exceed both the VaR and ES. The cumulative ES generalized exceedance is then calculated as: Under the null hypothesis that the ES model is specified correctly, the distribution of X ES N (α) is provided by Costanzino and Curran (2018) based on the binomial and Irwin-Hall distributions; ¶ the authors further note that the distribution tends asymptotically to a normal distribution for large forecasting periods, based on the derivation of Costanzino and § As in the case of the traffic light backtest, the conditional coverage test definitions hold for both left-and right-tail VaR, as exceedances are defined respectively based on the 100α% left and right tail of the distribution. ¶ As noted previously for VaR, the null hypothesis that the ES model is 'specified correctly' implies a joint hypothesis that the time series of ES generalized exceedance indicators X ES t (α) is i.i.d. and that for all p ∈ [0, α], the probability of VaR exceedances is P(r t ≤ −VaR t (p)) = p for the left tail and P(r t ≥ VaR t (p)) = p for the right tail.  (2000) based on the raw residuals-i.e. not divided by the estimated standard deviation, as suggested by Bayer and Dimitriadis (2020): The ER test statistic is then calculated as the sample mean of for short positions.
The test statisticμ does not have a standard distribution so we estimate it using a bootstrap simulation. In the results presented in Section 5, the distribution of the ER test statisticμ is simulated using 1000 bootstrapped replications. The null hypothesis is that E[ε t ] = 0; this is tested against a 1-sided alternative that E[ε t ] > 0, suggesting that ES is systematically underestimated.

Score-based tests for variance.
Scoring rules measure the accuracy of probabilistic forecasts and allow for comparisons between competing prediction models. In the case of negatively oriented scoring rules, a lower score indicates a better forecast for the entire distribution, but the most important determinant of the score is the ability to predict an accurate expected value. Yet here we are setting all models equal in that sense-every model simply assuming a zero mean return, because our focus is on the accuracy (or otherwise) of RiskMetrics TM type volatility forecasts. Therefore, the difference between scores in our study is entirely due to difference in accuracy of the variance forecast. We find these score-based tests useful, above and beyond the quantile predictions relating to VaR and ES metrics, because our scores † Note that the derivation described in Costanzino and Curran (2015) yields the asymptotic distribution of our Equation (22), whereas the distribution's variance in equation (17) of Costanzino and Curran (2018) is mistakenly omitting a factor of N. Regarding the accuracy of the normal approximation, Clift et al. (2016) perform a simulation study and find that the approximation is accurate for a forecasting period of length N = 250; this confirms that the normal approximation is fit for use in our analysis, as we use forecasting periods that include over 1, 000 observations. can be used to rank the accuracy of a variance forecast in one simple number. We use the continuous ranked probability score (CRPS) for univariate distribution forecasts and its multivariate extension, the energy score, for joint density forecast evaluation. Similarly, we use the negatively oriented logarithmic score (LogS) to evaluate the univariate and joint density forecasts. The CRPS (Matheson andWinkler 1976 andRanjan 2011) generalizes the mean absolute error of an observation y under a forecast distribution F: According to Gneiting and Raftery (2007), the CRPS can also be expressed as: where X and X are independent random variables with sampling distribution F. This representation leads to the energy score extension which generalizes the CRPS for multivariate distributions and is defined (Gneiting and Raftery 2007) as: where · denotes the Euclidian norm on R n , X and X are independent (n × 1) random vectors from a multivariate distribution with CDF forecast F and y = (y 1 , . . . , y n ) is a realized observation. Moreover, if F is given via m discrete (ndimensional) samples X = (X 1 , . . . , X n ), then the energy score is calculated as: (28) Finally, the uniformly-weighted, negatively oriented logarithmic score of an observation y from a univariate or multivariate forecast distribution F is defined by Gneiting and Ranjan (2011) as: Given the 1-period-ahead probability density function forecasts f t , g t and their corresponding univariate or multivariate scores S(f t ) and S(g t ) produced on a rolling basis over the out-of-sample period t = 1, . . . , N, we compare the forecasting performance of f and g directly using their average scores over the out-of-sample period. Alternatively, we use the hypothesis test of equal performance described by Gneiting and Ranjan (2011). If the average scores of f and g over the out-of-sample period areS f N andS g N respectively, then the test of equal performance is based on the statistic: where: The test statistic t N is asymptotically standard normal under the null hypothesis of vanishing expected score differentials; therefore in case of rejection, f is chosen if t N is negative and g is chosen if t N is positive.

Data
Intraday volatility is much greater in cryptocurrency markets than in traditional financial markets, so it is worth analysing hourly data here. † Thus, we obtain both daily and hourly historical data on four of the largest cap cryptocurrencies as of 1 January 2021: bitcoin, ether, ripple and litecoin. Since then all but litecoin have remained in the top five cryptocurrencies by market cap. Nevertheless, we retain litecoin because so many of the papers reviewed earlier also apply their models and tests to litecoin. Historical price data are collected using the Cryptocompare API and are in the form of volume-weighted (VWAP) close prices, averaged across multiple USD-denominated exchange-traded prices for each crypto asset. For the daily frequency analysis, the sample period is between 20 August 2015 and 31 August 2021, with daily prices recorded at 00:00 UTC 365 days per year. The rolling estimation window length for the GARCH models is fixed at 500 days so that the forecasting period consists of 1,704 daily observations between 1 January 2017 and 31 August 2021. For the hourly frequency analysis, the sample period is between 1 January 2021 00:00 UTC and 1 July 2021 00:00 UTC, with an estimation window length of 4 months, i.e. 2,882 hourly returns observations; the forecasting period therefore consists of 1,465 hourly observations, between 1 May 2021 00:00 UTC and 1 July 2021 00:00 UTC. Figure 1 depicts time series of daily log returns for each cryptocurrency. Bitcoin appears to be considerably less volatile than the other currencies, except during the 'Black Thursday' crypto market crash on 12 March 2020, and common volatility clusters are often observed simultaneously across all four cryptocurrencies. Figure 2 displays the time series of hourly log returns for each cryptocurrency over the entire sample period January to June 2021. All returns exhibit common volatility clustering and some extreme hourly returns above 10% or below − 10%, as also shown in the minimum and maximum returns in Table 3. Table 2 presents summary statistics for the daily log returns for each cryptocurrency over the sample period 20 August 2015 to 31 August 2021. All cryptocurrencies exhibit a relatively small mean, in line with our zero-mean assumption. The skewness is relatively small and can take either sign. The highly significant positive excess kurtosis justifies our assumption for heavy-tailed distributions, captured using Student-t innovations. Table 3 displays summary statistics for the hourly log returns for each cryptocurrency over the sample period January to June 2021. The mean and skewness are again relatively small. As expected, hourly returns are more volatile than daily returns according to the annualized standard deviation, and excess kurtosis is again positive so a heavy-tailed distribution such as the Student-t should be preferable.
We now ask whether the skewness observed in Tables 2 and 3 should warrant the use of the skewed Student-t distribution, such as that defined by Aas and Haff (2006), for the GARCH model innovations. To this end, we estimate a univariate skewed Student-t EGARCH model for each cryptocurrency using the entire daily and hourly sample periods and report the parameter estimates an p-values in Tables 4 and 5. Given the entire sample is very large, for both hourly and daily returns, the estimates of most parameters, including the skewed-t distribution parameter τ , are significant. However, these conditional skewness estimates have a different sign to the unconditional sample skewness observed in Tables 2 and 3. Moreover, the estimates of τ are highly unstable when estimated on rolling sample windows, instead of the entire sample. This suggests that the use of a skewed Student-t distribution can produce a misspecified model, and because it is important to select a robust and easily-estimated GARCH model against which to judge the performance of the simpler, RiskMetrics TM -type models with ad hoc parameter choices, we do not include the additional source of asymmetry afforded by skew Student-t innovations in our empirical study of asymmetric GARCH models.
The asymmetric volatility response refers to a negative correlation between today's return and tomorrow's volatility. Wu (2001), Avramov et al. (2006) and many others since show that this is a well-known feature of traditional financial assets. Regarding cryptocurrency returns, the full-sample daily and hourly frequency Student-t EGARCH parameter estimates shown in Tables 6 and 7 indicate at least one significant volatility response parameter, i.e. θ and γ , as defined in Equation (6). For the daily data, Table 6 shows that the response size parameter θ is small and positive, and only significant for BTC, but the asymmetry parameter γ is significant for all four cryptocurrencies. For the hourly data, Table 7 shows that the asymmetry parameter γ is again significant for all four cryptocurrencies and the response size parameter θ is large and negative, except for XRP.
Based on daily data in Figure 3 and hourly data in Figure 4 we present θ and γ parameter estimates based on a rolling window of 500 observations. Figure 3 shows that the daily frequency estimates for θ are relaively stable over time, but there is more variation in γ . For instance, during the March 2020 Covid crisis, both BTC and ETH daily returns exhibited an asymmetric volatility response which changed from day to day, due to extreme volatilty following the huge negative returns of over 30%, on 12 March 2020. The most variable parameter estimates are for XRP, which we ascribe to a high-profile lawsuit between Ripple Labs and the SEC. † Figure 4 covers a different sample, starting only in May 2021. By this time the cryptocurrency markets had matured considerably, and the parameter estimates are much more stable than they are in Figure 3. Again, XRP stands out from the other currencies, with a much smaller θ but a greater γ .
Another intuitive visualization of the volatility response uses the news impact curves defined by Engle and Ng (1993). These are depicted in Figure 5 for daily data and Figure 6 for hourly data. Figure 5 shows symmetric news impact curves at the daily frequency, the greatest impact being for XRP, which is to be expected given the extensive media coverage of the † For example, since December 2020 and thus during our entire sample for hourly data, Ripple Labs has been engaged in a high-profile law suit, filed by the U.S. Securities and Exchange Commission (SEC), and still ongoing at the time of writing. There is now a major class action by XRP coin holders claiming that the SEC law suit has had a highly detrimental effect on XRP returns. SEC law suit. Figure 6 shows that negative hourly returns have a more pronounced news impact than positive returns of the same magnitude, except for XRP which is more symmetric. We conclude that the cryptocurrency risk profile, as indicated by the shape of the news impact curve, varies considerably across different cryptocurrencies, as well as between daily and hourly returns. The differences observed here confirm a highly idiosyncratic volatility response behaviour even within the four major cryptocurrencies.
Because we use hourly data, we examine whether the cryptocurrencies in our sample exhibit intra-day volatility periodicity which, if found, can be accounted for as discussed by Andersen and Bollerslev (1997), Stroud and Johannes (2014) and others. To this end, Figure 7 shows the per-hour average in-sample volatility estimates from a univariate Student-t EGARCH model, estimated using the entire sample of hourly data for each cryptocurrency. This shows that only slight variations are observed between different  hours-of-day which are very small relative to the overall average volatility levels. More specifically, based on hourly returns bitcoin exhibits an average annualized volatility of Notes: The sample period is 1 January 2021 to 1 July 2021. The mean, minimum and maximum are expressed in % and the hourly standard deviation is annualized using √ 24 × 365.
99% but the range of the hour-of-day average volatility between maximum and minimum is only 3.1%. Similarly for ether, the average volatility is 126% and the range is Notes: The first column denotes the model parameters as defined in Equation (6), with τ denoting the Student-t distribution's skewness parameter, and the remaining columns denote the parameter estimates and p-values for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC).  5.9%, and respectively for ripple 173% and 7.9% and for litecoin 141% and 6.4%. While it may be the case that stateof-the-art volatility models accounting for intra-day volatility periodicity could produce superior results, such complex models may have a limited scope for application by many practitioners, who are likely to favour the simpler models with ad hoc parameter choices that are presented in this paper.  Figure 5. News impact curve denoting the volatility response σ 2 t (vertical axis) to previous-period returns ε t−1 (horizontal axis) for (BTC, in blue), ether (ETH, in red), ripple (XRP, in green) and litecoin (LTC, in gray), obtained from the univariate Student-t EGARCH model estimated for the entire daily frequency sample period 20 August 2015-31 August 2021. Figure 6. News impact curve denoting the volatility response σ 2 t (vertical axis) to previous-period returns ε t−1 (horizontal axis) for (BTC, in blue), ether (ETH, in red), ripple (XRP, in green) and litecoin (LTC, in gray), obtained from the univariate Student-t EGARCH model estimated for the entire hourly frequency sample period 1 January 2021-1 July 2021. Figure 7. Average in-sample volatility per hour-of-day obtained via the Student-t EGARCH model estimated for the entire hourly frequency sample period 1 January 2021-1 July 2021. In-sample estimated volatility is annualized with a factor of √ 24 × 365 and averaged across the sample for each hour-of-day.

Empirical results
We generate out-of-sample, one-period-ahead analysis for bitcoin, ether, ripple and litecoin daily and hourly log returns, comparing the results of the EWMA-type models and more complex GARCH models against the random walk, first backtesting both left-and right-tail VaR and ES in a univariate setting, followed by the results on the univariate and multivariate score-based tests of variance and covariance forecast accuracy. The hourly frequency analysis is then presented in the same order.
We emphasize that the entire 'raison d'être' for this research is to set ad hoc values for the EWMA parameters. This is because we seek to argue for (or against) setting universal, or at least sector-specific, values for EWMA parameters which are the same for the tens, hundreds (or even thousands) of crypto assets included. This is an important question at this stage in the development of cryptocurrency markets, where some type of extended RiskMetrics TM product needs thorough and independent examination. Institutional interest in exploring investment opportunities in a wide variety of coins and tokens is increasing rapidly at the time of writing. However, our parameter selections also need to be sensible. Setting a decay parameter less than about 0.9 would probably be unacceptable to regulators, and the asymmetry parameter should not be so large as to dominate the influence that the realized returns have on volatility. With these comments in mind, we describe how model parameters are selected in more detail below.

Daily forecasts
Variance and covariance predictions are made daily between 1 January 2017 and 31 August 2021, i.e. for 1,704 daily observations. The benchmark model employs an equally-weighted 30-day moving average. We select precisely 30 days because, although ad hoc, this time period is very commonly used by practitioners and is probably the best suited choice for a single benchmark model. EWMA and AEWMA volatilities and covariances are calculated using two possible values for λ, the RiskMetrics TM value 0.94 and a smaller value of 0.925; the AEWMA model further introduces an asymmetry parameter η which is set to 1%, 2% and 3% for left-tail (long positions) VaR/ES forecasting and − 1%, − 3% and − 5% for the right tail (short positions). Results for other choices of η are presented in the Appendix. For instance, Table A5 explores the effect of η on right-tail daily VaR results for short positions and Table A6 does the same, but for left-tail hourly VaR results, for long positions. † The univariate GARCH and multivariate DCC model parameters are estimated using MLE on a fixed-size rolling window with 500 daily log returns, with model parameters are updated on a daily basis.
It is important to note the trade-off between the length of the in-sample/rolling estimation period and the out-ofsample/forecasting period. A longer estimation period might improve the performance of GARCH models for which † As mentioned previously, note that the ad hoc value choices for the asymmetry parameter should be large enough to improve forecasting accuracy but also small enough to avoid a dominating influence of realized returns on volatility. Table 8. Backtesting results for one-day-ahead left-tail 1% and 2.5% VaR forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017-31 August 2021. Table 9. Backtesting results for one-day-ahead right-tail 1% and 2.5% VaR forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017-31 August 2021. parameters are estimated via MLE; however, this produces a significantly smaller out-of-sample period and allows for fewer sample points in the forecast accuracy tests. For instance, we have attempted an alternative specification for the GARCH models using a 1,240-day rolling estimation window, but this leaves fewer than 1,000 out-of-sample observations between 1 January 2019 and 31 August 2021. The parameter estimates obtained using the longer estimation window are slightly more stable, but the forecasting results are very similar to those reported here, which use a 500-day estimation window.
Regarding the distribution assumptions, in the random walk benchmark model, crypto cryptocurrency returns are assumed to follow a zero-mean normal distribution. In the EWMA and AEWMA models, a zero-mean location-scale transformed Student-t distribution is used with ad hoc ν = 6 degrees of freedom, to produce a heavy-tailed distribution; similarly, a multivariate Student-t with ν = 6 is assumed for the joint distribution of bitcoin, ether, ripple and litecoin returns. The GARCH and DCC models also assume a univariate and multivariate Student-t distribution, respectively, where the degrees of freedom parameter is estimated jointly with the model parameters based on the 500-day rolling estimation window. † GARCH and DCC model estimations and forecasts † The daily returns exhibit some asymmetry in their sample statistics, but not always in the same direction, as shown in Table 2. But the asymmetric model specifications can still capture this characteristic to some extent even with the symmetric distribution assumption that we use. Also, the choice of a symmetric distribution simplifies the and also some of the VaR and ES backtesting methods are implemented using the rugarch and rmgarch R packages of Ghalanos (2019Ghalanos ( , 2020. ‡

VaR and ES backtests.
We generate VaR and ES oneday-ahead forecasts for bitcoin, ether and ripple at the 1%, 2.5% and 5% significance levels for both the left and right tail of the returns distribution, to assess the ability of each model to capture risk that large negative daily returns present to long positions adequately-and also its ability to capture the risk that large positive daily returns present to short positions. The accuracy of forecast generated by each model is assessed using the tests described in Section 3.2. The results for 1% and 2.5% are presented in Table 8 for the left tail (long positions) and in Table 9 for the right tail (short positions). The results for 5% are less important-at least from the point of view of market regulation-and so are left to the Appendix.
These results are consistent with most of the findings in the relevant literature as discussed in Section 2. Catania and Grassi (2021) examine GAS model specifications against an EGARCH benchmark for 606 cryptocurrencies process of modelling the joint density for the returns of bitcoin, ether, ripple and litecoin.
‡ The multivariate EWMA model is implemented using the RiskPortfolios R package of Ardia et al. (2017), and the remaining EWMA and AEWMA specifications and also the traffic light backtesting methodology are implemented using custom-written R code. Table 11. Backtesting results for one-day-ahead right-tail 1% and 2.5% ES forecasts for bitcoin (BTC), ether (ETH), ripple (XRP) and litecoin (LTC), based on an out-of-sample period between 1 January 2017-31 August 2021. Notes: As the CRPS and energy score are negatively oriented and produce positive values, a relative score lower than 100% indicates outperformance of the benchmark. For the random walk benchmark model the average CRPS is reported outright and the average scores of the remaining models are expressed as a percentage of the benchmark's score. The EWMA, AEWMA and GARCH models are based on a Student-t distribution assumption and the degrees of freedom for EWMA and AEWMA are set to the ad hoc ν = 6 and for GARCH the degrees of freedom are estimated via MLE. The right-most column reports the average energy scores for one-day-ahead joint density forecasts of all four coins. For the tGARCH and tEGARCH models, the energy scores refer to the ADCC multivariate model. The relative energy scores for the DCC multivariate model are 98.43% for tGARCH and 102.14% for tEGARCH. Notes: For the random walk benchmark model the average log score is reported outright and the average scores of the remaining models are expressed as a percentage of the benchmark's score. As the negatively oriented version of the log score is used which produces negative values, a relative score higher than 100% indicates outperformance of the benchmark. The EWMA, AEWMA and GARCH models are based on a Student-t distribution assumption and the degrees of freedom for EWMA and AEWMA are set to the ad hoc ν = 6 and for GARCH the degrees of freedom are estimated via MLE. The right-most column reports the average multivariate log scores (MV LogS) for one-day-ahead joint density forecasts of all four coins. For the tGARCH and tEGARCH models, the multivariate log scores refer to the ADCC multivariate model. The relative multivariate log scores for the DCC multivariate model are 125.08% for tGARCH and 120.63% for tEGARCH.    and find that the additional modelling complexity introduced by the GAS framework 'pays off' for 5% and 1% ES and 5% VaR with increased accuracy, but less so for 1% VaR.
In that respect, the results presented in this section for oneday-ahead VaR and ES forecast backtesting are somewhat in agreement with Catania and Grassi (2021) in that introducing additional modelling complexity may sometimes 'pay off' in increased forecasting accuracy, especially at lower significance levels; however, we often find that AEWMA specifications are on par with EGARCH in terms of VaR and ES forecasting accuracy even at the 1% significance level.

Score-based tests for variance.
We now present the results on univariate and multivariate scores to rank the accuracy of each model for forecasting the one-day-ahead volatility and covariance of bitcoin, ether, ripple and litecoin log returns. The continuous ranked probability score (CRPS) and negatively oriented univariate log score are used to assess univariate density forecasts and joint distribution forecasts are evaluated with the energy score and again with the negatively oriented multivariate log score. Given the parametric distribution assumptions in the models used, i.e. normal for the random walk model and Student-t for the EWMA and GARCH specifications, the one-day-ahead volatility and covariance forecasts fully define the one-day-ahead distribution of log returns for each cryptocurrency and also their joint distribution, allowing for the scores' calculation. All univariate and multivariate scores are calculated using the scoringRules R package of Jordan et al. (2019) combined with custom-written R code. The CRPS is calculated every day by comparing the realized return with the one-day-ahead log returns density forecast produced by each model and in order to rank the models we average their CRPS over the entire forecasting period. Table 12 reports the results expressed as a percentage of benchmark model's average score which is given in the first row. Due to their negative orientation, relative scores below 100% suggest the possibility that the model is more accurate than the benchmark. We also use the individual scores for every day from each model to perform pair-wise comparisons of forecasting accuracy, by calculating the t N test statistic of Gneiting and Ranjan (2011) for the hypothesis test of equal forecasting performance as per Equation (30) in Section 3.
As shown in Table 12, most models outperform the benchmark although none of them achieve an average CRPS lower than 97% and the t N test for a significant difference between the highest and lowest average scores is always below 0.15, so the null hypothesis of equal forecasting performance is never rejected at 5%. This suggests that all models examined, including our very simple benchmark, produce equally inaccurate one-day-ahead density forecasts for the returns of bitcoin, ether, ripple and litecoin. This result extends the findings of Catania and Grassi (2021) that equal forecasting performance between an EGARCH benchmark model and more complex GAS models is the most common outcome.
The energy score is calculated by drawing 10,000 random samples from the forecast joint density of log returns produced by each model, based on the corresponding realized  As with the CRPS and energy scores, the univariate and multivariate log scores are calculated via the corresponding one-day-ahead predictive densities based on the realized returns and the volatility and covariance forecasts produced by each model. Again, the average log score is expressed outright for the benchmark in Table 13 and relative to the benchmark score for all other models. Table 13 shows that all models produce a higher relative score compared with the benchmark, contrary to the findings for the CRPS and energy score. However, this is because we employ the negatively oriented log score which produces negative average scores in all cases; consequently, a lower (i.e. more negative) log score is higher in magnitude but still indicates a better forecast. Therefore, the results shown in Table 13 are consistent with the findings for the CRPS and energy scores, in that all models outperform the benchmark. Moreover, the t N test statistic for equal forecasting performance applied to the log scores cannot reject the null hypothesis that the models produce the same scores, even at 5%. In other words, based on the logarithmic score results and well as the CRPS and energy score results, all models produce equally accurate one-day-ahead variance forecasts and similarly, none of the multivariate parametric models perform any better than the benchmark.

Hourly forecasts
We now present the same analysis performed on a more recent sample of hourly bitcoin, ether, ripple and litecoin log returns between 1 May 2021 00:00 UTC and 1 July 2021 00:00 UTC. The benchmark model forecasts are based on an equallyweighted 72-hour moving average; the EWMA and AEWMA forecasts are again produced with λ set to 0.925 and 0.94 but the AEWMA η parameter is now set to lower values i.e. to 0.7%, 0.8% and 0.9% for long positions and − 0.2% and 0.2% for short positions, to be in line with smaller hourly returns. † GARCH models are calibrated using a rolling estimation window of 4 months, i.e. 2,882 hourly observations. The same distribution assumptions as for the daily frequency analysis are followed, i.e. we use normal innovations for the benchmark, a Student-t distribution with ν = 6 for the EWMA and AEWMA models, and for the GARCH models the degrees of freedom parameter is estimated with the other model parameters via MLE.

VaR and ES backtesting.
The hourly 1% left-tail (long position) VaR forecast results show in Table 14, suggest that the AEWMA models are the only ones that produce fewer than the expected 14.65 exceedances for 1% VaR. ‡ For these models the CC test's null hypothesis of no clustering in exceedences is accepted even at the 10% level of significance. By comparison, the tEGARCH models are in the yellow zone and the standard tGARCH forecasts are in fact not considered accurate at all. The right-tail (short position) hourly 1% VaR backtesting results which are presented † Additional testing for other AEWMA η parameter choices is reported in the Appendix. ‡ The expected number of VaR exceedances for 1% VaR is calculated based on (17) using N = 1, 465 and α = 1%.
in Table 15 appear to favour the simple model specifications even more; for instance, the 1% right-tail VaR forecasts produced by the symmetric tEWMA models are considered accurate based on both the traffic light and CC tests, and even the simplest random walk benchmark model produces accurate 1% VaR forecasts for bitcoin and ripple, which is on a par with the more complex GARCH specifications.

5.2.2.
Score-based tests for variance. The average CPRS based on the hourly volatility forecasts for each cryptocurrency, and the energy score for the covariance forecasts, are presented in Table 18. Again, the average score for the benchmark is given in the first row and the subsequent rows present each model's average score relative to the benchmark score. Again, the benchmark yields a higher average score (indicating a poorer performance) than almost every other model. The tEGARCH model yields the lowest average CRPS for all cryptocurrencies except ripple and the ADCC-tGARCH yields the lowest average energy score. Importantly, and similar to the discussion of our daily frequency analysis, the test of equal forecasting performance does not distinguish between the models: the t N test statistic between the highest and lowest average scores is always below 0.45, so the null hypothesis of equal forecasting performance is accepted e.g. at the 5% and 10% significance levels based on both the CRPS for each cryptocurrency and on the energy score. § This suggests that the equal forecasting performance identified previously based on the average scores of daily returns density forecasts also holds at the hourly frequency: i.e. all models examined, even the random walk benchmark, are considered equally accurate regarding one-hour-ahead volatility and covariance forecasts for the returns of bitcoin, ether, ripple and litecoin.
Similarly, the average univariate and multivariate log scores are shown in Table 19. The average log scores of all models are greater in magnitude than the benchmark's average score which, as mentioned previously, indicates a better performance than the benchmark due to the negative sign of all average scores. As before, the test of equal forecasting performance indicates an equal performance across all models, for all assets and also for the multivariate log score, even at the 5% and 10% significance levels. It therefore appears that both the CPRS/energy score and univariate/multivariate log scores indicate an equal forecasting performance in all cases, for hourly data just as they do for daily data.

Conclusions
The standard RiskMetrics TM EWMA methodology fails to provide the industry standard of 99% coverage for VaR and ES of cryptocurrencies, at least when the decay parameter λ is 0.94, this being equal to the value set for returns on traditional financial assets. However, a lower value than 0.94 may be appropriate for cryptocurrencies because their volatility tends to be more highly reactive to market shocks and also less persistent, and reducing λ has precisely this effect. Having said this, we have not been able to produce the required level of coverage with other values for λ, lower, or higher than 0.94, even when innovations are drawn from a Student-t distribution.
What is important is that the EWMA model includes an asymmetry in volatility response. That is, volatility increases more following a negative return than it does following a positive return of the same magnitude. It is simple to do this by introducing another parameter η to the EWMA model, and with this modification we refer to the model as AEWMA, or more precisely as the AEWMA(λ, η, ν) model, where ν is the degrees of freedom in the Student-t innovations. We explore the values of λ, η and ν that do provide 99% coverage, with the same parameter values for all cryptocurrencies. This way, we may also use these values in a multivariate framework without concern for non-positive definiteness of the covariance matrix.
Our empirical results show that this simple, constant parameter AEWMA model performs as well as the fullycalibrated Student-t EGARCH models, which require at least several hundred observations in-sample, before any backtesting begins-and even more observations are required for calibration of their multivariate equivalents if these are to be employed in portfolio optimization. Running several state-of-the-art VaR and Expected Shortfall backtests over a five-year period at the daily frequency, and using several thousands of observations at the hourly frequency, we find that a coverage level of 99% for VaR and ES is easy to achieve for long or short positions on all cryptocurrencies considered is possible, provided we have the right parameter choices. Although a small improvement in coverage is obtainable using sophisticated GARCH models, the incremental returns to this effort are unlikely to be seen as attractive by industry practitioners. Also our univariate and multivariate score-driven tests for variance and covariance indicate no significant improvements in average scores which would warrant a ranking for GARCH models above our simple asymmetric extension for a RiskMetrics TM type EWMA.

Disclosure statement
No potential conflict of interest was reported by the author(s).