Modeling Expected Shortfall Using Tail Entropy

Given the recent replacement of value-at-risk as the regulatory standard measure of risk with expected shortfall (ES) undertaken by the Basel Committee on Banking Supervision, it is imperative that ES gives correct estimates for the value of expected levels of losses in crisis situations. However, the measurement of ES is affected by a lack of observations in the tail of the distribution. While kernel-based smoothing techniques can be used to partially circumvent this problem, in this paper we propose a simple nonparametric tail measure of risk based on information entropy and compare its backtesting performance with that of other standard ES models.


Introduction
Past banking regulation has concerned and current banking regulation concerns the measurement of risk needed for capital calculation, which can be computed based on various models. Additionally, banks and financial institutions need to estimate their own exposure to different relevant risk factors in order to understand the overall riskiness of their activities and to help them prepare for undesirable situations. While value-at-risk (VaR) is the most popular risk measure and has been widely used in the past, it is not a subadditive measure (Artzner et al. [1,2]) and it is a misleading measure for portfolio optimizers [3]. Expected shortfall (ES), which has been popularized by Artzner et al. [1,2], having the advantage of being a coherent measure [4], has increased in importance since the Basel Committee on Banking Supervision ( [5,6]) began to bases its regulatory capital calculations on ES estimates at the 2.5% significance level.
The standard way to estimate ES at level α is to simply compute the weighted average of the historical returns in the α-tail of the returns' distribution, where the weights are the probabilities associated with the returns estimated from historical data (nonparametric method). However, this often gives an estimate that is inaccurate (see McNeil et al. [7]), data-sensitive (due to the low number of observations in the tail), and often characterized by an undesirably large variability.
A comprehensive presentation of the modern techniques used in risk management, especially with regard to estimating value-at-risk, expected shortfall, and spectral risk measures, can be found in Guégan et al. [8].
Several alternatives exist to computing ES, and a comprehensive review can be found in Nadaraj et al. [9]. Starting with semiparametric methods, a strand of this approach is based on extreme value theory, as in McNeil and Frey [10], who apply the theoretical framework of Embrechts et al. [11] and employ the generalized Pareto distribution combined with an ARMA-GARCH(1,1) process. are measures of uncertainty, and (asymmetric) tail measures are considered measures of risk. Thus, tail entropy is a measure of risk. Philippatos and Wilson in 1972 [32], and, later, Ebrahimi et al. [33] suggested that entropy compares favorably to volatility/variance as a measure of uncertainty. The same has been shown by Dionisio et al. [34], with the authors arguing that while variance measures the concentration around the mean, entropy measures the dispersion of the density irrespective of the location of the concentration (also see Ebrahimi et al. [35] and Allen et al. [36]). Moreover, the entropy of a distribution function is strongly related to its tails, as shown by Pele et al. [37], which becomes an important feature for distributions with heavy tails or with an infinite second-order moment for which an estimator of variance is obsolete. Liu et al. [38] compare an entropy-based measure with the classical coefficient of correlation and conclude that their measure has certain superior characteristics as a descriptor of the relationship between time series when compared with the correlation measure.
The study of the relationship between entropy and financial markets goes back more than a decade. Entropy has been used as a measure of stock market efficiency in Zunino et al. [39], while Risso [40] links it to stock market crashes, arguing that the probability of having a crash increases as entropy decreases. Oh et al. [41] use entropy as a measure of the relative efficiency of the foreign exchange (FX) markets. Based on generalized entropy theory, Wang et al. [42] analyze the interactions among agents of stock markets. Entropy has also been used as a tool for the predictability of stock market returns. Maasoumi and Racine [43] show that entropy can capture nonlinear dependence between financial returns. An effective early warning indicator for crisis situations of banking systems has been built by Billio et al. [44]; this indicator considers different definitions of entropy.
Entropy has also been used in option pricing (Stutzer [45] and Stutzer and Kitamura [46]). An application of entropy-based risk measures for decision-making can be found in Yang and Qiu [47]. Applications of Tsallis entropy in risk management can be found in Gradojevic and Gencay [48], Gencay and Gradojevic [49], and Gradojevic and Caric [50]. Bowden [51] introduces directional entropy and uses it to improve the performance of VaR in capturing regime changes. Portfolio optimization based on maximum entropy has been discussed by Geman et al. [52].
The main objectives of this paper were to introduce a measure for expected shortfalls based on tail entropy, to study this measure's properties, and to compare it with alternative measures of ES. The main advantage of the measure we propose is that it is less sensitive to the actual values of observations in the tail of the distribution (compared to historical ES), making it a more stable measure of tail risk. We use several ES backtests to verify the accuracy of the proposed measure. The rest of the paper is organized as follows. Section 2 derives the theoretical results specifying the steps in estimating tail entropy as a measure of market risk and details our methodology, Section 3 presents an empirical application, and Section 4 concludes.

Methodology
For a series of asset prices denoted by S t , in the following we compute and denote log-returns by X with realizations (X t ) t∈N , i.e.,

Measures of Risk: VaR and ES
Given the returns series X, with F denoting its cumulative density (or distribution) function, for an α ∈ (0, 1) the VaR at level α is the smallest number with cumulative density at least as big as α, i.e., In probabilistic terms we can write that In addition, VaR is the negative of the quantile, i.e., The ES at significance level α is the average of the returns below the VaR at level α, as given by For a continuous variable with density function f, the above definition for ES can be rewritten as

Entropy
Entropy is a measure of uncertainty, which is in some ways similar to volatility. Various definitions exist (e.g., Shannon entropy, Tsallis entropy, and Kullback Cross entropy etc.) based on the informational content of a discrete or continuous random variable (see Zhou et al. [53] for a comprehensive review of entropy measures used in finance). Shannon information entropy is the most commonly used definition of entropy; it quantifies the expected value of the information contained in a discrete distribution.

Definition 1 (Shannon information entropy)
. For X, a discrete random variable with probability distribution X : x 1 . . . . . . x n p 1 . . . . . . p n , with p i = P(X = x i ), 0 ≤ p i ≤ 1, and i p i = 1, the Shannon information entropy is defined as For a discrete distribution, the entropy reaches its maximum value of H(X) = log 2 n for the uniform distribution when all the p i values are the same (i.e., when there is a high level of uncertainty). Similarly, the entropy will reach its minimum value of 0 for a distribution with zero uncertainty (i.e., one of the probabilities p i is 1 and the rest are all 0). For X, a continuous random variable with probability density function f (x), the differential entropy is given by However, unlike with Shannon entropy, differential entropy does not possess certain desirable properties like invariance to linear transformations and non-negativity (Lorentz [54] and Pele [55]). A measure of entropy similar to Shannon entropy can be defined via a transformation called quantization, as defined below [54].

of 19
For f : I = [a, b] → R , which is essentially bounded, the sampled function is (mean sampling) Definition 3 (quantization). The quantization of a function creates a simple function that approximates the original one. Given q > 0, a quantum, the following function defines a quantization of f .
Definition 4 (entropy of a function at quantization level q). Let f be a measurable and essentially bounded real valued function defined on [a,b] and let q > 0. Let I i = [iq, (i + 1)q) and B i = f −1 (I i ). Then, the entropy of f at quantization level q is given by where µ denotes the Lebesgue measure.
Lorentz's theorem given below calculates the entropy of a continuous function on a compact interval. Eventually, it helps define the entropy of a probability distribution function for a continuous random variable on a compact interval, regardless of the sampling and quantization.
Theorem (Lorentz [54]). Let f be continuous for point sampling, measurable, and essentially bounded for mean sampling. The sampling spacing is 1/n. Let S n ( f ) be the corresponding sampling; fix q > 0 and let Q q S n be the quantization of the samples with resolution q as in Definition 3. The number of occurrences of (i + 1/2)q in Q q S n is c n (i) = card (i + 1/2)q ∈ Q q S n and the relative probability of the occurrence of the value i is denoted by p n (i) = In the following we assume that we are dealing with a continuous random variable X whose support is the set A with the distribution function F : R → [0, 1], F(x) = P(X < x) . If the distribution function is absolutely continuous, then there exists a non-negative, Lebesgue integrable function f which is a probability density function (PDF), i.e., Assuming that the probability density function f is essentially bounded, we can define the entropy of the probability density function H q (f ) at the quantization level q > 0.  Of note is that in general, for a probability density function defined on set A not necessarily of finite measure, it is possible to consider this function's restriction on a compact interval, i.e., [a,b] ) satisfies the conditions of Lorentz's theorem, meaning the entropy can be defined.
The above framework can be applied to estimation of the entropy of a probability density function of a continuous random variable X.
Next, we present the estimation algorithm of the entropy of a probability density function. Let X 0 , . . . ., X n−1 be a sample of an i.i.d. variable with probability density function f. The Algorithm 1 (see below) estimates the entropy of a probability density function (see Pele et al. [37]). Algorithm 1. Estimation of the entropy of a probability density function.
Sample from the probability density function using the sampled function S n (f n )(i) =f n (X i ) for i = 0, . . . n − 1; III.

IV.
Compute the probabilities p n (i) = Estimate the entropy of the probability density function, i.e., H q (f n ) = − i p n (i) log 2 p n (i) A similar approach can be found in Miśkiewicz [56], where two possible algorithms for estimating entropy of a continuous probability density function are discussed.
The entropy of the probability density function reaches its maximum value for the uniform distribution. A dimensionless measure of uncertainty, normalized entropy, can be defined as the ratio between the entropy of the probability density function and the entropy of the uniform distribution, i.e., In the following we refer to the entropy of the probability density function as the normalized entropy of the probability density function, i.e.,

Tail Entropy
Similarly to the definition of entropy, we can define tail entropy as being when entropy is computed using observations in the tail only. Shannon tail entropy is defined below. Definition 6 (Shannon tail entropy). For X, a discrete random variable with probability distribution X : , and i p i = 1, and α-level value-at-risk VaR α (X) for 0 < α < 1, the Shannon tail information entropy at level α is defined by The tail-adjusted probabilities are denoted by p * i = p i /α, and this normalization is required to make sure that the probabilities in the tail add up to 1. For a discrete distribution, the TE will reach its maximum value for a distribution which has uniform distribution in the α-tail, with all tail observations having the same probability. Similarly, for a given α, the tail entropy will reach its minimum for a distribution with zero uncertainty in the tail (an observation in the tail has a probability of p i = α and the rest of the tail observations have a probability of 0). For X, the discrete random variable given above, we can compute the expected shortfall based on the variable's probability distribution at level α; this is denoted by ESPD and given by The similarity between Formulas (15) and (16) is obvious. There is a minus sign in Formula (15) to make sure the tail entropy is a positive measure. More importantly, while the entropy is a weighted sum of log probabilities, the ES is the weighted sum of observations in the tail. To visualize the difference between the two formulae, in Figure 1 we compare the left tail of a histogram of a variable distributed N (0,1) (normalized with the tail probability and using bins of width 1) with the left tail of a histogram built using the entropy measure. In Formula (15) we can denote log 2 p * i by x i , giving p * i = 2 x i , and this probability is presented in Figure 1 in the entropy histogram. It can be seen that the probabilities calculated using entropy are higher (except for when using the value −1) than the probabilities in the normal histogram.
( ) The tail-adjusted probabilities are denoted by α = * / i i p p , and this normalization is required to make sure that the probabilities in the tail add up to 1. For a discrete distribution, the TE will reach its maximum value for a distribution which has uniform distribution in the α-tail, with all tail observations having the same probability. Similarly, for a given α, the tail entropy will reach its minimum for a distribution with zero uncertainty in the tail (an observation in the tail has a probability of pi = α and the rest of the tail observations have a probability of 0). For X, the discrete random variable given above, we can compute the expected shortfall based on the variable's probability distribution at level α; this is denoted by ESPD and given by The similarity between Formulas (15) and (16) is obvious. There is a minus sign in Formula (15) to make sure the tail entropy is a positive measure. More importantly, while the entropy is a weighted sum of log probabilities, the ES is the weighted sum of observations in the tail. To visualize the difference between the two formulae, in Figure 1 we compare the left tail of a histogram of a variable distributed N (0,1) (normalized with the tail probability and using bins of width 1) with the left tail of a histogram built using the entropy measure. In Formula (15)  , and this probability is presented in Figure 1 in the entropy histogram. It can be seen that the probabilities calculated using entropy are higher (except for when using the value −1) than the probabilities in the normal histogram.  For a continuous random variable with density function f (x), the differential tail entropy is given by Similarly to differential entropy, tail differential entropy does not possess the properties of invariance to linear transformations and non-negativity. Using quantization, a measure of entropy similar to Shannon tail entropy can be defined as follows.
Definition 7 (tail entropy of a probability density function at quantization level q). For the probability 1]. Then, the tail entropy of f at level α with quantum q (using the notation µ* for the Lebesgue measure divided by α and [.] for the integer part of a number) is Of note is that as in Definition 5, for a probability density function defined on set A which is not necessarily of finite measure, we can consider its restriction on a compact interval, i.e., The tail entropy of the returns' distribution can be interpreted as the expected shortfall of the returns, if in the tail of the distribution the probability of losses can be expressed as an exponential function of the size of the losses. The tail entropy is a measure of uncertainty in the tail, with low values of tail entropy being associated with a lower risk in the tail.
This framework can be used to estimate the tail entropy of a distribution function of a continuous random variable X, defined on the support set of X, with values on [0, 1]. The distribution function can be estimated using the histogram estimator of a probability density or kernel density estimation methods as described in the previous section. We proceed by presenting an estimation algorithm of the tail entropy of a distribution function.
Let X 0 , . . . ., X n−1 be a sample of an i.i.d. variable with probability density function f. The Algorithm 2 (see below) estimates the tail entropy of a distribution function.
Sample from the probability density function using the sampled function Define a quantum q > 0; then The tail entropy of a distribution function reaches its maximum value for a distribution which has uniform distribution in the α-tail.

Property (maximum value of the tail entropy of a probability density function):
The α-tail entropy of a distribution function F : [0, 1] → [0, 1] reaches its maximum for a distribution with F(x) = x for x ≤ α, and this maximum value is log 2 m.
Proof. This is similar to the proof of the maximum Shannon entropy. Indeed, when all probabilities are equal for the observations in the α-tail (= 1/m), the tail probability can be written as: As such, a dimensionless measure of uncertainty, the normalized tail entropy, can be defined as the ratio between the entropy of the probability density function and its maximum value, i.e., We refer below to the tail entropy of the probability density function as the normalized tail entropy of the probability density function: In practical applications, the estimated tail entropy may be severely biased for small samples (Liu et al. [57]). In order to correct for small sample bias we propose a bootstrapping method, following DeDeo et al. [58]. Thus, we generate B independent samples of volume k from a multinomial distribution with probabilitiesp = (p * n (0), . . . , p * n (m)), and for each sample we estimate the normalized tail entropy The unbiased estimator of the normalized tail entropy has the form The estimated entropy can be influenced by the quantum value q > 0; as there is no canonical quantum/level of quantization, apparently all results depend on an arbitrarily chosen quantum/quantization level. However, the Lorentz theorem ( [54]) shows that the entropy estimator is consistent regardless of the choice of q; in practical applications, such as stock market data, we have used q = 0.2 (see Section 3). For q = 0.2, the number of bins used to compute the tail entropy is 1/q = 5; a smaller value of the quantum q will increase the number of bins and also the likelihood of having zero probability bins. For example, when the window used for estimation has 1000 observations (roughly 4 years of daily observations) and α = 5%, there are 50 observations in the left tail of the distribution; by decreasing the quantum level and increasing the number of bins, there is a high chance of having bins with zero probability where the term p log p is undefined.
Unlike expected shortfall, which is not well-defined when the expectation fails to exist (for instance for the Pareto family when the parameter a ≤ 1), tail entropy does exist for the Pareto distribution for any tail parameter value a. In general, under the hypothesis of the Lorentz theorem ( [54]), tail entropy exists for any probability density function with bounded support. Even if the probability density function is not bounded, we can consider its restriction on a compact interval in order to fulfil the conditions of the Lorentz theorem. Indeed, tail entropy is sensitive to the choice of the level of the quantum, but in the next section we show how it can be transformed (via normalizing it and using a linear adjustment) into an expected-shortfall-type measure which will be less sensitive to the choice of quantum.

Tail Entropy Expected Shortfall
Next, we turn our attention to the link between tail entropy and measures of risk and uncertainty, seeking to propose a tail-entropy-enhanced ES estimator. First, we perform a Monte Carlo experiment, estimating and comparing tail entropy and historical expected shortfall for several simulated Pareto-like distributions; we report our results for different tail probabilities. Here we use the random variable Y = −X~Pareto(a), which has the following characteristics:   Figure 2 presents the probability density function of the random variable Y for different values of the a parameter. The a parameter (the tail index) is a measure of tail probability: higher values of a correspond to lower probability in the tail. In order to assess the relationship between the tail entropy and the historical expected shortfall (ESH), we simulate random variables Figure 3 presents the relationship between TE and tail index a using scatterplots for different levels, i.e., 1% and 2.5%, of α. As the a parameter of the simulated distribution decreases, the tail entropy of the distribution decreases, too; as expected, high tail entropy values are associated with heavy-tailed distributions. Figure 4 presents the relationship between TE and historical ES, as functions of the a parameter of the simulated distribution, for 1% and 2.5% levels of α. In order to assess the relationship between the tail entropy and the historical expected shortfall (ESH), we simulate random variables Y k = −X k ∼ Pareto(a k ), with a k = 2 + 0.1k, k = 1...180. Figure 3 presents the relationship between TE and tail index a using scatterplots for different levels, i.e., 1% and 2.5%, of α. As the a parameter of the simulated distribution decreases, the tail entropy of the distribution decreases, too; as expected, high tail entropy values are associated with heavy-tailed distributions. Figure 4 presents the relationship between TE and historical ES, as functions of the a parameter of the simulated distribution, for 1% and 2.5% levels of α.  Additionally, we regress the historical ES estimates of the simulated returns above on tail entropy; the results are reported in Table 1. We estimate the historical ES using the formula    Additionally, we regress the historical ES estimates of the simulated returns above on tail entropy; the results are reported in Table 1. We estimate the historical ES using the formula Additionally, we regress the historical ES estimates of the simulated returns above on tail entropy; the results are reported in Table 1. We estimate the historical ES using the formula It can be seen that there is a strong linear negative relationship between the tail entropy and ESH, with higher values of tail entropy corresponding to lower values of expected shortfall. The validity of this regression approach depends on the tail behavior of the underlying distribution; without this explicit assumption, one would need regularly varying tails and a particular tail behavior in order to make the regression approach sound. Indeed, for some distributions the relationship between historical ES and tail entropy might not be exactly linear, but we show below how assuming a linear relationship helps us define a new measure of expected shortfall.
A multiple regression with several tail entropies of varying tail levels was also estimated in order to explicitly assess the relationship between historical ES and TE for a Pareto-like simulated distribution, as in the above examples.
The results of the regression as calculated using Formula (23) are reported in Table 2. By including varying tail levels for entropy, the performance of the model is significantly improved, with R 2 adj increasing for α = 1% from 51% (see Table 1, left panel) to 55%, and for α = 2.5% from 48% (see Table 1, right panel) to 60%.
A second simulation using α-stable distributions can help understand this behavior. We chose α-stable distributions because they have several interesting properties: they allow for heavy tails and any linear combination of independent stable variables follows a stable distribution for up to a scale and location parameter (Nolan [59]); additionally, the Gaussian distribution is a particular case of a stable distribution. We have that a random variable X which follows an α-stable distribution S(a, b, c, d; 1) if its characteristic function is In the above expression a ∈ (0.2] is the tail index (we obtain a = 2 for a normal distribution and lower values correspond to heavier tails), b ∈ [−1, 1] is the skewness parameter, c ∈ (0, ∞) is the scale parameter, and d ∈ R is the location parameter. We simulate 1000 observations from an α-stable distribution S (a k , b, c, d; 1) for a k = 1.1 + 0.01k, k = 1...90 and study the relationship between tail index and tail entropy.
The tail entropy reaches its maximum value for a distribution that has uniform distribution in the tail, and as the a parameter decreases, the tail entropy of the stable distribution decreases, too. As expected, low tail entropy values are associated with heavier-tailed distributions. Figure 5 presents the relationship between TE and tail index, while Table 3 present the regression results between TE and tail index.  Based on the results above (and taking into consideration that the tail entropy is between 0 and 1), we perform a linear adjustment on the tail entropy to obtain the tail entropy expected shortfall, i.e., where b0 is the mid-point of the first bin B0 and bm is the mid-point of the last bin in the α-tail of the distribution Bm.
The justification for Formula (25) lies in the following points.
• If the entropy reaches its maximum ( α = , ( ) 1 q H f ), which corresponds to the highest uncertainty level, then we have i.e., the tail  Based on the results above (and taking into consideration that the tail entropy is between 0 and 1), we perform a linear adjustment on the tail entropy to obtain the tail entropy expected shortfall, i.e., where b 0 is the mid-point of the first bin B 0 and b m is the mid-point of the last bin in the α-tail of the distribution B m . The justification for Formula (25) lies in the following points.
• If the entropy reaches its maximum (H α,q ( f ) = 1), which corresponds to the highest uncertainty level, then we have , i.e., the tail entropy expected shortfall is the average of losses below the value-at-risk.

•
If the entropy reaches its minimum (H α,q ( f ) = 0), which corresponds to the lowest uncertainty level, then we have ESTE α,q ( f ) = −b 0 , i.e., the tail entropy expected shortfall is the minimum value of the distribution.

Tail Entropy Expected Shortfall as a Spectral Risk Measure
According to Guégan et al. [8], a spectral risk measure can be defined as the function ρ : L → R , where L is a functional space and In Formula (26) φ is a positive or null, non-increasing, right-continuous, integrable function defined on [0, 1] such as 1 0 φ(p)dp = 1. Following Formula (25), the tail entropy expected shortfall can be written as a linear function of the tail entropy (even though for some distributions this is only an approximation), i.e., where β = − 1 (4 ln 2) log 2 m = − 1 4 ln m and φ(p) = −(4 ln 2)p log 2 p = −4p ln p. Thus, according to Formula (27), the tail entropy expected shortfall is a spectral risk measure and satisfies the conditions of coherence (see Artzner [1,2]), any spectral risk measure being law-invariant and comonotonic additive (see [8]).
The tail entropy expected shortfall can have interpretability in terms of monetary units, being a linear combination of the expected shortfall and the adimensional normalized entropy which takes values between [0,1], as can be seen from Formula (27).

Backtesting ES
In order to forecast daily ES, non-parametric and parametric models are used based on a rolling window approach. Thus, we compare the ES forecasting ability of the following six models.

2.
Tail entropy ES forecasts: , where In the above formulae, every risk measure is estimated using a rolling window of length w, with t ∈ {k + 1, . . . , k + w}, k ∈ {0, . . . , T − w + 1}, and T the number of daily log-returns. To see the ES forecasting performance of the above models, we run several backtests to compare the ES estimates. These are the unconditional exception frequency (UC) test and the conditional independence (CC) test of Du and Escanciano [60], and the Z 2 exception frequency and magnitude test of Acerbi and Szekely [61]. These are described in Appendices A and B.

Empirical Analysis
In order to illustrate the application of tail entropy in financial risk management, we consider S&P500 daily log-returns (sourced from Bloomberg). The time period considered is January 1 1980 to December 12 2018 (9835 observations). The estimators of expected shortfall described in Section 2 are computed on a rolling basis using a length of w = 1000 days for the estimation windows; these forecasts are later subjected to ES backtests. Tail entropy is estimated using a quantum q = 0.2 (see the theorem in Section 2). Figure 6 depicts the dynamics of tail-entropy-based ES versus historical ES estimated at 1% and 2.5% significance levels. It can be noted that the tail-entropy-based ES gives higher estimates of risk compared to the historical ES for both significance levels, as this model assigns a higher probability to large losses when compared to the historical ES model. However, it is not straightforward to comment on the correctness of these risk estimates as the level of true risk is not observable, meaning no direct comparisons can be made for true risk. Statistical backtests, performed in the next section, can be used to verify the correctness of VaR and ES forecasts.

Empirical Analysis
In order to illustrate the application of tail entropy in financial risk management, we consider S&P500 daily log-returns (sourced from Bloomberg). The time period considered is January 1 1980 to December 12 2018 (9835 observations). The estimators of expected shortfall described in Section 2 are computed on a rolling basis using a length of w = 1000 days for the estimation windows; these forecasts are later subjected to ES backtests. Tail entropy is estimated using a quantum q = 0.2 (see the theorem in Section 2). Figure 6 depicts the dynamics of tail-entropy-based ES versus historical ES estimated at 1% and 2.5% significance levels. It can be noted that the tail-entropy-based ES gives higher estimates of risk compared to the historical ES for both significance levels, as this model assigns a higher probability to large losses when compared to the historical ES model. However, it is not straightforward to comment on the correctness of these risk estimates as the level of true risk is not observable, meaning no direct comparisons can be made for true risk. Statistical backtests, performed in the next section, can be used to verify the correctness of VaR and ES forecasts.

Backtesting ES
Backtesting of the ES forecasts based on the models given in Section 2.3 was performed using a window length of 1000 trading days and for α = 1% and 2.5%. Table 4 presents the percentage of times each model passed the given backtest for each significance level α. Based on these statistics, it can be concluded that for the unconditional ES test, the t-GARCH(1,1) ES forecasts perform best for both levels of α. Looking at the rejection rates for the conditional coverage ES backtest, it can be seen that at the 2.5% level the t-GARCH(1,1) model performs best but at the 1% level the rejection rate of the tail entropy ES is comparable with the same rate within the t-GARCH(1,1) model. Also, according to the 2 Z test, the rejection rates are the lowest for tail entropy ES for α = 2.5%, and stand at around 10%. Based on these results the t-GARCH(1,1) and tail entropy ES show similar performances and overperform the other models considered. Table 5 presents the test statistics of each backtest over the entire testing period and for α levels of 1% and 2.5%. The tail entropy ES, like the other ES models considered, fails both the unconditional and conditional backtests at the 5% level, but it passes the 2 Z test at the 5% level, which the t-GARCH(1,1) ES forecasts fail to pass.

Backtesting ES
Backtesting of the ES forecasts based on the models given in Section 2.3 was performed using a window length of 1000 trading days and for α = 1% and 2.5%. Table 4 presents the percentage of times each model passed the given backtest for each significance level α. Based on these statistics, it can be concluded that for the unconditional ES test, the t-GARCH(1,1) ES forecasts perform best for both levels of α. Looking at the rejection rates for the conditional coverage ES backtest, it can be seen that at the 2.5% level the t-GARCH(1,1) model performs best but at the 1% level the rejection rate of the tail entropy ES is comparable with the same rate within the t-GARCH(1,1) model. Also, according to the Z 2 test, the rejection rates are the lowest for tail entropy ES for α = 2.5%, and stand at around 10%. Based on these results the t-GARCH(1,1) and tail entropy ES show similar performances and overperform the other models considered. Table 5 presents the test statistics of each backtest over the entire testing period and for α levels of 1% and 2.5%. The tail entropy ES, like the other ES models considered, fails both the unconditional and conditional backtests at the 5% level, but it passes the Z 2 test at the 5% level, which the t-GARCH(1,1) ES forecasts fail to pass. To get an insight into the dynamic behavior of the test statistics of the Z 2 backtest, Figure 7 depicts time-varying test statistics (using a rolling window of length 1000) for ES estimated at significance levels α = 1% and 2.5%. As can be seen, the expected shortfall estimated with the Student's t-GARCH(1,1) model underperformed, for both α levels, during the financial crisis of 2008. The illustration of tail-Entropy-based ES in Figure 6 indicates that at similar backtesting rejection rates tail-entropy-based ES seems to lead to overly high risk quantifications; in addition, the time to recover from a bad shock (like Black Monday on 19 October 1987) is much longer than for historical expected shortfall (which itself usually leads to long recovery times). These two observations seem to indicate that the proposed entropy-based version of expected shortfall tends to react fiercely in the presence of outlying observations. This may be a sign that tail-entropy-based ES is a more conservative risk measure than traditional ones due to the way it is defined, as it takes into account the tail distribution, which has a higher inertia and a longer recovery time from a negative shock. To get an insight into the dynamic behavior of the test statistics of the 2 Z backtest, Figure 7 depicts time-varying test statistics (using a rolling window of length 1000) for ES estimated at significance levels α = 1% and 2.5%. As can be seen, the expected shortfall estimated with the Student's t-GARCH(1,1) model underperformed, for both α levels, during the financial crisis of 2008. The illustration of tail-Entropy-based ES in Figure 6 indicates that at similar backtesting rejection rates tail-entropy-based ES seems to lead to overly high risk quantifications; in addition, the time to recover from a bad shock (like Black Monday on 19 October 1987) is much longer than for historical expected shortfall (which itself usually leads to long recovery times). These two observations seem to indicate that the proposed entropy-based version of expected shortfall tends to react fiercely in the presence of outlying observations. This may be a sign that tail-entropy-based ES is a more conservative risk measure than traditional ones due to the way it is defined, as it takes into account the tail distribution, which has a higher inertia and a longer recovery time from a negative shock.

Conclusions
In this paper we propose a nonparametric estimator of ES based on TE, as well as an extension of this model based on kernel smoothing, and compare it with the classical estimates of ES, i.e., historical ES, Gaussian ES, Student's t ES, Gaussian GARCH(1,1) ES and Student's t-GARCH(1,1) ES. The main advantage of the measure we propose is that it has a low dependency on the actual values of observations in the tail of the distribution, making it a more stable measure of tail risk.
We illustrate an application of tail entropy in financial risk management based on S&P 500 daily log-returns between 1980 and 2018 (9835 observations). Comparing backtest rejection rates based on rolling windows, it can be concluded that the performance of tail entropy ES is comparable with the performance of the t-GARCH(1,1) model. However, when we backtest over the entire sample period, tail-entropy-based ES is the only model which passed the Z 2 backtest.
Further research is needed to fully acknowledge the efficiency of the tail entropy ES estimator. This new measure may be beneficial for risk measurement as it is fully non-parametric and has a reduced sensitivity to actual observations in the tail. One possible extension of this approach will be to assess the predictive performance of tail entropy ES in a non-standard environment, for example by assuming a fat-tailed distribution of the returns as the generalized error distribution (see Cerqueti et al., 2019 [62]). In addition, our approach can be applied to other fields such as operational risk, where distributions are usually fat-tailed and highly skewed.
Our research can be easily reproduced using the Python codes uploaded to www.github.com.