Forecasting Value-at-Risk Using Deep Neural Network Quantile Regression*

In this article, we use a deep quantile estimator, based on neural networks and their universal approximation property to examine a non-linear association between the conditional quantiles of a dependent variable and predictors. This methodology is versatile and allows both the use of different penalty functions, as well as high dimensional covariates. We present a Monte Carlo exercise where we examine the ﬁnite sample properties of the deep quantile estimator and show that it delivers good ﬁnite sample performance. We use the deep quantile estimator to forecast value-at-risk and ﬁnd signiﬁcant gains over linear quantile regression alternatives and other models, which are supported by various testing schemes. Further, we consider also an alternative architecture that allows the use of mixed frequency data in neural networks. This article also contributes to the interpretability of neural network output by making comparisons between the commonly used Shapley Additive Explanation values and an alternative method based on partial derivatives.

network using the rectified linear unit (ReLU) as activation function and show that it has superior statistical performance relative to other quantile regression methods. Chen et al. (2020) propose a unified non-linear framework, based on feed-forward neural networks, that allows the estimation of treatment effects, for which they establish consistency and asymptotic normality. Their framework includes the quantile estimator and allows for high-dimensional covariates. ML-based estimators for quantiles have been proposed in other fields, see for example, Meinshausen (2006), where quantile random forests are introduced, and Zhang, Quan, and Srinivasan (2019) that propose a quantile neural network estimator.
In this article, we contribute to the expanding literature on the use of ML in Finance and use a deep quantile estimator that can capture non-linear associations between asset returns and predictors to forecast VaR. Note that this estimator also allows for high dimensional data. We further consider an alternative architecture that allows the use of mixed frequency data. We also contribute toward the explainable ML literature, by proposing the use of partial derivatives as a means to "peeking" inside the black box.
We first explore the small sample properties of the deep quantile estimator via Monte Carlo (MC) experiments, which show that the estimator delivers good finite sample performance. Then we examine the performance of the deep quantile estimator, in the context of one of the most widely examined problems in finance: that of measuring and subsequently forecasting the risk of a portfolio adequately, via VaR modeling. VaR is a popular model that was first introduced in the late 80s and since then, has become a standard toolkit in measuring market risk. It measures how much value a portfolio can lose within a given time period with some small probability, s. VaR and quantiles are related in the following manner, let r ¼ ðr 1 ; . . . ; r T Þ 0 denote the returns of a portfolio, then, the sth VaR is equivalent of computing the negative value of the sth quantile of r, Àq s ðrÞ.
In this article, we argue, following the non-parametric literature, that the linear relationship between VaR and predictors can be restrictive and use the deep quantile neural network estimator that allows a non-linear association between covariates and VaR. This method appears particularly suitable for developing sound predictions for the past stock return losses in the United States over the sample period from September 1985 up to August 2020, the importance of which has been brought to the forefront by the recent COVID-19 pandemic. Specifically, our aim is to forecast 10-day ahead VaR produced from daily VaR forecasts. We use daily frequency returns in a fixed forecasting framework that is outlined below.
Under this forecasting framework, mixed frequency models become relevant benchmarks to the non-linear quantile estimator, see for example, Ghysels, Plazzi, and Valkanov (2016). Hence, we also include a linear MIxed DAta Sampling (MIDAS) model as a competitor and also a non-linear MIDAS model, which is an extension to the deep quantile estimator. Further, we consider 10-day compounded VaR forecasts that exhibit similar patterns, which we relegate to the Supplementary Appendix.
We are not the first to use ML methods for VaR forecasting, see for example, Du, Wang, and Xu (2019), where they propose a recurrent neural network, as a forecasting methodology for the VaR model and exhibit an improved forecast performance relative to traditional methods. To the best of our knowledge though, there has been no application that uses a neural network quantile estimator in finance for forecasting VaR. Note that in this article, we consider a set of neural networks that allows for mixed frequency estimation, following the current literature see for example, Xu et al. (2021); Borup, Rapach, and Schü tte (2022); and Babii, Ghysels, and Striaukas (2022b).
Our empirical analysis shows that the deep quantile estimator outperforms the linear, MIDAS, and other non-parametric quantile models, in forecasting VaR. We assess the forecasting accuracy between models based on two statistical tests. The first is the Diebold and Mariano (1995) (DM) test with the Harvey, Leybourne, and Newbold (1997) adjustment, and the second is the Giacomini and White (2006) (GW) test. Results from both tests suggest that the neural network estimator has higher accuracy in forecasting VaR. We use the linear quantile method as a benchmark to assess whether the deep quantile estimator has predictive gains or not. This measure illustrates gains up to 98% relative to the linear one, for the deep quantile estimator and up to 84% for the non-linear MIDAS model. Further, we use the quantile score test that provides further evidence in favor of the neural network estimator.
We further examine whether the deep quantile estimator nests forecasts produced from the linear and other non-parametric models, using the encompassing test of Giacomini and Komunjer (2005). Overall, we find that forecasts from the deep quantile estimator encompass forecasts from competing models more times than vice versa. There are some cases where the test is inconclusive, suggesting that a forecast combination from a different pair of models would provide a better result, which is in line with the result of Bates and Granger (1969).
While ML methods show a great capacity at both approximating highly complicated non-linear functions and forecasting, they are routinely criticized as they lack interpretability and are considered a "black box"; in the sense that they do not offer simple summaries of relationships in the data. Recently though, there has been a number of studies that try to make ML output interpretable, see for example, Athey and Imbens (2017); Wager and Athey (2018); Belloni, Chernozhukov, and Hansen (2014);and Joseph (2019). In this article, we also try to understand in a semi-structural fashion, which variables impact the forecasting performance of the deep quantile estimator more. To this end, we first use Shapley Additive Explanation (SHAP) values as proposed by Lundberg and Lee (2017) and further developed by Joseph (2019), that have started to become a standard tool for interpretability in ML methods. Further, we use partial derivatives as a means of investigating the marginal contribution/influence of each variable to the output. We compare the partial derivatives and SHAP values over time, and our results can be summarized as follows. First, partial derivatives overall are more stable than SHAP values, and are able to produce interpretable results, at a fraction of the computational time of SHAP. Second, the partial derivatives of the deep quantile estimator fluctuate around the estimate of the conditional linear quantile and (i) exhibit time variation and (ii) can capture stressful events in the U.S. economy for instance the COVID-19 pandemic and the 2008 financial crisis.
The remainder of the article is organized as follows. Section 1 introduces the deep quantile estimator. Section 2 contains the MC exercise. Section 3 presents our empirical application. Section 4 presents the semi-structural analysis. Conclusions are set out in Section 5. We relegate to the Supplementary Appendix the specifications of the competing models, empirical results from a 1-day ahead VaR forecasting exercise, 10-day compounded VaR forecasts, GW test, and results from the quantile score test and predictive gains.

Empirical Methodology
In this section, we start by summarizing the underlying theory of a quantile regression as outlined by Koenker and Bassett (1978) and Koenker (2005) and argue that the linear relationship of the conditional quantile between a dependent variable given the covariates, can be restrictive. We illustrate how some fundamental results on the universal approximation property of neural networks can be used to approximate a non-linear relationship instead, and define the deep quantile estimator. We conclude with a discussion on how different penalization schemes can be used and further how hyper-parameters can be selected via time-series cross validation (CV).

Linear Quantile Regression
The standard goal in econometric analysis is to infer a relationship between a dependent variable and one or more covariates. We consider the following linear regression model: where y t is the dependent variable at time t, b ¼ ðb 1 ; . . . ; b p Þ 0 is a vector of unobserved slope parameters, x t ¼ ðx t1 ; . . . ; x tp Þ 0 is a vector of known covariates, and u t is the random error of the regression which satisfies Eðu t jx t Þ ¼ 0. Standard regression analysis tries to come up with an estimate of the conditional mean of y t given x t , that minimizes the expected squared error loss: This can be restrictive though, when (i) non-linearities and outliers exist and (ii) since it provides just an aspect of the conditional distribution of y t ; given x t by construction. These potential limitations led to the development of quantile regression. In their seminal work, Koenker and Bassett (1978) generalize ordinary sample quantiles to the regression setting, that give more complete information on the conditional distribution of y t given x t ; for which we now provide a succinct description. The quantile regression model can be defined as such that y t satisfies the quantile constraint Pr½y t x 0 t bðsÞjx t ¼ s, where bðsÞ are regression coefficients that depend on s. Quantile regression tries to come up with an estimate for the sth conditional quantile, b Q y ðs; x t Þ :¼ b bðsÞ, by minimizing the following function where q s ðÁÞ is the quantile loss function defined as if u t ðsÞ ! 0 ð1 À sÞu t ðsÞ; if u t ðsÞ < 0 and u t ðsÞ ¼ y t À x 0 t bðsÞ. The quantile estimator in Equation (4), provides (i) much richer information on the whole conditional distribution of y t as function of the x t , and (ii) more robust estimates under the presence of outliers and non-linearities, when compared with the ordinary least squares estimator.
Notice that the linear association assumption, Q y ðsjx t Þ ¼ x 0 t bðsÞ, can be generally restrictive. Instead, we consider the case of the following non-linear association, where h s ðÁÞ is some unknown, (potentially highly) non-linear function. In this article, we use an estimation strategy to approximate h s ðx t Þ with neural networks using their universal approximation property. Specifically, we assume that there exists a neural network with a function G s ðx t ; wÞ, to be defined below, that can approximate h s ðx t Þ well. Before we illustrate how this methodology is implemented, we provide a discussion on how neural networks can approximate h s ðx t Þ.

Neural Networks
In this article, we limit our attention to feed-forward neural networks, to approximate h s ðx t Þ. This architecture consists of an input layer of covariates, the hidden layer(s) where non-linear transformations of the covariates occur, and the output layer that gives the final prediction. Each hidden layer has several interconnected neurons relating it to both the previous and next ones. Specifically, information flows from one layer to the other, via neurons only in one direction, and the connections correspond to weights. Optimizing a loss function w.r.t. these weights makes neural networks capable of learning.
Throughout our exposition, L denotes the total number of hidden layers, a measure for the depth of a neural network, and J ðlÞ denotes the total number of neurons at layer l, a measure of its width. We start by presenting a general definition of a deep (multi-layer) feed-forward neural network. Let r l ðÁÞ; l ¼ 0; . . . ; L be the activation function used at the lth layer, that is applied element-wise and induces non-linearity. We use the ReLU activation function, r l ðÁÞ ¼ maxðÁ; 0Þ, for l ¼ 1; . . . ; L À 1, applied element-wise and a linear one for the output layer, l ¼ L. We denote by g ðlÞ the output of the lth layer which is a vector of length equal to the number of the J ðlÞ neurons in that layer, such that g ð0Þ ¼ x t . Then, the overall structure of the network is equal to: where In this sense, the above ðÞ-approximation can be seen as a sieve-type non-parametric estimation bound, where can become arbitrarily small by increasing the complexity of G s ðx t ; wÞ.
The increase in complexity can occur, either by letting L ! 1, which stands for deep learning, or by letting J ðlÞ ! 1. While asymptotically, both ways deliver the same results (see e.g., Farrell, Liang, and Misra (2021) and references therein), the approximation error has been shown to decline exponentially with L, see for example, Babii et al. (2020) but only polynomially with J ðlÞ , providing some evidence for the prevalent use of deep learning. Notice that there also exists an alternative approximation theory for sparse deep learning, see for example, the work of Schmidt-Hieber (2020). As an illustration, in the Supplementary Appendix, we depict a simple feed-forward neural network with two inputs, two hidden layers, a total of five neurons and one output layer.

Non-linear Quantile Regression
We assume that the conditional quantile follows a non-linear relationship Q y ðsjx t Þ ¼ h s ðx t Þ and there exists a function G s ðx t ; wÞ, that can ()-approximate h s ðx t Þ, see the bound in Equation (7). Using this assumption, we can formally define the conditional quantile function as the following approximation where G s ðx t ; wÞ is the unknown non-linear function we want to estimate in order to approximate h s ðx t Þ. We obtain the deep neural network conditional quantile estimate from the solution of the following minimization problem: where w ¼ ðvecðW ð0Þ Þ 0 ; . . . ; vecðW ðLÞ Þ 0 ; b ð1Þ0 ; . . . ; b ðLÞ0 Þ 0 contains all model parameters, and G s ðx t ; wÞ denotes the overall non-linear mapping, described in Equations (5) and (6). Notice that the choice of G s ðx t ; wÞ will govern whether the model is parametric or nonparametric. If the number of neurons and layers is small, then the model is parametric, if the above number becomes large, then the model becomes non-parametric, since the number of estimated parameters increases with the sample size, similar to sieve non-parametric approximations.
To allow the use of mixed frequency data, we can make the following changes to the structure of the network G s ðx t ; wÞ: In the input layer, we implement frequency alignment on each input variable x t according to the corresponding maximum lag order K. Thus, each high-frequency predictor x t is transformed into a low-frequency vector where Bðk; qÞ is the normalized Almon polynomial, L k u is a lag operator such that L k u x u t ¼ x u tÀk ; the lag coefficients in Bðk; qÞ of the corresponding lag operator L k are parameterized as a function of a small dimensional vector of parameters q. We use this weight function on the frequency alignment vector to reduce the number of parameters and ensure a parsimonious specification. As a consequence, the low-frequency variable x ? t which has the same frequency as the output y t is obtained. The rest of the architecture of the deep MIDAS follows the architecture of the deep quantile estimator, but instead of using x t in Equation (6), we use x ? t .

Regularized Non-linear Quantile Regression
Neural networks have a great capacity to estimate non-linear relationships from the data, but this comes at a cost, since they are prone to overfitting. This can lead to a severe drop in their forecasting performance, especially in small samples. There is a variety of commonly used techniques in ML, see for example, Gu, Kelly, and Xiu (2021) for a good summary, that can be used to ease this impact, originally coming from the high-dimensional statistical literature. The reader is also referred to Goodfellow, Bengio, and Courville (2016) for an excellent summary of different topics about the implementation of neural networks, including regularization.

Regularization
A common solution to this caveat is regularization, where a penalty term is imposed on the weights of the neural network and is appended in the loss function. Regularization, generally improves the out-of-sample performance of the network by decreasing the in-sample noise from over-parameterization, utilizing the bias-variance trade-off. Further, another benefit of regularization is that it provides computational gains in the optimization algorithm. The penalized loss function, for a given quantile s, can be written as: where the penalty term is and k and a are tuning parameters, for which we discuss their selection below. Generally, there is a plethora of loss functions, and the choice among them, depends mainly on the task at hand. In this article, we use the quantile loss function. The different penalization schemes on /ðwÞ work as follows: deep LASSO or l 1 -norm penalization, is a regularization method that shrinks uniformly all the weights to zero, and some at exactly zero. The latter is referred to as the variable selection property of the deep LASSO. Deep Ridge works in a similar manner to the deep LASSO, by shrinking the weights, uniformly to zero, but not at exactly zero. Finally, the deep Elnet 2 is a combination of deep LASSO and deep Ridge, which has been shown to retain good features from both methods, see for example, Zou and Hastie (2005).

Cross validation
The CV scheme consists of choices on the overall architecture of the neural network: the total number of layers (L) and neurons (J), the learning rate (c) of the Stochastic Gradient Decent (SGD), the batch size, dropout rate, level of regularization, and a choice on the activation functions.
Regarding the choice on the activation functions, we use ReLU for the hidden layers and a linear function for the output layer. We tune the learning rate of the optimizer, c, from five discrete values in the interval ½0:01; 0:001. For the width of the neural network we tune the hyper-parameters from the following grid ½1; 5; 10. The batch size is selected via the following grid ½10; 20. 3 Further, we tune the regularization parameter, k, from five discrete values in the interval ½0:01; 0:001, both for deep LASSO and deep Ridge, and for the case of the deep Elnet we choose a from a grid ½0:1; 0:5; 0:9. We also use dropout regularization, where the dropout probability is up to 20%, see for example, Gu, Kelly, and Xiu (2020). For the non-linear MIDAS, we also cross validate # 1 from eight discrete values in the interval ½À1; 0:5 and for # 2 , we use six discrete values in ½À0:5; 0:5.
We use two different sets of grids to tune the depth of the neural networks. The first set is used in our MC experiments where we use two different grids ½1; 3; 5 and ½10; 15; 20, which lead to shallow and deep neural networks, respectively. We use this first set to examine whether shallow or deep neural networks have better finite performance, which we discuss in the next Section 2. Given these results, we use the following grid ½1; 5; 10 as the second set, in our empirical application.
To select the various hyper-parameters outlined above, we follow Babii, Ghysels, and Striaukas (2022a) and references therein and use a time-series CV scheme, which we succinctly describe: Let d denote a gap of observations that separates the test and training samples with the aim of reducing the dependence between the two. For some d 2 N and at each t ¼ 1; . . . ; T: • If t > d þ 1 and t < T À d, we use the following sample I t;d ¼ f1; . . . ; t À d À 1; t þ d þ 1; . . . ; Tg to estimate all the different hyper-parameters, denoted as w Àt , for simplicity.
. . . ; Tg as the training sample. For t ¼ T À d; . . . ; T the training sample is I t;d ¼ f1; . . . ; T À d À 1g. Next, we use the left-out observations to test the model: • Finally, we minimize CV with respect to all different hyper-parameters.
It is clear that tuning all these different architectures, parameters and hyper-parameters increases considerably the computational cost. To ease the computational burden of timeseries CV we follow Babii, Ghysels, and Striaukas (2022a) and draw randomly a subsample I & T of size j and minimize: 3 We have also considered batch normalization and find that overall, results exhibit similar pattern with and without it.
Throughout the MC experiments and empirical application, we set j ¼ 20 and d ¼ 5 as in Babii, Ghysels, and Striaukas (2022a). Finally, we use the optimal parameters and hyperparameters from the time-series CV and evaluate the out-of-sample performance of the network.

Optimization
The estimation of neural networks is generally a computational cumbersome optimization problem due to non-linearities and non-convexities. The most commonly used solution utilizes stochastic gradient descent (SGD) to train a neural network. SGD uses a batch of a specific size, that is, a small subset of the data at each epoch (iteration) of the optimization to evaluate the gradient, to alleviate the computation hurdle. The step of the derivative at each epoch is controlled by the learning rate, c. We use the adaptive moment estimation algorithm (ADAM) proposed by Kingma and Ba (2014), 4 which is a more efficient version of SGD. Finally, we set the number of epochs to 5000 and use early stopping, following Gu, Kelly, and Xiu (2020) to avoid any potential overfitting.

Setup
In this section, we present MC experiments, in order to study the finite sample performance of the deep quantile estimator as outlined in Section 1, for the different penalization schemes. We generate artificial data fy t g using a single predictor fx t g, according to the following model: where u t is the realization of a random variable u distributed as, u t $ i:i:d:NðÀrU À1 ðsÞ; r 2 Þ; r ¼ 0:1 and U À1 is the quantile function of the standard normal distribution. h s ðÁÞ is the general non-linear function that we wish to approximate via the deep quantile estimator.
All the experiments are based on the following values: s 2 ð1%; 2:5%; 5%; 10%; 20%Þ; T 2 ð100; 300; 500; 1000Þ and the number of MC replications is 1000. We consider the following five data-generating mechanisms to assess the finite sample properties of the deep quantile estimator: Case I: We consider the case of a N(0, 1) simulated single predictor that is generated as This is the simplest design in our MC experiments. We use this simple case to showcase that linear methods, as expected, cannot produce reasonable performance under a sigmoid type of a non-linear function h s ðÁÞ.
4 ADAM is using estimates for the first and second moments of the gradient to calculate the learning rate.
Case II: We consider an AR(1) simulated single predictor as follows: where x t is simulated as In this design, we increase the complexity by introducing a correlated predictor.
Case III: We consider the case of a single predictor generated via a GARCH(1,1) model where x t is simulated as: In this design, we wish to examine, how the deep quantile estimator fares, when the regressor is conditionally heteroskedastic, following a GARCH(1, 1) model. A GARCH type of assumption on the distribution of asset returns is one commonly used in the literature.
Case IV: We consider the case of a single predictor that is generated as follows: In this case, we simulate h s ðx t Þ to reflect a function composition, commonly used in neural networks. We simulate it with three hidden layers and a specific number of neurons, such as is 8 Â 10 and W ð3Þ is 1 Â 8. Further, we simulate the weights, w; so that, every entry w i;j is simulated as, w i;j ¼ d i;j 1ðd i;j > 0:5Þ, where d i;j $ Uð0; 1Þ, allowing for some sparsity.
Case V: We consider an AR(1) simulated error as follows: where e t is simulated as We use this design to examine whether correlated errors impact the deep quantile estimator.
Across all cases, we estimate h s ðx t Þ using the deep quantile estimator with different penalization schemes. Let b h s;pen ¼ b G s;pen ðx t ; wÞ denotes the estimate, where pen corresponds to no regularization, deep LASSO, deep Ridge, and deep Elnet. We use the following metrics in order to evaluate the small sample properties, of the deep quantile estimator across R ¼ 1000, MC replications: (i) the average mean-squared error of the true residuals, y t;pen Þ 2 i and finally, (iii) the average absolute bias ABIAS b hs;pen We report results only for AMSE b ut;pen below, since results for the alternative metrics exhibit similar patterns and are available upon request.

Results
We consider a linear quantile estimator, other sieve-type have estimators such as polynomials and B-splines (defined in the Supplementary Appendix), shallow and deep neural network estimators with their depth being described in Section 1. Note, that we also use the time-series CV to select the optimal number of knots of the B-splines estimator from the following grid of six discrete values in the interval ½5; 10. Our MC aims to answer which estimator has the best asymptotic properties under a non-linear setup. We find that both shallow and deep networks deliver good finite sample properties across quantiles, with shallow learning being the best between the two in most cases. Further, neural networks perform better than the linear quantile and other sieve estimators. We present our MC results for Cases I-V in Tables 1-5, respectively.
In Table 1, we can see that the linear quantile estimator, under a non-linear setup does not work, as expected, and the MSE remains constant as the sample size increases. Next, we present the asymptotic properties for the deep quantile estimator across different penalization schemes, namely deep quantile, deep LASSO, deep Ridge, and deep Elastic Net, and find that the deep quantile non-linear estimators have good finite sample properties.
When s ¼ 1% it appears that the deep quantile estimator works well for sample sizes larger than T ¼ 300, but in comparison with the linear one it generally works better. In Case II, the non-linear estimators depict fine finite sample properties and their performance is better than the linear one. In this case, the non-regularized estimator performs better than the regularized ones. Next, similar behavior appears in Case III. In Case IV, where we allow for some sparsity in the weights, we find, as expected, that the linear quantile regression estimator, does not work under non-linearity, while the non-linear one works as expected. In Case V, where we consider serial correlated errors, we find that adding a penalty term in the non-linear estimators improves the performance of the deep quantile estimator in extreme quantiles.
We further find, as expected, that the linear quantile regression, second-order quantile polynomial, and cubic splines estimators, do not work under non-linearity. Finally, in very few occasions, we find that splines estimator performs better than shallow networks for small sample sizes and extreme quantiles.
Overall, our MC results suggest that the deep quantile estimator using both deep and shallow learning has good finite sample properties, and can approximate non-linear functions. We also find evidence in favor of the penalization schemes described in Section 1. Specifically, the penalized deep quantile estimators also have good finite sample properties, and in some cases, perform better that the non-regularized one; a finding in favor of weight regularization.

Empirical Setup
In this section, we outline our empirical application setup, where we use the deep quantile estimator to forecast VaR. We examine the predictive ability of the deep quantile estimator and other non-parametric models, relative to the linear one, using the quantileencompassing test of Giacomini and Komunjer (2005). We further examine the predictive performance of the different methods by testing their forecasting accuracy, using the DM, GW, and quantile score tests.

Deep Quantile VaR Forecasting
The data used in our empirical application consist of around 36 years of daily prices on the S&P500 index (source: Bloomberg), from September 1985 to August 2020 (T ¼ 9053 observations). We use daily log returns, defined as r t ¼ log ðP t =P tÀ1 Þ for our forecasting analysis. We use four different classes of VaR models and produce forecasts for s ¼ ð1%; 5%; 10%Þ empirical conditional quantiles, using the deep quantile estimator.
The first VaR specification we consider is the GARCH(1,1) model that has been proposed by Bollerslev (1986), in which r 2 1;t ¼ x 0 þ x 1 r 2 1;tÀ1 þ x 2 r 2 tÀ1 , see Equation (14). The second VaR specification we consider, is RiskMetrics, proposed by Morgan (1996), which assumes r 2 2;t ¼ kr 2 2;tÀ1 þ ð1 À kÞr 2 tÀ1 , where for daily returns, k ¼ 0:94, see Equation (15). The last two specifications we consider follow the Conditional Autoregressive VaR (CAViaR) model, proposed by Engle and Manganelli (2004), where a specific quantile is analyzed, rather than the whole distribution. Specifically, the CAViaR model corrects the past VaR j;tÀ1 estimates in the following way: it increases VaR j;t when VaR j;tÀ1 is above the sth quantile, while, when the VaR j;tÀ1 is less than the sth quantile, it reduces VaR j;t . Thus, the third VaR we examine is the symmetric absolute value (SV) that responds symmetrically to past returns, see Equation (16) and lastly, we consider the asymmetric slope value (ASV) as it offers a different response to positive and negative returns, see Equation (17). For ease of exposition, we refer to the above specification as VaR 1;t ; . . . ; VaR 4;t , respectively. We summarize their specifications below: where b i , i ¼ 0; . . . ; 3 are parameters to be estimated. We use these specifications following Giacomini and Komunjer (2005). Under the mixed frequency setup, we consider the following equation where BðL u ; qÞ is defined in Equation (9), i ¼ 1, . . . ; 4 and q are parameters to be estimated. For a more detailed summary of MIDAS we refer the reader to Ghysels, Santa-Clara, and Valkanov (2004). As discussed in Section 1, the linear association between VaR and the covariates can be restrictive. Instead, we assume that the relationship between the response variable, VaR, and the covariates has an unknown non-linear form for a given s, that we wish to approximate with the deep quantile estimator as where VaR j;t ; j ¼ 1; . . . ; 4 is indexed at (day) t ¼ 1; . . . ; T. The dimension p of covariates that we use in our analysis depends on the specification chosen for VaR. Specifically, if j ¼ 1, 2 then p ¼ 1, if j ¼ 3, p ¼ 2 and finally if j ¼ 4 then p ¼ 3.
In the Supplementary Appendix, we briefly delineate the model specifications for the quantile B-splines, quantile polynomial, and quantile MIDAS estimators.

Forecasting Exercise Design
This section presents our forecasting exercise design. We reserve the last 2000 observations to evaluate the out-of-sample performance using various tests and use the remaining 7053 observations to tune parameters via time-series CV as described in Section 1. This specific split is used because we follow Giacomini and Komunjer (2005) and want the power of the conditional quantile forecast encompassing (CQFE) test to be comparable with their exercise. Generally, a forecasting exercise is performed either via a recursive or rolling window, see for example, Ghysels et al. (2019), yet in either setting to produce all h-step ahead forecasts for the last 2000 observations and to tune the hyper-parameters can be computationally challenging. Instead, we follow Giacomini and Komunjer (2005) and perform a fixed forecast window exercise, in which we estimate our models once.
For our forecasting design, we use a fixed forecast window exercise and predict the tenday-ahead VaR as: where F t denotes the information set up to time t, w Ã denotes the optimal weights obtained from the time-series CV. Equation (23) illustrates how forecasts for the first VaR specification were obtained via the deep quantile estimator. In a similar manner, forecasts can be obtained for other VaR specifications and alternative models, using Equations (14)- (22). We evaluate the forecasting performance of VaR models with the deep quantile estimator as in Section 1. Further, we consider ten-day compounded VaR forecasts, which we relegate to the Supplementary Appendix.

Forecast Evaluation
In this section, we discuss the various tests we have considered, in order to evaluate the predictive ability of the deep quantile estimator and present the testing results. In general, root mean-squared forecast error (RMSFE) is used to measure the accuracy of point estimates and is defined as where j ¼ 1; . . . ; 4 depends on the specification chosen for VaR j , h denotes the forecasting horizon and b G s ðx j;tþh ; wÞ is the solution of Equation (8) after selecting the optimal w via CV at the sth quantile. , where RMSFE M comes from the competing estimator and RMSFE B from the linear quantile estimator for each VaR specification. In most cases, the linear quantile estimator outperforms polynomial and splines estimators across competing models and quantiles. In all cases, deep quantile and deep quantile MIDAS estimators outperform the linear quantile estimator. The forecast gains of the deep quantile vary from 50% to 98%, while the gains from deep quantile MIDAS fluctuate between 11% and 84%. We find that neural network models improve VaR forecasts in all VaR models across the different quantiles we consider.

DM test
We perform a quantitative forecast comparison across different methods and test their statistical significance. To do so, we calculate the RMSFE for each method and perform the DM test, with the Harvey, Leybourne, and Newbold (1997) adjustment to gauge the statistical significance of the forecasts. As our empirical application entails quantiles, we compute the DM statistics based on the comparison of empirical quantile losses rather than the MSE loss. With the DM test, we assess the forecasting accuracy of the deep quantile estimator relative to the benchmark linear quantile regression model. In this exercise, we set s equal to 1%; 5%; and 10%.
Results from the DM test are reported in Table 6, where asterisks denote the statistical significance of rejecting the null hypothesis of the test at 1%; 5%; and 10% level of significance, for all quantiles and models we consider. These results suggest that forecasts produced from the non-linear estimator outperform, for the majority of cases, forecasts obtained from the linear and non-parametric quantile regression estimators.

GW test
In a similar manner and to complement the DM test, we follow Carriero, Kapetanios, and Marcellino (2009) and further calculate the GW test of equal forecasting accuracy, that can handle forecasts based on both nested and non-nested models, regardless of the estimation procedures used for the derivation of the forecasts, including the deep quantile estimator. As in the DM test, we compute the GW statistics based on the empirical quantile losses rather than the MSE one. Table 2 in the Supplementary Appendix illustrates the results for GW test, where daggers denote the statistical significance of rejecting the null hypothesis of the test at 1%; 5% and 10% level of significance, for all quantiles and different models we consider. Similar to the DM forecasting accuracy test, the GW test is again significant at 1% in most cases, with the following exceptions.
Quantile polynomial regression forecasts are only significant at the 1% level of significance for ASV model at s ¼ 5%. In quantile splines, forecasts for the RM specification at s ¼ 10% are significant at the 5% level of significance and under SV at s ¼ 10% are significant at the 10% level of significance. Forecasts from the linear MIDAS, under the SV specification, at s ¼ 1% are insignificant and at s ¼ 5% are significant at the 5% level of significance and under the ASV specification, at s ¼ 1% and s ¼ 5%, are significant at the 10% significance level.
Results for the SV with Deep MIDAS estimator are not significant at s ¼ 1% and s ¼ 5% for the GW test, while results for the ASV with Deep MIDAS estimator are significant at 5% for all quantiles we consider. Results for the SV with Deep LASSO MIDAS estimator at s ¼ 1%are not significant, while at s ¼ 5% are significant at the 5% level of significance.
Results for the SV with Deep Ridge MIDAS estimator are only significant at the 5% level of significance for the GW test at s ¼ 10%. Results for the ASV with Deep Ridge MIDAS estimator are only significant at the 5% level of significance across quantiles. Forecasts from deep Elnet MIDAS model under SV specification at s ¼ 10% are significant at 5% level of significance. Finally, forecasts from deep Elnet under ASV specification across quantiles are significant at the 5% level of significance.
Overall, results from both the DM and GW tests suggest that the non-linear estimators outperform, for the majority of times, competing linear and non-parametric estimators in VaR forecasting.

CQFE
We present the implementation of the CQFE test as proposed by Giacomini and Komunjer (2005) and the generalized method of moments (GMM) estimation as proposed by Hansen (1982). Let b q 1;t be a vector of the sth quantile forecasts produced from model 1 and b q 2;t be the competing forecasts produced from model 2. The basic principle of CQFE is to test whether b q 1;t conditionally encompasses b q 2;t . Encompassing occurs when the second set of forecasts fails to add new information to the first set of quantile forecasts (or vice versa) in which case the first (second) quantile forecast is said to encompass the second (first).
The aim of the CQFE test is to test the null hypothesis, that b q 1;t performs better that any linear combination of b q 1;t and b q 2;t . Under the null hypothesis, it holds that is satisfied if and only if the weights ðh 1 ; h 2 Þ are equal to (1, 0). The objective function of the GMM is: The optimal weights are computed as: where W T is a positive definite matrix, g T ðhÞ is the sample moment condition, h ¼ ðh 0 ; h 1 ; h 2 Þ 0 is a set of weights, h ? ¼ ðh ? 0 ; h ? 1 ; h ? 2 Þ 0 denotes the optimal weights, and b q t ¼ ð1; b q 1;t ; b q 2;t Þ 0 is a vector with the forecasted values based on the pairwise models 1; and 2 in the CQFE test, m denotes the out-of-sample size and z T is a vector of instruments. Hansen (1982) showed that by setting W T ¼ S À1 T that is, the inverse of an asymptotic covariance matrix, is optimal as it estimates h ? with as small as possible asymptotic variance. S is also known as the spectral density matrix of g T . We follow Newey and West (1987) and use a heteroskedasticity robust estimate b S T , of S defined as: 1.000 *** 1.034 *** 4.559 ** 0.080 *** 0.584 * 0.122 *** 0.576 * 0.067 *** 0.567 * 0.063 *** 0.584 * 10% 1.000 1.275 *** 1.209 *** 0.033 *** 0.582 * 0.013 *** 0.542 ** 0.098 *** 0.542 ** 0.052 *** 0.536 ** Notes: The table reports relative RMSFE. The smaller the entry (<1) the better the forecast. *, **, and *** denote results from DM test with the Harvey, Leybourne, and Newbold (1997) adjustment for predictive accuracy, indicating rejection of the null hypothesis that the models have the same predictive accuracy at the 10%, 5%, and 1% levels of significance, respectively.
S 0 is the estimated spectral density matrix evaluated at frequency zero. The GMM estimation is performed recursively, that is, (i) minimize J T using an identity weighting matrix to get h ? , which gives W T via b S T and (ii) minimize J T using 1Þ, which correspond to testing whether forecast b q 1;t encompasses b q 2;t or b q 2;t encompasses b q 1;t . Then the CQFE statistics are defined as: The asymptotic distribution of the GMM estimates of h requires the moment conditions to be once differentiable. To satisfy this requirement, we follow Giacomini and Komunjer (2005) and replace the moment condition with the following smooth approximation: where g is the smoothing parameter. We choose the critical values, c crit of the test from a v 2 2 distribution, in which b q i;t encompasses b q j;t , if ENC i c crit 8i 6 ¼ j ¼ 1; 2. In the empirical application, the vector of instruments, z T , is ð1; r t ; VaR i;t ; VaR j;t Þ; 8i 6 ¼ j ¼ 1; 2.
We select g to be 0.005, following the CQFE test rejection probabilities in Giacomini and Komunjer (2005), since our POOS size is 2000 observations. We consider the following five blocks: (i) the non-parametric, (ii) the non-linear, (iii) the non-linear MIDAS, (iv) the linear, and (v) the linear MIDAS blocks. The non-parametric block consists of the quantile polynomial and quantile splines estimators, the non-linear block consists of the deep quantile estimators for the different regularization schemes and the non-linear MIDAS block consists of the deep MIDAS estimators for the different regularization schemes. Finally, the linear and linear MIDAS blocks consist of the linear quantile and linear quantile MIDAS estimators, respectively.
We examine each block of models across different quantiles. Specifically, we consider how many times the models within a specific block outperform models from other blocks and present these results in Table 7. Under this setting, a win denotes that the prevailing model encompasses the competing benchmark model, while a loss means that the competing model encompasses the prevailing one. Precisely, we consider a win when the computed p-value of the CQFE test fails to reject the null hypothesis, that is, H 10 or H 20 . On the contrary, in the case where the CQFE test suggests that there is no encompassing between the forecasts, we consider this as a loss, that is, the null hypothesis is rejected. Furthermore, the CQFE test has a gray zone in which the test can fail to reject both null hypotheses (H 10 and H 20 ), hence the test is inconclusive. Below we summarize the CQFE testing results for the different quantiles when g ¼ 0:005.
For the 10th quantile, the non-linear block encompasses 713 times the competing blocks, in comparison to the linear block, which encompasses the competing blocks 173 times and the non-parametric block that encompasses the others 322 times. The linear block does not encompass other blocks less than 15 times and the non-linear block for 86 times. Additionally, the test is inconclusive 696 times for the non-linear block and 159 times for the linear one. Thus, the non-linear block is ranked first in terms of how many times it encompasses the other blocks and the non-linear MIDAS block is ranked second.
For the 5th quantile, the non-linear block encompasses 739 times other blocks, 342 times the non-parametric, and the linear 170 times. Further, the linear block does not encompass the other blocks 18 times and the non-linear 60 times. Finally, for the non-linear block, the CQFE test is inconclusive 726 times and 166 times for the linear block. The ranking of the first two blocks is the same as in the 10th quantile.
Finally, we examine the first quantile. In this case, the non-linear block encompasses 757 times the other blocks, 336 times the non-parametric, and the linear block 173 times. Furthermore, the linear block does not encompass 15 times the other blocks and the nonlinear 42 times. The test is inconclusive 751 times for the non-linear block and 170 times for the linear one. The ranking remains the same as above. Results for different smoothing parameters g suggest similar patterns and are available upon request.

Semi-structural Analysis
A general issue in ML is the trade-off between accuracy and interpretability; where the output of a highly complicated model, for example, a deep neural network, can have great accuracy or forecasting performance, but cannot be easily interpreted. In this section, we first discuss the details of two methods that can be used to make ML methods interpretable. The first one is the SHAP values, which has received a lot of attention recently, and the second is partial derivatives. Further, we make a formal comparison on the output of both methods, based on the output of the deep quantile estimator that illustrates, (i) that both methods can be used to make the impact of each covariate in neural networks interpretable and (ii) perhaps surprisingly that the use of partial derivatives, offers more stable results at a fraction of the computational cost.
where C Dxt;iDo , represents the impact of a covariate to a reference value relative to the initial value, is assigned to each x t;i covariate, o ¼ f ðÁÞ is the output of the model, Do ¼ f ðxÞ À f ðrÞ; Dx t;i ¼ f ðx t;i Þ À r t;i and r the reference value. Equation (27)

Partial Derivatives
The use of partial derivatives for the interpretation of a model is straight forward in econometrics, with various uses, ranging from the simple linear regression model to impulse response analysis. In this section, we show how partial derivatives can be used even in highly non-linear deep neural networks. Before we start the analysis, note that while the deep neural networks are highly non-linear, their solution/output via SGD optimization methods can be treated as differentiable function, as the majority of activation functions are differentiable. Let's consider the case of ReLU, that is, not differentiable at 0, whereas it is in every other point. From the point of gradient descent, heuristically, it works well enough to treat it as a differentiable function. Further, Goodfellow, Bengio, and Courville (2016) argue that this issue is negligible and ML softwares are prone to rounding errors, which make it very unlikely to compute the gradient at a singularity point. Note that even in this extreme case, both SGD and ADAM, will use the right subgradient at 0.
For a general x t 2 R p , let denote the partial derivative of covariate x i ¼ x it , for i ¼ 1; . . . ; p at time t ¼ 1; . . . ; T; b G j;s ðx t ; wÞ is the forecasted VaR j;t , across the j different VaR specifications we consider. We assess the partial derivative in time, since, following Kapetanios (2007), we expect it to vary in time, due to the inherent non-linearity of the neural network. Our covariate(s) x t is the conditional volatility for GARCH and RM, VaR lagged values, the absolute S&P500 daily return, and the positive and negative S&P500 daily returns for SV and ASV, respectively. It is evident that under the classic linear regression problem, or linear quantile regression model, the effect of the covariates x t to the dependent variable y t is constant, time invariant, and corresponds to b bðsÞ.

Results
In this application, we use the whole sample size, that is, around 36 years of daily returns on the S&P500 index to provide an accurate interpretation of the deep quantile estimator. Figures 1-4 illustrate the partial derivatives and SHAP values evaluated in time on the output of the deep quantile 6 estimator, for a specific quantile s. Further, we compare the partial derivatives of the deep quantile estimator relative to the linear quantile regression partial derivative, that is, the bðsÞ coefficient. Both partial derivatives and SHAP values 6 In this section, we limit our attention in the output of the best performing model, in terms of its forecasting capacity, as reflected by the forecast gains measure in Section 3, for each model, based on the different penalization schemes. Results from all the different penalization schemes suggest similar patterns to the ones discussed above and are available upon request.
seem to identify interesting patterns that can be linked to some well-known events. Below we discuss our results for all models we have considered in our empirical application. The results for the first two models, that is, GARCH and RM can be summarized together, since in both models there is only one covariate, that is the conditional volatility, but with a different specification. The results from this model are illustrated in Figure 1. We find that the partial derivative appears to be more stable over time, fluctuating around the constant partial derivative, bðsÞ, of the linear quantile estimator. When there is a crisis or a stressful event in the financial markets, they increase. As an example, we see significant spikes in the partial derivatives, both in March 2020 as well as in 2008, which stand for the onset of the COVID-19 pandemic and the Great Recession, respectively. We also find that the biggest increase occurs in 1987, the year when Black Monday happened, and also significant variation during the U.S. government shutdown in 2019. The values for the partial derivatives generally increase, as s decreases. SHAP values have a similar behavior with the partial derivatives, but are more volatile across time. For the first two models, there are some events, for example, during the 1991, where the values for both SHAP and partial derivatives do not increase a lot. We view this finding as an inability of these two models, to properly account for this crisis.
In the last two models, the merit of SHAP values and partial derivatives becomes clear, since in these models we have more than one covariate and both methods can provide an indication on the effect of each covariate on the final output. Overall, we find that increasing the number of covariates, allow the models to account for all crises within the sample. For the case of the SV model, we find that the important covariate is the lagged values of VaR, rather than the absolute values of S&P500. Similar to the one covariate models, we find that the partial derivatives are more stable than SHAP values, fluctuating closely around bðsÞ and picking up when there are crisis or distress in the economy or financial markets. The SHAP values again appear to be more volatile with a wider range. Similar to the findings of the one covariate models, the higher the values for the partial derivative and SHAP, the lower the s quantile.
For the case of the ASV model, we find that again the lagged values of VaR are the most significant covariate, the negative S&P500 returns have some impact and the positive S&P500 returns are almost insignificant. Similar to the cases above, we find that the partial derivative is more stable than SHAP values, fluctuating closely around bðsÞ and picking up when there is a crisis or distress in the economy or financial markets. The SHAP values again appear to be more volatile with a wider range. Again and same as before, lower quantiles have higher partial derivatives. The results for these two models are illustrated in Figures 2-4. Different penalization schemes maintain the aforementioned results, with a lower magnitude. Overall, we observe that the linear quantile regression shows a fixed pattern across time, and is evident that this model does not anticipate shocks in the economy. Our analysis suggests that it is higher during stressful events. As Engle and Manganelli (2004) suggest, SV and ASV react more to negative shocks, and in stressful events, their spike is larger than the GARCH and RM models. Finally, covariates with the minimum contribution on the forecasted values, such as the positive S&P500 returns have negligible impact on both SHAP and partial derivatives values.

Conclusion
In this article, we contribute to the expanding literature on the use of ML in finance and use the deep quantile estimator that has the potential to capture the non-linear association between asset returns and predictors. In Section 1, we lay out the exact workings of the deep quantile estimator, and illustrate how it generalizes linear quantile regression.
In the MC exercise in Section 2, we study the finite sample properties of the deep quantile estimator, based on a number of data-generating processes. We present extensive evidence the estimator gives good finite sample performance, that is a function of T, uniformly across different regularization schemes.
We use the deep quantile estimator, with various penalization schemes, to forecast VaR. We find that the deep quantile estimator gives considerable predictive gains, up to 98%, relative to the VaR forecasts produced by the linear quantile regression. This result is backed by the forecasting accuracy tests, that is, the DM, the GW, and the quantile score tests. Further, results from the CQFE test of Giacomini and Komunjer (2005) suggest that forecasts obtained from the non-linear estimators encompass forecasts from the linear and non-parametric models with a higher frequency. These findings are in support of the nonlinear association between the conditional quantile of asset returns and covariates, hence suggesting a new avenue in forecasting in finance and in macroeconomics during extreme events.
In addition, we do a semi-structural analysis to examine the contribution of the predictors in VaR over time. We consider, following the ML literature, SHAP values, and further partial derivatives. Our findings suggest that the non-linear estimator reacts more in stressful events and exhibits time variation, while the linear quantile estimator presents, as expected, a constant time-invariant behavior. We conclude that financial variables are characterized by non-linearities, which the deep quantile estimator can approximate quite well.
Finally, we make a formal comparison between SHAP and partial derivatives, and interestingly find that partial derivatives can be used to make ML methods interpretable, are less volatile, easier to interpret and can be computed at a fraction of time used in the calculation of SHAP values.

Supplemental Data
Supplemental data are available at https://www.datahostingsite.com.