Forecasting bitcoin volatility: exploring the potential of deep learning

This study aims to evaluate forecasting properties of classic methodologies (ARCH and GARCH models) in comparison with deep learning methodologies (MLP, RNN, and LSTM architectures) for predicting Bitcoin's volatility. As a new asset class with unique characteristics, Bitcoin's high volatility and structural breaks make forecasting challenging. Based on 2753 observations from 08-09-2014 to 01-05-2022, this study focuses on Bitcoin logarithmic returns. Results show that deep learning methodologies have advantages in terms of forecast quality, although significant computational costs are required. Although both MLP and RNN models produce smoother forecasts with less fluctuation, they fail to capture large spikes. The LSTM architecture, on the other hand, reacts strongly to such movements and tries to adjust its forecast accordingly. To compare forecasting accuracy at different horizons MAPE, MAE metrics are used. Diebold–Mariano tests were conducted to compare the forecast, confirming the superiority of deep learning methodologies. Overall, this study suggests that deep learning methodologies could provide a promising tool for forecasting Bitcoin returns (and therefore volatility), especially for short-term horizons.


Introduction
Civilization at its present conception would not exist without money. Recent advancements in blockchain technology enable the creation of decentralized monetary systems called cryptocurrencies, where most famous one is Bitcoin, which has become a new asset class. This new type of asset is becoming part of the global financial and economic ecosystem, bringing new and interesting research questions that represent investigation opportunities.
Current macro-economic conditions, with the EUR/USD parity in hand with worldwide high inflation, make it the right time to question the concepts of money, the role of central banks and to better understand what opportunities these alternative systems can bring to the discussion and, ultimately, whether these new ideas can in fact help to improve our societies as whole.
The motivation for this study is to address the need for better understanding and forecasting of Bitcoin volatility, as this new asset class becomes increasingly relevant in the global financial and economic ecosystem. While traditional econometric models have been used to forecast financial assets volatility, the high volatility and unusual market patterns of cryptocurrencies present a challenge for these techniques. As a result, there is a need for more modern and innovative forecasting models that can better capture the nature of these markets. This study compares the prediction results of traditional econometric models, such as ARCH and GARCH, with machine learning models, specifically neural networks, in predicting Bitcoin volatility while doing a review on what might be the causes of this extraordinary volatility. In addition, the additional computational costs associated with machine learning models are justified by the improved forecasting accuracy. Thereby, a new insight into forecasting Bitcoin volatility will be provided and a contribution to the current discussion on the role and potential of cryptocurrencies and machine learning techniques in econometric studies will be made.
Forecasting models are critical decision-making tools for economic agents, investors, and governments, particularly when predicting financial and economic data (Aminian et al., 2006).
Econometric models, such as autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models, have been extensively used to model the volatility of financial assets. However, the high volatility and unusual patterns and behaviors of cryptocurrency markets make it challenging to apply such models (Franses & Van Dijk, 1996;Pilbeam & Langeland, 2015). To address this challenge, some scholars have proposed the use of modern techniques, such as machine learning/deep learning, to develop models that better explain and predict the nature of cryptocurrency markets (Bezerra & Albuquerque, 2017;Liu, 2019) and help businesses better understand the risks associated with these assets or assist in pricing derivatives. However, most of the time, there are no references to the implicit computational cost.
For example, regarding stock price forecasting, Costa et al. (2019), Lopes et al. (2021) and Ramos et al. (2018) report that some Recurrent Neural Networks (RNN) models, e.g., Long Short-Term Memory networks-LSTM, can be promising for modeling and forecasting time series with structure breaks, or with very irregular behavior (such as time series related to financial markets). However, despite the good forecasting quality, Lopes et al. (2021) and Ramos et al. (2021) note that these neural network architectures have a significant computational cost. Due to the facts mentioned by these authors, further reflection is important, combining the prediction power and computational cost of DNN models.
Thus, in addition to comparing methodologies (classical and deep learning), this work seeks to bring a scientific contribution in two aspects: (i) a comparative analysis between different deep learning methodologies, seeking to understand any differences; (ii) a critical analysis of the implicit computational cost (often omitted in scientific papers). These are aspects that have not been much discussed in the literature, so this work aims to contribute to the scientific debate on the subject.
The results of our study indicate that machine learning models, specifically neural networks, outperform traditional econometric models in forecasting Bitcoin volatility, especially in short-term horizons. Although requiring significant computational costs (specially LSTM models).
This paper is structured as follows: Sect. 2 reviews the relevant literature. The forecasting models are defined formally in Sect. 3, as well the data to be used in the study (including graphics illustrating the volatility to be forecast). Section 4 outlines the methodology employed in the implementation, including the forecasting models, statistical tests, and evaluation metrics. Section 5 presents a descriptive and inferential data analysis, along with visualizations of the forecast obtained by each model and accuracy tables. Finally, Sect. 6 concludes the paper and outlines directions for future research.

Literature review: bitcoin and volatility
According to the literature, there are conflicting ideas about what may explain the extra-ordinary volatility. Hayes (2017) and Garcia et al. (2014) argue that the main determinants of the Bitcoin price are production costs (electricity costs), and lower electricity prices or reduced mining difficulty will result in negative pressure on the Bitcoin price. Yermack (2015) highlights that since the quantity of new bitcoins is known with certainty by the public, this provides a clear and transparent understanding of the supply of new bitcoins. Gronwald (2019) states that the limited long-term fixed supply of Bitcoin makes it scarce as it is an "exhaustible resource commodity such as crude oil and gold" and analyzes demand shocks. Another important feature is the programmed supply shocks of the production of Bitcoin (halving's) that result on price volatility as buyers and sellers adjust for an equilibrium price, which however, will become less important over time (Chaim & Laurini, 2018). Pagnotta and Buraschi (2018) model Price-Hash Rate Spirals. Additionally, it is also important to mention the high occurrence of settlement cascades due to the unregulated nature of 1 3 most crypto markets which allows the usage of high leverage and market manipulation, contributing to this problem and increase volatility. Taleb (2021) disagrees with the cost models discussed above and states that any price should be zero, arguing that Bitcoin does not exhibit inflation hedging properties and has failed as a payment network due to high transaction costs and volatility in value.
Volatility plays an important role to measure and access potential risks and by getting a better understanding and knowledge of how it can be predicted, may support decision-making regarding future expectations. Due to cryptocurrencies high volatility, classical methodologies may face some difficulties. Kim and Won (2018) state that volatility plays crucial roles in financial markets, such as in derivative pricing, portfolio risk management, and hedging strategies. Black and Scholes (1973) would corroborate this importance due to their work and research on option pricing models. Markowitz (1952) argues that volatility is one of the key indicators to measure risk and uncertainty implying that the higher the volatility, the higher the risk of the asset or portfolio of assets. Hang (2019) highlighted the importance of forecasting, stating that it is an important tool to help companies create competitive advantage.
Several authors have applied the most diverse techniques to forecast volatility. Some of the most important models for forecasting volatility across the literature include ARCH by Engle (1982) and GARCH by Bollerslev (1986). Some authors studied their properties on crypto assets (Bergsli et al., 2022;Gronwald, 2019;Klose, 2022). Kim and Won (2018) agree on the advantages of such, since volatility clustering, heteroscedasticity and leptokurtosis can be captured. On the other hand, Klose (2022), uses GARCH models to forecast volatility of crypto assets and gold. In addition, he studies similarities and differences based on important factors related to liquidity premia, volatility and pronounced responses.
Classical machine learning tools, such as random forest (RF) and support vector machine (SVM) models, have been used to forecast volatility. SVM model, for example, have been used to forecast volatility of the S&P 500 index, taking advantage of its tolerance to high-dimensional inputs (Gavrishchaka & Banerjee, 2006). On the other hand, some authors have opted to use hybridization strategies mixing SVM with other models such as GARCH, ARIMA and wavelet transform to improve forecasting performance, for example, in the forecast of real stock market data, daily changes of the pound sterling, the New York Stock Exchange composite index and major stocks in Colombia (Chen et al., 2010;Rubio & Alba, 2022;Tang et al., 2009). In addition, RF model is widely used in volatility forecasting, e.g. for high-frequency historical data, crude oil and electricity market volatility, obtaining in each case competitive forecasting in terms of error for different forecast horizons (Luong & Dokuchaev, 2018;Wang et al., 2022).
Regarding Bitcoin volatility, it should be noted that, historically, cryptocurrencies exhibit higher volatility than other traditional asset classes and their returns exhibit a set of structural anomalies and breaks that could generate forecasting problems for the mentioned models. Ramos (2021) argues that although simple in application, classic linear methodologies have some difficulties in dealing with events that have out-of-the-ordinary patterns, as Pesaran andTimmermann (2004) andChatfield (2016). Contagion spill overs are also a phenomenon in cryptocurrencies, particularly in Bitcoin, which exhibit strong interdependence across different exchange markets. Pagnottoni (2019, 2020) have shown that this interdependence persists both at high and low frequencies.
Due to the challenges, over the past decade, it has been possible to see different Artificial Intelligence techniques, such as artificial neural networks (ANN)/deep neural networks (DNN) have been pointed out in the scientific literature as a promising alternative (Sezer et al., 2020;Tealab, 2020;Tkáč & Verner, 2016). Research on nonlinear methodologies based on neural networks, extensively discussed in the nineties and abandoned due to computational limitations (Bengio et al., 1994) reappear in recent works. Therefore, the scientific research along with the computational progress seen in recent years-due to the use of graphic process units (GPUs)-has assumed a fundamental role in the adoption of ANN to a larger audience. This is seen in simpler DNN structures (e.g. multilayer perceptron (MLP)) or more complex DNN structures (e.g. recurrent neural networks (RNN) and long short-term memory (LSTM) (Ramos et al., 2022).
In fact, many applications of DNN have appeared in the scientific literature in solving some problems related to the modeling and forecasting of time series, referring to its success (Balcilar et al., 2017;Kristjanpoller & Minutolo, 2018;Lahmiri & Bekiros, 2019;Mallqui & Fernandes, 2019;Pichl et al., 2017;Ramos et al., 2022). Part of these works points into that direction when forecasting volatility and prices of cryptocurrencies and/or financial time series using methodologies such as deep learning and hybrid models with both classical and neural network techniques. These techniques have shown significant improvements over classical approaches (Smyl, 2020).
However, the lack of interpretability in DNN models, commonly referred to as the "black box" problem, is a major challenge in adopting these models, particularly in finance where interpretability is crucial for regulatory compliance, risk management, and stakeholder communication. Previous studies, such as Bracke et al. (2019), have applied Shapley values to compare the explainability of neural network-based models with logistic regression models for default risk analysis. The Shapley values provided a useful tool to interpret the neural network model, highlighting the importance of individual input variables in predicting the model output. In a recent study, Giudici and Raffinetti (2021) proposed a novel approach, called Shapley-Lorenz explainable artificial intelligence (SLXAI), which combines Shapley values and Lorenz curves to provide a more nuanced measure of model explainability. The effectiveness of their approach in explaining the predictions of a random forest model for credit rating was demonstrated. On the other hand, there are other methodologies such as Recurrent Neural Networks (RNN) with Temporal Attention and Bayesian Neural Networks (BNN). Each of these methodologies allows assigning weights in the recurrent neural networks based on relevance and probability distribution, thus solving problems of interpretability, overfitting, and low data (Mirikitani & Nikolaev, 2010;Qin et al., 2017).
Despite these advances, the trade-off between model performance and interpretability remains an open question, and further research is needed to develop more effective approaches to model explainability.
This highlights the importance of this article to make literature contributions that generates awareness of such methodologies to researchers in business and financial markets, so that these tools are used on a daily basis in research.

ARCH and GARCH models
An ARCH/GARCH model for the daily return y t is given by Eq. (1) where Z t is a random variable that is an i.i.d. process such that, E Z t = 0 and Var Z t = 1. The t and t represent measurable functions related to a -field Σ t−1 produced by historical returns y t−k , k ≥ 1.
Engle (1982)  (2) resides in how it handles positive serial correlation 2 t , that is, large (small) 2 t values is followed with large (small) 2 t+1 values. Bollerslev (1986) extended the ARCH(p) method to introduce the GARCH(p, q) defined by Eq. (3) allowing an improved expression for 2 t based on lagging 2 t values (constants , i , i = 1, … , q , j , j = 1, … , p, are each positive). The seasonal and non-market impact are integrated to GARCH models by treating γ as function of the time.

Deep neural networks models
The MLP architectures is nowadays one of the most widely used network structures for classification and regression (Bishop, 1995). MLP model is defined by Eq. (4) where y ′ represent de vector of inputs, y � = 1, y T T , j is the weight vector, 0 , 1 , … , N are the output weights and ŷ is the network output. Function f A is the hidden node output, and is expressed as a squashing function, e.g. the logistic function.
From a data set of predefined outputs, neural networks can rapidly auto-learn and adapt themselves, allowing them to model and forecast non-linear and highly complex structures. RNNs are a group of neural networks that, because of more than one connection(s) among neurons, create cycles. The RNN cycles save and transmit information between neurons, building an inner memory which permits learning sequential information. In this way RNNs differ from standard neural networks since memory allows them to detect sequential correlations.
RNNs may be trained via backpropagation through time (BPTT) algorithm (Pineda, 1987). To calculate outputs in the hidden layer units, the following procedure shall be followed where f A is named the activation function for the occult layer, y t the entry corresponding to the preceding layer, M l is the binding weights in the prior layer, h t−1 is a return output determined from the previous step and M f its weight (Hopfield, 1982;Rumelhart et al., 1986). Different researchers demonstrated that RNNs can collect only limited data, causing long-term dependency issues. To address this problem, RNN frameworks as the LSTM architectures are available (Hochreiter & Schmidhuber, 1997;Malhotra et al., 2015).
The LSTM model pioneered by Hochreiter and Schmidhuber (1997) is probably the preferred deep learning method for natural language processing problems as it can handle long term dependencies inherent in the data and overcome gradient vanishing issues. Equations for calculating outputs and state values for the LSTM module are given by where f A represents the activation function, y t the input data, h t−1 the prior output, M f , M i , M o and b f , b i , b o represent weights and input, forget and output gate biases (Chung et al., 2014;Hochreiter & Schmidhuber, 1997).

Cross-validator and performance metrics
The methodology proposed by Hodrick and Prescott (1997) was used to remove the cycle and trend components, and the CUSUM algorithm by Duarte and Watanabe (2018) to study structural breaks. For inferential analysis, several hypothesis tests were applied, to study normality (Jarque-Bera, Skewness and Kurtosis) and BDS test to study data independence.
To systematically assess the quality of forecasting models, error metrics are used. The most common performance/error metrics are the following: mean absolute error (MAE) and mean absolute percentage error (MAPE) (Willmott & Matsuura, 2005). Considering the time series y t t∈T and the past observations from period 1, … , t , and being y t+h an unknown value in the future t + h and ŷ t+h its forecast, the prediction error corresponds to the difference of these two values, that is, where MAE and MAPE are defined, respectively, by where s corresponds to the number of observations in the forecasting samples (forecasting window).
In addition, a Diebold-Mariano test (Diebold & Mariano, 2002) was performed with the most efficient model for each category (ARCH/GARCH vs Neural Networks). The Diebold-Mariano test is in fact the most used instrument to estimate significance differences for forecasting precision. This is a z-test for the statistical hypothesis for the loss differential series mean defined by Eq. (13) where Z k = y k −ỹ k , is the prediction error for the Z model at timestep k and L , is the function of loss. To provide forecast at k , loss function is defined as

Data
Data used on this study was obtained from the Yahoo Finance public API by calling the ticker ''BTC-USD'' from 07-09-2014 to 01-05-2022 and obtaining the "Close Price" values expressed in U.S. Dollars. Using this time series, daily logarithmic returns were calculated given by the expression where, P t denote the close price at time t . For time series forecasting there is a precedent for transforming non-iid returns to a closer approximation using log normalization (or the Fisher Transform) for the prediction process. The inverse transform (10) e t+h = y t+h −ŷ t+h is then performed on the output to restore the original distribution ready to use predicted returns that allows for calculation of the predicted volatility. With the goal of achieving the research purpose of this study, two-time series variables are defined: BTC-USD that represents Bitcoin's Daily Closing Prices (Fig. 1) and BTC-USD-RET that represents Bitcoin's Daily Returns (Fig. 2).

Empirical findings
Initial steps in this study involved performing statistical calculations to better describe the BTC-USD-RET series. Results presented in Table 1, showed a strongly positively skewed time series with an extremely high positive leptokurtic kurtosis, and non-normal distribution confirmed by rejecting all the null hypothesis for normality. In addition, both the Augmented Dickey-Fuller (ADF) and the Kwiatkowski-Phillips-Schmidt-Shin (KPPS) tests confirmed the series' stationarity and independence and therefore the data is i.i.d. highlighting the importance of implementing nonlinear models for forecasting time series.
Several structural break situations were identified by applying the CUSUM algorithm, where shifts along the time series were observed in several regimes, as can be seen in Pratas (2022) for the same data set. As pointed out, the high number of structural breaks might represent forecasting difficulties for the classical econometric models and an advantage for the deep learning methodologies.
The subsequent step in the analysis involves model implementation, with the stationary nature of the time series allowing for the use of autocorrelation and partial autocorrelation test functions to determine the optimal number of lags for the models. It was found that the ARCH(4) and GARCH(4, 2) models had the best expected generalization properties, with both AIC and BIC showing lower values for the given parameters. This finding is contrary to the literature's preference for GARCH(1, 1) but in conformity in volatility forecasting for Bitcoin, as noted by Senarathne (2019).
In terms of forecasting itself, the forecasting out-of-sample for the ARCH model (Fig. 3) and the GARG model (Fig. 4) do not seem to be well-adjusted to the real data, as the forecast line does not follow the real data line.
For the neural network study, prior to training, the entire BTC-USD-RET dataset underwent an exponential smoothing pre-processing procedure. Three to five hidden layers were employed in conjunction with architecture-specific hyper parameters and the ADAM optimization algorithm, as recommended Brownlee (2018) and Kingma and Ba (2015) respectively. Cross-validation was performed using the Forward Chaining methodology, as suggested by Ramos (2021). Model performance was assessed using the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), and forecasts were generated for one, three, and seven-day horizons. The resulting models were evaluated, and the forecasts were plotted (see Upon examination of the DNN models, it was found that all models exhibited some degree of forecasting ability. However, the MLP model performed better on shorter time horizons (one-day and three-days), while the RNN model had lower 1 3 prediction errors on the seven-day horizon. Interestingly, the LSTM model, despite its complexity, performed the worst in terms of accuracy. This was anticipated, as LSTM models tend to underperform when forecasting stationary time series, as pointed out by Ramos et al. (2022). Moreover, this type of neural network is the one that requires the most computational cost and time, making it highly inefficient to use it on our forecast. Nonetheless, both the MLP and RNN models produced smoother forecasts with less fluctuation, but failed to capture large volatility spikes, such as the one that occurred on day two. In contrast, the LSTM model reacted strongly to such movements and attempted to adjust its forecast accordingly, due to its long-term memory properties that allow the model to "remember" that past volatility spikes may lead to high volatility spikes in the future, known as volatility clustering.
To compare performance of ARCH/GARCH models and DNN models, MAE and MAPE values for their forecasting out-of-sample were calculated for three different time horizons. For the DNN models, the parameters of the neural network (weights and bias) benefited from a pseudo-random initialization instead of using a fixed seed (Glorot & Bengio, 2010). To ensure the reliability of the results and avoid outliers, the forecasting was conducted in a loop of 200 runs, and the 5% worst and best results were excluded (according Ramos, 2021). The range of MAPE values, with the lower and upper bounds trimmed by 5%, are presented in Table 2, along with the MAE values for models with intermediate forecast quality chosen from each architecture (as shown in Fig. 5).
Once all the models have been estimated, metrics calculated, and results presented, it was deemed useful to conclude with a visual representation comparing five models (see Fig. 6). Results indicate that ARCH (4) and GARCH (4.2) models are superior, with ARCH (4) being the best model in terms of forecast accuracy as measured by mean absolute percentage error (MAPE).
Regarding deep learning approach, MLP model demonstrated superior performance for shorter time horizons (1-day and 3-days), while the RNN model showed lower prediction errors for seven-day horizon. This finding can be attributed to the basic memory capabilities of the RNN model, which produce a   Deep learning methodologies seem to show advantages over classical methodologies in terms of forecast quality, since nonlinear dependencies of the data are better captured. However, it is noteworthy that these models are associated with considerably higher computational costs and greater implementation complexity compared to classical techniques (corroborating with the scientific literature- Lopes et al., 2021 andRamos et al., 2021). Despite these limitations, implementation of deep learning models in the present study yielded a substantial reduction in prediction errors. As such, it can be inferred that increased computational costs associated with deep learning model implementation is justifiable, particularly when considering MLP model-which is the least complex model in terms of computational requirements-provided highest forecast accuracy for the time series studied. Finally, to infer whether forecast accuracy of these two models is the same, Diebold-Mariano test, with modification suggested by Harvey et al. (1997) was implemented (see Table 3).
With this, for a significance level of 5%, there is statistically significant evidence to suggest that forecasts do not have the same precision and one is significantly better than the other. According to previous information, the MLP model has better forecast accuracy.
These facts are consistent with previous research findings and highlight the importance of these new methodologies and how researchers must be equipped with knowledge about how these models can help to understand economic reality.

Conclusion
In recent years, Bitcoin has received significant attention from scholars due to its distinctive patterns and characteristics, including high volatility, multiple structural breaks, and unusual probability distributions. However, academic literature has noted a lack of research on this topic. This study contributes to understanding factors underlying Bitcoin volatility by examining price of production (electricity costs), programmed scarcity, programmed supply shocks (halvings), demand shocks (price-hash rate spirals), hash rate, network trust, and liquidation cascades.
Our findings suggest that ARCH(4) and GARCH(4, 2) models are the most effective to forecasting Bitcoin returns. ARCH(4) model performed best in terms of the MAPE metric. Among deep learning approaches, MLP model showed the best performance on shorter time horizons (one-day and three-days), while RNN model had the lowest prediction errors on seven-day horizon. LSTM model, being the most complex, performed weakly among the deep learning methods. Deep learning models have advantages over classical methods in terms of forecast quality, providing an effective capability to capture nonlinear dependencies in the data. However, higher computational costs and implementation difficulties are also involved. Nonetheless, the improvement in prediction errors justifies their implementation, especially considering that the MLP model used in this study is not the most complex or computationally expensive. Our results are consistent with prior research and underscore the significance of these new methodologies for understanding economic reality. Although this study has made valuable contributions to understand Bitcoin's returns and volatility factors is also important to recognize its limitations. One limitation is that it focuses on internal mechanisms of protocols as drivers of volatility, with less attention given to market dynamics specific to the cryptocurrency market such as low liquidity, market microstructure, high leverage, and market manipulation. Another limitation is the limited range of ARCH/GARCH models, which may not be the most advanced or effective for forecasting. In addition, models in this study used only one variable and did not consider external factors, which could be important in financial time series with nonlinear properties. Future research could consider a multi-variable perspective that considers derivatives data, on-chain data, and market sentiment data, as well as the use of hybrid models to better understand Bitcoin volatility.
In addition to this, this work also makes a reflective contribution to scientific literature by comparing classical methodologies (ARCH and GARCH models) and deep learning methodologies (DNN models) for returns and volatility forecasting. According to the scientific literature, classical methodologies are still the most used by professionals in economic, financial, and business fields (Wilson & Spralls, 2018). As expected, the results of the analysis show that DNN models have better forecast quality. However, it is important to highlight not only the potential of deep learning methodologies, but also the significant difference in forecast quality. In the economic and financial field, it is noteworthy that professionals often deal with high error rates. Therefore, in an increasingly competitive economic environment, those who use robust tools to support decision-making have an advantage. Therefore, it is important to encourage the training and awareness of these professionals, particularly investors in the cryptocurrency market, to use more accurate methodologies (e.g., deep learning). This is a challenge that this work aims to highlight.
In conclusion, this study aims to provide a valuable contribution to the understanding of Bitcoin's daily returns and the potential of deep learning methodologies. While many researchers have traditionally used classical approaches to volatility models, the recent advancements in computational power suggest that deep learning methodologies may offer a promising option for improving forecast quality. It is important for researchers to consider the use of these advanced methodologies, not only in the study of crypto assets but in other areas as well.