Forecasting Cryptocurrency Prices Time Series Using Machine Learning

. This paper describes the construction of the short-term forecasting model of cryptocurrencies’ prices using machine learning approach. The modified model of Binary Auto Regressive Tree (BART) is adapted from the standard models of regression trees and the data of the time series. BART combines the classic algorithm classification and regression trees (C&RT) and autoregressive models ARIMA. Using the BART model, we made a short-term forecast (from 5 to 30 days) for the 3 most capitalized cryptocurrencies: Bitcoin, Ethereum and Ripple. We found that the proposed approach was more accurate than the ARIMA-ARFIMA models in forecasting cryptocurrencies time series both in the periods of slow rising (falling) and in the periods of transition dynamics (change of trend).


Introduction
The rapid development of digital currencies during the last decade is one of the most controversial and ambiguous innovations in the modern global economy. Significant fluctuations in the exchange rate of cryptocurrencies and their high volatility, as well as the lack of legal regulation of their transactions in most countries resulted in significant risks associated with investment into crypto assets. This has led to heated discussions about their place and role in the modern economy (see, for example [1][2][3][4][5]). Therefore, the issue of developing appropriate methods and models for predicting prices for cryptographic products is relevant both for the scientific community and for financial analysts, investors and traders.
Methodological approaches to forecasting prices for financial assets depend on an analyst's understanding of the causal relationships in the pricing process.
For example, the forecasting model can be specified as a price formation model: ─ Based on the interaction of market players (demand-supply models) that make economic decisions based on some indicators or regularities, taking into account objective economic laws or laws of behavioral finance (econometric and balance models); ─ Given the past dynamics (time series models and autoregressive models), ─ Taking into account production-technological possibilities of creating the corresponding asset (in particular, for commodity markets, fundamental valuation of shares, technological opportunities for mining cryptocurrency, etc.); ─ Based on the consideration of random factors and events, for example, external shocks, which complicate the formal description of cause and effect relationships (stochastic models).
It should be noted that forecasting cryptocurrencies' prices is fundamentally different from forecasting other financial assets, in particular, ordinary (fiat) currencies, which have a large number of theoretical and empirical studies focused on studying their dynamics model. There are two fundamentally different approaches to forecasting the exchange rate dynamics of currencies. The first approach is to build a cause and effect casual model that describes the relationship between exchange rates and other macroeconomic variables (in particular, the rates of economic growth, trade and balance of payments, purchasing power parity, public debt, inflation rates, etc.) within a certain theoretical economic concept.
The other approach is to study only the time series and make a prediction based on the processing and analysis of past observations. The most common models are the Box-Jenkins ARIMA time series models and their modifications, GARCH models, or artificial neural networks.
It should be noted that there is no consensus on the fundamental value of cryptocurrencies among scholars. The prevailing thesis is that the exchange rate of the majority of cryptocurrencies is determined only by the ratio of demand and supply [3,4,[6][7][8][9][10].
Liu and Tsyvinski's [11] empirical analysis of the three most capitalized crypto currencies (Bitcoin, Ripple, and Ethereum) did not reveal a static relationship between the yield of cryptocurrencies and the complexity of their extraction.
At the same time, macroeconomic factors, which usually determine the dynamics of currency, stock and commodity markets have no significant effect on the dynamics of the cryptocurrencies market.
Conrad, et al. [12], also found that influence of the US stock market (SP500 index) and the global stock market index (Nikkei 225 index) on bitcoin's volatility was not significant.
In addition, the studies reported in [1,8,9] show that the price dynamics of cryptocurrencies is described by classical log-periodic models of price bubbles of Sornette [13] and their modifications.
A number of recent cryptocurrency market studies show that, unlike other financial assets, cryptocurrency prices are influenced by a number of specific factors that shape their demand, such as the number of Google trends searches, the number of posts in social networks and other mass media [6,[14][15][16]. These studies substantiated the feasibility of using non-typical factors as predictors.
All of these factors complicate the development of casual econometric models of cryptocurrency price dynamics.
Recently, non-parametric methods based on Machine Learning and Deep Learning have gained popularity for the analysis and forecasting of financial and economic time series.
Models of Machine Learning are based on special artificial networks that allow to solve the problem of prediction and classification by utilizing learning sequences in the data. The effectiveness of such models depends on the training speed and the degree of universality of approximating functions.
These models combine an arsenal of powerful methods, such as Artificial Neural Network (ANN), Support Vector Machines (SVM), Decision and Classification Tree (DT, CT), Fuzzy Logic, Genetic Algorithms (GA), linear and nonlinear statistical models, etc.
Examples of their effective use in forecasting exchange rates and stock indices are given, in particular, by Peng et al. [17].
Several studies [18][19][20] reported the results of the Bitcoin exchange rate forecasting using classical ARIMA models and using different methods of machine learning, such as Random Forest (RF), Logistic Regression (LR) and Linear Discriminant Analysis (LDA), and Long Short-Term Memory (LSTM). The results from these analyses showed that the models that relied on training proved to be better suited for forecasting both the prices of cryptocurrencies and their volatility.
Thus, in our view, the second approach, which is based on the application of the time series analysis using the CRISP-DM methodology [22], is more appropriate for predicting price trends in cryptocurrency.
The purpose of our work is to construct a short-term price forecasting model for the 3 cryptocurrencies with the highest market capitalization using binary autoregressive models and machine learning technology.

CRISP-DM Approach
To solve the problem of forecasting the dynamics of cryptocurrencies, we used the CRISP-DM (Cross-industry standard process for data mining) methodology ( Fig. 1-2).  According to CRISP-DM, intelligent analysis is a continuous process with many cycles and feedback loops, and has six phases (I-VI).
The main advantage of the CRISP-DM is that it is platform-and application neutral and that it can be adapted to various applied problems. Fig. 2 shows some of the CRISP-DM phases of the cryptocurrency forecasting functional dynamics diagram: Phase II: Data understanding, Phase III: Data preparation, Phase IV: Modeling, Phase V: Evaluation.
Methodology CRISP-DM is the most widespread publicly available standard process model that describes major phases and common data mining methods.

Regression Tree
The regression tree is a class of regression models that allows separating the input space of factor variables into segments. Subsequently, a separate piecewise regression model can be constructed for each of them representing a regression function in an intuitive and visual form [23][24].
In such a tree, internal nodes contain rules for splitting the space of explanatory variables; branches indicate the conditions and the transition between the nodes; and tree leaves are local regression models.
The essence of this method is in sequential division of the data set into nonintersecting classes, which, in turn, are also subject to a breakdown by a partition efficiency criterion.
The decision tree consists of the following elements: "nodes", "leaves" and "branches". "Branches" contain records of attributes which define the target function (result variable), the "leaves" are the values of the target function, and "nodes" are the remaining attributes under which the classification takes place.
There are two types of trees: (i) for classification, in this case, the result of the prediction is the data ownership class; and (ii) for regression, the result in this case is the predicted value of the target function.

BART Algorithm
Let us consider the proposed approach we call BART (Binary Auto Regressive Tree). It is a generalization of standard models of regression trees and is adapted to time series data. BART combines the classic classification and regression trees (C&RT) [24][25] algorithm and the standard autoregressive integrated moving average (ARIMA) models and their components (AR, MA). Models of ART (Auto Regressive Tree) are closely related to the models of the TAR (threshold autoregressive model) threshold autoregression models of the class and their modifications SETAR and ASTAR [24]. The SETAR and ASTAR models are linear models that construct multiple adaptive regression splines (MARS) based on time series [26][27]. BART models differ from the SETAR and ASTAR models in two ways: (1) Error estimates for models based on BART differ from one another; (2) BART models allow for the gap between built-in auto regression models.
To convert a time series, the "window" data conversion method is used. The result variable Y t in this algorithm corresponds to the previous value (Y t-1 ) and the value with the lag p (Y t-p ). This separation of the input space into segments (Fig. 3) allows to construct a separate (local) model for each of them and to represent a piecewise function as an autoregressive tree (Fig. 4) in an intuitive visual form. Most such algorithms apply a recursive separation of training data. In BART, unlike other algorithms, a step-by-step (staged and iterative) method of constructing a tree is used: Step 1. The construction of a regression tree begins from a single value (root node), which is defined as the Median (Me, second quartile Q 50% ) of the entire time series Y t and is calculated the equation The median of the time series is defined as the median of the distribution of realization of a random variable at time t, that is, a real number with probability of exceeding an arbitrary dimension equal to 0.5. For a stationary series and a series with a symmetric distribution, this value does not depend on the time of observation t Me Y  and coincides with the mean value of the series. Sometimes in the literature, the median is considered to be a prototype of a simple stable output.
Step 2. The best split is found for each unprocessed node, and it is selected according to a predefined rule.
These procedures are performed similarly to the C&RT algorithm. The difference lies in the accepted rules, criteria for evaluation and termination of splitting. We have used an alternative selection criterion (or informational criterion) for better splitting based on the entropy indicator, because it gives preference to options with less tree complexity. This algorithm will determine an entropy information gain.
In constructing BART, the number of branches (branching) is 2, that is, each node has two child nodes. The final tree is chosen from these nodes, and we have to evaluate informativeness of not only the predictor nodes that divide the time series into subsets, but also of those that separate a certain group of subsets from the set, that is, the subtree from the rest of the tree.
Entropy criterion. Initially, the probability is estimated as the frequency of assigning a particular observation to a certain subset (subtree) and the entropy Ĥ sampling l Y is calculated using the following equation: After all the information in the node is obtained for a certain predecessor, entropy is calculated using the following equation: where Р -is the number of objects that correspond to a subset С, and р -is the number of objects that correspond to the membership conditions of a subset, p P  , similarly n and N are such that C N N, n   .
Then the entropy of the sample , H p n , and the probability of obtaining an element from this sample will be calculated as p n P N   .
Similarly, for the sample Thus, the entropy of the whole sample after obtaining information φ is calculated using equation (3). Then the decrease of entropy can be calculated as: which is called entropy information gain, which is the amount of information about the current division of the tree into two classes «с» and «not с».
In addition, in the BART algorithm for the early termination criterion Q, we used an extended Bayesian information criterion [28], which minimizes the statistic: where SSE -is the sum of squares of the residuals of the model; J -is the number of model parameters; n -is the number of examples of training sample; p -is the quantity that characterizes the complexity of the model space (it is the product of the tree size and the number of explanatory variables).
In equation (5), the first term is the maximum value of the plausibility logarithmic function, and the second is a penalty for the model complexity.
Splitting of the nodes continues until the EBIC value is reduced. Note that the application of this criterion in the recursive approach of the algorithm of the regression tree is not possible. This is due to the fact that in the recursive method during tree construction only part of the model is considered at a time without considering the complete model as a whole.
For BART, the simplification procedure (i.e., early termination of the tree branching) is more important than, for example, for classification trees. This is due to the fact that regression trees tend to be more complex, because the variety of the investigated metric values (for example, the price of regression) is much more diverse than for qualitative data.
Step 3. If the selected split improves the model and it is valid with an entropy information gain, then this split is performed and step 2 is repeated. Otherwise, the final tree is selected and the BART algorithm execution procedure is considered complete.
The rejection of recursion in the BART algorithm and the transition to the iterative version allows for a complete control of the tree construction process, that is, it provides a "softer" control of the tree construction process at the expense of the following: (1) Determining the arbitrary order of split nodes; (2) Introducing early termination rules / algorithms that analyze both separate nodes and the whole regression tree as a whole; (3) Termination of the construction of the regression tree at any time.
Because the ultimate goal of the proposed algorithm is forecasting, the standard regression model of the ARIMA class, which is a traditional tool for forecasting financial series, needs to be built on the nodes-leaves: where Y t -is the time series, L -is the lag operator,   L  -is the polynomial degree р from L, μ -is the average process value,   L  -is the polynomial degree q from L, t  -is white noise, d -is the order of process integration Y t . If d=0, then process X t can be described by ARMA (p, q) or ARIMA (p, 0, q). This process is stationary and has a short memory. If d=1, then the series has infinite memory, that is, each perturbation has an impact on the behavior of the process indefinitely.
Thus the result variable Y t in this algorithm corresponds to the previous value (Y t-1 ) and the lag p ( p t Y  ). Also, the separation of the input space into segments allows to construct an own (local) model for each of them and to represent a piecewise function as an autoregressive tree in an intuitive visual form.

Empirical Results
For performing empirical analysis, we selected three cryptocurrencies which are the market capitalization leaders: Bitcoin (BTC), Ethereum (ETH) and Ripple (XRP). We have taken daily closing prices for the period from 01/01/2017 to 01/03/2019, according to Yahoo Finance [29] and calculated their time series in log-return.
To compare the predictive properties of the BART algorithm, we also made a forecast using the classical ARIMA (1, 0, 1) and ARFIMA (1, d, 1) models.
As a parameter d in ARFIMA we can use appropriate Hurst exponents (see, for example E. Peters [30]). So we selected as the difference parameter d for ARFIMA models for each currency such values [31]: . The sample size for training for all sub-periods for the BART algorithm was 80% of the total sample size, and 20% was used as out-of-sample dataset.
To implement the models, we chose the Microsoft Azure Machine Learning Studio Cloud Application. A fragment of the implementation of machine experiments is shown in Fig. 5.
For each model the target variable is the log-return for the next time period. The forecast was carried out on five different time horizons: 5, 10, 14, 21, and 30 days using three models for each cryptocurrency. To check the effectiveness of the BART algorithm and that of the classical models, we conducted tests for periods with different types of dynamics of cryptocurrencies time series (two subperiods for each type), namely (Fig. 6): (1) Stable period; (2) Falling trend; (3) Transition dynamics (change of trend); (4) Rising trend. As we can see, BTC is a driver and other cryptocurrencies repeat its dynamics. Fig. 7-8 illustrate the forecast accuracy for 3 models for ETH in the period of slow rising (falling) (Fig. 7) and rapid trend change period (Fig. 8). Forecasting accuracy for BTC and XRP have the same properties as ETH.  To estimate the prognostic properties of the models we used the Root Mean Square Error metric (RMSE).
Results (averaged over three cryptocurrencies) of forecasting performance for all sub-periods are shown in Table 1.
The obtained results indicate that for the investigated time series of cryptocurrencies, the proposed approach gives RMSE over the range 4% for the 14 days forecast horizon without reference to the type of dynamic behavior, over the range 6% for the 21 days and 8% for the 30 days forecast horizon, respectively.  The results show that for selected time series for the short-term forecast, the error of BART algorithm is half the size of the error of ARIMA model, on average, and it is 15-20% lower than the error of ARFIMA model for slowly changing periods (both falling and rising).
Note that all of our models show worse forecast accuracy for the periods of complex dynamic modes (rapid trend change periods).
In addition, the proposed algorithm is more accurate in the periods of transition dynamics (change of trend) compared to ARIMA-ARFIMA models.

Concluding Remarks
The modified model of Binary Auto Regressive Tree (BART) is adapted from the standard models of regression trees to the data of time series. BART combines the classic algorithm C&RT and autoregressive models ARIMA. One of the advantages of the proposed method is the use of the "window" data transformation method for the time series.
The obtained results proved that the BART algorithm is more accurate for all investigated time series of cryptocurrencies and subperiods. In particular, RMSE for this algorithm for the horizon of 14, 21, and 30 days was within the range of 4%, 6%, and 8%, respectively.
The proposed BART method for analyzing and forecasting cryptocurrecies time series demonstrated higher efficiency for building forecast estimates in comparison with traditional time series technique, regardless of whether the target data is collected before, during or after a recession.