1 Introduction

In this research, we use high-frequency Bitcoin pricing data together with machine learning algorithms to predict mid-price movement for bitcoin futures price series across a variety of time frequencies, ranging between 5 and 60-min. The novelty of our research surrounds the use of all available Bitcoin futures series from the Chicago Mercantile Exchange (CME). The CME had offered the product as a mechanism to hedge Bitcoin exposure or harness its performance with futures and options on futures, both of which have been markets presenting tremendous growth since their introduction (Akyildirim et al. 2020; Corbet et al. 2018a, b). While liquidity proved to be a substantial issue for some long-ranged futures such as those 6-months and 7-months into the future, after several specification tests, we present results based on the first 5-month futures products.Footnote 1 The contract is found to be quite substantial in size, representing the ownership of 5 bitcoin, as defined by the CME CF Bitcoin Reference Rate (BRR), quoted in U.S. dollars and cents per bitcoin. This exposure to Bitcoin is based on a leverage rate of 43%, therefore the investment outlay is below that of the face-value of 5 BTC. The minimum price fluctuation is $5.00 per bitcoin, where calendar spreads are $1.00 per bitcoin. Monthly contracts are listed for six consecutive months and two additional December contract months.Footnote 2

The decision for the CME to provide Bitcoin futures on 10 December 2017 was viewed as a significant milestone in the development of such a relatively young financial product, where to this point, few major exchanges, underpinning with such reputation and historic experience had considered similar responses. The launch of CME Bitcoin futures was viewed as the first step in the new cryptocurrency’s path toward legitimacy, hoping to entice institutional investors who had been, until late 2017, had been unwilling to enter the market for a variety of issues. In late 2020, CME futures possessed over $1 billion in open interest, representing the significant growth of the market over a very short amount of time. The use of settlement pricing from multiple sources was initially identified as a strong beneficial characteristic, particularly with the many problems pertaining to cyber-criminality and illicit behaviour across exchanges and directly through product development and creation.

We contribute to the literature by evaluating the application of high-frequency bitcoin futures prices using six machine learning algorithms during the outbreak of COVID-19. We attempt to forecast the mid-point movement of CME Bitcoin futures pricing across multiple futures products during COVID-19 where we use the sign prediction rate or accuracy rate which is calculated as the proportion of times the related methodology correctly predicts the next time mid-price return direction. If the underlying process were fully random then the correct sign prediction ratio would be 50%, where any accuracy rate greater than 50% would indicate the ability of the algorithm to beat the market, further supported with the use of optimal profit ratios to measure the performance of the related classification algorithm. Furthermore, we report most of the methods to provide close results to each other, the best performing model which is the support vector machine yields on average out-sample success rates of around 56%. Another important point to note is that while the maximum value of accuracy one can obtain with the ARIMA model is only 56% among all cases considered, this number even increases up to 71% for the support vector machine algorithm. Further evidence suggests that such predictability increases in magnitude as we focus on futures with larger maturities, particularly those of 4- and 5-month duration. Such evidence indicates that Bitcoin futures products present evidence of sign predictability using machine learning.

The paper is structured as follows: previous research that guides our selected theoretical and methodological approaches are summarised in Sect. 2. Section 3 presents a thorough explanation of the wide variety of data used in our analyses, while Sect. 4 presents a concise overview of the methodologies utilised. Section 5 presents a concise overview of the results and their relevance for policy-makers and regulatory authorities, while Sect. 6 concludes.

2 Previous literature

This research develops upon three key areas of research. The first is built on the development of machine learning and the inherent processes contained therein. The second is based on the development of cryptocurrencies with an emphasis on futures pricing behaviour, while finally, the third area through which we develop our work is based on several pieces that have examined the predictability of cryptocurrency spot prices. Primarily, machine learning has been used across a variety of areas such as that of stock markets (Wittkemper and Steiner 1996; Ntakaris et al. 2018; Sirignano 2019; Huck 2019; Sirignano and Cont 2019; Huang and Liu 2020; Philip 2020); currency markets during crises (El Shazly and El Shazly 1999; Zimmermann et al. 2001; Auld and Linton 2019); energy markets such as West Texas Intermediate (Chai et al. 2018), crude oil markets (Fan et al. 2016), Cushing oil and gasoline markets (Wang et al. 2018), gold markets (Chen et al. 2020); gas markets (Ftiti et al. 2020), agricultural futures (Fang et al. 2020); copper markets (Sánchez Lasheras et al. 2015); and coal markets (Matyjaszek et al. 2019; Alameer et al. 2020); cryptocurrency spot markets (Akyildirim et al. 2020; Chowdhury et al. 2020; Chen et al. 2021) options markets (Lajbcygier 2004; De Spiegeleer et al. 2018); and futures markets (Kim et al. 2020).

Our work further develops on that based on the use of neural networks for forecasting purposes, through which a concise synthesis of the earlier literature is provided by Zhang et al. (1998). Ghoddusi et al. (2019) presented a critical review of the literature based on the application of machine learning, suggesting that Support Vector Machine (SVM), Artificial Neural Network (ANN), and Genetic Algorithms (GAs) are among the most popular techniques used to focus on energy markets. Nakano et al. (2018) previously investigated Bitcoin intraday technical trading based on artificial neural networks for the return prediction, through which Akyildirim et al. (2020) further developed by examining the predictability of the most liquid twelve cryptocurrencies are analysed at the daily and minute level frequencies. The authors found that machine learning classification algorithms reach about 55-65% predictive accuracy on average at the daily or minute level frequencies, while the support vector machines demonstrate the best and consistent results in terms of predictive accuracy compared to the logistic regression, artificial neural networks, and random forest classification algorithms. Saad et al. (2020) provided evidence of prediction accuracy of up to 99% for Bitcoin and Ethereum prices. Whereas Hubáček et al. (2019) introduced a forecasting system designed to profit from the sports-betting market specifically developing their work through the application of convolutional neural networks for match outcome prediction.

Previous research on cryptocurrency futures has been quite extensive to date. An extensive overview of the key areas of research was presented by Corbet et al. (2019), through which areas surrounding market efficiency, the development of futures exchanges and illicit behaviour are outlined. Akyildirim et al. (2020) utilised a high-frequency analysis to show significant pricing effects sourced from both fraudulent and regulatory unease within the industry, verifying that CME Bitcoin futures dominate price discovery relative to spot markets. Alexander et al. (2020) found similar evidence when considering the role that BitMEX derivatives played when similarly, informationally leading spot markets. Corbet et al. (2020) found further evidence of Bitcoin market maturity through significant response to macroeconomic news, while Koutmos (2020) found that interest rates and implied stock market and foreign exchange market volatilities are important determinants of Bitcoin returns. de la Horra et al. (2019) analysed the demand for Bitcoin to determine whether it stems from Bitcoin’s utility as a medium of exchange, a speculative asset, or as a safe-haven commodity, finding that the asset is highly speculative in the short run. Giudici and Polinesi (2019) primarily identified that Bitcoin exchange prices are positively related to each other and large exchanges Bitstamp, drive the prices. Such destabilising effects of fraud and regulatory events had also been identified by Akyildirim et al. (2020), Corbet et al. (2020), Katsiampa et al. (2019a), Katsiampa et al. (2019b) and Hu et al. (2020). Evidence supporting the predictability of Bitcoin futures prices through the use of machine learning would not explicitly be a unique characteristic to cryptocurrency markets, as it has been previously identified across several assets such as foreign exchange markets (Plakandaras et al. 2015) and several other asset markets (Akyildirim et al. 2020), it is essential to note that it is contrary to the efficient markets hypotheses, where prices should follow a martingale process. However, such a result could present another evidence supporting the developing growth of operational and technical efficiency that has been observed in such markets in recent years. Such markets have grown to such an international status, that substantial amounts of research have identified the usage of cryptocurrency markets as a hedging mechanism against the significant financial market pressures and contagion effects that were associated with the development and broad confusion associated with the COVID-19 pandemic, with emphasis on contagion effects (Akhtaruzzaman et al. 2020; Corbet et al. 2020, 2021; Mensi et al. 2020), asset price discovery (Corbet et al. 2020), safe haven effects (Conlon et al. 2020), hedge fund performance (Yarovaya et al. 2020), sentiment (Corbet et al. 2020), political risk (Sharif et al. 2020), and the basis of future research focus (Goodell 2020). As Bitcoin futures market development was observed to be a significant milestone in the transition of not only Bitcoin in isolation, but cryptocurrency as a broad financial product, it is important to specifically understand whether there exist differentials of behaviour in comparison to traditional financial market assets.

While specifically forecasting Bitcoin spot prices, using neuro-fuzzy techniques, Atsalakis et al. (2019) estimated that their selected PATSOS methodological structure performed 71% than buy-and-hold strategies. Similarly, Faghih Mohammadi Jalali and Heidari (2020) found that through the use of a first-order grey model (GM (1,1)), Bitcoin’s price could be predicted accurately, to the extent of a confidence level of approximately 98% through the selection of specific periods. Alonso-Monsalve et al. (2020) found that Convolutional LSTM neural networks outperformed all the rest significantly, while CNN neural networks were also able to provide satisfactory results. especially in the Bitcoin, Ether and Litecoin cryptocurrency markets. Further, Ma et al. (2020) found that the proposed novel MRS-MIDAS model exhibits statistically significant improvement for forecasting the RV of Bitcoin Between 2011 and 2018, Adcock and Gradojevic (2019) found that backpropagation neural networks dominate various competing models in terms of their forecast accuracy. Further, when attempting to predict Bitcoin bubble crashes, Shu and Zhu (2020) showed that an LPPLS confidence indicator presented superior detection capability to bubbles and accurately forecast the bubble crashes, even if a bubble existed for only a short period of time. Such work built on that of the same structure of Samitas et al. (2020), who found that the effectiveness of machine learning reached 98.8% as an early warning system to predict the financial crisis. Zoumpekas et al. (2020) found that Convolutional Neural Network and four types of Recurrent Neural Network including the Long Short Term Memory network, the Stacked Long Short Term Memory network, the Bidirectional Long Short Term Memory network, and the Gated Recurrent Unit network could be utilised to predict the Ethereum closing price in real-time with promising accuracy and experimentally proven profitability. Such results present evidence that prediction of cryptocurrency markets was statistically possible in direct opposition to that of the efficient markets hypothesis [previously examined in cryptocurrency markets by Sensoy (2019) and Akyildirim et al. (2020)], but this is not the first market to have presented such evidence as Plakandaras et al. (2015) had previously identified similar atheoretical outcomes had been identified on spot foreign exchange markets.Footnote 3

3 Data

In this study, we use dollar-denominated Bitcoin futures data from Chicago Mercantile Exchange. We obtain the data for 1-month futures up to 9-month futures at a minutely level which have the earliest starting points at different dates. To have enough observations to draw meaningful and robust conclusions we use only 1-month to 5-month futures data with an initial date of 2 January 2020 and an end date of 10 September 2020. Bitcoin futures can be traded at any time during the day at CME after 11:00 PM on Sunday until 10:00 PM on Friday (because of daylight saving time change on March 8, 2020, the trading hours shifted as after 10:00 PM on Sunday till 9:00 PM on Friday). The time interval that we studied corresponds to 217 trading days which we sample at 5-, 10-, 15-, 30-, 60-min frequencies. For each time frequency, we compute the mid-price from the best ask and bid prices using the last observation in that time interval. Then we compute the log-returns for each time scale from these mid-prices. Table 1 shows the total number of observations for the mid-price returns of bitcoin futures at different time scales, while Figure 1 presents a plot of the 1-month futures price at 5-min frequency during 2020. For instance, while there are 4345 observations at the 60-min frequency, this number increases to 51733 at the 5-min frequency. Table 2 provides descriptive statistics for mid-price future returns for different maturities and time scales. As it is clear from the table, mean and median values are always around zero independent of time to maturity and time-frequency. As an expected min, max and standard deviation values get larger in absolute value as the time to maturity increases.

Fig. 1
figure 1

1-month futures mid-price at 5-min frequency. Note: The above figure presents a plot of the 1-month futures price at 5-min frequency during 2020

Table 1 Number of observations for the mid price returns of bitcoin futures at different time scales for different train/test set divisions
Table 2 Descriptive Statistics for mid price future returns for different maturities and time scales

4 Classification algorithms

4.1 Machine learning models

We apply six different machine learning algorithms (k-Nearest Neighbours, Logistic Regression, Naive Bayes, Random Forest, Support Vector Machine, Extreme Gradient Boosting) to classify the target variable (it refers to the return of mid-price of the bitcoin futures in our study) as “up” or “down” at varying time frequencies. These methods are selected due to their popularity and fast implementation, and they are performed with Python’s well-known scikit-learn package. In what follows, we briefly describe how each of these classification algorithms helps to forecast the sign of the target variable.

4.1.1 k-Nearest neighbours classifier

The k-nearest neighbours’ algorithm (kNN) is a commonly used, simple yet successful classification method that has been applied in a large number of classification and regression problems such as handwritten digits and satellite image scenes (Melgani and Bruzzone 2004; Munder and Gavrila 2006). The kNN is a supervised machine learning model where the model learns from the labelled data how to map the inputs to the desired output so that it can make predictions on test data. It is a non-parametric algorithm as it does not make any assumptions about the data, such as normality. The kNN model picks an entry in the database and then looks at the ‘k’ entries in the database which are closest to the chosen point. Then, the data point is assigned the label of the majority of the ‘k’ closest points. For instance, if \(k = 6\) with 4 of points being as ‘up’ and 2 as ‘down’, the data point in question would be labelled ‘up’ since ‘up’ is the majority class.

More generally, the kNN algorithm works as follows: For a given value of k, it computes the distance between the test data and each row of the training data by using a distance metric like Euclidean metric (some of the other metrics that can also be used are cityblock, Chebychev, correlation, and cosine). The distance values are sorted in ascending order and then top k elements are extracted from the sorted array. It finds the most frequent class among these k elements and returns as the predicted class. In our application of kNN, we optimize the algorithm over the k values from 1 to 20.

4.1.2 Naive Bayes classifier

The Naive Bayes is another widely used classification algorithm as it is easy to build and particularly useful for very large data sets (Chawla et al. 2002; Zhang et al. 2014). This method is a supervised learning algorithm based on the application of the Bayes’ theorem, and also called a probabilistic machine learning algorithm. It makes the “naive” assumption that the input features are conditionally independent of each other given the classification. If this assumption holds then the naive Bayes classifier may perform even better than more complicated models. However, in real life, most of the time it is not possible to get a set of predictors which are completely independent.

The naive Bayes classifier assigns observations to the most probable class by first estimating the densities of the predictors within each class. As a second step, it computes the posterior probabilities according to Bayes’ rule:

$$\begin{aligned} {\widehat{P}}(Y = k \mid X_1,\ldots ,X_P ) = \frac{\pi (Y = k) \displaystyle \prod \nolimits _{j=1}^P P(X_j \mid Y=k)}{ \displaystyle \sum \nolimits _{k=1}^K \pi (Y = k) \displaystyle \prod \nolimits _{j=1}^P P(X_j \mid Y=k)} \end{aligned}$$
(1)

where Y is the random variable corresponding to the class index of an observation, \(X_1,...,X_P\) are the random predictors of an observation, and \(\pi (Y = k)\) is the prior probability that a class index is k. Finally, it classifies an observation by estimating the posterior probability for each class, and then assigns the observation to the class yielding the maximum posterior probability.

4.1.3 Logistic regression classifier

The Logistic Regression is a machine learning classification algorithm that is used to forecast the probability of a categorical dependent variable (Chen et al. 2013; Fischer and Krauss 2018). In logistic regression, the outcome of the target variable is dichotomous (i.e., there are only two possible classes). The classification algorithm forecasts the probability of occurrence of a binary event utilizing a logit function. More explicitly, logistic regression outputs a probability value by using the logistic sigmoid function and then this probability value is mapped to two discrete classes.

In our case, we have a binary classification problem of identifying the next time excess return as up or down. Logistic regression assigns probabilities to each row of the feature’s matrix X. Let us denote the sample size of the dataset with N and thus we have N rows of the input vector. Given the set of d features, i.e., \(x=(x_{1},...,x_{d})\), and parameter vector w, the logistic regression with the penalty term minimizes the following optimization problem:

$$\begin{aligned} \min _{w,c} \frac{w^T w}{2} + C \sum _{i=1}^{N}{\log ( \exp (-y_i(x_i^T w + c))+1 ) } \end{aligned}$$
(2)

where we find the optimal value of C by making a grid search over a set of reasonable values for C.

4.1.4 Random forest classifier

The Random Forest Classifier is an ensemble algorithm such that it combines more than one algorithm of the same or different kind for classifying objects. Decision trees are the building blocks of the random forest model. In other words, the random forest consists of a large number of individual decision trees that function as an ensemble. Random forest classifier creates a set of decision trees from a randomly selected subset of the training set, and each tree makes a class prediction. It then sums the votes from different decision trees to decide the final class of the test object. For instance, assume that there are 5 points in our training set that is \((x_1,x_2,..., x_5)\) with corresponding labels \((y_1,y_2,...,y_5)\) then random forest may create four decision trees taking the input of subset such as \((x_1,x_2,x_3,x_4)\), \((x_1,x_2,x_3,x_5)\), \((x_1,x_2,x_4,x_5)\), \((x_2,x_3,x_4,x_5)\). If three of the decision trees vote for “up” against “down” then the random forest predicts “up”. This works efficiently because a single decision tree may produce noise but a large number of relatively uncorrelated trees operating as a choir will reduce the effect of noise, resulting in more accurate results.

More generally, in the random forest method as proposed by Breiman (2001), a random vector \(\theta _k\) is generated, independent of the past random vectors \(\theta _1,...,\theta _{k-1}\) but with the same distribution; and a tree is grown using the training set and \(\theta _k\) resulting in a classifier \(h(x,\theta _k)\) where x is an input vector. In random selection, \(\theta \) consists of several independent random integers between 1 and K. The nature and dimension of \(\theta \) depend on its use in tree construction. After a large number of trees are generated, they vote for the most popular class. This procedure is called random forest. A random forest is a classifier consisting of a collection of tree-structured classifiers \({h(x,\theta _k ), k=1, ...}\) where the \(\theta _k\)’s are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x.

4.1.5 Support vector machine classifier

The Support Vector Machine (SVM) is a supervised machine learning algorithm used for both regression and classification tasks (Fauvel et al. 2008; Suykens and Vandewalle 1999). The support vector machine algorithm’s objective is to find a hyperplane in an N-dimensional space where N is the number of features that distinctly classify the data points. Hyperplanes can be thought of as decision boundaries that classify the data points. Data points falling on different sides of the hyperplane can be assigned to different classes. Support vectors are described as the data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. The margin of the classifier is maximized using these support vectors. In more technical terms, the above process can be summarized as follows. Given the training vectors \(x_i\) for \(i=1,2,...,N\) with a sample size of N observations, the support vector machine classification algorithm solves the following problem given by

$$\begin{aligned} \min _{w,h,\xi } \frac{w^T w}{2} + C \sum _{i=1}^{N}{\xi _i} \end{aligned}$$
(3)

subject to \(y_i(w^T \phi (x_i) ) \ge 1-\xi _i\) and \(\xi _i \ge 0,i=1,2,...,N\). The dual of the above problem is given by

$$\begin{aligned} \min _{\alpha } \frac{\alpha ^T Q \alpha }{2} - e^T \alpha \end{aligned}$$
(4)

subject to \(y^T\alpha = 0\) and \(0\le \alpha _i \le C\) for \(i=1,2,...,N\), where e is the vector of all ones, \(C>0\) is the upper bound. Q is an n by n positive semi-definite matrix. \(Q_{ij} = y_i y_j K(x_i,x_j)\), where \(K(x_i,x_j) = \phi (x_i)^T \phi (x)\) is the kernel. Here training vectors are implicitly mapped into higher dimensional space by the function \(\phi \). The decision function in the support vector machines classification is given by

$$\begin{aligned} sign\left( \sum _{i=1}^{N}{y_i \alpha _i K(x_i,x) } + \rho \right) . \end{aligned}$$
(5)

The optimization problem in Equation 3 can be solved globally using the Karush-Kuhn-Tucker (KKT) conditions. This optimization problem depends on the choice of the Kernel functions. Our study employs the Gaussian (rbf) kernel, which is denoted by \(\exp (-\gamma \Vert x-x'\Vert ^2)\) where \(\gamma \) must be greater than 0. When SVM is implemented, we try to find an optimal value of C and \(\gamma \) for each stock by using a grid search for each of these parameters.

4.1.6 Extreme gradient boosting classifier

The Extreme Gradient Boosting (XGBoost) is a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework. As we said before, an ensemble method is a machine learning technique that combines several base models to produce one optimal predictive model (Weldegebriel et al. 2020). An algorithm is called boosting if it works by adding models on top of each other iteratively, the errors of the previous model are corrected by the next predictor until the training data is accurately predicted or reproduced by the model. A method is called gradient boosting if, instead of assigning different weights to the classifiers after every iteration, it fits the new model to new residuals of the previous prediction and then minimizes the loss when adding the latest prediction. Namely, if a model is updated using gradient descent, then it is called gradient boosting. XGBoost improves upon the base gradient boosting framework through systems optimization and algorithmic enhancements. Some of these enhancements can be listed as parallelised tree building, tree pruning using a depth-first approach, cache awareness and out-of-core computing, regularisation for avoiding over-fitting, efficient handling of missing data, and in-built cross-validation capability.

4.2 Calculating the prediction success and potential profitability

Assume that the real label of the target variable is denoted by Y and predicted label is denoted by \(Y'\), we employ the following two measures to assess the usefulness of our selected forecasting techniques:

  • The Sign Prediction Ratio (SPR): Correctly predicted excess return direction is assigned 1 and 0 otherwise, then sign prediction ratio is calculated by

    $$\begin{aligned} SPR = \frac{\sum _{j=1}^{M} matches (Y_{j},Y^{\prime }_{j})}{M}, \end{aligned}$$
    (6)

    where

    $$\begin{aligned} matches(Y_{j}, Y^{\prime }_{j}) = \left\{ \begin{array} {ll} 1 &{} \quad if \ Y_{j}=Y^{\prime }_{j} \\ 0 &{} \quad otherwise \end{array} \right. \end{aligned}$$
    (7)

    and M denotes the size of the set for which the sign prediction ratio is measured.

  • The Maximum Return is obtained by adding absolute value of all the excess returns (denoted by h)

    $$\begin{aligned} MaxReturn = \sum _{j=1}^{M} abs(h_{j}) \end{aligned}$$
    (8)

    and represents the maximum achievable return assuming perfect forecast.

  • The Total Return is computed in the following way

    $$\begin{aligned} TotalReturn = \sum _{j=1}^{M}sign(Y^{\prime }_{j}) * h_{j} \end{aligned}$$
    (9)

    where sign is the standard sign function and \(*\) denotes the usual multiplication. Notice that the better the prediction method, the larger the total return is.

  • Ideal Profit Ratio is the ratio of the total return in Eq. (9) and the maximum return in Eq. (8).

    $$\begin{aligned} IPR = \frac{Total\,Return}{Max\,Return} \end{aligned}$$
    (10)

5 Empirical results

As explained in Sect. 3, we sample our data at five different time scales including 5-, 10-, 15-, 30-, 60-min time scales and then implement six different machine learning algorithms that is kNN, logistic regression, naive Bayes, random forest, SVM, XGBoost which are described in the previous section.The target variable for all of these classification methods is the sign of the one-step ahead mid-price return of Bitcoin futures with different maturities at different frequencies. As features again we use the one-step before mid-price returns of Bitcoin futures. As an example, when the target variable is the sign of 5-min mid-price return of 1-month futures at time \(t+1\), we use the 5- min returns of 1-, 2-, 3-, 4-, 5-month futures at time t as features in the machine learning algorithm. We consider three different divisions of the dataset as train and test sets, called hold-outs. For the first hold-out, we take 70% of the total sample size rounded to the closest integer value as the training sample size and the remaining 30% as the test sample size. Similarly, we also look at the 0.8/0.2 and 0.9/0.1 divisions as train/test set partitions.

The number of observations for the mid-price returns of Bitcoin futures at different time frequencies for different train/test set divisions is reported in Table 1. For instance, while there are 51,733 mid-price returns in total at the 5-min scale, only 4345 mid-price returns are available at the 60-min scale. 0.7/0.3 hold-out at 5-min frequency corresponds to 36,213-time intervals in the train set and 15,520-time intervals in the test set. Similarly, 0.8/0.2 and 0.9/0.1 train vs test set partitions correspond to 41,386/10,347 and 46,559/5,174 five-minute time intervals, respectively. The descriptive statistics for mid-price returns of Bitcoin futures at different time scales with different maturities are presented in Table 2, showing that both mean and median values are almost zero across different maturities and time frequencies. Minimum (maximum) values of the returns are getting smaller (larger) as the time to maturity increases across all the time scales. However, minimum and maximum values do not change significantly from a one-time scale to another within the same Bitcoin futures (except the 1-month futures). As expected, standard deviations increase both with time to maturity and time-frequency within the same futures.

The performance of the different machine learning algorithms is compared based on two key metrics. First, we use the sign prediction rate or accuracy rate calculated as the proportion of times the related methodology correctly predicts the next time mid-price return direction. If the underlying process were entirely random, then the correct sign prediction ratio would be 50%. However, in our case, it is essential to note that we use the information contained only in the Bitcoin futures. In other words, we do not use any other information source, which can also be challenging to determine as they are many different parameters affecting Bitcoin prices. Hence, any accuracy rate greater than 50% already indicates the success of the algorithm to beat the market. Second, we also apply the ideal profit ratio to measure the performance of the related classification algorithm. As formalized in Sect. 4, the ideal profit ratio is the ratio of the return generated by a given algorithm to the perfect sign forecast. The numerical results are produced with a 2.6 GHz Intel Core i7 computer utilizing Python 3.7 with the Scikit-learn machine learning package. Running the algorithms at the hourly timescale can be completed in the order of seconds, whereas at higher frequencies, such as the five minutes sampling frequency, the computational time requirements for training and prediction increase to the order of minutes for the random forest algorithm.

Table 3 KNN classification: in-sample and out-of-sample accuracy results and ideal profit ratios for different train/test combinations for the static analysis
Table 4 Logistic regression: in-sample and out-of-sample accuracy results and ideal profit ratios for different train/test combinations for the static analysis
Table 5 Naive Bayes Classification: in-sample and out-of-sample accuracy results and ideal profit ratios for different train/test combinations for the static analysis
Table 6 Random Forest (RF) classification: in-sample and out-of-sample accuracy results and ideal profit ratios for different train/test combinations for the static analysis
Table 7 Support Vector Machine (SVM) classification: in-sample and out-of-sample accuracy results and ideal profit ratios for different train/test combinations for the static analysis
Table 8 XGBOOST classification: in-sample and out-of-sample accuracy results and ideal profit ratios for different train/test combinations for the static analysis

Tables 3, 4, 5, 6, 7, and 8 present the accuracy rates for the train (in-sample) and test (out-sample) periods in the first two columns for Bitcoin futures with maturities from 1-month to 5-month at different time scales. Similarly, ideal profit ratios are given in the following column for out of the sample period. Mean value (together with t-stat), standard deviation, maximum, and minimum of each column across different maturities and time frequencies are given below the tables. Table 3 provides results for the kNN algorithm, which are computed by optimizing over neighbourhood numbers from 1 to 20. The kNN methodology yields an average out-of-sample (in-sample) success ratio of 55% (77%) for the first hold-out, 55% (75%) for the second hold-out, and 56% (75%) for the third hold-out. The maximum average ideal profit ratio is computed around 6% for the 0.9/0.1 division of the data set. The kNN methodology yields the highest in-sample accuracy results after a random forest algorithm with a maximum value reaching as high as %94 for 5-month futures at 5-min frequency. Similarly, the maximum value for the out-of-sample success rate (66%) is attained by month futures at 15-min frequency. In most cases, we observe that the accuracy rate increases for the same maturity futures under the kNN method as the time frequency decreases. It is also evident from Table 3 that for most of the cases, we obtain a positive ideal profit ratio with a maximum value of 23 % in the first hold-out (27% for the second hold-out, 31% for the third hold-out) with 3-months futures at 60 min frequency.

Table 4 provides the results for the logistic regression algorithm, which is based on a linear classification. We observe that this method yields relatively stable results across different maturities and time scales. For instance, the average success rate for both in the sample and out-sample periods is almost always 54% for three different hold-outs. The same is also true for the min (51%) and max (57%) values for different cases. Again, we obtain a positive ideal profit ratio with this prediction method in most cases. Although the average values of the ideal profit ratios are not quite high, the maximum value of them (29% for 0.7/0.3 division, 32% for 0.8/0.2 division, 35% for 0.9/0.1 division) can be considered satisfactory. Table 5 shows the results for Naïve Bayes classification which is the worst-performing one among the six different methodologies. The in-sample success ratios are almost always indistinguishable from 50% which is nothing but the result coming from a random walk model for which the empirical results are given in Table 9. Although the max accuracy rate can reach even up to 60% for 1-month futures at 5 min frequency for the first hold-out, the average accuracy rate is only 45% across the different cases. This result also holds for 0.8/0.2 and 0.9/0.1 divisions. Similar results are also obtained for the ideal profit ratios. We receive an average negative ideal profit ratio only for the Naïve Bayes algorithm.

Table 9 Performance of Random walk and ARIMA time series forecasting over different maturities and different time scales

As can be noted in Table 6, the in-sample fits of the random forest algorithm are the highest among all the machine learning algorithms considered. The highest average in-sample success rate can even reach up to 87% for the first hold-out, 83% for the second hold-out, and 87% for the third hold-out. However, the out-of-sample average performance is significantly worse than the in-sample fits. This designates the high variance in the random forest classification with high sample fit to the noisy data but lower out-of-sample performance. For all of the data divisions, the average success rate is around 56% with a maximum value of 67%, which is attained for 5-months futures at 5-min frequency. It is also observed that except in a few cases, one can gain positive ideal profit ratios with a maximum value of 36% for 3-month futures at a 60-min time scale.

Table 7 shows the best out-of-sample forecasting results across different maturities and frequencies is acquired from the implementation of a nonlinear classification method by choosing a radial kernel in the support vector machine. A maximum value of 61% is obtained for the first division, 64% for the second division, and 71% for the third division. The average out-sample success rate is stable, around 56% for different hold-outs. The results are also similar to the ideal profit ratios. It is evident from the results given in Table 8 that the XGBoost method provides similar results to the kNN algorithm but with a lower level of average in-sample fits. The average out-sample fit accuracy is around 55% for different hold-outs.

As a benchmark for our models, we also utilize the classical ARIMA model to predict the next time return direction of the mid-price. At first sight, one can argue that most of the methods provide close results to each other and the ARIMA model. For instance, while the average of the out-sample success rates is around 56% for the support vector machine, it is also around 52% for the ARIMA model, as can be seen from the results are presented in Table 9. For a small-sized sample, 1% of difference may not mean a significant difference in terms of the robustness of the method. However, in our case of a large sample, on average only 1% of increase in the success rate of 5-min frequency (0.7/0.1 hold-out) means 1,552 more correct predictions of the target variable. In our case, 4% of difference corresponds to 6,208 more accurate predictions of the target variable at 5 min level, which is not a substantiative amount when considering the sample size analysed. We also have similar results for the other frequencies. Another critical point to note is that the maximum value of accuracy one can obtain with the ARIMA model is only 56% among all cases considered. However, this number even increases up to 71% for the support vector machine algorithm.

Table 10 t-test results for the mean of pairwise difference of seven classification algorithms, including ARIMA, kNN, logistic regression (Logistic), Naïve Bayes (NB), Random Forest (RF), Support Vector Machines (SVM), and XGBoost over different time train and test sets

In Table 10, t-test is applied to check the statistical significance of the estimation results between alternative algorithms. We compare the methods using the success ratio results for each in-sample and out-of-sample period across the different maturities and time scales. For instance, to compare kNN and ARIMA models for the in-sample period, we use the success ratio values from the same 0.7/0.3 in-sample period across different maturities and time scales. We test the null hypothesis that the difference of the values computed from Logistic and ARIMA models comes from a normal distribution with a mean equal to zero and unknown variance. The results show that except for a few cases, the pairwise differences between these algorithms are statistically significant.

6 Conclusion

We examine the forecast performance of the midpoint movement of CME Bitcoin futures starting from 2 January 2020 to 10 September 2020 by using 1-, 2-, 3-, 4-, 5-month futures at 5-, 10-, 15-, 30-, 60-min frequencies. To this end, we employ machine learning algorithms (MLAs) which are compared with standard ARIMA and random walk models. Sign prediction ratios together with the ideal profit ratios are utilized to measure the forecasting performance of the suggested MLAs. Our findings indicate that the k-nearest neighbour (kNN) approach and the random forest (RF) algorithm yield the highest in- and out-of-sample accuracy rates at varying frequencies. For instance, for the random forest algorithm in-sample success rate can reach up to 87% for the first hold-out (0.7/0.3), 83% for the second hold-out (0.8/0.3), and 87% for the third hold-out (0.9/0.1). However, the highest out-of-sample success rates are obtained by SVM. For instance, an accuracy rate of 71% can be achieved for the third hold-out. On the other hand, the logistic regression, the Naïve Bayes, and the XGBoost algorithm yield relatively stable results across different maturities with an average out-of-sample accuracy rate of 54%, 45%, and 55%, respectively. As a benchmark model, ARIMA model provides an average out-of-sample accuracy rate of around 52%. In general, our findings indicate that most of the MLAs outperform the benchmark model, with both in- and out-of-sample forecasting accuracy. This highlights the importance and relevance of MLAs to forecast bitcoin futures prices during periods of turmoil. From a policy perspective, such research has important implications regarding the monitoring of cryptocurrency market developments, and of course, from a regulatory perspective, signals of market discontinuity could demonstrate the sources of atypical influence. Furthermore, it is a known fact that derivative products complete the market and increases efficiency. Hence, as the cryptocurrency derivatives market enlarges, machine learning algorithms can also be used for the prediction and valuation of the products such as options and swaps in the future.