T HE S HORT -T ERM P REDICTABILITY OF R ETURNS IN O RDER B OOK M ARKETS : A D EEP L EARNING P ERSPECTIVE

,


Introduction
1.1 Financial markets and exchanges: the rise of High-Frequency traders A financial market is an ensemble of market agents willing to buy or sell a certain financial security, such as a stock, bond, or derivative.Today most trades take place on electronic exchanges, virtual places that bring together buyers and sellers, facilitating the occurrence of transactions.At the time of writing, the two largest U.S. equity exchanges by market capitalization are the NYSE, a hybrid (floor and electronic) auction market accounting for 20% of the U.S. equities market transactions, and the Nasdaq, a fully electronic dealer market executing about 16% of such trades (source: Cboe Exchange, Inc.).Both of these markets allow traders to access live order book information, i.e. the collection of all standing orders for a given security.In theory, this allows for symmetric information across traders, which should all have access to the same data regarding market depth, liquidity, and price discovery dynamics.
In practice, traders have access to different technology, receiving market data and submitting orders at different latencies.In the quest to exploit the advantages gained by faster access to markets, almost two decades ago, a new market participant emerged.These market players are today known as High-Frequency Traders (HFTs) and, over the years, have rapidly grown to represent a significant share of the market (Abergel et al., 2014).Their role has since been the object of a fierce debate: their critics claim HFTs engage in predatory -and sometimes illegal -behavior, while their supporters believe HFTs to be overall beneficial to the market by providing liquidity, reducing spreads, and helping price discovery dynamics.In this paper, we will not delve into questions regarding the legitimacy of HFT practices, but we will instead aim to independently analyze the value of immediate access to order book information.Over the past few decades, HFT companies have engaged in a fierce race to zero latency, making vast economic efforts to reduce their latency by just a few microseconds.What we aim to explore in this research is one of the possible reasons why such a race happened in the first place.Specifically, we will be analyzing the predictive value of order book data, i.e. to what extent can a trader with immediate access to the order book predict the future direction of the market?
It is also important to note that, over the past couple of decades, the trading process has become increasingly complex due to market fragmentation, availability of new technologies such as smart order routers (SORs), and -sometimes controversial -regulation, e.g.RegNMS (US Securities and Exchange Commission, 2005) and MiFID (European Parliament and Council, 2004).While practices that exploit arbitrage between competing trading venues exist, in this research, we will assume the trader has access to a single electronic order book-based exchange, namely the Nasdaq.

Order book predictability: asking the right questions
As discussed in the opening paragraph, we would like to explore whether, contrary to low-frequency returns, ultrahigh-frequency returns tend to display predictability.Empirical studies (Sirignano and Cont, 2019) have shown that price formation dynamics, i.e. next mid-price moves, are predictable.In this paper, we will try to understand whether such predictability persists at longer horizons.Intuitively, predictability in high-frequency returns may be understood to arise simply from the way an order book market is structured, i.e. the side of the order book with less liquidity is more likely to erode faster, resulting in a price increase/decrease, or as the product of recurring trading patterns in response to liquidity information.The approaches considered in the literature for forecasting high-frequency returns from order book data can be roughly divided into two categories: relatively simple models built on carefully handcrafted features (Aït-Sahalia et al., 2022) and more sophisticated architectures applied directly to raw order book data (Zhang et al., 2019;Zhang and Zohren, 2021;Kolm et al., 2021).In this research, we will focus on the latter class of models, leveraging the ability of deep learning techniques to learn complex dependence structures.We will consider a specific class of deep learning models, introduced by Zhang et al. (2019), designed to extract features from order book data.There is empirical evidence (Bengio et al., 2013) which suggests that, although deep learning models can extract complex features, the way data is represented may have a significant impact on model performance.We will hence explore how model performance varies when changing the way the order book data is arranged.Equipped with this class of models, we will aim to investigate the following questions, which naturally arise from our preceding discussion: 1. Do high-frequency returns display order book-driven predictability?If so, how far ahead can we predict? 2. Which order book representations perform best? 3. Can we use a single model across multiple horizons?4. Can we use a single model across multiple stocks?
We aim to answer all these questions in a formalized statistical inference framework based on model confidence sets (Hansen et al., 2011).

Related work and contributions
It is important to note that mathematical modeling of order book dynamics is a very broad and active area of research: ranging from Hawkes process models for order and trade events in continuous time (Bacry and Muzy, 2014;Large, 2007) to discrete-time synthetic data-driven order book generation (Byrd et al., 2020;Coletta et al., 2022).The work presented in this paper can be seen as contributing to two parallel research streams in the literature for order book-driven mid-price predictions.On one hand, we expand on the deep learning ideas discussed in Zhang et al. (2019); Zhang and Zohren (2021); Kolm et al. (2021) by introducing new data representations and carrying out a disciplined comparison between the specifications.On the other, we explore a set of questions related to short-term price predictability in a similar spirit to Aït-Sahalia et al. (2022) but under a different class of models.
In the broader context of the questions addressed in this paper, related works are those of Sirignano and Cont (2019), exploring the universality of order book dynamics, and Wu et al. (2021), advocating for robust representations of order books.We base our experimental procedure on model confidence sets, introduced by Hansen et al. (2011).We believe this formalized statistical inference framework perfectly suits our aim of addressing questions that require comparisons between multiple models and benchmarks.
The two main contributions of this paper are summarized as follows.First, we introduce a deep learning model for mid-price forecasting based on a more robust representation of the order book, which we will refer to as deepVOL.This representation allows us to easily adapt the model to the setting where more granular L3 data2 is available.Second, we provide new empirical results addressing essential questions regarding short-term price predictability in a disciplined experimental framework.
2 The space of models under consideration 2.1 The order book At a given point in time, an order book contains all the (visible) buy and sell orders placed for a given security on a specific exchange.The lowest ask price, resp.the highest bid price, is known as the first ask order book level, resp.the first bid order book level.Subsequent levels are defined accordingly.An example of a 10-level order book snapshot is displayed in Figure 1.We note that in electronic exchanges, orders can be submitted on an evenly spaced discrete set of prices, known as ticks.The smallest price increment is known as the tick size, for most U.S. traded stocks, it corresponds to $0.01.It is important to note that not all tick prices may have standing orders; therefore, the first 10-level ask/bid prices may not coincide with the first 10 ask/bid tick prices.
There are three main types of actions that traders can request on an exchange: • a limit order: an order to buy a given quantity of the security at a given price; • a market order: an order to buy a given quantity of the security at the best available price; • a cancellation or partial deletion: an order to fully or partially delete a standing limit order.
Note that while market orders are always immediately executed once posted, passive limit orders, i.e. orders which do not cross the spread, will sit in the order book until they are matched.A limit order entering the market at a tick price where other limit orders are already present will be added at the end of the standing queue.Trades occur each time a market order or aggressive limit order is posted, the requested volume is matched to the standing limit orders according to price-time priority.We will not delve into the detailed characteristics of all the different order types which exist, but it suffices to point out that the three actions described above represent the fundamental drivers of the evolution of order books.For example, in Figure 1, the change in order book shape may be due to either a (partial) deletion of a first-level buy limit order or the execution of a sell order (either a market order or an aggressive limit order).
During trading hours, electronic exchanges operate continuously, and for this reason, order books are sometimes referred to as continuous books.Exchanges such as the Nasdaq time stamp each event at nanosecond precision.Between different stocks, the level of trading activity may vary significantly, and the time elapsed between consecutive events may differ by orders of magnitude.For this reason, we define an alternative time clock: a discrete order book clock, which increments by one each time an action is executed on the order book.Throughout our discussion, we will explore questions of predictability with respect to this order book clock, which is the same as the one considered in Ntakaris et al. (2018).Some authors, for example, Aït-Sahalia et al. ( 2022) consider alternative order book-driven time clocks, such as transaction clocks and volume clocks.
The Short-Term Predictability of Returns in Order Book Markets We understand that models based on order book-specific clocks might be challenging to use in practical trading applications.However, we believe order book-based clocks provide a more natural measure of time for exploring predictability and are more suitable for comparing results across stocks than physical time: a 100ms time horizon has a significantly different meaning for stocks with different levels of trading activity.
Another important observation is that order book data might not be accessed by all market participants at the same level of granularity.The Nasdaq Quotation Dissemination Service makes the following distinctions: • L1 data: the best bid and ask prices and corresponding volumes; • L2 data: all available bid and ask prices and corresponding volumes; • L3 data: all available bid and ask prices and corresponding volumes split among the orders in the queue.
• q (i,k) x,t for k = 1, 2, . . . the queue corresponding to volume s x,t and ordered by time priority.

Predictability of returns
We will explore models that aim to identify short-term predictability in returns.We first introduce the familiar regression framework for return predictions.We then rephrase the task in terms of a classification problem, extending the definition of predictability to this setting.
We will be exploring predictability arising from past order book data and thus we define the information σ-algebra F t to be F t = σ(x t , . . ., x t−T +1 ), where x t , . . ., x t−T +1 are order book derived features at times t, . . ., t − T + 1 for some look-back window of length T .
Predictions in the regression framework Let us first consider the regression setting.At time t we denote by r t,t+h ∈ R the h-step ahead mid-price return, as defined in 3.2.2.Given order book information at time t, F t , there exists a measurable function g such that or, equivalently, r t,t+h = g(x t , . . ., x t−T +1 ) + ϵ t , for some mean-zero noise variable ϵ t orthogonal to the space of F t -measurable random variables.A prediction is defined to be any F t -measurable random variable and the best prediction is the F t -measurable random variable r (t) t,t+h which minimizes the expected cost for some appropriate cost function C : R × R → R. In the case of quadratic cost C(r 1 , r 2 ) = (r 1 − r 2 ) 2 , we have Different choices of cost function are possible, for example, when C is absolute cost, i.e.C(r 1 , r 2 ) = |r 1 − r 2 |, the best prediction is given by the conditional median of r t,t+h given F t .Given a parametric family of models {g θ (•)} θ∈Θ and an observed training data set D train = {(x t , . . ., x t−T +1 , r t,t+h )} t∈Itrain one can first learn a function g θ approximating g and then produce the return predictions r(t) t,t+h = g θ (x t , . . ., x t−T +1 ), for test data points D test = {(x t , . . ., x t−T +1 , r t,t+h )} t∈Itest .Assuming returns to be stationary, we say that there is order book-driven predictability if the learned predictions outperform an unpredictive benchmark prediction on the testing set with respect to the chosen cost function C(•, •).

Predictions in the classification framework
In this paper, we will discretize the space of returns by grouping mid-price movements as downward, no-change, and upward3 .We hence introduce the discretized return random variable for an appropriate choice of γ > 0, cf.Section 3.2.2.Instead of modeling only the expected conditional return, in the classification setting, one aims to approximate the whole conditional distribution, i.e. find measurable functions p ↓ , p = , p ↑ such that P(c t,t+h = * |F t ) = p * (x t , . . ., x t−T +1 ), for * ∈ {↓, =, ↑}.The discretized return prediction for an unobserved sample is then given by the minimizer of the expected misclassification cost: where the parametric conditional probabilities p ↓, θ , p =, θ , p ↑, θ are learnt from a training data set D train = {(x t , . . ., x t−T +1 , c t,t+h )} t∈Itrain and c * |⋆ is the misclassification cost of a * observation classified as a ⋆.Assuming the return process is stationary, we say that there is order book-driven predictability if the learned predictions outperform an unpredictive benchmark prediction on a test set in terms of total misclassification cost.
Specifying an appropriate cost function is task-dependent.For example, a trader using the predictions as trading signals might be more interested in the correctness of up and down predictions than no-change ones.On the other hand, a market maker might prioritize the correctness of no-change predictions when using these to decide whether to tighten quoted spreads.In both cases, the consequences of different types of errors are asymmetric and heavily impact the choice of the cost matrix C = {c * |⋆ } * ,⋆∈{↓,=,↑} .In the classification framework an alternative approach is to compare the predicted conditional distributions, {p * (x t , . . ., x t−T +1 )} * ∈{↓,=,↑} to the realized outcomes directly.This approach does not require specifying a cost matrix C nor a return prediction.The predicted conditional distribution (the output of our model) is directly compared with the observed data via a suitable "distance" on the space of probability measures P({↓, =, ↑}), where realized returns are encoded as Dirac measures.A natural choice of such "distance" is given by the categorical cross-entropy, as this can be interpreted (under appropriate assumptions) as the (log)-likelihood of the test set: where the model parameters θ are learnt from the training set D train .As this paper does not target a specific trading strategy but is more concerned with general predictability questions, we will be evaluating our models based on the categorical cross-entropy loss.In the following, we will thus say that there is order book-driven predictability if the learned conditional distributions outperform an unpredictive benchmark distribution on a test set relative to categorical cross-entropy loss.As discussed in Appendix A.1, the natural choice for the loss used in training will also be categorical cross-entropy.
Note that in both the regression and classification framework, we compare the learned predictions with those obtained from an unpredictive benchmark model to determine whether there is predictability by using a score/cost function.Simply comparing point estimates of the scores does not provide a sound statistical argument answering the question of whether there is predictability: the difference in score may simply be due to statistical variability.This is where the model confidence set (Hansen et al., 2011) procedure comes in.As we will discuss in Section 4 this provides a statistical testing framework to determine whether the learned models statistically outperform the unpredictive benchmark, i.e. whether there is predictability.Remark 2.1.It is clear that the definition of predictability is intrinsically tied to that of unpredictive benchmark.Different choices for unpredictive benchmark models are possible, for example, in the regression framework, a natural choice is given by (a version of) the efficient market hypothesis (EMH) (Fama, 1970).In this case, the unpredictive hypothesis assumes the conditional expected return to be 0.Under the classification framework, the simple EMH does not translate into a clear-cut model for the conditional distribution of returns.In this setting, we consider as a natural unpredictive hypothesis a slightly stronger version of the EMH: under the unpredictive benchmark, returns are assumed to be IID and independent of any information up to time t.The unpredictive benchmark conditional distribution will therefore be given by the empirical distribution of the training set.As a side remark, we believe it is important to note the EMH was originally proposed in a very different market environment and at significantly higher latencies than the ones considered in this paper.Nevertheless, we believe the EMH provides a natural unpredictive benchmark hypothesis when testing for predictability.
As mentioned in Section 1.2, there are two main approaches considered in the literature for exploring the short-term predictability of returns in order book markets: the first considers carefully handcrafted features x t , . . ., x t−T and relatively simple specifications for the prediction functions g or (p ↓ , p = , p ↑ ), e.g.linear specifications or decision trees, the second -which we will explore in this paper -uses raw order book features x t , . . ., x t−T as inputs to more complex prediction functions, e.g.deep neural network architectures.We will assume the reader is familiar with deep learning techniques and thus give a brief overview of the relevant concepts in Appendix A.1, more detailed expositions can be found in Goodfellow et al. (2016).

Deep learning models for short-term return predictions in order book markets
In the previous section, we set up the learning framework for the return prediction task.We now discuss how the neural networks covered in Appendix A.1 can be combined to model price formation mechanisms and predict h-step ahead high-frequency returns.We will consider a specific class of deep learning models based on the deepLOB architecture (Zhang et al., 2019).
2.3.1 deepLOB, (Zhang et al., 2019) This network acts on raw order book input.A CNN module and an inception module feed into an LSTM layer which produces the final classification output.The CNN and inception module aim to extract short-term spatio-temporal features in the data, while the LSTM module works on longer-term dependencies.The deepLOB architecture is summarized in Figure 3 and is made up of the following components: • Input.The first L level raw order book information -price and volume -with a look-back window of length T is used as input.The input at time t is thus a (T × 4L) array given by A feature-wise rolling window z-score standardization is applied to the input, i.e. v (1) a,t−τ is standardized using the mean and standard deviation of the first level ask volumes over the previous five days.
• CNN module.Convolutions are applied to the data in both the spatial and temporal dimensions.The spatial convolutions aim to aggregate information across order book levels, and the temporal convolutions can be understood as smoothing operations.The CNN module is summarized in Figure 4. • Inception module.This module up-samples the convoluted data by applying various temporal convolutions with different filter lengths (time-window).Each temporal convolution can be interpreted as a (weighted) moving average.It is similar in spirit to the computation of technical indicators, but the frequencies at which this is applied are substantially different.• LSTM.The Long Short-Term Memory layer takes the multidimensional time series produced by the inception module and feeds it through a recurrent network structure aimed at extracting longer-term dependencies among the data.The last hidden state of the LSTM is passed through a dense layer with a softmax activation function to produce the h-step ahead return prediction ↓, = or ↑.
The exact details of the deepLOB architecture can be found in Table 13.
2.3.2 deepOF, (Kolm et al., 2021) In the deepLOB model order book states, which are non-stationary, are mapped to a stationary quantity, returns.While in theory, this shouldn't be a problem (due to universality property of deep neural networks), Kolm et al. (2021) argue that using some form of stationary input might improve model performance by facilitating the training procedure.A stationary order book quantity is order flow.First-level order flow describes the net flow of orders at the best bid and ask.This was introduced in Cont et al. (2013) to explore the price impact of order book events.This single quantity parsimoniously models the instantaneous effect of order book events on prices.The first-level bid and ask order flows corresponding to the order book event occurring at time t (in order book time) are defined by: bOF and aOF The difference between the two is known as order flow imbalance: Bid order flow corresponds to the net change in volume at the best bid level; the three cases in the definition can be understood as: • a new bid order being placed at a higher price than the current best bid; • the order volume at the best bid price increasing or decreasing; • the entire volume at the best bid price being consumed, thus decreasing the best bid price.
Note that, as discussed in Section 2.1, volumes can increase or decrease due to orders being placed, executed, or canceled.A similar interpretation holds for ask order flow.
In Xu et al. (2019) the authors explore flows of volumes at deeper levels in the order book, by introducing multi-level order flow.The definitions are exactly as above with superscript (1) replaced by general (l).We note that in both Cont et al. (2013) and Xu et al. (2019) the authors investigate the explanatory power of (multi-level) order flow imbalance for price changes, i.e. the relationship between contemporaneous order flow imbalance and price changes.
In our setting, as in Kolm et al. (2021), we are instead interested in exploring the predictive power of order flow, i.e. the relationship between past order flow and future price changes.We will hence refer to deepOF as the deepLOB architecture with stationary order flow input, i.e. a (T × 2L) array given by The order flow input enters the CNN module in Figure 4 at the second convolutional layer.The rest of the architecture is exactly the same as deepLOB, as detailed in Table 13.We note that the first convolutions applied to the order flow input, i.e. the second convolutional layer in Figure 4, aggregate information across bid and ask order flows, essentially computing a weighted order flow imbalance.
Remark 2.2.The original deepOF specification in Kolm et al. (2021) was structured as a (multi-horizon) regression task.In their setting, the last layer of the deep neural network maps each prediction to R instead of P({↓, =, ↑}).Moreover, as discussed in Section 2.5, the way multi-horizon predictions are produced in the original work does not rely on the encoder-decoder structure we use for our multi-horizon models.Another slight difference with the experiments in Kolm et al. (2021) is in the standardization procedure: instead of standardizing each feature by its training mean and standard deviation, we use a rolling window approach.

The need for a robust representation
As discussed in the previous Section the main difference between deepLOB and deepOF is in the way the data is fed into the model.While deepLOB uses raw order book data as input, deepOF uses a derived quantity, order flow.In general, the success of deep learning tasks is highly dependent on the way the data is represented.The task of predicting returns in order book markets is no exception.In order to achieve the best possible results, one should adopt a robust representation of the data.Wu et al. (2021) identify five main desiderata for a robust representation of order book data: • Region of interest: the entire order book may contain a wide range of prices, the data representation should select a region of interest based on a complexity-performance trade-off.• Efficiency: the data representation should avoid excessive dimensionality.
• Validity: the data representation should have a simple definition of valid manipulations.
• Smoothness: the data representation should be robust to small perturbations.
• Compatibility: the data representation should be compatible with the deep learning architecture.
We note the order book representations used in deepLOB and deepOF do not conform to these desiderata.Order book states organized by 'level' do not have a simple validity (price and volume information are intrinsically tangled and would loose their significance if treated separately in a black box algorithm), are not robust to small perturbation (small orders added at empty ticks completely change the order book feature vector), and are incompatible with deepLOB's CNN module (the spatial structure is not homogeneous as there is no fixed interval between levels).It turns out that while the 'level' representation of the order book may be easily understandable by humans it is less so for statistical models.In addition to not satisfying the desiderata, this representation does not respect the following fundamental but implicit assumption of deep learning models: signals at the same entry of the input should come from the same source.In the 'level' representation, as new order book events happen, the same signal (i.e. a posted order) may move between levels.
We, therefore, introduce volume features, which provide a robust representation of order book data.Fixing a window of size W > 0 we define: a,t−τ , . . ., s where s x,t−τ for x ∈ {a, b} are the bid/ask volumes corresponding to the j-th price from the mid π (j) x,t−τ , as defined in Section 2.1.We note that the volume representation indeed satisfies the five desiderata: a region of interest is identified (via the window W > 0), it is efficient (for the same dimension of input it may convey more or less information than the 'level' representation, depending on how sparse the orders are placed in the order book), it has a simple validity (all entries are in the same units), it is robust to small perturbations (new orders at empty levels minimally affect the feature) and is compatible with the CNN architecture (the spatial structure of the volumes is homogeneous).An intuitive visualization of this representation as a one-dimensional gray-scale strip is given in Figure 6, when including a time dimension this naturally becomes a two-dimensional gray-scale image.The main drawback of the volume representation is that when orders are placed far apart in the order book it is sparse and a larger window W > 0 may be required.Our definition of volume features is similar to the mid-price-centered moving window representation of Wu et al. (2021), with the latter living in R 2W +1 instead of R 2W and using ± signs to distinguish between bid and ask volumes.The need for a new, more robust representation of the order book was reached independently.and feed this into a three-dimensional convolutional layer with a (2 × 2 × 1) filter and (1 × 1 × 1) stride.This layer aims to extract imbalances in the order book by comparing volumes on the two sides of the mid-price.The CNN module with the appropriate changes is depicted in Figure 7.The rest of the deepVOL architecture is exactly the same as deepLOB with one slight difference in the way the data is normalized.Thinking of the volume representation as a gray-scale image a natural choice of normalization is , instead of the rolling window standardization we apply to deepLOB and deepOF features.Remark 2.3.One could define a volume flow quantity based on the distance from the mid, in the same way order flow describes the flow of orders based on the level.Following the same motivation for considering order flow in deepOF presented in Section 2.3.2, one could consider using a volume flow quantity as input to the deep learning architectures with the desired robust representation properties.

L3 volume features
All models considered so far, deepLOB, deepOF, and deepVOL use L2 data as input.If one has access to more granular data, i.e.L3 data breaking down each volume queue into single orders, one might be interested in trying to leverage this information to obtain higher predictive performance.We thus define a natural extension of the volume representation considered in the previous section.
Let us denote by (q (j,1) x,t , q (j,2) x,t , . ..) ∈ R N for x ∈ {a, b} the queue at the j-th bid/ask price from the mid π (j) x,t as introduced in Section 2.1.Here q (j,k) x,t denotes the volume of the k-th order in the queue sorted by time priority and is set to zero if there is no such order.The aggregated volume at π x,t is given by: A natural extension of the volume representation would therefore be to consider Unfortunately, this is an infinite-dimensional array that cannot be directly fed into machine learning models.We therefore cut off the queue at a given depth level.In order to avoid discarding precious information, we aggregate all orders sitting past the maximum depth level at the end of the queue.For a given depth level D > 0, we thus consider the L3 volume feature x,t−τ , . . ., q (j,D−1) Remark 2.4.A similar approach to the one used for cutting off the queue might be applied to aggregate volumes sitting deeper in the order book when deriving L2 representations.This would give a better idea of the total liquidity in the order book.
In order to extract relevant information from the queue, we add an initial convolutional layer, which maps each queue to a weighted sum of the order sizes.The weighted aggregated volumes are then fed into the deepVOL architecture as above.The resulting CNN module is summarized in Figure 8.
The full architectures for deepVOL and deepVOL L3 are detailed in Table 14.
The Short-Term Predictability of Returns in Order Book Markets

Multi-horizon models
So far we considered single horizon models, i.e. order book input x t is mapped to a three class distribution (p ↓ , p = , p ↑ ) corresponding to the discretized return c t,t+h at fixed horizon h.In the following, we consider a generalization of the architectures considered thus far to the setting where the forecasting horizon is a vector h = (h 1 , . . ., h K ).In this case the modelling task consists in predicting the distributions of the discretized returns c t,t+h = (c t,t+h1 , . . ., c t,t+h K ).
The simplest way of adjusting the current models to the multi-horizon framework would be to replace the last softmax layer in Figure 5 with K parallel dense softmax layers.The last hidden state of the LSTM module would hence be mapped to an array of size 3 × K, corresponding to K distributions over three classes.A similar architecture -though in the regression framework -is considered in Kolm et al. (2021).While this approach is perfectly valid it does not make use of the sequential nature of the task, potentially neglecting an important structural feature of the data.
In this paper, we will be leveraging architectures inspired by machine translation which are naturally suited for sequential forecasting.Specifically, we will be considering encoder-decoder models: an encoder maps the input data to a latent summary state (also known as context vector) and then a decoder rolls forward predictions sequentially.In this context let z t−T +1 , . . ., z t denote the last T hidden states of an encoder at time t, then a decoder rolls forward predictions by: Here h(•) is a function acting on the hidden states of the encoder to extract the context vector c k (this may possibly depend on other inputs as well, such as previous hidden states of the decoder), f (r, p) is a recurrent layer with recurrent input r and exogenous input p, g(•) is an output layer depending on both the decoder hidden state and the context.The general mechanism of such encoder-decoder architecture is visualized in Figure 9.In our experiments, we will consider a sequence-to-sequence decoder (Cho et al., 2014), the simplest example of such architecture.In this setting we set c k ≡ z t , implicitly assuming that at every forecasting horizon, the last hidden state of the encoder summarizes all the relevant information required to make the prediction.More complex architectures exist, for example, Luong et al. (2015) introduce an attention-based decoder that uses a weighted combination of all the hidden states of the encoder as context vector.In this setting different weights are used at different forecasting horizons, selectively accessing hidden states of the encoder during decoding.While attention-based networks have been successfully applied for high-frequency mid-price predictions, cf.Zhang and Zohren (2021) and Tran et al. (2019), in this work, we wish to exemplify the potential of multi-horizon models and thus restrict ourselves to simple seq2seq decoders only.
In the experiments, we set f (•) to be an LSTM and g(•) to be a dense layer with softmax activation.Moreover p 0 is initialized at p0 = (0, 1, 0).
Using multi-horizon forecasting was first proposed for deepLOB in Zhang and Zohren (2021): the LSTM module of the deepLOB architecture discussed in Section 2.3.1 acts as an encoder, mapping order book states to the LSTM final latent vector z t = [h t , s t ].A seq2seq decoder then rolls forward the prediction producing distributional forecasts at horizons h = (h 1 , . . ., h K ).The LSTM module with seq2seq decoder is illustrated in Figure 10.Clearly, there is nothing stopping us from applying the same multi-horizon structure to the output of deepOF and deepVOL convolutional modules.
Full details of the multi-horizon architectures can be found in Table 13 and Table 14.Remark 2.5.We adopt the same multi-horizon framework as in Zhang and Zohren (2021), where the response c t,t+h k at horizon h k ∈ {h 1 , . . ., h K } corresponds to the return from time t to time t + h k .Alternatively, one might wish to consider as multi-horizon responses the subsequent incremental returns c t+h k−1 ,t+h k , i.e. the returns between time t + h k−1 and time t + h k .By choosing evenly spaced h k 's one would obtain more consistent responses across prediction time steps {1, . . ., K} and possibly improve the performance of the multi-horizon models.

Data set
In this section, we introduce the data set used in the experiments presented in Section 4. First, we briefly describe how LOBSTER order book data is related to the Nasdaq ITCH feed.Next, we give details of how we process LOBSTER data to obtain features and responses for the experiments.We also provide some descriptive statistics of the data.We consider the same universe of stocks and trading period as in Kolm et al. (2021): through LOBSTER (Huang and Polak, 2011) we access one year of open (9:30 EST) to close (16:00 EST) trading data for 115 Nasdaq tickers from January 2, 2019, to January 31, 2020.To produce results in feasible computational time, we select a subset of 10 stocks, trying to preserve a sufficiently varied set of liquidity characteristics.The ten tickers and their liquidity characteristics are summarized in Table 1.For full details on how the 10 stocks were selected, see Appendix A.8.The service LOBSTER provides to academic researchers is to reconstruct the historic order book from Nasdaq's Historical TotalView-ITCH data.For each selected ticker and date, LOBSTER returns message and order book files subsampled at a given granularity.In practical terms, a 10-level data granularity yields the set of messages corresponding to order book updates in the first 10 levels along with the corresponding reconstructed order book 'chopped at 10 levels'.The evolution of the order book determined by the message in Table 2 is reported in Table 3.More information on the order book reconstruction algorithm used by LOBSTER can be found in Huang and Polak (2011) and on the website www.lobsterdata.com.   1.
The Short-Term Predictability of Returns in Order Book Markets

Processed data
From the historic LOBSTER order book data we build the features x t , . . ., x t−T +1 and the target responses c t,t+h .In the experiments in Section 4 we apply the learning framework of Section 2.2 with the deep learning models introduced in Section 2.3 to these feature-response pairs, using the model confidence set (MCS) procedure for model comparison.As previously discussed in Section 2.1 we will be measuring time t (and prediction horizons h) using an order book-driven clock, ticking every time an event occurs on the order book.This clock corresponds to the Event ID in Table 2 and  Table 3.Note we access LOBSTER data up to level 10, thus our order book clock is conditional on the updating event being in the first 10 levels.By construction, the data contains all information on new limit orders, market orders, and cancellations restricted to the first 10 levels.We apply some minor pre-processing steps to the LOBSTER data, summarized in Appendix A.3.

Features: order book, order flow and volume
We start by discussing how to derive the features x t at each time point t from the raw LOBSTER data, i.e. from Table 2 and Table 3.While it is quite simple to build L1/L2 order book, order flow and volume features, reconstructing L3 volume data requires a bit more work.
The raw order book input x t used in deepLOB (Zhang et al., 2019) and described is Section 2.3.1 simply corresponds to the LOBSTER data in Table 3, i.e. at t = 1312 the 10-level order book feature is given by a,t , p b,t , . . ., p 11.86, 9484, 11.85, 8800, . . . , 11.73, 5500).
The features are standardized using a 5-day rolling window.
To compute the order flow input x t used in deepOF (Kolm et al., 2021) we apply the equations given in Section 2.3.2 to the LOBSTER data in Table 3, i.e. at t = 1312 the 10-level order flow feature is given by t , . . ., bOF = (−2516, 0, . . ., 0).
Again, the features are standardized using a 5-day rolling window.
To construct the volume input x t at L2 granularity for the deepVOL model described in Section 2.4, we select only the volume information from Table 3 adding in zeros corresponding to empty price ticks, i.e. at t = 1312 the volume feature with window size W = 10 is given by b,t , s (1) a,t , . . ., s (10) a,t = (1400, . . ., 8800, 9484, . . ., 500).
Note that, in this example, s b,t since some bid price ticks are empty, cf. Figure 1.As discussed in Section 2.4, volume features are normalized using max-scaling over the whole input array (x t , . . ., x t−T +1 ).
Finally, to construct the L3 volume features, one needs to work with the message file in Table 2 to keep track of the queues.For example, the volume queue at the first ask price, i.e. π (1) a,t = $11.86, given at time t = 1311 by q (1,1) 2516,2000,1484,4500,1500), 2000,1484,4500,1500), by the deletion in Table 2.All other queues are left unchanged.For more details on the complexities of reconstructing volume features from LOBSTER data see Appendix A.3.1.A deep dive into the distributions of the processed order book, order flow, and volume features is carried out in Appendix A.4.1.

Responses: categorical mid-price returns
In this paper, we are interested in answering questions regarding the predictability of market returns.Inevitably, the way returns, i.e. the target responses, are defined has a profound effect on this analysis.Here, in line with the related literature, we treat the mid-price as the "true" price and define returns relative to it, but it is important to note that, by definition, this is not a tradable price.We define the return at horizon h as where m t denotes the mid-price at time t and k is a fixed smoothing window.The mid-price m t is computed from Table 3 by 1 2 p (1) b,t + p (1) a,t , i.e. at times t = 1311 and t = 1312 we have m t = $11.855.This definition is subject to two possible interpretations.Treating the smoothed mid-price as a de-noised estimate of the true (latent) price h steps ahead, we can understand the return as the percentage change of the true (latent) price relative to the current mid.Alternatively, the return can be understood as the average return one would experience by entering a position at the current mid and exiting it roughly h steps ahead (assuming mid-mid trading).For all the horizons h ∈ {10, 20, 30, 50, 100, 200, 300, 500, 1000} which are considered in Section 4 we fix k = 5.In Appendix A.6 we discuss other methods for defining mid-price returns and their shortcomings.Remark 3.1.When defining the returns we assume immediate access to the order book.In practice though, hardware and software constraints lead to non-zero time lags when receiving messages and sending orders to the exchange.While, in our setting, such latencies have a negligible impact on the definition of returns, we discuss how their presence could be more precisely accounted for in Appendix A.7.
All our experiments are carried out in a classification framework where the discretized returns are defined as for some γ > 0. In order to make the three classes roughly symmetric and as balanced4 as possible, we empirically choose γ from the training set D train by where Qh is the empirical quantile function of the training set returns {r t,t+h } t∈Itrain .As discussed in Section 4, we will be splitting our data D in disjoint windows D w = D w,train ∪ D w,val ∪ D w,test for w = 1, . . ., W .The choice of γ will, therefore, be window w and horizon h specific.Descriptive statistics of the target return labels for the first window w = 1 are reported in Appendix A.4.2.
We note that, since stocks trade on a discrete grid of prices determined by the tick size5 ϑ, also the mid-price m t evolves on a discrete price (with steps of size ϑ/2).One could thus define the dollar return from t to t + h by the number of (half) ticks the mid-price moves, i.e.R t,t+h = m t+h − m t ∈ {. . ., −ϑ, −ϑ/2, 0, ϑ/2, ϑ, . ..} ≡ {. . ., −2, −1, 0, +1, +2, . ..}.This return is by definition discretized, and thus one could directly apply classification models (grouping large negative and positive returns to obtain a finite number of classes).A similar approach is used in Sirignano and Cont (2019) when predicting the next change in mid-price.In our work, we consider the estimate for the "true" mid-price to be which lives on a much finer grid than m t+h : over the, possibly quite long, time horizon h multiple changes to the mid-price might occur.In this case the smallest change has little meaning and so we group the dollar returns

Experiments
In this section, we explore the four questions introduced in Section 1.2 and attempt to answer them via the statistical framework provided by model confidence sets (MCS).This inference procedure allows to compare a set of competing models M 0 based on observed loss time series {L i,w } W w=1 where L i,w denotes the loss of model i ∈ M 0 at time w ∈ W .To apply the model confidence set procedure to the data described in Section 3, we divide the 55 weeks from January 14, 2019 to January 31, 2020 into W = 11 five-week periods6 , as represented in Figure 11.Each window of data D w is divided into a training-validation set D w,train ∪ D w,val , the first four weeks, and a test set D w,test , the fifth week.The joint training-validation set is then further split into training and validation sets by randomly selecting 5 days out of the four weeks for validation.First, the training dataset D w,train is used to choose the γ threshold for defining the return labels, as detailed in Section 3.2.2(note that the choice of γ is specific to the choice of window, horizon, and ticker).Then, the joint training-validation dataset D w,train ∪ D w,val is used to train the models.For the unpredictive benchmark model, this simply means determining the empirical distribution of returns in D w,train ∪ D w,val , cf.Appendix A.4.2.For the deep neural network architectures, this amounts to finding the optimal parameters which minimize the training weighted cross-entropy loss.To do so, we use Adam optimization with validation-based early stopping as described in Appendix A.1.Once the model has been trained we compute the out-of-sample losses on D w,test , i.e. for period w ∈ {1, . . ., 11} and model is the categorical cross-entropy loss corresponding to the estimated probabilities pi w,test when compared with the observations c w,test .pi w,test are the class probabilities produced by model i on the testing set D w,test after being trained on the training-validation set D w,train ∪ D w,val .The time series of test losses {L i,w } 11 w=1 are then "fed through" the MCS procedure described in Appendix A.2 to obtain the set of MCS p-values {p MCS i } i∈M0 .The intuitive interpretation of these p-values is: if model i ∈ M 0 has an MCS p-value lower than a prescribed confidence level then it is deemed statistically inferior to other models in M 0 at that confidence level.This naturally justifies the following definition of order book-driven predictability.Definition 4.1.For α ∈ (0, 1) we say that there is order book-driven predictability at confidence level 1 − α if p MCS benchmark < α.
In other words, if the unpredictive benchmark is deemed to be statistically inferior to some of the other models in M 0 at least one order book-driven model does better than the unpredictive benchmark, i.e. such model is predictive.Remark 4.1.Note that not all training procedures may converge to the optimal combination of parameters.This is a characteristic of any model learned via numerical optimization methods.We further note that in our experimental setup, no hyperparameter tuning is carried out for any of the models.When using these models in a production setting, one may obtain better results by selecting parameters using cross-validation on the training-validation set.Parameters which one may investigate tuning include: • Architecture hyperparameters: number of filters in each convolutional layer (we fix 32 channels); number of weighted averages and lengths of averaging windows; number of LSTM hidden nodes (we fix 64 hidden nodes); decoder type for multi-horizon models (we use seq2seq).
Here we are not interested in obtaining the best fit possible for a specific model but in comparing different models/benchmarks on a level playing field.We thus leave questions related to hyper-parameter tuning for future work.
All code is developed in Python with the tensorflow library and the keras API, with some layers requiring custom tensorflow methods.Due to the computationally intensive nature of the experiments, specialized infrastructure was required to store the data (5TB) and train the models (GPUs).All computations were carried out on Imperial College's High-Performance Computing cluster (Imperial College Research Computing Service, 2022), which provides access to several RTX6000 GPUs.
The results discussed hereafter are specific to the experimental setup under consideration, i.e. they are specific to the selected stocks, time period, and models.Different experimental setups may lead to different results.
4.1 Do high-frequency returns display predictability?If so, how far ahead can we predict?
From the results reported in Table 4 we see that at high frequencies predictability is systematically present.For most of the stocks under consideration, we were able to identify predictability up to 50 order book events ahead at the 99% confidence level.
Except for LILAK, which is the most illiquid stock, we observe a substantial correlation between the persistence in predictability and the average Updates to Price Changes ratio, cf.Table 1.Recalling that in our setting the horizon h is measured in order book updates, this can be interpreted as it being easier for the deep learning models to predict returns that are the result of fewer price changes.One might thus expect to obtain a more consistent maximum predictable horizon across stocks when using a price change-driven clock to measure time.When the p-value is low at least one of the order book-driven models statistically outperforms the unpredictive benchmark, i.e. there is order book-driven predictability according to Definition 4.1.

Which order book representations perform the best?
Having discussed the extent to which the class of models under consideration can identify predictability, it is now natural to ask which of these models performs best.In the MCS framework, this corresponds to determining the specifications which are consistently placed in the set of superior models.We restrict our attention to the horizons and stocks at which predictability is identified and, for each model in consideration, we count the number of times it is identified as a superior model.The results are reported in Table 5.Table 5: % of times the model is in the α-MCS when predictability is identified at the corresponding level α.
From the results in Table 5 (at the 99% confidence level) we can make the following observations on the way order book representations influence model performance.When considering deep learning models for short-term return prediction having access to L2 data provides a significant advantage over L1 data.We see that models with L1 data are rarely placed in the set of superior models when predictability is identified.When going from L2 to L3 data instead, the increased granularity does not seem to provide a clear advantage: deepVOL(L2) and deepVOL(L3) display similar performance.Our experiment, therefore, suggests that L3 data might be excessively granular when predicting high-frequency returns from order books.
The choice of features used to represent the order book is also crucial when leveraging deep learning methods for return prediction.Using order flow or volume representations provides a significant improvement in performance: the basic deepLOB model ends up being included in the set of best models only 10% of the time and is outperformed even by the model with only first-level (L1) order flow.Volume-and order flow-based models (with L2/L3 data granularity) display comparable performance, being placed in the set of superior models in 85-90% of the predictable horizons.

Can we use a single model across multiple horizons?
In this section, we explore whether using a seq2seq decoder to produce multi-horizon predictions is beneficial.Such models have the clear advantages of having a single set of weights (the network size is only slightly bigger than single horizon networks) and output multiple predictions in very similar run times.This significantly reduces both the memory required to store the models and the time needed to train them.But how do they perform when compared to their single-horizon counterparts?
To answer this question we run the same experiment as in Section 4.1 and Section 4.2 but enlarge the set of models with the seq2seq specifications.We focus only on prediction horizons h ∈ {10, 20, 30, 50}.The results are reported in Table 6 and Table 7.
From Table 7 we note that, for each input type, the seq2seq specifications outperform their single horizon counterparts.We suggest this behavior might be due to the increased availability of information in a multi-horizon setting.For a given input-target pair, when targets are multi-horizon, more information is available on the "order book regime" the inputs should be mapped to.Multi-horizon models can thus learn a more granular map from the input variables to the latent space of "order book regimes", which might be beneficial for producing predictions.Remark 4.2.Note that running the model confidence set procedure again with a larger set of models leads to a counter-intuitive situation where fewer prediction horizons are identified.One would expect that adding models could only increase the number of horizons at which predictability is identified.While this is true in the limit as the number of horizons increases to infinity, it does not hold in the setting where we have access to a finite set of observations.With this observation in mind, we note that results in deepVOL(L3, seq2seq) 81% 87% Table 7: % of times the model is in the α-MCS when predictability is identified at the corresponding level α.

Can we use a single model across multiple stocks?
In a similar spirit to Sirignano and Cont (2019), we investigate questions regarding the universality of order book dynamics.Intuitively, at a microstructural level, securities which are traded by market participants with similar characteristics may be subject to the same trading patterns, independently of the underlying stock's properties.In Sirignano and Cont (2019), the authors observe that price formation dynamics driven by past order book information, i.e. next mid-price moves, display common patterns across different stocks.In this paper, we explore whether similar results hold over longer horizons.
We run the same experiment as in Section 4.1 but train the models on multiple stocks simultaneously.Specifically, we split the set of 10 stocks under consideration into two; a first "in-sample" set of stocks given by {QRTEA, CHTR, EXC, WBA, AAPL}, and a second "out-of-sample" set {LILAK, XRAY, PCAR, AAL, ATVI}.For each window w, we use all the training-validation data for the "in-sample" stocks to train (and validate) the models.We then evaluate the trained models on the test data of both the "in-sample" and the "out-of-sample" stocks.The results are reported in Table 8.Table 8: MCS p-values of the unpredictive benchmark model for the 10 tickers and 9 horizons under consideration, universal models only.When the p-value is low at least one of the order book-driven models statistically outperforms the unpredictive benchmark, i.e. there is order book-driven predictability according to Definition 4.1.( * ) are "in-sample" stocks.
Focusing on "in-sample" stocks, we note that using universal models leads to results that are partially inconsistent with those in Table 4.A possible interpretation for this is that universal models may be picking up different order book dynamics from those identified by stock-specific models.In this sense, relatively illiquid stocks with overall "standard" trading behavior might benefit from universal models thanks to the greater availability of data.We suggest this might be the case for QRTEA and EXC at horizon h = 10.When, instead, the ticker is mainly subject to stock-specific trading patterns, universal models have a hard time detecting predictability -we believe this might be the case for CHTR and AAPL.These results are intrinsically tied to the observations of Remark 4.3.
It is rather remarkable that universal models can identify predictability for "out-of-sample" stocks.This provides evidence of the presence of common trading patterns in the order books of different tickers.For AAL and ATVI, universal models can consistently outperform the unpredictive benchmark predictions without ever learning from the stock's past order book data.Table 9: % of times each model is in the α-MCS when predictability is identified at the corresponding level α.
Since we believe universal and stock-specific models may be picking up different trading patterns, a natural question to ask is whether the conclusions from Section 4.2 carry over to the universal setting.The results in Table 9 suggest the superior power of L2 over L1 data is retained for universal models.In this case, though, the increased granularity of L3 data appears to be actually beneficial.Moreover, when considering universal models, the volume representation appears to outperform both order book and order flow inputs.These results thus suggest that some predictive universal patterns can be extracted only from the most granular data.This contrasts with stock-specific models which achieve good predictive performance simply based on order flow information (which is, by construction, also contained in volume-based features).Remark 4.3.As discussed in Section 3.2.2, the choice of γ used to define the return classes is stock-specific.This means that up/down labels for different securities may correspond to different numbers of mid-price changes.This is in line with our choice of order book-driven clock, which is by definition also stock-specific.It is important to note that the choices of γ and clock t might not entirely account for structural trading differences of stocks, making it harder to identify universal trading patterns.This contrasts with Sirignano and Cont (2019), where the next mid-price move has a straightforward universal structural interpretation.There are a couple more differences with Sirignano and Cont (2019) we want to point out: • First, in Sirignano and Cont (2019), the authors consider a much bigger set of stocks, comprised of 500 "in-sample" stocks used for training and 500 "out-of-sample" stocks, making their model more general and thus more universal.
• Second, in that paper, the authors also investigate questions regarding the stationarity of price formation dynamics.This entails using a single long training window: the greater availability of data leads to more stable results.To employ the model confidence set procedure and allow the models to capture patterns specific to different economic regimes, we did not adopt this approach in our work.
• The LSTM model considered in that paper differs from the deepLOB/deepOF/deepVOL architectures.In Sirignano and Cont (2019), the LSTM specification is an online model, i.e. inputs at time t represent the order book state at time t only and are fed through an LSTM-based architecture updating the stored hidden state and producing output predictions.In our models, instead, there is no storage of hidden units between one prediction and the next, i.e. at each time step t, we input the order book history between t − T + 1 and t (in the form of raw order books, order flow or volumes) and, after appropriate convolutional feature extraction, apply an LSTM to the processed sequential data to produce the prediction.When studying dependence on order book history T , Sirignano and Cont (2019) refer to the cut-off horizon used in the backpropagation through time computation of gradients during the training procedure.

Conclusions and Outlook
In this paper, we explored empirical questions regarding the predictability of mid-price returns driven by order book data in centralized exchanges.The predictability in price formation dynamics, already considered in the sense of next mid-price change in Sirignano and Cont (2019), was found to persist up to 50-300 order book updates ahead, horizons over which multiple mid-price changes may occur.Such predictable horizons might vary from a few milliseconds to nearly half a second, depending on the stock under consideration.These results contrast with low-frequency returns, where predictability is much harder to identify but easier to trade once discovered.In fact, in the high-frequency context, predictability is not always exploitable due to technological limitations and market microstructure issues.A related and more challenging question is, therefore, to understand whether the predictability identified in this paper is actually tradable.
The experiments were carried out using specific deep learning architectures.Other than identifying predictable horizons, we aimed to answer questions related to the models under consideration.In particular, we found strong empirical evidence for using L2 data over L1 data.But, from the results presented in this paper, using even more granular order book information (L3) seems to benefit only universal models.The experiments also highlight the importance of carefully choosing a data representation: models based on the basic order book level representation are considerably outperformed by those with order flow or volume inputs.Finally, we found empirical evidence for the presence of universal trading patterns in order book dynamics.Our preliminary results suggest that deep learning architectures may pick up different predictable order book patterns when trained on a pool of stocks instead of a single one.The volume representation has some considerable theoretical advantages, in particular, it is robust to small perturbations, which might explain its superior predictive ability when considering universal models.
Many further theoretical questions regarding predictability which we believe to be of research interest remain unanswered.First, it would be natural to explore whether predictability in returns can be entirely explained by order book structure or if recurring trading patterns play a relevant role.Next, one could compare the predictive performance of the deep learning architectures discussed in this paper to that of the models based on carefully engineered features considered in Aït-Sahalia et al. (2022).Finally, only Nasdaq-traded stocks were considered in this paper.This is a centralized electronic dealer market on which relatively big companies are listed.It would be interesting to explore whether similar results are obtained for securities traded on exchanges with different market structures.With appropriate experimental setups, we believe the model confidence set procedure used in this paper provides a solid framework to tackle all these questions.
On the practical side, there are some relevant issues one should consider.First, it is essential to note that in this paper we focus on mid-to-mid returns, which are not tradable in practice.When considering specific trading applications, one should thus define the return labels appropriately, for example, working with ask-to-bid returns.Second, in this paper, we use an order book-driven clock.In practice, this means the prediction horizon is intrinsically stochastic, which might be a problem in execution.Another critical aspect that would need to be accounted for in practical high-frequency trading is total execution speed.This consists of order book information latency, model prediction run time, and order submission speed.Zhang et al. (2019) and Kolm et al. (2021) argue that their models (which are also considered in this paper) are sufficiently fast to be used by traders with good connections: see Appendix A.7 for a discussion on the effects of infrastructure latency.
As with any trading application, one must also account for the impact his own trade will have on the order book.Moreover, if many traders exploit the same inefficiencies, this might erode all predictive power above the lowest tradable latencies.Overall we believe that while the predictability identified in this paper might not be directly tradable through a standalone strategy, it might still help some market players, such as market makers, to gauge the direction of the market and adjust their quotes accordingly.

A Appendix
A.1 Deep Learning

A.1.1 Deep Neural Network Architectures
When approximating a function8 f from a set of models {f θ } θ∈Θ , such as in the return prediction task introduced in Section 2.2, a quite common approach is to consider a parametric family of deep neural networks.Theoretically motivated by the universality of such models, this approach does not rely on the functional form of f being correctly specified.
The main idea behind deep neural networks is to lift the input into a higher dimensional space to extract relevant latent features before projecting onto output space.Mathematically a neural network consists of a composition of functions, known as layers, f l : R d l−1 → R d l for l = 1, . . ., L where d 0 is the dimensionality of input space and d L is the dimensionality of the output space, for example d L = 3 in our three-class classification task.A neural network is considered deep if the number of layers L is large.
In the simplest case the layers f l 's are "activated" affine transformations, i.e. for input where W l ∈ R d l ×d l−1 and b l ∈ R d l are learnable parameters, also known as weights, and σ l is a non-linear activation function applied element-wise.Neural networks of the form shown to be universal for the class of continuous functions (Cybenko, 1989;Hornik et al., 1990), i.e. any continuous function (on a bounded domain) can be approximated arbitrarily well by a large enough network in the supremum norm9 .Related works have shown the representational benefits of depth: there are functions that deep networks can construct with polynomially many parameters, which instead require exponentially many parameters when considering shallow networks (Telgarsky, 2015).More complex and domain-specific forms of layers f l exist; we briefly introduce convolutional and recurrent/LSTM layers.
Convolutional layers Convolutional layers are a natural choice when the input is an image, i.e.
where h l−1 is the height in pixels, w l−1 is the width in pixels and c l−1 is the number of channels, for example c l−1 = 3 if the image is in RGB.A convolutional layer is a specific case of a generic dense layer with parameter restrictions to account for adjacencies in the input's structure.It consists of k l weight kernels, also known as filters, which are convoluted with the input image to produce the output.Mathematically a 2-dimensional convolution with k l (n l × m l ) filters and stride (s l × t l ) is described by ∈ R are bias terms and [•] i,j,k denotes the i, j, k-th entry of a three dimensional tensor.The output then lives in R h l ×w l ×k l , where While this equation might look a bit daunting, convolutional layers are often easily understood via a graphical depiction, as in Figure 12.Many empirical studies (Krizhevsky et al., 2012;He et al., 2016) have investigated the ability of convolutional layers to extract relevant features from input with grid-like topologies.In Section 2.3, we discuss how this ability might be leveraged to extract relevant features from order books.
Recurrent/LSTM layers When the input has a built-in temporal structure, this can be quite naturally accounted for by recurrent layers.This type of layer retains information over time, discovering temporal dependencies in the data.Recurrent layers may be used with streaming data to obtain online predictions or applied to a whole time series yielding a single output.We focus on the latter case, i.e. assuming the input z l−1 ∈ R d l−1 has a temporal structure of the form number of filters k l = 32.
the recurrent layer z l = f l (z l−1 ) is given by where ϕ l : R m l−1 × R n l−1 → R m l−1 is a parameterized recurrent function and the h l−1 's are known as hidden states.The most widely used type of recurrent layer is Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997).While other types of recurrent layers may suffer from vanishing or exploding gradients when training, the specific structure of LSTMs largely prevents such problems (Hochreiter et al., 2001).In an LSTM layer each hidden state h is augmented with a memory state s, and hence the recurrence becomes The equations governing the LSTM recurrence are given by h where the input, forget and output gates depend on the context, i.e. on h The network diagram of an LSTM layer is given in Figure 13.LSTMs summarize relevant context information in the memory cell s (t) l−1 , performing exceptionally well on tasks with a strong temporal dependency (Melis et al., 2017).

A.1.2 Training Deep Neural Networks
As discussed in Section 2.2, once we have specified a parametric model for approximating the prediction function f ≈ f θ we wish to learn the best approximation to f from a training set of observed input-output data D train = {(x i , y i )} i∈Itrain .To do so, we aim to find the parameter combination θ which yields predictions closest to the observations: where L(•, •) is a loss function quantifying the distance between the prediction and the observed value, and θ contains all network parameters (weights matrices, bias terms, CNN filters...).The form of L may be chosen using probabilistic arguments or according to some other task-specific criteria.When the model f θ is a deep neural network the loss landscape L(θ) = L(θ|D train ) may be very complex and finding a "good" minimizer is often a difficult task.The most widely used approach is to use (some variant of) gradient descent in which the model parameters are iteratively updated by until a convergence criterion is met.Here η > 0 is a fixed learning rate, and the weights θ (0) are chosen according to an appropriate initialization rule.When considering deep neural network architectures, the number of parameters is often very large, and therefore second-order methods are often intractable or computationally infeasible.Many variants of the first order gradient descent algorithm exist though.For example, the stochastic gradient descent algorithm updates parameter values based on estimates of the gradient ∇ θ L(θ (n) ) computed from a random subset (known as a batch) of the training data set.More advanced first-order optimizers consider momentum and/or use adaptive learning rates.In our experiments, we use the Adam optimizer (Kingma and Ba, 2015) which updates the parameters based on adaptive estimates of first and second-order moments: for parameters η, β 1 , β 2 , ϵ > 0.
Remark A.1.Note that, contrary to other model specifications, such as linear models, the fitting procedure of deep neural networks is not closed form.This adds a further layer of uncertainty.Any model is subject to the following two sources of uncertainty: one due to model specification (i.e.f ≈ f θ ) and one due to parameter estimation uncertainty (i.e.statistical error arising from estimating θ from a finite sample).Models that do not have closed forms for θ from observed data are further subject to numerical optimization error in the parameter estimation phase.Sometimes, the optimization procedure may not even converge to a (global) minimum.With deep neural networks, we thus trade off smaller model uncertainty (at least theoretically due to the universality property) for larger parameter estimation error.
When the task is a classification problem f θ is usually chosen so that it maps to {p ∈ [0, 1] C : ∥p∥ 1 = 1}, where C is the number of classes in the set C. This allows to give a probabilistic interpretation of the deep neural network outputs, i.e. f θ (x) = {p c,θ (x)} c∈C , where p c,θ (x) = P(y = c|x, θ).In this case, a natural choice of loss function to use during training is the cross-entropy loss, which can be understood by maximum likelihood arguments: assuming the responses are independent given the features and the distribution of the features is independent of the parameters θ the negative log-likelihood is When training samples for the classes are unbalanced, say class c ∈ C has many more samples than all other classes, the optimization algorithm tends to get stuck in the trivial minimum given by the Dirac measure at c ∈ C. One way to mitigate this effect is to re-weight the categorical cross entropy loss such that the network is less incentivized to move towards the trivial minimum.With appropriate weighting, c ∈ C observations influence the direction of the gradient less, thus reducing the strength of the attraction towards the trivial minimum.Mathematically one uses the following weighted categorical cross-entropy loss where the weights w c are chosen to be inversely proportional to the class's training set density.For example, in our experiments, we set:

A.2 Model Confidence Sets
The model confidence set procedure introduced by Hansen et al. (2011) provides a formalized statistical inference framework in which to compare competing models.It does not assume that any of the models is the true data generating process but simply aims to identify a set of models that will contain the best model with a given level of confidence, known as model confidence set, MCS.A model confidence set is, therefore, analogous to a confidence interval for a parameter.To identify the best models, one must compare the model outputs (either as estimated class probabilities or as forecasts) by selecting an appropriate loss/score function.
Let us denote by M 0 the set of models under consideration and by {L i,w } W w=1 for i ∈ M 0 the time series of losses.In the MCS framework for models i, j ∈ M 0 and window w ∈ {1, . . ., W } we define: Under a stationarity assumption on the (d ij,w ) w≥1 , we define the set of best models among those in M 0 as This is the set we want to identify by using the MCS procedure.The MCS algorithm is then defined as follows.Definition A.1.Let δ M and e M be an equivalence test and an elimination rule.
Step 2. If H 0,M is accepted, set M * 1−α = M; otherwise, use e M to eliminate an object from M and repeat the procedure from Step 1.
Under appropriate assumptions on the equivalence test and elimination rule10 , the model confidence set M * 1−α has the following asymptotic properties: where W → ∞ denotes the number of sample periods.If furthermore, the equivalence test and elimination rule satisfy a coherency condition11 , we have the following finite sample property: In Hansen et al. (2011), the authors give multiple practical examples of equivalence tests and elimination rules.We focus on implementing the MCS procedure with tests constructed from t-statistics and bootstrap estimators.This choice of equivalence test and elimination rule has two practical advantages: it does not require estimating a variance-covariance matrix for the time series {(d ij,w ) i,j∈M0 , w ≥ 1} which might be difficult when |M 0 | ≈ W and it satisfies the coherency condition required for the finite sample property (iii) of the MCS procedure.Let us assume the following holds: Assumption A.1.For some r > 2 and γ > 0, it holds that that {(d ij,w ) i,j∈M0 , w ≥ 1} is strictly stationary, α-mixing of order −r/(r − 2) and Var(d ij,w ) > 0, E|d ij,w | r+γ < ∞ for all i, j ∈ M 0 .
For a set of models M ⊂ M 0 with time series of relative performances {(d ij,w ) i,j∈M , w ≥ 1} define the following quantities ∀i, j ∈ M: respectively measuring the sample relative performance of the models i and j and the sample relative performance of model i and all other models in M. We can then introduce the following t-statistics ∀i ∈ M: , where Var( di• ) are appropriate estimators of the variances of the di• 's.The associated test statistic is T max,M = max i∈M t i• .The equivalence test and elimination rule are given by: The asymptotic distribution of the test statistic T max,M is nonstandard because it depends on the (usually unknown) nuisance parameter ϱ under the null.The value of ϱ is not strictly necessary since we can consistently estimate the distribution of T max,M by using a bootstrapping procedure.In this case, the equivalence test compares the observed value T max,M with the bootstrap empirical quantile.The resulting procedure preserves the asymptotic12 properties (i) and (ii) and the finite sample property (iii) under Assumption A.1.For details of the bootstrapping procedure see the Appendix of Hansen et al. (2011).Remark A.2.The test based on the t-statistic discussed above relies on the fact that the null hypothesis H 0,M can be equivalently rewritten as In this paper, we present results using MCS p-values, which are defined as follows.Definition A.2. Let (δ M , e M ) be the equivalence test and elimination rule associated with a MCS procedure as defined in Definition A.1.The elimination rule defines a decreasing sequence of random sets M 0 ⊃ M 1 ⊃ . . .⊃ M |M0| by successively eliminating models e M0 , . . ., e |M0| .Let p H 0,M denote the p-value associated with the null hypothesis H 0,M under the equivalence test δ M with the convention that p H 0,M |M 0 | ≡ 1.Then for model i = e M k ∈ M 0 the MCS p-value is defined as Note that the definition of MCS p-value is such that for The MCS procedure is quite flexible.It allows to compare models between each other and to compare them to an unpredictive benchmark model (by including the unpredictive benchmark in M 0 ).When only two models are considered, i.e. |M 0 | = 2, the MCS procedure reduces to testing the null hypothesis The model confidence set procedure relies on the assumption that the loss differences time series {(d ij,w ) i,j∈M0 , w ≥ 1} is stationary and α-mixing.This means that even when the loss process {(L i,w ) i∈M0 , w ≥ 1} is non-stationary, e.g.model performance is tied to regime changes, the model confidence set procedure might still be applicable.In Section • As expected, order book prices display significant non-stationarity, cf. Figure 14.
• The difference between order book volumes v (l) t,x , Figure 15, and volumes s (w) t,x , Figure 17, is that in the former we measure the distance from the mid in levels, i.e. the volume is strictly positive, while in the latter we use all, i.e. also possibly empty, price ticks.
• In Figure 16 we note that order flow is zero most of the time.By definition, we have non-zero flow only when an event affects exactly that level or the whole order book shifts.• To provide some insight on the distribution of L3 volume features, we plot queue depth statistics in Figure 18.
We note that most queue depths are less than 9 orders long, this justifies the queue depth cutoff of length 10 chosen in our experiments.• Finally, looking at Figure 15, Figure 17 and Figure 18, one can clearly note the difference between first level/price tick dynamics and those of the rest of the order book: active price discovery leads to lower volumes and shorter queues.

A.4.2 Responses
In Figure 19 we plot the unconditional empirical distributions of the target responses defined in Section 3.2.2 for the first window w = 1.Note that the threshold γh is based on the training data set only.As detailed in Section 4, the train and validation sets are randomly partitioned from a four week period, i.e.D w,train ∪ D w,val , and thus are likely to display similar volatility regimes.On the other hand, the test set D w,test corresponds to the following week, cf. Figure 11, and thus may exhibit a different volatility profile.By joining the train and validation distribution we obtain the unpredictive benchmark described in Remark 2.1 and used to define predictability in the experiments in Section 4.
In Figure 20 we explore the dependence structure of the return labels by plotting13 the previous h-step return c t−h,t against the next h-step return c t,t+h .We note that for all horizons h ∈ {10, 20, 30, 50, 100, 200, 300, 500, 1000} the null hypothesis of independence is rejected by a contingency chi-square test.The fact that return labels display The Short-Term Predictability of Returns in Order Book Markets some form of persistency has a two-fold interpretation.On one hand, it provides evidence towards the existence of predictability in the data, i.e. against the i.i.d.Efficient Market Hypothesis on which the unpredictive benchmark is based, cf.Remark 2.1.On the other, it suggests that a simple one-step ahead Markovian model of returns may display predictive power, i.e. a model which predicts the next price move c t,t+h conditional on the last return c t−h,t only.In this respect, we note the return label c t−h,t can be computed from the features (x t−T +1 , . . ., x t ) for T sufficiently large.In view of the universal approximation property of neural networks, any such simple model can thus be considered as 'nested' in the deep learning models treated in this paper.In Appendix A.5 we explore how well a specific model based on c t−h,t only performs when compared to the more complex deep learning models considered in this paper.
A.5 Simpler predictive models: an empirical auto-regressive specification In this section, we explore how a simple model performs when compared to the deep learning architectures considered in this paper.We base the model on the simple empirical observation that the target return labels display time-dependence, cf.Appendix A.4.2 and Figure 20.We consider a one-step ahead

A.6 Return definition
The way returns are defined is often an overlooked topic which inevitably has a significant effect on model evaluation.
Here we state two alternative definitions to that given in Section 3.2.2taken from the literature and discuss their properties.In all cases, the mid-price is treated as the "true" price and returns are defined relative to it.Recall this is not a tradable price.
The Short-Term Predictability of Returns in Order Book Markets • In Ntakaris et al. (2018) the authors define the return at horizon h as where m t denotes the mid-price at time t.This definition of return compares the smoothed mid-price over the next h time steps to the current mid-price.In this case, the return can be equivalently understood as the average mid-to-mid return over the next h time steps.As the horizon h becomes longer, so does the window over which we smooth the returns.This has a couple of practical disadvantages in terms of the predictability questions we wish to explore.First, predictability over h = 100 time steps might be due to changes of the mid-price over the first, say, h = 50 time steps, and thus it becomes difficult to discuss questions of persistency in predictability.Second, the definition produces correlation between returns at different horizons, r t+h1 and r t+h2 , inducing a hidden bias in multi-horizon models.• In Zhang et al. (2019) and Zhang and Zohren (2021) the authors define the return at horizon h by where m t denotes the mid-price at time t.This definition has similar drawbacks to the previous one.But, additionally, we believe it may suffer from look-ahead bias: for example, if the mid-price has been going up in the past h steps (an information which is included in our covariates), then it is more likely the return r t,t+h will be positive.
The definition of returns used in this paper, cf.Section 3.2.2,does not suffer from the issues identified for these two return specifications.
Remark A.4.The definition of returns intrinsically depends on the clock used to measure time.Note that in Kolm et al. (2021)   With our choice of order book-driven clock, even assuming mid-mid trading, there is no way of placing a trade exactly h order book events ahead, i.e. the prediction horizon is random.Smoothing the exit price can thus be understood as averaging out the uncertainty in the execution time.Moreover, in our work, we aim to investigate questions regarding structural market predictability.In our setting, smoothing mid-prices leads to better estimates of the true (latent) prices by weakening idiosyncratic noise effects.As pointed out in the conclusions in Section 5 though, in practical trading applications it may be more appropriate to consider a physical time clock and un-smoothed returns.

A.7 Infrastructure latency
In this paper, we explored the predictive value of order book data by assuming instantaneous access to such information.In practice, due to technological constraints, market participants experience varying degrees of delay when engaging with the order book.Let us write λ n,view and λ n,act for the latencies that market participant n experiences when viewing and acting on the order book respectively.For x ∈ {view, act} we can decompose λ n,x = λ x 0 + λ n,x + , where λ act 0 is the time it takes the financial entity running the market, e.g. the Nasdaq, to execute an action on the order book once received and λ view 0 is the time it takes the market entity to send out the information regarding an order book update.Market participants make large efforts to reduce their idiosyncratic latencies λ n,x + for x ∈ {view, act} by optimizing software and hardware, one prototypical example being co-location.Note that λ n,act + also includes the time needed to take a trading decision which, in our setting, requires a forward pass through a pre-trained predictive model, e.g.deepLOB/deepOF/deepVOL. Figure 21  When considering practical trading applications one should thus take into account the effect these latencies may have on the definition of returns, cf.Section 3.2.2 and Appendix A.6.For example, denoting by t = 1, 2, . . . the order book-driven clock considered in this paper, one may want to predict the return between physical times τ t + λ n,tot and τ t+h + λ n,tot where λ n,tot = λ n,view + λ n,act = λ tot 0 + λ n,tot + , is the total round trip latency time to view and act once the order book clock ticks and τ : N → [0, ∞) is the increasing random function mapping the order book clock to physical time.An interesting analysis in this setting would be to analyze how predictability varies as a function of λ n,tot + = λ n,view + + λ n,act + .In order to quantify the magnitude of these time lags and better understand their impact we briefly summarize some infrastructure tests published by market participants.
• As of 2023, the Nasdaq reports the door-to-door speed of its matching engine, i.e. λ tot 0 = λ act 0 + λ view 0 , to be 'sub-40 µs, with the fastest production implementation at 14 µs' (Nasdaq, 2023).Studies from 2015 report the door-to-door latencies of Brazilian and European futures exchanges in the order of a few hundred microseconds (Kirilenko and Lamacie, 2015;Menkveld and Zoican, 2017).
• Leading industry producers of co-located network adapters and feed handlers report the total speed for processing an order book message and sending a trade action, i.e. tick-to-trade latency λ n,tot + = λ n,view + + λ n,act + , to be of the order of a couple of microseconds (µs) (Enyx, 2020;CSPi, 2023).This is achieved via the use of specific hardware known as FPGA cards.These figures assume relatively simple trading decision applications, when working with more sophisticated predictive models, such as the ones considered in this paper, one should account for a couple of hundred microseconds in tick-to-trade latency, see Table III in Zhang et al. (2019) for unoptimized forward-pass run times.
The total latency λ k,tot is thus less than a fraction of a millisecond (ms) while the average time elapsed between consecutive price changes in our data ranges from 0.18 to 8.38 seconds depending on the liquidity of the stock, cf.Table 1.The effect of infrastructural latencies on the definition of returns, cf.Section 3.2.2, is thus negligible.

A.8 Stock selection
Ideally, we would like to work with the same set of stocks as in Kolm et al. (2021).Due to computational limitations, we were able to conduct experiments only on a subset of 10 tickers from the total 115 NASDAQ stocks considered in the original paper.We selected 10 stocks with diverse liquidity characteristics, hopefully providing a sufficiently representative sub-sample of the whole set.
To choose the sub-sample, we use the stock characteristics provided in Kolm et al. (2021), Table 6.For each of the liquidity characteristics -Updates, Trades, Price Changes, Spread -we compute a sub-score based on the characteristic's rank.For example, the stock with the most updates is assigned an updates score of 1, while the stock with the least number of updates is assigned an updates score of 0. The characteristic-specific scores are then averaged to obtain a general "liquidity score" for each stock.The 10 chosen stocks correspond to the 10 evenly spaced quantiles of the "liquidity score".The liquidity characteristics of the 10 chosen stocks are reported in Table 1.

A.9 Model details
In the experiments we use the model architectures reported in Table 13 and Table 14.To speed up the training procedure, we use batch normalization (Ioffe and Szegedy, 2015) (with momentum = 0.6) after every convolutional layer and inception block.To prevent overfitting, we use a dropout layer (Srivastava et al., 2014) (with noise = 0.2) positioned after the inception module.
All the models are implemented with float32 precision policy: at every layer float32 is used as computation and variable data type.Loss and gradient computations are also carried out with float32 precision.

Figure 1 :
Figure1: A sample snapshot of an order book.Ask (resp.bid) volumes are denoted by red (resp.green) bars.The lighter green shaded area represents the change in the order book shape when some of the liquidity at the best bid price is removed.

Figure 2 :
Figure 2: L1, L2 and L3 representations of the order book in Figure 1.

Figure 3 :
Figure 3: Core modules of network architectures.

Figure 6 :
Figure 6: Gray-scale visualization of the volume representation of the order book in Figure 1.

Figure 10 :
Figure 10: Inception module and LSTM layer with seq2seq decoder for multi-horizon forecasting.

Figure 11 :
Figure 11: Data set windowing for experiments.

Figure 19 :
Figure 19: ATVI return labels during the first window of the experiment, i.e. from January 14th, 2019 to February 15th, 2019.

Figure 20 :
Figure 20: ATVI return labels dependence in the joint training and validation data set during the first window of the experiment, i.e. from January 14th, 2019 to February 15th, 2019.

Table 1 :
The Short-Term Predictability of Returns in Order Book Markets Selected stocks' characteristics, daily averages.
Huang and Polak (2011) and Polak (2011)In Section 2.1 we described the basic mechanisms governing electronic exchanges such as the Nasdaq.Every day, trading activity on the Nasdaq alone results in hundreds of thousands of order book updates for each stock, cf.Table1.To keep track of all events occurring on the exchange and communicate them efficiently to (subscribed) market participants the Nasdaq uses the TotalView-ITCH protocol.For efficiency, instead of streaming the entire state of the order book after each update, only information on the event changing the order book is sent out to market participants.An example of a (decoded) ITCH message is reported in Table2, note these are timestamped at nanosecond precision.It is up to each market participant to store the current state of the order book and update it each time a new message arrives.Table2: Example message from a LOBSTER message file.This example corresponds to the order book dynamics shown in Figure1.

Table 3 :
Example order book update from a LOBSTER order book file.This example corresponds to the order book dynamics shown in Figure

Table 4 :
MCS p-values of the unpredictive benchmark model for the 10 tickers and 9 horizons under consideration.

Table 6 :
Table 6 are consistent with those in Table 4.
MCS p-values of the unpredictive benchmark model for the 10 tickers and 9 horizons under consideration when seq2seq models are also considered.When the p-value is low at least one of the order book-driven models statistically outperforms the unpredictive benchmark, i.e. there is order book-driven predictability according to Definition 4.1.
14auto-regressive (AR) model, i.e. we aim to estimate p * (⋆) = P(c t,t+h = * |c t−h,t = ⋆) for * , ⋆ ∈ {↓, =, ↑}.We use a non-parametric approach, setting p * (⋆) in period w ∈ {1, ..., 11} to be the empirical distribution of c t,t+h = * ∈ {↓, =, ↑} given the previous return label was c t−h,t = ⋆ ∈ {↓, =, ↑} in the training and validation set D w,train ∪ D w,val , i.e. the row normalised matrices displayed in Figure20.Formally, this is the maximum likelihood estimator of the conditional probabilities P(c t,t+h = * |c t−h,t = ⋆) given the data set D w,train ∪ D w,val .We repeat the experiments of Sections 4.1 and 4.2 with this empirical AR model in the set M 0 .The updated results are reported in Table10 and Table 11(we only report results at the 99% confidence level, i.e. α = 0.01).

Table 11 :
% of times the model is in the α-MCS when predictability is identified at the level α = 0.01.While our results suggest deepOF and deepVOL architectures significantly outperform the simple AR model, a more in-depth study comparing these specifications to the models in Aït-Sahalia et al. (2022) is required to shed some light on the question of whether the expressivity of deep learning techniques is the key to producing good predictive models or if careful feature engineering can be as effective.This analysis is of particular relevance in applications where prediction speed plays a fundamental role, see Appendix A.7.
the authors use a physical time clock with stock-specific horizon h.In their setting, the authors use simple un-smoothed mid-price returns,r t,t+h = m t+h − m t m t .