Improved ACD-based financial trade durations prediction leveraging LSTM networks and Attention Mechanism

The liquidity risk factor of security market plays an important role in the formulation of trading strategies. A more liquid stock market means that the securities can be bought or sold more easily. As a sound indicator of market liquidity, the transaction duration is the focus of this study. We concentrate on estimating the probability density function p({\Delta}t_(i+1) |G_i) where {\Delta}t_(i+1) represents the duration of the (i+1)-th transaction, G_i represents the historical information at the time when the (i+1)-th transaction occurs. In this paper, we propose a new ultra-high-frequency (UHF) duration modelling framework by utilizing long short-term memory (LSTM) networks to extend the conditional mean equation of classic autoregressive conditional duration (ACD) model while retaining the probabilistic inference ability. And then the attention mechanism is leveraged to unveil the internal mechanism of the constructed model. In order to minimize the impact of manual parameter tuning, we adopt fixed hyperparameters during the training process. The experiments applied to a large-scale dataset prove the superiority of the proposed hybrid models. In the input sequence, the temporal positions which are more important for predicting the next duration can be efficiently highlighted via the added attention mechanism layer.


Introduction
Market liquidity refers to the degree to which an asset can be bought and sold easily for a fair price [1]. In other words, market liquidity can be regarded as the speed at which transactions can be concluded while maintaining a basically stable price [1]. Therefore, market liquidity risk is one of the most common factors considered by security investors especially by high frequency traders in building a trading strategy.
With the rapid development of computer storage technology, transaction by transaction financial trading data is accessible to researchers. Let i t stand for the time at which the th i trade occurs, so that the duration between the (i+1)-th and i-th trade is , which can directly measure the transaction speed of financial trading. The autoregressive conditional duration (ACD) model proposed by Engle and Russell has been the primary framework used for analyzing trading durations of ultra-high frequency (UHF) data, which are irregularly timespaced and convey meaningful information [2]. In ACD models, the transaction duration is decomposed into the multiplicative product of two components, the conditional (expected) duration and the unexpected duration. The expected component is the portion of transaction duration that is linearly conditional on past durations, whereas the unexpected duration is the fraction of duration beyond that which could be predicted from past durations, and is usually characterized by an exponential distribution.
Based on the work of Engle and Russell [2], many works tried to improve the ability of capturing the relation between the conditional duration and the lagged durations. For example, the logarithmic version of ACD model was provided in [3], the threshold autoregressive conditional duration model was proposed in [4], the asymmetric autoregressive conditional duration model was put forward in [5], the smooth transition ACD model and the time-varying ACD model were introduced in [6]. There are also many other works focusing on choosing a suitable distribution to characterize the unexpected duration. The distributions which have been applied to the ACD models includes the generalized Gamma distribution in [7], generalized F distribution in [8], the mixture of two exponential distributions in [9], the regime-switching Pareto distribution in [10], the mixture of an exponential and a generalized beta of type2 (GB2) distribution in [11], etc. Like many other statistical models, the ACD family models require strong assumptions which are difficult to satisfy in realistic situations [12].
In recent years, machine learning methods have been widely applied to image identification and natural language processing problems. Compared with traditional statistical models, machine learning methods have looser model assumptions and better generalization ability. The artificial neural network (ANN), inspired by the biological neural network, is one of the most widely used machine learning methods. According to Universal Approximation Theorem [13], feedforward neural networks can approximate a Borel measurable function to any desired degree of accuracy if sufficiently many hidden units with arbitrary squashing functions are provided. Recurrent neural networks (RNNs) are a family of specially designed artificial neural network networks capable of extracting temporal information via the cycle architecture [14].
As the development in optimization techniques and computation hardware, RNNs have been widely used in many different domains recently [15]. To solve the vanishing gradient/exploding problem of simple RNNs, Hochreiter S. proposed the LSTM neural networks which can help us to utilize a longer sequence of historical information [16]. Although having the merit of strong fitting ability, LSTMs cannot provide probabilistic output compared with ACD family models.
Inspired by the work from Kristjanpoller and Minutolo [17], we propose a new architecture called LSTM-ACD to predict the UHF transaction durations by combing the ANN networks and ACD framework. We take a fully data driven approach to extend the mean equation of classic ACD models while retaining the probabilistic inference ability. In addition, attention layer is added into our model to make a visualization of the proposed network and to improve the interpretability. The proposed architecture is applied to real-world stock duration datasets.
The result shows that the proposed model produces more accurate estimation and prediction, outperforming the classic ACD model on mean squared error and quantile estimation.
The remains of this paper is organized as follows: Section 2 introduces the methodology detailedly while Section 3 contains the experiment design and the corresponding results in this study. Section 4 concludes this paper and points out the possible direction of future research.

Methodology
In Section 2, the ACD framework is integrated with LSTM networks to propose a new LSTM-ACD model for predicting the trading durations of UHF data. This section is organized as follows. Section 2.1 introduces the classic ACD model. Section 2.2 describes the proposed LSTM-ACD architecture in detail. In section 2.3, attention mechanism layer is utilized to unveil the internal mechanism of the proposed model.

ACD model
A classic ACD model assumes that the durations are conditionally exponentially distributed with a mean that follows an ARMA process [2]. As shown in formula(1), the duration i t  between the i-th and (i-1)-th trade is the multiplicative product of and , which represents expected and unexpected portion of the transaction duration respectively. In the conditional mean equation, is linearly depends on the lagged durations and the lagged terms of itself. , in formula (2) represents the lagged order.
A major limitation of classic ACD model is the assumption that the variables in the conditional mean equation behave in strict stationarity and linearity, but the duration sequences are usually in a non-linear or non-stationary state. Hence, this paper intuitively extends the linear conditional mean equation to nonlinear case by LSTM networks due to the strong fitting ability of deep learning techniques.

LSTM-ACD model
It has been generally known that the LSTM cell is able to store information over longer time range compared with simple RNNs. As depicted in Figure 1, the information flow propagating across time steps is controlled by three LSTM gates: the forget gate, the input gate and the output gate. represents the bias vectors. The input vector, output vector and cell state vector at time are denoted as , ℎ , respectively [18]. The operating process of a LSTM cell can be mathematically described as follows: As a type of RNNs specially designed to avoid the exponentially fast decaying factor, the LSTM networks can effectively prevent the gradient vanishing/explosion problem. Due to their ability to learn long term dependencies, LSTMs are particularly suitable for financial prediction problems. Hence, we have the conjecture that extending the linear mean equation to LSTM network will improve the ability of extracting long-term dependencies for duration sequence.
To verify this hypothesis, we take the −1 and which follows an exponential distribution. The log likelihood function can be mathematically described as follows: where  represents a mapping from

Visualization and promotion by attention mechanism
Attention mechanism was firstly proposed to improve the image processing accuracy by mimicking the perceptual system of human beings [19]. In the work of [20], attention mechanism was introduced to extend the basic encoder-decoder architecture and enhance the interpretability on the task of machine translation. Unlike the sequence-to-sequence modeling in sentence translation, the problem we focus on in this paper is to predict the financial duration one-step ahead. The attention weights which help automatically search for import hidden states of the sequence-to-one LSTM architecture can be calculated by the following formulas: where ℎ − represents the hidden state lagged time steps and − represents the attention weight of ℎ − . The and are parameter matrices in the attention mechanism. By allocating different attention weights for different hidden states, a new vector is produced as the input of a feedforward network for predicting the target variable . (14) In this study, the attention layer is integrated with LSTM to characterize the dynamics of i  ln in the above-mentioned mean equation of ACD model. The proposed Attention-LSTM-ACD model can be described by the following equations: ^= ′ ( ′ ) (17) where − −1 ′ represents the cell state of LSTM lagged + 1 time steps. Figure 2 shows the Attention-LSTM-ACD model more detailly.

Data Characteristics
As the box plots in Figure 3 demonstrate, transaction durations of each constituent stock from SZSE 100 Index reveals a very long tail compared with the inter-quartile range. The large amount of data locates in the tail means the existence of liquidity risk. To further dig the dynamic characteristics of the duration sequence, the averaged coefficients of auto correlation function ( acf ) and partial correlation function ( pacf ) coefficients is plotted. As shown in the following Figure.4, we can see that time series duration data shows a longer memory in that both acf coefficients and pacf coefficients decay very slowly as the lagged term increases. Hence, the higher complexity of the UHF duration data requires a forecasting algorithm with strong fitting ability.

Mean absolute error ( )
As one of the most widely used metrics, is used to evaluate the performance of duration prediction more directly and can be calculated by the following formula: A smaller means that we have a more precise forecast of the transaction duration.

Performance measure for quantile prediction
To evaluate the forecasting performance of quantile points, we utilize the loss function in quantile regression minimization problems [21]. Let

Experiment models
In Section 2, we have created a new framework for the one-step ahead prediction. The

Training
During the training process, configurations are determined with as few exogenous inputs as possible because of the various drawbacks of manual tuning. We adopt fixed hyperparameters including learning rate, number of neurons of each layer, batch size, time steps, etc. for each constituent stock of SZSE 100.

Generation of training sets, validation sets and test sets
As mentioned above, the sample used in this study is the 100, 000 durations in 2017 for each stock collected from SZSE 100. We select the last 30% of data as the test set, while the remaining data is divided into training set and validation set according to the ratio of 8:2.

Training process
During the experiment, fixed hyperparameter combination is selected for each model based on LSTM-ACD framework. Table 1 lists the hyperparameters used in our experiment.
The attention size represents the height of the tensor in formula (11). The initial learning rate is 0.5 and it is reduced by 50% after 1000 training steps. Besides the selection of hyperparameter combination, the remaining parameters of the proposed hybrid models are learned by taking advantage the early-stopping technique to avoid the over-fitting problem. We evaluate model performance on the validation set every 100 training steps and the early stopping patience represents the number of times that there is continuously no improvement in the log likelihood function calculated on the validation set.

Comparison of different models in
The out-of-sample forecasting errors of the five types of experiment models are calculated. Table 2 is the average in the test sets when the five models are applied to SZSE 100 Index constituent stocks respectively. The average of the LSTM-ACD (M) model is smaller than the classic ACD model while the remaining three models all perform a bit worse than the classic ACD model. We can also see from the Figure 5 that the LSTM-ACD(M) and LSTM-ACD are both supreme to the classic ACD model on more stocks in the metric of .
As mentioned above, uniform hyperparameter combination is chosen when applying the hybrid models. If we select different hyperparameters when focusing on different stocks, the performance of these hybrid models will be much better.
In addition, we calculate the of each model for the durations one step lagged by the following formula (20): The results in the third column in Table 1 show that the average of ACD model is significantly smaller than the average . It means the predictions of other four models based on LSTM-ACD framework somewhat convey more meaningful information.

Attention weights of different lag orders
This section makes a visualization for the Attention-LSTM-ACD model and Attention-LSTM-ACD (M) model. As can be seen in the following Table 4 and Figure 6, the weights learned by the attention layer in both the two models decrease exponentially with the increase of lag order. That means that the closer transaction has a more important effect on the current duration, which is consistent to our intuition.