A micro-to-macro approach to returns, volumes and waiting times

Fundamental variables in financial market are not only price and return but a very important role is also played by trading volumes. Here we propose a new multivariate model that takes into account price returns, logarithmic variation of trading volumes and also waiting times, the latter to be intended as the time interval between changes in trades, price, and volume of stocks. Our approach is based on a generalization of semi-Markov chains where an endogenous index process is introduced. We also take into account the dependence structure between the above mentioned variables by means of copulae. The proposed model is motivated by empirical evidences which are known in financial literature and that are also confirmed in this work by analysing real data from Italian stock market in the period August 2015 - August 2017. By using Monte Carlo simulations, we show that the model reproduces all these empirical evidences.

stored and processed on modern computers.
A large part of effort in market microstructure studies has been produced in order to understand, mimic and predict basic empirical regularities observed in the most important financial variables. A special attention has been dedicated to the relation between financial volumes and returns. The majority of the works in this area can be classified within the so-called econometric framework, sometimes also referred as macro-to-micro approach. The cornerstone of this approach is to consider the observed price to be a collateral effect of an unobservable volatility process to which a noise process transformation is applied, see e.g. [5]. This line of research has flourished during the last two decades and considerable attention has been dedicated to the problem of irregular spacing in time of observations when dealing with high frequency financial data; the seminal work by [24] and the recent review by [4] can provide a wide overview on the subject. Rapidly, econometricians turned the attention on multivariate models of logarithmic price returns, volumes and duration (waiting times), the latter to be intended as the time interval between changes in trades, price, and volume of stocks, see e.g. [27,32,22,39] Another strand of literature relies on the modelling of directly observable quantities, the so-called micro-to-macro approach that is philosophically in contrast with the econometric approach. This framework has a long tradition that has its roots in the paper by [1] and also embraces lattice based models including the popular binomial and trinomial models (see e.g. [9] and [7]). The micro-to-macro approach has undergone a revival in recent years mainly due to the work of econophysicists that introduced the Continuous Time Random Walks (CTRW) apparatus in the modelling of financial returns, see [31,40,41]. The evolution equation of CTRW was formulated and it was shown that it can catch non-Markovian effects. Sometimes the non-Markovian behaviour of stocks has been accommodated considering a latent Markov process acting as a switching process as done in [10]. In any case, a viable solution to non-Markovian problem is given by semi-Markov based models. Semi-Markov processes are the equivalent of CTRW having a non-independent space-time dynamic. They appeared in the fifties in the probability field due to the independent contributions by [29] and [46]. They have been successfully investigated and applied in connection with a very wide range of problems including reliability theory, queuing, stochastic systems and DNA analysis, see e.g. [28], [2], [43,44] and [20,21].
However, only recently it has been recognized that also semi-Markov processes (CTRW included) are not able to reproduce accurately the statistical properties of high-frequency financial data and a more general solution has been advanced in a series of papers where the concept of weighted-indexed semi-Markov chains (WISMC) has appeared, see [15] and [17]. The WISMC model represents a generalization of ordinary semi-Markov processes and revealed to be particular useful to reproduce long-term dependence in the stock returns among other stylized facts of statistical finance. WISMC models were extended using different strategies to multivariate settings and applied to measure risk of financial portfolios, see [18] and [19]. In the meantime they were also successfully applied to the modelling of financial volumes by [12].
So far, models proposed in the literature of the micro-to-macro approach have not yet been able to advance a unifying approach where returns, volumes and waiting times are jointly modelled in such a way to reproduce known empirical regularities they possess. The contribution of this paper is to present a modelling framework where these three variables are managed contemporaneously in a satisfactory and flexible way. In particular, we first conduct a detailed explorative data analysis with the aim of a better understanding of the empirical relationships among the considered financial variables. The data analysed are from four Italian stocks for the period August 2015 -August 2017 observed at 1 minute frequency. The WISMC model is presented for the first time ever in discrete time with a general state space and is considered as a model for the log-returns and also for the log-volume returns with different kernels. To achieve the objective of a multivariate model of price-volumeswaiting times (triplet process), specific data-driven assumptions are advanced and the dependence structure between the price and volume return processes conditional on the waiting time process is embodied by using a copula function on the joint distribution of modulus of returns and modulus of volume returns that exhibit a general dependence structure. The dynamic of the multivariate model is completely characterized by the determination of the kernel of the triplet process. In general, this allows the computation of any financial statistic that can be written as a functional of the kernel of the triplet process. The model is used to compute linear and nonlinear measures of dependence, joint first passage time distribution function of price and volumes and shows ability to reproduce probability density functions of both variables as well as cross and auto-correlation functions.
The paper is organized as follows. Section 2 provides a statistical analysis of financial data with a particular focus on the relationships among price returns, volume returns and waiting times. Section 3 sets out the marginal models of price and volumes and the multivariate extension by means of copula functions. In this section the kernel of the triplet process is studied under some assumptions that are justified by the data. The section presents also the computation of some financial functions of broad interest. Section 4 illustrates the result of the application to real data and demonstrates the accuracy of the model in reproducing the main empirical regularities observed in financial markets. Section 5 summarizes our contribution and results. All proofs are deferred to the Appendix.

Data analysis
Empirical research on price changes, financial volumes and waiting times has identified some characteristics often called the stylized facts [26,36,34,35,3,38,37]. In this section we conduct an explorative data analysis with the aim of a better understanding of empirical relations among the considered financial variables. This analysis will inspire the main assumptions under which our model is going to be built on in the next section. The data used are quotes of Italian stocks for the period August 2015 -August 2017 (2 full years) with 1 minute frequency. Every minute, the last price and the cumulated volume (number of transactions) is recorded. For each stock the database is composed of about 2.6 * 10 5 volumes and prices. The list of stocks analysed and their symbols are reported in Table 1. From now onward we will use only the codes in the table to identify each stock.
The analysed stocks are chosen to represent different market sectors. According to the Global Industry Classification Standard F is in the industrial sector, ISP is one of the largest banks in Italy (financial sector), TIT is in the telecommunication and TEN in the energy sector. In Figure 1 we show an example, for the stock TIT, of the time series of price S(t) and trading volumes V (t) in the analysed period.
As a first step we analyse the time series and look for the most important statistical features. From prices we build a time series of the price returns   Table 3 Descriptive statistics of volume returns v(t). defined as r(t) = log(S(t)/S(t − 1)) and from trading volumes we define the log variation (from now onward volume returns) as v(t) = log(V (t)/V (t − 1)) where t is the time variable in one minute frequency. To be sure to use only variation in one minute period we exclude from the analyses the variation of both variables from the closing of the stock market at day d to the re-opening in the next trading day d + 1 (we remind that the stock market is open from 9 am to 17:30 pm in weeks day).
In Table 2 we summarize the descriptive statistics of price returns r(t), while in Table 3 we summarize the descriptive statistics of volume returns v(t).
To better visualize the distributions of both time series we show in Figure 2 and 3 the histogram of r(t) and v(t), respectively, and we also compare them with the best Gaussian fit.
We performed a Jarque-Bera test that rejected the Gaussian distribution for both r(t) and v(t) at 1% significance level.
One of the most important statistical feature of both time series is that  their absolute values are long range correlated. We show this in Figures 4 and 5. We also found that there is zero correlation between r(t) and v(t) while a non-zero correlation between |r(t)| and v(t) is present. This result is shown in Table 4. In this Table we show all possible combination of correlation between r(t) and v(t) and their absolute values, we also show the p-values which gives statistical significance of non-zero correlation.  Given these properties a good model should be able to take all of them into account. Another property that we found quite interesting and that should be included into a model is the following: for both time series (r(t) and v(t)) we found that there is a dependence with the waiting time which is defined as the time it takes for returns to change their values. This can be seen in Figures 6  and 7 where we have plotted the times it takes from a specific values of r(t) (v(t)) to jump into all other values. It can be noticed that waiting times has some dependence from r(t) (v(t)) values. This empirical evidence was already observed for price returns in [31] and now we highlight it also for volume returns.
This result is confirmed from contingency tables where we can see the dependence between r(t), v(t) and the waiting time T . The contingency tables,   shown in Table 5, have been obtained by a discretization of both r(t) and v(t) into 5 states. It can be easily noticed, from Tables 5, that there is a dependence of the number of transition from T (for T = 1 there are much more transitions than for T = 2 or T = 3). More specifically, in brackets we give the number of transitions that we would expect for independent processes where the probability of finding a given number of transition is simply given by the product of the frequencies of having each variable at that given state. From Tables 5 it is obvious that the independent hypothesis does not hold for both processes. We obtained similar results for all other stocks which, for reasons of space, are not shown here. From the above tables and from all results obtained in this section, we can say that a good real world model of price and volumes should take into account all the afore detected stylized facts that we can summarize in a list: -distributions of price returns and volume returns are not Gaussian; -the absolute values of price returns are long range correlated; -the absolute values of volume returns are long range correlated; -price returns and volume returns are uncorrelated while a non-zero correlation between |r(t)| and v(t) is present; -r(t) and the waiting times influence each other; -v(t) and the waiting times influence each other. The majority of them was already known and extensively documented in the financial literature, here they have been confirmed in our dataset. A very interesting summary and financial implications of those empirical regularities are discussed in [6] and in the references therein. The empirical evidences in the list are the cornerstones on which is built the model we are going to present in next section.

Mathematical Model
In this section we first present the WISMC model that is used as a marginal model for both the price and volume returns processes. Successively, we extend the mathematical model in a multivariate setting by considering a dependence structure between price, volumes and waiting times (durations) using a copula function.

Weighted-Indexed Semi-Markov Chains
Here, we introduce discrete-time WISMC model with Borel phase space in relation to the financial problem to which we are interested in.
Let (Ω, F, P) be a probability space endowed with a filtration F := (F n ) n∈IN where all upcoming random variables are defined.
Let S(t) be the price of a financial asset at time t ∈ IN. The time varying log return, defined as log(S(t)/S(t − 1)), is usually the main variable object of investigation in financial literature. As commented in the previous section, at the short-time scales considered in high-frequency finance, this variable changes values only in correspondence of an increasing sequence of times {T J n } n∈IN , the so-called jump-times of the asset price process.
In correspondence of the times {T J n } n∈IN , the logarithmic return process assumes different values denoted by {J n } n∈IN and along any waiting time X n := T J n+1 − T J n it does not change value and remains constant. Thus, J n is the value of the logarithmic change in price at its n-th transition. Let assume that at current time, say t 0 = 0, we dispose of a set of past data consisting of two vectors of observations collecting the last m + 1 visited states of the log-return process and corresponding transitions times, respectively, i.e.
Consider also an index process: where The process I J n (λ) can be interpreted as an accumulated reward process with the function f λ as a measure of the weighted rate of reward per unit time. The parameter λ is a memory parameter that should be calibrated on the data. A specific calibration procedure will be discussed in the application (Section 4). It should also be remarked that the index process considered in this paper is slightly more general than those considered in previous research articles because we added the term f λ (J n , T J n , T J n ) that add to the index process also the score deriving from observing the current log-return state J n at present time T J n . Introduce the counting process N J (t) := max{n ∈ IN : T J n ≤ t}, and let us now introduce the notion of weighted-indexed semi-Markov chains.
, called the indexed semi-Markov kernel, such that ∀n ∈ IN the following equality holds true: (2) Remark 1 Relation (2) asserts that the knowledge of the values of the variables J n , I J n (λ) is sufficient to give the conditional distribution of the couple J n+1 , T J n+1 −T J n whatever the values of the past variables might be. Therefore, to assess the probability of the next value of the log-return process and of the time in which the process is going to change state, we need only the knowledge of the last state of the log-return and the last value of the index process.

Remark 2
The function Q J (i, x; j, t) := s≤t q J (i, x; j, s) satisfies the following properties: a) Q J (i, x; j, ·) is a nondecreasing discrete real function such that is a Markov transition probability function from (IR, B(IR)) to itself.

Remark 3 If the indexed semi-Markov kernel is constant in
then, it degenerates in a semi-Markov kernel and the WISMC model becomes equivalent to classical semi-Markov chain model, see e.g. [30] and [11].
The triplet {J n , T J n , I J n (λ)} describes the system in correspondence of any jump time T J n . However, it is also important to describe the system in correspondence of any time t, which can be a jump time (t = T J n ) or not (t = T J n ). The random process Z J (t) := J N J (t) introduced in definition (1) marks the log-return at any time t, while the backward recurrence time process denotes the time elapsed since the last transition. In our model this information is not sufficient to completely characterize the status of the system because we need to know also the value of the index process. To this end we extended the definition of the index process allowing to consider any time t ∈ IN as follows: The following definition and result, which reduces the complexity of the model, is important for practical application of the WISMC model.
. . , n} be a sequence of states and corresponding transition times, i.e. i α ∈ IR, t α ∈ Z, From an intuitive point of view, the shift operator when applied to a trajectory (i, t) n+1 −m gives back a new trajectory where the sequence of visited states is the same as in the input trajectory with the difference that transition times are translated of t n+1 time units backward and the number of transitions is set one unit backward.
The following assumption concerning the score function f λ will be needed in the rest of the article: Lemma 1 For a WISMC with score function f λ that satisfies assumption A1, for fixed arbitrary state j and time t and (i, t) n+1 −m ∈ Θ n+1 −m , we have: Proof See the appendix.
The result presented in Lemma (1) focuses on a class of score functions leading probability (4) to be independent of n. Accordingly, the WISMC inherits a homogeneity property that is particularly useful for the applications of the model. Throughout this article, we are going to consider homogeneous WISMC only.
In this research we consider also financial volume as one important variable worthwhile to be investigated. The WISMC model was also applied to the modeling of financial volumes in a recent article by [12] and revealed to be able to reproduce several statistical properties of volumes at high-frequency scales. In order to be able to distinguish between the WISMC model for returns and that for volumes we introduce an additional notation for the volume model. Precisely, if V (t) is the volume of a financial asset at time t ∈ IN, the time varying log volume is defined as log(V (t)/V (t − 1)). This variable at shorttime scales changes values in correspondence of an increasing sequence of times {T V n } n∈IN , the so-called jump-times of the asset volume process. In correspondence of the times {T V n } n∈IN , the logarithmic volume process assumes different values denoted by {J V n } n∈IN . We introduce the index process for the volume by replacing in formula (1) the variables J n , T J n and λ by V n , T V n and γ, respectively. The semi-Markov kernel for the volume process will be denotes by q V = q V (i, x; j, t) and the WISMC process for the volume variable is defined by

The multivariate model
In this section we extend the WISMC model into a multivariate setting in such a way that it is able to describe jointly the time evolution of the three considered variables: log-returns, log-volumes and waiting times. The extension is done advancing a series of assumption that allow us to merge the WISMC kernel of the log-return process and that of the log-volumes in a new kernel that is completely characterized in this section. The first step in the joint modelization of returns, volumes and durations is to synchronize the time events of the returns and volumes. In order to do it let us start from the two sequences They mark the values and points in time where log-returns and log-volumes change states, respectively. First, we define a new sequence of transition times: Relation (6) means that we consider the union between the sets of transition times of returns and volumes and the obtained ordered sequence of times is denoted with the symbol {T n } n∈IN . Intuitively, the timeT 1 is the first time when a change in the returns or in the volumes occurred,T 2 the second point in time when a second change of state of whichever of the two processes J n and V n occurred, and so on. The corresponding inter-arrival times can be denoted byX Furthermore we define the corresponding values of the returns and volumes for each time of the random sequenceT n according to the following relations: Thus, we ended up with three variables (J n ,Ṽ n ,T n ) that denote the synchronized sequences of log-returns, log-volumes and transition times. In order to advance a joint model for this three-variate process we need to advance some specific properties concerning their interdependence and dynamics.
Suppose the following conditional independence relation, namely assumption A2, holds true: The following information sets are introduced for notational convenience: Probability (7) is so important to merit a formal definition: Definition 3 Let (J n ,Ṽ n ,T n ) be the synchronized triplet process of log-return, log-volume and transition times. The function q JV = q JV (A JV n,s ; j, a, t), is called the kernel of the triplet process.
The kernel of the triplet process, can be factorized into the product of the conditional joint distribution of log-return and log-volumes multiplied by the conditional distribution of inter-arrival times, i.e.
Our next main task is to give a representation of this kernel in such a way that the dynamic of the joint process (J n ,Ṽ n ,T n ) could be completely characterized. We shall now consider reasonable data-driven assumptions that permits this computation.
Assumption A3: synchronized waiting time distribution. We assume that the probability distribution of inter-arrival time is independent on current time s, this avoid the use of time non-homogeneous probabilistic structures of the process. Moreover we assume that the waiting time distribution does not explicitly depends on the time elapsed by log-return and log-volume into their current states but includes past information depending on the index processes of returns and volumes. In formula Assumption A3 implies that the distributional properties of the waiting-times in our model can differ according to price and volume movements (J n andṼ n values) as well as with their past behavior measured by the index processes I J n andĨ V n . Denote the conditional probability ofX n bỹ and the corresponding probability mass function by The independence of the cdf of waiting times on the number of transitions n, and on the time of last transition s is done in order to avoid unnecessary complications that would have made the model inhomogeneous in time.
The knowledge of the kernel of the triplet process needs also the specification of the conditional joint probability distribution of log-returns and log-volumes. In this respect we propose the following Assumption A4: the conditional joint distribution of modulus of log-returns and log-volumes is given by where C is a Copula-function and the marginal distributions F |J| (i, x, t + b J ; j) and F |V | (v, w, t + b V ; a) are given by Assumption A4 is motivated by the data analysis executed in Section 2, specifically in Table 4 we have shown that the two processes are dependent on each other. Essentially this assumption allows us to consider a dependence structure between the modulus of the log-returns and log-volumes that is managed through the use of any copula function. The copula maps the two marginal distributions F |J| and F |V | into a joint probability distribution function. The quantity F |J| expresses the probability to get the modulus of log-return less or equal to j conditionally on the last value of the variable, corresponding index process, waiting time length and duration in the last visited states. The same interpretation can be given to the quantity F |V | with the only exception that it is related to the modulus of log-volume process.
The F |J| can be evaluated as follows: .

Similar computations gives
.
By means of assumptions A3 and A4 we can get information on the joint distribution of modulus of log-returns and modulus of log-volumes. Nonetheless, it is our interest to recover information on the exact values (with signs) of these two variables. This is motivated by the empirical observation that although {|J n |} and {|Ṽ n |} are significantly correlated, {J n } and {Ṽ n } are uncorrelated.
To be able to reach this objective we advance a final assumptions: A5: For each n ∈ N,J n andṼ n satisfy the following relations: where η J n and η V n are two sequences of i.i.d. random variables with pmf This assumptions allows us to get the value of the variables starting from the knowledge of their modulus. Indeed, the variables η J n and η V n provides the sign of the size of the variation. Obviously, the parameters p J and p V need to be estimated on the data.
The next theorem will characterize the kernel of the triplet process.

Financial functions
In this subsection we show how it is possible to compute financial functions of specific interest using the characterization of the kernel of the triplet process given in Theorem 1. Results are confined to marginal distributions of logreturns and log-volumes, correlation structures and joint first passage time distributions. In general given the kernel, it is possible to compute any type of functional of the kernel.

The one-step marginal distributions of returns and volumes
The first question to which we are interested in is the determination of the marginal distributions of log-returns and log-volumes. Since the dependence structure has been introduced on the modulus of these variables the marginal distributions we are looking for do not coincide with those used in the copula, i.e. with F |J| and F |V | . Let us consider the problem of finding the marginal distribution of the return process. Let us proceed by integration of the volume variable and summation on the duration one, this gives t≥0 q JV (A JV n,s ; j, ∞, t). Thus, for j > 0, using the kernel representation (14) and the fact that F |V | (v, w, |∞|, t+b V ) = 1 and that C F |J| (i, x, t + b J ; j), F |V | (v, w, t + b V ; ∞) = F |J| (i, x, t + b J ; j) we obtain the following sequence of equalities: This marginal distribution expresses the probability to observe with next transition, executed at any future time t, a return not greater than j. Symmetric arguments can be used to get the marginal distrbution of volumes that results in

Dependence measures
The kernel of the triplet process (8) completely describes the dependence structure between returns and volumes and waiting times. Nevertheless, it is relevant to measure this dependence using classical indicators of linear and nonlinear dependence.
The most widely studied measure of linear dependence is the correlation coefficient. Let ρ A JV n,s (|J n+1 |, |Ṽ n+1 |) be the correlation coefficient between the modulus of returns and the modulus of volumes at next transition unconditionally on the time when the next transition will happen, i.e.
Using the formula discussed above, we can calculate the correlation coefficient by using the joint probability density function of (|J n+1 |, |Ṽ n+1 |) conditional on the information set A JV n,s . For every j, a ≥ 0, one gets: Consequently the density can be obtained by derivation of the cumulative distribution function, i.e.
Accordingly we get This allows the recovering of ρ A JV n,s (|J n+1 |, |Ṽ n+1 |) once the standard deviations σ A J n,s (|J n+1 |) and σ A V n,s (|Ṽ n+1 |) are known. They can be obtained by using the univariate densities f |Jn+1| (·) and f |Ṽn+1| (·) that in turn can be obtained by integration of the joint density.
It is also interesting to compute the covariance function between the modulus of log-returns and the log-volumes at next transition, i.e.
, then the modulus of log-returns and logvolumes are uncorrelated at next transition.
One may also be interested in providing nonlinear measures of dependence between random variables. Mutual information, which goes back to [42], possesses relevant properties that imposed it as a suitable measure of nonlinear dependence, see e.g. [23]. It is simple to express the mutual information within our model: where the densities are given in formula (20).

First passage time distributions
The first passage time distribution has attracted a lot of attention in finance. It has been considered for different assumptions about the stochastic processes that describes the asset behaviour. It has been investigated for log-returns when described by Ornstein-Uhlenbeck processes (see e.g. [52]) and more recently for generalized semi-Markov models in [15,17,16]. We shall now derive the first passage time distribution for our multivariate model. LetM J t (τ ) be the accumulation factor of the return process in the multivariate model from time t to t + τ . Formally, the accumulation factor can be defined as follows: A similar definition applies for the volume process, i.e.
For ρ ∈ IR + and ψ ∈ IR + , denote the joint first passage time by Thus, Γ (ρ;ψ) is the first time when at least one accumulation factor exceeds its own thresholds. Denote the corresponding conditional survival function by The definition of the shift operator given in Definition 2 can be easily extended to triplet sequences (i, v, t) n −m .
. . , n} be a sequence of returns, volumes and corresponding transition times.
We formulate and prove a theorem which provides an equation for the joint first passage time distribution.
Theorem 2 Let f λ and g γ be the score functions of the index processes relative to the return and volume processes, respectively. For i 0 ≥ 0 and v 0 ≥ 0, it results that where For i 0 < 0 replace e i0t1 with e i0 everywhere in formula (25).
Proof See the appendix.

Application to real high frequency data
To verify the validity of the model described above, we applied it to the database introduced in Section 2. Following [16, ?] we use, as definition of the function f λ in (1), an exponentially weighted moving average (EWMA) of the squares of J n which has the following expression: A similar choice is done for the volume return process leading to the choice of We remark that the choice for the functional form of f λ is not obtained trough any optimization procedure. One can probably find functional forms that perform better according to some performance measure. Our choice is motivated by its simplicity and the fact that it gives very good results. Moreover, it is justified by the empirical evidence that price and volumes returns dynamics do depend on volatility regime.
Next step in the application is finding the optimal parameters to be used in the model. We followed the same procedure described in [19] and summarized in the following subsection.

Parameters optimization
We describe here the whole procedure to set and optimize the parameters used in the univariate models.  Table 6 Parameters used in the application to real data.
1. The first step to set the WISMC model is, by using the descriptive statistics of the dataset, to fix a number of states s and a value for the weight parameter λ; 2. Build the trajectory (J n , T {M AP E(s, λ)}.
Notice that the algorithm can stop whenever the increase in the number of states does not decrease the MAPE more than a given threshold . This procedure should be repeated for all stocks in the portfolio and also for the variable v(t). Once all the parameters for the two univariate models are optimized use a copula to build the bivariate model.

Results
Here we show some results obtained and a comparison with real data. Using the optimization procedure described above we found the optimal parameters which are summarized in table 6 for the four stocks The dependence between the two real processes v(t) and r(t) is kept in the model by using a copula function. We tested different copulas like Gaussian, t-student, Gumbel and Clayton finding almost no differences in the results. This is mainly due to the fact that 1 minute price returns are almost discrete ρ(r(t), v(t)) p-value ρ(|r(t)|, v(t)) p-value ρ(r(t), |v(t))| p-value ρ(|r(t)|, |v(t))|  Table 7 Cross correlation between price and volume returns for simulated data.
and varies in a small range, then, in this dataset there is no tail effect. To keep the application as simple as possible we decided to use a Gaussian copula that has only one parameter. We simulated, using the estimated kernels and a Gaussian copula, the joint process |r(t)| and v(t) and obtained r(t) by using the relation described in Assumption A5. The results are trajectories with the same time length of real data for both variables v(t) and r(t).
In Table 7 we show the cross-correlation between the synthetic v(t) and r(t) and their absolute values, for all combinations, as done in Table 4. The Table shows that there is a good agreement with what was found for real data.
The model is also used to compare the first passage time distribution (fptd) of the joint processes v(t) and r(t). From the synthetic variables (r(t) and v(t)) we build the variables price and volumes in the following way: at each discrete state of r(t) (v(t)) is associated a range of variability of the continuous real r(t) (v(t)), inside this range a continuous value is chosen by extracting a random number form a uniform distribution and then inverting the empirical distribution of real continuous price (volume) return. Once r(t) (v(t)) are transformed back into continuous values, prices S(t) (volume V (t)) are obtained by S(t) = S 0 × e k=t k=1 r(k) (V (t) = V 0 × e k=t k=1 v(k) .). Synthetic price and volumes are then used to build the distribution of time at which there is a first cross of given thresholds.
To verify if the price fptd depends on volume values we estimated the fptd as a function of the value of initial condition on the discretized volume returns. In this way, for each initial v(t) value, we obtain a different fptd. At the same time, we verified if the proposed model also keeps this dependence structure. In Figure 8 we show the results and comparison with real data (for two of the given stocks). We fixed a price increment threshold at 0.5%. It is worth noting that data are sampled at 1 minute, then it takes 10 to 15 minutes to reach this price change.
From the figure we can see that there is, indeed, a dependence on the initial conditions on volume. Furthermore, the dependence lasts up to 15 minutes and after this time all distributions converge to similar values. The behaviour depends on the different stocks but almost all of them show faster achievements of threshold when the initial volume return is in the third state. Figure 8 also shows that, although with some differences, the model has the same behaviour of real data, there is a dependence on the initial conditions of volume return and it is very similar to real data. Finally, in Figure 9 we show the joint first passage time distribution that represent the first time that both S(t) and V (t) cross a given threshold. Again, the price increment threshold is set at 0.5% while the volume increment threshold is 100. Overall we can say that the model is able to capture all statistical features of real data keeping all the dependencies between price, volumes and waiting times. Furthermore, we found very good agreement between real data and model also for the first passage time distribution.

Conclusions
In this work we have advanced a new stochastic model, based on Weigthed-Indexed Semi-Markov Chain, for modelling price, volumes and waiting times in high frequency finance. After showing, by analyzing real data, all the empirical evidences that support the use of a multivariate model, we defined the probabilistic structure of the model and give a detailed mathematical implementation. Furthermore, mathematical expressions for covariance and first passage time distributions are given. In the last part we show, by using Monte Carlo simulations, that the model has the same statistical features of real data. In fact, the proposed model is able to reproduce the autocorrelation functions, the dependence between price and volume and the first passage time distributions. Further development can be the use of the model in portfolio optimization, development of risk measure and volatility forecasting.

Appendix
Proof (of Lemma (1)) Given the information set (J, T ) n+1 −m = (i, t) n+1 −m we can proceed to compute the value of the index process at the (n + 1) − th transition through formula (1) and assumption A1: (29) For simplicity of notation, denote by x this value, i.e. I J n+1 (λ) = x. Thus, −m )] and apply the definition of the shift operator to have: and in turn Since s n−1−r = i n−r and k n−1−r = t n−r − t n+1 it follows that Accordingly we get which completes the proof.
Proof (of Theorem (1)) The kernel of the triplet process has been represented in formula (9) as follows: q JV (A JV n,s ; j, a, t) = P[J n+1 ≤ j,Ṽ n+1 ≤ a|A JV T n,s ] · P[X n = t|A JV n,s ], and from assumption A3 we geth i,v (x, w; t) = P[X n = t|A JV n,s ]. Thus, it remains to evaluate the conditional probability of the joint distribution of log-return and log-volume. Let consider the case when j ≥ 0 and a ≥ 0 and introduce the notation F |J| (j) and F |V | (a) to denote in a compact form the marginal distributions of the copula.

Proof (of Theorem
By the definition of conditional probability (40) can be written as Note that by definitionX 0 :=T 1 −T 0 but sinceT 0 = t 0 = 0, we can replaceT 1 with the corresponding sojourn timeX 0 . Also note that the event {B(u) = u} is equivalent to the event {T N (u) = 0,T N (u)+1 > u}. The latter equality between events means that at least one between returns and volumes did last transition at time t 0 = 0 and the other process made its last transition at some time before. Let b J and b V generically denote the times since last transition of the backward recurrence time processes, i.e.
Besides, note that the information set (J,Ṽ ,T ) 0 −m = (i, v, t) 0 −m generates a value of the index process of returns equal tõ and of the index process of volumes equal tõ By the definition of joint first passage time we have that where It is clear that since i 0 ≥ 0, v 0 ≥ 0 andT 1 > t, the processesM J 0 (τ ) and M V 0 (τ ) are both increasing with respect to the variable τ . Accordingly, A substitution of (48) and (46) in (45) It remains to compute probability (41). By the law of total probability and by the definition of conditional probability we have the following chain of equality: A substitution of (52) in (51) and then of the obtained quantity in (41) togheter with (50) concludes the proof.