Optimal liquidity-based trading tactics

We consider an agent who needs to buy (or sell) a relatively small amount of asset over some fixed short time interval. We work at the highest frequency meaning that we wish to find the optimal tactic to execute our quantity using limit orders, market orders and cancellations. To solve the agent's control problem, we build an order book model and optimize an expected utility function based on our price impact. We derive the equations satisfied by the optimal strategy and solve them numerically. Moreover, we show that our optimal tactic enables us to outperform significantly naive execution strategies.


Introduction
Most electronic exchanges use an order book mechanism. In such markets, buyers and sellers send their orders to a continuous-time double auction system. These orders are then matched according to price and time priority. Every submitted order has a specific price and size and the order book is the collection of all submitted and unmatched limit orders. This is illustrated in Figure 1, which shows a classical representation of an order book at a given time. ) with i ≥ 1 are the sellers (resp. buyers) limit prices and they are increasingly (resp. decreasingly) ordered. For a given price P Ask i (resp. P Bid i ), the limit Q Ask i (resp. Q Bid i ) is the available selling (resp. buying) quantity.
In this limit order book setting, we consider the following issue: an agent has to buy or sell a given quantity of asset before a fixed horizon time. During the execution process, the agent can take four elementary decisions: • Insert limit orders in the order book, hoping to get execution at the best ask/bid (we will assume that the agent does not place limit orders above the best limits).
• Stay in the order book with an already existing limit order, to keep his tactical placement.
• Cancel existing limit orders.
• Send market orders to get immediate execution.
Note that this is the microstructural version of the classical Almgren-Chriss optimal scheduling problem for the liquidation of a large quantity of asset over a time interval [0, T ], see [3,7,15] and [10,16] for various extensions. In the setting of [3], [0, T ] is split in sub time windows (typically a few minutes per window) and one derives the number of shares to be executed in each window. In our case, we want to specify how to act optimally within each window. Indeed, the buyer or seller reacts to every order book move and handles reasonably small quantities during short periods of time.
In order to solve this problem, we of course need to model the order book dynamic. There are essentially two order book modelling approaches in the litterature. First, "general equilibrium models", based on interactions between rational agents who take optimal decisions, see [14,30,31]. Second, "statistical models" where the order book is seen as a suitable random process, see [1,2,6,11,12,21,22,32]. Statistical models focus on reproducing many salient features of real markets rather than individual agents behaviours and interactions between them.
In this paper, we use a statistical model. In such models, the arrival and cancellation flows often follow independent Poisson processes. The Poisson assumption allows for the derivation of simple, and often closed form, formulas, for example for the probabilities of various order book events, see [1,12,21,27].
However, as clearly shown in [19], this assumption is not realistic and it is necessary to take into account accurately the local state dependent behaviour of the order book. So in [19,20], the authors introduce the Queue-Reactive order book model where order flows follow a Markov jump process. They also provide ergodicity conditions and model parameters calibration methodology. Here we refine this model to make it compatible with a stochastic control framework enabling us to solve important practical issues. To do so, we only consider the best bid and ask limits to work with a reasonably small state space. Furthermore, in order to get a truly good fit of the order book dynamic, we focus on the so-called regeneration process which models the order book state right after the total depletion of a limit. Indeed, in our setting, when one limit is totally depleted, the order book is regenerated in a new state whose regeneration law depends on the order book state just before the depletion. In general, order book models consider several bid and ask limits and use a regeneration process independent from the order book state, see [2,12,19]. Here, we model the order book by a three-dimensional Markov jump process (Q 1 t , Q 2 t , P t ) where Q 1 t is the available quantity at the best bid, Q 2 t is the available quantity at the best ask and P t is the mid price. Furthermore, we focus on large tick assets (see [13]) and so we fix the spread as a constant 1 .
In this work, we deal with an optimal execution problem. Actually, the question of what a good execution means is not trivial. This is because it is difficult to define a suitable benchmark price. Indeed, agents need a benchmark to compare it with the execution price of their own trading strategy. In our work, we place ourselves in a setting where we can define a notion of long term value of the price P ∞ = lim t→∞ P t . We use it as a benchmark since it represents the asset future value. In practice, if the agent is able to buy the asset at a price lower than P ∞ , he can, in principle, make profit by selling it back in the future.
Let us now introduce the agent's control problem. We express it for a buy order of size q a (it can be changed to a sell order in an obvious way). From time zero to the final time T , we assume that, at every decision time, the buyer can do nothing or use one of the three following actions: insert the remaining quantity to buy (if not already inserted) at the top of the bid queue (decision l), cancel the already inserted limit orders (decision c) or send a market order (decision m). We suppose that actions involve the whole inventory of the agent. It will not be possible to apply a decision to part of the inventory and another to the remaining part. If the agent does not obtain the total execution of q a at time T , he cancels the remaining quantity in the order book and send a market order. The agent aims at determining the optimal sequence of decisions to outperform the benchmark P ∞ .
Let t be the current observation time, µ t the agent's control and I µ t the agent's inventory, that is the remaining quantity he has to buy at time t. We view the agent's control µ = {µ t , t ≤ T } as a process valued in {l, c, m} which remains constant when the user does nothing 2 . Recall that we need to handle the agent's inventory since we may have a partial execution of q a but the agent does not do any splitting of q a3 . The benchmark being P ∞ , the agent wishes to get the quantity E P ∞ − P Exec,µ T µ Exec high, where P Exec,µ t is the acquisition price of the quantity q a − I µ t and T µ Exec is the time where q a is totally executed. An important point is that in our setting, our trading has an impact on prices. In particular, trying to get P Exec,µ T µ Exec small means that we want to minimize our transient market impact, that is the price impact of our trading during the execution. Note that our trading also has an influence on P ∞ that we will be able to compute and therefore we write P µ ∞ instead of P ∞ . To take into account the waiting cost, the sensitivity to the price impact and to work in a slightly more general setting, we consider the following optimisation problem: where f : R → R is a Lipschitz function and c is an homogenization non-negative constant representing the waiting cost. Note that we use a conditional expectation to account for the fact that agents collect information along their own trading.
We study this problem in two cases. First, when agent's decisions are taken at fixed frequency ∆ −1 . This enables us to investigate latency effects and moderately high frequency trading issues. Second, when agent's decisions are taken at any time, to handle the situation where one has access to ultra high frequency trading technology.
Note that this paper is obviously not the first work where a stochastic control framework involving limit orders, market orders and cancellations is used to solve a high frequency trading problem. For example, in [23,24] the authors consider the problem of optimal posting of a limit order while market making issues are adressed in [4,10,17,18]. However, to our knowledge, this is the first approach where the interactions between market participants decisions, liquidity and behaviour of the order book are accurately taken into account, see also the complementary paper [5].
The paper is organized as follows. We introduce our order book model in Section 2. In Section 3, we formulate the agent's control problem. Our main results including the computation of P ∞ and the equations satisfied by the value functions are provided in Section 4. Finally, the numerical methodology to solve these equations is given in Section 5. The proofs and additional results are relegated to an appendix.

Order book modelling
In this section, we first confirm on data that agents behaviours depend on order book liquidity, see [19,24,25] for closely related results. Then, we describe the order book dynamic.

Preliminary: Empirical evidences
One specificity of our work is that we wish to carefully model the interactions between market participants and liquidity. We first show on real data that market participants act differently when facing different liquidity conditions.
Database presentation. Data used here are from Bund futures on Eurex exchange Frankfurt. We focus on Bund futures since they are a good example of a very liquid and large tick asset. The database records, during one week from 1 to 5 September 2014, the state of the order book (i.e available quantities and prices at best limits) event by event with microsecond accuracy. For each day, our data cover the time period from 8 a.m to 10 p.m Frankfurt time. Each event has a type, a side (i.e bid/ask) and a size. We consider three types of events: insertion of limit orders, cancellation of existing limit orders and market orders. The database accounts for 3 407 574 events.
Let t be the time where an event happens in the order book. We define the imbalance Imb t and the mid price move δ seconds after the event time t, ∆P mid δ (t), by where Q 1 t (resp. Q 2 t ) is the available quantity at the best bid (resp. ask), P t is the mid price, t is the event sign (i.e t = 1 when it is a buy order and -1 otherwise) and ψ t is the spread (i.e ψ t = P Ask t − P Bid t with P Ask t the best ask price and P Bid t the best bid price).
We want to confirm that agents decisions depend on the order book liquidity. A simple way to do it is to summarize the state of the order book liquidity through the imbalance. Figure 2.a shows the average imbalance value for each event type. We give the interpretation of Figure 2.a in the case of a buy limit/cancellation/market order, since the event sign is taken into account in the expression of Imb t . We see that market participants insert limit orders when imbalance is negative (execution highly probable), cancel orders when imbalance is positive (less chance to be executed) and use market orders when imbalance is highly positive (rushing for liquidity when it is scarse). Figure 2.b shows the distribution of imbalance just before a liquidity provision event (i.e insertion of limit orders) and a liquidity consumption event (i.e cancellation of limit orders or market orders). We see that agents are highly active at extreme imbalance values 4 . Indeed, in these cases, they identify a profit opportunity to catch or on the contrary an adverse selection effect to avoid (for example buying just before a price decrease). This is related to the predictive power of the imbalance. As can be seen in Figure 2.c, ∆P mid δ (t) after 2 minutes (i.e δ = 2 min) is highly correlated to the imbalance. Hence, market participants use the imbalance as a signal to anticipate next price moves 5 .
Hence, our empirical results clearly confirm that agents decisions depend on the order book liquidity. 4 The high rate of liquidity provision for very positive imbalance can be surprising at first sight. However it may be due to orders inserted within the spread creating a new best limit. 5 Quoting Sasha Stoikov: Imbalance is the least well hidden secret of high frequency trading.
(a) Average imbalance before Limit/Cancel/Market order (b) Imbalance density before liquidity provision/consumption event

Limit
Cancel Market

Order book framework
Let (Ω, F, (F t ), P) be a filtered probability space with F 0 the trivial σ-algebra. The order book state is modelled by the Markov process U t = (Q 1 t , Q 2 t , P t ) where Q 1 t (resp. Q 2 t ) is the best bid (resp. ask) quantity and P t is the mid price. We focus on large tick assets and fix the spread ψ as a constant.
Order book regeneration. When one limit is totally depleted, the order book is regenerated in a new state whose law depends on the order book state just before the depletion and the depleted side (i.e best bid/ask). We denote by d 1 u (resp. d 2 u ) the probability distribution of the regenerated state when the best bid (resp. ask) is totally depleted, with u ∈ (N * ) 2 × R the order book state before the depletion. For i ∈ {1, 2} and u ∈ (N * ) 2 × R, the regeneration law d i u has support in (N * ) 2 × R. A simple choice is to consider for example the case where the mid price decreases (resp. increases) by one tick when the best bid (resp. ask) is totally depleted and to draw new best bid and ask quantities from a fixed stationary distribution, see [11].
k≥q i λ i,− (u, k)d i (q,p);(q ,p ) represents the order book regeneration when the best bid (or ask) is totally depleted.
Moreover, without loss of generality, we can initialize the mid price at zero. Then, for every u = (q 1 , q 2 , p), u = (q 1 , q 2 , p ) and n ∈ N, we assume the following bid-ask symmetry relations: where u sym = (q 2 , q 1 , −p).

Ergodicity
We now provide a theoretical result on the ergodicity of the process Q t = (Q 1 t , Q 2 t ) under four general assumptions given below. A definition of ergodicity is given in Appendix B.
Assumption 2 ensures no explosion in the system: the order arrival speed stays bounded for any given state of the order book. Using the symmetry relation, we also have n≥0 λ 2,+ (u, n) + λ 2,− (u, n) ≤ H.
For a state u = (q 1 , q 2 , p) and i ∈ {1, 2}, we write U Disc,i,u = (Q 1,i,u , Q 2,i,u , P Disc,i,u ) for a random variable with law d i u . Assumption 3 (Regeneration bound). There exist three positive constants C Disc , L and z 1 > 1 such that for any u = (q 1 , q 2 , p) ∈ (N * ) 2 × R and i ∈ {1, 2}, Assumption 3 ensures no explosion as well by assuming that the probability to discover large quantities tends quickly to zero. For example, when there is a quantityQ max such that, for any q ∈ (N * ) 2 and i ∈ {1, 2}, Q 1,i,u ≤Q max and Q 2,i,u ≤Q max a.s, Assumption 3 is satisfied.
Finally, Assumption 4 means that the arrival rate of very large jumps λ i,+ (u, n) tends quickly to zero when n is high. Assumptions 1, 2, 3 and 4 are close to those used in [20] within a close setting. We have the following result.
Theorem 1 (Ergodicity). Under Assumptions 1, 2, 3 and 4, and when the arrival rate, the consumption rate and the regeneration distribution do not depend on the mid price, the process is ergodic (i.e converges towards a unique invariant distribution). Additionally, we have the following speed of convergence: with ||.|| T V the total variation norm, P t q (.) the Markov kernel of the process Q t starting from the initial point q = (q 1 , q 2 ) ∈ (N * ) 2 , π the invariant distribution, ρ < 1 and B(q) a constant depending on the initial state q, see Appendix B.
This theorem is the basis for the asymptotic study of the order book dynamic in Section 2.1, since it ensures the convergence of the order book state towards an invariant probability distribution. Thus the stylized facts observed on market data can be explained by a law of large numbers type phenomenon for this invariant distribution. The proof of this result is given in Appendix B for sake of completeness, although it is quite inspired from [19,20].
3 Optimal tactic control problem 3

.1 Presentation of the stochastic control framework
We express the control problem for a buy order of a size q a . It can be changed to a sell order in an obvious way.
Order book dynamic. The order book state is modelled by the process U µ t = Q Bef,µ split in three quantities to take into account the order placement. Limit orders posted by the agent have a size equal to the whole inventory and we do not handle splitting issues where a partial quantity of the inventory is inserted in the order book. We add minor changes to the order book dynamic: • For the best bid, we differentiate market orders consumption rate λ 1,− m from limit orders cancellation rate λ 1,− c . Cancellation orders consume Q Af t,µ t first, and market orders Q Bef,µ t first 6 .
• The regeneration process of U µ t can be deduced from that of (Q 1,µ t , Q 2,µ t , P µ t ) which is unchanged. After a regeneration Q a,µ t = 0 when the best bid is totally depleted and remains unchanged otherwise. Furthermore, the quantity Q Af t,µ + Q Bef,µ is given by the regenerated best bid and the position of Q a,µ t is drawn from a distribution ι i u depending on the order book state just before the regeneration and the depleted side (i.e best ask in our case). A natural choice is to set Q Af t,µ = 0 and Q Bef,µ equal to the new best bid when the price moves, and keep the quantities (Q Bef,µ t , Q a,µ t , Q Af t,µ t ) unchanged when the best ask is depleted with no price move.
The symmetry relation (1) satisfied by (Q 1,µ t , Q 2,µ t , P µ t ) is unchanged. A detailed description of the infinitesimal generator Q µ of the process U µ t is provided in Appendix C. Trader's controls. At every decision time, the trader can do nothing or take three decisions: • l : He can insert the quantity I µ at the top of the bid queue if not already inserted.
• c : He can cancel his already existing limit order Q a,µ . By acting this way, the trader can wait for a better order book state. This control will essentially be used to avoid adverse selection, i.e. obtaining a transaction just before a price decrease.
• m : He can send a market order to get immediate execution.
Thus, the trader's control µ = {µ t , t ≤ T } is a piecewise-constant càdlàg process valued in {l, c, m}. If the agent has no order inserted in the order book and does nothing at the beginning, the initial control is c.
Trader's inventory and liquidation price. Since we consider a buyer with an initial inventory q a , we fix I µ 0 = q a . Let q m be the size of a market order sent at the best bid at time t by another market participant. When q m > Q Bef,µ t − , the quantity min(q m − Q Bef,µ t − , Q a,µ t − ) of our order is bought at the best bid price P t − − ψ 2 . Thus, the dynamic of Q a,µ t and P Exec,µ t can be written when µ = l as with ∆X t = X t − X t − for any càdlàg process X. When a market order is sent (i.e µ = m), the quantity I µ t − is bought at the best ask. A linear temporary price impact is added when the best ask is not large enough to absorb I µ t − . In this case, the dynamic of Q a,µ t and P Exec,µ where the parameter α represents a linear temporary price impact and (x) + = max(x, 0). Finally, under the control µ = c, we set Q a,µ t = 0 and keep I µ and P Exec,µ unchanged since the agent's order is not present in the order book. We add the final time constraint Optimal control problem. We fix a finite horizon time T < ∞ and we want to compute where • u = (q bef , q a , q af t , q 2 , i, p, p exec ) is the initial state of the order book.
represents the price impact. We will see that E ∆P µ ∞ /F T µ Exec is well-defined and an explicit computation of this quantity is presented in the next section.
• c is a non-negative homogenization constant representing the waiting cost, q a is the order size, and f : R → R is a Lipschitz function.
We solve the agent's control problem in two situations: when decisions are taken at fixed frequency ∆ −1 and when they are taken at any time.

Theoretical results
In this section, we compute ∆P µ ∞ , discuss the existence, uniqueness and regularity of the solution of our control problem and give equations satisfied by the value function.

Computation of ∆P
, we replace Assumptions 3 and 4 by slightly less general ones. Assumption 5 (Insertion Bound). There exists a positive quantity Q max such that for any u = (q 1 , q 2 , p) ∈ (N * ) 2 × R and n ≥ 0, This assumption is not restrictive since available quantities in the best limits remain essentially bounded. Using the symmetry relation, we have as well, for any u = (q 1 , q 2 , p) ∈ (N * ) 2 × R and n ≥ 0 For a state u = (q 1 , q 2 , p) and i ∈ {1, 2}, we write U Disc,i,u = (Q 1,i,u , Q 2,i,u , P Disc,i,u ) for a random variable with a distribution d i u . We now give a final boundedness assumption.
Assumption 6 (Regeneration bound). The mid price P µ t lives in the space τ 0 Z, with τ 0 ∈ R + the tick value. Additionally, there exist two positive constantsP max andQ max such that Computation of the long term price impact: For every state U i at the end of the execution, we split ∆P µ ∞ in two quantities Exec is the mid price at the execution (i.e P µ T µ Exec and P Exec,µ T µ Exec are known at the execution).
Exec is the long term mid price move after the execution.
Thus, we only need to compute ∆P ,µ ∞ . Since we place ourselves after the execution of I µ , we have Q a,µ t = 0 and Q Af t,µ t = 0. We can then write U i = Q 1 i , Q 2 i , P i by using a slight abuse of notation and forget the dependence on the control. Let t 1 > T µ Exec be the first time where the best bid is totally consumed after the final execution time and t 2 > T µ Exec the first time where the best ask is depleted. When the best bid (resp. ask) is totally consumed, the price moves on average by They represent respectively the probability that the best bid is consumed before the best ask or conversely and the exit state is U i .
• d 1 i,k (resp. d 2 i,k ) are transition probabilities from the state U i to U k when the best bid (resp. ask) is consumed.
i ,k represent respectively the average mid price move after the first regeneration and the probability to reach the state U k starting from the initial point U i right after the first regeneration.
For a state U i , i sym is the index of the symmetric state U sym i . We have the following result.
Proposition 1 (Average mid price move). For an irreducible process U t , see Section 2.2, satisfying Assumptions 5 and 6, the vector D satisfies The proof of this result is given in Appendix D.1.
To compute D, we need to estimate the regeneration distributions d 1 . , d 2 . , α ± . and q ± . . The quantities d 1 . , d 2 . and α ± . can be estimated from the empirical distribution of order book states after a depletion. Then, we only need to estimate q ± ii . We now give a result on the computation of q ± ii .
whereQ * , z 1 and M are defined in Appendix D.2, see Equations (12) and (13). The solution of this equation is unique sinceQ * is invertible, see Appendix D.3.
When the dynamic of the best bid is independent from the one of the best ask,Q * is even diagonalisable. In the simple case of constant intensities as in [11],Q * diagonalisation can be computed easily with closed form formulas, see Appendix D.3. The proof of this lemma is given in Appendix D.2. A numerical computation of the vector P is given in Appendix A, Figure 8.

Existence and uniqueness of the optimal strategy, regularity properties
In the rest of the article, Assumptions 2, 5 and 6 are in force. In this section, we discuss existence and uniqueness of the optimal strategy and show regularity results for the state process U µ and the value function. First, for a finite horizon time T , we define the value function Existence -uniqueness of the optimal control. The optimal strategy exists in the two frameworks (i.e decisions taken at fixed frequency and at any time) but for different reasons. When decisions are taken at fixed frequency ∆ −1 the optimal strategy exists since we have a finite number of available strategies. When decisions are taken at any time, the sequence of optimal controls (τ i , υ i ), where τ i is the optimal decision time and υ i the optimal decision, satisfies τ 0 = 0 and υ 0 = argmax r∈{l,c,m} E V T (0, U r 0 ) with u r the new state when the agents takes the decision r and where, for a given state u and control r, Since V T is continous, the optimal control is well-defined, see Equations (15) and (16). The proof of (2) is given in Appendix F. However, there is a priori no uniqueness of the optimal strategy in the two frameworks.
Regularization of the problem. To force the uniqueness of the optimal strategy, we present a practical criterion. We define an order relation between trader's decisions c < l < m. The intuition behind this relation is that m is the least risky decision because we get direct execution, s is riskier than m but less risky than c because there is no delay of the execution. Hence, we can choose the least risky decision among the optimal ones in the above sense.
Regularity of the state process and the value function. The value function V T is Lipschitz in time, see Appendix E. Results of Appendix E are provided in the more general framework where we allow the state process to be valued in R 5 + × R 2 . In this setting, we study the regularity of the process U µ , see Theorem 5, which enables us to recover the Lipschitz property of V T .

Decisions taken at fixed frequency ∆ −1 : dynamic programming equation
In this section, we provide and solve the system of equations satisfied by the value function V of the optimal control problem. We have the following result.
Theorem 2. Let u = (q bef , q a , q af t , q 2 , i, p, p exec ) be an initial state and t ∈ [0, T ]. Then V (t, u) satisfies: • When i > 0: -At the decision time t = k∆ < T : where where u r i is the new order book state when the decision r ∈ {l, c, m} is taken. We keep in mind that the controls c and m may lead to several order book states because of the regeneration. Equation 3 should be understood coordinate by coordinate.
where A = ∂ t + Q is the infinitesimal generator of the process U µ t . The expression of Q is given in Appendix C.
• When i = 0 (execution time condition): • The terminal condition is: The proof of this result is given in Appendix F.
Remark 1. At every decision time, as long as the order is not executed, the agent compares the value function given by each control and takes the highest one, see Equation (3). When, the order is executed, the agent gain is g(U ). If the order is not executed before time T , the agent send a market order to obtain immediate execution and earn g(U ).

Remark 2.
Without the control c and l, Equations (3), (4) and (5) are equivalent, in dimension 1, to the classical problem of finite horizon Bermudean options. The above system can be solved explicitely, see Appendix F.

Second approach: Decisions taken at any time
Let us now consider the case where the agent takes a decision at any time. In this section, we provide the system of equations satisfied by the value function and we also introduce a simplified control problem whose value function can be easily computed numerically, and converges towards the one of the initial optimal control problem.

Dynamic programming equation
We keep the same notations as in Theorem 2. We have the following result for the value function in this setting.
Theorem 3. Let u = (q bef , q a , q af t , q 2 , i, p, p exec ) be an initial state and t ∈ [0, T ]. Then V (t, u) satisfies in the viscosity sense and almost everywhere • When i = 0 (execution time condition): • The terminal condition is: The proof of the result is given in Appendix F. Since ∂ t V is a priori not continuous, we use the notion of viscosity solution. However we show that ∂ t V is continuous except on the boundary of {V = g} and the above equations are satisfied pointwise except on this boundary, see Appendix F.

Remark 3.
When there is no control c and l, Equations (6) and (7) are equivalent, in dimension 1, to the classical problem of finite horizon American option.

Numerical resolution of the optimal execution problem
To solve numerically the preceding optimal control problem, we consider a discrete framework. We show here how this discrete framework can be used to approximate the solution of the continuous control problem. Furthermore, an error estimate is provided.
with U t the process defined in Section 2.2. Given the infinitesimal generator Q of U , the transition matrix P can be easily computed since P = (e ∆Q ), see Appendix C. In this approximation, U ∆ n is viewed as the market evolution without the intervention of the agent. Associated to this new market, we introduce the controlled discrete-time Markov chain U ∆,µ n = Q Bef,µ n , Q a,µ n , Q Af t,µ n , Q 2,µ n , I µ n , P µ n , P Exec,µ n by using the same construction as in Section 3.1. Additionally, we can also compute ∆P µ ∞ in this discrete-time approximation by following the same approach as in Section 4.1.
Solving numerically the optimal control problem in the discrete framework. We denote by V ∆ (n, u) the value function associated to the discrete control problem, with n the period and u the order book state. The dynamic programming principle reads Consequently, we have with the terminal constraint V ∆ (n f , u) = g(u) where u r is the new order book state after the control r ∈ {l, c, m} and n f is the final period. Equation (8) provides a numerical scheme to compute V ∆ (0, u). At the final time T , we can compute V ∆ (n f , u) for each reachable state. Using the backward Equation (8), we can compute V ∆ (i, u) knowing V ∆ (i + 1, u) to get the initial value V ∆ (0, u). The numerical results of simulations are presented in Section 5. To compute efficiently the value function, the dynamic programming scheme can be parallelized.
Remark 4. Note that applying a finite difference scheme to the equations of Theorem 3 provides the same result as in the discrete-time approximation given by the Markov chainŨ ∆ n with transition matrixP = I + Q∆. When ∆ is small, our discrete-time approximation is almost equivalent to the finite difference scheme since P = e ∆Q = I + Q∆ + o(∆).
Finally, for every k ≥ 0, we define the piecewise constant processŨ ∆ associated to U ∆ n such that: We denote byṼ ∆ (t, U ) the value function of the control problem where the state process is U ∆ . Then we have the following error estimate result.
Theorem 4. For every state u = q bef , q a , q af t , q 2 , i, p, p exec , we have with R = 4cq a H and H is defined in Assumption 2. This ensures the convergence of our discrete approximation:Ṽ Moreover, the sequence µ Opti,∆ associated to the processŨ µ t satisfies where µ Opti is the optimal control associated to the continuous time control problem.
The proof of this result is given in Appendix G. When ∆ is small, the above error estimate remains valid for P = I + Q∆ (i.e finite difference scheme), see Appendix G.

Numerical experiments
In this section, we show the relevance of the optimal strategy in both frameworks : when decisions are taken at fixed frequency ∆ −1 and when they are taken at any time. To do so, we compare the optimal gain given by our strategy and the one given by the standard strategy join the bid: stay in the order book at the best bid until the final time. Here, we write Q 1 (resp. Q 2 ) for the best bid (resp. ask) limit.

5.1
Computation of the optimal gain : decisions taken at a fixed frequency ∆ −1 Figure 4 shows for an order of size 1 the difference between the average gain (i.e the initial value function) of the optimal strategy and the one of the strategy join the bid for different values of the initial Q 1 and Q 2 . The gain of the optimal strategy is obviously always higher than that of the strategy stay in the order book. However, because of the priority value, that is the advantage of a limit order compared with another limit order standing at the rear of the same queue, is important, it is more useful to be active (i.e cancel the order or send a market order) when imbalance is highly positive than when it is negative. Finally, note that the optimal strategy reaches the maximum value of 2.4 ticks (since the tick δ = 0.01).

5.2
Computation of the optimal gain : decision taken at any time Figure 5 shows the value function at time zero (i.e trader's gain) of the optimal strategy in red and the one of the strategy stay at the best bid in blue in percentage of the tick δ = 0.01 using the discrete approximation. The points colors refer to the initial decision given by the strategy: green points means stay in the order book at the beginning is the best decision, red points means cancel is the best initial decision and black points means send a market order is the best initial decision. When imbalance is highly negative, it is optimal to cancel the order to avoid adverse selection, when imbalance is highly positive it is optimal to send a market order or stay in the order book. In our case, stay in the order book is interesting when imbalance is highly positive since the priority value is important (i.e Q Bef is fixed equal to 1). In mid cases (i.e imbalance close to 0), it is optimal to send a market order to reduce the waiting cost. We note that the gain of the optimal strategy is significanlty better than the one of the strategy join the bid.
Acknowledgements. The authors gratefully acknowledge the financial support of the ERC grant 679836 Staqamof and the Chair Analytics and Models for Regulation.  Figure 4: Difference between the optimal gain of the optimal strategy and the one of the strategy stay at the best bid. The initial parameters are fixed as follows: the time frequency is equal to ∆ = 10 seconds, the final time T = 100 seconds, arrival and consumption rates are estimated on data (see Appendix A), the new bid (resp. ask) is set to 5 and the new ask (resp. bid) to 3 after the total depletion of the bid (resp. ask) limit, the quantity q a = 1, the waiting cost c = 0, the price increases (resp. decreases) by δ = 0.01 when the ask limit (resp. bid limit) is totally consumed and the function f is equal to the identity. The gain per tick of the optimal strategy in red and the one of the strategy join the bid in blue for different values of the initial imbalance. Initial imbalances are obtained with Q Bef = 0, Q 1 = 11 and Q 2 from 1 to 11, and Q 2 = 11 and Q 1 from 10 to 1. Initial parameters are as follows: the time step is equal to ∆ = 1 second, there are 10 periods, arrival and consumption rates are constant λ 1,+ = λ 2,+ = 0.06 and λ 1,− = λ 2,− = 0.12, the new bid (resp. ask) is set to 5 and the new ask (resp. bid) to 3 after the total depletion of the bid (resp. ask) limit, the quantity q a = 1, the waiting cost c = 0.0085, the price increases (resp. decreases) by δ = 0.01 when the ask limit (resp. bid limit) is totally consumed and the function f is the identity. [

A Model parameters estimation
The estimation methodology of the arrival and cancellation rates of limit orders is similar to that in [19]. The regeneration distribution of the order book is estimated from the empirical distribution of order book states after a depletion.
In what follows, we provide the calibration results of our order book model using the database described in Section 2.1. Here, we write Q t = (Q 1 t , Q 2 t ) with Q 1 t (resp. Q 2 t ) the best bid (resp. ask) quantity and consider that intensities and regeneration distributions depend only on Q t .
Intensities estimation. For every Q = (Q 1 , Q 2 ), we write τ 1,+ (Q) = λ 1,+ /λ 1,− and τ 2,+ (Q) = λ 1,+ /λ 1,− respectively for the bid and ask side growth ratios. Given the bid-ask symmetry relation, we can aggregate data and focus on the bid side only. Figures 6.a, 6.b, 6.c and 6.d show respectively λ 1,+ , λ 1,− , τ 1,+ and τ 2,+ for different values of Q. As expected, we can see that participants insert more limit orders when the imbalance is negative (see Figure 6.a when Q 2 Q 1 ) while they cancel more when the imbalance is positive (see Figure 6.b when Q 1 Q 2 ). Finally, Figure 6.c (resp. Figure 6.d) shows that τ 1,+ (resp. τ 2,+ ) is high when imbalance is negative (resp. positive) and becomes low when imbalance is positive (resp. negative) which means that the bid limit (resp. ask limit) tends to increase (resp. decrease) when Q 1 Q 2 and tends to decrease (resp. increase) when Q 1 Q 2 . Quantities after depletion. When one limit is depleted, we write Q N ew,1 (resp. Q N ew,2 ) for the new best bid (resp. ask). Figures 7.a, 7.b and 7.c show respectively Q N ew,1 , Q N ew,2 and the ratio r + (Q 1 , Q 2 ) = Q N ew,1 Q N ew,2 for different values of Q 1 and Q 2 before the mid price move. Since we aggregate data, the bid queue is always the depleted queue and the ask limit is the non-consumed limit. Figures 7.a and 7.b show that Q N ew,1 depends mainly on Q 2 while Q N ew,2 depends on both Q 1 and Q 2 . However, the interesting point is that r + reach its maxima in two cases, see Figure 7.c. The first case, when the bid is low and the ask is high, can be explained by a mean reversion effect while the second one, when both queues are initially high, is due to the arrival of a large order consuming market liquidity. Approximation of E U 0 [∆P ∞ ]. Figure 8 shows E U 0 [∆P ∞ ] defined in Section 4.1 and computed using Proposition 1, for different values of the initial state U 0 = (Q 1 , Q 2 , P ). Figure 8 shows the predictive power of the imbalance : when the imbalance is positive the price increases on average and conversely. We also note that the bid-ask symmetry relation is respected.
Model approximation at short-time horizon. Figures 9.a, 9.b, 9.c and 9.d show respectively the empirical and theoretical distributions of Q 1 and Q 2 after 20 events. We choose 20 events since it is coherent with the duration of our control. The estimation of the theoretical distribution is based on a Monte-Carlo simulation of the order book. We can see that both distributions are close and consequently that our model is consistent with the empirical order book dynamic at least during the control duration. The model is also consistent with empirical data on long term horizon, see [19].  Figure 9: (a) (resp. (c)) Empirical distribution of Q 2 (resp. Q 1 ) after 20 events and (b) (resp. (d)) theoretical distribution of Q 2 (resp. Q 1 ). Q 1 and Q 2 are divided by the average event size.
B Ergodicity of the process (Q t )

B.1 Outline of the proof
Let Z t be a Markov process defined on the probability space (Ω, F, F t , P) and valued in (W, W) and P t (x, A) the probability transition of Z t .
Definition 1 (Ergodicity). The process Z t is ergodic if there exists an invariant probability measure π which satisfies lim Let Q be the infinitesimal generator of Q t . To prove that Q t is ergodic, we design a Lyapunov function V : R 2 + → (0, ∞), on which the negative drift condition is satisfied for some c > 0 and d > 0: QV (q) ≤ −cV (q) + d.
Then, using Theorem 6.1 in [26], the Markov process Q t is non-explosive and V-uniformly ergodic. Furthermore, by Theorem 4.2 in [26], it is Harris positive recurrent.
When u i ≤ k 0 with i ∈ {1, 2}, we only need to add a constant to the expression above using Assumptions 2 and 4. We remark that Term (3) ≤ 0 and (4) ≤ λ i,− (k)L. Furthermore for Term (1) we have Using Assumption 2 for (1.1), we get (1.1) ≤ H(z 2k 0 − 1). Using Assumption 2 for (1.2), we also obtain We write M (k 0 ) for (z k 0 − 1)z k 0 . Finally, by combining the above inequalities we have C The infinitesimal generator of the process U µ To fully describe the dynamic of U µ t , we need to define the infinitesimal generator Q µ of the process U µ t whithout our intervention. We writed i u = d i u ι i u for the joint regeneration density of the order book and the agent placement. Indeed after a regeneration, the agent's order is placed in a position q bef 0 with a probability ι i u (q bef 0 ).
• When a limit order of size n is inserted at the best ask: whereq = (q bef , q a , q af t , q 2 + n, i) andz =z.
• When a cancellation order of size n is removed from the best bid 1. When n > q bef + q af t : wherez andq satisfy and p , q 1 , q bef 0 and q 2 are fixed by the regeneration law.
2. When n < q bef + q af t : wherez =z andq satisfies • When a market order of size n is sent to the best bid 1. When n > q bef + q a + q af t : wherez andq satisfy where p , q 1 , q bef 0 and q 2 are fixed by the regeneration law.
2. Otherwise, we have: wherez andq satisfy = q a − min(n − q bef , q a )1 n>q bef i = i1 q a =0 + q a 1 q a >0 q af t = q af t 1 n<q bef + (q af t + q bef + q a − n)1 n≥q bef +q a q 2 = q 2 .
• When there is a liquidity consumption event at the best ask 1. When n > q 2 : where p , q 1 q bef 0 and q 2 are fixed by the regeneration law.
Additionally, using the symmetry relation, we have Given that 0 ≤ p i,i < 1 (the price moves with a non-zero probability when one limit is totally consumed), we have (1 − (p i,i − p i,i sym )) > 0. This proves the result of Proposition 1.

D.2 Proof of Lemma 1
For simplification, we fix the added/cancelled quantity q = 1. To take into account nonunitary jumps, we can simply fill the zero values of the matrixQ * with the right probabilities, see Equation (12).
To compute the matrix R, we first fix the price P = 0 since there is no price move before the total depletion of a limit and model the order book state by u = (q 1 , q 2 ) with q 1 (resp. q 2 ) the best bid (resp. ask) quantity. Then, we introduce the absorbing states U 0,q (resp. U q ,0 ) with q ≥ 1 associated to the cases u = (0, q )(resp. u = (q , 0)) where Q 1 (resp. Q 2 ) is consumed before Q 2 (resp. Q 1 ). We want to compute the probabilities to visit U 0,q and U q ,0 with q ≥ 1 starting from U i . To do this, we consider the infinitesimal generator Q * of the Markov process (Q 1 , Q 2 ) (the price P = 0 is fixed) where 0 2Q max is the zero square matrix of size 2Q max , Q 1,− encodes transitions to the absorbing states U 0,q and Q 1,+ encodes transitions to the absorbing states U q ,0 with 1 ≤ q ≤ Q max , and Q * is similar to the infinitesimal generator of the process U t without regeneration. The matrix Q * has the following form: whereQ * ,(l) 0 encodes transitions from level Q 1 = l to level Q 1 = l + 1, matrixQ * ,(l) 2 encodes transition from level Q 1 = l to Q 1 = l − 1 and matrixQ * ,(l) 1 encodes transitions within level Q 1 = l. Q max is the maximum quantity available on each limit. Within each sub-matrixQ * ,(l) i with i ∈ {0, 1, 2}, Q 1 is equal to l and Q 2 vary from 1 to Q max . The sub-matricesQ * ,(l) i , for i = 0, 1, can be writteñ + (l, 1) . . .
Let λ * (l, l ) = 2 i=1 λ i,+ (l, l ) + λ i,− (l, l ) for every l, l ∈ {1, · · · , Q max }. For l ≤ Q max , we havẽ Finally, we define the matrix Q 1,− such that Q 1,− ii = λ 1,− (1, i) for 1 ≤ i ≤ Q max and 0 otherwise, and the matrix Q 1,− such that Q 1,+ iQ max +1,i+1 = λ 2,− (i, 1) for 0 ≤ i ≤ Q max − 1 and 0 otherwise. Using Theorem 3.3.1 in [28], for every absorbing state U i we have In the above equations, we use a slight abuse of notation and do not differentiate the state U i and the index i . This readsQ * R = −z 1 and R = MR, The idea is to find a matrix P such that P −1Q * P is symmetric with P = LH. First, we consider the bloc-diagonal matrix ∀i ≥ 1.
Here √ . refers to the square root of a matrix. The existence of such a matrix in this case is trivial sinceQ * ,(i) 2 andQ * ,(i−1) 0 are diagonal with strictly positive coefficients. Next, we consider the bloc-diagonal matrix L = diag{L 1 , L 1 , . . . L 1 } where L 1 is a diagonal matrix with diagonal coefficients L 1 (1, 1) = 1 and L 1 (i + 1, i + 1) = L 1 (i, i) for all i ≥ 1. Given that queues are independent we haveQ * ,(0) 1 =Q * ,(0) i for all i ≥ 1. Finally, we note that P −1Q * P , with P = LH, is symmetric. D.3.2 Diagonalisation of the symmetric matrix P −1Q * P : constant coefficients In the simple case of constant coefficients, the matrix P defined in Appendix D.3.1 satisfies where V = βI with β > 0 and a and b are some fixed constants. In such framework, the eigenvalues of P −1Q * P are λ k,j a,b,β = a + 2b cos( kπ n + 1 ) + 2β cos( jπ n + 1 ), ∀1 ≤ k, j ≤ n, and the associated eigenspace is generated by the eigenvector and X k is a vector such that

E Generalities about the state process U µ t and the value function
In this section we allow the state process U µ t to start from an initial state valued in U = {u ∈ R 5 + ×R 2 ; 3 i=1 u i > 0 , u 4 > 0}. For simplification, we also assume that jumps are of size 1. By replacing Assumptions 7 and 8 below with Assumption 2, results of this section remain valid in the case where the state process takes values in N 5 × R 2 however the values of the constants are modified.
E.1 Regularity of the regenerative process U µ t The regularity of our regenerative process is not trivial. In fact, if we consider two processes U µ and U µ satisfying the same order book dynamic (see Section 3.1) but starting from two different initial points u 0 and u 0 , as long as there is no regeneration, for every order flow trajectory, the error ||U µ − U µ || is equal to the initial error ||u 0 − u 0 ||. However, when one of the two processes is regenerated before the other, the regenerated one starts a new cycle from a random position and the error ||U µ − U µ || is no longer bounded by ||u 0 − u 0 ||. Hence, the irregularity mainly comes from the regeneration. In our case, since the regeneration law depends on the killing state, it may introduce strong irregularities. Consequently, we need an assumption to ensure that regeneration distributions are similar when exit points are close enough. We give here a result on the regularity of the state process under two general assumptions.
Assumption 7 (Regeneration smoothness). There exist four positive constants K, q 0 , q 1 ≤ 1 ∨ q 0 and β such that for every u = (q bef , q a , q af t , q 2 , i, p, p exec ) and u = (q bef , q a , q af t , q 2 , i , p , p exec ) ∈ U, where q 1 = q bef + q a + q af t , ||p − p || T V = sup A∈F |p(A) − p (A)| is the total variation norm, ||.|| p is the L p norm with p ≥ 1 andd i u the regeneration distribution of the process U µ t .
Assumption 7 is a Lipschitz type inequality to ensure that regeneration distributions are almost similar when exit states are close enough. Furthermore, we consider a boundedness assumption and a support constraint to guarantee that regenerated limits have a size higher than a fixed minimum quantity q 0 . We also add the following assumption.
Assumption 8 (Exit dynamic). ∀u = (q 1 , q 2 , p) ∈ R 2 + ×R and i ∈ {1, 2}, there exists a positive constants β − such that ∀ > 0, ∃q > 0, For small size queues (i.e q i ≤ q ), Assumption 8 ensures that intensities of depletion are high (i.e λ i,− ≥ 1 ) while other intensities are bounded. Such assumption avoids critical situations where the order book goes far away from an exit state after being too close to it. It is also consistent with empirical evidences since a limit disappears almost instantaneously when it becomes lower than a given bound q . We have the following result proved in Appendix E.4.
Theorem 5 (Regularity of the state process). Under Assumptions 7 and 8, the process U µ where = ||.|| p is the L p norm for p ≥ 1, U µ t (resp. U µ t ) is the Markov process starting from the initial state U 0 (resp. U 0 ), T is the final time, K 0 and C 0 are constants defined in Appendix E.4.

E.2 Regularity of the value function
In this section, we fix p = 1 and write ||.|| p = ||.||. Let us assume that the function g(u) = f (E u [∆P µ ∞ ]) is Lipschitz. When the state process is valued in N 5 × R 2 we only need g to be bounded which is always satisfied Assumptions 5 and 6. Then, we have the following regularity properties proved in Appendix E.5.

Proposition 2. The value function V is
• Lipschitz in space: with T the final time, A a constant defined in Appendix E.5 and C 0 defined in Theorem 5.
• Lipschitz in time: with L 0 = cq a + Ae C 0 (T −t) C and C a constant defined in Appendix E.4.

E.3 Execution time inequalities
Here again we also fix p = 1 and write ||.|| p = ||.||. We recall that T t,µ Exec is defined in Section 4.2 for any control µ. We provide here two execution time inequalities. First, when agent's decisions are taken at a fixed frequency ∆ −1 , we have the following inequality. Proposition 3. Let U 1 , U 2 be two initial states and µ Opti 1 (resp. µ Opti 2 ) the optimal strategy for the process starting from U 1 (resp. U 2 ). Then, we have Proposition 3 shows that both initial states and agent's latency ∆ affect the optimal execution time. We have a second inequality when decisions are taken at any time.
Proposition 4. Let U 1 , U 2 be two initial states and µ Opti 1 (resp. µ Opti 2 ) the optimal strategy for the process starting from U 1 (resp. U 2 ). Then, we have with K 0 = log(K 0 ). The constant K 0 is given in Theorem 5.
Proofs of Propositions 3 and 4 are given in Appendix E.6.
Proof of Inequality (19): We assume that τ 1 > τ 1 a.s. The general case uses the same lines of argument. We have In the second inequality, we use when the best bid (resp. ask) is totally depleted). Finally, we denote by M the set of Borel functions on X that take values in [−1, 1]. In such case, we have In (*) we used the total variation norm property 2||µ − ν|| T V = sup By combining Inequalities (20) and (21), we get the result.
Proof of Lemma 2: First, we assume that τ 1 > τ 1 a.s and consider the following notation. For every state u, we write u 2,+ (resp. u 2,− ) for the new state of the order book when a quantity q = 1 is added to (resp. cancelled from) Q 2 . The same reasoning holds for Q 1 . We have whereτ 1 = τ 1 −τ 1 is also the first regeneration time of U µ t but starting from the initial point U µ The case Q 2,µ τ − 1 ≤ q is solved using the same arguments. Since ||U µ e the error is unchanged before the first regeneration), we have Q 1,µ τ 1 ≤ ||U µ 0 − U µ 0 || ≤ q . We note u 1 = U µ τ 1 . By considering the possible transitions of the process U µ , we have with λ * = i λ i,+ + λ i,− . Using Assumption 8 and h(u) ≤ T , for every initial state u, we have . This proves the result.
Proof of Lemma 3: By following the same methodology as for Lemma 2, we first note that This proves the result.

E.5 Regularity of the value function
Proof of Inequality (15): We write U 1,µ t (resp. U 2,µ t ) for the process such that U 1,µ t = U 1 (resp. U 2,µ t = U 2 ). Using the fact that g is Lipschitz and Inequality (14), we have )| + cq a |T t,µ where C = g [Lip] C + cq a , C is a constant, K 0 = log(K 0 ), µ Opti 1 (resp. µ Opti 2 ) is the optimal control when U 1 (resp. U 2 ) is the starting point and A = g [Lip] K 0 + K 0 C. In the penultimate inequality, we use Inequality (17) to complete the proof.

E.6 Proofs of Propositions 3 and 4
Proof of Proposition 3: We fix ∆ > 0 and prove the result by recurrence on n ≥ 0 for every T ∈ [0, n∆]. T ∈ [0, n∆], the result is true using the recurrence assumption. When T ∈ (n∆, (n+1)∆], we can write LetŨμ be the process following the same dynamic as U µ but with initial value U µ ∆ and ending at T − ∆ with a controlμ t = µ t+∆ . Then, we have classically Tμ Exec = T µ,∆ Exec − ∆ and V (∆, u) = V T −∆ (0, u). Thus, we can write -For Part (2), using the recurrence assumption and Inequality (14), we have -For Part (3), using the same arguments of Part (2), we have Finally, since C 1 = log(K 0 ) Proof of Proposition 4: Using Proposition 3, we have F Resolution of the optimal control problem F.1 Proof of Theorem 2 First, let us assume that the time derivative ∂ t V is continuous is each sub-interval (k∆, (k+1)∆). Then, we can show classically that V satisfies the equations of Theorem 2 by applying Itō's formula. Thus, it suffices to exhibit a solution and use a verification argument to conclude. Let us exhibit a solution V by solving equations of Theorem 2 step by step • Step 1 -Initialisation: Since we know the value of V at time T and V satisfies where k 1 = T ∆ , k 1 ∆ the first decision time and the vectorg encodes the execution time effect. Indeed, for everyq = (q bef , q a , q af t , q 2 , i) ∈ N 5 , q 1 = q bef + q a + q af t , u = (q 1 , q 2 , p), z = (p, p exec ) ∈ R 2 ,q = (q bef , q a , q af t , q 2 , i ) ∈ N 5 , q 1 = q bef +q a +q af t , u = (q 1 , q 2 , p ) and z = (p , p exec ) ∈ R 2 , we haveg = n≥0g n whereg n is defined such that -When i = 0 and q bef + q a ≤ n < q 1 g n (q, z) = λ 1,− m (u, n)g(q , z ), with z = z andq such that q bef = 0, q a = 0, q af t = q 1 − n, q 2 = q 2 and i = 0.
with z andq such that q bef = q 1 , q a = 0, q af t = 0 and i = 0 and p exec = p exec + q a (p − ψ 2 ), where q 1 , q 2 and p are fixed by the regeneration distribution. -In the remaining cases, we haveg n (q, z) = 0.
We know explicitely the solution of (23) where g is the vector such that g i = g(U i ) and k 1 = T ∆ . • Step 2 -Iteration: At time k 1 ∆, the agent can take a decision. So he compares expressions of Equation (3) and takes the maximum. When the optimal control is market the agent stops the execution otherwise he reiterates Step 1 with new initial values.
Since the exhibited solution satisfies the required regularity of ∂ t V , we conclude with a verification theorem as in Theorem 4.1 in [29].
F.2 Proof of Theorem 3 , and g a Lipschitz function representing the final constraint. We denote by Q max = max(Q max ,Q max ) andĨ max =P max Q max . Equations satisfied by V can be formally derived by assuming that V is smooth and using the dynamic programming principle with Qf (u) = f (u )−f (u)dQ(u ; u) the infinitesimal generator of the state process, K r f (u) = f (u )dk r (u ; u) for every continuous and bounded function f , state u and control r ∈ {l, c}. Since a control r may lead to several states, we write k r (u ; u) for the probability to reach the state u starting from u after taking the decision r.
Existence, uniqueness of the solution: Uniqueness of the solution comes from a standard comparison principle using the same arguments as in [8,Theorem 2.2]. Existence of the solution can also be derived following [8,Theorem 2.3].
Regularity of the solution: Let us show that ∂ t V is continuous except on the boundary of {V = g}. We denote by V the continuous and Lipschitz viscosity solution of (24). Let r be the control which modifies the agent's state when it exists. Let O be the open set O = {V > max(K r V, g)} ∪ {(t, u); V > g, k r (u; u) = 1}. On O, we have ∂ t V = −QV in the viscosity sense. Hence, by considering a sequence of smooth functions converging uniformly towards V , we have ∂ t V is continuous on O, see [9, Corollary 5.6] for a close construction.
Let O 1 = {K r V = V, V > g} and • O 1 its interior assumed non-empty, otherwise there is nothing to prove. Since V is Lipschitz, ∂ t V is essentially bounded. To show that ∂ t V is uniquely defined on • O 1 , we assume the opposite and consider a point x 0 = (t 0 , u 0 ) where ∂ t V admits two possible values. We have V (x 0 ) = K r V = V (t 0 , u )dk r (u ; u 0 ).
Since the sum in the above equation is finite, we can apply the same arguments as before several times to find that the null function is not uniquely defined which provides the needed contradiction. Hence ∂ t V is uniquely defined on • O 1 . Furthermore, since ∂ t V is continuous on O, we can prove by contradiction and using the same arguments that ∂ t V is continuous on • O 1 and thus onŌ 1 .
Let O 2 =Ō ∩Ō 1 , whereŌ 1 is the closure of O 1 and x be a point on O 2 . Thus, x is the limit point of (x n ) n≥0 and (x 1 n ) n≥0 , such that (x n ) n ∈ O and (x 1 n ) n ∈ O 1 . Let l (resp. l 1 ) be the limit value of lim n→∞ ∂ t V (x n ) (resp. lim n→∞ ∂ t V (x 1 n )). Hence, we can check Thus, ∂V is continuous on O 3 =Ō ∪Ō 1 . On the set O 4 = {V = g}, ∂ t V is clearly continuous since ∂ t V = 0. Finally, we consider the set O 5 = ∂O 4 and x a point on O 5 . Here again, x is the limit point of (x n ) n≥0 and (x 1 n ) n≥0 , such that (x n ) n ∈ O 3 and (x 1 n ) n ∈ O 4 . Let l (resp. l 1 ) be the limit value of lim n→∞ ∂ t V (x n ) (resp. lim n→∞ ∂ t V (x 1 n )). Thus, we have This relation is not necessarily satisfied. (24) is satisfied almost everywhere by V . Since ∂ t V is continuous except on the set O 5 = ∂{V = g}, Equation (24) is satisfied pointwise except on O 5 .

G Proof of Theorem 4 G.1 Proof of Inequality (9)
Let us fix ∆ and show the result by recurrence on n for every T ∈ [0, n∆]. Initialisation: in this case we have V = V = g. Iteration: let us assume the result true for n. Let T ∈ [0, (n + 1)∆).
• When T ∈ [0, n∆]: the result is true using the recurrence assumption.

G.2 Proof of Equation (10)
Let µ Opti,∆ be the piecewise constant optimal control associated to the processŨ µ,∆ t . We say that a sequence of functions f n converges to f in a stationary way when ∃n 0 such that ∀n ≥ n 0 , f n = f .