A simple learning agent interacting with an agent-based market model

We consider the learning dynamics of a single reinforcement learning optimal execution trading agent when it interacts with an event driven agent-based financial market model. Trading takes place asynchronously through a matching engine in event time. The optimal execution agent is considered at different levels of initial order-sizes and differently sized state spaces. The resulting impact on the agent-based model and market are considered using a calibration approach that explores changes in the empirical stylised facts and price impact curves. Convergence, volume trajectory and action trace plots are used to visualise the learning dynamics. Here the smaller state space agents had the number of states they visited converge much faster than the larger state space agents, and they were able to start learning to trade intuitively using the spread and volume states. We find that the moments of the model are robust to the impact of the learning agents except for the Hurst exponent, which was lowered by the introduction of strategic order-splitting. The introduction of the learning agent preserves the shape of the price impact curves but can reduce the trade-sign auto-correlations when their trading volumes increase.


Agent-based models
Much of the empirical economic and finance literature suffers from a multiple assumption problem with data-sets that are often static samples of the system that do not include the impact of system feedback. Often the key assumptions used to specify agents, and hence models, are that they are rational 1 and fully informed [17]. This introduced two key implementation problems for models: the problem of the size of the state space, and then, the problem of state space dynamics.
The state space can then become so large that the agents can not reasonably behave as fully informed and rational decision makers. Artificially forcing bounded rationality onto the agents [43] can become an ad hoc assumption that is introduced in order to retain some form of rationality in decision making to accommodate interactions with a static environment. In financial markets, and many social systems, the state space can change due to adaption, interaction, and feedback's between the agents and the changing environment they are creating and competing within. This implies that even with bounded rationality the data-sets extracted from historic data, or bootstrapped from historic data, are unlikely to be representative of either the current system configuration or future outcomes.
Adopting a bottom-up rule based approach using simple online updating representations of reasonable trading constraints can be an effective alternative [21]. Although these types of agent-based models explicitly exclude topdown causation [70] from higher complexity structures, that is feedback's from higher levels in a given systems hierarchy of complexity, they can still include the impact of a changing environment. Such Agent-Based Models (ABMs) provide a bottom-up approach to the modelling of a complex system where there can be multiple online heterogeneous interacting agents in a single level but changing environment. Zero-intelligence, or minimally intelligent, agents with simple trading rules can then be used in this environment to generate reasonably realistic dynamics, and these can be calibrated to observations of the system. Financial ABMs that model the limit order book attribute the dynamics of the Limit Order Book (LOB) to the emergent phenomena caused by the interactions of trading agents programmed by these simple rules. This is an attractive method for modelling the dynamics of the LOB because there has been an increase in the amount of high-frequency observational data available to the researcher and this data is a direct measurement of how markets are unfolding through time. This level of detail can provide a picture of the behaviour of individual agents, which can then guide the development of new ABMs while providing methods for validating the models [43,50,59]. Agent-based models are computationally expensive. This introduces a key trade-off between the computational cost of generating more sample paths against the additional statistical accuracy [59]. The most common method for validating an ABM is the empirical vs. simulation comparison of stylised facts that are present in the generated time series e.g. fat tails in the return distributions, the presence of volatility clustering, the leverage effect, and an appropriate decay in the autocorrelations of price changes. These can be found in Table 7. This is computationally convenient and is essentially a form of exploratory data analysis. However, this approach has been criticised in the literature in the context of intraday agentbased models.
Platt and Gebbie [60] considered an ABM that can recover several well-known stylised facts but demonstrates parameter degeneracies and calibration inconsistencies. It was then argued that these arise both, because of how markets clear, in real markets this is event based, as opposed to the batch market clearing mechanisms used in many financial agent based models, as well as due to the problem that many parameters in agent-based models are related to each other and share coupled impacts in the observed stylised facts. To guard against this they recommend validating an ABM by having the generated time series reproduce the statistical properties of empirically measured transaction data through a combination of sensitivity analysis, calibration and stylised-fact verification [60] while accepting a-priori some level of parameter degeneracy. Given this, the empirical-versus-simulation stylised fact comparison remains a good way to test the validity of an ABM of the limit-order book [20,28,59,60,61].
ABMs are also able to shed light on the cause of certain stylised facts. In financial time series, it is well known that the distribution of returns has fat-tails. Gaunersdorfer and Hommes [24] identified that trend-following trading strategies are likely the potential cause of these fat-tails. They show that when fundamental trading strategies (e.g. mean reversion) start to dominate, the returns are small because the market is tracking the fundamental value, but as trend followers start to dominate, due to the increased potential returns that this strategy has to offer, large price changes occur, which causes large returns and increases volatility. They also show that volatility clustering, the tendency of markets to exhibit periods of small returns followed by periods of large returns, is also potentially due to the trendfollowers.
When building ABMs a good place to start is to first try to understand the dynamics that can be realised from the interactions of minimally intelligent agents before adding on more complex forms of decision making. Farmer et al. [21] used minimally intelligent agents within a continuous double auction and were able to explain 96% of the variance of the spread and 76% of the variance of the price diffusion rate. They were also able to recover the price impact function and so demonstrate the simple laws relating price to order flows. This work also suggests institutions strongly shape our behaviour, so that some of the properties of markets may be a result of the constraints imposed by these institutions rather than the rationality of the individuals [21].
Many agent-based models have some form of batch clearing [61], where either batches of market orders are executed against the limit-order book at preset calendar times, or when rolling auctions are used. This can introduce spurious correlations between parameters and elements of the model as the system is synchronised by an external feature that is not present in real financial markets. This means that realistic models should clear markets in event time so that the sampling time of events is emergent from agent interactions and not externally imposed. From a modelling perspective, one approach is to directly use a matching engine to avoid batch market clearing. Any form of batch market clearing is introducing an averaging process that can introduce spurious correlations.
When market clearing takes place in event time rather than at preset times the model needs to be able to update based on each individual event, and to be able to react to each individual event in a manner that is asynchronous for all the other agent realisations. In an open system, using preset times, either randomly or uniformly sampled, is incompatible with an event driven process. If the agentbased model is the only model framework interacting via a given matching engine then all the possible events will be generated by the given agent based model. The model is a closed system. This is the hybrid framework [37]. Here preset times can be used to streamline the computations. However, in an open system without batch market clearing, where new agents can be introduced, the system does not know a-priori when events will occur as they arise from interactions. This requires the agents to be implemented in a reactive framework if order matching is to be realistic and event driven.
The ABM framework of Jericevich et al. [37] is used in this work for three reasons. First, the model was able to recover several conventional stylised facts, including persistent auto-correlation in the trade signs, fat-tail return distributions, and volatility clustering. Secondly, the model was able to recover a realistic price impact function, in particular the expected power-law relationship between the log difference in the price changes and the size of the normalised volumes. Finally, and most importantly, the trades were simulated in event time, and not in the usual calendar time. This is the key issue here, as the reactive model will also be simulated in event time so that additional agents can be included via direct interactions through the matching engine [39].

Time and reactive implementations
Typically in ABMs events, such as order submissions, are generated in either calendar time or in stochastic time. In both of these cases, the time stamp dictates when an event will occur. In the case of calendar time, the interarrival times are usually constant and measured by some external clock e.g. in minutes, seconds, milliseconds and so on. Whereas in stochastic time the inter-arrival times are stochastic and will usually be drawn from some distribution, but the times are still measured in calendar time.
In real financial markets at the small time scales, typically less than a day, events are generated based on the change of the system's state caused by the arrival of other events [11]. For example, if a large limit order is placed at the best offer, this may trigger buy market orders that want to take advantage of the lower expected price impact. This is event time [11]; this is how events in this model are generated. This also implies that the measured distributions of different inter-arrival times are emergent properties of the model that can be measured, and are not inputs.
Here we extend the hybrid ABM [37] to an explicitly event time implementation, and because the original hybrid ABM was created using stochastic time to approximate and test the event driven framework, some modifications to the rules are necessary. All agents should not submit orders in each event loop because this would cause the model to collapse into an approximate calendar time model; the generation of events would then depend on the event loop, and not on the state of the LOB. This would synchronise the agents' trading decisions. There need to be states of the LOB where agents do not trade, this is similar to having agent activation rules. This requires that the model representation is able to manipulate events streams, and agents react by these events. Reactive programming is one such approach.
Reactive programming can be thought of as a method of solving problems by manipulating or reacting to events in data streams generated by asynchronous observers using functional programming. Observables are the emitters of events in a data stream, and they can asynchronously emit events. There are a variety of reactive software implementations; here we use Rocket.jl 2 where actors subscribe to an observable, and when a new event occurs in the data stream the observable emits this event, and the actors consume this data and can act based on that event.
In an event-driven ABM (e-ABM) the observable emits the current state of the LOB, so the data stream consists of the states of the LOB through time. When the LOB is updated the observable emits this as an event and the actors (here represented as minimal intelligence agents) make trades based on the agent specifications as defined in Section 2.
Using reactive programming in this manner adds an element of realism to the simulations because in real financial markets changes to the LOB are events that occur asynchronously and the agents react to these events to make trades -it is not merely that time is non-uniformly sampled so that agents are non-uniformly or randomly activated in some calendar time. Reactive market models can be contrasted with the use of sequential market clearing, batched market clearing, or asynchronous stochastic time models which attempt to approximate the event time reactive nature of real trading by externally timing and triggering trading and market clearing. The e-ABM implemented in this work explicitly uses an event-based reactive programming approach to capture and model this behaviour. This in turn requires a careful modularisation of the software components built around a decoupled implementation of the matching engine [39].
A reactive ABM lends itself to extending current ABM models in two ways, namely: a fundamentally different model of time, and a fully decentralised control logic [14].
In most of the current research, the ABMs that are used to model financial markets use a synchronous model of time -that is, the model progresses through a sequence of global iterations where the agents are modelled as functions that are called in each of these iterations [14]. In many of these market models, the market is cleared using sequences of market clearing events that batch orders together, and then execute these as batches against the order book, or each other. Crafa [14] argues that this is not conducive to modelling financial agents as agents may respond to the same event at different times. For example, one trader may be able to respond quickly to a large increase in the price of a particular stock price while another may have to unwind other positions to participate in the sudden increase. In a reactive system, the agents can operate in parallel and update their independent state based on the state of the system [14]. They are also allowed to act whenever they deem it most appropriate.
Gilles [26] demonstrated that the dynamics of the ABM can change depending on whether the agents trade synchronously or asynchronously 3 . Aloud et al. [2] built an ABM of the foreign exchange (FX) market using agents that trade asynchronously, and was able to closely approximate the stylised facts of the market. This adds an important additional level of realism, but the market clearing mechanism still operated synchronously as at each time step all the agents' market orders are executed sequentially and all the limit orders are cleared. Here we decouple the matching engine and the model management system to ensure that the agents are simulated in a continuous double auction with no synchronous market clearing. This has the added benefit of allowing us to add additional agent classes that can interact with existing agents.
Here we can now consider our key question: "How can learning agent interactions change a minimally intelligent financial agent-based model?". In order to include an agent that has bounded rationality, but is strategic and adaptive, we need a learning framework that is event driven, trade by trade, but where there is an awareness of the state of the market, and the agent has an objective that provides a trade-off between the cost of execution, and the risk of future adverse market conditions. A simple starting point is a single optimal execution agent that will try to take sell (or buy) actions for an order in the simulated market by learning how to break the order up based on the state of the market and the feedback via the agent's objective. This can be thought of as a reinforcement learning problem [3,1,32].

Reinforcement learning
In a simple case of Reinforcement Learning (RL) an agent interacts with an environment at discrete time steps. At each time step, the agent observes the current state of the system and chooses an action out of a set of finite actions. One time step later the agent receives the reward for taking that particular action in that state and it also observes the new state of the environment. The goal of reinforcement learning is to find the mapping between states and actions to maximise the long-term reward of the agent. This framing of the reinforcement learning problem considers RL to be solving a Markov Decision Problem (MDP) [3]. RL problems are not restricted to solving discretetime finite state-action space MDPs. However, this is an ideal framework [3] for the simulation of simple RL algorithms. This provides a simple baseline foundation for comparison with future extensions, that is computationally inexpensive, but can accommodate a dynamic online environment.
Dynamic programming, as developed by Bellman [4], can also be used as a way of solving for the optimal mapping between states and actions i.e. to find an optimal policy. However, using dynamic programming can become impractical for solving real-world problems in two important ways: First, it assumes complete knowledge of the environment, which includes knowledge of the underlying probability transitions from one state to another given some action. Second, it suffers from the curse of dimensionality as the state and action spaces increase [65]. The advantages of using RL over dynamic programming are thus twofold: Firstly, the RL agents do not need full knowledge of the environment to learn an optimal policy [1]. They can learn the optimal policy by interacting with the environment and recording the rewards for taking actions in certain states. Secondly, to increase the scalability of the learning process the agent can use function approximations to represent their knowledge [1]. The RL algorithm that will be implemented in this paper is the Q-learning algorithm given by [69]. These simple RL agents are used to solve the optimal execution problem in order to represent a single learning agent with bounded rationality.
The optimal execution problem is defined as the problem of finding a way to trade a block of shares through time at the lowest possible cost [1,6,57,10]. Much of the existing research solves this problem using the tools of stochastic dynamic programming which assumes that the prices will follow some combination of random walks in discrete time, or Brownian motions in continuous time, to approximate the data-generating processes. Data driven representations of the data-generating processes have become popular due to the availability of market micro-structure data and improved computational resources. This has in turn made it possible to consider learning algorithms. Nevmyvaka et al. [54] provided one of the first large-scale empirical studies that used RL and static micro-structure data to learn the optimal trading strategy. They argue that RL is potentially well suited to the optimal execution problem as it significantly reduced the cost (up to 50%) over an existing simple strategy. However, Nevmyvaka et al. [54] compared the RL algorithm to a simple submit-and-leave strategy. This is not considered a meaningful comparison for determining the effectiveness of an optimal execution algorithm given that the seminal benchmark policy solution for optimal execution is based on the Almgren and Chriss optimal execution model [1]. Almgren and Chriss [1] provide trading trajectories that optimally minimise implementation shortfall [58] in the presence of both temporary and permanent linear price impact. Which by construction can beat simple strategies such as submit-and-leave without the added complexity of learning. The Almgren and Chriss policies optimally include solutions between the extremes of Time-Weighted-Average-Price (TWAP) and Immediate Execution (IE).
Hendricks and Wilcox [32] used tabular Q-learning to extend the Almgren and Chriss model [1] to make a more meaningful comparison and were able to reduce Perold's implementation shortfall [58] by 10.3% when compared with the benchmark model. This demonstrated the ability of RL to enhance the Almgren and Chriss model and thus improve on the balance between market impact, order size, and execution timing in the presence of a static data 4 driven perspective of the market. This was achieved by first computing the volume trajectory of the analytical model using linear price impact. Then at each time step, the RL model was used to determine the percentage of the Almgren and Chiss volume to trade. The idea is that by using micro-structure data the RL algorithm will be able to spot favourable and unfavourable trading conditions and dynamically adjust the Almgren and Chriss trading trajectory. They adapted a trading strategy with respect to the prevailing spread and volume to execute a sequence of optimised market orders. Hendricks [31] extended this to include a dynamic perspective where market states could be discovered using unsupervised learning as combined within a learning algorithm. This approach used clustering for state discovery as coupled to learning and could adapt as the market changed and evolved, and could be applied dynamically to streaming market data.
The RL agent implemented by Hendricks and Wilcox [32] was very simple. The agent only used time, inventory, spread and volume information to learn how to modify the Almgren and Chiss volume trajectory, but they were still able to demonstrate improvements in the profits obtained above that of the AC volume trajectory. The simplicity of the agent is important and for this reason, it will be the inspiration for our RL agent.
The training of the RL agent on market data was done in an offline manner, with the assumption that the order book is resilient. Meaning that the trading of the RL agent does not affect future prices, and the other agents in the market can't react to its trading. This is a necessary but strong constraint because not allowing other agents to react to the RL agent's trading may increase the observed profits above what will be realised in the market. We aim to give some insight into the effects of training a similar RL agent in an online manner.
This can be contrasted with a quite different approach, that of combining Deep Learning (DL) with RL. Given that the policy, value functions, and states are all functions, function approximation (FA) such as DL (or even linear models) can be used to learn them from experience. Moreover, when states or actions are continuous FA can be a useful method to extend the tabular methods to deal with a continuous setting. Additionally, if there are too many states in the discrete setting FA is used to give an approximate mapping. Deep RL is the method of using Artificial Neural Networks (ANNs) to perform these function approximations. Using DL to approximate the state-action values solves the problem of tabular Qlearning not being able to interpolate between state-action values [55]. However, it does not necessarily solve the problem of dynamic state discovery. Nevertheless, Deep RL can be applied to solving an optimal execution problem on static data. Ning et al. [55] compared a Deep RL agent to a Time-Weighted-Average-Price (TWAP) strategy to show that for seven out of nine stocks the Deep RL agent was able to produce lower costs. However, this does not demonstrate an ability to beat the canonical Almgren and Chriss policy trajectories, nor were they able to show that Deep RL methods can generalise Out-of-Sample (OOS) in real financial markets. It remains to be shown that Deep RL types methods can add value in real financial markets. Nevertheless, simple RL agents can be used as competitive proxies for realistic optimal execution agents that demonstrate bounded rationality in the simulation setting, and they can be naturally implemented within a reactive eventdrive software paradigm.

Event driven Agent-Based Model (e-ABM)
In this model, the liquidity takers (LTs) submit only market orders and the liquidity providers (LPs) submit only limit orders. To solve the activation problem for the liquidity takers we add a spread correction, ± 1 2 s t , to the fundamentalists and the chartists rules -see Section 2.1.1. The liquidity takers would then act in event time with actions that are asynchronous because they can now trade, hold, or do nothing as activated by the spread correction. In the case of high-frequency agents, the model's liquidity providers, a simplifying assumption is made to allow them to trade in each iteration of the event loop. Concretely, this is because the LPs are modelling HF market makers, and we can assume that as the LOB is updated there will be an HF trader attempting to use their latency advantage to exploit these updates to profit from the spread via round-trip trades -either taking advantage of the strategic or competition effect. The volume of the orders is sampled from a power-law distribution density function [37]: and is zero otherwise, where the parameters x m and α are determined based on the state of the order book for both the liquidity takers (Section 2.1.1) and the liquidity providers (Section 2.1.2).

Liquidity takers
There are two types of liquidity takers: First, there are the fundamentalists who trade based on a perceived fundamental value of the asset. Second, there are the chartists, or trend followers who trade based on the current trend of the asset price. There are N f LT fundamentalists and N c LT chartists that make up the population of liquidity takers, The direction and volumes of their market orders are determined by simple mechanistic decision rules. For the fundamentalists: where m k and s k are the mid-price and spread at event k, and f i is the fundamental price for the i th fundamental agent. Here the fundamental price is f i = m 0 e xi where x i ∼ N (0, σ 2 f ) and m 0 is the initial mid-price and σ 2 f is the variance of the normal distribution that generates the lognormal distribution. Each fundamentalist's fundamental price is determined at the start of the simulation and is kept constant throughout.
The spread correction, − 1 2 s k for sells and + 1 2 s k for buys, modifies the fundamentalists' trading rule in [37] to ensure that the fundamentalists never buy (sell) a stock for more (less) than their fundamental value. It also ensures that the fundamentalists are trading asynchronously. For the chartist: wherem i is the exponentially weighted moving average (EWMA) of the mid-price computed by the i th chartist agent. A similar spread correction was implemented for the chartists. The moving average is computed using: The forgetting factor, λ i , is not updated using the same method used by [37] because in event time the chartists can converge quickly, if they use the same moving average. The forgetting factor will then often not be able to update fast enough and can become consistently close to one. This will then cause the chartist agents to track the mid-price too closely to generate any trades. In this model the forgetting factor is sampled, for each agent, at the beginning of the simulation from the Uniform distribution, λ i = U (λ min , λ max ). Here, sampling from the uniform distribution gives us that each agent has an equal probability of having a long or short trading horizon. As with the fundamentalists the rule from [37] has been modified. The rationale for this rule is to ensure that the chartists' rules are not always activated, allowing for asynchronous trading, and greater heterogeneity in the chartist agents.
Once the liquidity takers have determined the direction of the market order the next step is to determine the volume of the order. The minimum volume, x m , of the volume distribution, is determined based on how far the fundamental value, or the moving average, is from the mid-price relative to a threshold, δ, given as a percentage of the current mid-price. For fundamentalists this is: For chartistis this is: This increases the aggression of the agents when the state of the system presents a favourable trading opportunity relative to their fundamental value or trading strategy [37]. The second parameter, α, of the volume distribution is determined with respect to the order imbalance, ρ k , and an additional model parameter ν: α LT = 1 ± ρ k /ν with + (-) for buy (sell) MOs. The order-imbalance ρ k is given by: where v a k and v b k is the volume of the asks and the bids respectively. The parameter ν > 1 ensures that α is never equal to 0. Specifying α in this way achieves two effects [37]: first, if the contra side of the order book is thicker then the agent will submit larger orders to take advantage of the extra liquidity because it will likely reduce the price impact of a large order. Second, if the same side is thicker then the orders will be smaller to reduce the price impact.

Liquidity providers
Each LP agent acts on every LOB modification in every event loop 5 . Thus an event loop can naively be thought to batch sets of limit order events, but for each agent, the orders are independently submitted to and matched by the matching engine. This is for three reasons. First, it ensures that the simulation does not result in a state 5 Sections Appendix A and Appendix A describe the collection of message updates as well as the event-loop respectively.
where no trade occurs because of the lack of liquidity provision. This would cause the simulation to prematurely halt, this would in turn be problematic when calibrating the model 6 . Second, the LP agents, in collective, are simulating high-frequency market-makers that are profiting from the spread, so it is natural that they want to submit as many orders, as quickly as possible, in the hope of generating lots of small profits, that are able to cover and exceed the large trading costs incurred from a high trading frequency. Moreover, given enough high-frequency agents in the market it is reasonable to assume at any moment limit orders are being placed based on the state of the LOB. Third, they ensure that the order book is reasonably stable throughout the simulation and reduces the chances of liquidity crashes.
During each event loop the LPs place an ask with a probability of θ, and a bid with probability of 1 − θ, where θ = 1 2 (ρ k + 1). This setup has LPs provide liquidity, on average, to the side that has the least liquidity [37]. After determining the direction of the order, the price placement and the volume have to be determined.
The price placement of the limit order is given by: where p a k and p b k is the best offer and best bid prices at event k respectively. The threshold parameter η is sampled from a Gamma distribution: for the current spread, s k . Here β is set using the order imbalance ρ k : β = e ±ρ k /κ with + (-) for bids (offers) where κ > 0 is a model parameter. The mean value for η at event k isη = s k e ±ρ k /κ with + (-) for bids (offers). This price placement setup achieves several effects [37]. First, when the order book is relatively balanced the orders are placed close to the best bid and offer prices. Second, when the order book is thicker on the contra side the orders are placed further away from the best bid and offer prices, capturing the strategic effect [49]. Third, when the same side of the order book is thicker we have that the orders are placed closer to the best bid and offer prices, capturing the competition effect [49]. The larger (resp. smaller) the value of the model parameter κ the lower (resp. higher) the intensity of the strategic and competition effects. The volume of the limit orders is sampled from the same power-law distribution as the market orders but with different parameter values. The minimum order volume was reduced by half when compared with [37]; which gives us that x m = 5. Using x m = 10 in the event-based model 6 Excessive premature halting causes some simulations to have very short trading sessions, and hence insufficient data to reliably estimate the simulated moments. In calendar time one can externally fix simulation times to generate enough trades for reliable estimation.
with the LPs trading in every event loop leads to the situation where too many orders can have order sizes of greater than 1 × 10 6 . This can saturate the order book. These orders can then become so large and frequent that the LTs cannot break through the best bid/offer orders, and the price would then not move for long periods of calendar time. By reducing the x m to at least 5 we reduced the frequency of these large orders. To ensure that there is enough liquidity an increase in the ratio of the number of LPs to the number of LTs is needed, but this ensures that there are more small orders rather than more big orders that saturate the best bid/offer. The shape parameter for the volume distribution for the liquidity providers is: α LP = 1 ± ρ k /ν with +(-) for bids (asks). Once again this achieves two effects. Firstly, when the order book is thicker on the contra side the volume will be larger. Secondly, when the order book is thicker on the same side the volume will be smaller. The rationale behind these effects is that it will produce a more stable order book [37].
Frequently cancelled orders are a feature of high-frequency traders and this ABM will model this behaviour using the method from [37], where orders are cancelled if they are not completely filled within a certain amount of time. The parameter controlling the number of milliseconds before an order is cancelled is φ.

Parameters
A full list of all the parameters in the model, as well as their descriptions, are shown in Table 1 where details on whether the parameters were kept fixed or were allowed to vary in the calibration are provided. If the variables were fixed during calibration then their value is also shown. For a detailed description of why certain parameters were kept fixed and others were allowed to vary see Appendix D.

Reinforcement Learning Agent (RLA)
Briefly, we consider an RL agent interacting with the environment at discrete event steps, k = 1, 2, 3, ..., and at each event step the agent observes the state of the environment S k ∈ S where S is the set of possible states, and takes an action A k ∈ A(S k ) where A(S k ) is the set of actions the agent can take in state S k . At the subsequent event step the agent then observes a reward R k+1 ∈ R ⊂ R, for taking action A k in state S k , and it then observes a new state S k+1 . To select the actions at each time step the agent uses a policy to map the state to the probabilities of selecting each possible action [65]. The goal of the agent is to find the policy such that it maximises the total reward received. The policy is denoted by π k where π k (a|s) is the probability that A k = a when S k = s [65].
Q-learning is an off-policy TD control algorithm. This means that the policy that generates the actions is not the same as the target policy. Using this method simplifies the analysis of the algorithm and reduces the conditions needed for guaranteed convergence to the optimal policy q * [65]. At a particular event k in an arbitrary episode we Table 1: The parameters of the ABM. Some parameters were kept fixed, others were allowed to vary in the calibration. If the variable was kept fixed its value is shown. For a detailed description of why certain parameters were kept fixed and others were allowed to vary see Appendix C and Appendix D. The relationships between the model parameters and the time-series moments used to calibrate the model are given in Figure 2.

Parameters Description
Type Value T Total simulation time, not including the initialisation of the LOB. Free -N f LT The number of fundamentalist (f) Liquidity Takers (LTs).
Free δ The mid-price cut-off % to set minimum volumes.
Free κ Limit order placement intensity for HFTs.
Free σ f Fundamental value variance Free λ min Minimum chartist forgetting factor Fixed 0.0005 λ max Maximum chartist forgetting factor Fixed 0.05 can give the simplest form of Q-learning using the one-step Q-learning update equation: Here α is the learning rate. Provided each state is visited infinitely often and each action is taken in each state infinitely often we have that Q converges to optimal policy q * with probability 1, for appropriately decaying step sizes 7 [68]. For a more detailed motivation for our implementation, we refer the reader to Appendix E.

Optimal execution
The optimal execution RL agent implemented here uses a simple tabular Q-learning approach with simple state attributes, actions, and reward functions. The simplicity of the agent is desired as this is both conceptually clean, and computationally efficient for our use case; that of adding an RL agent into an existing calibrated e-ABM. The state space used and the agent's actions were inspired by the RL agent defined in [32]. It was used because of the agent's simplicity and its demonstrated ability to improve upon the implementation shortfall of the Almgren and Chriss model [1].
Volume trajectory: The agent that is implemented in the ABM is a selling agent that aims to maximise the profit received from liquidating. Following from [32] our agent aims to modify a given volume trajectory using only market orders. However keeping with the theme of simplicity, the volume trajectory to be modified is the event time analog of the Time-Weighted-Average-Price (TWAP) volume trajectory. Since our model is implemented in event time we can't set aside specific times in the simulation for the agent to trade. Therefore, we have to modify the TWAP strategy to be able to fit into the event time paradigm. This volume trajectory will be referred to as the Eventbased TWAP (here denoted e-TWAP) volume trajectory. This is necessary because there is no external clock providing uniform time steps for the TWAP implementation in calendar time, this is because there is no top-down causation 8 imposing an external calendar time in our implementation. However, it does not matter whether the TWAP schedule is implemented at random times or uniformly sampled times, at each time it is triggering the update of the sequence of same sized trading lots that make the trading schedule.
Let X 0 be the total initial volume to be liquidated, and is chosen so that the RL agents trades initial volumes of: 21 500, 43 000, and 86 000 shares, and these are selected to be approximately 3%, 6%, and 12% of the ABM's Average Daily Volume (ADV) traded. The number of decision points is N dp and this can be thought of as the number of times an order of size X 0 /N dp needs to be executed to fully liquidate. A decision point count is the number of times the LOB is updated and the RL agent is asked to make a trading decision. N dp governs how long an RL agent will trade in the simulation if it follows the e-TWAP strategy. If N dp is small then the agent will trade for a short amount of time if it follows the e-TWAP volume trajectory and there is enough liquidity and vice versa for large N dp . The volume trajectory to be modified is X = (x 1 , x 2 , ..., x N dp ) where each of the N dp orders is of size X 0 /N dp . The equal size of the orders is due to the analogy to the TWAP strategy. Now the i th time the LOB is updated, the RL agent is sent the updated LOB, modifies x i based on its state, and submits a market order with the modified volume. In this implementation, we have set N dp to be such that, if the agent trades exactly x i at each decision point and there is enough liquidity, the agent will trade for almost the whole simulation and fully liquidate. We have chosen N dp this way because it most closely resembles a TWAP strategy implemented over an entire day.
On Virtual Machine 2 in Table A.8, all of the simulations had the number of LOB updates greater than 435 9 . Therefore, we set N dp = 430 to ensure the agent will trade for almost the whole simulation and fully liquidate if it follows the e-TWAP strategy and there is enough liquidity.
States: The state space consists of four variables, two variables that summarise the state of the individual agent's current execution and two variables that summarise the state of the LOB. Remaining time, t, and Remaining inventory, i, are used to summarise the state of execution. The Spread, s, and Best bid volume, v, are used as the variables to summarise the state of the LOB, with the idea being that the agent will learn to increase (resp. decrease) volume traded when the spreads are narrow (resp. wide) and the best bid volume is high (resp. low).
To make these variables compatible with tabular Qlearning, these variables have to be discretised and course grained to ensure that the state space does not grow too large. Table 2 describes the parameters necessary for defining the state space: Table 2 Parameters definitions used to create the state space for each RL agent. Here the remaining time index is t = 1, 2, . . . , n T and the remaining inventory index is i = 1, 2, . . . , n I ; with t and i decreasing as the remaining time and inventory decrease respectively. Similarly the spread state index is s = 1, 2, . . . , n S and the volume state is indexed v = 1, 2, . . . , n V .

Parameters Description
T 0 Total RL agent trading time. For the time and inventory variables, the total trading time is divided into n T intervals, and the total inventory to be traded is divided into n I blocks. As the time and inventory decrease through the simulation, the time interval and the inventory blocks that the state belongs to, determine its state value. The state values are given by indices i and t as defined in Table 2.
For the spread and volume states, we would like to know whether the spread and the best bid volume are "high" or "low". Therefore, we need something with which to compare the current values. To do this we use the same method as [32] and compare the current values to historical distributions of the spread and volume. To generate the historical distributions we ran the calibrated simulation model 365 times and recorded the spreads and the best bid volumes. The values that can be taken for the spread state values are in s Table 2. Here the state value is determined by which quantiles the spread falls in between. For the spread distribution, we could not use the same method used in [32] because too many of the state samples would have a value of one. To create the required quantiles so that the spread state value could be determined we assigned s = 1 when the spread is one. We then generate n S − 1 distinct quantiles by splitting the remaining density into n S − 1 equally spaced probabilities to ensure that the rest of the spread states have an equal probability of being chosen. If there are still too few quantiles we keep splitting the remaining density into finer grained equally spaced intervals until we reach the required n S quantiles. To see how the spread states are assigned based on the observed spread s k after some event k and the historical distributions see Table F.11.
For the volume distribution, we can use [32]'s method. We let π b v be the quantile of the best-bid volume. The volume state value v is then given by: Combining all the variables, we have that the agent's state is given by the tuple: z = (n T , n I , n S , n V ). Similarly, Table F.10, shows how the volume states are assigned for agents of different state space sizes. Actions: As was done in Hendricks and Wilcox [32], the RL agent is constrained to submitting only market orders. The volume of the market order is determined as follows: given a parent order X 0 and a volume trajectory {x i } = x 1 , x 2 , ..., x N dp such that the parent order is fully executed by the volume trajectory, X 0 = {x i }; then at the i th decision point the agent can increase, decrease, or trade x i . At the i th decision point the agent chooses an action a i ∈ {0, 0.25, 0.5, ..., 2} such that the modified volume is given byx i = a i x i . The goal is to have the agent learn in which states it is better to increase or decrease the volume traded, so as to maximise the reward. If at the end of the trading period the agent has not fully liquidated their position, the agent will submit a maximum order ofx i = 2x i , to try to trade as much of the remaining inventory allowed by the actions. Note that this does not guarantee all the inventory was traded. This was done to ensure that there is a penalty for having inventory left over as you won't receive the additional profit.
Reward: The goal for this agent is to maximise the total profit from selling the large initial parent order. Given this goal, the reward signal must be defined such that the return corresponds to the goal. Let there be N market orders submitted by the RL agent. For the n th order, in an episode, let the total profit received be the reward for that trade: R n = J j=1 p j x j , for an order that has walked J levels down the order book. Given that each simulation is used as an episode we then have that an episode's return is: G = N n=1 R n , where R n is the n th order in an episode. This corresponds to the total profit from selling inventory in that episode. Therefore, the agent will be maximising the total profit, which is its goal. This reward function is analogous to the reward function given by [6]. It was chosen for its simplicity.
The convergence guarantees, for the Q-learning algorithm, are based on the fact that the underlying model generating the state, action, and reward samples evolve according to a Markov process (see equation E.1), where the next state and the reward depend only on the previous state and the action taken in that state. Given the simplicity of the state space, it is unlikely that it encodes enough information about the LOB to ensure that the process is Markov. This means that the convergence guarantees for the Q-learning algorithm shown by [68] are violated. We show that this is sufficient to encode enough information for the agent to learn simple intuitive rules for liquidating a position in this model.
Comparing the updating equations in our model to Hendricks and Wilcox [32] we can see that they used the updating equation given by Garcia and Ndiaye [23]. They use this equation because they consider the system to evolve according to a finite-horizon MDP. Our implementation of the RL agent does not use the finite-horizon, as opposed to infinite-horizon distinction given by the MDP literature; instead we use the episodic, as compared to continuous implementation distinction [65]. This motivates the standard updating equation we use, rather than the updating equation given by Garcia and Ndiaye [23]. For a deeper discussion on the relationship between the finitehorizon vs. infinite-horizon and episodic vs. continuous learning, we refer to the literature [65].

Parameters
The Q-learning update equation needs to observe the next state that the agent transitioned to, to perform the update on the value of the current state-action pair. In a live implementation this state is not observed until the message from the matching engine has been received, indicating if, and then how, the order was matched. To remedy this the update to the value of the state-action pair is done one-time step later when the next state is observed. The algorithm is defined in Algorithm 6.
Numerous RL agents were implemented with different parent order and state space sizes. For larger parent orders the agent's volume trajectories that they will be adjusting will have higher volumes at each decision point and will therefore incur larger price impact, and visa-versa for smaller orders. Varying the size of the parent order will allow us to see if different behaviour were learned when the size of the parent order changes. See Table 3 to see the different sizes of the total inventory size of the parent order that will be traded. The sizes in brackets represent the sizes of the x i 's that the RL agent will have to adjust.
The size of the state space was also varied. This was done by changing the number of values each state variable could take on. The more values each state variable could take on the larger the state space will become, which could give the agent finer control and may lead to improved performance. However, the larger the state space the slower the agent will learn because the agent will return to states less often when compared with a smaller state space when using the same number of episodes. The number of values that each state space variable can take on is shown in Table 3. Table 3 also shows the maximum amount of time the RL agents trade in the simulation, T 0 . Once the initialisation is complete all the liquidity takers including the RL agents can submit orders. The simulation is run for 25 seconds 10 but the RL agents are only able to make trading decisions for the first 24.5 seconds. The small buffer time at the end was put in place so that if the agent has not liquidated then it can submit a final order with the largest action and the MMS has enough time to wait for the event to be processed by CoinTossX and sent back.
The number of episodes, the discount factor γ, and the learning rate α, are also detailed in Table 3. The number of episodes was chosen to be 1000 as this was the longest training time that could be used within our time and cost constraints. We set γ = 1 because it is the convention for episodic tasks and because it will mean our agent will aim to maximise the total profit obtained from selling the inventory. Multiple values for the learning rate were tested before the RL experiments were run and it was found that α = 0.1 was the most appropriate for the 1000 episodes. Setting α = 0.01 caused the learning to be too slow and setting it at 0.5 caused faster but more chaotic learning due to the larger step sizes. Table 3 The RL agent's training parameters are: the session length in machine time, and the different values for the initial inventories and state space variables that represent the different parameter combinations used to train the RL agents. The sizes in brackets represent the lot sizes i.e. the equally sized x i 's that the RL agent will have to adjust for each initial X 0 . The number of decision points, per episode, and the number of episodes are fixed along with the discounting and learning rates.

Parameters
Values Interpretation T 0 24.5s 8 hours of trading X 0 21500 (50) 3% of model ADV 11 43000 (100) 6% of model ADV 86000 (200) 12% of model ADV n T , n I , n S , n V 5, 10 "small", "large" N dp 430 decisions per episode N E 1000 episodes γ 1 discounting rate α 0.1 learning rate 10 A model run of 25 seconds of machine time is considered equivalent to 8 hours of calendar time trading because this generates comparable event counts to a single day of trading. 11 Estimated over 100 simulation runs and using the calibrated pa-In Q-learning there are two policies: the first being the target policy which we aim to optimise, and the second is the behaviour policy which generates the agents' behaviour used to optimise the target policy. In our implementation of Q-learning the target policy is the greedy policy (see equation E.6), and the behaviour policy is the -greedy policy. controls the amount of exploration versus exploitation in the optimisation. For a given state and N A actions, the -greedy policy chooses the action with the highest value with probability 1 − + ( /N A ) and the rest with probability /N A . Setting = 1 will mean an agent that selects each action with equal probability regardless of its value and setting = 0 will ensure the agent always selects the action with the highest value -the greedy policy. To ensure a good balance between exploration and exploitation the parameter was decayed throughout the simulation. Figure 1 shows how the parameter is decayed over the episodes. For the first 200 episodes, the decays from 1 to 0.9, for the next 400 episodes decays from 0.9 to 0.1, and for the following 150 episodes, it decays from 0.1 to 0.01 where it remains for the rest of the episode Figure 1: Plots the decay for -greedy policy used to generate the actions of the RL agents. For a given state and N A actions, the -greedy policy chooses the action with the highest value with probability 1− +( /N A ) and the rest with probability /N A . Setting = 1 will mean an agent selects each action with equal probability regardless of its value and setting = 0 will ensure the agent always selects the action with the highest value (greedy policy).

Calibration Data
A single day of Naspers trade-and-quote (TAQ) from the JSE was used as the empirical data that will be used to calibrate the model and compare the stylised facts. The day of data that was used was 08-07-2019 and was extracted from a larger static trading research dataset for rameters, we found the ADV to be 745 362. This is comparable to the JSE's traded volume of 498 981 for the day of trading used in the calibration.
the JSE and A2X exchanges [38,35,37]. The data set we have used is 8 hours of trading in calendar time. For this day of trading, we found that the number of limit order and trade events on the JSE was 15104 and 2510 respectively. The average number of limit order events in 100 simulations was 12071 95% CI (10988,12953), and similarly, the average number of trade events is 5467 95% CI (4566,5798).
The trade and limit order volumes are highly dependent on the parameters of the model, we chose to let the calibration procedure determine the parameters, which will produce the volume distributions. Even though there are some discrepancies between the counts i.e. fewer limit orders and more trades in the e-ABM, when compared to the JSE, increasing or decreasing the length of time the simulation was run for will only increase the difference. Based on this we are confident that the machine time period of 25s generates event counts that are of a comparable amount to that found in a day of real world trading, as observed in the calibration data.
Before estimating the moments and finding the stylised facts on both the calibrated price paths, as seen in Figure  3, and the JSE; we clean the simulated and real-market data into "top-of-book" or level-1 format (here denoted as L1LOB data). The mid/micro-prices obtained from this data are then used to estimate the moments and compare the stylised facts. This ensures a like-for-like data comparison between the simulated and JSE data. It also means that the data is compared in event time on both markets, where an event is classified as an order (limit order or trade) that modifies the top of book and trade data. During calibration, the simulated data can't be cleaned into this format without incurring a large performance overhead because cleaning the raw messages is a computationally intensive task. Therefore, we use the micro-prices from all updates to the LOB.
Here, the trade-off is valid since most limit orders are placed at the best bid/offer (See Figure B.21a) and all trades will affect the top of book data and the moments estimated during the calibration process are comparable to the top-of-book and trade data moments on the JSE. The calibration results also suggest this approximation is valid (see Section 3.4). This method was also used in Jericevich et al. [37], which enables a like-for-like comparison between their ABM and our e-ABM extension, as well as a like-forlike treatment between the e-ABM and the JSE.

Sensitivity Analysis
To get a better understanding of how varying the parameters cause changes to the simulated price paths we first perform a sensitivity analysis on the parameters of the ABM. To detect the changes in the price paths 9 moments have been chosen to characterise the behaviour of a single price path. These moments are the mean (µ), standard deviation (σ), Kolmogorov-Smirnov statistic (KS), Geweker-Porter-Hudak estimator (GPH), augmented Dicky-Fuller (ADF), the sum of the GARCH(1,1) parameters α and β (GARCH), and the modified Hill estimator. A detailed description of each of the moments is given in Table 4. These moments were chosen because they were found to be robust enough to characterise a price path but flexible enough to discriminate between parameter settings [71,60,37].
For most of the parameters, the distribution of moments across their values does not change considerably, and there is a large overlap of the distributions indicating that changing a single parameter does not cause a large change to the overall model price paths. An important exception is that if we increase the number of fundamentalist agents ceteris paribus the distribution of the Hurst exponent shifts to smaller values showing that fundamentalists are engaging in mean-reverting trading strategies. This can also be seen in the correlation matrix between parameters, the moments, and the objective function (See Figure  2) where there are only two large correlations, that of the standard deviation of the price paths with the objective function (0.918), and the Hurst exponent with the number of fundamentalists ( -0.699). Certain parameter values may tend to produce large outliers, e.g. smaller values of the κ parameter increased the number of outliers. This is explored in detail in the appendices Appendix C. The key takeaway is that for plots of the objective functions, these tend to decrease from areas of higher values to lower values giving heuristic evidence that optimisation is feasible. However, the optimisation of the objective function is non-trivial because there is evidence of local minima and parameter degeneracy.
Considering the bottom row of the correlation matrix we see that none of the parameters are correlated with the objective function however the objective function does depend on the values of the parameters. It is noted that while the correlations are informative towards finding linear dependencies they tell us little about the non-linear dependencies in the model. With this in mind interaction plots of the parameters showing how they affect each of the moments have been generated and are shown in the appendices.

Calibration Method
We calibrated the ABM to data from a single stock for a single day on the Johannesburg Stock Exchange (JSE) using the method-of-moments with simulated minimum distance (MM-SMD). This method was chosen because it is more computationally tractable than Bayesian inference and it does not require a closed-form solution like the maximum-likelihood method. A description of MM-SMD and a more detailed discussion as to why we chose it over other methods can be found in Appendix D and Appendix D respectively. The list of moments used is given in Table 4, and our reasons for choosing these can be found in Appendix D. The moments were estimated on the microprice log-returns because previous work found that using the mid-prices was unreliable when trying to identify the distributional properties [37].
To optimise the MM-SMD objective function we used the Nelder-Mead with threshold accepting (NMTA) algorithm as our optimisation routine due to its speed and its ability to escape local optima. See Appendix D for a description of the algorithm and our rationale for choosing it.

Model calibration results
The NMTA algorithm was able to converge to an optima during calibration and the results for the parameter convergence can be seen in Appendix D. The calibrated parameters and associated indicative confidence intervals are shown in Table 5. To compute the indicative confidence intervals we use the same block bootstrap method used by Jericevich et al. [37]. First, we notice that through the sensitivity analysis we have obtained estimates for the moments for a wide variety of parameter values. Using these estimates a 6 × 8 exposure matrix, B 12 , was estimated. Furthermore, we have the covariance matrix for the empirical moments, which is just Σ m e = Cov [m e ] = W −1 . Using these two estimated matrices we can compute an indicative covariance matrix for the parameters of the model Σ θ = BΣ m e B T . The diagonal entries of this matrix then give the indicative variance for each of the parameters. To create the confidence intervals we use: θ ± 1.96 × diag (Σ θ ). Due to the indicative nature of the calibration process and the construction of the intervals, it is difficult to perform inference on these values. Table 6 presents the simulated and empirical moments and their respective confidence intervals. The simulated and empirical moments were estimated on the micro-prices obtained from the cleaned level-1 order-book event data. This ensures that the moments are estimated in event time on both markets, where an event is classified as an order (limit order or trade) that modifies the top of book and trade data. We obtained the confidence intervals for the empirical moments using the covariance matrix estimated with the moving block bootstrap, which gives us m e ± 1.96 × diag (Σ m e ). For the simulated moments, we use the same method used to estimate the empirical moments. A price path is simulated using the calibrated parameters and is shown in Figure 3. The same moving block bootstrap method used to estimate Cov (m e ), and we estimate Σ s = Cov ([m s |θ]) for the simulated data. The indicative confidence intervals are then: Fabretti [20] states that the simulated time series should not only be compared at a qualitative level, by comparing stylised facts, but also determine if the simulated data comes from the same distribution and have the same characteristics as the empirical data. Following this advice, we compare the simulated and empirical moments. Four out of the eight moments have the simulated and empirical moments' confidence intervals overlapping. However, Table 4 The moments used to characterise the price paths. These are used in the sensitivity analysis to measure how changes in the parameters change the price paths, and they are used to calibrate the ABM to measured market data. Figure 2 shows the relationship between these moments, the model parameters, and the objective function used to calibrate the model.

Moments Description
Mean, Std. The mean, µ, and standard deviation, σ, measure the shape of the log-return distribution.

KS
The Kolmogorov-Smirnov [51](KS) statistic is used to compare the maximum absolute difference between the CDFs of the empirical and simulated log-returns. This is used to test if the samples are drawn from the same distribution. GPH Geweke and Porter-Hudak [25] (GPH) provide an estimate for the fractional integration parameter d ∈ [−0.5, 0.5] in absolute log-returns time series by estimating the memory parameter in the log-periodogram. This measures the long-range dependence in log-returns. ADF The Augmented Dickey-Fuller [15] (ADF) statistic tests for the presence of a unit root in the time series of the log-returns. The more negative the test statistic the less evidence there is for the presence of a unit root. GARCH α + β The parameters of the GARCH(1,1) model provide a useful summary for the short-range dependence in the squared returns. Winker et al. [71] found that the sum of the parameters in the GARCH(1,1) model is a more robust statistic as compared to the individual parameters. "Improved" Hill The Hill Estimator (HE) is used to estimate the tail-index of a power-law distribution which can be used to infer the power-law behaviour in the tails of experimental distributions. Nuyts [56] proposed an "improved" Hill estimator to remedy the issues with the standard estimator.

Hurst (H)
The Hurst (H) exponent [34,48] provides a measure of the long-memory of the log-returns. If H ∈ [0, 1] is significantly far from 0.5 we have evidence that the log-returns do not follow a random walk. Furthermore, if H ∈ [0, 0.5) then the process is said to be mean-reverting and if H ∈ (0.5, 1] the process is said to be trending. only the mean and the GPH estimator have a substantial enough overlap to consider the moments coming from the same distribution for this set of data. The GPH estimate is a measure of the long-range dependence in the absolute log-returns, and due to the overlap in the empirical and simulated moments' distributions, the ABM appears to replicate this feature. Considering the Hurst exponent, we see that the mean reversion feature of the empirical data has been replicated by the model, however, even though there is an overlap in distributions the ABM has overestimated the amount of mean reversion. Indicating that fundamentalist trading strategies dominate trend following strategies more in the simulation than they do in the real market. The overlap in the GARCH estimates can be explained by the high variance in the simulated moment's distribution. Even with the high variance, the model was still able to recover the fact that the sum of the GARCH(1,1) parameters should be close to one; indicating an integrated time series. Of the four moments that did not have an overlap in confidence intervals, the ADF statistic is the least concerning because both time series exhibit no evidence for a unit root. The KS statistic is the moment with the largest difference in the confidence intervals which gives evidence for the log-return distributions being significantly different.
The "improved" Hill estimator measures the power-law behaviour in the right tail of the log-return distribution. The simulated returns have a lower estimate across the entire confidence interval giving us that the distribution has fatter tails. The simulated returns also have a higher variance when compared with the empirical returns distribution. For each moment the simulated moments have a higher variance, which could be a result of fewer orders in the simulation or it could mean that the empirical moments are more stable. It must be noted that these estimates are for a single day of JSE tick-by-tick data and it is possible for moments to change across days, but since this was the day of data we calibrated on it is imperative that the moments derived from the simulation are compared to the data on which it is calibrated. Overall the calibration can be considered a partial success because some of the moments had overlapping confidence intervals and, with the exception of the KS statistic, the simulated moments were not considerably different when compared with the empirical data.
This provides evidence that the simulated and empir-  (Table 1), the moments (Table 4) ical time series paths exhibit similar characteristics. The simulated and empirical log-return distributions are centred around 0, the variance of the distributions are comparable, both price paths are mean-reverting, the GPH statistic shows that they have similar long-range dependence properties, the large negative ADF statistics indicate that the simulated and empirical price paths do not show evidence for a unit root, and considering the GARCH parameters we see evidence that both price paths are integrated. Additionally, we see similar extreme value behaviour as the values for the Hill estimator are reasonably close, even though the simulated price path exhibits fatter tails.

Model stylised facts
It is difficult to assess the "goodness" of one ABM relative to another as most comparisons are qualitative stylised fact comparisons, where there are often a variety of different ABMs that can produce similar stylised facts [37]. Qualitatively comparing empirical vs. simulation stylised facts is problematic [60] because there is usually no unique set of parameters that exists when an ABM is calibrated to a particular price path and even though the models can recover stylised facts the parameters of the model exhibit degeneracies. When multiple parameter sets generate the same required stylised facts of a particular market in non-unique combinations it is difficult to conclude that observed changes in the model are always due to changes in a specific parameter, or that parameter combinations will uniquely predict the same changes in the market [37]. This is often not merely a problem with the technology used to implement the calibration, but a fundamental feature of the model formulation. While these issues remain unresolved, demonstrating that a model can reproduce the stylised facts of a market is a good starting point in validating the ABM [60].
An indicative price path simulated using the calibrated parameters is plotted in Figure 3. The fluctuations in the log-returns, as well as the spread and order imbalance, are also plotted. This is also the price path on which the moments will be estimated and the stylised facts are to be computed. Given that this is a single price path that is subject to variation due to changes in the fundamental Table 6 Presents the simulated and empirical moments and their respective confidence intervals. The simulated moments (m s ) were obtained on the price path shown in Figure 3 and the empirical moments (m e ) were estimated on the day of JSE data described in Section 3.1. We obtained the confidence intervals for the m s and (m e ) using the covariance matrix estimated using the moving block bootstrap method outlined in Section 3.4. For four out of the eight moments, the simulated and empirical moments' confidence intervals overlap, and with the exception of the KS statistic, the simulated moments are not considerably different when compared with the empirical data. Indicating a reasonably successful calibration of the ABM. prices and the forgetting factors as well as the randomness of the samplings, the moments and the stylised facts are to be considered indicative.

Stylised facts
The stylised facts that this models aim to reproduce are provided in Table 7. Increasing the number of stylised facts increases the constraints on the model as it has to contain more features, therefore we have chosen several well-studied stylised facts to compare our model to empirical data.
All the stylised facts that involve returns will be computed using the micro-price 14 instead of the mid-price [37]. The micro-price is defined as the volume weighted average of the best bid and ask. The micro-price log-returns for the k th interval are: r k = ln(m v k ) − ln(m v k−1 ) and were computed from the level-1 order-book event data [37].
The first stylised fact that will be considered is the fat tails of the micro-price log-returns. Figure B.16 compares the micro-price log-return distributions, for the CoinTossX ABM (green) and the JSE (purple), to the normal distribution with MLE parameters. The inset is a QQ-plot comparing the two distributions. The figure shows that the ABM and the empirical returns have fatter tails than that of the normal distribution, and it shows that the ABM was able to reproduce the stylised fact. Table 6 shows that the variance for the ABM is higher than what is observed in the market and this fact is evident in Figure B.16. Now that we have shown that the returns have shown fatter tails than the normal it is required to compare the extreme values of the simulated and empirical distributions. Mandelbrot [47] showed that the tails of the return distributions exhibited power-law or Pareto-like distributions with a tail-index that is finite. Figure B.17 plots the upper 95th percentile as well as the lower 5th percentile. 14 Here and v a k are the best bid, best ask, best bid volume and best ask volume, at event k.
It is interesting to note that the return distributions are close to symmetric in both cases. The inset compares the upper and lower tails to a power-law distribution [12] given by the density: where α > 1, and x ≥ x min > 0; these can be used to provide a distribution function whose parameter can be estimated using an MLE method 15 . The insets, in both the plots, in Figure B.17 indicate that simulated and empirical return distributions have near power-law behaviour in their tails. If their distributions were exactly a powerlaw we would expect that the data would be linear on a log-log scale. However, we observe a slight downward curve which indicates only near power-law behaviour. Comparing the empirical vs. simulated returns we see that both curves observe this similar behaviour, giving credibility to the returns generated by the ABM. The tail-index estimate for the simulated returns is less than the tail-index of the empirical returns, meaning there are fatter tails in the simulated return distribution. This was also observed in the moments in Table 6. Given that the distributions do not exactly follow a power-law no conclusion about the existence of moments can be drawn from the tail-index parameter.
As stated in Table 7, at high-frequency times scales we have that there are significant negative autocorrelations for the first lags and then for the remaining lags the autocorrelations are insignificant. This has been cited as evidence of a mean-reverting component of the time series that could potentially be attributed to market making [13]. Figure B.18 plots the autocorrelations for the log-returns and the autocorrelations in the absolute log-returns. The significant negative autocorrelation for the first few lags 15 x min is the sample minimum and we estimate the tail-index α using a MLE:α = 1 + n n i=1 ln ( x i/x min ) −1 for n data points [12].  Table 5. This is equivalent to 8 hours of typical trading in calendar time. The different markers and colours correspond to the different types of orders and the different sides of the LOB respectively. You can see the mid-prices and micro-prices emerging from the interactions of these orders. The fluctuations in the log-returns, as well as the spread and order imbalance, are plotted in the middle and the bottom respectively. The log-returns exhibit clear volatility clustering, where there are some periods of low and high volatility. These results demonstrate that the ABM has been able to produce well-known stylised facts, and this is shown in more detail in Appendix B.
is present in both plots. The autocorrelation in the simulated returns is far higher than the empirical returns which may be the result of the frequency of limit orders placed by the high-frequency agents in the ABM. Once again the graphical representation is consistent with the estimated moments as the Hurst exponent in Table 6 suggests more mean-reversion in the simulated returns.
Another stylised fact that we aimed to reproduce is that of volatility clustering, the feature where large (resp. small) price variations are followed by further large (resp. small) price variations. This feature was calibrated using the GPH statistic, which is a measure of the long-range dependence in the absolute log-returns. Figure B.18 shows that the absolute log-returns for both the simulated and empirical both have persistence, meaning that simulated and empirical times series have evidence of volatility clustering. However, the decay in the autocorrelations is far quicker for the simulated returns.
Trade-signs have been shown to have persistence in their autocorrelations [45,8,66], meaning that buy (resp. sell) orders are followed by more buy (resp. sell) orders. To estimate the autocorrelation of the trade-signs each trade needs to be classified as a buy (+1) or a sell (-1). Following Jericevich et al. [35] we will be using the Lee/Ready rule [44] as the method by which we will be classifying trades. Figure B.19 plots the autocorrelations of the trade signs. The model's autocorrelation function (ACF) does not exhibit the same behaviour given by the empirical data. The ABM has been able to capture the decay of the ACF for increasing lags, however, this decay is much slower than the decay of the ACF for the empirical data. It has been shown on the London Stock Exchange (LSE) that the autocorrelations of the trade-signs should exhibit a power-law decay with tail-index exponents that indicate a series that is not integrable [45]. Meaning that the process is said to have long-memory. The insets on the plots show the autocorrelation on log-linear scale. The inset for the ABM shows that the decay of the order-flow autocorrelation does not decay as a power-law. However, the decay found on the JSE looks to observe a power-law decay which is illustrated in Figure B.21b. There is a linear decrease on the log-log scale giving evidence that the distribution is a power-law. This was observed by [45], and if we compare the plot in Figure B.21b to the figures they produce the Table 7 The key stylised facts that we aim to reproduce with our ABM.

Stylised Fact Description
Heavy tails The unconditional distribution of the log-returns has been shown to have power-law behaviour in the tails [47], with a finite tail-index between 2 and 5 [13].

Return autocorrelation
The log-returns exhibit an absence of significant autocorrelations [13]. In high-frequency time scales the return series observes negative significant autocorrelation for the first few lags which are usually given as evidence for a fast mean-reverting component [13].

Trade-sign autocorrelation
The trade sign time-series has been shown to have persistence in their autocorrelations [45,8,66]; implying that buy (resp. sell) orders are followed by more buy (resp. sell) orders. Herding and order-splitting have been used to explain this observed phenomena; with order-splitting being the main reason for the persistence in order flow [66]. Volatility Clustering Here large (resp. small) price variations are often followed by large (resp. small) price variations. The persistence of the autocorrelation of the absolute log-returns are signatures of volatility clustering and long-range dependence in the volatility [13].

Depth profile
The Paris Bourse and the NASDAQ have an observed hump in their shape with a slight increase in volume as the price moves away from the best bid/offer, with it reaching a maximum a few ticks away from the best bid/offer before a steep decline [9,29] 13 . The Standard and Poor's Depository Receipts (SPY) showed a maximum at the best bid/offer [62].

Price impact
Price impact can be measured as the change in bid/ask price caused by a trade of a given volume. As the volume increases the observed price changes increase linearly on a log-log scale indicating a power-law relationship [30,35,46].
results are very similar. It is clear that this model does reproduce the decay of the order-flow autocorrelation for increasing lags, but the volatility clustering effect is too strong and it does not describe correct the power-law decay seen in real markets. It does not fully match what is observed in the empirical data used to calibrate the ABM. This is important, and a feature of the model design.
Two potential reasons why the model has failed to correctly capture the required trade-sign ACF feature are: 1. The model has not been explicitly calibrated to describe the trade-sign ACF feature.
2. The ABM model does not have agents explicitly using order-splitting.
There is no moment estimated on the trade-sign autocorrelation that has been used to compare the empirical data and the simulated data -and this is a key design choice made in the calibration methodology [61]. This is important here because both herding and order-splitting have been used to explain the order-flow auto-correlation, with order-splitting being given as the main reason [66]. So in our ABM, the autocorrelations can only be explained by the herding of the chartists. Although the ABM can be enhanced to explicitly include minimally intelligent forms of order-splitting e.g. allowing the fundamentalists to implement TWAP strategies. This introduces a form of bound rationality at the level of the fundamentalists, and converts them into optimal execution agents, both buying and selling. The approach in this work is to explicitly introduce optimal execution agents as being strategic and hence able to learn with bounded rationality, so that we can then measure the impact of strategic order-splitting on the original model's stylised facts.
This also provides some insight into whether one can in fact improve on a naive TWAP execution strategy in this type of model knowing that this is what happens in real markets. Improving on the Almgren and Chriss execution trajectories is of more practical importance [32]. However, theoretically, this approach most easily sheds light on the impact of learning on a minimally intelligent agent-based model. The TWAP strategy is conceptually important because it can be recovered theoretically [6] when markets are informationally efficient in the sense of the linear in volume price impact link from the Kyle model [42] to ideas of no-arbitrage [33].
A secondary challenge relates to the depth profiles. The empirical data collected from the JSE for this project only contains level-1 information and therefore we cannot build a depth profile. Figure B.21a plots the mean volume at seven price levels for both the bid and the ask. We observe an asymmetry between the bid and ask curves, concerning total volume, but we also observe that both curves exhibit a rapid decrease in liquidity as the price moves away from the best bid/ask remaining consistent with empirical observation [9,29,62]. The maximum volume occurring at the best bid/ask is also consistent with previous research [62].
It is also worthwhile to compare the stylised facts generated by our model and [37]'s model since our model is an event-based implementation of the [37]. Both models were able to reproduce the fat-tailed return distributions however one limitation that has been remedied from [37] is the simulated returns now have more data evenly distributed from the mean and not so concentrated around zero, which was a limitation of their model. This also meant that there is a smaller discrepancy between the tailindex estimates for the extreme-value distributions of the log-returns. The event-based ABM captured the fact that only the first few lags should be significantly negative, [37] had significant lags (positive and negative) up to lags of fifty. The decay in the absolute log-returns is faster than what is observed in our model, which means there is less of a volatility clustering effect in [37]. Their model more accurately depicts the decay in the trade-sign autocorrelation. Since the order-flow autocorrelation is due to the herding effect caused by the chartists, and we changed the chartist rule to better suit the event-based ABM, their rule was likely better specified for their model. Future research on the event-based model should look to experiment with the chartist rule to better capture the correct decay. Another possible reason we see this discrepancy is that chartists are more likely to be trading in events that are very close together when compared with the stochastic time model implemented by [37]. This would increase the herding effect seen in the event-based model. It must be noted that the trade-sign autocorrelation has not been tested for the long-memory property. So, even though they better capture the decay it is unclear if the decay is indicative of a long-memory process. Both models were able to accurately recover the depth profiles.

Price impact
To compute the price impact of a trade the method of [46,30,35] was used. If we let m k be the mid-price at event k. We have the price impact of trade at event k is: ∆p ki+1 = ln(m ki+1 ) − ln(m ki ), where m ki+1 is the mid-price after the trade and m ki is the mid-price before the trade. We compute the normalised trade volume following [30,35]: ω i = vi /v wherev is the average volume for a given session, each session has T trading events, and v i is the traded volume for the i th event. To create the price impact curves we created 20 logarithmically spaced normalised volume bins on [10 −3 , 10 0.5 ]. Each trade was placed into one of these bins depending on its normalised volume. Then for each bin, we computed the average price impact, ∆p * and the average normalised volume, ω * , that occurred in that bin throughout the trading period. The price impact curves were plotted on a log-log scale and for both the simulated and empirical data we plot the price impact for buyer-initiated and seller-initiated trades. The empirical trades were classified using the Lee-Ready method [44].
The price impact curves for the simulated data and the empirical data are plotted in Figure 4. Even though there is noise, there is a clear linear upward trend in the price impact curves on a log-log scale. Therefore, the simulation has been able to recover the empirical stylised fact that there is a power-law relationship between the increasing trade volume and the price change caused by trade. These findings are consistent with previous research [30,35,46]. It must also be noted that both the event-based model, presented here, and the model presented by [37] were able to recover the correct form for the price impact curves.

Convergence results
The convergence of the returns of the learning agents is shown in Figure 5. The return is the total profit obtained from selling the parent order. To make the comparison between the different initial inventory levels easier the total profit has been normalised, to give us the Inventory Normalised Total Profit (INTP). Here m 0 is the initial mid-price, and if all agents were able to trade all shares at price m 0 then INTP would be 10000 and the implementation shortfall would be 0.
We can see that all agents show improvement and convergence in the profit over the episodes. From episodes 0 to approximately 250 there is an increase in the average profit and a decrease in the variance of the profit. After this period, the agents receive roughly the same profit if we ignore liquidity shocks of the agents with larger X 0 .
Consider the effect of increasing the size of the parent order and hence the number of shares to adjust at each decision point. As X 0 increases the frequency of liquidity crashes and large price decreases increase. This is consistent with the fact that larger orders will have a larger price impact which will push the price further down. Given that the ABM is simulating a low liquidity asset, increasing the inventory traded will also increase the chance of a large order being submitted which will wipe out all the liquidity on the contra side of the order book causing a liquidity crash.
The volume states are assigned based on the observed best bid (v b k ) after some event k, these are shown in Table  F.10 with the simulated historical distribution of the e-ABM. It also shows the approximate probabilities of falling into each of the states. The table indicates the levels of liquidity at the best bid, and the probabilities of breaking through the best bid e.g. an order of size 50 is likely to break through the best bid with a probability of a little over 20%, whereas an order of size 200 has a probability of a little less than 40%. From this, we can see that agents trading with less inventory are far less likely to cause excessive price impact. Based on the depth profiles observed in the e-ABM ( Figure B.21a), most of the liquidity is at the best bid/offer, therefore if you break through the best bid you have a high chance of clearing that side of the order book. Based on this, it is clear why the agents with the larger inventories are causing more frequent liquidity shocks.
Comparing the size of the state space, we do not see any significant difference between the profit obtained by the smaller state space agents and the larger state space   (Figure 4b). They are both plotted on a log-log scale and the simulated and empirical price impact curves both capture the linear upward trend, meaning that there is a power-law relationship between the volume traded and the price impact caused by the trade. These findings are consistent with previous empirical research [30,35,46] as well as previous ABM results [37]. The key differences relate to the impact of many small trades, which became particularly noticeable after the removal of the price floor in the JSE pricing model [30] agents when they are trading the same amount of initial inventory.
It is also clear that the agents are unable to significantly improve the profit received compared to if they were able to trade all inventory at the initial mid-price. Therefore, the agents have learned to only reduce the implementation shortfall. Figure 6a plots the convergence in the number of states visited for all the RL agents. This was done to find out if the agents are still finding new states when the number of episodes terminates. It also shows the difference in the convergence in the number of states visited between the larger and smaller state spaces. This is valuable information because if the agents are still rapidly finding new states then the agents will have more difficulty learning behaviour as they are not returning to previously visited states, and are therefore not updating the values of the states. Limiting their ability to find the optimal policy.
It is not surprising to see that the smaller state space converges much faster than the larger state space. The smaller state space also exhibits more jumps which is likely the result of the clustering of states, where the more coarsegrained space would cause the agent to sample from one cluster and then jump to another, while the larger state space would have less distinct clusters and the jumps between states would be less significant, giving a smoother convergence. It was precisely this state clustering feature that was exploited in the online learning method of Hendricks [31].
The difference between the number of states visited by the smaller and the larger state space agents is large. Having far more states means that there will be fewer returns to individual states, over the same number of episodes, which will mean the agents with the larger state space will likely have greater difficulty in learning a good policy over the same number of episodes.
We now consider the convergence in the behaviour of the agents and look at the convergence of the Q-matrix and the convergence of the greedy policy. The left-hand side plot of Figure 7 plots the one-step average absolute difference in the Q-matrix, for common states. Let A k and S k be the set of actions to choose and states observed at time step t, and let |A k | and |S k | denote the cardinality of these sets. The one-step average absolute difference in the Q-matrix at event k for common states is: Since A k−1 ⊂ A k and S k−1 ⊂ S k we averaged over the actions and state of the time step k − 1 because at event k we have that only more states could have been discovered.
The reason this is used as a measure is that if Q converges to q * then this one-step difference should converge to zero if there are appropriately decaying learning rates. For our RL agents we have not decayed the learning rate but set it at 0.1. This means that the differences will not converge to zero. However, it should still give us a good idea of the convergence of the Q-matrix values.
From Figure 7 we see that all the differences have decreased and converged around some constant, with some variance. The agents trading less inventory have smaller differences due to the smaller profit they will obtain, and they have a lower variance which could be partly due to the lower total profit and the smaller price impact.
Comparing the smaller state space to the larger, we see that the smaller state space agents all exhibit a hump shape instead of a smoother more monotone convergence of the larger state space agents. In each plot, we compare the profit convergence of the agents with smaller state space sizes to agents with larger state space sizes. m 0 is the initial mid-price, and if all agents were able to trade all shares at m 0 then INTP would be 10000 and the implementation shortfall would be 0. For the agents with the size of the parent order being 86000, the INTP was winsorised at the 1% and 99% level to allow for easier comparisons, as there were a few very large price decreases that distorted the figure. All agents demonstrate improvement and convergence in the profit. As the inventory increases, we start to see liquidity effects, as there are more frequent liquidity crashes. The size of the state space does not seem to cause significant differences in the profit, and all the agents were unable to significantly improve the profit received over the implementation shortfall amount.
Given that changing the Q-matrix may not change the actions the agent selects, in a one-step update, we consider the convergence of the greedy policy over the training of the agent. To measure this we use the one-step difference at event k: For a given state, this measures the difference between the actions chosen by the greedy policy, from event k − 1 to event k. The idea behind this measure is that as the agent learns the optimal policy the agent should learn to select the same action for the same state. This measure is plotted on the right-hand side of Figure 7. There is clear convergence in the actions selected by the agents as the training progressed. With the agents selecting the same actions for the same states using the greedy policy.
A surprising result is that the agents with the larger state space have the greedy policy converge faster and with less variance when compared with the agents with the smaller state space. This is once again related to the return problem, where the agents are not able to return to individual states and perform updates as frequently, causing fewer changes to the policy. It will also be shown that the larger state space agents are discovering more states than they are updating existing ones, which is another reason why the policy is not being updated.
This plot does not claim that the agent has found the optimal greedy policy, but it does show that the greedy policy is changing at a very slow rate. Moreover, we have that unless the number of episodes is drastically increased, the policies will not change significantly and hence the profit obtained will be unlikely to significantly improve.
Another way to visualise the convergence of the greedy policy for an agent can be seen in Figure 8, where we have plotted the actions taken in each state across the episodes. The action value of -1 represents a state that has not been visited yet, and a jump from -1 represents the discovery   Figure 6a. If the agents are still rapidly finding new states, then the agents will have more difficulty learning. The smaller state space agents converge much faster than the larger state space agents. Additionally, these agents have more jumps in the convergence whereas the larger state space agents exhibit a smoother convergence. The number of trades an agent submits in each episode is plotted on the right in Figure 6b. It shows the decrease in the number of trades over the episodes and the convergence of this decrease. The decrease is speculated to be due to the agent learning to guarantee the execution of all the inventory, because fewer trades imply the agent has finished executing the parent order in a shorter amount of time.
of the state. It is clear that the larger state space agent is still discovering states, while the smaller state space agent has discovered most of the states.
Looking at the smaller state space agent we can see that initially there is a large amount of changing of actions for each state, however, over the training period, the number of changes decreases and the agent starts to repeatedly select the same actions. Moreover, most of the states have been discovered, meaning that the agent is repeatedly sampling from the previously visited states. Therefore, this convergence in the actions selected for each state must be due to the agent finding some structure. Even though the agent has found structure it is clear that the agent's greedy policy has not converged as the agent is still changing actions in some states.
The agent with the larger state space is shown on the right of Figure 8. Initially there looks to be the same switching of actions found in the smaller state space agent's plot, and then over time, the agent starts repeatedly selecting the same actions for a given state and the switching slows. This is in contrast to the results seen in Figure 7, which demonstrates that the switching converges much quicker than exhibited in the trace plot. The initial noise was not due to action switching, but was the result of state discovery. It was found that with a large state space, agents switch actions as they get discovered, and then start to repeatedly sample the same action. The large number of states seen in the state space plot suggests that the agent is switching actions many times, per state; but there are only a few initial switches and a given state gets discovered by a large number of agents. Due to the size of the state space, individual states don't get visited as often and therefore updates to the policy are much slower, once again demonstrating the need for more episodes needed to train the larger state space agent.
It was found that the number of trades decreased over the episodes for all the agents, this is shown in Figure 6b. If the agent were to trade exactly x i shares (corresponding to action 1), and there is enough liquidity, the agent will be asked to make a trading decision approximately 430 times. Here the ABM is simulating an illiquid stock, and there may not be orders on the contra side of the order book. Moreover, the agent may trigger a volatility auction if it trades and will therefore, because of the rules of the ABM it will not submit an order. Initially, this is why the number of trades is less than 430.
The illiquid nature of the ABM means that if the agent submits an order, it may not be fully matched. All these challenges mean that the agent is at some risk of not fully executing the parent order. Therefore, it can be seen that the agent has learned to trade a higher volume, which means that it reduces the number of times it trades. It has correctly learnt to do this to mitigate the lack of liquidity and ensure that the parent order is fully executed.
Using the reward function it could be expected that the agent learns to only trade the maximum volume because this could potentially provide the agent with the highest reward based on the volume traded. Figure 6b has shown that the agent has learned that this is not a feasible tactic as it has converged to a higher number of trades than would be allowed by always trading the maximum volume.

Behaviour Analysis
The final greedy policy for the a single RL agent with initial inventory of 43000 and n T , n I , n S , n V = 5 is shown in Figure 9. An action of -1 indicates that the state has  : One-step absolute differences in common states between Q-matrices (left) and the number of one-step action changes for the common states in the greedy policy (right). The Q-matrix differences will not converge to zero as the learning rate was not decayed, but we do see convergence, around constant values, in all agents' Q-matrices. The one-step action changes for common states in the greedy policy demonstrate a rapid decrease in the number of action changes as the episodes increase. We also see that the larger state space converges faster than the smaller state space indicating fewer action changes. However, this is not likely due to the finding of the optimal policy but because the larger state space agents are still discovering new states and visit previous states far less often causing fewer updates and therefore fewer action changes. The trace plots in Figure 8 demonstrate this rationale.
not yet been reached. The states that are not visited correspond to situations the agent is highly unlikely to be in. For example, having a large (resp. small) amount of inventory when there is little (resp. lots) of time left. These states will seldom occur as the agent is unlikely to have traded most of the inventory when in the initial stages of the simulation nor will it have traded almost nothing for most of the simulation.
If you fix a specific time state and move upwards in the figure you see that the agent has learned to trade more, averaged across the spread and volume states, as the inventory increases. It can also be seen that if you fix a specific inventory state and increase the amount of time that the simulation has been running for (by moving left) you can see that, on average, the agent is learning to trade more. The agent has clearly learned to trade at different rates depending on the time and inventory states. From this figure there are no obvious changes in actions as the spread and volume states change, however, these effects will be investigated more thoroughly in the next section. This policy is for a single agent but the results are consistent across across all the agents and can be seen in Figures  F.29a, F.31, F.33, F.30a, and F. 35.
To see the change in behaviour of the agents across the episodes, the action volume trajectories for the first and last episodes are plotted in Figures 10a and 10b. Consider Figure 10a, in the first episode the agent is randomly selecting actions and there is no pattern in the volume trajectory. For the last episode the agent has learned to trade higher volumes initially, and at the end of the simulation, when most of the inventory has been traded, the agent starts to submit smaller orders. The higher inventory traded early in the simulation causes the agent to trade less when compared to the first episode. Figure 10a shows the action volume trajectory for the RL agent with the larger state space. Here the agent has learned to initially trade more; however, there are more variations in the actions when compared with the smaller state space agent. These figures only visualise the results for two out of the six agents. The remainder of the agents' volume trajectories can be found in the appendices in Figures F.29b, F.30b, F.32, and F.36 and give similar results.

Learning dynamics
To understand if the RL agents have learned intuitive behaviour, such as trading when the best bid volume is high and when the spreads are narrow, we select the states that correspond to high volume and narrow spreads and take an average of the actions taken by the greedy policy. If the agent has learned to trade higher when the volumes are high and the spreads are narrow we should see higher average actions than when the volume is low and the spreads are narrow. To gain a better understanding of how the agent has learned we plot the average actions over all episodes in Figures 11 and 11.
For the smaller state space agents, low volume and high volume states correspond to states with v ≤ 2 and v ≥ 4 respectively. While narrow and wide spread states correspond to s ≤ 2 and s ≥ 3. Having an extra state for the wide spread states was chosen because 60% of the spread states correspond to s = 1. These states get visited more often and get updated more frequently, therefore allowing the agent to learn the state-value function better in these states. To counter this imbalance we add an extra state to ensure that the wider spread states are sampled a greater number of times.   larger state space (right) with the same volume. The -1 action indicates a state that has not yet been visited and a jump from -1 to the other actions means the state has just been discovered. Looking at the smaller state space agent we can see that initially there is a large amount of changing of actions for each state, however, over the training period, the number of changes decreases and the agent starts to repeatedly select the same actions. Due to most of the states already being discovered the action changes are largely due to the action-values of previous states being modified. The larger state space agent seems to initially have lots of jumps which decrease over the episodes. However, it was found that this pattern was mainly due to state discovery rather than learning of actions. Trace plots for the remaining agents are similar and found in the supporting materials.
states are assigned based on the observed spread (s k ) after some event k, and the simulated historical spread distribution. The table also shows the approximate probabilities of being in each state. Looking at this table we can see the imbalance in the probability of observing the different states, which will impact the learning of the agents in each of the states. Figure 11 shows the average action for high (resp. low) volume and narrow (resp. wide) spreads for two of the RL agents with the smaller state space, but with different initial inventories. In both plots, we can see that the agent has learned to decrease the amount of inventory that is traded as the agent was trained for longer. This could be a sign that the agent was initially trading lots of inventory due to the reward signal but subsequently learned that this causes a larger price impact and hence reduced the inventory traded. For the lower initial inventory case (left), the agent has not learned to trade more when there is high volume and narrow spreads. However, when the initial inventory is increased (right) we see that it has learned the intuitive behaviour. The average action for the third agent with the smaller state space can be seen in Figure  F.37a, with similar results to the plot on the left.
To compare the learning dynamics of the agents with the larger state space to the agents with the smaller state space we plot the average action for the high (resp. low) and narrow (resp. wide) spread states. Figure 11 shows these results. Low and high volume states are given by v ≤ 4 and v ≥ 7 respectively, and narrow and wide spread states are given by s ≤ 4 and s ≥ 6 respectively. Once again we see that the agents are learning to trade less on average as the agents get trained. The larger state space agents trade more when the spreads are wide and the volume is low instead of trading more when the spreads are narrow and the volume is high. These agents have not learned intuitive behaviour.
Speculating as to why this is the case: when s = 1, being 60% of the spread states, we have that these states get returned to more often and the agent can learn optimal actions in these states better than in the other states. Moreover, the agents are learning to trade less in all the states over the training period so it could be the case that the agents are visiting these states more often and learning to reduce their volume. While the wider spread states are visited less often and are going to be learning to reduce the volume at a slower rate, which is what we see in the plots. This effect does occur in the smaller state space agents but it is somewhat mitigated. Once again the agents with the larger state space should be trained for longer so that the agent can visit each state-action pair more often. The plot for the third agent with the larger state space can be seen in Figure F.37b, for similar results.
To understand how the agents have learned to trade for different time and inventory states we chose to see how the agents' actions differ when they trade at the beginning, with a large amount of inventory, as compared to when they are trading at the end with less inventory. States with a lot of, and a little time, remaining are classified respectively as states that have t ≥ 4 and t ≤ 2. States with a lot of, and a little inventory remaining have i ≥ 4 and i ≤ 2 respectively. Figure 12 plots the results for the two agents with the smaller state space, the third agent's results can be found in Figure F.38a in the appendix. It is clear that the agent has learned to trade more inventory For each time and inventory state combination, we plot the actions taken across all volume and spread states. Fixing a particular inventory state and increasing the time past (moving left to right), you can see that the agent has learned to trade more as the time increases when averaged across the spread and volume states. This effect will be made more clear in Section 5.3. Furthermore, if you fix a time state and increase the inventory remaining (move bottom to top), you see that the agent has learned to trade at a higher rate when there is more inventory remaining. From this policy plot, it is not clear that the agent has learned any behaviour related to spread and volume. This will be investigated more thoroughly in Section 5.3. This example is for a single agent but the policies of the other agents show similar behaviour and can be found in Figures F.29a, F.31, F.33, F.30a, and F. 35. when it is early in the simulation and the agent has a large amount of inventory when compared with trading later when it has less inventory. This effect is present no matter how much initial inventory the agent was given. This finding corroborates the results seen in Figure 10a.
This effect may be due to the agents trading higher volumes initially in order to guarantee the execution of all the inventory, but then reducing the volume later to reduce the price impact. It must also be noted that the average action is rapidly decreasing when there is less time and inventory and if the episodes were increased it is likely the agent would continue to decrease the amount of volume traded in these states.
The time and inventory interaction results for the two agents with the larger state space are shown in Figure 12, with the third's results being shown in Figure F.38b in the appendix. States with a lot, and a little time, remaining are classified as states that respectively have t ≥ 7 and t ≤ 4; and states with a lot, and a little inventory remaining have i ≥ 7 and i ≤ 4, respectively. Once again we see that both agents have learned to decrease the amount of inventory traded as the agents are trained. However, the same effect, of trading more initially when the agent has a large inventory is not seen in the agents with the larger state space. The agents are trading roughly the same amounts in each state combination.
This reason for this discrepancy between the state spaces can be seen by plotting the marginal effects of time and inventory on the average actions. Figures F.39 and F.40 in the appendix plot these results. Here we see that all the agents have learned to trade, on average, more when there is less time remaining and trade more when there is less inventory remaining. However, in the smaller state space, we have that the time effect is much less than the inventory effect, which translates to the agents trading more when more time and inventory remain. However, for the agents with the larger state space, the time and inventory effects are roughly similar, therefore the effects are cancelled out in the interaction plot.

Stylised fact dynamics
Every 100 episodes, including the first and the last, all the raw messages of the simulation were recorded. The stylised fact moments were computed at each of these points so that we could see how adding in the RL agents   (1) at the top, and the last episode (1000) at the bottom. In the first episode, the agent is randomly selecting actions and there is no pattern in the volume trajectory. For the last episode, the agent has learned to trade higher volumes initially and at the end of the simulation, when most of the inventory has been traded, the agent submits smaller orders. The large agent has also learned to trade more initially, however, there is more variation in the actions when compared with the smaller state space agent seen in Figures 10a.
affected the stylised facts of the ABM, and if these effects changed as the agents learned to trade.
Four key moments were computed on both the JSE, and the ABM, and are plotted in Figure 13 along with their associated confidence intervals. This includes the moments for each of the RL agents over the episodes. The GPH statistic was roughly the same for the ABM and the JSE, and adding an RL agent to the simulation has not made a significant difference. This is also true for the GARCH moment, except for one erroneous point at the start of training for one of the agents. For the Hill estimator, the confidence intervals were close but did not overlap and the moments for the ABM's with the RL agents have Hill estimators that fall into the ABM confidence interval. The Hurst exponent moment was the only moment that showed slight differences between the ABM with and without the  Figure 11: We select the states that correspond to high (resp. low) volume and narrow (resp. wide) spread and take an average over the actions selected by the greedy policy to determine if the agent has learned to trade more (resp. less) when the volume is high (resp. low) and when the spreads are narrow (resp. wide). Figure (11a) shows that at lower initial inventory levels the agent was not able to learn this behaviour. However, for larger initial inventory levels ( Figure 11b) the agent was able to learn the correct intuitive actions. In both plots the agent has learned, on average, to decrease the amount of inventory traded. The agents with the larger state space have not been able to learn the intuitive actions as they trade significantly more when the volumes are low and the spreads are wide and are shown in Figure 11c and Figure 11d in the lower row.   Figure  12a) and when the agent has less inventory at the end of the simulation (Figure 12b) are plotted in the first row. The agent has learned to trade more when it is early in the simulation and the agent has a large amount of inventory. This effect is present when there is a low initial inventory and when there is a high initial inventory. In the lower row are the plots, Figure 12d and Figure 12d the average action were taken by two of the larger state space agents for a different time and inventory state combinations. We find that all the agents have learned to trade, on average, more when there is less time remaining, and trade more when there is more inventory remaining. However, in the large state space, the effects cancel out whereas in the smaller state space the inventory effect dominates.
RL agent. For the RL simulations, the Hurst exponent is lower than that of the ABM and consistently falls below the lower confidence interval of the ABM. The only ones to fall into the ABM's confidence interval are the RL agents that had the smallest parent orders. Indicating that the larger the amount of inventory traded by the RL agent, the more this increases the amount of mean reversion in the simulation. As the episodes increase there is a slight downward trend in the Hurst exponent for the agents with larger parent orders. Figure 13 only shows four out of the eight moments. The results for the other moments can be seen in Figure F.41, and they are similar to the results of the GPH, Hill, and GARCH estimators.

e-ABM
Even though the objective surface was difficult to optimise the calibration was able to find reasonable parameters. As argued by Fabretti [20], a successful calibration would mean that we need to show that the simulated data comes from the same, or at least similar distribution as the empirical data. We did this by comparing the moments of the simulated price path to the empirical price path and found that, except for the KS statistic, the moments of the simulated price path are similar to those found on the JSE (See Table 6). We saw that four out of the eight moments had overlapping confidence intervals. While there are clear differences between the distributions, the simulated data reasonably resembles the empirical data.
While acknowledging the challenges with a stylised fact based comparison, we found that the model was able to correctly reproduce many of the stylised facts observed in the market. Namely, Fat-tails, volatility clustering, significant negative autocorrelations for only the first few lags in the log-returns, decay in the trade-sign autocorrelation, and similarity in the extreme-value distributions of the ABM and the JSE. Even though the model was able to replicate the decay in the autocorrelation of the trade-signs, we did find that the decay did not exhibit the long-memory property that was found on the JSE. This can be seen in Figure B.19. We know that this decay is due to two effects: herding and order-splitting.
The e-ABM did not have any order-splitting agents therefore it is not surprising to see that this stylised fact has not been fully reproduced. Additionally, there is a large herding effect caused mainly by the chartists all submitting approximately the same orders at the same time, and it was also observed that they change their trades from buys to sells very slowly. The RL agent is an ordersplitting agent by construction, so to investigate whether this agent affected the decay of the trade-sign autocorrelation, we plot the trade-sign autocorrelation decay for all six of the RL agents in Figure 14. We see that adding a single learning agent that performs order-splitting, does not guarantee the correct decay in the trade-sign autocorrelation. However, agents plotted in green, blue and pink, all illustrate that power-law decay is possible using this model. After a few initial lags, the decay of autocorrelation is approximately linear on the log-linear scale indicating that the decay is moving closer to a power-law relationship.
The absolute log-returns are compared in Figure B.20 which shows those for the e-ABM ( Figure B.20a), those found in the real world measurements from the JSE (Figure B.20b) as compared to the simulations of the e-ABM with a single RL agent over different initial inventory and state space sizes (Figure 14b). We can see from this figure that the autocorrelations on the JSE decay far slower than the simulated moments, suggesting that there is higher volatility clustering found in the real market when compared with the simulated markets. We do not see significant differences between the autocorrelations of the e-ABM and the e-ABMs with the single RL agent added.
Changing the stochastic time model of Jericevich et al. [37] to an event-based model does change the stylised facts. The first difference was that the e-ABM model was able to increase the variance in the returns and reduce the amount of density at zero returns. This made the logreturn distribution more similar to the empirical distribution. The e-ABM model reduced the difference between the tail-index parameters of the simulated and empirical returns observed in the e-ABM when compared to Jericevich et al. [37]. There are clear differences in the autocorrelation of the returns. Jericevich et al. [37] found significant, positive and negative lags, for a larger number of lags, when compared with the e-ABM model. We only observed significant negative correlations for the first few lags. The log-return autocorrelations in the e-ABM model are thus more similar to the empirical data. The reason for this is likely because we have multiple fundamentalist agents who all trade mean reversion strategies, rather than just the one that was implemented in Jericevich et al. [37]. Both models were able to show persistence in the absolute log-return autocorrelations indicating volatility clustering, fat-tails in the log-returns, as well as correct depth profiles. Additionally, Jericevich et al. [37] was able to produce more accurate decay in the trade-sign autocorrelation.
Price impact curves for the e-ABMs that include the RL agents are shown in Figure 15 with the separation between buyer initiated (Figure 15a) and seller initiated (Figure 15b) trades. The linear upwards trend in the price impact on the log-log scale is indicative of a power law price impact function, where the price impact increases with the volume of the trade. Comparing these figures to Figure 4, we see that adding in the RL agents does not change the power law price impact seen in the underlying e-ABM. Even though the RL agents are all selling agents the power law price impact is demonstrated for both buyer and seller-initiated trades. Additionally, we observe the price impact increase as the inventory traded increases. These price impact curves are consistent with previous research [30,35,46,37].  Figure 13: Here we compare the ABM, and JSE moments to the moments computed on data generated by the simulations of the ABM with the RL agents. During the training of the RL agents, the moments were computed every 100 episodes. As the agents trained we could see if this generated an effect on the moments. The GPH statistic remained similar for the ABM and the JSE, and adding an RL agent to the simulation has no significant impact. This is also true for the GARCH moment, except for an initialisation. For the Hill estimator, the confidence intervals were close but did not overlap and the moments for the ABMs with the RL agents have Hill estimators that fall into the ABM confidence interval. The Hurst exponent moment was the only moment that showed small differences between the ABM with and without the RL agent. For the RL simulations, the Hurst exponent is lower than that of the ABM and consistently falls below the lower confidence interval of the ABM. The only ones to fall into the original ABM's confidence interval are the RL agents that had the smallest parent order, indicating that the larger the amount of inventory traded by the RL agents the more this increases the amount of mean reversion in the simulation. This plot only shows four out of the eight moments. The results for the other moments can be found in the supporting materials and are similar to that of the GPH, Hill and GARCH estimators.   Figure 14a. The plots for the autocorrelations in the absolute log-returns are in Figure 14b. The state space sizes and initial inventories are given in the legend. We see that adding a single learning agent that performs order-splitting does not guarantee the correct decay in the trade-sign autocorrelation. However, agents plotted in green, blue and pink in Figure 14a, all illustrate that adding a strategic order-splitting agent can reduce the trade-sign autocorrelations. We note that the addition of agents engaging in strategic order-splitting can boost the autocorrelation in the absolute returns as seen in Figure 14b on the right.

Reinforcement learning
Each RL agent learned to increase its profit and decrease the variance in the profit from an initial random agent to a reasonably optimised agent, and this is shown in Figure 5. The liquidity effect caused by increasing the size of the parent order is evident, as the agents with larger initial inventories increase the frequency and severity of the liquidity shocks. This is consistent with the power-law price impact effect seen in the underlying ABM.
What could be seen as a surprising result is that none of the agents were able to significantly improve on the amount of profit received over trading an initial single order with the value: X 0 × m 0 . In previous implementations of RL to solve the optimal execution problem the agents are trained offline with the assumption that the order book is resilient. Meaning that when the agent trades and pushes the price down, this does not affect the next price observed. This assumption has to be made because the agents are trained offline using static data sets, and the data was historic and used to generate a historic simulation of changes in the prices. This was because in a static data set the next observed price, that in the real world would be caused by that trade, is not easily simulated in a way that captures all the other agents' reactions to that specific RL agent's specific trade. In this ABM, we train the RL agent online using a dynamic data set, and we have that the trade from an RL agent will cause other agents to react to this trading decision. We speculate that this may be one of the reasons why we see that the RL agents are then unable to increase the profit more than X 0 ×m 0 as there is a permanent contribution to price impact. This can be tested by taking a static simulated price path, such as the one used in prior work, training the same RL agent with the same implementation as seen here, and determining if it can improve upon X 0 × m 0 . Consider the decay function ( Figure 1) and the convergence of the Q-matrix (left of Figure 7) and notice that Q-matrix differences stop improving suddenly at a similar time to when is at its lowest ( = 0.01). Agents with the highest X 0 show this effect most prominently. The rate of decrease in the one-step difference of the Q-matrix is much faster before is at its lowest indicating that the agents' Q-matrix was converging rather quickly.
It is reasonable to speculate that having more training episodes and a slower decay in , will promote greater convergence in the Q-matrix which will indicate that the agent is learning a better policy, which could increase the profit beyond X 0 × m 0 . This should be the case for the larger state space agent but Figure 7 suggests this could also be the case for the smaller state space agents. Furthermore, we saw that the policies are changing very slowly and to see significant behaviour changes we will have to significantly increase the training time for the RL agents.
In all the learning dynamics we saw that the agents initially increased the the inventory it traded, as shown in Figures (11, 11, F.37a, F.37b, 12, 12, F.38a, F.38b, F.39, and F.40). This is likely due to the initial reward signal, and as the agent learned it, they started to significantly decrease the inventory traded. This is evidence that the agent is learning about price impact and that it must reduce the inventory traded if it is to maximise its profit. When considering the smaller state space agent, the decreases for the most part have not converged, demonstrating that the agents have not converged to stable policies and are still trying to improve the return. This is additional evidence that the agents could be trained for a longer period, and that this could change their behaviour and improve their return.   Figure 4, we see that adding in the RL agents does not change the slope of the power law price impact function relative to that seen in the underlying e-ABM. The observed price impact increases as the inventory traded increases and these price impact curves remain consistent with previous research [30,35,46,37].
All RL agents learned to trade more when they have more inventory and trade more when they are at the end of the simulation, as evidenced by Figures F.39 and F.40. These are intuitive actions because if you have a large amount of inventory you will want to increase your rate of trading to ensure that you fully execute the parent order. This is also true for when there is little time remaining then on average you will want to increase the inventory traded to increase the chances of fully executing the parent order.
The agents' dynamical behaviour change from the start to the end of the simulation and this is shown in Figures 12  and 12. These confirm that the agent will start with high time and inventory states, and move to the lower states as time passes and they trade inventory. For the smaller state space agent, we see large differences in the behaviour of the agent at the start and end of the simulation, as the agent has learned to trade significantly more initially and reduces the amount traded at the end of the simulation. However, the larger state space agent does not exhibit a change in behaviour as the simulation progresses. The smaller state space agents have learned the correct trading behaviour when the initial inventory is high, but it is still learning this effect for the lower inventory levels. However, since convergence in the policy has not occurred it is speculated that the agent will learn this if the episodes are increased. The larger state space agents have not learned the correct behaviour.
This shows that smaller and larger state space agents have drastically different behaviours. The remaining question is why is this not shown as different profits in the return function? A potential reason for this is that each RL agent's profits are being washed out by the other agents in the ABM because they are reacting to their trades, and even though the agents' behaviours are different they are not different enough to overcome this effect. Figures 13 and F.41, demonstrate that adding an RL agent to the ABM does not cause the moments to differ too far away from the underlying ABM's moments. Illustrating that the underlying ABM's moments are robust to the addition of a single RL agent.

Limitations and future work
Here we faced many of the same challenges as those faced in the model of Jericevich et al. [37]. From the depth profiles, shown in Figure B.21a, we see that most of the volume is placed at the best bid/ask, which conforms well to what is empirically observed [37,62], and is the result of using a Gamma distribution to generate the price placements [37]. When the volumes are generated from a power-law distribution we find that either the LTs are unable to generate price moves due to the large volumes at the best bid/ask, or they quickly clear out one side of the LOB, generating liquidity shocks and extreme price movements. This can make the ABM fragile and difficult to optimise as different runs with the same parameters then produce very different price paths. This effect is exacerbated by the increased number of orders submitted by the agents. It was observed that a single order could have a volume of up to 1 × 10 7 , for either LPs or LTs.
Moreover, using the calibrated parameters we found the ADV to be 745 362 with a 95% CI (593536, 1 008 027), estimated over 100 simulations. This is comparable to the traded volume of 498 981 as found on the day of JSE data we used in the calibration. The range of traded volumes in the e-ABM is large, with the lowest values being of the same magnitude as the JSE, but the highest values being almost twice as large. The e-ABM has more volume traded than the JSE, which is likely due to the large amount of limit order volume at the top-of-book. For the e-ABM we have that the average volume at the best bid and offer is 197 909 744 95% CI (32 206 342,634 851 798). This is far higher than the 2 695 897 observed on the JSE. The larger volumes at the best bid and offer mean that the traded volumes need to be larger to ensure that the e-ABM estimated moments from the calibration procedure are similar to those observed in the real market.
Looking at the empirical evidence [9] we could generate price placement as a power-law, and then generate the volume conditional on the price. The conditional distribution of the volume based on the price is said to be independent of the price for the first 20-50 ticks and then decay as a power-law Bouchaud et al. [9]. More in-depth analysis of the price placement, limit order volume distributions, and trade distributions are necessary, especially for the JSE if we wish to be drawn from realistic distributions that match the market we are going to calibrate to. This did not detract from the overall intention of this work, which was aimed to demonstrate the impact of strategic ordersplitting.
As discussed previously, we have observed that the decay in the trade-signs does not completely match what is observed in the market. Indicating that the agents are submitting similar trades, and for longer periods of time, than what they would do in the real market. This rule should be changed to give greater heterogeneity to the agents so that they don't submit similar orders at similar times, and they should be programmed to change their orders at a quicker rate. However, here we show that by adding order-splitting agents, such as optimally executing RL agents, we can produce more realistic decay rates.
A further limitation to this model that can be improved, is to increase the number of episodes for training the RL agents. We have demonstrated that for the smaller state space agents, the agents are still decreasing the average inventory traded and there is evidence that increasing the number of episodes would make consecutive Qmatrices more similar, thus indicating the convergence in Q-matrix. Moreover, we have observed the policies changing at a slow rate and significantly increasing the episodes will increase the chances of the agent improving the return beyond that of a single large order of value: m 0 × X 0 using and immediate execution strategy.

Conclusion
This work combines a realistic but minimally intelligent agent-based model with a learning agent with bounded rationality in event time. This required moving the market clearing mechanism into one that is event-driven so that the minimally intelligent agents can interact with the learning agent in a truly asynchronous continuous-double auction market [39] without the need for batched market clearing events to be externally imposed in calendar-time.
The key challenge for this type of implementation is the requirement of moving the agent-based model (ABM) into an event-driven agent-based model (e-ABM) framework that is both reactive and reasonably realistic (Section 2).
The e-ABM was built on two class of agents: liquidity takers who only submit market orders, and liquidity suppliers who only submit limit-orders. The liquidity takers were split in turn into two sub-classes: fundamentalists and chartists. The fundamentalist started each trading session, such as a trading day, with different notions of fair or fundamental value, and traded accordingly; while the chartists considered a moving average of past price changes in order to trigger trading decisions. To adapt the hybrid ABM, the e-ABM model includes a spread condition as a cost to trading, and the software implementation was moved into reactive framework (Sections Appendix A, Appendix A, and Appendix A). Our implementation explicitly excludes direct calibration to either the order-flow, the trade-sign autocorrelations, or the shape of the price impact.
We were able to replicate key stylised facts of the market, these are demonstrated in Section 4 and Appendix B. The real market observations were based on trade and order-book data extracted from the Johannesburg Stock Exchange (JSE) as used by Jericevich et al. [38,37]. The recovery of the stylised facts are evidenced by Figures B.16 through B.18 where, respectively, the log-returns exhibit fat-tails, similar extreme value distributions to those observed in the market, volatility clustering, as well as replicating the ACF of the log-returns. The model was also able to capture the decay in the auto-correlation of the tradesigns, but with has a slower decay rate when compared with the JSE as evidenced by Figure B.19. The JSE exhibited the long-memory property described by Lillo and Farmer [45], however, this was not replicated by the e-ABM. The decay of the auto-correlation in the absolute values of the log-returns was slower in the real market than in the model. However, the model was able to replicate reasonable price impact functions and depth profiles shown in Figures 4 and B.21a. The learning agent with bounded rationality was represented by a simple tabular Q-learning algorithm that was learning to solve an optimal execution selling problem using only market orders. The algorithm considered two market states based on features representing the limit-order book: the spread and the best bid volume; and then two states representing the current execution agents states: the remaining time available in the trading session, and the agents remaining inventory. The RL agent attempted to learn how to modify an existing TWAP trading schedule to optimally execute its sell order.
After the calibration of the e-ABM the single RL optimal execution agent with its sell order was added into the market to interact with the e-ABM. The RL agent then attempts to modify its event-based TWAP strategy. We tested this agent with different parent order sizes and different state space sizes to observe whether any agents' profits could be found to increase and converge, to improve a random initial agent configuration. It was found that all agents were able to increase the profits over their initial random agent, demonstrating that each agent has learned. However, it was noted that there was an upper bound on the profits obtained. No agent was able to learn how to improve upon m 0 × X 0 , meaning that the agents were able to reduce the implementation shortfall, but they were not able to increase profits enough to observe a consistent and significant negative implementation shortfall. Adding RL agents with increasing order sizes caused the ABM to have significant liquidity effects as expected.
There were clear differences in the dynamics for agents with different state space sizes. The smaller state space agents had the number of states they visited converge much faster than the larger state space agents. They quickly finished the state discovery process while the larger state space agent was still rapidly discovering new states, which limited their ability to learn. The larger state space agents showed less switching of actions in each state when compared with the small state space agents, which was due to the agents not returning to individual states as often to update their state-action values.
This showed that the larger state space agents had difficulty learning in such a small number of episodes. This was also evident in the behaviour learned. If a large state space agent had a large amount of inventory at the beginning of the simulation versus a smaller amount at the end, the agent did not learn to change behaviour while the smaller state space agent learned to trade more initially and then reduce the rate of trading at the end. There was also evidence that the smaller state space agent was able to start learning to trade intuitively using the spread and volume states. Given this, all agents learned to reduce the number of trades and guarantee the execution of the entire inventory, and they learned some structure during the training process as evidenced by the change in the action volume trajectories over the training period.
The statistical analysis work is based on the comparison of the calibrated e-ABM via its simulated moments against the same e-ABM in the presence of the single RL optimal execution selling agent with varying initial order sizes. The e-ABM was calibrated using a simulated method of moments where the optimisations are implemented using Nelder-Mead with Threshold Acceptance [27] approach, using adaptive parameters [22]. This method of calibration was chosen to balance computational speed with stability Appendix D and was informed by prior work [27,22,37]. The calibrated parameters are provided with indicative confidence intervals in Table 5 following [37]. We see that the most significant parameters are, first, the variance in the fundamental value σ f , and then the intensity of the HFT order placement κ, the number of chartist liquidity takers N c LT (relative to the overall agent population), the shape of the order-imbalance ν, followed by the number of fundamental liquidity takers N f LT and the mid-price cut-off that sets the minimum volume lot sizes δ, these are described in Table 1.
Testing the effect the RL agents had on the moments of the ABM, we plotted the moments of the e-ABM that included the RL agents over the training period and compared them to the ABM's moments. We saw that, with the exception of the Hurst exponent, the moments were reasonably robust to the addition of a single RL agent even with increasing initial order sizes. The simulations are run on a system that consists of two parts: the Matching Engine (ME) and the Model Management System (MMS). The matching engine is its own piece of software that receives orders from the MMS, processes them, and sends messages back to the MMS notifying it that an event has occurred. The MMS consists of two parts, the first is the market-data listener that listens for update messages coming from the matching engine [37], and the second is where the LOB, maintained by the MMS, gets updated and the actors submit orders to the matching engine based on the current state of the LOB [16].

Matching Engine (ME)
The matching engine implementation is CoinTossX [63]. The first release was not designed to support a continuous market-data feed, which is needed to run an event-based simulation. Previous work that implemented CoinTossX submitted orders to the matching engine and requested snapshots of the LOB maintained by CoinTossX [36]. This is a high-coupling and low-cohesion design, as the matching engine is doing work that should be implemented by the MMS. Requesting a snapshot of the LOB is a time-consuming task that affects the low latency requirement of the matching engine and it was included in the first release as a convenience to be used by developers as a debugging tool, and not as part of a simulation workflow [37]. For this reason, a UDP socket was added to the matching engine [37,40] to send event messages 17 to the MMS and the internal debugging tools depreciated. This then achieves the desired low-coupling and high-cohesion design pattern required for simulation work.
Logging in and out, starting and stopping the simulation, and the submission of orders are submitted to the matching engine by calling functions using a java wrapper 18 implemented in Julia. These orders are processed by CoinTossX and messages are sent to the MMS over a network using a UDP socket. These messages are captured by an asynchronous task in the MMS which allows the MMS to build its own LOB. Messages received from the matching engine represent events in the event-based simulation that the agents will trade-off.
A key requirement of the event-based ABM is that the matching engine is low latency. To ensure this the messages in the continuous data feed are encoded and decoded in binary format to increase the speed of the transmission of the messages over the network [37].

Model Management System (MMS)
The MMS is split into 2 tasks: the first being the market-data listener, and the second being the event processor/generator. At the time of the project, Julia's multi-threading capabilities were relatively new and there were still issues in its implementation. Therefore the tasks could not be reliably implemented in a multi-threaded environment. 15 The depth profile measures the average volume observed at different prices on the ask and bid sides. 16 The actual simulation time is 25 seconds (in machine time) but due to the Just-In-Time (JIT) compilation of Julia and the initialisation of the LOB a single simulation run takes approximately 26 seconds. 17 Event messages refer to the messages detailing that a trade, cancellation or limit order has occurred. 18 The java wrapper package used is called JavaCall.
Thus an asynchronous programming approach became necessary to split the two tasks in the MMS [16]. Now because the code is run on the same thread the program is thread safe. The @async macro was used to implement the asynchronous programming [7]. The main task is the event processor/generator where the messages received by the market-data listener are used to update the LOB in the MMS and then generate trades from the agents based on the updated state of the LOB. The secondary task consists of the market-data listener that receives messages from the matching engine. The market-data listener was implemented in the secondary task because the call to read from the UDP socket blocks when there are no messages in the UDP buffer. To switch between tasks the main task is slept for a brief moment to allow the market-data listener to read the new messages from the UDP buffer. The amount of time the main thread is slept for depends on the number of agents submitting orders to the matching engine and the speed of the hardware the simulation is running on. If the main thread is slept for too short a time and there are a large number of messages that need to be processed, messages get dropped. Therefore, as the number of agents increases the time that the main thread is slept for will need to increase.

Market-data listener
The relationship between the two tasks can be thought of in the producer-consumer paradigm [18]. The market-data listener produces the messages, which it receives from the matching engine, and the event processor/generator consumes these messages and uses them to update the LOB. Getting data from the market-data listener is not as simple as a function call because there may be more than one message it has received. Julia provides a channel 19 mechanism for solving this problem. The market-data listener reads messages from the UDP buffer and places them into a channel that is used by the event processor/generator to read the messages.

Event processor/generator
The event processor/generator task is the main task. Since this is an event-based simulation the driver of the simulation is the event loop. In each event loop, all the new messages that were received by the market-data listener and placed in the channel are taken from the channel and used to update the LOB. The updated LOB is then sent to all the agents and they trade based on the state of the LOB, their current state, and the agent rules defined in Section 2. See the pseudocode for the event loop in Algorithm 1.
An event in the simulation is classified as an update to the LOB generated by all the order messages generated in the previous event loop. For example, if three orders were generated in the previous event loop the market-data listener would read the three new events from the UDP buffer and place them into the channel. Then in the next iteration of the event loop, those three events would be removed from the channel and they would be used to update the LOB. Classifying a single message updating the LOB as an event the agents would trade-off was tested, but it created a large lag between the matching engine's LOB used to match the orders and the MMS's LOB, which meant that agents were trading based on a stale LOB. The stale LOB lag effect was then exacerbated as more agents were submitting additional orders.
A simple Q-learning RL agent needs to observe the state of the environment after taking an action and the lag would mean that the effect of their action would only be observed after a large number of iterations, and in the worst case, the effect of their actions may not be observed by the agent if the simulation has ended before the update is received. Additionally, the software is designed in such a way that adding more RL agents to the simulation is straightforward in order to support future work that will use this software to analyse what effects there may be when adding additional learning agents to the model.
The current setup means that the market-data listener and the updating of the LOB are not in the same task. In an actor-based model, the market-data listener and the updating of the LOB would likely be placed in the same thread because you would want the messages to directly update the LOB and not have to go through another process. One limitation of the model is that it could not be multi-threaded and asynchronous programming had to be used. Updating the LOB in a separate task was done to ensure that the market-data listener had the fastest execution time possible so that the main thread could sleep for the least amount of time. This increases the speed of the program as well as allowed more messages to be processed, which helped the simulation scale better as the number of agents increased.

Ordering of LOB updates
The Reactive.jl framework delivers messages to actors in an ordered manner. The order is defined by the order in which the actors are initialised. For example, if actor A was defined before actor B then a message passed to actors A and B, who subscribe to the same source, would cause actor A to receive and process the message before B. The processing of a message can generate a trade for the given agent. This means that the order of the agents receiving the messages from the source also imposes an ordering on the trades. For example, if agent A is defined before agent B, agent A will always receive a message before agent B, and if they both trade then agent A will always submit an order before agent B. This ordering affects the dynamics of the LOB. Using another example, if liquidity providers always submit orders before the liquidity takers there will always be additional liquidity provided in the LOB before liquidity is removed. This may make the order book appear more resilient than is defined by the model rules. To ensure that an arbitrary ordering of agents does not affect the ABM it was decided to randomise the order in which agents receive messages (LOB updates), which will cause the order of the trades submitted to also be randomised. A uniformly random ordering ensured that the results of the ABM do not depend on an arbitrarily defined ordering of agents.

Simulation initialisation
In Jericevich et al. [37] the simulation was able to start with an empty limit order book. This did not work for the event-based ABM as it causes frequent liquidity crashes because the LTs can remove liquidity faster than it could be provided. Therefore, it was decided that it was necessary to start with a certain amount of liquidity in the LOB before the simulation starts. With the limit-order book initialised with some limited liquidity the frequency of crashes was reduced, and could be viewed as the outcome of an open auction.
The order book is initialised by having the LPs submit limit orders before the simulation starts. The number of orders submitted is set at 1001. This value was chosen as it provided a good balance between reducing the number of liquidity crashes and reducing the likelihood of having the LOB lock up due to too much liquidity being provided at the top-of-book. Given the agent specifications for the LPs, we have that the number of bids and asks will be roughly balanced and therefore the initial order book will start with only a small order imbalance.
During the initialising of the LOB, the LPs are not allowed to place orders inside the initial spread. This maintains the initial spread and does not allow the LPs to close the spread due to the competition effect without it being pushed back by the liquidity takers. The initial mid-price and the initial spread are set arbitrarily at 10000 and 40 respectively.

Miscellaneous considerations
In the Johannesburg Stock Exchange (JSE) a volatility auction is triggered when the execution price is larger than a 10% deviation from the dynamic reference price [41]. As per [64] the execution price is the price of the best bid or offer that is crossed when the market order hits the LOB, while the dynamic reference price is the deepest price level executed from the last transaction.
The agent specifications defined by [37] do not provide rules for how to trade in a volatility auction. Therefore, they imposed a restriction on the LTs to ensure that a market order is not submitted if it will trigger a volatility auction. The event-based model also does not account for behaviour in volatility auctions and will follow the same method as [37] to prevent them from occurring. This also applies to the RL agents.
Finally, liquidity takers and RL agents will not submit market orders if the contra side of the LOB is empty. A recommendation from [37] is to not submit an order at all as this will help in the data cleaning process, because when a market order is submitted to an empty LOB the data feed returns an empty string. So, submitting a market order or not will lead to the same result. In the event-based model, this will also reduce the number of unnecessary orders submitted, which will allow the main task to sleep for a shorter period and therefore speed up the simulation.

Hardware specifications and run times
The experiments run were all implemented using Microsoft Azure's Virtual machines [52]. The hardware specifications for the virtual machines used are detailed in Table A.8. The sensitivity analysis and the calibration were run on Virtual Machine 1. Due to the reduced number of virtual CPUs (cores) there were instances of some UDP packet loss, this could cause errors in the MMS. Error handling was implemented for the sensitivity analysis and calibration work so that the simulations could resume at the place where errors occurred, in subsequent runs, but without incorrect data being recorded or actual data loss.
For the RL experiments, the server was upgraded to provide for better message load handling. The training of the RL agent was therefore done on the compute-optimised Virtual Machine 2. The increase in the number of vCPUs from 4 to 16 was enough to ensure that during the training of all the RL agents no packets were dropped and no errors occurred. Table A.8 shows indicative run times for each of the experiments. Note that the increase in computing power, from sensitivity analysis and calibration to the RL experiments, was not used to increase the speed of the simulation but was used to reduce the number of packets dropped. The time it takes to run a single simulation takes the same time on both sets of hardware because the simulation is run for a set length of time.  were exactly a power-law we would expect that the data would be linear on a log-log scale. However, we observe a slight downward curve which indicates only near power-law behaviour. Comparing the empirical-versus-simulated returns we see that both distributions observe this behaviour, giving credibility to the returns generated by the ABM. We also note that the ABM's distribution is fatter tailed when compared with the JSE's distribution.

Appendix C. Sensitivity Analysis
Due to computational constraints, six parameters were chosen to be varied in the sensitivity analysis. These are the free parameters defined in Table 1 and were informed by prior work [37]. The parameters that were held constant during the sensitivity analysis as well as their values are also shown. The λ min and λ max parameters were chosen in such a way that a chartist agent has an equal probability of having a long or short trading horizon, and the time to cancellation, of the LPs limit-orders, was arbitrarily tuned to ensure that liquidity is not being removed too quickly from the order book, which would cause frequent liquidity crashes, and having too slow a cancellation time, which would cause almost no cancellations. This would cause behaviour that does not represent that of a high-frequency agent, which the LPs are trying to model. The total simulation time, after initialisation, T was set at 25 seconds in machine time, because it was the time that produced order counts most similar to the day of trading on the JSE that we will use to calibrate the model. On an Average Daily Volume (ADV) basis 25 seconds of machine time simulation is considered equivalent to both 8 hours of typical calendar time trading, and on an ADV count, as a measure of the typical number of volume time events, the volume count is equivalent. Here simulation time is measured in machine time.  Comparing the free parameters in my model with that of [37], it can be seen that I am varying the number of chartists and the number of fundamentalists while keeping the number of high-frequency agents constant. However, [37] kept the number of chartists and fundamentalists constant, at one agent each, and varied the number of high-frequency agents to determine how the ratio of liquidity providers to liquidity takers, N = N LP N LT , changed the simulated price paths. During initial testing of the model, it was observed that having different values for N f LT and N c LT changed the behaviour of the simulated price paths. For example, when there are more chartists (N c LT = 12) than fundamentalists (N f LT = 5) the log-return distribution has a smaller variance and but has a lower tail-index when compared with more fundamentalists and less chartists. A possible reason for this is there are fewer large orders placed by fundamentalists and smaller orders placed by the chartists on account of equations 4 and 5. The initial lags in the trade-sign autocorrelation are larger when there are more chartists and fundamentalists, which is possibly attributed to greater herding behaviour. Furthermore, as the number of fundamentalists increases more large orders are placed, and this caused wider spreads, which would give rise to more extreme price moves.
Due to these observed changes, it was decided that a better parameter set would be to keep the number of highfrequency agents constant and vary both the number of chartists and fundamentalists. Since we are still changing the ratio of liquidity takers to liquidity providers we will also be changing N and therefore, we will still be modelling how a changing N affects the moments as was done in [37].
When implementing the sensitivity analysis we considered 4 different values for each parameter, and for each value, we varied all other parameters and estimated the moments of all the simulated price paths, which leads to a total of 4096 parameter combinations. Therefore, for each parameter value and each moment, there were 1024 values to be aggregated Comparing the ACF for the JSE to rest, we see that the ACF found on the JSE decays much slower indicating that the JSE exhibits a stronger volatility clustering signature. We note that the addition of agents engaging in strategic order-splitting can boost the autocorrelation in the absolute returns as seen in Figure 14b.
(a) The depth profile is computed as the average volume observed at seven price levels throughout the simulation [37]. The maximum volume occurring at the best bid/ask and the rapid decrease in volume as the price increases is consistent with empirical data [62] as well as previous ABM results [37].
(b) The power-law decay in the order-flow autocorrelation shows the longmemory property Lillo and Farmer [45]. Here we plot the trade-sign autocorrelations for the JSE on a log-log scale and observe an approximately linear decrease to recover similar results. Using the MLE of the power-law described in equation 9, we show that α = 1.286, which means that the distribution is not integrable and exhibits the long-memory property.
over. Another benefit of this method is that it allowed us to create parameter surfaces for the 2 way interactions for each of the parameters. These surfaces allow us to visualise how the interactions of two parameters affect the statistical properties of the simulated price paths and the objective function. These effects can be seen in Figures C. 23 and C.24. It must be noted that when implementing the sensitivity analysis, for each of the parameter combinations the moments were estimated for a single run of the simulation with a single seed set which ensured the same fundamental prices and the same chartist forgetting factors. This was done due to time and computational cost constraints. Therefore, these results should be considered indicative rather than presenting the exact surfaces and moment values as there will be uncertainty around these values that have not been reduced through Monte Carlo replications over multiple seeds.
To ensure that the simulations for each of the parameter combinations remain independent of each other we had to start and stop trading sessions with CoinTossX before and after each simulation was run. This means, the LOB was cleared before the start of the next session and there were no orders from the previous session that could be matched with orders from the current session.
Due to a large number of parameter combinations after each simulation, the Julia garbage collector was run. Helping to free up memory from the Julia program and free up memory in the JVM created by the Julia program. This was recommended in the JavaCall documentation. They note that the pressure on the Julia and java heaps are different and the Julia garbage collector may not be run even if the java heap is full and memory could be freed with garbage collection. Therefore, they recommend periodically manually running the Julia garbage collector to ensure this issue does not cause an error. Indicating that a change to a single parameter value is not going to cause significant changes to the model. One parameter-moment combination where we do see a change in distribution when changing the parameter values is between the number of fundamentalist traders and the Hurst exponent. As we increase the number of fundamentalists the distribution of the Hurst exponent shifts to smaller values where most of the distribution lies further below 0.5. As described in Table 4, this represents more mean-reversion in the time series. This is consistent with the fact that the fundamentalists follow a mean-reverting trading strategy.

Marginals
Due to large outliers and large variations in the objective function, for each of the parameters, it is difficult to see a significant difference in the distributions of the objective function across the range of parameter values. The only observation we can make is that certain parameter values may be subject to producing large outliers. For example, smaller values of the κ parameter produce larger outliers.

Objective surfaces
A better way to visualise the changes in the objective function as the parameters change is to look at the behaviour of the objective function for an interaction of two variables. Figure C.23 plots the micro-price objective function when holding two variables fixed and averaging over all other combinations of the other parameters. Recall that the sensitivity analysis was run once for a single seed so there were no Monte Carlo replications. The surfaces are also course grained and made up of a 4x4 grid. Therefore, these surfaces are just an indicative representation of the objective function and are not to be interpreted as the true objective function. Despite this these surfaces let us see some indicative patterns.
In almost all the plots we have that the objective function decreases from areas of extremely high values to much lower values, giving evidence that optimisation is possible. However, optimisation of the objective function will not be trivial since there is evidence for local minima where the optimisation can get stuck. For example, consider the surface for parameters N f LT and ν. The optimisation algorithm could possible get stuck in the local minima at N f LT = 9 and ν = 7 rather than reaching the actual minimum at N f LT = 3 and ν = 2. Another reason that the objective function will be difficult to minimise is that the minima that the optimiser converges to may depend heavily on the initial state of the optimiser. For an extreme example, consider the interaction surface between N f LT and δ. Depending on which side of the saddle you are on will cause you to find different minima. These two issues make it extremely hard to calibrate the ABM. Calibration difficulties are not new and for detailed discussions on the difficulty of calibrating ABMs, we refer the reader to [60,59].

Moment surfaces
To show how the interaction of parameters affects the moments of the price paths, moment surfaces were produced in the same way as was done for the objective function. If the mean and standard deviation are not considered we have six moments and six parameters, which means we will have 90 surfaces. This is far too many surfaces to present here so we have considered just the surfaces for parameter combinations of N c LT and N f LT as well as κ and ν. These can be found in Figure C.24. Considering the Hurst exponent surface for the N c LT and N f LT parameters we can see that the Hurst exponent decrease drastically as the number of fundamentalists increase. There is also a small decrease in the Hurst exponent as the number of chartists increases but this is a much smaller effect and should not be interpreted as a trend. From this plot, there is evidence that the mean reversion of the fundamentalists is dominating the momentum from chartists. To illustrate some of the non-linear effects not captured by the correlations, consider the Hill estimator surface for the N c LT and N f LT parameters as well as the κ and ν parameters. LT and N f LT parameters we can see that the Hurst exponent decreases drastically as the number of fundamentalists increases, which is consistent with the fact that the fundamentalists employ a mean reverting strategy.
from the same distribution as that of the empirical data [20]. This presents a calibration problem in which the model must be calibrated to the empirical data [60].
The three most common methods used to calibrate financial ABMs are (1) maximum likelihood estimation, (2) Bayesian inference and, (3) method-of-moment using simulated minimum distance (MM-SMD). Maximum likelihood is reserved for those models involving closed-form solutions usually for daily sampled data [60], approximating closing auctions and not intraday trading. Therefore, this method was deemed unsuitable for calibrating our model. The second method, Bayesian inference, constructs a posterior distribution of the parameters using either the kernel density estimation (KDE) or parametric approaches. Parametric methods have been said to often not be flexible enough to accurately characterise the distribution [60]. However, Bayesian estimation has been shown to outperform MM-SMD in a wide range of circumstances using the KDE method [59], but this method is extremely computationally intensive for large ABMs such as the one described in Section 2, and therefore was ruled out as a calibration method. The method chosen is MM-SMD, because it is more computationally tractable than Bayesian inference, it does not require a closed-form solution, and it means that we are consistent with the methods used in [37], whose model we try to replicate in event time.
MM-SMD does have some pitfalls that need to be noted. The MM-SMD method has been criticised for producing parameter degeneracies and causing haphazard effects on the objective function [60]. These haphazard effects limit the ability of the optimiser to find identifiable parameters due to the optimiser converging to local optima. From the indicative objective surfaces in the sensitivity analysis, we see that there are likely local optima in our objective function, which could present the same calibration problems found in [60]. The method has also been criticised for the arbitrary selection of moments that will only represent a limited number of aggregate properties of the data, and therefore may not capture important dynamics [60]. Despite these shortcomings, MM-SMD presents a fast, simple and reliable (when there are more moments than parameters) method for calibrating an ABM. However, due to these shortcomings, these parameters should be considered indicative rather than robust [37].

Calibration data
Since we are simulating a continuous-double-auction (CDA) all data relating to the opening and closing auctions needed to be discarded. Therefore, the only activity that occurred between 9:00 and 16:50 was considered. Moreover, it was noticed that the first minute of trading produced erroneous data and it was decided to remove the first minute of trading since it would make the model more difficult to calibrate. It is also vital to remove any events related to intraday volatility auctions and the impact of various futures close-outs. Given that the future close-outs occur only in March, June, September and December we do not need to consider this case. To remove most of the unwanted trades such as after-hour trades (LT), correction of previous day's trades (LC) and auction uncrossing price trades (IP) I only considered automated trades (AT).
Often there are larger trades that get executed against a set of smaller trades. This means that single trade, at a given timestamp, gets split up into smaller trades that change the best bid and best ask as it moves up and down the order book. This event is represented in the data as multiple trades when it is a single trade. Also, some timestamps have multiple quotes. To deal with these events we performed trade and quote compacting. This is the process of changing the data to better represent the fact that a single trade has occurred, or to ensure that there is only a single quote per timestamp. For quotes with the same timestamp, the most recent quote in the sequence of quotes is kept and the rest are removed. For trades with the same timestamp, and the same order type, the volume of the trade is calculated to be the aggregated volume for that timestamp and the price of the trade is the volume weighted average price.

Selection of moments
The first step in the construction of MM-SMD is the selection of the moments. Winker et al. [71] recommend selecting a large number of moments taking into account a large number of characteristics of the empirical data, which will increase the probability of identification of the model parameters. The moments that have been chosen are described in Table  4. These moments have been chosen as they measure several stylised facts that we aim to reproduce. A description of these stylised facts can be found in Table 7. We differ from [37] as we do not include kurtosis as a moment. The fat tails of the log-return distribution call into question the existence of such higher order moments and kurtosis has been shown to not be a robust moment [71]. Jericevich et al. [37] stated that it was the least reliable moment as large kurtosis values were present when there were extreme price changes initially but the price remained unchanged for the rest of the simulation. To measure the fat tails of the return distribution the "improved" Hill estimator was used to measure the tail-index of the upper 5% of the right tail of the log-return distribution [71]. Removing the kurtosis moment is also valid as it still ensures that we have more moments than parameters. The standard deviation of the log-returns was kept as it is a robust statistic [71] and is far more likely to exist when compared with the kurtosis. During the initial testing of the model, the maximum likelihood estimate of the tail-index of the Pareto distribution using the upper and lower 5% of the tails of the log-return distribution was above 2 for all simulations, however, the estimate never went above 4 again giving evidence for the existence of the second moment and the nonexistence of the fourth.
The moments were estimated on the micro-price log-returns as Jericevich et al. [37] found that using the mid-prices was unreliable when trying to identify the distributional properties.

Simulated minimum distance
The calibration problem can be defined as the optimisation of a stochastic approximation of an objective function min θ∈Θ f (θ) where θ is a parameter vector, Θ is the space of all possible parameters and f is the objective function.
The parameters that we will be optimising are described in Table 1. We have that θ = {N c , N f , δ, κ, ν, σ f } are the free parameters of the model. There are several parameters in the model that were fixed to ensure that the fewest and most influential parameters were the only ones that varied. Keeping the number of free parameters as low as possible reduces the dimensionality of the surface and makes it easier to optimise. The maximum and minimum forgetting factors of the uniform distribution, λ min and λ max , that were sampled were kept fixed. These parameters aimed to establish a range of forgetting factors where each chartist, when sampling from the uniform distribution, had an equal chance of having a long or short trading horizon. The smallest forgetting factor was set low enough such that the moving average tracked the mid-price almost perfectly and the largest forgetting factor kept the moving average reasonably constant across the simulation. Since these values have already fulfilled their role there is no need to calibrate them. The time to trade, after initialisation, was set at 25 seconds because it was deemed to give the most similar order counts when compared with the day of data we calibrated to (see Section 3.1). The time to cancellation parameter for the high-frequency agent limit orders was set at 1 second as this was the fastest time to cancellation that did not remove liquidity too quickly to cause consistent liquidity crashes. The initial mid-price was arbitrarily set at 10000 and the minimum volume for the limit orders, x m , was reduced by half when compared to [37] to account for the increase in the number of limit orders due to the high-frequency agents trading in every event loop.
The MM-SMD method considers: the true moments, estimated moments and then moments estimated from the simulated paths. Here I = 5 for the entire calibration process following [60]. The Monte Carlo replications are key to reducing the variance of the stochastic approximation of the objective function. Moreover, for each simulation run, we have that the seed was also set to i. The seed controls the fundamental prices and the forgetting factors for each of the fundamentalists and chartists respectively. Averaging over multiple seeds is key as we want to try to average out the effects of the random fundamental prices and forgetting factors. Five Monte Carlo replications are low and this value was used due to computational constraints. Please see Table A.8 that shows the runtimes to get an idea of the computational burden.
In the ideal setting, each seed should be averaged over far more than 5 times to reduce the variance caused by the randomness in each simulation as well as the random fundamental prices and forgetting factors. We can construct the full objective function as a weighted sum-of-squares of the errors defined in G: Here W is a matrix of weights. Using the method of Winker et al. [71] we choose W = Cov [m e ] −1 . The intuition behind the choice is that it takes into account the uncertainty of the empirically estimated moments, m e , by assigning larger weights to moments with greater uncertainty [60]. The matrix W is estimated using a moving block bootstrap with a window of size 2000 [37]. The moving block bootstrap is used to preserve the autocorrelations in the time series. The total number of moving block bootstrap samples that were obtained was 1000. Therefore, we have 1000 samples in which the empirical moments could be estimated, and 1000 samples of each moment to estimate W . Since we are inverting a covariance matrix the condition number needs to be checked to determine the invertibility of the matrix. The condition number of Cov [m e ] is 1.9536096786628117 × 10 13 . This is concerning however it is far less than [37] who's condition number was 8.375811152908104 × 10 19 . Therefore the matrix may only just be invertible but it is at least comparable and better than that of [37].

Optimisation
The objective function described in Section Appendix D is computationally expensive to evaluate. A single run of the simulation lasts 25 seconds and because the number of Monte Carlo replications was set at 5, this means that it takes approximately 125 seconds to evaluate the objective function. Therefore, the Nelder-Mead (NM) optimisation algorithm [53] was chosen as it reduces the number of objective function evaluations by choosing efficient search steps [27]. The roughness of the stochastic approximation of the objective function will cause lots of local minima. To avoid the rapid descent into local minima the NM optimisation algorithm is combined with the threshold accepting algorithm (TA) to give us Nelder-Mead with threshold accepting (NMTA) used in [27,37]. The NM algorithm has been shown to become inefficient in higher dimensions and therefore the NM algorithm implementation with adaptive parameters was chosen to increase the efficiency of the optimisation [22].
Given n free parameters that need to be optimised the NM algorithm creates a simplex of n + 1 vertices, where each vertex represents a potential solution to the optimisation. At each iteration, the simplex is transformed by a reflection and either an expansion, contraction or shrinkage operator. This is repeated until either the convergence criteria are met or until the optimisation is halted by a time limit or a fixed number of iterations. As noted previously this algorithm will converge quickly to local optima if the objective surface is rough.
The TA algorithm selects a single point on the objective function and perturbs it to create a new point. This point is then evaluated and if it has a better objective function value then it is accepted as the new solution. Furthermore, if it has a worse objective function value by no more than some threshold τ of the current solution then the new solution replaces the current solution [71]. The acceptance of worse solutions allows this algorithm to escape from local optima. The threshold, τ , is decreased throughout the optimisation so that convergence can occur. Each value for τ is used for a set number of steps to allow the algorithm to search the parameter space using each threshold several times [37].
Combining these two algorithms gives the NMTA algorithm. At each iteration either a single step in the TA algorithm is executed or a single step in the NM algorithm is executed. A step of either the TA or NM algorithm is performed with probability ξ and 1 − ξ respectively with NM being chosen with a higher probability. The TA algorithm is applied to the vertex with the lowest objective function value and the size of the perturbation is dependent on the mean value, computed over the simplex, of the parameter being chosen [37]. Moreover, for each of the operations in the simplex search the objective function values are also compared using the same TA thresholding to ensure that worse solutions are selected in the NM algorithm.
Initial calibration attempts caused parameters of the model to fall outside reasonable bounds. For, example some vertices of the simplex would contain negative values for the number of liquidity takers. This is unrealistic and bounds on the parameters would need to be placed so that the simplex does not search areas of the objective function that have no grounds in reality. Table D.9 describes the parameter bounds and the justification for the bounds. Table D. 9 Describes the bounds on the parameters of the model and gives a justification as to why the particular bounds were chosen.

Parameter Bounds Justification
The lower bound is used to ensure that there are always liquidity takers. The upper bound was set such that there is a case for more liquidity takers than providers but it was not set too high that the program would drop messages from the matching engine. δ ∈ [0, 10] Assumes the fundamental prices and the moving averages would never be more than 10 times that of the mid-price. This is a reasonable assumption due to the bounds on the variance on the fundamental prices and the method used to sample the forgetting factors. During calibration, this parameter never went outside this range.
Was allowed to vary on the positive real line.
The lower bound was adjusted from 1 to 1.5 as values too close to 1 would cause volumes so high that there were errors when converting the sampled volume from a float to an integer. These volumes were deemed unrealistic.
Ensures that the fundamental prices are within approximately 10% of the mid-price. It was found that increasing this further did not make a difference as this value was rarely truncated and the prices did not move more than 10% over the course of the calibrated simulations.
Due to the bounds on the parameters the initial vertex was set at the mean values for each of the parameters and an affine transformation was applied to this vertex to give the initial simplex. The simulation was run for 100 iterations as it was determined that the simplex vertices converged to a state where the parameters in the vertices had stopped changing in a meaningful way at this point. The objective function was also shown to have converged to an optima in this time.

Calibration convergence results
Figures D. 25 and D.26, give evidence that the NMTA algorithm was able to converge to an optima during calibration. The value of the objective function for the best vertex has not significantly improved in the last 32 iterations, and the variance in the objective function values across the simplex also shows signs of convergence. Considering the values of each of the parameters across the simplex we see that each of the parameters has converged to similar values across the simplex. This indicates that there is no significant difference between the vertices of the simplex.
Given that the objective function has not significantly improved and the simplex vertices are essentially the same it is clear that the algorithm has found an optima. We cannot say that this is a global optima or that the algorithm will not be able to find another parameter combination that will significantly improve the objective function, but finding such parameter values would be very inefficient as the algorithm has collapsed into a slow threshold accepting algorithm. For a more detailed discussion on the convergence of the calibration, see Appendix D.
The plot on the left of Figure D.25 depicts the convergence of the objective function for the best vertex over the number of iterations (blue) as well as the estimate for the scaled standard deviation of the objective function values over the entire simplex (purple). The purple line can be thought of as the convergence criteria of the objective function because as this decreases this means that the simplex values are becoming more similar in their objective function values and are therefore likely closer together on the objective function. The plot on the right shows the convergence of the objective function for each of the vertices on the simplex. The objective function value for the best vertex does seem to have converged to an optima because there has only been a very small improvement in the last 32 iterations. The last 35 iterations had no thresholding applied. The convergence criteria have also decreased as the simplex vertices seem to have more similar objective function values. This is also shown in the right-hand side plot of the figure. Since the objective surface is stochastic and most likely rough, there will always be noise in the objective function values even if they are very close in the parameter space. Therefore, just looking at the objective function convergence will not provide the full picture of convergence. The key question is whether the vertices in the simplex are similar enough where any of the simplex transformations or threshold accepting perturbations will cause the simplex to continue to move around the parameter space. Figure D.26 shows the parameter value for each of the vertices in the simplex (dashed lines) and the parameter value for the centroid of the simplex (solid line). These plots show that throughout the optimisation we have that each parameter value on each vertex has converged to the same place on the objective surface because by the end of the calibration all the simplex vertices have essentially the same parameter values. This means that if the optimisation process was run for further iterations there would be no significant changes in the vertices of the simplex. The simplex search transformations are based on the difference in the parameter values of the simplex and the centroid and because they are so similar there would not be significant changes to the vertices and the algorithm would become a threshold accepting algorithm. The threshold accepting steps are run with probability ξ which is much smaller than (1 − ξ) and therefore will only be run very few times and the optimisation will become very inefficient to continue the search as the simplex will change very slowly if at all for the rest of the algorithm. For these reasons, the optimisation was halted at 100 iterations. It can also be noted that there is variance in the objective function values of each vertex in the simplex, in the last few iterations, even though the parameter values are not significantly different. Highlighting the stochastic and rough nature of the objective surface and its difficulty to optimise. The right-hand plot in Figure D.25 has large spikes in the value of the objective function. These spikes are due to the shrink operator of the simplex algorithm. For a deterministic smooth function, the shrinkage operator will shrink the simplex closer to the minimum as it assumes it is closer to an optima. However, for a stochastic approximation of a function shrinking the simplex can produce vertices that have worse objective function values due to the variance of the objective function and its rough surface, especially if the simplex vertices still have significant differences.

Appendix E. Reinforcement learning
A Reinforcement Learning (RL) agent interacts with the environment at discrete time steps, t = 1, 2, 3, ... 20 . At each time step the agent observes the state of the environment S t ∈ S where S is the set of possible states, and takes an action A t ∈ A(S t ) where A(S t ) is the set of actions the agent can take in state S t . At the following time step the agent observes the reward R t+1 ∈ R ⊂ R 21 , for taking action A t in state S t , and it observes the new state S t+1 .
To select the actions at each time step the agent uses a policy, which maps the state to the probabilities of selecting each possible action [65]. The policy is denoted by π t where π t (a|s) is the probability that A t = a when S t = s [65].
The goal of the agent is to find the policy such that it maximises the total reward received. Or more formally, at each time step t the agent wishes to maximise the expected return, G t , which is a function of the reward sequence. In the simplest case this is just the sum of the rewards: G t = R t+1 + R t+2 + R t+3 + · · · + R T , where T is the final time step [65]. This is the expected return function that our agent will be using.
This formulation of the return maps nicely to problems that have a final time step, where the agent-environment interaction can be broken up into sub-sequences, such as repeated plays of a game [65]. For example, blackjack, or in our case repeated executions of a large parent order. These repeated tasks are called episodes and tasks that can be broken up into episodes are called episodic tasks. At the final time step, of an episode, the agent moves into a terminal state, after which the state of the environment gets reset to some starting state.
Some tasks can't be broken up into episodes, where T = ∞. A trading agent that just aims to maximise its profit over its lifetime would be an example of such an agent. These tasks are called continuing tasks [65]. Since T = ∞, we have that the expected return could be infinite. This is remedied by adding discounting, which gives us the expected discounted return: G t = ∞ k=0 γ k R t+k+1 . Here 0 ≤ γ ≤ 1 is the discounting rate, and it controls how much the agent values immediate vs. farsighted returns. As γ approaches zero the agent becomes more myopic and values immediate rewards more heavily, while as γ approaches one the agent becomes more farsighted and starts to value future returns more heavily. Now, an issue occurs because we have two tasks and two reward functions. To recover a single notation that describes both tasks we use the convention given to us by [65], where we consider episode termination to be the case of entering an absorbing state that transitions only to itself and has a reward of zero. This allows us to describe both tasks' return functions.
Given action a and state s, the probability of observing a reward r and moving to state s is given by: Using these definitions we can derive equations for the state-value function v π (s) and the action-value function q π (s, a) for policy π, where v π (s) is the value of state s under policy π, which is defined as: Here the action-value function, which is the value of taking action a in state s, and following π thereafter, is: Value functions give us a partial ordering of policies. We say that policy π is better than or equal to a policy π if its expected return is greater than or equal to the expected return of π for all states [65]. The optimal policy is the policy such that no other policy has a greater expected return for any of the states. There could be more than one of these policies, however, we denote the optimal policy as π * . Using this ordering of policies we can define the optimal state-value function as: v * (s) = max π v π (s) ∀s ∈ S, and the optimal action-value function as: q * (s, a) = max π q π (s, a) ∀s ∈ S and ∀a ∈ A. The optimal policy can then be found by maximising over q * (s, a) using that if a = argmax a∈A q * (s, a) then π * (s, a) = 1, otherwise π * (s, a) = 0. Therefore if we know q * (s, a) we know the optimal policy. This is key since most RL algorithms, and the Q-learning algorithm that we will implement will aim to optimise the action-value function directly. Therefore, we will need to recover the optimal policy from the optimal action-value function. We do not have the full model of the environment therefore, we can't use equations E.4 and E.5. However, we can learn the optimal state-value and action-value functions from samples of sequences of states, actions, and rewards, that can be generated through actual or simulated interaction with the environment. The two main methods for learning these functions through experience are Monte Carlo and Temporal Difference (TD) learning. When using Monte Carlo methods to estimate the value functions, the value functions are only updated at the end of every episode. This can make learning inefficient. Moreover, Monte Carlo methods can't be used to learn in a step-by-step online way. Given the learning inefficiencies and the long runtimes of a simulation, it is likely that we won't be able to have enough episodes for the RL agent to accurately estimate the value functions, using Monte Carlo methods. Therefore, we will only consider TD learning.
TD methods allow the agent to learn in each episode and does not require waiting for the episode to complete like Monte Carlo. The TD algorithm that we will be using is the Q-learning algorithm first proposed by [69]. In this particular case, the update to the value function can be done after a single step in an episode.
Q-learning is an off-policy TD control algorithm which is in contrast to the SARSA on-policy control algorithm. This means that the policy that generates the actions is not the same as the target policy. Using this method simplifies the analysis of the algorithm and reduces the conditions needed for guaranteed convergence to q * .
At a particular time t in an arbitrary episode, we give the simplest form of Q-learning, the one-step Q-learning update equation is as follows: Q(S t , A t ) =Q(S t , A t ) + αR t+1 + α γ max a Q(S t+1 , a) − Q(S t , A t ) . (E.6) Here α is the learning rate. Provided each state is visited infinitely often and each action is taken in each state infinitely often we have that Q converges to q * with probability 1, for appropriately decaying step sizes 22 [68].

Appendix F. Learning agent results
Best bid volume state spaces Table F.10 Describes how the states, for the best bid volume (v b k ), are determined as well as the approximate probability of being in each of those states, based on the observed best bid volume at some event k and the simulated historical distributions. Table F.10a and F.10b shows how the volume states are assigned for the agents with the smaller and larger state spaces respectively.   22 The conditions on the step sizes are t αt = ∞ and t α 2 t < ∞.

Behaviour analysis
(a) Plots the policy for the agent with X0 = 21500 and n T , n I , n S , n V = 5 (b) Plots the action volume trajectory for the agent with X0 = 21500 and n T , n I , n S , n V = 5.
Figure F.29: Plots the policy (left) and the action volume trajectory (right) for the RL agent with X 0 = 21500 and n T , n I , n S , n V = 5. From the change in the action volume trajectory, it is clear that the agent has learned to change its behaviour from an initial random agent.       Learning dynamics (a) Plots the average action for the high (resp. low) volume and narrow (resp. wide) spread states over the training episodes. This figure plots the results for the agent with X0 = 43000 and n T , n I , n S , n V = 5.
(b) Plots the average action for the high (resp. low) volume and narrow (resp. wide) spread states over the training episodes. This figure plots the results for the agent with X0 = 43000 and n T , n I , n S , n V = 10.  Plots the average action taken by the RL agents with n T , n I , n S , n V = 5, when there is lots of time remaining and little time remaining as well as when the agent has a large amount of inventory remaining and a small amount of inventory remaining. Once again we see that the agents are learning to reduce the amount they trade as they get trained. Moreover, we see that all the agents have learned to trade, on average, more when there is less time remaining and trade more when there is less inventory remaining. It is clear that the amount of inventory remaining has a more significant effect when compared with the time remaining, as there is a larger difference between the average actions taken in the more and less, time and inventory remaining states. Here, we see that the mean and the KS statistic generated by the RL agents across their learning episodes do not significantly differ from these same moments generated by the e-ABM. The standard deviation (Std) moment was zero for e-ABM and the JSE, up to the third significant digit, however, we do see that some simulations saw larger variations in the log-returns. The ADF statistic sees more significant differences between the e-ABM and the e-ABMs with the RL agents. However, most of the moments fall into either the JSE or e-ABM confidence intervals, and all are significantly negative indicating that there insufficient evidence for the presence of a unit root.