Neural parameter calibration for large-scale multiagent models

Significance In this work, we consider multiagent models, widely used across the quantitative sciences to analyze complex systems. These often contain parameters which must be estimated from data. While many methods to do so have been developed, they can be mathematically involved or computationally expensive. We present an alternative using neural networks that addresses both these issues. Our method can make accurate predictions from various kinds of data in seconds where more classical techniques, such as MCMC, take hours, thereby presenting researchers across the quantitative disciplines with a valuable tool to estimate relevant parameters and produce more meaningful simulations at a greatly reduced computational cost.


I. Introduction
W e live in an age of complexity. Thanks to an array of sophisticated and potent computational resources, paired with vast and fruitful reservoirs of data, researchers can increasingly see social, economic, biological, or epidemiological processes as the complex, selforganising, and dynamical systems they are. A line of great current interest in the mathematical sciences is to calibrate parameters of mathematical models using data, thereby creating computationally efficient, theoretically grounded, and socially beneficial predictive tools. For instance, models of the spread of contagion continues to inform much government policy during the ongoing COVID-19 pandemic [1][2][3][4][5]: predictions are generated, subsequently compared to live data, and the underlying model revised accordingly. Computational models have also been used to understand the dynamics of crime and urban violence [6,7], pedestrian dynamics [8], synchronised oscillations [9], social network dynamics [10][11][12], chemotaxis and flocking [13][14][15], population dynamics [16,17], systemic risk [18], and the dynamics of economic systems [19,20].
Two modelling paradigms have established themselves over the years: in the first, a system of coupled differential equations are iteratively solved to simulate the behaviour of a finite number of interacting particles. This is at times implemented using what are called agentbased models (ABMs): a collection of entities (agents) moving through space and time, interacting with each other and their environment, and adapting their behaviour or even learning in the process. In the second approach, a model is realised as a discretised version of equations in continuous time and space [21,22]; this approach opens models up to analysis using mean-field theory, stochastic analysis, statistical physics, and kinetic theory [23][24][25][26]. A large and sophisticated toolset of statistical methods exists to estimate parameters, interaction kernels, or network structures from data, such as maximum likelihood estimators [27][28][29][30][31], Markov-Chain Monte Carlo methods based on a Bayesian paradigm [32][33][34][35], martingale estimators [36], estimation of active terms in ODE and PDE systems (see [37] for a review), entropy maximisation [38], and regression-based learning methods [39,40]. More recently, a promising new method has emerged in the form of artificial neural nets. Neural networks have of course prominently been used as powerful pattern-recognition devices and predictive models [41], but, as they become more and more accessible to the scientific community at large, researchers are beginning to apply their computational capabilities across the mathematical disciplines, including in fields heretofore dominated by more classical methods: examples include finding solutions of partial differential equations [42,43] or parameter estimation of multi-agent models [44,45]. Neural networks, and especially deep neural networks, are mathematically little understood, and their theoretical underpinnings sparse and mainly restricted to shallow networks (networks with only one hidden layer) [46,47]. This major drawback notwithstanding, recent results seem to indicate that their computational performance can often outstrip that of other, thus far more rigorously understood approaches, though the method still lies in its infancy.
This work is a contribution to the general push to better understand the possibilities of neural nets as calibration tools for mathematical models of complex dynamics. We present and investigate a simple yet powerful computational scheme to obtain probability densities for model parameters from data. The method combines classical numerical models with machine learning, and in the following case studies we recover probability densities from noiseless and noisy, synthetic and real, and time series and steady-state data. The case of a nonconvex problem is also considered. Using the well-known SIR model of contagious diseases, we estimate parameter densities from a time series modelling the diffusion of infection through a population on a two-dimensional domain. In a second study, we use the Harris-Wilson model of economic activity on a network [19] to learn parameter densities both from synthetic steady-state data as well as a real dataset of activity across Greater London. In doing so we revisit an earlier study of the same dataset that used Bayesian methods to estimate parameters [20]. Our proposed method can find estimates for model parameters in seconds, even for large systems, and provides parameter densities for the London dataset in nearly one minute where the Bayesian approach took be-tween 3 and 7 hours. At the same time, the quality of the calibration (in terms of prediction error) is improved by two to three orders of magnitude.
The scope of our approach covers models formulated as coupled differential equations of the kind or, in the stochastic case, where ϕ ∈ R N is the state vector, x is a space-like variable, λ := (λ 1 , ..., λ p ) ∈ R p a set of scalar parameters, and B an N -dimensional stochastic process (such as a Wiener process). (We include x to allow for infinitedimensional problems leading to (stochastic) partial differential equations, though in practice discretisation will often lead to it being absorbed into the state vector.) In a neural ODE or SDE, the scalar parameters λ are the outputs of a neural network, whose internal parameters are to be learned from data [48]. In this work, we investigate the use of such neural ODEs as calibration tools. In many cases, the spatial topology is given by a network structure, such that the dependency on x is more specifically a dependency on a graph adjacency matrix A; this is found in many contemporary models of dynamical systems. One might also like to learn network structures from data. By vectorising A, it can be conceived of as a single vector of parameters λ, transforming the question into one that can indeed be considered within our proposed framework. However, as the scale of this problem is typically of a different order of magnitude, we address it in future work.
The dynamics of systems governed by equations such as eq. [1] can depend sensitively on the choice of parameters λ. For mathematical models such as those used in the social sciences or computational biology, theoretical estimates for these parameters are often difficult obtain: what is the reproduction number of a novel disease, and how susceptible are different age groups to the disease? What is a good model for the topology of a social network? How should we estimate return on capital rates for different agents in economic models? What are recovery rates for different animal species subject to predator-prey dynamics? Such parameters are difficult to measure, and should instead be extracted from data.

9:
Calculate gradient ∇ θ J 10: Update θ using backpropagation 11: end for Figure 1: The methodological pipeline proposed in this work. The neural net u θ takes q time series elements as input and outputs parameter predictions. These predictions are fed into a numerical solver, which produces a predicted time series. The true and predicted time series are used to generate a loss functional, which in turn can be used to train the neural net's internal parameters θ. The goal is to retrieve the true parameters λ from the data. A single pass over the entire dataset is called an epoch. The dataset is processed in batches, meaning the loss is calculated over B steps of the time series before the weights are updated. If B = L, training is equivalent to batch gradient descent; if B = 1, training is equivalent to stochastic gradient descent. The integral in line 6 represents an arbitrary numerical scheme to solve eq. [1].

II. Methodology i. Obtaining probability densities from Neural equations
We present a method to estimate parameter densities of ODE or SDE systems by training a neural net to find a set of parametersλ that, when inserted into the model equations eq. [1], reproduce a given time series T = (ϕ 1 , ..., ϕ L ). A neural network is a function u θ : R N ×q → R p , where q ≥ 1 represents the number of time series steps that are passed as input (cf. SI appendix fig. S1). Its output are the estimated parameterŝ λ, which are used to run a numerical solver for B iterations (B is the batch size) to produce an estimated time seriesT(λ) = (φ i , ...,φ i+B ). This in turn is used to train the internal parameters θ of the neural net (the weights and biases) via a loss functional J(T, T). A common choice for J is the l 2 norm, which we will use in this work. Asλ =λ(θ), we may calculate the gradient ∇ θ J and use it to optimise the internal parameters of the neural net using a backpropagation method of choice; popular choices include stochastic gradient descent, Nesterov schemes, or the Adam optimizer [49,50]. Calculating ∇ θ J thus requires differentiating the predicted time seriesT, and thereby the system equations [1], with respect toλ. In other words: the loss function contains knowledge of the dynamics of the model. Finally, the true data is once again input to the neural net to produce a new parameter estimateλ, and the cycle starts afresh. A single pass over the entire dataset is called an epoch (cf. fig. 1). In the following we will always let q = 1. The technicalities of the differentiation procedure are handled by the auto-differentiation scheme, which treats random vectors as constants. For non-differentiable convex functions (such as · ), the subgradient of minimum norm is used, see e.g. [51].
This solves an optimisation problem, but it does not provide us with any sort of confidence intervals for the predictions. To obtain probability densities, we exploit the fact that as the model trains, it traverses the parameter space R p , calculating a loss value J at each estimatê λ. This produces a loss potential J(λ) : R p → R, from which a posterior density can be estimated via π(λ|T) ∼ exp(−J(T, T))π 0 (λ), with π 0 the prior. The marginals are then proportional with the subscript −i signifying that we are not integrating overλ i . In the following, we initialise the neural net's internal parameters such that the prior is a uniform density. Non-convexity can be dealt with by training the neural net multiple times on the same dataset, each time with a different initialisation, thereby increasing the volume of the sampled parameter space. This comes at a computational cost, but multiple runs can be parallelised and run concurrently, thereby greatly improving performance. Both the noiseless (eq. [1]) and the noisy (eq. [2]) version of the equations can be used when running the numerical solver (i.e. the forward pass of the pipeline alg. 1). Running the solver without noise circumvents having to make any assumptions about the nature of the true underlying noise in the dataset, which will often be unknown. The neural net will then simply fit the best noiseless model to the dataset. However, noise can be added to the solver when randomness is an inherent part of the dynamics, and can help make the predictions more robust. Alternatively, the variance of the noise can itself be learned. All these scenarios will be demonstrated in this work.

ii. Using Neural equations in practice
It is important to reiterate that the neural net is not fitting a dataset in the traditional sense, but rather producing a set of parameters that, when plugged into the governing equations, generate the dataset (or an estimate thereof). Differentiating the loss function (and thereby the physical equations) may seem daunting; but many of the standard machine learning libraries have auto-differentiation features, and so the differentiation procedure need not (and if possible should not) be implemented manually. In practice, the bottleneck of our method will lie in writing a fast numerical solver that is also compatible with the differentiation procedure of the machine learning package used. Explicitly iterating over agents or network nodes should be avoided, and the dynamics instead be implemented as operations on the entire state vector ϕ. Pre-implemented matrix and vector operations should be used as much as possible, as these typically (1) use efficient, tested, and pre-compiled algo-rithms, and (2) are compatible with auto-differentiation features. We have implemented an open-source code package (see below), written such that it can be easily extended and adapted to further models. See the supplementary material and the README files in the repository.
Aside from computational considerations, when learning parameters, several fundamental mathematical questions must be carefully considered. (1) First, the model's dynamical range should be analysed in order to ascertain whether there are non-identifiable regimes. Dynamical systems may gravitate toward attractors and steady equilibria which are independent of certain model parameters and thus cannot be used to train the net. (2) Numerical stability must be maintained throughout the training process, e.g. by rescaling the neural net output to prevent numerical overflow. In certain parameter regimes, dynamical systems may also exhibit chaotic behaviour that inhibit the learning process. The neural net must be kept from straying into the 'danger zones' of potential numerical instability as it traverses the parameter space during training. (3) To speed up computation, a suitable neural net architecture should be chosen that encapsulates as much information on the parameters as can reasonably be assumed a priori. For instance, using activation functions that guarantee parameters remain in valid ranges may be conducive: if parameters are probabilities, an activation function that maps into [0, 1] (e.g. a sigmoid) may be an appropriate choice for the final layer.
iii. Code and data availability All code and data can be found under https://github. com/ThGaskin/NeuralABM. It is easily adaptable to new models and ideas. The code uses the utopya package 1 [52,53] to handle simulation configuration and efficiently read, write, analyse, and evaluate data. This means that the model can be run by modifying simple and intuitive configuration files, without touching code. Multiple training runs and parameter sweeps are automatically parallelised. The neural core is implemented using pytorch 2 . All datasets used in this work, including the synthetic data, have been made available, together with the configuration files needed to reproduce the plots. Detailed instructions are provided in the supplementary material and the repository.
III. Application to time series data: diffusive SIR model of epidemics We first demonstrate our methodology by applying it to time series data generated by a classical agent-based model of epidemics. Consider N agents moving around a square domain [0, L] 2 with periodic boundary conditions, 0 < L ∈ R; each agent has a position x i , and a state k i ∈ {S, I, R}. All agents with k i = S are susceptible to the disease. If a susceptible agent lies within the infection radius r of an infected agent (an agent with k i = S), they are infected with infection probability p. After a certain recovery time τ , agents recover from the disease (upon which k i = R); each agent's time since infection is stored in a state τ i . Agents move randomly around the space with diffusivities σ S , σ I , and σ R . Each iteration of the agent-based model thus consists of the following steps: τ j ← τ j + 1 9: end for 10: for all agents i with k i = I do 11: if τ i ≥ τ then 12: Move agent randomly with respective diffusivity 17: end for The function d is the distance metric on the torus and we initialise the ABM with a single infected agent at Let S(x, t) be the spatio-temporal distribution of susceptible agents (analogously I and R). Assume we only observe the temporal densities applicable to the spread of an epidemic where we only see the counts of infected and recovered patients without any location tracking or contact tracing. To these observations we now wish to fit the stochastic equations where W is a Wiener process, and • represents the Stratonovich integral. We will do so by recovering the parameters λ = (β, τ, σ) ∈ R 3 + from observations of S, I, and R generated by the agent-based model. We expect β ≈ p, as the likelihood of a susceptible agent coming in to contact with an infected agent approaches 1. Naturally, there is some data-model mismatch, which is intentional and in this case accounted for by the noise σ; in reality, σ could for instance model errors in the estimates of the number of infected agents.
We use a shallow neural net with 20 neurons in the hidden layer and, since parameters are known to be positive and negative outputs would produce unpredictable   behaviour, the modulus x → |x| as an activation function. The loss function is the batch-averaged mean squared error, and we use the Adam optimizer [50] for the backpropagation. For a single initialisation we train the neural net for 70 epochs with a batch size of 90, and we run the model from 20 different initialisations. Figure 4 shows the parameter predictions. We recover the infection probability with a maximum likelihood estimate ofβ = 0.19, and a slightly overestimated infection timeτ = 16.82. The most likely noise level is predicted to be 0, with an expectation value of 0.07 ± 0.15. In fig. 3, we use the neural net predictions from all 20 initialisations to calibrate the model. We see how, on average, the predicted time of peak infection matches the true time (dotted lines), though the density of infected agents both increases and decreases significantly more slowly than in the true dataset. Despite the significant data-model mismatch, the neural network therefore manages to make reasonable predictions -both in terms of the estimated parameters and the output of the calibrated ABM -in only a few seconds: training the model for all 20 initialisations took 45s on a standard laptop CPU.

IV. Application to a non-convex problem: the Harris-Wilson model of economic activity
In this section we analyse the connection between prediction uncertainty and noise in the training data, as well as comparing the method to more classical methods with regard to its predictive ability and computational performance. We do so using the Harris-Wilson model of economic activity on a network [19], a nonconvex problem for which the loss function has at least two global minima. In a first step, we will consider synthetic data, thereby avoiding any data-model mismatch, and giving us control over the variance in the data; thereafter we shall analyse a real dataset of economic activity in Greater London. Before presenting our results we briefly describe the model dynamics.
In the Harris-Wilson model, N origin zones are connected to M destination zones through a weighted, directed, complete bipartite network, i.e. each origin zone is connected to every destination zone. Economic demand flows from the origin zones to the destination zones, which supply the demand. Such a model is applicable for instance to an urban setting, the origin zones representing e.g. residential areas, and the destination zones representing retail areas, shopping centres, or other areas of consumer activity. Let C ∈ R N ×M be the nonzero section of the full network adjacency matrix. The network weights c ij quantify the convenience of travelling from origin zone i to destination zone j: a low weight thus models a highly inconvenient route (e.g. due to a lack of public transport). Each origin zone has a fixed demand O i . The resulting cumulative demand at some destination zone j is given by T ij representing the flow of demand from i to j. The model assumption is that this flow depends both on the size W j of the destination zone and the convenience of 'getting from i to j': The parameters α and β represent the relative importance of size and convenience to the flow of demand from i to j: high α means consumers value large destination zones (e.g. prefer larger shopping centres to smaller ones), high β means consumers place a strong emphasis on convenient travel to destination zones. Finally, the sizes W j are governed by a system of M coupled logistic equations: with given initial conditions W j (t = 0) = W j,0 . Here, is a responsiveness parameter, representing the rate at which destination zones can adapt to fluctuations in demand, and κ models the cost of maintaining a larger site per unit floor space (e.g. rent, utilities, etc.). We recognise the logistic nature of the equations: the change in size is proportional to the size itself, as well as to D j − κW j . A low value of κ favours larger destination zones (e.g. larger malls), a high cost favours smaller zones (e.g. local stores). In addition, the model eq. [6] includes multiplicative noise with variance σ ≥ 0, with • signifying Stratonovich integration. This represents a perturbation of the net capacity term D j − κW j by a centred Gaussian. We thus have a system of M coupled stochastic differential equations. In the noiseless case, the stable equilibrium is determined by where W ∈ R M and D ∈ R M are the origin zone sizes and demands, respectively. The steady state is thus independent of the responsiveness , which only affects the convergence rate to the equilibrium. We can therefore set = 1. Note also that α and β are unitless, and thus unaffected by any scaling of the origin or destination zone sizes. κ is given in units of cost/area, and scales accordingly.
In the stochastic case, the dynamics reach a steadystate equilibrium that is independent of the initial condition W 0 . A good indicator to assess the effect of the parameters α and β on the system steady state is the inequality ν = 1 indicates a completely monopolised market, while ν = 0 indicates perfect equality of size, i.e. all zones are of equal size. Low values of α and high values of β (i.e. low relative importance of zone size, high relative importance of convenience) lead to low overall inequality, due to a large collection of smaller zones emerging, all of roughly equal size (cf. fig. 5). Conversely, low relative importance of travel convenience, and high relative importance of store size lead to the emergence of a small number of very large superstores, with most smaller centres dying out: this is reflected in the high values of ν.
The steady state condition eq. [7] is given by the M coupled equations In the case of ν = 1, eq. [9] will be solved by anyα and β, withκ uniquely given bŷ wherek is the index of the monopolising zone. This region is thus unlearnable in α and β (with consumers only having a single option, both their preference for size and convenience is irrelevant). Similarly, the case ν = 0 admits solutions that are independent of α (since all zones are of the same size, the size preference parameter becomes irrelevant). Parameters thus cannot be learned for ν ∈ {0, 1}, and throughout this work, we only consider datasets with 0 < ν < 1. For any ν, the triple (α = 1, β = 0, κ = i O i / k W k ) represents a global minimum of the loss function, and if 0 < ν < 1, it can be shown that there exist at most two global minima of the loss function (see [19] for a proof): Fact. In their stable equilibrium, the noiseless Harris-Wilson equations eq. [7] for 0 < ν < 1 admit at most two solutions in (α, β, κ), one being the trivial solution (1, 0, i O i / k W k ). For any ν, the solution is unique in κ, where κ is given by

i. Results on synthetic data
We generate synthetic data of the steady-state sizes W := W(t → ∞), determined via sup j dW j < tol for some sufficiently small tolerance. We generate training data with different noise levels; the data has length L = 1 in the noiseless and L = 4 in the noisy case. We run the numerical solver without noise, since we do not wish to include any assumptions about σ in the training process.
Considering first the noiseless case, we wish to estimate a probability distribution for the model parameters {α, β, κ} ∈ R 3 + with confidence bounds ( is not learnable, as it does not affect the steady state). The steady state must meet the criterion 0 < ν < 1. We train the neural net 100 times from different initialisations for 10000 epochs each, performing a gradient descent step after each prediction (thus the batch size is B = 1), and using the same network architecture as before. The loss function is the same as in the SIR case (eq. [5]). Figure 6 visualises the resulting loss potential in α and β.
We clearly see the two global minima of J, one at the true parameters and one at the trivial minimum of the decoupled system α = 1, β = 0. In running the model several times with different initialisations, we are thus able to deal with the non-convexity of the problem. Turning to the case of noisy data, fig. 7 shows the marginals for α as we increase the level of noise in the data. The noise is generated by a mean-zero, centred Gaussian, and its variance ranges from 0 to a very high level of σ = 1.5. In the noiseless case (dark blue) the non-trivial solution dominates, with a small peak at the trivial case α = 1. As the noise increases, so do the peak widths, until at σ = 1.5 the non-trivial solution is no longer identiable, and the model collapses to the trivial value α = 1. Figure 8 shows the average and standard deviation of the peak widths for all three parameters. Since κ is uniquely identifiable, there is only one peak, hence the standard deviation is 0. For the other two, the disappearance of the non-trivial peak is evident at σ = 1.5, where the standard deviation of the peak width becomes 0. For all three parameters, we observe an increase in the average peak width as the noise in the data increases. At σ = 0, there is still some uncertainty in the predictions, originating from the uncertainty in the internal model weights θ. Generating the 1 million samples in 3-dimensional space shown in figure 6 takes about 15 minutes on a standard CPU. Far fewer data points would already have been sufficient to get good estimates of the parameters, since the model spends most of its time near the minima of the loss function. The time to run a single iteration increases with (N + M ) 2 , as the adjacency matrix of the network (which is not sparse) grows accordingly; the loss after a fixed number of iterations remains fairly constant (see SI appendix figure S2). In the following section, we give a more detailed analysis of the model performance by comparing it to that of a classical Bayesian estimator.

ii. Results on data of economic activity in London
We apply our methodology to a dataset of economic activity in Greater London and compare the results to a study of the same dataset, performed using an MCMC approach [20]. The data consists of origin zone sizes O, destination zone sizes W, and two different convenience matrices C for the region for the period around 2015/2016.
Origin zones: The origin zone sizes are given by total spending budget of each of the N = 625 electoral wards in Greater London as classified by the Greater London Authority (GLA) [55]. Ward-level income is calculated as the product of the number of households times the average household income; however, households do not spend their entire budget on retail and service goods: according to the Office for National Statistics [56], between 2015-2017 London households on average spent between 21%-30% of their total budget on comparison, service, and convenience goods 3 , hence we multiply the income figures by 0.21 to obtain the origin size (we are excluding smaller retail zones from the dataset, see below, and hence choose to exclude food costs from the budget figure, see footnote). For example, for the City of London, 6680 households and an average household income of £63.620/a result in a total ward-level income of about £425 million per annum, and hence an origin [10], using the minimum transport time between locations for public transport or driving. Background map of London boroughs: [54]. Right: the densities of the distance distributions for three different metrics: temporal, spatial, and Euclidean. size of £89 million/annum. In order to prevent numerical overflow, we use units of £10 8 /a for the origin zones. Recall that scaling the origin zone values only affects the resulting prediction for κ, as α and β are unitless. It should be noted that while the population figures are for 2015, the income figures are only for 2012/2013. Destination zones: The destination zone sizes are the total occupied retail floor space sizes in m 2 for all M = 49 town centres classified as either 'international', 'metropolitan', or 'major' by the Greater London Authority (GLA) [57]. In the GLA report [57] this retail floor space comprises comparison, convenience, and service retail (as given by [57], tab. 1. 1.1a-1.1.1c); for example, for the West End we obtain a total occupied retail floor space in 2016 of 474.456 m 2 . In order to prevent numerical overflow (resulting from large values of W α j potentially occurring during the training), we use units of 10 5 m 2 for the destination zone sizes. With this choice of units for the origin and destination zone sizes, κ will be given in units of £1000/a. The true value resulting from this data is Cost network: For the cost network C we use the Google Distance Matrix API 4 to extract travel data from Google Maps. The API can be used to extract both travel distances and travel times for large batches of search queries for all the transport modes available on Google Maps (driving, public transport, walking, cycling). We only consider two modes of transport: driving and public transport. The API does not provide past data, so travel time and distance data reflect the state of the network in June 2022. In order to restore some level of comparability, we set the travel date to a Sunday, in order to blend out the newly opened Elizabeth line (which in June 2022 did not yet run on Sundays). We define the distance d ij between two nodes as the shorter of the two transport modes considered: for example, travelling from Kentish Town to the West End takes about 19.7 minutes by public transport, and 22.35 minutes by car, so we set the distance to be 19.7 minutes (of course, this assumes that everyone can choose between both modes of transport, and does not factor  in the added cost of driving in London.) Note that the Google Maps public transport mode defaults to walking for very short distances. We consider two different metrics: a temporal metric and a spatial metric. The convenience factors are derived in [58] as where d ij is the distance in the metric in question and τ = sup i,j d ij the time/length scale, ensuring a unitless exponent. The network is then given by C = (c ij ). Figure 9 visualises the dataset: the red nodes are the destination zones W j , the blue dots are the origin zones O i , and the network edge widths represent the cost matrix, using the temporal metric. The large, central destination zone is the London West End, by far the largest retail zone in London. On the right, the distance distributions for the two metrics are shown: it is interesting to observe that the temporal metric has a distribution more or less symmetrically centred around 0.5, whereas the distance metric is much more heavily skewed to the left. In practice this means that zones are statistically closer spatially than temporally. This may be an artefact of our choice of transportation mode: since users must use either public transport or drive, zones within realistic walking or cycling distance appear further away than they actually are. This is substantiated by the fact that more zones lie within zero spatial distance of each other than within zero temporal distance. We expect the choice of metric to only affect the predictions for β, the convenience parameter coupling the dynamics to the network.

Results
We estimate the marginal densities for the parameters for both metrics (temporal and spatial), and compare the results to those obtained in [20]. In that study, the authors considered two underlying noise levels in the data, σ = √ 2 × 10 −2 and σ = √ 2 × 10 −1 , and estimated densities for α and β. We produce estimates for α, β, and κ, using the same values for the noise in the forward pass of the scheme. The resulting marginals are shown in fig. 10, along with the densities obtained from running the MCMC scheme from [20]. We train our model for 10.000 epochs over 20 different initialisations to obtain the same number of samples. It should be noted that the authors in [20] used Euclidean distances for their study, which however in fig. 10 can be seen to have much the same distribution as our spatial metric. Table 1 summarises the results. While the MCMC scheme produced very different estimates depending on the choice of σ, our method shows good consistency across noise regimes. As is to be expected, the predictions are independent of the choice of metric for α and κ. We obtain good to fair estimates for κ, as well as a consistent maximum likelihood estimate for α. We assess the quality of the predictions by using the most likely values to calibrate the numerical solver eq. [6], and computing the expected mean squared prediction error over 100 runs (thereby accounting for the random noise involved). Since [20] did not estimate κ, we use its true value for the calibration run with the MCMC values. As can be seen, our method improves on the MCMC prediction by three orders of magnitude in the low noise, and two orders of magnitude in the high noise regime. Note also that our method produces density estimates in just over one minute, while the MCMC scheme took between 3.7 hours and 7.3 hours to run; this represents a performance improvement of two orders of magnitude. The speed-up is not attributable to parallelisation: in CPU time, our method took 9 minutes to run, still representing two orders of magnitude performance increase.

V. Conclusions
In this article, we outlined an approach to estimating probability densities for parameters of differential equations using neural nets. We considered a broad spectrum of datasets, including time series data of different lengths, steady state data with only a single time frame, noiseless and noisy data, synthetically generated, and real data; the method was applied to a non-convex problem with two global minima of the loss function (Harris-Wilson model), as well as to a situation with data-model mismatch (SIR model). In all cases, the neural net quickly and reliably found densities for the relevant parameters. We assessed the quality of the predictions by comparing the parameter predictions to their true values, in the case of the Harris-Wilson model, comparing our model's prediction errors to that of a study using classical MCMC techniques. Our model significantly outperformed that study in terms of the accuracy of the predicted data, while being computationally significantly faster to run: in both regards, improvements covered two to three orders of magnitude. The use of numerical solvers for the forward pass allows giving estimates for different levels of assumed noise in the data, which can help to obtain realistic confidence bounds on predictions. However, in the SIR study we also showed the ability of our method to learn the noise level itself. We demonstrated the importance of (1) analysing a model's dynamical range before training, (2) ensuring numerical stability throughout the training process, and (3) adapting the neural net architecture to the problem. For the Harris-Wilson model, this was achieved by (1) only training on data from the dynamical range 0 < ν < 1, (2) scaling the origin and destination zone sizes appropriately to prevent numerical overflow, and (3) ensuring positive predictions by using the modulus as the activation function.
Due to the method's performance and relative mathematical simplicity, we hope that it will prove a powerful tool across the quantitative sciences, such as in quantitative sociology or computational epidemiology. Our work opens up fruitful lines for further research: in particular, we have not yet considered how the model performance scales with the size of λ. How, for instance, will the model fare when predicting very large parameter sets, such as graph adjacency matrices? This will be the subject of further investigation by the authors.

Methodology Neural Networks: Notation and Terminology
A neural network is a sequence of length L ≥ 1 of concatenated transformations. Each layer of the net consists of L i neurons, connected through a sequence of weight matrices W i ∈ R Li+1×Li . Each layer applies the transformation to the input x from the previous layer, where b i ∈ R Li+1 is the bias of the i-th layer. The function σ i : R Li+1 → R Li+1 is the activation function; popular choices include the rectified linear unit (ReLU) activation function σ(x) = max(x, 0), and the sigmoid activation function σ(x) = (1 + e −x ) −1 . A neural net has an input layer, an output layer, and hidden layers, which are the layers in between the in-and output layers. If a network only has one hidden layer, we call it shallow, else we call the neural net deep.
x 0 Hidden layers Output layer Figure S1: Example of a deep neural network with 3 hidden layers. The inputs (light blue nodes) are passed through the layers, with links between layers representing the weight matrices W. Each layer also applies a bias (yellow nodes), with the network finally producing an output (orange).

Details on the code
The code is uploaded to the Github repository as given in the main text.

Installation
Detailed installation instructions are given in the repository. First, clone the repository, install the utopya package and all the required additional components into a virtual environment, for example via PyPi. In particular, install pytorch. Enter the virtual environment. Then, from within the project folder, register the project: utopya projects register .
You should get a positive response from the utopya CLI and your project should appear in the project list when calling:

utopya projects ls
Note that any changes to the project info file need to be communicated to utopya by calling the registration command anew. You will then have to additionally pass the --exists-action overwrite flag, because a project of that name already exists. See

Running the code
To run the code, execute the following command: utopya run <model_name> By default, this runs the model with the settings in the <model name> cfg.yml file. All data and the plots are written to an output directory, typically located in~/utopya output. To run the model with different settings, create a run cfg.yml file and pass it to the model like this: utopya run <model_name> path/to/run_cfg.yml This is recommended rather than changing the default settings, because the defaults are parameters that are known to work and you may wish to fall back on in the future.
Plots are generated using the plots specified in the <model name> plots.yml file. These too can be updated by creating a custom plot configuration, and running the model like this: utopya run <model_name> path/to/run_cfg.yml --plots-cfg path/to/plot_cfg.yml See the Utopia tutorial for more detailed instructions.
All the images in this article can be generated using so-called configuration sets, which are complete bundles of both run configurations and evaluation configurations. For example, to generate the four frames of the SIR model, you can call utopya run SIR --cfg-set ABM_data This will run and evaluate the SIR model with all the settings from the SIR/cfgs/ABM data/run.yml and eval.yml configurations.

Parameter sweeps
Parameter sweeps are automatically parallelised by utopya, meaning simulation runs are always run concurrently whenever possible. The data is automatically stored and loaded into a data tree. To run a sweep, simply add a !sweep tag to the parameters you wish to sweep over, and specify the values, along with a default value to be used if no sweep is performed: Then in the run configuration, add the following entry: [bgcolor=gray]{yaml} perform_sweep: true Alternatively, call the model with the flag --run-mode sweep. The configuration sets used in this work automatically run sweeps whenever needed, so no adjustment is needed to recreate the plots used in this work.

Initialising the neural net
The neural net is controlled from the NeuralNet entry of the configuration: name: HardTanh # you can also pass a function that takes additional args and/or kwargs args: --2 # min_value -+2 # max_value learning_rate: 0.001 # optional; default is 0.001 optimizer: SGD # optional; default is Adam num layers specifies the depth of the net; nodes per layer controls the architecture: provide a default size, and optionally any deviations from the default under layer specific. The keys of the layer specific entry should be indices of the layer in question. The optional biases entry determines whether or not biases are to be included in the architecture, and if so how to initialise them. A default and layer-specific values can again be passed. Setting an entry to default initialises the values using the pytorch default initialiser, a Xavier uniform initialisation. Passing a custom interval instead initialises the biases uniformly at random on that interval, and passing a tilde~(None in YAML) turns the bias off. activation funcs is a dictionary specifying the activation functions on each layer, following the same logic as above. Any pytorch activation function is permissible. If a function requires additional arguments, these can be passed as in the example above. Lastly, the optimizer keyword takes any argument allowed in pytorch. The default optimizer is the Adam optimizer [50] with a learning rate of 0.001. The neural net can be initialised from different initial values in the parameter space by changing the random seed in the configuration: where dϕ(t) is given by The ReLU function ensures that densities do not become negative. The estimated valueŵ is given byσX, whereσ is the neural net prediction for the noise, and X ∼ N (0, 0.1) a normally distributed random variable with variance 0.1. The function f is given by with k = 1000. It approximates a step function, ensuring that recovery only begins after a certain period. γ is a scaLling factor, designed to ensure the three estimated parameters are roughly of the same order of magnitude. This increases the speed of the neural net's convergence to a loss function minimum, as the parameters are closer together, but is not required to achieve reliable results. When writing the data,τ is scaled back to its original dimension. We choose γ = 10.

Running the code
The training data for the ABM is provided in the data/SIR folder, and is provided in hdf5 format. To train the neural net, simply run the following command (make sure you are in the project folder, otherwise change the path to the dataset to an absolute path):

utopya run SIR --cfg-set Predictions
This will load the ABM data and run the model 20 times from different initialisations, finally producing the plots from the manuscript. You can change the number of sweeps, and the initial seeds, by changing the seed entry in the run.yml configuration. By default, it looks like this: seed: !sweep default: 0 range: [20] Application to a non-convex problem: the Harris-Wilson model of economic activity Neural Network Architecture The neural network architecture is the same as for the SIR section, using the same optimizer with the same learning rate of 0.002. We initialise the biases of the neural net in the interval [0, 4]: In the noiseless case we set the training noise to 0, as we are not learning the noise; for the noisy runs, we also learn the noise.

Training
We use the following matrix form of the Harris-Wilson equations to train the neural net. Let D ∈ R M , O ∈ R N , W ∈ R M be the demand vector, origin zone size vector, and destination zone size vector respectively. The the dynamics are given by with indicating the Hadamard product, and elementwise exponentiation. Z ∈ R N is the vector of normalisation constants The dynamics then readẆ = W (D − κW) with given initial conditions W(t = 0) = W 0 . This formulation is more conducive to machine learning purposes, since it contains easily differentiable matrix operations and does not use for-loop iteration. Figure S2 shows a performance analysis of the model as a function of the network size N + M . On the left, the time a neural net with 20 neurons takes to run a single epoch (L = 1, B = 1). Each data point is an average over 10 different initialisations, each with 6000 epochs. On the right, the loss after for each size after 6000 epochs is shown, averaged over 10 different initialisations. The shaded area represents the standard deviation. Training was performed on the CPU of a standard laptop. Figure S3 shows the corresponding marginals for β, κ, and the noise level σ for figure 7 in the main manuscript.

Datasets
The data/HarrisWilson folder contains several datasets, both real and synthetic, that can be used to train the neural net and learn parameters for the Harris-Wilson equations. Simply set the load from dir key in your run.yml file to point to the folder containing the data: the data will automatically be loaded and the model trained on that data.

Synthetic data
The synthetic data folder contains synthetically generated networks, origin sizes, and destination sizes, both with and without noise. The name of the folder indicates the network size, i.e. N 100 M 10 means N = 100 and M = 10. However, all folders also include a config.yml file detailing the specific configurations for each dataset.

London dataset
The London data folder contains datasets of economic activity across Greater London. The GLA data folder contains the data compiled from the two GLA studies on ward profiles and retail floor space. The dest sizes.csv and origin sizes.csv are the destination and origin zone sizes used in the paper. The exp times.csv and exp distances.csv are the two different transport network metrics used, calculated via exp(−d ij / max(d ij )) from the respective distances.csv and times.csv files. The Google Distance Matrix Data folder contains transport times and distances using the Google Maps API service. Each file is a pickle-dictionary containing the API output for different travel modes: transit (public transport) and driving (driving, no traffic). The departure time for transit is Sunday, June 19th 2022, 1 pm GMT (in Unix time: departure time = 1655640000). However, since trips in the past cannot be computed, a future date must always be specified when using the API. The data is also available as a cost matrix in .csv format: entries are given in seconds and metres respectively.