Bayesian uncertainty quantification for data-driven equation learning

Equation learning aims to infer differential equation models from data. While a number of studies have shown that differential equation models can be successfully identified when the data are sufficiently detailed and corrupted with relatively small amounts of noise, the relationship between observation noise and uncertainty in the learned differential equation models remains unexplored. We demonstrate that for noisy datasets there exists great variation in both the structure of the learned differential equation models and their parameter values. We explore how to exploit multiple datasets to quantify uncertainty in the learned models, and at the same time draw mechanistic conclusions about the target differential equations. We showcase our results using simulation data from a relatively straightforward agent-based model (ABM) which has a well-characterized partial differential equation description that provides highly accurate predictions of averaged ABM behaviours in relevant regions of parameter space. Our approach combines equation learning methods with Bayesian inference approaches so that a quantification of uncertainty can be given by the posterior parameter distribution of the learned model.


Introduction
Many phenomena in nature arise as a result of complex interactions between individual agents at the microscale that give rise to emergent properties at the macroscale. Understanding the mechanistic basis for the observed macroscale behaviour, in order to gain fundamental insights into biological phenomena, is one of the key challenges in biology.
Mathematical models are well-placed to help provide such insights, providing a rigorous framework where hypotheses can be generated, tested and refined. While interactions between data. However, practitioners wishing to develop models that they can use in real-world settings require, in addition, a thorough quantification of uncertainty [10,16,17,18,19]. This need comes from the fact that the, often significant, noise in real-world data can impact the models predicted by equation learning methods, and hence the predictive capability of the models for unseen data or scenarios [20,21]. For example, Nardini et al. [8] have recently shown, through the use of several case studies, that it is possible to infer differential equation models that describe noisy data generated by stochastic ABMs. However, the stochasticity in the ABM results in variability in the learned macroscale differential equation. This means that, for a particular realisation generated from a stochastic model, the learned differential equation is a point estimate of the underlying differential equation, and there is no quantification of uncertainty in the learned equation.
Recently, some authors [11,22] have begun to address this problem by analysing the robustness of PDE-FIND with noisy or sparse data. For example, Rudy et al. [11] investigate how the learned PDE varies as the numerical solution of a ground truth PDE is corrupted by additive noise, while Li et al. [22] investigate how to increase the signal-to-noise ratio of a dataset prior to the use of equation learning techniques. While both works show that model parameters can be retrieved to within an impressive margin of error when the data are corrupted with relatively small amounts of noise, these approaches give no statistical quantification of the uncertainty in model predictions, nor do they address how to deal with significant noise levels.
In this work we demonstrate that noise can significantly impact both the structure and the values of the parameters of learned differential equation models, rendering uncertainty quantification a crucial component of the equation learning process. As such, our overarching aim is to develop and showcase a method for uncertainty quantification in the context of equation learning, where we harness the immense computational efficiencies of PDE-FIND in learning point estimates of governing equations, together with the power of computational Bayesian inference in evaluating the level uncertainty in the learned equation. Figure 1 shows our proposed framework for uncertainty quantification. We start from the basis that it is possible to collect an ensemble of spatiotemporal datasets from a given system, and develop an approach to understand how the data can be used to learn a governing equation while simultaneously estimating the uncertainty in that learned equation.
Our motivation is thus: on the one hand, PDE-FIND provides a computationally cheap method to obtain a point estimate for the governing equation from a single time series. However, when the data are noisy, the individual predictions are unreliable. This is seen on the left-hand side of Figure 1, where this approach is identified with obtaining a point estimate of the PDE.
On the other hand, the field of computational Bayesian inference provides a number of methods to estimate, for a given model and data, posterior parameter distributions i.e. it provides estimates of model parameters and quantifies the uncertainty in those estimates. In principle, computational Bayesian inference approaches could be used directly with the candidate library of the PDE-FIND method to estimate the posterior distribution of the library coefficients. However, due to the very large number of candidate terms in the PDE-FIND library, the computational cost associated with applying methods for computational Bayesian inference on the entire highdimensional parameter space is generally prohibitive. Instead, we propose a framework that combines the strengths of each approach: first, we train the PDE-FIND algorithm on individual datasets from the ensemble to obtain an informed prior parameter distribution (top row of Figure 1). While this prior distribution will likely be relatively broad and uninformative of the uncertainty in the model, it can still be used to vastly reduce the dimensionality of the inference problem. Such a reduction in dimensionality makes it feasible to find an informative posterior distribution using computational Bayesian approaches (right-most column of Figure 1).
In this work, we demonstrate the potential of this approach using synthetic data generated from a widely used ABM that describes the behaviour of a motile and proliferative cell population and can be coarse-grained to a mean-field PDE that accurately describes ABM dynamics in certain regions of parameter space. In Section 2, we describe the ABM and discuss in detail its relation with a governing PDE, as well as the PDE-FIND algorithm. In Section 3, we demonstrate both that the PDE models learnt using PDE-FIND are intrinsically variable when the data are noisy, and that PDE-FIND can learn unphysical models. We provide an explanation for this in terms of the objective function of the PDE-FIND algorithm. In Section 4 we propose a method to combine methods for Bayesian inference with PDE-FIND in order to learn the structure of the governing PDE model, construct a prior parameter distribution for the PDE model, and infer the posterior parameter distribution of the learned PDE model. We conclude in Section 5 with a discussion of our results, and avenues for future research. Code for all algorithms can be found at https://github.com/simonmape/UQ-for-pdefind.

Models and equation learning methodology
We begin by describing the ABM and briefly outlining how to derive the corresponding coarsegrained PDE model, and then we provide details of the PDE-FIND algorithm.

Agent-based model
ABMs allow practitioners to investigate the collective behaviour of a population of individuals based on a description of the behaviour of individuals within that population. Here, in order to take into account the interactions between individuals of the population, we follow the volume-exclusion model presented in [4,23,24,25,26] for a population of agents that move and proliferate according to a discrete random walk model. This is a simple model that can be used to analyse a range of phenomena, including the collective migration of cells in a tissue, for example.
We assume that agents occupy sites on a square lattice of spacing ∆, so that their possible locations are (i∆, j∆), where (i, j) are integer coordinates, and volume exclusion entails that at most one agent can occupy a lattice site at any given time. We have 1 ≤ i ≤ I and 1 ≤ j ≤ J, and throughout this work we take I = 200 and J = 20. A pseudo-one-dimensional initial condition is taken by initially populating all lattice sites with 90 ≤ i ≤ 110, and leaving the rest of the lattice empty. We impose zero flux boundary conditions at all boundaries so that agents cannot leave the lattice, and use time step τ to advance the simulations through time, with T = 1000 time steps in total for each simulation. Let N (t) to denote the number of agents on the lattice at time t. The parameter p m ∈ [0, 1] specifies the attempted movement probability of each agent in a time interval of duration τ , and ρ ∈ [−1, 1] the left-right bias in movements. Similarly, the parameter p p ∈ [0, 1] specifies the attempted proliferation probability of each agent in a time interval of duration τ .
At each time step, τ , a random sequential updating procedure is carried out: N (t) agents are selected, one at a time, with replacement, and are allowed to attempt a movement or proliferation event. When an agent is selected, S 1 ∼ U (0, 1) is drawn. If S 1 ≤ p p then the agent attempts to proliferate by placing a daughter agent into one of the randomly chosen nearest neighbour sites.
If the target site is occupied then the proliferation event is aborted. If p p < S 1 ≤ p p +p m then the agent attempts to move to one of its nearest neighbour lattice sites. A second random number S 2 ∼ U (0, 1) is drawn and the target site is chosen according to the rules in Table 1. As for proliferation, if the target site is occupied then the movement event is aborted. If S 1 > p p + p m then the agent does not attempt to move or proliferate. For convenience we take ∆ = 1, τ = 1 and p m = 1 and consider the effects of varying p p and ρ.

Move chosen
Target site Probability Where random number S 2 falls vertically down (i, j − 1) Table 1: Algorithm by which an agent at site (i, j) selects a target site to move into.
Let C k ij (t) denote the occupancy of site (i, j) at time t in simulation k, so that C k ij (t) = 1 if (i, j) is occupied by an agent at time t and C k ij (t) = 0 if it is empty. We can average the site occupancy over the columns of the lattice, defining the mean occupancy of column i at time t in simulation k as, for 1 ≤ i ≤ I, to give a one-dimensional averaged agent density profile for simulation k.

Coarse-grained PDE model
To make progress in deriving a coarse-grained PDE equivalent, we first note that the choice of a pseudo-one-dimensional initial condition means that we can consider deriving a one-dimensional PDE for c(x, t), the density of agents at position x at time t, without making explicit reference to the spatial coordinate y. We use C i (t) to denote the average probability of occupancy of lattice site i at time t, for 1 ≤ i ≤ I, where the average is taken over K simulations: write: where the first four terms on the right-hand side correspond changes in occupancy owing to agent movement, and the final two to agent proliferation. Note that in writing down this conservation statement, we have implicitly assumed that lattice site occupancies are independent, so that e.g.
the average probability that site i is occupied and site i ± 1 is unoccupied can be written as ). This is a standard assumption called the mean-field approximation [27,28,29].
We then identify C i (t) with the continuous density c(x, t), Taylor expand the resulting equation and take limits as ∆, τ → 0, to arrive at the following PDE: where and the subscripts x and t denote partial derivatives. For the full derivation and details, we refer the reader to [24,25,26].
Identification of the ABM with a coarse-grained macroscale PDE model motivates us to investigate the performance of equation learning methods trained on data generated by the ABM, since the PDE accurately describes the time-evolution of the expected value of the density profile, and so we can evaluate the performance of equation learning methods against Equation (4). Note that in order for Equation (4) to provide an accurate description of the averaged dynamics of the ABM, we require that the assumption of lattice-site occupancy independence (i.e. the meanfield assumption) approximately holds. Typically, this requires p p and |ρ| to be small relative to p m [23,27,30].

Comparison of the ABM and PDE model predictions
As test cases for learning the governing equations from data, we explore three different parameter regimes in the model, which each correspond to a biologically relevant setting: in Case I, we consider agents moving without bias and without proliferation (ρ = p p = 0); in Case II, we consider agents moving with bias but without proliferation (ρ = 0.075 and p p = 0); and in Case III: proliferation, no bias with space discretisation ∆x = 10 −3 , and constant time discretisation ∆t = 10 −4 .
We make two observations. First, we note that the solutions of the PDE models accurately predict the dynamics of the ABM in the chosen parameter regimes (see also Figure S1 of the Supplementary Information), so that we have a "ground truth" PDE against which to benchmark the equation learning methodology. Second, at early times the density profiles are very similar for the three different cases, but at later times differences due to the effects of bias and proliferation are clearly discernible. This observation implies that the equation learning methodology will require data on sufficiently long timescales in order to be able to accurately learn the correct PDE model.

Equation learning: PDE-FIND
In the following, assume that we have time series data for an unknown function u(x, t) on a grid of n points in time and m points in space. This data is stored in a matrix U ∈ R n×m . We assume that the data are a noisy discretisation of a function u(x, t), the solution of an unknown PDE, and the aim is to learn the PDE that best describes the governing equation of the observed data.
Henceforth, and to avoid confusion, we will write U (x, t) for the observed data, u(x, t) for the learned PDE, and c(x, t) for the solution to the coarse-grained PDE defined in Equation (4). We follow Rudy et al. [11,12] in assuming that the PDE governing u(x, t) is given in the following where N is a nonlinear function of u(x, t) and its partial derivatives. Furthermore, it is assumed that N is a linear combination of a finite number of distinct library terms so that we can write for coefficients ξ i , where i = 1, . . . , N . By design, we will specify that N has polynomial nonlinearities, as is common in many equations in the natural sciences, and we note that Equation (4) falls within this class of PDEs. The aim of PDE-FIND is then to select, from the large library of terms N i , for i = 1, . . . , N , a small subset of relevant terms.
The first step of the PDE-FIND pipeline is to numerically approximate both sides of Equation (7). This is done by estimating derivatives of the data with respect to space and time.
The standard PDE-FIND implementation in [11] takes finite difference approximations when the data contains little noise and polynomial differentiation when data are very noisy. The data and its derivatives are combined in a matrix Θ(U ), where each column of Θ contains all of the values of a particular candidate function across the entire n × m grid. For example, if the candidate library consists of all polynomials up to degree two and non-mixed derivatives up to second order, N = 9 and Θ(U ) will look like As a result, if there are N terms in the candidate library, Θ(U ) is a n × m × N matrix.
The left-hand side of Equation (7) is similarly approximated, and we obtain a linear matrix equation representing the PDE evaluated at the data points: where ξ = [ξ 1 , . . . , ξ N ] T . Taking the same example for Θ(U ) as in Equation (8), this matrix Note how this representation shows that each row in the matrix equation represents the governing dynamics behind the data at one point in time and space. The values of the coefficients ξ i determine the form of the PDE, and so the aim is to learn the coefficients ξ i in some sense "optimally".
Following Rudy et al. [11], we will assume Θ to be overspecified, meaning that the dynamics can be represented as linear combinations of the columns of Θ. However, many PDEs in the natural sciences contain only a few terms. Therefore, we wish to learn a sparse vector ξ = [ξ 1 , . . . , ξ N ] T as a solution of Equation (7). This is done in PDE-FIND by considering the optimisation criterion for the coefficient vector ξ, where λ ∈ R >0 is a free parameter that penalises large coefficients.
This is the method of ridge regression. We note here that the term ξ 2 2 can be replaced with ξ 2 1 , which corresponds to performing LASSO [8,15]. The optimal choice of implementation is largely problem-dependent, and various choices for the regularisation method have been compared in the literature [11,12,32,33], although no method has been proven to be definitively preferred over another.
The default implementation of PDE-FIND as proposed by Rudy et al. [11] supplements the ridge regression problem with a sequential thresholding procedure in which a solution to Equation (11) is found, and a hard threshold is performed on the regression coefficients by eliminating all library terms that have coefficients smaller than some pre-specified parameter d tol .
This process is then repeated on the remaining library terms until all coefficients are larger than d tol , or until a maximum number of iterations has been reached. The sequential thresholding process is undertaken to enforce sparsity as the solution to the ridge regression problem in Equation (11) may contain several small, but non-zero values. The combined algorithm is called Sequential Thresholding Ridge regression (STRidge). For more details and motivation of the method we refer to [11], and for completeness we summarise the PDE-FIND method in Algorithm 1.

Application of PDE-FIND to the ABM data
We first generate ABM data for Cases I, II and III. For each, we generate two datasets with different noise levels, denoting them D i r where r = I, II, III denotes the case and i = 1, 2 denotes the dataset / noise level. For Dataset 1 (i = 1) the density profiles are generated using single realisations of the ABM and averaging (so that K = 1 and the data are relatively noisy), whereas for Dataset 2 (i = 2) the density profiles are generated using 50 realisations of the ABM and We use the standard implementation of PDE-FIND [11] and use polynomial differentiation at fourth order to evaluate both the time and space derivatives.
We select a library of candidate terms that includes all polynomial terms up to order two and up to the second derivative. Table 3 shows the values of the coarse-grained PDE coefficients, according to Equation (4). These are the values of the coefficients that we would expect the PDE-FIND algorithm to return for perfect spatio-temporal data. In Table 3, and the rest of this work, we use the notation c i for the coefficient of term i in the learned PDE.  Table 3: Coefficients of the coarse-grained PDEs describing evolution of the mean population density over time for the three example cases used in this work. The coefficients correspond to the coarse-grained PDEs described in Table 2.

Sources of variability and model misspecification
In this section, we showcase three different, but related, directions in the uncertainty quantification of the learned differential equations. In Section 3.1, we demonstrate the intrinsic variability of the learned coefficients in the presence of observation noise. We evaluate how uncertainty changes as the noise level is varied and suggest that even when using state-of-the-art denoising approaches a need for uncertainty quantification remains. Although increasing the signal-tonoise ratio helps, regardless of the method there is still variability that needs to be quantified.
In particular, this is important in biological applications where observations are often very noisy and practitioners rarely have access to very large amounts of data. In Section 3.2, we investigate the impact of varying the regularisation hyperparameter in PDE-FIND with a view to asking whether this can be optimised to reduce uncertainty. We find that, while this is possible, parameter estimates are still uncertain and this uncertainty needs to be quantified. Finally, in Section 3.3, we demonstrate that a key issue with PDE-FIND is that it aims to fit the time derivative of the solution and does not take into account the fit of the observed density to the data, leading to unphysical predictions. We demonstrate how to mitigate these issues in Section 4 through the use of Bayesian methods where we can evaluate uncertainty in a framework that optimises the fit of the model density profile to the data.
In order to quantify the variability in results from the application of PDE-FIND, we introduce a statistic which we term the identification ratio. Assume that for each sample, s, in the observed dataset (which contains N s averaged density profile samples), we have used PDE-FIND to produce an estimate of the library coefficients, ξ, using STRidge, and denote this estimateξ s .
For each term i in the library, we define the identification ratio, a i , as where I represents the indicator function andξ s i is the i-th entry ofξ s . Therefore a i quantifies how often the term N i from Equation (7) is included in the PDE-FIND predictions. When a i is close to unity, the term is identified across many samples as being relevant for the dynamics and, conversely, when a i is close to zero, the term is identified in only a small minority of samples as being relevant.

Variability of relevant PDE-FIND coefficients with noisy observations
We first demonstrate that a naive application of PDE-FIND on noisy synthetic data yields variable and unreliable parameter estimates. For this application, we do not carry out hyperparameter tuning, but simply use widely adopted parameter settings to learn the coefficients. Experiment

Case I
For Case I, recall that the true PDE model contains only the term u xx with coefficient 0.25, hence in noise-free scenarios we anticipate that c uxx should be non-zero and all other coefficients should be zero. Table 4 shows that the two terms identified regularly by PDE-FIND on D 1 I , the high-noise dataset, are u xx and uu xx , with identification ratios of 0.826 and 0.199, respectively.
On D 2 I , the low-noise dataset, u xx is consistently identified and no other terms are identified. However, there is significant variability in the learned coefficients between different samples from the same dataset ( Figure 3). In addition, for the high-noise dataset, D 1 I , the coefficients of u xx and uu xx are correlated ( Figure 3C). In some cases, PDE-FIND identifies just one of the two terms, and in others it identifies a combination of the two. This result highlights that potentially the wrong PDE can be learnt from noisy data, partly due to the fact that different PDEs can give rise to similar predictions. Since all of the ABM data samples have the same corresponding coarse-grained PDE, this highlights the inability of PDE-FIND to confidently learn the governing PDE from noisy data. Case I. A: histograms for c uxx generated using D 1 I (blue) and D 2 I (red) compared to the true parameter value (black line). B: histogram for c u·uxx generated using D 1 I compared to the true value (black line). C: joint distribution of c uxx and c u·uxx generated using D 1 I compared to the true parameter (black star).

Case II
For Case II, where motility is biased, Table 4 shows that the two terms identified regularly by PDE-FIND on the high-noise dataset D 1 II are u x and u xx , with identification ratios of 0.999 and 0.012, respectively. Note that the true model should also contain the term uu x , but PDE-FIND fails to identify this term across all the samples of the dataset. This term arises in the coarse-grained PDE as a result of volume exclusion (incorporated into the ABM through the requirement that at most one agent can occupy a lattice site at any instant in time). Therefore we infer in this case that the data are insufficient to identify the impact of volume exclusion. This is most likely a result of the initial conditions and / or the timescale over which data are collected since the density is relatively low across the domain and so crowding likely unimportant.
For the low-noise dataset D 2 II only u x is identified, with an identification ratio of 1.0. The histograms in Figure 4 reveal a significant amount of variability in the learned parameters. For instance, for both D 1 II and D 2 II , the parameters are distributed far away from the true parameter value. The variability appears to decrease with the noise level, and the distribution of estimated parameters moves towards the true parameter value, however the u xx coefficient is "lost" in the process. for Case II, in which motility is biased but there is no proliferation. A: histogram for c uxx generated using D 1 II (blue) compared to the true parameter value (black line). B: histogram of c ux generated using D 1 II (blue) compared to the true parameter value (black line). C: histogram of c ux generated using D 2 II (red) compared to the true parameter value (black line).

Case III
For Case III, which includes proliferation, Table 4 shows that the terms identified regularly by PDE-FIND on the high-noise dataset D 1 III are u, u 2 and u xx with identification ratios equal to 0.659, 0.482 and 0.571, respectively. Note that this means that all terms we would expect to appear in the PDE are identified. However, as shown in detail in Figure S6 of the Supporting Information, there is a correlation between the learned coefficients of u and u 2 , which points towards non-identifiability [34,35]. For the low-noise dataset D 2 III , the parameters identified are u and u xx , both with identification ratio equal to 1.0. The histograms in Figure 5 reveal that the variability in the learned coefficients decreases as the noise level in the data is decreased. However, this does not mean that the model is increasingly well identified as the noise is decreased -the term u 2 is not identified by PDE-FIND for the low-noise dataset D 2 III , which contradicts the mean-field analysis.

Methods to decrease the noise levels
To investigate the impact of noise on PDE-FIND, we carried out two further studies in which the noise in the data is reduced. First, we investigated whether choosing a more coarse spatial grid improves the PDE-FIND predictions. Choosing a more coarse spatial discretisation results in a smoother density profile, however greater errors are incurred in the approximation of the spatial derivatives and fewer data points are available. For this experiment, we subsampled the data along the x-dimension by averaging the occupancy over multiple columns at a time.
Mathematically, from the empirical densities, C i , at each time point, we subsample over intervals containing 2B lattice sites, estimating the average occupanciesC i for 1 ≤ i ≤ I/(2B), as Table 4 summarises the identification ratios found for the high-noise dataset D 1 I , where motility is unbiased and there is no proliferation, and we take B = 2. We see that with spatial subsampling the identification ratio of c uxx increases significantly, although there is no marked improvement in the identification ratios of other terms in the model. We conclude that even if this method of noise reduction allows the correct coefficients to be identified more frequently, there remains a need to mitigate the fact that many other terms are spuriously identified by PDE-FIND.
Second, we investigated other means to reduce observation noise. In applications of PDE-FIND to real-life data, practitioners will typically not be able to control for the amount of observation noise in the way that is done in the numerical experiments of this work, hence methods to smooth data may be useful in allowing identification of the PDE model. In Supplementary Information Section S3 we explore two practically appealing methods, convolution with a Gaussian kernel and an implementation of principal component analysis for equation learning by Li et al. [22]. Our results show that even with these well-established techniques for reducing the influence of noise, predictions remain variable with coefficients highly correlated, and that uncertainty quantification remains necessary for a reliable application of PDE-FIND to realistic biological data.

Role of the regularisation hyperparameter
Recall that in Algorithm 1, a free parameter λ controls the level of penalty incurred by choosing large coefficients in the solution of Equation (7). It is well known that the choice of regularisation parameter is nontrivial because it modulates the amount of sparsity that is enforced on the estimated coefficients. The issue of how to choose the optimal value of this hyperparameter in the context of ABMs was addressed recently by Nardini et al. [8], where cross-validation is discussed, amongst other options. As a test case to investigate the effects of algorithm hyperparameters on the uncertainty of learned coefficients, we perform cross-validation on the dataset D 1 I and then apply PDE-FIND using the optimal value of λ found. To do this, we apply the grid search implementation of cross-validation suggested in [8], as detailed in Supplementary Information Section S4, to arrive at an optimal value of λ = 0.5. We note here that this value is problemdependent, and whenever a new dataset is being investigated a different value of λ will generally be appropriate.
While the results presented in Supplementary Information Section S4 show that crossvalidation improves the performance of PDE-FIND dramatically, as the number of misspecified coefficients decreases sharply when the regularisation parameter is optimised, cross-validation does not provide a sufficient solution to manage the uncertainty associated with variability in the predicted coefficients. Figure 6 shows that, even with the optimal value of the regularisation coefficient, there is still much uncertainty in the coefficients, as the support of the histogram is large. While the atom at zero has nearly vanished, uncertainty quantification is still necessary because the empirical distribution still indicates a large degree of variability. Moreover, Figure 6 shows that at the optimal value of the tuning parameter, λ, the coefficients c uxx and c u·uxx still have a nontrivial joint distribution, implying that even with an optimal choice of the regularisation parameter, Bayesian methods are needed to analyse the joint behaviour of these two coefficients.

Comparison of model predictions
We now provide an explanation for the poor performance of PDE-FIND on the ABM data. The PDE-FIND algorithm solves a sparse regression problem to fit linear combinations of spatial derivatives to the time derivative. When data contains little-to-no noise, the temporal and spatial derivatives can be accurately estimated, and so the relationship between spatial and temporal derivatives can be inferred from observed data. In this context, comparing model predictions by their performance with respect to the L 2 -loss in the learned temporal derivative retrieves the ground truth 2 . However, when the data are noisy, a number of different linear combinations of the spatial derivatives can result in an L 2 -loss comparable to (or better than) those of the ground truth PDE (Figure 8). In Supplementary Information Section S5 (Figures S11-S13), we demonstrate this by selecting, for each dataset, two instances where the learned equations contain different terms to the coarse-grained PDE yet in both cases the temporal derivatives reproduce the observed temporal derivative qualitatively. However, there is no guarantee that such a match in the temporal derivative is sufficient to yield solutions that resemble the observed data when the PDE is numerically evaluated. We illustrate this in Figure 7 where, for each of Cases I, II, and III, we select two sets of coefficients learned by PDE-FIND: one where the solution of the corresponding PDE resembles a typical data trace, and one where the solution of corresponding PDE bears little resemblance to typical observed data traces. In summary, what this striking difference in predictive capabilities of the learned PDEs reveals is that coefficients that optimise Equation (11) do not necessarily perform well in terms of their ability to predict the evolution of the spatio-temporal density profiles. We illustrate this in further detail in Figure 8. We take the dataset D 1 I which consists of unbiased motility and no proliferation, and each sample in the dataset consists of an average over K = 1 simulations from the ABM. First, we average over all N s = 1000 samples in the dataset to obtain the density profile C i (t) as in Equation (2). The two coefficients consistently identified for dataset D 1 I are c uxx and c u·uxx , which gives the PDE We integrate this PDE numerically over a grid of values of c uxx and c u·uxx , and then evaluate the L 2 -loss between the time derivative of the PDE model and that of the averaged ABM data, Figure 8A), as well as the difference between the density predicted by the PDE model and that of the averaged ABM data ( Figure 8B). We estimate the sum of the L 2 -loss between the PDE and ABM data at five time points as where, for example when comparing density profiles, and when comparing time derivatives, with C it (50j) = ( C i (50j) − C i (50j + τ ) )/τ for j = 1, . . . , 5 where τ is the time step of the ABM simulation algorithm. The blue shading in Figure 8 shows the L 2 -loss in each case, and we also plot the PDE-FIND-estimated coefficients on the same axes for each of the N s = 1000 samples of the dataset.   Figure 8) we see increasing errors between the density profile predicted by solution of the PDEs and the averaged ABM data. Figure S14 of the Supplementary Information demonstrates that this issue is further compounded as the noise in the data increases.
In summary, in this section we have shown that the density profile loss landscape is much more informative about the underlying PDE model than the derivative loss landscape. We exploit this observation in the following section, where we propose a method which we term "Bayes-PDE-FIND" to tackle the issues relating to both the uncertainty in the predictions of PDE-FIND for noisy data and also quantify the uncertainty in PDE-FIND predictions.

Bayes-PDE-FIND
In this section, we propose an approach that harnesses the likelihood-free method of approximate Bayesian computation to quantify the uncertainty in estimates provided by PDE-FIND. In brief, our method involves the application of PDE-FIND to multiple datasets in order to define a prior distribution for the coefficients of the PDE library terms, N i for i = 1, . . . , N , followed by application of Bayesian approaches for estimation of the posterior parameter distribution.

Approximate Bayesian computation
The goal of Bayesian parameter estimation is to update prior beliefs about model parameters θ encoded in a prior distribution π(θ). In this context, θ constitutes the coefficients, ξ i , of the library terms N i for i = 1, . . . , N . The updating process is dependent on observations D obs , which in this work are the noisy, averaged data from the ABM, as detailed in Section 2.2.1.
The mathematical model is the PDE defined in Equation (7), which then defines a likelihood P (D obs | θ). In Bayesian statistics, the likelihood is combined with the prior distribution to give the posterior distribution: Such posterior distributions provide information as to the uncertainty in parameter estimates that are learned from observed data, and also allow practitioners to understand the range of would need to first assume that the mean of the ABM data is exactly given by the PDE solution and prescribing the distribution of ABM outputs around the PDE model mean. However, for a general ABM, the distribution for the deviation from the mean is unknown. In some cases, one might choose to make a simplifying assumption, such as a Gaussian approximation. However, in the small data limit considered in EQL applications, such an assumption is unreasonable. It will depend on the details of the ABM as to the extent to which individual realisations vary from their mean, which is for the purposes of inference, unknown. As we prefer to avoid placing unnecessary assumptions on the process, we opt instead for a likelihood-free approach. We provide further mathematical insight and justification for avoiding likelihood-based methods in Supplementary Information section S7.
Approximate Bayesian computation is a popular likelihood-free tool to estimate the posterior parameter distribution [36,37]. It approximates the likelihood, P (D obs | θ), using repeated simulation of the model, and acceptance of the parameter θ requires that the output of the model, D sim (θ), is in some sense close enough to the data, D obs . The ABC posterior can be written where d is a distance function that quantifies the difference between the data, D obs , and model output, D sim . In this work, our aim is to use ABC to estimate the posterior distribution of the PDE model coefficients, ξ i for i = 1, . . . , N , for a given dataset.
Despite the apparent simplicity of the ABC method, unless an informative prior is used to constrain the space of possible parameters and informative summary statistics can be found, the application of ABC methods to models with high-dimensional parameter space and output space is generally computationally prohibitive. In particular, this high computational cost means that the direct application of ABC methods to estimate the coefficients ξ i , for i = 1, . . . , N , in Equation (7) is essentially infeasible unless it is possible to construct an informed prior distribution. Here, we propose a method that uses the predictions of PDE-FIND to construct an informed prior so that ABC can then be used to estimate the library coefficients ξ i .

Using PDE-FIND to define a prior distribution for ABC
Assume that for each sample, s, in the observed dataset (which contains N s averaged density profile samples), we have used PDE-FIND (as defined in using Algorithm 1) to produce an estimate,ξ s , of the parameters ξ. Recall that for each term i in the library, we have defined the identification ratio, a i , as to quantify how often the term N i from Equation (7) is included in the PDE-FIND predictions.
To make progress in specifying a prior distribution for the library coefficients, ξ i , we first threshold, using parameter 0 < δ < 1, so that we can define A = {i : a i > δ} as the set of coefficients that are identified by PDE-FIND in more than a fraction δ of the N s samples of the dataset. We then eliminate from the library all terms for which a i < δ, i.e. we set the marginal prior for coefficient ξ i to be π i ≡ 0. This achieves a first step of coefficient selection by eliminating variables for which the initial PDE-FIND screen indicates low confidence. On the other hand, for i ∈ A, there is still a need to investigate which coefficients to include in the final model, and a prior must be carefully chosen to explore which parameters to include, and which parameters to eliminate.
Spike-and-slab models are powerful tools to perform variable selection in regression problems [38,39,40,41]. The main idea of a spike-and-slab type prior is that it defines a two-point mixture distribution in which coefficients are mutually independent. Each mixture is made up of a flat distribution with large support (the slab) and a degenerate distribution at zero (the spike). In early formulations, the slab was modeled as a uniform distribution over some region of parameter space [40,41], whereas in more recent work, inference is performed on hyperparameters of the marginal distributions [38,39]. Samples of the hyperparameters yielding a high variance will lead to sampling parameters far away from zero, whereas samples of the hyperparameters yielding a low variace will sample close to zero. In this way, the aim is to explore parameter space by iteratively sampling over the hyperparameters and the values for the coefficients using Gibbs sampling. In this work, we wish to exploit the simplicity of the earliest slab-and-spike models, which use a Dirac measure at zero to enforce sparsity, whilst utilising as much information as possible from the PDE-FIND screen in defining the prior distribution without violating the likelihood principle.
We follow the hierarchical Bayesian group LASSO model with an independent spike and slab type prior for each coefficient [39]. We set the group size in the model of Xu et al. [39] equal to one, so that the prior π i (ξ i ) for each coefficient ξ i is given by where IG is the inverse-gamma distribution with parameters α i , β i that define the shape of the prior on σ 2 i . This is the standard choice for modeling the distribution of the hypervariances. In the approach of Xu et al. [39], µ = 0. In this work, we can use knowledge of the coefficients gained through our initial PDE-FIND screen to inform the µ i . To do so, we randomly divide the ABM data in half and use one subset in the PDE-FIND screen to inform the prior (exploration subset), and the other subset to perform inference (inference subset). We first set a i equal to the i-th identification ratio and µ i equal to the i-th sample mean of the PDE-FIND coefficients trained on the exploration subset. Since the hyperprior for the variances σ 2 i allows for large values of σ 2 i , this prior is not overly restrictive, since values far away from the sample mean can be sampled. Second, we tune α i , β i using the exploration subset so that the variance is on average the same order of magnitude as the PDE-FIND coefficients. This is crucial: a large (small) variance in parameters that are typically small (large) will fail to sample from the relevant regions of parameter space. The exact values of µ i , α i , β i are given in Supplementary Information Section S9.
These considerations now imply that the prior for ξ is given by The key advantage in specifying this prior distribution is that we are now only required to perform Bayesian inference for a reduced model that has a much lower dimensional parameter space (equal to |A| N ) compared to the original model (with parameter space of dimension N ). This is because PDE-FIND promotes sparsity and so the majority of coefficients will have identification ratio, a i , close to zero. Such terms can be confidently eliminated from the model and so not considered in the ensuing parameter estimation process. On the other hand if a i ≈ 1 this means PDE-FIND has consistently identified the i-th term as relevant for the dynamics.
As such, we can be confident that N i should be included in the model, however uncertainty in estimates of ξ i must still be quantified. Finally, if a i is neither close to zero or one, which means that PDE-FIND has included the i-th term in the library for a non-trivial number of samples in the dataset, Bayesian approaches can be used to investigate the joint posterior distribution of the i-th coefficient, ξ i , with the rest of the model terms by considering the performance of models that both include and exclude the i-th term.

The Bayes-PDE-FIND algorithm
We now outline the Bayes-PDE-FIND approach. In essence, we apply the PDE-FIND algorithm to each sample s = 1, . . . , N s of the dataset under consideration, and use the results to formulate a prior distribution for ABC as described in Section 4.2 and, in particular Equations 4.2-4.3. We then apply ABC to estimate the posterior parameter distribution, noting that the computational cost of ABC is much reduced through the use of PDE-FIND to generate an informed prior distribution -in effect, we use PDE-FIND to reduce the target PDE in Equation (7) to with a prior distribution over the ξ i , for i ∈ A, given by Equation (22).
Importantly, when we apply ABC to estimate the posterior parameter distribution, we use a low-noise dataset, C i (t) LN for 1 ≤ i ≤ I, created by averaging over the original N s samples of dataset as the observed data, D obs , where sample s is calculated as that is, each sample s consists of column-averaged data from K simulations of the ABM, and so, for 1 ≤ i ≤ I, For the distance function, d, we again use an averaged estimate of the L 2 -difference between the ABM data and the PDE solution, here defined as where, for j = 1, . . . , 5, D sim (θ) 50j = [u(∆, 50j; θ), u(2∆, 50j; θ), . . . , u(200∆, 50j; θ)] T , and u(x, t; θ) is the solution to Equation (23)

Results
The aim of this section is to showcase how the Bayes-PDE-FIND algorithm can be used to significantly improve the quality of the learned PDE model. Recall that the aim is to reduce the uncertainty surrounding which coefficients to include in the model, to reduce uncertainty in the estimated model parameters, and to improve the posterior predictive capability of the model by finding a posterior parameter distribution that takes into account properties of the observed density profiles. For each of the noisy-data test cases, that is, datasets D 1 I , D 1 II and D 1 III , we apply Algorithm 2, using the Pakman package [42] for the ABC step.
Recall that for Case I, where the ABM contains only unbiased motility, the only two coefficients regularly identified using PDE-FIND are c uxx and c u·uxx (see Table 4). This means that 9 Perform approximate Bayesian computation using observed data all library terms except for u xx and uu xx can be confidently excluded, and for the ABC process we consider the PDE model and aim to infer the (c uxx , c u·uxx ) posterior parameter distribution. For Case II, which contains biased motility and no proliferation, the only coefficients regularly identified using PDE-FIND are c ux and c uxx (see Table 4) and so for the ABC process we consider the PDE model and aim to infer the (c uxx , c ux ) posterior parameter distribution. Note that PDE-FIND failed to identify one relevant model term in Case II, which is uu x , most likely either due to the significant noise in the samples of the dataset and / or the timescale over which the data are collected. For Case III, which contains both unbiased motility and proliferation, the only coefficients that are regularly identified are c u , c u 2 and c uxx (see Table 4) and so for the ABC process we consider the PDE model and aim to infer the (c u , c u 2 , c uxx ) posterior parameter distribution.
We explore the results of using Bayes-PDE-FIND with a form of ABC which is known as ABC-rejection sampling. At each each step, θ * is sampled from the prior distribution π(θ), and the PDE given in Equation (7) is integrated in time using parameters θ * to yield simulated data D sim (θ * ). The simulated data, D sim (θ * ), is compared to the observed data, D obs , according to a distance function d(D obs , D sim (θ * )) provided by the practitioner. Given a tolerance ε > 0, the sampled θ is accepted into the posterior distribution whenever d(D obs , D sim (θ * )) < ε. We provide full details of the ABC-rejection algorithm used in this work in Supplementary

Case I
In the case of unbiased motility only (Case I, dataset D 1 I , Figure 9) we see that use of the spike-and-slab prior distribution ( Figure 9A) results in a posterior distribution with the true parameter value contained in the support of the posterior. Moreover, the sparsity enforced by the prior ensures that the correct PDE structure, with only u xx included, is selected. In contrast, with a uniform prior on (c uxx , c u·uxx ) ( Figure 9B) although the true parameter value is contained in the support of the posterior distribution, the correct PDE structure is generally not established, with both u xx and uu xx terms contained in the PDE model.

Case II
In the case of biased motility (Case II, dataset D 1 II , Figure 10), both c ux and c uxx are non-zero for all parameter values accepted into the spike-and-slab posterior distribution. This is a striking result given an identification ratio of just 0.012 for c uxx after the application of PDE-FIND. We remark that the posteriors in Figure 10 show that in the presence of model misspecificationrecall that the term uu x was not identified by PDE-FIND -the posterior distribution may be biased. In this case it entails that the true parameter values are not contained in the support of the posterior distribution.

Case III
In the case of unbiased motility and proliferation (Case III, dataset D 1 III , Figure 11), the posterior obtained using the PDE-FIND prior still contains some accepted parameter samples with either c u or c u 2 equal to zero, demonstrating that there is potential non-identifiability of these terms given the data. However, all parameter samples in the posterior have non-zero c uxx , a significant result given an identification ratio of 0.57 for c uxx . We note that the support of the spike-andslab posterior contains the true parameter value for c uxx , even though it is not in the support of the empirical distribution of the PDE-FIND coefficients from the exploration subset.

Computational performance
To highlight the performance of our method, we compare the computational cost and the accuracy of our method against alternative options. In addition to the spike-and-slab model and uniform priors used to generate the posterior distributions (which we call informed spike-andslab and sparse uniform, we also consider a spike-and-slab prior with mean 0 in the slab for each coefficient (i.e. the PDE-FIND screen only informs the prior through the identification ratios), which we call naive spike-and-slab, as well as a uniform prior on all library coefficients (i.e. the classic Bayesian scenario, where no variable selection is performed), which we refer to as classic Bayesian. In each of the experiments, we perform ABC rejection to sample 300 parameters from the ABC posterior with the same thresholds as used in Case I-III and record the time taken to complete. Inference is done on a Lenovo desktop computer using 6 Intel(R) i5-8500T cores with clock speed 2.10GHz.  is significantly lower than any of the uniform prior models and that using an informed spike-andslab prior offers a substantial speed-up in computational time than using a naive spike-and-slab approach. The approach using a pre-screened uniform implementation failed to yield a sparse set of coefficients, thus showing unacceptable accuracy in learning the correct equations. The classic Bayesian analysis did not finish sampling within 1.5 · 10 6 s (approximately two weeks). By calculating the dimensionality of the space, we estimate that the acceptance probability should be expected to be of order 10 −2 %, which confirms that performing a classical Bayesian analysis in such a case is inappropriate. We highlight that many applications of EQL methods will have even larger libraries, making the computational time of naive uniform priors exponentially longer. The posteriors from both naive and informed spike-and-slab models are qualitatively similar across all cases and both identify the correct regions of parameter space.
In summary, Figures 9-11 highlight that the use of PDE-FIND in combination with ABC rejection can significantly reduce the uncertainty associated with the PDE coefficients. In Cases I and II, uncertainty regarding which parameter to include in the model is completely removed, as the posterior is has support only on the diffusion parameter axis in Case I and on a region where both c ux and c uxx are nonzero in Case II. In contrast, a uniform prior over all coefficients that have sufficiently large identification ratio (those for which A i > δ) does not enforce sparsity, which means that the resulting PDE models can be misspecified and / or contain greater complexity than is necessary to accurately predict the data. To further assess the quality of the resulting posterior parameter distributions, we carried out a posterior predictive check ( Figure   12). For each parameter sample accepted into the posterior distribution, we used numerical integration, as detailed in the Methods section, to obtain a prediction for the density at t = 250 to assess how well the model interpolates the data, and a prediction for the density at t = 1000, which is t = 750 beyond the time horizon used to train PDE-FIND on the inference subset.
We then plot the 5% and 95% quantiles of the output distributions and overlay them with a representative sample from each of the datasets for Cases I, II and III. The results shown in Figure 12 demonstrate that the PDE model predictions can both interpolate and extrapolate the data well. We conclude that even in the presence of model misspecification, such as in Case II, it is possible to obtain a posterior with reasonable predictive power, although as the time horizon is extended beyond the time horizon of the training data, the misspecification becomes apparent in the systematic prediction error. To highlight the increased accuracy, we compare to results shown in Figure 7, where the integrated model solutions fail to resemble the empirical data when PDE-FIND is used in isolation.

Discussion and outlook
The aim of this work was to develop and showcase a framework to perform uncertainty quantification for equation learning methods in the context of noisy spatio-temporal data. In essence, our approach harnesses equation learning methodologies to generate an informed prior distribution, that is then used within a Bayesian framework to estimate both the model structure and  The motivation for developing such a framework stems from the fact that equation learning methodologies such as PDE-FIND typically make variable predictions, both in terms of model structure and parameters, in the presence of noise, which is common for datasets in the life and biomedical sciences. Incorporating uncertainty quantification through the use of Bayesian statistics approaches provides a means to quantify the uncertainty in both model structure and parameter values, and understand how such uncertainty propagates into model predictions.
We showcased our methodology in the context of noisy spatio-temporal data generated using a canonical ABM that has seen widespread use in the modelling of motile and proliferative cell populations, and which has a corresponding PDE model that can make accurate predictions of averaged ABM output. We used datasets generated in three different parameter regimes (that incorporate different cellular behaviours) to show how to combine the advantages of the PDE-FIND algorithm (efficiency and the ability to learn simple, interpretable models) with those of ABC (ability to quantify uncertainty in parameter estimates and model predictions).
There are a number of ways in which our approach can be further improved going forward.
For example, we saw that our approach may return models that fail to capture important where F is the flux and S is the net proliferation rate, and hence for the models to make unphys-ical predictions (as occurs in Case II). A possible solution to this specific problem could be to encode the terms in the candidate library in flux form. More generally, however, it is not obvious how to balance the wish to include constraints in specifying terms in the candidate library whilst at the same time avoiding over-constraining the space of possible output PDE models. Bayesian approaches may prove useful this respect. In the case of severe model mis-specification due to incompleteness of the supplied library, Bayes-PDE-FIND offers several possibilities. In some scenarios, the learned PDE will interpolate the data well and extrapolate to new settings, even though it contains library terms that are different from the ground truth. Such a PDE can still be used for the purposes of simulation and inference, since the benefit of such a PDE model is that it is fast to solve, which is often crucial when performing inference. When the learned PDE terms performs poorly in interpolating or extrapolating, Bayes-PDEFIND returns a quantification of the error between model solutions and observed data. This may offer real-world insights: the library terms are usually provided by practitioners to reflect hypothesised mechanisms in the system under consideration. That those terms fail to explain the data provides a motivation to reconsider which mechanisms should form part of the model. Data access: All code to generate synthetic data, as well as code used to analyse the data is available on Github at https://github.com/simonmape/UQ-for-pdefind.   [31], which solves the PDE using the method of lines, where space is discretised using the grid on which spatial data for the ABM is collected. The resulting ODEs are solved using a Runge-Kutta solver. For each of Case I, II, and III, we average over K = 1000 realisations of the ABM. We numerically integrate the corresponding coarse-grained PDEs to find solutions at times t = 0, 50, 150, 500 and plot the observed averaged densities on the same axes in Figure S1.  Table S1.

S2.2 Empirical distribution of all learned PDE-FIND parameters
In this section we report the pairwise distributions of all learned PDE-FIND parameters in Case I, Case II and Case III, both in the low-noise and high-noise regimes. For convenience, the terms in the library are ordered as: 1, u, u 2 , u x , uu x , u 2 u x , u xx , uu xx and u 2 u xx .
S2 Figure S2: Pairwise correlation plots for coefficients estimated using D 1 I .
S3 Figure S3: Pairwise correlation plots for coefficients estimated using D 2 I .
S4 Figure S4: Pairwise correlation plots for coefficients estimated using D 1 II .
S5 Figure S5: Pairwise correlation plots for coefficients estimated using D 2 II .
S6 Figure S6: Pairwise correlation plots for coefficients estimated using D 1 III .
S7 Figure S7: Pairwise correlation plots for coefficients trained on D 2 III .

S3.1 Rescaled Robust PCA
Li et al. [22] propose an approach to perform PDE-FIND on noisy data. They denote the data matrix U ∈ C n×m , given on a discretized domain x ∈ [a, b] and t ∈ [0, T ]. That is, U is the discretely and noisily sampled measurements from the function u(x, t) that satisfies the target PDE. The authors observe that the underlying data is often low-rank and so they suppose that U and its time derivative U t can be decomposed as where Z and D(Z, Q) have low rank and E 1 , E 2 are sparse. Informally, Z represents the clean data, D(Z, Q) the clean derivative matrix, and E 1 , E 2 the perturbations of the clean data and derivatives, respectively. The method developed in [22] aims to find optimal Z and D(Z, Q) from the data matrix U . This problem is addressed by solving the optimisation problem such that U = Z + E 1 and U t = D(Z, Q)ξ + E 2 . Additionally, R(ξ) is a sparse regularisation of the parameters ξ, and the λ i , for i = 1, 2, 3, are positive constants. For details on solving this nonconvex optimisation problem, we refer to [22]. However, essence of the technique is that the original data is first processed with an algorithm called Robust Principal Component Analysis (rPCA) to obtain Z, after which Low-rank STRidge (LrSTR) is used to construct D(Z, Q) from Z, and U t is denoised prior to applying PDE-FIND. The final algorithm, which uses rPCA and LrSTR in tandem, is called Double Low-rank Sparse Regression (DLrSR). While the examples given by Li et al. in [22] show that the method is capable of retrieving the true parameters of the PDEs with remarkable accuracy, the choice of regularisation parameters is non-trivial and makes implementation challenging, since the many degrees of freedom complicate choosing a robust parameter set. As shown in Algorithm 1 of [22], λ 2 corresponds to the penalty in rPCA -one must choose λ 2 carefully, since different values of this parameter can yield unfavourable predictions when used in tandem with PDE-FIND.
We start by plotting the output of empirical data processed by rPCA at two different values of λ 2 . As seen in Figure S8, Focusing on intermediate values of λ 2 , we now plot the rescaled rPCA output compared to the original solutions in Figure S9. Based on the improved results in these plots, we use the rescaled output of the DLrSR algorithm in [22] on D u 1 going forward, choosing for rPCA the value λ 2 = 0.035. Table S1 contains the values of the identification coefficients, a i , found with the rPCA method for Case I, unbiased motility only, using the dataset D 1 I . It can be seen that no terms in the PDE are conclusively identified, as all the a i are either small or in the intermediate range. Most saliently, rPCA identifies the diffusion coefficient in only 38% of samples in the dataset. Comparing the rPCA-generated identification coefficients with the corresponding values generated using naive STRidge reveals that the rPCA method has a much higher variability in the sense of variation in the terms that are included and excluded in the resulting PDE model. In addition, Figure S10 shows that the support of the PDE-FIND predictions for the diffusion coefficient obtained using DLrSR on D 1 I is similar in size to that obtained with Naive STRidge. This means that rPCA yields estimates for the diffusion coefficient that are no less variable than the original PDE-FIND method. Finally, Figure S10 shows that with rPCA the coefficients are non-trivially correlated. This correlation plot also highlights once again the problem with taking point estimates of the coefficients. For Case I the predicted coefficients for the terms u xx and uu xx are contained in a region of parameter space such that one or the other of the two terms is discovered, or both terms are picked up and are not small. This implies that any point estimate from a single data sample is unreliable, and careful consideration needs to be given to the correlations between terms. In summary, we conclude that even when using a state-of-theart method of denoising observations it is important to statistically quantify uncertainty in the learned coefficients.

S3.2 Convolution with Gaussian kernel: an alternative approach to denoising
In many applications, noisy data can be successfully smoothed using a suitable kernel. We choose to smooth our data using a Gaussian kernel with standard deviation σ = 2, which is roughly equivalent to the scale used in the spatial discretisation. Summary values for the identification coefficients, a i , are found in Table S1. While the convolution approach is highly successful in recovering one of the relevant terms in the dynamical system, for example the diffusion term in D 1 I , the bias term in D 1 II and the proliferation term in D 1 III , the method fails to recover the diffusion term regularly in both D 1 II and D 1 III . As a result, we conclude that smoothing using a Gaussian kernel does not result in marked improvements of the method. S11 Figure S10: Joint distribution of the c uxx and c u·uxx coefficients using PDE-FIND on D 1 I (unbiased motion with no proliferation) where each dot is the result from a single sample. The location of the true parameter value is indicated by a black star. A: naive application of STRidge on the ABM data. B: ABM data is pre-processed with rPCA as detailed in Section S3.1 before the application of PDE-FIND. C: ABM data is pre-processed by smoothing with a Gaussian kernel with σ 2 = 2 before the application of PDE-FIND.

S4 Tuning of the regularisation parameter
Following Nardini et al. [8], we implement cross-validation to find the optimal value of the regularization parameter in the STRidge algorithm for dataset D 1 I . As training data, we randomly select ten samples from the ensemble. For each λ ∈ {0.05j : 0 ≤ j ≤ 50}, we perform leave-one-out cross validation, that is, we apply PDE-FIND to the left-out data sample, and compute the L 2 -loss of the integrated solution at t = 250 with the other nine randomly selected samples. This gives an optimal value of λ = 0.5. To illustrate the role of varying the tuning parameter, Table S2 shows the values of the identification coefficients, a i , at different values of λ.
For these values, Figure 6

S5 Comparison of model predictions
For noisy data, different linear combinations of the spatial derivatives best match the temporal derivative from different samples when measured according to the L 2 -loss, and this can have a significant impact on the density profile predicted by the learned equation. In Figures S11-S13 we demonstrate this for each of Case I (using D 1 I , Figure S11), Case II (using D 1 II , Figure S12) and Case 3 (using D 1 III , Figure S13). In the left-hand column of each figure we show results when the PDE-FIND algorithm accurately predicts evolution of the population density, and in the right-hand column of each figure we show results when PDE-FIND fails to accurately predict evolution of the population density. The top two rows show plots of the temporal derivative estimated from single data samples, the temporal derivative estimated assuming the relevant coarse-grained PDE model in Table 2 of the main text, and the temporal derivative estimated using the (mis-specified) coefficients established via application of PDE-FIND to the data sample. In each case, the top row shows results at t = 20 and the middle row shows results at t = 250. The bottom row of each figure shows the density profiles computed via numerical solution of the corresponding PDEs at t = 250. In five of the six cases, the temporal derivatives estimated assuming the (mis-specified) PDE-FIND coefficients reproduce the observed temporal derivative qualitatively. However, Figures S11-S13 show that in many cases, these mis-specified PDE coefficients do not yield predictions that are in close agreement with the observed density data. S13 Figure

S6 Additional information on noisy reference data
In Section (c) of the main text we identify significant differences between the loss landscapes of the different error metrics for the computed derivatives and spatial solutions of the PDEs. Here, we show that we may also take as a reference point a single, noisy, observation instead of the average of the observations in the ensemble, noting that the quality of the resulting posterior is decreased. As in the main text, we take the dataset D 1 I which consists of unbiased motility and no proliferation, and each sample in the dataset consists of an average over K = 1 simulations from the ABM, and we plot the different loss landscapes in Figure S14. In the top row we plot results when we average over all N s = 1000 samples in the dataset, whereas in the bottom row we plot results when we select a single sample at random from the data set.
The large amount of noise in the single data sample case means that the loss landscape when considering the derivative does not have a minimum around the true parameter value, implying that minimising with respect to this norm does not guarantee convergence to the correct solution. This is very different to the case of low noise, where the loss landscape has a minimum around the true parameter value. On the other hand, the fit of the integrated solutions is a much more consistent summary statistic in the case of a single, highly noisy data sample, and retrieves low measurement error in a region around the true parameter value. When the loss is considered on the averaged data, this region is smaller, and the range of the error is bigger, suggesting that the averaged-data loss function is better able to distinguish between solutions. S17 Figure

S7 Intractability of the likelihood function
In this section, we aim to provide further motivation for the choice of a likelihood free method instead of a classical likelihood based approach. Recall that, in Bayesian statistics, the likelihood P (D obs | θ) defines the probability density of the observations D obs given the model parameters θ. In this context, the observed data {U (x, t)} at space points x = x 1 , x 2 , . . . , x N and time points t = t 1 , t 2 , . . . , t N are obtained from a stochastic ABM. The solution u(x, t; θ) of the PDE model is an approximation of the mean of the ABM data, i.e.: u(x, t; θ) ≈ E θ [(U (x, t)]. (S4) To define a classical likelihood, one would need to first assume that the mean of the ABM data is exactly given by the PDE solution and describe the distribution of ABM outputs around the PDE model mean by where i is a random variable with zero mean (not necessarily identically distributed, but generally assumed independent). One can then obtain a likelihood for the observed data by prescribing a probability distribution for the i , which would yield a likelihood for U (x, t) given u(x, t; θ).
However, for a general ABM, the distribution for the deviation from the mean (i.e. the distribution of the i ) is unknown. In some cases, one might choose to make a simplifying assumption on the i , such as a Gaussian approximation. This may be appropriate when one is familiar with the noise process and the distribution of the data is approximately Gaussian, for example in the case of a very large number of samples, where the Central Limit Theorem can be invoked.
However, in the small data limit considered in EQL applications, the assumption that the i are Gaussian is unreasonable. It will depend on the details of the ABM as to the extent to which individual realisations vary from their mean, which is for the purposes of inference, unknown. As we prefer to avoid placing unnecessary assumptions on the process, in the form of assumptions on the i , we opt instead for a likelihood-free approach. We additionally stress that classical likelihood models, when considering a likelihood for general PDE data, employ the assumption that the errors i are independent in space and time, and so the joint likelihood of the data given the model, t 1 ), . . . , U (x N , t N )|u(x 1 , t 1 ; θ), . . . , u(x N , t N ; θ)), can be decomposed as P (U (x 1 , t 1 ), . . . , U (x N , t N )|u(x 1 , t 1 ; θ), . . . , u(x N , t N ; θ)) = i P (U (x i , t i )|u(x i , t i ; θ)). (S7) S19 Since individual realisations of the ABM may have deviations from the mean that have spatial and temporal structure, it is inappropriate to assume that the i are independent. To summarise, a classical likelihood approach can be used, but it requires the introduction of nontrivial additional assumptions and so it will potentially not target the true posterior. We prefer not to take this approach, and rather opt for a method that has been developed specifically to deal with cases where the likelihood is mathematically intractable.

S8 ABC-rejection implementation
ABC-rejection can be retrieved from Algorithm 1 by choosing the importance distribution q to be equal to the prior distribution π(θ).
1 Set n = 0. Generate θ n ∼q(θ); 5 Simulate y sim ∼ f (• | θ n ); 6 Set w n = [π(θ n )/q(θ n )] · I(d (y sim , y obs ) < ε) 7 until S is true; We choose the tolerance, ε, as 0.3 for Case I and 0.8 for Case II and Case III, as detailed in the main text. For each posterior distribution, we set the stopping condition to 1500 accepted parameters.

S9 Additional experiments
In this section we report the hyperparameters used for the various spike-and-slab models employed in the paper and the settings for the uniform prior over all library terms. Recall that to tune the hyperparameters, we have used the exploration subset of the data. In all cases, the mixture parameters in the spike-and-slab models were taken as the previously found identification ratios. Recall that the values for α, β in each hypervariance distribution are chosen such that the expected variance of the slab is of the same order of magnitude as the empirical variance of the sample.

S9.3 Uniform prior on all parameters
We report the prior used to compare the performance of Bayes-PDEFIND against using a uniform prior over all parameters. We choose the prior

S10 Posterior distributions for Case III
In Figure S15 we report in plot matrix format the posterior distributions found for the parameters in Case III.