Data-driven model reduction of agent-based systems using the Koopman generator

The dynamical behavior of social systems can be described by agent-based models. Although single agents follow easily explainable rules, complex time-evolving patterns emerge due to their interaction. The simulation and analysis of such agent-based models, however, is often prohibitively time-consuming if the number of agents is large. In this paper, we show how Koopman operator theory can be used to derive reduced models of agent-based systems using only simulation data. Our goal is to learn coarse-grained models and to represent the reduced dynamics by ordinary or stochastic differential equations. The new variables are, for instance, aggregated state variables of the agent-based model, modeling the collective behavior of larger groups or the entire population. Using benchmark problems with known coarse-grained models, we demonstrate that the obtained reduced systems are in good agreement with the analytical results, provided that the numbers of agents is sufficiently large.


Introduction
Systems of multiple agents that act and interact within a social network lead to complex dynamics and collective social phenomena. An agent can represent an individual person, a household, an organization, or any kind of discrete entity in an environment, which can be given, e.g., by geographical conditions, resources, infrastructure, but also rules or laws. Applications such as innovation spreading and infection kinetics (e.g., [1,2]) range from data-based micro-simulations to abstract agent-based models (ABMs). A well-studied application concerns opinion dynamics and can be traced back to the voter model introduced by Clifford and Sudbury [3], developed in the 1970s. The name was coined by Holley and Liggett [4] a few years later. In this model, an agent imitates the opinion of its neighbors. This means that whenever two agents with different opinions interact with each other, one of them copies the opinion of the other agent. There exist various modifications of the voter model, e.g., regarding the representation of the opinions, imitation, and interaction structure, see, for instance, [5][6][7][8] for an overview.
Agent-based models provide an easily explainable and accessible framework for studying the dynamical behavior of interacting agents without requiring an extensive mathematical background. Models range from (highly detailed) microscopic stochastic descriptions following spatial movement and neighbor interactions [9] and individual-based stochastic descriptions in a network without movement [10] to Markov chain approaches for collective population dynamics [11]. Most agent-based models have in common that they are hard to analyze due to their high-dimensionality. Additionally, simulations are often time-consuming so that a detailed analysis of such systems or parameter studies are typically infeasible. Especially for real-time decision and policy making this is clearly a disadvantage. One way to mitigate this is to compute surrogate models via machine learning approaches that can be used for calibration, sensitivity analysis, and parameter studies, see [12]. Another way is to represent the agents as a system of ordinary or stochastic (partial) differential equations (ODEs, meanfield ODEs, SDEs, or SPDEs), see, for instance, [13][14][15][16]. Assuming that the population of homogeneous agents that interact with each other (e.g., via a complete network) is sufficiently large, this system can be modeled as a Markov jump process (see also [10,11]), which in turn can be approximated using ordinary or stochastic differential equations [8,17]. This does not hold for all ABMs (consider, e.g., network-free or off-lattice models). A drawback is that the aforementioned methods require knowledge about the process itself, which might not be available. Thus, there is a growing interest in learning the interaction laws of social dynamics in a data-driven fashion. One method is the so-called equation-free approach pioneered by Kevrekidis et al. [18,19], which aims at circumventing the derivation of macroscopic, system-level equations when they are believed to exist but cannot be expressed in closed form. In [20], the equation-free approach is used to obtain a reduced model of a spatio-temporally varying agent-based civil violence model. The obtained model is a stochastic differential equation that depends on two coarse-grained variables. The estimation of the drift and diffusion terms is accomplished by suitable short realizations of the agent-based simulation. Other applications of the equation-free approach are, e.g., bifurcation and stability analysis for ABMs or rare-event analysis [21,22]. One key problem is the discovery of the right coarse-grained variables. If these are not known from physical insights or intuition, it is possible to use, e.g., a data-mining approach. In [23], the authors propose to use diffusion maps to learn the essential variables, resulting in an equation-free-variable-free approach. In [24], a non-parametric approach for learning the interaction laws that is similar to parameter estimation problems for ordinary differential equations is proposed, assuming that the interaction depends only on pairwise distances between agents. Furthermore, it is shown that the learning rate is then independent of the dimension, making their approach suitable for large-scale systems. The data-driven approach described in [25] utilizes memory terms to improve the accuracy of the coarse-grained model.
Our approach to learn coarse-grained systems for complex ABM dynamics relies on Koopman operator theory. The Koopman operator and its generator have been used for computing metastable and coherent sets, stability analysis, and control, but also for system identification, e.g., [26][27][28][29]. It was shown that by expressing the full-state observable in terms of the basis functions or eigenfunctions, it is possible to learn the governing equations of dynamical systems from data. While this has been mostly applied to ordinary differential equations [26,[30][31][32], the approach can be naturally extended to stochastic differential equations, where the drift and diffusion terms are then estimated in a similar fashion [33]. While Koopman operatorbased methods have been successfully applied to molecular dynamics, fluid dynamics, engineering, and physics problems, the application of these methods to complex social systems such as ABMs, however, is still lacking, although notions like metastability and coherence exist in this context as well. The goal then is to study the coarse-grained behavior of complex ABMs based on data. If the model describes, for instance, the voting behavior of a large population, we are often not interested in each agent's decision but in the collective behavior of larger groups or the entire population. In [34,35], the authors use Koopman mode analysis to investigate the dynamics of the spatial-temporal distribution of different agent types or to extract non-obvious information from the system's state indicating changes in the dynamics. Is was shown in [17] that the long-term characteristic behavior of ABMs can be determined by simulating (many) short trajectories of the corresponding SDE instead.
Our goal is to illustrate how coarse-grained models of complex ABM dynamics can be learned from data. The approach is based on [33], with the difference that we here directly learn reduced models. Since we know the resulting limit processes in this case, which are given by a systems of ODEs or SDEs, we can compare the numerical results obtained for finitely many agents with the theoretical results. We demonstrate that under appropriate conditions the estimated models are in good agreement with known limit cases. The aim is to use the reduced models also for sensitivity analysis, parameter optimization, and control, by combining it with techniques proposed in [28,36,37]. The main contributions of this work are: • We show that the Koopman generator can be used to learn reduced stochastic models from aggregated trajectory data that represents the collective behavior of larger groups or the entire population.
• We demonstrate for a voter model defined on a complete network that the obtained reduced models are in good agreement with the SDE approximation for large population sizes and can not only be used for system identification but also for predictions of the temporal evolution. Furthermore, we show how the transition rate constants of the underlying Markov jump process corresponding to the ABM can be reconstructed.
• We show that the proposed procedure also yields good reduced models that allow prediction in some other cases where the limit process is unknown or even far from a limit case. We demonstrate this for incomplete, clustered interaction networks (demonstrated again for the voter model) as well as models that do not have a network-based formulation (using a predator-prey model).
In general, this method requires a lot of data, which, however, is no problem in simulation studies where a surrogate model is required for the optimization or control of the full-complexity ABM.
The remainder of this paper is structured as follows: In Section 2, we introduce the stochastic Koopman operator, its generator, and generator extended dynamic mode decomposition (gEDMD). We then briefly summarize the representation of ABMs as Markov jump processes and its SDE limit model for large population sizes in Section 3. Furthermore, we introduce the voter model and the predator-prey model, which are used as guiding examples throughout the paper. In Section 4, we learn reduced models for complex ABM dynamics purely from aggregated data. We show in Section 5 that, under certain conditions, the coarse-grained models agree with known limit cases. Furthermore, considering both ABMs with clustered interaction networks and ABMs without any underlying network structure, we demonstrate that the reduced models also allow prediction for other cases. Concluding remarks and future work will be discussed in Section 6.

The Koopman operator and its generator
In what follows, let X � R d be the state space and f 2 L 1 ðXÞ a real-valued observable of the system, which can represent any kind of measurement. Furthermore, let E½ � � denote the expected value. Given a stochastic differential equation of the form where b: R d ! R d is the drift term, s: R d ! R d�s the diffusion term, and W t an s-dimensional Wiener process, the stochastic Koopman operator is defined by Here, F t is the flow map associated with (1). It can be shown that the infinitesimal generator of the stochastic Koopman operator is where a = σσ > The adjoint operator is given by The function uðt; xÞ ¼ K t f ðxÞ solves the Kolmogorov backward equation given by the second-order partial differential equation @u @t ¼ Lu, see [38]. Moreover, @u @t ¼ L � u is called Fokker-Planck equation [39]. For deterministic dynamical systems, σ � 0 and consequently also a � 0 so that we obtain a first-order partial differential equation, namely the Liouville equation.

Infinitesimal generator EDMD
While the classical extended dynamic mode decomposition (EDMD) approximates the Koopman operator or the Perron-Frobenius operator [27,40], we now seek to approximate their generators from data. We thus introduce generator EDMD or, in short, gEDMD, which was proposed in [33]. Assume that we have m measurements of the system's state f x l g m l¼1 , its drift f bðx l Þ g m l¼1 , and diffusion f sðx l Þg m l¼1 . We will discuss in Section 4 how to obtain these pointwise estimates. Then, choosing a set of basis functions fc i g n i¼1 , which is sometimes also called dictionary, and writing it in vector form as ψ(x) = [ψ 1 (x), . . .,ψ n (x)] > , we define For all measurements and basis functions, we can now assemble the matrices . . . ; where C X ; dC X 2 R n�m . Assuming there exists a matrix M such that dC X = MC X , we solve the problem in the least-square sense by minimizing kdC X − MC X k F since in general this problem cannot be solved exactly. Here, k�k F denotes the Frobenius norm. The least-squares solution is given by where A + denotes the Moore-Penrose pseudoinverse of a matrix A. The matrix L = M > is an empirical estimate of the matrix representation of the infinitesimal generator L as shown in [33]. In the infinite data limit, gEDMD converges to a Galerkin approximation of the generator, i.e., a projection onto the space spanned by the basis functions.

System identification
Let X be bounded so that the full-state observable g(x) = x is (component-wise) contained in L 1 ðXÞ. With the aid of the full-state observable, it is possible to reconstruct the governing equations of the underlying dynamical system. We assume that the function g(x) = x can be represented by the basis functions ψ. The easiest way to accomplish this is to add the observables f x i g d i¼1 to the dictionary. Let B 2 R n�d be the matrix such that gðxÞ ¼ B > cðxÞ. The system can directly be represented in terms of the basis functions, which, for a deterministic dynamical system, is equivalent to SINDy [41]. For non-deterministic systems and for ψ k (x) = x i x j , note that the diffusion term can be identified by provided that b i and b j as well as b i (x)x j and b j (x)x i are contained in the space spanned by the basis functions. If the drift term σ itself is needed, we can obtain it using a Cholesky decomposition of a, see [33].

Modeling agent-based systems
We consider agent-based systems of N interacting agents. For each system, there is a set {S 1 , . . .,S d } of types available to the agents, a set {R 1 , . . .,R K } of transition rules that define possible changes between the types S i , and a set of propensity functions specifying the rates of random occurrences of the transitions. The ABM state space is given by {1, . . .,d} N and grows like d N , which is problematic for large N. For this reason, we describe the ABM via the population state, i.e., we count the number of agents of each type. The population state space grows like N d in the worst case. If we assume random interactions between all agents (e.g., via a complete network) and indistinguishable agents, then the population state space description is exact. In all other cases it involves an approximation error due to aggregation of the ABM state space. We will consider two different agent-based models and modeling approaches. The first one is a continuous-time voter model without spatial resolution where the agents are nodes in an interaction network and each of them can switch between d different types according to some given transition rules. This model is similar to the discrete-time model in [11]. The second ABM is a spatial (i.e., there is no underlying network) predator-prey model formulated in discrete-time. Unlike in the first model, the agents are not changing their types (in this context called breed). Instead, transitions in the population state are caused by reproduction and death of predators and prey. The population size is thus not constant.
We will now describe the representation of agent-based systems (using the population state) as a Markov jump processes and their approximation by SDEs for large population sizes. For further details, we refer the reader to [17].

Agent-based models as Markov jump processes
At any time t, the population state x 2 X of the ABM is fully described by the vector where x i is the number of agents of type S i . For the sake of simplicity, we assume in this subsection random interactions between all agents so that transitions between agent types imply transitions between population states. We use a formalism that is most commonly used in the chemical context, where each transition rule is represented by an equation of the form It induces an instantaneous change in the system's state of the form describes the net change in the number of agents of each type S i due to transitions R k . Transition R k occurs in an infinitesimal time step dt with probability a k ðxÞ dt, where a k : X ! ½0; 1Þ denotes the propensity function associated with transition R k . We assume that the propensity α k is proportional to the number of combinations of agents in x, and, moreover, that it scales with the total population size N, i.e., Here, γ k > 0 denotes the rate constant for the kth transition R k . The evolution of the population state can be described by a continuous-time stochastic pro- where x i (t) denotes the number of agents of type S i at time t. It is a Markov jump process, i.e., it is piece-wise constant with jumps of the form X t 7 ! X t + ν k .
Let Pðx; tÞ ≔ P½X t ¼ x j X 0 ¼ x 0 � denote the probability of finding the process in state x at time t given some initial state x 0 . The temporal evolution of {X t } t � 0 can then be described by the Kolmogorov forward equation given by By setting α k (x) ≔ 0 and P(x, t) ≔ 0 for x= 2N d 0 , we exclude terms in the right-hand side of (4) where the argument x − ν k contains negative entries. Since in general the Kolmogorov forward equation of the ABM process cannot be solved analytically, the distribution of the process can be estimated by Monte Carlo simulations, which can be generated using Gillespie's stochastic simulation algorithm [42].
Assuming convergence of the propensity functions for N ! 1, it is well-known that the rescaled jump process X t N −1 converges to the frequency process C(t), t � 0, given by the SDE ffi ffi ffi ffi N p ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffĩ a k ðCðtÞÞ with initial state C(0) = lim N!1 X 0 N −1 , independent Wiener processes W k (t), k = 1, . . .,K, and rescaled propensities, i.e.,ã k ðcÞ ¼ N À 1 a k ðcNÞ [43]. The SDE limit model (5) is also known as the chemical Langevin equation in the context of chemical reaction kinetics [44]. Written as an SDE of the form (1), the drift and diffusion terms b(c) and σ(c) are given by

Extended voter model
Throughout the paper, we will use the extended voter model (EVM) with N agents, d types, and two sorts of transitions as one of two guiding examples. This model is well-known, e.g., as the noisy multi-state voter model, for describing foraging ant colonies, or chemical systems, see [8,45,46]. The agents are the nodes in an interaction network. Given two agents with types S i 6 ¼ S j , imitation or adaption is a second-order transition of the form R ij : S i + S j 7 ! 2S j , whereas exploration or mutation is a first-order transition of the form R 0 ij : S i 7 !S j . Imitation happens whenever one agents of type S i adopts the type of another agent with different type S j . It can be interpreted as adopting an opinion or technology, or also as being infected. Exploration corresponds to an independent change of the agent's type. Given a complete network, the propensity functions for imitative and exploratory transitions R ij and R 0 ij are given by where g ij ; g 0 ij > 0 denote the rate constants for the transitions. Fig 1A shows a graph with N = 10 nodes representing the interaction network. Here, the agents can have three different types (represented by blue, red, and yellow vertices). Fig 1B shows a trajectory of the Markov jump process.

PLOS ONE
Data-driven model reduction of agent-based systems using the Koopman generator

Predator-prey model
The second agent-based model we consider as a guiding example in this work is a predatorprey model (PPM), where the agents move freely in a given domain. We formulate the PPM in discrete time using intuitive text-based transition rules to emphasize its ABM character.
Given a continuous and periodic space, all agents are constantly performing Gaussian random walks with normally distributed step size. This means, given its current position in space There are two breeds of agents: predator agents and prey agents. We will denote them as predators and prey, respectively. At each time step, all agents carry out the following steps corresponding to their breed: • A prey moves and reproduces with probability p rep . The offspring is placed randomly in the space.
• A predator moves and looks for prey within a radius of vision v. If there is prey within the radius of vision, the predator chooses its victim randomly and kills it. The predator can only reproduce with probability p 0 rep if it killed a prey before. The offspring is placed randomly in the space. If there is no prey in the radius of vision, the predator dies with probability p death .
A flow chart describing the PPM in more detail can be found in Fig 2. In the absence of predators, the prey has an unlimited growth, which can be interpreted as independence of resources. There is no competition between the prey. The growth is only kept in check by the existence of predators. The population size is clearly not constant here. Fig 3A shows a snapshot of the PPM for a realization using the parameters summarized in Table 2. Green and red dots represent prey and predators, respectively. The search radius for prey is indicated by the light-red area around the red dots. The aggregate state is given by the number of prey and predators, respectively. Remark 3.1 Due to the spatial component of the PPM, it cannot be formulated directly using the formalism summarized in Section 3.1. Assuming a well-mixed system and denoting prey by S 1 and predators by S 2 , the rules given above translate to ðreproduction of preyÞ ðreproduction of predatorsÞ S 2 7 ! ;; ðdeath of predatorsÞ for some rate constants γ i > 0, i = 1, . . .,3. Then the aggregate state of the PPM resembles the stochastic Lotka-Volterra predator-prey differential equations.
In the next section, we will show how we can obtain reduced models of agent-based models using simulation data only.

Learning coarse-grained models from data
We will now illustrate how to learn reduced models for large agent-based dynamics from aggregated trajectory data using the Koopman generator. The approach is based on [33]. First, we estimate drift and diffusion pointwise, cf. Section 2.2. Subsequently, we apply gEDMD to the estimates to obtain a global description of the drift and diffusion terms, cf. Section 2.3. For the EVM, we will show in Section 5.1 that the identified SDE coincides with the SDE limit model (5), provided that the number of agents is sufficiently large. We will now go through the main steps that are necessary to learn the Koopman generator from data generated by an ABM.

Measurements
Assume that we have access to m measurements of an aggregate state variable of a given ABM. This aggregate state can represent the number of agents sharing, e.g., the same type S i or belonging to some group. These m measurements will be the starting point. Let us denote them by fx l g m l¼1 . If possible, we choose the measurements x l such that they are uniformly distributed in the aggregate state space X to ensure a good coverage of the whole (aggregate) space. One way to achieve this is by constructing an appropriate map from the macroscopic (aggregate) state to the microscopic ABM state. By appropriate we mean that the mapped macroscopic state and a naturally developed ABM state with same aggregate variables agree in probability. Practically, this means that if, e.g., the agents follow a certain spatial distribution, this needs to be taken into account when constructing the map. Another, rather straightforward, possibility is to gather the measurements "on the fly", i.e., by using the states belonging to trajectories obtained from the simulation of the ABM.

Pointwise estimates
Since the drift and diffusion terms b and σ are in general unknown, we estimate them pointwise via finite difference approximations for each measurement f x l g m l¼1 using the Kramers-Moyal formulae bðxÞ ≔ lim aðxÞ ≔ lim The formulae can be deduced from the Kramers-Moyal expansion, see, e.g., [47]. These expressions can be evaluated by Monte Carlo methods via multiple short trajectories at each data point f x l g m l¼1 . The simulation of multiple short realizations of the original ABM is comparable to the equation-free approach and common practice in the context of transfer operator approximations. These pointwise estimates of the drift and diffusion for each training data point form the first stage to obtain a global description of them via gEDMD.

Conservation laws
If the aggregate state is subject to a conservation law, e.g., if the number of agents is constant for all time t � 0, we have only d − 1 degrees of freedom and the aggregated trajectory data belongs to a d − 1 dimensional system, i.e., the number of agents x j (t) can be expressed by We thus reduce each measurement by keeping, without loss of generality, only the first d − 1 entries. This eliminates redundant representations of the system. Additionally, we can scale the measurements by the number of agents, N, to obtain a frequency representation c i ðtÞ ¼ x i ðtÞ N .

Basis functions
Next, we need to choose a set of basis functions fc i g n i¼1 . This is a non-trivial step since in general it is not clear how the drift term b and diffusion term σ of the SDE (1) look like. If we assume that the SDE approximation of the ABM adheres to the model structure introduced in Section 3 and comprises at most pth order transitions, we can show that monomials of degree up to p + 1 are sufficient to correctly identify the model of the form (5). The highest order transition coincides with the maximum degree of all propensity functions. First, to identify the drift term (6), we conclude from the propensity functions that the set of basis functions needs to contain at least monomials up to degree p. Second, as gEDMD identifies a = σσ > and not the diffusion term (7) itself, we obtain for c = x/N aðcÞ ≔ sðcÞsðcÞ which shows that monomials are sufficient for the identification of the diffusion term as well. Finally, to identify the diffusion term via (3), we argue that also monomials of degree p + 1 are needed.

Identification
We are now able to assemble the matrices C X and dC X in (2) and solve the minimization problem kdC X − MC X k F to obtain an approximation L = M > of the infinitesimal generator L associated with the ABM. For a suitable projection matrix B, we identify the drift and diffusion terms. These are now global descriptions (i.e., functions depending on x) forming the second stage, cf. Section 2.3. The overall procedure is summarized in the following algorithm.

Numerical results
We will now apply Algorithm 4.1 to three benchmark problems. First, we compare the numerical result with the theoretical SDE limit model (5) for the EVM in Section 3.2 for varying numbers of agents N and numbers of Monte Carlo samples k for the pointwise drift and diffusion estimates as these are two crucial parameters for the quality of the numerically obtained model. In Section 5.2, we will then show that it can also be applied to the case where the network is not fully connected but consists of clusters connected by a few edges only. In Section 5.3 we show for the PPM that it is also possible to obtain a reduced model for systems not based on interaction networks.
All results are compared using the root mean square error (RMSE), which is defined by where y i andŷ i denote the measured quantity and its prediction, respectively.

Complete networks
Let us consider the EVM defined in Section 3.2 and assume that the network is complete. The state space of this ABM is given by the d − 1 dimensional simplex X N , with We consider now d = 3 types and set the rate constants to for i, j = 1, . . .,3. Due to the conservation law, this is essentially a two-dimensional system. Thus, we eliminate one equation of the limit SDE (5) such that we can compare it with the data-driven SDE obtained by Algorithm 3.1. Additionally, after scaling the measurements by the number of agents, N, we obtain We will then evaluate the quality of the identified coarse-grained model. Utilizing c 3 (t) = 1 − c 1 (t) − c 2 (t), we obtain the drift and diffusion terms respectively. Note that a(c) = a(c) > = (a ij (c)). Their derivation can be found in S1 Appendix. Following the arguments in Section 4, for a correct identification, we need a set of basis functions comprising monomials up to degree 3 as the highest order transition is of order 2. For any given number of agents N, we can construct the first columns of the approximation L N of the generator L analytically via the coefficients of b and a. E.g., for N = 10 we obtain the matrix entry l 22 from the coefficient of c 1 in b 1 , i.e., l 22 ¼ g 31 À g 13 À g 0 12 À g 0 13 À g 0 31 , see S1 Appendix for details. The first columns of L 10 are then given by We will compare the numerical results to the corresponding columns of L N and the drift and diffusion terms (11a) and (11b), respectively. The identified system has the following structure: where the coefficients are given by the expressions derived for (11a) and (11b), see S1 Appendix for details. The coefficients b i h of (12a) can immediately be obtained from the second and third column of L N . The coefficients k ij h of (12b) are extracted from the columns four to six by using (3). E.g., for a 12 (c) we obtain b 1 ðcÞ ¼ ðLc 2 ÞðcÞ ¼ À c 2 1 À 2 c 1 c 2 þ 0:97c 1 þ 0:01; b 2 ðcÞ ¼ ðLc 3 ÞðcÞ ¼ c 2 2 þ 2 c 1 c 2 À 1:03 c 2 þ 0:01; a 12 ðcÞ ¼ ðLc 5 ÞðcÞ À b 1 ðcÞ c 2 À b 2 ðcÞ c 1 ¼ À 0:3 c 1 c 2 À 0:001c 1 À 0:001c 2 : Comparing the coefficients of the SDE limit model with its corresponding parts in the datadriven system, we can (under certain conditions) recover the rate constants of the underlying Markov jump process. For the considered example we compare the coefficients of (11) with (12). We set up a system of linear equations Aγ = v for a suitable matrix A, where γ and v are given by Note that for this example with the rate constants chosen in (9) the system cannot be solved exactly in general since the model is symmetric in the sense that imitation is possible in both ways (i.e., γ ij 6 ¼ 0 for all i 6 ¼ j). Thus, we only find values for γ ij and γ ji satisfying the differences appearing in (11a) and (11b), see S1 Appendix for details. However, this has only an influence on the reconstruction of the underlying Markov jump process but not on the coarse-grained model.

Evaluations.
For both the number of agents N and the number of Monte Carlo samples k, we set a maximum of 5000. Since the state space X N is discrete and N constant, the amount of distinct points is finite and depends on N and d; more precisely for a d-dimensional regular discrete simplex with N + 1 points on each edge, the number of points is given by Nþd d À � for d � N [48]. In our example, we have a two-dimensional simplex and thus Nþ2 2 À � points. The number of uniformly chosen measurements is given in Table 1 for different N. We then estimate the drift and diffusion term for each point via (8) for k short simulations of the Markov jump process with a lag time of τ = 0.01 resulting in a total of m � k training data points. Fig 4 shows the approximation error of the numerically obtained coefficients and their theoretical counterparts appearing in (11a) and (11b) depending on the number of agents and the number of Monte Carlo samples. For both parameters, the RMSE decreases by several orders of magnitude as N and k increase. Note that the number of agents N has a significantly larger influence than the number of samples k. Especially for small N, e.g., N = 10, we observe that higher values of k do not improve the results. This is consistent with the literature as the SDE model (5) approximates the Markov jump process for large N.
As it is not only important to identify the coefficients of an SDE limit model, we also compare how well the reduced model approximates the dynamics of the ABM, e.g., to make predictions about the number of agents of a specific type. Fig 5A shows a comparison for a long-time realization in terms of expectation (solid line) and standard deviation (dashed line) for the   Fig 5B shows the dependency of the RMSE on the number of measurements m for two fixed k i , namely k 1 = 10 (dashed line) and k 2 = 100 (solid line). The error is averaged over 100 simulations for 5000 agents. We observe that for greater m the error, as expected, decreases by several orders of magnitude, independently of k. However, the impact of increasing m is larger than the one of increasing k. For small values m � k i , the error is smaller for k 1 = 10 (dashed line) than for k 2 = 100 because the measurements cover the state space more densely: For k 1 = 10, for example, we have m = 10 measurements while for k 2 = 100 we only have m = 1 measurement. Thus, there are two tuning parameters for the amount of training data to be used.

Clustered networks
Let us now consider the case where the network consists of Q (not necessarily equally-sized) clusters. Within a cluster each agent is connected to all other agents, i.e., each cluster q is a complete sub-graph of size N q . Two agents of different clusters are connected with probability p. If p is sufficiently small, then the clusters are connected only by a few edges and the corresponding sub-matrix of the adjacency matrix is sparse. As before, each agent is influenced by its neighbors. However, due to the non-completeness of the network, the resulting transition propensities depend on the size of the individual neighborhood; therefore, they might differ among agents. Here, we do not model the population state of the ABM as described in Section 3 since the overall aggregation leads to errors in this case. Instead, we augment the population state by subpopulations, i.e., an aggregation by cluster. We will use these to learn a coarsegrained model of the agent dynamics. and standard deviation (dashed) of the SDE limit model C i (t) and its data-driven approximation (gray) estimated from 10 3 Monte Carlo simulations for the dynamics of the EVM of Section 3.2 for N = 5000 agents and initial state cð0Þ ¼ ½0:2; 0:7; 0:1� > 2 X. The relative number of agents of type S 3 can be reconstructed using (10) and is therefore not displayed. The approximate moments (gray solid and dashed lines) agree with the SDE limit model. (B) Approximation and evaluation error of the drift and diffusion estimates for the EVM in Section 3.2 compared to the exact SDE limit model (5) depending on the number of measurements m for fixed k 1 = 10 (dashed), k 2 = 100 (solid) and N = 5000 agents. The error is averaged over 100 simulations. Clearly, for higher amounts of training data a smaller error can be expected. This holds for both parameters m and k. https://doi.org/10.1371/journal.pone.0250970.g005

An SDE limit model for clustered networks.
We can set up a limit model that describes the relative frequencies of each type per cluster. As mentioned before, this limit model contains an approximation error that is due to the aggregation of types in each cluster. However, under certain conditions (e.g., uniformly drawn connecting edges) the model yields a good approximation.
We extend (5) such that it describes the temporal evolution of the relative frequencies for a network that consists of Q clusters. Assume that the connecting edges are drawn uniformly with probability p. Let N be the number of agents in cluster q = 1,. . .,Q. For simplicity we assume that all clusters are equally sized. We augment the system state such that it has the relative frequencies of each type per cluster, i.e., Letã q;k be the rescaled propensity function for transition k in cluster q and n q;k 2 R d Q its corresponding net change vector. We obtain ffi ffi ffi ffi N p ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffĩ a q;k ðCðtÞÞ q dW q;k ðtÞn q;k Note that equation (13) can be rewritten so that clusters can also have different sizes. Given a cluster q the diffusion term corresponding to transitions within the cluster scales with 1= ffi ffi ffi ffi ffiffi N q p while transitions induced by another cluster q 0 scales with 1= ffi ffi ffi ffi ffi ffi N q 0 p . Example 5.1 (SDE limit model for two clusters). Let us consider a network consisting of Q = 2 clusters each having N 1 and N 2 agents, and let p be the probability for an edge connecting two agents of cluster Q 1 and Q 2 . We define the connection strength of cluster Q 1 and Q 2 as the ratio between the number of edges E connecting both clusters and the total number of possible connecting edges E max = N 1 N 2 . The expected connection strength is given by p since As in Section 3.2, we consider imitation and exploration. The latter is independent of the considered network, while the former is either induced from the inside or outside. If the transition is caused from the inside, we call it intra-cluster transition and inter-cluster transition if it is caused from the outside. Intra-cluster transitions are denoted by R ij and R 0 ij . Imitation as an inter-cluster transition rule is given by For the intra-cluster transitions the propensity functions are given by while for the inter-cluster transition they are given by as each agent has N q þ pN q 0 possible partners for interaction. For simplicity, we assume that both clusters are of the same size. For the corresponding net change vector, it holds that ν qq 0 ,ij = ν q,ij as the inter-cluster transitions R qq 0 ,ij only influences state c q (t) and not c q 0 (t). For CðtÞ ¼ ½c 1 ðtÞ > ; c 2 ðtÞ > � > 2 R 2d , the SDE solution is given by The addends (a), (b), (d), and (e) correspond to intra-cluster transitions, while (c) and (f) correspond to inter-cluster transitions. We will drop the index q whenever it is clear from the context.

Evaluations.
We now simulate the EVM in discrete time with step size t step = 0.01, see S2 Appendix for the pseudocode. While it can be applied to arbitrary networks, we restrict ourselves to highly clustered networks as depicted in Fig 6A and 6B. We create k = 1000 realizations for each of the m = 1000 uniformly drawn initial states of the ABM for a lag time of τ = 0.01. The network consists of two equally sized clusters, each containing N = 50 agents. We assume γ q,ij = γ q 0 ,ij , g 0 q;ij ¼ g 0 q 0 ;ij , and β q,ij = β q 0 ,ij = γ q,ij for all i,j. The rate constants for imitative transitions are given by (9a) and (9b). For exploratory transitions we set g 0 ij ¼ 0 for all i, j. We compare the data-driven model and the model defined in (14) for two networks with different connection strengths. The adjacency matrices of both networks are shown in Fig 6A  and 6B. The first network has a connection strength of p = 0.01 while the second has a 20-times larger connectivity, i.e., p = 0.2. The first network is a subgraph of the second. We apply Algorithm 3.1 to the cluster-based aggregate states of the agent dynamics for each network to obtain the data-driven coarse-grained model. Fig 6C and 6D show the prediction of the temporal evolution of the first moments for each type per cluster. Note that the colors are different from Fig 1. Both realizations start from the same initial value. The difference in their temporal evolution results directly from the network structure. As described in Section 5.1 for complete networks, the results improve for larger values of N, m, and k. We can also observe in Fig 6D that for a higher connectivity, i.e., larger p, both clusters synchronize so that the relative numbers of agents per type are identical in each cluster.
Remark 5.2 Consider a random network of N = 500 nodes where two agents are connected with a probability of 10%. The resulting network is sparsely connected and exhibits an approximate average degree of 50. Fig 7 shows the expectation of the data-driven model (solid) compared to the EVM (dashed) for this random network estimated from 10 3 Monte Carlo simulations. For short times t, the data-driven model agrees with the ABM. However, for larger time t the prediction deteriorates mainly due to the sparsity of the network. Note that the absence of a ground truth model for the EVM on this sparse network complicates the analysis. Compared to the expectation obtained via the SDE limit model (5) (indicated in gray, dotted) the data-driven model yields a better approximation.

Predator-prey model
Let us now consider the PPM introduced in Section 3.3. The parameters we use are listed in Table 2. We learn a data-driven model from m = k = 1000 measurements and samples. The lag time for estimating drift and diffusion is set to τ = 1. Although the defined PPM has a spatial component, i.e., relatively slow movement of the agents with respect to the dimension of the space and search radius v of the predators, we use the classic Lotka-Volterra differential equations as a starting point for the set of basis functions. The set consists of monomials up to degree 3 so that we can identify the coefficients of the drift and diffusion terms. Fig 8A and 8B show the phase portrait of the first-order moment of the reduced SDE model and the PPM averaged over 958 realizations. In 42 out of 1000 realizations the predators died out before the prey so that the size of the prey population grows exponentially. The results show that the reduced model is able to approximate the qualitative dynamical behavior of the PPM. Fig 8C  shows a realization of the reduced SDE model.

Conclusion
In this work, we showed how the Koopman generator can be used to obtain coarse-grained stochastic models from aggregate state data of agent-based dynamics. We demonstrated the procedure for two different ABMs, namely a voter model and a predator-prey model. The ABM codes used for generating the results presented in this paper can be found at https:// github.com/Henningston/ABMs.
In the first case we considered complete and clustered interaction networks of homogeneous agents such that each agent can interact at any time with all other agents (or within their cluster, respectively). We showed that under certain conditions the reduced models agree with their respective SDE limit models. In both considered cases, we showed that the data-driven reduced models are suitable for predictions. The results of Section 5.1 showed that when considering incomplete, clustered networks, aggregation of state variables led to an approximation error in the population state model. As a consequence, the data-driven model and its SDE approximation agreed only for short time intervals, see Fig 6A. It also showed that the number of agents per cluster needs to be large enough or, alternatively, the connectivity between them high enough for the data-driven coarse-grained model and the SDE model (14) to agree, see Fig 6B. First experiments showed that for networks with an arbitrary structure the prediction  horizon can be shorter which implies that, if the state of an ABM depends strongly on the spatial structure, e.g., formation of clusters, coexistence or spatial heterogeneity, this needs to be taken into account, see Fig 7. For the second model-the predator-prey system-we showed in Section 5.3 that it is also possible to identify a reduced model for an ABM that is not bound to interaction networks and whose time step is comparably large (i.e., not close to zero as in the first case). The reduced model is able to capture the qualitative behavior.
Our approach is limited to ABMs where it is believed that the aggregated dynamics can be meaningfully represented by ODEs or SDEs. However, this approach might fail if spatial interaction or interaction with the space itself have a strong influence on the behavior of the agents and therefore the outcome of the model.
In general, our approach relies on the assumption that all types of agents are available in sufficient numbers. If the number of agents (more generally speaking the size of the system) is large enough, it is known that the SDE accurately approximates the chemical master equation [15]. However, there exist cases where the SDE fails to capture the behavior of a discrete ABM, more precisely noise-induced metastability. This is the case when bi-or multi-stability stems from the discreteness of the system (that is, if the size of the system is not large enough) [49,50]; see also Figs 1B and 5A for systems with small and large numbers of agents, respectively.
Additionally, the approach relies on accurate, pointwise estimates of the drift and diffusion terms. Inaccurate, insufficient estimates lead to nonsparse solutions of the generator  Table 2. https://doi.org/10.1371/journal.pone.0250970.g008

PLOS ONE
Data-driven model reduction of agent-based systems using the Koopman generator approximation. Additional techniques like iterative hard thresholding or denoising might be applied to improve the results, see [33] and references therein.
Our approach to obtain data-driven coarse-grained models from agent-based dynamics opens up new possibilities for further analysis and has the potential to reduce the numerical effort when investigating ABMs. In addition to parameter optimization or sensitivity analysis, which are often infeasible due to the complexity of the ABM, the reduced model can also be used to find control schemes to steer the system to a desired state. More precisely, the reduced model can be used to find, e.g., harvesting schedules for systems like the predator-prey models or to develop strategies to persuade agents to change their opinion (e.g. electoral or commercial campaigns, or use of green technology). Future research will address the control of ABMs using data-driven reduced models.