Memory-Based Reduced Modelling and Data-Based Estimation of Opinion Spreading

We investigate opinion dynamics based on an agent-based model and are interested in predicting the evolution of the percentages of the entire agent population that share an opinion. Since these opinion percentages can be seen as an aggregated observation of the full system state, the individual opinions of each agent, we view this in the framework of the Mori–Zwanzig projection formalism. More specifically, we show how to estimate a nonlinear autoregressive model (NAR) with memory from data given by a time series of opinion percentages, and discuss its prediction capacities for various specific topologies of the agent interaction network. We demonstrate that the inclusion of memory terms significantly improves the prediction quality on examples with different network topologies.


Introduction
Political opinion polls capture how the opinions of people within a society regarding a certain topic or their current voting preferences are distributed.Individual opinions do not have to be constant but rather are subject to change induced by impactful events or the opinions of their peers which is formalized under the term conformity in [Sta15].There have been recent advances in simulating the process in which members of a society change their opinions, see, e.g., [BLA11, KLT07, Mis12, LBW + 12, NKB08, BG12, BCN17] and the review articles 1 arXiv:2006.08257v1[math.DS] 15 Jun 2020 [AY19,XWX11,CFL09,SLST17].This is in part due to increasing computing power which enables to carry out agent-based models that simulate behaviour of members of a synthetic population, such as members of a society, on the microscale by emulating the decision-making rules.The agents are often treated as the nodes of a network while an edge between two nodes means that these agents are neighbors of each other and thus influence each other's respective opinions.
One is often not interested in modelling, or predicting, which person has which opinion but rather, as in polls, what the percentage of each opinion within the society is.There is ample interest in deriving dynamics for the evolution of these percentages.
In this article, we will present a framework which identifies the governing equations for the dynamics of opinion percentages for different types of networks, more exactly, how the governing equations can be inferred from data on the opinion percentages.To this end, we will emulate the decision making process with a simple agent-based model (ABM) that is based on the assumption of conformity and inspired by the ABM in [Mis12].Introductions into agent-based modelling in general can be found in [JSW98] and [LJMR09] and specifically into agent-based models for opinion dynamics in [Ban16].We investigate the cases of complete and incomplete interaction networks.More precisely, we will consider complete networks where every agent interacts with all others, as well as incomplete networks in which there are subcommunities within the society that have few links between each other.As we will show, in the case of a complete network, one can identify a Markovian model for the dynamics of the opinion percentages using standard well-mixedness arguments known from the mean-field approaches or population limits, e.g., for predator-prey models [Ber92].However, arguments used for that case do not hold up in cases when the network is not complete.We will show how to use information from the past (memory) via a kind of delay embedding of the dynamics to describe the evolution of opinion percentages in the general case.
The exact reason for the inclusion of memory will formally by derived in Section 2 by using the Mori-Zwanzig formalism [Zwa01, LL19,CHK02].Inspired by problems in statistical physics, the Mori-Zwanzig formalism explains how in the case of only low-dimensional observations of a high-dimensional system being available, the evolution of these observations of the full system can be obtained by replacing the missing information of the full system by past information of these available observations.This is in light of the result of Takens [Tak81] that states that, under fairly generic assumptions, the delay embedding of the dynamics of an observable is diffeomorphic to the dynamics of the full system.
There are various techniques for the modelling of time-discrete dynamical systems which involve the memory of the system.An intuitive approach is comprised by Higher-order Markov Models [Raf85,Tuy18].These models are defined by transition probabilities between discrete states where each state represents a sequence of cells of a discretization of the state space with a given length.Although these models can be powerful in investigating the longterm behaviour of the process by means of Markov State Models for Markovian processes [BPN14], they yield two problems: The loss of accuracy obtained from the discretization and an exponentially increasing number of states with increasing length of the sequences and number of grid cells.
Another example is simplex projection as in [SM90] where, using Takens' result, subsequent states of a system are predicted from relative next steps of similar patterns as its recent history.A younger modelling technique is Long short-term memory neural networks (LSTMs) [HS97,PD18] which is a subclass of recurrent neural networks and specifically designed for prediction of time series for which past information is vital.However, both these techniques provide little to no understanding the dynamical rules of the system: Simplex projection does not produce any model or dynamical law but rather uses a procedure similar to the nearest neighbors classification algorithm (see, e.g., [DGL13]).LSTMs, as most neural networks, typically have far too many parameters to admit interpretability.An additional means for forecasting of memory-dependent dynamical systems is the well-known class of autoregressive (AR) models [BD91], which describes the evolution of a system by a linear combination of its most recent states.Additionally, there exist variants of these AR models that are sparse [DZZ12, FSG + 07] or nonlinear [Bil13] or comprise both aspects in application to a Singular Value Decomposition of a data matrix [BBP + 16].As we will see, linear (Markovian) systems cannot describe the evolution of opinion percentages even in the simplest case, but simple polynomial terms are sufficient for fully connected networks.We shall address this point with nonlinear AR (NAR) models, as derived through the Mori-Zwanzig formalism.
The novelty in our work lies in the usage of such NAR models, estimated from data, to describe the evolution of opinion percentages.Theoretical justification comes from the Mori-Zwanzig formalism applied to the special (clustered) network topologies that we consider.
Even though the models that we derive merely predict an expected evolution of a randomly evolving quantity, large-deviation type results suggest that in the large-population limit the random fluctuations vanish [BDM14,WNWS20].We will show that the prediction accuracy of the NAR models for the opinion percentages increases with larger memory depths.With this, we will deploy methods from data-driven (sparse) system identification-as in Dynamic Mode Decomposition [SS08, TRL + 14, JSN13] or Sparse Identification of Nonlinear Dynamics [BPK16a]-to the field of opinion dynamics.
In Section 2 we start with recalling the derivation of NAR models for the evolution of observations through the Mori-Zwanzig formalism.In Section 3, we will present a method that detects coefficients of functions in these NAR models.This method will be demonstrated on a toy example derived from the Hénon map in Section 4, and used to increase the accuracy of prediction of opinion percentages in the case of incomplete networks in Section 5.

Derivation of a nonlinear autoregressive model using the Mori-Zwanzig formalism
Below, we will model the spread of opinions inside a closed society by an agent-based model.
It will consist of a high number N of agents who change their opinions X i , i = 1, . . ., N , within a finite set of M possible opinions over discrete time steps according to a rule that is based on the opinions of themselves and other agents.This rule will be Markovian, or memory-free, i.e., the changes of opinions are only influenced by opinions in the current time step.These dynamics will be called the microdynamics.The state of the microdynamics at time t is denoted by X t = ((X t ) 1 , . . ., (X t ) N ) T .The respective state space is denoted by X, and has cardinality |X| = M N .We will only be able to observe the percentages of opinions, i.e., the ratios of those among all agents for each of the M opinions.In this article, we are interested in identifying the dynamical rules of the evolution of the percentages of opinions, which we call the macrodynamics.Identifying the dynamics of low-dimensional observations of a higher dimensional system is a typical setup for the Mori-Zwanzig formalism [Zwa01, CHK02, LL19].We will consider a general framework for this and show how it yields a nonlinear autoregressive model [Bil13] for the macrodynamics.Later on we show how it can be applied to the specific case of the spread of opinions.

The setting: Microdynamics and observations
First we assume that the microdynamics is Markovian (memory-free) and deterministic.We consider the dynamical system F : X → X that governs the microdynamics (2.1) Further, we denote the space of observations of the microdynamics (observables) by Y ⊆ R m and the set of functions that map states of the dynamical system (2.1) to Y by Let us denote the accessible, or relevant, variables by x = ξ(X) ∈ Y. Hence, ξ ∈ G is a fixed observable.We suppose from here on that we do not have knowledge of the state of the microdynamics at any point in time but instead have the value of the function ξ.
Additionally, we define the subspace H of functions in G that depend only on these relevant variables and map to Y as Functions in H still depend on X ∈ X but the information of ξ(X) is enough to evaluate them.When we write h(x) for x ∈ Y, we actually mean h(x) where h is the function in H with the property that h = h • ξ.An example is In this case it is enough to know the value of ξ(X) to evaluate h(X).

Approximating observations with projections
The goal is now to represent the evolution of a function from G under the microdynamics with knowledge only about ξ but not of the state of the microdynamics X t .As illustrated in the following diagram, instead of taking one step of the microdynamics and then evaluating a function g ∈ G, we only have access to the observation ξ(X) and want to evaluate g(F (X)) under the premise that ξ(X) = x.
This means, we want to install a dynamical system that is dependent only on ξ(X t ) but approximates all observations of the microdynamics as well as possible.
A simple example would be: If we only know X 1 + X 2 , how do we compute X 2 1 + X 2 ?In contrast to h in the above example, g / ∈ H.
If an exact representation is not possible, we define an approximation that depends only on ξ(X).To this end, we define a projection operator P : G → H that maps a function depending on X to a function depending on ξ(X).We additionally define its complement Q := Id − P .We assume from now on that the microdynamics are stationary with an Finvariant probability distribution µ over X, so that when asking what g(X) is, we assume that X is distributed by µ.Given this, a natural candidate for P would be the conditional expectation with respect to µ, which is more precisely defined in Appendix A.4.
This term represents exactly the question we should ask: What do we expect g(X) to be if we know that ξ(X) = x?However, the integrals in (2.3) are often infeasible if we do not know µ.We hence follow [LL19] until the end of Section 2.3 and define P as the orthogonal projection onto the span of a set of linearly independent functions from H.This set consists of the columns of ϕ = [ϕ 1 , . . ., ϕ L ]: where x ∈ Y and the scalar product •, • is defined for matrix-valued functions f : X → R m×a and g : which itself is matrix-valued.The term ϕ, ϕ is a mass matrix that ensures that P is in fact an orthogonal projection.This orthogonal projection has the property that P g is the closest function in span(ϕ) to g where closeness is measured by the scalar product, i.e., P g minimizes g − P g, g − P g = X (g − P g) T (X)(g − P g)(X)dµ(X), which is the expected quadratic difference between g and P g.
Note that if H is infinite-dimensional, one would need an infinite number of functions to yield that span(ϕ) = H.In this case the projection formalism is well defined if H is closed.
In practice, in this case for the computation that will follow one would choose a sufficiently rich finite set of functions so that span(ϕ) covers those parts of H that are of interest.

Mori-Zwanzig representation of the macrodynamics
We will now show how to represent the evolution of a function in H over time.With the Koopman operator [Koo31] K for the system (2.1), defined as the operator that maps a function g ∈ G to g • F ∈ G, we consider the Dyson formula (2.5) The Dyson formula describes a way to iteratively split up the application of the Koopman operator to a function g into parts P Kg and QKg.Equation (2.5) yields, by application of both sides of the equation to ξ and evaluation at the initial value X 0 of the macrodynamics, that (2.6) We can replace X t−k by x t−k in the last step because the application of P to a function makes this function depend only on the relevant variables.We explicitly used the parentheses around the operator P Kρ k and its equivalent formulations to indicate that P is a projection operator that works on the function Kρ k .
Since ρ 0 = ξ, we obtain that P (ρ 0 • F ) = P (ξ • F ).This is usually referred to as the optimal prediction term since it is the best Markovian approximation of ξ(X t+1 ), i.e., the best approximation of ξ(X t+1 ) that only uses ξ(X t ).The sum in the last row of (2.6) starting at k = 1 is referred to as the memory terms, since these terms use information from previous values of ξ(X).The term ρ t+1 (X 0 ) depending on the full state X 0 and not on the projection ξ(X 0 ), is often called noise, because one does not have explicit access to it and can often only treat it as a stochastic influence.1 In total, the last row of (2.6) is called the Mori-Zwanzig equation.
Substituting the definition of P as the orthogonal projection onto basis functions as in (2.4), we obtain )dµ(X).Finding a suitable approximation of the non-accessible noise term ρ t+1 (X 0 ) in (2.6) is generally a non-trivial task and depends on properties of the microdynamics.Examples are discussed in [LC17,HEVEDB10,KDG15].From this point onwards, we will make the simplification of replacing ρ t+1 (X 0 ) by a zero-mean stochastic noise term ε t+1 ∈ R m .A typical practice is to let ε t+1 be a zero-mean Gaussian random variable as, e.g., in [LL19,LBL16].
With this, we obtain the macrodynamics (2.8) In the form of the diagram (2.2), through the Mori-Zwanzig formalism, we get (2.9)

Macrodynamics as a nonlinear autoregressive process
If it is reasonable to assume a sufficiently fast decay of the terms h k with increasing k, the memory terms that lie far in the past have negligible influence [HHSN07, VVR17, CHK00 As a nonlinear autoregressive model with memory depth p we define a dynamical system of the form [Bil13,AH96] x t+1 = f (x t , . . ., x t−p+1 ) + ε t+1 .
A common structure for f is (2.10) with matrix-valued coefficients H i and a vector-valued function f .
A special case is the well-known family of linear auto-regressive models [BD91], for which f is linear, Note that Equation (2.8) does not have the form of NAR models as (2.10), since the coefficients h k are vector-valued and the basis functions ϕ are matrix-valued, opposed to having matrix-valued coefficients and vector-valued basis functions (Figure 1).It turns out that (2.10) is in fact a special case of the formulation we have derived here because by choosing scalar-valued functions φ1 , . . ., φL and defining where h i now denotes the i-th coordinate of a vector h ∈ R mL .
Since in Section 3 we will introduce a method that identifies matrix-valued coefficients for NAR models in a way that is motivated by system identification methods such as Dynamic Mode Decomposition [WKR14,TRL + 14], Extended Dynamic Mode Decomposition [WKR14] or Sparse Identification of Nonlinear Dynamics [BPK16a,BPK16b], we use the formulation of (2.8) which reads (2.12)

Stochastic microdynamics
Let us consider stochastic dynamics where ω t ∈ Ω is a random influence on F which is now defined as F : X × Ω → X.We will assume that the noise process ω t , t ∈ N, is i.i.d. with law P.In this case we only strive to forecast the expected macrodynamics, and define the (stochastic) Koopman operator as The spaces G and H, just as the projection P remain unchanged.Naturally, to the derivation of the Mori-Zwanzig approximation we need to apply the necessary obvious modifications.E.g., the last step in (2.6) now has to be modified as: We can thus obtain the identical structure of the macrodynamics as in (2.8) where for the computation of the coefficients h k in (2.7) the expectation with respect to P had to be added.

Models (SINAR)
We propose here a method of data-based identification for coefficients H k in (2.8) that is an extension of the Sparse Identification of Nonlinear Dynamics (SINDy) algorithm from [BPK16a,BPK16b,KKB18].SINDy can be used to identify the governing equations of a Markovian-in our case, discrete time-dynamical system We will extend this method to Non-Markovian systems by applying SINDy to an extended version of X, the Hankel matrix In essence, this is the concept used for the Hankel-alternative view of Koopman (HAVOK) analysis from [BBP + 16], which identifies a linear autoregressive model on transformed coordinates obtained from a Singular Value Decomposition of the Hankel matrix to separate linear from non-linear, or even chaotic, behaviour of a Markovian system.In this section and by the choice of the name SINAR, we explicitly want to point out the connection of system identification methods for non-linear Markovian systems to their counterparts for non-linear Non-Markovian systems (with finite memory these are NAR systems) that can be derived through the Mori-Zwanzig formalism from Section 2.

SINDy: A short summary
We start with a short description of SINDy: In SINDy we try to approximate each coordinate of f by a linear combination of basis functions θ i : R m → R and define To this end, we fit a coefficient matrix Ξ ∈ R m×v to the data X, X by finding In order to only obtain the basis functions from Θ that are dominant for the relation between x t+1 and Θ(x t ), we enforce a sparsity constraint using the LASSO regression algorithm [Tib96] in which a regularisation term is added onto the coefficient matrix.Note that (3.2) decomposes to independent minimization problems for the rows of Ξ.For every row X i of X , we solve The use of the 1-norm generates that the solution will be sparse if we set λ > 0 appropriately.
Sparse models will often times be less accurate than non-sparse models.However, what we gain through a sparse right-hand side of (3.3) is a better interpretability of the model since only the dominant terms have been identified as influential to the dynamics.It is vital to set λ so that the loss of accuracy is minimal compared to the gain in interpretability.
SINDy is related to the (first step of) the method of Dynamic Mode Decomposition (DMD) [WKR14, TRL + 14], which aims at finding a linear connection between x t and x t+1 .
To this end, one solves In a second step, DMD then uses Ξ from (3.4) to uncover properties of the Koopman operator of the system.SINDy, instead, tries to explain the evolution of x t by basis functions that do not have to be linear.Still, essentially, the problem (3.5) is equivalent to (3.4) for Θ(x) = x and λ = 0. Further, there exists a sparse version of DMD [JSN13], where the sparsity constraint is enforced by the additive 1-norm regularisation as in (3.4).Then the emerging minimization problem is the same as (3.4) with Θ(x) = x.

Extending SINDy to SINAR
Now, when we suspect that in the dynamical system (3.1)x t+1 depends not only on x t but on memory terms, too, we can apply the SINDy algorithm to suitably transformed data to obtain a nonlinear autoregressive model as in (2.12) with sparse coefficients, i.e., with only few basis functions with non-zero coefficients: Selecting a memory depth p and denoting let us define as data matrices the Hankel matrix (3.6) Again we choose basis functions , and minimize for every row of Then with the basis functions with non-zero coefficients in Ξ ∈ R m×v , we have derived a nonlinear autoregressive model that approximates the evolution of x: By deleting all columns of Ξ that only contain zeros, which should be many if we enforce the sparsity constraint, we get a reduced matrix and thus a low number of terms on the right-hand side of (3.8).We have thus identified a sparse nonlinear autoregressive model so that we call this extension of SINDy Sparse Identification of Nonlinear Autoregressive Models (SINAR).Note that for a memory depth of p = 1, SINDy and SINAR are equivalent.Figure 2 illustrates the different structures of SINDy and SINAR.
By choosing with φ1 , . . ., φL being scalar-valued functions as in Equation (2.11) in Section 2, we can directly fit the coefficients H k of the model (2.12) which was derived through the Mori-Zwanzig formalism previously.Then Ξ has the blockwise form Of course, by choosing linear basis functions Θ(x t ) = xt and setting λ = 0, one obtains a wellknown linear autoregressive model [BD91].Except for the sparsity term, the determination of model coefficients as in (3.7) is exactly the Least Squares method commonly used for the linear AR models.
The covariance of the noise term ε t+1 in (2.12) can be estimated in the common way for linear or nonlinear AR models [BD91,LL19] by calculating the statistical covariance between X and Ξ Θ(X) (see Appendix A.6 for more details on both statements).
The following diagram sketches how system identification methods from different contexts are related.With DMD, SINDy, SINAR and AR models in mind, one can observe that in all of them, a minimization problem of the same form is solved: Given are data matrices X and X which contain contain data points of the realisation of a, possibly memory-exhibiting, dynamical system that are shifted from each other by one time step.Then one tries to find a connection between both through a transformation of X which is multiplied with a coefficient matrix by solving (omitting possible sparsity constraints) In DMD, one tries to find a linear, Markovian, connection between x t and x t+1 , i.e.Θ(x) = x.
In SINDy, X is transformed in a possibly nonlinear way in order to explain the evolution of systems for which a linear model might be inaccurate.Linear AR models look for a linear connection between a fixed number of past values of the system and its next value.The columns of X, in this case, contain not just data points of the system but sequences of data points of a fixed length.In this way, AR models can be seen as a form of delay-embedded DMD.SINAR brings together SINDy and AR models in the sense that one seeks a possibly nonlinear connection between past values of the system and subsequent ones.We will now demonstrate the emergence of memory terms in the case of inaccessible variables in the sense of the Mori-Zwanzig formalism by means of an example of a dynamical system and use SINAR to detect an NAR model that reconstructs the dynamics.
The classical Henon system [Hé76] describes a two-dimensional system that is one of the most famous examples for systems with chaotic behaviour, i.e., where slightly deviated initial conditions lead to a significantly different trajectory.The dynamical system is given by where a, b are fixed parameters.As we can observe, y t is nothing more than a scaled and time-delayed version of x t .We now consider x as the relevant and y as the irrelevant variable; this means in the Mori-Zwanzig formalism the space H is given by all functions depending on only x.We can then still express the evolution of x exactly with dependence on the past two values of x by plugging in the equation for y t+1 into the equation for x t+1 : Let us now consider an extended version of the Henon system whose dynamical behaviour is visualized in Figure 3.only in dependence of its own past terms and without values of y then we do not get a system with a finite memory depth but with an infinite one: which can be quickly shown by induction on t.
We have hereby derived an equation of the form of the Mori-Zwanzig equation (2.6) for this simple example: The term 1 − ax 2 t is the optimal prediction, i.e., the Markovian approximation using the relevant variables x t .The sum t j=1 c j−1 bx t−j contains the memory terms depending on past values of x and the term c t y 0 is the noise term with information about the irrelevant, or for us inaccessible, variable y.

Reconstructing the extended Henon system with SINAR
We now apply the SINAR algorithm to data originating from a trajectory of the extended Henon system and demonstrate the increase in performance by using memory terms compared to applying the usual Markovian SINDy algorithm.
We set as parameters a = 1.3, b = 0.3, c = 0.3 and initial values x 0 = y 0 = 0.Then, for example, the exact model up to a memory depth of 3 in Equation (4.2) is As basis functions we choose monomials of the time-delayed coordinates up to second order without mixed terms between different delays, .

Short-term predictions
We now generate a trajectory of length T = 2000 out of which we erase the first 1000 steps Although all coefficients are recovered up to an error of smaller than 10 −14 when we use 800 time steps for training, the reconstruction becomes inaccurate after around 100 time steps which underlines the strongly chaotic nature of the system, i.e., small deviations at one point in time causing significant deviations in the long term behaviour.We thus use 920 time steps for training and only 80 time steps for validation to investigate how the relative Euclidean reconstruction error depends on the memory depth.Below we discuss how the attractor of the system is recovered using much longer reconstructions.
We see in Figure 4 how the relative Euclidean prediction error decreases for increasing memory depth p. Predicted was the evolution of x with data about x.It is interesting to note how large a memory depth is necessary to get an accurate prediction for x when c = 0.3 (Figure 4 (left)).The chaotic nature of the system yields that even coefficients of the form bc j for j = 27 have to be taken into account.Of course, for smaller c such as c = 0.03, memory terms in (4.2) decay quicker and a memory depth of p = 8 is sufficient to yield an accurate prediction as can be seen in Figure 4 (right).For the full system (x, y), the system is Markovian and the prediction error is unsurprisingly very small even for p = 1.

Attractor reconstruction
Although large deviations between original and reconstructed trajectories of x t occur after around 100 time steps, both trajectories remain on roughly the same set of points.We quantify this by the Hausdorff distance between the two-dimensional delay embeddings (see definition in Appendix A.7) of the original trajectory and each reconstructed trajectory.The Hausdorff distance denotes the maximal minimal distance of members of one set of points to another set.In other words, the Hausdorff distance between two sets is 0 if the sets are equal and big if there is a point in one set which is far away from all points in the other set.
We make predictions of 3000 time steps based on coefficients that were obtained with SINAR on data of 1000 time steps.In Figure 6 are depicted the two-dimensional delay embeddings of the original trajectory of x and the reconstructed trajectories for p = 1, 2, 5, 10 and p = 30.There we see how already for p = 2 the original and reconstructed attractors look much more similar compared to p = 1. Figure 5 shows the Hausdorff distances for different memory depths.Similar to the relative Euclidean prediction error, the distance decreases with increasing p.The remaining error is due to the fact that the complicated geometry of the attractor is hard to approximate uniformly well with a finite set of points.

Formulating the ABM
The ABM is given as follows: Suppose there are N agents and each agent has exactly one out of M different opinions, denoted by 1, . . ., M .The vector X t , which comes from then represents the opinions of each agent at time t and (X t ) i denotes the opinion of agent i at time t.The neighborhoods of all agents are represented by the symmetric adjacency-matrix A ∈ {0, 1} N ×N where A ij = 1 means that agents i and j are neighbors of each other and ) be the number of neighbors of an agent.The diagonal entries of A are set to 1, so that every agent is its own neighbour.
Let the procedure of opinion changing be given by the following rule: In every time step, every agent picks one of its neighbors in the network uniformly at random and changes its opinion with adaption probability α m m where m is the opinion of the agent and m is the opinion of the selected neighbour.This results in the term which we denote by p t i (m , m ).The probability for an agent not to change its opinion thus is In algorithmic form, the agent-based model is executed in the following way: To clarify the notation, remember that (X t ) i and (X t ) j denote the opinions of agents i and j at time t.Hence α (Xt) i (Xt) j is the adaption probability of opinion (X t ) j given that an agent has opinion (X t ) i .
We can now state the so-defined microdynamics by where at every time step, ω t denotes a tuple consisting of N agents that represents the chosen neighbour of each agent plus numbers u i ∼ U[0, 1] that govern the adaption probability α (Xt) i (Xt) j as in Algorithm 1.To be more precise, ω t has the form ω t = (j 1 , . . ., j N , u 1 , . . ., u N ), j i ∼ U{j : F then is given by This way of stating the microdynamics seems complicated compared to the more intuitive option of denoting by (ω t ) i the new opinion of the i-th agent, distributed by [p t i ((X t ) i , 1), . . ., p t i ((X t ) i , M )].However, this would mean that the distribution of ω t would change over time, since the p t i depend on (X t ) i .For the Mori-Zwanzig formalism, this would prevent us from applying the procedure of skew-shift systems introduced in Section 2.5 where we drew all ω t a priori and thus independently of the X t .By using the notation of ω t denoting a tuple of neighbors j i and random numbers u i that are compared to the adaption coefficients, we can draw the whole sequence of ω t independently of the X t and maintain consistency with the notation of skew-shift systems.

Deducing macrodynamics from the ABM
We now define as the opinion percentages the function and are interested in modelling how these percentages evolve over time.It turns out that for a complete network, i.e., A ij = 1 ∀i, j, we can derive macrodynamics for the expected evolution of that do not require memory terms.They are given by This equation can be derived as follows: In case of a complete network, p t i (m , m ) ≡ p t (m , m ) is independent of i because the percentages of opinions among neighbors are equal for all agents since they all have the same neighbors.Then In every time step, every agent with opinion m chooses its opinion in the next time step with respective probabilities p t (m , m ) for all opinions m = m and probability 1− m =m p t (m , m ) for keeping opinion m .Since the number of these agents is given by N • (x t ) m , the expected absolute number of agents that change their opinion from m to m is given by This is the expected absolute number of agents that change their opinion from m to m .This means, that from this term alone, the percentage (x t ) m of m is reduced by 1 N times this term, which is α m m (x t ) m (x t ) m .Since at the same time agents with opinion m can change their opinion to m with probability α m m (x t ) m (x t ) m , we have to subtract the analogous term for E[#Agents changing opinion from m to m ] and the factor (α m m − α m m ) comes in.
In consequence, for a complete network, the loss of information about X does not yield loss of information about the evolution of x.
In order to align this with the Mori-Zwanzig formalism from Section 2, this means that because we can state Kξ = E[ξ(F )] directly by using Equation (5.1) that depends only on ξ.
Let us now consider the penultimate line of Equation (2.6), where terms of the form occur.This yields for k > 0 that ρ k = (QK) k−1 (QKξ) (5.2) = 0.In this way we can see that memory terms are not required for the dynamics of ξ if the network is complete.
However, this is generally not the case for incomplete networks.An example that explicitly shows that probabilities for future states of the macrodynamics depend on past values is given in Appendix A.1.In this case, it is no longer sufficient to know ξ(X) instead of In other words, Equation (5.2) is no longer valid so that the ρ k do not vanish.In this case, by using as P the orthogonal projection onto basis functions we were able to find approximate representations of the terms P (ρ k • F ) in Equation (2.6).
With this we derived an NAR model.Here lies another part of the value of the application of the Mori-Zwanzig formalism: It installs that the structure of the ensuing macrodynamics in Equation (2.6) is additive, i.e., it can be written as a sum of transformations of individual memory terms as opposed to, e.g., containing products of memory terms.In this way, we can restrict the search for a good model of the macrodynamics to dynamics of this form.For an incomplete network, that is still sufficiently densely connected, we expect the microdynamics to be in expectation still close to that of a complete network.Thus, in such a case we expect QKξ ≈ 0, even if (5.2) does not hold exactly.Consequently, assuming dense connectedness, the opinion percentages should allow for a description of their evolution with a small memory depth.
In the following, we will use SINAR to identify NAR models of this form suggested by the Mori-Zwanzig formalism.

Recovering the macrodynamics in case of an incomplete network
We now create realisations of the ABM with networks that consist of equally sized clusters of agents.Edges between agents from different clusters exist but are few.Inside the clusters, all agents are connected with each other.To this end, we create networks of an even number N of agents and divide them into clusters of equal size.Two agents from different clusters are connected with probability p between .
From the same initial state and with the same parameters, we create multiple realisations of the form [X 0 . . ., X T ] of the ABM and deduce the percentages of opinions [x 0 , . . ., x T ] = [ξ(X 0 ), . . ., ξ(X T )].We denote the realisations of the resulting macrodynamics by X 1 , . . ., X r and divide these data into training data X 1 , . . ., X train and validation data X train+1 , . . ., X r .
Subsequently, we execute the SINAR method with different memory depths p on the training data.SINAR gives us NAR models that we use for the reconstruction of the validation data.
For this, the SINAR method can straightforwardly be modified for multiple trajectories by defining data matrices X = [X 1 , . . ., X train ] and X = [ X1 , . . ., Xtrain ] in the notation of Section 3. We then compute the reconstruction errors of the validation data for each value of p = 1, . . ., p max .For the reconstruction, we divide each realisation X i of the validation data into blocks of length l ≥ p.A block denotes l states x (j) i = [x jl , . . ., x (j+1)l−1 ] while the next block will be x We then compute a reconstruction x(j) i = [x jl , . . ., x(j+1)l−1 ] of this block with the NAR model obtained with SINAR for which we use the last p values of the previous block as starting values.As in Equation (4.3), we calculate the relative Euclidean error between reconstruction and data for each block by err(x . Afterwards, we take the mean over all err(x (j) i ) to measure the performance of the NAR model.
Since the entries of ξ(X t ) always sum up to 1, information about the percentages of opinions 1, . . ., M − 1 immediately yields the percentage of opinion M so that we use SINAR to find an NAR model for the evolution of the percentages of the first M − 1 opinions only and omit the redundant information ξ(X) M .For the reconstruction error, we compare data about the percentages of only the first M − 1 opinions with their reconstructions.This NAR model does not necessarily ensure that the predicted first M − 1 percentages stay between 0 and 1 and their sum is at most 1.Since we make short-term predictions only, however, there will at most be only slight deviations from this property.
In the form of the diagrams (2.2) and (2.9) from Section 2, the Mori-Zwanzig procedure applied to this concept can be described as

Case 1: A complete network
For p between = 1, the network is complete and there should be no improvement of the prediction by allowing memory terms.
We set N = 5000, T = 300 and A ij = 1 ∀i, j.The number of different opinions is M = 3.
As coefficients α m m we choose As initial percentages we assign values to the (X 0 ) i so that ξ(X 0 ) = (0.45, 0.1, 0.45) T .
As the block length in the validation data, we use l = 40.We can already write down the macrodynamics since they are given in Equation (5.1) (see Appendix A.2 for details): Inspired by this structure, we choose as basis functions in SINAR Markovian terms as in (5.3) , . . .
We create r = 20 realisations of which we use 12 for training and the others for validation.
We set the sparsity parameter to λ = 0 and to λ = 0.05 to test how the accuracy decreases with a sparser model.Since the macrodynamics (5.3) are Markovian, we obtain for the prediction error of the validation data no improvement by allowing memory terms (Figure 7) for neither the 40-nor the one-step prediction error.Note that the predictions with the sparse NAR model provide slightly better accuracy for large memory depths.This is, because small non-zero coefficients for memory terms improve the fit of the training data but cause errors in the prediction of the validation data, because the macrodynamics are Markovian.
Through the sparsity constraint enforced, these non-zero coefficients for memory terms are cut off.The recovered sparse macrodynamics p = 1 reads which is very close to the analytically derived macrodynamics (5.3).We can observe oscillatory behaviour since agents with opinion 1 tend to change their opinion to 2 and analogously from 2 to 3 and from 3 to 1. Bottom: 40-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with λ = 0 and λ = 0.05.As expected, the prediction error does not decrease with higher memory depth than p = 1.
Case 2: A two-cluster network We now construct a network with N = 5000 agents, divided into two clusters of size 2500 each.We set p between = 0.0001.Again, M = 3 and α m m are the same as in case 1.As the starting condition, we let opinions in the first cluster be distributed by (0.8, 0.1, 0.1) and in the second cluster by (0.1, 0.1, 0.8).If the initial percentages in both clusters were equal then the percentages in both clusters would evolve in a quite similar way in parallel so that the macrodynamics would essentially be the same as in the complete network case.With the initial percentages being so different, it is possible that an opinion that is dominant in one cluster at one point in time but only sparsely represented in the other, can become popular through the links between agents from different clusters.This will cause the difference in behaviour of the evolution of percentages compared to the complete network.
Moreover, in order to derive the Markovian macrodynamics in Equation (5.1), we needed that the probabilities for an agent i to change its opinion (X t ) i at time t, which we denoted by p t t ((X t ) i , m ), be independent of i.If the neighborhoods of different agents are generally different from each other, this is not longer the case.Especially so, if agents are distributed into different clusters, where opinion percentages might be very different.Thus, we cannot derive Markovian macrodynamics for this case but, in light of the Mori-Zwanzig formalism, we will need memory terms.
To show this, we create r = 20 realisations of length T = 500 and again use 12 for training, the remaining for validation.As block length, we choose l = 20.Memory terms become immediately significant, as the error graphs illustrate (Figure 8).
For p = 1, the NAR model obtained with SINAR (λ = 0.05) is With λ = 0, the obtained NAR model has other terms with non-zero coefficients, but these are small.In Figure 9, an example for the predictions of opinion percentages in one block using the NAR models with p = 1, 2 and 10 is depicted and compared to the corresponding data.As the error graphs in Figure 8 show already, the predicted percentages come closer to the percentages in the data with increasing memory depth.Again there is oscillatory behaviour but also plateaus and short dips as in the red and green graphs at time 25 -150.This is because at these times one opinion is dominant in one cluster but not present in the other.Through the links between the clusters, an opinion, that is not present in a cluster but dominant in the other one can be revived, e.g., the blue opinion in the upper cluster.Bottom: 20-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with λ = 0 and λ = 0.05.Memory terms yield a significant decrease in the prediction errors compared to Markovian predictions.
In order to illustrate why memory terms improve the prediction accuracy, let us imagine for now that there are no links between the clusters.Then the evolutions of opinion percentages in both clusters run in parallel to each other and are Markovian as derived previously.
The opinion percentages in the full network are then given by the averages of the cluster-wise percentages x t ).This means, if we know x t , then there are various options for what x (2) t−1 which themselves make some of the candidates for x (1) t and x (2) t unlikely, e.g., because they are far away from them.Thus, through the information of memory terms we can restrict the options for what the percentages inside each cluster are.We illustrate this in more detail in Appendices A.1 and A.3.
The links between the clusters have as consequence that within one cluster agents generally do not have identical opinion change probabilities since their neighborhoods are different.This yields additional need for memory terms since then not even for the macrodynamics in one cluster, a Markovian formulation can be derived.
Figure 9: Opinion percentages over one block of length 20 from the validation data and prediction evolutions with NAR models obtained with SINAR for p = 1, 2 and 10 and λ = 0. percentages from validation data are depicted with thin and predicted percentages with thick lines.With p = 1, the prediction accuracy is poor and improves drastically for p = 2.With p = 10, the predicted evolutions come even closer to the curves from the validation data.

Case 3: A five-cluster network
We repeat the same procedure as with the two-cluster network but with five clusters of equal size 1000.Again, all agents within a cluster are connected with each other and p between = 0.0001.The α m m are identical to the ones used in the first two examples.As starting conditions we let opinions in the different clusters be drawn according to different distributions for each cluster.Those distributions are (0.8, 0.1, 0.1), (0.1, 0.1, 0.8), (0.1, 0.8, 0.1), (0.3, 0.4, 0.3) and (0.5, 0.3, 0.2).The evolution of the opinion percentages is now much more irregular compared to the previous examples.The oscillatory behaviour is still present but the amplitudes differ from time to time.Through the higher number of clusters, more randomness comes into the model since an opinion can be randomly spread from one cluster, where it is dominant, to another one, where it is not dominant, suddenly altering the evolution of percentages in this cluster and thus in the whole network.
We now show that, similar to when we used a two-cluster network, memory terms become important for predictions of the evolution of the microdynamics.This can be seen in Figure 10.Again, the mean relative error per block converges with increasing p.While in the twocluster network example the performance did not improve visibly with p > 10, in this case we can get slightly lower errors for p approaching 20.
For p = 2 and λ = 0.05, we obtain the NAR model For p > 2, the models show increasing complexity, e.g., for p = 3: Again, we show as an example the predictions of percentages for one block of length 40 with memory depths 1, 2 and 10 (Figure 11).As in the example with the two-cluster network, we could see that a higher memory depth indeed increases the prediction accuracy for the evolution of the opinion percentages in the short term, i.e., for predictions of length 20 resp.40.Plus, enforcing the sparsity constraint with the parameter λ in SINAR set to 0.05 yielded significantly sparser models while the prediction accuracy only suffered slightly.

Discussion
In this article, we have summarized how the evolution of observations of a dynamical system can be derived through the Mori-Zwanzig formalism, and how this can result in a nonlinear autoregressive model with memory.For the determination of model parameters, we have used methodology from data-driven system identification methods, inspired by SINDy [BPK16a].
We could then extend SINDy to SINAR which identifies sparse NAR models from data, thus deploying a common system identification method for non-Markovian systems.
We applied this to an agent-based model (ABM) that simulates the dynamics of opinion changes in a population.Assuming that all agents are equally strongly influenced by all other agents in the population, we showed that for the prediction of the percentages of opinions within the population memory terms are not necessary.However, for incomplete networks, this is no longer the case.Our methodology enabled us to make more accurate predictions for the percentages of opinions among the agents when the population of agents was defined by clusters with little influence between them.Additionally, sparse models obtained from enforcing a sparsity constraint in the estimation of NAR models in SINAR, gave almost equally good prediction accuracy as the non-sparse ones, while yielding far simpler models.
In the context of opinion dynamics, such sparse models permit to point out more clearly which opinions impact which and how.
In the future, the following challenges are to be addressed: • In our methodology, we have assumed a noise term resulting from Mori-Zwanzig that was zero-mean.This allowed us to omit it when making predictions of the expected value of the opinion percentages.This simplifying assumption does not need to be true, and one could try to derive a more accurate representation for the noise term.As a result of this simplifiying assumption, the NAR models we considered were deterministic, even for non-deterministic microdynamics.Introduction of explicit noise in the NAR models, e.g., by extending the approach outlined in [KNP + 20], could improve their (statistical) predictive capacities.
• One could additionally choose a different projection P in the Mori-Zwanzig formalism.
The choice of an orthogonal projection on a finite set of basis functions explicitly yielded an NAR model.The right projection for a given system could inspire an optimal choice of basis functions, e.g., such that the memory depth is minimal.
• We have derived models that are stationary, i.e., do not change over time.Since the assumption of an equilibrium distribution over states of the microdynamics might not always hold, coefficients of the NAR model may become time-dependent.One could use a regime switching model as in [Hor11] that fixes coefficients for a time interval before changing them to other fixed values when the macrodynamics show certain behaviour, e.g., coefficients might be different depending on which opinion is dominating.
By summing up the columns and dividing by 3 since all options for X t−1 are assumed to be equally probable, this gives that This is unequal to the value of ≈ 0.2778 that we computed for P[x t+1 = (1, 0) T |x t = ( 2 3 , 1 3 ) T ].The additional knowledge that x t−1 = ( 2 3 , 1 3 ) T makes it slightly less probable that X t = (• − • − •) ( 13 45 instead of 1/3) which is the opinion vector that gives a lower probability that
The original trajectories and the trajectories obtained from this model are depicted in Figure 13.The one-step prediction error improves for memory depths larger than p = 2 (Figure 14).Since with NAR models obtained from the trajectories for these initial percentages, the predicted trajectories diverge, the full prediction error is not shown.In summary, for a network that consists of two clusters which are uncoupled but fully connected internally, the expected macrodynamics are given by the mean of the expected intra-cluster dynamics.Assuming the dynamics to have no variance and hence to be deterministic, given in Equation (A.1), with symmetric initial percentages, a memory depth of 2 is enough for us to generate an almost exact NAR model for the macrodynamics.However, for non-symmetric initial percentages, the ensuing best-fitting NAR models with the basis functions we use are not accurate in the long-term.This seems to be in part due to the fact, that for non-symmetric initial percentages, the trajectories show more complex behaviour which no longer consists of periodic oscillations but is rather more irregular.This could cause the best-fitting NAR models to then be dominated by linear terms.Results about to which degree one can analytically derive NAR models for both symmetric and non-symmetric initial percentages require further research.

A.7 Hausdorff distance of delay embedding of trajectories
The Hausdorff distance between two non-empty compact sets measures the maximal minimal distance a point from one set has to the other set.It is commonly used to compare attractors of dynamical systems.The lower the Hausdorff distance between two sets, the more similar they are.From two trajectories X = [x 0 , . . ., x T ] and X = [x 0 , . . ., xT , we construct the delay embeddings with embedding depth p as x p−1 . . .x − x 2 ).

Figure 2 :
Figure 2: Sketch of the SINDy algorithm (left) and SINAR (right).SINAR contains the additional step of creating a Hankel matrix.

Dynamic
4 Example: An extended Henon system

Figure 3 :
Figure 3: Trajectory of length 5000 of the two-dimensional extended Henon system (4.1) with a = 1.3, b = 0.3, c = 0.3 and initial values x 0 = y 0 = 0.The first 1000 states are omitted here so that the trajectory has time to converge towards the attractor.
to give the trajectory time to converge to the attractor.We then use the first T train data points for training and the remaining 1000 − T train for validation.With the training data, we determine coefficients Ξ for the basis functions in Θ with SINAR for different memory depths p and compute reconstructions xT train +1 , . . ., x1000 of x T train +1 , . . ., x 1000 using Equation (3.8) with initial values x T train −p+1 , . . ., x T train .In essence, we recover the coefficients of the forms a resp.c j−1 b from Equation (4.2) until j = p − 1 and recompute values of the extended Henon system with the recovered coefficients.As error measure we use the relative Euclidean prediction errorerr( X ) = X − X F X F (4.3)where X = [x T train +1 , . . ., x 1000 ] denotes data points from the original trajectory and X = [x T train +1 , . . ., x 1000 ] data points from the reconstructed trajectory.

Figure 4 :
Figure 4: Relative error of validation err( X ) for SINAR on visible variable x of the extended Henon system for two different values of c with different memory depths p on the x-axes.The prediction accuracy improves with increasing memory depth.Results based on SINAR with λ = 0.As parameters in the extended Henon system, we chose a = 1.3, b = 0.3 and c = 0.3 (left) resp.c = 0.03 (right).For every value of p, the same 80 time steps were taken into account for the reconstruction error. 2

Figure 5 :
Figure 5: Hausdorff distances between original and reconstructed attractors with 3000 points of two-dimensional delay embeddings of x for two different values of c with different memory depths p on the x-axes.Results based on SINAR with λ = 0 with parameters in the extended Henon system a = 1.3, b = 0.3 and c = 0.3 (left) resp.c = 0.03 (right).

Figure 6 :
Figure 6: Two-dimensional delay embedded attractors wih 3000 points of the extended Henon system with a = 1.3, b = 0.3, c = 0.3.Original (upper left) and reconstructed ones based on SINAR with λ = 0.For p ≥ 2, differences are difficult to see but exist as the Hausdorff distances in Figure 5 indicate.

Algorithm 1 :
Agent-based opinion change model 1 Choose end time T , number of agents N , network adjacency matrix A, opinion change coefficients α m m , initial opinions X 0 2 for t = 0, . . ., T do 3 for i = 1, . . ., N do 4 Draw j from {j : A ij = 1} uniformly at random (Choose neighbour) 5 Draw

Figure 7 :
Figure 7: Top left: One realisation of the microdynamics.Every column of the graphic represents the opinion of each of the 5000 agents at one point in time.Blue denotes opinion 1, green denotes opinion 2 and red denotes opinion 3. Top right: Corresponding realisation of the macrodynamics ξ(X) that represent the percentages of opinions among all agents.We can observe oscillatory behaviour since agents with opinion 1 tend to change their opinion to 2 and analogously from 2 to 3 and from 3 to 1. Bottom: 40-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with λ = 0 and λ = 0.05.As expected, the prediction error does not decrease with higher memory depth than p = 1.

Figure 8 :
Figure 8: Top left: One realisation of the microdynamics.Colours represent opinions as in Figure7.Top right: Corresponding realisation of the macrodynamics ξ(X).Again there is oscillatory behaviour but also plateaus and short dips as in the red and green graphs at time 25 -150.This is because at these times one opinion is dominant in one cluster but not present in the other.Through the links between the clusters, an opinion, that is not present in a cluster but dominant in the other one can be revived, e.g., the blue opinion in the upper cluster.Bottom: 20-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with λ = 0 and λ = 0.05.Memory terms yield a significant decrease in the prediction errors compared to Markovian predictions.
be, all of which might result in different values for x t+1 and thus x t+1 .If we are additionally given x t−1 , this might yield possible values for x (1) t−1 and x

Figure 10 :
Figure 10: Top left: One realisation of the microdynamics.Every column of the graphic represents the opinion of each of the 5000 agents at one point in time.Top right: Corresponding realisation of the macrodynamics ξ(X).The behaviour is much more complex than in the first two cases.Bottom: 20-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with λ = 0 and λ = 0.05.

Figure 11 :
Figure 11: Opinion percentages over one block of length 40 from the validation data and prediction evolutions with NAR models obtained with SINAR for p = 1, 2 and 20 and λ = 0. percentages from validation data are depicted with thin and predicted percentages with thick lines.As in the example with a two-cluster network, we can see what the error graphs in Figure 10 indicate: The predicted evolutions are closer to the validation data with higher memory depth of the NAR model.

Figure 12 :
Figure 12: Original trajectories for initial percentages in (A.2) and predicted trajectories with the NAR model (A.3).

Figure 13 :
Figure 13: Original trajectories for initial percentages in (A.4) and predicted trajectories with the NAR model (A.3).

Figure 14 :
Figure 14: One-step prediction error for the NAR models obtained from trajectories of x t with initial percentages of the x (i) t as given in (A.4).