Minimising the Kullback–Leibler Divergence for Model Selection in Distributed Nonlinear Systems

The Kullback–Leibler (KL) divergence is a fundamental measure of information geometry that is used in a variety of contexts in artificial intelligence. We show that, when system dynamics are given by distributed nonlinear systems, this measure can be decomposed as a function of two information-theoretic measures, transfer entropy and stochastic interaction. More specifically, these measures are applicable when selecting a candidate model for a distributed system, where individual subsystems are coupled via latent variables and observed through a filter. We represent this model as a directed acyclic graph (DAG) that characterises the unidirectional coupling between subsystems. Standard approaches to structure learning are not applicable in this framework due to the hidden variables; however, we can exploit the properties of certain dynamical systems to formulate exact methods based on differential topology. We approach the problem by using reconstruction theorems to derive an analytical expression for the KL divergence of a candidate DAG from the observed dataset. Using this result, we present a scoring function based on transfer entropy to be used as a subroutine in a structure learning algorithm. We then demonstrate its use in recovering the structure of coupled Lorenz and Rössler systems.


Introduction
Distributed information processing systems are commonly studied in complex systems and machine learning research. We are interested in inferring data-driven models of such systems, specifically in the case where each subsystem can be viewed as a nonlinear dynamical system. In this context, the Kullback-Leibler (KL) divergence is commonly used to measure the quality of a statistical model [1][2][3]. When a model is compared with fully observed data, computing the KL divergence can be straightforward. However, in the case of spatially distributed dynamical systems, where individual subsystems are coupled via latent variables and observed through a filter, the presence of hidden variables renders typical approaches unusable. We derive the KL divergence in such systems as a function of two information-theoretic measures using methods from differential topology.
The model selection problem has applications in a wide variety of areas due to its usefulness in performing efficient inference and understanding the underlying phenomena being studied. Dynamical systems are an expressive model characterised by a map that describes their evolution over time and a read-out function through which we observe the latent state. Our research focuses on the more general case of a multivariate system, where a set of these subsystems are distributed and unidirectionally coupled to one another. The problem of inferring this coupling is an important multidisciplinary study in fields such as ecology [4], neuroscience [5,6], multi-agent systems [7][8][9], and various others that focus on artificial and biological networks [10].
We represent such a spatially distributed system as a probabilistic graphical model termed a synchronous graph dynamical system (GDS) [11,12], whose structure is given by a directed acyclic graph (DAG). Model selection in this context is the problem of inferring directed relationships between hidden variables from an observed dataset, also known as structure learning. A main challenge in structure learning for DAGs is the case where variables are unobserved. Exact methods are known for fully observable systems (i.e., Bayesian networks (BNs)) [13]; however, these are not applicable in the more expressive case when the state variables in dynamical systems are latent. The main focus of this paper is to analytically derive a measure for comparing a candidate graph to the underlying graph that generated a measured dataset. Such a measure can then be used to solve the two subproblems that comprise structure learning, evaluation and identification [14], and hence find the optimal model that explains the data.
For the evaluation problem, it is desirable to select the simplest model that incorporates all statistical knowledge. This concept is commonly expressed via information theory, where an established technique is to evaluate the encoding length of the data, given the model [1,15,16]. The simplest model should aim to minimise code length [2], and therefore we can simplify our problem to that of minimising KL divergence for the synchronous GDS. Using this measure, we find a factorised distribution (given by the graph structure) that is closest to the complete (unfactorised) distribution. We first analytically derive an expression for this divergence, and build on this result to present a scoring function for evaluating candidate graphs based on a dataset.
The main result of this paper is an exact decomposition of the KL divergence for synchronous GDSs. We show that this measure can be decomposed as the difference between two well-known information-theoretic measures, stochastic interaction [17,18] and collective transfer entropy [19]. We establish this result by first representing discrete-time multivariate dynamical systems as dynamic Bayesian networks (DBNs) [20]. In this form, both the complete and factorised distributions cannot be directly computed due to the hidden system state. Thus, we draw on state space reconstruction methods from differential topology to reformulate the KL divergence in terms of computable distributions. Using this expression, we show that the maximum transfer entropy graph is the most likely to have generated the data. This is experimentally validated using toy examples of a Lorenz-Rössler system and a network of coupled Lorenz attractors (Figure 1) of up to four nodes. These results support the conjecture that transfer entropy can be used to infer effective connectivity in complex networks. thus there is a dearth of work that provides formal derivations for the use of this measure for inferring effective connectivity. Our work allows us to compute scoring functions directly from multivariate time series (as in functional connectivity), yet still assumes an implicit model (albeit with weaker assumptions on the model than those considered in inferring effective connectivity).

Notation
We use the convention that (·) denotes a sequence, {·} a set, and · a vector. In this work, we consider a collection of stationary stochastic temporal processes Z. Each process Z i comprises a sequence of random variables (Z i 1 , . . . , Z i N ) with realisation (z i 1 , . . . , z i N ) for countable time indices n ∈ N. Given these processes, we can compute probability distributions of each variable by counting relative frequencies or by density estimation techniques [42,43]. We use bold to denote the set of all variables, e.g., z n = {z 1 n , . . . , z M n } is the collection of M realisations at index n. Furthermore, unless otherwise stated, X i n is a latent (hidden) variable, Y i n is an observed variable, and Z i n is an arbitrary variable; thus, Z n = {X n , Y n } is the set of all hidden and observed variables at temporal index n. Given a graphical model G, the p i parents of variable Z i n+1 are given by the parent set . . , z i n−k+1 denote the vector of k previous values taken by variable Z i n .

Representing Distributed Dynamical Systems as Probabilistic Graphical Models
We are interested in modelling discrete-time multivariate dynamical systems, where the state is a vector of real numbers given by a point x n lying on a compact d-dimensional manifold M. A map f : M → M describes the temporal evolution of the state at any given time, such that the state at the next time index x n+1 = f (x n ). Furthermore, in many practical scenarios, we do not have access to x n directly, and can instead observe it through a measurement function ψ : M → R M that yields a scalar representation y n = ψ(x n ) of the latent state [22,44]. We assume the multivariate system can be factorised and modelled as a DAG with spatially distributed dynamical subsystems, termed a synchronous GDS (see Figure 2a). This definition is restated from [12] as follows. Figure 2. Representation of (a) the synchronous GDS with two vertices (V 1 and V 2 ), and (b) the rolled-out DBN of the equivalent structure. Subsystems V 1 and V 2 are coupled by virtue of the edge X 1 n → X 2 n+1 .

Definition 1 (Synchronous GDS
The global dynamics and observations can therefore be described by the set of local functions [12]: where υ f i and υ ψ i are additive noise terms. The subsystem dynamics (1) are a function of the subsystem state x i n and the subsystem parents' state x ij n j at the previous time index, i.e., f i : (M i × j M ij ) → M i . However, the observation y i n+1 is a function of the subsystem state alone, i.e., ψ i : M i → R. We assume that the maps { f i } and {ψ i }, as well as the graph G, are time-invariant.
The discrete-time mapping for the dynamics (1) and measurement functions (2) can be modelled as a DBN in order to facilitate structure learning of the graph [12] (see Figure 2b). DBNs are a probabilistic graphical model that represent probability distributions over trajectories of random variables (Z 1 , Z 2 , . . .) using a prior BN and a two-time-slice BN (2TBN) [45]. To model the maps, however, we need only to consider the 2TBN B = (G, Θ G ), which can model a first-order Markov process p B (z n+1 | z n ) graphically via a DAG G and a set of conditional probability distribution (CPD) parameters Θ G [45]. Given a set of stochastic processes (Z 1 , Z 2 , . . . , Z N ), the realisation of which constitutes the sample path (z 1 , z 2 , . . . , z N ), the 2TBN distribution is given by To model the synchronous GDS as a DBN, we associate each subsystem vertex V i with a state variable X i n and an observation variable Y i n . The parents of subsystem V i are denoted Π G (V i ) [12]. From the dynamics (1), variables in the set Π G (X i n+1 ) come strictly from the preceding time slice, and additionally, from the measurement function (2), Π G (Y i n+1 ) = X i n+1 . Thus, we can build the edge set E in the GDS by means of the edges in the DBN [12], i.e., given an edge X i n → X j n+1 of the DBN, the equivalent edge V i → V j exists for the GDS. The distributions for the dynamics (1) and observation (2) maps of M arbitrary subsystems can therefore be factorised according to the DBN structure such that [12] The goal of learning nonlinear dynamical networks thus becomes that of inferring the parent set Π G (X i n ) for each latent variable X i n . Finally, recall that the parents of each observation are constrained such that Π G (Y i n+1 ) = X i n+1 . As a consequence, we use the shorthand notation y ij n to denote the observation of the j-th parent of the i-th subsystem at time n (and the same for x ij n ).

Network Scoring Functions
A number of exact and approximate DBN structure learning algorithms exist that are based on Bayesian statistics and information theory. We have shown in prior work how to compute the log-likelihood function for synchronous GDSs. In this section, we will briefly summarise the problem of structure learning for DBNs, focusing on the factorised distribution (3).
The score and search paradigm [46] is a common method for recovering graphical models from data. Given a dataset D = (y 1 , y 2 , . . . , y N ), the objective is to find a DAG G * such that where g(B:D) is a scoring function measuring the degree of fitness of a candidate DAG G to the data set D, and G is the set of all DAGs. Finding the optimal graph G * in Equation (4) requires solutions to the two subproblems that comprise structure learning: the evaluation problem and the identification problem [14]. The main problem we focus on in this paper is the evaluation problem, i.e., determining a score that quantifies the quality of a graph, given data. Later, we will address the identification problem by discussing the attributes of this scoring function in efficiently finding the optimal graph structure.
In prior work, we developed a score based on the posterior probability of the network structure G, given data D. That is, we considered maximising the expected log-likelihood [12] (Θ G : where the expectation E[Z] = ∞ −∞ z Pr(z)dz. It was shown that state space reconstruction techniques (see Appendix A) can be used to compute the log-likelihood of Equation (3) as a difference of conditional entropy terms [12]. In the same work, we illustrated that the log-likelihood ratio of a candidate DAG G to the empty network G ∅ is given by collective transfer entropy (see Appendix B), i.e., For the nested log-likelihoods above, the statistics of 2( (Θ G : D) − (Θ G ∅ : D)) asymptotically follow the χ 2 q -distribution, where q is the difference between the number of parameters of each model [47,48]. We will draw on this log-likelihood decomposition in later sections for statistical significance testing.

Computing Conditional KL Divergence
In this section, we present our main result, which is an analytical expression of KL divergence that facilitates structure learning in distributed nonlinear systems. We begin by considering the problem of finding an optimal DBN structure as searching for a parsimonious factorised distribution p B that best represents the complete digraph distribution p K M . That is, p K M is the joint distribution yielded by assuming no factorisation (the complete graph K M ) and thus no information loss. The distribution is expressed as: We quantify the similarity of the factorised distribution p B to this joint distribution via KL divergence. In prior work, De Campos [3] derived the MIT scoring function for BNs by this approach and it was later used for DBN structure learning with complete data [49]. We extend the analysis to DBNs with latent variables, i.e., we compare the joint and factorised distributions of time slices, given the entire history, Substituting the synchronous GDS model (3) into Equation (8), we get However, Equation (9) comprises maximum likelihood distributions with unobserved (latent) states x n . It is common in model selection to decompose the KL divergence as where the second term is simply the log-likelihood (5). In this form, p K M is often identical for all models considered and, in practice, it suffices to ignore this term and thus avoid the problem of computing distributions of latent variables. The resulting simpler expression can be viewed as log-likelihood maximisation (as in our previous work outlined in Section 3.3). However, as we show in this section, p K M is not equivalent for all models unless certain parameters of the dynamical systems are known. Hence, for now, we cannot ignore the first term of Equation (10) and we instead propose an alternative decomposition of KL divergence that comprises only observed variables.

A Tractable Expression via Embedding Theory
In order to compute the distributions in (9), we use the Bundle Delay Embedding Theorem [32,33] to reformulate the factorised distribution (denominator), and the Delay Embedding Theorem for Multivariate Observation Functions [50] for the joint distribution (numerator). We describe these theorems in detail in Appendix A, along with the technical assumptions required for ( f , ψ). Although the following theorems assume a diffeomorphism, we also discuss application of the theory towards inferring the structure of endomorphisms (e.g., coupled map lattices [51]) in the same appendix.
The first step is to reproduce a prior result for computing the factorised distribution (denominator) in Equation (9). First, the embedding where τ i is the (strictly positive) lag, and κ i is the embedding dimension of the i-th subsystem (the embedding parameters). Note that, although we can take either the future or past delay embedding (11) for diffeomorphisms, we explicitly consider a history of values to account for both endomorphisms and diffeomorphisms. Moreover, an important assumption of our approach is that the the structure (enforced by coupling between subsystems) is a DAG; this comes from the Bundle Delay Embedding Theorem [32,33] (see Lemma 1 of [12] for more detail). Our previous result is expressed as follows.
Lemma 1 (Cliff et al. [12]). Given an observed dataset D, where y n ∈ R M , generated by a directed and acyclic synchronous GDS (G, Next, we present a method for computing the joint distribution (numerator) in Lemma 3. For convenience, Lemma 2 restates part of the delay embedding theorem in [50] in terms of subsystems of a synchronous GDS and establishes existence of a map G for predicting future observations from a history of observations.
The multivariate observation is given, for some map G, Proof. The proof restates part of the proof of Theorem 2 of Deyle and Sugihara [50] in terms of subsystems. Given M inhomogeneous observation functions {ψ i }, the following map is an embedding where each subsystem (local) map Φ f i ,ψ i : M → R κ i , smoothly (at least C 2 ), and, at time index n is described by where ∑ i κ i = 2d + 1 [50]. Note that, from (13) and (14), we have the global map ).
The last 2d + 1 components of F are trivial, i.e., the set y i,(κ i ) n is observed; denote the first M components by G : Φ f ,ψ → R M , and then we have .
We now use the result of Lemma 2 to obtain a computable form of the KL divergence.

Lemma 3.
Consider a discrete-time multivariate dynamical system with generic ( f , ψ) modelled as a directed and acyclic synchronous GDS (G, x n , y n , { f i }, {ψ i }) with M subsystems. The KL divergence of a candidate graph G from the observed dataset D can be computed from tractable probability distributions: Proof. Lemma 1, we can substitute (12) into (9), and express the KL divergence D KL p K M p B as We now focus on p K M (z n+1 |z (n) n ). Using the chain rule, Given the Markov property of the dynamics (1) and observation (2) maps, we get Now, recall fom Lemma 2 that global equations for the entire system state x n and observation y n are Given the assumption of i.i.d noise on the function f , from (18), we express the probability of the dynamics x n+1 , given by the embedding, as By assumption, the observation noise is i.i.d or dependent only on the state x n+1 , and thus the probability of observing y n+1 , from (19) is By (20) and (21), we have that Substituting Equation (22) into (17) gives Finally, substituting (23) back into (16) yields the statement of the theorem.
Given all variables in (15) are observed, it is now straightforward to compute KL divergence; however, as we will see, it is more convenient to express (15) as a function of known information-theoretic measures.

Information-Theoretic Interpretation
The main theorem of this paper states KL divergence in terms of transfer entropy and stochastic interaction. These information-theoretic concepts are defined in Appendix B for convenience. Theorem 4. Consider a discrete-time multivariate dynamical system with generic ( f , ψ) represented as a directed and acyclic synchronous GDS (G, x n , y n , { f i }, {ψ i }) with M subsystems. The KL divergence D KL p K M p B of a candidate graph G from the observed dataset D can be expressed as the difference between stochastic interaction (A9) and collective transfer entropy (A8), i.e., Proof. We can reformulate the KL divergence in (15) as Substituting in the definitions of transfer entropy (A8) and stochastic interaction (A9) completes the proof.
To conclude this section, we present the following corollary showing that, when we assume a maximum or fixed embedding dimension κ i and time delay τ i , it suffices to maximise the collective transfer entropy alone in order to minimise KL divergence for a synchronous GDS.

Corollary 1.
Fix an embedding dimension κ i and time delay τ i for each subsystem V i ∈ V. Then, the graph G that minimises the KL divergence D KL p K M p B is equivalent to the graph that maximises transfer entropy, i.e., arg min Proof. The first term of (24) is constant, given a constant vertex set V, time delay τ and embedding dimension κ and is thus unaffected by the parent set Π G (V i ) of a variable. As a result, S Y does not depend on the graph G being considered, and, therefore, we only need to consider transfer entropy when optimising KL divergence (24).
As mentioned above, Corollary 1 is, in practice, equivalent to the maximum log-likelihood (5) and log-likelihood ratio (6) approaches. However, the statement only holds for constant embedding parameters. In the general case, where these parameters are unknown, one requires Theorem 4 to perform structure learning. Given this result, we can now confidently derive scoring functions from Corollary 1.

Application to Structure Learning
We now employ the results above in selecting a synchronous GDS that best fits data generated by a multivariate dynamical system. The most natural way to find an optimal model based on Theorem 4 is to minimise KL divergence. Here, we assume constant embedding parameters and use Corollary 1 to present the transfer entropy score and discuss some attributes of this score. We then use this scoring function as a subroutine for learning the structure of coupled Lorenz and Rössler attractors.
From Corollary 1, a naive scoring function can be defined as Given parameterised probability distributions, this score is insufficient, since the sum of transfer entropy in (27) is non-decreasing when including more parents in the graph [38]. Thus, we use statistical significance tests in our scoring functions to mitigate this issue.

Penalising Transfer Entropy by Independence Tests
Building on the maximum likelihood score (27), we propose using independence tests to define two new scores of practical value. Here, we draw on the result of de Campos [3], who derived a scoring function for BN structure learning based on conditional mutual information and statistical significance tests, called MIT. The central idea is to use collective transfer entropy T Y ij j →Y i to measure the degree of interaction between each subsystem V i and its parent subsystems Π G (V i ), but also to penalise this term with a value based on significance testing. As with the MIT score, this gives a principled way to re-scale the transfer entropy when including more edges in the graph.
To develop our scores, we form a null hypothesis H 0 that there is no interaction T Y ij j →Y i , and then compute a test statistic to penalise the measured transfer entropy. To compute the test statistic, it is necessary to consider the measurement distribution in the case where the hypothesis is true. Unfortunately, this distribution is only analytically tractable in the case of discrete and linear-Gaussian systems, where 2NT Y ij j →Y i is known to asymptotically approach the χ 2 -distribution [48]. Since this distribution is a function of the parents of Y i , we let it be described by the function χ 2 ({l ij } j ). Now, given this distribution, we can fix some confidence level α and determine the value χ α,{l ij } j such that then we accept the hypothesis of conditional independence between Y i and Y ij j ; otherwise, we reject it. We express this idea as the TEA score:  (28). Both distributions were generated by observing the outcome of 1000 samples from two Gaussian variables with a correlation of 0.05. The figures illustrate: the distribution as a set of 100 sampled points (black dots); the area considered independent (grey regions); the measured transfer entropy (black line); and the difference between measurement and penalty term (dark grey region). Both tests use a value of α = 0.9 (a p-value of 0.1). The distribution in (a) was estimated by assuming variables were linearly-coupled Gaussians, and the distribution in (b) was computed via a kernal box method (computed by the Java Information Dynamics Toolkit (JIDT), see [52] for details).
In general, we only have access to continuous measurements of dynamical systems, and so are limited by the discrete or linear-Gaussian assumption. We can, however, use surrogate measurements T Y ij s j →Y i to empirically compute the distribution under the assumption of H 0 [52]. This same technique has been used by [38] to derive a greedy structure learning algorithm for effective network analysis. Here, Y ij s j are surrogate sets of variables for Y ij j , which have the same statistical properties as Y ij j , but the correlation between Y ij s j and Y i is removed. Let the distribution of these surrogate measurements be represented by some general function T(s i ) where, for the discrete and linear-Gaussian systems, we could compute T(s i ) analytically as an independent set of χ 2 -distributions χ 2 ({l ij } j ). When no analytic distribution is known, we use a resampling method (i.e., permutation or bootstrapping), creating a large number of surrogate time-series pairs { Y ij s j , Y i } by shuffling (for permutations, or redrawing for bootstrapping) the samples of Y i and computing a population of T Y ij s j →Y i . As with the TEA score, we fix some confidence level α and determine the value T α,s i , such that p(T(s i ) ≤ T α,s i ) = α. This results in the TEE scoring function as We can obtain the value T α,s i by (1) drawing S samples T Y ij s j →Y i from the distribution T(s i ) (by permutation or bootstrapping), (2) fixing α ∈ {0, 1/S, 2/S, . . . , 1}, and then (3) taking T α,s i such that We can alternatively limit the number of surrogates S to α/(1 − α) and take the maximum as T α,s i [22]; however, taking a larger number of surrogates will improve the validity of the distribution T(s i ).
Both the analytical (TEA) and empirical (TEE) scoring functions are illustrated in Figure 3. Note that the approach of significance testing is functionally equivalent to considering the log-likelihood ratio in (6), where, as stated, nested log-likelihoods (and thus transfer entropy) follows the above χ 2 -distribution [48].

Implementation Details and Algorithm Analysis
The two main implementation challenges that arise when performing structure learning are: (1) computing the score for every candidate network and (2) obtaining a sufficient number of samples to recover the network. The main contributions of this work are theoretical justifications for measures already in use and, fortunately, algorithmic performance has already been addressed extensively using various heuristics. Here, we present an exact, exhaustive implementation for the purpose of validating our theoretical contributions.
First, for computing collective transfer entropy for the score (29), we require CPDs to be estimated from data. Given these CPDs, collective transfer entropy (A8) decomposes as a sum of p conditional transfer entropy (A7) terms, where p = |{Y ij } j | is the size of the parent set (see Appendix B for details). Since most observations of dynamical systems are expected to be continuous, we employ a non-parametric, nearest-neighbour based approach to density estimation called the Kraskov-Stögbauer-Grassberger (KSG) estimator [43]. For any arbitrary decomposition of collective transfer entropy (i.e., any ordering of the parent set), this density estimation can be computed in time O(κ(p + 1)KN κ(p+1) log(N)), where K is the number of nearest neighbours for each observation in a dataset of size N, and κ is the embedding dimension [52]. We upper bound this as O(κMKN κM log(N)) since the maximum p is M − 1. Now, the above density estimation was described for an arbitrary ordering of the parent set. In the case of parametric (discrete or linear-Gaussian) density estimation, every permutation of the parent set yields equivalent results, with potentially different χ α,{l ij } j values for each permutation [3]; however, this is not the case for non-parametric density estimation techniques, e.g., the KSG estimator. Hence, as a conservative estimate of the score, we compute all p! permutations of the parent set and take the minimum collective transfer entropy. In order to obtain the surrogate distribution, we require S uncorrelated samples of the density. Since the surrogate distributions decompose in a similar manner, the score for a candidate network can be computed in time O(S · M! · κMKN κM log(N)), where, again, we have upper bounded p! as M!.
Using this approach, we can now compute the score (29), and thus the optimal graph G * can be found using any search procedure over DAGs. Exhaustive search, where all DAGs are enumerated, is typically intractable because the search space is super-exponential in the number of variables (about 2 O(M 2 ) ), and so heuristics are often applied for efficiency. We restrict our attention to a relatively small network (a maximum of M = 4 nodes) and thus we are able employ the dynamic programming (DP) approach of Silander and Myllymaki [53] to search through the space of all DAGs efficiently. This approach requires first computing the scores for all local parent sets, i.e., 2 M scores. Once each score is calculated, the DP algorithm runs in time o(M · 2 M−1 ) and the entire search procedure run in time O(M · 2 M−1 + 2 M · S · M! · κMKN κM log(N)). As a consequence, the time complexity of the exhaustive algorithm is dominated by computing the 2 M scores and, in smaller networks, most of the time is spent on density estimation for surrogate distributions.
Finally, the problem of inferring optimal embedding parameters is well studied in the literature. In our experimental evaluation, we set the embedding dimension to the maximum, i.e., κ = 2d + 1, where d is the dimensionality of the entire latent state space (e.g., if M = 3 and d i = 3 for each subsystem, then κ = 2 ∑ i d i + 1 = 19). However, determining these parameters would give more insight into the system and reduce the number of samples required for inference. There are numerous criteria for optimising these parameters (e.g., [54]); most notably, the work of [55] suggests an information-theoretic approach that could be integrated into the scoring function (29) to search over the embedding parameters and DAG space simultaneously.

Experimental Validation
The dynamics (1) and observation (2) maps can be obtained by either differential equations, discrete-time maps, or real-world measurements. To validate our approach, we use the toy example of distributed flows, whereby the dynamics of each node are given by either the Lorenz [56] or the Rössler system of ODEs [57]. The discrete-time measurements are obtained by integrating these ODEs over constant intervals. In this section, we formally introduce this model, study the effect of changing the parameters of a coupled Lorenz-Rössler system, and finally apply our scoring function to learn the structure of up to four coupled Lorenz attractors with arbitrary graph topology. To compute the scores, we use the Java Information Dynamics Toolkit (JIDT) [52], which includes both the KSG estimator and methods for generating the surrogate distributions.

Distributed Lorenz and Rössler Attractors
For validating our scoring function, we study coupled Lorenz and Rössler attractors. The Lorenz attractor exhibits chaotic solutions for certain parameter values and has been used to describe numerous phenomena of practical interest [56,58,59]. Each Lorenz system comprises three components (d i = 3), which we denote x = u, v, w ; the state dynamics are given by: with free parameters {σ, ρ, β}. Similarly, the Rössler attractor has state dynamics given by: with free parameters {a, b, c} [57].
In the distributed case, the components of each state vector x i t are also driven by components of another subsystem. A number of different schemes have been proposed for coupling these variables, e.g., using the product [21,60] and the difference [61,62] of components. Our model uses the latter approach of linear differencing between one or more subsystem variables to couple the network. Let λ denote the coupling strength, C denote a three-dimensional vector of binary values, and A denote an adjacency (coupling) matrix (i.e., an M × M matrix of zeros with A ij = 1 iff V i ∈ Π G (V j )). Then, the state equations for M spatially distributed systems can be expressed aṡ where g i (·) represents the i-th chaotic attractor and ν f is additive noise. In our simulations, we use λ = 2, C = 1, 0, 0 (each subsystem is coupled via variable u), and the adjacency matrices shown in Figure 4. In our experiments, we use common parameters for both attractors, i.e., σ = 10, β = 8/3, ρ = 28 and a = 0.1, b = 0.1, c = 14. For the observation y i t , it is common to use one component of the state as the read-out function [4,32,33]; we therefore let y i t = u i t + ν ψ . The noise terms are normally distributed with ν f ∼ N (0, σ f ) and ν ψ ∼ N (0, σ ψ ). Figure 1 illustrates example trajectories of Lorenz-Lorenz attractors coupled via this model.

Case Study: Coupled Lorenz-Rössler System
In order to characterise the effect of coupling on our score, we begin our evaluation by measuring the transfer entropy of a coupled Lorenz-Rössler attractor. In this setup, M = 2, Π G (V 1 ) = ∅, and Π G (V 2 ) = V 1 , g 1 (x) was given by (30), and g 2 (x) was given by (31). The transfer entropy was computed with a finite sample size of N = 100, 000. Figure 5 shows the transfer entropy as a function of numerous parameters. In particular, the figure illustrates the effect of varying the coupling strength λ, embedding dimension κ, dynamics noise σ f , and observation noise σ ψ . As expected, increasing λ, or reducing either noise σ, increases the transfer entropy. The embedding dimension, however, increases to a set point, remains approximately constant, and then decreases. The κ-value above which transfer entropy remains constant illustrates the embedding dimension at which the dynamics are reconstructed; the decrease in transfer entropy after this point, however, is likely due to the finite sample size used for density estimation. There are two interesting features in Figure 5 due to the dynamical systems studied. First, in the bottom row (Figure 5g-i), there is a bifurcation around κ = 6. The theoretical embedding dimension for this system is κ = 2(d 1 + d 2 ) + 1 = 7, and, in this case, for κ < 6, the embedding does not suffice to reconstruct the dynamics. Second, in Figure 5i, the transfer entropy decreases after about λ = 2. This appears to be the case of synchrony due to strong coupling, where the dynamics of the forced variable become subordinate to the forcing [4], thus reducing the information transferred between the two subsystems.

Case Study: Network of Lorenz Attractors
In this section, we evaluate the score (27) in learning the structure of distributed dynamical systems. We will look at systems of three and four nodes of coupled Lorenz subsystems with arbitrary topologies. Unfortunately, significantly higher number of nodes become computationally expensive due to an increased embedding dimension κ, number of data points N, and number of permutations required to calculate the collective transfer entropy. To evaluate the performance of the score (27), the dynamics noise is constant σ f = 0.01, whereas the observation noise σ ψ and the number of observations taken N are varied. We selected the theoretical maximum embedding dimension κ = 2d + 1 and τ = 1 as is common given discrete-time measurements [22]. It should be noted that from the results from Section 6.2 that transfer entropy is sensitive to the numerous parameters used to generate the data, and thus depending on the scenario, a significant sample size can be required for recovering the underlying graph structure. We do not make an effort to reduce this sample size and instead show the effect of using a different number of samples on the accuracy of the structure learning procedure.
In order to evaluate the scoring function, we compute the recall (R, or true positive rate), fallout (F, or false positive rate), and precision (P, or positive predictive value) of the recovered graph. Let TP denote the number of true positives (correct edges); TN denote the number of true negatives (correctly rejected edges); FP denote the number of false positives (incorrect edges); and FN denote the number of false negatives (incorrectly rejected edges). Then, R = TP/(TP + FN), F = FP/(FP + TN), and P = TP/(TP + FP). Finally, the F 1 -score gives the harmonic mean of precision and recall to give a measure of the tests accuracy, i.e., F 1 = 2 · R · P/(R + P). Note that the ideal recall, precision and F 1 -score is 1, and ideal fallout is 0. Furthermore, a ratio of R/F >1 suggests the classifier is better than random. As a summary statistic, Tables 1 and 2 presents the F 1 -scores for all networks illustrated in Figure 4, and the full classification results (e.g., precision, recall, and fallout) are given in Appendix C. The F 1 -scores are thus a measure of how relevant the recovered network is to the original (generating) network from our data-driven approach.
In general, the results of Tables 1 and 2 show that the scoring function is capable of recovering the network with high precision and recall, as well as low fallout. In the table, the cell colours are shaded to indicate higher (white) to lower (black) F 1 scores. The best performing score is that with a p-value of 0.01 and no penalisation (a p-value of ∞) has the second highest classification results. As expected, the graphs recovered from data with low observational noise (σ ψ = 1) are more accurate than those inferred from noisier data (σ ψ = 10). The results for three-node networks (shown in Table 1) yields mostly full recovery of the structure for a higher number of observations N ≥ 75 K, whereas, the four-node networks (shown in Table 2) are more difficult to classify. Table 1. F 1 -scores for three-node (M = 3) networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4b-d (network G 1 has no edges and thus an undefined F 1 -score). The p-value of the TEE score is given in the top row of each table, with ∞ signifying using no significance testing, i.e., score (27).   Table 2. F 1 -scores for four-node (M = 4) networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4f-h (network G 5 has no edges and thus an undefined F 1 -score). The p-value of the TEE score is given in the top row of each table, with ∞ signifying using no significance testing, i.e., score (27). Interestingly, the statistical significance testing does not have a strong effect on the results. It is unclear if this is due to the use of the non-parametric density estimators, which, in effect, are parsimonious in nature since transfer entropy will likely reduce when conditioning on more variables with a fixed samples size. One challenging case is the empty networks G 1 and G 5 ; this is shown in Appendix C, where the fallout is rarely 0 for any of the p-values or sample sizes (although a large number of observations N = 100 K appears to reduce spurious edges). It would be expected that significance testing on these networks would outperform the naive score (27) given that a non-zero bias is introduced for a finite number of observations. Further investigation is required to understand why the null case fails.

Discussion and Future Work
We have presented a principled method to compute the KL divergence for model selection in distributed dynamical systems based on concepts from differential topology. The results presented in Figure 5 and Tables 1 and 2 illustrate that this approach is suitable for recovering synchronous GDSs from data. Further, KL divergence is related to model encoding, which is a fundamental measure used in complex systems analysis. Our result, therefore, has potential implications for other areas of research. For example, the notion of equivalence classes in BN structure learning [63] should lend insight into the area of effective network analysis [35,36].
More specifically, the approach proposed here complements explicit Bayesian identification and comparison of state space models. In DCM, and more generally in approximate Bayesian inference, models are identified in terms of their parameters via an optimisation of an approximate posterior density over model parameters with respect to a variational (free energy) bound on log evidence [64]. After these parameters have been identified, this bound can be used directly for model comparison and selection. Interestingly, free energy is derived from the KL divergence between the approximate and true posterior and thus automatically penalises more complex models; however, in Equation (8), these distributions are inverted. In future work, it would be interesting to explore the relationship between transfer entropy and the variational free energy bound. Specifically, computing an evidence bound directly from the transfer entropy may allow us to avoid the significance testing described in Section 5 and instead use an approximation to evidence for structure learning.
Multivariate extensions to transfer entropy are known to eliminate redundant pairwise relationships and take into account the influence of confounding relationships in a network (i.e., synergistic effects) [65,66]. In this work, we have shown that this intuition holds for distributed dynamical systems when confined to a DAG topology. We conjecture that these methods are also applicable when cyclic dependencies exist within a graph, given any generic observation can be used in reconstructing the dynamics [50]; however, the methods presented are more likely to reveal one source in the cycle, rather than all information sources due to redundancy.
There are a number of extensions that should be considered for further practical implementations of this algorithm. Currently, we assume that the dimensionality of each subsystem is known, and thus we can bound the embedding dimension κ for recovering the hidden structure. However, this is generally infeasible in practice and a more general algorithm would infer the embedding dimension and time delay for an unknown system. Fortunately, there are numerous techniques to recover these parameters [54,55]. Furthermore, evaluating the quality of large graphs is infeasible with our current approach. However, our exact algorithm illustrates the feasibility of state space reconstruction in recovering a graph in practice. In the future, we aim to leverage the structure learning literature on reducing the search space and approximating scoring functions to produce more efficient algorithms.
Finally, the theoretical results of this work supplements understanding in fields where transfer entropy is commonly employed. Point processes are being increasingly viewed as models for a variety of information processing systems, e.g., as spiking neural trains [67] and adversaries in robotic patrolling models [68]. It was recently shown how transfer entropy can be computed for continuous time point processes such as these [67], allowing for efficient use of our analytical scoring function g TEA in a number of contexts. Another intriguing line of research is the physical and thermodynamic interpretation of transfer entropy [69], particularly its relationship to the arrow of time [70]; this relationship between endomorphisms as discussed here and time asymmetry of thermodynamics should be explored further.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Embedding Theory
We refer here to embedding theory as the study of inferring the (hidden) state x n ∈ M of a dynamical system from a sequence of observations y n ∈ R. This section will cover reconstruction theorems that define the conditions under which we can use delay embeddings for recovering the original dynamics f from this observed time series.
In differential topology, an embedding refers to a smooth map Φ : M → N between manifolds M and N if it maps M diffeomorphically onto its image. In Takens' seminal work on turbulent flow [31], he proposed a map Φ f ,ψ : M → R κ , that is composed of delayed observations, can be used to reconstruct the dynamics for typical ( f , ψ). That is, fix some κ (the embedding dimension) and τ (the time delay), the delay embedding map, given by is an embedding. More formally, denote Φ f ,ψ , D r (M, M) as the space of C r -diffeomorphisms on M and C r (M, R) as the space of C r -functions on M, then the theorem can be expressed as follows.
Theorem A1 (Delay Embedding Theorem for Diffeomorphisms [31]). Let M be a compact manifold of dimension d ≥ 1. If κ ≥ 2d + 1 and r ≥ 1, then there exists an open and dense set ( f , ψ) ∈ D r (M, M) × C r (M, R) for which the map Φ f ,ψ is an embedding of M into R κ .
The implication of Theorem A1 is that, for typical ( f , ψ), the image Φ f ,ψ (M) of M under the delay embedding map Φ f ,ψ is completely equivalent to M itself, apart from the smooth invertible change of coordinates given by the mapping Φ f ,ψ . An important consequence of this result is that n ) [44]. The bound for the open and dense set referred to in Theorem A1 is given by a number of technical assumptions. Denote (D f ) x as the derivative of function f at a point x in the domain of f . The set of periodic points A of f with period less than τ has finitely many points. In addition, the eigenvalues of (D f ) x at each x in a compact neighbourhood A are distinct and not equal to 1.
Theorem A1 was established for diffeomorphisms D r ; by definition, the dynamics are thus invertible in time. Thus, the time delay τ in (A1) can be either positive (delay lags) or negative (delay leads). Takens later proved a similar result for endomorphisms, i.e., non-invertible maps that restricts the time delay to a negative integer. Denote by E (M, M) the set of the space of C r -endomorphisms on M, then the reconstruction theorem for endomorphisms can be expressed as the following.
Theorem A2 (Delay Embedding Theorem for Endomorphisms [71]). Let M be a compact m dimensional manifold. If κ ≥ 2d + 1 and r ≥ 1, then there exists an open and dense set ( f , ψ) ∈ D r (M, M) × C r (M, R) for which there is a map π κ : X κ → M with π κ Φ f ,ψ = f κ−1 . Moreover, the map π κ has bounded expansion or is Lipschitz continuous.
As a result of Theorem A2, a sequence of κ successive measurements from a system determines the system state at the end of the sequence of measurements [71]. That is, there exists an endomorphism f ,ψ to predict the next observation if one takes a negative time (lead) delay τ in (A1). In this work, we consider two important generalisations of the Delay Embedding Theorem A1. Both of these theorems follow similar proofs to the original and have thus been derived for diffeomorphisms, not endomorphisms. However, encouraging empirical results in [6] support the conjecture that they can both be generalised to the case of endomorphisms by taking a negative time delay, as is done in Theorem A2 above. This would allow for not only distributed flows that are used in our work, but endomorphic maps, e.g., the well-studied coupled map lattice structure [51].
The first generalisation is by Stark et al. [44] and deals with a skew-product system. That is, f is now forced by some second, independent system g : N → N . The dynamical system on M × N is thus given by the set of equations In this case, the delay map is written as Φ f ,g,ψ (x, ω) = y n , y n+τ , y n+2τ , . . . , y n+(κ−1)τ , and the theorem can be expressed as follows.
Theorem A3 (Bundle Delay Embedding Theorem [44]). Let M and N be compact manifolds of dimension d ≥ 1 and e, respectively. Suppose that κ ≥ 2(d + e) + 1 and the periodic orbits of period ≤ d of g ∈ D r (N ) are isolated and have distinct eigenvalues. Then, for r ≥ 1, there exists an open and dense set of ( f , ψ) ⊂ D r (M × N , M) × C r (M, R) for which the map Φ f ,g,ψ is an embedding of M × N into R κ .
Finally, all theorems up until now have assumed a single read-out function for the system in question. Recently, Sugihara et al. [4] showed that multivariate mappings also form an embedding, with minor changes to the technical assumptions underlying Takens' original theorem. That is, given M ≤ 2d + 1 different observation functions, the delay map can be written as where each delay map Φ f ,ψ i is as per (A1) for individual embedding dimension κ i ≤ κ. The theorem can then be stated as follows.
Theorem A4 (Delay Embedding Theorem for Multivariate Observation Functions [50]). Let M be a compact manifold of dimension d ≥ 1. Consider a diffeomorphism f ∈ D r (M, M) and a set of at most 2d + 1 observation functions ψ i where each ψ i ∈ C r (M, R) and r ≥ 2. If ∑ i κ i ≥ 2d + 1, then, for generic ( f , ψ i ), the map Φ f , ψ i is an embedding. N = 5000 (Tables A1 and A2), N = 10,000 (Tables A3 and A4), N = 25,000 (Tables A5 and A6), N = 50,000 (Tables A7 and A8), and N = 100,000 (Tables A9 and A10). Each table has results for various p-values (with a p-value of ∞ denoting the maximum likelihood score (27)), as well as two different observation noise variances, σ ψ = 1 and σ ψ = 10. Table A1. Classification results for three-node (M = 3) networks for N = 5000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 P  Table A2. Classification results for four-node (M = 4) networks for N = 5000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.    Table A3. Classification results for three-node (M = 3) networks for N = 10,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.    Table A4. Classification results for four-node (M = 4) networks for N = 10,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.    Table A5. Classification results for three-node (M = 3) networks for N = 25,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.  Table A6. Classification results for four-node (M = 4) networks for N = 25,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.   Table A7. Classification results for three-node (M = 3) networks with N = 50,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.  Table A8. Classification results for four-node (M = 4) networks with N = 50,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.   Table A9. Classification results for three-node (M = 3) networks with N = 100,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.  Table A10. Classification results for four-node (M = 4) networks with N = 100,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.