Topological reversibility and causality in feed-forward networks

Systems whose organization displays causal asymmetry constraints, from evolutionary trees to river basins or transport networks, can be often described in terms of directed paths (causal flows) on a discrete state space. Such a set of paths defines a feed-forward, acyclic network. A key problem associated with these systems involves characterizing their intrinsic degree of path reversibility: given an end node in the graph, what is the uncertainty of recovering the process backwards until the origin? Here we propose a novel concept, \textit{topological reversibility}, which rigorously weigths such uncertainty in path dependency quantified as the minimum amount of information required to successfully revert a causal path. Within the proposed framework we also analytically characterize limit cases for both topologically reversible and maximally entropic structures. The relevance of these measures within the context of evolutionary dynamics is highlighted.


I. INTRODUCTION
Causality is the fundamental principle pervading dynamical processes. Any set of time-correlated events, from the development of an organism to historical changes, defines a feed-forward structure of causal relations captured by a family of complex networks called directed acyclic graphs (DAGs). Their structure has recently attracted the interest of researchers [1][2][3][4] since DAGs represent time-ordered processes as well as a broad number of natural and artificial systems. Examples would include simple electronic circuits [5], feed-forward neural [6] and transmission networks [7], river basins [8], or even some food webs and chemical structures [9].
A paradigmatic example of a causal structure is the chart of the relations among states followed by a computational process through time. Intimately linked to the topology of the computational chart of consecutive states, a fundamental feature of computations is its degree of logical reversibility [10,11]. Indeed, it is said that a process is logically reversible when, if reverting the flow of causality, i.e. going backwards from the computational outputs to their inputs, we can unambiguously recover the causal structure of the process. Roughly speaking, if we have a computer performing a function g : N → N and we can unambiguously determine the input u from the only knowledge of the value v = g(u), we say that the function is logically reversible. Otherwise, if there is uncertainty in determining u from the only knowledge of v, we say that the function is logically irreversible, and thus, additional information is needed to successfully reconstruct a given computational path.
Analogously, the potential scenarios emerging from an evolutionary process raise similar questions. Within evolutionary biology, a relevant problem is how predictable is evolutionary dynamics. In particular, it has been asked what would be the result of going backwards and "re-playing the tape of evolution" [12,13]. Since this question pervades the problem of how uncertain or predictable is a given evolutionary path, it seems desirable to actually provide a foundational framework.
In this paper, we analytically extend the concept of logical reversibility to the study of any causal structure having no cyclic topologies, thereby defining a broader concept to be named topological reversibility. Whereas thermodynamical irreversibility implies thermodynamical entropy production [14,15], topological irreversibility implies statistical entropy production. In general, we will say that a DAG is topologically reversible if we can unambiguously recover a path going backwards from any element to the origin. Genealogies and phylogenies are examples of tree-like structures where a chronological order can be established among the events and an unambiguous reconstruction of the lineage can be performed for every element of the graph [16]. Following this argument, we will label a graph as topologically irreversible when some uncertainty is observed in the reconstruction of trajectories.
As shown below, the entropy presented here weigths the extra amount of information that would be required to recover the causal flow backwards. Information measures are not new in the study of complex networks [17][18][19][20][21][22][23], although such measures accounted for connectivity correlations [18,19,21,22] or were used to characterize a Gibbsian formulation of the statistical mechanics of complex networks [17]. We finally note that the starting point of our formalism resembles the classical theory of Bayesian networks. However, the particular treatment of reversibility proposed here is qualitatively different from the concept of uncertainty used in such a framework and closer to the one described in [20].
The paper is organized as follows: In section II we provide the basic concepts underlying our analytical deriva-tions. Section III provides the general mathematical definition of topological reversibility and the general expression for the average uncertainty associated to the reversion of the causal flow. This is consistently derived from the properties of the adjacency matrix. In section IV we consider two limit cases, finding the exact analytic form for their entropies and predicting the uncertain configuration. Finally, in section V we outline the generality and relevance of our results in terms of characterizing DAG structure.

II. THEORETICAL BACKGROUND
The theoretical roots of this paper stem from fundamental notions of directed graph theory [24,25], ordered set theory [26,27] and information theory [28][29][30][31]. Specifically, we make use of Shannon's entropy which, as originally defined, quantifies the uncertainty associated to certain collections of random events [28,30]. In our framework, the entropy in a given feed-forward graph measures the uncertainty in reversing the causal flow depicted by the arrows [39].

A. Directed graphs and orderings
Let G(V, E) be a directed graph, being V = {v 1 , ..., v n }, |V | = n, the set of nodes, and E = { v k , v i , ..., v j , v l } the set of edges -where the order, v k , v i implies that there is an arrow in the following direction: v k → v i . Given a node v i ∈ V , the number of outgoing links, to be written as k out (v i ), is called the out-degree of v i and the number of ingoing links of v i is called the in-degree of v i , written as k in (v i ). The adjacency matrix of a given graph G, A(G) is defined as A ij (G) = 1 ↔ v i , v j ∈ E; and A ij (G) = 0 otherwise. Through the adjacency matrix, k in and k out are computed as Furthermore, we will use the known relation between the k-th power of the adjacency matrix and the number of paths of length k going from a given node v i to a given node v j Specifically, is the number of paths of length k going from node v i to node v j [25]. A feed-forward or directed acyclic graph is a directed graph characterized by the absence of cycles: If there is a directed path from v i to v k (i.e., there is a finite sequence v i , v j , v j , v l , v l , v s , ..., v m , v k ∈ E) then, there is no directed path from v k to v i . Conversely, the matrix A T (G) depicts a DAG with the same underlying structure but having all the arrows (and thus, the causal flow) inverted. Given its acyclic nature, one can find a finite value L(G) as follows: It is easy to see that L(G) is the length of the longest path of the graph. The existence of such L(G) can be seen as a test for acyclicity. However, the use of leafremoval algorithms [32,33], i.e. the iterative pruning of nodes without outgoing links, is by far more suitable than the above method, in terms of computational costs. In a DAG, a leaf-removal algorithm removes completely the graph in a finite number of iterations, specifically, in L(G) iterations -see eq.
(2). Now we study the interplay between DAGs and order relations. Borrowing concepts from order theory [27], we define the following set: to be named the set of maximal nodes of G, by which |M | = m. The set of all paths π 1 , ..., π s , s ≥ |E|, from M to a given node v i ∈ V \ M is indicated as Π(G). Given a node v i ∈ V \ M , the set of all paths from M to v i is written as Π(v i ) ⊆ Π(G). Furthermore, we will define the set v(π k ) as the set of all nodes participating in this path, except the maximal one. Additionally, one can define the set of nodes with k out = 0 as the set of minimal nodes of G, to be named µ. Notice that the absence of cycles implies that m ≥ 1 and that the set of minimals µ must also contain at least one element -see fig. (1a).
Attending to the node relations depicted by the arrows, and due to the acyclic property, at least one node ordering can be defined, establishing a natural link between order theory and DAGs. This order is achieved by labeling all the nodes with sequential natural numbers and obtaining a configuration such that: Accordingly, DAGs are ordered graphs [2]. However, as order relations imply transitivity, it is not the DAG but its transitive closure what properly defines the order relation among the elements of V . The transitive closure of G (see fig. 1b), to be written as In this framework, for a given number of maximal nodes, in the transitive closure the addition of a link either creates a cycle or destroys a maximal or minimal node. If the pairs defining the set of links of T (G) are conceived as the elements of a set relation E T ⊂ V × V , such a relation satisfies the following three properties: where M denotes the set of maximals, µ the set of minimals and the V \ M set the set of non-maximals (a). The respective transitive closure, where any node of the maximal set is connected to any node of the set V \ M . This is an special structure displaying maximal entropy (see text).
The DAG definition implies that E directly satisfies the two first conditions whilst the third one (transitivity) is only warranted for E T . Thus, only E T holds all requirements to be an order relation, specifically, a strict partial order. The transitive closure of a given DAG can be obtained by means of the so-called Warshall's algorithm [25]. Finally, a subgraph F(V F , E F ) ⊆ G is said to be linearly ordered or totally ordered provided that for all pairs Let us notice that if we understand E F as a set relation If G is linearly ordered and W ⊂ G, we refer to G as a topological sort of W [25].

B. Uncertainty
According to classical information theory [28][29][30][31], let us consider a system S with n possible states, whose occurrences are governed by a random variable X with an associated probability mass function formed by p 1 , ..., p n . According to the standard formalization, the uncertainty or entropy associated to X, to be written as H(X), is: which is actually an average of log(1/p(X)) among all events of S, namely, H(X) = log(1/p(X)) , where ... is the expectation or average of the random quantity between parentheses. As a concave function, the entropy satisfies the so-called Jensen's inequality [29], which reads: The maximum value log n is achieved for p i = 1/n for all i = (1, ..., n). Jensen's inequality provides an upper bound on the entropy that will be used below. Analogously, we can define the conditional entropy. Given another system S containing n values or choices, whose behavior is governed by a random variable Y , let P(s i |s j ) be the conditional probability of obtaining Y = s i ∈ S if we already know X = s j ∈ S. Then, the conditional entropy of Y from X, to be written as H(Y |X), is defined as: which is typically interpreted as a noise term in information theory. Such a noise term can be interpreted as the minimum amount of extra bits needed to unambiguously determine the input set from the only knowledge of the output set. This will be the key quantity of our paper, for it accounts for the dissipation of information in a given process.

III. TOPOLOGICAL REVERSIBILITY AND ENTROPY
Let us imagine that a node v i ∈ V \ M of a given DAG G, receives the visit of a random walker that follows the Notice that more than a pathway, with more or less probability to be chosen, connect maximals from each terminal (a). Given a node (v6) receiving two inputs, we consider two different alternatives to go backwards. The uncertainty in this particular case is obtained by computing hL(vi) from eq. (14), i.e., hL(v6) = log 2 assuming equiprobability in the selection (b) .
flow chart depicted by the DAG. We only know that it began its walk at a given maximal node and it followed a downstream random path attending to the directions of the arrows to reach the node v i . Suppose also that the global structure of the graph is unknown. What is the uncertainty associated to the followed path? In other words, what is the amount of information we need, on average, to successfully perform the backward process?
A. The definition of entropy As we mentioned above, the starting point of our derivation is close to treatment of Bayesian networks [34]. In our approach, the first task is to define the probability to follow a given path π k ∈ Π(v i ) when reverting the process. Let v(π k ) be the set of nodes participating in the path π k except the maximal ones. Maximal nodes are not included in this set because they are the ends of the path of the reversal process. The probability to chose such a path from node v i by making a random decision at every crossing when reverting the causal flow will be: Consistently: As P is a probability distribution, we can compute the uncertainty associated to a reversal of the causal flow, starting the reversion process from a given node v i ∈ V \ M , to be written as h(v i ): The overall uncertainty of G, written as H(G), is computed by averaging h over all non-maximal nodes, i.e: B. The transition matrix Φ and its relation to the adjacency matrix The main combinatorial object of our approach is not the adjacency matrix but instead a mathematical representation of the probability to visit a node v i ∈ V \ M starting the backward flow from a given, different node v k ∈ V \ M regardless the distance separating them. As we shall see, this combinatorial information can be encoded in a matrix, to be named transition matrix Φ and we can explicitly obtain it from A(G). We begin by defining and we can see that: Let us explain eq. (13) and its consequences. First we define h L (v i ) as: where L indicates the amount of local entropy introduced in a given node when performing the reversion process -see fig (2). Thereby, it is the amount of information needed to properly revert the flow backwards when a bifurcation point is reached having k in possible choices. Secondly, we define φ ik as the coefficients of a (n − m) × (n − m) matrix Φ(G) = [φ ik (G)], i.e. our transition matrix G: φ ij (G) = π k :vj ∈v(π k ) P(π k |v i ).
This represents the probability to reach v j starting from v i . Now we derive the general expression for Φ. The derivation allows us to obtain a consistent mathematical definition of the transition matrix in terms of A(G). We first notice two important facts linking paths and the powers of the adjacency matrix that are only generically valid in DAG-like networks. First, we observe that: being L(G) the length of the longest path of the graph as defined by (2). Analogously, the number of paths of Π(v i ) crossing v k , to be written as α ik is: The above quantities provide the number of paths. To compute the probability to reach a given node, we have to take into account the probability to follow a given path containing such a node, defined in (9). To rigorously connect it to the adjacency matrix, we first define an auxiliary, (n − m) × (n − m) matrix B(G), namely: where v i , v j ∈ V \ M . From this definition, we obtain the explicit dependency of Φ from the adjacency matrix, namely [40], and accordingly, we have It is worth to mention that Φ(G) resembles the transition matrix related to the concept of information mobility [20]. In the general case of non-directed graphs, one can assume the presence of paths of arbitrary length, which leads (using a correction factor tied to the length of the path) up to an asymptotic form of the transition matrix in terms of the exponential of the adjacency matrix. However, the intrinsic finite nature of the paths in a given DAG makes the above asymptotic treatment non viable.

C. The general form of the Entropy
Let us now define the overall entropy in a compact form, only depending on the adjacency matrix of the graph. From eqs. (8,11,13), we obtain This is the central equation of this paper. This measure quantifies the additional information (other than topological one) to properly revert the causal flow. We observe that this expression is a noise term within standard information theory [28]. In this equation we have been able to decouple the combinatorial term associated to the multiplicity of paths at one hand, and the particular contribution to the overall uncertainty of every node, at the other hand. The former is fulfilled by the matrix Φ, which encodes combinatorial properties of the system, and how they influence in the computation of the entropies. The latter is obtained from the set of local entropies h L (v 1 ), ..., h L (v n−m ). These terms account for the contribution of local topology -i.e. the uncertainty when choosing an incoming link at the node level in the reversion of the causal flow-to the overall entropy. This uncoupling is a consequence of the extensive property of the entropy and, putting aside its conceptual interest, simplifies all derivations related to the uncertainties, since we are not forced to compute the complex series arising in the brute-force calculation of entropies. This general expression of the entropy can be simplified if we assume that ∀v i ∈ V \ M , p(v i ) = 1/(n − m). Therefore, by defining and thus H(G) is expressed as: Finally, we recall that the above entropy is bounded by Jensen's inequality (7) i.e., Notice that the quantity on the right side of eq. (23) is the uncertainty obtained by considering all paths from M to v i equally likely to occur.

D. Topological reversibility
Having defined an appropriate and well grounded entropy measure, now we can discuss the meaning of topological (ir)reversibility. Let us first make a qualitative link with standard theory of irreversible thermodynamics, where irreversibility is tied to the parameter of entropy production σ s in the entropy balance equation [15]. Here, σ s = 0 depicts thermodynamically reversible processes, whereas σ s > 0 appears in irreversible processes [14,15]. Irreversibility is rooted in the impossibility of reverting the process without generating a negative amount of entropy, which contradicts to the second law of thermodynamics. Consistently, we will call topologically reversible those DAG structures such that In those structures (they belong to the set of trees, as we shall see in the following section) no ambiguity arises when performing the reversion process. On the contrary, a given DAG by which H(G) > 0 will be referred to as topologically irreversible. DAGs having H(G) > 0 display some degree of uncertainty taking the causal flow backwards, since the reversion process is subject to some random inevitable decisions. In these cases, H(G) is the average of the amount of extra information needed to successfully perform the process backwards. Similarly, the successful reversion of a thermodynamically irreversible process would imply the (irreversible) addition of external energy, or that the reversion of a logically irreversible computation requires an extra amount of external information to solve the ambiguity arising in rewinding the chain of computations. In this context, for example, reversible computation is defined by considering a system of storage of history of the computational process [10]. Furthermore, we observe that, roughly speaking, we can associate the logical (ir)reversibility of a computational process to the topological (ir)reversibility of its DAG representation. In our study, the adjective topological arises from the fact that we only use topological information to compute the uncertainty. Thus, we deliberately neglect the active role that a given node can play as, for example, a processing unit, or the different weights of the paths. However, it is worth to mention that entropy can be generalized for DAGs where links are weighted by a probability to be chosen in the process of reaching the maximal.

IV. LIMIT CASES: MAXIMUM AND MINIMUM UNCERTAINTY
Let us illustrate our previous results by exploring two limit cases, namely DAGs having zero or maximal uncertainty. In this section we identify those feed-forward structures which, containing n nodes and without a predefined number of links, minimize or maximize the above uncertainties. In this way, for example, a chain having m = 1 will display H(G) = 0, whereas its somehow opposite graph, the star having m = n − 1 will have H(G) = log(n − 1). The derivation of the limit scenarios will be more sophisticated, due to the active role of combinatorics in defining the paths. The minimum uncertainties are obtained when the graph G is a special kind of tree, to be described below. Afterwards, we also derive the graph configuration with maximum entropy. The conceptual starting point of this derivation is the graph representation of the linear order.

A. Zero Uncertainty: Trees
Imagine a random walker exploring a (directed) tree containing only a single maximal ( fig. 3a). From such a maximal node, there exists only one path to a given node. In the evolutionary context, a single ancestor is at the root of all evolutionary tree [35]. Thus, the process of recovering the history of the random walker up to its initial condition is completely deterministic, and no uncertainty can be associated to it -in purely topological terms. Formally, we recognize two defining features on trees, namely: We thus conclude that there is no uncertainty in recovering the flow, since the two reported properties are enough to conclude that there is 1 and only 1 path to go from M to any v i ∈ V \ M . This agrees with the intuitive idea that trees are perfect hierarchical structures.
This result complements the more standard scenario of the forward, downstream scenario paths followed by a random walker on a tree [16]. It is worth noting that evolutionary trees, particularly in unicellular organisms, have been found to be a poor representation of the actual evolutionary process [36,37].

B. Maximum Uncertainty
Now we consider the maximum entropic scenario. For this purpose, we cut the problem in two pieces: First, we constructively obtain the feed forward graph containing m maximal nodes maximizing H(G). Once we identified such a feed forward configuration, we ask for the m that maximizes such a quantity. Let G be a feed-forward organized graph containing n nodes, where m of them are maximal. Since for the entropy computation all nodes become indistinguishable, let g(m, n) be the ensemble of different possible feedforward configurations containing n nodes, where m of them are maximal. We are looking for a graph, to be written asG ∈ g(n, m), such that ∀G i ∈ g(m, n): i.e., a graph containing all possible links, preserving the number of maximal nodes. This implies, as defined in section II A, eq. (5), that we must add links to the set V \ M until it becomes linearly ordered, attending to a labeling of nodes which respect the ordering depicted by the feed-forward graph (see fig. 1c). Once we have the set of nodes V \M linearly ordered, we proceed to generate a link from any node v i ∈ M to any node v k ∈ V \ M . We thus obtain a feed forward graph containing m maximal nodes and only 1 minimal node. In the above constructed graph, any new link creates a cycle or destroys a maximal vertex. Furthermore, given two fixed values of m and n, it is straightforward to demonstrate that it maximizes any entropy based on paths: Any feed-forward graph of the ensemble g(m, n) other thanG is obtained by removing edges ofG. This edge removal process will necessarily result in a reduction of uncertainty.
For the sake of clarity we differentiate the labeling of M and V \ M when working withG. Specifically, nodes v i ∈ V \ M will be labeled sequentially from 1 to n − m respecting the ordering defined in eq. (4). This labeling will be widely used in the forthcoming sections. Furthermore, we recall that no special labeling other than different natural numbers is needed for v k ∈ M , since there will be no ambiguous situations. Given the labeling proposed above, and starting from eq. (15) the number of paths inG from M to v i ∈ V \ M will be: 2. The explicit form of entropies in the linear ordering of V \ M .
We first bound H(G) using Jensen's inequality. Indeed, from eq. (7) we can derive an upper bound for H(G), namely We can go further, first computing the probabilities defining the matrix Φ(G). To compute these probabilities, let us suppose we are in node v i ∈ V \ M . The first observation is that the probability to reach one maximal is 1 m . What about v 1 , i.e., the first node we find after the maximal set? We observe that, from the node v i , the situation is completely analogous to the situation where there are m + 1 maximal nodes, since the probability to pass through v 1 does not depend on what happens above v 1 . Therefore: and running the reasoning from v 1 to v i−1 , we find that: Interestingly, for k < i, φ ik is invariant, no matter the value of i. This leads matrix Φ(G) to be: and the final expression is obtained by observing that and therefore, inserting it and (27) into eq. (22), we obtain after some algebra: We can see that the value entropy is reduced to the computation of the average of f over the set V \ M . IfG contains n nodes, being m of them the maximal ones we will refer to this average as f (n, m) , defined as:

Absolute maxima of entropies
What is the relation between n and m maximizing the above entropies? As we shall see, given a fixed value of n, the absolute maximum is found in the linear ordering above defined at m * = 2, for graphs sizes n 1. To support the above claim, let us first notice that: , enabling us to derive the first inequality: . To this end, let us first observe a key property of f , defined in eq. (29). Indeed, we observe that (∀ > 0)(∃k ) : provided that n is large enough. From this property, and since f (n, m) is an average -see eq. (30)-we can be sure that (∃n * ) : (∀n > n * ), by choosing appropriately n in such a way that we have enough terms lower than a given to obtain the above desired result. Thus, from eq. (30) and knowing that (with proportionally factor equal to n/(n − 2)) we can conclude that The general case easily derives from the same reasoning, since: and thus, we can conclude that: This closes the demonstration thatG containing m = 2 is the most entropic graph provided that n > 14, according to numerical computations.

V. DISCUSSION
In this paper we address the problem of quantifying path dependencies using the DAG metaphor. To this goal, we introduce the concept of topological reversibility as a fundamental feature of causal processes that can be depicted by a DAG structure. The intuitive definition is rather simple: A system formed by an aggregation of causal processes is topologically reversible if we can recover all causal paths with no other information than the one provided by the graph topology. If graph topology induces some kind of ambiguity in the backward process, the graph is said to be topologically irreversible, and additional information is needed to build the backward flows.
We provided the analytical form of the uncertainty (the amount of extra information needed) arising in the reversion process by uncoupling the combinatorial information encoded by the graph structure from the contributions of the local connectivity patterns of individual nodes, as depicted in eqs. (22,21). It is worth noting that all our results are derived from just two basic concepts: The adjacency matrix of the graph and the definition of entropy. Furthermore, we offer a constructive derivation of the two limit cases, namely trees (as the reversible ones), and linear ordered graphs (having two maximal nodes) as the most uncertain ones.
According to our results, only a tree DAG is topologically reversible. However, beyond this singular case, the quantification of topological irreversibility by using the entropy proposed here could provide insights in the characterization of feed forward systems. An illustrative case-study can be found precisely in biological evolution. The standard view of the tree of life involves a directional, upward time-arrow where the genetic structure of a given species (its genome) derives from some ancestor after splitting (speciation) events. One would think that this classical but too simplistic view of evolution as a tree gives a topologically reversible lineage of genes, changing by mutations and passing from the initial ancestor to current species in a vertical inheritance. However, it has been recently evidenced that the so-called horizontal gene transfer among unrelated species may have had a deep impact in the evolution and diversification in microbes [37]. According to this genetic mechanism the tree-like and thus the logical/topological reversibility is broken by the presence of cross-links between brother species. At the light of these evidences, tree-based phylogenies become unrealistic. In this context, our theoretical approach provides a suitable framework for the characterization of the logical irreversibility of biological evolution and, in general, for any process where time or energy dissipation impose a feed-forward chart of events. Further research in this topic will contribute to understand the causal structure of evolutionary processes.