Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference

We postulate a principle stating that the initial condition of a physical system is typically algorithmically independent of the dynamical law. We argue that this links thermodynamics and causal inference. On the one hand, it entails behaviour that is similar to the usual arrow of time. On the other hand, it motivates a statistical asymmetry between cause and effect that has recently postulated in the field of causal inference, namely, that the probability distribution P(cause) contains no information about the conditional distribution P(effect|cause) and vice versa, while P(effect) may contain information about P(cause|effect).


Introduction
Drawing causal conclusions from statistical data is at the heart of modern scientific research. While it is generally accepted that active interventions to a system (e.g. randomized trials in medicine) reveal causal relations, statisticians have widely shied away from drawing causal conclusions from passive observations. Meanwhile, however, the increasing interdisciplinary field of causal inference has shown that also the latter is possible-even without information about time order-if appropriate assumptions that link causality and statistics are made [1][2][3], with applications in biology [4], psychology [5], and economy [6]. More recently, also foundational questions of quantum physics have been revisited in light of the formal language and paradigms of causal inference [7][8][9][10][11][12][13].
Remarkably, recent results from causal inference have also provided new insights about the thorny issue of the arrow of time. Contrary to a wide-spread belief, the joint distribution P X Y , of two variables X and Y sometimes indicates whether X causes Y or vice versa [14]. More conventional methods rely on conditional independencies and thus require statistical information of at least three observed variables [1,2]. The intuitive idea behind the new approach is that if X causes Y, P X contains no information about P Y X and vice versa. Within this context it is not obvious, however, how to make precise the meaning of information. Accordingly, different formalizations of this intuitive notion have been proposed.
The algorithmic information approach proposed in [15,16] gives a precise meaning to information by postulating that knowing P Y X does not admit a shorter description of P X and vice versa. This is the approach we will follow more closely here, in particular for drawing a link to thermodynamics, given that algorithmic information has already been related to the thermodynamic entropy [17]. Nevertheless, we should also mention an interpretation for the meaning of information recently stated in the context of machine learning, more precisely in semi-supervised learning (SSL).
SSL algorithms learn the statistical relation between two random variables X and Y from some (x, y)-pairs x y , , , , n n 1 1 plus some unpaired instances ¼ n k 1 information about P Y X ? References [20,21] draw the link to causality by the conjecture that the additional xvalues do not help to learn P Y X if X is the cause and Y the effect (the 'causal learning' scenario), while they may help when Y is the cause and X the effect (the 'anticausal learning' scenario). The hypothesis is supported by a meta-study that only found success cases of SSL in the literature for the anticausal but not for the causal learning scenario [20,21] 6 . This suggests that the 'no information' idea-despite its apparent vagueness-describes an asymmetry between cause and effect that is already relevant for scientific tasks other than causal inference. It is thus natural to explore such kind of asymmetries in the context of physical systems.
As a matter of fact, like the asymmetries between cause and effect, similar asymmetries between past and future are also manifest even in stationary time series [22] which can sometimes be used to infer the direction of empirical time series (e.g. in finance or brain research) or to infer the time direction of movies [23]. Altogether, these results suggest a deeper connection for the asymmetries between cause versuseffect and past versusfuture. In particular, a physical toy model relating such asymmetries to the usual thermodynamic arrow of time has been proposed [24].
Motivated by all these insights, we propose a foundational principle for both types of asymmetries, cause versuseffect and past versusfuture. The contributions of this paper are the following: (1) We postulate a principle stating that the initial state of a physical system and the dynamical law to which it is subjected to should be algorithmically independent.
(2) As we show, this principle implies for a closed system the non-decrease of physical entropy if the latter is identified with algorithmic complexity (also called 'Kolmogorov complexity'). Thus, it reproduces the thermodynamic behavior for closed systems given earlier insights on the thermodynamic relevance of algorithmic information proposed in the literature [17].
(3) Our principle brings new insights to understand open system and we apply it to a toy model representing typical cause-effect relations.
(4) We show that the algorithmic independence of P cause and P effect cause stated earlier can be seen as part of this principle, if we identify cause and effect with the initial and final states of a physical system, respectively.
This paper thus links recently stated ideas from causal inference with a certain perspective of thermodynamics. To bridge such different lines of research, we start by reviewing several relevant ideas of both.
Algorithmic randomness in thermodynamics. We start by briefly introducing some basic notions of algorithmic information theory. The algorithmic randomness (also called 'algorithmic complexity', 'algorithmic information', or 'Kolmogorov complexity') K(s) of a binary string s is defined as the length of its shortest compression. More precisely, K(s) is the length of the shortest program on a universal Turing machine (with prefix-free encoding) that generates s and then stops [25,26]. We call this shortest program 7 s * the shortest compression of s.
The conditional algorithmic complexity ( | ) K s t is defined as the length of the shortest program generating the output s from the input t. A slightly different quantity is ( | ) * K s t since the input * t is slightly more valuable than the input t. This is because a Turing machine is able to convert * t into t (by definition), while there can be in principle no algorithm that finds * t when t is given. One can therefore show that ( | ) K s t may be larger than ( | ) * K s t by a term that can grow at most logarithmically in the size of the length of t. Accounting for this kind of subtleties, several statements in Shannon information theory have nice analogues in algorithmic information theory. For instance, the information of a pair 8 is given by a sum of the information of one string and the conditional information of the other: . As common in algorithmic information theory [27], the equation is not exact and therefore the equation sign is marked by the symbol + indicating an error term that can be upper bounded by a constant (which does not depend of the strings involved, but does depend on the Turing machine).
As further analogue to Shannon information theory, algorithmic mutual information can be defined in three equivalent ways [26]: , . 6 Note that success of SSL does not necessarily imply that the unpaired x-values contained information about P Y X . Instead, they could have helped to fit a function that is particularly precise in regions where the x-values are dense. The meta-study suggests, however, that this more subtle phenomenon does not play the major role in current SSL implementations. 7 If there are more than one such program, we refer to the first one with respect to some standard enumeration of binary words. 8 Algorithmic information of a pair of binary words can be defined by first converting the pair to one string by some fixed bijection between pairs and single binary words, which can easily be constructed by enumerating binary words and then using some fixed bijection of and .
Intuitively speaking, ( ) I s t : is the number of bits saved when s and t are compressed jointly rather than independently, or, equivalently, the number of bits that the description of s can be shortened when the shortest description of t is known, and vice versa.
There are cases where Shannon information and algorithmic information basically coincide: Let ¼ x x , , n 1 be samples drawn from a fixed distribution P X on some alphabet  . If s n denotes the binary encoding of the n-tuple , then the algorithmic information rate ( ) K s n n converges almost surely [27] to the Shannon entropy Here and throughout the paper, lower case letters denote probability densities (for discrete distributions the density is just the probability mass function) corresponding to the respective distributions. For instance, p X denotes the density of P X , and whenever this causes no confusion, we write p(x) instead of p X for sake of convenience.
The more interesting aspects of algorithmic information, however, are those where the information content of a string cannot be derived from Shannon entropy. On the one hand, the asymptotic statement on the information rate blurs the fact that the description length of a typical n-tuple is given by Hence, the description length of the distribution also needs to be accounted for; in order to achieve the compression length ( ) nH p X one would need to know p X , hence the full description of s n involves also describing p X [28]. In the context of causal inference it has been pointed out [15] that the description length of the joint distribution of some random variables sometimes also contains information about the underlying causal links between the variables. Therefore, in causal discovery, restricting attention to Shannon information unavoidably ignores essential aspects of information.
A second reason why Shannon information is not sufficient for our purpose is that a string may not result from independent sampling at all. If, for instance, s describes the state of a multi-particle system, the particles may have interacted and hence the particle coordinates may be correlated. Then, treating the joint state of the system as if each particle coordinate would have been drawn independently at random overestimates the description length because it ignores the correlations. In this sense, algorithmic information includes aspects of information that purely statistical notions of information cannot account for.
In a seminal paper, Bennett [29] proposed to consider K(s) as the thermodynamic entropy of a microscopic state of a physical system when s describes the latter with respect to some standard binary encoding after sufficiently fine discretization of the phase space. This assumes an 'internal' perspective (followed in parts of this paper), where the microscopic state is perfectly known to the observer. Although K(s) is in principle uncomputable, it can be estimated from the Boltzmann entropy in many-particle systems, given that the microscopic state is typical in a set of states satisfying some macroscopic constraints [17,29]. That is, in practice one needs to rely on more conventional definitions of physical entropy.
From a theoretical and fundamental perspective, however, it is appealing to have a definition of entropy that neither relies on missing knowledge like the statistical Shannon/von-Neumann entropy [30,31] nor on the separation between microscopic versusmacroscopic states-which becomes problematic on the mesoscopic scale-like the Boltzmann entropy [32]. For imperfect knowledge of the microscopic state, Zurek [17] considers thermodynamic entropy as the sum of statistical entropy and Kolmogorov complexity [33], which thus unifies the statistical and the algorithmic perspectives of physical entropy.
To discuss how K(s) behaves under Hamiltonian dynamics, notice that the dynamics on a continuous space is usually not compatible with discretization, which immediately introduces also statistical entropy in addition to the algorithmic term-particularly for chaotic systems [34]-in agreement with standard entropy increase by coarse-graining [35,36]. Remarkably, however, K(s) can also increase by applying a one-to-one map D on a discrete space [17]. Then + ( ) ( ) K s K D is the tightest upper bound for ( ( )) K D s that holds for the general case. For a system starting in a simple initial state s and evolving by the repeated application of some simple mapD, the description of ≔ ( ) ≔˜( ) s D s D s t t essentially amounts to describing t and Zurek derives a logarithmic entropy increase until the scale of the recurrence time is reached [17]. Although logarithmic growth is rather weak [34], it is worth mentioning that the arrow of time here emerges from assuming that the system starts in a simple state. We will later argue that this is just a special case of the idea that we propose here, that is, that the initial state is independent of D. The fact that K depends on the Turing machine could arguably spoil its use in physics. However, in the spirit of Deutsch's idea that the laws of physics determine the laws of computation [37], future research may define a 'more physical version' of K by a computation model whose elementary steps directly use physically realistic particle interactions, see e.g. the computation models in [38][39][40]. Moreover, quantum thermodynamics [41] should rather rely on quantum Kolmogorov complexity [42].
Algorithmic randomness in causal inference. Reichenbach's principle [43] states that every statistical dependence between two variables X Y , must involve some sort of causation: either direct causation (X causes Y or vice versa) or a common cause for both X and Y. Conversely, variables without causal relation are statistically independent. However, causal relations in real life are not always inferred from statistical relations. Often, one just observes similarities between single objects that indicate a causal relation. As argued in [15], two binary words x y , , representing two causally disconnected objects should be algorithmically independent, i.e.
Depending on the context, we will here read the equation sign = + in two different ways: for theorems, symbols like x y , are considered placeholders for strings that can be arbitrarily long. Then = + means that the error term does not grow with the length of the strings (although it does depend on the Turing machine). In a concrete application where x and y are fixed finite strings, this is certainly meaningless. Then we interpret = + by saying that the error is 'small' compared to the complexity of the strings under consideration (provided that the latter are complex enough). The decision about what 'sufficiently complex' means is certainly difficult, but analogue issues also occur in statistics: rejecting or accepting statistical independence also depends on the choice of the significance levels (which can never be chosen by purely scientific reasons) since statistical independence actually refers to an infinite sample limit that is never reached in real-life. For sake of simplicity, we will henceforth just distinguish between dependent versusindependent.
Rephrasing the ideas of [15], one could say that algorithmic independence between objects is what typically happens when objects are generated without causal relations, i.e., without information exchange. To elaborate on this idea, [16] considers a model where strings are created according to Solomonoff's prior [44] that is defined as the distribution of outputs obtained by uniformly randomizing all bits in the infinite input band of a Turing machine and conditioning on the case that the program halts. It can be shown [45] that this results essentially (up to factors of order 1) in the following probability distribution on the strings: where c is a normalization constant. Obviously, Solomonoff's prior assigns higher probability to simple strings. For this reason, it is often considered as a very principled implementation of Occam's Razor in the foundations of learning. Following this prior, if two strings x y , are generated by two independent random processes of this kind, the pair then occurs with probability On the other hand, it occurs with probability when it is generated in a joint process. Thus, , measures the log of the probability ratio for the occurrence (after neglecting the constant c). In this sense, the amount of algorithmic information shared by two objects (or, more precisely, by the strings encoding them) can be taken as measuring the evidence for the hypothesis that they are causally related. Here, one may again object that the dependence of ( ) I x y : on the Turing machine renders the claim at least vague if not useless: x and y can be independent with respect to one Turing machine but significantly dependent with respect to a second one. Indeed, the asymptotic statement 'equal up to a constant' does not help. Apart from our remarks from above requesting for 'natural' Turing machines for the purpose of physics, we mention that [15] discusses that the notion of being causally connected or not is also relative: assume, for instance, one considers the genomes of two humans. With respect to a 'usual' Turing machine we will observe significant amount of algorithmic mutual information just because both genomes are from humans. On the other hand, given a Turing machine that is specifically designed for encoding humans genomes, the mutual information is only significant if the subjects are related apart from both being humans. Certainly, the fact that they are from the same species is also a causal relation, but if we focus on causal relations on top of this (i.e., relatedness in the sense of family relations), we should only look at algorithmic dependences with respect to a Turing machine that has access to the respective background information. In other words, the fact that algorithmic mutual information is relative fits well to causality being relative as well (in the above sense).
Reference [15] further elaborates on the idea of using algorithmic dependences to obtain causal information. It develops a graphical model based framework for inferring causal relations among n objects based on conditional algorithmic (in)dependences in analogy to conventional causal inference which infers causal graphs among n variables from conditional statistical (in)dependences [1,2].
More surprisingly, algorithmic information can also be used to infer whether X causes Y (denoted by  X Y ) or Y causes X from their joint distribution, given that exactly one of the alternatives is true 9 . If P cause and P effect cause are 'independently chosen by nature' and thus causally unrelated, [15] postulates that their algorithmic mutual information is negligible, formally The postulate raises, however, the following questions for practical applications: First, the joint distribution of cause and effect is not known and can only be estimated from finite data. The estimated distribution may show dependences between P cause and P effect cause that disappear in infinite sampling. Second, algorithmic mutual information is uncomputable. For these reasons, the independence postulate has only been used as an indirect justification of practical causal inference methods. We now describe two examples.
Causal inference for linear models with non-Gaussian noise. First, consider linear non-Gaussian additive noise models [47]: let the joint distribution P X Y , of two random variables X, Y be given by the linear model and N Y is an unobserved noise term that is statistically independent 10 of X. Whenever X or N Y is non-Gaussian, it follows that for every model of the form a = + X Y N Y X , the noise term N X and Y are statistically dependent, although they may be uncorrelated. That is, except for Gaussian variables, a linear model with independent noise can hold at most in one direction. Within that context, [47] infers the direction with additive independent noise to be the causal one. To justify this reasoning, [48] argues that whenever (2) holds, the densities of P Y and P X Y are related by the differential equation p y y p x y x y p x y log log 1 . Therefore, knowing P X Y enables a short description of P Y . Whenever P Y has actually high description length (which can, of course, only be conjectured but never be proven for the specific case under consideration), we thus reject  Y X as a causal explanation. It should be emphasized that this justification does not assume that causal relations in nature are always linear. Instead, the statement reads: whenever the joint distribution is linear in one direction but not the other, the former is likely to be the causal direction. This is because it would be an implausible coincidence that P cause and P effect cause together generate a joint distribution that admits a linear model from effect to cause.
Information-geometric causal inference. Second, we consider the toy scenario described in [49,50]. Assume that X and Y are random variables with values in [ ] 0, 1 , deterministically related by = ( ) where f is a monotonically increasing one-to-one mapping of [ ] 0, 1 . If X is the cause and Y the effect then P effect cause is uniquely described by f, while P cause effect is given byf 1 . Hence, applying (1) to this special case yields where ¢ f denotes the derivative of f. In words, averaging the logarithmic slope of f over p X is the same as averaging it over the uniform distribution. As already observed in [49,50], (4) is equivalent to uncorrelatedness between ¢ f log and p X . Here, one interprets both functions as random variables on the probability space [ ] 0, 1 with the uniform distribution. Then the difference between the left-and the right-hand side of (4) can be written as the covariance of these random variables: To further justify (3), [50] discusses scenarios where functions f and distributions P X are independently generated at random in a way that ensures that (4) is approximately correct with high probability. For instance, P X can be obtained by randomly distributing some peaks across the interval [ ] 0, 1 . The same type of process can be used to generate a monotonic function f at random because the cumulative distribution function of any strictly positive probability density on [ ] 0, 1 defines, as desired, a monotonic bijection of [ ] 0, 1 . Stating that (4) typically holds approximately always relies on strong assumptions on the generating processes for p X and f. Therefore, (4) is just a pragmatic way to replace algorithmic independence with a computable independence condition. Intuitively, we consider (4) as stating that some of the peaks of p X lie in regions where f has large slope, and some in regions with small slope, such that on the average the expectation of ¢ f log over p X does not significantly differ from the one with the uniform distribution.
One can show [49,50] that the independence condition (4) implies a dependence for the backwards direction, i.e., the output density p Y is positively correlated with -¢ f 1 : , which is stronger than being uncorrelated, i.e., with equality if and only if f is the identity. Hence, the output density p Y tends to be higher in regions where the functionf 1 is steep. This is because the function 'focuses' points into regions where the derivative is small. In that sense, p Y contains information about the mechanism relating X and Y. Moreover, [49,50] show that (4) implies that the Shannon entropies of p Y and p X satisfy with equality if and only if f is the identify. This information theoretic implication is the main reason, among others, for stating (4) with ¢ f log instead of just using ¢ f . Intuitively, (6) holds because applying f to a density typically adds additional peaks, which makes the density less uniform. Only functions f that are adapted to the specific shape of the density p X can make it smoother. As a result, [49] proposes the cause to be the variable with smaller entropy (subject, for course, to assuming a deterministic relation).

Results
A common root for thermodynamics and causal inference. To provide a unifying foundation connecting thermodynamics and causal inference we postulate: Principle 1 (Algorithmic independence between input and mechanism). If s is the initial state of a physical system and M a map describing the effect of applying the system dynamics for some fixed time, then s and M are algorithmically independent, i.e., In other words, knowledge of s does not enable a shorter description of M (and vice versa, with the roles of s and M interchanged). Here, we assume that the initial state, by definition, is a state that has not interacted with the dynamics before. The last sentence requires some explanations to avoid erroneous conclusions. Below we will discuss its meaning for an intuitive example (see the end of the paragraph 'physical toy model for a deterministic nonlinear cause-effect relation'). The example will also suggest that states that are independent in the sense that they 'have never seen the mechanism before' occur quite often in nature. Note that 'not seeing the mechanism' also excludes a preparation procedure for s that accounts for the length of the time interval the dynamics is active because this information is, by definition, considered as part of M.
Principle 1 is entailed by the assumption that there is no algorithmic dependence in nature without an underlying causal relation. By overloading notation, we have identified mechanism and state with their encodings into binary strings. Principle 1 needs to be taken with a grain of salt. Again, there may be some information shared by s and M that we do not account for because we call it 'background' information. Assume, for instance, we place some billiard balls on a pool table and give them randomly some momenta. In doing this, we are aware of the dynamical laws governing the balls, but we may not aware of the exact size of the table. Then, the latter is the decisive aspect of the dynamics that is algorithmically independent of the initial state. More generally, we consider the descriptions of M and s, given some background information and postulate independence conditional on the latter. Although this renders the postulate somehow tautologic, it is still useful because it has mathematical implications which are non-trivial, although they have to be taken relative to the respective background information.
To address further potential issues with principle 1, note that generalizations of algorithmic mutual information for infinite strings can be found in [45], which then allows to apply principle 1 to continuous physical state spaces. Here, however, we consider finite strings describing states after sufficiently fine discretizations of the state space instead, neglecting issues from chaotic systems [34] for sake of conciseness.
We should also discuss the question of how to interpret the sign = + in this context. For fixed s and M, the mutual information takes one specific value and stating that they are zero 'up to a constant term' does not make sense. A pragmatic interpretation is to replace 'up to a constant term' with 'up to a small term', where the decision of what is considered small will heavily depend on the context. A more principled interpretation is the following. In continuous space, the binaries describing state and dynamics depend on the chosen discretization. Then = + can be read as stating that the algorithmic mutual information does not increase with finer discretization. Dynamics of closed physical systems. Principle 1 has implications that follow from the independence condition (7) regardless of why the independence holds in the first place. It may hold because the state has been prepared independently or because some noise has destroyed previous dependences of the state with M.
Moreover, one could argue for a notion of 'initial state' that, by definition, implies that it has been prepared independently of M and thus, typically, shares no algorithmic information with M.
To show one immediate consequence, consider a physical system whose state space is a finite set S. Assuming that the dynamics D is a bijective map of S, it follows that the entropy cannot decrease: Theorem 1 (No entropy decrease). If the dynamics of a system is an invertible mapping D of a discrete set S of states then principle 1 implies that the algorithmic complexity can never decrease when applying D to the initial state s, i.e.
Proof. Algorithmic independence of s and D amounts to = That is, while [17] derives entropy increase for a simple initial state s, we have derived it for all states s that are independent of t.
To further illustrate theorem 1, consider a toy model of a physical system consisting of n×m cells, each being occupied or not with a particle, see figure 1. Its state is described by a binary word s with nm digits. For generic s, we have » ( ) K s nm, while figure 1, left, shows a simple state where all particles are in the left uppermost corner containing k×l cells. A description of this state s consists essentially of describing k and l (up to a negligible amount of extra information specifying that k and l describe the size of the occupied region), which requires + k l log log 2 2 bits. Assume now that the dynamical evolution D transforms s into ¢ = ( ) s D s where ¢ s looks 'more generic', as shown in figure 1, right. In principle, we cannot exclude that ¢ s is equally simple as s due to some non-obvious pattern. However, excluding this possibility as unlike, theorem 1 rules out any scenario where ¢ s is the initial state and s the final state of any bijective mapping D that is algorithmically independent of ¢ s . The transition from s to ¢ s can be seen as a natural model of a mixing process of a gas, as described by popular toy models like lattice gases [51]. These observations are consistent with standard results of statistical mechanics saying that mixing is the typical behavior, while de-mixing requires some rather specific tuning of microscopic states. Here we propose to formalize 'specific' by means of algorithmic dependencies between the initial state and the dynamics. Here, this view does not necessarily generate novel insights for typical scenarios of statistical physics, but it introduces a link to crucial concepts in the field of causal inference.
So far, we have avoided to discuss whether the assumption of discrete state space came from the discretization of a continuous system (which is problematic for reasons mentioned earlier) or from really focusing on discrete systems. In the former case, despite these issues, theorem 1 still shows that increase of physical entropy does not necessarily require coarse graining effects. To argue for the latter view, one may think of a discrete quantum dynamics starting in an eigenstate with respect to some natural basis, e.g., the energy basis, and also ending up in these basis states. To satisfy principle 1, the basis must be considered as background information relative to which the independence is stated.
Dynamics of open systems. Since applying (7) to closed systems reproduces the standard thermodynamic law of non-decrease of entropy, it is appealing to state algorithmic independence for closed system dynamics only and then obtain conditions under which the independence for open system follows. We will then see that the independence of P cause and P effect cause can be seen as an instance of the independence principle for open systems.
Let D be a one-to-one map transforming the initial joint state (s, e) of system and environment into the final For fixed e, define the open system dynamics ¢ M s s :  . If s is algorithmically independent of the pair (D, e) (which is true, for instance when K(e) is negligible and s and D are independent), independence of s and M follows because algorithmic independence of two strings a b , implies independence of a c , whenever c can be computed from b via a program of length O(1), see e.g. [15], lemma 6.
Further, we can extend the argument above to statistical ensembles: consider n systems with identical state space S, each coupled to an environment with identical state space E (where S and E are finite sets, for simplicity). Let (s j , e j ) ∈ S × E be the initial state of the jth copy and (s j ′, e j ′) its final state. Following the standard construction of Markovian dynamics, we assume statistical independence between the initial state s and the initial environmental state e. Further, in agreement with the general idea of this paper, we assume also that for some ¢ e . The approximate equality holds because of the statistical independence of s and e, which is approximately also true for empirical frequencies if n is large. Hence, P S is determined by s n and P S′|S is (in the limit of large n) determined by e n and D. We thus conclude that P S and P S′|S are algorithmically independent, because they are derived from two algorithmically independent objects via a program of length O(1). Defining the variable 'cause' by the initial state of one copy S and 'effect' as the final state, we have thus derived the algorithmic independence of P cause and P effect cause . Notice that it is not essential in the reasoning above that cause and effect describe initial and final states of the same physical system, one could as well consider a tripartite instead of a bipartite system.
Physical toy model for a deterministic nonlinear cause-effect relation. To describe a case where principle 1 implies thermodynamic statements that are less standard, we revisit the toy scenario of information geometric causal inference [49,50] and observe that (3) implies To see this, we only need to interpret the set of probability distributions as states on which f defines an invertible map and (9) follows in analogy to the proof of theorem 1 because p X can be uniquely reconstructed for p Y when f is known. Thus, if p Y had a shorter description than p X , knowing f would admit a shorter description of p X . Equation (9) matches the intuition that a distribution typically gets additional peaks by applying the nonlinear function f. Remarkably, the increase of complexity on the phenomenological level, namely the level of distributions, is accompanied by a decrease of Shannon entropy. To avoid possible confusion, we should Figure 2. Physical system generating a nonlinear deterministic causal relation: a particle travelling towards a structured wall with momentum orthogonal to the wall, where it is backscattered in a slightly different direction. x and y denote the positions where P crosses a vertical line before and after the scattering process, respectively.
emphasize that the process f, although it is a bijection, is not a 'reversible process' in the sense of thermodynamics because the latter term refers to maps that are locally volume preserving in phase space and thus preserve Shannon entropy. To further clarify this point, we now describe a simple physical system whose dynamics yields the function f when restricted to a certain part of the physical phase space. The decrease of Shannon entropy is then perfectly consistent with the conservation of Shannon entropy for the entire system, in agreement with Liouville's theorem. Figure 2 shows a simple two-dimensional system with a particle P travelling towards a wall W perpendicular to the momentum of P. P crosses a line L parallel to W at some position Î [ ] x 0, 1 . Let the surface of W be structured such that P hits the wall with an incident angle that depends on its vertical position. Then P crosses L again at some position y. Assume that L is so close to W that the mapping is one-to-one. Also, assume that 0 is mapped to 0 and 1 to 1. Let the experiment be repeated with particles having the same momenta but with different positions such that x is distributed according to some probability density p X . Assuming principle 1, the initial distribution of momenta and positions does not contain information about the structure of W. Due to theorem 1, the scattering process thus increases the algorithmic complexity of the state. Further, this process is thermodynamically irreversible for every thermodynamic machine that has no access to the structure of W. Hence, the entire dynamical evolution is thermodynamically irreversible when the structure of W is not known, although the Shannon entropy is preserved.
Let us now focus on a restricted aspect of this physical process, namely the process that maps p X to p Y via the function f, so we can directly apply the information-geometric approach to causal inference [49]. Now we conclude (9) because restricting the attention to partial aspects of two objects cannot increase their mutual information, see e.g., [15]. This illustrates, again, that we can either conclude (9) by applying principle 1 directly to f, or, alternatively, we could state the principle only for the dynamics of the closed system and derive (9) by standard arguments of algorithmic information theory. Intuitively speaking, we expect p Y to contain information about the wall. On the one hand, we already know that (4) implies that p Y correlates with the logarithmic slope of f, due to (5). On the other hand, we can also prove that p Y contains algorithmic information about f provided that ( ) K p Y is properly larger than p Y . This is because independence of p Y and f would imply independence of p Y andf 1 and then we could conclude Y by applying the above arguments to f 1 instead of f. Certainly, particles contain information about the objects they have been scattered at and not about the ones they are going to be scattered at. Otherwise a photographic image would show the future and not the past. In this sense, the observations trivially fit to the usual arrow of time. What may be unexpected according to standard thermodynamics is, as already mentioned, the decrease of Shannon entropy (6), which could lead to misleading conclusions such as inferring the time direction from p Y to p X . Thus principle 1 is of particular relevance in scenarios where simple criteria like entropy increase/decrease are inapplicable, at least without accounting for the description of the entire physical system (that often may be not available, e.g., if the momentum of the particle is not measured). The example above also suggests how the algorithmic independence could provide a new tool for the inference of time direction in such scenarios. One could certainly time reverse the scenario where p Y is the particle density of the incoming beam while p X corresponds to the outgoing beam. Then, the incoming beam already contains information about the structure of the surface it is scattered at later. We now argue how to make sense of principle 1 in this case. Of course, such a beam can only be prepared by a machine or a subject that is aware of the surface structure and directs the particles accordingly. As a matter of fact, particles who were never in contact with the object cannot 'a priori' contain information about it. Then principle 1 can be maintained if we consider the process of directing the particles as part of the mechanism and reject the idea of calling the state of the hand-designed beam an 'initial' state. Instead, the initial state then refers to the time instant before the particles have been given the fine-tuned momenta and positions.
Arrow of time for an open system in the real world. So far, we have provided mainly examples that help for a theoretical understanding of the common root of thermodynamics and causal inference. Apart from discussing the foundations for both fields, the independence principle aims at describing the arrow of time for systems for which it is not obvious how to derive asymmetries between past and future from standard thermodynamics.
As one such example, reference [52] considers audio signals from a piece of music and its echo at different places of a building and addresses the task of inferring which one is the original signal and which one its echo. On the one hand, one can consider this task as part of causal inference with the echo being the effect of the original signal, as in [52]. On the other hand, the problem is arguably related to the arrow of time since the echo comes later than its original signal. Here, it would be hard to infer the time direction from entropy arguments: even if one manages to define a physical system like the air that carries the signal, one could hardly keep track of the entropy contained in the entire system. The independence principle, on the other hand, does not have to account for entropies of the entire system in order to infer the time direction. To show this, we first rephrase some results from [52] and then discuss future directions using the principle of algorithmic independence.
the system is not only statistically but also algorithmically independent of the environment, and second, it is also algorithmically independent of the dynamical law. It thus provides a useful new rationale for finding the most plausible causal explanation for given observations arising in study of open systems. It is known, however, that non-Markovian dynamics is ubiquitous. As argued in [55], for instance, the dynamics of a quantum system interacting strongly with the environment is not Markovian because it does not start in a product state. Instead, initial state of system and environment already share information. At least for these cases, we will also expect violations of principle 1. It should be emphasized, however, that also for non-Markovian systems (for which the initial state has not been prepared independently of the environment) one is sometimes interested in the question of what would happen to an input state if it was prepared independently. This perspective becomes particularly clear by discussing analogies between non-Markovian dynamics and the phenomenon of confounding in the world of statistics and causal inference [2].
To explain this, consider just two variables X and Y where the statistical dependence is entirely due to the causal influence of X on Y. The corresponding causal relation is visualized in figure 3, left. For this relation, the observed conditional distribution P Y X can be interpreted as describing also the behavior of Y under interventions on X. Explicitly, = P Y X x is not only the distribution of Y after we have observed that X attains the value x. Instead, it also decribes the distribution of Y given that we set X to the value x by an external intervention. Using similar language as in [2], we write this coincidence of observational and interventional probabilities as = On the other hand, if the dependence between X and Y is only partly due to the influence of X on Y but also due to the common cause Z as in figure 3, right, setting X to the value x yields a different distribution than observing the value x, i.e. Assume, for instance, one observes a correlation between taking a medical drug (variable X) and recovery from a disease (variable Y). Let say, the correlation is partly because the drug helps and partly because women take the drug more often than men and are, at the same time, more likely to recover. The question of whether it is worth to take the drug needs to be based on = ( ) P Y do X x , not on P Y X . If the data base contains information on the gender Z, we can adjust for this confounder using (11) and obtain the interventional probabilities from the observational ones. Otherwise, finding = ( ) P Y do X x requires randomized experiments. This example shows that although the input x is in fact not independent of the mechanism relating X and Y, we are interested in the question what would happen if we made it independent. Markovian and non-Markovian systems can be seen as the physical analogs of figure 3, left and right, respectively: a system is non-Markovian because the future state of the system is not only influenced by the present state but also by some common history or state of the environment. Like in the case of random variables, for a non-Markovian system we may be interested in what would happen for a 'typical' input state, that is, one that is prepared independently of the state of the environment and the dynamics.
Going back to the causal inference world, we should emphasize that algorithmic independence of P X and P Y X has only been postulated for the causal relation in figure 3, left, and not for the confounded scenario on the right hand side. Accordingly, confounding may be detected by dependences between P X and P Y X [15]. Likewise, for physical systems, dependences between a state and dynamics may indicate non-Markovian dynamics 12 .
More generally speaking, algorithmic information has also attracted interest for the foundations of physics recently. For instance, given the recent connections between the phenomenon of quantum nonlocality [56] with . Left: pure ause-effect relation without common cause. Right: cause-effect relation that is confounded by a common cause. 12 For quantum systems, note that [13] discusses a condition that can also indicate common causes. algorithmic information [57,58] and causality [7][8][9][10][11][12][13], our results may also point new directions for research in the foundations of quantum physics.