Information Anatomy of Stochastic Equilibria

A stochastic nonlinear dynamical system generates information, as measured by its entropy rate. Some---the ephemeral information---is dissipated and some---the bound information---is actively stored and so affects future behavior. We derive analytic expressions for the ephemeral and bound informations in the limit of small-time discretization for two classical systems that exhibit dynamical equilibria: first-order Langevin equations (i) where the drift is the gradient of a potential function and the diffusion matrix is invertible and (ii) with a linear drift term (Ornstein-Uhlenbeck) but a noninvertible diffusion matrix. In both cases, the bound information is sensitive only to the drift, while the ephemeral information is sensitive only to the diffusion matrix and not to the drift. Notably, this information anatomy changes discontinuously as any of the diffusion coefficients vanishes, indicating that it is very sensitive to the noise structure. We then calculate the information anatomy of the stochastic cusp catastrophe and of particles diffusing in a heat bath in the overdamped limit, both examples of stochastic gradient descent on a potential landscape. Finally, we use our methods to calculate and compare approximations for the so-called time-local predictive information for adaptive agents.


Introduction
If we track the position of a particle diffusing on an unchanging potential long enough, we can estimate the probability of observing a sequence of positions [1]. From that, we can quantitatively answer questions about the process's behavior using a range of information statistics: • How random is it? The entropy rate h µ , which is the uncertainty in the present observation conditioned on all past observations [2].
• What must be remembered about the past in order to optimally predict the future? The causal states, which are groupings of pasts that lead to the same probability distribution over future trajectories [3,4].
• How much memory is required to store these causal states? The statistical complexity C µ , or the entropy of the causal states [3].
• How much of the future is predictable from the past? The excess entropy E, which is the mutual information between the past and the future [5].
• How much of the generated information (h µ ) is relevant to predicting the future? The bound information b µ , which is the mutual information between the present and future observations conditioned on all past observations [6].
• How much of the generated information is useless-neither affects future behavior nor contains information about the past? The ephemeral information r µ , which is the uncertainty in the present observation conditioned on all past and future observations [6].
These informational quantities cannot be derived from a dynamical phase diagram in general, so we see them as providing a complementary view of a process's structure and behavior.
Here, we focus on continuous stochastic nonlinear dynamical systems, the theory for which has a long and venerable history, has met with a number of successful predictions, and has identified a number of principles describing how noise interacts with nonlinearity [21]. For nonlinear systems transitioning to chaos, to take just one example, noise plays the role of a "disordering" field, just as the magnetic field is an ordering field for spin systems at critical transitions [22,23]. Though their history substantially predates that of the wide range of complex systems just cited, relatively fewer analyses of their information processing components-their information anatomy-have been carried out. As a start, we demonstrate how to calculate the quantities above for continuous-time, continuous-state stochastic nonlinear systems exhibiting dynamical equilibria, yielding intuition for the properties these measures capture in simpler, and perhaps more familiar, physical models.
Throughout, we focus on a ubiquitous and simple nonlinear generative model: stochastic gradient descent or, in other words, diffusion on a potential surface. We assume infinite precision in our observation of the state space. The first calculation assumes that the diffusion matrix is invertible; the second assumes that the drift term is linear but allows for a noninvertible diffusion matrix. All calculations assume that the time between measurements is nonzero, but arbitrarily small.
To get started, background is given in Section 2. Results are presented in Section 3 and stated more succinctly in Table 1. To illustrate how to apply those formulae, we calculate the information anatomy of the stochastic cusp catastrophe in Section 4.1 and coupled particles diffusing in a heat bath in Section 4.2.
We provide a suite of appendices that are home to technical details necessary for completeness, but that would otherwise distract. Several appendices also draw out implications of information anatomy analysis. In particular, Appendix A shows that the information anatomy of a Markovian system requires looking only one time step into the future and past, as expected from a similar calculation in [6]. Appendix B establishes that the causal states of a first-order Langevin equation are isomorphic to the present position. Appendix C justifies why, given an infinitesimal time resolution τ , the conditional entropy of the measurement at a future time step given the present measurement can be approximated arbitrarily well by using a linearized drift term when the diffusion matrix is invertible. Appendix D then demonstrates that the entropy of the Green's function of a linear Langevin equation with a noninvertible diffusion matrix differs from that when the diffusion matrix is invertible. Finally, Appendix E applies the formulae in Appendices A-C to explore estimates of the time-local predictive information and related alternatives, used as optimization principles to choose action policies for adaptive autonomous agents [12].

Background
Let's first recall the information anatomy analysis of discrete-time, discrete-state processes introduced in [6]. The main object of study is a process P: the list of all of a system's behaviors or realizations {. . . x −2 , x −1 , x 0 , x 1 , . . .} and their probabilities Pr(. . . X −2 , X −1 , X 0 , X 1 , . . .). We denote a contiguous chain of random variables as X 0:L = X 0 X 1 · · · X L−1 . We assume the process is ergodic and stationary-Pr(X 0:L ) = Pr(X t:L+t ) for all t ∈ Z-and the measurement symbols range over a finite alphabet: x ∈ A. In this setting, the present X 0 is the random variable measured at t = 0, the past is the chain X :0 = . . . X −2 X −1 leading up the present, and the future is the chain following the present Shannon's various information quantities-entropy, conditional entropy, mutual information, and the like-when applied to time series are functions of the joint distributions Pr(X 0:L ). Importantly, they define an algebra of information measures for a given set of random variables [24]. Ref. [6] used this to show that the past and future partition the single-measurement entropy H(X 0 ) into several measuretheoretic atoms. These include the ephemeral information: which measures the uncertainty of the present knowing the past and future; the bound information: which is the information shared between present and future conditioned on past; and the enigmatic information: which is the co-information between past, present, and future.
For a stationary time series, the bound information is also the shared information between present and past conditioned on the future: One can also consider the amount of predictable information not captured by the present: which is called the elusive information. It measures the amount of past-future correlation not contained in the present. It is nonzero if the process has "hidden states" and is therefore quite sensitive to how the state space is "observed" or coarse-grained.
The total information in the future predictable from the past (or vice versa) is the excess entropy: The process's Shannon entropy rate h µ can also be written as a sum of atoms: Thus, a portion of the information (h µ ) a process spontaneously generates is thrown away (r µ ) and a portion is actively stored (b µ ). Putting these observations together gives the information anatomy of a single measurement: These quantities were originally defined for stationary processes, but easily carry over to a nonstationary process of finite Markov order. (See Appendix A.) The burden of the following is to analyze the limit from the discrete-time, discrete-value processes just discussed to continuous-time, continuous-value processes. Suppose that observations are made at very small intervals of duration τ . Then the observation at time t n = nτ is now labeled X nτ . Rather than entropy or mutual information per observed symbol, we define an entropy or mutual information per elapsed time; that is, informational rates. A step in that direction is to normalize the information measures defined above by the observation interval: In doing this, terms of order τ or higher are ignored. These definitions then lead to a familiar τ -entropy rate using a discrete-time, continuous-value treatment [2,25,26]:  More natural definitions of these quantities might involve a fully continuous-time development that avoids the log τ divergences of the τ entropy rate [27], but we leave this for future research.
Figures 1a and 1b give information diagrams that illustrate the algebra of the information measure atoms just defined. There, the entropy of a set is the sum of the entropy of its atoms. This reveals several useful linear dependencies that were originally noted in [6]: For a Markovian process, as illustrated in Figure 1b, the elusive information vanishes: Therefore, in this case, if we find expressions for H 0 (τ ), h µ (τ ), and b µ (τ ), then we can find r µ (τ ), q µ (τ ), and E via:

Information Anatomy of Stochastic Dynamical Systems
To determine a process's information anatomy one must calculate entropies and conditional entropies of the joint probability distribution of the entire past, the present, and the entire future. In the general case, this is challenging. However, since first-order Langevin equations we consider are Markovian, we have: (Appendix A provides the derivation.) Therefore, to calculate a Markovian process's information anatomy, we only need the joint probability distribution of three successive measurements instead of the joint probability distribution of the present and semi-infinite past and future. To further simplify the calculation of conditional entropies, we assume that τ is small enough that the entropy of the Green's function-i.e., the transition probabilities P (x , t + τ |x, t)-is well approximated by the entropy of a corresponding Gaussian. This is exactly true for a linear Langevin equation. For a nonlinear Langevin equation, the Gaussian approximation is valid in the limit of infinitesimal τ for a set whose measure can be made arbitrarily close to 1. (Appendix C calculates small-τ approximations for the variance of this Gaussian.) We do not approximate the stationary distribution of a nonlinear Langevin equation by a Gaussian, however, and that means that the joint probability distribution over successive measurements is in general highly non-Gaussian. Appendix B shows that, for first-order Langevin dynamics, the single-measurement entropy H[X 0 ] is the process's statistical complexity C µ [3,4]. The result is that the information anatomy analysis decomposes this causal-state information into: • that useful for prediction or retrodiction beyond the information provided by the causal states at the previous time step-the bound information b µ ; • that useful for both prediction and retrodiction-the co-information q µ ; and • that useless for both prediction and retrodiction-the ephemeral information rate r µ . This is a similar but finer C µ decomposition than considered in [28]. There, and more generally, C µ = E + χ. That is, the state information consists of that shared with the future (E) and information not shared with the future but that must be stored to implement optimal prediction-the crypticity χ [29]. Together with these observations, Eqn. 3 reminds us that χ = h µ for Markov processes, as originally noted for finite-range one-dimensional spin systems [30].

Nonlinear Langevin Dynamics
Consider an n-dimensional nonlinear Langevin equation: where x ∈ R n , U (x) is an analytic potential function and η(t) is zero-mean white noise with diffusion matrix D: η i (t) = 0 and η i (t)η j (t ) = D ij δ(t−t ). The diffusion coefficients D ij = D ji are assumed to be independent of x and such that det D = 0. The following (well-known) stationary distribution is derived by converting the stochastic differential equation into its Fokker-Planck equation form: where Z = e −2U (x) dx. We assume that this is the stationary probability distribution experienced by the particle and that it is normalizable: Z < ∞. (See Fig. 2 for simulation results in one dimension.) The time-discretization normalized entropy of a measurement is: The conditional entropies H[X τ |X 0 ] and H[X τ |X −τ ] in Eqns. 4-5 simplify if the conditional probabilities Pr(X τ |X 0 ) and Pr(X τ |X −τ ) are Gaussians, since: Appendix C shows that the conditional distributions Pr(X τ |X 0 ) and Pr(X τ |X −τ ) are Gaussian to o(τ ) over a region of R n with measure arbitrarily close to 1. The entropies of these Gaussians are calculable to leading and subleading order in τ using a linearized version of the nonlinear Langevin equation about the initial position: where A(x ) is a matrix with entries (A(x )) ij = ∂(D∇U ) j /∂x i . (This is similar but not identical to the approximation used in [12]. Appendix E comments on the differences.) From Appendix C, we have that: and, similarly, Substituting Eqns. 10 and 11 into Eqns. 8 and 9, respectively, gives with some algebra: and Substituting Eqn. 13 into Eqn. 4, we find that: Figure 2. A particle diffusing according toẋ = −x + η(t) with diffusion coefficient D = 1 moves as in Figure 2a. Over infinite time, the particle experiences positions distributed according to the probability density function in Eqn. 6; see Figure 2b. If the previous particle position is known, a future particle position can be determined with less uncertainty than if no previous particle position is known, as shown in Figure 2c.  (c) The probability of being in position x at a time t differs from the equilibrium probability distribution ρ eq (x), if we know the position of the particle at a previous time.
The leading order term is recognizable as an ( , τ )-entropy rate of the Ornstein-Uhlenbeck process [25], except that the has been regularized away since we used Shannon's differential entropy. Substituting Eqns. 12 and 13 into Eqn. 5, we find the bound information rate: Thus, the rate of active information storage depends on the dimension of the state space to leading order in τ , but its nondivergent part depends on the average curvature of the potential. From these quantities all other anatomy measures follow. Substituting Eqns. 14 and 15 into Eqn. 2, we find that the ephemeral information is: Unsurprisingly, the dissipated information-that entropy created in the present useful for neither predicting nor retrodicting-depends only on the noisiness of the dynamics and not the drift. Table 1. Information anatomy of first-order, n-dimensional nonlinear Langevin dynamics:

Information rates Definition Terms
Finally, the enigmatic information-that shared between past, future, and present-follows by substituting Eqns. 7-15 into Eqn. 1: It is interesting to consider how q µ changes as the stochasticity of the system increases: The stationary distribution ρ eq (x) flattens out, leading to an unbounded increase in H 0 . This is counteracted by an unbounded increase in the entropy rate.
We can also bound the bound information rate when D is positive semidefinite and ∇U grows more slowly than e −2U with ||x||. Then, integration by parts applied to Eqn. 15 gives; When D is positive semidefinite, v Dv ≥ 0 for any vector v, then: Therefore, b µ (τ ) is maximized for a positive semidefinite diffusion matrix when the potential well is as flat as possible, while maintaining Z < ∞.

Linear Langevin equation with Noninvertible Diffusion
What if the invertibility of the diffusion matrix is relaxed? In particular, do we still have qualitatively the same information anatomy if a subsystem of the stochastic dynamical system evolves deterministically? How does this affect the information generation and storage properties? To this end, , where x d evolves deterministically and x n stochastically: Again, η(t) is white noise with η(t) = 0 and η(t)η(t ) = Dδ(t − t ), where D is invertible. Taken together, though, this is a linear Langevin equation for x with a noninvertible diffusion matrix. Naively assuming that the deterministic subsystem evolves with a small amount of noise, Eqn. 15 would apply and give, for example, to O(τ ): But this assumption would be incorrect; the noiseless limit is singular. Since Eqns. 17 and 18 specify a linear Langevin equation for x, its Green's function is Gaussian. From App. D, to O(τ ) the entropy rate is: and the bound information is: Applying Eqn. 2, the ephemeral information rate is to O(τ ): These answers are very different from those derived assuming that x's deterministic subsystem x d evolves with an infinitesimal amount of noise. The bound information in Eqn. 19 differs from that found from naive application of Eqn. 15 in two ways. First, the pre-factor for the log 2/τ divergence is n/2+m rather than n/2. That is, the difference counts the dimension m of the deterministically evolving state space x d . Thus, the deterministic subsystem allows for the active storage of more of the spontaneously generated stochasticity. Second, b µ 's O(1) term involves tr(B dd )−3tr(B nn ) rather than tr(B dd )+tr(B nn ).
The ephemeral information in Eqn. 20 differs from a naive application of Eqn. 16 in two new ways. First, the expression in Eqn. 20 has an additional O(1/τ ) factor that is linearly proportional to the dimension m of the deterministic subsystem. And, second, the term log(2πe| det D nn || det B dn D nn B dn |) can be interpreted by supposing that B dn D nn B dn is the effective diffusion matrix felt by the deterministically evolving states.
These information anatomy quantities are therefore sensitive to the process's underlying noise architecture.

Examples
To illustrate how the information measures are helpful and interesting summaries of nonlinear Langevin dynamics, let's consider several examples.

Stochastic Gradient Descent in One Dimension
Consider a first-order nonlinear Langevin dynamics for x ∈ R in which: where η(t) = 0 and η(t)η(t ) = 2Dδ(t − t ). The stationary distribution is: with Z a normalization factor: We require that Z < ∞. This process's elusive information is zero and the ephemeral information rate is the strength of the noise. But the bound information is: Using integration by parts, this can be rewritten: So, b µ is sensitive to the average curvature of the potential or, equivalently, to the average squared drift normalized by the diffusion constant.
In the deterministic limit, this expression simplifies. Suppose that {x * 1 , ..., x * m } are the global minima of the potential function: Applying this limit to Eqn. 21, we have: This limit is a little strange. If D = 0 exactly, so that we have deterministic gradient descent, then the stationary time series consists of a single measurement. The information anatomy becomes rather trivial.
There is no uncertainty in the present measurement and the past, present, and future share no information.
If D is nonzero, no matter how small, however, then there is finite uncertainty in a measurement and the past, present, and future share information with one another. As a concrete example, consider the canonical form for the cusp catastrophe [31]: , and the corresponding bound information in the noiseless limit is: The global minimum x * (r, h) is not everywhere differentiable in r and h, and this appears also in b µ (τ, r, h). See Figure 3. The contour of nondifferentiability is h = 0 for r > 0. Along the contour, the potential is symmetric, there are suddenly two global minima of U (x) with x * 1 = −x * 2 and so the sign of x * changes discontinuously across h = 0.
Interestingly, for double-well potentials but not single-well potentials, b µ (τ ) is maximized at a nonzero noise level D > 0. At some level, this is completely counterintuitive. Adding noise only decreases the predictability of a process. However, adding noise in the present affects the future in a way that cannot be predicted from the past. Since b µ (τ ) measures the amount of information shared between the present and future which is not shared with the past, there is (for some double-well potentials) a level of stochasticity that maximizes b µ (τ ). This is shown in Figure 3d.

Particles Diffusing in a Heat Bath
Suppose N particles with positions x 1 , ..., x N and masses m 1 , ..., m N diffuse according to the potential function U (x 1 , ..., x N ) in a heat bath of temperature T . Let x denote the vector of concatenated particle positions. When the inertial terms m i d 2 x i /dt 2 are negligible, an overdamped Langevin equation can be used to approximate the particles' trajectories:    M is a diagonal matrix whose entries are the particle masses and the parameter γ is a friction coefficient that controls how strongly the particles couple to the heat bath. The stationary distribution of positions x is the Boltzmann distribution: where Z is the partition function: From Eqn. 7, the normalized single-measurement entropy is: which is simply proportional to the familiar definition of entropy in physics. For notational ease, letm denote the geometric mean of the masses: , k i the effective "spring constant" for the i th particle: and ω i the effective "oscillation frequency" for the i th particle: From Eqn. 14, the entropy rate is: From Eqn. 15, the bound information is to similar order: From Eqn. 16, the ephemeral information rate is: Several information measures appear dimensionally incorrect. This is a perennial concern when calculating the differential entropy of random variables that themselves have units. The probability density over those variables also has a dimension and this leads to differential entropies that involve the log of a number with dimension. Implicitly, however, we chose a standard unit system such that all quantities are dimensionless.
All of these quantities are extensive in N . The normalized entropy per measurement H 0 is proportional to the Boltzmann entropy by a factor of k B /τ . The entropy rate h µ (τ ) and ephemeral information r µ (τ ) increase logarithmically with the mean squared velocity v 2 = k B T m . The bound information b µ (τ ) increases when there is a larger γ; that is, when there is stronger coupling between the particles and the heat bath or when there is a smaller average oscillation frequency N i=1 ω 2 i . Since γ ≥ 0 and ω 2 i ≥ 0, the bound information is bounded above by b µ (τ ) ≤ N log To achieve this upper bound, the potential U (x) must "flattened out" to decrease k i , as described in Section 3.
There are alternative models for coupled particles diffusing in a heat bath, and there is no guarantee that even the qualitative conclusions here will hold true when particle trajectories are modeled according to a second-order Langevin equation, for instance.

Conclusions
Our calculations led to general formulae for the information anatomy of stochastic equilibria in simple, familiar systems when the time discretization was very small. We considered a first-order nonlinear Langevin equation with a normalizable stationary distribution, invertible diffusion matrix, and analytic drift. We do not expect the expressions in Section 3 to hold for larger time discretizations, though Gaussian approximations could be used to upper bound conditional entropies more generally. We also considered first-order linear Langevin equations with normalizable stationary distribution and noninvertible diffusion matrix in Section 3.2.
An important technical consideration is that the information anatomy of Langevin stochastic dynamics is likely not unique, just as the pre-factors for the ( , τ )-entropy rate of an Ornstein-Uhlenbeck process depend on definition and approximation procedure [25,26]. However, based on results not shown here, we have reason to believe that the qualitative scaling seen with drift and diffusion holds regardless of approximation method. This parallels the way that ( , τ )-entropy rate estimates for an Ornstein-Uhlenbeck process all increase with diffusion coefficient. That said, a complete understanding of how information anatomy estimates vary with technique requires further study. We hope that our results are sufficiently compelling to motivate further efforts.
With this caveat in mind, let's focus on qualitative rather than quantitative conclusions. Even though the entropy rate is typically viewed as a measure of randomness, some of that randomness is useful for prediction. That is the bound information-shared between present and future but not contained in the past-and we showed that it is sensitive only to drift. In contrast, we showed that the ephemeral information-information in the present useless for predicting or retrodicting-is sensitive only to the diffusion. In short, for stochastic equilibria the entropy rate consists of a quantity (ephemeral information) that has to do with a process's inherent noisiness and a quantity (bound information) that has only to do with the underlying process regularities.
A key lesson is that information anatomy measures are sensitive to process organization. Section 3.2 showed that the information anatomy of linear Langevin dynamics changes discontinuously whenever one of the diffusion coefficients vanishes. This sensitivity to underlying process structure could also be a feature rather than a defect. For instance, if we know that the underlying process is a first-order linear Langevin equation, then one could infer the dimension of the deterministically evolving state space by comparing known τ -scaling relations in Section 3 with empirically determined scaling relations.
This brings us to discuss what was learned from the several example applications. Section 4.1 showed that the bound information picks up different features than one finds in a dynamical phase diagram. In the noiseless limit, b µ of the cusp catastrophe as a function of parameters r and h is nondifferentiable on the line h = 0 for r ≥ 0, because the location of the global minimum of the potential function changes discontinuously across that contour. Moreover, this is not related to the bifurcation contour h = ±2r 3/2 /3 √ 3 [31] where the number of equilibria changes from two to one or vice versa, which has no apparent signature in the bound information. However, in these calculations, we did not avoid the "ultraviolet catastrophe". We embraced it since we could then evaluate the information anatomy for general nonlinear Langevin equations by linearizing. If one evaluates the information anatomies of these types of stochastic dynamics when the time discretization is not infinitesimal, however, then signatures of dynamical phase transitions should show up in the bound information as they do for the finite-time predictable information or excess entropy [16,32]. Section 4.2 calculated the information anatomy of coupled particles in a heat bath. Physicists are concerned primarily with H 0 , the entropy of a single measurement symbol, since its changes are proportional to heat loss [33]. However, the point of this example was that alternative information-theoretic quantities capture other behavioral properties of particles diffusing in a heat bath. As an application of this analysis it will be worth exploring how the information anatomy measures reflect the trade-off between stable information storage and heat loss in the context of Maxwell-like demons [34].
To close our discussion of applications, we briefly mention the use of information measures to express optimization principles that guide adaptive agents. A Markov process's bound information has been used as an optimization measure called the time-local predictive information (TiPi) [12]. Moreover, the class of systems used there and for which TiPi was calculated are exactly the first-order nonlinear Langevin dynamics analyzed here. Due to the similarities in setup and approach, Appendix E compares alternative TiPi measures. Generally, an agent that wishes to maximize its TiPi will be driven into unstable regions of the potential landscape on which it diffuses. However, Appendix E shows that the similarly motivated, but alternative optimization measures lead to different adaptive strategies. More investigation is required to compare such strategies to those seen in biological agents before general principles of adaptive behavior can be understood.

A. Information Anatomy of a Markov Process
If the system at hand is Markovian, then the information anatomy simplifies tremendously since one need only consider single time steps into the future and into the past. As a result, many of the Markovian formulae are special cases of those developed in [6] for more complex processes, but are derived here for completeness.
For notational ease, we use the discrete-time notation in which X t:t is the random variable of measurements X t , X t+1 , ..., X t −1 . For a Markovian process the immediately preceding observation "shields" the future from the past: Pr(X n = x n |X −m:n = x −m:n ) = Pr(X n = x n |X n−1 = x n−1 ) .
And, it becomes relatively easy to calculate the information anatomy measures, since the sequence probabilities simplify: For example, the entropy rate becomes: Moreover, all information shared between the past and future goes through the present:  This equality is evident from the information diagram of Figure 1b. The other information anatomy measures follow from b µ and h µ via identities given in Section 2: The excess entropy follows as the sum: As stated in Section 2, to normalize these measures as rates (entropies per unit time rather than per measurement), we simply divide the above above by the time discretization τ : τ .
If the system is Markovian, one only needs the joint distribution of three successive measurements to calculate the anatomy of a bit. Thus, the formulae derived here also can be used as time-local measures for nonstationary dynamics despite the subtleties of defining a measure over bi-infinite time series in general [35].

B. Statistical Complexity is the Entropy of a Measurement
The statistical complexity C µ is the entropy of the probability distribution over causal states. Causal states themselves are groupings of pasts that are partitioned according to the predictive equivalence relation ∼ [4]: x :0 ∼ x :0 ⇔ Pr(X 0: |X :0 = x :0 ) = Pr(X 0: |X :0 = x :0 ) .
Although causal states are difficult to determine for general complex processes, they are particularly easy for Markov processes. Recall that a Markov process is defined by single-time step shielding: Pr(X 0: |X :0 ) = Pr(X 0 |X −τ ) Pr(X 1: |X 0 ) .

It follows that:
Pr(X 0: |X :0 = x :0 ) = Pr(X 0: Therefore, for a Markov process, groupings of pasts in which only the last measurement is recorded constitutes at least a prescient partition. Since: we conclude that the causal states are simply groupings of pasts with the same last measurement: (x :0 ) = x −1 . The causal state space S is isomorphic to the alphabet of the process A and the statistical complexity is the entropy of a single measurement: First-order Langevin equations generate Markovian time series. Our claim, then, is that the stochastic differential equations considered here produce time series for which: So, the causal states are isomorphic to the present measurement X 0 and the statistical complexity is C µ = H[X 0 ]. Implicit in these calculations is an assumption that the transition probabilities Pr(X 0 |X −τ ) for a given stochastic differential equation exist and are unique.
For intuition, consider a linear Langevin dynamics for an Ornstein-Uhlenbeck process: As described in Appendix D and many other places, e.g., [21], the transition probability density Pr(X t |X 0 = x) is a Gaussian: For Pr(X t |X 0 = x) = Pr(X t |X 0 = x ) the means and variances of the above probability distribution must match. Meaning that e Bt x = e Bt x ⇒ x = x . Therefore, for an Ornstein-Uhlenbeck process, the causal states are indeed isomorphic to the present measurement and the statistical complexity is The key here is that although Pr(X t |X 0 = x) may quickly forget its initial condition x, for any finite-time discretization, the transition probability Pr(X t |X 0 = x) still depends on x.
In the more general case, we have a nonlinear Langevin equation: where the stationary distribution ρ eq exists and is normalizable. Our goal is to show that if Pr(X t |X 0 = x) = Pr(X t |X 0 = x ), then x = x . The transition probability Pr(X t = x|X 0 = x ) is a solution to the corresponding Fokker-Planck equation: with initial condition ρ(x, 0) = δ(x − x ). As in [36], we can use an eigenfunction expansion to show that ρ(x, t|x , 0) cannot equal ρ(x, t|x , 0) unless x = x for finite time t. Therefore, Pr(X t |X 0 = x ) = Pr(X t |X 0 = x ) ⇒ x = x . This implies that the causal states are again isomorphic to the present measurement and the statistical complexity is To summarize, this application of computational mechanics [3,4] to Langevin stochastic dynamics shows that the entropy of a single measurement is also the process's statistical complexity C µ . Recall that the latter is the entropy of the probability distribution over the causal states, which in turn are groupings of pasts that lead to equivalent predictions of future behavior. So, for the stochastic differential equations considered here, their causal states simply track the last measured position. What the information anatomy analysis reveals, then, is that not all of the information required for optimal prediction is predictable information about the future. In other words, Langevin stochastic dynamics are inherently cryptic [28,29]. Unfortunately, as is so often the case, the necessary and the apparent come packaged together and cannot be teased apart without effort.

C. Approximating the Short-time Propagator Entropy
The study of stochastic differential equations and short-time propagator approximations is mathematically rich and, as noted in the introduction, the application to nonlinear diffusion has a long history [21]. What follows is a brief sketch, not a rigorous proof, that likely glosses over important pathological cases.
Consider the nonlinear Langevin equation: with driving noise satisfying η(t) = 0 and η(t)η (t ) = Dδ(t − t ), where det D = 0. Let p(x|x ) be the transition probability Pr(X t = x|X 0 = x ) for the system in Eqn. 22. From arguments in [36], it exists and is uniquely defined when the stationary distribution is normalizable. Let q(x|x ) be a Gaussian with the same mean and variance as p(x |x).

We show that H[p] = H[q] + o(τ )
where H[p] = − p(x|x ) log p(x|x )dx and H[q] = − q(x|x ) log q(x|x )dx. Note that here and in the following we suppress notation for the dependence of these quantities on x , using the shorthand H[p] ≡ H[p|X = x ] and the like. First, consider: Since q(x|x ) is the Maximum Entropy distribution consistent with the mean and the variance of p(x|x ), averages of log q(x|x ) with respect to p are the same as those with respect to q. Specifically, ifx is the mean: and if C(x ) is the variance: then q is the normal distribution consistent with that mean and variance: From this, we derive: Since the mean and variance for p and q are consistent, we have: and, thus: A moment expansion will show that the moments of q can be determined to o(τ ) from this linearized Langevin equation. Our strategy is to construct a series expansion for the moments of p in the timescale τ , as in [37]. Immediately, with that statement, we run into a problem. Moments do not uniquely specify a distribution unless an additional condition (e.g., Carleman's condition) is satisfied. However, we are interested in approximating the entropy of the transition probability, rather than approximating the transition probability itself. The Kullback-Liebler divergence is invariant to changes in the coordinate system and, for reasons that become apparent later, it is useful to move to the parametrization z = (x −x)/ √ t. In a slight abuse of notation, p(z|x ) and q(z|x ) will be used to denote the re-parametrized distributions p(x|x ) and q(x|x ). If we could show that all moments of p(z|x ) and q(z|x ) differ by a quantity that is at most of O(τ 3/2 ), it would follow that p(z|x ) = q(z|x ) + τ 3/2 δq where δq is at most of O(1) in τ . From that it would follow that D KL [q + τ 3/2 δq||q] = (τ 3/2 ) 2 I[q] where I[q] is the Fisher information of a Gaussian (and hence bounded) and that H[p] = H[q] to o(τ ).
For intuition and simplicity, we start with the one-dimensional example. This is similar in flavor to the approach in [37], but our point differs-we wish to understand how well we can approximate the full system with a linearized drift term. The stochastic differential equation for x ∈ R is: with noise as above. The mean x evolves according to: Using an Ito discretization scheme: where dη(t) ∼ N (0, D∆t), we have: From these, we derive evolution equations for the moments (x − x ) n for n ≥ 2: Substituting Eqn. 23 into the above and simplifying leads to: Now, we re-express: where δ is at most O(1) in x − x . Then: When µ (x ) = 0 and δ = 0 the Green's function is a Gaussian with zero mean and variance Dt, so that (x − x ) n ∝ (Dt) n/2 . Inspired by this base case, we consider the moments of the variable We expand z n in terms of t, since we are interested in the small-t limit: In terms of these coefficients, we have: Substituting Eqns. 27-28 into Eqn. 26 and matching O(1/t) terms, O(1/ √ t) terms, and so on, yields: for O(1/t), O(1/ √ t), and O(1), respectively. Note that none of C n , α n , or β n have information about δ, which encapsulates higher-order nonlinearities of the drift. The O( √ t) term finally has information about δ: Interestingly, this implies that any dependencies of the moments on δ are O(t 3/2 ), at most. Eqns. 29-31 can be solved with the following initial conditions: and, by construction: Then, C n = α n = β n = 0 for n odd, and α n = 0 for n even as well. Some algebra shows that: A Gaussian with mean 0 and variance C 2 +α 2 √ t+β 2 t = 1+µ (x )t would also have C n = α n = β n = 0 for n odd, α n = 0 for n even, and z n q = C n (1 + β 2 t) n/2 = C n + n 2 C n µ (x )t + O(t 2 ). Thus, the moments z n of p(z|x ) are consistent with the moments of q(z|x ) to O(t 3/2 ). And, those moments are consistent with the moments of the linearized Langevin equation to O(t 2 ). From prior logic, H[p] can be approximated to O(t 2 ) by 1 2 log(2πe|Dt + µ (x )Dt 2 |). The n-dimensional case follows the same principle, but the calculations are more arduous. We start with the stochastic differential equation for x ∈ R n : with the noise as before. The initial condition is x(t = 0) = x . Since we are interested not only in whether the distribution is effectively Gaussian, but also in how important the nonlinearities of µ(x) are, we re-express µ(x) as: where A ij (x ) = ∂µ j /∂x i : and δ ijk is at most of O(1) in ||x − x ||. The evolution equation for the means is: Using an Ito discretization scheme with time step ∆t: where dη(t) ∼ N (0, D∆t). From this, we find evolution equations for the moments of x. As before, we subtract the mean: For notational ease, let σ(1), ..., σ(m) be a list of integers in the set {1, ..., n} where n is the dimension of x; repeats are allowed. We want an evolution equation for Cov(x σ(1) , ..., x σ(m) ): Using Eqn. 33 and steps similar to those outlined in Eqns. 24 and 25, we find that: Cov(x σ(k):k =i,j ) .
As before, we substitute the above series expansion into Eqn. 35 If D is invertible, the conditional entropy is then: If the matrix D is not invertible because det D = 0, then we only have the leading order term in t of the entropy H[p] and we cannot draw any conclusions about the O(1) term in any of the information anatomy quantities. This becomes very clear by example in Appendix D.

D. Linear Langevin Dynamics with Noninvertible Diffusion Matrix
If the stochastic differential equation is linear: where η(t) is white noise η(t) = 0 and η(t)η(t ) = Dδ(t − t ), then we can solve Eqn. 36 in terms of η(t) as: yielding: Since η(t) is white, x(t) is a Gaussian random variable with mean: and variance: Finally, we can also calculate the stationary probability distribution's variance in several ways, but for now we simply define: Var(x(t)) .
Since the Green's function is Gaussian for all time-not approximately in the short time limit-and since the variance of this Gaussian does not depend on the initial start point, we can calculate the conditional entropies H[X t |X 0 ] via: The goal here is to calculate this quantity for small t when the matrix D is not invertible. We assume that it has the block matrix form: where D nn = D nn . Let B have the corresponding block matrix form: (Recall that the subscript d stands for deterministic and the subscript n stands for noisy.) We can rewrite the variance in Eqn. 37 as a power series in t: Since we are concerned about the small-t limit, we consider only with the first few terms of this power series and, for reasons that will become clear, we write all in block-matrix form. The first term, which is of O(t), is the usual: The second term, of O(t 2 ), has the form: The third term, of O(t 3 ), has the form: We place a dash in the lower right block matrix entry since, as it turns out, it does not matter for this calculation. The fourth term, of O(t 4 ), has the form: With some algebra, this becomes: Assume that B dn D nn B dn is invertible; i.e., det(B dn D nn B dn ) = 0. Therefore: where Substituting Eqns. 46 and 47 into Eqn. 45 and substituting that into Eqn. 38, we have the conditional entropy: where m = dim(B dd ) and n = dim(B nn ). Suppose that B dn B dn is invertible. Then, some algebra not shown here reveals: For this special case, the conditional entropy is: .

E. Time-local Predictive Information
Information anatomy measures will have broad application to monitoring and guiding the behavior of adaptive autonomous agents. Practically, information anatomy gives a suite of semantically distinct kinds of information [6,38] that is substantially richer and structurally more incisive than simple uses of Shannon mutual information that implicitly assume there is only a single kind of (correlational) information. For example, it is reasonable to hypothesize that biological sensory systems are optimized to transmit with high fidelity information that is predictively useful about stimuli or environmental organization. In such a setting, the bound information quantifies how much predictability is lost if one has extracted the full predictable information E from the past, but chooses to ignore the present H(X 0 ). Along these lines, the time-local predictive information 1 (TiPi) was recently proposed as a quantity that agents maximize in order to access different behavioral modes when adapting to their environment [12].
In fact, [12] does a calculation very similar to the ones above, considering discrete-time stochastic dynamics of the form: x t = φ(x t−1 ) + η t and calculating the TiPi: with fixed T > 1. The motivation being that, whatever the history prior to t − T , the agent knows the environment state x t−T then. However, from that time forward the agent, making no further observations, is ignorant. The stochastic dynamics then models the evolution of that ignorance from the given state to a distribution of states at t − 1 and then at t, taking into account only the model φ the agent has learned or is given. They report that TiPi is the difference between state information and noise entropy: where: 1 For clarity, we must address a persistently misleading terminology at use here, since it is critical to correctly interpreting the benefits of information-theoretic analyses. The proposed measure is a special case of bound information b µ . Recall that both b µ and the excess entropy E capture the amount of information in the future that is predictable [5,6] and not that which is predictive. The latter is the amount of information that must be stored to optimally predict and this is given by the statistical complexity C µ . And so, when we use the abbreviation TiPi, we mean the time-local predictable information-information the agent immediately sees as advantageous.
with L (0) = I. Since Σ depends on the states between times t − T and t − 1, the TiPi expression in Eqn. 49 also depends on the states between times t − T and t − 1. The TiPi definition in Eqn. 48 does not. Thus, even though the numerical results of [12] are quite interesting, the quantity that the behavioral agents there were maximizing was not the stated conditional mutual information.
To address this concern and explore informational adaptation hypotheses, let's consider alternatives. If desired, for example, one could define an averaged TiPi as: Or, one could define TiPi to be: so that it depends on both x t−T and x t−1 . Even with these modifications, Eqn. 49 still cannot be a general expression for TiPi since it depends on measurements at intermediate times that must be marginalized out of the conditional probability distribution with which we are calculating the mutual information.
These formulae lead to the following expressions for the TiPi alternatives: Maximizing these with respect to θ has a different effect on the action policy. Maximizing the original TiPi I N [X t ; X t−τ ] leads the agent to alter the landscape so that it is driven into unstable regions.
Maximizing the averaged TiPi I N 1 [X t ; X t−τ ] leads to a flattening of the potential landscape. And, the effect of maximizing I N 2 [X t ; X t−τ ] is not yet clear. Not surprisingly, when N is small, we recover the result that maximizing I N [X t ; X t−τ ] has the same effect on the potential landscape as maximizing the TiPi in [12] when T = 2. Though the model there is set up for a discrete-time analysis, it is natural to suppose that adaptive agents in an environment move according to continuous-time equations, but receive sensory signals in a discrete-time manner. Equating notation used here and there: φ(x) = x − D∇U (x)τ gives: where A ij = ∂(D∇U (x)) i /∂x j . When N = 2, substituting this into Eqn. 50 yields: This then gives, upon substitution into Eqn. 49: The above expression is identical to that in Eqn. 51 for all practical purposes, as derivatives of the two with respect to θ are identical up to an unimportant multiplicative constant to subleading order in τ . Therefore, for T = 2 many of the qualitative conclusions from numerical simulations are likely to carry over when Eqn. 51 is used as the objective function. Finally, the difference in how these quantities were calculated is interesting to us. For instance, was the series expansion for the coefficients of the moments of the Green's function in Appendix C actually necessary? Could we have used an Ito discretization scheme to write x t+∆t in terms of x t−∆t and noise terms, and use that expression to evaluate b µ ? This is related to the approach taken in [12]. However, the answer obtained using the moment series expansions is a factor of two different than would have been obtained such a discretization scheme. And, by keeping track of the order of the approximation errors in Appendix C, we found that these formulae for both bound information and TiPi would only hold for invertible diffusion matrices. As suggested by Appendix D, our estimates for such conditional mutual informations change qualitatively when the diffusion matrix is not invertible. And that, in turn, may be relevant to environments that are hidden Markovian-settings for which the agent's sensorium does not directly report the environmental states.