A general theory to estimate Information transfer in nonlinear systems

A general theory for computing information transfers in nonlinear stochastic systems driven by deterministic forcing and additive and/or multiplicative noises, is presented. It extends the Liang-Kleeman framework of causality inference to nonlinear cases based on information transfer across system variables (Liang, 2016). We present an effective method of computing formulas of the rates of Shannon entropy transfer (RETs) between selected causal and consequential variables, relying on the estimation from data of conditional expectations of the system forcing and their derivatives. Those expectations are approximated by nonlinear differentiable regressions, leading to a much easier and more robust way of computing RETs than the brute-force approach calling for numerical integrals over the state-space and the knowledge of the multivariate probability density of the system. The approach is fully adapted to the case where no model equations are available, starting with a nonlinear model fitting from data of the consequential variables, and the subsequent application of method to the fitted model. RETs are decomposed into sums of single one-to-one RETs plus synergetic terms (of pure nonlinear nature) accounting for the joint causal effect of groups of variables. State-dependent RET formulas are also introduced, showing where in state-space the entropy transfers and local synergies are more relevant. A comparison of the RETs estimations is performed with previous methods in the context of two models: (i) a model derived from a potential function and (ii) the classical chaotic Lorenz system, both forced by additive and/or multiplicative noises. The analysis demonstrates that the new estimations are robust, cheaper, and less data-demanding, providing evidence of the possibilities and generalizations offered by the method and opening new perspectives on real-world applications.


Introduction
Causal inference has a long history in a wide range of fields as discussed for instance in the overview of Pearl [1] from medicine to social sciences. The nice example put forward by Pearl [1] is the question on the influence of symptoms on the actual disease of a patient. Finding the directionality of the influence from the cause to the consequence is therefore key in answering such type of question, building in that way a graph of influence from one process to another.
One famous approach to analyze such types of problems is the Granger's causality which consists in evaluating whether one variable X, provides additional predictive power to the forecast of another variable Y, implying that X is a cause of Y [2]. This approach has been applied in many different contexts from economy to atmospheric variability [3][4]. Since these early developments, other methods have been designed, in particular in the context of information theory. In the 1970s, this question was addressed by the development of the directed information theory [5,6]. Since then, many information-theoretic techniques have been designed and compared, e.g. Schreiber [7] and Hlavackova-Schindler et al. [8] . Other approaches based on dynamical systems theory or connected directed networks have been also developed in parallel and illustrated in different applications [9,10].
One important development in the conjunction of dynamical systems theory and information theory has been initiated by Liang and Kleeman [11]. They developed a theory of the rate of information transfer between specific variables starting from the fundamental dynamical equations describing the evolution of the Shannon entropy. The theory was then considerably expanded in a series of papers by Liang [12][13][14][15][16][17][18], culminating on one side to analytical expressions of the rate of information transfer in multivariate systems, and on the other side, to specific simple expressions that can be used when dealing pragmatically with time series. The latter developments were done by assuming that the system underlying the dynamics of the observed data could be approximated by a multivariate linear stochastic system with additive noise [18], i.e. a multivariate Ornstein-Uhlenbeck process. These tools have, in particular, considerably been used in the context of climate science [19][20][21][22][23][24][25].
This linear approximation, however, is restrictive as some important nonlinear interactions could be missed. In the present work, the extension of the estimation to nonlinear dynamical systems will be addressed. The main objective is therefore to find a closed formula, to evaluate the rate of entropy transfer (RET) -equivalent to the rate of information transfer in Liang-Kleeman's (LK)'s theory -in nonlinear systems forced by additive or multiplicative noise that can be robustly estimated from data (e.g. time-series or time-evolving ensembles), under closed and/or periodic or pdf vanishing boundary conditions. The main idea is to apply the hereafter called 'Causal Sensitivity Method' (CSM) relying on the computation of conditional expectations (and their derivatives or sensitivities) of the deterministic and stochastic terms of the model, for a selected consequential variable.
This applies also to the computation of the Shannon entropy budget terms, bridging them with the RETs. This raises the important concept of synergetic causality, coming from the joint effects of groups of variables, not accounted for in single variable-to-variable RETs. Formulas of RETs have been obtained for closed or periodic boundary conditions in the state-space. Open conditions in the state-space call for additional boundary terms in the entropy budget.
Comparison and validation of the different approaches to estimating the RETs is performed, namely: a) The analytical-numerical (AN) method where the model stochastic equations are known and the probability density function (pdf), (analytical or estimated from data) is used with 'brute-force' in the integral formula of RET in state-space.
b) The CSM approach developed in the context of this study, based on conditional expectations (obtained from nonlinear regressions) and their derivatives with respect to the selected consequential variable. Here the model equations are supposed to be known. c) The model fitting (MF), from data, of both the deterministic term and of the diffusivity coefficients, with subsequent application of the CSM to the fitted model. Only the evolution equation of the selected consequential variable is necessary.
d) The multivariate linear (ML) approach proposed in Liang [18], supposing a multivariate linear model fitting forced by additive noise.
The analysis is performed in the context of two different models: a potential model following Liang [16,17] and the famous Lorenz model [26], where the advantages of the CSM and MF approaches are exemplified. Sections 2 and 3 are devoted to the fundamental theory of the entropy budget and the rates of entropy transfer respectively. Section 4 explains the numerical aspects of the different techniques. Section 5 describes the results obtained with the potential model for which one performs a sensitivity analysis to the presence of linear, and nonlinear terms and additive or multiplicative noises, as well as with respect to the time-series length. In Section 6, the approach is also illustrated in the context of the Lorenz model, with the specific purpose of providing robust examples of estimations based on data only, and with a particular interest of trying to obtain good RET estimations from quite short time-series, simulating real situations of low data availability. Finally in Sec. 7, the main results are summarized and discussed, together with potential extensions of this type of analysis to more complex and higher dimension systems. Three appendices are added with theoretical proofs, numerical technicalities and a list of symbols and achronyms.

Theory of the entropy budget
Let us consider a -dimensional stochastic continuous system described by the state vector = ( 1 , … , ) (where stands for transpose) and a parameter vector , which evolves by Euler-Bernstein equations: where ( , , ) are the deterministic terms, a component of the -dimensional differentiable vector . Stochastic forcings are introduced where ~√ (0, 1) are independent Wiener processes and , are noise-diffusion coefficients, merged in the differentiable matrix , from which one builts the positive-definite matrix of pdf diffusivities ≡ with elements , ≡ ∑ , , =1 . We will hereafter denote by ~≡ ( 1 , … , −1 , +1 , … , ) the set of variables different from or its complementary vector, which will be used to discuss the effect of ~, taken as a set of causal variables on taken as a consequential variable.
In this work, we adopt the approach introduced originally by Liang [11][12][13][14][15][16][17][18]27], relying upon the concept of Shannon entropy = (− log ) of and its rate of change . In the formula, denotes the probability density function (pdf) of and stands for expectation operator over the state-space. When the rate is positive (negative), there is a loss (gain) of information and becomes less (more) precise. For Gaussian pdfs, is the the relative rate of variation of the variance ( ) per time unit (e.g. = 2 means that variance doubles at every unit of time).
In that framework, one aims to decompose the rate of entropy change (REC) as: where the first term comes from the influence of ~ on or the rate of entropy transfer (RET) and the second term is the self entropy generation (SEG), obtained in a process where ~ is 'frozen in time', i.e. REC=SEG+RET. This problem was solved for two-dimensional systems (Theorem 5.1. of Liang [27]) leading to the following rate of entropy transfer: ( 2 → 1 ) = (~1 → 1 ) = − ( where ~1 = 2 is the complementary of vector of 1 in two-dimensional systems. That RET expression can be generalized to any dimension ≥ 3 in the form: where (~→ ) and (~→ ) are respectively the determinist RET (D-RET), subindexed with , depending on and the stochastic RET (S-RET), subindexed with , depending on , . The same subindexes will be taken along the manuscript for other quantities. The proof of (4) (in Appendix A1) comes from the expression of the Frobenius-Perron operator governing the time evolution of the pdf of in the stochastic process where ~ is frozen in the period [ , + ], by using a similar procedure to that of Propositions VI.2, VI.3 and VI.4 of Liang in 2016 [16]. Since, the full state-space is spanned by and ~, there are no 'indirect' transfers, via excluded variables.
The rate of variation of (2) is obtained from the Fokker-Plank equation, governing the time evolution of the pdf of the system: by multiplying by −(1 + log ) and integrating over the state-space domain. We further indistinctly assume one of the boundary conditions in the state-space: 1) vanishing when ‖ ‖ → ∞ or at the border of the pdf support set, 2) closed and non-diffusive boundary conditions i.e = ∑ , , =1 = 0 at borders perpendicular to the direction, 3) periodic boundary conditions at the support set border.
Under those quite generic boundary conditions, we obtain the rate of entropy change decomposed into the deterministic part (D-REC), depending on and the stochastic part (S-REC), depending on , as: where ~| is the conditional pdf of ~ given . In the expressions, is the inward unit vector, the dot represents inner product, ∇ is the gradient operator and dl is the infinitesimal element of . Those terms appear, for instance, when a non-null is imposed by some external control at the system boundary as it in certain thermodynamical open systems. An example of that is the case of a mass of moist air subjected to liquid-evaporation and vapour-condensation and taking the mass of vapour as a dynamical state variable, which is lower-bounded by zero (dry conditions) and upper-bounded by the saturation value. Therefore, the sum of boundary terms contributing to the budget of Shannon entropy (uncertainty) is are the pdfs at dry and saturated states respectively, ̇, is the rate of liquid-evaporation at dry conditions and ̇, is the rate of vapour-condensation at saturated conditions. That shows that the importation/exportation of entropy through at the bounds of statevariables can affect the causality. Despite their great interest, the effect of boundary fluxes in the state-space is out of the scope of this study and will henceforth not be considered.
Therefore, under closed or periodic boundaries, the S-REC ( ) is further decomposed as: where (7b) and (7c) are, respectively, the additive-noise part of S-REC and the multiplicative-noise part of S-REC.
(7b) is always non-negative, being an entropy source coming from the dynamical noise. The term (7c) is null whenever , is constant (i.e. under an additive noise). A non-vanishing (7c) value calls for a state dependent or multiplicative noise.
After taking the difference between (6a) and (4), we obtain the SEG decomposed in its deterministic and stochastic parts: When we looks for the expressions of RET and SEG, is useful to decompose and , as: where ̂,̂, ,~ are exclusive functions of and ,~, , ,~ are the causal-depending parts of , , since they contain a cross dependency (from the point of view of ) on ~.

Deterministic part of RET and SEG
For the considered boundary conditions and after, some integral manipulations (e.g. integration by parts), we obtain the D-RET and D-SEG written respectively as: See proof of (10a) in Appendix A2. (10b) is the difference between (6b) and (10a). From (10a,b), it is clear that D-RET (10a) and D-SEG (10b) are linear in terms of , i.e. the D-RET and D-SEG of a linear combination of deterministic terms and diffusivities is the respective linear combination of D-RETs and D-SEGs.
The term (10a), can be interpreted by considering the surprise function − log = −log + log~| whose averaged value is . The rate of variation accompanying the state-space evolution is The 'advective' term due to in the sum is (− log ) = (− log ) + log~| and thus log~| in (10a) contributes to augment entropy following the state-space trajectory. Furthermore, only the parcel of with crossdependency on ~ (i.e. ,~) contributes to the D-RET.
The entropy generation (10b) is the contribution of for the average state-space speed divergence. When all the parcels of div( ) ≡ ∑ =1 are constant and negative, that leads to the contraction of state-space volumes and to the decreasing of entropy.
Therefore, the D-RET (10a) can be further written as (proven in Appendix A3): The first term of the r.h.s. of (11) is the average derivative with respect to (or sensitivity) of the conditional expectancy ( ,~| ) of the causal-depending term ,~ for a given . Moreover, note that (11) is zero with ,~ changed into ̂. A non-null inference of causality (11) means that causes, (through ,~) , are acting in a different way on different consequence values . For this reason, the application of (11)-like formulas will be part of the hereafter referred 'Causal Sensitivity Method' (CSM). Now, recalling that the full expectation decomposes as (… ) = [~(… | )], we can write (11) as an the average of the specific D-RET, i.e: The function (12a) provides information about the states where the entropy transfer is positive, or negative, and also about its intensity.
A very important point, is the fact that the RET formula (10a) depends explicitly on the pdf (obtained from integration of the Fokker-Planck equation or estimation from data), whereas the alternative formula (11) depends only on conditional expectancies like ( ,~| ). This is a key advantage with respect to (10a) because the estimation of (10a) calls for a 'brute-force' and computationally expensive integration over the state-space, growing exponentially with dimension whereas (11) represents a much cheaper approach. There, ~( ,~| ) can be estimated from some linear or non-linear fitting of ,~ as a function of by using an ensemble of realizations of the pdf of the state-vector , where the pdf can be either a transient ensemble or a long enough time-series if is ergodic. Note in (11) that ( ,~| ) =~( ,~| ) in which ~ is the expectation over the variables in ~.
A corollary of (11) comes out:

Corollary of Theorem 1
If ,~= 1 ( ) 2 (~), i.e. the deterministic term is multiplicative-separable in terms of factors depending on and ~ respectively, then: The proof is immediate since, using (11), we get [~( , The above separation is quite common (e.g. given by a multivariate polynomial of the variables). From (13), we easily verify that, If ~ is statistically independent from , then ~( 2 | ) is independent of , its derivative is null and hence (13) vanishes. Therefore, a non-vanishing D-RET (13) comes from the statistical dependency on the consequential variable of the causal-dependent factor 2 (~).

Stochastic part of RET and SEG
The S-REC (6c) depends on the diffusivity , and how it depends on and ~. The additive-noise part of S-REC (7b) is retained in the S-SEG, i.e.: which contributes to increase the entropy. In a random walk process (14) is the sole entropy budget term, leading to a variance increasing with a constant time rate , /2.
In what concerns the multiplicative-noise term (7c), it decomposes as: , whose expressions are given by: We must note here, that the S-SEG terms (14, 16a) depend uniquely on the conditional expectation of the diffusivity ( , | ), which is an exclusive function of , independently of any cross-dependency. The term (16a) is proportional to the concavity of ( , | ), through the second derivative in (16a) (see proof in Appendix A4). The non-vanishing of the S-RET (16b) calls for a cross-dependency of the diffusivity , , on ~. In fact, if , is constant or owns a unique self-dependency on , then (16b) is zero.
After, some integral manipulations, the S-RET (16b) get a CSM-like expression: which depends on the causal sensitivities (first and second derivatives) of the conditional diffusitivity ( ,~| ) with respect to .
The proof of (17) comes in the Appendix A5. Like (11), the formula (17) has no explicit dependency on the pdf, hence its calculation is rather cheaper.
The specific contribution for the S-RET (17) like for the D-RET in (12b) is: From (17) we infer the corollary below:

Corollary of Theorem 2
If , ,~= 3 ( ) 4 (~), i.e. the diffusivity is multiplicatively-separable in terms of factors depending on and ~ respectively, then The proof looks like that of the Corollary of Theorem 1, playing with second derivatives and separating the factors 3 , 4 in the integrals spanning and ~ respectively in the expectation operator. The formula (19), like (13) provides a very simple form to compute the S-RET.

Stationary balance of entropy
If the pdf is stationary, for instance when the system (1) is autonomous and reaches the ergodic pdf, then the entropy evolves towards a stationary value and all the terms of must balance, i.e. their sum is zero as: The SEG contributions can be estimated from model equations, through the expressions of , , and also from the marginal pdf , estimated from data. The RET terms can be obtained by the CSM expressions (11,17) from dataestimated conditional expectations on and their derivatives. That is important to check the correct estimation of every term. Under additive noise conditions, the D-RET is given by: Furthermore, In the case of a noise-free dissipative model that moves to the balance: (~→ ) = − ( , ) > 0.

Integral expressions of the RET
Let us now consider the RETs, ( → ) , ≠ , between single variables as in [16]. The integrated influence of ~ upon the consequential variable , through (~→ ) cannot in general be split as the sum of single RETs, ( → ) , ≠ , except when variables in ~ are conditionally independent with respect to . We will come back to this point at the end of this section.
We start with a result (proven in Appendix A6).

Theorem 3
The rate of entropy transfer ( → ) towards the consequential variable due the influence of the causal variable , ≠ , is decomposed into a D-RET and S-RET as: where ~ is the pdf of the set of variables different from and | , |~ are conditional pdfs. Liang [16] has obtained ( → ) as the difference between the rate and the of rate of change of in a process in which is frozen. The middle terms of (22b,c) are from Liang [16], and the rightmost ones are the new relationships based on the conditional pdfs. For the two-dimensional case where is a generalized speed in state-space composed by the deterministic speed ( ) and stochastic components, this one, poiting towards lower pdf values and lower diffusivities (i.e. pointing against the gradient).
Note in (23a) that log( | ) = log ( , ) where ( , ) ≡ , is the copula function, expressing the form how variables are interrelated, independently of the marginal pdfs. The average value of log [ ( , )] is the mutual information between and . We can shown that (23a) appears with opposite sign in the budget equation of the copula entropy − ( , ) ≡ − [log( , )], which is the opposite of the mutual information, and thus the RET is an indirect entropy transfer, via the copula entropy.
The use of -dimensional integrals for estimating the RETs like (23a) , calls for the estimation of the pdf of the system. This can be difficult due to data restrictions and sampling, even for moderately low dimensions. Moreover, the state-space discretization and bounding required for the numerical computation of the integrals represents an additional source error. This will be further illustrated in Secs. 5 and 6 when computing RETs for two different models.
There are alternative integral formulas for the RETs containing less sources of errors, which relies on the sensitivities (derivatives) of , with respect to , agreeing with the Theorem below, proven in Appendix A7: where we use the conditional exceedance probability of given : which vanishes if and are statistically independent.
The sensitivity of the generalized speed (23b) with respect to the causal variable appearing in (24) is: Splitting the first, second and third terms leads to corresponding formulas for the D-RET ( → ) and S-RET ( → ) as: Let us interpret heuristically the sign of ( → ) in (24). A positive value of ( → ) is favoured by ( | ) and having the same sign. In particular, ( | ) > 0 means that for a state-space displacement > 0, the tail (or exceedance) probability Prob( ≥ | ) is larger, and thus values greater than are more likely, i.e. > 0 drives > 0. Furthermore, if > 0, that induces a variation = > 0 and therefore contributes positively to the divergence of the generalized speed and in analogy with (10b) to an increase of .
Given the dependence on the sensitivity term in (25), we have the following corollary:
The difference − , includes all the terms not depending on , in particular the diffusion speed, under additive noise, i.e. the term − , (27). In the case of additive or self-dependent multiplicative noise, we have , = , . The restriction to terms in , in (27) simplifies considerably the computation of (24) since many terms can be discarded. It also shows that all the terms without dependence on have only an implicit impact on the RET via the shape of the pdf.

Expressions of the RETs based on the Causal Sensitivity Method (CSM)
Let us now focus on how (26a,b) can be modified in order to enable a simple estimation based on conditional expectations, by using the CSM already used to reach (11)(12)(13)(14)(15)(16)(17), In order to simplify the notation, we split the variables into three kinds: the causal variable , the consequential variable (the target) and the remaining outer variables within the state vector, denoted by where ≡~( , ) stand for variables different from the pair ( , ). The outer space, spanned by can be extended from the null set up to any variable physically possible or in practice those which have plausible connections to , in a certain context and so spans the contextual variables, like third-part agents interacting with , .
That outer space also contains all the possible factors that can lead from to , by any sort of causal mechanism. The average of the deterministic term and diffusivity along the outer space, for a fixed value of the consequential variable, takes a very important role in the causality diagnostics like the RETs. The RET values can even depend crucially on the extent of the outer space of variables.
Let us denote the overbar quantities as is the conditional expectation to , over the outer space of variables =~( , ). Those quantities are bivariate functions of ( , ). In particular, in the two-dimensional isolated case (cause, consequence) or in the twodimensional equivalent case where the outer variables are totally independent of , we have ̅ , = , and ̅ , , = For instance, in fluid dynamics, the Reynold turbulent stresses, parametrized as a function of large-scale variables , is a conditional expectation of the form ̅ , .
Let us also denote by (… ) and (… ) the expectations over the causal and consequential space respectively and by , (… ) = [ (… )| ] the joint average. Now, we have the CSM equivalent theorem (proven in Appendices A8,A9) : where the expectation's variables are sub-indexed in the expectation operator, for ease of interpretation.
(29a,b) show that the RETs depend on conditional consequential averages (… | ) of the deterministic terms and diffusivities (averaged over the outer space) and their first and second derivatives with respect to the . The terms in (29a,b) do not vanish in general because the above used operators: (… | ) and (… ) do not commute.
The formulas (29a,b) are linear in , , , , , and thus the single RETs coming from a linear combination of deterministic terms or diffusivities is the linear combination of single RETs produced by each term.
The RET is the average value of the specific rate of entropy transfer ℱ ( ), in analogy with (12,18) i.e.
to which contribute the corresponding determinist and stochastic components: We must note that the value of ℱ , ( ) is constant for Gaussian pdfs associated to linear systems but can vary for non-Gaussian pdfs, yielding to contributions to the rate of entropy that can fluctuate from positive to negative values depending on .
The formulas (29a,b) have the major advantage as compared with (26a,b) that its computation does not explicitly require the pdf of the system. In practice, the terms ̅ , = ( , | ), ̅ , , = ( , , | ) and other terms like ( ̅ , | ) and ( ̅ , | ) are conditional expectations with respect to and can be estimated by nonlinear regression using a certain set of basis functions (e.g. monomials) or by parametrizing the joint pdfs by Bayesian methods, followed by an estimation of the average. The regression can be obtained from an ensemble of realizations of state-space trajectories -like time-series -exploring the ergodic pdf, or a time-evolving transient ensemble where the regression coefficients vary on time.
Theorem 5 has important corollaries allowing for simple applications to quite general systems, linear, or nonlinear with additive or multiplicative noises with or without cross dependencies. They also provide relevant information about the rates of entropy transfer in models where the governing equations are known.

Corollary 1 of Theorem 5
If , = = ̅ , where is constant, i.e. the deterministic term is linear with respect to the causal variable , then The systems verifying that condition can even depend nonlinearly both on and variables , ≠ , . The proof is straitghforward, since For a linear system of dimension D: where is a constant matrix and noise is additive, i.e is a constant matrix, the ergodic pdf , is Gaussian with zero average and covariance matrix , with entries and the rate of entropy transfer is: where the ( , ) entry of is = ; ( ), is the covariance between and ( = 1, . . , ) and Δ is the In the particular case of a two-dimensional linear system driven by additive noise, we have, after computing the cofactors, the Liang [15] result: A very important corollary of Theorem 5, based on the CSM is the case where the deterministic terms of , and diffusivity terms of , have a separate product dependence between the causal variable and the remaining variables, leading the following result (proven in Appendices A10,A11):

Corollary 4 of Theorem 5
In the particular case , = 1 ( ) 2 ( ) 2 ( ) , then: In a similar way, for , , = 3 ( ) 4 ( ) 4 ( ), we get: The above closed formulas should be applied to each term of the deterministic part and diffusivity. These relations constitute the practical basis of the CSM for the single rates of entropy transfer.

Causal synergies and rates of entropy transfer
At the beginning of Sec. 3.1, the problem of relating the global RET (~→ ) with the RET between pairs of variables was posed. In order to figure out the link between these different quantities, we consider the simplest situation of dimension = 3 (without loss of generality) and a multiplicatively separable deterministic term Now, let us decompose the averaged conditional product as 2,3 ( 2 3 | ) = 2 ( 2 | ) 3 ( 3 | 1 ) + ( 2 , 3 | 1 ), i.e. the product of conditional expectations plus the conditional covariance. Next, after taking the derivative and applying the single RET expression (37a), we get: which shows that, the joint RET is the sum of single RETs (41b) plus a synergetic term (41c) related to the conditional covariance between the functions of the complementary variables 2 , 3 . The same procedure can be replicated for multiple variables putting into evidence the effect of synergies of couples, triplets and other groups of variables which are accounted for in multivariate forms of the mutual information, like the Interaction Information for triplets [28].
The sum of single RETs equals the global RET when all the non-consequential variables are conditionally independent with respect to the consequential variable.
This led us to present the concept of synergetic rate of entropy transfer as the difference between the global RET and the sum of single RETs and the corresponding deterministic and stochastic contributions: For linear systems the deterministic synergetic RET vanishes, though this is not a sufficient condition of linearity.

Rate of entropy transfer between transformed variables
We address in this section the problem of how variable changes can affect the values of ( → ). Ideally, in a naive view, any independent variable changes should not influence the RET because the causality only depends on the link between variables and not on how variables or processes are represented, in particular the physical units.
The next theorem checks in a rigorous manner, the effect of one-to-one diffeomorphism variable-changes in the rate of entropy transfer. For that, let us consider new variables ̂,̂,̂ defined in a union of open intervals (eventually not connected), which are functions respectively of , , , where ≡~( , ) stands for variables different from , and where, without loss of generality, all the jacobians are strictly positive, i.e. ℑ ≡̂> 0, ℑ ≡ > 0, ℑ ≡̂> 0. Then, we have the following result proven in Appendix A12.

Theorem 6
The rate of entropy transfer (RET) between one-to-one diffeomorphism variable changes is: The result (43b) shows that the RET correction ( → ) is a unique function of the diffusion term, being the deterministic part of ( → ) kept invariant for any one-to-one diffeomorphisms of variables. Only the stochastic RET can change, being invariant when any of the factors in (43b) vanish, i.e.: the noise is additive or multiplicative without cross dependencies, i.e. , = 0 or when the variable transformation →̂ is affine (translation and scaling) leading to log ℑ = 0. This guarantees the RET invariance under quite generic conditions. This is a revision of Remark 1 of Theorem III.3 of Liang [17], showing that in certain conditions, it is possible to make invertible changes of the consequential variable, keeping the value of the entropy transfer. The variables ̂ where ≡~( , ) can additionally be an invertible mixture of , since |~ is preserved by that transformation. The r.h.s. of (43b) shows that the correction, due to the variable change is obtained by conditional expectations of the diffusivity. The result (43b) can be important for the RET estimation, since certain variable changes can make the pdf simpler, as for instance Gaussian anamorphoses of the single variables [29] where ̂,̂,̂ are such that the marginal pdfs follow a standard Gaussian pdf, though the multivariate pdf is not necessarily Gaussian.
When the variable changes are not injective, the problem is more complicated. For instance, for →̂= 2 we have the entropy is the entropy of inverse images of the transformation, which vanishes in the case of injective variable changes. This problem is not addressed here.

Numerical approach to the computation of RETs
The estimation of the rates of entropy transfer (RETs) is a more complicated task than the estimation of the single, multivariate and conditional entropies, for which numerous estimators exist in the literature as seen in the review [30]. The main difficulty comes from the RET dependency, not only on the pdfs and conditional pdfs (like the entropy) but also on the partial derivatives of pdfs (10a, 23a). The later can be difficult to evaluate, especially if pdfs display abrupt changes or their support set has a fractionary Hausedorff dimension as in strange chaotic attractors, leading to non-differentiable pdfs (albeit some pdf projections like marginal pdfs can be differentiable). This difficulty is alleviated if enough (in intensity) stochastic forcing noise is added to the evolution equations of deterministic chaotic systems, thus producing non-vanishing pdf values at the fractal gaps of the strange attractors. Therefore, the estimation of pdf derivatives by finite differences call for the prescription, both of the full pdf and of conditional pdfs along a regular grid of the state-space. This constraint call for the estimation of RET integral formulas using integrals over the pdf, as used by Liang in [16,17], which will be considered as a benchmark to compare with the CSM approach.
Therefore, in testing the methods to evaluate the rate of entropy transfer, ( → ), ( , = 1, … , ≠ ), four different approaches are considered: the Analytical-Numerical (AN) approach which consists in using the direct estimations of the probabilities and integrals over the pdfs, the Causal Sensitivity Method (CSM), the Model Fitting (MF) followed by application of CSM and finaly the Multivariate Linear (ML) approach of Liang [18], which supposes a linear model fitting and Gaussian distributed noises. The four methods will be applied into two low-order models of dimension = 3 whose equations are known: (1) a model derived from a potential function [16,17], with some generalizations to include multiplicative noise, linearity and non-linearity, all weighted by appropriate parameters and (2) the classical Lorenz chaotic model [26], driven by additive or multiplicative noise, weighted by controlled parameters. The details of the four referred approaches are described below. Sections 5 and 6 present model results.

Analytical-Numerical (AN) approach
In the first approach, the so called 'Analytical-Numerical' (AN), we evaluate the integrals implicitly in (26a,b). An analytical or a data-estimated multivariate (3D) pdf as well as the analytical values of the sensitivities to the causal variable: , , , 2 , are used in the centroids ( ′ , ′ , ′ ) of equal-volume voxels (3D cells) with sides ∆ , ∆ , ∆ , respectively in the variables , , , ( ≠ , of the state-space in which the bulk of the multivariate pdf lies. In the forthcoming discretized expressions, we use plica-indices for the respective variable (e.g. ′ is the index discretizing variable ). This method was used in Liang [16,17] but using integrals of (22b,c).
Recall that in three dimensions, there is only one outer variable ≠ , . The probability of each voxel ( ′ , ′ , ′ ) is denoted as , such that probabilities sum one altogether and ̃ is prescribed or estimated.
The one-or two-indexed probabilities are sums over the remaining indices, e.g.
Therefore, by restricting the 3D-integrals of (26a,b) over the bounded domain, we get: where stands for pdf. In (44), the derivatives of a generic quantity at the voxel ( ′ , ′ , ′ ) are estimated by centred differences as ; while at the interval bounds: ′ = 1, one takes the values at ′ = 2, − 1 respectively.
After the discretization of (44), one obtains: By using the prescribed analytical expressions values of , , and to form in (45) at each voxel ( ′ , ′ , ′ ), the rightmost factor of the above sum is: The truncation and discretization errors (due to finite bounds , and non-infinitesimal Δ respectively) committed in the estimated causality, depends much on the model equations and pdf. For instance, if [ , ] spans standard deviations around the average , the estimation error in the generic average of ( ) within the probabilistic integral, satisfies, after using the Tchebitchev inequality, where is the estimated average in the bounded interval and is the average of outside the interval [ , ]. In the problem of computing ( → ), the above quantities are terms of . Therefore, to minimize | |, high enough must be taken, especially if ( ) is monotonic increasing with a positive power of | |. Moreover, | |, | | increase in the case of fatter pdf tails. As a rule of thumb, at least = 3 standard deviations must be included in the interval used in the numerical integral. In practice, a stopping criterium for increasing | |, | | and increasing resolution is that the estimated averages of the highest-order products of variables (e.g. 2 2 ) have corrections smaller than 1% of that average. The deterministic and noise contributions ( → ) , and ( → ) , by AN, are computed and used as benchmarks for the other methods. The AN method is considered as a 'brute-force' method, since it requires the direct computation of the integrals. This cab be a very heavy computation, since the number of cells in domain integrals increases as a power of the system dimension. Moreover, the effective implementation increases in complexity with the dimension.

Causal Sensitivity Method (CSM)
In the second approach, which we developed in the context of this analysis, we identify the parts , and , , (27) of the deterministic term and diffusivity , that depends on . Then, in the case, , and , , are composed by additive terms with a separation of the causal variable, we apply the formulas (36a,b) or (37a,b), which dependent on the particular form of model equations.
In the application of the CSM, the expectations are computed as averages over a certain ensemble of different realizations of the model states. The ensemble can be: 1) a transient ensemble of a large number of time-evolving states driven by the equations of the model or 2) a long enough time-series of length with sampling-time Δ , tunned to uncover the ergodic pdf of the model, with = /Δ realizations. The RET coming from a very large ensemble (long time-series) or the average of RETs obtained from shorter ensembles (time-series) can be taken. In that case, realizations of the estimated RET are obtained by generating a number of (30 in practice) time-series i.e. obtained with different noise generator seeds and initial conditions, leading to RET estimations (ensemble members) whose mean and standard deviation over the sampled realizations are computed. The sensitivity to the time series length is also studied.
In practice, the values of , and , , are calculated for each realization (ensemble member or instant). One of the assumptions for applying the CSM formulas (36a,b) or (37a,b) is that the different additive terms of , and , , can be factorized by factors depending on the different variables. That happens, for instance, when the deterministic terms are products of monomials of the different variables , , . For each term, the single dependent functions of , , are identified, after which one computes the conditional expectations for a given value of . Thanks to the product separability of forcings, the regressions are univariate. There, in general nonlinear regressions are performed in which we consider a certain set of basis functions (e.g. monomials , = 0, … , , up to a certain degree or a set of orthogonal functions, wavelets, trigonometric functions, polynomial splines over a set of subintervals of ) ranged by decreasing scale (e.g. polynomial order, wave number). The conditional expectation function must be differentiable to apply CSM with algorithmic expressions for the first and second derivatives. The most efficient and robust regression depends on the freedom choice of the basis functions or of the regression formula as a whole. In this respect, some strategies could be followed: 1) mixing different basis functions (e.g. polynomials, trigonometric functions, wavelets); 2) use of machine learning techniques such as Relevance Vector Machines [31] and 3) use of evolving basis functions towards an optimal set by using Symbolic Regression [32]. In general, the regression formula may depend nonlinearly on the fitted parameters.
For the sake of algorithmic simplicity and didactic presentation, we will restrict our analysis to polynomial basis functions. In the case of small ensembles or short time-series, the regressions must be accompanied by statistical significance tests in order to obtain robust regressions (e.g. forward step-regression or computation of the Akaike Information Criterium).
The regression coefficients are fixed for an ergodic pdf or time-evolving for transient ensembles of the model. For a good performance of CSM, the conditional averages like [ ( | )] must be accurately estimated as well as their derivatives. Therefore, depending on the particular RET, must be high enough, such that regressions do not differ much from the probabilistic averages ∫ ( | ) ( ) along the interval [ , ]. Too small can lead to biases.
After, calculating the regressions, their first and second derivatives with respect to are computed as well as the required products and averages. Then ensemble (or time-averages) averages taken. If the regressions are computed using a basis of unbounded functions like monomials, the regressions as well as the values of the specific RETs ℱ , , ( ) and ℱ , , ( ) can reach (in absolute value) much higher values when is extrapolated beyond the bounds of the dataset. In those cases, a certain from of trimmed average (cutting the extremes) can be applied.
The computational complexity of CSM is much lower than in AN, being thus an execelent alternative to the highlydemanding AN especially in high dimensionsional systems.

Model fitting (MF) followed by CSM
The CSM approach assumes known model equations. However, in practice, both the deterministic terms and diffusivities , ( = 1, … , ) may be partially or totally unknown. In this case, an additional preliminar operation to apply CSM afterwards. This procedure, hereafter referred as 'Model fitting followed by CSM' (MF) is described bellow.
In order to apply the CSM, we suppose that, , ( = 1, … , ) are written as linear combinations of variableseparable basis functions (including the unity), Ψ , Φ belonging to certain prefixed classes (e.g. multivariate polynomials). Let us consider the deterministic term decomposition and the diffusivity decomposition where , , (merged in the matrix ) and , , (merged in the matrix ) are constant parameters. The corresponding fitted values from data are denoted respectively as ̂, , and ̂, , , merged in matrices ̂ and ̂.
Like in the case of conditional expectancies, other regression techniques may be applied (e.g. machine learning techniques), paying attention to the product separability of forcings and algorithmic expressions of their derivatives. Moreover, robust regressions must be accompanied by statistical significance tests when the available time-series are short.
The estimation method is presented in Appendix B1. For instance in order to fit a linear model, we consider the class of basis-functions {Ψ ( )} = {1, 1 , … , }.
Since the RETs of a linear combination of deterministic terms or diffusivities (47a, 48a) is the linear combination of the RETs driven by each term, we can write the total RET obtained by MF, as a sum over the basis-functions as: By applying the CSM to each contribution (49a,b) and using the estimated parameters, one has the RETs associated to each basis function: Despite the above constraints, it happens that the CSM and MF approaches are computationally cheaper and technically easier than the AN approach, even ih high system dimensions. Test of MF will be restricted to the Lorenz model here (see Sec. 6).

Multivariate Linear Approach
The multivariate linear (ML) approach uses the formulas obtained by Liang [18] and covered in (34) where, a multivariate linear model with additive Gaussian noise is fitted to data, using the maximum likelihood estimator of the model coefficients, and also minimizing the mean squared error of the time derivative of . Then, the corresponding RET, denoted as ( → ) , is obtained using the linear fitted model. Since the fitted model has no multiplicative noise, ( → ) , = 0.
The ML approach is equivalent to the CSM approach, when the fitted model is linear and the regressions used for the conditional expectations are linear, i.e. the order of the fitting poynomials is = 1. This method is used to emphasize the limitations of such a linear approach which is not expected to work as accurately as a nonlinear estimator in strongly nonlinear situations. Therefore, it is expected that ML works under weak nonlinearity as will be shown below in examples. Table 1 summarizes the methods of RET estimation, the used models in this study, the drawbacks, advantages and the computational cost (raw estimation of the number of simple algebraic operations) of each method in terms of size parameters: = state-space dimension, = Number of voxels (bins) taken in each variable, = Number of ensemble realizations or time-series values, = Number of basis-functions or free regression parameters. According to Table 1, computational complexity grows exponentialy with in the AN approach essentially due to the pdf estimation whereas the one of the CSM grows as a power of , essentially due to the solution of regression problems. This shows clearly the much higher efficiency of CSM compared to AN.

Method
Table 1: Summary of tested methods with the applied models, drawbacks, advantages and computational cost of the RET estimation.

Model description
The AN computation of RETs was performed by [16,17] on a stochastic potential model. Here, we will make independent estimations on the same model, adding some generalizations for the validation of three of the different methods of computing RETs, proposed in Sec. 4: AN, CSM and ML methods. The MF will be tested with the Lorenz model in Sec. 6.
The values = 1, = = 0, reproduce the same situation studied by Liang [17]. The parameter ≠ 0 produces a cross-linear symmetric term across the variables whereas = = 0 leads to pure Ornestein-Ulenbeck processes (the equivalent to continuous red-noise processes).
The stationary solution of the Fokker-Planck equation (5) governing the model pdf is ( 1 , 2 , 3 ) ≡̂/ , with where is the normalization constant such that the integral of over ℝ 3 equals one.
For the implementation of the AN approach and computation of (45), one requires the sensitivities of the generalized speed with respect to (25) which can be written for the potential model as: in which we have used: The RETs based on the CSM approach are obtained by using the procedure described in Sec. 4.2, whose particular forms are given below.

Formula of ( → )
Here, the causal variable is 2 , whereas the consequential variable is 1 and the single outer variable is = 3 or = 1, = 2, = 3. The terms used in the computation are: Application of (37a,b) to each term of 1,2 and 1,1,2 leads to: The deterministic part (57a) is decomposed into a nonlinear contribution, proportional to and a linear one, proportional to whereas the noise part (57b) is proportional to the multiplicative-noise parameter .

Results of the experiments
We have computed the RETs for different values of , , . Three scenarios were considered: 1) An additive noise scenario with increasing non-linearity controlled by . 2) A linear scenario with increasing multiplicative noise controlled by . 3) A multiplicative noise scenario with increasing non-linearity controlled by .
In all validation cases, the pdf is ergodic and the CSM and ML estimates of the RET are obtained using model runs of size = 4000 (and 400) time units corresponding to = 10 5 (10 4 ) time-steps = ∆ = 0.04. Ergodicity was assessed by verifying the similarity of the statistical properties of independent model runs. The degree of polynomial regressions of the conditional expectations is fixed to = 4 with some tests using = 2,8. Model runs are started at 1 = 2 = 3 = 0.01, using = 30 random seeds for the noise generator. The AN integrals are estimated in a cube with edges [− , ], = 10. This will obviously limit the pdf tails, but to a relatively low extent. A discretization of = 200 points along every edge of the cube is used, leading to a number of 200 3 cubic voxels.

Additive noise with increasing non-linearity
Here, wet set = 0. , = 1, = 0 and ∈ [0,1], starting with a linear model ( = 0), and then increasing the amplitude of the non-linearity. The stochastic part of the RET vanishes since the noise is purely additive and additionally ( 1 → 3 ) = 0 (61a). The value = 1, reproduces the case studied by Liang [17]. Figure 1 shows the values of ( 2 → 1 ) and ( 1 → 2 ) computed by the AN, CSM and ML approaches as a function of . The RETs increase with the strength of nonlinearity due to nonlinear cross-dependencies. In this ]. Despite its explicit linear dependence on , it increases in a nonlinear way due to the modifications of the pdf as a function of . One obtains a good estimate with a quadratic regression of the conditional expectation as shown below. Therefore, where and stand for covariance and variance, respectively. Taking the derivative, one gets: leading to an estimate of the specific RET: varying with 1 as expected in nonlinear systems. Its average is For instance for = 1, we have, ( 2 2 , 1 2 ) = −0.038, ( 1 2 ) = 0.348, 1 ( 1 2 ) = 0.402, the computation provides ( 2 → 1 ) ≈0.088, which is fairly close to ( 2 → 1 ) , = 0.094, which was obtained independently by [15] . Figure. 1 show the AN value of RETs (dotted line) and the average (solid lines) and standard deviation (error bar length) over the 30 seeds of the RETs obtained by CSM and ML. The AN and CSM values are relatively close to each other, with a negative slight bias of CSM, lower than 7-10%, valid both for the long time series (Fig. 1a) ( = 4000, = 10 5 ) and the short time series (Fig. 1b) ( = 400, = 10 4 ). The difference is due to estimation errors both in the AN, due to the specific choices of the interval size ,and its resolution and in CSM due to the regression accuracy. By comparing Fig. 1a-b, we see that the standard deviation of RETs is approximately proportional to 1/√ , being 5% of the RET by AN for the long time series. Therefore to achieve an acceptable error of less than 20%, we must have > 160. The ML estimates of ( 1 → 2 ) and ( 2 → 1 ) are close to zero for the tested nonlinearity weights, which shows that the ML estimates are not able to detect RETs with those nonlinear terms. Figure 1. RETs for the scenario = 0. , = 1, = 0, with ∈ [0,1] of the potential model. Mean over 30 noise seeds of ( 1 → 2 ) (red curves) and ( 2 → 1 ) (black curves) by the AN (dotted line, squares), CSM (solid line,circles) and ML (solid line, triangles). Standard deviations over the ensemble of seeds are marked by the error bar length. Pannels a) and b) refer respectively to period (time series) length of 4000 and 400 time units. Note that curves for ( 1 → 2 ) and ( 2 → 1 ) for the ML approach superpose to each other.

Linear system with increasing multiplicative noise
Here we set = 0, = 0.5, = 1 and ∈ [0,1], starting with a situation of a pure additive noise ( = 0). Then, the multiplicative noise strength increases, producing non-Gaussian statistics, despite the fact the system is linear [34] Thanks to the model symmetrization when = 0, all the RETs are equal, and thus we only evaluate ( 2 → 1 ) and ( 2 → 1 ) for the diverse approaches (AN,CSM,ML). Note that in this scenario, both the deterministic term and noise scales as ‖ ‖, leading to extreme behaviors when is large. In fact, for the range ∈ [0,1], the standard-deviation of 1 increases from 0.92 to 1.94 and its kurtosis-excess increases from 0.04 to 34, which is an extreme scenario. This is easily verified through the pdf formula (53). For instance for = 0.25, the pdf scales as ̂~− 9 while for = 1, the pdf scales as ̂~− 3 , producing a fat-tailled distribution.
In this case, the RET is approximated using a simple linear regression: For all the range of , we have the correlation value ( 1 , 2 ) ≈ 0.66 and thus ( 2 → 1 ) ≈ 0.33 = 0.66 × 0.5, which perfectly agrees with the AN and CSM-based RET estimations shown in Fig. 2 through their average over noise seed values.
The ML estimation, also shown in Fig. 2, is able to detect the deterministic RET, though affected by a bias, which becomes large for highly leptokurtic (high kurtosis) probability distributions, associated with high .
The stochastic RET is: For symmetry reasons all the cross statistics are symmetric and thus 1 ( 1 | 2 ) and 2 ( 2 | 1 ) must have the same linear functional dependency, leading to 2 2 ( 2 | 1 ) 1 2 = 0. Moreover, by using a quadratic regression for 2 2 , we get 2 2 ( 2 | 1 ) 1 2 ≈ 2 ( 1 2 , 2 2 ) ≈ 0.52 in all the range. Therefore, one obtains: Figure 2 also shows the values of the S-RET ( 2 → 1 ) , i.e. the stochastic RET, obtained with the AN and CSM approaches. Both approaches agree quite well for ≤ 0.6, after which, the excess-kurtosis is larger than 3 and the AN value becomes highly biased if too short interval bounds are used. Here the interval [−10,10] was shown to be sufficient for accounting for the pdf tails. The CSM value grows almost linearly with in agreement with (67). This suggests that the CSM estimation can even provide a more accurate and outlier-resistant estimate than the AN method in certain circumstances. The standard deviations of RETs obtained by CSM grows with the intensity of the multiplicative noise and scale approximately as 1/√ (as in scenario of Sec. 5.3.1) by comparing the values for the long period (Fig. 2a) and short period (Fig. 2b).
Another interesting aspect is that, in certain extreme noise conditions, the stochastic RET can be larger than the deterministic RET (e.g. > 1), which means that information flow occurs mostly trough the noise than through the deterministic dynamics. Figure 2. RETs for the scenario = 0, = 0.5, = 1 and ∈ [0,1] of the potential model. Mean over 30 noise seeds of ( 2 → 1 ) by CSM (red curve, squares), by ML (green solid, diamonds) and of ( 2 → 1 ) by CSM (magenta solid, triangles).Values of ( 2 → 1 ) by AN (black dotted) and ( 2 → 1 ) by AN (blue dotted). Standard deviations over the ensemble of seeds are marked by the error bar length. Pannels a) and b) refer respectively to period (time series) length of 4000 and 400 time units.

Multiplicative noise with increasing nonlinearity
Here we set = 0.5, = 0.3, = 1 and ∈ [0,1], simulating a mixture of non-linearity and multiplicative noise. The symmetry between variables is broken through a non-vanishing . Moreover, according to (51a-c), the deterministic terms are scaled as ‖ ‖ 3 whereas the noise scales as ‖ ‖ 2 , and therefore that contributes for smaller kurtosis of the system variables than in the previous scenario.
As concerns ( 2 → 1 ) (57a), and using the linear regression for 2 ( 2 | 1 ) and the quadratic regression for 2 ( 2 2 | 1 ), one gets: (68) Figure 3 shows the values of ( 2 → 1 ) using the AN and CSM approaches, using the regression degree = 8, as well as the minimal regression approach given by (68) (acronym MN) and the ML method. Figure 3 shows that all AN,CSM and MN estimations are quite close to each other. The ML approach is reasonable only for small values of the nonlinearity (small ), i.e. weak nonlinearity.
The specific RET is not constant, as expected from nonlinear systems and has higher amplitude for larger values of 1 2 according to: The minimal regression of ( 2 → 1 ) (59b), leads to Figure 3 also shows the values of ( 2 → 1 ) (Fig3a: long period, Fig 3b: short period), obtained with the AN, CSM and MN approximations. The AN and CSM agree quite well, with quite short standard deviation over noisegenerated samples, as seen in error bars of Fig.3a,b, while MN presents lower values of RET. Despite this bias, the dependence on is well detected by the minimal regression of MN. The standard deviation of sampled RETs by CSM are also approximately scaled as 1/√ as in previous scenarios.
The graphics for ( 1 → 2 ) and ( 1 → 3 ) are similar to those of Fig. 3 (not shown). of the potential model. Mean over 30 noise seeds of ( 2 → 1 ) by CSM (red, solid), by ML (gree, solid), by MN (deep blue, solid) and of ( 2 → 1 ) by CSM (magenta, solid) and by MN (purple, solid). Values of ( 2 → 1 ) by AN (black, dotted) and ( 2 → 1 ) by AN (blue, dotted). Standard deviations over the ensemble of seeds are marked by the error bar length. Pannels a) and b) refer respectively to period (time series) length of 4000 and 400 time units.

Lorenz Model
In this section we compute the RETs without the explicit knowledge of the model pdf as in the previous section. For that extension of the analysis to more classical models, the Lorenz's 1963 model [26] driven by a noise term, is used [35]. The Lorenz model and its extensions have been used in numerous theoretical and methodological studies, such as predictability [36,37], data assimilation [38], bifurcation theory [39], ergodic theory [40], thermodynamics [41], signal processing [42] where we set equal noise coefficients: with 2 ≥ 0 and ≥ 0 being parameters associated with the amplitudes of the additive and multiplicative noises respectively. The noise-free model is set to = = 0. The noises follow the law: ~√ (0, 1). The model converges to pullback attractors whose properties depend on the parameter values [34].

Rates of entropy transfer and budget of entropy
The deterministic terms in (71a-c) are composed of polynomials, satisfying the separability assumption for the application of CSM formulas (37a,b). The deterministic and stochastic RETs are easily computed as: (72a) ]; (72d) Since the normalization is an affine variable change, the RETs are the same as those from the non-normalized variables according to theorem 6 (43b).
In what concerns the entropy budget, the terms of (20) can be easily estimated using the CSM formulas of Secs.
where the first term is the self-determinist generation (SEG), the second term is the sum of single RETs (42a) the third one is the synergetic RET (42b), the fourth and fifth terms are the additive-noise and multiplicative-noise terms respectively, which have the following expressions: where ,1 is the Kronecker symbol. For the Lorenz system, we still have: We must note that, in the Lorenz model, there are two quadratic terms responsible for the nonlinearity: 1 3 in 2 (71b) and 1 2 in 3 (71c), yielding synergetic RETs in ( 1,3 → 2 ) and ( 1,2 → 3 ) : These terms are relevant for the entropy budget as we will show below and are also important in the turbulent chaotic regime simulated by the Lorenz model. The conditional covariance terms ( 1 , 3 | 2 ) and ( 1 , 2 | 3 ) are essentially the vertical and horizontal heat fluxes and the derivatives in (76a), (76b) are their parametrizations with respect to temperature gradients. Therefore, the synergetic terms (76a), (76b) are simply associated with the horizontal and vertical turbulent heat flux diffusivities as a function of temperature gradients. This example suggests that formulas provided by the CSM can be good tools to identify physical factors influencing the entropy budget and the inferred causality.

Setup of experiments
The RET values are calculated by the AN, CSM, MF and ML approaches for three situations: noise free ( = = 0); additive noise ( = 1. , = 0) and multiplicative noise ( = 0, = 1. ). The variance of different variables is 1 for the free-noise case and within the interval [0.9,1.1] for the noisy scenarios which are particularly extreme because the signal to noise ratio is small, and the noise can considerably affect the S-RETs.
The numerical 3D integration, used in AN is made over the cube [− , ] 3 , = 4., regularly discretized into cubic voxels of side ∆= /100. Two integration periods are tested, after a relaxation of 100 time units and with lengths The ergodic pdf of the model is estimated by a kernel-based approach, using the probabilities ( ) at every statespace voxel centroid ∈ [− , ] 3 estimated as: The instantaneous state ( ) has an effect only at points within a Cartesian distance √3∆, which is set by using the indicator function (1 or 0 if the condition is true or false respectively). is the normalizing constant. Accuracy tests were performed, by comparing probabilistic and temporal averages of powers (up to degree 4) of the variables. After the ( ) computation, the AN-based RETs are assessed at the voxel centroids.
In the CSM approach, we use polynomial regressions of degree = 10 to obtain accurate representations of the conditional expectations and their derivatives, though lower degrees are sufficient for certain terms. This will be further discussed below in the validation section.
By substituting the minimal regressions into the D-RET formulas, we get:   Table 2 presents the overall results of the different deterministic and stochastic single RETs, computed with the two period lengths ( = 400 and = 80 with ∆ = 0.02 and = 80 with ∆ = 0.01) for the different scenarios: noise free, additive noise, and multiplicative noise and using the four tested approaches: AN, CSM, MF and ML. The sample averages and sample standard deviations of RETs were computed from an ensemble of 30 independent realizations of model runs. Polynomials of order 10 have been applied for the nonlinear regressions in CSM and MF. Model fitting uses polynomials of second order. The ensemble average AN values obtained with the long run are taken as reference values (in bold in Table 2). The sources of estimation errors are discussed in Sec. 4 and summarized in Table 1. 3) The MF estimations have comparable biases to those of CSM, except in ( 2 → 1 ) . That bias in MF can be decreased by taking smaller time-steps (see comparison between dt=0.02 and 0.01). The sampling standard deviation of MF ranges within 2-10% in the long runs and 4-25% in the short runs, being generally larger than that of CSM due to model fitting errors and being also roughly proportional to 1 √ . It shows that even with very short time series without the knowledge of model equations, it is possible to obtain relatively accurate causality diagnostics.

4)
In what concerns the ML values, ( 2 → 1 ) agrees with those from AN, due to the absence of nonlinear terms in (71a). All the remaining ML values are close to zero, except ( 1 → 2 ) because the product − ( 1 3 2 ) 1 3 in (71b) correlates with − 1 , leading to a negative (highly biased) D-RET (see Table 2). The zero ML values agree with AN in the cases ( 1 → 3 ) , ( 2 → 3 ) because of two coincidences: the absence of cross-linear terms in the equation governing 3 (71c) and the symmetry of pdfs 1,3 ( 1 , 3 ) and 2,3 ( 2 , 3 ) (see Fig. 4). 5) The AN values computed with short runs exhibit high variability (sampling standard deviation), reaching twice to three times larger than that of CSM, especially due the inaccuracy of estimated pdfs. Therefore, , the time-series length and model dimension are severe limitations to the applicability of AN. 6) The deterministic RETs (D-RETs), which are mostly associated to nonlinear terms in ( 1 → 2 ) , ( 3 → 2 ) decrease in amplitude in the noisy scenarios because the nonlinear dependencies weaken with respect to the noise-free situation. In particular, in the additive-noise scenario, the model becomes closer to Ornstein-Uhlenbeck models.
The sensitivity of the CSM and MF values with respect to the degree of polynomial regression used in the conditional expectations is illustrated in Fig. 5, showing the average and standard deviation (over 30 noise seeds) of two RETs, essentially of nonlinear nature: ( 1 → 2 ) , ( 3 → 2 ) for the noise free and additive noise scenarios, short run ( = 80) and dt=0.02. We conclude from the figure that the speed of convergence as Nd grows, depends on the considered RET. For the noise-free scenario, ( 1 → 2 ) stabilizes for ≥ 4 whereas ( 3 → 2 ) stabilizes for ≥ 10, justifying that choice in the CSM validation. In the noisy scenario, since the pdf is less non-Gaussian than in the noise-free situation, and non-linearity decreases, the RETs have smaller absolute value and the convergence is faster, showing that in both RETs stabilize for ≥ 4. The final bias (difference with respect to the analytical value) is still nonzero indicating other sources of errors, either from the analytical or from the CSM and MF algorithms.
The remaining RETs converge rather quickly at = 2 (not shown). This makes clear that the representativness of the conditional expectations call in some cases for higher-order polynomial regressions or eventually for more appropriate regression functions, namely with finite L2 norms in the real domain.
The convergence speed of the polynomial regressions of the conditional expectations generally depends on the nonlinearities of the studied model, the shape of the corresponding pdf and also of the set of basis functions (e.g. polynomials, trigonometric functions, orthogonal eigen-functions, wavelets). Two factors weakening the convergence of regressions are the abruptness of pdf variations along state-space, the existence of topologically complex pdf shape and the 'strangeness' and fractal nature of chaotic attractors, especially if the Hausedorff dimension is small. In the 3D noise-free Lorenz model, the Hausedorff dimension is ~2.06 [43] and the atractor projects quite well on each of the three variables, thus leading to smooth univariate and bivariate pdfs (Fig. 4). The strength of the Gaussian white noise forcings tends to make pdfs more, Gaussian and smoothed, leading to faster convergence regressions as shown in Fig. 5.

RET
Noise type    Soild and dotted lines refer to noise-free and additive noise scenarios respectively. The analytical value is marked at Nd=16.

Discussion of the entropy budget
The computation of the entropy budget terms and specific contributions is important to identify the leading dynamical mechanisms and provide key information about predictability, causality and synergies. We present in Table 3 the different terms (rows) of the budget (73) for the three scenarios (columns) and the three variables (subcolumns) using the CSM values based on a long run of 400 time units. As expected, the sum of terms (last row) is very close to zero, being of the order of the sampling variability (in Table 2). The entropy budget points out the very important role of synergetic RETs coming from the nonlinearities of the Lorenz system. In fact, while, for 1 , the deterministic terms are linear, leading to vanishing synergetic RETs, the synergetic RETs for 2 (joint effect of 1 , 3 ) and 3 (joint effect of 1 , 2 ) are different from 0. In the case of 3 , the synergetic RET gets the total value of the RET (Table 3). It shows, that, despite the presence of 1 and 2 in the dynamics of 3 , it is rather their joint effect, through the product 1 2 in (71c) that contribute to the entropy change. This must partially be a result of the high interrelation between 1 and 2 , coming from their sharp pdf (Fig. 4a) and high linear correlation (~0.88 in the noise free situation).
The computation of single RETs (one to one entropy transfers or more appropriately, entropy generation of one variable under the effect of another one), as well their sum, hereby denoted as (~→ ) , , both in the Lorenz system and probably in most nonlinear and complex network systems (like the atmospheric-oceanic system) appears to be an incomplete picture of the causality due to the effect of synergies. Therefore, the set of one-to-one causal links, not only of the LK type but also issued from other techniques (e.g., conditional mutual information (CMI) [44], Peter and Clark Momentary Conditional Independence (PCMCI) [45], Granger causality [2,6]) may not reproduce the 'full story' about the system causality.
Moreover, the importance of synergetic terms reveals the usefulness (possibly holding on more general models), of merging highly interrelated variables and concentrating them onto low-order spaces or mixed variables (e.g. through principal component analysis, independent component and subspace analysis [46,47], nonlinear principal component analysis [42] and projection pursuit [48]), followed by computation of RETs between those mixed variables and the remaining variables.
In order to illustrate that rationale in the noise-free scenario, we consider the pair of variables where the neglected terms are about 10-20 times less than the retained ones. The two-dimensional system (80a,80b) appears as approximately linear and quasi Hamiltonian, due the near vanishing of the sum of forcing divergences: −7.21 + 7.16~0, (i.e. the preservation of area elements with the dynamical system flow) and a nearly conserving positive-defined Hamiltonian 6.34 3  .76, which is close to that of Lorenz system oscillations. The time evolution of ( 3 , 4 ) is shown in Fig. 6, in which the oscillations are evident. The system also satisfy closed boundary conditions since the state-space speed is tangent to the axis 4 = 0, thus being the validy conditions of the presented theory (Secs. 2 and 3) In what concerns the LK causality, the system (80a, 80b) is nearly linear and thus the RET formula (34), valid for linear systems can be applied leading to the approximation: This has been verified independently by empirical estimations of ( 1 2 → 3 ) and ( 3 → 1 2 ), respectively 6.99 and −7.03 as well as in [49]. Therefore, the entropy appears as being self-generated by 1 2 (i.e. the square of convection intensity), then transferred to 3 (intensity of the vertical temperature gradient), being self-dissipated afterwards by 3 . The variable 2 2 , (square of the intensity of the horizontal temperature gradient) is nonlinearly driven by ( 3 , 4 ), whose approximate evolution equation is obtained by (71b) multiplied by 2 2 after using the regression for 2 1 with a positive RET ( 3 , 4 → 2 ) compensating the dissipation. In a similar way, an approximately Hamiltonian system will be obtained by considering the pair ( 2 1 , 3 ).
The Liang-Kleeman entropy diagnostics can thus suggest groups of variables which, when mixed together, can provide a simplified picture of the entropy transfers.
For stochastic systems, the presence of noise induces a decrease of the amplitude of the deterministic RETs (singles and synergetic), since the noise contributes positively to the entropy balance. In the case of additive noise and as far its amplitude increase, every single variable becomes more approximately governed by Ornstein-Uhlenbeck processes producing RETs of smaller amplitude as far as noise variance increases An important step forward is to make a systematic study of the RETs in stochastic systems and how they evolve along the space of bifurcation parameters, noise intensity and its characteristics, which is out of the scope for the present study. Another possible step is the merging of signal-processing techniques and blind source separation methods with the Liang-Kleeman entropy diagnostics, using the nonlinear approach devided in this study. Table 3. Terms of the budget of entropy for the 3 variables of the stochastic Lorenz system and the three noise scenarios.
For the specific RETs, it allows determining the regions of state-space that most contribute to the RET. By focusing on the noise-free situation, the deterministic contributions (12b,31b) for 1 and 3 are nearly constant along the range of the consequential variable, being retained by the single and synergetic terms, respectively for 1 and 3 , i.e.: Concerning 2 , there are really strong variations and changes of the sign of the specific deterministic RETs ℱ 2,1, ( 2 ), ℱ 2,3, ( 2 ) and ℱ 2,~2, , ( 2 ) along the range of 2 (see Fig. 7) which is a clear mark of nonlinearity, since on linear systems the specific RETs are constant. In fact, while, the specific value of ( 1 → 2 ) , i.e. ℱ 2,1, ( 2 ) = is mostly positive and constant ~3.7 (black curve in Fig. 7), the specific ( 3 → 2 ) i.e. ℱ 2,3, ( 2 ) (red curve in Fig. 6) is positive near the 2 = 0 but reaches large negative values at the lobes of the attractor while the synergetic effect of ( 1 , 3 ) through ℱ 2,~2, , ( 2 ) (blue curve in Fig. 7) is mostly negative, displaying an opposite behavior. This means that, both 1 and 3 separately contribute to increase the entropy of 2 , whereas their joint combination contributes to decrease entropy of 2 .
Most of the RETs take place near the point of maximum joint pdf at ( 1 = 0, 2 = 0, 3~− 0.3) where most of sign changes of the convective cells take place, when the system orbit approaches the unstable fixed point of no ergodic pdf. However, they can be evaluated on forecast ensembles, particulary on short-to-long range weather forecasting. There, entropy is interpreted as the entropy of forecast error, coming mostly from the ensemble spread.

7.Discussion and Conclusions
A general theory is derived for computing Shannon entropy transfers in nonlinear (deterministic and/or stochastic) systems, holding under quite generic state-space boundary conditions, namely: 1) unbounded pdf support set; 2) closed boundaries, i.e. null deterministic and diffusive fluxes at the pdf support set borders; 3) periodic conditions in state-space. The theory extends the Liang-Kleeman (LK) framework [16] of causality inference, by decomposing the evolution of the Shannon entropy of a single state vector component (taken as the consequential variable), into a self entropy generation (SEG) and a global rate of entropy transfer (RET) issued from the complementary space of the consequential variable, with contributions assigned either, to the deterministic or the stochastic forcings. In the case of open conditions, some extra terms must be added to the entropy balance equation, related to externally driven fluxes of entropy to the state-space, imposed by nun-null speeds or diffusivities at the borders of the pdf support set in the space-space.
The RET comes from the existence of cross variable dependencies, either in the deterministic components or in probabilistic diffusivities in the multiplicative-noise case. The global RET is further decomposed into the sum of RETs from the single variables spanning the complementary space plus a synergetic RET accounting for synergetic causal effects due to groups (e.g. triplets, quartets) of variables [28]. The synergetic term vanishes in linear systems, being thus of purely nonlinear origin.
Synergies due to nonlinearities (e.g. products of variables in the forcings), as well as redundancies between variables suggest the extended RET computation between a generic causal subspace and a generic consequential subspace (or more generally manifolds), and for given contextual or outer subspace. As a climatic example of the causal, consequential and contextual subspaces, we can choose sub-spaces spaning the low-frequency variability of separate oceanic basins and the overlying atmosphere where indirect oceanic links may take place via an 'atmospheric dynamical bridge' [50]. Consistent expressions of the space-to-space RETs, are being devised (author's future manuscript), following the same CSM technique presented here.
The diagnostics of causality between two particular variables (inferred by RETs) can strongly depend on the set of contextual or remaining variables, spaning the so-called outer space acting as room of 'hidden' indirect interactions between that pair of variables. This effect can be amplified when the allowed interactions jump from linear to nonlinear and the dimension of the 'outer' space increases, thus calling for a proper analysis of the extension of the 'causality outer space'.
An effective method of estimating RETs, the 'Causal Sensitivity Method' (CSM) is then introduced, as an alternative to the 'brute-force' and more expensive estimation approaches calling for the explicit knowledge of the pdf of the system, which may have severe pitfals. In fact, the pdf estimation can be unreliable even for moderate system dimensions if not enough data are available, and being biased by effect of the 'curse of dimensionality' (fact that multivariate outliers become more probable than near average values).
The CSM relies on RET formulas depending explicitly on derivatives (sensitivities) of conditional expectations of the deterministic terms and noise diffusivities, conditioned on the consequential variable. The CSM is fully appropriate to cases where an ensemble of realizations is available, from which the conditional expectations are estimated using nonlinear differentiable regressions based upon an extended set of basis functions (e.g. monomials, orthogonal functions, wavelets) which can be fixed or optimized through signal processing techniques and machine learning, as relevant vector machines and symbolic regression. Those ensembles can be time-evolving or obtained from a sampled long time-series covering the attractor of the system. In particular in ensemble forecasting (e.g. mediumto-long ensemble weather forecasting), entropy of forecast errors grows with the ensemble spread, and the RETs become important to diagnose drivers of predictability or unpredictability of chosen variables.
Formulas for the CSM-based specific contributions to RETs, along the state-space are available, being uniform (varying) on linear (nonlinear) systems, and thus providing information where the RETs are intense or not. The CSMbased RET formulas could also help to propose certain parametrizations of the effects of the outer variables on the causal-consequential pair, as for instance in parameterizing sub-grid scale processes in turbulence or in meteorology.
To exemplify the usefulness of the CSM formulas, four RET estimation approaches are compared: The 'brute-force' computationally expensive analytical-numerical (AN) method, relying upon numerical integrals on the state-space depending on the system pdf and taken as the reference in our study; the CSM, when the dynamical equations are known; the MF method based on the model fitting followed by the CSM; and the multivariate linear (ML) approach of Liang [18]. The methods are validated in two three-dimensional models: a model derived from a potential function and the classical chaotic Lorenz model [26], both subject to a variety of noises and nonlinearities. The CSM values appear to be robust and close to the reference AN values, being thus appropriate for application with real data. The ML approach only works under weak nonlinearity, being potentially strongly biased when strong nonlinearities and synergies across variables are present. This justifies the need for using a nonlinear estimate for nonlinear problems. Diagnostics of RETs can suggested ways of separating system variables leading to more simplified pictures of entropy fluxes.
The application of the CSM, with or without model fitting, is susceptible of generalization to differentiable maps [12] and estimation of RETs between subspaces of variability, thus opening a straightfull, and computationally cheap methodology to infer information flows across variables or groups of variables of a system, for instance spanning sub-spaces of variability of the climatic system. Furthermore, this study has raised the importance of a systematic study of RETs on chaotic and stochastic system as a function of bifurcation and noise parameters.
The application of the CSM, as well as other causality diagnostics to real systems of moderate dimension (10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20) and high dimension represents a great endeavor, both in the cases of known and unknown evolution equations, with the computational cost of many problems growing exponentially with and the 'curse of dimensionality' effect. Furthermore, when available time series are short, for selected spatial and temporal scales, we add the need of making robust model fittings, avoiding overfitting. In high-dimension systems, governed by complex networks (like the climatic system), there are possible high levels of linear or nonlinear redundancies (measured for instance by mutual information). Therefore, it is advisable to first compute low-dimensional spaces (by linear and/or nonlinear principal component analysis and independent and/or subspace component analysis), and then diagnose the causality between the spaning variables or between subspaces, following the same CSM strategy. Those variables can be substituted by other relevant common indices (e.g. climatic indices in the climatic system, like El-Niño index), acting as drivers (causal variables) of the raw system variables, for instance, helping to locate which regions are more or less influenced.
The implementation of the CSM in real cases studies without available model equations has important prerequisites related to efficient and robust regressions both in the fitting of model equations and also in the expression of CSM conditional expectations. Robust regressions should preferentially be accompanied by statistical significance tests. The fitting may be performed by standard linear or non-linear regression techniques or more sophisticated machinelearning-based and genetic programming (e.g. relevance vector machines), keeping in mind that basis-functions must be multiplicatively separable in terms of the causal/consequential/outer variables. The basis-functions should preferentially be non-recurrent (issued from sets of orthogonal functions), and ranging from large to small scales.
Further applications to real-world class of problems like the interaction between different components of the climate system [50] with presence of feedbacks is certainly a way forward that is taken up and will be reported in the near future. Moreover, the CSM technique can be applied to ensemble forecasting (e.g. short-to-long weather range forecasting) where the Shannon entropy mostly comes from the ensemble spread and forecast error variance. There, the identification of causes (drivers) of the ensemble forecast spread through the RETs can be useful to determine predictability conditions of selected variables, thus opening a 'world' of applications.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Declaration of Generative AI and AI assisted technologies in the writing process
Generative AI/AI assisted technologies were not used in the preparation of this work.

Data availability
Data will be made available on request.
In the third equality we apply integration by parts twice with respect to and once in the fifth and sixth equalities. According to formula (2) of Liang [16], the rate of entropy transfer to due to the influence of is: Referring to [33], one can find the minimum mean square difference between ̇ and by applying ordinary multivariate linear regression yielding the fitted matrix of parameters through the matrix product:

B.2. The Predictor-Corrector scheme
The stochastic differential equations are integrated in time using the predictor-corrector scheme described by the steps below: