Causality Analysis with Information Geometry: A Comparison

The quantification of causality is vital for understanding various important phenomena in nature and laboratories, such as brain networks, environmental dynamics, and pathologies. The two most widely used methods for measuring causality are Granger Causality (GC) and Transfer Entropy (TE), which rely on measuring the improvement in the prediction of one process based on the knowledge of another process at an earlier time. However, they have their own limitations, e.g., in applications to nonlinear, non-stationary data, or non-parametric models. In this study, we propose an alternative approach to quantify causality through information geometry that overcomes such limitations. Specifically, based on the information rate that measures the rate of change of the time-dependent distribution, we develop a model-free approach called information rate causality that captures the occurrence of the causality based on the change in the distribution of one process caused by another. This measurement is suitable for analyzing numerically generated non-stationary, nonlinear data. The latter are generated by simulating different types of discrete autoregressive models which contain linear and nonlinear interactions in unidirectional and bidirectional time-series signals. Our results show that information rate causalitycan capture the coupling of both linear and nonlinear data better than GC and TE in the several examples explored in the paper.


Introduction
The study of causality aims to explore cause-and-effect relationships between processes. In general, causality can be investigated in two categories: causal discovery and causal analysis. Causal discovery shows the inherent causal relationship within the dataset by analyzing and creating causal models based on graph theory. The construction of these models can be achieved via algorithmic dynamics and evaluating the algorithms against the observed data [1][2][3]. Generally, the complexity of the models' networks is characterized and measured using graph-theoretic measures. A recent approach using classical information theory, such as Shannon entropy, has been considered as an alternative measurement, but its high dependence on the distribution can lead to spurious disparate values for the same complexity, as shown in [4].
Alternatively, causal analysis, which is the focus of this paper, studies the potential changes in a system caused by another. In practical terms, causal discovery requires more detailed background information about the provided dataset to accurately conduct the study. However, the causal analysis does not require detailed underlying knowledge of the connectivity embedded within the dataset [3,5]. Among the various analysis methods used to quantify causality, Granger causality (GC) and Transfer Entropy (TE) are extensively used in various fields, including economics, climate, education, meteorology, neuroscience, and more [5][6][7][8][9][10][11][12][13][14][15][16]. For instance, Rebecca M. Pullon et al. utilized GC to demonstrate the decrease in cortical information flow throughout the brain as the subject loses responsiveness [10]; Yunyuan Gao et al. used TE to investigate the coupling strength of the brain represents the net TE of signals 1 and 2. The net TE enables us to identify the appropriate lag(s) that yield significant results in TE. Using the knowledge of the appropriate lag(s) from the net TE, the TE is calculated through the window sliding approach, with each small window of the signals evaluating TE at the designated lag. Similarly, information rate causality will be evaluated through the window sliding approach with the information rate calculated within each window. In the scenario where the signals are coupled, it is expected that there will be spikes in the information rate causality, indicating changes in the distribution of one signal caused by another. Note that the methodology and concepts of the analyses used in this paper are mainly from the previous works [17][18][19]28,30,32].
The remainder of this paper is organized as follows. In Section 2, we briefly review the concepts for GC, TE, and information rate causality along with their implementation. The three different types of autoregressive models-unidirectional, interchange unidirectional, and bidirectional models-are introduced and analyzed in Sections 3 to 5, respectively. Section 6 contains our conclusions. Appendices A-C contain some background theory to make the paper self-contained. As a reference, the basic model of continuous coupling is studied and analyzed in Appendix D.

Granger Causality
The general idea behind GC is to evaluate the dependency of one process on another process [17,18]. This dependency is calculated by comparing the errors/noises of the modelled signals through the autoregressive model. Consider two stochastic processes, namely x 1 (t) and x 2 (t), which can be modelled using the information from their respective signals (as shown in Equations (1) and (2)) or including some information from each other (as shown in Equations (3) and (4)).
Here, {a i , c i , a * i , b * i , c * i , d * i } ∈ R are the constant parameters that represent the fractions of contribution from the past observations towards x 1 (t) and x 2 (t). i∈ [1,2] is the uncorrelated external additive noise needed for modelling the processes with variance Σ ii,i∈ [1,2] . * i∈ [1,2] is the correlated noise with the variance Σ * ij,(i,j)∈ [1,2] , which can be represented by The total interdependence or total causality index (F x 1 ·x 2 ) between x 1 (t) and x 2 (t) is composed of two directional causalities (F x 1 →x 2 and F x 2 →x 1 ) and one instantaneous causality (F x 1 ↔x 2 ). These are defined in Equations (5) to (8).
In conjunction, the spectral of the GC can be evaluated in the frequency domain (ω = 2π f , f ≡ frequency) using the spectral matrix S(ω), transfer matrix H(ω), and the covariance matrix Σ. They are related according to Equation (9). Here, · is defined as an ensemble average and † is denoted as the complex conjugate transpose of a matrix.
Similar to the analysis in the time domain, the total independence in the frequency domain between x 1 (ω) and x 2 (ω) (I x 1 ·x 2 ) also consists of two directional causalities (I x 1 →x 2 and I x 2 →x 1 ) and one instantaneous causality (I x 1 ↔x 2 ). They are expressed as Equations (10) to (13).
Here,Ĥ 22 is denoted as the complex conjugate of the element.
In this paper, the frequency domain and the time-varying frequency domain of GC will be calculated through the non-parametric method. The general idea of this method is to decompose the spectral matrix S(ω) into the transfer matrix H(ω) and the covariance of the noises Σ. This decomposition can be achieved using Wilson's algorithm [18,24,25,35,36]. The spectral matrices for the frequency domain and the time-varying frequency domain are expressed as Equations (14) and (15), respectively.
Here, T denotes the total period of the signal. x(ω) n|n∈ [i,j] is the discrete Fourier transform of the signal and s n|n∈[i,j] (t, ω) is the short-time Fourier transform of the signal x(t) n|n∈ [i,j] . Note that [·] denotes the complex conjugate. The Hann function window is used when evaluating the short-time Fourier transform. The window will move through the whole series with half of the data points being overlapped. The general flow of the procedure is shown in Figure 1a.
n , x n+1 , and y (1) n . Refer to Section 2.2 for the definition of TE. Figure (c) demonstrates the evaluation of Information rate causality. Each window (labelled as 1st window) is divided into two windows with the first half labelled as 2nd window. The distribution p(x(t), y(t i )) estimates the evolution of distribution x(t) while y(t i ) is fixed at the 2nd window. Refer to Section 2.3 for the definition of information rate causality.  n , x n+1 , and y (1) n . Refer to Section 2.2 for the definition of TE. (c) demonstrates the evaluation of information rate causality. Each window (labelled as 1st window) is divided into two windows with the first half labelled as 2nd window. The distribution p(x(t), y(t i )) estimates the evolution of distribution x(t) while y(t i ) is fixed at the 2nd window. Refer to Section 2.3 for the definition of information rate causality.

Transfer Entropy
TE is a model-free approach used to calculate causality by evaluating the dependency between processes. The general expression of TE is given by Equation (17), which measures the additional information required for the realization of one state of the process (x 1 (t)) depending on the past states of the processes (e.g., x 1 (t − 1) and x 2 (t − 1)) [19,28,37].
Here, x n+1 represents the state of x 1 (t) at the (n + 1)-th moment, x (k) n represents the state at n-th moment of x 1 (t) consisting of k previous states of x 1 (t), y (l) n represents the state at n-th moment of x 2 (t) consisting of l previous states. Note that the previous states of k and l are arbitrary depending on the interest of the study. For instance, in a collective observed stochastic process X = (x 1 , x 2 , . . . , x 10 ), x Thus, TE quantifies the additional information needed for the realization of x 1 (t) at (n + 1) from x 2 (t) with the assumption that x 1 (t) is independent of x 2 (t). If x 2 (t) has no impact on the future evolution of x 1 (t), and hence p(x n+1 |x For the TE calculation, the probability distribution p(. . .) will be estimated via histogram with the bin size of 5, as this bin size is determined by the cubic root of Rice's rule to accommodate the 3-dimensional of joint probability within TE (p(x n+1 , x Using a larger number of bin sizes will not be able to properly depict the distribution and it will bring a spurious result in calculating the TE. To simplify the analysis, the number of past values of k and l is set to 1 (k = l = 1) in this paper. Two different evaluations will be conducted for TE. First, assuming that the processes are stationary, the net TE (TE x 1 (t),x 2 (t) = TE x 1 (t)→x 2 (t) − TE x 2 (t)→x 1 (t) ) is evaluated on the entire processes at different lags. This will enable us to choose the appropriate past value(s)/lag(s) in evaluating the TE based on the significance of the net TE. Second, using the knowledge of the appropriate lag from the net TE, the TE is next evaluated via a sliding window approach. This sliding window approach is used to calculate the TE at each instance of the interested window. In this sliding window approach, a small number of data points of the stochastic processes is sampled within the window and the TE is calculated. These calculations are sketched in Figure 1b.

Information Rate Causality
In a dimensionless statistical manifold, the distance between two probability distributions is defined by their statistical difference. One commonly used measure of this difference is the KL divergence/relative entropy, but this measure is not symmetrical and not path-dependent. For the time-dependent probability distribution, the changes in the distribution along with time can be measured through the information rate and information length. The information rate (Γ(t)) quantifies the rate of change of the distribution, while information length (L) measures the total change of the distribution. These measures are defined and expressed as Equations (18) and (19) [30][31][32].
Here p(x, t) is the probability distribution of a stochastic process x at the time t. For studying causality, we consider two stochastic processes (x 1 (t) and x 2 (t)) and the information rate for a joint probability distribution is evaluated. The causality between these processes can be quantified by the changes in the information rate (statistical changes or the changes in distribution) while having another process remain static. For instance, the information rate of x 2 (t) causing x 1 (t) is defined as Equation (21) and called information rate causality in this paper.
Referring to Figure 1c, here t i is denoted as the reference time that remains static in the process. Therefore, p(x 1 (t), x 2 (t i )) is a probability distribution that is sampled by having the process x 2 (t) remain static at time t i while the process x 1 (t) evolves along the time t. Since information rate can only quantify the changes in the distribution, therefore the information rate causality is evaluated within each window of interest instead of the whole process of the time series to determine whether the coupling of the processes is still persists. To accommodate the calculation, a discretized version of Equation (21) is used and it is expressed as Equation (22).
For calculating the information rate, the joint probability distribution for two processes is estimated via the histogram with the bin size (b) determined by the square root of Rice's rule, as expressed in Equation (23). Rice's rule is used because it is able to appropriately determines the bin size to sample the obtained data [38,39] for 1-dimensional distribution. Hence, the square root of Rice's rule is to accommodate the 2-dimensional distribution in this case. b = 2 3 √ n , n = number of sampled data. (23) In this paper, we evaluate the information rate causality by employing Equation (22) to examine the impact of one process, which remains static in time, on the distribution of another process within a specific time window. To estimate the joint probability distribution utilized in Equation (22), a histogram is employed with a bin size determined by Equation (23). Figure 1c illustrates the overall procedure of this evaluation.

Summary of the Procedure: Data And Analysis
We further develop information rate causality (refer to Section 2.3) for the analysis of our numerically generated data and compare it with the non-parametric GC (refer to Section 2.1) and window sliding TE (refer to Section 2.2). Our numerical data are generated by simulating different discrete autoregressive models covering both linear and nonlinear models. The simulation is conducted for 10,000 trials with each trial running for 25 s (physical time) at 200 Hz (physical sampling frequency). Different coupling/causal relationships between the signals are considered in this paper, such as unidirectional (refer to Section 3), interchanging unidirectional (refer to Section 4), and bidirectional (refer to Section 5). Note that the noncoupling cases are also considered in this paper to check the robustness of the causality analyses; specifically, we refer to the uncoupling of x 1 (t) towards x 2 (t) after the physical time 10 s. Using simulated signals, the causality analyses (GC, TE, information rate causaltiy) are conducted according to the sketch shown in Figure 1.
To be consistent with the data points used for each window, each sampling window will contain 0.5 s of data points with an overlap of 0.25 s (0.5 × 0.5 s), which is equivalent to 100 data points for one simulation. Since the models are simulated for 10,000 trials, each window will consist of 1 × 10 6 sample points (100 points × 10, 000 trials). Note that this window is not applied to the frequency domain of GC and the net TE as both are calculated based on the whole series.

Summary of the Different Models and Key Results
Prior to the discussion of different models in Sections 3 to 5 with the implementation of different analysis methods, we provide the key results/findings in Table 1. Table 1. Summary of the performance of GC, TE, and information rate causality for different models discussed in Sections 3 to 5. In general, GC performs well for linear and stationary signals; TE performs well only if the lags are chosen correctly; information rate causality performs well and can recover the underlying lags of the signals.

Unidirectional Model
In this part, unidirectional causality, where the first process (x 1 (t)) influencing the second process (x 2 (t)), is considered for both linear and nonlinear autoregressive models given by Equations (24) and (25), respectively. The causality between the processes occurs as the processes are coupled with each other through the Heaviside step function (H(. . .)). Linear model: Nonlinear model: Here, t is the time steps and τ is the physical time. Both t and τ are related as τ = [physical frequency] × t. H(τ − 10) is a Heaviside step function that allows the coupling to occur starting at the physical time of 10 s. n|n∈ [1,2] is the Gaussian noise that perturbates the systems. In this study, the noises are set to have zero mean which will be expressed in covariance matrix ( ). Due to the exponential term in the nonlinear model, the process will have non-stationary oscillation when the power of the exponential term becomes positive (e R → diverge). Hence, two different values of noise covariance will be considered to study the stationary and non-stationary oscillation of the processes. They are labelled as large noise and small noise with the respective matrices expressed as that the cross-covariance of the noises is set to zero (Σ 12 = Σ 21 = 0.0) to ensure the coupling is purely due to the intrinsic interaction in the model but not through the mutual noises. The linear model will have stationary oscillation for either large or small noises, and hence only one noise will be used to simulate the linear model and large noise is chosen. The general structure of the models is shown in Figure 2.
To explore all possible combinations of cases, we investigate the coupling and noncoupling cases for both linear and nonlinear models. To this end, we investigate the following six difference cases:    (24) and (25) is shown in Figure 3. Notice that the signals for the nonlinear model exhibit non-stationary oscillation when the perturbating noise is small, as shown in Figure 3e,f. This is due to the divergence of the positive power of the exponential term of Equation (25) and the influences of the previous state towards the current state (x 1 (t) ← 0.55x 1 (t − 1)). The rest of the simulated signals have stationary oscillation, as shown in the phase space where the observed states are localized around (x 1 , x 2 ) = (0, 0).

Granger Causality
As shown in Figure 4, the coupling of x 1 to x 2 can be captured well through nonparametric GC analysis for stationary linear signals, as shown in Figure 4a. Concurrently, the detection of the causality works well for stationary nonlinear signals, as shown in Figure 4c. This is because the nonlinear exponential term is well-approximated to a constant value as the spectral matrix decomposed to the transfer matrix and covariance matrix of noise via Wilson's algorithm. Note that the amplitude of the GC decreases for this nonlinear stationary signal. This is due to the contribution of the exponential term (e −R ) that eventually affects the S 22 (ω) element of the spectral matrix (refer to Equation (11)) and hence the amplitude of the causality (I x 1 →x 2 ). For non-stationary signals, the frequency domain of GC is still able to show the causality within the signals, but the time-varying frequency of GC cannot represent well the period of causation, as shown in Figure 4e. From the figure, we can see that the nonparametric GC gives a false result, as it suggests that the coupling between the processes occurs throughout the time. Similarly, the nonparametric GC analysis works fine in evaluating the non-causality cases for stationary cases but not so well for the non-stationary case (refer to Figure 4b,d,f. GC analysis fails to work on the signals oscillating non-stationary, as the signals cannot be modelled well via the linear autoregressive equations (refer to Equations (3) and (4)).       (24) and (25). Refer to Figure 2 for the graphical explanation of the process.

Transfer Entropy
The results from TE analysis are shown in Figure 5. By first assuming that the signals are stationary, the net TE (TE X,Y = TE X→Y − TE Y→X ) is first calculated to find the suitable lag for evaluating the sliding window TE later. Figure 5a,c,e show that the net TE between the signals is significant at the lag of 1. This is because the simulated models (Equations (24) and (25)) have the coupling occurring at one lag (compares ). Next, the window sliding TE is conducted through the conditional probability (multidimensional probability) that contains the data from one previous lag (refer to Equation (17)) at each window of the evaluation of TE. As shown in Figure 5a,c,e, the window sliding TE is able to show the time (10 s physical time) when the causality occurs. Similarly, the TE is also able to capture the non-causality situation well for all the cases, as shown in Figure 5b,d,f. Even though TE is able to detect the causality between the signals for either linear stationary or nonlinear non-stationary cases, it requires extra inputs/assumptions (such as bin sizes, number of lags, and the dimension of the multidimensional probability) to work properly/accurately. The failure of TE in capturing the causality for the linear model (Equation (24)) is shown in Figure 6 when the lag is set to 9 instead of 1 when evaluating the TE.   (24) and (25). Refer to Figure 2 for the graphical explanation of the process.  (24)) when lag 9 is used for evaluation, as compared to Figure 5a, where the TE accurately shows that the coupling occurs after 10 s.

Information Rate Causality
Both GC and TE detect causality by examining the increase in the predictability of the stochastic processes (measured by the inverse of entropy) when the past information is included. In comparison, information rate causality evaluates causality by quantifying the rate of change of the probability distribution of one variable conditioned on another, as noted previously. The results are shown in Figures 7 and 8 where the changes of the distribution due to the causality between the processes are well-reflected via the changes in the information rate (Γ). For the cases of noncoupling, the information rate is not changing and remains constant, which would suggest that none of the signals cause the changes to the joint distribution (refer to Figure 7b,d,f). Hence, no causality occurs between the signals. Similarly, prior to the coupling of the signals, the causality is not seen, and hence the information rate again remains constant. Notice that the information rate does not stay at zero when no causality happens, it is due to the noises/spikes of the estimated distribution, as shown in Figure S1.
In the presence of the couplings, the changes in the distribution due to the causality among the signals can be observed via the changes in the information rate. For instance, the changes in the information rate due to the causality of x 1 (t) to x 2 (t) can be observed in Figure 7a,c,e (zoom-in: Figure 8a,c,e). Referring to the models, the signals are coupled with the lag difference of one (x 1 (t) = x 1 (t − 1) and x 2 (t) = x 1 (t − 2)), and hence the peak of the information rate is observed after one lag. From Figure 8a    To demonstrate the capability of information rate causality to detect lag differences within a signal, we simulated and analyzed Equation (26), which has the coupling starting at 10 s with a lag difference of 9 (compares x 1 = x 1 (t − 1) and x 2 = x 1 (t − 10)). Figure 9 shows the section of the information rate from 9.6 s to 10.4 s. The figure reveals that the peak of the information rate appears at the 9th time step (0.045 s) at every window of evaluation, which accurately reflects the lag difference specified in Equation (26). This result suggests that information rate causality can effectively capture the underlying causality of the signals at each evaluation window.

Interchange of Unidirectional Model
In this part, similar models from Section 3 are used with slight modifications to study the interchange of the unidirectional causality. The models are modified to Equations (27) and (28) for linear and nonlinear models, respectively. The modification with the Heaviside step functions (H(. . .)) in the models portrays that the signal x 2 (t) causing x 1 (t) occurs prior to physical time (τ) 10 s, while the signal x 1 (t) causes x 2 (t) after 10 s. The general structure of Equations (27) and (28) is illustrated in Figure 10. Linear model: Nonlinear model: Here t is the time-step and τ is the physical time. The Heaviside step functions are used with H(10 − τ) suggesting the coupling occurs from the beginning of the simulation till the physical time 10 s conversely H(τ − 10), suggesting the coupling occurs after 10 s. Similar to Section 3, 6 possible cases are simulated based on Equations (27) and (28). The simulated signals will be evaluated by GC, TE, and information rate causality.
The result of the simulations is shown in Figure 11, which suggests that all the signals oscillate stationary. Unlike Equation (25), in which the signal x 1 (t) is perturbated by its previous state by a factor of 0.55 (0.55x 1 (t − 1)) and the small noise ( 1 (t)), Equation (28) is perturbated by the past state of x 2 (t) (x 2 (t − 2)) instead. Hence, the power of the exponential will always be negative; consequently, all the simulated signals have stationary oscillations. Note that the noncoupling is referring to the decoupling of x 1 (t) to x 2 (t) (H(τ − 10) = 0).
Here t is the time-step and τ is the physical time. The Heaviside step functions are used 282 with H(10 − τ) suggesting the coupling occurs from the beginning of the simulation till 283 the physical time 10 seconds conversely H(τ − 10) suggesting the coupling occurs after 284 10 seconds. Similar to Section 3, 6 possible cases are simulated based on Equations (27) 285 and (28). The simulated signals will be evaluated by GC, TE, and information rate causality. 286 noise, AFTER time, noise, BEFORE time, Figure 10. Model of the flow of information of the processes x 1 (t) and x 2 (t) for Equations (27) and (28). In this paper, the equations are simulated with the physical time of 25 seconds and sampling frequency of 200 Hertz (5000 realizations) with either large noise or small noise. Note that the process x 2 (t) coupling with x 1 (t) occurs before 10 seconds and the interchange of coupling occurs after 10 seconds. In the context of noncoupling, it is referring to H(τ − 10) = 0.
The result of the simulations is shown in Figure 11 which suggests that all the signals 287 oscillate stationary. Unlike Equation (25) which the signal x 1 (t) is perturbated by its 288 previous state by a factor of 0.55 (0.55x 1 (t − 1)) and the small noise (ϵ 1 (t)), Equation (28) 289 is perturbated by the past state of x 2 (t) (x 2 (t − 2)) instead. Hence, the power of the 290 exponential will always be negative consequently all the simulated signals have stationary 291 oscillations. Note that the noncoupling is referring to the decoupling of x 1 (t) to x 2 (t) 292 (H(τ − 10) = 0) 293 (a) linear: couple Figure 10. Model of the flow of information of the processes x 1 (t) and x 2 (t) for Equations (27) and (28). In this paper, the equations are simulated with the physical time of 25 s and sampling frequency of 200 Hertz (5000 realizations) with either large noise or small noise. Note that the process x 2 (t) coupling with x 1 (t) occurs before 10 s and the interchange of coupling occurs after 10 s. In the context of noncoupling, it is referring to H(τ − 10) = 0.

Granger Causality
Based on the models, the signal x 2 (t) will first cause x 1 (t) for the beginning of the process until the physical time of 10 s, and x 1 (t) causes x 2 (t) after the 10 s. This pattern is captured in the time-varying frequency of GC, as shown in Figure 12 for all the cases. Prior to 10 s, the x 1 (t) is caused by x 2 (t) by a factor of 0.55, and hence a similar factor of GC is shown in the time-varying frequency domain. Similarly, the coupling between the signals can also be observed in the frequency domain of GC. Notice that the [nonlinear: Entropy 2023, 25, 806 23 of 59 small noise] cases are able to show the significance of x 2 (t) causing x 1 (t) prior to 10 s because the small noise and the exponential term had x 2 (t) causing x 1 (t) increased to about 1.5 (e 1 × 0.55 ≈ 1.5).   (27) and (28). Refer to Figure 10 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.

Transfer Entropy
TE is able to show the correction direction of the causality between the signals if the correct lag is chosen. Therefore, having models (like Equations (27) and (28)) with the coupling occurring at different time lags cannot properly show the direction of the causality via the window sliding TE. For instance, the window sliding TE shows that x 2 (t) is causing the x 1 (t) when lag 0 is chosen, while the reverse is shown when lag 1 is used, as shown in Figure 13 (for lag 0) and Figure 14 (for lag 1). This is depicted well in the underlining models (Equations (27) and (28)), in which x 1 (t) causes x 2 (t) one lag later. As a result, TE cannot show the whole causality of the signals by just one lag.  (27) and (28). Refer to Figure 10 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.   (27) and (28). Refer to Figure 10 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.

Information Rate Causality
Alternatively, the information rate causality is able to show the coupling of the signals along with the lag of coupling in each window of the evaluation. As discussed, the information rate causality evaluates the causality by observing the changes in the probability distribution of one signal after it is influenced by another signal. As shown in Figures 15 and 16, the information rate causality can captures the interchange of the coupling among the signals of the simulated models (Equations (27) and (28)). As discussed in TE, signal x 2 (t) is causing x 1 (t) at a lag of 0, and x 1 (t) causes x 2 (t) at a lag of 1. These lags can be observed at each window of the information rate evaluation. For instance, as shown in Figure 16, the information rate of 2 to 1 (Γ 2 2 to 1) has a spike at the beginning of the time, while the information rate of 1 to 2 (Γ 2 1 to 2) has a spike after one time step (one lag). Hence, the information rate causality is capable of showing the underlying causality of the signals.   Figure 15. Result of information rate causality for Equations (27) and (28). Note that the interchange of the causality occurs at 10 s.  Figure 15 for the information rate causality between 9.6 s and 10.4 s. Note that the interchange of the causality occurs at 10 s.

Bidirectional Model
In this section, the bidirectional causality is studied by modifying the linear and nonlinear models from Section 3 as Equations (29) and (30), respectively. These modified models have the signals x 2 (t) continuously influencing x 1 (t) at all times t ≥ 1. In addition, at time 10 s, the Heaviside step function (H(τ − 10)) enables the coupling of x 1 (t) and x 2 (t), which allows x 1 (t) to influence x 2 (t) while the previous coupling remains intact. The general structure of the Equations (29) and (30) is illustrated in Figure 17. Linear model: Nonlinear model: Here, t is the time step and τ is the physical time; H(τ − 10) is the Heaviside step function that allows the coupling of the signal starting at a physical time of 10 s. Similar to the previous model, 6 cases are simulated and the signals are used to evaluate through GC, TE, and information rate causality. Recall that the noncoupling cases here are referring to the decoupling of x 1 (t) to x 2 (t) (H(τ − 10) = 0). and x 2 (t) which allows x 1 (t) to influence x 2 (t) while the previous coupling remains inta The general structure of the Equations (29) and (30) is illustrated in Figure 17. Linear model: (2 Nonlinear model: Here, t is the time-step and τ is the physical time; H(τ − 10) is the Heaviside step functio that allows the coupling of the signal starting at physical time of 10 seconds. Similar to th previous model, 6 cases are simulated and the signals are used to evaluate through GC, T and information rate causality. Recall that the noncoupling cases here are referring to th decoupling of The simulation result from all aforementioned cases is shown in Figure 18. From th observation of the individual signal and the phase space, the scenario of [nonlinear: sma noise, couple] has nonstationary oscillation after the coupling as shown in Figure 18e. Th is due to the divergence caused by the nonlinear exponential term embedded within th simulated equation (e 1−|x 2 (t−1)| = e R ) along with the factor of 0.55 of the previous sta of x 2 (t) (0.55x 2 (t − 1) . . .). On the other hand, [nonlinear: small noise, noncouple] has stationary oscillation as shown in Figure 18f, it is because x 1 (t) only receives the inpu from x 2 (t) which keeps the power of the exponential term negative and hence results in stationary oscillation. Figure 17. Model of the flow of information of the processes x 1 (t) and x 2 (t) for Equations (29) and (30). In this paper, the equations are simulated with physical time of 25 s and sampling frequency of 200 Hz (5000 realizations) with either large noise or small noise. Note that the x 2 (t) coupled with x 1 (t) throughout the process and the occurrence of bidirectional causality happens after 10 s. In the context of noncoupling, it is referring to H(τ − 10) = 0.
The simulation result from all aforementioned cases is shown in Figure 18. From the observation of the individual signal and the phase space, the scenario of [nonlinear: small noise, couple] has non-stationary oscillation after the coupling, as shown in Figure 18e. This is due to the divergence caused by the nonlinear exponential term embedded within the simulated equation (e 1−|x 2 (t−1)| = e R ) along with the factor of 0.55 of the previous state of x 2 (t) (0.55x 2 (t − 1) . . .). On the other hand, [nonlinear: small noise, noncouple] has a stationary oscillation, as shown in Figure 18f; this is because x 1 (t) only receives the input from x 2 (t), which keeps the power of the exponential term negative and hence results in a stationary oscillation.

Granger Causality
From the GC analysis, the coupling from x 2 (t) to x 1 (t) cannot be captured well in it except for the [nonlinear: small noise] cases. This result is due to a similar reason as discussed in Section 4, i.e., the coupling between the signals is affected by a factor of 1.5 (e 1 × 0.55 ≈ 1.5). Therefore, the rest of the cases cannot show the significant contrast of the causality from [x 2 (t) to x 1 (t)] as shown in Figure 19. Furthermore, referring to the time-varying GC, the bidirectional coupling/causality between x 1 (t) and x 2 (t) that occurs after 10 s are not able to capture all the coupling cases.    (29) and (30). Refer to Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.

Transfer Entropy
The results of TE shown in Figures 20 and 21 once again are quite similar to those shown in Section 4. The window sliding TE, evaluated with lag 0, shows that signal x 2 (t) causes x 1 (t) throughout the time, as shown in Figure 20. For the coupling cases, the amplitude of the TE changes when x 1 (t) feedbacks to x 2 (t) (after time 10 s). Similar to Section 4, the window sliding TE only shows that signal x 1 (t) is causing x 2 (t) when evaluated at lag 1 for the coupling cases shown in Figure 21. Hence, TE cannot fully capture the causality between the signals with just one lag.   (29) and (30). Refer to Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.    (29) and (30). Refer to Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.

Information Rate Causality
The information rate causality analysis evaluates the causality through the change in probability distribution of an original signal after including another signal assuming it causes the original signal. The information rate causality is able to show the coupling among the signals well, as shown in Figures 22 and 23. Similar to Section 4, prior to time 10 s, the causality from signal x 2 (t) to x 1 (t) can be depicted with the presence of the peak of Γ 2 2 to 1. The peak presented at the beginning of the evaluation window suggests that the coupling occurs at lag 0. After 10 s, the mutual feedback between the signals is shown within the evaluation window of the information rate causality. This mutual causality between the signals is observed through the alternating oscillations of the peak occurrence in the information rate within the window. Hence, this can reflect well the underlying model of the signals. Note that this is different than Section 4, which is an interchange of unidirectional causality that has the feedback solely from one signal to the other without retaining information from itself.  (29) and (30). Note that the bidirectional causation begins at 10 s. The results shown are based on the expression in Equations (29) and (30). Refer to Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.  Figure 22 for the information rate causality between 9.6 s and 10.4 s. Note that the bidirectional causation begins at 10 s. The results shown are based on the expression in Equations (29) and (30). Refer to Figure 17 for the graphical explanation of the process. Note that the noncoupling is referring to H(τ − 10) = 0 for all the cases.

Conclusions
In this paper, we proposed an alternative causal analysis method to the popular GC and TE for causality quantification. Specifically, based on information rate, which measures the rate of temporal change in the time-dependent distribution, we developed a model-free, information-geometric measure of causality-information rate causality-that is suitable for analyzing numerically generated non-stationary, nonlinear data. To compare the GC, TE, and information rate causality, we applied the methods to numerical data generated by simulating different types of discrete autoregressive models which contain linear and nonlinear interactions. In Section 3, we showed that information rate causality performs equally well compared to GC in the standard linear stationary signals. Later, we showed that the GC did not perform well for non-stationary and/or nonlinear signals. Furthermore, it failed to capture mutual feedback between the signals, as shown specifically in Section 5 in all the coupling cases. TE performed slightly better than GC, since it is model-independent. However, the information on the lag is needed to properly evaluate the TE, as shown in Sections 4 and 5. In comparison with GC and TE, information rate causality has shown to be able to uncover the underlying mechanism of causality in the signals, such as the interchanging oscillatory feedback between the signals that is discussed in Section 5, which was not captured by either GC or TE.
While our results in this paper were obtained based on the interaction of two time series of different types of (non)linearity and (non)stationary, they have at least pointed out some limitations of GC and TE that can be resolved by employing information rate causality. It remains as future work to extend this work and to investigate a larger class of time series data, including real data such as EEG signals.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Granger Causality
Appendix A.1. Autoregressive Process Given a stochastic process (x(t)), it can be modelled through an autoregressive (AR) model with available/appropriate n number of observations from the process (e.g., AR(n) known as autoregressive of nth order). The general expression of the AR(n) model is expressed as Equation (A1).
such that t is the time step, and α i is the constant coefficient which suggested the weight of contribution from the ith past observation towards the process x(t). (t) is the additive noise that has the variance Σ, which indicated the proportion of the external influence towards the process. Hence, a process that is strongly related to past observations will have small noise (e.g., small Σ).

Appendix A.2. Granger Causality: Time Domain
Given two stochastic processes (x 1 (t) and x 2 (t)), they can be modelled independently, as shown in Equation (A1) and expressed as Equations (A2) and (A3).
such that 1 (t) and 2 (t) are the noises with variances of Σ 11 and Σ 22 , respectively; a i and c i are the constant parameters that suggested the fractions of the contribution from the past observations towards x 1 (t) and x 2 (t), respectively.
If there is a present causal relationship between two processes, they can be modelled with the past values from each other, as shown in Equations (A4) and (A5).
Here a i , b i , c i , and d i are the constants parameters. The noises ( * 1 (t) and * 2 (t)) would have the variance presented as the covariance matrix Σ shown in Equation (A6): Similar to Equations (A2) and (A3), Σ * 11 and Σ * 22 are the variance of the * 1 (t) and * 2 (t), respectively, while Σ * 12 = Σ * 21 is the covariance of the noises ( * 1 (t) and * 2 (t)). The causal relationship between x 1 (t) and x 2 (t) can determined by evaluating the noises from Equations (A2) to (A5). For instance, Σ * 11 < Σ 11 would indicate that the predictability of x 1 (t) increases as the past information of x 2 (t) is included. Hence, this would suggest that there is a causal relationship from x 2 (t) to x 1 (t) which is known as GC (x 2 (t) Granger-causes x 1 (t), F x 2 →x 1 ). Similarly, Σ * 22 < Σ 22 would suggest x 1 (t) Granger-causes x 2 (t), F x 1 →x 2 . The covariance of the noises Σ * 12 = Σ * 21 would suggest that the processes are correlated by one another via a common external force that drives the processes; consequently, it shows an instantaneous causality (F x 1 ↔x 2 ).
The idea behind GC is to evaluate the total interdependence (also known as total causality index) between x 1 (t) and x 2 (t), which consists of two directional causalities (F x 1 →x 2 and F x 2 →x 1 ) and one instantaneous causality (F x 1 ↔x 2 ). The total interdependence between x 1 (t) and x 2 (t) is defined as Equation (A7).
such that |Σ| is the determinant of the covariance matrix Σ. If either one of the signals Granger-causes or correlates to one another, the value of F x 1 ·x 2 > 0 because the Σ * 11 < Σ 11 , Σ * 22 < Σ 22 , or Σ * 12 = 0. On the contrary, F x 1 ·x 2 = 0 indicates there is no GC relationship and correlation between the signals.
The directional causalities and instantaneous causality are defined as Equations (A8) to (A10). The spectral of the GC is studied at the frequency domain of the AR model. Unlike the time domain, the frequency domain of GC required both the constant parameters and the variance of the noises for the evaluation. To understand the formulation, the lag operator (L k ) is used and defined as Equation (A11).
Hence, by applying the lag operator expressed in Equation (A11) to Equations (A4) and (A5), they can be rewritten as x(ω) = H(ω) * (ω), such that A(ω) is known as the coefficient matrix. The elements in the coefficient matrix are expressed as which is obtained via the Fourier transform from the time domain to the frequency domain. Note that j is the imaginary unit while ω is the angular frequency that is defined as ω = 2π f , f = frequency. The matrix H(ω) is known as the transfer matrix. The spectral matrix (S(ω)) can be obtained by proper ensemble averaging of Equation (A17), which is expressed as Equation (A22): such that · is defined as an ensemble average, and † is denoted as the complex conjugate transpose of a matrix. The element S nm|n=m is known as auto spectra and S nm|n =m is known as cross spectra. Similar to Equation (A7), the total interdependence between x 1 (ω) and x 2 (ω) is defined as Equation (A23), which consists of two directional causalities (I x 1 →x 2 and I x 2 →x 1 ) and one instantaneous causality (I x 1 ↔x 2 ).
such that |S(ω)| is the determinant of spectral matrix S(ω). The individual causalities are defined as H 12 (ω); * is denoted as the complex conjugate of the element [17].
As shown in Equations (A23) to (A26), the elements from the transfer matrix and covariance matrix of the noises are needed to evaluate the GC in the frequency domain. The elements can be obtained via the constant parameters of Equations (A4) and (A5), which can be estimated via Yule-Walker equations [18,40]. This approach of evaluating GC through the knowledge of constant parameters is known as the parametric method; it has the disadvantage that the information of lags (number of past values) is needed when estimating the constant parameters. Alternatively, the non-parametric method conducts the GC analysis by directly decomposing the spectral matrix (S(ω)) into the transfer matrix (H(ω)) and the covariance matrix of the noises (Σ). The decomposition is conducted via a recursive algorithm known as the Wilson algorithm [18,35]. The derivation of the Wilson algorithm will be discussed in the next section.

Appendix A.4. Wilson Algorithm
The spectral matrix S(ω) that is defined in the range [−π, π] is known as the spectral density function. It can be expressed as Equation (A27).
such that the C k is known as the correlation coefficient obtained via inverse Fourier transform Equation (A28). Following Wilson's factorization theorem (theorem 1.1 in [35]), the spectral density function is the product of generating function ψ(e jω ) with its complex conjugate transpose that is defined on a unit circle |z| = 1: ψ(e jω ) = ψ(e jω )e −jωi dω.
Here A k is the moving average coefficient, which can be obtained by inverse Fourier transform as expressed in Equation (A31). With Equation (A30) being holomorphic extended to the inner of the unit circle (|z| < 1), it can be rewritten as Equation (A32).
Comparing Equations (A22) and (A33) when z = e jω = 0: such that I is the identity matrix, and [·] indicates the transpose of the matrix. Since A 0 −1 A 0 A 0 A 0 − = I, the transfer matrix H(ω) can be obtained via The definition of MI is derived from the Kullback-Leibler (KL) divergence. KL divergence measures the differences between two probability distributions (p(x) and q(x)) non-symmetrically; it is defined as Equation (A42) [37].
Hence, MI measures the difference between the marginal probability distribution (p(x)p(y)) and joint probability distribution (p(x, y)) that quantifies the common information shared between the processes. Therefore, the MI will be zero when both processes are statistically independent and no common information is presented between the processes as the joint probability distribution will be equal to the product of marginal probability distribution (p(x, y) = p(x)p(y) → MI = ∑ x∈X,y∈Y p(x, y) log 2 1 = 0) [19,37].
As discussed above, MI only measured the commonly shared information, which subsequently does not indicate the directional flow of information. On the contrary, Transfer Entropy (TE) is able to reveal the directional flow of information by considering the conditional probability p(·|·) and is defined as Equation (A43).
such that x n+1 is the state of X at the (n + 1)-th moment; x (k) n is the state at n-th moment of X consists of k previous states of X; y (l) n is the state at n-th moment of Y consisting of l previous states. Note that the previous states of k and l are arbitrary and depend on the interest of the study. For instance, given a collective observed stochastic process X = (x 1 , x 2 , . . . , x 10 ), x (2) 8 = (x n , x m ) |n, m ∈ [1,8] → (x 1 , x 2 ) or (x 3 , x 8 ), etc. Hence, TE measures the additional information needed for the realization of X at (n + 1) from Y with the assumption that X is independent of Y. If Y does not affect the future evolution of X, then p(x n+1 |x (k) n , y (l) n ) = p(x n+1 |x (k) n ) and subsequently T Y→X = 0. Using the general definition of conditional probability distribution (p(x|y) = p(x,y) p(y) ), Equation (A43) can be rewritten as Equation (A44) [19]. . (A44)

Appendix C. Information Rate, Information Length, and Information Rate Causality
In a dimensionless statistical manifold, the distance of two probability distributions is defined by their differences. One of the distances is discussed in Appendix B; this is KL divergence/relative entropy with the expression as Equation (A42). As mentioned, the KL divergence measured the differences non-symmetrically, and hence it is hardly considered as a metric distance because it does not satisfy the triangle inequality [32]. The symmetric version of KL divergence is introduced as Jensen divergence and expressed as Equation (A45). The square root of Jensen divergence has shown to be a metric distance that satisfies triangle inequality [32,43,44].
As depicted in Equations (A42) and (A45), the distance is measured at the ending points (e.g., initial and final probabilities), and hence it is not path-dependent. The pathdependent distance can be evaluated with the information length (L), which is computed through the knowledge of information rate (Γ). The detailed discussion and derivation are shown in [32]. An essential brief discussion and derivation of Γ and L are shown in the next part. The results of the GC are shown in Figure A2 for all the cases. Based on the results, both the coupling and noncoupling cases can be depicted well via the nonparametric GC approach (refer to Section 2.1). The continuous causality from x 1 to x 2 can be depicted well in Figure A2a,c. Similarly, the noncoupling between the signals is shown well in Figure A2b,d.

Appendix D.3. Information Rate Causality
Information rate causality is a quantitative measure of the changes in probability distribution caused by one signal influencing another. In the cases of coupling, a high peak is expected to indicate a significant change in the distribution due to the causal relationship between the signals, as depicted in Figure A4a,c. The magnification of these figures is shown in Figure A5a,c. In this scenario, the causation from x 1 (t) to x 2 (t) is represented well by the peaks at an order of magnitude of ×10 4 . In noncoupling cases, as shown in Figure A4b,d (with magnification shown in Figure A5b,d), information rate causality does not reduce to zero due to the noise present in the estimation of the probability distribution (refer to Figure S1). However, this noise does not affect the evaluation, as it does not result in a high order of peaks observed in coupling cases.