MuTE: A MATLAB Toolbox to Compare Established and Novel Estimators of the Multivariate Transfer Entropy

A challenge for physiologists and neuroscientists is to map information transfer between components of the systems that they study at different scales, in order to derive important knowledge on structure and function from the analysis of the recorded dynamics. The components of physiological networks often interact in a nonlinear way and through mechanisms which are in general not completely known. It is then safer that the method of choice for analyzing these interactions does not rely on any model or assumption on the nature of the data and their interactions. Transfer entropy has emerged as a powerful tool to quantify directed dynamical interactions. In this paper we compare different approaches to evaluate transfer entropy, some of them already proposed, some novel, and present their implementation in a freeware MATLAB toolbox. Applications to simulated and real data are presented.


Introduction
Since its first introduction by Schreiber [1] transfer entropy (TE) has been recognized as a powerful tool to detect the transfer of information between joint processes. The most appealing features of TE are that it has a solid foundation in information theory and it naturally detects directional and dynamical information. Moreover, the formulation of TE does not assume any particular model as underlying the interaction between the considered processes, thus making it sensitive to all types of dynamical interactions. The popularity of this tool has grown even more with the recent elucidation of its close connection with the ubiquitous concept of Granger causality [2], which has led to formally bridge information-theoretic and predictive approaches to the evaluation of directional interactions between processes. Given all these advantages, TE has been increasingly used to assess the transfer of information in physiological systems with several applications in neurophysiology [3][4][5][6]. It is worth noting that speaking of the transfer of information as measured by TE we refer to the ''predictive information transfer'' intended as the amount of information added by the past (and present) states of a source process to the present state of a target process.
The estimation of TE from time series data which constitute realizations of the investigated physiological processes is complicated by a number of practical issues that need to be addressed and that are contributing to the development of several recipes to compute this measure.
In this study we discuss three different approaches (binning, nearest neighbor, linear) to evaluate the probability distribution function which constitutes the basis for TE in multivariate systems. In turn, each approach has to be paired with the choice of the time series' past values which contribute information to the knowledge of the present state of a given target time series. The first choice is the classical uniform embedding (UE) that considers a fixed amount of past terms for each series; the second approach is quite recent and employs a non-uniform embedding (NUE) [7,8] iteratively selecting the most informative terms through an optimization criterion.
These recipes, some of them already established, some novel, are accordingly revisited or explained. Then, in order to contribute to the foundation of a common framework for the application of TE, we describe their implementation in a modular MATLAB toolbox. Several examples are presented allowing a critical comparison of UE and NUE approaches for all the three entropy estimators.
The paper is organized as follows. We first provide an overview of TE. We then distinguish between UE and NUE approaches to the representation of the history of the observed processes. We describe in detail the methods used to estimate the probabilities involved in the evaluation of the TE and their implementation in the toolbox. The approaches are then validated on synthetic time series and then tested on real data: the electroencephalogram of an epileptic patient and cardiovascular measurements in healthy subjects.

Transfer entropy
Let us consider a composite system described by a set of M interacting dynamical subsystems and suppose that, within the composite system, we are interested in evaluating the information flow from the source system X to the destination system Y, collecting the remaining systems in the vector Z~Z k È É k~1,...,M{2 . We develop our framework under the assumption of stationarity, which allows to perform estimations replacing ensemble averages with time averages (for nonstationary formulations see, e.g., [9] and references therein). Accordingly, we denote X, Y and Z as the stationary stochastic processes describing the state visited by the systems X , Y and Z over time, and X n , Y n and Z n as the stochastic variables obtained by sampling the processes at the present time n. Moreover, we denote X { n~½ X n{1 X n{2 . . ., Y { n~½ Y n{1 Y n{2 . . ., and Z { n~½ Z n{1 Z n{2 . . . as the vector variables representing the whole past of the processes X, Y and Z. In some cases it can be desirable to take into account also the instantaneous influences of the candidate drivers. In this case, the vectors X { n and Z { n defined above should contain also the present terms X n and Z n . Then, the multivariate transfer entropy from X to Y conditioned to Z is defined as: where the sum extends over all the phase space points forming the trajectory of the composite system. p(a) is then the probability associated with the vector variable a while p(bDa)~p(a,b)=p(a) is the probability of observing b knowing the values of a. The conditional probabilities used in (1) can be interpreted as transition probabilities, quantifying to which extent the transition of the target system Y towards its present state is affected by the past states visited by the source system X . Specifically, the TE quantifies the information provided by the past of the process X about the present of the process Y that is not already provided by the past of Y or any other process included in Z. The formulation presented in (1) is an extension of the original TE measure proposed for pairwise systems [1] to the case of multiple interacting processes. The conditional TE formulation, also denoted as partial TE [5,8], rules out the information shared between X and Y that is mediated by their common interaction with Z. Note that the TE can be seen as a difference of two conditional entropies (CE), or equivalently as a sum of four Shannon entropies: TE has a great potential in detecting information transfer because it does not assume any particular model that can describe the interactions governing the system dynamics, it is able to discover purely non-linear interactions and to deal with a range of interaction delays [4]. Recent research has proven that TE is equivalent to Granger Causality (GC) for data that can be assumed to be drawn from a Gaussian distribution, a case in which the data covariance is fully described by a linear parametric model [2,10]. This establishes a convenient joint framework for both measures. Here we evaluate GC in the TE framework and compare this model-based approach with two model-free approaches.

Reconstruction of the system's past states and TE evaluation
We will discuss here the crucial issue of how to approximate the infinite-dimensional variables representing the past of the processes. This problem can be seen in terms of performing suitable conditioned embedding of the considered set of time series [11].
The main idea is to reconstruct the past of the whole system represented by the processes X, Y, Z with reference to the present of the destination process Y, in order to obtain a vector V~½V Y n ,V X n ,V Z n containing the most significant past variables to explain the present of the destination. Once V is computed it is easy to evaluate TE as the difference of two CEs or through the four entropies using the whole V or convenient subsets of it according to equation (2).
Uniform embedding. The large majority of approaches applied so far to estimate TE implicitly follow uniform conditioned embedding schemes where the components to be included in the embedding vectors are selected a priori and separately for each time series. For instance the vector Y { n is approximated using the embedding vector V Y n~½ Y n{m Y n{2m . . . Y n{dm , where d and m are respectively the embedding dimension and embedding delay (the same for X { n and Z { n , approximated by V X n and V Z n ). In this way it is possible to distinguish between a first phase during which the past states are collected and a second phase during which the estimate of the entropy, and consequently of the CE, is evaluated by means of the chosen estimator, according to the following pseudo-code: 1. build the vector V~½V Y n ,V X n ,V Z n ; 2. use V and Y n to evaluate the last two entropies of (2) and, consequently, the lowest CE term (CE2); 3. use V \V X n to evaluate the first two entropies of (2) and, consequently, the highest CE term (CE1); 4. compute TE as equal to the difference CE1-CE2.
The obvious arbitrariness and redundancy associated with this strategy are likely to cause problems such as overfitting and detection of false influences [11]. Moreover one should assess which TE values are significant. The significance tests associated with TE estimation based on UE are different for model-based and model free estimators, and are described in the respective following subsections.
Non-uniform embedding. Non-uniform embedding constitutes the methodological advance, with respect to the state of art, that we implement as a convenient alternative to UE. This approach is based on the progressive selection, from a set of candidate variables including the past of X, Y, and Z considered up to a maximum lag (candidate set), of the lagged variables which are most informative for the target variable Y n . At each step, selection is performed maximizing the amount of information that can be explained about Y by observing the variables considered with their specific lag up to the current step. This results in a criterion for maximum relevance and minimum redundancy for candidate selection, so that the resulting embedding vector V~½V X n V Y n V Z n includes only the components of X { n , Y { n and Z { n , which contribute most to the description of Y n . Starting from the full candidate set, the procedure which prunes the less informative terms is described below: 1. Get the matrix with all the candidate terms MC ½X n{1 . . . X n{lX Y n{1 . . . Y n{lY Z n{1 . . . Z n{lZ , with l X , l Y , l Z representing the maximum lag considered for the past variables of the observed processes; these matrices will contain also the terms X n and Z n in case one wants to take into account instantaneous effects. 2. Run the procedure to select the most informative past variables and the optimal embedding vector: (c) Among the tested W (i) n , select the termŴ n which maximizes the mutual information (d) ifŴ n fulfills a test for candidate significance, as described below, include it in the embedding vector, V (k) n~½Ŵ n V (k{1) n , delete it from MC and set k~kz1.
(e) else end the procedure setting k~numCz1 and returning V~V (k{1) n 3. Use Y n and the full embedding vector V~½V X n V Y n V Z n to evaluate the third and fourth entropy values of (2) and, consequently, the lowest CE term (CE2) 4. Take the subset of V without the past states belonging to the source process, ½V Y n V Z n to evaluate the first and the second term of (2) and, consequently, the highest CE term (CE1) 5. compute TE subtracting CE2 from CE1.
As described above, candidate selection is performed maximizing the mutual information between the target variable and the vector of the candidates already selected, incremented by the candidate under examination. As we will see in the following sections, the practical implementation of this general criterion consists of an optimization process (i.e., minimization of the conditional entropy or maximization of the conditional mutual information, depending on the estimator chosen). The performances of the processes mentioned above in the reconstruction of the optimal embedding for an assigned target process are also discussed in [8].
The complexity of the algorithm concerns mainly step 2, in particular step 2(b), involving a for loop nested inside a while loop: in the worst case the body of the for loop is executed numC 2 times resulting in a complexity O(numC 2 ).
At step 2(d), the test for candidate significance is performed at the k-th step comparing the conditional mutual information between the target variable and the selected candidate given the candidates previously selected up to the (k{1)-th step, I(Y n ;Ŵ n DV (k{1) n ), with its null distribution empirically built by means of a proper randomization procedure applied to the points ofŴ n . The test for candidate significance is fulfilled if the original measure I(Y n ;Ŵ n DV (k{1) n ) is above the 100(1{a) th percentile (where a is the desired significance level) of its null distribution. In order to maximize detection accuracy, the adopted randomization procedure is varied for each estimator, and is thus described in the relevant section.
Summarizing, the non-uniform embedding is a feature selection technique selecting, among the available variables describing the past of the observed processes, those who are the most significantin the sense of predictive information -for the target variable. Moreover, given the fact that the variables are included into the embedding vector only if associated with a statistically significant contribution to the description of the target, the statistical significance of the TE estimated with the NUE approach results simply from the selection of at least one lagged component of the source process. In other words, if at least one component from X is selected by NUE, the estimated TE is strictly positive and can be assumed as statistically significant. If this is not the case, the estimated TE results exactly zero and is assumed as nonsignificant. This latter case occurs also when the first candidate (k~1) does not reach the desired level of significance, meaning that none of the candidates provides statistically significant information about the target variable. In such a case, that is encountered for instance when the target process is a white noise, the code returns an empty embedding vector and assigns a value of zero to the TE.

Entropy estimators
Estimation of the TE, performed according to either UE or NUE presented above, results from the application of estimators of entropy and CE to the various terms in (2). The toolbox contains three of such estimators. The first is the linear estimator (LIN) that assumes that data are drawn from a Gaussian distribution. Under this assumption, the two CE terms defining the TE can be quantified by means of linear regressions involving the relevant variables taken from the embedding vector [2]. The second estimator is the classical binning estimator (BIN), which consists of coarse-graining the observed dynamics using Q quantization levels, and then computing entropies by approximating probability distributions with the frequencies of occurrence of the quantized values [12]. The third estimator is based on k-nearest neighbor techniques (NN) which exploit the statistics of distances between neighboring data points in the embedding space to estimate entropy terms; we adopted the bias-reduction method of estimating entropies through neighbor search in the space of higher dimension and range searches in the subspaces of lower dimension [13].
A problem that can arise dealing with UE and NUE procedures when we use entropy estimators that does not assume any probability distribution concerns the curse of dimensionality. Indeed the more candidates we work with, the more the data points will be spread in the phase space, the more the probability density function will assume a constant value. Consequently the NUE should be the most apt method to avoid the curse of dimensionality because it reduces the dimension of the phase space. We will prove this statement in the Results section when it will be clear from the comparison between UE and NUE for the BIN and NN estimators in multidimensional spaces. We are now going to introduce each estimator in detail.
Linear estimator (LIN). The linear estimator method works under the assumption that the overall process fX ,Y ,Zg has a joint Gaussian distribution. This assumption allows to work with wellknown expressions for the probability density functions. Under this assumption, the two CE terms defining the TE in (2) are expressed by means of linear regressions involving the past states of the systems collected in the vector variables [14]. When the UE is implemented, X { n is approximated with the vector of length p, V X n~X n{1 , . . . ,X n{p Â Ã , and the same for Y { n and Z { n which are approximated by V Y n~½ Y n{1 , . . . ,Y n{p and V Z n1 Z n{1 , . . . ,Z n{p (here m~1, p~d). When the NUE is implemented, the embedding vectors will contain only the components resulting from the selection procedure. Then, an unrestricted regression of Y n on the full vector V (u)~½ V X n V Y n V Z n T , and a restricted regression of Y n on the reduced vector V (r)~½ V Y n V Z n T , are performed as follows: where A (u) and A (r) are vectors of linear regression coefficients. The terms e (u) n and e (r) n are scalar white noise residuals with variance s (u) and s (r) . Under the joint Gaussian assumption, it has been demonstrated [2] that the entropy of Y n conditioned to the unrestricted or restricted regression vectors is, respectively, H(Y n DV (u) )0 :5( log s (u) z2pe) and H(Y n DV (r) )~0:5( log s (r) z2pe), from which follows immediately that: In this study, the unrestricted and restricted regression models in (3) and (4) were estimated by the least-squares method. In the UE implementation, the order p of the regressions was selected by the Bayesian information criterion [15]; in the NUE implementation, the order resulted implicitly from the selection procedure. In NUE, maximization of the mutual information between the componentŴ n selected at the step k and the target variable Y n (step 2d) was obtained in terms of minimization of the CE H(Y n DŴ n ,V k{1 n )~0:5( log s (k) z2pe), where s (k) denotes the variance of the residuals of the linear regression of Y n on ½Ŵ n ,V k{1 n . Here, the randomization procedure applied to test candidate significance consisted time-shifting the points ofŴ W n by a randomly selected lag (of at least 20 lags, set to avoid autocorrelation effects) [16].
The statistical significance of the TE estimated through the UE approach is assessed by a parametric F-test for the null hypothesis that the p coefficients of A (u) which weigh the past components of the driving process, collected in V X n , are all zero [17]. In this case, the test statistic is F~( where RSS r and RSS u are the residual sum of squares of the restricted and the unrestricted model, and N is the time series length. The TE is considered statistically significant if F is larger than the value of the Fisher distribution with (p, N{p) degrees of freedom at the significance level a~0:05.
Binning estimator (BIN). Here we describe the estimator based on fixed state space partitioning. This approach consists of an uniform quantization of the time series followed by estimation of the entropy approximating probabilities with the frequency of visitation of the quantized states [12]. This is the classical approach adopted in the first definition of TE [1]. A time series y, realization of the generic process Y, is first normalized to have zero mean and unit variance, and then coarse grained spreading its dynamics over j quantization levels of amplitude r~(y max {y min )=j, where y max and y min represent minimum and maximum values of the normalized series. Quantization assigns to each sample the number of the level to which it belongs, so that the quantized time series y j takes values within the alphabet A~(0,1, . . . ,j{1). Uniform quantization of embedding vectors of dimension d results in an uniform partition of the ddimensional state space into j d disjoint hypercubes of size r, such that all vectors V falling within the same hypercube are associated with the same quantized vector V j , and are thus indistinguishable within the tolerance r. The entropy is then estimated as: where the sum is extended over all vectors found in the available realization of the quantized series, and the probabilities p(V j ) are estimated for each hypercube simply as the fraction of quantized vectors V j falling into the hypercube (i.e., the frequency of occurrence of V j within A d ). According to this approach, the estimate of TE based on binning results from the application of (6) to the four embedding vectors defined in (2) and determined either by UE or by NUE.
In the NUE implementation, maximization of the mutual information between the componentŴ W n selected at the step k and the target variable Y n (step 2d) was obtained in terms of , with the two entropy terms estimated through the application of (6). As for the LIN estimator, the randomization procedure applied to test candidate significance consisted in timeshifting the points ofŴ n by a randomly selected lag [16].
The statistical significance of the TE estimated through the BIN UE approach exploited the method of surrogate data implemented by the time-shift procedure proposed in [11,16,18]. Specifically, the estimated TE is tested against its null distribution formed by the values of TE computed on replications of the original series, where in each replication the source series is time-shifted by a randomly selected lag, set to exclude autocorrelation effects.
Nearest Neighbor estimator (NN). Since its first introduction in 1967 [19], the nearest neighbor method has been shown to be a powerful non-parametric technique for classification, density estimation, and regression estimation. This method can be used to estimate the entropy of a d-dimensional random variable X, H(X ), starting from a random sample (x 1 , . . . ,x n ) of N realizations of X. Following the reasoning in [13], if we consider the probability distribution P k (e) for the distance between x i and its k-th nearest neighbor, the probability P k (e)de is equal to the chance that there is one point lying within a distance r[½e=2,e=2zde=2 from x i , that there are k{1 other points at smaller distances from it, and that N{k{1 points have larger distances from x k . Let p i be the mass of the e-sphere centered at x i , p i (e)~Ð Ej{xiEve=2 m(j)dj, where m(j) is the density of the variable j. Then, the expectation value of log(p i (e)) is where P k (e) is evaluated through the trinomial formula and y( : ) is the digamma function. The expectation is taken here over the positions of all other N{1 points, with x i kept fixed. An estimator for log(m(x)) is then obtained by assuming that m(x) is constant in the entire e-sphere. The latter gives where d is the dimension of x and c d is the volume of the ddimensional unit sphere. For the maximum norm one has simply c d~1 , while c d~p e=2 =C(1zd=2)=2 d for the Euclidean norm. From (7) and (8) we can evaluate log(m(x i )) and finally: The NN estimator faces the issue of the bias in the estimation of multiple entropies for vector variables of different dimensions by computing entropy sums through a neighbor search in the space of higher dimension, and range searches in the projected sub-spaces of lower dimensions [13]. This approach can be fruitfully exploited for the estimation of the TE, as previously done, e.g., in [3,4]. To do this, we first rewrite the expression for TE in (2) in terms of the components of the embedding vector V~½V Y n ,V X n ,V Z n spanning a space of dimension (d X zd Y zd Z ): The term H(Y n ,V ) is estimated through neighbor search in the (d X zd Y zd Z z1){dimensional space, while the three other terms are estimated through range searches in the spaces of . Accordingly, adaptation of (9) to the four terms in (10) yields the equation for TE based on the nearest neighbor estimator: where N V Y n V Z n ,N YnV Y n V Z n and N V are the number of points whose distance from ½V Y n ,V Z n , ½Y n ,V Y n ,V Z n and V, respectively, is strictly less than the distance from ½Y n ,V to its k-th neighbor, and v : w denotes average over all n.
In the NUE implementation of the NN estimator, maximization of the mutual information between the componentŴ n selected at the step k and the target variable Y n (step 2d) was obtained in terms of maximization of the conditional mutual information I(Y n ;Ŵ n DV (k{1) n ), which was computed as described above by estimating the four relevant entropies through a neighbor search in the complete space, and range searches in the projected sub-spaces of lower dimensions. Moreover, the randomization procedure applied to test candidate significance consisted in shuffling randomly and independently both the points ofŴ n and those of Y n . These techniques have been recently shown to be optimal for the selection of candidates in a non-uniform embedding approach using nearest neighbor entropy estimators [8]. As for the BIN UE method, the statistical significance of the TE estimated through the NN UE approach exploited the method of surrogate data implemented by the time-shift procedure proposed in [11,16,18].

Toolbox structure
This section describes how the three TE estimators presented above are implemented in the toolbox, exploiting either the UE or the NUE approach for system state reconstruction.
The same main structure, consisting of the following steps, is common to all methods: 1. normalize the data and perform quantization when needed; 2. evaluate the probability density function (PDF); 3. evaluate CE2 (the second conditional entropy in (2)). This term, accounting for the present state of the target series conditioned to the past of the remaining series including the driver, is evaluated first since it is needed to obtain the complete set of conditional terms including all the series; 4. evaluate CE1 (the first conditional entropy in (2)): this term accounts for the present state of the target series conditioned to a vector including the past of the target series and of the all other series except the driver; such a vector is obtained subtracting the candidates belonging to the driver series from the set of candidates evaluated in the previous step.
Keeping this general scheme in mind, specific steps will be performed for any method of choice. For instance, when using the NUE with the BIN estimator, the steps to be performed are: 1. data quantization; 2. estimation of the PDF, as described in Binning estimator section; 3. evaluation of the first and second transfer entropy terms according to Non-uniform embedding section.
Given the modularity of the structure shown previously it has been possible to build a user friendly toolbox that allows one to compare all the methods at the same time. The toolbox is available at this link http://dx.doi.org/10.6084/m9.figshare.1005245. The package also contains two existing MATLAB toolboxes which are used in some of the calculations: ARFIT [20], a collection of modules for modeling and analyzing multivariate time series with autoregressive models, used for choosing the model order in LIN UE, and OPENTSTOOL [21], a software package for signal processing with emphasis on nonlinear time-series analysis, and used in searching for neighbor in NN. In order to optimize the toolbox for speed, the routine evaluateEntropy, that estimates the entropy among variables according to entropy~{ P p log(p), has been converted in a.cpp executable substantially reducing the computation time.
In the following we provide guidelines for the use of the toolbox. Let's start from a hypothetical main function and let's explore how a user should set the parameters to chose which methods to use and, possibly, how to build a new method to be inserted within the toolbox.
In the exampleMain file, included in the folder /MuTE/ exampleToolbox, some commented lines remind the method order that has to be kept in mind when setting the parameters, and the parameters available for each method. A first part then follows, devoted to setting the name of the folder that contains data as, for instance, mat files. Each file should contain a matrix with the time series as the rows. Then the folders in which all the output files will be stored are defined. In the second part the function para-metersAndMethods is called.
The function parametersAndMethods requires the following inputs, as reported in table 1.
N the number of data realizations; N the sampling rate; N the subset of interest; N a value to cut the series length if necessary N a vector specifying whether each method will take into account all the pairwise combinations of chosen variables. By default the instantaneous effects won't be considered; N a vector specifying whether the user will set by hand all the pairwise combinations of the chosen variables. This vector will also be used for visualizing the output. It is worth noting that in this case the user should provide as input also the sequence of the destination series and the driver series; N the folder in which results can be stored, previously defined; N the folder in which data are stored, previously defined; N the folder in which results can be eventually copied; N the number of processors if the code can be run in parallel on several nodes; N the name of the method chosen and all the relevant parameters as shown in the comments. Here attention should be paid in setting four parameters if the instantaneous effects have to be considered. First of all the function choosing the candidate terms should be set and consequently the variable usePresent: generateConditionalTerm, and usePresent~0 if the instantaneous effects do not have to be taken into account, generateCondTermLagZero and usePresent~1 otherwise. Then, if one is interested in the action of more than one driver on a target series, for each driver it can be specified whether its instantaneous effect should be considered by writing twice in a row the number of the driver series. One can also choose which variables belonging to the Z set can be considered with their instantaneous effects, filling the vector idOtherLagZero, table 2, third column.
For an example of how these parameters should be set, let's consider 5 variables; a conditioned analysis and a vector idTargets ½1 2 3 4 5 would result in the situation shown in table 2, second column, in which no instantaneous effect are set and the variable idDrivers contains on the columns the id of the driver series only once and the variable idOtherLagZero is the null vector. An example considering instantaneous effects is reported in table 2, third column, when looking at how drivers 1 and 4 influence the target 2 and how drivers 5 and 2 influence the target 3, with series 5 and 2 as conditoning variables.
The input parameters, including the methods of choice, specified in the function createNameMethodParams are stored in a structure called params by the function parametersAndMethods. This function then computes TE according to the chosen methods, via the function callingMethods, and stores the significant results through the function storingOutput. In case of multiple realiza-tions/data sets to analyze, the computation can be performed in parallel on separate pools.
The description of the toolbox structure should take into account the structure of the function callingMethods that receives in input the data matrix with the time series points in row and the structure params. The function reads the names of the methods stored in the params structure and computes the TE with all the chosen methods (in parallel if the hardware architecture allows it). This function will return a cell array containing the output of each method.
The open structure of the toolbox allows users to integrate in it their own method. The main function should in this case be modified with some comments showing which parameters should be passed as an input to parametersAndMethods, and in which order. Each new method should then be implemented following the steps described above using all the necessary parameters conveniently grouped via the function createNameNewMethod-Params. The new method will be called by setting the appropriate name in callingMethods.
The execution time for a single run of the system 14 ranged from 0.4 s for the LIN UE to 90.4 s for NN NUE on a Dell Mini Tower Computer, OptiPlex 990 with four Intel Core i5-2400 CPU at 3.10 GHz, 16 GB of RAM.
One of the purposes of this toolbox is to provide a common framework for all the researchers interested in the application of Transfer Entropy to their data. As part of this effort, MuTE will soon be interfaced with the toolbox TRENTOOL [22]. Readers and users are invited to check periodically the webpages of both toolboxes, that will announce when this interface has been set up.

Simulated data
The first set of simulated data, implemented to validate the simplest approach to TE, BIN UE, consists of two coupled chaotic maps: 1 zd e n X 2,n~( 1{C 1 )(1{bb 2 2 )zC 1 (1{bb 2 1 )zd e n where C 1~0 :2 is the coupling coefficient according to which X 1 is influencing X 2 , b 1~D mean(x 1,n{1 )D, b 2~D mean(x 2,n{1 )D, b~1:8, d~0:03 is the coefficient that regulates the noise and e is a Gaussian noise [23]. The function generating these data is multichaoticmap available in the folder /MuTE/commonFunctions.

Electroencephalogram in epilepsy
The second experiment is performed on intracranial electroencephalography (EEG) measurements recorded from a patient with refractory epilepsy. The dataset consists of time series from 76 contacts. The first sixty-four of these contacts were placed on a 868 grid at the cortical level, while the other 12 were along two six-electrode strips that were implanted in deeper brain structures. Eight sets of measurements were taken on this patient, corresponding to eight different epileptic seizures. An epileptolo-gist, examining the data for each seizure, identified two key periods relating to the seizure i.e., a pre-ictal period, just before the clinical onset, and an ictal one, corresponding to the seizure spread and to the clinical symptoms. Each epoch contained 10 seconds of data recorded at 400 Hz. The data are available at http://math. bu.edu/people/kolaczyk/datasets.html and described in [25]. In order to reduce overfitting, in this application data were downsampled to 100 Hz.

Cardiovascular and Cardiorespiratory time series
We considered cardiorespiratory time series measured from 15 young healthy subjects (25:7+2:7 years old) undergoing a standard head-up tilt testing protocol [26]. The acquired signals were the surface electrocardiogram (ECG), the finger arterial blood pressure, and the respiratory nasal flow, measured at 1 kHz sampling rate for 15 minutes in the resting supine position, and 15 further minutes in the 60 0 position after passive head-up tilting of the bed table. From these signals, the beat-to-beat variability series of heart period (RR interval), RR(n), systolic arterial pressure (SAP), Sap(n), and respiratory activity, Resp(n), were offline measured respectively as the temporal interval occurring between the n-th and the (nz1)-th R waves of the ECG, as the local maximum of the systolic arterial pressure signal inside the n-th heartbeat, and as the nasal flow taken at the onset of the n-th heartbeat. The time series are available in the folder /MuTE/ cardiovascular_data. This measurement convention allows instantaneous effects from Sap(n) to RR(n), as well as from Resp(n) to Sap(n) and to RR(n), which were implemented using the relevant Table 2. Example of the parameters required to define the methods for an experiment on 5 variables. In the second column the instantaneous effects are neglected both for targets and conditioning. In the third column we set instantaneous effects for some drivers and the respective targets. For example, when the target is 1, instantaneous effects are taken into account for driver 2 (first two rows, right column, parameter idDrivers) and conditioning variable 3 (first row, right column, parameter idOtherLagZero). doi:10.1371/journal.pone.0109462.t002 feature of the toolbox. The subsequent data analysis was performed on stationary windows of 300 beats taken in supine and upright body positions; inside these windows, the series were normalized to zero mean and unit variance, obtaining the dimensionless series resp(n), sap(n), rr(n).

Simulated data
The aim of testing the BIN UE approach on the coupled maps of eq. 12 was to show a simple case of applicability for this method, which constitutes the most basic approach to the model-free evaluation of TE. We generated 100 realizations of eq. 12, each of 512 points, and performed the analysis setting 1 as maximum lag for the candidates, 100 surrogates, a~0:05 and 6 quantization levels. As we can see in figure 2, the method detected correctly the information transfer returning 100 significant realizations for the link X 1 ?X 2 and an average TE much higher than the average TE for the link X 2 ?X 1 by means of the detection of only 2 significant realizations over 100. We tested also the other methods, which gave similar positive results as the BIN UE, thus demonstrating the applicability of the toolbox for simulations of bivariate systems with short memory.
Then we moved to a more challenging situation in terms of number of interacting systems and lag of the interaction effects, considering the time series simulated with equations 13 and 14, which involve five systems and contain influences up to 3 points in the past. The experiments were run on 100 realizations of eqs. 13 and 14, of length equal to 512 points. We investigated the TE between each pair of variables conditioned to the other three. The setup of the experiment was the following: for all estimators, used either in the UE or in the NUE framework, the maximum lag for the candidates was set as 5, the number of surrogates was fixed to 100 and a~0:05. We set 6 quantization levels for BIN and 10 nearest neighbor for NN estimator.
In order to check whether the methods were able to detect the right information transfers, taking into account figure 1, we expect the estimators to find a TE greater than zero with the highest significance at the following matrix elements: (1,2), (1,3), (1,4), (4,5), (5,4). Figures 3 and 4 report the analysis results obtained respectively for the linear system and the non-linear system. Looking at Figure 3 one can notice that LIN UE has very good performances: this reflects the fact that this approach is, in this case of a linear AR system, ''by construction'', the most likely to correctly detect information transfer. Its NUE version can detect the same links between the variables, though with a slightly higher number of false positives. The LIN estimator, therefore, is able to reveal the correct information flows for this simulation. On the contrary, BIN UE suffers from the curse of dimensionality mentioned in Entropy estimators section: evaluating the influences up to the first 5 past points for all the series implies that the uniform embedding procedure projects the data into a phase space of M|5 dimensions, where M is the number of time series, resulting in a phase space with 25 dimensions, with the points spread enough to lose relevant information about the transfer entropies in the system. As a consequence, no significant link is retrieved with this approach. NN UE retrieves all the true links, but also detects a number of false interactions. Its better performance compared with BIN UE reflects the ability of the nearest neighbor approach to achieve bias compensation in the estimation of entropies of variables of different dimension. Still, the performance of NN UE is not optimal due to the curse of dimensionality. On the other hand, BIN and NN used in the NUE framework are able to recover all the correct links, with only a few false positives. Moving to Figure 4 depicting TE analysis for the non-linear systems, one can notice that the LIN estimator cannot detect all the correct information flows, returning in addition some false positives. Again, BIN UE cannot detect any link because of the curse of dimensionality; conversely BIN NUE, in which the dimensionality of the space is considerably reduced, has high specificity and sensitivity. NN NUE can achieve almost the same performance as BIN NUE but its specificity is lower, especially along the direction X 2 ?X 4 . NN UE this time is not able to detect all the correct information transfers (X 5 ?X 4 remains undetected) and reveals some false positives (X 2 ?X 4 , X 2 ?X 5 ).
To better clarify whether and how much the methods are able to distinguish between the true information transfer links and the false ones, in Figures 5 and 6 we plotted the average TE with respect to the number of significant realizations found by the methods. Each retrieved link is a point in this bidimensional space. The true links should be in the upper right corner of the plot corresponding to high TE and high number of significant realizations, and they should be apart from the false links, whose natural location would be around the origin of the plot (low TE and low number of significant links). Looking at figure 5 one can notice that for all the methods, except BIN UE and partly NN UE, the two groups of links are well separated and the false links with an averaged TE greater than zero in figure 3 can be neglected. The opposite reasoning holds for BIN UE that is not able to distinguish between false and true links. For the non-linear system (figure 6) only BIN NUE can separate well true positive from false positive links.
To understand how stable the performance of the methods is, in terms of sensitivity and the specificity, with respect to the length of the analyzed data set, we computed the analysis varying the series length from 128 to 1024 points. Figures 7 and 8 depict, respectively for the systems 13 and 14, the Receiver Operating Characteristic (ROC) curves obtained for all methods as a function of the series length. Evaluating the amount of TP (true positives), TN (true negatives), FP (false positives) and FN (false negatives) after grouping together all coupled directions (positives) and all uncoupled directions (negatives), we computed sensitivity as TP=(TPzFN) and specificity as TN=(TNzFP). In the case of the linear system (Figure 7), all methods except the BIN UE provide good performance, with the LIN estimator providing the best sensitivity and specificity. All methods provided robust results with respect to the series length, with only a limited decay in the performance observed for 128 points. In the case of the non-linear system (Figure 8), the performance was optimal for BIN NUE and NN NUE (with a slightly lower specificity), while the methods implementing either the LIN estimator or the UE approach were considerably less sensitive.

Electroencephalogram in epilepsy
In such high dimensional and redundant data, a non-uniform embedding approach is intuitively the most appropriate to identify the patterns of information transfer specific to the onset and spread of the epileptic seizure. The aim of the experiment was to use the NUE approach in order to characterize the dynamical interactions in the epileptic brain by looking at the information transfer between the variables during the pre-ictal and ictal phases. The embedding size in the embedding matrix (EM) was set to eight. The results are reported in figure 9. The regions corresponding to one of the depth strips (contacts 70 to 76) and the lower left corner of the grid (contacts 1-4, 9-11 and 17) were resected during anterior temporal lobectomy as they were identified by the epileptologists as the seizure onset zone. The Binning approach to NUE seems to be the one which best identifies these areas as those most influential at the start of the seizure and in the early phases of the spread, signature of a putative seizure onset zone. The Binning  approach is more selective with respect to the target variables for each driver and less sensitive to the confounding effect of volume conduction resulting in the diagonal patterns observed with the other methods and probably due to conduction effects on the grid.

Cardiovascular data
The analysis of the information transfer for cardiovascular and cardiorespiratory time series was focused on the directions of interaction that are more studied from a physiological point of view: the link from SAP to RR which is related to the so-called cardiac baroreflex, and the links originating from Resp and directed either to RR or to SAP, related respectively to cardiopulmonary or vasculo-pulmonary regulation mechanisms [26]. The particular protocol considered allows to establish a sort of verifiable ground truth. Indeed, in the studied protocol, the transition from supine to upright is known to evoke an activation of the sympathetic nervous system and a concurrent deactivation of the parasympathetic nervous system [27]. Accordingly, the two main physiological regulation mechanisms that are expected to be solicited by this transition are: (i) a substantial increase of the baroreflex regulation (direction sap?rr), reflecting the necessity of the cardiovascular system to react with changes in the heart rate to the higher fluctuations in the arterial pressure induced by the sympathetic activation; and (ii) a substantial decrease of cardiopulmonary regulation (direction resp?rr), reflecting the dampening of respiratory sinus arrhythmia consequent to the parasymphatetic deactivation [28]. On the contrary, no known alterations of the vasculo-pulmonary regulation (direction resp?sap) are expected when moving from supine to upright [26]. In our analysis all these trends are well reflected in terms of information transfer  when the multivariate TE is estimated using the BIN NUE and NN NUE methods. Figure 10 reports the distribution of the multivariate TE computed along these directions using all methods, with subjects studied in the supine and upright body positions. We observe in Figure 10 that BIN NUE and NN NUE reveal, moving from supine to upright, a substantial increase of the TE from Sap to RR, a substantial decrease of the TE from Resp to RR, and an unchanged TE from resp to SAP. These trends were also observed, though with less evident differences, computing the TE according to the LIN estimator. These results suggest the appropriateness of model free TE estimators based on NUE for detecting the information transfer in physiological time series. On the contrary, the BIN UE estimator shows different trends of difficult physiological interpretation, thus suggesting also in experimental data that the estimated TE may be unreliable due to the curse of dimensionality.

Conclusions
In this work we have considered three entropy estimators able to reveal the information transferred among variables represented by time series. We implemented the estimators in two different ways according to UE and NUE approaches, resulting in six methods, two of which are novel, BIN NUE and NN NUE. We compared all the methods validating them on simulated data first and then on real data. We checked whether and how the methods were affected by the number of variables and by the time lag at which the series influenced each other. From the results obtained we can conclude that the new methods introduced, not assuming any  model to explain the data and exploiting the NUE strategy for component selection, can detect the correct information flows and are less affected by the number of involved processes and by their interaction lags. The NUE approaches are indeed prone to work in high dimensional spaces as well as in low dimensional spaces because of their ability to reduce the effective dimension of the phase space, choosing only the right variables at the specific time lag that are better able to explain the destination series. On the contrary, BIN UE and NN UE suffer from the curse of dimensionality when several time series and longer interaction delays are present. Finally, looking at LIN UE and LIN NUE performances we can conclude that, even though the equivalence Figure 9. TE matrices for human EEG recordings. Matrices of Transfer Entropy among the 76 intracranial contacts implanted in an epileptic subject. Contacts 1 to 64 belong to a cortical grid, contacts 65 to 76 to two strips implanted in deeper structures. Transfer Entropy values are obtained with three approaches to non-uniform embedding considering ten seconds of brain activity in the pre-ictal phase (top panels) and ictal phase (bottom panels). The color scale reflects Transfer Entropy values, the shading is inversely proportional to the significance: brighter colors correspond to more significant values. doi:10.1371/journal.pone.0109462.g009 between Granger causality and TE establishes a convenient joint framework for these two measures, there are some drawbacks in having a predefined model to explain the data when these are nonlinear. The better performances obtained by the new methods appear when looking at the ROC curves: BIN NUE and NN NUE have high sensitivity and specificity both for linear and non-linear systems.
All the methods have been implemented in an organic toolbox in MATLAB, allowing straightforward comparisons between the methods, and flexible enough to allow other users to implement their own methods.