Determining Causal Skeletons with Information Theory

Modeling a causal association as arising from a communication process between cause and effect, simplifies the discovery of causal skeletons. The communication channels enabling these communication processes, are fully characterized by stochastic tensors, and therefore allow us to use linear algebra. This tensor-based approach reduces the dimensionality of the data needed to test for conditional independence, e.g., for systems comprising three variables, pair-wise determined tensors suffice to infer the causal skeleton. The only thing needed is a minor extension to information theory, namely the concept of path information.

The gold standard for causal inference is experimentation. Deliberately changing one variable while keeping all other variables constant, tests for three necessary conditions of a causal association: temporal precedence of the cause over the effect, the existence of a physical influence, and finally, the distinction between an apparent direct association, and a "real" direct association [1]. When experiments, or interventions, are not possible, other methods are needed to test whether the earlier mentioned conditions are met. An important class of methods are those that use graphical models reflecting the statistical (conditional) independence relations resulting from observational data [2,3]. If the data are causally sufficient, i.e., there are no unobserved variables, causal associations can be inferred if the Markov property is applicable [2]. The vertices in the related graph represent the variables, the directed edges represent existence and directionality of the causal associations. For a graph to be causal, the faithfulness assumption [2] must be satisfied. This assumption entails that the edges in a probabilistic graphical model reflect dependencies in the data that do not result as a consequence of other pathways. For example, in a system comprising three variables, independence of X and Z given Y, denoted as X ⊥ ⊥ Z|Y, implies that the causal structure is a cascade. The causal skeleton, the causal graph with undirected edges, equals X−Y−Z. When testing for conditional independence, a multivariate approach is therefore needed; it is deemed impossible to infer a causal skeleton using bivariate measures. An issue with multivariate approaches is the "curse of dimensionality" [4].
In this letter, we summarize our work that potentially minimizes the impact of the curse of dimensionality, and enables us, in some cases, to use bivariate analysis to discover the causal skeleton for multivariate systems. Our approach is based on information theory [5]. An implicit assumption of this theory of communication is the existence of a mechanism that enables repeatability in communication, i.e., information theory could be used to infer causality. This idea is not new, see for example [6,7]. Mutual information, the information theoretic measure of association between random variables [8], arises from data transmission over a noisy discrete memoryless communication channel or channel in short. In our approach, an edge is modeled as a channel. If the maximal amount of information that channel can transfer, the so-called channel capacity [8], equals zero, no direct causal relation can exist between the input and output of the channel, and the edge is not shown in the graph. Using an additional measure of association, path-based mutual information or path information in short, we show that for a system comprising three variables, pair-wise determined measures can differentiate between direct and indirect associations. Because of the symmetry of mutual information, directionality cannot not follow from information theory, i.e., we can only use it to discover the causal skeleton. If one so wishes, one could use the fact that dependence between otherwise independent causes is induced when conditioning on the collider of a v-structure, e.g., X ⊥ ⊥ Z|Y for the v-structure X → Y ← Z. See for example the "Fast Causal Inference" (FCI) method described in [2]. Here we focus on the discovery of the causal skeleton. The assumptions underlying our approach are the same assumptions used in information theory: stationarity and ergodicity of the data [5]. We furthermore assume that the systems under consideration are noisy.
A foundational aspect of our approach is the earlier mentioned discrete memoryless channel. This channel transforms the probability mass function of the input data into the probability mass function of the output data via a linear transformation. In a memoryless channel, the output solely depends on its input, i.e., it encodes the Markov property. The linear map is the channel specific transition probability matrix [8]. The data are represented as discrete random variables and denoted with uppercase letters, e.g., X and Y. The related realizations are represented by the lowercase letters, i.e., x and y. Assuming a fixed, e.g., a lexicographical, ordering of the related sample spaces or alphabets, a one-to-one relationship exists between the realizations x and y and their positions in the respective alphabets. When using contra-variant and covariant notation, i.e., superscript and subscript notation, it is immediately clear from the equation whether to interpret x and y as indices or as realizations. With p x the probability that X = x, p(x), p y the probability p(y), and A y x the transition probability p(y|x), the channel transforms an input probability mass function into an output probability mass function according to (1) As of now, we will use the Einstein summation convention, i.e., summation is implied by indices that appear both as contra-variant and covariant indices. Equation (1) can therefore be written as When transforming the input and/or output alphabets via a linear transformation, the transition probability matrix represented using the contra-variant and covariant notation, transforms as a proper tensor [9]. We denote a tensor with calligraphic font, e.g., A. Let us now consider a cascade comprising three variables, the chain X → Y → Z. As we know, the mutual information for any pair of random variables, e.g., X and Z, equals Because the mediator Y is not an explicit variable in this equation, it is unknown how Y influences I(X; Z) in this chain. The resulting uncertainty is reflected in the data processing inequality, I(X; Z) ≤ min[I(X; Y), I(Y; Z)], where I(X; Y) and I(Y; Z) are the mutual informations of the constituting direct associations of the chain. There is equality if one or more associations are due to a noiseless channel [8]. With observational data more information is available, allowing us to be more specific because each association is represented by a tensor, and these tensors contain the transition probabilities that can be determined from the data. We now make a small sidestep and introduce some new terminology and notation. The tensor representing the association between the beginning and the end of this chain is referred to as "the tensor of the path {X}{Y}{Z}". A path with one or more mediators is called an indirect path, while a path without mediators is a socalled direct path. Using the tensors of the constituting direct paths {X}{Y} and {Y}{Z} denoted as C z x = p(z|x) and B z y = p(z|y) respectively, and the associated expressions for the linear transformations analogous to Equation (1), it follows that for the chain: . The mutual information resulting from transmission along the path therefore equals with p z = p(z) and p xz = p(x, z). The index p in indicates that the probability distributions are associated with the indirect path {X}{Y}{Z}. The dynamics within the chain involving the mediator Y is now captured by the matrix product A y x B z y . If the three variables in the chain X → Y → Z are the only three variables in the system, then I(X; Z) p in = I(X; Z). The direct association between X and Z is a consequence of the indirect path, so an edge between X and Z should be removed, in other words, the graph needs to be pruned. In Appendix B it is proven that the existence of an additional association between X and Z, mediated by a fourth variable, does not violate the validity of Equation (3), but that the path information does not equal the mutual information: I(X; Z) p in = I(X; Z). Instead of comparing the information measures, we use the tensors. Inequality of the path information and the mutual information implies that T {X}{Z}, the tensor of the direct path {X}{Z}, does not equal the tensor of the indirect path information: T z x = A y x B z y . Therefore, comparison of the tensors of the direct and indirect paths enables, under the assumption of causal sufficiency, differentiating between direct and indirect associations using pair-wise determined tensors for a system comprising three variables. It is of course possible that the tensor of the direct association equals the tensor of the indirect association purely by chance. Using the simplified probabilistic model from Appendix C, it can be shown however that this probability decreases with increasing cardinalities of the alphabets because the probability scales with 10 −N(M−1) . The variables N ∈ N and M ∈ N represent the cardinalities of the input and output alphabets respectively.
The approach is also applicable to forks. Consider for example the fork Y ← X → Z. The indirect association between Y and Z is either a consequence of the path {Y}{X}{Z}, or the path {Z}{X}{Y}. Because Equation (1) expresses the law of total probability [10], a stochastic tensor always exists for a path, irrespective of the direction in which this path is traversed. We use the following naming convention, if the tensor of the path {X}{Y} is denoted as A, then the stochastic tensor associated with the path {Y}{X} is denoted as A ‡ , i.e., p x = p y A x ‡ y . The tensor A ‡ is not the inverse of A because the inverse of the stochastic tensor is in general not a stochastic tensor itself and therefore does not represent a channel. This naming convention allows us to write the linear transformations associated with the two paths as p y A x ‡ y C z x = p z and p z C x ‡ z A y x = p y respectively. Because a fork can be considered a chain, and the path information is independent of how the chain is traversed [11], I(Y; Z) p in = I(Z; Y) p in , where p in indicates that probability distributions are associated with the path {Y}{X}{Z}, which are identical to the probability distributions associated with the path {Z}{X}{Y}. Like mutual information, path information cannot be used to infer directionality either.
Extending the approach to systems comprising over three variables is straightforward because a mediator, say S, could also be a set of variables. If the tensor of the path {X}{Z} equals the tensor of the path {X}{S}{Z}, the association between X and Z is indirect. The tensors of the paths from or towards S are multivariate. However, a so-called multivariate pruning step is not needed if all but one indirect paths have channel capacities that do not differ significantly from zero.
The concepts discussed, allow us to discover the causal skeleton in three steps. First, using the observational data, an undirected graph is inferred in which each edge represents a channel capable of transferring information, i.e., only edges for which the channel capacities differ significantly from zero are shown. The channel capacity is calculated using the Blahut-Arimoto algorithm, which uses a transition probability matrix as an input [12]. In the next step, the resulting undirected graph is pruned using the bivariate tensors that were determined in the first step. The pruned graph is the input for the third step. In this step the channel capacity of each indirect path is calculated for any start and end point connected by a direct path and two or more indirect paths. If two or more indirect paths have a channel capacity larger than zero, the multivariate pruning step is performed. The output of the third step is the final causal skeleton.
We applied our approach to a toy data set with a well-defined ground truth, the "LUng CAncer Simple set", LUCAS0 [13]. The data in this set were generated artificially by causal Bayesian networks with binary variables and comprises 12 binary parameters, each containing 2000 samples. To infer the skeleton, the probability transition tensors and the 95% confidence intervals were determined for all pairs of variables. For the latter, Jeffreys interval estimation for a binomial proportion was used [14]. The calculations were performed on a 2010 Mac mini with an 2.4 GHz Intel Core 2 Duo processor, and 8 Gb RAM. The proposed framework was implemented in MATLAB R2018b. The processing time was 1.9 s.
In Figure 1, the result of the bivariate pruning step is depicted. After this pruning step resulting in the solid edges, the direct association between {Lung Cancer} and {Fatigue} is the only candidate for which a multivariate pruning step could be needed because this is the only direct association that could arise as a result of two indirect paths. However, the channel capacity of the indirect path {Lung Cancer}{Genetics}{Attention Disorder}{Fatigue}, does not differ significantly from zero. Because the tensor of the direct association does not equal the resulting tensor of the second indirect path {Lung Cancer}{Coughing}{Fatigue} either, no multivariate pruning step is needed. The inferred causal skeleton should therefore be equal to the skeleton of the ground truth, which is indeed the case. To conclude, with the addition of path information, causal skeletons are a consequence of information theoretic considerations. Because directionality can be inferred from the causal skeleton using aspects from the earlier mentioned FCI method, one could argue that information theory is inherently a theory of causation. It furthermore follows from the data processing inequality that in a noisy system the channel capacity of a cascade decreases with increasing path length, we speculate that for larger, noisy systems, the path-based method is less sensitive to the curse of dimensionality, as illustrated by our example where bivariate analysis sufficed. This will be the subject of future research. It is interesting to note that our approach is also applicable to time-series. For example, transfer entropy [15] is proven to result from transmission over a set of discrete memoryless channels with associated tensors [16]. In this case, directionality is also inferred.
Funding: This research received no external funding.

Acknowledgments:
We acknowledge the support of Hans Onvlee, ASML Research.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Matrix Product
Consider the chain X → Y → Z, and the transition probability matrices A y x = p(y|x), B z y = p(z|y), and C z x = p(z|x). For this chain C z Proof. Consider the linear transformation p z = p y B z y related to the association Y → Z. The input for this linear transformation, p y , is itself the result of the linear transformation related to the association X → Y: p y = p x A y x . Substituting the latter expression in the linear transformation p z = p y B z y , results in the expression Because the linear transformation related to the association X → Z equals p z = p x C z x , it follows that for the chain C z Because the transition probability matrices A y z = p(y|x) and B z y = p(z|y) are associated with the direct paths {X}{Y} and {Y}{Z} respectively, the transition probability matrix A y x B z y is associated with the indirect path {X}{Y}{Z}. This is used in Appendix B.

Appendix B. Proof of Inequality Path Information and Mutual Information in Case of Multiple Indirect Paths
Consider system comprising the chains X → Y → Z and X → U → Z.

Appendix C. Simple Probabilistic Model for Coincidental Equality of Direct and Indirect Associations
Let us assume that the tensor for the indirect path is a stochastic N×M tensor, where N ∈ N and M ∈ N represent the cardinalities of the input and output alphabets, respectively. Per element of this tensor, an associated sample space is determined, consisting of integers ≥ 1. The cardinality of the sample space is equal to the product of each tensor element and 10 ς , where ς ∈ N represents the number of significant digits. Finally, a new N×M transition probability tensor is constructed by selecting per tensor element a number at random from the associated sample space, and multiplying it by 10 −ς . The probability that this newly constructed stochastic tensor equals the tensor of the indirect association p(ς, N, M) scales with 10 −ς·N(M−1) . For binary data, this probability is small for ς ≥ 2, e.g., p(2, 2, 2) ≈ 10 −4 . The probability approaches zero rapidly with increasing cardinalities. The probability that the tensor of the direct association is equal to the tensor of the indirect association purely by chance is therefore negligible.