The entropic approach to causal correlations

The existence of a global causal order between events places constraints on the correlations that parties may share. Such"causal correlations"have been the focus of recent attention, driven by the realization that some extensions of quantum mechanics may violate so-called causal inequalities. In this paper we study causal correlations from an entropic perspective, and we show how to use this framework to derive entropic causal inequalities. We consider two different ways to derive such inequalities. Firstly, we consider a method based on the causal Bayesian networks describing the causal relations between the parties. In contrast to the Bell-nonlocality scenario, where this method has previously been shown to be ineffective, we show that it leads to several interesting entropic causal inequalities. Secondly, we consider an alternative method based on counterfactual variables that has previously been used to derive entropic Bell inequalities. We compare the inequalities obtained via these two methods and discuss their violation by noncausal correlations. As an application of our approach, we derive bounds on the quantity of information - which is more naturally expressed in the entropic framework - that parties can communicate when operating in a definite causal order.


I. INTRODUCTION
When describing most physical phenomena it seems natural to assume that physical events take place in a well-defined causal structure. For instance, earlier events can influence later ones but not the opposite, or, if two events are distant enough (typically, space-like separated) from each other, any correlation between them can only be due to some common cause in their past. This intuition is formalized in Reichenbach's principle [1] and generalized by the mathematical theory of causal models [2] that form the basis for our current understanding of how to infer causation from empirically observed correlations. Not surprisingly, it has found a wide range of applications [2][3][4]. Yet, quantum phenomena defy such an intuitive notion of cause and effect.
As shown by Bell's Theorem [5], quantum correlations obtained by measurements on distant entangled parties are incompatible with Reichenbach's principle [6,7] or, more generally, with classical theories of causality, forcing us to generalize the notion of causal models [8][9][10][11][12][13]. In a scenario where different experimenters interact only once with a given system that is exchanged between them, one could expect that no simultaneous causal influences between them should be possible but rather only one-way influences. However, it has been realized that physical theories do not necessarily have to comply with the idea of a definite causal * These authors contributed equally to this work. order [14,15]. One can also imagine theories where the causal order itself is in a sort of "quantum superposition" [14,16], which can be verified using so-called causal witnesses [17,18].
As for entanglement witnesses [19,20], the use of causal witnesses assumes that we have a precise description of the measurement apparatus, that is, they are relevant in a device-dependent framework. Nevertheless, by allowing physical theories that are locally equivalent to quantum mechanics but relax the assumption of a fixed global causal structure, it is possible to verify causal indefiniteness also in a deviceindependent manner. With the aim of providing a general framework to such scenarios, the process matrix formalism [14] has been introduced and shown to allow for the violation of so-called causal inequalities [14,[21][22][23][24][25], which are device-independent constraints that play a similar role to that of Bell inequalities [5]. However, whether violations of causal inequalities can be experimentally observed is still an important open question.
Our goal in this paper is to introduce a new framework for the derivation of causal inequalities and the study of their potential violations: the entropic approach to causal correlations. The idea of using entropies to understand sets of correlations has its origin in the context of Bell inequalities [26][27][28][29] but since then has also found various other applications in quantum contextuality [30][31][32], device-independent applications [33,34], causal inference [9,35,36] and in the characterization of nonsignaling correlations [37]. As for these previous applications, the interest in characterizing the entropies compatible with causal correlations stems not only from practical and technical issues, but also from a more fundamental reason. To begin with, causal inequalities expressed in terms of probabilities are constructed for a fixed number of inputs and outputs, and their systematic derivation becomes harder as this number increases [24,25]. In contrast, we will derive entropic causal inequalities that are valid for arbitrary finite alphabets either for the input and output variables, or just for the output variables. Furthermore, entropic inequalities can be easily combined with extra assumptions, such as conditional independence relations or information theoretic constraints (e.g., bounds on the amount of communication), which would be hard to treat in the probabilistic framework [35,37,38]. More fundamentally, given that entropies are a core concept in classical and quantum information theory, it is of clear relevance to have a framework that focuses on these quantities rather than on probabilities, and it may help connect causal inequalities with principles such as information causality [39].
The paper is organized as follows. In Sec. II, we will introduce the basic notions relevant for our investigation, namely, causal correlations and the entropic approach to causal structures. In Sec. III we will derive entropic causal inequalities for the bipartite scenario, and discuss their violation. In Sec. IV, we will show how this approach can be generalized to multipartite scenarios. Finally, as an application, in Sec. V we use this approach to derive bounds on communication in causal games.

A. Causal correlations
Causal correlations are most easily introduced in the bipartite case, where we consider two parties, Alice (A) and Bob (B), who together conduct a joint experiment while each having control over a separate closed laboratory. During each round of the experiment, Alice and Bob each receive, operate on, and send out a single physical system, which is the only means by which they may communicate. In addition, they each receive some (external) classical inputs X and Y, for Alice and Bob respectively, and produce some classical outputs A and B, respectively. Throughout the paper we use upper-case letters (e.g., X) to denote random variables, and corresponding lower-case letters (e.g., x) to denote the specific values they take. Their probability distributions will generically be denoted by P; we will also use the shorthand notations P(x) for P(X = x), P(x ( , ) y) for P(X = x, Y = y), P(a|x) for P(A = a|X = x), etc.
The joint conditional probability distributions P(ab|xy) that can be produced in such an experiment depend on the causal relation between Alice and Bob. If Bob cannot signal to Alice their correlations should obey P(a|xy) = P(a|xy ) for all x, y, y , a, where P(a|xy) = ∑ b P(ab|xy). We denote this situation by A ≺ B, and write P = P A≺B in this case. Note that this does not necessarily imply that Alice is in the causal past of Bob since the events could be space-like separated, but merely that the correlation is compatible with such a causal order. Similarly, if the correlation is compatible with Bob being in the causal past of Alice we write B ≺ A and we have P B≺A (b|xy) = P B≺A (b|x y) for all x, x , y, b. The correlations that satisfy both these conditions (and are thus consistent both with A ≺ B and B ≺ A) are precisely the nonsignaling correlations [40].
More generally, we are interested in the correlations achievable under the assumption of a definite causal order in each round of the experiment, even if the causal relation between Alice and Bob may be different (e.g., chosen randomly) for each individual round. We thus say that a correlation P(ab|xy) is causal if it can be written as P(ab|xy) = q 0 P A≺B (ab|xy) + q 1 P B≺A (ab|xy), (1) with q 0 , q 1 ∈ [0, 1] and q 0 + q 1 = 1, where P A≺B (ab|xy) and P B≺A (ab|xy) satisfy the respective (one-way) nosignaling conditions defined above [14]. It was shown in Ref. [24] that the set of bipartite causal correlations forms a convex polytope, whose vertices are simply the deterministic causal correlations (i.e., causal correlations for which the outputs A, B are deterministic functions of the inputs X, Y). The facets of this polytope thus specify causal inequalities, analogous to Bell inequalities for local correlations, that any causal correlation must satisfy [14]. The situation with binary input and output variables was characterized completely in [24], where it was shown that there are only two nonequivalent causal inequalities (up to symmetries). The simplest of these is perhaps the "guess your neighbor's input" (GYNI) inequality, which has a simple interpretation as a game (up to a relabeling of the inputs and outputs) in which the inputs X, Y are chosen uniformly at random and the goal is for each party to output the other party's input. One such form of this inequality can be written [24] where δ is the Kronecker delta function. The notion of causal correlations can be generalized to more parties, although one has to take into account the fact that, in a given round of the experiment, the causal order of some parties may depend on the inputs and outputs of previous parties [23,25]. In this paper we will primarily, in Sec. III, focus on applying the en-tropic approach to bipartite causal correlations, before returning to the multipartite case in Sec. IV.

B. The entropic approach and marginal problems
Below we introduce the basic notions concerning entropy cones and marginal scenarios. We then review the entropic characterization of marginal scenarios [28] using two complementary methods, the first considering the entropies of the variables composing a given causal model, and the second based on the counterfactual approach to correlations.

Entropic and cones
Let S = {X 1 , . . . , X n } be a set of n random variables taking values x 1 , . . . , x n , whose joint distribution P(x 1 , . . . , x n ) we wish to characterize entropically. For every nonempty subset T ⊂ S we shall denote by X T = (X i ) X i ∈T the joint random variable that involves all variables in T, taking values x T = (x i ) X i ∈T . We can then compute the marginal Shannon entropies H(X T ) = H(T) from the marginal probability distribu- Together with H(∅) := 0, every global probability distribution P(x 1 , . . . , x n ) thus specifies 2 n real numbers in the entropic description, which can be expressed as the components of a (2 n )-dimensional vector h = (H(∅), H(X 1 ), . . . , H(X 1 X 2 ), . . . , H(X 1 . . . X n )) in R 2 n .
A fundamental problem in information theory is to decide whether a given vector is an entropy vector, that is, if it is obtainable from some probability distribution. The (closure of the) region of valid entropy vectors is known to be a convex cone, called the entropy cone (see Ref. [41] for a comprehensive discussion on entropy cones). There is no known explicit description of Γ * S , so one generally has to rely on an approximation of it. A well-known and very useful outer approximation of Γ * S is the so-called Shannon cone Γ S , defined by the elemental inequalities for all 1 ≤ i, j ≤ n, i = j, and T ⊂ S \ {X i , X j }. That is, the Shannon cone Γ S is described by a finite system of m = n + 2 n−2 ( n 2 ) linear inequalities, which one can write in the form Ih ≤ 0, where I is an m × 2 n real matrix and 0 a vector with null entries. The inequalities in Eq. (5) are the minimal set of inequalities implying the monotonicity of entropy, i.e., H(U|T) := H(TU) − H(T) ≥ 0, and the submodularity (or strong subadditivity), i.e., I(U : V|T) := H(TU) + H(TV) − H(TUV) − H(T) ≥ 0, for any subsets T, U, V ⊂ S. These inequalities and any combination thereof are known as Shannon-type inequalities. It is known that for n ≤ 3 variables every inequality delimiting the entropy cone Γ * S is of the Shannon type; however, this is not the case for n > 3 [41].
The inequalities characterizing the Shannon cone simply arise from demanding that the function P(x T ) appearing in (3) should be identified with a valid probability distribution (i.e., it should be nonnegative and normalized). However, one often wishes to consider (and characterize the entropy vectors for) situations where additional constraints on the random variables are known. For example, X i and X j might be known to be independent, which implies that P(x i , x j ) = P(x i )P(x j ). Such independence constraints, which are nonlinear in terms of probabilities, define simple linear constraints in terms of entropies, e.g., . These extra constraints can be easily incorporated into the entropic framework since they define a linear subspace, which we denote L C , characterized by linear equalities. When combined with the elemental inequalities one obtains a new finite system of inequalities I h ≤ 0 characterizing the "constrained" Shannon cone Γ S L C .
In some cases, one may also wish to add linear inequality constraints which, in general, may give rise to more general polyhedra described by inhomogeneous systems of linear inequalities I h ≤ β [42]. In such cases we will again denote the set of vectors h satisfying these additional constraints as L C ; we will return to this point in more detail in Sec. III.

Marginal scenarios
Consider again a set of random variables {X 1 , . . . , X n } with a joint probability distribution P(x 1 , . . . , x n ). We often encounter situations where not all variables, or combinations thereof, are empirically accessible. For example, our system of interest could be composed of three random variables X 1 , X 2 , X 3 but, for some reason, we can access at most two of them at a time, thus implying that we cannot know their joint entropy H(X 1 X 2 X 3 ). Alternatively, there might be variables that represent latent factors [2] and that, for this reason, are unobservable. In such cases, we face a marginal problem: decide whether some given information on the marginals is compatible with a global description fulfilling certain constraints (for example the elemental entropy inequalities). In the example with three variables, it is easy to see that the elemental inequalities imply that That is, the global structure of entropy vectors implies nontrivial constraints (which are not elemental inequalities (5)) that should be respected by any marginal information compatible with it.
More formally, given a set of random variables S = {X 1 , . . . , X n }, a marginal scenario is a collection of subsets M = {M 1 , . . . , M |M| }, M j ⊂ S representing those variables for which we have access to the probability distribution P(x M j ) (and thus to H(M j )). Clearly, M j ∈ M and M j ⊂ M j implies M j ∈ M, that is, given some probability distribution we also have access to any marginal of it. In a slight abuse of notation we will therefore write M only in terms of its maximal subsets, since these are sufficient to specify the entire marginal scenario. In the example above the marginal scenario would then be represented as In general we are interested in characterizing the entropy cone Γ * M associated with a marginal scenario M, thus obtaining constraints implied by the global entropy cone on the marginal subspace of interest. Geometrically, this corresponds to the projection of the original entropy cone onto the subspace corresponding to the variables in M. Since, in practice, we work with the Shannon cone Γ S -possibly constrained by some further linear constraints specifying a linear subspace L C , as described previously -which is characterized by a finite system of inequalities, this projection corresponds to a simple variable elimination of all the terms not contained in M [28,43,44]. After removing redundant inequalities, the remaining inequalities are facets (i.e., the boundaries) of the Shannon cone, or more generally polyhedron, in the observable marginal subspace. Formally, the marginal Shannon polyhedron Γ M is defined as where Π M denotes the projection onto the coordinates associated with the marginal scenario M.

Probability structures
The characterization of entropy cones and marginal problems outlined above can be easily extended to the case where we no longer assume that there is a welldefined global probability distribution over all the variables in the set S. Instead, we may assume that only certain subsets of variables have such a joint distribution, and that only the marginals of certain subsets of these subsets are empirically accessible. This type of restriction may be imposed by assumptions about the underlying physical theory being described, as will be clear in the example we discuss in Sec. II B 5.
We will denote the collection of subsets of S for which we assume joint probability distributions exist by S = {S 1 , . . . , S |S| }, with each S i ⊂ S, such that ∪ i S i = S; we call S the probability structure. As for the marginal scenario, we will represent S by just its maximal subsets in a slight abuse of notation; the complete representation of S, that explicitly includes all (not necessarily maximal) subsets for which a joint probability distribution exists, will be denoted S c = {T | T ⊂ S i , S i ∈ S}. In such a situation the entropies H(T) cannot be defined for all subsets T ⊂ S, but only for the subsets in S c . The entropy vectors we shall consider will thus be defined here as h = (H(T)) T∈S c ∈ R |S c | . Again, no explicit characterization is known for the set of valid entropy vectors; we will instead rely on its outer approximation characterized via the Shannon constraints, now restricted to each subset S i ∈ S. Namely, the Shannon cone of interest is now where Γ S i ⊂ R |S c | is the cone defined by the Shannon inequalities on the variables in S i , which, in particular, leave the other variables in S \ S i unconstrained. In the extremal case where we do assume a global joint probability distribution for all variables we have S = {S}, S c = 2 S , and we recover Γ S = Γ S . One can similarly consider marginal scenarios under a given probability structure S, with the constraint that marginals must arise from existing probability distributions, i.e., for all M j ∈ M there must exist an S i ∈ S such that M j ⊂ S i . One can also add linear constraints to the entropy vectors under consideration, as before, represented by some subset L C . We can thus define the marginal Shannon polyhedron associated with S, M, and L C as The choice of probability structure can generally be considered on a case-by-case basis depending on the scenario being modeled. Unless otherwise stated we will take S = {S} but, as we will discuss, this will not always be the most pertinent choice.

The entropic characterization of causal Bayesian networks
In order to describe the causal relations between random variables, we will use the framework of causal Bayesian networks 1 [2]. Such networks can be conveniently represented as directed acyclic graphs (DAGs), in which each node represents a variable and directed edges (arrows) encode the causal relations between them. A set of variables S = {X 1 , . . . , X n } forms a Bayesian network with respect to a given DAG if and only if the variables admit a global probability distribution P(x 1 , . . . , x n ), i.e., S = {S}, that factorizes according to where Pa i stands for the graph-theoretical parents of variable X i , that is, all those variables X j that have an outgoing edge pointing to X i in the DAG under consideration. The decomposition above implies a set of conditional independences (CIs), which are either independence relations of the type P(x i , x j ) = P(x i )P(x j ) (in which case we write X i ⊥ ⊥X j ) or conditional independence relations such as P( Given a DAG, a complete list of CIs can be obtained via the d-separation criterion [2]. If the CIs implied by a Bayesian network describe the direct causal relations between the variables in question, then we call it a causal Bayesian network. Entropically, these CIs correspond to simple linear relations: As a result, the set of entropy vectors compatible with a given DAG is the intersection of the entropy cone Γ * S with the linear subspace L CI defined by the set of linear constraints that characterize the CIs associated with the DAG [29,35]. In practice, we again rely on the outer approximation given by the intersection of the Shannon cone Γ S with L CI .
If all the variables in a DAG are observable, in order to check the compatibility of a given entropy vector with the DAG it suffices to check whether all the entropic CIs are satisfied. However, we are often interested in DAGs containing latent, non-observable, variables. Splitting the n variables making up the DAG into j observable variables O 1 , . . . , O j and n − j latent variables Λ 1 , . . . , Λ n−j we thus need to compute the marginal Shannon cone As an illustration, consider the paradigmatic causal Bayesian network for a local hidden variable model 1 Note, however, that although notions of causal correlations and causal Bayesian networks both share the "causal" qualifier, they are distinct concepts: a causal correlation is not simply one that can be obtained from any particular causal Bayesian network. 2 For CIs between more than two variables, we use the natural extension of this notion. For example, if P( satisfying Bell's assumption of local causality [5,7]. The relevant DAG, shown on Fig. 1, has five variables, four of which are observable while the hidden variable Λ is not: in the context of Bell's Theorem the "hidden variables" indeed refer to the latent factors introduced above. This DAG represents the physical scenario where two distant observers receive physical systems produced by a common source (the hidden variable Λ) and make different measurements (choices of which are labelled by X and Y), obtaining measurement outcomes (represented by the variables A and B). That is, the probability struc- Some of the conditional independences implied by this DAG are given by P(xyλ) = P(x)P(y)P(λ) (the measurement independence assumption), P(a|xybλ) = P(a|xλ) and P(b|xyaλ) = P(b|yλ) (the locality assumption) that in turn imply (after eliminating the hidden variable Λ) Bell inequalities for the observed variables [5,7]. These constraints also imply the no-signaling constraints P(a|xy) = P(a|x) and P(b|xy) = P(b|y). This example shows that, in general, DAGs with latent variables imply CIs both on the level of observable and unobservable variables. The CIs involving latent variables are not directly testable but imply further constraints (Bell inequalities, in the example above) that can be tested to check whether the observable behavior is compatible with the proposed underlying DAG.
If, instead of characterizing the allowed probability distributions, we consider the entropic description of the Bell scenario, i.e., the Shannon cone together with the linear constraints arising from the DAG's CIs, then after eliminating the latent variable Λ one obtains no further constraints other than the elemental inequalities (which are trivial since they are respected by all probability distributions) and the observable CIs implied by the DAG: H(XY) = H(X) + H(Y), H(A|XY) = H(A|X) and H(B|XY) = H(B|Y). The first CI relation represents the independence of the two measurement choices, while the two latter ones are nosignaling conditions. Thus, for this particular causal Bayesian network, when the entropic approach is ap-plied to the variables making up the DAG one does not obtain any nontrivial constraints (i.e., entropic Bell inequalities) [45]. However, there are many examples of Bayesian networks for which one does obtain such nontrivial constraints [9,29,35]. In fact, as we will see in Sec. III, a slight modification of this method also leads to nontrivial constraints on causal correlations.

The entropic characterization of counterfactuals
While the DAG method fails to provide nontrivial constraints for the Bell scenario (a result that can be extended to a larger class of "line-like" Bayesian networks [45]), it has been known for some time that entropic Bell inequalities can be derived using different methods [26]. Interestingly, these inequalities can even be turned into necessary and sufficient conditions for a given probability distribution to satisfy Bell's local causality assumption [31].
The method that allows such inequalities to be derived is motivated by the realization that the entropic approach can be applied to any marginal scenario for a relevant set of random variables [28], and not only those arising from causal Bayesian networks. In particular, when we are interested in constraints on conditional distributions of the form P(ab|xy), where we have distinct sets of input and output variables, we may consider the output variables conditioned on certain relevant input variables (e.g. A xy and B xy , where the notation A xy denotes the random variable A|(X = x, Y = y)). 3 The choice of relevant input variables to condition on, as well as the appropriate probability structure, will depend on the physical situation being considered. In general, a global probability distribution may not exist on such "counterfactual" variables even if one does on the unconditioned variables.
Let us illustrate how this method may be applied by considering again its application to the Bell scenario. Instead of considering all the input and output variables as in the DAG approach (e.g. X, Y, A, B), one can consider copies of the output variables conditioned on the corresponding party's input, i.e., A x , B y , where A x denotes the random variable A|(X = x). Indeed, due to the DAG constraints (no-signaling), the output variables can only depend on the corresponding local input. Furthermore, from Fine's Theorem [46] we know that Bell's local causality assumption is equivalent to the existence of a well defined (although empirically inaccessible) joint probability distribution P(a 1 , . . . , a |X | , b 1 , . . . , b |Y | ) (where X = {1, . . . , |X |} and Y = {1, . . . , |Y |} denote the alphabets of Alice and Bob's inputs) on these variables 4 that marginalizes to the observable one given by P(ab|xy) = P(a x b y ). Hence, the appropriate probability structure for local correlations in the Bell scenario is S = {S} with S = {A 1 . . . , A |X | , B 1 , . . . , B |Y | }, and we consider the Shannon cone Γ S = Γ S that contains all 2 |X |+|Y | -dimensional entropy vectors The marginal scenario in this case is simply M = {A x , B y } x,y and local correlations are then characterized by the cone Π M (Γ S ).
In contrast to the characterization based directly on the DAG variables, this approach leads to nontrivial entropic inequalities (i.e., not obtainable from the elemental inequalities in Eqs. (5)) in the Bell scenario. For example, for two measurement settings per party, which we take as X = Y = {0, 1}, one obtains the Braunstein-Caves inequality [26] together with its symmetries obtained by relabeling the inputs, namely, is the mutual information between the variables A x and B y . This inequality can be understood as the entropic counterpart of the paradigmatic CHSH inequality [47].
In general (i.e., beyond the simplest Bell scenario), both methods based on the variables in a causal Bayesian network and on counterfactual variables can lead to nontrivial constraints [9,29,35,37,48,49]. To conclude this section, let us nonetheless highlight an important difference between the two methods: while the former is valid for arbitrary input alphabets, the latter fixes the number of inputs to which the inequalities apply.
Although the choice of probability structure above corresponds, via Fine's theorem, to the assumption of a local hidden variable theory, one can also consider other possibilities. For instance, taking S = M amounts to assuming a nonsignaling theory [40]. In this case, the entropy cone is characterized only by the Shannon inequalities and one can obtain a characterization of the extremal rays of the cone, corresponding to the entropic analogue of Popescu-Rohrlich (PR) boxes [37].

III. BIPARTITE ENTROPIC CAUSAL INEQUALITIES
With the entropic approach to characterizing sets of correlations outlined, we can now proceed to apply this approach to causal correlations, so as to derive entropic causal inequalities. We consider in this section the bipartite case. We first show how the method based on causal Bayesian networks can be adapted to characterize causal correlations, before considering also the method based on counterfactual variables.

Conditional DAGs for bipartite causal correlations
The ability to apply the entropic approach to DAGs, as outlined in Sec. II B 4, is a powerful tool for characterizing the correlations obtainable within arbitrary causal networks. However, the notion of causal correlations defined in Eq. (1) is somewhat more general and cannot be directly expressed within the framework of causal Bayesian networks. In order to see why this is the case, let us first note that the random variables of interest are X, Y, A, B, representing the inputs X, Y and outputs A, B for Alice and Bob. Note that since we consider signaling scenarios here, unlike in the Bell scenario, we do not need to include any latent variable Λ in our description to account for shared randomness, since this can be established via local randomness and communication.
If Alice and Bob share a correlation compatible with a fixed causal order (i.e. either A ≺ B or B ≺ A, then the functional dependences between X, Y, A, B can indeed be expressed as a DAG (specifically, the two DAGs containing these variables in Fig. 2). However, a causal correlation may in general not be compatible with any fixed causal order, but may require a mixture thereof. This has some similarities with the situation in the Svetlichny definition of genuine multipartite nonlocality [50,51] where a convex mixture of different DAGs has to be considered.
To tackle this problem it is necessary to find a way to take into account the constraints arising separately from each of the two fixed causal orders, and then to combine them to obtain those satisfied by causal correlations. In order to do this, we exploit the fact that any mixture of fixed-order causal correlations can be seen as arising from a latent variable that determines the causal order for each individual experiment [23]. We thus introduce a new random variable Q which we call a "switch", and which determines univocally the appropriate causal Bayesian network for each trial. The resulting causal model is shown in Fig. 2, where the DAG with A ≺ B is used for Q = 0, or the one with B ≺ A for Q = 1. By identifying q 0 , q 1 in Eq. (1) as q 0 = P(Q = 0), and q 1 = P(Q = 1), one can readily see that this description is equivalent to the definition of a causal correlation in Eq. (1).
Both DAGs imply the independence of the inputs, X⊥ ⊥Y. The DAG for Q = 0 (i.e., for A ≺ B) also implies the CI A⊥ ⊥Y | X (i.e. that there is no signaling from B to A), while the DAG for Q = 1 implies B⊥ ⊥X | Y instead. In addition, the switch variable Q should be independent of Alice and Bob's inputs X and Y, so that we have XY⊥ ⊥Q, which, together with X⊥ ⊥Y, implies that X⊥ ⊥Y⊥ ⊥Q.

Shannon polyhedron of causal correlations
In order to use the "conditional" causal Bayesian network in Fig. 2 to characterize the set of entropy vectors obtainable from causal correlations, we first note that we can directly use the techniques of Sec. II B 4 to characterize the Shannon cones for each of the two DAGs appearing in the figure conditioned on Q (i.e., for fixedorders correlations with A ≺ B or B ≺ A). Denoting these cones Γ A≺B and Γ B≺A , we have and where Γ S is the Shannon cone for the four variables in Recall that in the probabilistic case the polytope of causal correlations is simply the convex hull of the polytopes of correlations for A ≺ B and B ≺ A [24], and with the new variable Q the definition Eq. (1) can be rewritten as P(ab|xy) = P(Q = 0)P A≺B (ab|xy, Q = 0) In contrast, the convex hull of the cones Γ A≺B and Γ B≺A does not contain all entropy vectors of causal correlations due to the concavity of the Shannon entropy. Indeed, in Appendix A we provide an explicit example of a causal correlation whose entropy vector is not contained in the convex hull conv(Γ A≺B , Γ B≺A ).
To see more precisely why this is the case, and how to give a correct entropic characterization of causal correlations, observe that, when taking a convex mixture of two causal correlations with different causal orders, the "conditional entropy vectors" h 0 = (H(T|Q = 0)) T⊂S and h 1 = (H(T|Q = 1)) T⊂S must be contained in Γ A≺B and Γ B≺A , respectively, and thus satisfy I 0 h 0 ≤ 0 and I 1 h 1 ≤ 0. For any causal correlation, the convex mixture is thus contained in conv(Γ A≺B , Γ B≺A ). Observe now that, in contrast to the convex sum Eq. (14) defining causal correlations, h conv thus defined is equal to (H(T|Q)) T⊂S , rather than just (H(T)) T⊂S , and hence the convex hull of the fixed-order entropy cones characterizes the conditional entropies (conditioned on the switch variable Q) obtainable with causal correlations, rather than the entropy vectors of causal correlations directly.
With the appropriate transformation, the inequalities Ih ≤ 0 characterizing 5 conv(Γ A≺B , Γ B≺A ) can be transformed into inequalities satisfied by the standard (i.e., non-conditional) entropy vectorh = (H(T)) T⊂S for the variables now inS = S ∪ {Q} (and the probability structure is consequently extended to S = {S}). Specifically, each row I of the matrix I (defining each individual inequality Ih ≤ 0) must undergo the linear transfor- for all nonempty subsets T ⊂ S. We will denote by conv Q (Γ A≺B , Γ B≺A ) the cone of vectorsh satisfying the resulting inequalitiesĨh ≤ 0.
To complete the characterization of entropy vectors for causal correlations, we recall that, in addition to the fact that any distribution onS must give an entropy vector in the Shannon cone ΓS, the conditional DAG in Fig. 2 gives us the CI constraints X⊥ ⊥Y⊥ ⊥Q. Moreover, since Q is a binary variable (as there are only two orders to switch between) we have H(Q) ≤ 1. A consequence of this final inequality constraint is that the set of entropy vectors under consideration will be characterized by an inhomogeneous system of inequalities of the formĨh ≤β for someβ ∈ R 2 |S|+1 and is thus no longer a cone but a polyhedron. The polyhedron characterizing entropy vectors associated with the conditional DAG (when still including Q) is thus given by where L C (·) denotes the subspace or polyhedron defined by the corresponding linear constraints. Finally, following the general approach presented in Sec. II B, it remains just to eliminate the terms containing the (unobservable) switch variable Q in order to obtain the inequalities characterizing bipartite causal correlations. This is done by projecting Γ causal onto the marginal scenario M = {S} = {X, Y, A, B} . We thus finally obtain the polyhedron which we shall refer to as the causal Shannon polyhedron or simply the causal polyhedron and is again characterized by an inhomogeneous system of inequalities I h ≤ β for some β ∈ R 2 |S| . We emphasize that the construction given above is in fact not at all restricted to the description of causal correlations, and can be used to characterize arbitrary convex mixtures of different Bayesian networks. Furthermore, as we will see in Sec. IV, this method can be generalized to convex combinations of more distributions, in our case corresponding to more than two causal orders in multipartite scenarios (and even correlations with "dynamical causal order" [23,25,53]).

Entropic causal inequalities and their violation
The constructive description of the causal polyhedron Γ causal from Eqs. (17) and (18) also makes it clear how we can characterize it, in practice, as a system of linear inequalities. A description of Γ causal in terms of its facets is straightforwardly obtained by taking the union of the inequalities describing the individual cones, linear subspaces, and polyhedra appearing in Eq. (17). The inequalities characterizing Γ causal can then be found by eliminating the terms not contained in the marginal scenario M = {S}, either by Fourier-Motzkin elimination [43] or by finding its extremal rays and projecting out the unwanted coordinates.
The corresponding system of inequalities is thus satisfied by any bipartite causal correlation. However, many of these inequalities are either elemental inequalities (see Eq. (5)) or can be obtained from these by using the independence constraint X⊥ ⊥Y, and thus represent trivial constraints. After computing the polyhedron in Eq. (18) and eliminating all trivial inequalities, i.e., those satisfied by any distribution P(xyab) with X⊥ ⊥Y, we find 35 novel entropic causal inequalities. Several of these inequalities are equivalent under the exchange of parties (i.e., exchanging (X, A) ↔ (Y, B)), and under this symmetry there are in fact 20 equivalence classes of entropic causal inequalities, the full list of which is given in Appendix B. Of these, 10 have bounds of 0 (i.e., are of the form I · h ≤ 0), while the remaining 10 have nonzero bounds (resulting from a nontrivial dependence on H(Q) before this variable was eliminated, see Appendix B). Simple interpretations of the entropic causal inequalities seem to be less forthcoming than for the bipartite causal inequalities in terms of probabilities [24] (for binary inputs and outputs -recall that the entropic inequalities given here are, in contrast, valid for any number of possible inputs and outputs). One of the simpler examples, which is symmetric under the exchange of parties, is Note that the fact that we find nontrivial inequalities is in stark contrast to the situation for Bell-type inequalities (and line-like causal Bayesian networks), where the DAG-based entropic method only leads to trivial inequalities obtainable from the elemental inequalities and no-signaling conditions [45]. While these entropic inequalities are obeyed by any bipartite causal correlation, we note that a priori they need not be tight. Indeed, recall that the Shannon cone is only an outer approximation to the true entropy cone, and the method we applied to bound convex combinations of fixed-order correlations may introduce extra slack. It is thus interesting to study the tightness and violation of these inequalities more carefully.
Although one generally would not expect every point on the boundary of Γ causal to be obtainable by a causal correlation, it is nonetheless desirable to be able to saturate each inequality by some causal probability distribution for appropriate distributions for X and Y. By looking at deterministic causal distributions with binary inputs and outputs, which can easily be enumer-ated, we readily verified that all 10 families of inequalities that are bounded by 0 (given in Eq. (B1)) can indeed be saturated when taking uniformly distributed inputs. However, we were unable to find causal distributions, either by mixing binary ones or by considering more outputs, that saturate the remaining inequalities, and their tightness remains an open question.
To understand the violation by noncausal distributions of the entropic inequalities, we consider the extremal rays of the constrained Shannon cone which violate the inequalities. 7 A crucial question is whether or not these extremal rays actually correspond to valid probability distributions (i.e., whether they support entropy vectors), and if not, whether the inequalities can nonetheless be violated.
In order to look at this, it is instructive to first restrict our attention to distributions satisfying H(X) ≤ 1, H(Y) ≤ 1, H(A) ≤ 1 and H(B) ≤ 1. These constraints are satisfied by all distributions with binary inputs and outputs, and this therefore also allows us to compare the violation of the entropic causal inequalities to the violation of standard causal inequalities that are understood well in this scenario [24]. Imposing these constraints on the cone in Eq. (20), one obtains a polytope with extremal points corresponding to the extremal rays of the cone scaled to satisfy these constraints (together with the null vertex 0). Under these constraints we found that the 10 inequalities in Eq. (B1) and the two Eq. (B2) could be violated, although the latter are weaker than, and implied by, the former and are thus redundant. The remaining 8 inequalities in Eqs. (B3) and (B4) cannot be violated. All in all, the set of binary causal correlations is entropically characterized by the 10 inequalities in Eq. (B1) that are bounded by 0.
Amongst the extremal points violating each of these inequalities, those that give the maximal violation all satisfy H(X) = H(Y) = 1 and H(XY) = H(XYAB) and thus, if realizable, correspond to deterministic conditional distributions taken with uniformly distributed inputs X and Y. In fact, all but one of these 10 inequalities are maximally violated (by which we henceforth mean with respect to the Shannon cone augmented with the independence constraint X⊥ ⊥Y) by one of the three following deterministic distributions taken with uniform inputs: P(ab|xy) = δ a,y δ b,x⊕y P(ab|xy) = δ a,x⊕y δ b,x (21) P(ab|xy) = δ a,x⊕y δ b,x⊕y , where x, y, a, b take the binary values 0, 1, and ⊕ denotes addition modulo 2. For example, Eq. (19) which, in turn, is violated by the deterministic distribution (again taken with uniform inputs) P(ab|xy) = δ a,x⊕xy δ b,y⊕xy .
However, unlike for the other inequalities, this distribution does not give the maximal possible violation of inequality (22) (which is 1/2), as the corresponding extremal point h ext that does maximally violate it is not reachable by a valid probability distribution with binary inputs and outputs. This is easily verified by making use of the previous observation that this extremal point must correspond to a deterministic distribution taken with uniform inputs, the set of which can easily be enumerated for binary inputs and outputs. Amongst such distributions, the one in Eq. (23) gives the best violation of 1 − 3 2 log 2 3 2 ≈ 0.123 > 0. The distributions in Eq. (21) are particularly interesting, as they all violate maximally some symmetries of the GYNI inequality (2) (i.e. under relabeling of the parties, inputs, and outputs), but not Eq. (2) itself. Interestingly, it turns out that all binary deterministic noncausal distributions, when taken with uniform inputs, violate at least one of our entropic inequalities except the distribution P GYN I (ab|xy) = δ a,y δ b,x (which violates maximally Eq. (2)) and its four symmetries under relabeling of outputs only. Note, however, that if Alice and Bob have a noncausal resource producing the distribution P GYN I , they can produce any of the distributions in Eq. (21) by appropriately XORing their input with their output, and thus still obtain an operational violation of an entropic causal inequality. 8 It is interesting to observe that distributions maximally violating 8 This illustrates an important difference between the probabilistic and entropic frameworks: while all symmetries of a correlation obtained by flipping inputs and outputs (possibly conditioned on the local inputs for the latter) are equivalent in the probabilistic case (in the sense that if one violates a causal inequality, then all other ones violate a symmetry of that inequality) this is not the case in the entropic approach. The entropy vectors of two different symmetries of a correlation may be inequivalent, with one violating an entropic causal inequality while the other does not.
GYNI-type inequalities have such a crucial role in violating the entropic causal inequalities given that the entropic inequalities superficially bear little resemblance to these, and are valid for arbitrary numbers of inputs and outputs.

Returning to the more general situation with no upper bound imposed on H(X), H(Y), H(A) and H(B),
we see that all the remaining entropic causal inequalities can be violated by entropy vectors that are parallel to the realizable entropy vectors giving violations in the restricted scenario -more precisely, those obtained from the distributions Eq. (21) (for all but one of the remaining inequalities) and Eq. (23) (for the remaining one). This shows that, given large enough alphabets for the input and output variables, all the entropic causal inequalities we obtained can indeed be violated by noncausal probability distributions, since if the distribution P(xyab) has entropy vector h then the distribution P(xyab) = P(x 1 y 1 a 1 b 1 ) × · · · × P(x n y n a n b n ), (24) where x = (x 1 , . . . , x n ) and similarly for y, a and b, has entropy vector n · h. One should be careful, however, to note that the operation of sharing multiple independent correlations among the same parties is not a free operation either in the framework of causal correlations (since, for example, two independent copies of a causal distribution may give rise to a noncausal one), or in the process matrix framework (where two independent copies of a process matrix does not, in general, produce a valid process matrix). Nevertheless, P(ab|xy) = P(xyab)/P(xy) obtained from Eq. (24) still represents a valid (possibly noncausal) distribution.
It is interesting also to ask how sensitive the entropic causal inequalities are for detecting noncausality. Since it does not appear possible to saturate the inequalities (B2)-(B4) with non-zero bounds using causal distributions, these inequalities are not tight and, consequentially, unable to detect noncausal correlations that are very close to being causal. For the other inequalities in Eq. (B1) this is nonetheless a pertinent question. More precisely, one may ask whether there exists a distribution P ε of the form P ε (ab|xy) = εP NC (ab|xy) + (1 − ε)P C (ab|xy), (25) where P NC is a noncausal distribution and P C is causal, that violates any of these entropic inequalities for arbitrarily small ε > 0.
We looked in detail at this question for the case of binary inputs and outputs, where the inequalities in Eq. (B1) can all both be saturated by causal distributions, and violated by noncausal ones. By trying exhaustively all deterministic distributions P NC and P C , we found that such behaviour was exhibited (for such distributions) only by the two inequalities Equation (26), for example, is violated by P ε for all ε > 0 when taking P NC (ab|xy) = δ a,x⊕y δ b,x⊕y and P C (ab|xy) = δ a,0 δ b,x⊕y along with uniformly distributed inputs X and Y, which also gives a violation of the GYNI-type causal inequality 1 4 ∑ x,y,a,b δ a,x⊕y δ b,x⊕y P(ab|xy) ≤ 1 2 with a left-hand side value of 1+ε 2 > 1 2 . For the remaining inequalities, such mixtures that violate a standard causal inequality for arbitrarily small ε only violate an entropic causal inequality when ε > ε 0 for some ε 0 bounded away from 0. We observed identical behavior when we extended our consideration also to various non-deterministic distributions P NC and P C , and it thus seems that only Eqs. (26) and (27) exhibit this ability to detect the noncausality of distributions that are arbitrarily close to being causal.
A final point worth discussing relates to the physical interpretation of the distributions violating entropic causal inequalities. One of the motivations in introducing the notion of causal correlations was whether nature permits more general causal structures that might allow such correlations to be realized, for example in quantum gravity [14]. In particular, the authors of Ref. [14] introduced the so-called process matrix formalism, in which quantum mechanics is assumed to hold locally for each party, while no global order is assumed between the parties. They showed that causal inequalities can be violated within this framework, and this helped motivate further studies of causal correlations, where it has been shown that the violation of facetinducing causal inequalities is ubiquitous within this framework [21,22,24,25,54,55]. It is thus interesting to see whether entropic causal inequalities share this property and can also be violated within the process matrix framework.
To look for such violations, we used the optimization techniques of Refs. [24,25] with qubit systems to try and optimize the violation of the GYNI-type inequalities that the distributions in Eq. (21) violate maximally. We also tried minimizing the distance to other deterministic noncausal correlations such as Eq. (23), as well as optimizations in random directions in probability space. Unfortunately, we were unable to find any process matrices operating on qubits that violate entropic causal inequalities with such techniques. We additionally attempted to reproduce (as closely as possible) distributions of the form (25) for small ε in order to violate inequalities (26) and (27), but similarly found no violation. Finally, we looked at noncausal correlations obtained by mixing noncausal correlations realizable by process matrices with causal correlations. An analogous mixing procedure was shown to enable all nonlocal distributions to violate the entropic Bell inequalities described in Sec. II B 5 [31], but we were unable to find violations of the entropic causal inequalities with this approach.
This lack of violation is perhaps unsurprising given the general lack of sensitivity of the entropic inequalities to nearly-causal distributions, and the fact that the best-known violations of causal inequalities for this scenario with process matrices are relatively small [24]. Nonetheless, it remains possible that violations can be found with higher-dimensional systems or more inputs and outputs.

B. Characterization based on counterfactual variables
In this section we will consider counterfactual variables as outlined in Sec. II B 5. Rather than considering the inputs as random variables X and Y, we take copies of each output variable for all input combinations, i.e. A xy and B xy . In contrast to the method based on causal Bayesian networks, this method fixes the number of inputs that the inequalities apply to but may lead to novel constraints, as is the case in the Bell scenario.

Counterfactual variables for bipartite causal correlations
To keep the discussion simple, we will consider only the case of binary inputs, but the generalization to arbitrary inputs is straightforward. We consider the variables S = {A 00 , A 01 , A 10 , A 11 , B 00 , B 01 , B 10 , B 11 }. (29) Note that, in contrast to the example of Bell inequalities discussed in Sec. II B 5, we need to consider copies of each variable for each input pair (x, y). This is a consequence of the fact that the correlations which we want to characterize may be signaling, e.g., for the causal order A ≺ B, B 00 and B 10 will in general be different.
Since (30) In contrast to the DAG-based method, several choices of probability structure S compatible with M are possible, and the particular choice must be motivated on the basis of physical assumptions. One natural possibility would be to take S = M, as one may have no a priori reason to think that the variables A xy and A x y have simultaneous physical meaning for (x, y) = (x , y ), and hence may not have a well-defined joint probability distribution. On the other hand, in some cases one may imagine that such inputs correspond to the choice of measurements of some physical properties that are simultaneously well-defined, as in a classical theory; hence, one may alternatively take S = {∪ M j ∈M M j } = {S}. In the following, we will adopt the former approach and take S = M, since this constitutes the minimum assumptions compatible with the marginal scenario. The Shannon cone for S is thus (31) as in Eq. (8). We note however that this physically motivated choice for S implies, for this particular scenario, that a global probability distribution does in fact exist. 9 Taking S = {S} would thus provide an equivalent entropic characterization, and moreover, an equivalent characterization also at the level of Shannon (rather than entropic) cones (see Appendix C, for an extensive discussion).
We follow a method analogous to that used in Sec. III A. First, we characterize the cones Γ A≺B and Γ B≺A of entropy vectors for fixed-order causal correlations, then, we characterize the convex mixtures of such correlations.
To do this, we note that the no-signaling conditions obeyed by fixed-order correlations (see Sec. II A) impose constraints on the counterfactual variables. For example, correlations consistent with the order A ≺ B obey P(a|xy) = P(a|xy ) for all x, y, y , a, which implies A xy = A xy and thus H(A xy ) = H(A xy ) also. Similarly, for B ≺ A, we have H(B xy ) = H(B x y ) for all x, x , y. The cones Γ A≺B and Γ B≺A are thus given by and where L C (·) again denotes the linear subspace defined by the corresponding constraints. As in Sec. III A, we introduce the latent switch variable Q, denote the augmented set of random variables S = S ∪ {Q}, and extend the probability structure as (in Appendix C we discuss further the implications of different choices of probability structures). With this extra variable we note again that the convex hull conv(Γ A≺B , Γ B≺A ) contains the vectors h conv = (H(T|Q)) T∈S c for causal correlations. The system of inequalities Ih ≤ 0 characterizing conv(Γ A≺B , Γ B≺A ) can then again be transformed in a similar way to Eq. (16) into a new systemĨh ≤ 0 defining the cone of corresponding entropy vectorsh = (H(T)) T∈ S c , which we again denote by conv Q (Γ A≺B , Γ B≺A ). In contrast to the DAG-based method, the only constraint on Q is, now, H(Q) ≤ 1, since Q need not be independent of the (counterfactual) output variables A xy , B xy . Finally, we need to project onto the marginal scenario M in Eq. (30). The causal polyhedron is thus given, in analogy to Eqs. (17) and (18), by where we have Γ S = x,y∈{0,1} Γ {A xy ,B xy ,Q} .

Entropic causal inequalities for counterfactual variables and their violation
As in Sec. III A, the construction above allows one to obtain the full list of entropic inequalities characterizing Γ causal . After removing the trivial inequalities directly implied by Shannon constraints on M, we find that there are 6 nontrivial entropic causal inequalities, which can be grouped into two equivalence classes of inequalities under the relabeling of inputs: and I(A 00 : B 00 ) + I(A 11 : The fact that these inequalities have nontrivial bounds is, as for the DAG-based method, a result of the constraint H(Q) ≤ 1 which means Γ causal is a polyhedron characterized by a set of inhomogeneous inequalities. Indeed, if one chooses not to eliminate Q from the entropic description, one obtains a convex cone characterized by the above equations, except that the right-hand side is multiplied by H(Q) (see the discussion in Appendix B). In contrast to the case for the DAG-based approach, where violation of the causal inequalities we obtained was possible even with deterministic distributions, it is clear that such distributions provide no interesting behavior in the counterfactual approach since any such distribution will have a null entropy vector. By looking at equal mixtures of deterministic causal distributions, however, we were able to verify that the inequalities in Eqs. (36)-(37) can indeed be saturated by such (causal) distributions and are thus tight. In order to study the potential violation of these entropic inequalities, we again need to look at nondeterministic distributions. One can easily see, however, that Eqs. (36)- (37) cannot be violated when restricted to distributions satisfying H(A xy ) ≤ 1 and H(B xy ) ≤ 1 for all x, y ∈ {0, 1}, as this also implies that I(A xy : B xy ) ≤ 1. This means that the inequalities for counterfactual variables are unable to detect noncausality when both parties are restricted to binary outputs.
To study possible violations we again look at the extremal rays of the Shannon cone Γ S of Eq. (31) which violate one of the inequalities, and examine whether these rays can be reached by any probability distribution. Considering bounds on H(A xy ) and H(B xy ) strictly larger than 1, we find that violations are possible for any such bound. Moreover, the entropy vectors giving maximal violation of Eqs. (36) and (37) are generally realizable with equal mixtures of causal and noncausal distributions. For example, given the constraints H(A xy ) ≤ log 2 k and H(B xy ) ≤ log 2 k for some integer k ≥ 2, the distribution where a, b ∈ {0, . . . , k − 1}, realizes such an extremal point for all k ≥ 2, and provides a violation of both Eqs. (36) and (37) for k > 2. For k = 2 (binary outputs), this distribution can be written as the convex combination P 2 (ab|xy) = 1 2 P NC (ab|xy) where P NC (ab|xy) = δ a⊕1,x⊕y δ b⊕1,x⊕y maximally violates a GYNI-type inequality (it is simply a symmetry of the third distribution in Eq. (21), obtained by flipping all outputs), and P C (ab|xy) = δ a,0 δ b,0 is causal. Even though it does not violate Eq. (36) or (37), P 2 is noncausal. The distribution P k can be seen as a possible generalization of a GYNI-violating distribution. This link to the GYNI-type inequalities and correlations can be made more explicit by considering the related distribution with again a, b ∈ {0, . . . , k − 1}. We have P 2 = P NC , and, for k ≥ 3, P k has the same entropy vector as P k−1 . P k can be clearly simulated from P 2 = P NC by making use of shared randomness and by letting both parties replace the output 1 obtained from P 2 by a shared random value a = b ∈ {1, . . . , k − 1}. It is interesting to see, then, that the GYNI-maximally-violating distributions also provide the best behavior entropically, when augmented with shared randomness, even though they fail to violate the inequalities when the parties have only binary outputs.
As for the DAG-based method, it is also interesting to look at the sensitivity of the inequalities with respect to the detection of noncausality. Unfortunately, by looking at distributions of the form given in Eq. (25), but now where P NC and P C equal mixtures of 3-outcome deterministic noncausal and causal distributions, respectively, we were unable to find such distributions P ε (ab|xy) which violate the entropic inequalities (36) and (37) for arbitrary small ε.
Finally, one may again ask whether one can violate any of the entropic inequalities for counterfactuals within the process matrix formalism, or whether any noncausal correlation can be mixed with a causal one to violate an entropic inequality, as is the case for entropic Bell inequalities obtained from the counterfactual approach [31]. We leave this as an open question, but note only that we were not able to find a way to do so: for example, we were unable to find a violation (with or without the use of shared randomness) for noncausal distributions realizable within the process matrix framework.

IV. MULTIPARTITE ENTROPIC CAUSAL INEQUALITIES
The notion of causal correlations can be extended to more than two parties in a recursive manner [23,25]. Consider N parties A 1 , . . . , A N , with inputs x = (x 1 , . . . , x N ) and outputs a = (a 1 , . . . , a N ). In any given run, one party, say A k , must act first, and none of the other parties can signal to them, which implies P(a k |x) = P(a k |x k ). The correlations shared by the remaining N − 1 parties, conditioned on the input and output of the first, must also in turn be causal. However, note that the causal order itself (and not only the response functions) of the remaining parties may depend on the input and output of the first, a phenomenon called dynamical causal order [23,25,53], and which goes beyond the standard model of fixed causal Bayesian networks.
An N-partite correlation P(a|x) is thus called causal if it can be decomposed in the following way [23,25]: where x \k = (x 1 , . . . , x k−1 , x k+1 , . . . , x N ) and a \k = (a 1 , . . . , a k−1 , a k+1 , . . . , a N ), with q k ≥ 0, ∑ k q k = 1, and where for each k, x k , a k , P k,x k ,a k (a \k |x \k ) is a causal (N−1)-partite correlation (down to the lowest level of this recursive definition, where any 1-partite correlation is considered to be causal). Note that, for N = 2 this reduces to Eq. (1). The entropic approach can be gen- eralized to the multipartite scenario using a similar recursive method.

A. Causal Bayesian network method
It is instructive to first look into the details of the tripartite case -in which case we shall denote the parties Alice (A), Bob (B) and Charlie (C), as is standard -before generalizing the method to more parties. The general method follows that used for the bipartite case in Sec. III A, and the relevant conditional DAG is shown in Fig. 3. The set of observable variables to be considered here is S = {X, Y, Z, A, B, C}.
The polytope of tripartite causal correlations (i.e., of the form Eq. (41)) can be written as where P A is the polytope of causal distributions consistent with Alice acting first and such that the remaining conditional correlation shared by Bob and Charlie is causal, and analogously for P B and P C . As a consequence, in order to define the polyhedron characterizing tripartite causal correlations, which we denote Γ causal ABC , we first need to define the corresponding Shannon polyhedra, namely Γ A , Γ B , Γ C , associated with each party acting first.
Let us thus consider Γ A . According to the recursive definition given in Eq. (41), for any x, a, the conditional entropy vector h xa BC = (H(T|X = x, A = a)) T⊂{Y,Z,B,C} for a correlation in P A must be contained ). Together with the facts that h must lie in the Shannon cone Γ S for the relevant variables, that all the inputs must be independent from each other, and that Alice's output must be independent from Bob and Charlie's inputs (conditioned on her input), we obtain the characterization with similar expressions for Γ B and Γ C . Following the same approach as in Sec. III A, we introduce a (now three-valued) switch variable Q (see Fig. 3). Similarly to what we observed in the bipartite case, the convex hull conv(Γ A , Γ B , Γ C ) contains the conditional entropy vectors (H(T|Q)) T⊂S for tripartite causal correlations. The inequalities characterizing conv(Γ A , Γ B , Γ C ) can again be transformed into inequalities satisfied by the entropy vectorh = (H(T)) T⊂S , for variables inS = S ∪ {Q}, by introducing a transformation T Q as in Eq. (16), thus defining the polyhedron conv Q (Γ A , Γ B , Γ C ) as before. Taking into account the Shannon constraints for all variables inS, the independence constraints CI Q = (X⊥ ⊥Y⊥ ⊥Z⊥ ⊥Q) and the bound H(Q) ≤ log 2 3, and finally projecting onto the observable variables in S, we see that the entropy vectors for tripartite causal correlations belong to the polyhedron While this characterization is certainly valid, some subtleties arising from the differences between the probabilistic and entropic descriptions allow one to actually make it tighter. Specifically, certain conditions implied by the definition (41) need not be implied by the corresponding entropic definition outlined above. For example, if P(abc|xyz) is a causal correlation, then the bipartite marginal distributions P x (bc|yz) = ∑ a P(abc|xyz) and P(bc|yz) = ∑ x P(x)P x (bc|yz) are both causal (as are the corresponding marginals for each other pair of parties) [25].
This implies that the entropy vectors (H(T|X)) T⊂{Y,Z,B,C} and (H(T)) T⊂{Y,Z,B,C} corresponding to a tripartite causal correlation must also satisfy all the inequalities characterizing the bipartite causal polyhedron Γ causal BC -which may not necessarily be implied by the characterization of (Γ causal ABC ) 0 above. We can thus tighten the previous characterization, and define the tripartite causal polyhedron as 10 where [perms.] denotes the permutations of the preceding two terms for the other parties. Note that such extra constraints do not need to be imposed in the bipartite case since the causality of all one-party marginals is equivalent to them being valid probability distributions, which is already assured by the elemental inequalities.
To extend the above idea to the general multipartite case of Eq. (41), we simply define recursively (here the notation should be self-evident) where CI A k denotes the set of independence constraints resulting from the assumption that all parties' inputs are independent, i.e. X 1 ⊥ ⊥ . . . ⊥ ⊥X N , and that party k acts first, which implies A k ⊥ ⊥X \k |X k . The causal polyhedron is then defined as where CI Q denotes the independence relation between all inputs and Q, i.e. X 1 ⊥ ⊥ · · · ⊥ ⊥X N ⊥ ⊥Q.

B. Counterfactual variable method
A similar generalization is possible also for the counterfactual method. Again, it is instructive to look first at the tripartite case. We start by defining the polyhedron for the case in which Alice acts first, 10 In Eq. (45) we abuse the notation slightly and denote by Γ causal BC the set of entropy vectors (H(T)) T⊂S -instead of (H(T)) T⊂{Y,Z,B,C} -which satisfy the constraints characterizing Γ causal BC as defined in Eqs. (17)- (18). The transformation T X , of which T * X is the dual, is again defined in a similar way as in Eq. (16). which is the analogue, for the counterfactual method, of the polyhedron in Eq. (43). Similar definitions hold for Γ B and Γ C . The tripartite polyhedron of causal counterfactual inequalities can then be defined, following a similar reasoning to the previous case, as where M = {A xyz , B xyz , C xyz } xyz and Γ causal BC|x is defined by imposing the constraints characterizing Γ BC (a priori defined for some variables B yz , C yz ) to the variables B xyz , C xyz , and with similar definitions for Γ causal AC|y and Γ causal AB|z . As for the case based on causal Bayesian networks, the construction in Eq. (49) can then be generalized to an arbitrary number of parties in a recursive way.

V. INFORMATION BOUNDS IN CAUSAL GAMES
One of the advantages of the entropic approach is that it allows information theoretic constraints to be naturally imposed, derived, and interpreted [10,39]. As an illustration, we consider a simple application of our approach to understanding the role of bounded communication in causal games.
Consider the generalization of the GYNI game described in Sec. II A to arbitrary numbers of inputs and outputs, in which two parties try to maximize the winning probability p succ = P(a = y, b = x). If the parties operate causally, then in any given round of the game only one-way communication may occur. One may be interested in the effect of limiting the amount communication that can occur in any such round. In the entropic framework, this can easily be taken into account by adding an additional constraint of the form I(X : B) ≤ H(M) to Γ A≺B in order to restrict B's dependency on X, and similarly imposing I(Y : A) ≤ H(M) to Γ B≺A , where the variable M represents the message that is sent. For example, if the parties are permitted, in each round, to exchange a classical d-dimensional system, then H(M) = log 2 d. In general, the amount of one-way communication H(M) does not need to be specified in advance, it will appear as parameter in our inequalities. By applying the approach of Sec. III A to this scenario one finds that causal correlations must then obey the inequality i.e., the two-way communication is similarly bounded by H(M). Although this is perhaps not unexpected, it shows the ease with which such bounds can be derived in the entropic framework. A more subtle variant is obtained by considering a slight generalization of the causal game proposed by Oreshkov, Costa, and Brukner (OCB) in Ref. [14]. In this game, the goal is also for one party to guess the other party's input; in contrast to the GYNI game, however, an additional input random bit Y is given, 11 which determines whether it is Bob who should guess Alice's input (if Y = 0) or vice versa (if Y = 1). The parties thus now attempt to maximize the winning probability An analogous entropic inequality can be obtained via a combination of the methods discussed in Sec. III.
Since the relevant direction of communication in each round of this game depends on the additional input Y , we will combine the DAG-based method for the variables A, B, X, Y with the counterfactual approach to condition on Y . More precisely, one may take S = {A y , B y , X, Y, Q} y and M = {A y , B y , X, Y} y ; the relevant causal constraints for the cones Γ A≺B and Γ B≺A and the polyhedron Γ causal are the same as those imposed on A, B, X, Y, Q in the DAG-based method, except that now they are applied to each copy of the conditional variables A y and B y , and the communication bounds I(X : B y ) ≤ H(M) and I(Y : A y ) ≤ H(M) are imposed on the corresponding cones. Notice that, in this way, we are assuming that Q⊥ ⊥X⊥ ⊥Y⊥ ⊥Y . Combining the above constraints with the analysis in Sec. III, one finds that causal correlations must obey This inequality, for the special case of binary inputs and outputs and with H(M) = 1, was proposed in Ref. [58] as a potential principle to bound the set of correlations obtainable within the process matrix formalism, 12 in analogy with the celebrated information causality principle [39] that provides bounds on the strength of bipartite quantum correlations. Our approach allowed us to show that Eq. (52) indeed holds for causal processes, but it remains to be seen whether such a constraint on communication for causal correlations can be violated within the process matrix framework. This example, however, highlights the potential of the entropic approach to causal correlations for studying information-theoretic principles.

VI. DISCUSSION
Since Bell first formulated his eponymous theorem, understanding the role of causality within quantum mechanics has been a central yet thorny goal. Complicating matters further, the very idea of a definite causal order itself has begun to be questioned. While sophisticated frameworks have been introduced in an effort to free physical theories from the shackles of a rigid causal framework, the issue of whether nature permits violations of causal inequalities remains an elusive question.
Against this backdrop, our aim in this paper was to introduce an entropic approach to studying causal correlations, and to this end we presented two complementary methods: the first based on the consideration of the entropies of the variables appearing in the causal Bayesian networks describing causal scenarios, and the second based on a counterfactual description of the outcome variables appearing in such networks. Focusing on bipartite causal scenarios, we described in detail the successful application of both methods to derive nontrivial entropic causal inequalities, before showing how the characterizations can be generalized to multipartite scenarios. In contrast to the usual approach to causal correlations based on probability distributions, the entropic causal inequalities we derived using both methods are valid for any finite number of possible outcomes, as well as for any number of inputs for the first method based on causal Bayesian networks, and thus provide a very concise description of causal correlations. We discussed the ability for the derived entropic causal inequalities to witness the noncausality of several classes of interesting noncausal correlations, but were nonetheless unable to find violations of the inequalities by correlations obtainable within the process matrix formalism [14] using qubit systems. In light of the coarse-grained description provided by entropic inequalities and the fact that the known violations of standard causal inequalities are in general rather small [24], that is arguably an unsurprising negative result. The question of whether entropic causal inequalities can be violated within the process matrix formalism and (more importantly) by quantum correlations thus remains open. More generally, our construction can be used to characterize arbitrary convex combinations of different causal Bayesian networks, and thus provides, for example, a natural tool to investigate stronger no-tions of multipartite Bell nonlocality [50,51,59,60] from the entropic perspective.
In view of this new framework for the study of causal correlations we believe that several other directions of research can naturally be pursued. Here we focused on using the Shannon entropies of the relevant variables, but it is known that, at least in particular scenarios, the same approach can be used to derive constraints using certain generalized entropies [61,62] and even with non-statistical information measures such as the Kolmogorov complexity [35]. Can our framework be extended to these other information measures, and if so, are they more sensitive to violations of causality? Similarly, one may wonder whether the addition of non-Shannon-type inequalities to the entropic descriptions of causal correlations considered might lead to tighter constraints [28,63,64]. More generally, it remains an open question whether the definition of causal correlations implies any additional constraints within the entropic description that might allow a tighter characterization, particularly in the multipartite case, similar to the additional constraints on marginal and conditional entropies imposed in Sec. IV.
Another important direction to consider would be the ability to formulate, and perhaps violate, informationtheoretical principles [10] of causality. We provided, as a simple application, an idea at one possible approach, showing how simple bounds on two-way communication can be derived for causal games where communication is limited in each direction. It would be interesting to see, in particular, whether such principles could be violated within the process matrix formalism and, if so, the connection to the violation of causal inequalities. For example, does the violation of causal inequalities imply the violation of some principle implied by quantum mechanics? We expect our results to motivate these and many more future investigations. In order to see that there are causal bipartite correlations that have entropy vectors not contained in conv(Γ A≺B , Γ B≺A ), consider the following counterexample. Take P A≺B (ab|xy) = δ a,x δ b,x and P B≺A (ab|xy) = δ a,y δ b,y and consider the inputs x, y to be uniformly distributed so that P A≺B (xyab) = 1 4 P A≺B (ab|xy) and P B≺A (xyab) = 1 4 P B≺A (ab|xy). The distribution P(ab|xy) = 1 2 P A≺B (ab|xy) + P B≺A (ab|xy) is thus also causal, but one can verify that the entropy vector for the distribution P(xyab) = 1 4 P(ab|xy) violates the first and last inequalities in (A1) with a value for the left-hand sides of 1 − 3 2 log 2 3 2 ≈ 0.123 > 0. A similar conclusion can also be reached for the method based on counterfactual variables: starting from the definitions of Γ A≺B and Γ B≺A in Eqs. (32) and (33) one finds that the inequalities characterizing conv(Γ A≺B , Γ B≺A ) are precisely the same as the causal inequalities in Eqs. (36) and Eq. (37) except with bounds on the right-hand side of 0. One can easily verify that Eqs. (36) and Eq. (37) can be saturated by causal correlations (for some equal mixtures of correlations P A≺B and P B≺A ), thus providing such a counterexample.

Appendix B: Bipartite entropic causal inequalities from the DAG method
The following is the full list of (equivalence classes of) inequalities from the DAG method, up to their symmetries under the exchange of parties.
Ten where X M i denotes the joint random variable associated with the subset M i ∈ M. The linear constraints in Eqs. (32) and (33), can then be imposed after the projection. Hence, the use of S = M or S = {S} is irrelevant, in this case, even at the level of Shannon cone description. A similar analysis can be applied to compare S and S = {S}, whereS i ∩S j = {Q} for all distinctS i ,S j ∈ S. Even though the marginal scenario is the same as above, Eq. (35) involves extra constraints given by conv Q (Γ A≺B , Γ B≺A ) and H(Q) ≤ 1, so the previous result does not directly apply. However, one can look at an intermediate projection Π S (Γ S ), and then impose the constraints involving Q. Hence, to prove the equivalence of the two probability structures at the level of the Shannon description, it would be sufficient to prove that Γ S = Π S (Γ S ). Using again the result by Matúš [65], we have However, we do not know whether the equality Π S Γ S ∩ L C ({XS i \{Q} ⊥ ⊥XS j \{Q} |Q}S i ,S j ∈ S,i =j ) = Π S ( S ∈ S ΓS) holds in general, so we were unable to prove the equivalence.
Nonetheless, we stress that any differences in tightness between the entropic inequalities arising here from the choice of a particular probability structure do not arise as a consequence of stricter physical assumptions (i.e., the existence of joint probability distributions), but rather as a consequence of different outer approximation of the true entropy cone via Shannon inequalities. We remark, however, that the choice of a minimal probability structure is computationally easier to handle due to the much lower number of variables; for example, compare the case S = M in Eq. (31), where Γ S ∈ R 12 , with the corresponding case for S = {S}, where Γ S ∈ R 2 8 = R 256 . For an extensive discussion of the role of such constraints in the computation of tighter approximations to the entropy cone we refer the reader to Ref. [57].