An operational information decomposition via synergistic disclosure

Multivariate information decompositions hold promise to yield insight into complex systems, and stand out for their ability to identify synergistic phenomena. However, the adoption of these approaches has been hindered by there being multiple possible decompositions, and no precise guidance for preferring one over the others. At the heart of this disagreement lies the absence of a clear operational interpretation of what synergistic information is. Here we fill this gap by proposing a new information decomposition based on a novel operationalisation of informational synergy, which leverages recent developments in the literature of data privacy. Our decomposition is defined for any number of information sources, and its atoms can be calculated using elementary optimisation techniques. The decomposition provides a natural coarse-graining that scales gracefully with the system’s size, and is applicable in a wide range of scenarios of practical interest.

Multivariate information decompositions hold promise to yield insight into complex systems, and stand out for their ability to identify synergistic phenomena. However, the adoption of these approaches has been hindered by there being multiple possible decompositions, and no precise guidance for preferring one over the others. At the heart of this disagreement lies the absence of a clear operational interpretation of what synergistic information is. Here we fill this gap by proposing a new information decomposition based on a novel operationalisation of informational synergy, which leverages recent developments in the literature of data privacy. Our decomposition is defined for any number of information sources, and its atoms can be calculated using elementary optimisation techniques. The decomposition provides a natural coarse-graining that scales gracefully with the system's size, and is applicable in a wide range of scenarios of practical interest.

I. INTRODUCTION
The familiarity with which we relate to the notion of "information" -due to its central role in our modern worldview -is at odds with the mysteries still surrounding some of its fundamental properties. Continued investigations in complexity science have highlighted the key role of synergy: interdependencies held by a group of variables, but not by any subset of them. Synergistic relationships have shown to be instrumental in a wide range of systems, including the nervous system [1, 2], artificial neural networks [3], cellular automata [4], and music scores [5]. Furthermore, the concept of synergy traces a particularly promising road to formalise the notion of "the whole being greater than the sum of the parts", one of the long-standing aims of complexity science [6].
There have been a number of attempts to formalise the notion of synergy within various frameworks, including redundancy-synergy balances [5,[7][8][9], information geometry [10,11], and others. Within the literature, one of the most elegant and powerful proposals is the Partial Information Decomposition (PID) framework [12], which divides information into redundant (contained in every part of the system), unique (contained in only one part), and synergistic (contained in the whole, but not in any part) components. One peculiarity of the PID framework is the absence of precise prescriptions about how synergy should be quantified [13]; and despite numerous efforts, an agreed-upon measure of synergy remains elusive [14][15][16][17]. Most approaches to quantify synergy proceed by postulating axioms encoding some "intuitive" desiderata, which should ideally lead towards a unique measure -following the well-known axiomatic derivation of Shannon's entropy [18]. Unfortunately, a number of critical incompatibilities between some of these axioms have been reported [19,20], which point out the limitations of our intuition as a guide within the counterintuitive realms of high-order statistics.
Building on these remarks, we argue that measures of synergy with little concrete, operational meaning provide a limited advance from previous qualitative criteria. Moreover, as argued by Kolchinsky [20], there might exist not a single but multiple reasonable definitions of synergy, and hence it is crucial to clarify what each proposed measure is capturing [21]. There have been a few attempts at formulating operational measures of unique information [15,22], but these efforts are still in progress, and apply only indirectly to synergy [23]. Providing a clear operational meaning for synergy is, to the best of our knowledge, an important unresolved challenge.
In this paper we address these issues by proposing a synergy-centered information decomposition, rooted on the notion of synergistic data disclosure from the literature of data privacy [24,25] and synergistic variables introduced in Ref. [26]. In this view, synergy is the information that can be disclosed about a system without compromising any of its parts. This measure is efficiently computable and is, to the best of our knowledge, the first to provide a direct operational interpretation for synergy. Based on this measure, we put forward a novel information decomposition, applicable to any number of source variables. To alleviate the super-exponential growth of terms in the decomposition, the operational meaning of our measure enables a natural coarse-graining that provides useful tools for many practical analysis scenarios.
The paper is structured as follows. First, Sec. II introduces our operational definition of synergy, and Sec. III uses it to build our proposed decomposition. The decomposition's coarse-graining is discussed in Sec. IV, and the special case of self-synergy in Sec. V. Finally, the relationship with other decompositions is studied in Sec. VI.

II. SYNERGY AND DATA DISCLOSURE
In line with the PID literature, our goal is to develop a method to decompose the information that a multivariate system X := (X 1 , . . . , X n ) provides about a target variable Y , as quantified by Shannon's mutual information I(X; Y ). Our approach consists of three steps: 1. Introduce synergistic channels, which convey information about X but not about any of its parts (Sec. II A); 2. Define synergistic disclosure as the maximum amount of information about Y that can be obtained through a synergistic channel on X (Sec. II B); and 3. Build an information decomposition by computing the synergistic disclosure for every node of a lattice, and using the Möbius inversion formula (Sec. III).
The rest of this section provides technical details about synergistic disclosure, building upon the work recently reported in Refs. [24,25].
A. Synergistic channels Consider a system described by n variables, X := (X 1 , . . . , X n ), where each X k takes values on an discrete alphabet X k of cardinality |X k |. Consider also a channel that is applied on X to generate a scalar observable V , which is characterised by a conditional distribution p V |X . We are interested in a particular class of observables, which carry information about X while revealing no information about specific subsystems. Subsystems of X can be represented by sets of indices of the form α = {n 1 , . . . , n k } ⊂ [n], with [n] := {1, . . . , n} being a shorthand notation; and the corresponding subsystem is denoted by X α = (X n1 , . . . , X n k ).
In the sequel we also consider collections of subsystems, which are represented by source-sets of the form α = {α 1 , . . . , α L }, where α j ⊂ [n] for all i = 1, . . . , L. For example, possible source-sets for n = 2 are {∅}, {{1}}, {{1}, {1, 2}}, etc. With the notion of source-set in hand, we can formally define synergistic channels as follows: The set of all α-synergistic channels is denoted by (1) A variable V generated via an α-synergistic channel is said to be an α-synergistic observable.
Due to the independence constraints, an α-synergistic observable V satisfies I(X αi ; V ) = 0 for all i = 1, . . . , L. Thus the name synergistic: by construction, an αsynergistic observable V might convey information about the whole, X, while disclosing no information about the corresponding parts X α1 , . . . , X α L .
Example 1. If X = (X 1 , X 2 ) are two independent fair coins, then the observable V = X 1 xorX 2 given by Note that p V |X can be depicted as a rectangular matrix. Elegant algebraic methods for characterising synergistic channels based on this matrix are available, and are presented in Appendix A.
B. Fundamental properties of synergistic disclosure Now that we have defined synergistic channels, we are in a position to formulate our measure of synergistic disclosure. To do this, consider a target variable Y , potentially correlated with X according to a given joint distribution p X,Y . We are interested in quantifying to what extent the collective properties of X can predict Y without revealing any information about the subsystems X α . This intuition can be naturally operationalised by the mutual information between Y and the synergistic observables of X, as described in the next definition.
Definition 2. The α-synergy between sources X and target Y is defined as Above, the supremum is calculated over all the αsynergistic channels p V |X , so that the joint distribution of (V, Y ) over which I(V ; Y ) is calculated is of the form Importantly, Definition 2 has a concrete operational interpretation [24,25] that follows from Shannon's source coding theorem [27]: S α (X → Y ) represents the amount of information about Y that can be disclosed from X while revealing no information about X αi for i = 1, . . . , L. This strongly contrasts with previous approaches to information decomposition, which have proceeded by writing down an axiomatic base and then formulating a measure consistent with those axioms. Note that in information theory most problems are operational in nature [28], and hence one could argue that this approach lies closer to Shannon's original contribution.
Let us now explore a few basic properties of our new measure of synergy, S α . A first fortunate feature is that this quantity is computable via simple optimisation techniques, which is a direct extension of Ref. [24, Theorem 1].
Theorem 1. The supremum in Eq. (3) is always attained, and the corresponding synergistic channel can be obtained as the solution to a standard linear-programming problem.
Despite the guarantees provided by this result, it is useful to have simple bounds. Note that, due to the data processing inequality, S α satisfies S α (X → Y ) ≤ I(X; Y ) for all α. The following result introduces a less trivial upper bound.
Proof. See Appendix B.
The above property is particularly useful to calculate S α , as if one finds a particular synergistic observable that attains this upper bound then it is clear that it is maximal. One immediate consequence of this Proposition, noting that I(X −αj ; Y |X αj ) = I(X; Y ) − I(X αj ; Y ), is that In other words, the amount of non-synergistic information is lower-bounded by the amount of information carried by the most strongly correlated subgroup.
Further details on S α , including properties of its bounds, algebraic properties, and a data processing inequality, are presented in Section VI A and Appendix C.

III. INFORMATION DECOMPOSITION
This section uses the functional definition of α-synergy to formulate our proposed information decomposition. For this, we focus on the study of the sets of constraints of the form α = {α 1 , . . . , α L }, which are the argument in the synergy S α (X → Y ). For such sets, we say |α| := L is the cardinality of the set.

A. The extended constraint lattice
Let us start by observing that not all source-sets yield unique synergistic channels. As a simple example, if α = {{1, 2}} and β = {{1}, {1, 2}} one has that C(X; α) = C(X; β), as all the additional constraints in β are subsumed by the constraints in α. More formally, we say that two source-sets are equivalent, denoted by α ≡ I β, if C(X; β) = C(X; α). Our next result shows that the set of anti-chains (6) contains exactly one member of each equivalence class, and this member is the simplest such source-set.

Proof. See Appendix D.
In other words, considering collections of indices that are not anti-chains would not provide new classes of channels, as broader subunits subsume smaller ones. This property brings strong reminiscences of Williams and Beer's redundancy lattice [12] -which we will discuss in detail in Section VI [29].
In addition to the set of nodes, to build a lattice on which one can formulate a decomposition one needs a partial order relationship. Considering our setup, a natural candidate is the order introduced by James et al. in their proposed constraint lattice [30], defined by for α, β ∈ A * . Intuitively, α c β means that all the constraints imposed by α are included within those imposed by β, and therefore C(X; β) ⊆ C(X; α).
Putting these structures together, we introduce the extended constraint lattice L * := (A * , c ), which extends the lattice introduced by James et al. [30]. The cases n = 2 and n = 3 are depicted in Fig. 1. Importantly, in constrast with James' proposal, L * includes nodes that do not cover all the sources. The resulting lattice is isomorphic in shape to Williams and Beer's redundancy lattice, but with different relationships between the nodes. Despite this similarity, however, comparisons between these two lattices are not straightforward (c.f. Sec. VI).
The lattice L * possesses some interesting properties, most prominently: Lemma 2. If α, β ∈ L * and α c β, then This result shows that S α (X → Y ) is a non-increasing function of α ∈ L * for any given variables X, Y . With this, one can propose the following decomposition based on the Möbius inversion formula [31]: correspond to the terms given by the Möbius inverse of S α (X → Y ); i.e. the unique set of values that satisfy for all α ∈ A * .
Intuitively, the Möbius inversion can be understood as a discrete derivative over a lattice. In effect, an equivalent representation of the Möbius relationship is given by which is analogous to the fundamental theorem of calculus. The Möbius inversion yields synergy atoms of the form S α ∂ , which quantify how much information about the target is contained in the collective effects of variables α. For example, [32]. This last indentity, combined with Eq. (10), gives the following important result: Proposition 2 (Information decomposition). The mutual information between X and Y can be decomposed as Proof. Follows directly from noting that S ∅ (X → Y ) = I(X; Y ), and combining this with Eq. (10).
The next section builds our intuition on this decomposition for small systems.

B. The case n = 2
After having formally presented the decomposition for n variables, let us focus on the bivariate (n = 2) case, and develop some intuitions about the resulting synergy atoms. For two predictors X = (X 1 , X 2 ), Equation (11) yields can be understood as the information about Y that is related to collective properties of X that can be disclosed without compromising either X 1 or is the information about Y that can be disclosed without revealing parts of X 1 but compromising X 2 (otherwise it would have been included in S with the synergy. A detailed comparison of these and the standard PID atoms is presented in Section VI.
For the particular case where X 1 and X 2 are binary variables, then the optimal synergistic channel only depends on their joint distribution -and not on the target variable, as shown in Ref. [24]. Interestingly, if X 1 and X 2 are independent fair coin flips, then [34] This result shows that our definition of synergy effectively captures high-order statistical effects, which are FIG. 1. Extended constraint lattice for systems of n = 2 (left) and n = 3 (right) sources.
most purely exhibited by XOR logic gates [35]. Analytical results for the more general case where X 1 and X 2 are binary, though not necessarily independent, are presented in Appendix E.
With these results, it is straightforward to compute the decomposition in Eq. (11) for a few illustrative examples; results are presented in Table I. First, we notice that the paradigmatic distributions COPY and XOR have the expected 1 bit of redundancy and synergy, respectively, in agreement with our intuition for these cases. Similarly, the Unq.1 distribution shows only one non-zero atom, S {2} ∂ , which corresponds to unique information. The index of the atom, however, might seem counterintuitive; the confusion is explained by the fact that the superscript {2} refers to a constraint (the impossibility to disclose what is in X 2 ), and hence S {2} ∂ is more related with the contents of X 1 . This shows a general theme: that S α , while operationally meaningful and intuitive, needs to be interpreted differently from other PIDs (c.f. Sec. VI).
As a further example, we compute the disclosure decomposition S α ∂ for the result of an AND gate with correlated inputs (Fig. 2). As the inputs become more correlated, there is less information that can be disclosed without compromising either of them, and therefore the fraction of the total information that corresponds to S ∅ ∂ grows as correlation increases.

IV. THE BACKBONE DECOMPOSITION
As the extended constraint lattice L * grows extremely rapidly with system size, it is unfeasible to examine every element of our proposed decomposition in all but very small systems. Luckily, the nature of S α allows us to formulate a reduced collection of source-sets that form the "backbone" of the constraint lattice, which provides a natural summary of the system's high-order interactions.
In the sequel, Subsection IV A introduces the backbone lattice, then Subsection IV B discusses the backbone decomposition, and finally Subsection IV C illustrates these ideas with some examples.

A. The backbone constraint lattice
We introduce the backbone constraint lattice, denoted by B ⊂ L * , as the sublattice composed by the elements of A * of the form γ m = {α ⊂ [n] : |α| = m} for m = 0, . . . , n (the dependency on n is left implicit). Importantly, c restricted to B provides a total order: For the synergy terms associated with B, we use the shorthand notation B m (X → Y ) := S γm (X → Y ). In simple words, B m (X → Y ) accounts for the information about Y that can be disclosed without compromising any group of m variables. Furthermore, as γ m−1 c γ m , the following chain of inequalities is guaranteed: B. Backbone atoms A new application of the Möbius inversion formula allows us to define backbone atoms, B m ∂ (X → Y ), which we define as Equivalently, the backbone atoms are the values Intuitively, B m−1 corresponds to the amount of information about Y that X can reveal without compromising any group of m − 1 variables; or, equivalently, information revealed by compromising only groups of m or more variables. Consequently, B m ∂ quantifies the marginal gain of information that can be disclosed by relaxing the constraints from groups of m variables to groups of m − 1. For example, for m = 1 then B 1 (X → Y ) measures how much information can be disclosed while keeping each X j confidential, while B 1 ∂ (X → Y ) corresponds to how much is gained when these constraints are relaxed. Additionally, note that these backbone atoms can be directly related to the synergy atoms in Eq. (9), as Puting all these results together one finds a reduced decomposition, which is formalised by the following result.
These backbone atoms provide a coarse-graining of the full decomposition in Eq. (11). A basic schematic of this backbone decomposition, as well as its relationship with the S α ∂ atoms in the extended constraint lattice are shown in Fig. 3. Importantly, note that the cardinality of the backbone lattice grows linearly with system size, and hence the number of atoms in Eq. (18) remains tractable for large systems.

C. Examples
As an illustrative example of the potential of the backbone decomposition, let us apply it to scenarios where the relationship between X and Y can be expressed as a Gibbs distribution. In particular, we consider systems of n + 1 spins (i.e. X i = {−1, 1} for i = 1, . . . , n + 1) whose joint probability distributions can be expressed in the form where β is the inverse temperature, Z a normalisation constant, and H k (x n ) a Hamiltonian function of the form with the last sum running over all collections of indices I ⊆ [n + 1] of cardinality |γ| = k. To calculate all quantities in this section we consider Y = X n+1 as target variable. Full simulation details are reported in Appendix F. As a first test case, we consider Hamiltonians with interactions up to order k, as in Eq. (20) above. For these systems, we calculated the backbone term B 1 (X → Y ), which measures the strength of the high-order statistical effects beyond pairwise interactions (Fig. 4a). As expected, our results show that if the Hamiltonian only possesses first or second order interactions (i.e. k = 1 or 2) then B 1 (X → Y ) is negligible; and for k ≥ 3, B 1 (X → Y ) grows monotonically with k.
As a second test case, we studied Hamiltonians with source-target interactions only of order k, and compute their full backbone decomposition. Fig. 4b shows all the backbone atoms B m ∂ on the X-axis, normalised by I(X; Y ). Interestingly, for each Hamiltonian order k there is only one non-zero backbone atom, which suggests that I(X; Y ) ≈ B k ∂ (X → Y ). Note that this relationship between Hamiltonian interaction order and backbone atom is highly non-trivial, and finding analytical methods to make this connection more explicit is an open question.
These findings suggest that the backbone decomposition may provide an analogue to the measure of connected information introduced in Refs. [10,11], which captures the effects of Hamiltonian high-order terms over their corresponding Gibbs distributions [36]. The main difference between the connected information and the backbone decomposition is that in the former all variables play an equivalent role, while in the latter they are divided between sources and target.

V. SYNERGISTIC CAPACITY AND PRIVATE SELF-DISCLOSURE
So far, we have investigated the usual information decomposition scenario, in which a group of source variables X hold information about another, target variable Y . Using the tools developed so far, we can ask a new question: how much information can X disclose about itself under specific constraints? Answering this question will provide further intuitions on the nature of synergistic disclosure, while revealing some unexpected properties.
We start by presenting the definition of the selfdisclosure of a system, which is a particular case of the formalism presented above.
Definition 4. The α-self-synergy of X is given by S α (X → X), and denoted simply by S α (X).
This definition makes it straightforward to extend the concepts above to define self-synergy atoms S α ∂ (X), as well as backbone self-synergy terms and atoms, denoted by B m (X) and B m ∂ (X), respectively. Let us begin with an example, by computing the selfdisclosure of binary bivariate distributions. Consider two binary variables X = (X 1 , X 2 ), with P{X 1 = 1} = P{X 2 = 1} = p and P{X 1 = 1, X 2 = 1} = r (Fig. 5). Perhaps surprisingly, a direct calculation shows that maximal synergy is achieved for X 1 , X 2 independent and p = 1/2 -which is equivalent to the much-debated Two-Bit-Copy (TBC) gate commonly discussed in the PID literature [16,37,38]. To make sense of this result, consider the following bounds on the self-disclosure: Lemma 3. For any X, Y the following bound holds: Proof. The upper bound is proven by an application of Proposition 1 with Y = X, and the lower bound by an application of Lemma 5.
This lower bound is particularly insightful, as it suggests that the synergistic self-disclosure of X is the tightest upper bound on the synergistic information that X could hold about any other target. Therefore, this (admittedly heterodox) perspective of synergy provides a clear explanation of why the TBC could have non-zero synergy, since it accounts for the "synergistic capacity" of its inputs.
Additionally, the upper bound in Lemma 3 provides a quick way to estimate how much synergy can be found with respect to a given set of sources X. For example, if (X 1 , X 2 ) are two i.i.d. fair coins, Lemma 3 states that their synergy cannot be larger than 1 bit, which is attained by the optimal self-synergistic channel V * = X 1 xorX 2 [39].
Another natural conjecture, in the light of the findings reported in Section IV C, would be to argue a relationship between self-synergy and connected information, as both measures treat symmetrically all the corresponding variables. However, numerical evaluations show there is no relationship between them. As a matter of fact, systems with low degrees of interdependency have high levels of self-synergy, while having low levels of connected information.
A final lesson that can be learnt from studying selfsynergy is that high-order synergies are not rare corner cases, but are in fact prevalent in the space of probability distributions. More formally, our next result shows that B m (X) takes most of the information contained in X as the system size grows.
Proposition 4. Consider a sequence of random variables X := (X 1 , . . . , X n ) for which there exists K ∈ N such that |X k | ≤ K for all k ∈ N. If lim n→∞ H(X)/n exists and is not zero, then for any fixed m ∈ N lim n→∞ B m (X) Proof. See Appendix G.
Let us work an example to gain intuition on this seemingly counterintuitive result.

VI. RELATIONSHIP WITH OTHER INFORMATION DECOMPOSITIONS
This section explores the relationship of our proposed framework with other information decompositions. For this, Subsection VI A explores various properties of our definition of synergy under the light of various axioms typically used in the PID literature, then Subsection VI B explores relationships of our decomposition with other PID, and finally Subsection VI C carries out numerical comparisons between our metrics and other well-known decompositions.

A. Axioms
In previous literature, partial information decomposition is usually discussed in terms of axioms, which encode various desirable properties that measures might -or might not -satisfy. These axioms are often formulated for redundancy measures, which, given that the basic constituent of our decomposition is a synergy measure, makes assessing our framework in these terms nontrivial. Nevertheless, this subsection explores some of the common axioms from the point of view of S α , using as guideline the set of axioms discussed in Ref. [40].
The following axioms are satisfied by our measure: • (GP) Global positivity: S α (X → Y ) ≥ 0 for all X, Y and α ∈ A * .
• (Eq) Equivalence-class invariance: is invariant under substitution of X i or Y by an informationally equivalent random variable (i.e. relabeling).
• (wM) Weak monotonicity: . Note that this doesn't hold for the backbone terms, as the α's are not equal.
• (T-DPI) Target data processing inequality The proposed measure does not satisfy strong symmetry (sS), as it might be the case that S α (X 1 , X 2 ) → Y = S α (X 1 , Y ) → X 2 [41].
We can prove by counterexample that S α ∂ does not satisfy strong local positivity (LP), i.e. that there exist S α ∂ (X → Y ) < 0 for some α ∈ A * , at least for n ≥ 3 [42]. For n = 2, numerical explorations did not find any case with S α ∂ < 0, although we could not find a proof of (LP) either. On the other hand, note that the backbone atoms B m ∂ (X → Y ) do satisfy (LP), as shown in Section IV.

B. General relationship with PID
In this section we focus on the relationship between our decomposition for the case of n = 2 (c.f. Sec. III B), and the standard PID. When considering α, β ∈ A := A * /{∅}, the classic work of Williams and Beer [12] introduces the following partial ordering: While the difference between wb and c might seem subtle, they induce drastically different lattice structures. Traditional PID-type decompositions for two sources are based on the following conditions: which are valid for i, j ∈ {1, 2} with i = j. A direct parallel between these terms and our framework can be made, and is shown in Table II. A key relationship between any PID and our decomposition comes from noticing that, considering Proposition 1 for X = (X 1 , X 2 ) and α = {{1}}, one finds that Moreover, numerical evaluations show that this bound is often not attained, as illustrated by Figure 7 (see Appendix F for more details).
As a consequence of this, one has that Conversely, an opposite relationship holds for the marginal mutual information: By combining these two results, one can compare the co-information with a corresponding co-information obtained from our decomposition, as follows: This result implies that, when assessing the balance between redundancy and synergy, our decomposition always tends towards redundancy over synergy with respect to any PID decomposition. In this sense, one can say that -at least for n = 2 -our decomposition is conservative when attributing dominance of synergies. The next section provides further evidence to support this claim.

C. Numerical comparisons with other PIDs
Let us now study how our proposed measure of synergy relates to the ones corresponding to other well-known

Disclosure decomposition PID
decompositions. Our analysis includes the I BROJA decomposition by Bertschinger et al. [15], Common Change in Surprisal (I ccs ) by Ince [14], I min by Williams and Beer [12], I dep by James et al. [30], and the pointwise decomposition by Finn and Lizier (I ± ) [16]; all computed using the dit package [43]. To do this comparison, we draw random distributions from the probability simplex following a NSB prior (see Appendix F for details), and then compute their synergy values with all measures. A first, somewhat striking result is the overwhelming correlation found between most proposed measures -BROJA, CCS, I min , and I dep are all related with each other with correlations greater than 0.94 for every pair Fraction of distributions with synergy greater than a given threshold. (Fig. 8a). The two oddballs in this plot are our proposed measure S α and I ± , which are less well correlated with the rest and with each other (correlations range around 0.70 for S γ1 and around 0.50 for I ± ). To examine this discrepancy, we computed the inverse cumulative function of the resulting values of the synergy for the various measures (Fig. 8b). This curve shows the fraction of all sampled distributions that have a synergy greater than a given threshold, to gauge how prevalent synergy is judged to be according to each measure. Consistent with Fig. 8a, Fig. 8b shows that the measures BROJA, CCS, I min , and I dep all follow similar profiles. Interestingly, S γ1 falls much faster than the rest, while I ± does it much more slowly. Therefore, our measure S γ1 can be said to be more "restrictive," in the sense that it tends to assign lower vaules of synergy, while I ± is more lenient. We hypothesise this "overestimation" of synergy by I ± happens because of its tendency to assign negative values to the redundant or unique information [44].

VII. CONCLUSION
This paper puts forward an operational definition of informational synergy, and uses it as a foundation to build a multivariate information decomposition. Compared to previous approaches to information decomposition, our framework possesses two key features: 1. It is a "synergy-first" decomposition, which begins by positing a measure of synergy and builds a decomposition after it, as opposed to previous approaches that are based on redundancy or unique information, and have synergy as a by-product.
2. It is based on a quantity that is the optimal solution of a well-defined problem in the data privacy literature, which makes reasoning about the measure more transparent while bringing the decomposition closer to standard information-theoretic formulations.
We illustrated the capabilities of the proposed decomposition on various examples, and showed that it gives a complementary perspective compared to other information decompositions. In particular, our results show that our measure of synergy is in general more conservative than other approaches, as it tends to attribute smaller values of synergy. We also showed how its operational interpretation provides clear explanations to open questions in the field of information decomposition, such as the well-known two-bit-copy problem [14,16,20].
Moreover, our measure has an associated "backbone" decomposition, which provides a natural coarse-graining of the information atoms. Our results show that in some scenarios the backbone atoms provide a directed version of the well-known connected information, which captures the effect of high-order interaction terms within Gibbs distributions. The number of backbone atoms grows linearly with system size, which makes this decomposition practical for studying a wide range of systems of interest.
The operational approach taken in this work represents a step towards establishing a solid foundation in the field of information decomposition. Additionally, we provide an open-source software package [45] implementing the key quantities in this paper, opening the door for an exciting range of applications in data analysis, neuroscience, and information dynamics.
where G := L k=1 i∈α k |X i |, andX is the set of tuples x ∈ n k=1 X k such that p X (x) > 0. This matrix is designed such that the matrix product P α p X (with p X being the probability vector of X) yields the marginals within p X that need to be "masked" by the synergistic channel -so that p X α |V =v is a uniform distribution for all α ∈ α. Note that P α is a binary matrix, since the X α 's are deterministic functions of X. As an example, if |X i | = 2, ∀i ∈ [n] and α = {{1}, . . . , {n}}, then P α is a 2n × 2 n matrix that can be built recursively according to with P α = P n and P 1 = 10 01 .
With this matrix one can characterise the channels in C(X; α), as shown in the next lemma. This result is a direct extension of Ref. [24,Lemma 1], and is presented here for completeness. Lemma 4. p V |X ∈ C(X; α) if and only if (p X −p X|v ) ∈ Null(P α ), ∀v ∈ V.
Proof. Let X, Y and Z be discrete r.v.'s that form a Markov chain as X − Y − Z. Having X ⊥ ⊥ Z is equivalent to p X (·) = p X|Z (·|z), i.e., p X = p X|z , ∀z ∈ Z. Furthermore, due to the Markov chain assumption, we have p X|z = P X|Y p Y |z , ∀z ∈ Z, and in particular, p X = P X|Y p Y . Therefore, having p X = p X|z , ∀z ∈ Z results in P X|Y p Y − p Y |z = 0, ∀z ∈ Z, or equivalently, p Y − p Y |z ∈ Null(P X|Y ), ∀z ∈ Z.
The proof is complete by noting that i) X αi − X − Y form a Markov chain for each index i ∈ [L], and ii) Null(P α ) = n i=1 Null(P X α i |X ).
In summary, the matrix form of the (reverse) synergistic channel is related to p X and the null space of P α . They key take-away from this lemma is that one can compute the conditional distributions p X|v of the synergistic channel by algebraic manipulation of P α and p X . Furthermore, this lemma has one important implication: the synergistic channels needed to compute the synergistic components of I(X; Y ) with respect to a target variable Y depend only on p X , not on p Y |X .