Partial Information Decomposition as a Unified Approach to the Specification of Neural Goal Functions

In many neural systems anatomical motifs are present repeatedly, but despite their structural similarity they can serve very different tasks. A prime example for such a motif is the canonical microcircuit of six-layered neo-cortex, which is repeated across cortical areas, and is involved in a number of different tasks (e.g.sensory, cognitive, or motor tasks). This observation has spawned interest in finding a common underlying principle, a 'goal function', of information processing implemented in this structure. By definition such a goal function, if universal, cannot be cast in processing-domain specific language (e.g. 'edge filtering', 'working memory'). Thus, to formulate such a principle, we have to use a domain-independent framework. Information theory offers such a framework. However, while the classical framework of information theory focuses on the relation between one input and one output (Shannon's mutual information), we argue that neural information processing crucially depends on the combination of \textit{multiple} inputs to create the output of a processor. To account for this, we use a very recent extension of Shannon Information theory, called partial information decomposition (PID). PID allows to quantify the information that several inputs provide individually (unique information), redundantly (shared information) or only jointly (synergistic information) about the output. First, we review the framework of PID. Then we apply it to reevaluate and analyze several earlier proposals of information theoretic neural goal functions (predictive coding, infomax, coherent infomax, efficient coding). We find that PID allows to compare these goal functions in a common framework, and also provides a versatile approach to design new goal functions from first principles. Building on this, we design and analyze a novel goal function, called 'coding with synergy'. [...]


Introduction
In many neural systems anatomical and physiological motifs are present repeatedly in the service of a variety of different functions.A prime example is the canonical cortical microcircuit that is found in many different regions of the six-layered mammalian neocortex.These different regions serve various sensory, cognitive, and motor functions, but how can a common circuit be used for such a variety of different purposes?This issue has spawned interest in finding a common abstract framework within which the relevant information processing functions can be specified.
Several solutions for such an abstract framework have been proposed previously, among them approaches that still use semantics to a certain extent (predictive coding with its initial focus on sensory perception), teleological ones that prescribe a goal based on statistical physics of the organism and its environment (free energy principle) and information theoretic ones that focus on local operations on information (Coherent Infomax).While these are all encouraging developments, they also beg the question of how to compare these approaches, and how many more possibilities of defining new approaches of this kind exist.Ideally, an abstract framework that would comprise these approaches as specific cases would be desirable.This article suggests a possible starting point for the development of such a unifying framework.
By definition this framework cannot be cast in processing-domain specific language, such as 'edge-filtering' or 'face perception, or 'visual working memory, for example, but must avoid any use of semantics beyond describing the elementary operations that information processing is composed of 6 .A framework that has these properties is information theory.In fact, information theory is often criticized exactly for its lack of semantics, i.e. for ignoring the meaning of the information that is processed in a system.As we will demonstrate here, this apparent shortcoming can be a strength when trying to provide a unified description of the goals of neural information processing.Moreover, by identifying separate component processes of information processing, information theory provides a meta-semantics that serves to better understand what neural systems do at an abstract level (for more details see [1]).Last, information theory is based on evaluating probabilities of events and thereby closely related to the concepts and hypotheses of probabilistic inference that are at the heart of predictive coding theory [2,3,4,5].Thus information theory is naturally linked to the domain-general semantics of this and related theories.
Based on the domain-generality of information theory several variants of information theoretic goal functions for neural networks have been proposed.The optimization of these abstract goal functions on artificial neural networks leads to the emergence of properties also found in biological neural systems -this can be considered an amazing success of the information theoretic approach given that we still know very little about general cortical algorithms.This success raises hopes for finding unifying principles in the flood of phenomena discovered in experimental neuroscience.Examples of successful, information-theoretically defined goal functions are Linsker's infomax [6] -producing re-ceptive fields and orientation columns similar to those observed in primary visual cortex V1 [7], recurrent infomax -producing neural avalanches, and an organization to synfirechain like behaviour [8], and coherent infomax [9].The goal function of coherent infomax is to find coherent information between two streams of inputs from different sources, one conceptualized as sensory input, the other as internal contextual information.As coherent infomax requires the precomputation of an integrated receptive field input as well as an integrated contextual input to be computable efficiently (and thereby, in a biologically plausible way), the theory predicted the recent discovery of two distinct sites of neural integration in neocortical pyramidal cells [10].For details see the contribution of Phillips to this special issue.We will revisit some of these goal functions below and demonstrate how they fit in the larger abstract framework aiming at a unified description that is presented here.
Apart from the desire for a unified description of the common goals of repeated anatomical motifs, there is a second argument in favor of using an abstract framework.This argument is based on the fact that a large part of neural communication relies on axonal transmission of action potentials and on their transformation into post-synaptic potentials by the receiving synapse.Thus, for neurons, there is only one currency of information.This fact has been convincingly demonstrated by the successful rewiring of sensory organs to alternative cortical areas that gave rise to functioning, sense-specific perception (see for example the cross-wiring, cross-modal training experiments in [11]).In sum, neurons only see the semantics inherent in the train of incoming action potentials, not the semantics imposed by the experimenter.Therefore, a neurocentric framework describing information processing must be necessarily abstract.From this perspective information theory is again a natural choice.
Classic Shannon information theory, however, mostly deals with the transmission of information through a communication channel with one input and one output variable.In a neural setting this would amount to asking how much information present at the soma of one cell reaches the soma of another cell across the connecting axons, synapses and dendrites, or how much information is passed from one circuit to another.Information processing, however, comprises more operations on information than just its transfer.A long tradition dating back all the way to Turing has identified the elementary operations of information as information transfer, active storage, and modification.Correspondingly, measures of information transfer have been extended to cover more complex cases than Shannon's channels, incorporating directed and dynamic couplings [12] and multivariate interactions [13], and also measures of active information storage have been introduced [14].Information modification, seemingly comprising of subfunctions such as de novo creation and fusion of information, however, has been difficult to define [15].
One reason for extending our view of information processing to more complicated cases is that even the most simple function from Boolean logic that any other logic function can be composed of (NAND, see for example [16], chapter 1) uses two distinct input variables and one output.While such a logic function could be described as a channel between the two inputs and the outputs, this does not do justice to the way the two inputs interact with each other.What is needed instead is an extension of classic information theory to three way systems, describing how much information in the output of this Boolean function, or any other three-way processor of information, comes uniquely from one input, uniquely To establish the link to the coherent infomax literature we identify the input X 1 with the receptive field input R, which may be excitatory (e) or inhibitory (i), and which is summed.In the same way, X 2 is identified with the contextual input C. (C) Overlay of the coherent infomax neural processor on a layer 5 pyramidal cells, highlighting potential parallels to existing physiological mechanisms.Layer 5 cells created with the TREES toolbox [21], courtesy of Hermann Cuntz.
from the other input, how much they share about the output, and how much output information can only be obtained from evaluating both inputs jointly.These questions can be answered using an extension of information theory called partial information decomposition (PID) [17,18,19,20].This article will introduce PID and show how to use it to specify a generic goal function for neural information processing.This generic goal function can then be adapted to represent previously defined neural information processing goals such as infomax, coherent infomax and predictive coding.This representation of previous neural goal functions in just one generic framework is highly useful to understand their differences and commonalities.Apart from a reevaluation of existing neural goal functions, the generic neural goal function introduced here also serves to define novel goals not investigated before.
The remainder of the text will first introduce partial information decomposition, and then demonstrate its use to decompose the total output information of a neural processor.From this decomposition we derive a generic neural goal function "G", and then express existing neural goal functions as specific parameterizations of G.We will then discuss how the use of G simplifies the comparison of these previous goal functions and how it helps to develop new ones.

Partial Information decomposition
In this section we will describe the framework of partial information decomposition (PID) to the extent that is necessary to understand the decomposition of the mutual information between the output Y of a neural processor and a set of two inputs X 1 , X 2 (Figure 1).The inputs themselves may be multivariate random variables but we will not attempt to decompose their contributions further.This is linked to the fact that in many neurons contextual and driving inputs are first summed separately before being brought to interact to produce the output.This summation strongly reduces the parameter space and thereby makes learning tractable -see [22,23]. 7Therefore, we limit ourselves to the PID of the mutual information between one "left hand side" or "output" variable Y and two "right hand side" or "input" variables X 1 , X 2 .That is, we decompose the mutual information I(Y : X 1 , X 2 ) 8 , the total amount of information held in the set {X 1 , X 2 } about Y : 9 where the A • signifiy the support of the random variables and H(•), H(•|•) are the entropy and the conditional entropy, respectively (see [25] for definitions of these information theoretic measures).
The PID of this mutual information addresses the questions: 1. What information does one of the variables, say X 1 , hold individually about Y that we can not obtain from any other variable (X 2 in our case)?This information is the unique information of X 1 about Y : I unq (Y : 2. What information does the joint input variable {X 1 ; X 2 } have about Y that we cannot get from observing both variables X 1 , X 2 separately?This information is called the synergy, or complementary information, of {X 1 ; X 2 } with respect to Y : I syn (Y : X 1 ; X 2 ).
3. What information does one of the variables, again say X 1 , have about Y that we could also obtain by looking at the other variable (X 2 ) alone?This information is the shared information 10 of X 1 and X 2 about Y : I shd (Y : X 1 ; X 2 ).
Following [17], the above three types of partial information terms together by definition provide all the information that the set {X 1 , X 2 } has about Y , and other sources agree on this [17,20,18,19], i.e.: Figure 2 is a graphical depiction of this notion by means of the partial information (PI-) diagrams introduced in [17].In addition, there is agreement that the information one input variable has about the output should decompose into a unique and a shared part 7 Furthermore, the formulation of measures providing generally-accepted decompositions [19,20] at the present time are only defined for two variables [24]. 8As the concepts of unique, shared and synergistic information require a more fine grained distinction of how individual variables are grouped, we employ the following extended notation that was introduced in [19] and defined in Appendix Appendix A: ":" separates sets of variables between which mutual information or partial information terms are computed, ";" separates multiple sets of variables on one side of a partial information term, whereas "," separates variables within a set that are considered jointly (see the Appendix Appendix A for examples). 9See notational definitions in Appendix Appendix A. 10 Also known as redundant information in [17]. as: For the treatment of neural goal functions we have to furthermore give PID representations of the relevant conditional mutual information terms.These can be obtained from equations 3 and 4 as : Moreover, all parts of the PI-diagram are typically required to be positive to allow an interpretation as information terms.Due to the pioneering work of Williams and Beer [17] it is now well established that neither unique, nor shared, nor synergistic information can be obtained from the definitions of entropy, mutual information and conditional mutual information in classical information theory.Essentially, this is because we have an underdetermined system, i.e. we have fewer independent equations relating the output and inputs in classical information theory (three for two input variables) than we have PID terms (four for two input variables).For at least one of these PID terms a new, axiomatic definition is necessary, from which the others then follow, as per equations 3-5.To date, the equivalent axiom systems introduced by Bertschinger and colleagues [19], and by Griffiths and Koch [20] have found the widest acceptance.They also yield results that are very close to an earlier proposal by Harder and colleagues [18].All of these axiom systems lead to measures that are sufficiently close to a common sense view of unique, shared and synergistic information, and all satisfy equations 3-5.Hence, their exact details do not matter at first reading for the purposes of this paper, and will therefore be presented in Appendix Appendix B. The one exception to this statement is that we have to mention here already that shared information may arise in the frameworks of Bertschinger at al. [19], Griffiths et al. [20] , and also Harder et al. [18] for two reasons.First, there can be shared information because the two inputs X 1 , X 2 have mutual information between them (termed source redundancy in [18], and source shared information here) -this is quite intuitive for most.Second, shared information can arise because of certain mechanisms creating the output Y (mechanistic redundancy in [18], mechanistic shared information here).This second possibility of creating shared information is less intuitive but nevertheless arises in all of the frameworks mentioned above.For example, the binary AND operation on two independent (identically distributed) binary random variables creates 0.311 bits of shared information in [19,18,20], and 0.5 bits of synergistic mutual information, while there is no unique information about the inputs in its output.

A generic decomposition of the output information of a neural processor
We use PID in this section to decompose the information H(Y ) that is contained in the output of a general neural processor (Figure 1) with two input (sets) X 1 and X 2 and an output Y : =I unq (Y : To arrive at a neural goal function we can add weight coefficients to each of the terms in the entropy decomposition above to specify how 'desirable' each one of one of these should be for the neural processor, i.e. we can specify a neural goal function G as a function of these coefficients.Since all the terms in equation 6 are non-overlapping, and the coefficients can be be chosen independently, this is the most generic way possible to specify such a goal function: which can also be rewritten with another set of of coefficients γ i as:

and equation 6).
Note that training a neural processor will obviously change the value of the goal function in equation 7, but of course also change the relative composition of the entropy in equation 6.
This decomposition of the entropy and its parametrization are closely modeled on the approach taken by Kay and Phillips in their formulation of another versatile information theoretic goal function ("F ", see below) for the coherent infomax principle [9,26,22,23].
In general, we will choose the formulation used in equation 7 because the conditional entropy does not overlap with the parts in the PI-diagram (Figure 2), but note that the formulation used in equation 8 may be useful when goals with respect to total bandwidth, rather than unused bandwidth, are to be made explicit.This could for example happen when neuronal plasticity acts to increase to total bandwidth of a neural processor 11 .
In the next sections we introduce coherent infomax and analyze it by means of PID.We then show how to (re-)formulate infomax, and predictive coding using specific choices of parameters for G. Last, we will introduce a neural goal function, called coding with synergy, that explicitly exploits synergy for information processing.

The coherent infomax principle
The coherent infomax principle (CIP) proposes an information theoretically defined neural goal function in the spirit of domain-independence laid out in the introduction, and a neural processor implementing this goal function [9,26,22,23].The neural processor operates on information it receives from two distinct types of inputs X 1 , X 2 and send the results to a single output Y (see Figure 1).The two distinct types of input in CIP were described as driving and modulatory, formally defined by their distinct roles in local processing as detailed in the coherent infomax principles CIP.1-CIP.4,below.Here we will denote the driving input by X 1 , and the contextual input by X 2 .
In the mammalian brain the driving input X 1 includes, but is not limited to, both external information received from the sensors and information retrieved from memory.The contextual input X 2 arises from diverse sources as lateral long-range input from the same or different brain regions, descending inputs from hierarchically higher regions, and input via non-specific thalamic areas.Phillips, Clark and Silverstein [27] provide a recent in-depth review of this issue in relation to the evidence for such distinct inputs from several disciplines.
The coherent infomax principle (CIP) states the following four goals of information processing:

CIP.1
The output Y should transmit information that is shared between the two inputs, so as to enable the processor to preferentially transmit information from the driving inputs (X 1 ) that is supported by context-carrying information from internal sources elsewhere in the system arriving at input X 2 .This is what the term 'coherent' refers to.

CIP.2
The output Y could transmit some information that is only in the driving input X 1 , but not in the context, so as to enable that local processors transmit some information that is not related to the information currently available to it from elsewhere in the system.

CIP.3
The output Y should minimize transmission of information that is only in the contextual input X 2 .This is necessary to ensure that the effects of the context do not become confounded with the effects of the drive and thereby reduce the reliability of coding .

CIP.4
The output Y should be optimally used in terms of bandwidth.
To state these goals more formally, Kay and Phillips first decomposed the total entropy of the output, H(Y ) as: where the three-term multi-information I(Y : X 1 : X 2 ) is defined as: Kay and Phillips then re-weighted the terms of this decomposition by coefficients Φ i to obtain a generic information theoretic goal function F as: Here, the first term, I(Y : X 1 : X 2 ), was meant to reflect the information in the output that is shared between the two inputs, the second term the information in the output that was only in the driving input, the third term the information in the output that was only in the contextual input, while the last term represents the unused bandwidth (see Figure 3 for a graphical representation of these terms).Below, these assignments will be investigated using PID.
In previous work [9], the goal of coherent infomax was implemented by setting Φ 0 = 1, Φ 1 = Φ 2 = Φ 3 = 0, leading to the objective function I(Y : X 1 : X 2 ).While this objective function appears not to explicitly embody any asymmetry between the influences of the X 1 and X 2 inputs, it is important to realize that the modulatory role played by the contextual input X 2 is expressed through the special form of activation function introduced in Phillips et al. (1995), and defined in Appendix 7.4.The possibility of expressing this asymmetry explicitly in the objective function was also discussed in [9,26] by taking Φ 0 = 1, 0 ≤ Φ 1 < 1, Φ 2 = Φ 3 = 0, leading to the goal function which is a weighted combination of the multi-information and the information between Y and the driving input X 1 conditional on the contextual input X 2 .This last term was meant to represent information that was both in the output Y and the driving input X 1 , but not in the contextual input X 2 .
Next, we will investigate how this goal function F CIP implements the goals CIP.1-CIP.4when these are restated using the language of PID.

F as seen by PID
We first take the generic goal function F from equation 11, that is independent of CIP proper, and rewrite it as a sum of mutual information terms and decompose these using PID.We will sort the resulting decomposition by PID terms and compare this result to the general goal function G.This will tell us about the space of goal functions covered by F .Knowing this space is highly useful as a working neural network implementation of F with learning rules exists (reviewed in [22,23]).This implementation can also be used to implement goal functions formulated in the precise PID framework based on G, whenever the specific G that is of interest lies in the space that can be represented by F 's.
We begin by decomposing F mutual information terms: which, using the PID equations 3-5, and collecting PID terms, turns into: Comparing this to the general PID goal function G, we see that the coefficients Γ = [Γ 0 . . .Γ 4 ] and Φ = [Φ 0 . . .Φ 3 ] are linked by the matrix Ω as: Since Ω is not invertible, there are parameter choices in terms of Γ that have no counterpart in Φ.These are described by the complement of the range of this matrix (the null space of Ω T ).This one-dimensional subspace is described by 12 : The existence of this subspace of coefficients not expressible in terms of Φ i 's means that it is impossible to prescribe the goal of simultaneously maximizing synergistic and shared information, while minimizing the two unique contributions, and vice versa when using F .Ultimately, the existence of a subspace not representable by Φ i 's is a consequence of the fact that PID terms cannot be expressed using classic information theory (while F in contrast was defined from classical information theoretic terms only).

The coherent infomax principle as seen by PID
For the investigation of the specific goal function F CIP , we first want to clarify how we understand the four goals listed in the previous section.To this end we identify them one to one with goals in terms of PID as: 1. → CIP.1:The output should contain as much shared information I shd (Y : X 1 , X 2 ) as possible.

→ CIP.3:
The output should minimize unique information I unq (Y : With respect to item 1 on this list, it is important to recall from section 2 that shared information can arise from mutual information between the sources (source shared information) or be created by a mechanism in the processor (mechanistic shared information).Kay and Phillips had in mind the first of these two possibilities.
To see whether F CIP indeed reflects these goals as stated via PID, we look at the specific choice of parameters, Φ 0 = 1, 0 ≤ Φ 1 < 1, Φ 2 = Φ 3 = 0, that was used to implement the coherent infomax principle, and find using equations 3-5 (the reader may also verify this graphically using Figure 3): We will now discuss the various contributions to F CIP in detail, starting with the shared information, which figures most prominently in the goals CIP.1-CIP.4.Shared information.We see that shared information is maximized.This shared information contains contributions from mutual information between the sources (source shared information) as well as shared information created by mechanisms in the processor (mechanistic shared information, see the note on item 1 above).The first type of shared information is the one aimed for in CIP.1.Thus, for inputs that are not independent the coherent infomax goal function indeed maximizes source shared information as desired.We will investigate the case of independent inputs below.
Unique information.In addition to the shared information, the unique information from the driving input is also maximized, albeit to a lesser degree.In contrast, synergy between the output and the combined inputs is minimized.Therefore, goals 1, 2 and 3 are expressed explicitly in this objective function but there is no explicit mention of minimizing the output bandwidth.
Synergistic information.Of all the PID terms, synergy is discouraged.This may at first seem surprising as the mapping of goals of coherent infomax to PID, did not appear to make any explicit statements about synergistic components -unless one views the transmission of undesirable synergistic components as being an extra component of the bandwidth (along with H(Y |X 1 , X 2 )) that is not used in the optimal attainment of goals 1-3.Nevertheless the minimization of synergy serves the original goals of coherent infomax.This can be seen when we consider that these were formulated for two different types of inputs, driving and modulatory.For these two types of input, the goal of coherent infomax is to use the modulatory inputs to guide transmission of information about the driving inputs.Synergistic components would transmit information about both driving and modulatory inputs, so transmitting them would be treating the modulatory inputs as driving inputs.This is clearly undesirable in the setting of coherent infomax.
At a more technical level, we note the trade-off in that increasing the value of the parameter Φ 1 towards 1 at once serves to enhance promotion of the unique information from the driving input while simultaneously lessens the pressure to minimize the synergy.This is a remnant of the term Φ 1 I(Y : X 1 |X 2 ) in equation 12 which had been included in order to capture information that was both in Y and X 1 but not in X 2 (i.e. the unique information from the driving input), but inadvertently also served to capture the synergy.
In terms of the range of tasks that can be learned by a processor with F CIP , the minimization of synergy between the two types of inputs means for example that learning tasks that require a lot of synergy between the inputs, like the XOR-function, cannot be achieved easily.It is crucial, however, to realize that discouragement of synergy concerns only relations between drive X 1 and modulation X 2 .In contrast, synergistic relations between just the components of a multivariate X 1 can be learned by the coherent infomax learning rule.The XOR between components of X 1 for example can be learned reliably if supervised, and still occasionally if not [9].Independent sources.What remains to be investigated is what the goal functions aims for in the specific case of statistically independent inputs, i.e. when source shared information cannot be obtained.In other words, we may ask whether the coherent infomax processor will maximize mechanistic shared information in this case?
Since the mutual information between the inputs, I(X 1 : X 2 ), is assumed to be zero, then using one of the forms of the multi-information (eq.10) we have and so the multi-information is non-positive.It follows from the other forms of the multiinformation (eq.10) that This implies directly (compare 2A) that for independent inputs we must have: -an important additional constraint that arises from independent inputs.Thus, in this case the minimization of synergy and the maximization of shared information compete, giving more effective weight to the unique information from the driving input.Nevertheless, limited shared information may exist in this scenario, and if so it will be of the mechanistic type.
In sum, we showed that (i) the generic goal function F in the coherent infomax principle cannot represent all goal functions that are possible in the PID framework using the goal function G -specifically, F lacks one degree of freedom; (ii) for the CIP this leads to a weighted maximization of the shared information (source shared information and mechanistic shared information) and the unique information from the driving input; (iii) it can be shown that within the space of all possible goal functions F it is impossible to maximize synergy and shared information together, while minimizing the two unique information terms, and vice versa; (iv) and for the CIP synergy between the driving and modulatory inputs is explicitly discouraged.(E) The three way information I(Y : X 1 : X 2 ), weighted by Φ 0 .Here the three way information is the checkered minus the striped area.(F) This region appears in (C),(D),(E) and is weighted accordingly by three coefficients simultaneously (Φ 0 ,Φ 1 , Φ 2 ).The area in (F) is the synergistic mutual information that is also shown in cyan in Fig. 2.

Partial information decomposition as a unified framework to generate neural goal functions
In the this section we will use PID to investigate infomax, another goal function proposed for neural systems, and we will formulate an information-theoretic goal function for a neural processor aimed at predictive coding.

Infomax
To investigate infomax, we recall that the goal stated there is to maximize the information in the output about the relevant input X 1 , which typically is multivariate [6].This goal function is implicitly designed for situations with limited output bandwidth, i.e.H(X 1 ) > H(Y ).Not considering a second type of input X 2 it is obvious that PID will not contribute to the understanding of infomax.This changes however if the variables in a multivariate input will be considered separately.Then, it may make sense to ask whether the output information in a given system is actually being maximized predominantly due to unique or synergistic information.
Mathematically, the infomax goal can also be represented by using F with two types of inputs X 1 , X 2 , where the information transmitted about X 1 is to be maximized.This can be achieved by choosing Φ 0 = Φ 1 = 1 to obtain (e.g.[22]): =I unq (Y : . The insight to be gained using PID here is that infomax does not incorporate the use of auxiliary variables X 2 to extract even more information from X 1 via the synergy I(Y : X 1 ; X 2 ), nor does it prefer either shared or unique information over the other.

Predictive coding
In predictive coding the goal is to predict inputs X 1 (t) using information available from past inputs 13 .Thus, the processor has to learn a model M P C that yields predictions X 2 (t) = M P C (X 1 (t − 1)), such that X 2 (t) ≈ X 1 (t).This is the same as maximizing the mutual information between outcome and prediction I(X 1 (t), X 2 (t)) = I(X 1 (t), M P C (X 1 (t − 1))), at least if we do not care how exactly X 2 (t) represents14 the prediction.Under some mild constraints 15 the data processing inequality here actually states that trying to tackle this problem information theoretically is trivial, as I(X 1 (t), X 2 (t)) = I(X 1 (t), M P C (X 1 (t − 1))) is maximized by M P C (X 1 (t − 1)) = X 1 (t − 1), i.e. all the information we can ever hope to exploit for prediction is already in the raw data (and it is a mere technicality to extract it in a useful way).The whole problem becomes interesting only when there is some kind of bandwidth limitation on M P C , i.e. when for example M P C (X 1 (t − 1)) has to use the same alphabet as X 1 (t), meaning that we have to state our prediction as a single value that X 1 (t) will take.Of course, this actually is the typical scenario in neural circuits.Therefore, we state the main goal of predictive coding as maximizing I(X 1 (t), X 2 (t)) = I(X 1 (t), M P C (X 1 (t − 1))), under the constraint that X 1 (t) and M P C (X 1 (t − 1))) have the same "bandwidth" (the same raw bit content to be precise).Despite of the goal to maximize a simple mutual information this is not an infomax problem, due to the temporal order of the variables, i.e. we need the output X 2 (t) before the input X 1 (t) is available.Thus, we have to find a different solution to our problem.
To this end, we suggest that a minimal circuit performing predictive coding will have to perform at least three subtasks, (i) produce predictions as output, (ii) detect whether there were errors in the predictions, (iii) use these for learning.In Fig. 4 we detail a minimalistic circuit performing these tasks, with subtask (i) represented in X 2 (t), subtask (ii) in Y (t) and subtask (iii) in M P C .This circuit assumes the following properties for its neural circuits: (a) neurons have binary inputs and outputs, (b) information passes through a neuron in one direction, and (c) information from multiple inputs can be combined into one output only.The circuit consists of two separate units: (1) the error detection unit that operates on past predictions X 2 (t − 1) = M P C (X 1 (t − 2)), obtained via a memory buffer, and past inputs X 1 (t − 1), to create the output Y via an XOR operation, with y = 1 indicating an erroneous prediction in the past; (2) the prediction unit that has the capability to produce output based on a weighted summation over a vector of past inputs X 1 (t − 1) via a weighting function in the model M P C .M P C will update its weights whenever an error was received.We suggest that the information theoretic goal function of this circuit is simply to minimize the entropy of the output of the error unit, i.e.H(Y ).In principle, this would drive the binary output of the circuit either to p(y = 1) → 1 or to p(y = 0) → 1.Of these two possibilities, only the second one is stable, as the constant signaling of the presence of an error will lead to incessant changes in M P C , which in turn will change Y even for unchanging input X 1 .Thus, minimizing H(Y ) should enforce p Y (y = 0) → 1.Therefore, we can formulate an information theoretic goal function of the form G if we conceive of the whole circuit as being just one neural processor with inputs X 1 (t − 1) and X 1 (t − 1), and as having the error Y as its main output.In this case, we find as a goal function for the predictive coding error (PCE):

Interestingly, this goal function formally translates to
This gives hope that one can translate the established formalism for F to the present case by taking into account that the original architecture behind F is augmented here by an additional XOR subunit.Learning of the circuit's goal function may have to proceed in two steps if we do not have subunits able to perform XOR at the beginning.In this case, the "XOR" subunit will first have to learn to perform its function.This can be achieved by maximizing the synergy of two uniform, random binary inputs and the subunit's output Y .After this initial learning the XOR-subunit is 'frozen' and learning of predictions can proceed to minimize H(Y ).One conceivable mechanism for this would be to use learning based on coincidences between input bits in M (X 1 (t − 2)) and the error bit Y .We note that this goal function is not entirely new, as the idea of making the output of a processing unit as constant as possible in learning has been used before in various implementations (e.g.[28,29,30]).It is also closely related to the homeostatic goals pursued by the free energy minimization principle [31,32,33].We have merely added here a generic minimal circuit diagram and the information theoretic interpretation to these previous approaches.Also, note that the actual prediction X 2 (t) = M P C (X 1 (t − 1)) must be implicitly part of the information theoretic goal function, as the goal function we suggest here would be nonsensical on many other circuits.
As a next level of complication one may consider that the predictions X 2 that are created within our minimal circuit are sent back to the source of the input X 1 to interact with it there.One such interaction scheme will be studied in the next section.

Coding with synergy
So far the goal functions investigated in our unifying framework G had in common that maximization of synergy did not appear as a desirable goal.This may historically be simply due to the profound mathematical difficulties that had to be overcome in the definition of synergistic information.In this section we will therefore show how synergy naturally arises in a generalization of ideas from efficient coding by PID.We will call the goal function simply coding with synergy (CWS).
The neural coding problem that we will investigate here is closely related to predictive coding discussed in the previous section.However, in contrast to predictive coding where the creation of predictions was in focus, here we focus on possible uses of prior (or contextual) information from X 2 , be it derived from predictions or by any other means.In other words, we here simply assume that there is (valid) prior information in the system that does not have to be extracted from the ongoing input stream X 1 by our neural processor.Moreover, we assume that there is no need to waste bandwidth and energy on communicating X 2 as this information is already present in the system.Last, we assume that we want to pass as much of the information in X 1 as possible, as well as of the information created synergistically by X 1 and X 2 .This synergistic information will arise for example when X 2 serves to decode or disambiguate information in X 1 .
Looking at the PID diagram (Fig. 2) one sees that in this setting it is optimal to minimize I unq (Y : X 2 \ X 1 ) and the unused bandwidth H(Y |X 1 , X 2 ) while maximizing the other terms.This leads to: The important point here is that this is different from maximizing just I(Y : X 1 |X 2 ), as this would omit the shared information, i.e. we would lose this part of the information in X 1 .The goal function G CW S is also different from just maximizing I(Y : X 1 ), as this would omit the synergistic information, i.e. the possibility to decode information from X 1 by means of X 2 .Furthermore, there is no corresponding goal function F here in terms of classical information theoretic measures.This can easily be proven by noting that Γ = [1, −1, 1, 1, −1] has a non-zero projection in V Γ (equation 17).In other words, there is no Φ that satisfies equation 15.
Given there were bandwidth constraints on Y , one might want to preferentially communicate one or two of the positively weighted terms in equation 24.The natural choice here is to favor synergy and unique information about X 1 , because the shared information with X 2 is already in the system.If just one contribution can be communicated this leaves us with three choices.We will quickly discuss the meaning of each here: first, focusing on the unique information I unq (Y : X 1 \ X 2 ) emphasizes the surprising information in X 1 , because this is the information that is not yet in the system at all (i.e.not in X 2 ); second, focusing on the shared information I shd (Y : X 1 ; X 2 ) basically leads to coherent infomax; third, focusing on the synergistic information I syn (Y : X 1 ; X 2 ) emphasizes information which can only be obtained when putting together prior knowledge in X 2 and incoming information X 1 -this would be the extreme case of CWS.This case should arise naturally in binary error computation, e.g. in error units suggested as integral parts of certain predcitive coding architectures (see [3] for a discussion of error units, also compare the XOR unit in Figure 4).
A classic example for this last coding strategy would be cryptographic decoding.Here, the mutual information between cypher text (serving as input X 1 ) and plain text (serving as output Y ) is close to zero, i.e.I(Y : X 1 ) ≈ 0, given randomly chosen keys and a well performing cryptographic algorithm.Nevertheless the mutual information between the two, given keys (serving as input X 2 ), is the full information of the plain text, i.e.I(Y : X 1 |X 2 ) = H(Y ), assuming the unused bandwidth is zero (H(Y : X 1 , X 2 ) = 0).As the mutual information between key and plain text should also be zero (I(Y : X 2 ) = 0) we see that in this case the full mutual information is synergistic: I(Y : X 1 , X 2 ) = I syn (Y : X 1 ; X 2 ).In a similar vein, any task in neural systems that involves an arbitrary key-dependent mapping between information sources -as in the above cryptographic example -will involve CWS.One such task would be to read a newspaper printed in Latin characters (which could be in quite a range of languages) to get knowledge about the current state of the world (or at least some aspects of it).Visually inspecting the text, without the information incorporated in the rules of the unknown written language used will not reveal information about the world.Yet, having all the information on the rules of written language, without having a specific text will also not reveal anything about the world.To obtain this knowledge we need, both, the text of the newspaper and the language-specific information how written words map to possible states of the world.
A corollary of the properties of synergistic mutual information is that when a neuron's inputs are investigated individually they will seem unrelated to the output -to the extent that synergistic information is transmitted in the output.Therefore, the minimal configuration of neuronal recordings needed to investigate the synergistic goal fucntion is a triplet of two inputs and one output.Thus, though coding with synergy has not been prominent in empirical reports to date, it might become more frequently detected as dense and highly parallel recordings of neuronal acticity become more widely available.
The general setting of coding under prior knowledge discussed here is also related to Barlow's efficient coding hypothesis [34] if we take the prior information X 2 to be information about which inputs to our processor are typical for the environment it lives in.We here basically generalize Barlow's principle by dropping reference to what the input or the prior knowledge are about.
Last, this goal function seems significant to us as synergy is seen by some authors as useful in an formal definition of information modification (e.g.[15]).Thus synergy is a highly useful measure in the description of neural processor with two or more inputs (or one input and an internal state), as it taps into the potential of the processor to genuinely modify information 167.Discussion

Biological neural processors and PID
In this study we introduced partial information decomposition (PID) as a universal framework to describe and compare neural processors in a domain-independent way.PID is indispensable for the information theoretic analysis of systems where two (or more) inputs are combined to one output, because it allows to decompose the information in the output into contributions provided either uniquely by any one of the inputs alone (unique information), by either of them (shared information), or only by both of them jointly (synergistic information).Using PID, the information processing principles of the processor can be quantitatively described by specific coefficients Γ for each of the PID contributions in a PID-based goal function G(Γ), which the processor maximizes.
This framework is useful in several ways.First, and perhaps most importantly, it allows the principled comparison of existing neural goal functions, such as infomax, coherent infomax, predictive coding, and efficient coding.Second, it aids in the design of novel neural goal functions.Here we presented a specific example, coding with synergy (CWS), that exploits synergy to maximize the information that can be obtained from the input when prior information is available in the system.Note, however, that the actual implementation of a neural circuit maximizing the desired goal function is not provided by the new framework and will have to be constructed on a case by case basis at the moment.This is in contrast to coherent infomax where a working implementation is known.Third, applying this framework to neural recordings may help us understand better how neural circuits that are far away from sensory and motor periphery, and for which we do not have the necessary semantics, function.
Currently, the applicability of our framework rests on the assumption that a neural processor with two inputs is a reasonable approximation of a neuron or microcircuit 17 .Of course, neurons typically have many more inputs than just two.However, if such inputs naturally fall into two groups, e.g.being first integrated locally in two groups on the dendrites before being brought to interact at the soma, then indeed the two input processor is a useful approximation.If, moreover, these integrated inputs are measured before their fusion in the soma, then the formalism of goal functions presented here will allow us to assess the function of this neuron in a truly domain independent way, relying only on information that is also available to the neuron itself.
For example, two such spatially segregated and separately integrated inputs can be distinguished on Pyramidal cells (Fig. 1).Pyramidal cells are usually highly asymmetric and consist of a cell body with basal dendrites and an elongated apical dendrite that rises to form a distal dendritic tuft in the superficial cortical layers.Thus, the inputs are spatially segregated into basal/perisomatic inputs, and inputs that target the apical tuft.Intracellular recordings indicate that there are indeed separate integration sites for each of these two classes of input, and that there are conditions in which apical inputs amplify (i.e.modulate) responses to the basal inputs in a way that closely resembles the schematic two-input processor shown in Fig. 1.There is also emerging evidence that these segregated inputs have driving and modulatory functions and are combined in a mechanism of apical amplification of basal inputs -resembling the coherent infomax goal function.Direct and indirect evidence on this apical amplification and its cognitive functions is reviewed by Phillips [submitted to this special issue].That evidence shows that apical amplification occurs within pyramidal cells in the superficial layers, as well as in layer 5 cells, and suggests that it may play a leading role in the use of predictive inferences to modulate processing.
Which of the goal functions proposed here, e.g infomax, coherent infomax, or coding with synergy a neural processor actually performs is an empirical question that must be answered by analyzing PID footprints of G obtained from data recorded in neural processors.At present this is still a considerable challenge when applied to the level of single cells or microcircuits because this requires the separate recording of at least one output and two inputs, wich must moreover be of different type in the case of coherent infomax.Next, the PID terms have to be estimated from data, instead of distributions that are known.This type of estimation is still a field of ongoing research at present.Overcoming these challenges will yield in-depth understanding of, for example, the information processing of the layer 5 cell described above in terms of PID, and elucidate which of the potential goal functions is implemented in such a neuron.
In the spirit of the framework proposed here, classical information theoretic techniques have already been applied to psychophysical data to search for coherent infomax-like processing at this level [36].These studies confirmed for example that attentional influences are modulatory, and showed how modulatory interactions can be distinguished from interactions that integrate multiple driving input streams.These result are a promising beginning of a more large scale analysis of neuronal data at all levels with information theoretic tools, such as PID.
Further information theoretic insight relevant to predictive processing may also be gained by relating the predictable information in a neural processor's inputs (measured via 'local active information storage' [14]) to the information transmitted to its output (measured via transfer entropy [12], or local transfer entropy [13]) to investigate whether principles of predictive coding apply to the information processing in neurons.This is discussed in more detail in [1].

Conclusion
We here argued that the understanding of neural information processing will profit from taking a neural perspective, focusing on the information entering and exiting a neuron, and stripping away semantics imposed by the experimenter -semantics that is not available to a neuron.We suggest that the necessary analyses are best carried out in an information theoretic framework, and that this framework must be able to describe the processing in a multiple input system to accommodate neural information processing.We find that PID provides the necessary measures, and allows to compare most if not all theoretically conceivable neural goal functions in a common framework.
Moreover, PID can also be used to design new goal functions from first principles.We demonstrated the use of this technique in understanding neural goal functions proposed for the integration of contextual information (coherent infomax), the learning of predictions (predictive coding), and introduced a novel one for the decoding of input based on prior knowledge called coding with synergy (CWS).designed by her, where the payout depends only on the outcomes of Y .In such a game, her reward will depend only on the probability distribution p(x 1 , y) = p(x 1 |y)p(y), while Bob's reward will depend only on p(x 2 , y) = p(x 2 |y)p(y).The winner is thus determined simply by the two distributions p(x 1 , y) and p(x 2 , y), but not by the details of the full distribution p(x 1 , x 2 , y).Practically speaking, Alice should therefore construct the game in such a way that her payout is high for outcomes y about which she can be relatively certain, knowing x 1 .
From this argument, it follows that Alice could not only prove to have unique information in the case described by the full joint distribution P = P (X 1 , X 2 , Y ), but also for all other cases described by distributions Q = Q(X 1 , X 2 , Y ) that have the same pairwise marginal distributions, i.e. p(x 1 , y) = q(x 1 , y) ∧ p(x 2 , y) = q(x 2 , y) ∀x 1 , x 2 , y ∈ A X1,X2,Y .
Based on this observation it makes sense to request that I unq (Y : X 1 \ X 2 ) and I unq (Y : X 2 \ X 1 ) stay constant on a set ∆ P of probability distributions that is defined by: where ∆ is the set of all joint probability distributions of X 1 , Y , X 2 .
From this, it follows from equation 4 that also the shared information I shd (Y : X 1 ; X 2 ) must be constant on ∆ P (consult Figure B.5, and take into account that the mutual information terms I(Y : X 1 ) and I(Y : X 2 ) are also constant on ∆ P ).Hence, the only thing that may vary when exchanging the distribution P, for which we want to determine the unique information terms,for another distribution Q ∈ ∆ P is the synergistic information I syn (Y : X 1 ; X 2 ).It therefore makes sense to look for a specific distribution Q 0 ∈ ∆ P where the unique information terms coincide with something computable from classic information theory.From Figure 2 we see that for the case of a distribution Q 0 ∈ ∆ P where synergistic information vanishes, the unique information terms would coincide with conditional mutual information terms, i.e.I unq,P (Y : X 1 \X 2 ) = I unq,Q0 (Y : . It is known, however, that a Q 0 with this property does not necessarily exist for all definitions of unique, shared and synergistic information that satisfy equations 3-5, and that also satisfy the above game-theoretic property (being able to prove the possession of unique information).Therefore, Bertschinger and colleagues suggested to define a measure Ĩunq of unique information via the following minimization: Ĩunq (Y : X 2 \ X 1 ) = min From this, measures for shared and synergistic information can be immediately obtained via equations 4, 3 as: ) is known from information theory, and depends on the choice of Q.
With the aim to estimate I unq (Y : X 1 \ X 2 ), one defines a set of I Q (Y : {X 1 , X 2 }`) on ∆ P .I Q depends on the choice of Q (see main text).
Likewise, I syn,Q (Y : X 1 ; X 2 ) depends on the choice of Q.
The aim is to quantify I unq , I shd , and I syn .
.5: Graphical depiction of the principle behind the definition of unique information in [19].For details also see the main text.(A) Reminder of the partial information diagram.(B) Explanation how unique information can be defined using minimization of conditional mutual information on the space of probability distributions ∆ P (see text).Note that if the synergy in (B6) can not be reduced to 0, then we simply define the unique information measure as Ĩunq(Y : Note that CoI refers to the co-information CoI(Y ; X 1 ; X 2 ) = I(Y : X 1 ) − I(Y : X 1 |X 2 ) (see [19] for details).For this particular choice of measures it can be shown that there is always at least one distribution Q 0 ∈ ∆ P for which the synergy vanishes, as was desired above.As knowledge of the pairwise marginal distributions P (X 1 , Y ), P (X 2 , Y ) only specifies the problem up to any Q ∈ ∆ P , and as the synergy varies on ∆ P , we need to know the joint distribution P (X 1 , X 2 , Y ) to know about the synergy.This is indeed an intuitively plausible property and supports the functionality of the definitions given by Bertschinger and colleagues [19].
From Figure 2 and the definition of Ĩunq , Ĩshd , and Ĩsyn in equations B.2-B.5 it seems obvious that the following bounds hold for these measures: and this can indeed be proven, given that I unq , I shd , and I syn is taken to mean any other definition of PID that satisfies equations 3-5 [19] and the above game theoretic assumption of a constant I unq on ∆ P .The measures Ĩunq , Ĩshd , and Ĩsyn require finding minima and maxima of conditional mutual information terms on ∆ P .Fortunately, these constrained optimization problems are convex for two inputs as shown in [19], meaning that there is only one local minimum (maximum) which is the desired global minimum (maximum).Incorporating the constraints imposed by ∆ P into the optimization maybe non-trivial, however.

Appendix B.2. PID by example: Of casinos and spies
A short example may demonstrate the above reasoning: Let Alice and Bob bet on the outcomes Y of a (perfect, etc.) Roulette table at a Casino in a faraway city, such that they do not have immediate access to these outcomes; they will only get a list of these outcomes when the Casino closes, but will have to place their bets before that.Alice has a spy X 1 at the Casino who informs here directly after an outcome was obtained there, but only tells the truth when the outcome was even (this includes 0).Otherwise he tells her a random possible outcome from a uniform distribution across natural numbers from 0 and 36 (just like the Roulette).Bob also has a spy X 2 at the casino, but in contrast to Alice's spy he only tells Bob the truth for uneven outcomes and for 0, otherwise he lies in the same way as the one of Alice, picking a random number.Neither Alice nor Bob knows about the spy of the other 20 .While this situation looks quite symmetric at first glance, both can prove to each other to have unique information about the outcomes at the casino, y.To see this, remember that Alice may suggest a game constructed by herself when trying to prove the possession of unique information.Thus, Alice could suggest to double the stakes for bets on even numbers 21 .At the end of the day, both Alice and Bob will have won a roughly equal amount of bets, but the bets Alice will typically have won payed out more, and Alice wins.In the same way, Bob could suggest to double the stakes for uneven outcomes if it were his turn to prove the possession of unique information.Thus, both have the same amount of information about the outcomes at the casino, but a part of that information is about different outcomes.
In this example, there is also redundancy as both will have the same information about the outcome Y = 0.
It is left for the reader to verify that Alice and Bob will gain some information (i.e.synergy) by combining what their spies tell them, but that this is not enough to be certain about the the outcome of the Roulette, i.e.I(Y : X 1 , X 2 ) < H(Y )22 .

Appendix B.3. Estimating Synergy and PID for jointly Gaussian variables
While synergy, shared and unique information are already difficult to estimate for discrete variables, it is not immediately clear how to extend the definitions to continuous variables in general.Barrett has made significant advances in this direction though by considering PID for jointly Gaussian variables [37].Approaches to Gaussian variables are important analytically because the classical information theoretic terms there may be computed directly from the covariance matrix of Y , X 1 , X 2 , and are important empirically due to the wide use of Gaussian models to simplify analysis (e.g. in neuroscience).
First, Barrett was able to demonstrate the existence of cases of non-zero quantities for each of synergy and shared information for such variables.This was done without reference to any specific formulation of PID measures by examining the 'net synergy' (synergy minus shared information), i.e.I(Y : X 1 , X 2 ) − I(Y : X 1 ) − I(Y : X 2 ), which provides a sufficient condition for synergy where it is positive and for shared information where it is negative.This was an important result, since the intuition of many authors was that the linear relationship between such Gaussian variables could not support synergy.
Next, Barrett demonstrated a unique form for the PID for jointly Gaussian variables which satisfies the original axioms of Williams and Beer [17] as well as having unique and shared information terms depending only on the marginal distributions (X 1 , Y ) and (X 2 , Y ) (as argued by Bertschinger et al. [19] above, and consistent with [18,20]).To be specific, this unique form holds only for a univariate output (though multivariate inputs are allowed).This formulation maps the shared information to the minimum of the marginal mutual information terms I(Y : X 1 ) and I(Y : X 2 ) -hence is labeled the Minimum Mutual Information (MMI) PID -and the other PID terms follow from equations 3-5.Interestingly, this formulation always attributes zero unique information to the input providing less information about the output.Furthermore, synergy follows directly as the additional information provided by this "weaker" input after considering the "stronger" input.Some additional insights into this behaviour have recently been provided by Rauh and colleagues in [38] Appendix C. Learning rules for maximizing F and for learning the coherent infomax goal function F CIP We here briefly present the learning rules for gradient ascent learning of neural processor learning to maximize the goal function F from equation 11.We only consider the basic case of a single a neural processor with binary output Y here [26,22].The inputs to this processor are partitioned into two groups {X 1i }, representing the driving inputs and {X 2j }, representing the contextual inputs.These inputs enter the information theoretic goal function F (X 1 , X 2 , Y ) via their weighted sums per group as: The inputs affect the output probability of the processor via an activation function A(x 1 , x 2 ) as: For the sake of deriving general learning rules, A may be any general, differentiable nonlinear function of the input.Note that Θ fully determines the information theoretic operation that the processor performs.Θ is a function of the weights used in the summation of the inputs.Thus, learning a specific information processing goal can only be done via learning these weights -assuming that the input distributions of the processor can not be changed.Learning rules for these weights will now be presented.
To write the learning rules in concise form, the additional definitions: Using online learning therefore necessitates computing these expectations over a suitable time window of past inputs.To write the learning rules in concise notation a non-linear floating average Ō of the above expectations is introduced as: Last, we note that for the specific implementations of CIP, the activation function was chosen as: (C.12) with 0 ≤ k 1 < 1 and k 2 > 0, and with x 1 , x 2 being realizations of X 1 , X 2 from equations C.1, C.2.This specific activation function [22,9] guarantees that: • Zero output activation can only be obtained if the summed driving input X 1 is zero.
• For zero summed contextual input X 2 , the output equals the summed driving input.
• A summed contextual input of the same sign as the summed driving input leads to an amplification of the output.The reverse holds for unequal signs.
• The sign of the output is equal to the sign of the summed driving input.
These four properties were seen as essential for an activation function that supports coherent infomax.

Figure 1 :
Figure 1: Neural processors:(A) neural processor with multidimensional inputs X 1 , X 2 , and output Y .(B) Processor with local weighted summation of inputs as used in coherent infomax and in this study.To establish the link to the coherent infomax literature we identify the input X 1 with the receptive field input R, which may be excitatory (e) or inhibitory (i), and which is summed.In the same way, X 2 is identified with the contextual input C. (C) Overlay of the coherent infomax neural processor on a layer 5 pyramidal cells, highlighting potential parallels to existing physiological mechanisms.Layer 5 cells created with the TREES toolbox[21], courtesy of Hermann Cuntz.

Figure 2 :
Figure 2: Partial information diagram with both classical information terms (solid lines) and PID terms (color patches).

Figure 3 :
Figure 3: Graphical depiction of the various contributions to F and their weighting coefficients in the PID diagram.(A) Classical unconditional mutual information terms.(B) Unused bandwidth, weighted by Φ 3 .(C) Conditional mutual information I(Y : X 1 |X 2 ), weighted by Φ 1 .(D) Conditional mutual information I(Y : X 2 |X 1 ), weighted by Φ 2 .Note the overlap of this contribution with the one from (C).(E)The three way information I(Y : X 1 : X 2 ), weighted by Φ 0 .Here the three way information is the checkered minus the striped area.(F) This region appears in (C),(D),(E) and is weighted accordingly by three coefficients simultaneously (Φ 0 ,Φ 1 , Φ 2 ).The area in (F) is the synergistic mutual information that is also shown in cyan in Fig.2.

Figure 4 :
Figure 4: Graphical depiction of a minimalistic, binary predictive coding circuit.This circuit can be conceived of as one neural processor (indicated by the box) with inputs X 1 (t − 1), X 1 (t − 1) and (main) output Y (t).