Entropy and information causality in general probabilistic theories

We investigate the concept of entropy in probabilistic theories more general than quantum mechanics, with particular reference to the notion of information causality (IC) recently proposed by Pawlowski et al (2009 arXiv:0905.2292). We consider two entropic quantities, which we term measurement and mixing entropy. In the context of classical and quantum theory, these coincide, being given by the Shannon and von Neumann entropies, respectively; in general, however, they are very different. In particular, while measurement entropy is easily seen to be concave, mixing entropy need not be. In fact, as we show, mixing entropy is not concave whenever the state space is a non-simplicial polytope. Thus, the condition that measurement and mixing entropies coincide is a strong constraint on possible theories. We call theories with this property monoentropic. Measurement entropy is subadditive, but not in general strongly subadditive. Equivalently, if we define the mutual information between two systems A and B by the usual formula I(A: B)=H(A)+H(B)-H(AB), where H denotes the measurement entropy and AB is a non-signaling composite of A and B, then it can happen that I(A:BC)<I(A:B). This is relevant to IC in the sense of Pawlowski et al: we show that any monoentropic non-signaling theory in which measurement entropy is strongly subadditive, and also satisfies a version of the Holevo bound, is informationally causal, and on the other hand we observe that Popescu–Rohrlich boxes, which violate IC, also violate strong subadditivity. We also explore the interplay between measurement and mixing entropy and various natural conditions on theories that arise in quantum axiomatics.

. This is relevant to IC in the sense of Pawlowski et al: we show that any monoentropic non-signaling theory in which measurement entropy is strongly subadditive, and also satisfies a version of the Holevo bound, is informationally causal, and on the other hand we observe that Popescu-Rohrlich boxes, which violate IC, also violate strong subadditivity. We also explore the interplay between measurement and mixing entropy and various natural conditions on theories that arise in quantum axiomatics.

Introduction
One can view quantum mechanics as an extension of the classical probability calculus, allowing for random variables that are not simultaneously measurable. In order to gain a clearer understanding of quantum theory from this perspective, it is useful to contrast it with various (factitious) alternatives that are neither classical nor quantum. The best-known example of such a 'foil' probabilistic theory is probably the theory of 'non-local boxes' [1,28,29]; but in fact, there is a standard mathematical framework for such theories, going back to the work of Mackey in the 1950s [26]. Working in this framework, one can show that many phenomena commonly regarded as characteristically quantum-no-cloning and no-broadcasting theorems [2,3], the trade-off between state disturbance and measurement [1], and the existence and basic properties of entangled states [1,2,23,24]-are actually quite generic features of all non-classical probabilistic theories satisfying a basic 'non-signaling' constraint. Other quantum phenomena, such as the possibility of teleportation [4] or remote steering of ensembles [6], are more special (and in some sense, more classical), but can still be seen to arise outside the boundaries of quantum theory. 3 One might hope to find some reasonably short list of probabilistic or information-theoretic phenomena that more cleanly separate quantum theory from other possible non-signaling theories. In a recent paper [27], Pawlowski et al take a step in this direction by showing that any non-signaling correlation violating the Tsirel'son bound also violates a qualitative information-theoretic principle they call information causality (IC). In essence, this prohibits a form of 'multiplexing' in which one party (Bob) gains the ability to access a total of more than m bits of information held by another party (Alice), on the basis of an m-bit message from Alice, plus some shared non-signaling bipartite state. It is also established in [27] that quantum mechanics-and hence also classical probability theory-satisfies this IC constraint.
In establishing that quantum mechanics satisfies IC, Pawlowski et al make use only of standard formal properties of the von Neumann entropy of joint quantum states. This raises the obvious question of where their proof breaks down in other contexts (e.g. a PR box) in which IC fails. In order to address this question, we develop some of the basic machinery of entropy, conditional entropy and mutual information in a very general probabilistic setting-an independently interesting problem, which seems not to have received much previous attention (an exception being the paper [20] of Hein).
We begin by identifying two notions of entropy, which we call measurement and mixing entropy, and which we denote respectively by H (A) and S(A), where A is a general probabilistic model. Briefly, the measurement entropy of a system is the minimum Shannon entropy of any possible measurement thereon, whereas the mixing entropy is the infimum of the Shannon entropies of the various ways of preparing the system's state as a mixture of pure states. These coincide classically and in quantum theory, but are generally quite different animals. For example, measurement entropy is always subadditive, and is concave; mixing entropy is generally neither. In fact, in appendix A, we show that there are always violations of concavity of the mixing entropy for any system with a state space that is a non-simplicial polytope. Thus, the condition that mixing and measurement entropies do coincide, as in quantum mechanics, is a powerful constraint on the structure of a probabilistic theory. We call theories with this feature monoentropic.
Next, we develop an account of joint measurement entropy, conditional entropy and mutual information for composite systems, and apply this apparatus to the notion of IC given in [27]. Somewhat surprisingly, it seems that the main issue is not so much one of the strength of non-local correlations, but rather the failure of two other, very basic principles. One is the strong subadditivity or, equivalently, the condition that the mutual information, defined by This holds both classically and in quantum theory, but is violated in very simple non-classical models-even models in which A and B are classical, so that no issue of non-locality can arise. Another basic principle, equivalent to the Holevo bound, is that I (E : B) I (A : B), where E is any particular measurement on system A.
Both strong subadditivity and the Holevo bound can be viewed as special cases of an even more basic principle, usually called the data processing inequality (DPI). This asserts that, for any systems A, B and B and for any reasonable process E : B → B , we have I (A : E(B)) I (A : B) (where E(B) := B is the output system of the process). This is intuitively appealing as a basic physical postulate. 4 Finally, we apply the apparatus just described to the notion of IC. We consider in detail the basic example, due to van Dam [33], of an IC-violating composite system, and find that it exhibits a violation of strong subadditivity. We also establish that, within a very broad class of finite-dimensional monoentropic theories, strong subadditivity together with the Holevo bound entail IC. It remains an open question whether all three of these conditions are necessary for this conclusion.
The remainder of this paper is organized as follows. In section 2, we review in some detail the framework of generalized probability theory, largely following [2]. In section 3, we define, and establish some elementary properties of, measurement and mixing entropy for states of an arbitrary probabilistic model. Section 4 discusses composite systems in our framework, and collects some observations about the behavior of joint measurement entropy and the notion of mutual information based on this. Using this apparatus, we establish in section 5 that any monoentropic probabilistic theory in which measurement entropy is strongly subadditive and satisfies the Holevo bound is informationally causal in the sense of [27]. We also point out that violations of strong subadditivity are possible in theories having no entanglement. Section 6 collects some final remarks and open questions. Appendix A contains the proof that mixing entropy is not concave on state spaces that are non-simplicial polytopes. Appendix B establishes some further properties of monoentropic theories, relevant to axiomatic characterizations of quantum theories, and also shows that monoentropicity follows from two other properties, steering and pure conditioning, the physical content of which may be more transparent. Finally in appendix C, we discuss how the framework of this paper relates to the 'convex sets' framework, and consider analogous definitions of measurement entropy in that context.

General probabilistic models
As mentioned above, there is a more-or-less standard mathematical framework for discussing general probabilistic models, going back at least to the work of Mackey in the 1950s, and further developed (or, in some cases, rediscovered) in succeeding decades by various authors [1,13,15,16,18,25]. In what follows, we work in the idiom of [8], which we briefly recall.
We characterize a probabilistic model, or, more briefly, a system, by a pair A = (A, ), where A is a collection-possibly infinite-of discrete classical experiments or measurements and is a set of states. We make the following assumptions.
(i) Every experiment in A is defined by its set of possible outcomes, so that we may represent A, mathematically, as a collection of sets E, F, . . .. In the language of [16,35] For a given test space A, one can define the space of all states on A. This is called the maximal state space and is denoted by (A). It is clearly convex. The physical state space is necessarily either equal to or a subset of the maximal state space. This framework, although very simple, is broad enough to accommodate both measuretheoretic classical probability theory and non-commutative probability theory based on von It is depicted using a Greechie diagram, wherein vertices denote outcomes and every smooth line through a set of vertices represents a test. Neumann algebras 7 . In this paper, we shall be interested exclusively in discrete, finitedimensional systems. Accordingly, from this point forward, we make the standing assumptions that (i) A is locally finite, meaning that all tests E ∈ A are finite sets 8 , and (ii) is finite dimensional and closed.
As is easily checked, local finiteness guarantees that the maximal state space (A) is compact; thus, the closedness of the physical state space ensures that it, too, is compact 9 . It follows that every state can be represented as a finite convex combination, or mixture, of pure states, that is, extreme points of .
We now consider several examples. For us, a classical system corresponds to a pair ({E}, (E)), where the test space {E} consists of a single measurement and where (E) denotes the entire simplex of probability weights on E. In other words, there is just one test, and any probability distribution over the outcomes is a possible state. A quantum system corresponds to (F(H), (H)), where F(H) is the set of (unordered) orthonormal bases on a complex Hilbert space H and (H) is the set of density operators 10 .
A simple example that is neither classical nor quantum, and to which we shall refer often, is the 'two-bit' test space A 2 = {{a, a }, {b, b }}, consisting of a pair of two-outcome tests, depicted in figure 1. The full state space (A 2 ) is isomorphic to the unit square [0, 1] 2 under the map α → (α(a), α(b)) and is depicted in figure 2. Accordingly, we shall call a system of this form a square bit or squit. A PR box is a particular entangled state of two squits, as discussed below in section 5.1.

Measurement and mixing entropies
Let H be a finite-dimensional Hilbert space, representing a quantum system. The von Neumann entropy of a state ρ on this system is defined as −Tr(ρ log ρ), where here and elsewhere, logarithms have base 2. Equivalently, it is the Shannon entropy of the coefficients λ i in the 7 Measure-theoretic classical probability theory is, in effect, the theory of systems of the form (D, ), where D = D(S, ) is the set of all finite (respectively, countable) partitions E = {a i } of a measurable space S by nonempty measurable sets a i ∈ , and is some closed convex set of probability measures on E. The probabilistic apparatus of states and observables associated with von Neumann algebras can be modeled in a similar way. 8 Alternatively, this condition could be derived from some other mild conditions on test spaces, as discussed in appendix B. 9 By Shultz [31], any compact convex set can be represented as the full state space (A) of some locally finite test space A. 10 To be a bit more precise, a quantum state is the quadratic form associated with a density operator. We shall routinely identify a density operator ρ with its quadratic form, writing ρ(x) for ρx, x where x is a unit vector on H. spectral decomposition ρ = i λ i P i (where the P i are ρ's rank-one eigenprojections). In effect, the spectral decomposition privileges a particular convex decomposition of the state, and (up to phases) a privileged test in F(H). In our much more general setting, where we have nothing like a spectral theorem, how might we define the entropy of a state? The following definitions suggest themselves. Note that the measurement entropy of a state of A = (A, ) depends entirely on the structure of A, and is independent of the choice of state space . It will often be convenient to write H (α) as H (A), where context makes clear which state is being considered.
For the remainder of this paper, we make, and shall make free use of, the assumption that the measurement entropy of a state is actually achieved on some test, i.e. that H (α) = H E (α) for some E ∈ A. This is the case in quantum theory, and can be shown to hold much more generally, given some rather weak analytic requirements on an abstract model (A, )-for details, see appendix B. It follows that H (α) = 0 if and only if there is a test such that α assigns probability 1 to one of its outcomes.

Definition 2.
Let α be a state on A. The mixing (or preparation) entropy for α, denoted S(α), is the infimum of the classical (Shannon) entropy H ( p 1 , . . . , p n ) over all finite convex decompositions α = i p i α i with α i pure.
Again, we write S(A) for S(α) where α belongs to the state space of a system A = (A, ). In contrast to measurement entropy, the mixing entropy of a state depends only on the geometry of the state space , and is independent of the choice of test space A. The mixing entropy is essentially the same as the entropy defined for elements of compact convex sets by Uhlmann [34]. The mixing entropy of a pure state is 0.
Trivially, in classical probability theory, measurement and mixing entropies coincide, both being simply the Shannon entropy. Much less trivially, measurement and mixing entropies also coincide in quantum theory, where they equal the von Neumann entropy 11 . As the following example shows, however, measurement and mixing entropies can be quite different.  figure 3. One can check that (A) has five pure states, one of which is given by On the other hand, consider the pure states β and γ determined by β(b) = β(z) = 1 and γ (x) = γ (y) = γ (z) = 1: their average ω := 1/2β + 1/2γ has mixing entropy S(ω) = 1. This follows from the fact that the only convex decomposition of ω into pure states is into β and γ , which in turn follows from the fact that these are the only pure states that assign probability one to z. On the other hand, ω(z) = 1, so H (ω) = 0.
Even in the general case, measurement entropy is quite well behaved. For example, it is easy to see that H (α) is continuous as a function of α. Further, we have the following theorem. (1) Proof. Since for each test E the local entropy H E is concave, Mixing entropy is, by contrast, a curious beast. The following example shows that it need not be continuous as a function of the state.   On the other hand, α can be approached as closely as we like by extreme points belonging to C \ {α}, which have mixing entropy 0. The mixing entropy is therefore discontinuous at α.

Example 3.
Let be a square, as illustrated in figure 5. Let α and β be the midpoints of adjacent faces, noting that these each have unit mixing entropy, S(α) = S(β) = 1. Let γ = 1/2(α + β) be the midpoint of the line segment between α and β, and note that it also lies on the line segment between antipodal vertices of (the diagonal through the square between the chosen faces). But given that γ is not at the midpoint of this diagonal, the Shannon entropy for the associated convex decomposition is less than one, as is therefore the infimum over convex decompositions. Therefore, the mixing entropy for γ satisfies S(γ ) < 1. Consequently, S(γ ) < 1/2S(α) + 1/2S(β), and we have a failure of concavity of the mixing entropy.
In fact, the failure of concavity for the mixing entropy is quite generic.

Theorem 2.
Mixing entropy is not concave whenever the state space is a non-simplicial polytope.
The proof is given in appendix A. It follows that an assumption of concavity for the mixing entropy forces the state space to be either a simplex (i.e. classical) or not a polytope. Hence such an assumption or one that implies it may be a useful tool in axiomatizing quantum theory.

9
It is natural to ask what follows from the condition that, as in classical and quantum theories, measurement and mixing entropy coincide. One immediate consequence is that mixing entropy will be concave. In view of theorem 2, this implies that either the system is essentially classical or there are an infinite number of pure states. Hence equality of measurement and mixing entropies narrows down possible theories quite a lot. We discuss this matter further in appendix B.
Both measurement and mixing entropies have been considered before, notably by Uhlmann [34] (who considered preparation entropy) and by Hein [20] in a similar context, albeit with somewhat different aims than ours in view. There are various other entropic quantities one could reasonably consider. For example, a concept of entropy that might be more closely related to operational tasks is the supremum, over convex decompositions of the state and over tests, of the classical mutual information between the random variable specifying the element of the convex decomposition, and the random outcome of the test. Natural analogues of this quantity and of the measurement and preparation entropies defined above exist in the closely related ordered linear spaces framework (also known as the convex sets framework) for theories. Test space models such as we have defined above induce ordered linear spaces models by a linearization procedure that embeds the test space in a vector space and identifies outcomes in the test space with certain elements of the dual vector space; this procedure allows one to define concepts of measurement entropy more tightly related to the geometry of the state space, but that can usually be viewed as special cases of the test space definition. Appendix C gives a further brief discussion of this.
From this point on, we focus mainly on measurement entropy. As always with mathematical definitions, there is a certain tension between the ideals of flexibility and generality, on the one hand, and the desire to avoid annoying pathologies, on the other. Our testspace-dependent definition of measurement entropy definitely errs on the side of the former, in that it is consistent with quite absurd examples. For example, if one includes in one's test space a test having a single outcome, then all states will automatically have zero entropy. One can avoid such difficulties by placing various restrictions on the test spaces to be considered, at the cost of a slightly more involved technical development. Going to the linearized setting mentioned above may also help. Our work in this paper does not demand such fastidiousness, however, as our results are of a very general character.

Composite systems and joint entropy
Most of the interesting problems of information theory involve more than one system. The following subsection describes how to treat composite systems in the language of test spaces. The idea is that, given systems A and B, the joint system AB should be associated with a test space and state space of its own. However, there is not a unique recipe for determining test and state spaces for AB given the test and state spaces for A and B. Instead, a theory must give additional rules that specify how systems combine 13 . Our results will pertain to a variety of notions of composition, although we limit the scope by requiring certain properties to hold. In particular, we assume that the test space of the composite includes all product tests and conditional two-stage tests (where one party's choice of test is conditioned on the outcome of the other party's test). One motivation for this is to have a test space that is sufficiently rich to be interesting. Another motivation is that this assumption guarantees that all states are nonsignaling. We go on to define analogues of familiar quantities, such as joint entropies and mutual information, which are used later to analyze IC.

Composite systems
Consider two systems A and B, where A = (A, A ) and B = (B, B ). For convenience, assume that these are controlled by two parties, called Alice and Bob. The first, and most basic, assumption we shall make is that Alice can perform any test E ∈ A simultaneously with Bob performing any test F ∈ B. This can be regarded as a single product test. The possible outcomes of this product test are pairs of the form (e, f ) ∈ E × F.

Definition 3. The Cartesian product of the test spaces A and B is the collection of all product tests. It is denoted
The set (A × B), of all states that can be defined on the Cartesian product test space, typically includes signaling states, which allow Alice to send messages instantaneously to Bob, or vice versa, by varying her choice of which test to perform.
If a state ω AB is non-signaling, it is possible to define the marginal (or reduced) state ω A via where equation (2) ensures that the right-hand side is independent of F ∈ B. The marginal ω B is defined similarly.
If ω AB is non-signaling, it is also possible to define a conditional state, ω B|e . Informally, this is the updated state at Bob's end following the outcome e being obtained for a test at Alice's end: By convention, ω B|e is zero if ω A (e) is zero. The conditional state ω A| f is defined similarly. Note that a particular type of measurement, which might be thought reasonable, is not included in the Cartesian product. This is a joint measurement, where Alice first measures her system and then communicates the result to Bob, who performs a measurement that depends on Alice's outcome. Entangled measurements, such as are allowed in quantum theory, are also not included. Hence the Cartesian product A × B models a situation in which Alice and Bob are fairly limited-they can act independently and collate the results of their actions at a later time, but cannot otherwise communicate.
It is possible to construct a more sophisticated product of two test spaces, which does allow for the kind of two-stage measurements just described (although still not entangled measurements). Let ←→ AB denote the test space consisting of the following: 1. All two-stage tests, where a test E ∈ A is performed and then, depending on the outcome e that is obtained, a pre-selected test F e ∈ B is performed. 2. All two-stage tests, where a test F ∈ B is performed and then, depending on the outcome f that is obtained, a pre-selected test E f ∈ A is performed.

←→
AB is called the Foulis-Randall or bilateral product of test spaces A and B.
The Foulis-Randall product contains the Cartesian product, A × B ⊆ ←→ AB, because product tests are a special case of two-stage tests. Furthermore, if either A or B is non-classical, then not all two-stage tests are product tests, so that the containment is strict. The containment of one test space in another has consequences for their state spaces. Specifically, if X and Y are test spaces such that X ⊆ Y, then the convex set (Y) may be in a higher dimensional space than (X), but the restrictions of states in (Y) to X (which are well defined, because every test in X is also a test in Y ) are all contained in (X). In other words, writing (Y)| X for the set of restrictions to X of states on Y, we have (Y)| X ⊆ (X). Because the additional measurements in Y place additional constraints on these states, the containment may well be strict.
It follows that the restriction of the maximal state space of the Foulis-Randall product to the Cartesian product is contained within the maximal state space of the Cartesian product, ( The containment is strict if one of the systems is non-classical. Indeed, the states in ( ←→ AB)| A×B correspond exactly to the non-signaling states in (A × B). This is demonstrated in [16]. See [5] and [37] for further examples and properties of the Foulis-Randall product and related constructions.
We are now prepared to define the class of test and state spaces for composites in which we shall be interested. The test space for the composite, which we denote by C, is required to contain the Foulis-Randall product of the components, ←→ AB ⊆ C. The state space of the composite, which we denote by AB , is unconstrained beyond being a subset of the maximal and that all the states in ( ←→ AB) are non-signaling, it follows that all states in AB are non-signaling. Indeed, the main motivation for confining our attention to test spaces containing ←→ AB is that this is sufficient to ensure no-signaling without any further constraints on the state space.
Given a state ω AB ∈ AB , the marginals ω A , ω B , and conditionals of the form ω A| f , ω B|e , are defined in the obvious way by the probabilities that ω AB assigns to the product tests. Furthermore, we assume that the composite systems we consider satisfy the following natural requirement: that if a test is performed on system A, the conditional state on system B must be allowed in the theory, i.e. be contained in B , and vice versa. Hence AB satisfies the constraint that for all e and f such that ω A (e), ω B ( f ) = 0, ω B|e and ω A| f belong to B and A , respectively. This is enough to ensure that the marginal states ω A , ω B also belong to the state spaces of the component systems.
A general composite test space C may contain non-product measurements, which are not contained in the Foulis-Randall product. Quantum theory, for instance, has a test space 12 for composites that is larger than the Foulis-Randall product. If A and B are quantum systems, so that A = (F(H), (H)) and B = (F(K), (K)), then the quantum joint system is AB := (F(H ⊗ K), (H ⊗ K)), which is a composite in our sense and contains non-product measurement outcomes, for instance entangled ones.
Henceforth, AB will stand for a general non-signaling composite of systems A and B. In the particular case where A = ({E}, (E)) is a classical system, we always take C to be the Foulis-Randall product ←→ {E}B. We also assume that composition of systems is associative, so that for any three systems A, B and C, there is a natural isomorphism A(BC) (AB)C.
In addition to specifying how systems combine, a probabilistic theory must specify what sorts of systems are allowed. For instance, in finite-dimensional quantum theory, every dimensionality of Hilbert space defines a different type of system and they are all allowed. Furthermore, a classical system of arbitrary dimensionality (that is, arbitrary cardinality for the test) can be defined within quantum theory as a restriction upon a quantum system of the same dimensionality, so in this sense classical systems are allowed as well. A probabilistic theory must specify the types of systems that are allowed and how these compose. We shall confine our attention to theories incorporating only finite-dimensional systems, and those that contain, for any finite set E, the classical system (E, (E)). (Thus, for us, quantum theory means finitedimensional quantum theory in conjunction with classical systems.) For a discussion of what such theories might look like in category-theoretic terms, see [9,10].

Joint entropies, conditional entropies and mutual information
Consider a composite system AB = (C, AB ). The measurement entropy H (ω AB ) of a state ω AB ∈ AB , which we will sometimes denote by H (AB), is the infimum over E ∈ C of H E (ω AB ). In this context, it will also be understood that H (A) and H (B) stand for the entropies H (ω A ) and H (ω B ) of the marginal states ω A and ω B .

Definition 5. The conditional measurement entropy between A and B is defined as
Our notation here is less precise than it might be, because the joint entropy H (AB) depends on the test space associated with the joint system, hence so do conditional entropies. We will try to be clear, at any point where the question could arise, as to what product is in play. Classically, given a joint distribution ω AB over variables A and B, one defines mutual information by 13 where H denotes the Shannon entropy. One can regard this as a measure of how far A and B are from being independent: by subadditivity, I (A : B) 0, with I (A : B) = 0 iff A and B are independent, i.e. ω AB factorizes. In attempting to extend the concept of mutual information to more general models, one might very naturally consider defining I (A : B) to be the maximum of the mutual informations I (E : F) as E and F range over tests belonging to systems A and B, respectively. However, the usual practice in quantum theory is simply to take equation (5), with von Neumann entropies replacing Shannon entropies, as defining mutual information.
In general, this gives a different value. In order to facilitate comparison with quantum theory, we shall adopt the following.
With this definition, the subadditivity of measurement entropy (theorem 3) implies that measurement-entropy-based mutual information is non-negative. Hereafter, we will refer to this simply as 'mutual information'. Note that equation (5) is a special case of this definition. Now intuitively, one might expect that the mutual information I (A : B) between two systems should not decrease if we recognize that B is a part of some larger composite system BC, i.e. that I (A : B) I (A : BC). Simple algebraic manipulations (using equations (4) and (6)) allow us to reformulate this condition in various ways. Both the Shannon and von Neumann entropies are strongly subadditive. In the former case, this is a straightforward exercise, but in the latter, a relatively deep fact. Colloquially, this means that in classical and quantum theories, just forgetting about or discarding a system C never increases one's mutual information between systems A and B. As the following shows, however, strong subadditivity can fail in general theories, even when two of the three systems are classical. One potential gloss is that discarding or forgetting about system C can increase the mutual information between systems A and B. But a more sensible reading is perhaps that the quantity defined as mutual information should not in the general case be interpreted as 'the information one system contains about another.'   Note that the foregoing example is all but classical, depending not on any notion of entanglement or non-locality, but only on the fact that one can measure either, but never both, of {e, e } and { f, f }.
This section concludes with some lemmas, which hold in the special case that one or more of the systems in the composite is classical. Some of these are useful later on.

Lemma 2. Let ω AB be a state on AB, where A is classical. Then
The proof is straightforward. As a shorthand, when A is classical we might write The proof is immediate from equations (4) and (7).

Corollary 2. If A is classical and independent of B, then H (AB) = H (A) + H (B).
Proof. The assertion that A and B are independent means that the joint state is ω AB = ω A ⊗ ω B , i.e. that ω B|e = ω B for all e ∈ E. By lemma 2, we have Finally, strong subadditivity does hold in the special case that systems A and C in lemma 1 are classical. Colloquially, discarding a classical system can never result in an increase in the mutual information between a general system and another classical system.
By lemma 2, we have We can rewrite this as Since measurement entropy is concave, It follows that e,g ω AC (eg)H (ω B|eg ) − y ω C (g)H (ω B|g ) 0, which, combined with equation (8), gives the desired result that H (A|BC) H (A|C).

Data processing and the Holevo bound
A fundamental result in quantum information theory, the Holevo bound, asserts that if Alice prepares a quantum state ρ = x∈E p x ρ x for Bob, then, for any measurement F that Bob can make on his system,

often called the Holevo quantity).
This inequality makes sense in our more general setting. Suppose that Alice has a classical system A = ({E}, (E)) and Bob a general system B. Alice's system is to serve as a record of which state of B she prepared. Hence the situation above is modeled by the joint state Accordingly, the content of the Holevo bound is simply that the mutual information between the measurement of Alice's classical system and any measurement on Bob's system is no greater than I (A : B), While this is certainly natural, it does not always hold. Both strong subadditivity and the Holevo bound are instances of a more basic principle. The DPI asserts that, for any systems A and B and any physical process E : B → C, 14 I (A : E(B)) I (A : B).
The strong subadditivity of entropy amounts to the DPI for the process that simply discards a system (the marginalization map BC → C). The Holevo bound is the DPI for the special case of measurements, which can be understood as processes taking a system into a classical system that records the outcome. It seems reasonable that discarding a system, or performing a measurement, should be allowed processes in a physical theory. But a notion of mutual information, according to which discarding a system, or performing a measurement, causes a gain of mutual information seems bizarre. So it is an attractive idea that a physical theory should allow at least some definition of entropy and mutual information such that the corresponding DPI is satisfied.

Information causality
In [27], Pawlowski et al define a principle they call 'IC' in terms of the following protocol. Alice and Bob share a joint non-signaling state, known to both parties. Alice receives a random bit string E of length N , makes measurements, and sends Bob a message F of no more than m bits. Bob receives a random variable G encoding a number k = 1, . . . , N , instructing him to guess the value of Alice's kth bit E k . Bob thereupon makes a suitable measurement and, based upon its outcome and the message from Alice, produces his guess, b k . IC is the condition that The main result of [27] is that if a theory contains states that violate the CHSH inequality [11] by more than the Tsirel'son bound [32], then it violates IC. In particular, if Alice and Bob can share PR boxes, then using a protocol due to van Dam [33], they can violate IC maximally, meaning that Bob's guess is correct with certainty, and the left-hand side of equation (9) is N . Pawlowski et al also give a proof, using fairly standard manipulations of quantum mutual information, that quantum theory does satisfy IC.
Having seen how to define notions of entropy and mutual information for general systems, it is interesting to consider where Pawlowski et al's quantum proof breaks down for some non-quantum systems such as PR boxes. One issue is that the proof uses strong subadditivity. As the following subsection shows, in the case where a PR box is the shared state, the van Dam protocol itself provides an example of the failure of strong subadditivity of the measurement entropy. Section 5.2 provides a converse result. Any theory that is monoentropic, strongly subadditive and where the Holevo bound holds must satisfy IC.
First, a few words about how to describe this setting in our terminology. Let Alice and Bob share two systems A and B, where each of these, as usual, has an associated test space. The joint test space of AB is immaterial, as long as it includes the Foulis-Randall product (i.e. allows all the separable measurements). The bit strings E and F are regarded as classical systems in their own right and the joint test space for a classical and a general system is, as always, assumed to be the Foulis-Randall product. Systems A and B begin the protocol in some joint non-signaling state ω AB .

The van Dam protocol
Consider a special case of the protocol described above, in which Alice and Bob share a PR box. Alice is supplied with a two-bit string E = E 1 E 2 , and transmits one bit F to Bob. Let the PR box be a state of two systems A and B, where A and B are squits corresponding to the test spaces {{a 1 , a 1 }, {a 2 , a 2 }} and {{b 1 , b 1 }, {b 2 , b 2 }}, respectively. The joint state of A and B is It can be verified that these outcome probabilities are indeed the PR box correlations, violating the CHSH inequality maximally. In van Dam's protocol, Alice determines the parity, E 1 ⊕ E 2 (where ⊕ denotes addition mod 2). If this is zero, she performs the {a 1 , a 1 } measurement on her system; if it is 1, she performs the {a 2 , a 2 } measurement. She then sends Bob a single bit with a value equal to the parity of her outcome and E 1 (where unprimed outcomes correspond to 0 and primed outcomes to 1). Bob can then determine the value of E 1 by measuring {b 1 , b 1 }, or the value of E 2 by measuring {b 2 , b 2 }.
Consider now an intermediate stage in this protocol, at which Alice has measured her system, and sent the bit F to Bob, who has not yet measured his system. Bob has access to systems B and F, but does not know the outcome of Alice's measurement. Hence consider the joint state of E F B, averaged over the outcomes of Alice's measurement. This is easily verified to be But clearly, H (E, B) = 3; hence

Theories satisfying IC
As the previous subsection observes, the van Dam protocol involves a joint state on a classical-nonclassical composite system, which does not satisfy strong subadditivity of entropy. This is enough to prevent the proof of IC going through. This subsection proves a converse result.

Theorem 4. Suppose that a theory (a) is monoentropic, meaning that measurement entropy equals mixing entropy for all systems, (b) is strongly subadditive and (c) satisfies the Holevo bound.
Then the theory satisfies IC. It follows that any theory satisfying these conditions cannot violate Tsirel'son's bound.
Note that, as discussed in section 4.3, the second and third conditions both follow from a single assumption, the data processing inequality. Note also that in proving theorem 4, the condition that a theory be monoentropic is only used to establish the technical condition that H (A|B) 0 when A is classical 15 . So the theorem would still be valid if the monoentropic assumption were replaced by a direct assumption that for classical A, H (A|B) 0. Otherwise, begin with the following lemma.

Lemma 4. Suppose that a theory is monoentropic and that A is a classical system. Then H (A|B) 0 for any system B.
Proof. (Lemma 4). Suppose that A is a classical system and that the joint state of AB is ω AB . If the measurement and mixing entropies are equal, then lemma 2 immediately gives where p x = ω A (x) and β x is the state of B conditioned on x. Recall that the mixing entropy of a state is defined in terms of an infimum over convex decompositions into pure states. For a fixed , call a convex decomposition of a state ω into pure states -optimal if the Shannon entropy of the coefficients is S(ω) + . For any > 0, there is an -optimal decomposition. Let β x = y q y|x β x y be an -optimal convex decomposition of β x into pure states β x y . It follows that is a (possibly far from optimal) convex decomposition of ω B into pure states. Hence S(B) is less than or equal to the Shannon entropy of the coefficients on the right-hand side. Therefore, Given lemma 4, the proof of theorem 4 is essentially a reconstruction of the quantum argument of appendix A of [27], adapted to the broader setting of non-signaling states on test spaces. In its form the proof is the same, but great care must be taken at each step to ensure that the relevant properties of entropies and mutual information still hold. Many of the steps still go through by virtue of generic properties of the measurement entropy. The explicit assumptions of theorem 4 are needed for the rest.
Proof. (Theorem 4). Assume that Alice and Bob share a joint system AB. Consider the N -bit string that Alice receives as a classical system E, and consider the m-bit message that Alice sends to Bob as a classical system F. Let E k denote Alice's kth bit. Consider the stage of the protocol where Alice has measured system A, and sent F to Bob, but Bob has not yet measured system B. Bob has control of systems F and B at this point, and does not know the outcome of Alice's measurement. Hence the strategy is to consider the joint state of systems E, F and B, averaged over Alice's outcomes.
The first goal is to show that the joint state at this point satisfies By the fact that the initial state of AB is non-signaling, E is independent of B. Therefore This gives equation (10). The next step is to establish Because the distribution on E is uniform, the bits E i are independent, so I (E 2 . . . E N : E 1 ) = 0. Hence, By strong subadditivity, Hence Applying this inequality recursively gives equation (13). Finally, consider the last stage of the protocol. If Bob is instructed to guess the kth bit, then, depending on the message F, he measures system B. This can be seen as a single joint measurement X k on the system F B 16 . The Holevo bound, combined with equations (10) and (13), gives Finally, Bob outputs a guess b k for the value of E k , where the guess depends on k and on the outcome of the measurement X k . The usual DPI applied to classical mutual information yields which is IC.

Conclusions, discussion and further questions
We have defined preparation and measurement-based generalizations of quantum and classical entropy and mutual and conditional information, and studied some of their basic properties. We called theories in which they coincide monoentropic, and showed that if they in addition satisfy the DPI (or at least its corollaries strong subadditivity and the generalized Holevo bound), Pawlowski et al's IC principle holds. By their remarkable result that any correlations violating the Tsirel'son bound can be used to violate IC, it follows that monoentropic theories satisfying data processing must, like quantum theory, obey the Tsirel'son bound. Monoentropicity is a strong constraint on theories, as we have shown by establishing that it fails for all polytopes except simplices.
Our results indicate that it is interesting and profitable to develop notions of entropy, and allied notions of conditional entropy and mutual information, for abstract probabilistic models. This paper should be regarded as only a preliminary exploration of this possibility.
A natural direction for further research is to study data compression and channel capacities in the abstract setting of this paper. It is natural to seek a measure of entropy that governs the rate of high-fidelity data compression, as Shannon and von Neumann entropy do in classical and quantum theory. A first step toward exploring classical channel capacities in generalized probabilistic theories might be to identify sufficient conditions for the Holevo bound to hold. This is related to the issue of finding an operationally motivated definition of mutual information. Arguably, a properly motivated notion of mutual information should manifestly be monotonic. Of course, the monotonicity of quantum mutual information-equivalently, the strong subadditivity of quantum entropy-is not manifest from its usual functional form. Still, the outright failure of the measurement-entropy-based mutual information to satisfy monotonicity in some cases raises a question as to its significance. Although in such cases measurement-based mutual information cannot be used to establish IC through a proof parallel to Pawlowski et al's quantum proof, it could be that IC nevertheless holds in some such cases. One should be cautious, though, about dismissing natural generalizations of classical quantities on the grounds that they fail to satisfy intuitively compelling properties. A case in point is the history of skepticism, based on the fact that it can be negative, about the operational significance of conditional information in quantum information theory. It was known for many years that conditional mutual information can be negative, but it was eventually shown to have an operational interpretation, involving the rate for quantum state merging protocols. It is also good to keep in mind that different operational motivations might turn out to be naturally associated with different entropic quantities, each with reasonable claim to be called mutual information.
At a more fundamental level, one would like to understand better the operational significance of various notions of entropy for abstract probabilistic models and theories. It is likely that the entropic quantities we have discussed here, measurement and mixing entropies, will turn out not to be the best notions of entropy to use in many situations. For example, in appendix C a variation (or perhaps better, a specialization) of the notion of measurement entropy that is more tightly coupled to the geometry of the state space is considered.
We have seen that, taken together, the conditions of monoentropicity, strong subadditivity and the Holevo bound imply IC. It is not out of the question that some subset of these conditions would suffice (especially since we need only very special cases of strong subadditivity). Alternatively, it would be of interest to find a single, reasonably simple physical postulate that would imply all three of these conditions. It seems plausible that such a postulate exists. On the one hand, strong subadditivity and the Holevo bound are both special cases of the DPI, which in turn can be derived (as we will detail in a future paper) from the assumption that arbitrary processes can be dilated to reversible ones. On the other hand, as we show in appendix B, monoentropicity can be derived from conditions of a similar flavor, involving the dilatability of mixed states to pure states with a 'marginal steering' property. Another avenue to explore is the consequence of monoentropicity that is needed for the IC proof: positivity of conditional information when a classical system is conditioned upon a general one. Although its operational interpretation is not evident at first blush, it warrants further study.
We hope to discuss all of these matters in detail in a future paper.
Innovation. This work was also supported by the EU's FP6-FET Integrated Projects SCALA (CT-015714) and QAP (CT-015848), and the UK EPSRC project QIP-IRC. JB is supported by an EPSRC Career Acceleration Fellowship. At IQC, Matthew Leifer was supported in part by MITACS and ORDCF. At Perimeter Institute, Matthew Leifer was supported in part by grant RFP1-06-006 from The Foundational Questions Institute (fqxi.org). Note added. In the course of this work, we learned of independent work on the same general topic [40] that will be published simultaneously in New Journal of Physics.

Appendix A. Non-concavity of mixing entropy
In this appendix we prove theorem 2, which states that the mixing entropy is not concave for non-simplicial polytopes. As a preliminary to the proof, we state some basic definitions and facts that we will use. A face of a convex set C, which is a set F ⊆ C such that every x ∈ C that can appear in a convex decomposition of something in F, is also in F. A maximal face of C is one that is not a proper subset of any face in C other than C itself. An exposed face of C is a subset of C that is the intersection of C with a hyperplane supporting it (such a subset is easily shown to be a face). All faces of a polytope are exposed, and the maximal ones have affine codimension 1, i.e. their spans are affine hyperplanes. We denote the affine space generated by a set S by aff(S), the linear span of S by lin(S) and the cone generated by S (i.e. the set of non-negative linear combinations of elements of S) by cone(S). Note that when a subset of a real vector space contains 0, aff(S) = lin(S). The relative interior of a convex compact set C is the interior of C when it is considered as a subset of aff(C). Finally, we will use the term Z -ball, where Z is an affine subspace of the ambient vector space, to mean a subset of Z that is a ball in Z .
The proof relies on the following lemma, proven below. We begin by proving the theorem.
Proof. (Theorem). First note that any counterexample to concavity of the mixing entropy in a polytope S will also be a counterexample in a polytope S that has S as a face. This follows from the fact that if S is a face, only states in S can appear in convex decompositions of states in S. The proof of the theorem is by induction. Suppose as our induction hypothesis that the mixing entropy fails to be concave for nonsimplicial polytopes in dimension d. For every polytope in dimension d + 1, either (i) every maximal face is simplicial or (ii) there is a maximal face that is non-simplicial. If case (ii) applies, then there is a face that constitutes a non-simplicial polytope of dimension d and, by our induction hypothesis, the mixing entropy fails to be concave for this face. If case (i) applies, then the polytope satisfies the conditions of the lemma and the mixing entropy fails to be concave by virtue of the lemma.
To complete the inductive argument, we need to show that the mixing entropy fails to be concave for non-simplicial polytopes in dimension d = 2, the lowest dimension in which there exist non-simplicial polytopes. This follows from the fact that all of the maximal faces of a two-dimensional non-simplicial polytope are line segments, which are simplices, so that the conditions of the lemma apply.
We now prove the lemma.  Proof. (Lemma). Suppose S is a d-dimensional polytope that satisfies the conditions of the lemma, that is, it is non-simplicial, but all of its maximal faces are simplicial. In this case, one can always find two maximal faces ((d − 1)-dimensional simplices), F 1 and F 2 , whose intersection, F 1 ∩ F 2 , is a (d − 2)-dimensional simplex. We define V 1 to be the vertex of F 1 that is not contained in F 2 ∩ F 2 . V 2 is defined similarly. Let ρ 1 be the barycenter of F 1 , ρ 2 the barycenter of F 2 and ρ 3 the barycenter of F 1 ∩ F 2 . Let V be a vertex of S that is not contained in F 1 or in F 2 . Such a vertex always exists because if it did not, then the total number of vertices in S would be d + 1 and S would be a simplex, contrary to hypothesis.
Define the (d − 1)-dimensional polytope H to be the convex hull of F 1 ∩ F 2 and V . Note that H is a simplex.
Define T to be the triangle with vertices ρ 1 , ρ 2 and ρ 3 . Define L to be the intersection of T and the (d − 1)-dimensional polytope H . L is a line segment; we defer establishing this to the end of the proof, because it is somewhat technical.
Finally, we define the state ρ for which the mixing entropy will fail to be concave. It is defined as the second vertex of L , that is, L is the line segment extending from ρ 3 to ρ. By definition, ρ ∈ T, so that it is a convex combination of ρ 1 , ρ 2 and ρ 3 , where the p i form a probability distribution. Because ρ 1 (ρ 2 ) is the barycenter of F 1 (F 2 ), which has d vertices, its mixing entropy is Recalling that L is one dimensional, we know that ρ = ρ 3 or, equivalently, p 1 + p 2 > 0, which implies that It follows from equation (A.7) and the fact that ρ 1 , ρ 2 are barycenters of (d − 1)dimensional simplices that Next, we show that ρ cannot be the barycenter of H. We begin by demonstrating that ρ is the barycenter of H ≡ conv(F 1 ∩ F 2 , V ), where V ≡ p 1 V 1 + p 2 V 2 and where conv(S, S ) is the convex hull of S ∩ S . Letting bary(F) denote the barycenter of F, the proof is as follows, The latter follows from the fact that V, V 1 and V 2 are all vertices of the non-simplicial polytope S, and consequently V cannot be in conv(F 1 ∩ F 2 , V 1 , V 2 ), unlike V which is. Given that ρ ∈ H but is not at its barycenter, and given that H has d vertices, we have S(ρ) < log d. (A.14) From equations (A.8) and (A.14), we infer the failure of concavity of the mixing entropy. We finish the proof by establishing the claim that L, defined above, is a line segment. First note that because it is an intersection of convex compact sets, it is compact and convex. Because dim(aff(T )) = 2 and dim(aff(H )) = d − 1, aff(T ) ∩ aff(H ) is one or two dimensional. For it to be two dimensional would require T ⊂ aff(H ), implying ρ 1 , ρ 2 ∈ aff(H ), and hence since F 1 ∩ F 2 ⊂ H , that V 1 , V 2 lie in the hyperplane aff(H ). This contradicts the assumption that F 1 , F 2 are distinct maximal faces.
Since L ⊂ aff(T ) ∩ aff(H ), L is at most one dimensional. To show it is at least one dimensional, we begin by observing that because they are subsets of S, both T and H lie in the 'tangent wedge' W to S at ρ 3 , i.e. the intersection of the half-spaces aff(F 1 ) + and aff(F 2 ) + . Here aff(F i ) + is defined to be the closed half-space to the polytope S's side of aff(F i ). In fact, V lies in the interior of W because if it lay in aff(F 1 ) or aff(F 2 ), our assumption that all maximal faces were simplices would be violated. Viewing ρ 3 as the origin of a real linear space, and noting that lin T and lin (F 1 ∩ F 2 ) are complementary subspaces (they span the space and intersect only at 0, i.e. ρ 3 ), we can decompose V in a unique way into a component in lin (T ) and a component in the edge, lin (F 1 ∩ F 2 ), of the tangent wedge.
Let q be the linear projection with kernel lin (F 1 ∩ F 2 ) and image lin (T ). For any set X such that X = X + lin (F 1 ∩ F 2 ), q(X ) = lin (T ) ∩ X . Both W and aff(H ) satisfy this condition. As already noted, V ∈ int W ; this is equivalent to q(V ) being in the relative interior of cone(T ). Because V ∈ H , q(V ) is also in aff(H ). Therefore, cone(T ) ∩ aff(H ) is an interior ray r of cone(T ). Now, because ρ 3 is the barycenter of into half-spaces, and the H -side half-ball B\B 0 is contained entirely in the relative interior of H . Because, as established near the beginning of our argument, aff(T ) ∩ aff(H ) is a line in aff(H ), and we now know that while it contains ρ 3 it is not entirely contained in F 1 ∩ F 2 , it must intersect B\B 0 . Its intersection with B\B 0 is contained in H . By choosing B small enough, we can ensure that this intersection is also contained in T . This is obvious from two-dimensional geometry. To be slightly more explicit, the facts that r , i.e. the half of aff(H ) ∩ aff(T ) on the H -side of aff(F 1 ∩ F 2 ), is interior to cone(T ), cone(T ) is generated by T and T is closed under multiplication by scalars in [0, 1], ensure this. Because aff(T ) ∩ aff(H ) ∩ B is contained in both H and T , it is contained in L; because it is one dimensional, so is L and so, because L is a compact convex set, L is a line segment.
In generalized theories we can define (cf [7], where analogous quantities for convex-setsbased theories were defined, and their failure to be concave in general was also observed) measurement-entropy-like quantities H T based on any function T that (like entropy) is Schur concave and defined on finite lists of classical probabilities. For a state ρ, H T (ρ) is defined as the infimum over tests of the value of T on the probabilities for the results of the test. We define U d for positive integers d as the uniform distribution on d alternatives. The same proof as before (with F(U d ) in place of log d) gives us the following proposition.

Proposition 1.
For any T whose value on U d+1 is strictly greater than its value on U d , for all d, for example any strictly Schur-concave T , the only polytopes on which H T is concave are simplices.

Appendix B. Entropy and quantum axiomatics
That mixing and measurement entropies coincide, as they do in classical and quantum theory, has powerful consequences for the structure of a probablistic model and, perhaps even more profoundly, for the structure of a probabilistic theory. As already noted, it implies that mixing entropy is concave, which places sharp restrictions on the geometry of state spaces. It also figures importantly in our derivation, in section 5, of IC. In this appendix, we explore some further consequences of monoentropicity, and also suggest some other postulates, the physical content of which may be clearer, that enforce this property.
It will be helpful to impose some mild restrictions on the models we consider. (These are satisfied by all of the examples discussed earlier.) First, we want to have enough analytic structure to guarantee that measurement entropies will be well behaved. Accordingly, in this appendix we shall require of all models A = (A, ) that be a compact, finite-dimensional

Corollary 3. Suppose A is monoentropic, and that the set of pure states in A is closed. Then A is sharp.
Proof. If α is a pure state, then H (α) = S(α) = 0. By lemma 6, there exists a measurement outcome x with α(x) = 1. On the other hand, if α(x) = 1, then S(α) = H (α) = 0, whence, again by lemma 6, α is the limit of a sequence of pure states, say n → α. By assumption, the set of pure states is closed, so α is pure. Because the set of states assigning unit probability to x is convex, it follows that α is the unique such state.
While the condition that the set of pure states be closed is not totally innocent (consider e.g. example 2 above), neither is it unreasonable. For example, it will be satisfied if there exists a compact group of symmetries of the state space that acts transitively on the pure states.
The condition that measurement and mixing entropies coincide also places some constraints on how systems compose. Proof. By the previous lemma, A, B and AB are sharp. If x ∈ A and y ∈ B are outcomes of A and B, respectively, and x , y and x y are the unique pure states making x, y and x y certain, then x y = x ⊗ y . Now if ρ is a pure entangled state in AB , then S(ρ) = 0. If H = S, then H (ρ) = 0, whence ρ = z for some outcome z ∈ C. If z is a product outcome, say z = x y, then ρ = x ⊗ y , a contradiction.
We now consider whether the condition H = S can be derived from more physically transparent considerations. Proof. Let ω be a pure bipartite state. Pick an observable E minimizing measurement entropy for ω 1 , so that H (ω 1 ) is the Shannon entropy By PC (and the assumption that ω is pure), the conditional states ω B|x are pure. By definition, S(ω B ) is the minimum Shannon entropy of the mixing coefficients in any pure-state ensemble for ω B , so S(ω B ) H E (ω A ) = H (ω A ). By the same argument, S(ω A ) H (ω B ).

Definition 9.
A theory has the steering property iff, for every pair of systems A and B, every pure bipartite state ω of AB steers its marginals, in the sense that for any convex decomposition ω B = i p i β i , with β i pure and distinct from each other, there is a test E = {a i } of A with β i = ω B|a i , and similarly for ω A .
The term 'steering' is due to Schrödinger [30], who showed that quantum theory is steering; further proofs and extensions are in [19,21]; a survey is in [22]. Definition 9 is closely related to the notion of steering introduced in [6].

Lemma 9.
If a theory has the steering property, then for every pure bipartite state ω, H (ω A ) S(ω B ).
Proof. For any > 0, choose a convex decomposition ω B = i p i β i of ω B into pure states β i , with S(ω B ) > H ( p i ) − . Because the state ω is steering, there exists a test E = {x i } with ω B|x i = β i , whence p i = ω A (x i ). It follows that S(ω B ) > − i p i log( p i ) − = H E (ω A ) − . Since is arbitrary, S(ω B ) H (ω A ).

Definition 10.
A pure state α in an abstract probabilistic theory is purifiable iff for every state α on a system A, there exists a pure bipartite state ω-a purification of α-on a composite AB, with B a copy of A, with ω A = ω B = α. An abstract probabilistic theory has the purifiability property iff every state in the theory is purifiable.
Quantum mechanics has the purifiability property. D'Ariano et al [12] have considered a condition very similar to purifiability as a potential axiom for quantum theory, and have shown that many other features of quantum theory follow from it. From the lemmas above, we have the following proposition.

Proposition 2.
A theory that has the pure conditioning, steering and purifiability properties is monoentropic.

Appendix C. Linearized test space models, ordered linear space models and entropy
The apparatus of states on test spaces can be linearized, as follows. If A = (A, ), with total outcome space X = A, let V (A) denote the span of in R X , regarded as an ordered real vector space with positive cone V + (A) = {α ∈ V (A)|α(x) 0 ∀x ∈ X }. Every outcome x ∈ X defines a positive linear evaluation functional f x ∈ V * ( ) by f x (µ) = µ(x) for all µ ∈ V (A). Moreover, one has x∈E f x = u, where u is the unique functional taking the constant value 1 on . Abstracting, one defines an effect to be a positive linear functional f ∈ V * (A) with 0 f (α) 1 for all α ∈ (equivalently, 0 f u); an observable on A is a sequence f 1 , . . . , f n of effects with i f i = u.
From this point of view, the structure of the test space is essentially a privileged set of observables-an additional structure that (like a preferred basis for a vector space) may or may not carry some useful information, or may simply be a computational convenience. For example, if A(H) = (F(H), (H)) is a quantum system, V (A) is the space of quadratic forms associated with-but one might as well say, the space of -Hermitian operators on H, and V * is essentially the same space, under the duality a(ρ) = Tr(ρa). In particular, an effect is a positive operator between 0 and 1, and an observable is essentially a discrete POVM. The convex sets, or ordered linear spaces, formalism takes this kind of combination of a convex state space and a set of effects in the dual cone to the state space, as primary. Roughly, a convex model is defined by taking a convex compact set of states as a base for a cone V ( ) + of unnormalized states, and a cone of 'unnormalized allowed effects' that is a closed subcone V + , containing u in its interior, of the dual cone V * ( ) + of all effects. u is defined by the condition u( ) = 1, and the interval [0, u] according to the ordering defined by V is the set of effects allowed in the theory. When V = V ( ) + , the model is called maximal (or sometimes saturated [10]). If the model is constructed from a test space, one will usually want to choose V to contain the effects associated with all outcomes in the test space.
Two natural distinguished classes of effects are the ray-extremal ones, i.e. effects that lie on extremal rays of the cone generated by effects, and atomic effects, i.e. maximal effects in extremal rays (equivalently, ray-extremal effects that are also extremal in the convex set [0, u] of effects). We may define the measurement entropy as the infimum of entropies obtainable by measuring observables consisting of ray-extremal elements, or alternatively as the infimum of entropies obtainable by measuring observables consisting of atomic effects. Intuitively, the observables consisting of ray-extremal effects are maximally fine grained. Ray-extremal effects cannot be further refined by decomposing them as sums of other effects. Although they can be decomposed as sums of shrunken versions of themselves, intuitively this cannot provide any additional information about the system being measured. Certainly in the case of atomic effects, and probably with some care and relabeling in the case of ray-extremal effects (which unlike atomic effects may appear more than once in a given observable), the measurements with such outcomes can be organized into distinguished test spaces associated with a given convex sets model, so the test space framework we use in the main text will probably cover this natural possibility, although the additional assumptions we make to obtain particular results will need to be checked for these cases. In the case of ray-extremal effects, the infimum in the definition of measurement entropy will likely not be changed if we omit measurements in which an effect appears more than once; the measurements without repetitions should be easier to organize into a test space. Linearization and the application of one of these definitions may well remove pathologies in measurement entropy that are associated with some test space/state space models. The spirit of the definition of measurement entropy via an infimum suggests excluding tests that are not maximally fine grained when viewed from the convex-states perspective, as the above definitions do. The definition using ray-extremal effects potentially includes more fine-grained measurements than the one that just involves atomic effects, and should probably be preferred, as allowing the largest class of observables that can reasonably be considered to be maximally fine grained. Passing to one of these definitions may also remove pathologies that might arise when the set of distinguished observables associated with tests has an irregular relationship to a state space whose underlying geometry is quite regular.