Entropy in general physical theories

Information plays an important role in our understanding of the physical world. We hence propose an entropic measure of information for any physical theory that admits systems, states and measurements. In the quantum and classical world, our measure reduces to the von Neumann and Shannon entropy respectively. It can even be used in a quantum or classical setting where we are only allowed to perform a limited set of operations. In a world that admits superstrong correlations in the form of non-local boxes, our measure can be used to analyze protocols such as superstrong random access encodings and the violation of `information causality'. However, we also show that in such a world no entropic measure can exhibit all properties we commonly accept in a quantum setting. For example, there exists no`reasonable' measure of conditional entropy that is subadditive. Finally, we prove a coding theorem for some theories that is analogous to the quantum and classical setting, providing us with an appealing operational interpretation.

Information plays an important role in our understanding of the physical world. We hence propose an entropic measure of information for any physical theory that admits systems, states and measurements. In the quantum and classical world, our measure reduces to the von Neumann and Shannon entropy respectively. It can even be used in a quantum or classical setting where we are only allowed to perform a limited set of operations. In a world that admits superstrong correlations in the form of non-local boxes, our measure can be used to analyze protocols such as superstrong random access encodings and the violation of 'information causality'. However, we also show that in such a world no entropic measure can exhibit all properties we commonly accept in a quantum setting. For example, there exists no 'reasonable' measure of conditional entropy that is subadditive. Finally, we prove a coding theorem for some theories that is analogous to the quantum and classical setting, providing us with an appealing operational interpretation.

I. INTRODUCTION
Understanding information in classical and quantum physics has helped us shed light on the fundamental nature of these theories. Indeed, it has even been suggested that quantum theory could be more naturally formulated in terms of its information-theoretic properties [5,7,10,16]. Yet, we have barely scratched the surface of understanding the role of information in the natural world. To gain a deeper understanding of information in physical systems, and to help explain why nature is quantum, it is sometimes instructive to take a step back and view quantum mechanics in a much broader context of possible physical theories. Many examples are known that indicate that if our world were only slightly different, our ability to perform information processing tasks could change dramatically [2,6,15,26,33,35,37,39].
However, before we can hope to really investigate general theories from the perspective of information processing, we first need to find a way to quantify information. In a quantum and classical world, this can be done using the von Neumann and Shannon entropy respectively, which capture our notions of information and uncertainty in an intuitive way. These quantities have countless practical applications, and have played an important role in understanding the power of such theories with respect to information processing.
Here, we propose a measure of information that applies to any physical theory [44] which admits the minimal notions of finite physical systems, their states, and the probabilistic outcomes of measurements performed on them. Many such theories have been suggested, each of which shares some aspects with quantum theory, yet have important differences. For example, we might consider quantum mechanics itself with a limited set of allowed measurements, quantum mechanics in a real * Electronic address: ajs256@cam.ac.uk † Electronic address: wehner@caltech.edu Hilbert space, generalized probabilistic theories [1,3], general C * -algebraic theories [10], box world [32] (a theory admitting all non-signalling correlations [27,42], previously called Generalized Non-Signalling Theory [3]), classical theories with an epistemic restriction [34] or theories derived by relaxing uncertainty relations [35].
A. A measure of information

Entropy
We propose an entropic measure of information H that can be used in any such theory in Section IV A. We will show that our measure reduces to the von Neumann and Shannon entropy in the quantum and classical setting respectively. In addition, we show that it shares many of their appealing intuitive properties. For example, we show that the quantity is always positive and bounded for the finite systems we consider. This provides us with a notion that each system has some maximum amount of information that it can contain. Furthermore, we might expect that mixing increases entropy. I.e. that the entropy of a probabilistic mixture of states cannot be less than the average entropy of its components. This is indeed the case for our entropic quantity. Another property that is desirable of a useful measure of information is that it should take on a similar value for states which are 'close', in the sense that there exists no way to tell them apart very well. This is the case for the von Neumann and Shannon entropy, and also for our general entropic quantity, given one extra minor assumption. Finally, when considering two different systems A and B, one may consider how the entropy of the joint system AB relates to the entropy of the individual systems. It is intuitive that our uncertainty about the entire system AB should not exceed the sum of our uncertainties about A and B individually. This property is known as subadditivity and is obeyed by our measure of entropy given one additional reasonable assumption on the physical theory. Our en-tropic quantity thus behaves in very intuitive ways. Yet, we will see that there exist physical theories for which it is not strongly subadditive, unlike in quantum mechanics.
Of course, there are multiple ways to quantify information and we discuss our choice by examining some alternatives and possible extensions such as notions of accessible information, relative entropy as well as Rényi entropic quantities in Sections IV C and IV D.

Conditional entropy and mutual information
Clearly, it is also desirable to capture our uncertainty about some system A conditioned on the fact that we have access to another system B. This is captured by the conditional entropy, for which we provide two definitions in Section IV B which are both interesting and useful in their own right. Based on such definitions we also define notions of mutual information which allow us to quantify the amount of information that two systems hold about each other. Our first definition of conditional entropy is analogous to the quantum setting, and indeed reduces to the conditional von Neumann entropy in a quantum world. This is an appealing feature, and opens the possibility of interesting operational interpretations of this quantity as in a quantum setting [20,21]. Yet, we will see that there exists a theory (called box world) for which not only the subadditivity of the conditional entropy is violated, but also where conditioning increases entropy. Intuitively, we would not expect to grow more uncertain when given additional information, which we could always choose to ignore.
We will hence also introduce a second definition of conditional entropy, which does not reduce to the von Neumann entropy in the quantum world. However, it has the advantage that in any theory conditioning reduces our uncertainty, as we would intuitively expect when taking an operational viewpoint. Nevertheless, even our second definition of the conditional entropy violates subadditivity.

Possible properties of the conditional entropy
Naturally, one might ask whether the fact that both our definitions of the conditional entropy violate subadditivity is simply a shortcoming of our definitions. In Section VI we therefore examine what properties any 'reasonable' measure of conditional entropy can have in principle. By reasonable here we mean that if given access to a system B we have no uncertainty about some classical information A, then the quantity is '0', and otherwise it is positive (or even non-zero). We show that under this simple assumption there exists no measure of conditional entropy in box world that is subadditive or obeys a chain rule.

B. Examples
To give some intuition about how our entropies can be used outside of quantum theory, we examine a very simple example in box world in Section V, which illustrates all the peculiar properties our entropies can have. This is based on a task in which Alice must produce an encoding of a string x, such that Bob can retrieve any bit of his choosing with some probability [38] (known as a random access encoding). It is known that superstrong random access codes exist in box world [35], leading to a violation of the quantum bound for such encodings [23].
A similar game was used in [26] to argue that one of the defining characteristics that sets the quantum world apart from other possibilities (and particularly box world) is that communication of m classical bits causes information gain of at most m bits, a principle called 'information causality'. In Section VII, we examine this statement using our entropic quantity. We notice that it is the failure of subadditivity of conditional entropy in box world that leads to a violation of the inequality quantifying 'information causality' given in [26]. We conclude our examples by discussing the definition of 'information causality' more generally.

C. A coding theorem
In the classical, as well as the quantum setting, the Shannon and von Neumann entropies have appealing operational interpretations as they capture our ability to compress information. In Section VIII, we show that the quantity H(·) has a similar interpretation for some physical theories. When defining entropy we have chosen to restrict ourselves to a minimal set of assumptions, only assuming that a theory would have some notion of states and measurements. To consider compressing a state or indeed decoding it again, however, we need to know a little more about our theory. In particular, we first have to define a notion of 'size' for any compression procedure to make sense. Second, we need to consider what kind of encoding and decoding operations we are allowed to perform. Given these ideas, and several additional assumptions on our physical theory, we prove a simple coding theorem.

D. Outline
In Section II, we introduce a framework for describing states, measurements and transformations in general physical theories, followed in Section III by some examples. In Section IV we then define our entropic measures of information that can be applied in any theory. Examples of of how these entropies can be applied in box world can be found in Section V. In Section VI we examine what properties we can hope to expect from a conditional entropy in box world. Section VII investigates the notion of 'information causality' in our framework and finally we show a coding theorem for many theories in Section VIII. We conclude with many open questions in Section IX.

II. AN OPERATIONAL FRAMEWORK FOR PHYSICAL THEORIES.
We now present a simple framework, based on minimal operational notions (such as systems, states, measurements and probabilities), that encompasses both classical and quantum physics, as well as more novel possibilities (such as 'box world') [1,3,11,16]. Our approach is similar to that in [1], however it is slightly more general as it does not assume that all measurements that are mathematically well-defined are physically implementable, or that joint systems can be characterised by local measurements.
A. Single systems and states.
Firstly, we will assume that there is a notion of discrete physical systems. With each system A we associate a set of allowed states S A , which may differ for each system. We furthermore assume that we can prepare arbitrary mixtures of states (for example by tossing a biased coin, and preparing a state dependent on the outcome), and therefore take S A to be a convex set, with s mix = ps 1 + (1 − p)s 2 denoting the state that is the mixture of s 1 with probability p and s 2 with probability 1 − p. To characterize when two states are the same, or close to each other, we first need to introduce the notion of measurements.

B. Measurements
Secondly, we thus assume that on each system A, we can perform a certain set of allowed measurements E A = {e}. If the system A is clear from context, we will omit the subscripts and simply write E and S.
With each measurement e we associate a set of outcomes R e , which for simplicity of exposition we take to be finite. When a particular measurement is performed on a system, the probability of each outcome should be determined by its state. We therefore associate each possible outcome r ∈ R e with a functional e r : S → [0, 1], such that e r (S) is the probability of obtaining outcome r given state S. We refer to such a functional as an effect. To ensure that measurement behaves according to our intuition when applied to mixed states, we require that e r (S mix ) = p e r (S 1 ) + (1 − p)e r (S 2 ). This means that each effect can be taken to be linear [45]. In order for the probabilities of all measurement outcomes to sum to one, we also require that where u is the unit effect, which has the property that u(S) = 1 for all S ∈ S. We can thus characterize a measurement e as a set of outcome/effect pairs [46] e = {(r, e r ) | r ∈ R e and r e r = u} .
We write e(S) for the probability distribution over outcomes when e is performed on a state S. Note that in this general framework, not all measurements that are mathematically well-defined need be part of a particular physical theory. One measurement can be equivalent to, or strictly more informative than, another. Consider two measurements e (with outcomes R e and effects e r ) and f (with outcomes R f and effects f r ), for which there exists a map M : R e → R f such that If M is one-to-one it corresponds to a re-labelling of the outcomes. Otherwise, we say that f is a coarse-graining of e (or alternatively that e is a refinement of f ). Because we can always re-label the outcomes of an experiment according to any map M , we assume that E is closed under re-labelling and coarse-graining. This implies that E always contains the trivial measurement u (with one outcome corresponding to effect u).
A refinement/coarse-graining is trivial if In this case, the measurement of e is equivalent to performing f and obtaining r , then outputting a randomly selected r satisfying M (r) = r (where the distribution depends on the proportionality constant in (4)). Hence the two measurements are equally informative about the state. In contrast, when e is a non-trivial refinement of f it offers strictly more information about the state, and in this case we write e f . A subset of measurements of particular importance are the fine-grained measurements E * ⊆ E, which have no non-trivial refinements, and are therefore optimal for gathering information about the state. Formally, We will also call an effect e fine-grained if it is part of a fine-grained measurement. We assume that E * is nonempty (i.e. that there exists at least one finite outcome fine-grained measurement). In quantum and classical theory this restricts us to the finite-dimensional case.

C. Transformations
As well as preparing states and performing measurements, it may be possible to perform transformations on a system. As in the case of effects, in order to behave reasonably when applied to mixed states, a transformation must correspond to a linear map T : S A → S A taking allowed states to allowed states (although the input and output systems may be of a different type). For each type of system, there will be some set of allowed transformations T .
We assume that the identity transformation I is allowed, and that the composition of two allowed transformations is allowed (as long as the system output by the first transformation is of the same type as the input to the second). Furthermore, it must be the case that any allowed transformation followed by an allowed measurement is an allowed measurement.
We can also combine the notion of transformation with that of measurement in a natural way to represent non-destructive measurements [3,11]. To incorporate non-destructive measurements, define the sub-normalised statesS = {pS|0 ≤ p ≤ 1, S ∈ S}. A measurement can then be described by assigning a subnormalised transformation t r : S →S to each outcome r. Result r occurs with probability p r = u(t r (s)) and the post measurement state is s r = t r (s)/p r . However, we will not need such constructions in the main part of this paper.

D. Relations between states
Having introduced measurements, we can now define what it means for two states to be equal. Given that we are taking an operational viewpoint, we adopt the intuitive notion that two states S 1 , S 2 ∈ S are equal, if and only if there exists no measurement that distinguishes them. That is, We can also define a natural measure of distance for states S 0 , S 1 ∈ S that directly relates to the probability that we can distinguish these states using measurements available in our theory, in analogy to the quantum setting [22]. Suppose we are given either S 0 or S 1 with equal probability, and perform a measurement e to distinguish the two cases. Note that the above implies that any theory that admits at least two possible states has at least one measurement e with two possible outcomes. Furthermore any such theory must have a measurement e with exactly two outcomes since any theory admits arbitrary coarse-grainings of measurements. We will base our decision on the maximum likelihood rule, that is, when we obtain outcome r, we will conclude we received state S 0 if e r (S 0 ) > e r (S 1 ) and S 1 otherwise. The probability of distinguishing the two states using measurement e is then given by where C(e(S 0 ), e(S 1 )) = 1 2 r∈Re |e r (S 0 ) − e r (S 1 )| is the classical statistical distance between the probability dis-tributions e(S 0 ) and e(S 1 ). We now define the distance as D(S 0 , S 1 ) := sup e C(e(S 0 ), e(S 1 )) .
By the above, we see that this measure of distance has an appealing operational interpretation because it directly captures our ability to distinguish the two states S 0 and S 1 using any available measurement (see appendix A, Lemma A.1 for details). In the quantum setting, it thus directly reduces to the well-known trace distance.

E. Multi-partite systems
Suppose that we have two systems A and B, each of which may admit different sets of states and measurements. We allow that two individual systems can be combined into a composite system AB, which we can treat as a new type of system having its own set of allowed states, measurements, and transformations just as in the single-system case. However, these sets must bear some relation to those of the component subsystems.
With respect to states, we would like it to be possible to independently prepare any state S A ∈ S A of system A and S B ∈ S B of system B. This corresponds to a product state of the composite system, which we denote by S AB = S A ⊗ S B ∈ S AB . Note that at this point we have not proved that ⊗ corresponds to a tensor product in the usual sense [47], but we would nevertheless expect that it is distributive for mixtures and associative. We make use of the standard terminology that states are separable if they can be written as a mixture of product states, and entangled otherwise. To avoid excessive subscripts when dealing with multiple systems, we will usually refer to the state of systems AB and B directly by these letters, rather than the more cumbersome S AB and S A (e.g. e(S AB ) = e(AB) etc. ).
Similarly, we would expect to be able to perform a measurement e ∈ E A and f ∈ E B , giving a product measurement which we denote by g = e⊗f ∈ E AB (with outcome set R g = R e × R f and effects g ij = e i ⊗ f j ). By considering coarse-graining and tri-partite systems, we would again expect ⊗ to be distributive and associative. When applying a product measurement to a product state we furthermore require that When considering multiple systems, we can consider what happens if we only measure some of these systems. Note that this means that we perform a measurement consisting of a unit effect on some of these systems. This only makes sense if marginal states are well defined and we hence assume that even when a bipartite state is entangled each part is an allowed marginal state. We can thus have Furthermore, in the case in which B performs a measurement on his subsystem and obtains result r (corresponding to an effect e r ) we would expect A's subsystem to 'collapse' to an allowed state A |r ∈ S A . We will denote such a state as Finally, a crucial constraint on multi-partite systems is the existence of product transformations T A ⊗ T B ∈ T AB . In a variant of quantum theory in which all positive (rather than completely positive) trace-preserving maps are allowed transformations, this would prevent the existence of entangled states.

III. EXAMPLE THEORIES
In this section we show how quantum theory and classical probability theory fit into the framework defined above, and also describe the theory known as 'box world' [3,32], which admits all non-signalling correlations [27,42], and was one of the main motivations for this work.

A. Classical probability Theory
In classical probability theory, a state S corresponds to a probability distribution p i over a finite set of elements. The effects correspond to linear functionals of the form for any q i r ∈ [0, 1]. Note that the unit effect corresponds to q i = 1 ∀ i. Normalisation of measurements therefore requires r q i r = 1 ∀ i. Transformations correspond to stochastic maps.

B. Quantum Theory
In quantum theory, the convex set of states are the density operators S = ρ (trace-1 positive operators), and effects correspond to linear functionals of the form e r (S) = tr(ρE r ) (13) where E r is a positive operator. All measurements satisfying the normalisation constraint are allowed, and the fine-grained measurements are those for which all E r are rank 1 operators. The allowed transformations represent completely positive trace-preserving maps [24].

C. Restricted Quantum/classical theories
Note that unlike other approaches [1,3] our framework also encompasses real Hilbert space quantum mechanics. Furthermore, because we do not assume that all well-defined operations are physically realizable, it can be used to study quantum or classical theory with a restricted set of states, measurements and transformations (for an interesting example in the classical case consider Spekkens' toy model [34]). The entropies we would assign in such cases would differ from the standard von Neumann entropy, and may be interesting to study.

D. Box world
In box world, the state of a single system X corresponds to a conditional probability distribution S = P (x out |x in ) where x in and x out are elements of a finite set of 'inputs' and 'outputs' respectively. The intuition is that there is a special set of measurements on each system represented by x in (referred to as fiducial measurements), and that any probability distribution for these measurements corresponds to an allowed state. We represent a system X with k possible inputs x in and m possible outputs x out by In the special case in which there is only one possible input, the conditional probability distribution reduces to the standard unconditional probability distribution P (x out ), and we omit the input line to the box in the diagram. Thus box world contains classical probability theory as a special case, and we will use such classical boxes to represent classical information in our treatment of information-theoretic protocols in box world.
A multi-partite state in box world corresponds to a joint conditional probability distribution with a separate input and output for each system. Aside from the usual constraints of normalisation and positivity, the allowed states must also satisfy the non-signalling conditions: That the marginal probability distribution obtained by summing over x k out , is independent of x k in for all k. This means that the other parties cannot learn anything about a distant party's measurement choice from their own measurement results. A bipartite state of particular interest is the PR-box state [27][28][29], for which all inputs and outputs are bi-nary, and the probability distribution is in 0 : otherwise (16) where ⊕ denotes addition modulo 2. This state is 'more entangled' than any quantum state, yielding correlations that achieve the maximum possible value of 4 for the Clauser-Horne-Shimony-Holt (CHSH) expression [9], compared to ≤ 2 √ 2 for quantum theory (Tsirelson's bound [36]), and ≤ 2 for classical probability theory. We represent entanglement between systems in box world by a zigzag line between them, and classical correlations (i.e. separable but non-product states) by a dotted line.
In box world, we allow all mathematically well-defined measurements and transformations to be physically implemented. Writing x out = (x 1 out , x 2 out , . . . , x N out ) and where Q r (x out | x in ) can be taken to be positive [3]. The effect e x in x out corresponding to performing joint fiducial measurements x in and obtaining results x out is represented by Q x out ( x out | x in ) = δ xinx in δ xout,x out . Because of the positivity of Q r , any effect can be expressed as a weighted sum of such fiducial measurement effects. It follows that a measurement is fine-grained if and only if each of its effects is proportional to some e xin xout , and that products of fine-grained measurements are themselves fine-grained.

IV. GENERALIZED ENTROPIES
The Shannon entropy H( p) = − i p i log p i and von Neumann entropy S(ρ) = − tr(ρ log ρ) are extremely useful tools for analyzing information processing in a classical or quantum world. Here, we would like to define an analogous entropy for general probabilistic theories which reduces to H( p) and S(ρ) for classical probability theory and quantum theory respectively. We would also like our new entropy to retain as many of the mathematical properties of the Shannon and von Neumann entropy as possible. Not only will this help our new entropy conform to our intuitive notions, but it will make it easier to prove general results using these quantities, and transfer known results to the general case. Note that although we can use any base for the logarithm in the definition of the Shannon and von Neumann entropies (as long as we are consistent), in what follows we will use base 2 (i.e. log = log 2 ) throughout.

A. Entropy
We now give a concrete definition of entropy for any physical theory, which satisfies the above desiderata. Other definitions are certainly possible, and we will consider one alternative (based on mixed state decomposition) in Section IV D. However, the following definition has many appealing properties.
Given any state S ∈ S, we define its entropy H(S) by where the infimum is taken over all fine-grained measurements e ∈ E * on the state space S and H(e(S)) = − r∈Re e r (S) log e r (S) is the Shannon entropy of the probability distribution e(S) over possible outcomes of e. This has an intuitive operational meaning as the minimal output uncertainty of any fine-grained measurement on the system. Note that for information-gathering purposes, the best measurements are always fine-grained, and without restricting to this subset the unit measurement would always be optimal (giving zero outcome uncertainty). Furthermore note that trivial refinements of e always generate a higher output entropy, so it is sufficient to only consider measurements in the infimum that have no parallel effects.
In appendix B, we prove that H retains several important properties of the Shannon and von Neumann entropy. In particular, we show: 1. (Reduction) H reduces to the Shannon entropy for classical probability theory, and the von Neumann entropy for quantum theory.

(Positivity and boundedness)
Suppose that the minimal number of outcomes for a fine-grained measurement in E * S is d. Then for all states S ∈ S, 3. (Concavity) For any S 1 , S 2 ∈ S and any mixed state S mix = pS 1 + (1 − p)S 2 ∈ S: 4. (Limited Subadditivity) Consider a theory with the additional property that fine-grained measurements remain fine-grained for composite systems. i.e.
This is true in quantum theory, classical theory, and box world. When (21) 5. (Limited Continuity). Consider a system for which all allowed measurements have at most D outcomes, or for which restricting the allowed measurements to have at most D outcomes does not change the entropy of any state. This is true in quantum theory, with D = d = dim(H), and also in box world and classical theory. Then we can prove an analogue of the Fannes inequality [14,17], which says that the entropy of two states which are close does not differ by too much. In particular, given S 1 , S 2 ∈ S satisfying D(S 1 , S 2 ) < 1/e, .
We will also see in section VIII that H has an appealing operational interpretation as a measure of compressibility for some theories. However, one property of the von Neumann entropy that does not carry over to H is strong subadditivity [24]. In particular, we will see in section V there exists a tripartite state in box world such that B. Conditional entropy and mutual information

A standard definition
Based on the entropy H, we can also define a notion of conditional entropy. In analogy to the von Neumann entropy [8], we define the conditional entropy of a general bipartite state AB ∈ S AB with reduced states A ∈ S A and B ∈ S B by H(A|B) := H(AB) − H(B) .
This has the nice property that for quantum or classical systems it reduces to the conditional von Neumann and Shannon entropies respectively. In some theories (including quantum theory but not classical probability theory), H(A|B) can be negative, which is strange, but opens the way for an appealing operational interpretation as in the quantum setting [20]. However, unlike in quantum theory, we will see that H(·|·) has the counterintuitive property that it can decrease when 'forgetting' information in some probabilistic theories. In particular, the violation of strong subadditivity for H in box world implies that it is possible to obtain H(A|BC) > H(A|B), and that H(·|·) is not subadditive. These properties will motivate us to consider an alternative definition of the conditional entropy below. However, we will show that no 'reasonable' entropy in box world can have all the appealing properties of the conditional von Neumann entropy.
In analogy to the quantum case, we can also define the where I is the classical mutual information.

An alternative definition
Given the problems observed with the previous definition in some theories, we now define a second form of conditional entropy based on H, which sometimes captures our intuitive notions about information in a nicer way. For any bipartite state AB ∈ S AB with reduced states A ∈ S A and B ∈ S B we define where the infimum is taken over all measurements on B, and A |j is the reduced state of the first system conditioned on obtaining measurement outcome j when performing f on the second system. This definition has the appealing property that conditioning on more systems always reduces the entropy, that is, H(A) ≥ H + (A|B) ≥ H + (A|BC) (see appendix C, Lemma C.1), and it reduces to the conditional Shannon entropy in the classical case. Note, however, that H + (·|·) does not reduce to the conditional von Neumann entropy in the quantum setting, as it is always positive. Furthermore, we will see in section VI that it is not subadditive, and does not obey the usual chain rule. (even though a limited form of chain rule holds in box world as we show in the appendix Section C 2). Nevertheless H + (·|·) seems quite a natural entropic quantity, and its corresponding quantum version has found an interesting application in the study of quantum correlations [12]. We can also define a corresponding information quantity via which is always positive. However, unlike I(A; B), this definition is not symmetric and hence it cannot really be considered 'mutual information'. Instead, I + (A; B) captures the amount of information that B holds about A.

C. Other entropic quantities
For cryptographic purposes, such as in the setting of device independent security for quantum key distribution, it is useful to define the following Rényi entropic variants of H. More precisely, we define where H α (e(S)) = 1 1−α log j (e(S) j ) α is the Rényi entropy of order α. Note that H 1 (S) = H(S) (taking the limit of α → 1). These quantities can also be useful in order to bound the value of H(·) itself as for any state S ∈ S and α < β we have H β (S) ≥ H α (S).
To define a notion of relative entropy, we adopt a purely operational viewpoint. Suppose we are given N copies of a state S 1 or a state S 2 , and let Classically, as well as quantumly, the relative entropy captures our ability to distinguish S N 1 from S N 2 for large N . Note that to distinguish the two cases, it is sufficient to coarse grain any measurement to a two outcome measurement e = {(1, e 1 ), (2, e 2 )}, where without loss of generality we associate the outcome '1' with the state S N 1 and '2' with S N 2 . Then e 1 (S N 2 ) denotes the probability that we conclude that the state was S N 2 , when really we were given S N 1 . Similarly, e 2 (S N 1 ) denotes the probability that we falsely conclude that the state was S N 2 . In what is called asymmetric hypothesis testing, we wish to minimize the error e 1 (S N 2 ) while simultaneously demanding that e 2 (S N 1 ) is bounded from above by a parameter ε. Here we fix ε = 1/2. We therefore want to determine In a quantum setting, it has been shown that the quantum relative entropy is directly related to this quantity via the quantum Stein's lemma [4,18,25], which states that we have This is a deep result giving a clear operational interpretation to the relative entropy, telling us that in the large N limit the probability of making the error p N decreases exponentially with D(S 1 ||S 2 ). Furthermore, as it is expressed in operational terms, we can simply adopt (32) as our definition of relative entropy in any theory for which the limit is well defined. Thus we recover the usual value in the quantum (and classical) case, and in all other theories we still capture the same operational interpretation. Note also that our choice of ε = 1/2 was quite arbitrary, and one may consider a family of relative entropies, one for each choice of ε. In quantum theory, these are all equivalent [4], but they may yield different values in other theories.

D. Decomposition entropy
Although the entropy H has several appealing properties, and seems quite intuitive, it is nevertheless interesting to consider alternative notions of entropy for general theories. One seemingly natural alternative is the decomposition entropy, which measures the mixedness of a state.
There is a special subset of states S * ⊆ S which cannot be obtained by mixing other states: S * form the extreme points of S and are referred to as pure states (with the remaining states being mixed ). Suppose that any state in S can be decomposed into a finite sum of pure states. Then we can define the entropy of a state by the minimal Shannon entropy of its decompositions into pure states. Define a decomposition D(S) of a state S ∈ S as a probability distribution over the set of pure states that is non-zero for only a finite set of states S i ∈ S * with probabilities p i ∈ (0, 1] such that Like our previous entropy definition, we show in appendix D thatH reduces to the Shannon and von Neumann entropy in classical probability theory and quantum theory respectively. However, it has a number of unappealing properties when compared with H. In particular it is neither concave nor subadditive, as revealed by explicit counterexamples from box world given in appendix D.
After studying simple examples in box world, it seems thatH is a less intuitive and helpful measure of uncertainty than H. For this reason, althoughH may play an important role in discussions of entanglement or purity in many generalized theories, and may also lead to interesting operational interpretations, we do not discuss it further here.

V. EXAMPLES IN BOX WORLD
We now investigate how our entropic quantity H(·) behaves in box world with a simple, yet illustrative, example.
To first gain some intuition on how H behaves in such a setting, consider a trivial classical system X which admits only one possible measurement and outputs 2 possible values x out ∈ {0, 1} each which probability 1/2. Clearly, since the system admits only one possible measurement e, we have H(X) = H(e(X)) = H((1/2, 1/2)) = 1 .
Consider now a PR-box (a bipartite system in the state (16)) where Alice holds system Y (with binary input y in and output y out ) and Bob holds system Z (with binary input z in and output z out ). Note that the fine-grained measurements on the entire system correspond to a sequence of fiducial measurements on the two subsystems (where the choice of input to the second subsystem may depend on the output of the first) [3], and the outcome is the output of both measurements. The minimal entropy for the joint system can be obtained by inputting '0' into both boxes, giving outputs '00' or '11' each with probability 1/2 (in fact, any other fine-grained measurement is equally good), and the marginal states yield a random output bit for any input. Hence we have that We now consider a scenario for which it is known that PR-boxes yield an advantage over the quantum setting in terms of information processing. The basis of our example is a simple non-local game in which Alice is given a random 'parity' bit x, and has to output two bits x 0 and x 1 satisfying x 0 ⊕ x 1 = x (where ⊕ denotes addition modulo 2). Then, without receiving any communication from Alice, Bob is given a random target bit t and has to successfully output x t [13]. This game is equivalent to the CHSH-game [9,38]. We begin with Alice having the parity bit (which we model by a classical box in the state X described above), and Alice and Bob sharing a PR-box in the state Y Z. Now Alice performs the following procedure, which corresponds to an allowed transformation in box world. She measures the parity bit X to obtain x := x out , then uses this as the input to her part of the PR-box, setting y in = x and obtaining outcome y out . Finally, she prepares two new classical bits x 0 = y out and x 1 = x ⊕ y out (represented by classical boxes X 0 , X 1 ). Note that because of the correlations inherent in the PR box, the output of Bob's system will now be described by z out = y in · z in ⊕ y out = (x 0 ⊕ x 1 ) · z in ⊕ x 0 = x zin . Hence the state of X 0 X 1 Z after this procedure is the classically correlated state: Given any target bit t, Bob can win the game by setting z in = t and outputting the result z out = x t . We can think of Bob's system as a perfect random access encoding of the two-bit string x 0 x 1 [35,38]. Consider the entropies of the state X 0 X 1 Z. All of the individual systems yield a random output bit, giving and x 0 and x 1 are independent random bits, so Also note that we have since for any input z in , the output z out will be perfectly correlated with one of the other bits (giving only 2 independent random output bits). Finally, because we can make z out perfectly correlated with either of the remaining bits we have where the optimal measurements are z in = 0 and z in = 1 respectively.
These entropy values all seem very intuitive (Note in contrast that for the decomposition entropyH(X 0 Z) = 2). However, they violate several natural properties of the Shannon and von Neumann entropies.
(a) Strong subadditivity. First of all, it is easy to see from the above that which violates strong subadditivity. We now turn to the two possible forms of conditional entropy that we defined, where our simple example clearly illustrates their differences.

A. Standard conditional entropy
First of all, we consider the standard form of conditional entropy, which reduces to the von Neumann entropy in the quantum settings. By the above, we can immediately see that it has the following interesting properties.
(b) Subadditivity of the conditional entropy. Using (25) we deduce that H(X 0 |Z) = H(X 1 |Z) = 0, H(X 0 X 1 |Z) = 1 (43) which seems intuitive, as we can perfectly predict the output of either X 0 or X 1 (but not both) using Z. However, this yields a violation of subadditivity for the conditional entropy, as This may seem rather bizarre at first glance, however, we will see in Section VI that no 'reasonable' measure of conditional entropy in box world is subadditive, unlike the von Neumann entropy. It is also interesting to consider the corresponding mutual information quantities, which are Again, these seem intuitive, as we can extract one bit of information about either X 0 or X 1 or the pair X 0 X 1 from Z. It may be tempting to conclude that the point at which H(X 0 X 1 |Z) becomes subadditive (or equivalently, where H(X 0 X 1 Z) becomes strongly subadditive) is exactly when the PR-box is weakened to obey Tsirelson's bound. Note that our trivial example only shows that PR-boxes which are more than ≈ 0.89 > 1/2 + 1/(2 √ 2) correct do not obey subadditivity. However, note that constraining non-local boxes to obey Tsirelson's bound alone is insufficient to reduce box world to quantum theory (e.g. each quantum system admits a continuum of fine-grained measurements whereas any box admits only a finite set).
(c) Conditioning can increase entropy. Our small example also emphasizes another curious property of the conditional entropy. By definition, But this is strange, because we can perfectly determine the output of X 0 given Z. Furthermore, since H(X 0 |Z) = 0, we then clearly have which means that 'forgetting information', namely discarding X 1 , can decrease uncertainty. Again, it may seem that this is a consequence of not choosing the 'correct' definition of entropy.

B. Alternative conditional entropy
Reevaluating the conditional entropies of the previous section using this new definition we find that H + (X 0 |Z) = H + (X 1 |Z) = 0, H + (X 0 X 1 |Z) = 1 (48) as before, hence this new measure still violates subadditivity. However we now have as we would intuitively expect. This means that conditioning on X 1 no longer increases the entropy. However, it generates a violation of the chain rule On balance though, this measure of conditional entropy seems more reasonable than the original one in this example.

VI. PROPERTIES OF CONDITIONAL ENTROPIES IN BOX WORLD
We now show that any 'reasonable' measure of the conditional entropy in box world will necessarily defy our intuition about information in several ways.
Intuitively, the goal of any entropic quantity is to capture the degree of uncertainty we have about a system, possibly given access to some additional information. We assign a label A to the system of interest and use B to denote any additional systems or information available to us. For simplicity, let us suppose that A corresponds to some classical information (i.e. it is a state of a classical box). Let H(A|B) denote some entropic quantity that quantifies our uncertainty about A given B. If we were able to determine A with certainty given access to B (i.e. to determine the precise output of the classical box A), we would intuitively say that there is no uncertainty and the quantity H(A|B) should vanish. Conversely, if we cannot determine A given B, but will necessarily have some residual uncertainty, then the quantity H(A|B) should be positive. Motivated by this intuition in quantifying uncertainty we demand the following two properties to hold for any 'reasonable' measure of uncertainty when A is classical.
{1} If the output of A can be obtained from B with certainty, H(A|B) = 0.
{2} If the output of A cannot be obtained from B with certainty, then H(A|B) > 0.
In the classical and quantum world, all commonly used entropic quantities satisfy these conditions (given that A is classical). In both such worlds, there also exist entropic quantities that are subadditive and obey a chain rule, for example the conditional Shannon and von Neumann entropies. In box world, H + (A|B) is 'reasonable' according to this definition, while H(A|B) is 'unreasonable'. Curiously, it turns out that in box world there cannot be any reasonable measure of conditional entropy that obeys conditions {1} and {2}, but at the same time is subadditive or obeys a chain rule.
(a) Subadditivity of the conditional entropy. Consider the state of the two classical bits A = X 0 X 1 and Bob's binary input/output box B = Z described by (37) in the previous section. We now show that in this case no reasonable measure of entropy that obeys properties {1} and {2} is subadditive. First of all, note that Bob can determine one of the bits perfectly, given access to Z. Therefore from condition {1}, we obtain that However, since Bob cannot determine the parity of the two bits, he certainly cannot learn both bits perfectly and hence from condition {2} we have In order for subadditivity to hold, we would need that which using (51) and (52) leads to a contradiction. Note that subadditivity could still hold, if the quantity H(X 0 X 1 |Z) were negative. Note that for the state described by (37), condition {1} gives us H(X 0 |Z, X 1 ) = H(X 0 |Z) = 0 (54) because x 0 can be obtained perfectly from B = Z or B = ZX 1 . A chain rule for the conditional entropy would mean that Using Eq. (54), together with Eqs. (51) and (52) again gives us a contradiction. Note that H + (·|·) obeys conditions {1} and {2}, and hence does not admit a chain rule in box world. As H(·|·) satisfies a chain rule, it follows from the above that it must be 'unreasonable'. Indeed, this can be seen from the fact that H(X 0 |X 1 Z) = 1 despite the fact that we can perfectly determine the output of X 0 given Z and X 1 , violating condition {1}. It is easy to see that if we were to drop the conditions that make an entropy 'reasonable' but simply assume that it is not subadditive, but we do enforce a chain rule, then conditioning can increase entropy.

VII. INFORMATION CAUSALITY
We now use our entropic quantities to investigate the game given in [26]. This task relates to 'information causality', which is expressed as the principle that 'communication of k classical bits causes information gain of at most k bits'. In [26] it is reported that this principle can be violated in box world using the following simple game (where we take k = 1): Alice is given two random classical bits a 0 and a 1 and Bob is given a single random bit t. Alice is allowed to send a single bit message m to Bob, after which he must output a bit b. The couple succeed in the task if b = a t . This task is clearly very similar to the non-local game considered in section V. Indeed, any solution to the previous problem can also be used to solve this one. Alice takes the parity bit as x = a 0 ⊕ a 1 , then generates x 0 and x 1 = x 0 ⊕x as before. She sends the message m = x 0 ⊕a 0 to Bob. Using the previous protocol, Bob generates x t , and then outputs b = x t ⊕ m = a t .
In the context of this game, 'information causality' is interpreted as meaning that I := I(a 0 ; b|t = 0) + I(a 1 ; b|t = 1) ≤ 1. (56) where I(·; ·|·) is the classical conditional mutual information. This inequality is obeyed in quantum theory. However, given the above argument it is clear that it can be violated in box world, as Alice and Bob can achieve I = 2. Let us examine why (56) fails in terms of our general entropies. We consider the state just after Bob has received the message from Alice, when she holds classical bits A 0 and A 1 , and Bob holds the classical message M and his part of the PR-box Z. This state is described by P (a 0 a 1 mz out |z in ) = We can compute entropies explicitly in this case as in section V, and will obtain similar results. However, [26] also contains a proof of (56) in quantum theory based on the quantum mutual information. It is interesting to attempt to follow this proof using our general mutual information I (or I + ) to see where it fails. The quantum proof relies on the chain rule for quantum mutual information (which I satisfies by definition) [48], positivity of the mutual information (which is true for I in box world due to the subadditivity of H), and non-signalling (which is one of the defining features of box world). However, the crucial step is a use of the data processing inequality to deduce that Although it is very natural that 'forgetting' A 1 can only decrease the mutual information, this inequality is violated in box world. Indeed, for the state (57) we find This is again a consequence of the violation of strong subadditivity for H, which forms the key ingredient in why (56) can be violated in box world. Although the violation of (56) in box world, and its validity in quantum theory, is a very interesting result, it is interesting to consider whether this really implies that communicating k bits has caused an information gain of more than k bits. From the state (57) it is easy to check that hence under both these measures the total information about the composite system A 0 A 1 has only increased by one bit due to the one bit classical message. We show in Section C 2 in the appendix that in box world we indeed have that given some arbitrary system Z held by Bob, the mutual information about a classical string A can never increase by more than the length of a classical message M that is transmitted. Furthermore, Bob can extract only one of the two bits, either A 0 or A 1 , with the help of the message as is indeed noted in [26]. It is therefore arguable that the information gain of Bob is only one bit. Perhaps 'information causality' should be restated in a clearer way, that more directly represents the form of (56). e.g. the principle that an m bit classical communication allows us to learn any one out of at most m unknown bits.

VIII. A SIMPLE CODING THEOREM
We now show that for some theories, the entropic quantity H(·) has an appealing operational interpretation in capturing our ability to compress information. Here, we will only show this for theories obeying further restrictions, and it is an interesting open question how generally this interpretation applies.

A. Dimension and subspaces
Before we can talk about compression, we first need to clarify our notions of the size of a system. Intuitively, the size of a system should limit the amount of uncertainty we can have about it. Furthermore, to compress, we will clearly need to shrink the original state space. It is therefore helpful to define a notion of size for any subset of allowed states S T ⊆ S.
We refer to the size of a set of states S T as its dimension d, which we define by d := min e∈E * |{r ∈ R e |∃S ∈ S T , e r (S) > 0}|.
This corresponds to eliminating all measurement outcomes that cannot occur for any state in S T , and then counting the minimal number of remaining outcomes for any fine-grained measurement. It follows that log d ≥ H(S) for all S ∈ S T . In quantum theory d corresponds precisely to the dimension of a Hilbert space. A natural way to select a subset of states is to consider all states that yield a given measurement outcome with certainty. We refer to an effect f such that {f, u − f } is an allowed measurement, and that occurs with certainty for some state, as a full effect (i.e. f is full if there exists S ∈ S such that f (S) = 1). For any full effect f , we can therefore define a non-empty subset of states S f = {S|S ∈ S, f (S) = 1}. We refer to such a subset as the subspace of S given by f . Note that subspaces are always convex, and the subspace corresponding to an effect f which is both full and fine-grained obeys d f = 1.
We say that we have compressed a state if we have constrained it to lie within a set of states of smaller dimension.

B. Additional assumptions
So far, we were never concerned about what happens to a state after a measurement. In our compression protocol, however, we will need to use an abstract notion of post-measurement states as described in Section II C. In particular, we will consider pseudo-projective measurements, which we define to be measurements that fullfill two conditions.

(Repeatability) A pseudo-projective measurement is
repeatable, such that if the same measurement is applied again the same result is obtained. This requires that the output state S r after obtaining a result r lies in the subspace given by e r (i.e., e r (S r ) = 1). Consequently, all effects in a pseudoprojective measurement must be full effects.
2. (Weak Disturbance) If a particular outcome r of a pseudo-projective measurement occurs with probability e r (S) ≥ 1−δ for a state S, then the post measurement state S r after this result is obtained satisfies e r (S)D(S, S r ) ≤ cδ ε , where c ≥ 0 and ε ∈ (0, 1] are constants depending on the particular theory. For example, for projective measurements in quantum theory c = ( √ 8 + 1)/2 and ε = 1/2.
Any projective measurement in quantum theory fulfills these conditions, but these conditions alone do not define projective measurements, hence the slightly different name. In quantum theory, the weak disturbance property can be understood as an instance of the gentle measurement lemma [43]. Furthermore, in order to prove our simple coding theorem, we will need to make some additional assumptions on the states and the measurements that achieve the minimal output entropy H(·) in our theory. In particular, we assume that for all states, the minimal output entropy can be attained by a pseudo-projective measurement. That is, we assume that for all S ∈ S there exists some pseudo-projective measurement e ∈ E * such that H(S) = H(e(S)). We further assume that for all such measurements, e ⊗n is fine-grained and pseudoprojective, and that course grainings of e ⊗n can also be made pseudo-projective. Lastly, we assume that the dimension of S ⊗n is d n . These assumptions are all true in the classical and quantum case (where e is projective).
We will see in Appendix E, that this is all we will need to show the following simple coding theorem following the steps taken by Shannon [31] and Schumacher [30] (see for example [24]).

C. Compression
We consider a source that emits a stateS k ∈ S with probability q k , chosen independently at random in each time step. When considering n time steps, we hence obtain a sequence of statesS k =S k1 , . . . ,S kn ∈ S ⊗n with k = (k 1 , . . . , k n ), where each sequence occurs with probability q k = Π j q kj . A compression scheme consists of an encoding and decoding procedure. The encoding procedure maps each possibleS k into a stateŜ k ∈ S f ⊂ S ⊗n . In turn the decoding procedure maps the statesŜ k back to statesS k ∈ S on the original state space. In analogy with the quantum case, we say that the compression scheme has rate R, if the dimension of the smaller space obeys d f ≤ 2 nR . Note that in order for a compression scheme to be useful, it must have R < log d (and hence d f < d n ). A compression scheme is called reliable, if we can recover the original state (almost) perfectly, in the sense that the average distance between the original and the reconstructed state can be made arbitrarily small for sufficiently large n. I.e. for any > 0 and all sufficiently large n, Note that the output of the source can be described as a mixed state Src = k q kSk in each time step, and a product state Src ⊗n ∈ S ⊗n over the course of n time steps. We then obtain the following theorem (see appendix Section E) in terms of the entropy of the source H(Src).
Theorem VIII.1. Consider an i.i.d source {q k ,S k ∈ S} k with entropy rate H(Src). Then for R > H(Src) there exists a reliable compression scheme with rate R.
Note that in order to establish that H(·) truly characterizes our ability to compress information, we would also like to have a converse stating that for R < H(Src) there exists no reliable compression scheme. In quantum theory, it is not hard to prove the converse of the above theorem since it admits a strong duality between states and measurements, which may also hold for other theories. Here, however, we explicitly tried to avoid introducing any such strong assumptions.

IX. CONCLUSION AND OPEN QUESTIONS
We introduced entropic measures to quantify information in any physical theory that admits minimal notions of systems, states and measurements. Even though these measures necessarily have some limitations, we nevertheless showed that they also exhibit many intuitive properties, and for some theories have an appealing operational interpretation, quantifying our ability to compress states.
Most of the problems we encountered with the conditional entropy seem to arise due to a violation of strong subadditivity. It is an interesting question whether quantum and classical theories are the only ones in which H is strongly subadditive, or whether this is true for other theories. Indeed, it would be an exciting question to turn things around and start by demanding that our entropic measures do satisfies these properties, and determine how this restricts the set of possible theories.
In H + (·|·) we defined a natural entropic quantity which differs from the conditional von Neumann entropy in quantum theory, and has been used in [12] to study quantum correlations. It would be interesting to study whether this quantity can shed any further light on quantum phenomena, or if an alternative conditional entropy can be defined that behaves like H + (·|·) in box world, but still reduces to the conditional shannon entropy in quantum theory.
Whereas we have proved some intuitive properties of our quantities, it is interesting to see whether other properties of the von Neumann or Shannon entropy carry over to this setting. In particular, it would be interesting to prove bounds on the mutual and accessible information analogous to Holevo's theorem when none of the systems are classical.
Another interesting question is whether one can find a closed form expression for the relative entropy in general theories. In quantum theory, we can define the mutual information (and indeed the entropy itself) in terms of the relative entropy [49], hence such an approach may also yield an alternative definition of other entropic quantities for general theories.
We believe our measures are an interesting step towards understanding information processing in general physical theories, which may in turn shed some light on our own quantum world.
in part by the EU QAP project (CT-015848). Part of this work was done while AJS was visiting Caltech (Pasadena, USA).
Note added: In the course of this work we learned of in-dependent work on the same general topic [40], to appear simultaneously in NJP. Related work has also appeared later on [41].
Consider the N × N matrix M determined by the entries which allows us to write q = M p. Note that since M ,j ≥ 0 and j M ,j = M ,j = 1, M is a doubly stochastic matrix. Using Birkhoff's theorem (see e.g ., [19,Theorem 8.7.1]), we may thus write M as a convex combination of permutation matrices, that is, where P is a probability distribution over the group of permutations S N . Using the concavity of the Shannon entropy we obtain H(S 2 ) = H(f(S 2 )). We can then bound | H(S 1 ) − H(S 2 )| ≤ |H(f(S 1 )) − H(f(S 2 ))| (B15) ≤ C(f(S 1 ), f(S 2 )) log D C(f(S 1 ), f(S 2 )) ≤ D(S 1 , S 2 ) log D D(S 1 , S 2 ) where the first inequality follows from the fact that H(S 1 ) ≤ H(f(S 1 )), the second from Fannes inequality [14] applied to the classical case, and the final inequality by noting that If the infimum is not achieved, then for all sufficiently small δ > 0 there nevertheless exists f ∈ E * such that H(S 2 ) = H(f(S 2 )) − δ. Following the same procedure as before, we find We now see that in consistency with the no-signalling principle, the transmition of an bit message M causes the mutual information about a classical system C given access to some aribtrary box information B to increase by at most bits. Note that for our alternate definition of conditional entropy and mutual information we have To show the reduction ofH(ρ) to the von Neumann entropy S(ρ) in quantum theory, we use the following Lemma H(D(ρ)) (D3)

Subadditivity and concavity
In this section we will show thatH is neither concave nor subadditive by giving explicit counterexamples from box world.
First consider a single box with binary input/output. For clarity, we will represent its state by giving its probability distribution P (a|x) in vector form: These can both be optimally decomposed into two equally weighted pure states, e.g. To obtain a violation of subadditivity we consider a bipartite state in which each system has a binary input/output, represented in the form of a matrix S AB =      P (00|00) P (01|00) P (00|01) P (01|01) P (10|00) P (11|00) P (10|01) P (11|01) P (00|10) P (01|10) P (00|11) P (01|11) P (10|10) P (11|10) P (10|11) P (11|11) It is known that in this case there are exactly 24 pure states for the bipartite binary input/output case (16 product states and 8 entangled states) [3], which we denote by S i AB . By demanding that S AB − p i S i AB be a positive matrix for each pure state we find that any decomposition must satisfy p i ≤ 1 4 ∀ i. HenceH(AB) = inf D(ρ) H(p i ) ≥ 2. In fact we can construct an explicit decomposition in terms of an entangled state and three product states (all equally weighted), givingH(AB) = 2. given by a particular theory. We then note that The inequality in the last line follows from the typical subspace theorem. As δ can be chosen to be arbitrarily small, this concludes our proof.