Memory cost of temporal correlations

A possible notion of nonclassicality for single systems can be defined on the basis of the notion of memory cost of classically simulating probabilities observed in a temporal sequence of measurements. We further explore this idea in a theory-independent framework, namely, from the perspective of general probability theories (GPTs), which includes classical and quantum theory as special examples. Under the assumption that each system has a finite memory capacity, identified with the maximal number of states perfectly distinguishable with a single measurement, we investigate what are the temporal correlations achievable with different theories, namely, classical, quantum, and GPTs beyond quantum mechanics. Already for the simplest nontrivial scenario, we derive inequalities able to distinguish temporal correlations where the underlying system is classical, quantum, or more general.


I. INTRODUCTION
Given a single quantum system, in what sense can we say that it has some nonclassical properties? The most celebrated phenomena where quantum systems depart from their classical counterpart involve notions such as entanglement [1,2] and nonlocality [3,4], which can be defined only in terms of multipartite systems. What if we are able to perform experiments only on a single, indivisible, system? Can we still say that the observed statistics has some "nonclassical properties"? Some notion of nonclassicality have been proposed for single systems, such as contextuality [5] and nonmacrorealism [6,7]. One may argue that such notions are limited to specific measurement procedures and hence are not fully satisfactory. Contextuality restricts the set of possible operations to compatible measurements, which in many cases need to be (approximately) projective or at least satisfy some analogous notion of repeatability and nondisturbance [8,9], in order to avoid the so-called "compatibility loophole" [8] or other similar classical explanations. Macrorealism has similar strong restrictions on the set of allowed measurements, namely, they must be noninvasive to avoid the clumsiness loophole or other forms of classical interpretation of the results [10].
A strong motivation for developing such a notion of nonclassicality for single systems also arises from quantum information theory. Notions such as entanglement and nonlocality have been proved to play a role in quantum information tasks related to communication, such as, e.g., device-independent quantum key distribution [11]. That such notions should play a role also for tasks involving only single systems, such as, e.g., quantum computation, is less evident. Several recent results connected quantum contextuality with models of quantum computation such as, e.g., quantum computation via magic state injection or measurement based quantum computation [12][13][14][15][16][17][18]. However, a natural question arises of whether this connection is fundamental or just related to the particular model used for quantum computation [19]. If one moves from compatible projective measurements to general instruments, it is no longer clear whether the notion of quantum contextuality make sense at all, due to the compatibility loophole mentioned above [8].
In this paper, we go beyond such notions and introduce a notion of nonclassicality for the measurement statistics of a single system which is not restricted to specific measurement operations. The main tool of this investigation is the notion of memory cost of simulating temporal correlations. By temporal correlations we mean the observed statistics arising from sequences of measurements on a single system and memory roughly refers the amount of classical information that can be stored in the physical system. The notion of memory cost has been explored in connection with quantum contextuality [20,21] and a related notion, i.e., that of communication cost, has been explored in relation to both Bell nonlocality [22,23] and temporal correlations [24,25]. Similar notions have been explored also in the prepare-and-measure scenario [26][27][28][29][30] and in connection with quantum information tasks such as random access codes [31][32][33].
In our approach, we go beyond the prepare-and-measure scenario by exploring arbitrary long sequences of measurements and we remove any restriction on the type of measurement by considering arbitrary quantum instruments. Our analysis is not only restricted to the differences between classical and quantum theory, but is extended to general probabilistic theories (GPTs) [34][35][36][37], which embrace also the former theories. In particular, we derive inequalities on the observed probabilities that are able to discriminate between classical, quantum, and genuine GPT correlations. Moreover, as a further development of the ideas presented in Refs. [20,21], we show that in the framework of finite-state machines it is impossible to simulate contextual correlations on a qubit system, for a fixed initial state and arbitrary instruments.
The paper is organized as follows. In Sec. II, we will introduce the basic notions and tools necessary for our analysis, namely, temporal correlations and the arrow of time polytope. In Sec. III, we will introduce finite-state machines in GPTs, in particular, also in classical and quantum theory. In Sec. IV, we will discuss the existence of nontrivial temporal bounds for such theories and the impossibility of simulating contextual correlations on a qubit. Finally, we present the conclusions and an outlook of the paper.

II. TEMPORAL CORRELATIONS
We consider a box that accepts certain inputs from an input alphabet X and produces outputs from an output alphabet A. The box is operated in a sequential fashion, see Fig. 1(a), such that, for instance, it first receives an input labeled by x ∈ X yielding an output labeled by a ∈ A, subsequently it receives y yielding b, and finally it receives z yielding c. Prior to this sequence the box is initialized, such that its behavior is independent of anything except the input sequence xyz. Consequently, for a fixed input sequence xyz ∈ X 3 , the admissible output sequences abc ∈ A 3 are governed by a probability distribution. If we now consider all possible inputs, we obtain the correlations p(abc|xyz). Due to the time ordering of the inputs and outputs, these correlations must satisfy the arrow of time constraints [38], ∑ c p(abc|xyz) = ∑ c p(abc|xyz ), for all a, b ∈ A and all x, y, z, z ∈ X , ∑ bc p(abc|xyz) = ∑ bc p(abc|xy z ), for all a ∈ A and all x, y, y , z, z ∈ X .
These constraints encode the fact that a future choice of an input, e.g., z or z in Eq. (1), must not influence previous outputs of the box, e.g., a or b. This is in analogy to the nonsignaling conditions in the usual Bell scenario [39]. The arrow of time constraints come solely from causality and hence, they must be satisfied not only in classical and quantum theory, but in any GPT. We can represent the correlations p(abc|xyz) as a vector with coordinates labeled by the possible sequences abc and xyz. Due to the linearity of the arrow of time constraints, the set of correlations satisfying those forms a polytope. Its extremal points have been recently characterized [40][41][42]. It is instructive to briefly sketch the central steps for the simple case of sequences of length three. All correlations in the corresponding polytope can be decomposed as p(abc|xyz) = p(a|x)p(b|a; xy)p(c|ab; xyz), since the marginals on the right hand side are well defined (for the pathological cases where p(ab|xy) = 0 we define the right hand side to be zero). Vice versa, taking valid probability distributions p(a|x), p(b|a; xy), p(c|ab; xyz) over a, b, c, respectively, one always obtains an element of the polytope. Its extremal points are obtained by deterministic strategies, i.e., where each of the probability distributions on the right hand side of Eq. (3) consists only of probabilities 0 or 1. It easily follows that classical and quantum models can reach extremal points if enough memory is available. In more precise terms, each deterministic strategy can be reached if the box internally keeps a record of all previous inputs and outputs. Storing this record then requires the box to have memory. Of course, the notion of memory needs clarification, in particular if the box is described using quantum theory or a GPT, for details see Sec. III. Clearly, storing the full record of previous inputs and outputs is not necessarily memory optimal and gives rise to the question: What is the minimal number of states necessary to obtain certain correlations? How does such a number depend on the specific theory we use to describe the internals of the box? An important element, in order to be able to speak about the memory cost of temporal correlations, is the requirement that all time-dependent information used to produce the outputs must be stored within the physical system used to implement the box. This implies that the physical operations performed to produce an output must be time-independent, e.g., the experimenter is not allowed to look at the wall clock and decide to implement in a different way the operation associated with a certain input x, as this will result in an additional source of memory, i.e., the clock keeping track of time. It is interesting to notice that the case where such time-dependent are admissible is equivalent to the case of quantum communication scenarios such as quantum random access codes or the scenario described by Brierley et al. [24]. In fact, the latter scenario can be modeled as a network with ordered nodes, where a single physical system is transmitted through the nodes, and at each time step one of the nodes receives the system, performs a local operation, and transmit the system to the subsequent node. Since for each node it is known in advance in which part of the sequence it is situated, its local operations can be adapted to maximize a certain figure of merit defined in terms of probabilities of outcomes. This scenario covers the notion of "communication cost" and it must be distinguished from the notion of "memory cost" that is considered here. Moreover, even though in the memory cost scenario we are not allowed to change the operations throughout the sequence, it still makes sense to use classical randomness at the beginning of a sequence: at each experimental run, the experimenter can flip a coin and decide to perform the whole sequence of with one box or another. The resulting correlations will be a convex combination of the correlations obtained from either box. A graphical representation of the above ideas is presented in Fig. 1. These intuitive notions are made more rigorous in the next section.

III. FINITE-STATE MACHINES
In this section, we formally define the classical, quantum, and GPT models for the box used in the previous section. In this model we assume that the box is implemented as a machine which acts on an internal state. Upon receiving an input x, the box operates on the internal state and produces the output a. The internal state is the specific model of the memory from the previous section. More precisely, we use the finite number of perfectly distinguishable states as a measure for the memory and for this reason we call this model a finite-state machine.
In a first step we need to describe the internal state ω and the operations I a|x of the machine. We choose ordered vector spaces to describe the machine, which is an appropriate framework for a wide range of GPTs. In Appendix A we give a brief summary of this mathematical formalism. In brief, a GPT is then described by a real vector space V with partial order "≤" and an order unit e ∈ V. In quantum theory V would be the set of Hermitian operators, A ≤ B would correspond to B − A being positive semidefinite, and e to the identity operator. Measurement outcomes are represented by effects f ∈ V with 0 ≤ f ≤ e and a measurement M x is represented by a collection of effects M x = ( f a|x ) a with ∑ a f a|x = e. The set of states S is a subset of the dual space of V such that the probability of outcome a in the measurement M x is given by p(a|x) = ω f a|x . Therefore ωe = 1 and ω f ≥ 0 for all f ∈ V with f ≥ 0. The operations I a|x represent a specific way to implement a measurement, taking into account the change of the internal state ω. More precisely, the linear map I a|x : V → V is such that f a|x = I a|x e is the effect describing the output a. In addition the positivity condition I a|x f ≥ 0 for any f ∈ V with f ≥ 0 needs to be satisfied and further restrictions to I a|x may apply depending on the specific GPT. If we group together the transformations I x = (I a|x ) a∈A for a fixed input x, then I x is called an instrument. If we ignore the outcome a, then the instrument maps states to states, in the sense that ω ∑ a I a|x ∈ S for any state ω.
Given the initial internal state ω of the finite-state machine and the instrument I x = (I a|x ) a , the probabilities associated with a sequence of measurement are given by p(a|x) = ωI a|x e, p(ab|xy) = ωI a|x I b|y e, etc.
Note, that we write the transformations in the Heisenberg picture, so that the time ordering proceeds from the left to the right. For a general sequence of inputs x 1 x 2 · · · x n = x and outputs a 1 a 2 . . . a n = a we write p( a| x) ≡ p(a 1 · · · a n |x 1 · · · x n ) = ωI a 1 |x 1 · · · I a n |x n e ≡ ωI a| x e, We exemplify in the next sections how this expression is specialized to the classical and quantum case.
As we discussed previously, we exclude any external source of memory, such as a clock keeping track of time. This is formalized by the fact that all instruments solely depend on the input and in particular by the fact that all transformations are time-independent. In general, for a fixed GPT this requirement makes the set of achievable correlations nonconvex. Nevertheless, we can recover convexity by allowing the use of convex mixtures as follows. Before starting the experiment we use a random variable λ, distributed according to some probability distribution q(λ), to decide which finite-state machine to use subsequently. Since the machine is characterized by the initial state ω λ and the instruments I λ x , this yields the correlations The above procedure allows us to generate all correlations from the convex hull of correlations obtainable from a family of finite-state machines parametrized by λ. Finally, we define the memory of the system using the GPT notion of capacity (cf. Ref. [43]), i.e., the size of the maximal set of perfectly distinguishable states. More precisely, we say that a GPT defines a d-state machine if d is the maximal integer such that there exists a collection of d states (ω k ) k and d effects ( f k ) k such that ∑ k f k ≤ e and ω i f j = δ ij for all i, j.
Namely, all effects are part of the same measurement, which is able to perfectly (i.e., probability one) discriminate among the states. This notion of capacity corresponds to the dimension of the Hilbert space in quantum mechanics and with the number of extremal points of the state simplex in classical probability theory (see, e.g., Ref. [43]).
It is instructive to discuss in more detail the classical and quantum case, which may be more familiar to the reader. We subsequently introduce a particular class of capacity-2 GPTs, the dichotomic norm cones [44].

A. Classical finite-state machines
A classical finite-state machine [45] is described by its internal rules for state transitions and output probabilities. Given the classical state C = { 1, 2, . . . , d }, the observed probability distribution p( a| x) for an input sequence x of length n can be written as p( a| x) = ∑ s 0 ,...,s n ∈C r(s 0 )q(a 1 , s 1 |s 0 , x 1 ) · · · q(a n , s n |s n−1 , x n ). (8) Here, r(s 0 ) described the probability of preparing the initial state s 0 of the machine 1 and q(a, s |s, x) describes the probability that the machine yields the output x and transition to the state s , given that the internal state is s and the input is x. Those machines can in particular deviate in which is the initial state, i.e., For clarity reasons, we use only Eq. (8) in the following. The correlations p( a| x) can be rewritten as p( a| x) = π † T(a 1 |x 1 ) · · · T(a n |x n )η ≡ π † T( a| x)η, where η = (1, 1, . . . , 1) † is the d-dimensional vector of ones, π is the vector representing the initial state, and T(a|x) is the d × d transition matrix. Hence, π s = r(s) and [T(a|x)] s,s = q(a, s |s, x). The rules for probabilities that constrain q(a, s |s, x) translate to [T(a|x)] s,s ≥ 0 for all s, s , a, x, and ∑ a [T(a|x)η] s = 1 for all s, x.
Translating the above in the languages of GPTs, we let V = R d and set the order unit e to η. The partial order is such that v ≤ w if v s ≤ w s for all s. Then the set of states is given by by the canonical (d − 1)-dimensional simplex, In particular π is a state. Analogously, the transition matrix T(a|x) corresponds to the instruments I a|x , whereas the effects can be obtained as f a|x := T(a|x)η. It can be easily seen that d correspond exactly to the capacity defined according to Eq. (7).

B. Quantum finite-state machines
The quantum case is perhaps the most familiar to readers from quantum information. The probability distribution is obtained by sequences of generalized measurements M x = (E a|x ) a on a single system described by a Hilbert space of fixed dimension d. The outcomes of the measurement are described by positive semidefinite operators E a|x ≥ 0 with ∑ a E a|x = 1 1.
In order to discuss sequential measurements, however, we need to know the post-measurement state, or, better, the transformation induced by the measurements. This information is provided by a quantum instrument I x , defined as a collection of completely positive maps I x = (I a|x ) a , from the space of linear operators into itself, that sum up to a unital map, i.e., ∑ a I a|x (1 1) = 1 1, corresponding to the rule of preservation of probability in the Heisenberg picture, see, e.g., [46]. Each instrument defines a generalized measurement through the formula E a|x = I a|x (1 1). Similarly to the previous cases, we can shorten the notation by defining I a| x := I a 1 |x 1 • . . . • I a n |x n , where • denotes the composition of maps and write p( a| x) = tr[ρ I a| x (1 1)].
As mentioned before, quantum theory is a particular case of a GPT, where the vectors space V is the set of Hermitian operators, the partial order is defined through positive semidefiniteness and the order unit e is given by 1 1. The set of states is given by the density operators, identified by the Hilbert-Schmidt inner product with the elements of the dual space of V, Hence Eq. (12) and Eq. (5) are equivalent. It is then clear that the capacity of the system, defined as the number of perfectly distinguishable state [42,47] precisely corresponds to the dimension of the Hilbert space. It is important to remark that we need to consider the general formalism of quantum instruments, since if the measurement devices would merely act projectively, there would be nontrivial limitations on the achievable correlations that are valid for arbitrary dimensions [48,49].

C. GPT two-state machines
We already provided a definition of GPT finite-state machines at the beginning of Sec. III. In this section, we specialize this definition by considering a class GPTs where the effects belong to a dichotomic norm cone. These theories are a generalization of the classical bit (cbit) and quantum bit (qubit), in the sense that they have capacity two, i.e., they allow for a set of perfectly distinguishable states, in the sense of Eq. (7), of at most size two. We then specialize our discussion to the case of hyperbits (hbits) [50] and generalized bits (gbits) [51]. The former are a generalization of the Bloch sphere to dimension higher than three, whereas the latter are the local part of a Popescu-Rohrlich box [39]. We also provide a more detailed discussion of GPTs in Appendix A.
Consider the vector space V := R × R n , and the partial order where (t, x) ≥ 0 if t ≥ |x|. Here, |x| is any norm in R n . We define the order unit e := (1, 0). This implies that effects are vectors f = (t, x) such that |x| ≤ min { t, 1 − t }. The states for a dichotomic norm cone are the maps ω : (t, x) → t + w † x with the condition |w| * ≤ 1, where |w| * := sup { w † y | |y| ≤ 1 } is the dual norm of | · |. A peculiarity of this GPT is that it has exactly capacity two, independent of n or the choice of the norm | · |. We provide a proof of this fact in Appendix C.
Depending on the norm chosen and on n we have different GPTs. If we take |x| to be the Euclidean (or 2 ) norm, i.e., |x| 2 = ∑ i x 2 i , we obtain hbits, and specifically cbits for n = 1, qubits for n = 3 and more general hbits for n > 3. If we take n = 2 and the Manhattan (or 1 ) norm, i.e., |x| = ∑ i |x i |, we obtain a gbit. For the case of the Euclidean norm, the dual norm is also the Euclidean norm itself, whereas the dual of the Manhattan norm is the supremum (or ∞ ) norm, i.e., |w| * = max i |w i |.

IV. BOUNDS ON TEMPORAL CORRELATIONS
In this section, we consider the simplest nontrivial scenario, a sequence of two measurements, with inputs x, y and outputs a, b, with a, b, x, y = 0, 1. We are interested in bounds on the sum of correlations Similar expressions have been considered in Ref. [41,42,52]. Clearly, the trivial bound S ≤ 3 holds. For hbits the value S = 3 cannot be reached and therefore there must exist a nontrivial bound S ≤ Ω hbit,n for any dimension n of the hbit, in particular for the cbit (n = 1) and the qubit (n = 3). A simple analytical proof of Ω hbit,n < 3 is presented in Appendix B.

A. Measure-and-prepare strategies
The analysis of the case of sequences of length two can be greatly simplified using measure-and-prepare instruments. These are instruments of the form T x = ( f a|x σ a|x ) a , where M x = ( f a|x ) a is a measurement and (σ a|x ) a is a collection of states. Hence T x can be implemented by first measuring M x and then, depending on the outcome a, preparing the state σ a|x . Now, for a sequence of length two, the correlations are given by where ω λ is given by the initialization procedure of the individual finite-state machines participating in the mixture of machines. Clearly, the extremal values S can be achieved by a single finite-state machine and hence in the following we will omit the index λ and the summation of λ.
The instruments I x can be replaced by measure-and-prepare instruments, by letting f a|x = I a|x e and σ a|x = ωI a|x /ω( f a|x ) if the denominator is nonzero, or σ a|x = ω. Then p(ab|xy) = ω f a|x σ a|x f b|y . Hence we can equivalently replace I a|x by the prepare-and-measure strategy T a|x = f a|x σ a|x . Using this simplification, we obtain where we used the notation p(b|a; xy) for the probabilities conditioned on previous outputs.

B. Analytical and numerical bounds
Since S = 3 cannot be reached with hbits, there must be a finite gap between the actual bound for cbits, qubits, and hbits with a Bloch sphere of fixed dimension. In fact, the sets of states and effects are compact, and the expression S can be written as a continuous function from the set of states and effects into the interval [0, 3], so its image must be compact. In this section, we explore in more detail the bounds for cbits, qubits, and hbits via numerical methods.

Classical bit
For the cbit case, we use the representation from Sec. III A, specifically, ω is represented by (1, 0), σ i|i by (s i , 1 − s i ), and f i|i by (a i , b i ) † , where s i , a i , b i ∈ [0, 1]. Then Eq. (16) reads Only a 0 and a 1 appear nonlinearly in this expression. Therefore, the maximum of S is attained when all remaining parameters are either 0 or 1. This leaves us with a two-dimensional, at most quadratic optimization, which can be performed at once. For the maximal value Ω cbit of S using classical bits we then obtain This maximum occurs at a unique point, where s 1 = b 1 = 0, b 0 = s 0 = a 1 = 1, and a 0 = 1 2 . Hence, an optimal machine is given by the initial state π † = (1, 0) and the transition matrices Note, that while the solution for the chosen parametrization is unique, the transition matrices are not unique.

Quantum bit
For the qubit case, we can proceed similarly to Ref. [42]. First we note that in Eq. (16), the initial state ω can be replaced by a pure state, so that ω : X → 0|X|0 . The expression S can then be written as where 0 ≤ E i|i ≤ 1 1 are effects and σ 0 and σ 1 are density operators. Since the latter occur only linearly in S, we can substitute them with pure states |ψ 0 and |ψ 1 , respectively. The maximum of S for qubits is hence given by By parametrizing E 0|0 , E 1|1 , |ψ 0 , |ψ 1 with real parameters, one can write the expression in Eq. (21) as fourth degree polynomial. This can be further simplified, by taking E 0|0 E 1|1 , |ψ 0 , |ψ 1 as real expression, which lowers the number of parameters to ten. 2 The reduction to the real part of a qubit does not affect the optimality as we show in the next section, see Eq. (27).
It is always possible to obtain a lower bound Ω feas qubit on Ω qubit by guessing appropriate values for the free parameters. An upper bound, Ω Lass qubit , can be obtained via Lasserre's method [53] of polynomial optimization based on moment matrices and semidefinite programming (SDP) [54], which provides analytical upper bounds up to the numerical precision. That is, With the simplifications used above, the upper and lower bounds coincide up to the numerical precision of 10 −5 . We have, showing a gap between the cbit and qubit case. A feasible solution is given by the post-measurement states and effects, and the effects

Hyperbit
For the case of hbits, and also the more general dichotomic norm cones, we use the parametrization ω : (t, x) → t + w † x and σ i|i : (t, x) → t + w † i x for the states and f i|i = (t i , f i ) for the effects. Then Eq. (16) reads When maximizing S, we can eliminate the maximization over w 0 and w 1 , by choosing appropriate vectors with The maximal value of S for a given dichotomic norm cone is hence where the constraints of the optimization are |w| * ≤ 1 and | f i | ≤ min{t i , 1 − t i }. For the case of hbits, both |·| * and |·| correspond to the 2 norm , hence the conditions are invariant under orthogonal transformations as it is the case for the function to be maximized, which depends only on the norm of f i and the scalar products between w and f i . Since the only contribution for w comes from the component in the span of f 0 , f 1 , the problem reduces to a two-dimensional one. This is equivalent to the qubit case with the Bloch ball restricted to the xz-plane, both for states and effects. This implies that the bound for hbits coincide with the bound for qubits. We thus have as in Eq. (23).

Generalized bit
The case of gbits differs from the previous one because we can actually reach S = 3 already for a two-state machine, namely the dichotomic norm cone with n = 2 and the 1 norm. This model corresponds to the local part of a Popescu-Rohrlich box [39,51]. The space of effects is a polytope with extremal effects given by the extremal point of the two-dimensional 1 norm, i.e., a ±i = 1 2 (1, ±e i ), with e i the canonical vectors in R 2 . Then, the states are the ω = (1, w) with w in the square [−1, 1] × [−1, 1], i.e., the unit ball with respect to the ∞ norm. The choices and yield, according to Eq. (27), the algebraic maximum for S, i.e., S = 3. We thus have for gbits and hence also for the set of all dichotomic norm cones with the same norm and arbitrary n.

C. Impossibility of simulating contextual correlations with general instruments on a qubit
In this section, we investigate whether qubit machines are able to simulate some contextual correlations that arise in higher dimensional quantum systems. In Ref. [20] it was proved that in order to simulate all deterministic predictions associated with the observables of the Peres-Mermin square [55,56], a classical machine with at least 4 states is necessary. This result was obtained in the framework of tests of contextuality involving sequential measurements [8], in which the relevant compatibility notion is given by the nondisturbance among compatible measurements and repeatability of outcomes, e.g., if M x and M y are compatible measurements in the measurement sequence M x M y M x , the outcome for the first measurement of M x will be repeated in the second measurement of M x .
We derive here a related result by showing that even a qubit is not sufficient to exhibit contextual correlations. For this we use a rather broad notion of contextuality. Consider a box with inputs from an alphabet X and outputs from an alphabet A as before. The input sequences are restricted such that a sequence x is admissible if and only if all inputs are from the same context C ⊂ X , i.e., { x i | i } ⊂ C. A context C is a set of inputs, such that p( a| x) = p[π( a)|π( x)] for any inputs sequence x from C, any output sequence a, and any permutation π. In addition we assume that any input is repeatable, i.e., p( ab| xx i ) = p( a| x)δ b,a i for any position i in any admissible sequence.
Such a box is noncontextual, if all correlations of the box (using only admissible input sequences) can be reproduced by a box without memory, i.e., by a noncontextual model. We claim that any such box implemented on a qubit is noncontextual.
We start the proof of this statement by determining those inputs, which cannot require the use of memory. First, if an input z ever produces only the output c, within all admissible input sequences, then we can eliminate this input from our considerations. This is the case, because in any sequence we can permute z to the end of the sequence. Then p( ac| xz) = p( a| x)p(c| a; xz) = p( a| x), where the first equality is due to Eq. (3) and the second due to the assumption that only the output c ever occurs. Second, assume that for a certain input z, whenever it occurs in an admissible sequence, the internal state of the machine before the input z is only ever the state ρ. Again we can eliminate this input from our considerations, because the output for z and the state after the output can be determined without considering the state. Third, we can ignore the pathological cases of inputs, which are not member of any context. In the following we assume without loss of generality, that the box does not have any input falling under the those three cases just discussed. Next, we show that for any input z the instrument (I c|z ) c must be a measure-and-prepare instrument of the form I c|z : X → |ψ c,z ψ c,z |X|ψ c,z ψ c,z | with ψ c,z |ψ c,z ∈ { 0, 1 } .
This can be seen as follows. According to the assumptions, there are two input sequences xz and yz and corresponding output sequences ac and bc, so that the state before the input z is ρ and ρ , respectively, with ρ = ρ . Using Eq. (3) and Eq. (12) we have p( ac| xz)δ c,c = p( acc | xzz) = p( a| x)p(cc | a; xzz) = p( a| x) tr[ρ I c|z I c |z 1 1] and (34) p( bc| yz)δ c,c = p( bcc | yzz) = p( b| y)p(cc | b; yzz) = p( b| y) tr[ρ I c|z I c |z 1 1], where p( a| x) > 0 and p( b| y) > 0. Therefore for c = c , tr[ρ I c|z I c |z 1 1] = 0 (36) withρ = (ρ + ρ )/2. Since ρ = ρ and we assume a qubit system, the mixtureρ has necessarily rank two, i.e., ρ ≥ 1 1 for some > 0. We arrive at the condition where K i and Q j are the Kraus operators associated, respectively, with the instruments I c |z and I c|z , e.g., I c |z X = ∑ j K † j XK j . Then K i Q j = 0 for all i, j. Similarly, exchanging c with c , we obtain Q j K i = 0 for all i, j. This implies that K i and Q j are of rank one and that K i is proportional to K i as well as Q j being proportional to Q j , for all i, i and j, j . Hence we can omit the indices i, j and consider simply K and Q. Note that from ∑ c I c|z 1 1 = 1 1, the condition Q † Q ≤ 1 1 follows which allows us to write Q = |α β| with α|α = 1 and β|β ≤ 1. Now, for c = c we obtain tr(ρI c|z I c|z 1 1) = tr(ρI c|z 1 1), which implies (Q † ) 2 Q 2 = Q † Q. It follows that either |β = 0 or |α and |β are equal up to a phase and hence I c|z is as stated in Eq. (33). As final step we need to show that there is no contextuality for projective qubit instruments. Given an admissible input sequence xyz, and an output sequence abc such that p( ab| xy) > 0, we have p( abc| xyz) = p( ab| xy)| ψ b,y |ψ c,z | 2 and p( abcb| xyzy) = p( ab| xy)| ψ b,y |ψ c,z | 4 .
The left hand side of both expressions has to be equal, yielding | ψ b,y |ψ c,z | ∈ { 0, 1 }. Consequently, any two inputs within a context are realized by the same projective instrument, except for some relabeling of the outcomes. We choose a specific measurement within one context, say y, so that I a|x = ∑ b I b|y f b (a|x) with some coefficients f b (a|x) ∈ { 0, 1 }. This way we can write for any correlations of this context which is exactly the formula for a one-state machine, i.e., a noncontextual model. This concludes the proof of our statement, due to the following observation. If two contexts share an observable, then our argument already applies and the union of both contexts must admit a noncontextual model and hence the union of both contexts is again a context. Eventually, we can join contexts until all contexts are mutually disjoint. For each disjoint set we can construct a noncontextual model, and since there are no admissible sequence involving two different contexts, we have constructed a noncontextual model for all admissible input sequences.

V. CONCLUSIONS AND OUTLOOK
We introduced the notion memory cost of simulating temporal correlations based on the notion of finite-state machine, i.e., a physical system accepting an input at each time instant and generating an outcome and an internal state transition according to probabilistic rules. We investigated the correlations obtainable via such finite-state machines operating according to different probability theories, i.e., classical, quantum, or GPT. Our framework allow us to derive inequalities able to discriminate among different theories for the simplest nontrivial case, i.e., two-state machines, two inputs, two outputs, and sequences of length two. Moreover, we investigated, from the perspective of quantum finite-state machines, the possibility of simulating contextual correlations with a qubit and answered this question in the negative.
Our framework provides a notion of nonclassicality for single systems, which is based solely on observed correlations and does not make any assumption of the type of measurements involved, e.g., compatibility or noninvasiveness. We believe that several problems in quantum foundations and quantum information could be studied in this framework. For instance, a notion of nonclassicality for single systems, i.e., quantum contextuality, has recently been suggested as a resource for quantum computation. On the other hand, memory has been identified as a resource needed to simulate contextual correlations classically [20,21]. In addition, a different notion of contextuality for sequential operations has been defined and connected to speed-up in quantum computation [57]. Our work could provide a general framework to discuss such different results and understand better the connection between memory cost of (classical) simulations, contextual correlations, and advantages in computation. Moreover, the idea of computation in GPTs, such as Spekkens' toy model [58], that are intermediate between classical and quantum probability has been recently investigated [59,60]. In particular, this GPT can be exactly simulated with two classical bits.
In quantum theory the set of effects is represented by Hermitian operators F with 0 ≤ F ≤ 1 1. This convex set has three characteristic properties. (i) It is a subset of the real vector space of Hermitian operators. (ii) There exists the special operator 1 1 representing the all-embracing effect. (iii) Its shape is given by the partial order A ≤ B which is defined by the condition that B − A is positive semidefinite.
In a GPT, the notion of an effect is generalized by considering a straightforward generalization of those properties. We start with an arbitrary real vector space V with a partial order a ≤ b. This partial order has to be linear in the sense that a ≤ b implies λa ≤ λb for any λ ∈ R + and a ≤ b implies a + c ≤ b + d if also c ≤ d. This turns (V, ≤) into an ordered vector space.
The all-embracing effect is a distinct element e ∈ V. It is is required to dominate all of V, i.e., for any x ∈ V there is a positive number λ such that x ≤ λe. This property makes e an order unit and (V, ≤, e) an order unit vector space. In addition, it is convenient to assume that the order unit is Archimedean, i.e., if x ≤ λe holds for all λ > 0, then already x ≤ 0. In our paper we implicitly assume that any order unit is Archimedean.
It is sometimes convenient to let V + = { x ∈ V | 0 ≤ x }. Since a ≤ b is equivalent to b − a ∈ V + , we then equivalently describe an AOU space by the tuple (V, V + , e). The effects in a GPT are now given by the set V + e = V + ∩ (e − V + ). A measurement M in a GPT is represented by a collection of elements M = ( f k ) k ⊂ V + e with ∑ f k = e, where f k represent the outcomes of the measurement.
For the set of states, we note that in quantum theory one can represent a state ρ equivalently by the linear map ω : X → tr(ρX). Then the normalization of ρ becomes ω(1 1) = 1 and the condition ρ ≥ 0 reads ω(X) ≥ 0 for all X ≥ 0. By analogy, the set of states in a GPT is given by S = { ω ∈ V * | ω(e) = 1 and ω( f ) ≥ 0 for all f ≥ 0 } , where V * = { ϕ : V → R | ϕ is linear } is the dual space of V. With this definition, the probability for outcome k of a measurement M = ( f k ) k is given by p k = ω( f k ).