Fisher information of correlated stochastic processes

Many real-world tasks include some kind of parameter estimation, i.e., determination of a parameter encoded in a probability distribution. Often, such probability distributions arise from stochastic processes. For a stationary stochastic process with temporal correlations, the random variables that constitute it are identically distributed but not independent. This is the case, for instance, for quantum continuous measurements. In this paper we prove two fundamental results concerning the estimation of parameters encoded in a memoryful stochastic process. First, we show that for processes with finite Markov order, the Fisher information is always asymptotically linear in the number of outcomes, and determined by the conditional distribution of the process' Markov order. Second, we prove with suitable examples that correlations do not necessarily enhance the metrological precision. In fact, we show that unlike for entropic information quantities, in general nothing can be said about the sub- or super-additivity of the joint Fisher information, in the presence of correlations. We discuss how the type of correlations in the process affects the scaling. We then apply these results to the case of thermometry on a spin chain.


INTRODUCTION
In many experimentally meaningful scenarios, a quantity of interest cannot be accessed directly via a measurement, but only retrieved from a sample of data; this problem is commonly referred to as estimation. Examples arise in many disciplines: for instance, one might be interested in estimating financial parameters from the time series of stock prices, the lethality of epidemic diseases from real-world hospital data, or the Rabi frequency of a two level atom. Estimation theory [1][2][3][4] provides safe formal grounds for this estimation procedure, and gives ways to compute the maximum achievable precision that can be attained for a certain parameter, given the probability distribution from which the sample is extracted. The variance σ 2 θ of an unbiased estimator (inverse to the precision) for the parameter θ is subject to the Cramér-Rao bound [5,6], where F 1:N (θ) is the Fisher information associated to N outcomes. In the case of i.i.d. random variables, the Cramér-Rao bound simply reduces to its most wellknown version, given by F 1:N (θ) = N F 1 (θ). However, this is not true when, as often happens in the real world, * radaellm@tcd.ie † gtlandi@gmail.com ‡ quantum@felix-binder.net processes exhibit correlations, where past values influence the future behaviour, possibly up to a fixed number of steps.
Being able to quantify how many past outcomes have to be recorded to retrieve an accurate description of the future statistics is of crucial importance for data compression purposes, but also as a way to understand the structure and complexity of the underlying model [7][8][9]. In the context of discrete-time classical stochastic processes, the notion of memory length is expressed by the Markov order [10].
An important scenario featuring such correlated outcomes is that of quantum systems subject to continuous weak measurements [11,12]. This is common place in various experimental platforms, from quantum optics [13][14][15] to mesoscopic conductors [16]. It involves a stream of outcomes whose correlations crucially depend on the underlying quantum properties of the model. A famous example is the emission of light from quantum dots, which is known to present intermittencies [17] (called blinking) with very long range correlations [18,19]. Several papers have recently analyzed the metrological properties of continuous measurements [20][21][22][23][24][25][26][27][28], aiming both at constructing useful estimation strategies, as well as establishing the ultimate bounds in precision.
The application of estimation theory to correlated processes is still widely unexplored, and a general framework is lacking, except for very specific parameter encoding, such as location parameters [29,30]. In this work, we specifically address this issue, and prove through suitable examples that the joint Fisher information is not in general sub-additive with respect to the marginal Fisher information; this fact sets the Fisher information apart from entropy-like information measures, for which subadditivity is a fundamental requirement.
The main result of our work, summarized in Eq. (30), is a general decomposition for the Fisher information for a stationary stochastic process of finite Markov order M . This result showcases the interplay between the number of outcomes N , and the Markov order M . As we discuss in detail, this leads to two fundamental consequences. First, it shows that when N ≫ M , the Fisher information will asymptotically always scale as where f (θ) is the Fisher information rate, which is independent of N . We here report the precise form of f (θ), its conditional dependence on preceding samples, and the non-asymptotic residual. Second, Eq. (2) allows us to establish the sub-or super-additivity of F 1:N , by comparing f with the Fisher information F 1 of a single outcome. In fact, we show below by means of examples that in general f ≶ F 1 ; that is, unlike entropic information quantities, the Fisher information does not satisfy sub-additivity. Correlations can therefore be both advantageous or deleterious. The impact of these correlations in concrete estimation scenarios will be discussed in detail below. This article is organized as follows. In Section I, we review the basis of estimation theory, introduce the notion of unbiased estimator and the Fisher information. In Section II we discuss the behaviour of the Fisher information in the presence of correlations, showing with suitable toy models that in general it can exhibit superadditive or sub-additive behaviour with respect to the individual Fisher information. In Section III we introduce the notion of Markov order, and present our main result: a decomposition of the joint Fisher information in a finite-Markov order stochastic process. We discuss the consequences of the decomposition for the Cramér-Rao bound. To provide more intuition for our results, in Section III B, we discuss how the sign of the correlations affects the behaviour of the Fisher information, and we show that, for the specific class of estimators based on the sample mean, anti-correlated processes usually yield higher Fisher information. Finally, in Section IV we apply our results to the task of estimating temperature on an Ising spin chain, and show a direct connection between our decomposition of the Fisher information and the heat capacity per unit length (specific heat capacity) of the chain.

I. ESTIMATION THEORY
Consider the following problem. We are given a probability distribution P θ (x 1:N ), describing the statistics of a sequence X 1:N = X 1 , . . . , X N (left-and right-inclusive notation will be assumed throughout this work) of N random variables, and depending on a vector of real parameters θ = (θ 1 , θ 2 , . . . , θ d ). Our aim is to estimate, in the most precise way possible, the vector θ, by sampling from the probability distribution.
In many scenarios, the parameters vector θ reduces to a scalar, such as in the case of thermometry we will deal with below. Other times, θ is a genuine vector (multiparameter estimation), for example of Carthesian coordinates, velocities, rotation or torsion parameters. In the following, we will introduce the formalism of estimation theory in full generality in a multi-parameter fashion.
It is possible to show that the precision that can be achieved on the estimation of θ is upper-bounded by an intrinsic property of the probability distribution itself. An estimator T is a function which, given as an input a realization x 1:N = {x 1 , x 2 , . . . , x N }, outputs an estimate for the parameter vector: The Fisher information matrix is defined as [2,5,31] F 1:N (θ) as: That is, the covariance between the two first derivatives (with respect to the parameter vector components θ i and θ j ) of the logarithms of the probability distribution (i.e., the surprisal ).
The importance of the Fisher information is mainly due to its role in the expression of the Cramér-Rao bound, the ultimate limit to the precision of any estimate of the parameters. We define the bias of an estimator [3] as the average difference between the estimate of the parameter vector and the true value of the parameter vector itself: An estimator is unbiased if its bias is equal to zero. Given an unbiased estimator T , we can define its covariance matrix σ 2 as the matrix [2] The Cramér-Rao bound then reads [6,32] where 1 F 1:N (θ) is the inverse of the Fisher information matrix, and the inequality means that the difference σ 2 − 1 F 1:N (θ) is a positive semi-definite matrix. Positive definiteness implies, in particular, that the variances of each θ k will satisfy Eq. (1), with F replaced by the (k, k)th component of the Fisher information matrix. Figure 1: Sequential parameter estimation from correlated outcomes. The picture denotes a black box outputting outcomes X i that depend on some unknown parameter θ. The outcomes, however, are not statistically independent, but instead have a finite Markov order M (in this example M = 2). This encompass various scenarios in both classical and quantum processes. For example, it may represent the output current of a quantum continuous measurement. Or the outcomes of sequential local measurements in a quantum spin chain (Sec. IV). The goal of this paper is to understand how correlations in the outcomes affect the precision in estimating θ.

II. FISHER INFORMATION IN PRESENCE OF CORRELATIONS
The simplest parameter estimation scenario is that of independent and identically distributed (i.i.d.) random variables, for which factorization of the joint probability distribution holds. In this case the Fisher information matrix behaves additively: as proven in Appendix B, where F 1 (θ) is the Fisher information matrix corresponding to the probability distribution of P θ (X). The Cramér-Rao bound (Eq. 7) in this case expresses the variance in θ scaling inversely with N : In the presence of correlations the factorization of the joint probability distribution no longer holds. Instead, we define the conditional Fisher information matrix between two random variables X 1 and X 2 , as It is possible to show [33] that the joint and the conditional Fisher information matrices between two random variables are connected by the relation which may serve as an alternative definition of F X2|X1 (θ). The single-parameter case was treated in Ref. [33]. A proof for the multi-parameter case can be found in Appendix B (we are not aware of this general proof in the literature). It directly follows that the joint Fisher information matrix of N outcomes can be decomposed as [34] valid also in the case of multi-parametric estimation. If the variables are independent, this reduces back to Eq. (9). Until this point, our treatment has considered the case of multi-parameter estimation in full generality. From now on, for the sake of clarity we will restrict our attention to single-parameter estimation, and we will therefore think of the Fisher information as a scalar quantity. Our results can be trivially generalized to a multi-parameter scenario.
A. Lack of sub-additivity for the Fisher information In general, nothing can be said about the inequality In this sense, the Fisher information does not behave as an entropy-like quantity, for which convexity is a fundamental requirement [3]. To illustrate this idea, we consider a simple example based on Gaussian random variables X, Y , with identical mean µ and a covariance matrix σ defined as for ρ ∈ [−1, 1] and γ 0 > 0. The task at hand is then the estimation of the mean value µ. The joint Fisher information for Gaussian random variables with mean vector ⃗ µ and covariance matrix σ is well known, and given by [1]. Applying this to the estimation of µ itself yields The marginal Fisher information, on the other hand, is given by: Whence It is therefore clear that we obtain a super-additive behaviour (F X,Y > F X + F Y ) when the variables are anti-correlated, ρ < 0, and a sub-additive behaviour (F X,Y < F X + F Y ) when they are correlated, ρ > 0. For instance, if X and Y are perfectly correlated (ρ = 1), we get F X,Y = F X : We learn nothing from Y that we hadn't already learned from X. The fact that we can get super-additive behavior for ρ < 0, though, is not at all intuitive. We will revisit this in Sec. III B, where we show that it happens because for anti-correlated variables the errors tend to cancel out, while in the correlated case they tend to add up.

III. THE MARKOV ORDER
Let us consider a family of stochastic processes . . . , X 0 , X 1 , . . . , X N , . . . = ← → X , fully represented by the overall joint parameter-dependent probability distribution P θ ( ← → x ). Here, we idealize stochastic processes as bi-infinite. A process is stationary [3] if any marginal of the overall joint probability distribution is translationally invariant, i.e.: for any k, m, l ∈ Z, m ≥ k. Given a stationary stochastic process, one can define its Markov order [10] as: (20) where X −∞:0 = ← − X denotes the left-semi-infinite string of random variables in the stochastic process. The concept is illustrated in Figure 1. The definition of the Markov order implies the possibility to truncate any conditioning in the probabilities to a fixed number of steps in the past, representing the memory of the system [35] (see also Eq. 28 below): where − → X = X 1:+∞ . For example, in a process of Markov order M = 2, all knowledge about the future is encoded in the two 'most recent' random variables in the past: Markov order M = 0 represents independent random variables Processes yielding Markov order M ≤ 1 are customarily called Markov processes.
It is worth emphasizing that the Markov order only establishes conditional independence. Unconditionally, all random variables in the chain are correlated (unless M = 0). These correlations, however, decay exponentially with their distance for all processes with finite Markov order. For instance, in the case of Markov order M = 1 the process forms a Markov chain specified by the transition probabilities Q xy = P (X i+1 = x|X i = y). For any two random variables X i and X j , one will then have that (assuming j > i) Since Q forms a stochastic matrix, its eigenvalues must have absolute value smaller than unity. Thus, the correlations between X i and X j will decay exponentially with |j−i|. The same reasoning also holds for processes having M > 1, provided that |j − i| > M .
The interplay between conditional and unconditional independence is subtle, and often counterintuitive. Some non-trivial examples are discussed in Appendix F. We also mention that, in general the Markov order is a structural property of a stochastic process (or topological in the language of computational mechanics [36]), independent of the precise choice of the parameters; other quantities like the (covariance-based) correlation length, on the contrary, usually exhibit a significant dependence on the parameters.
Alongside the probability-theoretic approach (20), it is equivalently possible to define the Markov order from an information-theoretic perspective as [37] Here, H denotes the Shannon entropy of the distribution, while h is the entropy rate, defined as Finally, the excess entropy E is given by the mutual information between past and future It follows that, for a process of finite Markov Order M , the memory renders past and future conditionally independent: meaning that no information about the future is retained in the past beyond the Markov order. The relation between the information-theoretic and the probabilitytheoretic versions of the Markov order is discussed in Appendix A. For a stationary process of finite Markov order M we now consider the decomposition (13) for the Fisher information for a string of N ≥ M outcomes F 1:N . In this case, all the conditional probabilities can be truncated at length M , and the same will happen to the conditional Fisher information. We are therefore left with: but, since the process is stationary, all the terms of the summation yield the same contribution, and, using (13), we can directly write: This is our main result. In the remainder of this section, we showcase its significance by discussing some of the main consequences. It is natural to write Eq. (30) in the spirit of (25), as defining the Fisher information rate f (θ) as and the excess Fisher information ϵ(θ) as This expression leads to a clear analogy with (25); however, it is remarkable that, as already shown in the example of the two Gaussian random variables (in Sec. II), ϵ can be both positive or negative, while the excess entropy E is always positive. To provide a concrete example, consider the case M = 1. We then get We must always have F 12 > F 1 , since extra information can only improve the estimation. However, ϵ does not have a well defined sign because Exploiting the decomposition (30), the Cramér-Rao bound (7) can be expressed as where the last approximation is valid in the large N limit. Hence: for a finite-Markov order process, the maximum precision of the measurement, given by the Cramér-Rao bound, grows linearly with the number of measurements, in the limit of a large number of measurements.
Since almost every stochastic process can always be assumed to have a finite (possibly large) Markov order, this means that for sufficiently large N the precision will always eventually scale linearly with N . The remaining question, then, is how does F M +1|1:M behave. More concretely, we can aim to compare it with F 1 , which would be the Fisher information that one would obtain if the outcomes were independent. Eq. (18), however, has already provided one illustration that nothing can in general be said about this. If F M +1|1:M > F 1 , then having correlations is advantageous for the estimation, while if F M +1|1:M < F 1 they are deleterious. We therefore now turn to additional examples, which aim to pinpoint which ingredients might lead to F M +1|1:M ≶ F 1 .
A. An example of super-and sub-additive behaviours We illustrate the above results with minimal two-state models having Markov order M = 1. For instance, a model having a sub-additive Fisher information is one given by the transition matrix: Conversely, a model having a super-additive Fisher information is that given by The Fisher information sub-or super-additivity is spoiled by the difference F 1:2 − 2F 1 , or equivalently f − F 1 , represented for both the processes in the left-most plot of Figure 2 as a function of the parameter θ.
The relevance of the sub-or super-additive behaviour becomes apparent if we consider an actual estimation task for these two models. It is well-known [3,38] that the Maximum Likelihood Estimator always asymptotically attains the Cramér-Rao precision bound (also in the case of correlated variables, under adequate regularity conditions, see Appendix C); in Figure 2 we compare the Mean Squared Error given by the MLE estimation on the models with our asymptotic bound (36).
The plots show two relevant facts: first, the convergence of the MSE of the MLE estimations to the asymptotic Cramér-Rao bound 36 is verified; second, such a bound can be both much larger and much smaller than the i.i.d. Cramér-Rao bound for the specific case: once more, the Fisher information behaves sometimes superand sometimes sub-additively.
The existence of processes yielding a sub-additive Fisher information could raise the question whether it could be possible to beat the system by employing some estimator which, by ignoring the correlations, can achieve the i.i.d. Cramér-Rao bound. It turns out that this is not the case, as discussed in more details in Appendix D: the Fisher information (and consequently the Cramér-Rao bound) is an intrinsic property of the process, and cannot be circumvented by any choice of an estimator.

B. Influence of the sign of correlations in the sample mean
Eq. (18) provided an example where the sign of the correlations ρ ≶ 0 determined whether correlations are advantageous or not for the estimation process. When considering a general estimator, it unfortunately turns out that a direct link cannot be established; however, an interesting insight can be gained by restricting our attention to a specific subclass of estimators, namely those based on the sample mean. The employment of the sample mean usually represents the simplest choice when asked to estimate something based on a stochastic process.
Let us consider a θ-dependent stationary stochastic process X 1 , . . . , X N ; its sample mean is given by Let µ = E[X i ] be the local expectation value, and σ 2 = Var[X i ] the local variance. Due to the stationarity requirement, the covariance between two random variables X i and X j depends only on their distance, let We refer to C i > 0 as positively correlated and C i < 0 as negatively correlated.
The variance, on the other hand, reads where we used the relation Under appropriate hypotheses (discussed in Appendix E), a generalization of the Central Limit Theorem for correlated variables guarantees asymptotic normality of the distribution of Y , with average µ and variance ν, given by Under this normality assumption, the Fisher information of the sample mean Y will be: where v = N ν. Notice that the second term is negligible in the N → ∞ limit, hence we finally get a Fisher information This result, which also appears in Ref. [39] in the context of super-resolution imaging theory, clearly illustrates the role that the sign of C i has on the Fisher information. Positively correlated variables (C i > 0) increase the denominator and hence decrease F Y , while negatively correlated C i do the opposite. For estimation, it is thus advantageous to have negatively correlated outcomes. To understand the intuition behind this, consider Eq. (39) and write X i = µ + δX i , where δX i represent the fluctuations around the mean. Then  Figure 2: Two toy-model estimation tasks using the Maximum Likelihood Estimator show the convergence of the expectation value of the MSE, obtained by averaging over 50 iterations of the simulation, to the asymptotic behaviour given by (36). The expression of the conditional probabilities yielding these processes are in the main text, and the leftt-most plot shows the corresponding sub-or super-additive behaviours of the Fisher information. The semi-transparent lines represent the MSE of single runs of the simulation. Notice that the bound is achieved only asymptotically, hence it is possible to have, for finite N , an MSE lower than the bound.
If two variables δX i and δX j are negatively correlated, then an outcome δX i > 0 implies there is a tendency to observe δX j < 0. As a consequence, errors tend to cancel out in Eq. (45). Conversely, for positively correlated variables, the errors tend to add up. Since F Y is the Fisher information for a restricted set of estimators, it must necessarily lower bound the full Fisher information, Using (30) we have therefore that, in the N → ∞ limit, This provides a fundamental bound for the asymptotic conditional Fisher information, clearly showcasing the effects of correlations. These ideas are illustrated in Figure 3, which analyzes the estimation of the mean µ of a Gaussian process, with tunable nearest-neighbor correlations given by the parameter ρ ∈ [0, 1]. Details on how to construct the process are discussed in Appendix F. The left panel shows a single stochastic realization using the sample mean as an estimator. As can be seen, positively and negatively correlated samples seem to converge at a somewhat similar rate to the true parameter, while uncorrelated samples seem to converge more slowly. The impression from this single realization, however, is deceiving. In the right panel of Fig. 3 we show the mean squared error (MSE) averaged over multiple realizations. It is now quite clear that negatively correlated samples yield much more precise estimates when compared to the uncorrelated case (notice the log scale), in agreement with the predictions of Eq. (44). Conversely, positively correlated samples yield a clear and sizable disadvantage.

IV. APPLICATION TO THERMOMETRY OF SPIN CHAINS
In this section, we apply the above detailed methods to the task of temperature estimation on an Ising spin chain [40], consisting of N 1/2-spins, with Hamiltonian where σ z j are Pauli matrices, B is the the external magnetic field, and J k is a k-distance coupling strength. The system is prepared in a thermal statê whose temperature T we wish to estimate (i.e., it plays the role of the parameter θ), and Z is the partition function (we set the Boltzmann constant to k B = 1). Due to the structure of the Gibbs state ρ, the best strategy for thermometry is always to measure the system in the energy basis [41,42]. Eq. (48) is already diagonal, and thus a projective energy measurement is tantamount to N local measurements in the basis of σ z j . The outcome is a string s = (s 1 , . . . , s N ), with s j = ±1, which occurs with probability where Because of the interactions, the outcomes of each site are not statistically independent. Hence p(s) forms a Markov chain, precisely of the form described in the previous section.
For intuition, one may imagine that the spins are measured sequentially, from left to right as in Fig. 1. In this way, subsequent measurements (each yielding the current state of the measured spin) will progressively form a stochastic process. To apply the definition of the Markov order (20) to the spin chain, one first has to be able to marginalize the chain's thermal state (50) in order to express probability distributions for configurations of finite number of spins. The procedure for doing so is based on the well-known transfer matrix method [43] and is detailed in [44,45]. In Appendix G, we briefly summarize the main results.
The problem of the Markov order of spin chains has been discussed in Ref. [10]. Let us first define the interaction range R of a spin chain as For instance, for nearest neighbor interactions J 1 ̸ = 0, while J s = 0 for s > 1, which would lead to R = 1. Naively, one might expect that R = M . However, this is not always true. In [44] it was implicitly proved that R ≥ M for a completely general spin chain. Despite its intuitiveness, it is not clear whether the equality holds in all cases (as already noted in Ref. [45]). E.g. obvious exceptions appear for T = 0 and 1/T = 0; however, these extremal, singular points bear no mathematical relevance in this context where T is the parameter to be estimated. We leave a general proof of the full conditions under which R = M as open question. A review of the discussion about the Markov order of classical spin chains can be found in Appendix H. Once we have determined the Markov order of a specific spin chain, for any number of consecutive spins larger than the Markov order the Fisher information scales according to (30).
If a system is in a thermal state, expressed by a Gibbs distribution (50), then the well-known relation between the thermal Fisher information and the heat capacity C holds [41]: where C is defined as: It is important to remark that this relation holds exactly only if we take into account the whole chain, since the marginals of a Gibbs distribution are not, in general, of Gibbs form. However, the Fisher information decomposition (30) allows us to relate directly the heat capacity per unit length (i.e., the specific heat capacity) to the Fisher information of suitable marginals. Let us recall the decomposition of the Fisher information (Eq. (31)); defining the specific heat capacity as where C(N ) denotes the heat capacity for a chain of length N , we have: Taking the N → ∞ limit, since neither f nor ϵ depend on N , we then obtain: Spin chains constitute an extremely interesting playground for our discussion: already in their simplest version, the nearest-neighbor Ising chain, they exhibit a surprising richness of features. This nearest-neighbor model requires a single parameter to encode the coupling strength; we will call it J instead of J 1 , for the sake of conciseness. Given the decomposition 36, for M = 1 all the joint probabilities involving more than two spins are fully determined by the one-site and two-sites probabilities. Hence, it makes sense to discuss the relation between the 2-sites joint Fisher information F 1:2 (T ) and the 1-site marginal Fisher information F 1 (T ). In the same spirit of the Gaussian example in Section II, we can discuss the inequality Or, what is equivalent, F 2|1 ? ≶ F 1 (since F 1:2 = F 1 + F 2|1 in this case). Let us define the ratio Super-additive behaviour corresponds to ξ > 1, and subadditive to ξ < 1; independent random variables yield a perfectly additive behaviour with ξ = 1. The value of ξ for a wide choice of parameters is represented in Figure  4. It is apparent that there is a significant difference between the behaviours in the ferromagnetic and the antiferromagnetic regimes: the first gives ξ ≲ 1 almost everywhere, corresponding to a slightly sub-additive Fisher information. In contrast, the anti-ferromagnetic regime sees significantly high ξ values, representing a strongly super-additive Fisher information.
Considering the shape of the different probabilities as a function of the temperature allows us to gain a precious intuition about the conditions under which each regime is attained, as illustrated in Figure 5. Note how, in the T → 0 limit, each curve reaches the probability of the specified configuration in the ground state of the chain: total antialignment in the anti-ferromagnetic case (P (↑↓) = P (↓↑ ) = 1 /2) and total alignment in the ferromagnetic case (P (↑↑) = P (↑) = 1)).
The ξ > 1 condition in the anti-ferromagnetic case is mainly due to the strong suppression of the denominator 2F 1 (T ). The reason of the suppression is apparent if one considers how two two-sites joint probabilities add up to give a one-site marginal: the P (↑↑) and P (↑↓) probabilities in the anti-ferromagnetic case exhibit almost opposite derivatives with respect to T ; hence, their sum P (↑) will show a very mild dependence on the temperature, yielding a reduced Fisher information. This does not happen in the ferromagnetic regime, where the derivatives of P (↑↑) and P (↑↓) have significantly different absolute values.
This difference at the level of the derivatives can be easily thought in terms of the structure of the ground state of the chain, or, equivalently, of the thermal state for T → 0. In the anti-ferromagnetic case, the ground state is two-fold degenerate (for values of B that are not too large): P (↑↓) and P (↓↑) both have 1 /2 probability, while P (↑↑) = P (↓↓) = 0. In the high-T limit, all the two-site probabilities will achieve a value of 1 /4. Hence, in the anti-ferromagnetic case P (↑↓) has to lose 1 /4 probability between T → 0 and T → ∞; in the same temperature interval, P (↑↑) has to gain 1 /4; the two derivatives can almost cancel one another.
On the opposite side, the ferromagnetic chain has a non-degenerate ground state: P (↑↑) has to lose 3 /4 of probability in the T -interval, while P (↑↓) has only to gain 1 /4: the probabilities cannot exhibit opposite derivatives.
This perspective allows us also to interpret the emerging curve labeled as zero-derivative curve in Figure 4 in the anti-ferromagnetic region. The curve corresponds to point of maximum or minimum of the one-site probability P (↑) with respect to the temperature; consequently, they have zero derivative and zero one-site thermal Fisher information, hence causing a divergence of the ξ ratio. Notice that the zero-derivative curve is only visible for a B /J ratio larger than -2. This can be traced back to a fundamental regime change in the structure of the ground state, happening exactly at B /J = −2: the antiferromagnetic ground state . . . ↑↓↑↓↑↓ . . . gives way to a ferromagnetic ground state . . . ↑↑↑↑↑ . . ., due to the frustration of the anti-ferromagnetic interaction operated by the extremely strong external magnetic field. The onesite probability in this high-B regime is monotonous, thus leading to the absence of zero-derivative points.
Finally, the ξ ratio diverges at all temperatures for B → 0; this is due to F 1 (T ) → 0 in that limit. In fact, in the absence of an external magnetic field, it does not make any energetic difference for a single spin to be in state ↑ or ↓; no information about temperature can therefore be obtained from observations of single spins (F 1 (T ) = 0), whereas correlations can still give place to meaningful estimations of temperature (F 1:2 (T ) ̸ = 0). We also considered how the introduction of a secondorder coupling (J 2 in the language of Eq. (51)) affects the temperature estimation task. What appears is a much richer phenomenology, summarized in Appendix I.
The key points discussed above are preserved also in this more general case: chains with predominantly antiferromagnetic behaviour tend to give a large advantage, exploiting correlations for temperature estimation (ξ ≫ 1), whereas chains dominated by ferromagnetic interactions tend to yield a disadvantage (ξ < 1).

V. DISCUSSION AND OUTLOOK
In this paper we studied the Fisher information of correlated stochastic processes. The central result is the decomposition of the joint Fisher information (30). It entails that the precision that may be achieved for parameter estimation from any process with finite Markov order obeys an asymptotic bound that scales linearly with the number of outcomes (36). A finite Markov order does not mean that the correlation is finite; on the contrary, even for M = 1, all outcomes are correlated, except that these correlations decay exponentially. Thus, our results encompass an extremely large class of physical processes (a notable exception being critical systems, for which the correlation length diverges). Second, a central realization of our framework was that, in contrast to entropic information measures, the Fisher information does not necessarily obey subadditivity, and can be both super or sub-additive, depending on the problem. This means that correlations can sometimes help, or sometimes hamper the estimation process, in the sense that the convergence of the mean-squared error with N may be quicker or slower. As our results illustrate, generally speaking, anti-correlated outcomes tend to be advantageous, whereas correlated outcomes tend to result in reduced scaling.
Our focus here was on classical stochastic processes. It is natural to ask about a possible generalization of our main result, Eq. (30), to the quantum domain. In this regard, it is useful to distinguish between two scenarios. The first is that of continuous quantum measurements [12]. In this case, the process is quantum, but the outcomes are classical and local. Indeed, quantum continuous measurements fall precisely within the paradigm of our framework, which is thus directly applicable. For instance, Ref. [22] studied the estimation of the Rabi frequency from photon counting experiments. Their results fall precisely within our framework with M = 1. The same is true for Ref. [23], which studied homodyne detection.
A classical time series may be also be obtained from a quantum collision model [46][47][48]. For instance, we may consider a memory system which is inaccessible to direct observation. At each time step, a fresh probe systems is made to interact with the memory system at and then subject to a fixed, generalised, quantum measurement. Even for Markovian dynamics on the composite space, the resulting time series is in general non-Markovian, with any Markov order realisable. Our results apply directly to this scenario of indirectly probing a quantum system.
More generally, however, one may also consider scenarios requiring non-local measurements, which is often the case for systems exhibiting entanglement or other forms of quantum correlations. The generalization of the no-tion of Markov order to this case has proved to be rather cumbersome [35,46]: when a completely general measurement is taken into account, it can be shown [35] that the quantum Markov order is either 0, 1 or infinite (one could instead try to apply the concept of approximate Markov order [49,50]). In fact, it has been shown that the fundamental decomposition (13) generally does not hold for the Quantum Fisher information [33]; this fact is at the basis of the possibility of achieving higher precision by exploiting specifically quantum properties, such as entanglement. Furthermore, when dealing with multiparameter estimation, one has to take into account the possible non-commutativity of the corresponding operators, thus leading to the necessity of more refined theoretical tools than the Fisher information matrix [4].
To give a simple example of a non-local measurement scenario, suppose one wished to do thermometry, exactly as in Sec. IV, but considering instead the transverse field Ising model; i.e., where the field is now applied in the x direction. There has been considerable interest in thermometry of quantum systems (for reviews, see [51,52]). And, as already discussed in Sec. IV, in this case the optimal approach is to measure the system in the energy eigenbasis. In general, however, this basis is non-local and hence our main result, Eq. (30) is not necessarily applicable.
An extension to the fully-quantum case would be an interesting future avenue of research.
Let us now consider the N -site joint Shannon entropy H(X 1:N ), for which a well-known chain rule holds [3]: Each conditional Shannon entropy can be written as: where we used the probabilistic definition of the Markov order (20) in the last equality. Hence, the joint Fisher information can be reduced to: where in the last step the stationarity of the process has been exploited. With the definitions we obtain the structure of the information-theoretic definition (25).

Appendix B: Decomposition of the joint Fisher information
The proof of the decomposition of the joint Fisher information in this appendix is analogous to that in Ref. [33], but generalized to the case of multi-parameter dependence. Let us consider definition (4) of the Fisher information matrix for two random variables X 1 , X 2 : Now, notice that the second and the third terms are zero. In fact, for the second: but by normalization of probabilities we have that In the same way, it is easy to prove that the third term vanishes. Hence, we have: The multi-partite case can be proven by iteratively following the same procedure. As a direct consequence, in case of i.i.d. multi-parametric estimation, we have simply:

Appendix C: Maximum Likelihood Estimation
In this appendix, we present a brief review of Maximum Likelihood Estimation (MLE), following mainly the book by Kay [1], and give some indications about its application to parameter estimation on finite-Markov order processes on a finite sample space.
Let us consider a parameter-dependent probability distribution P θ (X 1:N ), and define the function (of θ): which is called likelihood function. The value of the parameter θ proposed by the MLE estimator is given by the maximisation of ℓ(θ) over the entire space in which θ can take values. This corresponds to choosing the value of θ which makes the current outcomes as likely as possible.
Notice that the maximising θ does not change if a monotonous function is applied to ℓ(θ); it is therefore common to maximise the log-likelihood L(θ) = log ℓ(θ).
When dealing with processes with finite Markov order M (M ≪ N ) and a finite sample space, the computation of the MLE is greatly eased by making use of the following observations. First, exploiting the definition of the Markov order (20), we can rewrite the log-likelihood as: log P θ (X k |X k−M :k−1 )+ + log P θ (X M |X 1:M −1 ) + . . . . . . + log P θ (X 2 |X 1 ) + log P θ (X 1 ).

(C2)
Notice that all the terms in the summation have the same form: the probability of a certain outcome in the sequence, conditioned on exactly the previous M . The remaining terms, conditioned on a smaller number of random variables, give a negligible contribution in the N ≫ M limit. Let us now group together the items in the sequence in groups of homogeneous size M + 1, with a sliding window approach: Each of the η sequences is associated to a conditional probability in the form: Notice that, if any of the variables X j can take up to d different values, then there are only z = (M + 1) d possible different groups η j ; the summation can therefore be transferred to the different blocks: where f q denotes the number of η groups having the same form, indexed by q, in the sequence, C(q) is the associated conditional probability, and R denotes the negligible remaining terms. In the common situation (M + 1) d ≪ N , this allows us to compute the likelihood with a significantly enhanced efficiency with respect to a direct attempt. We recall also some relevant properties of the MLE. Let us first consider a set X 1:N of i.i.d. random variables, extracted from the same θ-dependent distribution P θ (X). Under mild regularity conditions on P θ , it can be proven that [38]: • the MLE estimation is consistent, meaning that, in the limit N → ∞, the MLE estimatorθ is such that θ → θ * , where θ * is the true value of the parameter; • the MLE estimator is asymptotically normal, meaning that the random variableθ−θ * (where randomness is given by the extraction of different datasets) tends to be distributed according to a Gaussian distribution, with variance given by the Cramér-Rao bound in the form (10).
These propositions were originally formulated for i.i.d. random variables. For the purposes of this paper, it is however fundamental to consider non-i.i.d. random variables. It has been proven [53] that, under appropriate regularity conditions, they can be extended to Markov chains, once the marginal Fisher information F 1 (θ) is replaced by the conditional Fisher information F 2|1 (θ). Furthermore, a process with a finite sample space and of finite Markov order M can be directly translated into a Markov order 1 process by putting together groups of M random variables, in a similar fashion as it is done in the transfer matrix approach, see Appendix G. This allows the extension of the result to larger Markov orders.
Appendix D: On the meaning of sub-additive Fisher information When dealing with a process yielding a sub-additive Fisher information, it could seem that, by using some estimator able to ignore the correlations, the experimenter could be able to achieve a precision beyond the Cramér-Rao bound. We illustrate the situation with an example.
Consider the Markov process defined by the transition matrix (37); we know that it yields a sub-additive Fisher information for θ. Being a M = 1 process, its genuine MLE has to take into account the correlations; the likelihood function for a sequence of outcomes x 0:N has the form We could now define an uncorrelated MLE, taking into account only the marginal probabilities, as: Naively, one could think that this uncorrelated MLE should result in a higher precision, with an MSE converging to the i.i.d. version of the Cramér-Rao bound (10) instead of the correlated version (36). However, this is not verified, as shown in Figure Figure 6. This gives an interesting perspective on the meaning of the Fisher information and of the Cramér-Rao bound: they are related to the nature of the process, not to the way in which the estimator is designed. In other terms, there is no way of taking advantage of the sub-additivity of the Fisher information to obtain a precision beyond the Cramér-Rao bound.

Appendix E: The Central Limit Theorem for correlated random variables
The usual formulations of the Central Limit Theorem [54] consider i.i.d. variables. Many different possible generalizations of the theorem, relaxing the independence hypothesis have been presented; in this appendix we discuss one that is directly applicable to most cases of physical interest.  Let X 1 , X 2 , . . . be a stationary stochastic process, and let us define the coefficients α n by the inequality In most cases, we will have that lim n→∞ α n = 0, meaning that two random variables in the stochastic process become increasingly independent for an increasing distance between them. If α n → 0, the process is called α-mixing. It is possible to prove for instance that every Markov process is α-mixing, and that α n decays exponentially on it.
Let us now consider an α-mixing process such that: • α n = O(n −5 ) • the expectation value E[X 12 0 ] is finite (hence, due to stationarity, it is finite for all random variables composing the process); then the results of the Central Limit Theorem can be recovered [54].
Notice that the first hypothesis is easily fulfilled by any process exhibiting exponential decay of the α n coefficients, such as in the case of Markov processes; the second one is trivially true for any process on a finite space, and also for processes on continuous sample spaces with an exponential depression of the distribution tails, such as normal distributions.
A stationary Markov process is fully determined by the joint probability P (X 1 X 2 ), in terms of which all the relevant conditional and marginal probabilities can be written. We assume this two-sites joint distribution being Gaussian with the form: where ⃗ x = (x 1 , x 2 ) T , ⃗ µ = (µ, µ) T , and the normalization factors have been dropped. Here σ is the two-sites covariance matrix, written in terms of the two parameters γ 0 ∈ R + and ρ ∈ [−1, 1] as Notice that the sign of the nearest-neighbor correlations is encoded in the sign of ρ. It is necessary to determine the conditional probability distribution P (X 2 |X 1 ), which we want to sample for the simulation. Like the joint probability, it is also Gaussian, with mean and variance given respectively by Using these two relations, we are able to iteratively simulate a Gaussian Markov process. The process built in this way clearly has Markov order 1 by construction, because only the most recent element from the past plays a role in the probability of the future. The full covariance matrix for this process turns out to be It is rather instructive to compare this behaviour with another possible definition of a Gaussian process; let us consider the Gaussian joint probability distribution with mean vector ⃗ µ = (µ, µ, . . . , µ) and covariance matrix In this case, the covariances for a distance larger than 1 are exactly zero; however, computing explicitly the conditional probabilities, we find that a probability of the form P (X 3 |X 2 X 1 ) depends non-trivially on X 1 , hence it is necessarily different from P (X 3 |X 2 ): despite its appearance, Eq. (F5) does not have Markov order 1. Instead, its Markov order is infinite. The process with actual Markov order 1 is that given by Eq. (F4).

Appendix G: Marginalisation of spin chains
A detailed description of the method to obtain the marginals has been provided in [44]; here we only give a brief review, considering for the sake of simplicity a R = 1 spin chain with local dimension d = 2 (a spin-1 /2 Ising chain).
The Gibbs state (50) of N spins can be conveniently rewritten in terms of an iterated product of components of the so-called transfer matrix V (note that periodic boundary conditions are assumed): where the transfer matrix is defined as [55]: and it is well-known that the partition function Z of N spins can be written as: We would now marginalize p(s 1 , . . . , s N ) over all the spins but the first m, i.e., to compute: Note that, since the periodic boundary conditions grant translational invariance, the obtained expression is the general form of an m-spin marginal, and not specific to the first m spins. In terms of the transfer matrix expression (G1) we have therefore: but the summation over the spinss m+1 , . . . ,s N is equivalent [56] to a matrix multiplication. Thus: This expression has been obtained without any approximation, and holds for any value of N . However, we are interested in the thermodynamic limit N → ∞. It is noteworthy that only the dominant eigenvalue λ of the transfer matrix and the corresponding eigenvector u play a role in this limit, yielding the final expression: A procedure to generalize the transfer matrix to higher-d or higher-R chains is detailed in [56], and involves the creation of a d R × d R -dimensional matrix, whose components are denoted by multi-indices built on R spins (here we will only put the relevant indices one after the other, like s i s j ). The obtained transfer matrix is in general non-symmetric, and it is hence not possible to guarantee that it can be fully diagonalised. However, it can be proven [44,57] that the dominant eigenvalue of V still dominates the trace (and the partition function, according to (G3)); the corresponding left and right eigenvectors (u L and u R respectively) have non-negative components.
It is important here to mention the fact that, unlike what happens in the R = 1 scenario, when dealing with longer ranges R the transfer matrix only permits the direct computation of marginals of m = kR, k ∈ N spins. Any marginal consisting of a number of spins which is not a multiple of R has to be computed by manual marginalization of a larger marginal. The general expression for a m = kR marginal of a range-R spin chain is: V s jR+1 . . . s (j+1)R ; s (j+1)R+1 . . . s (j+2)R .

(G8)
Appendix H: Markov order of spin chains The proof of M ≤ R is fully general, and has been first presented in [45]. Let R be the interaction range, and consider the calculation of a marginal over L spins, where L = RL ′ for a suitable L ′ ∈ N. Let us introduce a multi-index notation to describe the behavior of a whole block of R spins as: The marginal probability distribution for a sequence of length L is then given by: Let us now consider two sequences, both of length L, Using the marginal probability detailed before: from which it is apparent that of all the spin blocks in ← − s L , the conditional probability only depends on the single spin block η −1 , since it is the only one which appears in the expression. One has finally: The thesis follows, because η −1 has length R.
In this paper, we have only considered nearestneighbors chains, with R = 1. It is rather easy to prove for such chains that, for T ̸ = 0, ∞, M = 1. Notice that the condition R = 1 implies that J 1 ̸ = 0, by definition of R.
Proceeding by contradiction, let us assume that the Markov order is smaller than the interaction range, hence M = 0, R = 1. Therefore the spins behave independently and the conditionals can be dropped: P (s 2 |s 1 ) = P (s 1 ), or, equivalently, the probability factorizes, as P (s 1 s 2 ) = P (s 1 )P (s 2 ).
The expression for both the two-sites joint probabilities P (s 1 s 2 ) and the one-site marginals are analytically known (but lengthy). Plugging them into (H7) and solving for J, the only possible solution we obtain is J = 0, which contradicts the assumption R = 1. Hence, for the Ising chain we have, at finite temperature, M = R. the relevant points are the ones already mentioned in the previous parts. Again, for J > 0 (primary ferromagnetic), one has disadvantage, for J < 0 (primary antiferromagnetic) advantage arises. In presence of a very strong external magnetic field, alignment is also enforced in the otherwise anti-ferromagnetic chain; in this case, the behaviour is similar to the ferromagnetic case, with ∆F < 0. Overall, weaker coupling (relative to B) leads to weaker ∆F .