An improved estimator of Shannon entropy with applications to systems with memory

We investigate the memory properties of discrete sequences built upon a finite number of states. We find that the block entropy can reliably determine the memory for systems modeled as Markov chains of arbitrary finite order. Further, we provide an entropy estimator that remarkably gives accurate results when correlations are present. To illustrate our findings, we calculate the memory of daily precipitation series at different locations. Our results are in agreement with existing methods being at the same time valid in the undersampled regime and independent of model selection.


INTRODUCTION
Stochastic modeling lies at the heart of many modern studies of complex systems. In this approach, the detailed dynamics is replaced by a set of transition probabilities that connect the states of the system at two different times. Quite often, the resulting stochastic processes can be modeled with the aid of discrete Markov chains [1], where the transition probabilities to a future state depend only on the present state and not on past states [2]. The broad applicability of this Markovian dynamics runs across disciplines: fluctuation theory in statistical physics [3], DNA sequence analysis in molecular biology [4], weather forecast in meteorology simulations [5], or web searches in information retrieval [6], just to mention a few.
However, this memoryless approximation is insufficient if one deals with strongly correlated systems [7,8], whose next state depends not only on the present one but also on a longer history of past states. Higher-order Markov chains [9], which include a finite number m of past states to determine the transition probability to the next one, are thus needed for an accurate characterization of systems with memory. Recent examples of systems modeled with higher-order Markovian discrete chains include Parkinsonian tremor time series [10], linguistically detected emotional events [11], genomic polymorphisms for cancer research [12], human navigation on the web [13] and autochemotactic searchers [14]. It is therefore of utmost importance to have reliable methods that faithfully determine the order or memory m for a given sequence of data [15].
This goal can be achieved in several ways. For the purpose of this paper, we focus on information-theoretical methods. The widely used Akaike information criterion (AIC) [16] is based on a loss function constructed from the Kullback-Leibler relative entropy. The loss function is then minimized to find a good estimate of m [17]. Since this approach views the transition probabilities from a frequentist perspective, the Bayesian information criterion (BIC) uses instead Bayes factors to further constrain the loss function [18]. While the AIC criterion is shown to overestimate m, the BIC is biased towards overly simple models [19]. To overcome these difficulties we propose in this paper a new method to determine the order of a Markov chain that best represents a given series of data. Advantages of the proposed method include that it is independent on model selection and sufficiently precise even in the undersampled regime.
Central to our proposal is the block Shannon entropy [20]. Data are grouped in blocks of a given size n from which the entropy is extracted using the probability of each block. When the total sequence length is much larger than the number of possible blocks, such that all blocks appear with sufficient statistical relevance, the maximum likelihood estimator for the entropy works well. Nevertheless, sequences extracted from real data can be short and the previous estimator is consequently negatively biased [21]. There have been several attempts to derive unbiased estimators for the entropy but, to the best of our knowledge, all these proposals rely on the sequence being composed by independent random states. What is needed is a faithful estimator being able to deal with correlated systems, as inferred in this work. Therefore, our contribution is twofold. First, we suggest a novel method that determines the memory of a generic discrete sequence when the series is described by a Markov chain of arbitrary order. Second, we introduce an estimator suitable for the expected value of any observable (like the entropy) that provides information of a correlated system. We emphasize that both findings are independent and constitute, by their own, interesting advances that do not rely on each other. The calculation technique for the order of Markov chains would still hold if another block entropy estimator is used in any of the multiple contexts where these processes are relevant [22]. Additionally, our improved estimator can be applied to any model of correlations (not necessarily a Markov chain) and is thus of interest in situations where entropy is fundamental to understand the system's properties [23].
The paper is organized as follows. In Section II we develop the main relation between the block entropy and the order of a Markov chain. We then explain the method to extract the order from the entropy estimator, based on a linear relation between the block entropy and the order, whose proof is left for the Appendix. In Section III we develop the algorithm to estimate the block entropy from a series of correlated data. In Section IV A we present a case study where we determine the memory of a series constructed numerically, while in Section IV B we apply the method to determine the order of daily precipitation series, comparing with the results found within the BIC approach. Finally, Section V contains our conclusions.

II. RELATION BETWEEN BLOCK ENTROPY AND MEMORY
Consider a discrete random variable X with L possible outcomes, {z i }, i ∈ {1, . . . , L}. Let S = (X 1 , . . . , X N ) be a sequence of time-ordered observations, where N is the sequence size. Then, S has memory m ≥ 1 if the transition probabilities satisfy with s the time or position in the series. For the case m = 0 the probabilities P (X s = z j ) are independent for all s. The sequence is termed Markovian if m = 1 in Eq. (1). Values m > 1 correspond to higher-order Markov systems, which are the focus of this work. In Fig. 1(a) we illustrate the transition probabilities for a generic sequence with m = 1: the outcome probability at time s depends only on the previous state at time s−1 (red arrow). Figure 1(b) is a graphical representation for the case m = 2, where now the probability to observe the system in a given state at time s depends on states at both previous times s − 1 and s − 2 (red arrows).
In what follows, we consider homogeneous Markov chains for which the L m+1 transition probabilities of Eq. (1) are independent of time index s and the conditional probabilities are then uniquely determined by the states {z i }. Furthermore, we will only consider the stationary case and therefore neglect transient effects.
Given a long sequence S it is not an easy task to determine the order m by a direct application of Eq. (1) since one should check a growing number of relations as m is increased. However, a much more efficient procedure stems from an analysis of the block Shannon en-tropy [24,25]. Let us form overlapping blocks in S of size n ≥ 1. The j-th block is then B (n) j = (X j , . . . , X j+n−1 ) with 1 ≤ j ≤ N −n+1 ≡ N n . There exist L n distinct possible blocks of size n, which we denote as {b i } 1≤i≤L n . The block Shannon entropy is defined by where p(b i . We use the convention that H 0 ≡ 0. The key point of the proposed method to determine the order of a stochastic process is to realize that H n is a linear function of n for n ≥ m if and only if the sequence has memory m (see A for a proof). Therefore, we can write which, for the independent case (m = 0), reduces to the known expression for the entropy rate per symbol [26] H n = nH 1 n ≥ 1.
For the general case m > 0 the following procedure yields an accurate assessment of m. Given a trial memory µ = 0, 1, 2, . . . we define the trial entropies H µ (n) by the relation: Now, according to Eq. (3), H µ (n) would be equal to H n for n ≥ µ if the series S could be represented by a Markov chain of order µ. Therefore, if we consider the mean squared error we find that ∆ µ vanishes for µ ≥ m, independently of the value of the cutoff n max . Hence, the criterion to find the order m of the sequence S is In contrast to the BIC and AIC methods, where the loss function depends on the model selected, Eq. (7) is independent of any selection procedure. For memory determination, both AIC and BIC methods are generally used along with a comparison among possible models suitable for a given sequence, whereas our proposed method provides an exact result for discrete random variables. If the system is naturally described with a continuous random variable, we first need to discretize this variable and the results obtained with Eq. (7) will depend on the binsize used for the discretization. As an example of an explicit calculation of m, in Fig. 2 we show results for a binary system (L = 2) and memory m = 3. The 2 3+1 = 16 transition probabilities are chosen randomly from a uniform distribution in the interval (0, 1) (and multiplied by a common factor in order to fulfill the necessary normalization conditions). We plot in the main panel with color lines the trial entropy H µ (n) given by Eq. (5) for different values of µ. For comparison, the exact block entropies H n , computed from the transition probabilities, are also shown with black dots. Clearly, the trial entropies H µ (n) coincide for µ ≥ 3 with the exact block entropy H n in agreement with the memory of the process. Alternatively, the inset of Fig. 2 shows that the mean squared error ∆ µ vanishes for µ ≥ 3. Therefore, Eq. (7) is met and the memory is m = 3, as expected.
This particular example is based on a particular set of transition probabilities drawn from a uniform distribution. We stress that for a different set of transition probabilities the curves in Fig. 2 would change quantitatively but the condition ∆ µ = 0 for µ ≥ 3 will always hold for a system with memory m = 3. As a consequence, the method is robust against strong variations of the transitions between states.
It should be clear that in this example the determination of the memory m has been very efficient because we had access to the true block entropies H n , determined from the knowledge of the L m+1 transition probabilities. In a case where this is not possible, for example, if we are given a numerical sequence S of finite length N , we would need to replace the exact block entropies H n in Eqs. (5) and (6) with appropriate estimatorsĤ n . As a consequence, a good performance of our method requires a reliable estimator of the block entropies. Unfortunately, this is not an easy issue since all the entropy estimators reported in the literature have systematic biases that de-grade the performance of the method. In the next section, we will propose a new entropy estimator that turns out to be particularly suitable for our purposes of determining the memory of a higher-order Markov process.

III. ENTROPY ESTIMATOR
Block probabilities and, hence, the Shannon block entropy, could also be obtained with a sufficient precision if we had access to an unlimited amount of data. However, data records are necessarily finite and we must resort to other less accurate estimates of the entropy that take into account this finite amount of data. Hereafter, we will use the notationâ to denote a numerical estimator for an arbitrary random variable a. We recall that an unbiased estimator is one for which â = a, while for biased estimators the difference â − a generally decreases with the sample size N .
A first attempt to estimate the entropy of a general sequence S is to employ the maximum likelihood estimator i /N n is the relative frequency given by the observed number of occurrencesn with respect to the total number N n = N − n + 1 of overlapping blocks of size n present in S.
Despite the fact thatp MLE is unbiased,Ĥ MLE n turns out to be biased [27] and this can lead to an extreme underestimation of H n , especially in the undersampled regime N n < L n where Ĥ MLE n < H n . The estimation provided by this method can be considered reliable up to n max ∼ log(N )/ log(L).
There is not known unbiased estimator for the entropy [28] but there is a large number of estimators that manage to improve the results obtained with the MLE [29], allowing one to increase their range of validity as given by n max . In particular, the coverage adjusted estimator proposed by Chao and Shen [21] provides good results for sequences of independent events. Here, we will present an estimator that generally improves the coverage adjusted estimator for systems with memory, using a combination of the Horvitz-Thompson adjustment [30] to account for missing elements in the sequence and a correction to the probabilities that takes into account correlations.

A. Horvitz-Thompson correction
We are interested in a numerical estimation of the sum of A for all possible blocks, namely Y n = In a finite series S, not all L n possible values of b (n) i will necessarily appear. To account for the missing blocks in the data, Horvitz and Thompson [30] proposed to estimate Y n by summing only the contributions of the blocks that do appear in S and weighting each term by the probability that the element is included in S: For a random sequence of length N , memory m = 0 (independent events) and block size n = 1, the probability of appearance for block b (1) i can be computed as In principle, Eq. (10) does not hold in the presence of correlations (m > 0). It does not hold either for m = 0 and n > 1 as the existing overlapping between consecutive blocks already induces correlations in the block series: e.g., in the case L = 2 and n = 3 with possible results z i = 0, 1, the block (0, 0, 1) can only be followed by the blocks (0, 1, 0) and (0, 1, 1). Nevertheless, we have checked in all our numerical simulations that the corrections introduced by these effects are negligible in the limit N n (see B). Using Eqs. (9) and (10) in Eq. (2), one arrives at the Horvitz-Thompson estimator for the block entropŷ

B. Chao-Shen estimator
In general, the exact probabilities p(b (n) i ) that appear in Eq. (11) are not known. Thus, it is necessary to replace them by their estimators obtained from the sequence S. The most basic estimator is the maximumlikelihood-estimatorp MLE (b (n) i ) defined above. However, since not all possible blocks are likely to appear in the finite sequence S, we note that b (n) Therefore, it is convenient to correct the estimated frequencies before substituting them into Eq. (11). Chao and Shen [21] proposed to use instead ofp MLE The new estimator is requested to satisfy the condition i ). From Eq. (12) and using the normalization condition of the MLE estimator, one finds thatĈ represents the total probability for the occurrence of the blocks observed in S.
i ) < 1 implies that there exist unseen blocks in the data and this is an error source for the estimation of H n .
Again, as the exact values of p(b (n) i ) are not known, Eq. (13) can not be used to computeĈ n for a given sequence S. The Good-Turing estimator [31] can be used as an estimate for the sample coverage, where N (n) 1 is the number of blocks of size n that appear only once in the sequence S. Substituting Eq. (12) into Eq. (11) and using Eq. (14), one finds the Chao-Shen entropy estimator which has proven to provide very good results for independent sequences. However, we show below that Eq. (15) does not work so well for correlated data as happens in systems with memory. This is our motivation to present an improved estimator that does take into account correlations.

C. Correlation coverage estimator
Our proposal follows Eq. (12) butĈ n is now estimated using a sequential procedure to tackle possible correlations in the sequence. First, we consider the initial part of the sequence, namely S 0 = (X 1 , X 2 , . . . , X N1 ), with N 1 ≡ n − 1 + N n /2, such that S 0 contains exactly N n /2 ≡ N n blocks of size n. We adopt the initial estimatorĈ N n +k , 1 ≤ k ≤ N n ), and modify the sample coverage according tô here the factor 1/(N n + 1) accounts for the probability that the observed block B (n) +1 appears for the first time. We repeat the process with the next observation B (n) +2 to modify the estimator where S +i = S 0 ∪ (X N1+1 , . . . , X N1+i ). At variance with the coverage given by Eq. (14), which is obtained from a single observation from the whole series, this procedure has the advantage of iteratively updatingĈ (k) n , k = 0, 1, 2, . . . , N n based on the previously observed data. We continue with this procedure until we arrive at the final value of the correlation coverage (CC) estimator where the indicator function I(Z) yields 1 if the event Z is true and 0 otherwise.
Finally, we substitute the corrected probabilities into Eq. (11) to obtain the correlation coverage-adjusted entropy estimator As an example, in Fig. 3 we show the exact entropy H n (dashed black line and dots) as a function of the block size for a Markovian (m = 1) binary process that takes the values z = 0, 1, with transition probabilities P 0 ≡ p(0|0) = P (X s = 0|X s−1 = 0) = 0.7 and P 1 ≡ p(1|1) = P (X s = 1|X s−1 = 1) = 0.6. Note that, as expected, H n is linear from n ≥ 1. To compare with, we also depict the results obtained with the maximum-likelihood estimators given by Eq. Notably, the proposed estimator performs better than both the MLE and Chao-Shen entropies and perfectly agrees with the exact result for a wide range of block sizes up to n 17 while the MLE and the Chao-Shen estimators deviate earlier from the exact H n . In fact, the MLE estimator provides a good approximation up to a value of n 12 which is close to the expected limit n ∼ log(N )/ log(L) ∼ 13, while the validity of the Chao-Shen estimator extends up to n 13 and our estimator even further. In C we present a similar comparison for sequences with memory 2 and 3.
Departures from the exact value will show up only in the extremely undersampled regime (N n L n ). In this case, every observation will constitute a different element. As a consequence, the sum in Eq. (19) will include all blocks in the second part andĈ CC n = 1 − N n j=1 1 N n +j 1 − ln(2) 0.307, for large N . The proximity of the computed estimator to this limiting value indicates a largely undersampled sequence and determines the validity of the estimators to the coverage C n .
For a more detailed comparison, we show in panel (a) of Fig. 4 the performance of the Chao-Shen estimator given by Eq. (15), and in panel (b) that of the our correlation estimator from Eq. (19), both calculated for a process with m = 1 and L = 2. We show the results as a function of all possible combinations of transition probabilities P 0 and P 1 , the remaining probabilities being derived from the normalization conditions p(1|0) = 1 − P 0 and p(0|1) = 1−P 1 . We measure the goodness of each estimator for particular values of P 0 and P 1 with the mean squared error (P 0 , P 1 ) ≡ 1 nmax nmax n=1 (H n −Ĥ n ) 2 . To improve the statistics of the error, for each set of transition probabilities (P 0 , P 1 ) we generate M = 20 series, each with N = 10 4 elements, and average the mean squared error over the M values. We use in all cases n max = 17 and plot the resulting values in a color code. As shown in the figure, quite generally, Eq. (19) performs better than Eq. (15), with the possible exception of the cases near P 0 , P 1 ≈ 1/2 (the independent case with equal outcome probability), in which the latter is slightly better. The overall performance is obtained by adding all values of (P 0 , P 1 ) using a grid size ∆P = 0.1 in Fig. 4, resulting in an aggregated mean squared error five times larger for the Chao-Shen estimator as compared with the correlation coverage-adjusted estimator.

IV. DETERMINATION OF THE MEMORY OF A SEQUENCE
Having demonstrated the usefulness of the estimator given by Eq. (19) for Markovian systems, we now return to the method of Sec. II for the determination of the memory in discrete sequences.
Suppose we are given a finite time series S that describes a phenomenon for which we would like to determine its memory m. As explained above, we will use the criterion given by Eq. (7)  ing Eqs. (5,6) and using a suitable estimatorĤ n for the block entropy H n . For a given entropy estimator with a known n max and fixed N and L, a meaningful linear fit for the calculation of ∆ µ as given by Eq. (6) requires that µ ≤ n max − 2 (at least three points are required for a meaningful fit to a straight line). Since the chosen entropy estimator works for block sizes up to n max , our method can provide in principle an accurate result if the system under study has memory m ≤ n max −2 but would fail otherwise.
In Figs It is worth mentioning that, in general, the value of n max for a certain estimator is not the same for every sequence since n max may depend on the transition probabilities. To be conservative, in this section we use n CC max = n MLE max even though our estimator certainly provides good results for larger block sizes.
Due to the limitations of all estimators we expect that ∆ µ is determined within an error that must be taken into account. To do so, we consider M series of data S (i) , i = 1, . . . , M , all with the same length, obtained either by repeating the experiment M times or by splitting the original series in M disconnected sequences. For each sequence we calculate the corresponding value of ∆ (i) µ . Then, we calculate the mean∆ µ and standard deviation σ µ for the obtained values ∆ (i) µ . The condition given by Eq. (7) is transformed into the criterion that the mean value∆ µ is consistent with the value 0 within the standard deviation, A. Numerical simulation We first present results arising from controlled numerical simulations. We generate M = 20 series of length N = 1000 with memory m = 1 and possible values z = 0, 1 (L = 2). We show only results corresponding to p(0|0) = 0.7 and p(1|1) = 0.6 but similar conclusions are obtained quite generally for different values of the transition probabilities. In Fig. 5 we plot the values of∆ µ obtained using the method explained above. In Fig. 5(a) we employ the MLE estimator for the block entropy, while Fig. 5(b) uses the correlation coverage-adjusted estimator. Remarkably, the MLE entropy is not able to yield any useful result, and it is thus not possible to determine the sequence memory since Eq. (20) is never satisfied. In stark contrast, the calculation of∆ µ and σ µ using the entropy estimator given by Eq. (19) clearly shows that the criterion Eq. (20) yields the correct result m = 1.
In Fig. 6(a) and (b) we present results arising from similar numerical simulations as before but now generating sequences with memory m = 2 and m = 5, respectively (same values of M , N and L as before) and transition probabilities chosen randomly from an uniform distribution in the interval (0, 1). We plot the values of∆ µ obtained using the correlation coverage-adjusted estimator for the entropy. Again, the usage of our proposed entropy estimator allows us to accurately determine the memory of the system. Similar conclusions are generally obtained for a different set of transition probabilities.
These numerical experiments illustrate the importance of having a reliable entropy estimator to successfully apply the method of memory determination.

B. Daily precipitations
We have thus far tested our proposed method only with numerically generated sequences of known memories. We now test the method with real data. It has been widely accepted that the occurrence (or not) of precipitation can be modeled as a system of memory 1, but it is also well known that this assumption has several shortcomings [32]. There have been some attempts to improve this model by studying sequences of daily data worldwide. It has been found [33] that the model memory depends on the geographical location. Here, we select a few locations and compare the results with those obtained using the BIC method [34].
We collect data from the Global Historical Climatology Network Daily [35]. For each location, we record the available observation for daily precipitation, setting the threshold at 0.1 mm to specify if a day is rainy or not. This way we produce a time series with two states (L = 2) and length N s 25000 (the exact value of N s varies for each location). Then, this series is divided in M = 5 sequences of equal lengths N = N s /5 and for each of this sequence we estimate the block entropiesĤ CC n for n = 1, . . . , 12, from which we obtain the mean values∆ µ and their standard deviations σ µ . It should be noted that the use of the correlation coverage estimator for the entropy allows us to obtain reliable results up to n max = 12. If we had used the MLE estimator, this limit value would have been n max ∼ 11.
In Fig. 7 we show the results obtained for four locations: Rome, Dallas, Bangkok and Than Lwin. For the first two locations, our method predicts m = 1 whereas for the second two places the procedure suggests that both series are better described with m = 2. These results are in excellent agreement with the BIC method (see Fig. 5 in Ref. [33]) and consequently validate our technique for the memory determination in real data sequences modeled with Markov chains of arbitrary order.

V. CONCLUSIONS
We have developed a novel method to determine the memory of a discrete sequence. Importantly, the method is valid for both Markovian and non-Markovian systems. Since the technique relies on the calculation of the Shannon entropy as a function of the block size, it is crucial to additionally propose an entropy estimator that gives good results for correlated systems. To this end, we have introduced a correction to the estimated probabilities to amend the source of error stemming from unseen ele- ments in small samples. Our estimator is shown to significantly increase the accuracy of the entropy for systems with memory. Both numerically generated sequences and real data series have been used as benchmarks. These successful results will certainly encourage further applications of the proposals discussed in this work. It is left as a future project to present a more detailed comparison of our entropy estimator with different estimators that have shown promising results for independent sequences [36][37][38][39].
As far as n ≥ m, the right-hand side of Eq. (A4) is independent of n. Making the replacement n → m we arrive at which proves that the dependence of H n on n is linear for n ≥ m, i.e., H n = an + b with constant parameters a and b.
000 (black lines) for binary sequences with memory m = 0 as a function of the probability p(0) for occurrence of outcome 0. For each combination of N and p(0), we generate K = 10 4 numerical sequences and calculate P (1) 000 from K000/K, where K000 is the number of sequences where the block (0, 0, 0) appears at least once. We also plot with blue lines P (2) 000 = 1 − (1 − (p(0)) 3 ) N −2 . We show results for N = 100 in (a), 200 in (b), 500 in (c) and 1000 in (d). We observe that the black and blue curves overlap as N grows.

Appendix B
Given a sequence S of length N , we calculate the probability that we observe the block b (n) i of size n as follows, Even though Eq. (B1) is exact only when n = 1 and the sequence memory is m = 0, we have checked in all our simulations that the corrections introduced by correlations when m > 0 and n > 1 can be neglected if N n. As an example, we now assess the probability that the block (X s = 0, X s+1 = 0, X s+2 = 0) ≡ (0, 0, 0) appears in S by generating K numerical sequences for fixed parameters L, m and N , and a particular set of transition probabilities. If the block (0, 0, 0) appears in K 000 of those sequences, then P (1) 000 ≡P ((0, 0, 0) ∈ S) = K 000 /K. We note that for K 1 P (1) 000 P ((0, 0, 0) ∈ S). We compare this result with the value P 000 ≡ 1 − (1 − p(0, 0, 0)) N −2 , where p(0, 0, 0) is the probability of occurrence of the block (0, 0, 0), which is computed from the transition probabilities.
In Fig. B1 we plot the results for both P In Fig. B2 we show similar curves but now considering sequences with memory m = 1. For each case, we fix p(1|1) = 0.6 and vary p(0|0) between 0 and 0.9.
The results plotted in both Figs. B1 and B2 clearly show that as N increases the difference between P (1) 000 and P (2) 000 vanishes. We have verified with our simulations that this holds for every block sequence and for different values of m. Therefore, Eq. (B1) is an excellent approximation when the size of the sequence is much larger than the size of the block, regardless of the memory value.
000 (black lines) for binary sequences with memory m = 1 as a function of the conditional probability p(0|0). For each combination of N and p(0|0), we generate K = 10 4 numerical sequences and calculate P (1) 000 from K000/K, where K000 is the number of sequences where the block (0, 0, 0) appears at least once. We also plot with blue lines P (2) 000 calculated from the transition probabilities. For all the cases considered we tale p(1|1) = 0.6. We show results for N = 100 in (a), 200 in (b), 500 in (c) and 1000 in (d). We observe that the black and blue curves overlap as N grows.