On Asymptotic Equipartition Property for Stationary Process of Moving Averages

: Let { X n } n ∈ Z be a stationary process with values in a finite set. In this paper, we present a moving average version of the Shannon–McMillan–Breiman theorem; this generalize the corresponding classical results. A sandwich argument reduced the proof to direct applications of the moving strong law of large numbers. The result generalizes the work by Algoet et. al., while relying on a similar sandwich method. It is worth noting that, in some kind of significance, the indices a n and ϕ ( n ) are symmetrical , i.e., for any integer n , if the growth rate of ( a n ) n ∈ Z is slow enough, all conclusions in this article still hold true.


Introduction
Information theory is mainly concerned with stationary random processes X = {X n } n∈Z , where X n takes values in a set X , with cardinality |X | < ∞.The strong convergence of the entropy at time n of a random processes divided by n to a constant limit called the entropy rate of the process is known as the ergodic theorem of information theory or the asymptotic equipartition property (AEP) [1], in some sense, of the expression Its original version proven in the 1950s for ergodic stationary processes is known as the Shannon-McMillan theorem for the convergence in mean and as the Shannon-Breiman-McMillan theorem [2][3][4] for the almost everywhere convergence.Since then, generalized versions of Shannon-McMillan-Breiman's limit theorem were developed by many authors [1,2,4,5].Extensions have been made in the direction of weakening the assumptions on the reference measure, state space, index set and required properties of the process.For the general development, please see Girardin [6] and the references therein.
In statistics, smoothing data is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise phenomena.One of the most used smoothing methods is moving average (MA).A number of authors have studied the question of almost everywhere convergence for an invertible transformation of X, which is measure preserving the moving averages, e.g., Akcoglu and Del Junco [7]; Bellow, Jones, and Rosenblatt [8]; Junco and Steele [9]; Schwartz [10]; and Haili and Nair [11].Recently, Wang and Yang [12,13] proposed a new concept of the generalized entropy density, and established a generalized entropy ergodic theorem for time-nonhomogeneous Markov chain and for non-null stationary processes.Shi, Wang et al. [14] studied the generalized entropy ergodic theorem for nonhomogeneous Markov chains indexed by a binary tree.
Motivated by the work above, in this paper we will give a moving average version of the Shannon-McMillan-Breiman theorem.The results in this paper generalize the results of those in [2].It is worth noting that, in some sense, the indices a n and ϕ(n) are symmetrical.In this paper, we are discussing the so-called forward moving average; if the growth rate of (a n ) n∈Z w.r.p. to integer n is slow enough, all conclusions in this article still hold true, i.e., the backward moving average is still established.
The method used in showing the main results is the "sandwich" approximation approach of Algoet and Cover [2], which depends strongly on the moving strong law of large numbers: sample entropy is asymptotically sandwiched between two functions whose limits can be determined from the moving SLLN theorem.
This paper is organized as follows.In Section 2, we introduced some necessary preparatory knowledge.To distinguish this from the main conclusion theorem names, we present some required preliminaries and three lemmas.In Section 3, we give the main results and some properties of them are studied in the same section.Also, we give examples of applications.

Preliminaries
Throughout this section, let (Ω, F , P) denote a fixed probability space and let {X n } n∈Z be a stationary sequence taking values from a finite set ) wherever the conditioning event has positive probability.Define random variables p(X j i ) and P(X j |X j−1 i ) by setting X j = X j (ω) in the corresponding definitions.Since P(p(X j i ) = 0) = 0 the conditional probability makes sense P − a.e.(i.e., almost everywhere holds true under measure P).
Definition 1 (see e.g., [2]).The canonical Markov approximation of order m to the probability is defined for large j > m as p We will prove a new version of AEP for a stationary process {X n } n∈Z .Before developing the main theme of the paper, we shall need to derive some basic lemmas.Let {a n , ϕ(n)} n∈Z be a pair of positive integers such that ϕ(n) → ∞ as n → ∞, and for every Lemma 1.Let {X n } n∈Z be a stationary process with values from a finite set X ; then, we have and where the base of the logarithm is taken to be 2.
Proof.Let A be the support set of p(X where E P indicates taking expectation under measure P. Similarly, let B(X a n −1 −∞ ) denote the support set of p(•|X a n −1 −∞ ).Then, we have By Markov's inequality and Equation (4), we have, for any ε > 0, < ∞ , we see by the Borel-Cantelli lemma that the event By the arbitrariness of ε, we have lim sup Applying the same arguments using Markov's inequality to Equation (3), we obtain lim sup This proves the lemma.
Lemma 2. (SLLN for MA): For a stationary stochastic process {X n } n∈Z , and where An argument similar to the one used in Lemma 1 shows that Let s ∈ [− 1 2 , 1 2 ]\{0}, and define Note that 1 By Equations ( 8) and ( 9) and and the property of superior limits, we have lim sup Setting s ∈ (0, 1  2 ], dividing both sides of Equation ( 10) by s, we obtain lim sup Using the inequalities log By the fact that max{t 1 2 ln 2 t, 0 ⩽ t ⩽ 1} = 16e −2 , we have From Equations ( 11)-( 13), we have lim sup Putting s ↓ 0 in Equation ( 14), we obtain lim sup Replacing s ∈ (0, 1  2 ] by s ∈ [− 1 2 , 0) in the above argument, we can obtain These imply that Note that P ≪ P [m] ; therefore, we have, by Equation (15), Since P [m] (X 5) follows immediately from Equations ( 7) and ( 16).Similarly, let s be a nonzero real number, and define The remainder of the argument is analogous to that in proving Equation ( 5) and is left to the reader.
Proof.We know that for stationary precesses H m ↘ H, so it remains to show that since all random variables are discrete, we may write is a non-negative supermartingale, hence converges a.e. to an integrable limit function for all x 0 ∈X .Note that, for any m, where the last equation follows from stationarity.
Since X is finite and p log p is bounded and continuous in p for all 0 ⩽ p ⩽ 1, the bounded convergence theorem allows interchange of expectation and limit, yielding

Main Results
With the preliminary accounted for, we wish to use the Lemma 1 to conclude that It is not easy to prove Equation ( 17).However, the closely related quantities p(X a n +ϕ ) are easily identified as entropy rates.Recall that the entropy rate is given by Of course, H m ↘ H by stationarity and the fact that conditioning does not increase entropy.It will be crucial that With the help of the preceding lemmas, we can now prove the following theorem: Theorem 1. (AEP) If H is the entropy rate of a finite-valued stationary process {X n } n∈Z , then it holds that Remark 1.In the case a n ≡ 1, ϕ(n) = n, Theorem 1 reduces to the famous Shannon-McMillan-Breiman theorem, which is the fundamental theorem of information theory.Let a n = n; it gives a delayed average version of AEP.
Proof.We argue that the sequence of random variables ) is asymptotically sandwiched between the upper bound H m and the lower bound H ∞ for all m ≥ 0. The AEP will follow since H m →H ∞ and H ∞ = H.
From Lemma 1, we have lim sup which we rewrite, taking the existence of the lim n→∞ which we rewrite as lim sup From the definition of H ∞ in Lemma 2, we have, by putting together Equations ( 6) and ( 7), Now, we give some interesting applications of our main results in the next examples.
Example 1.Let {X n } n∈Z be independent, identically distributed random variables drawn from the probability mass function p(x); then, Example 3. Let {X n } n∈Z be independent identically distributed random variables drawn according to the probability mass function p(x), x ∈ X .Thus, p(x , where q is another probability mass function on X ; then, where D(p ∥ q) is the informational divergence between two probability distributions p and q on a common alphabet X .Since convergence almost everywhere implies convergence in probability, Theorem 1 has the following implication: Definition 2. The typical set A a n ,ϕ(n) ε with respect to P is the set of sequence (x with the following properties: As a consequence of the Theorem 1, we can show that the set A (a n ,ϕ(n)) ε has the following properties: Proposition 1.Let {X n } n∈Z be independent, identically distributed random variables drawn from the probability mass function p(x); then, (1).If (x (2).P(A ε) , where |A| denotes the number of elements in set A. (4).|A (H−ε) for sufficiently large n.
Proof.The property ( 1) is immediate from the definition of A (a n ,ϕ(n)) ε .Property (2) follows directly from Theorem 1, since the probability of the event (X Thus, for any δ > 0, there exists an n 0 such that for all n ≥ n 0 , we have Setting δ = ε, we have the following: To prove property (3), noticing that where the second inequality follows from Equation (19), Finally, for sufficiently large n, P(A where the second inequality follows from Definition 2. Therefore, These complete the proof of the proposition.
Then we have the following: (1) lim n→∞ P{X ϵ) , for sufficiently large n.
Proof.(1) By Theorem 1, the probability X a n +ϕ(n)−1 a n is typical goes to 1.
(2) By the Strong Law of Large Numbers for moving average, we have P(X So there exists ϵ > 0 and N 1 such that P(X a n +ϕ(n)−1 a n ∈ A a n ,ϕ(n) ) > 1 − ϵ 2 for all n > N 1 , and there exists N 2 such that P(X a n +ϕ(n)−1 a n ∈ B a n ,ϕ(n) ) > 1 − ϵ 2 for all n > N 2 .So for all n > max(N 1 , N 2 ), P(X So for any ϵ > 0 there exists N = max(N 1 , N 2 ) such that P(X a n +ϕ(n)−1 a n ∈ A a n ,ϕ(n) B a n ,ϕ(n) ) > 1 − ϵ for all n > N; therefore, P(X a n +ϕ(n)−1 a n ∈ A a n ,ϕ(n) B a n ,ϕ(n) ) → 1.