Nonparametric statistical inference for the context tree of a stationary ergodic process

We consider the problem of estimating the context tree of a stationary ergodic process with finite alphabet without imposing additional conditions on the process. As a starting point we introduce a Hamming metric in the space of irreducible context trees and we use the properties of the weak topology in the space of ergodic stationary processes to prove that if the Hamming metric is unbounded, there exist no consistent estimators for the context tree. Even in the bounded case we show that there exist no two-sided confidence bounds. However we prove that one-sided inference is possible in this general setting and we construct a consistent estimator that is a lower bound for the context tree of the process with an explicit formula for the coverage probability. We develop an efficient algorithm to compute the lower bound and we apply the method to test a linguistic hypothesis about the context tree of codified written texts in European Portuguese.


Introduction
In this work we address the issue of whether or not there exist consistent estimators (and confidence bounds) for the context tree of a discrete time stationary ergodic process with finite alphabet. In words, the context tree of a stochastic process is a set of finite strings or left-infinite sequences that determines the portion of the past the process has to look at in order to decide the distribution of its next symbol. For example, an i.i.d. process has the empty set as context tree, a Markov chain of order 1 has a context tree containing strings of length (at most) 1, and an infinite memory process has a context tree having at least one left-infinite sequence. We refer to this statistical inference problem as nonparametric because we make no further assumptions concerning the distribution of the process.
Finite context trees were introduced by Rissanen (1983) as an efficient tool for data compression. The corresponding processes were originally called Variable length Markov Chains (VLMC) and its estimation was first addressed in Bühlmann and Wyner (1999). Recently, they have received increasing attention in the applied statistics literature, being used in a wide range of problems from different areas (Bejerano and Yona;2001;Dalevi et al.;2012, for instance). Its success in real word applications seems to stem from its parsimony (including memory only where data needs) and its capacity to capture structural dependencies in the data. The counterpart of the model, when compared to finite step Markov models for instance, is that estimation is a much complicated task. When he introduced the model, Rissanen (1983) also provided an algorithm for estimating the (finite) context tree out of a given sample. Since then, a large part of the related statistical literature has focussed on consistent estimation of the context tree in the finite and infinite memory case, an incomplete list includes (Bühlmann and Wyner;Collet et al.;Csiszár and Talata;Garivier and Leonardi;2011). Most of these works make additional assumptions on the processes, as lower bounding the transition probabilities or imposing mixing conditions. In the nonparametric case, that is in the general class of stationary ergodic processes over a finite alphabet, Csiszár and Talata (2006) proved the consistency of the Bayesian Information Criterion (BIC) when the context trees are truncated to a given finite length (the truncation being necessary only for infinite context trees). Interestingly, nothing has been done concerning confidence bounds as far as we know.
Given a sample of a stationary ergodic process, it is natural to wonder whether this process has a finite or infinite context tree. Interestingly, this cannot be consistently decided in this general class (Bailey;1976;Morvai and Weiss;2005). That is, there exists no twovalued function of the sample which, as the sample increases, stabilizes to the value "yes" for every process having a finite context tree and "no" for every process having an infinite context tree. Thus, when considering the discrete metric in the space of trees, the existence of a universal consistent estimator relies on assumptions that cannot be checked empirically. This situation has its counterpart in nonparametric statistics for i.i.d observations. For instance, Fraiman and Meloche (1999) observed that it is impossible to decide, out of a random sample, whether or not the underlying distribution has a finite number of modes. Assuming a priori that the number of modes is finite, they can be consistently estimated.
In the present work the space of irreducible context trees with finite alphabet is equipped with the Hamming distance. Using only topological arguments we prove that if this metric in the space of trees is unbounded, there exists no consistent estimator of the context tree in the class of stationary and ergodic processes. In the bounded case, we construct an estimator that is consistent and also a nonparametric lower bound with an explicit coverage probability, based on a result of Garivier and Leonardi (2011). Finally, following Donoho (1988), we also prove that it is not possible to obtain nonparametric upper bounds. To our knowledge, this is the first work considering the problem of construction of nonparametric confidence bounds for context trees.
Notation, definitions and necessary background are given in the next section. We state the results in Section 3 and prove them in Section 4.

Notation and basic definitions
Let A be a finite set called alphabet. For any m ≤ n, we denote by a n m the string a m . . . a n of symbols of A with length n − m + 1. This notation is also valid for m = −∞ in which case we obtain a string a n −∞ which is infinite on the left. The length of a string w will be denoted by |w|. For any j ∈ {0, 1, . . .}, we let A j denote the set of strings in A having length j, in particular A 0 = {∅} composed uniquely by the empty sequence. We also let A ⋆ = ∪ j≥0 A j denote the set of all finite strings on A and A ∞ , the set of all left-infinite sequences a n −∞ with symbols in A.
We will need to concatenate strings, for instance, if v ∈ A i and w ∈ A j are strings of length i and j respectively, then vw denotes the string of length i + j obtained by naturally concatenating both strings. We also extend concatenation to the case where v ∈ A ∞ is an infinite string on the left. We say that w is a suffix of the sequence s if there exists a sequence v such that s = vw. When |v| ≥ 1 we say that w is a proper suffix of s.
A tree τ is any set of strings or perhaps of left-infinite sequences, called leaves, such that no w ∈ τ is a proper suffix of any other s ∈ τ . We say that the tree τ is irreducible if no w ∈ τ can be replaced by a proper suffix without violating the tree property. The tree consisting of the entire set A ∞ of left-infinite sequences will be denoted by τ ∞ and the tree consisting of the unique sequence ∅ will be denoted by τ root .
Any finite string that is a suffix of some s ∈ τ will be called a node of τ . Sometimes it will be convenient to identify τ with the set of its nodesτ ⊂ A ⋆ . In fact it is easy to verify that τ uniquely determinesτ and vice versa.
We let T denote the set of all irreducible trees on A, with the following partial order Given a tree τ ∈ T and a constant k ∈ N, we denote by τ [k] = A k , the complete tree of depth k, and by τ | k the truncated tree at level k, defined bȳ Finally, T is equipped with the Hamming distance defined by is a bounded metric space and τ ∞ is the unique accumulation point. Unless explicitly mentioned, we will always consider the summable case.
Let {X i : i ∈ Z} be a stationary and ergodic process assuming values in the alphabet A. We denote by P (a m n ) the stationary probability of the string a m n , that is In the particular case s = ∅ we define P (a|s) = Prob X 0 = a . A process as above is said to have law, or measure, P .
Definition 2.1. We say that the string s ∈ A ⋆ is a context for a process with measure P if it satisfies

No proper suffix of s satisfies 2.
An infinite context is a left-infinite sequence x −1 −∞ such that all its finite suffixes x −1 −n , n = 1, 2, . . . have positive probability but none of them is a context.
The set of contexts of a process with measure P is an irreducible tree, it will be denoted by τ P . A process of infinite memory has τ ∞ as context tree. On the contrary, and i.i.d process has context tree τ root = {∅}.
Let Σ be the σ-algebra on Ω = A Z obtained as the product of the discrete σ-algebra on A. Let P denote the set of all stationary ergodic probability measures over (Ω, Σ).
Define the following distance in P is the k-th order variational distance. This distance is known in the literature as the weak distance, and the topology induced by it is known as the weak topology.
Lemma 2.2. The space (P, D) is a Baire space.
In this paper we are interested in making statistical inference in P, that is, in inferring properties of P from samples X 1 , . . . , X n of size n of the corresponding stationary and ergodic process. Classically, assuming that the data are generated by a measure in the huge class P amounts to say that we are in a nonparametric setting. In the sequel we define the notion of consistency of a sequence of estimators in this framework.
Definition 2.3. Let F : P → F be a functional with values in some metric space (F , d). We say that F is consistently estimable on P (in probability) if there exists a sequence {F n } n∈N of statistics, with F n : A n → F , such that for all P ∈ P, d (F n (X 1 , . . . , X n ), F (P )) P → 0.
In this case we say that {F n } n∈N is a consistent estimator for F on P. We say that F is strongly consistent on P if the convergence takes place almost surely with respect to the probability measure P , and in this case we say that {F n } n∈N is a strongly consistent estimator for F on P.
Proposition 2.4. Assume F : P → R is bounded (that is there exists R ∈ R such that |F (P )| ≤ R for all P ∈ P). If F is consistently estimable on P then F must be continuous on a dense subset of P.
Concerning confidence bounds, the following definition is taken from Donoho (1988).
Definition 2.5. We say that F admits a non-trivial upper confidence bound somewhere on P if there exists a sequence {U n } n∈N of statistics, with sup P ∈P P (U n < sup F (P )) = 1 for all n ≥ 1 and such that for any α > 0 we have inf P ∈P P (F (P ) < U n ) ≥ 1−α for any sufficiently large n. Analogously, we say F admits a non-trivial lower confidence bound somewhere on P if there exists a sequence {L n } n∈N of statistics, with sup P ∈P P (L n > inf F (P )) = 1 for all n ≥ 1 and such that for any α > 0 we have inf P ∈P P (F (P ) > L n ) ≥ 1−α for any sufficiently large n.

Main results
Our principal concern in this paper is about nonparametric inference for the functional T : P → T that assigns to any measure P ∈ P its associated context tree τ P ∈ T .
A first question is if it is possible to decide, out from a finite sample, if the sum over the nodes of the context tree of the function φ is finite or not.
This result states, in particular, that the functional that attributes the value 1 if the measure is Markovian, and 0 otherwise, is not consistently estimable when φ is not summable. This is a known result; see Morvai and Weiss (2005) and references therein. However, our proof is completely different and based on topological properties of P.
The only if part of this theorem is a direct consequence of Theorem 3.1. The if part is proved constructively later, because the estimator T c n defined by (3.1) will be proved to be consistent when φ is summable.
Our first theorem concerning confidence bounds is a negative result stating that the functional T does not admit a non-trivial upper confidence bound on P.
This functional does however admit non-trivial lower confidence bounds. In what follows, we construct a statistic which will be proved to be a consistent estimator of T and a nontrivial lower confidence bound. Its definition requires some more definitions.
Given a sequence w, denote by N n (w) the number of occurrences of w in the sample X 1 , . . . , X n ; that is If N n (w) > 0, we define for any a ∈ A the estimated transition probabilitŷ p n (a|w) := N n (wa) N n (w) .
In the case N n (w) = 0 we use the conventionp n (a|w) = 1/|A| for any a ∈ A. Given a tree τ ∈ T we also define the length of the smallest context of τ ℓ(τ ) := min{|v| : v ∈ τ }.
Consider t(n), any integer valued function satisfying t(n) → ∞ and t(n)/ log(n) → 0 when n → ∞. We introduce a discrepancy measure between a sample X 1 , . . . , X n and a measure Q ∈ P as a function d n : A n × P → R defined by We are now ready to define the statistic of interest. Given a constant c > 0, for any n ∈ N let T c n : Observe that the Markov measure having transition probabilities {p n (·|w)} w∈A t(n) belongs to the set and has context tree smaller or equal than τ [t(n)] . Therefore we have T c n (x n 1 ) ≤ τ [t(n)] for all x n 1 ∈ A n . Theorem 3.4. The statistic T c n (X n 1 ) is a strongly consistent estimator of T on P, that is, if X 1 , . . . , X n has law P , then d φ (T c n (X n 1 ), τ P ) → 0 almost surely as n → ∞ .

Proofs
Proof of Lemma 2.2. With respect to the weak topology, the set of all stationary probability measures over (Ω, F ) is a compact Hausdorff space (Shields;1996) and the subspace P of all stationary and ergodic probability measures over (Ω, F ) is a G δ set (Parthasarathy; 1961, Theorem 2.1). Therefore, P is a Baire space with the induced topology.
Proof of Proposition 2.4. The proof uses the same arguments of Lemma 1.1 in Fraiman and Meloche (1999). The difference is that here we do not have independent random variables and the space P is not a complete metric space with respect to D. But the same result can be obtained in our setting, as we show in the sequel. Assume {F n } n∈N is a consistent sequence of estimates of F . Define where I is the indicator function and sg is the sign of F n . It is not hard to show that {S n } n∈N is also a consistent sequence of estimates of F , for the details see (Fraiman and Meloche;, Lemma 1.1). As for any n ∈ N the function S n is bounded by R we have that the convergence in probability to F (P ) implies convergence in mean. Therefore we have that Therefore, for each n, φ n is uniformly continuous with respect to the weak topology (induced by D) on P. Then, by Lemma 2.2 and the Baire's Cathegory Theorem, the function F must be continuous on a dense subset of P.

Proof of Theorem 3.1
The proof is based on Proposition 2.4 and on the following lemma, that is the core of all our negative results.
Lemma 4.1. Any measure P ∈ P can be approximated, in D, by a sequence of measures {P n : n ≥ 1} in P having context tree τ ∞ , and by a sequence of measures {P ′ n : n ≥ 1} in P having finite context tree.
Proof. Along this proof, convergence is understood with respect to D.
For the proof of the first statement, we proceed in two steps, (1) we define a sequence of Markov measures P [k] , k ≥ 1 converging to P and (2) for any k ≥ 1, we construct a sequence of ergodic stationary measures P [k] i , i ≥ 1 converging to P k and having context tree τ ∞ . The conclusion of the proof then follows by a diagonal argument, since convergence in D (or in the weak topology) corresponds to convergence of the measure of cylinders (Shields;1996, Section I.9).
For any k ≥ 1, let P [k] be the k-steps canonical Markov approximations of P , which has kernel P [k] (a|a −1 −k ) := P (X 0 = a|X −1 −k = a −1 −k ) , a and a −i ∈ A , i = 1, . . . , k. It is well known that the sequence P [k] , k ≥ 1 converges weakly to P (see Rudolph and Schwarz (1977) for instance). Before we define P [k] i , let us introduce the continuity rate of a kernel P along a given past a, which is the non increasing [0, 1]-valued {β P l (a)} l≥1 defined as β P l (a) := sup Observe that β P l (a) > 0 means that the context tree of P has a branch of size larger than l along the past a, and β P l (a) > 0 for any l means that the branch is infinite. We now define the kernel P where r i ր 1 is a sequence of (0, 1)-valued real numbers, andP is a positive kernel satisfying (i) for any a, βP l (a) > 0 for all l ≥ 1, and (ii) l≥0 sup a βP l (a) < ∞. Positivity is clearly satisfied by P i (a|a) ≥ (1 − r i )P (a|a) > 0 for any a and any i ≥ 1. This kernel enjoys properties (i) and (ii) as well. This is because for any l ≥ k and a, β implying that β We prove the second statement using a similar two-steps argument as above. First, we use the sequence of canonical Markov approximations P [k] , k ≥ 1 to approximate P . Second, as we do not know whether these Markov measures are ergodic or not, we construct, for any k ≥ 1 a sequence P [k] i , i ≥ 1 of ergodic Markov measures converging to P [k] . The conclusion of the proof also follows from a diagonal argument.
The construction of P i , for any i, k ≥ 1 is also carried as above. Define the kernel P |A| , that is, a convex combination between the kernel of the k-step canonical Markov approximation and the uniform distribution on A. Then P i > 0), and the sequence P We are now ready to prove Theorem 3.1.
Proof of Theorem 3.1. Assume that v∈τ ∞ φ(v) = +∞. Then Lemma 4.1 states that any P ∈ P having L(P ) = i, i = 0, 1 is limit (in D) of a sequence P n , n ≥ 1 of measures in P satisfying L(P n ) = 1 − i, n ≥ 1. In other words, the functional L is discontinuous (with respect to the D-distance) at any point of P. Together with Proposition 2.4, this proves that L is not consistently estimable on P when v∈τ ∞ φ(v) = +∞.

Proof of Theorem 3.2
As we already mentioned, the proof of the if part of the theorem follows from Theorem 3.4 which states that T c n is actually a consistent estimator of τ P . It remains to prove the only if part. Assume v∈τ ∞ φ(v) = +∞ and suppose there exists {T n } n∈N , consistent estimator of T on P. Define L n : A n → {0, 1} by L n (x n 1 ) = 1{ v∈Tn(x n 1 ) φ(v) < +∞}. We will prove {L n } n∈N is a consistent estimator for L, which is in contradiction with Theorem 3.1, concluding the proof of the only if part of the theorem.
We will prove that for any ǫ > 0 the ball of center τ P and radius ǫ contains only trees where L is constant and equal to L(τ P ). Let τ ′ ∈ T , then and for any two context trees τ 1 and τ 2 we have that Then if d φ (τ P , τ ′ ) < ǫ we have L(τ P ) = 1 if and only if L(τ ′ ) = 1. Therefore which proves that L is consistently estimable on P. But by Theorem 3.1, L is not consistently estimable on P, which is a contradiction.

Proof of Theorem 3.4
We will use the following lemma.
Lemma 4.2. Given n, let P ∈ P be such that ℓ(τ P ) ≤ t(n) and let X 1 , . . . , X n be a sample with law P . Then there exists a constant γ > 0 such that for any constant c > 0, we have P d n (X n 1 , P ) ≤ c log(n) ≥ 1 − e (c 2 log 3 (n) + 2|A|) n c 2 log(n)/2|A|−γ .
Proof. For any two distribution on A, Q 1 (·) and Q 2 (·), the Kullback-Leibler divergence between Q 1 and Q 2 , denoted by D(Q 1 ; Q 2 ), is defined as Assume P satisfies the conditions of the lemma and fix w a finite string being suffix of some context of τ P . Proposition A.7 in Garivier and Leonardi (2011) states that for any δ > 0 P (N n (w)D(p n (·|w); p(·|w)) > δ) ≤ 2e(δ log n + |A|) exp(−δ/|A|). (4.2) Recall thatτ | t(n) denotes the set of nodes of τ P having length smaller or equal to t(n). Here we want to obtain an upper bound for A := P sup w∈τ | t(n) N 1/2 n (w)|p n (·|w) − P (·|w)| 1 > c log n .
Recalling that t(n)/ log n → 0, the term |A| t(n)+1 grows at most polynomially and the lemma is proved.
Hereafter, we will say that an event E n is satisfied eventually almost surely (e.a.s. in the sequel, it will always be understood that it is when n → ∞) if there exists N ⋆ = N ⋆ (X ∞ 1 ) a.s. finite such that E n is satisfied for any n ≥ N ⋆ .
Proof of Theorem 3.4. First notice that T c n (X n 1 ) ≤ τ P | t(n) e.a.s. This is obvious if ℓ(τ P ) = ∞ and in the case where ℓ(τ P ) < ∞, this follows directly from the definition of T c n (X n 1 ) and the fact that the bound in Proposition 4.2 is summable in n, what implies that d n (X n 1 , P ) ≤ c log(n) e.a.s. Now, let ǫ > 0 and take k ∈ N such that u∈τ P : |u|>k φ(u) < ǫ .
Since T c n (X n 1 ) ≤ τ P | t(n) ≤ τ P e.a.s., we will have Therefore it is enough to prove thatτ P | k \T c n (X n 1 ) = ∅ e.a.s., or equivalently T c n (X n 1 ) ≥ τ P | k e.a.s. Let v ∈τ P | k , we will prove that e.a.s. the set {η : d n (X n 1 , η) ≤ c log n} is included in the set {η : v ∈τ η }. As the set τ P | k is finite we will have that {η : d n (X n 1 , η) ≤ c log n} ⊂ {η : τ η ≥ τ P | k } e.a.s. and therefore, T c n (X n 1 ) ≥ τ P | k e.a.s. It remains to explain why the above set inclusion is true. Note that for a sequence v as above it can be shown that there exists a finite string w such that w ∈τ P , P (·|v) = P (·|w) and w is the greatest proper suffix of v or v is a proper suffix of w. If v is a context for P then it is enough to take w equal to the greatest proper suffix of v and the assertion is satisfied. On the other hand, if v is not a context then it is a proper suffix of a context of P , then it can be shown that there exists a finite w having v as a suffix and satisfying that P (w) > 0 and P (·|v) = P (·|w), as shown in Csiszár and Talata (2006, proof of Lemma 3.1). Now, by ergodicity we have that P -almost surely N n (s) n − P (s) → 0 and |p n (·|s) − P (·|s)| 1 → 0 (4.3) for s = v and s = w. Using the triangle inequality we have, for any Q ∈ P, that |Q(·|v) − Q(·|w)| 1 ≥ |P (·|v) − P (·|w)| (4.4) − |P (·|v) −p n (·|v)| − |p n (·|v) − Q(·|v)| (4.5) − |p n (·|w) − P (·|w)| − |Q(·|w) −p n (·|w)| . (4.6) Therefore, e.a.s. a measure Q belonging to {η : d n (X n 1 , η) ≤ c log n} will satisfy |Q(·|v) − Q(·|w)| 1 > 0. (4.7) This is because the righthand side of the first line is strictly positive (w was selected to satisfy this), the first terms of the second and third lines vanish due to (4.3) and the second terms of the second and third lines vanish P -almost surely due to the assumption on Q.
To finish the proof of the theorem we will show that T c n (X n 1 ) is a non trivial nonparametric confidence bound for T on P. For any P ∈ P with ℓ(τ P ) ≤ t(n) we have that the event {d n (X n 1 , P ) ≤ c log(n)} implies {T c n (X n 1 ) ≤ τ P }. Therefore, by Lemma 4.2 we have P ( T c n (X n 1 ) ≤ τ P ) ≥ P (d n (X n 1 , P ) ≤ c log(n)) ≥ 1 − α .
This concludes the proof of the theorem.