Estimating beta-mixing coefficients via histograms

The literature on statistical learning for time series often assumes asymptotic independence or"mixing"of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing coefficients from data. Additionally, for many common classes of processes (Markov processes, ARMA processes, etc.) general functional forms for various mixing rates are known, but not specific coefficients. We present the first estimator for beta-mixing coefficients based on a single stationary sample path and show that it is risk consistent. Since mixing rates depend on infinite-dimensional dependence, we use a Markov approximation based on only a finite memory length $d$. We present convergence rates for the Markov approximation and show that as $d\rightarrow\infty$, the Markov approximation converges to the true mixing coefficient. Our estimator is constructed using $d$-dimensional histogram density estimates. Allowing asymptotics in the bandwidth as well as the dimension, we prove $L^1$ concentration for the histogram as an intermediate step. Simulations wherein the mixing rates are calculable and a real-data example demonstrate our methodology.


Introduction
For time series analysis, the independence assumption is replaced by requiring the asymptotic independence of events far apart in time, or "mixing". Mixing conditions make the dependence of the future on the past explicit, quantifying the decay in dependence as the future moves farther from the past. There are many definitions of mixing of varying strength with matching dependence coefficients (see [12,10,6] for reviews), but many of the results in the statistical literature focus on β-mixing or absolute regularity. Roughly speaking (see Definition 2.1 below for a precise statement), the β-mixing coefficient at lag a is the total variation distance between the actual joint distribution of events separated by a − 1 time steps and the product of their marginal distributions, i.e., the L 1 distance from independence.
Numerous results in the statistics literature rely on knowledge of mixing coefficients. While much of the theoretical groundwork for the analysis of mixing processes was laid years ago (cf. [35,5,13,27,1,31,37,38]), recent work has continued to use mixing to prove interesting results about the analysis of time-series data. Non-parametric inference under mixing conditions is treated extensively in Bosq [4]. Baraud et al. [2] study the finite sample risk performance of penalized least squares regression estimators under β-mixing. Kontorovich and Ramanan [18] prove concentration of measure results based on a notion of mixing defined therein which is related to the more common φ-mixing coefficients. Ould-Saïd et al. [26] investigate kernel conditional quantile estimation under α-mixing. Steinwart and Anghel [30] show that support vector machines are consistent for time series forecasting under a weak dependence condition implied by α-mixing. Asymptotic properties of nonparametric inference for time series under various mixing conditions are described in Liu and Wu [20]. Finally, Lerasle [19] proposes a block-resampling penalty for density estimation. He shows that the selected estimator satisfies oracle inequalities under both β-and τ -mixing.
Many common time series models are known to be β-mixing, and the rates of decay are known up to constant factors given the underlying parameters of the process, say θ. Among the processes for which such knowledge is available are ARMA models [23], GARCH models [7], and certain Markov processes -see [12] for an overview of such results. Fryzlewicz and Subba Rao [16] derive upper bounds for the α-and β-mixing rates of non-stationary ARCH processes.
With a few exceptions, the mapping from θ to the constants in the mixing coefficients is unknown in the literature. For example, it is known that the mixing coefficients of the ARMA process at time lag a are O(ρ a ) for some 0 ≤ ρ < 1. However, both ρ and the constants are unrecoverable given knowledge of θ. To our knowledge, only Nobel [24] approaches a solution to the problem of estimating mixing rates by giving a method to distinguish between different polynomial mixing rate regimes through hypothesis testing.
We present the first method for estimating the β-mixing coefficients for stationary time series data given a single sample path. Our methodology can be applied to real data assumed to be generated from some unknown β-mixing process. Additionally, it can be used to examine known mixing processes thereby determining exact mixing rates via simulation. Section 2 defines the β-mixing coefficient and states our main results on convergence rates and consistency for our estimator. Section 3 gives an intermediate result on the L 1 convergence of the histogram estimator with β-mixing inputs which is asymptotic in the dimension of the target distribution in addition to the bandwidth. Section 4 proves the main results from §2. Section 5 demonstrates the performance of our estimator in three simulated examples, both providing good recovery of known rates in simple settings as well as providing insight into unknown mixing regimes. Section 6 concludes and lays out some avenues for future research.

Estimator and consistency results
In this section, we present one of many equivalent definitions of absolute regularity and state our main results, deferring proof to §4.
To fix notation, let X = {X t } ∞ t=−∞ be a sequence of random variables where each X t is a measurable function from a probability space (Ω, F, P) into a measurable space X . A block of this random sequence will be given by X j i ≡ {X t } j t=i where i and j are integers, and may be infinite. We use similar notation for the sigma fields generated by these blocks and their joint distributions. In particular, σ j i will denote the sigma field generated by X j i , and the joint distribution of X j i will be denoted P j i .

Definitions
There are many equivalent definitions of β-mixing (see for instance [12], or [6] as well as Meir [22] or Yu [38]), however the most intuitive is that given in Doukhan [12].
Definition 2.1 (β-mixing). For each a ∈ N and any t ∈ Z, the coefficient of absolute regularity, or β-mixing coefficient, β(a), is where || · || T V is the total variation norm, and P t,a is the joint distribution of (X t −∞ , X ∞ t+a ). A stochastic process is said to be absolutely regular, or β-mixing, if β(a) → 0 as a → ∞.
Loosely speaking, Definition 2.1 says that the coefficient β(a) measures the total variation distance between the joint distribution of random variables separated by a − 1 time units and a distribution under which random variables separated by a − 1 time units are independent. In the most general setting, P t −∞ ⊗ P ∞ t+a − P t,a T V is preceded by a supremum over t. However, this additional generality is unnecessary for stationary random processes X which is the only case we consider here. Our main result requires the method of blocking used by Yu [37,38]. The purpose is to transform a sequence of dependent variables into subsequences of nearly IID ones. Consider a sample X n 1 from a stationary β-mixing sequence with density f . Let m n and µ n be non-negative integers such that 2m n µ n = n. Now divide X n 1 into 2µ n blocks, each of length m n . Identify the blocks as follows: Let U be the entire sequence of odd blocks U j , and let V be the sequence of even blocks V j . Finally, let U be a sequence of blocks which are independent of X n 1 but such that each block has the same distribution as a block from the original sequence. That is construct U j such that where L(·) means the probability law of the argument. The blocks U are now an IID block sequence, in that for integers i, j ≤ 2µ n , i = j, U i ⊥ ⊥ U j so standard results about IID random variables can be applied to these blocks. (See [38] for a more rigorous analysis of blocking.) With this structure, we can state our main result.

Results
Our result emerges in two stages. First, we recognize that the distribution of a finite sample depends only on finite-dimensional distributions. This leads to an estimator of a finite-dimensional version of β(a). Next, we let the finite-dimension increase to infinity with the size of the observed sample.
For positive integers t, d, and a, define where P t,a,d is the joint distribution of (X t t−d+1 , X t+a+d−1 t+a ). Also, let f d be the d-dimensional histogram estimator of the joint density of d consecutive observations, and let f 2d a be the 2ddimensional histogram estimator of the joint density of two sets of d consecutive observations separated by a − 1 time points.
We construct an estimator of β d (a) based on these two histograms. 1 Define We show that, by allowing d = d n to grow with n, this estimator will converge on β(a). This can be seen most clearly by bounding the 1 -risk of the estimator with its estimation and approximation errors: The first term is the error of estimating β d (a) with a random sample of data. The second term is the non-stochastic error induced by approximating the infinite dimensional coefficient, β(a), with its d-dimensional counterpart, β d (a). Our first theorem in this section establishes consistency of β dn (a) as an estimator of β(a) for all β-mixing processes provided d n increases at an appropriate rate. Theorem 2.4 gives finite sample bounds on the estimation error while some measure theoretic arguments contained in §4 show that the approximation error must go to zero as d n → ∞. The O(exp{W (log n)}) growth rate of d n in the above theorem leads to the optimal rate of decay for the estimation error. A finite sample bound for the estimation error is the first step to establishing consistency for β dn . This result gives convergence rates for estimation of the finite dimensional mixing coefficient β d (a) and also for Markov processes of known order d, since in this case, β d (a) = β(a).
Theorem 2.4. Consider a sample X n 1 from a stationary β-mixing process. Let µ n and m n be positive integers such that 2µ n m n = n and µ n ≥ d > 0. Then Consistency of the estimator β d (a) is guaranteed only for certain choices of m n and µ n . Clearly µ n → ∞ and µ n β(m n ) → 0 as n → ∞ are necessary conditions. Consistency also requires convergence of the histogram estimators to the target densities. We leave the proof of this theorem for section 4. As an example to show that this bound can go to zero with proper choices of m n and µ n , the following corollary proves consistency for first order Markov processes. Consistency of the estimator for higher order Markov processes can be proven similarly. These processes are geometrically β-mixing as shown in e.g. Nummelin and Tuominen [25]. Proof. Recall that n = 2µ n m n . Then, if m n = Ω(log n) for constants K 1 and K 2 . But the exponential terms are exp −K 3 n 2 j m n for j = 1, 2 and a constant K 3 . Therefore, both exponential terms go to 0 as n → ∞ if m n = o(n).
Setting the right side to δ and solving, gives that as long as m n = Ω(log n), then β 1 (a) P − → β(a) at a rate of o( √ n) apart from a logarithmic factor.
Proving Theorem 2.4 requires showing the L 1 convergence of the histogram density estimator with β-mixing data. We do this in the next section.

L 1 convergence of histograms
Convergence of density estimators is thoroughly studied in the statistics and machine learning literature. Early papers on the L ∞ convergence of kernel density estimators (KDEs) include [36,3,29]; Freedman and Diaconis [15] look specifically at histogram estimators, and Yu [37] considered the L ∞ convergence of KDEs for β-mixing data and shows that the optimal IID rates can be attained. Tran [32] proves L 2 convergence for histograms under α-and β-mixing. Devroye and Györfi [11] argue that L 1 is a more appropriate metric for studying density estimation, and Tran [31] proves L 1 consistency of KDEs under α-and β-mixing. As far as we are aware, ours is the first proof of L 1 convergence for histograms under β-mixing.
Additionally, the dimensionality of the target density is analogous to the order of the Markov approximation. Therefore, the convergence rates we give are asymptotic in the bandwidth h n which shrinks as n increases, but also in the dimension d which increases with n. Even under these asymptotics, histogram estimation in this sense is not a high dimensional problem. The dimension of the target density considered here is on the order of exp{W (log n)}, a rate somewhere between log n and log log n.
To prove this result, we use the blocking method of [38] to transform the dependent β-mixing sequence into a sequence of nearly independent blocks. We then apply McDiarmid's inequality to the blocks to derive asymptotics in the bandwidth of the histogram as well as the dimension of the target density. For completeness, we state Yu's blocking result and McDiarmid's inequality before proving the doubly asymptotic histogram convergence for IID data. Combining these lemmas allows us to derive rates of convergence for histograms based on β-mixing inputs.
Lemma 3.2 (Lemma 4.1 in [38]). Let φ be a measurable function with respect to the block sequence U uniformly bounded by M . Then, where the first expectation is with respect to the dependent block sequence, U, and E is with respect to the independent sequence, U .
This lemma essentially gives a method of applying IID results to β-mixing data. Because the dependence decays as we increase the separation between blocks, widely spaced blocks are nearly independent of each other. In particular, the difference between expectations over these nearly independent blocks and expectations over blocks which are actually independent can be controlled by the β-mixing coefficient. [21]). Let X 1 , . . . , X n be independent random variables, with X i taking values in a set A i for each i. Suppose that the measurable function f :

Lemma 3.3 (McDiarmid Inequality
whenever the vectors x and x differ only in the i th coordinate. Then for any > 0, The following lemma provides the doubly asymptotic convergence of the histogram estimator for IID data. It differs from standard histogram convergence results in the bias calculation. In this case we need to be more careful about the interaction between d and h n . Lemma 3.4. For an IID sample X 1 , . . . , X n from some density f on R d , where f is the histogram estimate using a grid with sides of length h n .
Proof of Lemma 3.4. Let p j be the probability of falling into the j th bin B j . Then, For the second claim, consider the bin B j centered at c. Let I be the union of all bins B j . Assume the following regularity conditions as in [14]: 1. f ∈ L 2 and f is absolutely continuous on I, with a.e. partial derivatives f i = ∂ ∂y i f (y) 2. f i ∈ L 2 and f i is absolutely continuous on I, with a.e. partial derivatives Using a Taylor expansion where f i (y) = ∂ ∂y i f (y). Therefore, p j is given by since the integral of the second term over the bin is zero. This means that for the j th bin, Therefore, Since each bin is bounded, we can sum over all J bins. The number of bins is J = h −d n by definition, so We can now prove the main result of this section.
Proof of Theorem 3.1. Let g be the L 1 loss of the histogram estimator, Let f U , f V , and f U be histograms based on the block sequences U, V, and U respectively. Clearly f n = 1 Here, where the probability on the right is for the σ-field generated by the independent block sequence U . Since these blocks are independent, showing that g U satisfies the bounded differences requirement allows for the application of McDiarmid's inequality 3.3 to the blocks. For any two block sequences u 1 , . . . , u µn andū 1 , . . . ,ū µn with u =ū for all = j, then Therefore,

Proofs of results in §2.2
The proof of Theorem 2.4 relies on the triangle inequality and the relationship between total variation distance and the L 1 distance between densities.
Proof of Theorem 2.4. For any probability measures ν and λ defined on the same probability space with associated densities f ν and f λ with respect to some dominating measure π, Let P be the d-dimensional stationary distribution of the d th order Markov process, i.e. P = P t t−d+1 = P t+a+d−1 t+a in the notation of equation 3. Let P a,d be the joint distribution of the bivariate random process created by the initial process and itself separated by a − 1 time steps. By the triangle inequality, we can upper bound β d (a) for any d = d n . Let P and P a,d be the distributions associated with histogram estimators f d and f 2d a respectively. Then, a | is our estimator β d (a) and the remaining terms are the L 1 distance between a density estimator and the target density. Thus, A similar argument starting from β d (a) = ||P ⊗ P − P a,d || T V shows that Therefore, The proof of Theorem 2.3 requires two steps which are given in the following Lemmas. The first specifies the histogram bandwidth h n and the rate at which d n (the dimensionality of the target density) goes to infinity. If the dimensionality of the target density were fixed, we could achieve rates of convergence similar to those for histograms based on IID inputs. However, we wish to allow the dimensionality to grow with n, so the rates are much slower as shown in the following lemma. These choices lead to the optimal rate of convergence.
Proof. Let h n = n −kn for some k n to be determined. Then we want n −1/2 h −dn/2 n = n (kndn−1)/2 → 0, d n h n = d n n −k → 0, and d 2 n h 2 n = d 2 n n −2k → 0 all as n → ∞. Call these A, B, and C. Taking A and B first gives Similarly, combining A and C gives k n ∼ 2 log d n + 1 2 log n log n 1 2 d n + 2 . (10) Equating (9) and (10) and solving for d n gives where W (·) is the Lambert W function. Plugging back into (9) gives that It is also necessary to show that as d grows, β d (a) → β(a). We now prove this result. Proof. Because the process is stationary, let t = 0 in Definition 2.1 without loss of generality. Let P 0 −∞ be the distribution on σ 0 −∞ = σ(. . . , X −1 , X 0 ), and let P ∞ a be the distribution on σ ∞ a = σ(X a , X a+1 , X a+2 , . . .). Let P a be the distribution on σ = σ 0 −∞ ⊗ σ ∞ a (the product sigma-field). Then we can rewrite Definition 2.1 using this notation as Let σ 0 −d+1 and σ a+d−1 a be the sub-σ-fields of σ 0 −∞ and σ ∞ a consisting of the d-dimensional cylinder sets for the d dimensions closest together. Let σ d be the product σ-field of these two. Then we can rewrite β d (a) as β d (a) = sup As such β d (a) ≤ β(a) for all a and d. We can rewrite (11) in terms of finite-dimensional marginals: where P a,d is the restriction of P to σ(X −d+1 , . . . , X 0 , X a , . . . , X a+d−1 ). Because of the nested nature of these sigma-fields, we have for all finite d 1 ≤ d 2 . Therefore, for fixed a, {β d (a)} ∞ d=1 is a monotone increasing sequence which is bounded above, and it converges to some limit L ≤ β(a). To show that L = β(a) requires some additional steps. Let , which is a signed measure on σ. Let which is a signed measure on σ d . Decompose R into positive and negative parts as R = Q + − Q − and similarly for R d = Q +d − Q −d . Notice that since R d is constructed using the marginals of P, then R(E) = R d (E) for all E ∈ σ d . Now since R is the difference of probability measures, we must have that for all D ∈ σ.
Such a set C is guaranteed by the Hahn decomposition theorem (letting C * be a set which attains the supremum in (11), we can throw away any subsets with negative R measure) and (12) assuming without loss of generality that P a (C) > [P 0 −∞ ⊗ P ∞ a ](C). We can use the field σ f = d σ d to approximate σ in the sense that, for all , we can find A ∈ σ f such that Q(A∆C) < /2 (see Theorem D in Halmos [17, §13] or Lemma A.24 in Schervish [28]). Now, Also, since A ∩ C and A c ∩ C are contained in C and A ∩ C ⊆ A. Therefore since A ∩ C ⊆ C and Q − (C) = 0 by (14). Finally, And since β d (a) ≥ Q +d (A), we have that for all > 0 there exists d such that for all d 1 > d, Thus, we must have that L = β(a), so that β d (a) → β(a) as desired.

Performance in simulations
To demonstrate the performance of our proposed estimator, we examine its performance in three simulated examples. The first example is a simple two state Markov chain. The second example takes this Markov chain as an unobserved input and outputs a non-Markovian binary sequence which remains β-mixing. Finally, we examine an autoregressive model. As shown in [9], homogeneous recurrent Markov chains are geometrically β-mixing, i.e. β(a) = O(ρ a ) for some 0 ≤ ρ < 1. In particular, if the Markov chain has stationary distribution π and a-step transition distribution P a , then Consider first the two-state Markov chain S t pictured in Figure 1. By direct calculation using (15), the mixing coefficients for this process are β(a) = 4 9 1 2 a . We simulated chains of length n = 1000 from this Markov chain. Based on 1000 replications, the performance of the estimator is depicted in Figure 2. Here, we have used two bins in all cases, but we allow the Markov approximation to vary as d ∈ {1, 2, 3}, even though d = 1 is exact. The estimator performs well for a ≤ 5, but begins to exhibit a positive bias as a increases. This is because the estimator is nonnegative, whereas the true mixing rates are quickly approaching zero. The upward bias is exaggerated for larger d. This bias will go to 0 as n → ∞.
As an example of a long memory process, we construct, following Weiss [34], a partially observable Markov process which we refer to as the "even process". Let X t be the observed sequence which takes as input the Markov process S t constructed above. We observe Since S t is Markovian, the joint process (S t , S t−1 ) is as well, so we can calculate its mixing rate β(a) = 8 9 1 2 a . The even process must also be β-mixing, and at least as fast as the joint process, since it is a measurable function of a mixing process. However, X t itself is non-Markovian: sequences of one's must have even lengths, so we need to know how many one's have been observed to know whether the next observation can be zero or must be a one. Thus, the true mixing coefficients are bounded above, though unknown. Using the same procedure as above, Figure 3 shows the estimated mixing coefficients. Again we observe a bias for a large due to the nonnegativity of the estimator. Finally, we estimate the β-mixing coefficients for an AR(1) model iid ∼ N(0, 1).
where we calculate the expectation based on independent simulations from the process. Figure 4 shows the performance for n = 3000. The optimal number of bins is 33, 11, 7, 5, and 3 for a = 1, . . . , 5 and 1 for a > 5. However, since the use of one bin corresponds to an estimate of zero, the figure plots the estimate with two bins. Using two bins, we again see the positive bias for a > 5.

Discussion
We have shown that our estimator of the β-mixing coefficients is consistent for the true coefficients β(a) under some conditions on the data generating process. There are numerous results in the statistics literature which assume knowledge of the β-mixing coefficients, yet as far as we know, this is the first estimator for them. An ability to estimate these coefficients will allow researchers to apply existing results to dependent data without the need to arbitrarily assume their values. Additionally, it will allow probabilists to recover unknown mixing coefficients for stochastic processes via simulation. Despite the obvious utility of this estimator, as a consequence of its novelty, it comes with a number of potential extensions which warrant careful exploration as well as some drawbacks. Several other mixing and weak-dependence coefficients also have a total-variation flavor, perhaps most notably α-mixing [12,10,6]. None of them have estimators, and the same trick might well work for them, too.
The reader will note that Theorem 2.3 does not provide a convergence rate. The rate in Theorem 2.4 applies only to the difference between β d (a) and β d (a). In order to provide a rate in Theorem 2.3, we would need a better understanding of the non-stochastic convergence of β d (a) to β(a). It is not immediately clear that this quantity can converge at any well-defined rate. In particular, it seems likely that the rate of convergence depends on the tail of the sequence {β(a)} ∞ a=1 . The use of histograms rather than kernel density estimators for the joint and marginal densities is surprising and perhaps not ultimately necessary. As mentioned above, Tran [31] proved that KDEs are consistent for estimating the stationary density of a time series with β-mixing inputs, so perhaps one could replace the histograms in our estimator with KDEs. However, this would need an analogue of the double asymptotic results proven for histograms in Lemma 3.4. In particular, we need to estimate increasingly higher dimensional densities as n → ∞. This does not cause a problem of small-n-large-d since d is chosen as a function of n, however it will lead to increasingly higher dimensional integration. For histograms, the integral is always trivial, but in the case of KDEs, the numerical accuracy of the integration algorithm becomes increasingly important. This issue could swamp any efficiency gains obtained through the use of kernels. However, this question certainly warrants further investigation.
The main drawback of an estimator based on a density estimate is its complexity. The mixing coefficients are functionals of the joint and marginal distributions derived from the stochastic process X, however, it is unsatisfying to estimate densities and solve integrals in order to estimate a single number. Vapnik's main principle for solving problems using a restricted amount of information is "When solving a given problem, try to avoid solving a more general problem as an intermediate step [33, p. 30]." However, despite our estimator's complexity, we are able to obtain nearly parametric rates of convergence to the Markov approximation departing only by logarithmic factors. While the simplicity principle is clearly violated, perhaps our seed will precipitate a more aesthetically pleasing solution.