Dynamics of Bayesian Updating with Dependent Data and Misspecified Models

Much is now known about the consistency of Bayesian updating on infinite-dimensional parameter spaces with independent or Markovian data. Necessary conditions for consistency include the prior putting enough weight on the correct neighborhoods of the data-generating distribution; various sufficient conditions further restrict the prior in ways analogous to capacity control in frequentist nonparametrics. The asymptotics of Bayesian updating with mis-specified models or priors, or non-Markovian data, are far less well explored. Here I establish sufficient conditions for posterior convergence when all hypotheses are wrong, and the data have complex dependencies. The main dynamical assumption is the asymptotic equipartition (Shannon-McMillan-Breiman) property of information theory. This, along with Egorov's Theorem on uniform convergence, lets me build a sieve-like structure for the prior. The main statistical assumption, also a form of capacity control, concerns the compatibility of the prior and the data-generating process, controlling the fluctuations in the log-likelihood when averaged over the sieve-like sets. In addition to posterior convergence, I derive a kind of large deviations principle for the posterior measure, extending in some cases to rates of convergence, and discuss the advantages of predicting using a combination of models known to be wrong. An appendix sketches connections between these results and the replicator dynamics of evolutionary theory.


Introduction
The problem of the convergence and frequentist consistency of Bayesian learning goes as follows. We encounter observations X 1 , X 2 , . . ., which we would like to predict by means of a family Θ of models or hypotheses (indexed by θ). We begin with a prior probability distribution Π 0 over Θ, and update this using Bayes's rule, so that our distribution after seeing X 1 , X 2 , . . . X t ≡ X t 1 is Π t . If the observations come from a stochastic process with infinite-dimensional distribution P , when does Π t converge P -almost surely? What is the rate of [41] have dealt with the convergence of non-parametric Bayesian estimation for IID data when P is not in the support of the prior, obtaining results similar to Berk's in far more general settings, extending in some situations to rates of convergence. All of this work, however, relies on the dynamical assumption of an IID data-source. This paper gives sufficient conditions for the convergence of the posterior without assuming (a), and substantially weakening (c) and (d). Even if one uses non-parametric models, cases where one knows that the true data generating process is exactly represented by one of the hypotheses in the model class are scarce. Moreover, while IID data can be produced, with some trouble and expense, in the laboratory or in a well-conducted survey, in many applications the data are not just heterogeneous and dependent, but their heterogeneity and dependence is precisely what is of interest. This raises the question of what Bayesian updating does when the truth is not contained in the support of the prior, and observations have complicated dependencies.
To answer this question, I first weaken the dynamical assumptions to the asymptotic equipartition property (Shannon-McMillan-Breiman theorem) of information theory, i.e., for each hypothesis θ, the log-likelihood per unit time converges almost surely. This log-likelihood per unit time is basically the growth rate of the Kullback-Leibler divergence between P and θ, h(θ). As observations accumulate, areas of Θ where h(θ) exceeds its essential infimum h(Θ) tend to lose posterior probability, which concentrates in divergence-minimizing regions. Some additional conditions on the prior distribution are needed to prevent it from putting too much weight initially on hypotheses with high divergence rates but slow convergence of the log-likelihood. As the latter assumptions are strengthened, more and more can be said about the convergence of the posterior.
Using the weakest set of conditions (Assumptions 1-3), the long-run exponential growth rate of the posterior density at θ cannot exceed h(Θ) − h(θ) (Theorem 1). Adding Assumptions 4-6 to provide better control over the integrated or marginal likelihood establishes (Theorem 2) that the long-run growth rate of the posterior density is in fact h(Θ) − h(θ). One more assumption (7) then lets us conclude (Theorem 3) that the posterior distribution converges, in the sense that, for any set of hypotheses A, the posterior probability Π t (A) → 0 unless the essential infimum of h(θ) over A equals h(Θ). In fact, we then have a kind of large deviations principle for the posterior measure (Theorem 4), as well as a bound on the generalization ability of the posterior predictive distribution (Theorem 5). Convergence rates for the posterior (Theorem 6) follow from the combination of the large deviations result with an extra condition related to assumption 6. Importantly, Assumptions 4-7, and so the results following from them, involve both the prior distribution and the data-generating process, and require the former to be adapted to the latter. Under mis-specification, it does not seem to be possible to guarantee posterior convergence by conditions on the prior alone, at least not with the techniques used here.
For the convenience of reader, the development uses the usual statistical vocabulary and machinery. In addition to the asymptotic equipartition property, the main technical tools are on the one hand Egorov's theorem from basic mea-sure theory, which is used to construct a sieve-like sequence of sets on which log-likelihood ratios converge uniformly, and on the other hand Assumption 6 bounding how long averages over these sets can remain far from their longrun limits. The latter assumption is crucial, novel, and, in its present form, awkward to check; I take up its relation to more familiar assumptions in the discussion. It may be of interest, however, that the results were first found via an apparently-novel analogy between Bayesian updating and the "replicator equation" of evolutionary dynamics, which is a formalization of the Darwinian idea of natural selection. Individual hypotheses play the role of distinct replicators in a population, the posterior distribution being the population distribution over replicators and fitness being proportional to likelihood. Appendix A gives details.

Preliminaries and Notation
Let (Ω, F, P ) be a probability space, and X 1 , X 2 , . . ., for short X ∞ 1 , be a sequence of random variables, taking values in the measurable space (Ξ, X ), whose infinite-dimensional distribution is P . The natural filtration of this process is σ (X t 1 ). The only dynamical properties are those required for the Shannon-McMillan-Breiman theorem (Assumption 3); more specific assumptions such as P being a product measure, Markovian, exchangeable, etc., are not required. Unless otherwise noted, all probabilities are taken with respect to P , and E [·] always means expectation under that distribution.
Statistical hypotheses, i.e., distributions of processes adapted to σ (X t 1 ), are denoted by F θ , the index θ taking values in the hypothesis space, a measurable space (Θ, T ), generally infinite-dimensional. For convenience, assume that P and all the F θ are dominated by a common reference measure, with respective densities p and f θ . I do not assume that P ∈ Θ, still less that P ∈ supp Π 0i.e., quite possibly all of the available hypotheses are false.
We will study the evolution of a sequence of probability measures Π t on (Θ, T ), starting with a non-random prior measure Π 0 . (A filtration on Θ is not needed; the measures Π t change but not the σ-field T .) Assume all Π t are absolutely continuous with respect to a common reference measure, with densities π t . Expectations with respect to these measures will be written either as explicit integrals or de Finetti style, Π t (f ) = f (θ)dΠ t (θ); when A is a set, . Bayesian updating of course means that, for any A ∈ T , or, in terms of the density, It will also be convenient to express Bayesian updating in terms of the prior and the total likelihood: The one-step-ahead predictive distribution of the hypothesis θ is given by , with the convention that t = 1 gives the marginal distribution of the first observation. Abbreviate this by F t θ . Similarly, let P t ≡ P X t |σ X t−1 1 ; this is the best probabilistic prediction we could make, did we but know P [39]. The posterior predictive distribution is given by mixing the individual predictive distributions with weights given by the posterior: Remark on the topology of Θ and on T The hope in studying posterior convergence is to show that, as t grows, with higher and higher (P ) probability, Π t concentrates more and more on sets which come closer and closer to P . The tricky part here is "closer and closer": points in Θ represent infinite-dimensional stochastic process distributions, and the topology of such spaces is somewhat odd, and irritatingly abrupt, at least under the more common distances. Any two ergodic measures are either equal or have completely disjoint supports [31], so that the Kullback-Leibler divergence between distinct ergodic processes is always infinity (in both directions), and the total variation and Hellinger distances are likewise maximal. Most previous work on posterior consistency has restricted itself to models where the infinite-dimensional process distributions are formed by products of fixed-dimensional base distributions (IID, Markov, etc.), and in effect transferred the usual metrics' topologies from these finitedimensional distributions to the processes. It is possible to define metrics for general stochastic processes [31], and if readers like they may imagine that T is a Borel σ-field under some such metric. This is not necessary for the results presented here, however.

Example
The following example will be used to illustrate the assumptions ( §2.2.1 and Appendix B), and, later, the conclusions ( §3.6). The data-generating process P is a stationary and ergodic measure on the space of binary sequences, i.e., Ξ = {0, 1}, and the σ-field X = 2 Ξ . The measure is naturally represented as a function of a two-state Markov chain S ∞ 1 , with S t ∈ {1, 2}. The transition matrix is so that the invariant distribution puts probability 1/3 on state 1 and probability 2/3 on state 2; take S 1 to be distributed accordingly. The observed process is a binary function of the latent state transitions, X t = 0 if S t = S t+1 = 2 and X t = 1 otherwise. Figure 1 depicts the transition and observation structure.
Qualitatively, X ∞ 1 consists of blocks of 1s of even length, separated by blocks of 0s of arbitrary length. Since the joint process {(S t , X t )} 1≤t≤∞ is a stationary and ergodic Markov chain, X ∞ 1 is also stationary, ergodic and mixing. This stochastic process comes from symbolic dynamics [43; 37], where it is known as the "even process", and serves as a basic example of the class of sofic processes [66], which have finite Markovian representations, as in Figure 1, but are not Markov at any finite order. (If X t = 1, X t−1 = 1, . . . X t−k = 1 for any finite k, the corresponding S t−i must have alternated between one and two, but whether S t is one or two, and thus the distribution of X t+1 , cannot be determined from the length-k history alone.) More exactly [36], sofic systems or "finitary measures" are ones which are images of Markov chains under factor maps, and strictly sofic systems, such as the even process, are sofic systems which are not themselves Markov chains of any order. Despite their simplicity, these models arise naturally when studying the time series of chaotic dynamical systems [3; 15; 57; 16], as well as problems in statistical mechanics [50] and crystallography [62].
Let Θ k be the space of all binary Markov chains of order k with strictly positive transition probabilities and their respective stationary distributions; each Θ k has dimension 2 k . (Allowing some transition probabilities to be zero creates uninteresting technical difficulties.) Since each hypothesis is equivalent to a function Ξ k+1 → (0, 1], we can give Θ k the topology of pointwise convergence of functions, and the corresponding Borel σ-field. We will take Θ = ∞ k=1 Θ k , identifying Θ k with the appropriate subset of Θ k+1 . Thus Θ consists of all strictly-positive stationary binary Markov chains, of whatever order, and is infinite-dimensional. As for the prior Π 0 , it will be specified in more detail below ( §2.2.1). At the very least, however, it needs to have the "Kullback-Leibler rate property", i.e., to give positive probability to every "neighborhood" N (θ) around every θ ∈ Θ, i.e., the set of hypotheses whose Kullback-Leibler divergence from θ grows no faster than : The limit exists for all θ, θ combinations [32].) This example is simple, but it is also beyond the scope of existing work on Bayesian convergence in several ways. First, the data-generating process P is not even Markov. Second, P ∈ Θ, so all the hypotheses are wrong, and the truth is certainly not in the support of the prior. (P can however be approximated arbitrarily closely, in various process metrics, by distributions from Θ.) Third, because P is ergodic, and ergodic distributions are extreme points in the space of stationary distributions [20], it cannot be represented as a mixture of distributions in Θ. This means that the Doob-style theorem of Ref. [42] does not apply, and even the subjective certainty of convergence is not assured. The results of Refs. [38; 68; 5; 6] on mis-specified models do not hold because the data are dependent. To be as concrete and explicit as possible, the analysis here will focus on the even process, but only the constants would change if P were any other strictly sofic process. Much of it would apply even if P were a stochastic context-free language or pushdown automaton [12], where in effect the number of hidden states is infinite, though some of the details in Appendix B would change.
Ref. [47] describes a non-parametric procedure which will adaptively learn to predict a class of discrete stochastic processes which includes the even process. Ref. [58] introduces a frequentist algorithm which consistently reconstructs the hidden-state representation of sofic processes, including the even process. Ref. [61] considers Bayesian estimation of the even process, using Dirichlet priors for finite-order Markov chains, and employing Bayes factors to decide which order of chain to use for prediction.

Assumptions
The needed assumptions have to do with the dynamical properties of the data generating process P , and with how well the dynamics meshes both with the class of hypotheses Θ and with the prior distribution Π 0 over those hypotheses.
× T -measurable for all t. The next two assumptions actually need only hold for Π 0 -almost-all θ. But this adds more measure-0 caveats to the results, and it is hard to find a natural example where it would help.
Assumption 2 For every θ ∈ Θ, the Kullback-Leibler divergence rate from P , exists (possibly being infinite) and is T -measurable.
As mentioned, any two distinct ergodic measures are mutually singular, so there is a consistent test which can separate them. ( [53] constructs an explicit but not necessarily optimal test.) One interpretation of the divergence rate [32] is that it measures the maximum exponential rate at which the power of such tests approaches 1, with d = 0 and d = ∞ indicating sub-and supra-exponential convergence, respectively.
Assumption 3 For each θ ∈ Θ, the generalized or relative asymptotic equipartition property holds, and so with P -probability 1.
Refs. [1; 32] give sufficient, but not necessary, conditions sufficient for Assumption 3 to hold for a given θ. The ordinary, non-relative asymptotic equipartition property, also known as the Shannon-McMillan-Breiman theorem, is that lim t −1 log p(x t 1 ) = −h P a.s., where h P is the entropy rate of the data-generating process. (See [32].) If this holds and h P is finite, one could rephrase Assumption 3 as lim t −1 log f θ (X t 1 ) = −h P − h(θ) a.s., and state results in terms of the likelihood rather than the likelihood ratio. (Cf. [24, ch. 5].) However, there are otherwise-well-behaved processes for which h P = −∞, at least in the usual choice of reference measure, so I will restrict myself to likelihood ratios.
The meaning of Assumption 3 is that, relative to the true distribution, the likelihood of each θ goes to zero exponentially, the rate being the Kullback-Leibler divergence rate. Roughly speaking, an integral of exponentially-shrinking quantities will tend to be dominated by the integrand with the slowest rate of decay. This suggests that the posterior probability of a set A ⊆ Θ depends on the smallest divergence rate which can be attained at a point of prior support within A. Thus, adapting notation from large deviations theory, define where here and throughout ess inf is the essential infimum with respect to Π 0 , i.e., the greatest lower bound which holds with Π 0 -probability 1.
Our further assumptions are those needed for the "roughly speaking" and "should" statements of the previous paragraph to be true, so that, for reasonable sets A ∈ T , If this assumption fails, then every hypothesis in the support of the prior doesn't just diverge from the true data-generating distribution, it diverges so rapidly that the error rate of a test against the latter distribution goes to zero faster than any exponential. (One way this can happen is if every hypothesis has a finite-dimensional distribution assigning probability zero to some event of positive P -probability.) The methods of this paper seem to be of no use in the face of such extreme mis-specification.
Our first substantial assumption is that the prior distribution does not give too much weight to parts of Θ where the log likelihood converges badly.
Assumption 5 There exists a sequence of sets G t → Θ such that Comment 1: An analogy with the method of sieves [25] may clarify the meaning of the assumption. If we were constrained to some fixed G, the uniform convergence in the second part of the assumption would make the convergence of the posterior distribution fairly straightforward. Now imagine that the constraint set is gradually relaxed, so that at time t the posterior is confined to G t , which grows so slowly that convergence is preserved. (Assumption 6 below is, in essence, about the relaxation being sufficiently slow.) The theorems work by showing that the behavior of the posterior distribution on the full space Θ is dominated by its behavior on this "sieve".
Comment 2: Recall that by Egorov's theorem [35,Lemma 1.36,p. 18], if a sequence of finite, measurable functions f t (θ) converges pointwise to a finite, measurable function f (θ) for Π 0 -almost-all θ ∈ G, then for each > 0, there is a (possibly empty) B ⊂ G such that Π 0 (B) ≤ , and the convergence is uniform on G \ B. Thus the first two parts of the assumption really follow for free from the measurability in θ of likelihoods and divergence rates. (That β needs to be at least 2h(Θ) becomes apparent in the proof of Lemma 5, but that could always be arranged.) The extra content comes in the third part of the assumption, which could fail if the lowest-divergence hypotheses were also the ones where the convergence was slowest, consistently falling into the bad sets B allowed by Egorov's theorem.
For each measurable A ⊆ Θ, for every δ > 0, there exists a random natural number τ (A, δ) such that Assumption 6 The sets G t of the previous assumption can be chosen so that, for every δ, the inequality t ≥ τ (G t , δ) holds a.s. for all sufficiently large t.
The fraction of the prior probability mass outside of G t is exponentially small in t, with the decay rate large enough that (Lemma 5) the posterior probability mass outside G t also goes to zero. Using the analogy to the sieve again, the meaning of the assumption is that the convergence of the log-likelihood ratio is sufficiently fast, and the relaxation of the sieve is sufficiently slow, that, at least eventually, every set G t has δ-converged by t, the time when we start using it.
To show convergence of the posterior measure, we need to be able to control the convergence of the log-likelihood on sets smaller than the whole parameter space.

Assumption 7
The sets G t of the previous two assumptions can be chosen so that, for any set Assumption 7 could be replaced by the logically-weaker assumption that for each set A, there exist a sequence of sets G t,A satisfying the equivalents of Assumptions 5 and 6 for the prior measure restricted to A. Since the most straightforward way to check such an assumption would be to verify Assumption 7 as stated, the extra generality does not seem worth it.

Verification of Assumptions for the Example
Since every θ ∈ Θ is a finite-order Markov chain, and P is stationary and ergodic, Assumption 1 is unproblematic, while Assumptions 2 and 3 hold by virtue of [1].
It is easy to check that inf θ∈Θ k h(θ) > 0 for each k. (The infimum is not in general attained by any θ ∈ Θ k , though it could be if the chains were allowed to have some transition probabilities equal to zero.) The infimum over Θ as a whole, however, is zero. Also, h(θ) < ∞ everywhere (because none of the hypotheses' transition probabilities are zero), so the possible set I of θ with infinite divergence rates is empty, disposing of Assumption 4.
Verifying the remaining assumptions means building a sequence G t of increasing subsets of Θ on which the convergence of t −1 log R t is uniform and sufficiently rapid, and ensuring that the prior probability of these sets grows fast enough. This will be done by exploiting some finite-sample deviation bounds for the even process, which in turn rest on its mixing properties and the method of types. Details are referred to Appendix B. The upshot is that the sets G t consist of chains whose order is less than or equal to log t 2/3+ − 1, for some > 0, and where the absolute logarithm of all the transition probabilities is bounded by Ct γ , where the positive constant C is arbitrary but 0 < γ < 2/3+ /2 2/3+ . (With a different strictly sofic process P , the constant 2/3 in the preceding expressions should be replaced by h P .) The exponential rate β > 0 for the prior probability of G c t can be chosen to be arbitrarily small.

Results
I first give the theorems here, without proof. The proofs, in § §3.1-3.5, are accompanied by re-statements of the theorems, for the reader's convenience.
There are six theorems. The first upper-bounds the growth rate of the posterior density at a given point θ in Θ. The second matches the upper bound on the posterior density with a lower bound, together providing the growth-rate for the posterior density. The third is that Π t (A) → 0 for any set with J(A) > 0, showing that the posterior concentrates on the divergence-minimizing part of the hypothesis space. The fourth is a kind of large deviations principle for the posterior measure. The fifth bounds the asymptotic Hellinger and total variation distances between the posterior predictive distribution and the actual conditional distribution of the next observation. Finally, the sixth theorem establishes rates of convergence.
The first result uses only Assumptions 1-3. (It is not very interesting, however, unless 4 is also true.) The latter three, however, all depend on finer control of the integrated likelihood, and so finer control of the prior, as embodied in Assumptions 5-6. More exactly, those additional assumptions concern the interplay between the prior and the data-generating process, restricting the amount of prior probability which can be given to hypotheses whose log-likelihoods converge excessively slowly under P . I build to the first result in the next subsection, then turn to the control of the integrated likelihood and its consequences in the next three sub-sections, and then consider how these results apply to the example.
In particular, this holds whenever 2h(A) < β or A ⊂ ∞ k=n G k for some n.
Theorem 5 Under Assumptions 1-7, with probability 1, where ρ H and ρ T V are, respectively, the Hellinger and total variation metrics.
Theorem 6 Make assumptions 1-7, and pick a positive sequence with probability 1.

Upper Bound on the Posterior Density
The primary result of this section is a pointwise upper bound on the growth rate of the posterior density. To establish it, I use some subsidiary lemmas, which also recur in later proofs. Lemma 2 extends the almost-sure convergence of the likelihood (Assumption 3) from holding pointwise in Θ to holding simultaneously for all θ on a (possibly random) set of Π 0 -measure 1. Lemma 3 shows that the prior-weighted likelihood ratio, Π 0 (R t ) tends to be at least exp {−th(Θ)}. (Both assertions are made more precise in the lemmas themselves.) I begin with a proposition about exchanging the order of universal quantifiers (with almost-sure caveats).
Proof: Since Q is measurable, for all ω and θ, the sections are measurable, and the measures of the sections, P (Q θ ) and Π(Q ω ), are measurable functions of θ and ω, respectively. Using Fubini's theorem, By hypothesis, however, P (Q θ ) = 1 for all θ. Hence it must be the case that Π(Q ω ) = 1 for P -almost-all ω. (In fact, the set of ω for which this is true must be a measurable set.) where, for every y ∈ C, there exists a Q y ∈ T such that, for every θ ∈ Q y , Eq. 1 holds. Moreover, Π 0 (Q y ) = 1.
Proof: Let the set Q consist of the θ, ω pairs where Eq. 1 holds, i.e., for which being explicit about the dependence of the likelihood ratio on ω. Assumption 3 states that ∀θP (Q θ ) = 1, so applying Lemma 1 just needs the verification that Q is jointly measurable. But, by Assumptions 1 and 2, h(·) is T -measurable, and Everything then follows from the preceding lemma.
Remark: Lemma 2 generalizes Lemma 3 in [4]. Lemma 1 is a specialization of the quantifier-reversal lemma used in [45] to prove PAC-Bayesian theorems for learning classifiers. Lemma 1 could be used to extend any of the results below which hold a.s. for each θ to ones which a.s. hold simultaneously almost everywhere in Θ. This may seem too good to be true, like an alchemist's recipe for turning the lead of pointwise limits into the gold of uniform convergence. Fortunately or not, however, the lemma tells us nothing about the rate of convergence, and is compatible with its varying across Θ from instantaneous to arbitrarily slow, so uniform laws need stronger assumptions.
Lemma 3 Under Assumptions 1-3, for every > 0, it is almost sure that the ratio between the integrated likelihood and the true probability density falls below exp {−t(h(Θ) + )} only finitely often: and as a corollary, with probability 1, Proof: It's enough to show that Eq. 4 holds for all x ∞ 1 in the set C from the previous lemma, since that set has probability 1.
Let N /2 be the set of all θ in the support of Π 0 such that h(θ) ≤ h(Θ) + /2. Since x ∞ 1 ∈ C, the previous lemma tells us there exists a set Q x ∞ 1 of θ for which Eq. 1 holds under the sequence x ∞ 1 .
We must have Π 0 (N /2 ) > 0, otherwise h(Θ) would not be the essential infimum, and we know from the previous lemma that for all but finitely many t. Since this holds for all x ∞ 1 ∈ C, and P (C) = 1, Equation 6 holds a.s., as was to be shown. The corollary statement follows immediately.
Theorem 1 Under Assumptions 1-3, with probability 1, for all θ where π 0 (θ) > 0, lim sup Proof: As remarked, By Assumption 3, for each > 0, it's almost sure that for all sufficiently large t, while by Lemma 3, it's almost sure that for all sufficiently large t. Hence, with probability 1, for all sufficiently large t. Hence lim sup Lemma 3 gives a lower bound on the integrated likelihood ratio, showing that in the long run it has to be at least as big as exp {−th(Θ))}. (More precisely, it is significantly smaller than that on vanishingly few occasions.) It does not, however, rule out being larger. Ideally, we would be able to match this lower bound with an upper bound of the same form, since h(Θ) is the best attainable divergence rate, and, by Lemma 2, log likelihood ratios per unit time are converging to divergence rates for Π 0 -almost-all θ, so values of θ for which h(θ) are close to h(Θ) should come to dominate the integral in Π 0 (R t ). It would then be fairly straightforward to show convergence of the posterior distribution.
Unfortunately, additional assumptions are required for such an upper bound, because (as earlier remarked) Lemma 2 does not give uniform convergence, merely universal convergence; with a large enough space of hypotheses, the slowest pointwise convergence rates can be pushed arbitrarily low. For instance, let x t 1 be the distribution on Ξ ∞ which assigns probability 1 to endless repetitions of x t 1 ; clearly, under this distribution, seeing X t 1 = x t 1 is almost certain. If such measures fall within the support of Π 0 , they will dominate the likelihood, even though h(x t 1 ) = ∞ under all but very special circumstances (e.g., P = x t 1 ). Generically, then, the likelihood and the posterior weight of x t 1 will rapidly plummet at times T > t. To ensure convergence of the posterior, overly-flexible measures like the family of x t 1 's must be either excluded from the support of Π 0 (possibly because they are excluded from Θ), or be assigned so little prior weight that they do not end up dominating the integrated likelihood, or the posterior must concentrate on them.

Convergence of Posterior Density via Control of the Integrated Likelihood
The next two lemmas tell us that sets in Θ of exponentially-small prior measure make vanishingly small contributions to the integrated likelihood, and so to the posterior. They do not require assumptions beyond those used so far, but their application will.
Lemma 4 Make Assumptions 1-3, and chose a sequence of sets B t ⊂ Θ such that, for all sufficiently large t, Π 0 (B t ) ≤ α exp {−tβ} for some α, β > 0. Then, almost surely, for all but finitely many t.
Proof: By Markov's inequality. First, use Fubini's theorem and the chain rule for Radon-Nikodym derivatives to calculate the expectation value of the ratio.
Now apply Markov's inequality: for all sufficiently large t. Since these probabilities are summable, the Borel-Cantelli lemma implies that, with probability 1, Eq. 8 holds for all but finitely many t. The next lemma asserts a sequence of exponentially-small sets makes a (logarithmically) negligible contribution to the posterior distribution, provided the exponent is large enough compared to h(Θ).
Lemma 5 Let B t be as in the previous lemma. If β > 2h(Θ), then Proof: Begin with the likelihood integrated over B t rather than its complement, and apply Lemmas 3 and 4: for any > 0 provided t is sufficiently large. If β > 2h(Θ), this bound can be made to go to zero as t → ∞ by taking to be sufficiently small. Since Lemma 6 Make Assumptions 1-3, and take any set G on which the convergence in Eq. 1 is uniform and where Π 0 (G) > 0. Then, P -a.s., lim sup Proof: Pick any > 0. By the hypothesis of uniform convergence, there almost surely exists a T ( ) such that, for all t ≥ T ( ) and for all θ ∈ G, t −1 log R t (θ) ≤ −h(θ) + . Hence Let Π 0|G denote the probability measure formed by conditioning Π 0 to be in the set G.
for any integrable function z. Apply this to the last term from Eq. 15.
The second term on the right-hand side is the cumulant generating function of −h(θ) with respect to Π 0|G , which turns out (cf. 6) to have exactly the right behavior as t → ∞.
Since h(θ) ≥ 0, exp {−h(θ)} ≤ 1, and the L p norm of the latter will grow towards its L ∞ norm as p grows. Hence, for sufficiently large t, where the next-to-last step uses the monotonicity of log and exp.
Putting everything together, we have that, for any > 0 and all sufficiently large t, Hence the limit superior of the left-hand side is at most −h(G).

Lemma 7 Under Assumption 1-6,
lim sup Proof: By Lemma 5, Consequently, again for large enough t, by Lemma 6. By Assumption 6, t ≥ τ (G t , /3) for all sufficiently large t. Hence for all > 0 and all t sufficiently large. Since, by Thus, for every > 0, then we have that for large enough t, or, in short, lim sup almost surely.

Proof: Combining Lemmas 3 and 7,
−h(Θ) ≤ lim inf The standard version of Egorov's theorem concerns sequences of finite measurable functions converging pointwise to a finite measurable limiting function. However, the proof is easily adapted to an infinite limiting function.
Lemma 9 Let f t (θ) be a sequence of finite, measurable functions, converging to ∞ almost everywhere (Π 0 ) on I. Then for each > 0, there exists a possiblyempty B ⊂ I such that Π 0 (B) < , and the convergence is uniform on I \ B.
Proof: Parallel to the usual proof of Egorov's theorem. Begin by removing the measure-zero set of points on which pointwise convergence fails; for simplicity, keep the name I for the remaining set. For each natural number t and k, let B t,k ≡ {θ ∈ I : f t (θ) < k} -the points where the function fails to be at least k by step t. Since the limit of f t is ∞ everywhere on I, each θ has a last t such that f t (θ) < k, no matter how big k is. Hence ∞ t=1 B t,k = ∅. By continuity of measure, for any δ > 0, there exists an n such that Π 0 (B t,k ) < δ if t ≥ n. Fix as in the statement of the lemma, and set δ = 2 −k . Finally, set B = ∞ k=1 B n,k . By the union bound, Π 0 (B) ≤ , and by construction, the rate of convergence to ∞ is uniform on I \ B.
Proof: The integrated likelihood ratio can be divided into two parts, one from integrating over I and one from integrating over its complement. Previous lemmas have established that the latter is upper bounded, in the long run, by a quantity which is O(exp {−h(Θ)t}). We can use Lemma 9 to divide I into a sequence of sub-sets, on which the convergence is uniform, and hence on which the integrated likelihood shrinks faster than any exponential function, and remainder sets, of prior measure no more than α exp {−nβ}, on which the convergence is less than uniform (i.e., slow). If we ensure that β > 2h(Θ), however, by Lemma 5 the remainder sets' contributions to the integrated likelihood is negligible in comparison to that of Θ \ I. Said another way, if there are alternatives which a consistent test would rule out at a merely exponential rate, those which would be rejected at a supra-exponential rate end up making vanishingly small contributions to the integrated likelihood.
Proof: Theorem 1 says that, for all θ, lim sup a.s., so there just needs to be a matching lim inf. Pick any > 0. By Assumption 3, it's almost certain that, for all sufficiently large t, while by Lemma 10, it's almost certain that for all sufficiently large t, Combining these as in the proof of Theorem 1, it's almost certain that for all

Convergence and Large Deviations of the Posterior Measure
Adding Assumption 7 to those before it implies that the posterior measure concentrates on sets A ⊂ Θ where h(A) = h(Θ). Proof: The last term is easy to bound. From Eq. 11 in the proof of Lemma 5, for any > 0, for all sufficiently large t, almost surely. Since β > 2h(Θ), the whole expression → 0 as t → ∞.
To bound Π t (A ∩ G t ), reasoning as in the proof of Lemma 7, but invoking Assumption 7, leads to the conclusion that, for any > 0, with probability 1, for all sufficiently large t. Recall that by Lemma 3, for all > 0 it's almost sure for all sufficiently large n. Hence for every > 0, it's almost certain that for all sufficiently large t, Since h(A) > h(Θ), by picking small enough the right hand side goes to zero.
The proof of the theorem provides an exponential upper bound on the posterior measure of sets where h(A) > h(Θ). In fact, even without the final assumption needed for the theorem, there is an exponential lower bound on that posterior measure.
Lemma 11 Make Assumption 1-6, and pick a set A ∈ T with Π 0 (A) > 0. Then lim inf Proof: Reasoning as in the proof of Lemma 3, it is easy to see that lim inf and by Lemma 7, lim sup

Theorem 4 Under the conditions of Theorem 3, if
In particular, this holds whenever 2h(A) < β or A ⊂ ∞ k=n G k for some n. Proof: Trivially, From Eq. 23 from the proof of Theorem 3, we know that, for any > 0, a.s. for sufficiently large t. On the other hand, under the hypothesis of the theorem, the proof of Eq. 22 can be imitated for Π t (A∩G c t ), with the conclusion that, for all > 0, again a.s. for sufficiently large t. Since β /2 > h(A), Eq. 26 follows. Finally, to see that this holds for any A where h(A) < β/2, observe that we can always upper bound Π t (A ∩ G c t ) by Π t (G c t ), and the latter goes to zero with rate at least −β/2.
Remarks: Because h(A) is the essential infimum of h(θ) on the set A, as the set shrinks h(A) grows. Sets where h(A) is much larger than h(Θ) tend accordingly to be small. The difficulty is that the sets G c t are also small, and conceivably overlap so heavily with A that the integral of the likelihood over A is dominated by the part coming from A ∩ G c t . Eventually this will shrink towards zero exponentially, but perhaps only at the comparatively slow rate h(Θ) − β/2, rather than the faster rate h(Θ)−h(A) attained on the well-behaved part A∩G t . Theorem 4 is close to, but not quite, a large deviations principle on Θ. We have shown that the posterior probability of any arbitrary set A where J(A) > 0 goes to zero with an exponential rate at least equal to But in a true LDP, the rate would have to be an infimum, not just an essential infimum, of a point-wise rate function. This deficiency could be removed by means of additional assumptions on Π 0 and h(θ).
Ref. [22] obtains proper large and even moderate deviations principles, but for the location of Π t in the space M 1 (Θ) of all distributions on Θ, rather than on Θ itself. Essentially, they use the assumption of IID sampling, which makes the posterior a function of the empirical distribution, to leverage the LDP for the latter into an LDP for the former. This strategy may be more widely applicable but goes beyond the scope of this paper. Papangelou [49], assuming that Θ consists of discrete-valued Markov chains of arbitrary order and P is in the support of the prior, and using methods similar to those in Appendix B, derives a result which is closely related to Theorem 4. In fact, fixing the sets G t as in Appendix B, Theorem 4 implies the theorem of [49].

Generalization Performance
Lemma 10 shows that, in hindsight, the Bayesian learner does a good job of matching the data: the log integrated likelihood ratio per time-step approaches −h(Θ), the limit of values attainable by individual hypotheses within the support of the prior. This leaves open, however, the question of the prospective or generalization performance.
What we want is for the posterior predictive distribution F t Π to approach the true conditional distribution of future events P t , but we cannot in general hope for the convergence to be complete, since our models are mis-specified. The next theorem uses h(Θ) to put an upper bound on how far the posterior predictive distribution can remain from the true predictive distribution.
Theorem 5 Under Assumptions 1-7, with probability 1, lim sup where ρ H and ρ T V are, respectively, the Hellinger and total variation metrics.
Proof: Recall the well-known inequalities relating Hellinger distance to Kullback-Leibler divergence on the one side and to total variation distance on the other [30]: for any two distributions P and Q, It's enough to prove Eq. 28, and Eq. 29 then follows from Eq. 31.
Remark: It seems like it should be possible to prove a similar result for the divergence rate of the predictive distribution, namely that but it would take a different approach, because h(·) has no upper bound, and the posterior weight of the high-divergence regions might decay too slowly to compensate for this.

Rate of Convergence
Recall that N was defined as the set of all θ such that h(θ) ≤ h(Θ) + . (This is measurable by Assumption 2.) The set N c thus consists of all hypotheses whose divergence rate is more than above the essential infimum h(Θ). For any > 0, Π t (N c ) → 0 a.s., by Theorem 3, and for sufficiently small , lim t→∞ t −1 log Π t (N c ) = − a.s., by Theorem 4. For such sets, in other words, for any δ > 0, it's almost certain that for all sufficiently large t, Now consider a non-increasing positive sequence t → 0. Presumably if t decays slowly enough, Π t (N c t ) will still go to zero, even though the sets N c t are nondecreasing. Examination of Eq. 32 suggests, naively, that this will work if t t → ∞, i.e., if the decay of t is strictly sublinear. This is correct under an additional condition, similar to Assumption 6.
Proof: By showing that Π t (N c t ) → 0 a.s. Begin by splitting the sets into the parts inside the G t , say U t , and the parts outside: From Lemma 4, the second term → 0 with probability 1, so for any η 1 > 0, it is ≤ η 1 eventually a.s. Turning to the other term, Theorem 4 applies to U k for any fixed k, so (a.s.), implying, with Lemma 10, that (a.s.). By Eq. 33, for any η 2 > 0, eventually almost surely. By Lemma 10 and Bayes's rule, then, eventually a.s., for any η 3 > 0. Putting things back together, eventually a.s., Since t t → ∞, the first term goes to zero, and since η 1 can be as small as desired, Π t (N c t ) → 0 almost surely.
The theorem lets us attain rates of convergence just slower than t −1 (so that t t → ∞). This matches existing results on rates of posterior convergence for mis-specified models with IID data in [68, Corollary 5.2] (t −1 in the Renyi divergence) and in [38] (t −1/2 in the Hellinger distance; recall Eq. 30), and for correctly-specified non-IID models in [29] (t −α for suitable α < 1/2, again in the Hellinger distance).

Application of the Results to the Example
Because h(Θ) = 0, while h(θ) > 0 everywhere, the behavior of the posterior is somewhat peculiar. Every compact set K ⊂ Θ has J(K) > 0, so by Theorem 3, Π t (K) → 0. On the other hand, Π t (G t ) → 1 -the sequence of good sets contains models of increasingly high order, with increasingly weak constraints on the transition probabilities, and this lets its posterior weight grow, even though every individual compact set within it ultimately loses all weight.
In fact, each G t is a convex set, and h(·) is a convex function, so there is a unique minimizer of the divergence rate within each good set. Conditional on being within G t , the posterior probability becomes increasingly concentrated on neighborhoods of this minimizer, but the minimizer itself keeps moving, since it can always be improved upon by increasing the order of the chain and reducing some transition probabilities. (Recall that P gives probability 0 to sequences 010, 01110, etc., where the block of 1's is of odd length, but Θ contains only chains with strictly positive transition probabilities.) Outside of the good sets, the likelihood is peaked around hypotheses which provide stationary and smooth approximations to the x t 1 distribution that endlessly repeats the observed sequence to date. The divergence rates of these hypotheses are however extremely high, so none of them retains its high likelihood for very long. (x t 1 is a Markov chain of order t, but it is not in Θ, since it's neither stationary nor does it have strictly positive transition probabilities. It can be made stationary, however, by assigning equal probability to each of its t states; this gives the data likelihood 1/t rather than 1, but that still is vastly larger than the O(−ct) log-likelihoods of better models. (Recall that even the log-likelihood of the true distribution is only O(− 2 3 t).) Allowing each of the t states to have a probability 0 < ι 1 of not proceeding to the next state in the periodic sequence is easy and leads to only an O(ιt) reduction in the likelihood up to time t. In the long run, however, it means that the log-likelihood will be O(t log ι).) In any case, the total posterior probability of G c t is going to zero exponentially.
Despite -or rather, because of -the fact that no point in Θ is the ne plus ultra around which the posterior concentrates, the conditions of Theorem 5 are met, and since h(Θ) = 0, the posterior predictive distribution converges to the true predictive distribution in the Hellinger and total variation metrics. That is, the weird gyrations of the posterior do not prevent us from attaining predictive consistency. This is so even though the posterior always gives the wrong answer to such basic questions as "Is P (X t+2 t = 010) > 0?" -inferences which in this case can be made correctly through non-Bayesian methods [47; 58].

Discussion
The crucial assumptions were 3, 5 and 6. Together, these amount to assuming that the time-averaged log likelihood ratio converges universally; to fashioning a sieve, successively embracing regions of Θ where the convergence is increasingly ill-behaved; and the hope that the prior weight of the remaining bad sets can be bounded exponentially.
Using asymptotic equipartition in place of the law of large numbers is fairly straightforward. Both results belong to the general family of ergodic theorems, which allow us to take sufficiently long sample paths as representative of entire processes. The unique a.s. limit in Eq. 1 can be replaced with a.s. convergence to a distinct limit in each ergodic component of P . However, the notation gets ugly, so the reader should regard h(θ) as that random limit, and treat all subsequent results as relative to the ergodic decomposition of P . (Cf. [31; 17].) It may be possible to weaken this assumption yet further, but it is hard to see how Bayesian updating can succeed if the past performance of the likelihood is not a guide to future results.
A bigger departure from the usual approach to posterior convergence may be allowing h(Θ) > 0; this rules out posterior consistency, to begin with. More subtly, it requires β > 2h(Θ). This means that a prior distribution which satisfies the assumptions for one value of P may not satisfy them for another, depending, naturally enough, on just how mis-specified the hypotheses are, and how much weight the prior puts on very bad hypotheses. On the other hand, when h(Θ) = 0, Theorem 5 implies predictive consistency, as in the example.
Assumption 6 is frankly annoying. It ensures that the log likelihood ratio converges fast enough, at least on the good sets, that we can be confident that integrated likelihood of G t has converged well by the time we want G t to start dominating the prior. It was shaped, however, to fill a hole in the proof of Lemma 7 rather than more natural considerations. The result is that verifying the assumption in its present form means proving the sub-linear growth rate of sequences of random last entry times, and these times are not generally convenient to work with. (Cf. Appendix B.) It would be nice to replace it with a bracketing or metric entropy condition, as in [4; 68] or similar forms of capacity control, as used in [46; 63]. Alternately, the uniformly consistent test conditions widely employed in Bayesian nonparametrics [30; 67] have been adapted the mis-specified setting by [38], where the tests become reminiscent of the "model selection tests" used in econometrics [64]. Since the latter can work for dynamical models [51], this approach may also work here. In any event, replacing Assumption 6 with more primitive, comprehensible and easily-verified conditions seems a promising direction for future work.
These results go some way toward providing a frequentist explanation of the success of Bayesian methods in many practical problems. Under these conditions, the posterior is increasingly weighted towards the parts of Θ which are closest (in the Kullback-Leibler sense) to the data-generating process P . For a Π t (A) to persistently be much more or much less than ≈ exp {−tJ(A)}, R(θ) must be persistently far from exp {−th(θ)}, not just for isolated θ ∈ A, but a whole positive-measure subset of them. With a reasonably smooth prior, this requires a run of bad luck amounting almost to a conspiracy. From this point of view, Bayesian inference amounts to introducing bias so as to reduce variance, and then relaxing the bias. Experience with frequentist non-parametric methods shows this can work if the bias is relaxed sufficiently slowly, which is basically what the assumptions here do. As the example shows, this can succeed as a predictive tactic without supporting substantive inferences about the datagenerating process. However, 4-7 involve both the prior and the data-generating process, and so cannot be verified using the prior alone. For empirical applications, it would be nice to have ways of checking them using sample data.
When h(Θ) > 0 and all the models are more or less wrong, there is an addi-tional advantage to averaging the models, as is done in the predictive distribution. (I owe the argument which follows to Scott Page; cf. [48].) With a convex loss function , such as squared error, Kullback-Leibler divergence, Hellinger distance, etc., the loss of the predictive distribution (Π t ) will be no larger than the posterior-mean loss of the individual models Π t ( (θ)). For squared error loss, the difference is equal to the variance of the models' predictions [40]. For divergence, some algebra shows that where the second term on the RHS is again an indication of the diversity of the models; the more different their predictions are, on the kind of data generated by P , the smaller the error of made by the mixture. Having a diversity of wrong answers can be as important as reducing the average error itself. The way to accomplish this is to give more weight to models which make mostly good predictions, but make different mistakes. This suggests that there may actually be predictive benefits to having the posterior concentrate on a set containing multiple hypotheses. Finally, it is worth remarking on the connection between these results and prediction with "mixtures of experts" [2; 10]. Formally, the role of the negative log-likelihood and of Bayes's rule in this paper was to provide a loss function and a multiplicative scheme for updating the weights. All but one of the main results (Theorem 5, which bounds Hellinger distance by Kullback-Leibler divergence) would carry over to multiplicative weight training using a different loss function, provided the accumulated loss per unit time converged.

Appendix A: Bayesian Updating as Replicator Dynamics
Replicator dynamics are one of the fundamental models of evolutionary biology; they represent the effects of natural selection in large populations, without (in their simplest form) mutation, sex, or other sources of variation. [34] provides a thorough discussion. They also arise as approximations to many other adaptive processes, such as reinforcement learning [8; 9; 54]. In this appendix, I show that Bayesian updating also follows the replicator equation.
We have a set of replicators -phenotypes, species, reproductive strategies, etc. -indexed by θ ∈ Θ. The population density at type θ is π(θ). We denote by φ t (θ) the fitness of θ at time t, i.e., the average number of descendants left by each individual of type θ. The fitness function φ t may in fact be a function of π t , in which case it is said to be frequency-dependent. Many applications assume the fitness function to be deterministic, rather than random, and further assume that it is not an explicit function of t, but these restrictions are inessential.
The discrete-time replicator dynamic [34] is the dynamical system given by the map where Π t (φ t ) is the population mean fitness at t, i.e., The effect of these dynamics is to re-weight the population towards replicators with above-average fitness. It is immediate that Bayesian updating has the same form as Eq. 36, as soon as we identify the distribution of replicators with the posterior distribution, and the fitness with the conditional likelihood. In fact, Bayesian updating is an extra simple case of the replicator equation, since the fitness function is frequencyindependent, though stochastic. Updating corresponds to the action of natural selection, without variation, in a fluctuating environment. The results in the main text assume (Assumption 3) that, despite the fluctuations, the long-run fitness is nonetheless a determinate function of θ. The theorems assert that selection can then be relied upon to drive the population to the peaks of the long-run fitness function, at the cost of reducing the diversity of the population, rather as in Fisher's fundamental theorem of natural selection [23; 34].
. Under the conditions of Theorem 2, the time average of the log relative fitness converges a.s.
Remark: Theorem 2 implies that H t ≡ |log π t (θ) + tJ(θ)| is a.s. o(t). To strengthen Eq. 37 from convergence of the time average or Cesàro mean to plain convergence requires forcing H t − H t−1 to be o(1), which it generally isn't. It is worth noting that Haldane [33] defined the intensity of selection on a population as, in the present notation, log π t (θ) π 0 (θ) whereθ is the "optimal" (i.e., most selected-for) value of θ. For us, this intensity of selection is just R t (θ)/Π 0 (R t ) whereθ is the (or a) MLE.

Appendix B: Verification of Assumptions 5-7 for the Example
Since the X ∞ 1 process is a function of the S ∞ 1 process, and the latter is an aperiodic Markov chain, both are ψ-mixing (see [44; 60] for the definition of ψ-mixing and demonstrations that aperiodic Markov chains and their functions are ψ-mixing). Let P (k) t be the empirical distribution of sequences of length k obtained from x t 1 . For a Markov chain of order k, the likelihood is a function of P (k+1) t alone; we will use this and the ergodic properties of the data-generating process to construct sets on which the time-averaged log-likelihood converges uniformly. Doing this will involve constraining both the order of the Markov chains and their transition probabilities, and gradually relaxing the constraints.
It will simplify notation if from here on all logarithms are taken to base 2. Pick > 0 and let k(t) be an increasing positive-integer-valued function of t, k(t) → ∞, subject to the limit k(t) ≤ log t h P + , where h P is the Shannon entropy rate of P , which direct calculation shows is 2/3. The ψ-mixing property of X ∞ 1 implies [60, p. 179] that P (p T V ( P (k(t) t , P (k(t)) ) > δ) ≤ log t h + 2(n + 1) t γ 1 2 −nC1δ 2 (38) where ρ T V is total variation distance, P (k(t)) is P 's restriction to sequences of length k(t), n = t/k(t) − 1, γ 1 = (h P + /2)/(h P + ) and C 1 is a positive constant specific to P (the exact value of which is not important). The log-likelihood per observation of a Markov chain θ ∈ Θ k is where f θ (a|w) is of course the probability, according to θ, of producing a after seeing w. By asymptotic equipartition, this is converging a.s. to its expected value, −h P − h(θ).
Let z(θ) = max w,a |log f θ (a|w)|. If z(θ) ≤ z 0 and ρ T V ( P (k+1) t , P (k+1) ) ≤ δ, then t −1 log f θ (x t 1 ) is within z 0 δ of −h P − h(θ). Meanwhile, t −1 log p(x t 1 ) is converging a.s. to −h P , and again [60] P (|t −1 log p(X t 1 ) − h P | > δ) ≤ q(t, δ)2 −tC2δ (39) for some C 2 > 0 and sub-exponential q(t, δ). (The details are unilluminating in the present context and thus skipped.) Define G(t, z 0 ) as the set of all Markov models whose order is less than or equal to k(t) − 1 and whose log transition probabilities do not exceed z 0 , in symbols Combining the deviation-probability bounds 38 and 39, for all θ ∈ G(t, z 0 ) These probabilities are clearly summable as t → ∞, so by the Borel-Cantelli lemma, we have uniform almost-sure convergence of t −1 log R t (θ) to −h(θ) for all θ ∈ G(t, z 0 ). The sets G(t, z 0 ) eventually expand to include Markov models of arbitrarily high order, but maintain a constant bound on the transition probabilities. To relax this, let z t be an increasing function of t, z(t) → ∞, subject to z t ≤ C 3 t γ2 for positive γ 2 < γ 1 . Then the deviation probabilities remain summable, and for each t, the convergence of t −1 log R t (θ) is still uniform on G(t, z t ). Set G t = G(t, z t ), and turn to verifying the remaining assumptions.
Start with Assumption 5; take its items in reverse order. So far, the only restriction on the prior Π 0 has been that its support should be the whole of Θ, and that it should have the "Kullback-Leibler rate property", giving positive weight to every set N = {θ : d(θ) < }. This, together with the fact that lim t G t = Θ, means that h(G t ) → h(Θ), which is item (3) of the assumption. The same argument also delivers Assumption 7. Item (2), uniform convergence on each G t , is true by construction. Finally (for this assumption), since h(Θ) = 0, any β > 0 will do, and there are certainly probability measures where Π 0 (G c t ) ≤ α exp {−βt} for some α, β > 0. So, Assumption 5 is satisfied. Only Assumption 6 remains. Since Assumptions 1-3 have already been checked, we can apply Eq. 18 from the proof of Lemma 6 and see that, for each fixed G from the sequence of G t , for any > 0, for all sufficiently large t, t −1 log Π 0 (GR t ) ≤ −h(G) + + t −1 log Π 0 (G) a.s. This shows that τ (G t , δ) is almost surely finite for all t and δ, but still leaves open the question of whether for every δ and all sufficiently large t, t ≥ τ (G t , δ) (a.s.). Reformulating a little, the desideratum is that for each δ, with probability 1, t < τ (G t , δ) only finitely often. By the Borel-Cantelli lemma, this will happen if t P (τ (G t , δ) > t) ≤ ∞. However, if τ (G t , δ) > t, it must be equal to some particular n > t, so there is a union bound: From the proof of Lemma 6 (specifically from Eqs. 15, 16 and 17), we can see that by making t large enough, the only way to have the event n −1 log Π 0 (G t R n ) > δ − h(G t ) is to have n −1 log R n (θ) − h(θ) > δ/2 everywhere on a positivemeasure subset of G t . But we know from Eq. 40 not only that the inner sum can be made arbitrarily small by taking t sufficiently large, but that the whole double sum is finite. So τ (G t , δ) > t only finitely often (a.s.).