Asymptotic properties of predictive recursion: Robustness and rate of convergence

: Here we explore general asymptotic properties of Predictive Recursion (PR) for nonparametric estimation of mixing distributions. We prove that, when the mixture model is mis-speciﬁed, the estimated mixture convergesalmost surely in total variation to the mixture that minimizes the Kullback-Leiblerdivergence,and a bound on the (Hellinger contrast) rate of convergence is obtained. Simulations suggest that this rate is nearly sharp in a minimax sense. Moreover, when the model is identiﬁable, almost sure weak convergence of the mixing distribution estimate follows. PR assumes that the support of the mixing distribution is known. To remove this requirement, we propose a generalization that incorporates a sequence of supports, increasing with the sample size, that combines the eﬃciency of PR with the ﬂexibility of mixture sieves. Under mild conditions, we obtain a bound on the rate of convergence of these new estimates.


Introduction
Despite a well-developed theory and numerous applications of mixture models, estimation of a mixing distribution remains a challenging statistical problem. However, some recent progress has been made through a computationally efficient nonparametric estimate due to Newton [17]; see also Newton, et al. [18] and Newton and Zhang [19]. A mixture model views data (X 1 , . . . , X n ) ∈ X n as independent observations from a density m(x) of the form where F = F(Θ, µ) is the class of probability measures on a parameter space (Θ, B) dominated by a σ-finite measure µ, and {p(·|θ) : θ ∈ Θ} is a parametric family of densities on (X , A ), dominated by a σ-finite measure ν. More succinctly, the mixture model (1.1) assumes m ∈ M := {m F : F ∈ F}. To estimate F from the data X 1 , . . . , X n , Newton [17] proposed the following n-step recursive algorithm, called Predictive Recursion (PR): Algorithm PR. Choose an initial measure F 0 ∈ F having µ-density f 0 , and a sequence of weights {w i : i ≥ 1} ⊂ (0, 1). For i = 1, . . . , n, compute and produce F n , the measure with µ-density f n , as the final estimate of F .
Key features of PR include its speed and its unique flexibility to estimate a mixing distribution which has a density with respect to any user-defined dominating measure µ. The latter is a practically important property as a number of modern applications demand existence of a mixing density with respect to a specified dominating measure. For example, high-dimensional empirical Bayes analysis, spurred mainly by recent developments in DNA microarray technologies, starts with a Bayesian model whose prior/mixing distribution is itself a mixture of both discrete and continuous components; see Efron [5] and the references therein. Estimation of prior/mixing distributions in this "two-groups model" context is a promising application of PR; see Bogdan, et al. [3].
Until recently, very little was known about the large-sample behavior of PR. Ghosh and Tokdar [8] used a novel martingale argument to prove, when Θ is finite and m = m F ∈ M, that F n → F a.s. Martin and Ghosh [16] proved a slightly stronger consistency theorem using tools from stochastic approximation theory. Most recently, Tokdar, Martin, and Ghosh [23] (henceforth, TMG) handled the case of a more general parameter space Θ by extending the martingale argument to the X -space, proving that the mixture density estimate m n (x) := m Fn (x) = p(x|θ) dF n (θ), x ∈ X , converges a.s. to m F (x) in the L 1 topology. From L 1 convergence of m n , consistency of F n in the weak topology on Θ is obtained. In this paper, we extend the convergence results of TMG in two important directions. First, in the more general context, where the mixture model M need not contain the true density m, we show that the estimated mixture m n in (1.3) is asymptotically robust in the sense that it converges almost surely in the total variation (or Hellinger) topology to the mixture m F that minimizes the Kullback-Leibler (KL) divergence K(m, m Φ ) = log(m/m Φ )m dν. When the mixing distribution is identifiable, we also obtain weak convergence of F n . Our second main result is a bound on the rate of convergence for m n . For a unified treatment of the well-and mis-specified cases, we consider the Hellinger contrast where m F is a mixture that minimizes K(m, m Φ ). The Hellinger contrast has been previously used to study asymptotics of maximum likelihood and Bayes estimates under model mis-specification; see, for example, Patilea [20], and Kleijn and van der Vaart [10]. In Section 4, we show that if F is compact, then √ a n ρ(m n , m F ) → 0 almost surely, where a n = n i=1 w i . This establishes a direct connection between the choice of weights w i and the performance of the resulting PR estimate. Moreover, this bound is derived without using any structural knowledge about the mixands p(x|θ) and applies to a wide range of such densities, including Normal, Gamma, and Poisson. The conditions on w i required for this result are satisfied by w n ≍ n −γ , γ ∈ (2/3, 1], leading to a n ≍ n 1−γ . Consequently, the Hellinger contrast convergence rate of m n is strictly faster than n −(1−γ)/2 . How this relates to the rate of convergence for F n remains an important open problem.
Our nearly n −1/6 bound on the convergence rate for m n closely matches the results in Genovese and Wasserman [6] derived for the special case of finite Gaussian mixtures. But it falls short of the nearly parametric rates obtained in Li and Barron [14] and Ghosal and van der Vaart [7]. However, the simulation results presented in Section 4.3 indicate that our rate is minimax in nature and, therefore, should not be directly compared to the rates in these two papers. In fact, empirical evidence suggests that, in some cases, the PR estimates can converge faster than what our bounds indicate.
A shortcoming of PR is that one needs to specify the compact mixing parameter space Θ a priori. To remove this requirement, we propose, in Section 5, a generalized PR (GPR) algorithm that features an increasing sieve-like sequence of supports Θ i ⊂ Θ i+1 . We obtain a bound on the rate of convergence for GPR in the special case where m = m F and F has an unknown compact support.

Almost supermartingales
The primary tool used to prove the new results of this paper is an "almost supermartingale" convergence theorem of Robbins and Siegmund [22]. For convenience, we give the statement of this result here. Let {M n : n ≥ 1} be a sequence of non-negative random variables adapted to a filtration {F n : n ≥ 1}. Suppose there are non-negative random variables {β n , ξ n , ζ n } such that E(M n |F n−1 ) ≤ (1 + β n−1 )M n−1 + ξ n−1 − ζ n−1 , n ≥ 1. (2.1) If both β n ≡ 0 and ξ n ≡ 0, then M n would be exactly a supermartingale. But, more generally, Robbins and Siegmund [22] call a sequence {M n } that satisfies (2.1) an almost supermartingale.
That is, even if the sequence is not exactly a supermartingale, as long as the perturbations β n and ξ n vanish fast enough, then the conclusions of the usual martingale convergence theorem remain valid. For further discussion on Theorem 2.1 and its extensions, see Lai [11].

Kullback-Leibler projections
It is quite natural to use the KL divergence to study the large-sample properties of PR. Indeed, Martin and Ghosh [16] remark that, for Θ finite, ℓ(Φ) = K(m, m Φ ) is a Lyapunov function for the differential equation that characterizes the asymptotics of PR: roughly, the KL divergence controls the dynamics of PR, driving the estimates toward a stable equilibrium. But when m / ∈ M, this equilibrium cannot be m. In such cases, we consider the mixture m F that is closest to m in a KL sense; that is, where F is the weak closure of F, and M = {m F : F ∈ F}. We call m F the KL projection of m onto M. Similar ideas may be found in [12,20,10]. Existence of the KL projection is an important issue, and various results are available; see, for example, Liese and Vadja [13,Chap. 8]. Here we prove a simple result which gives sufficient conditions for the existence of a KL projection in our special case of mixtures. Assume the following: A1. F is pre-compact with respect to the weak topology. A2. θ → p(x|θ) is bounded and continuous for ν-almost all x. Proof. Choose any Φ ∈ F and any {Φ s } ⊂ F such that Φ s → Φ weakly. Then A2 and Scheffé's theorem imply m Φs → m Φ in the L 1 (ν) and, hence, the weak topology. Further, where the inequality follows from weak lower semi-continuity of K(m, ·); see Liese and Vadja [13], Theorem 1.47. Consequently, κ(·) is weakly lower semicontinuous and, therefore, must attain its infimum on the compact F. A similar proof may be given based on Lemma 4 of Brown, et al. [4].
Remark 3.2. Convexity of the set M and of the mapping K(m, ·) together imply that the KL projection m F in Lemma 3.1 is unique. However, in general, there could be many mixing distributions Φ ∈ F whose mixture m Φ corresponds to the KL projection m F . Identifiability is needed to guarantee uniqueness of F .
Next we state one more important property of the KL projection which will be useful in what follows. Various proofs of this result can be found in the literature; see, e.g., Patilea [20] or Kleijn and van der Vaart [10].

Preliminaries
We begin with our assumptions and some preliminary lemmas; proofs can be found in the Appendix. Let {w i : i ≥ 1} be the user-specified weight sequence in the PR algorithm, and define the sequence of partial sums a n = n i=1 w i . In addition to A1-A2 in Section 3, assume the following: A3. w n > 0, w n ↓ 0, n w n = ∞, and n a n w 2 Condition A3 is satisfied if w n ≍ n −γ , for γ ∈ (2/3, 1]. The square-integrability condition A4 is the strongest assumption, but it does hold for Exponential family mixands (including Normal or Poisson) with sufficient statistic S(x), provided that Θ is compact and mS −1 admits a moment-generating function on Θ. If one is willing to assume that m ∈ M, then A4 can be replaced by a less restrictive condition, depending only on p(x|θ); cf. assumption A5 in TMG (p. 2505).
Our new developments are partially based on calculations in TMG, and we begin by recording a few of these for future reference. First, let R(x) be the remainder term of a first-order Taylor approximation of log For notational convenience, also define the function Then the KL divergence K n := K(m, m n ) satisfies where R(x) satisfies (4.1). Let A n−1 be the σ-algebra generated by the data sequence X 1 , . . . , X n−1 . Since K n−1 is A n−1 -measurable, upon taking conditional expectation with respect to A n−1 we get where T (·) and Z n are defined as Note that T (F n−1 ) is exactly M * n defined in TMG (p. 2508). The following properties of T (·) will be critical in the proof of our main result.
. Some analysis shows that D(θ; Φ) is the negative Gâteaux derivative of K(m, η) at η = m Φ in the direction of p(·|θ). Now, if T (Φ) = 0, then D(θ; Φ) = 0 for µ-almost all θ and, hence, The fact that the Gâteaux derivative vanishes in all directions suggests that m Φ is a point at which the infimum K(m, M) is attained, and this is exactly the conclusion of Lemma 4.1(a). A similar characterization of the NPMLE is given in Lindsay [15]. In light of Lemma 4.1(a) and (4.3), we see in K n the makings of an almost supermartingale (2.1). The following bound on Z n is needed to push through the argument based on Theorem 2.1. Our last preliminary result of this section gets is needed to bound the convergence rate of K(m, m n ). Define K *

Main results
We are now ready to state and prove our main results. Those convergence properties advertised in Section 1 are corollaries to Theorems 4.5 and 4.8 that follow.
This is of the form (2.1) with β n ≡ 0, ξ n−1 = w 2 n E(Z n |A n−1 ), and ζ n−1 = w n T (F n−1 ). Therefore, from Theorem 2.1 we get K * n → K * ∞ ≥ 0 a.s. and It remains to show that K * ∞ = 0 a.s. Suppose, on the contrary, that K * ∞ > 0 with positive probability. Then there exists ε > 0 such that for all but perhaps finitely many n. Recall the proof of Lemma 3.1, which shows that the mapping κ(Φ) = K(m, m Φ ) is lower semi-continuous with respect to the weak topology on F. Consequently, is a weakly open set, and its closure F ε is compact by A1. Since F n ∈ F ε for all but finitely many n, Lemma 4.1 implies that T (F n−1 ) is bounded away from zero. But this and A3 together contradict (4.7). Therefore, K * ∞ = 0 a.s. Next we show that K * n → 0 implies m n − m F → 0, where · denotes the L 1 (ν) norm. Proof. Suppose not; that is, there is an ε > 0 and a subsequence {n s } such that m ns − m F > ε for all s. Then, by A1, this sequence has a further subsequence {n s(t) } such that F n s(t) converges weakly to some F ∞ ∈ F. By A2, m n s(t) converges to m ∞ := m F∞ pointwise and in L 1 (ν) by Scheffé's theorem; therefore, m ∞ = m F . Define u t = m n s(t) /m ∞ − 1. Then u t → 0 pointwise and {u t } is uniformly integrable with respect to m dν by A4 and Jensen's inequality: Corollary 4.6 suggests that F n converges to some F ∈ F at which the infimum K(m, M) is attained. However, to conclude weak convergence of F n from L 1 convergence of m n , we need two additional conditions: A5. Identifiability: m Φ = m Ψ ν-a.e. implies Φ = Ψ; cf. Remark 3.2. A6. For any ε > 0 and any compact X ′ ⊂ X , there exists a compact Θ ′ ⊂ Θ such that X ′ p(x|θ) dν(x) < ε for all θ ∈ Θ ′ .
With conditions A5-A6 and Theorem 3 of TMG, the next result follows immediately from Corollary 4.6. A slight modification of Theorem 4.5 produces a bound on the rate of convergence. But one extra assumption is needed to push through the proof, namely, that the KL minimizer F is dominated by µ. The precise result is next.
Theorem 4.8. In addition to A1-A4, assume a KL minimizer F exists in the interior of F. Then a n K * n → 0 a.s. Proof. Multiply through (4.6) by a n to get E(a n K * n | A n−1 ) = a n K * n−1 + w 2 n E(Z n |A n−1 ) − w n T (F n−1 ) = a n−1 K * n−1 + w n K * n−1 + a n w 2 n E(Z n |A n−1 ) − a n w n T (F n−1 ) This last line is also of the almost supermartingale form (2.1), with β n ≡ 0, ζ n−1 = a n w n T (F n−1 ), and ξ n−1 = a n w 2 n E(Z n |A n−1 ) + w n K * n−1 . Since n ξ n < ∞ a.s. by Lemmas 4.3-4.4, it follows from Theorem 2.1 that a n K * n → K * ∞ a.s., and n a n w n T (F n−1 ) < ∞ a.s. To show that K * ∞ = 0 a.s., proceed by contradiction as in the proof of Theorem 4.5.
Remark 4.9. The extra condition that the KL minimizer F sits inside F can be viewed as an assumption about the quality of the model. That is, F should be inside F unless the mixture model M is "too bad." This notion of model quality is not yet fully understood, so sufficient conditions are currently not available. However, an example of a "bad" model is one where m(x) = p(x|θ) for some θ, which amounts to a mis-specification of the dominating measure µ. We suspect that the conclusion of Theorem 4.8 holds without this extra condition-see Section 4.3-but our proof hinges on Lemma 4.4, which we are unable to prove unless the KL minimizer F has a density f with respect to µ.
In the mis-specified case, even though K * n → 0 implies m n − m F → 0, the L 1 (ν) rate does not easily follow without extra assumptions, such as a.e. boundedness of m F /m. But a Hellinger contrast rate is a direct consequence of Theorem 4.8. In the well-specified case, when m = m F , the Hellinger contrast reduces to the usual Hellinger distance, so our convergence rate results are comparable to those of, say, Genovese and Wasserman [6].   [20]. Indeed, For part (b), let q n = mm n /m F Γ n , and notice that K * n ≥ K(m, q n ); see Barron [1,Theorem 3]. Then, Pinsker's inequality [9, Theorem 6.1] gives .

The triangle inequality for
and hence .

1464
The right-hand side is related to the L 1 (ν) error via Holder's inequality where C := m F /m L∞(m dν) is finite by assumption. Therefore, and part (b) follows from the fact that (1 − Γ n ) ≤ K * n .

Numerical illustrations
We present a brief simulation study to highlight an example where our bound on rates appears sharp. Let p(·|θ) be a N(θ, 0.1 2 ) density, with Θ = [0, 1] and µ as Lebesgue measure on Θ. We simulate data X 1 , . . . , X n from N(0.5, 0.1 2 ), which equals the mixture m F where F ∈ F is the point mass at 0.5. We consider weight sequences of the form w i = (i+1) −γ for γ ∈ {0.5, 0.6, 0.67, 0.7, 0.75, 0.8, 0.9, 1.0}. For each choice of γ, the KL distance K n = K(m, m n ) is computed for 500 simulated data sets of size n, for each n ∈ {10 3 , 10 4 , 10 5 , 10 6 }. In order to estimate the empirical rate of convergence, set L n = − log 10 K n and consider the following linear models: Model 1: E(L n ) = β 0 + β 1 log 10 n (4.8) Model 2: E(L n ) = β 0 + β 1 log 10 n + β 2 log 10 log 10 n (4.9) Figure 1 shows K n , averaged over the 500 replications, against n for each choice of γ. Table 1 shows the estimated coefficients of the linear models. In either model we would expect β 1 to be close to 1 − γ had our upper bound been sharp. This indeed appears to be the case, particularly for γ not too close to 1. Also, the two estimates of β 1 sandwich 1 − γ, perhaps indicating that the actual rate is n −(1−γ) modulo a factor of log n.
We should point out that this example does not exactly satisfy the assumptions of Theorem 4.8. Indeed, the KL minimizer F = δ {0.5} lies on the boundary of F and not in its interior. However, we interpret this as the limit of examples where the optimal F gets increasingly close to the boundary, and this limitbased argument points to a minimax-type sharpness of the derived bounds. On the other hand, in examples where F is in the interior of F, simulation studies have shown convergence rates faster than n 1−γ . For example, when F is a Unif(Θ) distribution and γ = 0.75, a fit of Model 1 givesβ 1 = 0.73, and a fit of Model 2 givesβ 1 = 0.53 andβ 2 = 1.98. These empirical results suggest that a nearly parametric rate of convergence, like (log n) k / √ n, may be attainable in some cases.

Table 1
Estimated coefficients for Models 1 and 2 (4.8-4.9). Standard errors forβ 1 are in (1, 2) × 10 −3 and (1, 2) × 10 −2 for Models 1 and 2, respectively, and standard errors forβ 2 are in (1, 2) × 10 −1 . Our upper bound suggests β 1 should be close to 1 − γ. It is larger than 1 − γ for Model 1 and smaller for Model 2 with the addition of a log log n term. This indicates that the actual rate is unlikely to be faster than n −(1−γ) by a power of n

A generalized PR algorithm
To satisfy the conditions of Theorem 4.8, one typically needs the mixing parameter space Θ to be compact. Also, this Θ must be known in practice since computation requires integration over Θ in (1.2). These requirements can be somewhat restrictive, particularly when there is no natural choice of Θ. A potential solution is to use a mixture sieve, which allows the support of the esti-mated mixing distribution to grow with the sample size. A motivation for this dynamic choice of support would be that, eventually, the support will be large enough so that the class of all mixtures over that support will be sufficiently rich.
Borrowing on this idea, we propose a sieve-like extension of the PR algorithm which, instead of requiring Θ to be fixed and known, incorporates a sequence of compact mixing parameter spaces that increases with n. Let Θ denote the mixing parameter space, which may or may not be compact. For example, if the model is a Gaussian location-scale mixture, then Θ = R×R + . The generalized PR algorithm, in terms of densities, is as follows.
Algorithm GPR. Choose an increasing sequence of compact sets Θ n such that Θ n ↑ Θ, a bounded, strictly positive, µ-measurable function g(θ), and a sequence c n ≥ 0 that satisfies n log(1 + c n ) < ∞. Define where d n = Θn\Θn−1 g dµ is the normalizing constant. Start with an initial estimate f 0 on Θ 0 and, for n ≥ 1, define and then As in (1.3), define m n := m fn as the final estimate of m.
Our motivation for using g n is that if Θ n is small compared to the unknown Θ, the estimate should be padded near the boundary of Θ n to compensate for the possibly heavy tails of f assigning non-negligible mass to Θ \ Θ n . The GPR algorithm requires specification of two parameters, namely, the weight sequence c n and the function g. Simple choices are c n = w 2 n and g(θ) ≡ 1, but the practical performance of GPR for these or other choices has yet to be studied.
We now consider convergence of the GPR estimate m n . The primary obstacle in extending the results in Section 4 is that now the support of the mixing distributions is changing with n-this makes the comparisons of the mixing densities in the proof of Lemma 4.4 more difficult. Here we will consider only the case where m = m f for some mixing density f, but both f and its support Θ f ⊂ Θ are unknown. Theorem 5.1 below establishes a bound on the rate of convergence in the case where Θ f is a compact subset of Θ. Note that, while we are restricting ourselves to the compact case, we do not assume Θ f is known.
Recall condition A4 in Section 4 requiring that the likelihood ratio be square integrable uniformly over Θ. In this case, we have a sequence Θ n , and we require a bound similar to that of A4 for each Θ n . To this end, define the sequence Since Θ n ⊂ Θ n+1 , the sequence B n is clearly increasing; if we are to push the proof of Theorem 4.8 through in this more general situation, we will need to control how fast B n increases.
Theorem 5.1. Assume that Θ f ⊂ Θ is compact and that conditions A2-A3 hold. Furthermore, assume that n a n w 2 n B n < ∞. Then the GPR estimates m n satisfy a n K(m f , m n ) → 0 a.s.
The proof is essentially the same as that of Theorem 4.8-we simply need to check that Lemmas 4.3 and 4.4 continue to hold in this new context. The details are provided in the Appendix. Remark 5.2. The growth rate condition n a n−1 w 2 n B n < ∞ in Theorem 5.1 clearly holds provided that {w n } and {B n } satisfy w n ≍ (n α log n) −1 for some α ∈ [2/3, 1], and B n = O(log n).
2π be a N (θ, 1) density (with respect to Lebesgue measure). Then If we let Θ n = [−t n , t n ], then If we choose t n ≍ (c + 1 12 log log n) 1/2 , for some constant c > 0, then B n = O(log n). Taking w n as in Remark 5.2 satisfies the conditions of the theorem.
Example 5.4. Suppose that p(x|θ) = e −θ θ x /x! is a Poisson density (with respect to counting measure). Then Take Θ n = [α n , β n ] where α n and β n are to be determined. Then, If β n ≍ (c + log log n) 1/5 , for some constant c > 0 such that Θ 0 suitably large, and α n = β −1 n , then it is easy to check that B n = O(log n). Therefore, the conditions of Theorem 5.1 are satisfied with w n as in Remark 5.2.

Discussion
PR is an exciting stochastic algorithm for mixture models that is quite different from EM or MCMC in its focus and structure. While MCMC and EM focus exclusively on the mixture distribution, PR brings the mixing density to the fore.
Structurally, PR is not a hill-clmbing algorithm like EM or MCMC; rather, it draws inspiration from the recursive aggregation idea of stochastic approximation (SA). Martin and Ghosh [16] have established a precise connection between PR and SA for finite-dimensional F, and this connection is likely to extend to the infinite-dimensional case as well. Interestingly, very little is known about convergence properties of general infinite-dimensional SA algorithms, even though their finite-dimensional counterparts are well understood. In this regard, our convergence results here can make a major contribution to the study of SA in general.
In this paper we have extended our recent work on asymptotic analysis of PR in two key directions. First, we have shown that PR is robust to mixture model mis-specification, in the sense that the PR estimate m n converges to the mixture m F which is closest in KL divergence to the true density m. This property is important since, typically, m itself is not a mixture, but is closely approximated by one. Second, we have established a bound on the PR rate of convergence, which makes a direct connection between the choice of weight sequence w n and the performance of the PR estimates. We suspect that the extra condition needed for the rate in Theorem 4.8-namely, that a KL minimizer F sit inside F-can be relaxed, but more work is needed. Simulation results presented in Section 4.3 reveal two interesting observations: (i) the bound on the rate is of a minimax nature, and (ii) the best (minimax) rates, when w n ≍ n −γ , are achieved for γ near 0.5. Further investigations are needed to better understand in what sense the rate is minimax, to extend Theorem 4.8 to the case when γ ≈ 0.5, and to characterize the rate in a typical "non-minimax" problem.
We have also proposed a practical extension of the PR algorithm which does not require that the mixing distribution have a known compact support. Instead, GPR uses an increasing sieve-like sequence of increasing compact supports. Sufficient conditions are given-which essentially control the growth rate of the sieves-that guarantee consistency of the estimated mixture and bound the rate of convergence. Here, however, the growth of the sieve space is rather slow (see Examples 5.3-5.4), so the advantages of a dynamic support over a large fixed support may be difficult to see in finite samples. We suspect that the extension of Theorem 4.8 to handle γ near 0.5 can be used here to relax the growth rate conditions and allow the sieve space to grow more rapidly.
where convergence follows from the above three properties and Theorem 1 of Pratt [21]. Therefore, T (Φ s ) → T (Φ) and since the sequence {Φ s } was arbitrary, T must be continuous.
Proof of Lemma 4.4. Since F ∈ F, the KL divergence of F n from F is welldefined. Following the argument of TMG (p. 2505), it is possible to write a recursion for K(F, F n ) as we did in (4.2). In particular, E{K(F, F n ) | A n−1 } = K(F, F n−1 ) − w n D(F n−1 ) + w 2 n E(Y n |A n−1 ), where Y n , like Z n , is uniformly bounded by B + 1, and the functional D(·) is given by It follows from Jensen's inequality that with equality iff m Φ = m F . This is an almost supermartingale, so we conclude from Theorem 2.1 that K(F, F n ) converges a.s. and n w n D(F n−1 ) < ∞ a.s. But in light of (A.4), we have n w n K * n−1 < ∞, the desired result. Proof of Theorem 5.1. Since B n is increasing, the condition n a n w 2 n B n < ∞ implies that the conclusion of Lemma 4.3 holds, and the remainder term Z n in (4.5) satisfies n w 2 n E(Z n ) < ∞. The key observation is that, since Θ f is compact, there is a number N = N (f) such that Θ f ⊂ Θ n for all n ≥ N . Consequently, there are only N iterates f 0 , . . . , f N−1 such that K(f, f i ) = ∞. Without loss of generality, we can shift the "time scale" so that N = 0. Now apply the same expansion as in the proof of Lemma 4.4 to get E{K(f, f n ) | A n−1 } − K(f, f n−1 ) = − f log(f n /f n−1 ) dµ = −w n D(f n−1 ) + w 2 n E(Y n |A n−1 ) + log(1 + c n ) This is of the almost supermartingale form (2.1), and both n w 2 n E(Y n |A n−1 ) and n log(1 + c n ) are finite. Therefore, Theorem 2.1 implies K(f, f n ) converges, and n w n D(f n−1 ) < ∞. From this and (A.4) we can conclude that n w n K(m f , m n ) < ∞ a.s. To finish the proof, simply throw away the first N iterations and apply the argument used to prove Theorem 4.8.