Catoni-style confidence sequences for heavy-tailed mean estimation

A confidence sequence (CS) is a sequence of confidence intervals that is valid at arbitrary data-dependent stopping times. These are useful in applications like A/B testing, multi-armed bandits, off-policy evaluation, election auditing, etc. We present three approaches to constructing a confidence sequence for the population mean, under the minimal assumption that only an upper bound $\sigma^2$ on the variance is known. While previous works rely on light-tail assumptions like boundedness or subGaussianity (under which all moments of a distribution exist), the confidence sequences in our work are able to handle data from a wide range of heavy-tailed distributions. The best among our three methods -- the Catoni-style confidence sequence -- performs remarkably well in practice, essentially matching the state-of-the-art methods for $\sigma^2$-subGaussian data, and provably attains the $\sqrt{\log \log t/t}$ lower bound due to the law of the iterated logarithm. Our findings have important implications for sequential experimentation with unbounded observations, since the $\sigma^2$-bounded-variance assumption is more realistic and easier to verify than $\sigma^2$-subGaussianity (which implies the former). We also extend our methods to data with infinite variance, but having $p$-th central moment ($1


Introduction
We consider the classical problem of sequential nonparametric mean estimation.As a motivating example, let P be a distribution on R from which a stream of i.i.d.sample X 1 , X 2 , . . . is drawn.The mean of the distribution, is unknown and is our estimand.The traditional and most commonly studied approaches to this problem include, among others, the construction of confidence intervals (CI).That is, we construct a σ(X 1 , . . ., X t )-measurable random interval CI t for each t ∈ N + such that It is, however, also well known that confidence intervals suffer from numerous deficiencies.For example, random stopping rules frequently arise in sequential testing problems, and it is well known that confidence intervals (2) typically fail to satisfy the guarantee In other words, traditional confidence intervals are invalid and may undercover at stopping times.To remedy this, a switch of order between the universal quantification over t and the probability bound in the definition of CI (2) was introduced [Darling and Robbins, 1967]: The random intervals1 {CI t } that satisfy the property above are called a (1 − α)-confidence sequence (CS).The definition of CS (4) and the property of stopped coverage (3) are actually equivalent, due to Howard et al. [2021, Lemma 3].
It is known that CSs do not suffer from the perils of applying CIs in sequential settings (e.g.continuous monitoring or peeking at CIs as they arrive).For example, Howard et al. [2021, Figure 1(b)] shows in a similar context that the cumulative type-I error grows without bound if a traditional confidence interval is continuously monitored, but a confidence sequence has the same error bounded at α; also see Johari et al. [2017] for a similar phenomenon stated in terms of p-values.
Prior studies on constructing confidence sequences for mean µ hinge on certain stringent assumptions on P .Darling and Robbins [1967] considered exclusively the case where P was a normal distribution.Jennison and Turnbull [1989] made the same parametric assumption.Later authors including Lai [1976], Csenki [1979], and recently Johari et al. [2021] (who notably defined "always-valid p-values") allowed P to be a distribution belonging to a fixed exponential family.More recently, Howard et al. [2020Howard et al. [ , 2021] ] performed a systematic study of nonparametric confidence sequences, whose assumptions on P ranged among subGaussian, sub-Bernoulli, subgamma, sub-Poisson, and sub-exponential, which in most cases considered involve a bounded moment generating function, and in particular, that all moments exist.The latest advance in CSs was the paper by Waudby-Smith and Ramdas [2023] that studied the case of bounded P largely because of its "betting" set-up, which of course implies all moments exist.Finally, the prior result closest to our setting is a recent study on heavy-tailed bandits by Agrawal et al. [2021, Proposition 5], whose implicit CS is based on empirical likelihood techniques, but it demands a nontrivial and costly optimization computation and its code is currently not publicly available.
In this paper, we remove all the parametric and tail lightness assumptions of the existing literature mentioned above, and make instead only one simple assumption (Assumption 2 in Section 2): the variance of the distribution exists and is upper bounded by a constant σ 2 known a priori, (5) Further, we shall show that, even under this simple assumption that allows for a copious family of heavy-tailed distributions (whose third moment may be infinite), the (1 − α)-CS {CI t } which we shall present achieves remarkable width control.We characterize the tightness of a confidence sequence from two perspectives.First, the rate of shrinkage, that is how quickly | CI t |, the width of the interval CI t , decreases as t → ∞; Second, the rate of growth, that is how quickly | CI t | increases as α → 0. It is useful to review here how the previous CIs and CSs in the literature behave in these regards.Chebyshev's inequality, which yields (2) when requiring (5), states that forms a (1 − α)-CI at every t, which is of shrinkage rate O(t −1/2 ) and growth rate O(α −1/2 ).
Strengthening the assumption from (5) to subGaussianity with variance factor σ 2 [Boucheron et al., 2013, Section 2.3], the Chernoff bound ensures that (1 − α)-CIs can be constructed by i.e. the stronger subGaussianity assumption leads to a sharper growth rate of O log(1/α) .It is Catoni [2012, Proposition 2.4] who shows the striking fact that by discarding the empirical mean µ t and using an influence function instead to stabilize the outliers associated with heavytailed distributions, a O(log(1/α)) growth rate can be achieved even when only the variance is bounded (5); similar results can be found in the recent survey by Lugosi and Mendelson [2019].
In the realm of confidence sequences, we see that recent results by Howard et al. [2021], Waudby-Smith and Ramdas [2023], while often requiring stringent Chernoff-type assumption on the distribution, all have O log t/t shrinkage rates and O log(1/α) growth rates.For example, Robbins' famous two-sided normal mixture confidence sequence for subGaussian P with variance factor σ 2 (see e.g.,Howard et al. [2021, Equation (3.7)]) is of the form The best among the three confidence sequences in this paper (Theorem 9) draws direct inspiration from Catoni [2012], and achieves a provable shrinkage rate of Õ(t −1/2 ), where the Õ hides polylog t factors, and growth rate O(log(1/α)).A fine-tuning of it leads to the exact shrinkage rate O( log log t/t), matching the lower bound of the law of the iterated logarithm under precisely the same assumption (5).The significance of this result, in conjunction with Howard et al. [2021], is that moving from one-time valid interval estimation to anytime valid intervals estimation (confidence sequences), no significant excess width is necessary to be incurred; nor does weakening the distribution assumption from sub-exponential to finite variance results in any cleavage of interval tightness, in both CI and CS alike.Our experiments demonstrate that published subGaussian CSs are extremely similar to our finite-variance CSs, but the former assumption is harder to check and less likely to hold (all moments may not exist for unbounded data in practice).We summarize and compare the mentioned works in terms of tightness in Table 1.

Problem set-up and notations
Let {X t } t∈N + be a real-valued stochastic process adapted to the filtration {F t } t∈N 0 where F 0 is the trivial σ-algebra.We make the following assumptions.
Assumption 1.The process has a constant, unknown conditional expected value: Assumption 2. The process is conditionally square-integrable with a uniform upper bound, known a priori, on the conditional variance: The task of this paper is to construct confidence sequences {CI t } for µ from the observations X 1 , X 2 , . . ., that is, Table 1: Comparison of asymptotic tightnesses among prominent confidence intervals and confidence sequences.Here "(EM)" indicates that the corresponding CI or CS is constructed around the empirical mean; "w.h.p." stands for "with high probability" (used when the interval widths are not deterministic).The "Markov inequality" bounds in the last cell of the "CI" column can be derived from e.g. the martingale L p bound Lemma 7 in Wang et al. [2021], Appendix A.
We remark that our assumptions, apart from incorporating the i.i.d.case (with E[X t ] = µ and Var[X t ] ⩽ σ 2 ) mentioned in Section 1, allow for a wide range of settings.The Assumption 1 is equivalent to stating that the sequence {X t − µ} forms a martingale difference (viz.{ t i=1 (X i − µ)} is a martingale), which oftentimes arises as the model for non-i.i.d., statedependent noise in the optimization, control, and finance literature (see e.g.Kushner and Yin [2003]).A very simple example would be the drift estimation setting with the stochastic differential equation dG t = σf (G t , t)dW t + µdt, where f is a function such that |f (G t , t)| ⩽ 1 and W t denotes the standard Wiener process.When sampling X t = G t − G t−1 , the resulting process {X t } satisfies our Assumption 1 and Assumption 2.
We further note that Assumption 1 and Assumption 2 can be weakened to drifting conditional means and growing conditional p th central moment bound (1 < p ⩽ 2) respectively, indicating that our framework may encompass any L p stochastic process {X t }.These issues are to be addressed in Section 9 and Section 10.2, while we follow Assumption 1 and Assumption 2 in our exposition for the sake of simplicity.Finally, we remark that the requirement for a known moment upper bound like Assumption 2 may seem restrictive, but is known to be minimal in the sense that no inference on µ would be possible in its absence, which we shall discuss in Section 10.1.
Throughout the paper, an auxiliary process {λ t } t∈N + consisting of predictable coefficients (i.e. each λ t is an F t−1 -measurable random variable) is used to fine-tune the intervals.We denote by [m ± w] the open or closed (oftentimes the endpoints do not matter) interval [m − w, m + w] or (m − w, m + w) to simplify the lengthy expressions for CIs and CSs; and by min(I), max(I) respectively the lower and upper endpoints of an interval I.The asymptotic notations follows the conventional use: for two sequences of nonnegative numbers {a t } and {b t }, we write hold, and a t ≍ b t if lim t→∞ a t /b t exists and 0 < lim t→∞ a t /b t < ∞.We write a t = polylog(b t ) if there exists a universal polynomial p such that a t = O(p(log b t )).Finally, if a t = O(b t polylog(t)), we say a t = Õ(b t ).

Confidence sequence via the Dubins-Savage inequality
The following inequality by Dubins and Savage [1965] is widely acknowledged to be a seminal result in the martingale literature and it will be the foundation of our first confidence sequence.
Lemma 1 (Dubins-Savage inequality).Let {M t } be a square-integrable martingale with M 0 = 0 and Then, for all a, b > 0, We prove Lemma 1 in Appendix B for completeness.Recall from Section 2 that {λ t } t∈N + is a sequence of predictable coefficients.Define processes As a consequence of Assumption 1, E[λ t (X t − µ) | F t−1 ] = 0. Hence both of {M + t } and {M − t } are martingales.Applying Lemma 1 to these two martingales yields the following result.
Theorem 2 (Dubins-Savage confidence sequence).Let {λ t } t∈N + be any predictable process.The following intervals {CI DS t } form a (1 − α)-confidence sequence of µ: The straightforward proof of this theorem is in Appendix C. Now, we shall choose the coefficients {λ t } that appear in the theorem in order to optimize the interval widths {| CI DS t |}.Our heuristic for optimizing the width is inspired by Waudby-Smith and Ramdas [2023, Equations (24-28)]; that is, we first fix a target time t ⋆ and consider a constant sequence.After finding the λ ⋆ that minimizes | CI DS t ⋆ |, we set λ t ⋆ to this value.The detailed tuning procedure can be found in Appendix A.1, where we show that is a prudent choice.Then, the width of the confidence sequence at time t is Let us briefly compare the O(α −1/2 ) rate of width growth, and the Õ(t −1/2 ) rate of width shrinkage we achieved in (17) with the well-known case of confidence intervals.Both the O(α −1/2 ) rate of growth and the O(t −1/2 ) rate of shrinkage of the Chebyshev CIs (6), which hold under a stronger assumption than our paper (i.e.independence and variance upper bound Var[X i ] ⩽ σ 2 ), are matched by our Dubins-Savage CS, up to the log t factor.It is worth remarking that the Chebyshev CIs {CI Cheb t } never form a confidence sequence at any levelalmost surely, there exists some CI Cheb t 0 that does not contain µ.While the Õ(t −1/2 ) rate of shrinkage cannot be improved (which shall be discussed in Section 7), we shall see in the following sections that growth rates sharper than O(α −1/2 ) can be achieved.The sharper rates require eschewing the (weighted) empirical means, e.g. the λ i X i λ i that centers the interval CI DS t in (14) above, because they have equally heavy tails as the observations {X t }.

Intermezzo: review of Ville's inequality
The remaining two types of confidence sequence in this paper are both based on the technique of constructing an appropriate pair of nonnegative supermartingales [Howard et al., 2020].This powerful technique results in dramatically tighter confidence sequences compared to the previous approach à la Dubins-Savage.
A stochastic process {M t } t∈N , adapted to the filtration for all t ∈ N + .Since many of the supermartingales we are to construct are in an exponential, multiplicative form, we frequently use the following (obvious) lemma.
A remarkable property of nonnegative supermartingales is Ville's inequality [Ville, 1939].It extends Markov's inequality from a single time to an infinite time horizon.

Confidence sequence by self-normalization
Our second confidence sequence comes from a predictable-mixing version of Delyon [2009, Proposition 12] and Howard et al. [2020, Lemma 3 (f)].
Lemma 5. Let {λ t } t∈N + be any predictable process.The following process is a nonnegative supermartingale: The proof is in Appendix C. We can obtain another nonnegative supermartingale by flipping {λ t } into {−λ t } in (19).Applying Ville's inequality (Lemma 4) on the two nonnegative supermartingales, we have the following result which is again proved in Appendix C.
Lemma 6 (Self-normalized anti confidence sequence).Let {λ t } t∈N + be any predictable process.Define We further define the interval aCI 2 When the term inside the square root is negative, by convention the interval is taken to be ∅.
(where each stands for t i=1 ), and the interval aCI Then, both {aCI SN+ t } and {aCI SN− t } form a (1 − α/2)-anticonfidence sequence for µ.That is, Applying union bound on Lemma 6 immediately gives rise to the following confidence sequence.
where aCI SN+ It is not difficult to perform a cursory analysis on the topology of CI SN t .Without loss of generality assume µ = 0 since the method is translation invariant.When we take {λ t } to be a decreasing sequence, with high probability λ 2 i X 2 i will be much smaller than λ i in the long run, implying that U + t < 0 while U − t > 0. Thus, with high probability, max(aCI Therefore, we expect the disjoint union of three intervals to be the typical topology of CI SN t for large t.Indeed, we demonstrate this with the a simple experiment under X t i.i.d.
∼ N (0, 1) and λ t = 1/ √ t, plotted in Figure 1.We now come to the question of choosing the predictable process {λ t } to optimize the confidence sequence.Since the set CI SN t always has infinite Lebesgue measure, a reasonable objective is to ensure min(aCI and make the middle interval M SN t as narrow as possible.We resort to the same heuristic approach as in the Dubins-Savage case when optimizing | M SN t |, which is detailed in the Appendix A.2.The result of our tuning is Indeed, min(aCI ) → ∞ almost surely if we set {λ t } as above.We remark that the removal of the "spurious intervals" L SN t and U SN t is easily achieved.For example, 6 Confidence sequence via Catoni supermartingales Our last confidence sequence is inspired by Catoni [2012], where under only the finite variance assumption, the author constructs an M-estimator for mean that is O(log(1/α))-close to the true mean with probability at least 1 − α; hence a corresponding (1 − α)-CI whose width has O(log(1/α)) growth rate exists; cf. ( 6).We shall sequentialize the idea of Catoni [2012] via constructing two nonnegative supermartingales which we shall call the Catoni Supermartingales.
Following Catoni [2012, Equation (2.1)], we say that ϕ : R → R is a Catoni-type influence function, if it is increasing and − log(1 Lemma 8 (Catoni supermartingales).Let {λ t } t∈N + be any predictable process, and let ϕ be a Catoni-type influence function.The following processes are nonnegative supermartingales: This lemma is proved in Appendix C. We now remark on the "tightness" of Lemma 8. On the one hand, it is tight in the sense that the pair of processes make the fullest use of Assumption 2 to be supermartingales, which we formalize in Appendix C with Proposition 16; on the other hand, a slightly tighter (i.e.larger) pair of supermartingales do exist, but are not as useful in deriving CS (see Section 10.5).In conjunction with Ville's inequality (Lemma 4), Lemma 8 immediately gives a confidence sequence.
Theorem 9 (Catoni-style confidence sequence).Let {λ t } t∈N + be any predictable process, and let ϕ be a Catoni-type influence function.The following intervals {CI C t } form a (1 − α)-confidence sequence for µ: Although this confidence sequence lacks a closed-form expression, it is easily computed using root-finding methods since the function m → t i=1 ϕ(λ i (X i − m)) is monotonic.A preliminary experiment is shown in Figure 2.
We shall show very soon in extensive experiments (Section 8) that the Catoni-style confidence sequence performs remarkably well controlling the width | CI C t |, not only outperforming the previously introduced two confidence sequences, but also matching the best-performing confidence sequences and even confidence intervals in the literature, many of which require a much more stringent distributional assumption.On the theoretical side, we establish the following nonasymptotic concentration result on the width | CI C t |.Theorem 10.Suppose the coefficients {λ t } t∈N + are nonrandom and let 0 < ε < 1. Suppose further that (34) Then, with probability at least 1 − ε, The proof of the theorem is inspired by the deviation analysis of the nonsequential Catoni estimator Catoni [2012, Proposition 2.4] itself, and can be found in Appendix C.
We remark that ( 34) is an entirely deterministic inequality when {λ i } i∈N + are all nonrandom.When λ t = Θ(t −1/2 ), which is the case for (33), the condition (34) holds for large t since t i=1 λ i = Θ( √ t) while t i=1 λ 2 i grows logarithmically.This gives us the following qualitative version of Theorem 10.
We hence see that the Catoni-style confidence sequence enjoys the Õ(t −1/2 ) and O(log(1/α)) near-optimal rates of shrinkage and growth.If we do not ignore the logarithmic factors in t, for example, taking It is now natural to ask whether the Catoni-style CS can obtain the law-of-the-iteratedlogarithm rate Θ( log log t/t).This cannot be achieved by tuning the sequence {λ t } alone [Waudby-Smith and Ramdas, 2023, Table 1], but can be achieved using a technique called stitching [Howard et al., 2021].
and error level α.Then, let t j = e j , α j = α (j+2) 2 , and Λ j = log(2/α j )e −j .The following stitched Catoni-style confidence sequence } forms a (1 − α)-CS because of a union bound over j⩾0 α j ⩽ α.The width in (38) matches both the Θ( log(1/α)) lower bound on the growth rate, and the Θ( log log t/t) lower bound on the shrinkage rate (which we shall present soon in Section 7).It pays the price of a larger multiplicative constant to achieve the optimal shrinkage rate, so we only recommend it when long-term tightness is of particular interest.The proof of this corollary is in Appendix C.
Remark 1.While the union bound argument of Corollary 10.2 asymptotically improves an existing t −1/2 polylog(t) CS to a log log t/t CS, there is a related (but much less involved) idea to construct a CS from a sequence of CIs: split α = ∞ t=1 α t and define the CS as the CI at time t with error level α t .This, however, leads to poor performance.Because the events {µ ∈ CI t } and {µ ∈ CI t+1 } are highly dependent, making the union bound over t ∈ N + very loose.In Figure 3, we visually demonstrate the Catoni CIs with α t = α/[t(t + 1)] (forming what we call the "trivial Catoni CS"), versus our supermartingale-based Catoni-style CS (32).

Lower bounds
For the sake of completeness, we now discuss the lower bounds of confidence sequences for mean estimation.We first introduce the following notion of tail symmetry, which is a standard practice when constructing two-sided confidence intervals and sequences.
Definition 1 (tail-symmetric CI/CS).Let Q be a family of distributions over The following lower bound of minimax nature, akin to Catoni [2012, Section 6.1]), characterizes the minimal growth rate of confidence intervals (hence also confidence sequences) when α is small.Its proof shall be found in Appendix C.
Proposition 11 (Gaussian lower bound).We define where the supremum is taken over Q σ 2 , the set of all distributions of {X t } satisfying Assumption 1 (where µ ranges over R) and Assumption 2 (where σ 2 is fixed), the infimum over all tail-symmetric be the quantile function of the standard normal distribution.Then, as long as Here O( 1) is with respect to α, ε → 0.
The next lower bound, due to the law of the iterated logarithm (LIL), lower bounds the shrinkage rate of confidence sequences as t → ∞.The proof is again delayed to Appendix C.
Proposition 12 (LIL lower bound).Let P be a distribution on R with mean µ and variance Remark 2. The assumption µ t ∈ CI t is true for many existing confidence sequences for mean estimation in the literature Darling and Robbins [1967], Jennison and Turnbull [1989], Howard et al. [2021], meaning that our CS that matches this lower bound (38) is fundamentally not worse than them even under the much weaker Assumption 2. While assuming µ t ∈ CI t does not encompass all the CSs in the literature, it can be relaxed by assuming instead that there exists an estimator θ t ∈ CI t that follows the law of the iterated logarithm.For example, it is known that the weighted empirical average satisfies the LIL Teicher [1974], which implies that all of the predictable mixture confidence sequences due to Waudby-Smith and Ramdas [2023], as well as our Theorem 2, are subject to a similar LIL lower bound.Relatedly, the LIL is also satisfied by some M-estimators He and Wang [1995], Schreuder et al. [2020].However, these LIL-type results are only valid under constant weight multipliers (in our parlance, that is the sequence {λ t } is constant, in which case our CSs do not shrink), and hence the M-estimators for µ can be inconsistent, the limit of the LIL-type convergence being some value other than µ.The search for new LIL-type results for consistent M-estimators under decreasing {λ t } sequence, e.g. the zero of t i=1 ϕ(λ i (X i − m)) which is included in the Catoni-style CS (32), shall stimulate our future study.

Experiments
We first examine the empirical cumulative miscoverage rates of our confidence sequences as well as Catoni's confidence interval.These are the frequencies at time t that any of the intervals {CI i } 1⩽i⩽t does not cover the population mean µ, under 2000 (for all CSs) or 250 (for the Catoni CI 3 ) independent runs of i.i.d.samples of size t = 800 from a Student's t-distribution with 3 degrees of freedom, randomly centered and rescaled to variance σ 2 = 25.Its result, in Figure 4, shows the clear advantage of CSs under continuous monitoring as they never accumulate error more than the preset α, unlike the Catoni CI whose cumulative miscoverage rate goes beyond α early on.In fact, a similar experiment in the light-tailed regime by Howard et al. [2021, Figure 1(b)] shows that the cumulative miscoverage rate of CIs will grow to 1 if we extend the sequential monitoring process indefinitely.We then compare the confidence sequences in terms of their growth rates.That is, we shall take decaying values of error level α and plot the length of the CSs, with the corresponding {λ t } sequences ( 16), (28), and (33) 4 we choose.We draw a t = 250 i.i.d.sample from the same Student's t-distribution as above (σ 2 = 25).The Dubins-Savage CS has deterministic interval widths, while the self-normalized (we only consider | M SN t |) and Catoni-style CS both have random interval widths, for which we repeat the experiments 10 times each.We add the Catoni CI for the sake of reference.The comparison of widths is exhibited in Figure 5a.
We observe from the graph that the self-normalized CS and the Catoni-style CS both exhibit restrained growth of interval width when α becomes small.On the other hand, the Dubins-Savage CS, with its super-logarithmic O(α −1/2 ) growth, perform markedly worse in contrast to those with logarithmic growth.
We run the same experiment again, this time with Gaussian data with variance σ 2 = 25 and we add two CSs and one CI for subGaussian random variables with variance factor σ 2 from previous literature for comparison.First, the stitched subGaussian CS [Howard et al., 2021, Equation (1.2)] which we review in Lemma 17; Second, the predictably-mixed Hoeffding CS 3 due to its inherent non-sequentializable computation 4 Actually we use {λ max{t,9} } for the Catoni-style CS to facilitate the root-finding in (32).
with (33); Third, the standard subGaussian Chernoff CI from (7).Recall that all three of the above bounds are not valid under only a finite variance assumption, but require a subGaussian moment generating function.This extended comparison is plotted in Figure 5b.Next, we examine the rates of shrinkage as t → ∞.We sequentially sample from (i) a Student's t-distribution with 5 degrees of freedom, randomly centered and rescaled to variance σ 2 = 25, (ii) a normal distribution with variance σ 2 = 25, at each time calculating the intervals.We include again the Catoni's CI in both cases, the three subGaussian CI/CSs in the Gaussian case.The evolution of these intervals is shown in Figures 6a and 6b, and their widths in Figures 6c and 6d.CS performs markedly better than the other two CSs, and is close to the Catoni CI.In the Gaussian setting, our Catoni-style CS is approximately at a same caliber as the best subGaussian CSs in the literature.There appears no benefit in using the minimax optimal stitched Catoni-style CS for sample sizes within 10,000 due to its large constant.These new observations appear to be practically useful.
It is important to remark that in the particular instantiation of the random variables that were drawn for the runs plotted in these figures, the Catoni CI seems to always cover the true mean; however, we know for a fact (theoretically from the law of the iterated logarithm for Mestimators [Schreuder et al., 2020]; empirically from Figure 4) that the Catoni CI will eventually miscover with probability one, and it will in fact miscover infinitely often, in every single run.
When the first miscoverage exactly happens is a matter of chance (it could happen early or late), but it will almost surely happen infinitely [Howard et al., 2021].Thus, the Catoni CI cannot be trusted at data-dependent stopping times, as encountered via continuous monitoring in sequential experimentation, but the CSs can.The price for the extra protection offered by the CSs is in lower order terms (polylogarithmic), and the figures suggest that it is quite minimal, the Catoni-CS being only ever so slightly wider than the Catoni CI.

Heteroscedastic and infinite variance data
In lieu of Assumption 2, we can consider a much more general setting that encompasses data drawn from distributions without a finite variance, e.g.Pareto distribution or stable distribution with index in (1, 2), and possibly those that are increasing in scale.
Assumption 3. The process is conditionally L p with an upper bound, known a priori, on the conditional central p th moment: where {v t } t∈N + is a predictable, nonnegative process.
When p = 2, all of the three confidence sequences can extend naturally to handle such scenario of heteroscedasticity.We leave the details of the heteroscedastic versions of the Dubins-Savage CS and the self-normalized CS to Appendix D. For the infinite variance case 1 < p < 2, the generalization of the Dubins-Savage inequality in this infinite variance regime by Khan [2009] can easily be used to construct a confidence sequence under Assumption 3, extending our Theorem 2. However, due to the relatively unsatisfactory performance of the Dubins-Savage CS, we do not elaborate upon this extension.
Let us focus primarily on extending our Catoni-style CS in Theorem 9 to Assumption 3 in this section.To achieve this, we resort to an argument similar to the generalization of the Catoni CI by Chen et al. [2021].
We say that ϕ p : R → R is a p-Catoni-type influence function, if it is increasing and − log(1 − x + |x| p /p) ⩽ ϕ p (x) ⩽ log(1 + x + |x| p /p).A simple example is Lemma 13 (p-Catoni supermartingales).Let ϕ p be a p-Catoni-type influence function.Under Assumption 1 and Assumption 3, the following processes are nonnegative supermartingales, The proof is straightforwardly analogous to the one of Lemma 8.The corresponding CS can be easily expressed akin to Theorem 9.
Theorem 14 (p-Catoni-style confidence sequence).Let ϕ p be a p-Catoni-type influence function.Under Assumption 1 and Assumption 3, the following intervals {CI C t p } form a (1 − α)confidence sequence of µ: Chen et al.
[2021] point out that in the i.i.d.case (i.e.assuming v t = v for all t in Assumption 3), the asymptotically optimal choice for the rate of decrease of {λ t }, when working with this L p set-up, would be λ t ≍ t −1/p .Specifically, in Chen et al. [2021, Proof of Theorem 2.6], the authors recommend the tuning to optimize their CI.We adopt exactly the same tuning (50) in our experiment, shown in Figure 7, with i.i.d., infinite variance Pareto data.Indeed, employing λ t ≍ t −1/p also leads to a width concentration bound of optimal shrinkage rate t −(p−1)/p , similar to Theorem 10 and proved in Appendix C.  Theorem 15.Suppose the coefficients {λ t } and the conditional p th moments {v t } are all nonrandom and let 0 < ε < 1. Suppose further that Then, with probability at least 1 − ε, Similar to the case in Theorem 10, ( 51) is an entirely deterministic inequality when {v t } and {λ t } are all nonrandom.When v t = Θ(1) and λ t = Θ(t −1/p ), which is the case for (50), the condition (51) holds for large t since t i=1 λ i = Θ(t (p−1)/p ) while t i=1 λ p i grows logarithmically.This gives us the following qualitative version of Theorem 15, like (and generalizing) Corollary 10.1.
We remark that this shrinkage rate, up to a logarithmic factor in t, matches the lower bound for CIs by Devroye et al. [2016, Theorem 3.1].If we let {v t } grow, say in a rate of v t = Θ(t γ ), one may match the scale growth of data by adjusting the {λ t } sequence to a more decreasing one, in order to optimize the width bound in Theorem 15.
10 Discussions and extensions 10.1 Minimality of the moment assumptions We stress here that an upper bound on a (1+δ) th moment, for example the upper variance bound σ 2 in Assumption 2, is required to be known.We have seen in Section 9 that Assumption 2 can be weakened in various ways, but is not eliminated since another moment bound is introduced.Such assumptions, strong as they may seem at first sight, are necessitated by the results of Bahadur and Savage [1956], which immediately imply that if no upper bound on a moment is known a priori, mean estimation is provably impossible.Indeed, without a known moment bound, even nontrivial tests for whether the mean equals zero do not exist, meaning that all tests have trivial power (since power is bounded by the type-I error), and thus cannot have power going to 1 while the type-I error stays below α.The lack of power one tests for a point null, thanks to the duality between CIs and families of tests, in turn implies the impossibility of intervals that shrink to zero width.In a similar spirit, one can see that the lower bound of Proposition 11 grows to infinity as σ does, indicating that a confidence interval (hence a confidence sequence) must be unboundedly wide when no bound on σ is in place.

Drifting means
Our three confidence sequences also extend, at least in theory, to the case when Assumption 1 is weakened to where {µ t } is any predictable process.This, in conjunction with Section 9, implies that our work provides a unified framework for any L 2 process {X t }.Such generalization is done by replacing every occurrence of (X i − µ) in the martingales (13) and supermartingales ( 19), ( 30) by (X i − µ i ).The closed-form Dubins-Savage confidence sequence ( 14) now tracks the weighted average In the case of self-normalized and Catoni-style confidence sequence, a confidence region CR t ⊆ R t can be solved from the Ville's inequality at each t, such that P[∀t ∈ N + , (µ 1 , . . ., µ t ) ∈ CR t ] ⩾ 1 − α.The exact geometry of such confidence regions shall be of interest for future work.

Sharpening the confidence sequences by a running intersection
It is easy to verify that if {CI t } forms a (1 − α)-CS for µ, so does the sequence of running intersections a fact first pointed out by Darling and Robbins [1967].The intersected sequence { CI t } is at least at tight as the original one {CI t }, while still enjoying a same level of sequential confidence.However, Howard et al. [2021, Section 6] points out that such practice does not extend to the drifting parameter case, and may suffer from empty interval.We remark that, following the discussion in Section 10.2, we can still perform an intersective tightening under drifting means.
To wit, if {CR t } is a confidence region sequence satisfying P[∀t ∈ N + , (µ 1 , . . ., µ t ) ∈ CR t ] ⩾ 1−α, so is the sequence formed by The peril of running intersection, however, is that it may result in an empty interval.Though this happens with probability less than α by the definition of CS, an empty interval is a problematic result in practice that one would like to avoid. since By Lemma 18 in Appendix C, this larger pair of supermartingales indeed yields a CS even tighter than the Catoni-style CS.However, we remark that the difference between this tighter CS and the Catoni-style CS is small as λ t decreases; and this tighter CS is computationally infeasible: finding the root of M C * t , N C * t = 2/α suffers from non-monotonicity (so that we may not easily find the largest/smallest root which defines the endpoints of the CS) and high sensitivity.However, following the discussion in Section 10.4, it is easy to test if µ is in this tighter CS, i.e. if M C * t , N C * t ⩽ 2/α actually holds.Therefore, we recommend that one use the Catoni supermartingales when constructing a CS, but use ( 63), (64) when sequentially testing the null Assumption 1.

Concluding remarks
In this paper, we present three kinds of confidence sequences for mean estimation of increasing tightness, under an extremely weak assumption that the conditional variance is bounded.The third of these, the Catoni-style confidence sequence, is shown both empirically and theoretically to be close to the previously known confidence sequences and even confidence intervals that only work under light tails requiring the existence of all moments, as well as their decay.
This elegant result bears profound theoretical implications.We now know that the celebrated rate of shrinkage O(t −1/2 ) and rate of growth O(log(1/α)) of confidence intervals produced by MGF-based concentration inequalities (e.g.Chernoff bound ( 7)) extend essentially in two directions simultaneously: heavy tail up to the point where only the second moment is required to exist, and sequentialization to the anytime valid regime.
Our work shall also find multiple scenarios of application, many of which are related to multiarmed bandits and reinforcement learning.For example, the best-arm identification problem [Jamieson and Nowak, 2014] in the stochastic bandit literature relies on the construction of confidence sequences, and most previous works typically study the cases of Bernoulli and sub-Gaussian bandits.Given the result of this paper, we may now have a satisfactory solution when heavy-tailed rewards [Bubeck et al., 2013] are to be learned.A similar locus of application is the off-policy evaluation problem [Thomas et al., 2015] in contextual bandits, whose link to confidence sequences was recently established [Karampatziakis et al., 2021].While Karampatziakis et al. [2021] only considered bounded rewards, our work provides the theoretical tools to handle a far wider range of instances.
Besides the issues of drifting means we mentioned in Section 10.2, the search for an allencompassing LIL lower bound we mentioned in Section 7, we also expect future work to address the problem of multivariate or matrix extensions.The study by Catoni and Giulini [2017], we speculate, can be a starting point.Finally, the online algorithm for approximating the interval in the Catoni-style CS (32) can also be studied.

A Tuning the coefficients {λ t }
A.1 Tuning the coefficients in the Dubins-Savage confidence sequence Note that when (15) happens, the half-width of the CI at t ⋆ is which obtains optimal width when With the above guidance, in Theorem 2, we take Then, the CS half-width at time t is A.2 Tuning the coefficients in the self-normalized confidence sequence Take t ⋆ as fixed and , where we use the approximation √ t 2 + smaller term ≈ t + smaller term 2t . Examining the final expression, the optimal λ ⋆ is hence taken as Since we need λ t to be F t−1 -measurable, we replace S 2 and S 1 with t−1 i=1 X 2 i and t−1 i=1 X i , but all other occurrences of t ⋆ with t in (69) to obtain our predictable sequence {λ t } of choice for Theorem 7, B Discussion on the Dubins-Savage confidence sequence We first present here a short and self-contained proof of the Dubins-Savage inequality [Dubins andSavage, 1965, Khan, 2009].
Proof of Lemma 1.Consider the function Q(x) = 1 1−min(x,0) .It is not hard to see that, for any x ∈ R and m ⩽ 0, and for any x, m ⩽ 0, See Figure 8 for an illustration.Now define the following random variables: We shall show that {Q(x t )} is a supermartingale, meaning that 71) and ( 72), Since x 0 = −ab and Q(x 0 ) = 1/(1 + ab), we define R(x) := (1 + ab)Q(x) to obtain a nonnegative supermartingale {R(x t )} with R(x 0 ) = 1, on which we can use Ville's inequality (Lemma 4) to conclude that concluding the proof.
Indeed, we can see from the proof that the Dubins-Savage inequality can actually be derived from Ville's inequality on a nonnegative supermartingale.In the parlance of Section 10.4, the process {R(x t )} can be used as a test supermartingale for the null Assumption 1, when setting M t to be λ i (X i − µ).However, there is a major difference in how this test supermartingale relates to the the Dubins-Savage confidence sequence: if one fixes a priori the parameters a and b, the rejection rule R(x t ) ⩾ 1/α is equivalent to the Dubins-Savage CS (Theorem 2) only when α = 1/(1 + ab).This is unlike the cases of the other two CSs in this paper, where the duality between confidence sequence and sequential testing holds for any α.

C Omitted proofs and additional propositions
Proof of Theorem 2. We apply Lemma 1 to the two martingales ( 13) with a = (2/α − 1)/b.Then we have, Using Assumption 2, we then have, We remark here that the parameter b in the inequalities above is actually redundant and can be eliminated (i.e., take b = 1), since tuning b is equivalent to tuning the coefficients λ i .To wit, multiplying b by a constant λ 0 results in the same inequalities as dividing each λ i by λ 0 .Putting b = 1 in the inequalities above and taking a union bound, we immediately arrive at the result.
Proof of Lemma 6. Applying Ville's inequality (Lemma 4) on the nonnegative supermartingale (19), we have that Solving µ from the quadratic inequality yields the interval (each standing for t i=1 ) which then forms a (1 − α/2)-anticonfidence sequence for µ, the aCI SN+ t at issue.Another (1 − α/2)-anticonfidence sequence, aCI SN− t , can be formed by replacing each λ t with −λ t , Proof of Lemma 8. We observe that Hence {M C t } and {N C t } are both nonnegative supermartingales by Lemma 3.
We have the following statement on the tightness of Lemma 8, which states that the variance bound Assumption 2 is necessary for the processes {M C t } and {N C t } to be supermartingalesthe violation of Assumption 2 on any non-null set will prevent {M C t } and {N C t } from being supermartingales.
Proof of Proposition 16.Let η be any real number in (0, 1/2).There exists a positive number x b such that for any ϕ that is a Catoni-type influence function, when |x| < x η , ϕ(x) ⩾ log(1 + x + ηx 2 ).Note that And we have lim inf The first two limits inferior above are 0 since Y t has finite conditional (2 + δ) th moment.Since η is arbitrary in (0, 1/2) it follows that lim inf (Actually it is not hard to see that the inequality above is equality.)Now, recall that v t > (1 + 2κ)σ 2 on the set S ∈ F t−1 .By (100) there exists some λ 0 such that when λ < λ 0 , Let g(κ)6 be the unique positive zero of e x −1−(1+κ)x; so when x ∈ (0, g(κ)), 1+(1+κ)x > e x .Hence, when 0 < λ ⩽ min{λ 0 , 2g(κ)/σ}, Proof of Theorem 10.We define f t (m) to be the random function in (32), which is always strictly decreasing in m.First, for all m ∈ R, let Hence, again due to Lemma 3, {M t (m)} is a nonnegative supermartingale.Note that {M t (µ)} is just the Catoni supermartingale {M C t } defined in (30).We hence have EM t (m) ⩽ 1; that is, ) and it will also be a nonnegative supermartingale for all m ∈ R, with Define the functions Let us now consider the equation which, by rearrangement, can be written as (each standing for t i=1 ) As a quadratic equation, it has solutions if and only if which is just the condition (34).Let m = π t be the smaller solution of (112).Since {λ t } is assumed to be non-random, the quantity π t is also non-random.Then, we can put m = π t into (110), Now notice that Combining ( 117) and ( 121) gives us the one-sided concentration Now, let ρ t be the larger solution of A similar analysis yields and Hence we have the other one-sided concentration Hence, a union bound on ( 122) and ( 127) gives rise to the concentration on the interval width we desire, This concludes the proof.
Before we prove Corollary 10.2, we review the technique of stitching as appeared in Howard et al. [2021, Section 3.1].Let {Y t } be an i.i.d.sequence of random variable of mean µ and subGaussian with variance factor 1.Then, for any λ ∈ R, the following process is a nonnegative supermartingale which, in conjunction with Ville's inequality, yields the following "linear boundary" confidence sequence, The idea of Howard et al. [2021] is to divide α = ∞ j=0 α j , take some sequences {Λ j } and {t j } (t 0 = 1), and consider the following CS: which is indeed a (1 − α)-CS due to union bound.Howard et al. [2021] shows that using geometrically spaced epochs {t j }, the lower bound log log t/t of the law of the iterated logarithm can be matched.We prove a slightly different bound than Howard et al. [2021, Equation (11)] below.
Further, we see that holds as long as e j > polylog(1/α).Hence ( 136) is met when t > polylog(1/α).Combining all above we arrive at the desired conclusion.
Proof of Proposition 11.Let θ t = lwt + up t 2 and r t = w Q,ε 2 .Due to union bound, for any Q ∈ Q σ 2 , Now by Catoni [2012, Proposition 6.1] (note that r t here is a data-independent constant), there exists µ 0 ∈ R such that when X i iid ∼ N (µ 0 , σ 2 ), Without loss of generality suppose the latter holds.Surely N (µ 0 , σ 2 ) ⊗N + ∈ Q σ 2 .We see that indicating that This shows that w holds for any tail-symmetric CI, which clearly imlpies the minimax lower bound.Before we prove Theorem 15, let us introduce three lemmas, which are also used to prove the infinite variance case in the recent follow-up work on robust Catoni-style confidence sequences, Wang and Ramdas [2023].Our proof of Theorem 15 is also inspired by [Wang and Ramdas, 2023, Proof of Theorem 5], which in turns roughly follows the proof of Theorem 10 in this paper.The first two lemmas are proved by direct substitution; the third by Taylor expansion.We refer the reader to the works cited for their proof.Now we are ready to prove Theorem 15.
Proof of Theorem 15.We define f pt (m) to be the random function Consider the process than the maximum of the right hand, meaning that such c does exist, and is smaller than x p = 1 p−1 .So, the equation ( 163

D Heteroscedastic Dubins-Savage and self-normalized CS
Suppose in this section that instead of Assumption 2, the following assumption holds.
Assumption 4. The process is conditionally square-integrable with an upper bound, known a priori, on the conditional variance: where {σ t } t∈N + is a predictable, nonnegative process.
We can easily generalize Theorem 2 as follow7 .
Theorem 23 (Self-normalized confidence sequence).We define the intervals aCI SN+ ′ to be (U ± t are defined back in (20).)Then, setting ), we have that {CI SN ′ t } forms a 1 − α confidence sequence for µ, under Assumption 1 and Assumption 4.

t
and aCI SN− t are defined in (21) and (22).
Figure 3: To achieve the same level of lower tightness (e.g. when lower confidence bound surpasses 0), the trivial Catoni CS needs a sample of size 880, about 4 times the Catoni-style CS which only takes 246.

Figure 4 :
Figure 4: Cumulative miscoverage rates when continuously monitoring CSs and CI under t 3 distribution, which (provably) grow without bound for the Catoni CI, but are guaranteed to stay within α = 0.05 for CSs.

Figure 5 :
Figure5: Comparison of CI/CS growth rates at t = 250.In both figures, triangular markers denote random widths (for which we repeat 10 times to let randomness manifest), and square markers deterministic widths; hollow markers denote the widths of CIs, while filled markers the widths of CSs.Our Catoni-style CS is among the best CSs (even CIs) in terms of tightness under small error probability α, in heavy and light tail regimes alike.In the right figure, note the overlap of the Chernoff-CI with Catoni-CI, as well as that of the Hoeffding-type subGaussian CS with the Catoni-style CS.
that does not have such spurious intervals -e.g., the Dubins-Savage CS (14).Next, construct the self-normalized CS {CI SN t } at 1 − α ′ confidence level.A union bound argument yields that {CI • t ∩ CI SN t } is a (1 − α)-CS, and the intersection with {CI • t } helps to get rid of the spurious intervals L SN Proof of Proposition 12.By the law of the iterated logarithm, With probability at least 1 − α, µ t , µ ∈ CI t for every t, which implies that | CI t | ⩾ | µ t − µ| for every t.Hence, with probability at least 1 − α, Lemma 18.Let {R m t } and {S m t } be two families of nonnegative adapted processes indexed by m ∈ R, among which {R µ t } and {S µ t } are supermartingales with R µ 0 = S µ 0 = 1.If almost surely for any m ∈ R, R m t ⩾ S m t , then the (1 − α)-CSs CI R,t = {m : R m t ⩽ 1/α}, CI S,t = {m : S m t ⩽ 1/α} (147) satisfy CI R,t ⊆ CI S,t a.s.(148) Proof of Lemma 18. First, by Ville's inequality (Lemma 4) we see that {CI R,t } and {CI R,t } are indeed confidence sequences for µ.Then, almost surely, for all m ∈ CI R,t , CI S,t .Hence almost surely CI R,t ⊆ CI S,t .