Time-uniform Chernoff bounds via nonnegative supermartingales

We develop a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We illustrate this point by presenting a single assumption and theorem that together unify and strengthen many tail bounds for martingales, including classical inequalities (1960-80) by Bernstein, Bennett, Hoeffding, and Freedman; contemporary inequalities (1980-2000) by Shorack and Wellner, Pinelis, Blackwell, van de Geer, and de la Pe\~na; and several modern inequalities (post-2000) by Khan, Tropp, Bercu and Touati, Delyon, and others. In each of these cases, we give the strongest and most general statements to date, quantifying the time-uniform concentration of scalar, matrix, and Banach-space-valued martingales, under a variety of nonparametric assumptions in discrete and continuous time. In doing so, we bridge the gap between existing line-crossing inequalities, the sequential probability ratio test, the Cram\'er-Chernoff method, self-normalized processes, and other parts of the literature.


Introduction
Concentration inequalities play an important role in probability and statistics, giving non-asymptotic tail probability bounds for random variables or suprema of random processes. In this paper, we consider a method to bound the probability that a martingale ever crosses a time-dependent linear threshold. We were motivated by the fact that such bounds are the key ingredient in many sequential inference procedures. We argue, however, that this formulation is materially better for the development of exponential concentration inequalities, even in some non-sequential settings. We give a master assumption and theorem which handle all of these cases, in discrete and continuous time, for scalar-valued, matrix-valued, and smooth Banach-space-valued martingales. By unifying and organizing dozens of results, we illustrate how these results relate to one another and highlight the specific ingredients contributed by each author. Our improvements to existing results come in the form of weakened assumptions, extension of fixed-time or finite-horizon bounds to infinite-horizon uniform bounds, and improved exponents.
Our main results are presented in full generality in the following section. To motivate these results, we first contrast a small handful of well-known, concrete results from the exponential concentration literature; see Section 1.2 for a more detailed overview of the literature we draw upon. Throughout the paper, most of our results are presented for filtered probability spaces, and we use E t to denote expectation conditional on the underlying filtration F t at time t. For any discrete-time process (Y t ) t∈N , we write ∆Y t := Y t − Y t−1 for the increments. Finally, we write H d for the space of d × d Hermitian matrices. The relation A B denotes the semidefinite order on H d , while γ max : H d → R denotes the maximum eigenvalue map.
Example 1. Unless indicated otherwise, let (S t ) ∞ t=0 be a real-valued martingale with respect to a filtration (F t ) ∞ t=0 , with S 0 = 0. (a) Three of the earliest and most well-known results for exponential concentration are attributed to Bernstein, Bennett, and Hoeffding. Assume the increments (∆S t ) are independent, and let v t := t i=1 E(∆S i ) 2 . We present Bernstein's inequality  in a widely used form (e.g., Boucheron et al., 2013, Corollary 2.11): if, for some fixed m ∈ N and c > 0, the increments satisfy the moment condition m i=1 E(∆S t ) k ≤ k! 2 c k−2 v m for all integers k ≥ 3, then for any x > 0, we have . (1.1) Bernstein's moment condition is easily seen to be satisfied if the increments are bounded. Bennett (1962, eq. 8b) improved Bernstein's result for bounded increments: if ∆S t ≤ 1 for all t, then for any x > 0 and m ∈ N, we have (1.2) Finally, Hoeffding (1963, eq. 2.3) gave a simplified result for increments bounded from above and below: if |∆S t | ≤ 1 for all t, then for any x > 0 and m ∈ N, we have Blackwell (1997, Theorem 1): if |∆S t | ≤ 1 for all t, then for any a, b > 0, we have P(∃t ∈ N : S t ≥ a + bt) ≤ e −2ab . (1.4) Relative to Hoeffding's inequality, Blackwell removes the assumption of independent increments, although this possibility was noted by Hoeffding himself (Hoeffding, 1963, p. 18). More importantly, Blackwell replaces the event {S m ≥ x} for fixed time m with the time-uniform event {∃t ∈ N : S t ≥ a + bt}. To see that Blackwell's result recovers and strengthens that of Hoeffding, set a = x/2, b = x/2m and note that Blackwell's uniform bound recovers Hoeffding's bound at time t = m, so that Blackwell obtains the same probability bound for a larger event. (c) Freedman (1975, Theorem 1.6): if |∆S t | ≤ 1 for all t, then writing V t := t i=1 Var ( ∆S i | F i−1 ), then for any x, m > 0, we have P (∃t ∈ N : V t ≤ m and S t ≥ x) ≤ m x + m x+m e x .
(1.5) Similar to Bernstein's and Bennett's inequalities, but unlike those of Hoeffding and Blackwell, Freedman's inequality measures time in terms of a predictable quantity, the accumulated conditional variance V t , rather than simply the number of observations t. Freedman's inequality bounds the deviations of (S t ) uniformly over time, but only up to the finite time horizon implied by V t ≤ m. (d) de la Peña (1999, Theorem 6.2, eq. 6.4): if the increments are conditionally symmetric, that is, ∆S t ∼ −∆S t | F t−1 for all t, then letting V t = t i=1 ∆S 2 i , for any α ≥ 0 and β, x, m > 0 we have P ∃t ∈ N : V t ≥ m and S t α + βV t ≥ x ≤ exp −x 2 β 2 2m + αβ . (1.6) A remarkable feature of this result is that we measure time via the adapted quantity V t . Unlike Freedman's inequality, which uses the true conditional variance to measure time, de la Peña's inequality relies only on empirical quantities. In further contrast to Freedman's inequality, de la Peña's bound holds uniformly over V t ≥ m rather than V t ≤ m, and we bound the deviations of the self-normalized process S t /(α + βV t ).
(e) Tropp (2012, Theorem 6.2): departing from the above results for realvalued martingales, here we begin with a martingale (Y t ) t∈N taking values in H d . Assume that the increments ∆Y t are independent and, for some fixed c > 0 and H d -valued sequence (W t ) t∈N , the moments of the increments satisfy E ∆S k t F t−1 k! 2 c k−2 ∆W t for all t and all k ≥ 2. Then, writing S t = γ max (Y t ) and V t = γ max (W t ), for any x > 0 and t ≥ 1, we have .
( 1.7) This elegant result extends Bernstein's inequality to the matrix setting. Note the prefactor of d that appears when we bound the deviations of the maximum eigenvalue of a d × d matrix-valued process. (f) Finally, we recall a textbook result for Brownian motion (e.g., Durrett, 2017, Exercise 7.5.2): if (S t ) t∈(0,∞) is a standard Brownian motion, then for any a, b > 0, we have P(∃t ∈ (0, ∞) : S t ≥ a + bt) = e −2ab . (1.8) The result closely resembles Blackwell's inequality for discrete-time martingales with bounded increments, but here we have an equality.
Clearly, these results have much in common with each other and with myriad other results from the exponential concentration literature. Examining the proofs, we find many shared ingredients which are now well known: the notions of sub-Gaussian and sub-exponential random variables, the Cramér-Chernoff method, the large-deviations supermartingale, and so on. Nonetheless, there are enough differences among the results and their proofs to leave us wondering whether these results are merely similar in appearance, or whether they are all special cases of some underlying, general argument.
In this paper, we provide a framework that formally unifies the above results along with many others. Our framework consists of two pieces. First, we crystallize the notion of a sub-ψ process (Definition 1), a sufficient condition general enough to encompass a broad set of results not previously treated together, yet specific enough to derive a useful set of equivalent concentration inequalities. This definition provides a convenient categorization of exponential concentration results into sub-Bernoulli, sub-Gaussian, sub-Poisson, sub-exponential, and sub-gamma bounds. Second, we give a generalization of the Cramér-Chernoff argument, Theorem 1. This result yields strengthened versions of many existing inequalities and illustrates equivalences among different forms of exponential bounds. For example, Theorem 1 strengthens both "Freedman-style" inequalities such as (1.5) and "de la Peña-style" inequalities such as (1.6) to hold uniformly over all time, and in these strengthened forms, the two styles of inequalities are shown to be equivalent, as depicted in Figure 1. We remark that the seminal works from which these examples are drawn, like others referenced below, include many other important contributions, and our claims about Theorem 1 refer only to the particular inequalities cited from each work. Once the framework is in place, the proof of the main result follows using tools from classical large-deviation theory (Dembo and Zeitouni, 2010). We construct a nonnegative supermartingale as in , and we obtain a bound on its entire trajectory using Ville's maximal inequality (Ville, 1939). We invoke Tropp's ideas (Tropp, 2011) to extend the results to the matrix setting. The equivalences that follow from optimizing linear bounds are obtained using convex analysis (Rockafellar, 1970). By drawing together various proof ingredients from different sources, we elucidate previously unrecognized or understated connections. For example, we demonstrate how self-normalized matrix inequalities follow easily upon combining ideas from the literature on self-normalized processes with those from matrix concentration.

Paper organization
Section 2 lays out our framework for exponential line-crossing inequalities. Specifically, we formally state Definition 1 and Theorem 1 that together describe a general formulation of the Cramér-Chernoff method. After stating Theorem 1, we give a quick overview of existing results which can be recovered in our framework and the improvements thus obtained. A short proof of our master theorem comes next, and following some remarks, we provide three illustrative examples.
Sections 3 and 4 are devoted to a catalog of important results from the literature which fit into our framework, often yielding results which are stronger than those originally published. In Section 3, we consider the maximum-eigenvalue process of a matrix-valued martingale and enumerate useful sufficient conditions for such a process to be sub-ψ, collecting and in some cases generalizing a variety of ingenious results from the literature. Section 4 examines various instantiations of our master theorem, obtaining corollaries by combining one of the sufficient conditions from Section 3 with one of the four equivalent conclusions of Theorem 1. These illustrate how our framework recovers and strengthens existing exponential concentration results. We discuss sharpness, another geometrical insight, and future work in Section 5. Proofs of most results are in Section 6.

Historical context
To aid the reader, we give here some historical context for the existing results discussed below. This is not intended to be a comprehensive history of the literature on exponential concentration, and we focus on the specific results discussed in Section 4, giving pointers to further references as appropriate.
The Cramér-Chernoff method takes its name from the works of Cramér (1938) and Chernoff (1952). Both of these authors were concerned with a precise characterization of the asymptotic decay of tail probabilities beyond the regime in which the central limit theorem applies; Cramér provided the first proof of such a "large deviation principle", while Chernoff gave a more general formulation and placed more emphasis on the non-asymptotic upper bound which is our focus. These results spawned a vast literature on large deviation principles, with the goal of giving sharp upper and lower bounds on the limiting exponential decay of certain probabilities under a sequence of measures; see Dembo and Zeitouni (2010) for an excellent presentation of this literature. Our focus, on non-asymptotic upper bounds for nonparametric classes of distributions, is rather different, though such upper bounds often make an appearance in proofs of large deviation principles.
Bernstein was perhaps the earliest proponent of the sort of exponential tail bounds that are the focus of this paper, having proposed his famous inequality in 1911, according to Prokhorov (1995); see also Craig (1933), Uspensky (1937, ch. 10, ex. 12-14, pp. 204-205) and , though the last source appears rather inaccessible. The modern theory of exponential concentration began to take shape in the 1960's, as (using the terminology of this paper, from Section 3)  improved Bernstein's sub-gamma inequality to sub-Bernoulli and sub-Poisson ones for random variables bounded from above.  gave alternative sub-Bernoulli and sub-Gaussian bounds for random variables bounded from both above and below. For further references on this line of work, see Boucheron et al. (2013), whose treatment of the Cramér-Chernoff method has been invaluable in formulating our own framework, as well as McDiarmid (1998). Godwin (1955, p. 936) reports that Bernstein generalized his inequality to dependent random variables. Hoeffding (1963, pp. 17-18) considered the generalization of his sub-Bernoulli and sub-Gaussian bounds to martingales and the possibility of finite-horizon uniform inequalities based on Doob's maximal inequality; the martingale generalization was later explored by Azuma (1967).  extended Bennett's sub-Poisson bound to martingales, giving a uniform bound subject to a maximum value of the predictable quadratic variation of the martingale. This "Freedman-style" bound has been generalized to other settings in many subsequent works Tropp, 2011;Fan et al., 2015). Chen (2012a,b) has considered the extension of Chernoff-style bounds to hold uniformly over time for scalar-valued martingales in a manner similar to our line-crossing inequalities, including a condition similar to our sub-ψ definition; our formulation further encompasses matrix-valued processes and self-normalized inequalities.
The extension of these methods to matrix-valued processes, via control of the matrix moment-generating function, originated with Ahlswede and Winter (2002). The method was refined by Christofides and Markström (2007), Oliveira (2010a,b) and then by Tropp (2011Tropp ( , 2012, whose influential treatment synthesized and improved upon past work, generalizing many scalar exponential inequalities to operator-norm inequalities for matrix martingales. We have incorporated Tropp's formulation into our framework, and we focus on his theorem statements for our matrix bound statements. See Tropp (2015) for a recent exposition and further references.
There is a long history of investigation of the concentration of Student's t-statistic under non-normal sampling. Efron (1969) gives many references to early work. He also shows, by making use of Hoeffding's sub-Gaussian bound, that the equivalent self-normalized statistic ( i X i ) / i X 2 i satisfies a 1-sub-Gaussian tail bound whenever the X i satisfy a symmetry condition, a result he attributes to Bahadur and Eaton (Efron, 1969(Efron, , p. 1284. Starting with Logan et al. (1973), there has been a great deal of work on limiting distributions and large deviation principles for self-normalized statistics; see Shao (1997) and references therein. In terms of exponential tail bounds, de la Peña (1999) explored general conditions for bounding the deviations of a martingale, introduced new decoupling techniques (cf. de la Peña and Giné, 1999), and showed that any martingale with conditionally symmetric increments satisfies a self-normalized sub-Gaussian bound with no integrability condition. This work laid the foundation for the type of self-normalized exponential inequalities which we explore in this paper. These methods were extended by de la Peña et al. (2000Peña et al. ( , 2004, which introduced a general supermartingale "canonical assumption" that is a key precursor of our sub-ψ condition, and initiated a flurry of subsequent activity on self-normalized exponential inequalities (cf. de la Peña et al., 2007;de la Peña, Klass and Lai, 2009). We note in particular inequality (3.9) of de la Peña et al. (2001), which gives an infinite-horizon boundary-crossing inequality based on a mixture extension of their canonical assumption, as well as the multivariate inequalities (3.24) (for a t-statistic) and (3.29) (for general mixture boundaries) given by de la Peña, Klass and Lai (2009). Bercu and Touati (2008) gave a self-normalized sub-Gaussian bound without symmetry by incorporating the conditional quadratic variation, requiring only finite second moments, and some ingenious further extensions have been given by ), Fan et al. (2015, and Bercu et al. (2015), many of which we include in our collection of sufficient conditions for a process to be sub-ψ (Section 3.2). See de la Peña, Lai and Shao (2009) and Bercu et al. (2015) for further references.

Main results
Let (S t ) t∈T ∪{0} be a real-valued process adapted to an underlying filtration (F t ) t∈T ∪{0} , where either T = N for discrete-time processes or T = (0, ∞) for continuous-time processes. In continuous time, we assume (F t ) satisfies the "usual hypotheses", namely, that it is right-continuous and complete, and we assume (S t ) is càdlàg; see, e.g., Protter (2005). In a statistical setting, we may think of (S t ) as a summary statistic accumulating over time, for example a cumulative sum of observations, whose deviations from zero we would like to bound under some null hypothesis. In this setting, a bound on the deviations of (S t ) holding uniformly over time can be used to construct an appropriate sequential hypothesis test, a special case of which is Wald's sequential probability ratio test discussed in Section 4.6. We first explain our key condition on (S t ), the sub-ψ condition. We then state, prove, and interpret our master theorem, followed by some more detailed examples of its application.

The sub-ψ condition
Our key condition on (S t ) is stated in terms of two additional objects. The first object is a real-valued, nondecreasing process (V t ) t∈T ∪{0} , also adapted to (F t ) (and càdlàg in the continuous-time case). It is an "accumulated variance" process which serves as a measure of intrinsic time, an appropriate quantity to control the deviations of S t from zero (Blackwell and Freedman, 1973). The second object is a function ψ : R ≥0 → R, reminiscent of a cumulant-generating function, which quantifies the relationship between S t and V t . The simplest case is when S t is a cumulative sum of i.i.d., real-valued, mean-zero random variables with distribution F , in which case we take V t = t and let ψ(λ) = log e λx dF (x) be the CGF of F . Our key condition requires that S t is unlikely to grow too quickly relative to intrinsic time V t ; it generalizes developments from de la Peña et al. (2004); Tropp (2011);Chen (2012b), and others.
Definition 1 (Sub-ψ process). Let (S t ) t∈T ∪{0} and (V t ) t∈T ∪{0} be two realvalued processes adapted to an underlying filtration (F t ) t∈T ∪{0} with S 0 = V 0 = 0 a.s. and V t ≥ 0 a.s. for all t ∈ T . For a function ψ : [0, λ max ) → R and a scalar l 0 ∈ [1, ∞), we say (S t ) is l 0 -sub-ψ with variance process (V t ) if, for each λ ∈ [0, λ max ), there exists a supermartingale (L t (λ)) t∈T ∪{0} with respect to (F t ) such that L 0 (λ) ≤ l 0 a.s. and (2.1) For given ψ and l 0 , we write S l0 ψ for the class of pairs of l 0 -sub-ψ processes (S t , V t ): We often say simply that a process is sub-ψ, omitting l 0 from our terminology for simplicity. All examples considered in this paper fit into three cases for the value of l 0 : l 0 = 1, when deriving one-sided bounds on scalar martingales; l 0 = 2, when deriving bounds on the norm of certain Banach-space-valued martingales; or l 0 = d, when deriving bounds on the maximum-eigenvalue process of a d × d matrix-valued martingale. Also, though we often speak of a process (S t ) being sub-ψ, the sub-ψ condition formally applies to the pair (S t , V t ) and not to the process (S t ) alone, so that meaningful statements are always made in the context of a specific intrinsic time process (V t ).
Definition 1 may at first defy intuition. We can motivate it from several angles: • Suppose S t is a scalar-valued martingale whose deviations we wish to bound uniformly over time. We might like to apply Ville's maximal inequality (see Section 2.3), but must first transform S t into a nonnegative supermartingale. It is natural to consider the exponential transform e λSt for some λ > 0, which immediately yields a submartingale. Our task, then, is to find some appropriate ψ and (V t ) which "pull down" the submartingale so that the process exp {λS t − ψ(λ)V t } is a supermartingale. Intuitively, the exponential process exp {λS t − ψ(λ)V t } measures how quickly S t has grown relative to intrinsic time V t , and the free parameter λ determines the relative emphasis placed on the tails of the distribution of S t , i.e., on the higher moments. Larger values of λ exaggerate larger movements in S t , and ψ captures how much we must correspondingly exaggerate V t . • Consider again the simple case in which S t is a cumulative sum of i.i.d.
draws from a distribution F over the reals with mean zero and CGF ψ(λ) < ∞ for λ ∈ [0, λ max ). Then, setting V t = t, we may take L t (λ) equal to the exponential process exp {λS t − ψ(λ)t}, which is a martingale in this case, so that the defining inequality of Definition 1 is an equality. The exponential process may be interpreted as the likelihood ratio in an exponential family containing F with sufficient statistic S t . See Example 2 for a more detailed exposition of this setting and Section 4.6 for more on the connection with exponential families. • Alternatively, we may begin from the martingale method for concentration inequalities Azuma, 1967;McDiarmid, 1998;Raginsky and Sason, 2012, section 2.2), itself based on the classical Cramér-Chernoff method (Cramér, 1938;Chernoff, 1952;Boucheron et al., 2013, section 2.2). The martingale method starts from an assumption such as E e λ(Xt−E( Xt | Ft−1)) F t−1 ≤ e ψ(λ)σ 2 t for all t ≥ 1 and λ ∈ [0, λ max ). When ψ(λ) = λ 2 /2 and λ max = ∞ (and the condition holds for λ < 0 as well), this is the definition of a conditionally sub-Gaussian random variable with variance parameter σ 2 t . When ψ(λ) = λ 2 /(2(1 − cλ)) and λ max = 1/c, we have the definition of a random variable which is conditionally sub-gamma on the right tail with variance parameter σ 2 t and scale parameter c (Boucheron et al., 2013).
2 ; this fact underlies Example 1(a,b). Or, if S t ≤ 1 for all t, then (S t ) is 1-sub-ψ with ψ(λ) = e λ − λ − 1 on λ ∈ [0, ∞), a fact which leads to Example 1(c). • Unlike the martingale method assumption, Definition 1 allows (V t ) to be adapted rather than predictable, which leads to a variety of self-normalized inequalities de la Peña et al., 2004;de la Peña, Lai and Shao, 2009;Bercu et al., 2015;Fan et al., 2015), for example yielding bounds on the deviation of a martingale in terms of its quadratic variation. In this context, Definition 1 is closely related to the "canonical assumption" of de la Peña et al. (2004, eq. 1.6), which requires that exp {λS t − Φ(λV t )} is a supermartingale for certain nonnegative, strictly convex functions Φ. We have found it more useful to separate the second term into ψ(λ)V t , though both formulations yield interesting results. For from which we may obtain Example 1(d).
• Also in contrast to de la Peña et al. (2004), we allow the exponential process to be merely upper bounded by a supermartingale, rather than being a supermartingale itself; this permits us to handle bounds on the maximum eigenvalue process of a matrix-valued martingale, using techniques from Tropp (2011). For example, under the conditions of Example 1(e), the maximum eigenvalue process In this case, the exponential process exp {λS t − ψ(λ)V t } is not a supermartingale, but is upper bounded by the trace-exponential supermartingale tr exp {λY t − ψ(λ)W t }. The initial value of this traceexponential process is l 0 = d, which leads to the pre-factor of d in the bound (1.7).
Section 3 collects a variety of sufficient conditions from the literature for a process to be sub-ψ, including all of the examples given above. These conditions illustrate the broad applicability of Definition 1 in nonparametric settings, i.e., those which restrict the distribution of (S t ) to some infinite-dimensional class, for example all processes with bounded increments, or with increments having finite variance. Even in such nonparametric cases, ψ is still a CGF of some distribution in all of our examples, though this is not required for the most basic conclusion of Theorem 1. Indeed, the full force of Theorem 1 comes into effect only when ψ satisfies certain properties which hold for CGFs of zero-mean, non-constant random variables (Jorgensen, 1997 In many typical cases we have λ max = ∞ andb = ∞. With Definitions 1 and 2 in place, we are ready to set up and state our main result in the following section.

The master theorem
To state our main theorem on general exponential line-crossing inequalities, we will make use of the following transforms of ψ: The Legendre-Fenchel transform: ψ (u) := sup λ∈[0,λmax) [λu − ψ(λ)], for u ≥ 0. (2.4) The "decay" transform: D(u) := sup λ ∈ (0, λ max ) : (2.5) The "slope" transform: In the definition of D(u), we take the supremum of the empty set to equal zero instead of the usual −∞. For u > 0, this case can arise in general, but not when ψ is CGF-like. Note that D(u) can also be infinite. We call D(u) the "decay" transform because it determines the rate of exponential decay of the upcrossing probability bound in Theorem 1(a) below. We call s(u) the "slope" transform because it gives the slope of the linear boundary in Theorem 1(b); this is defined only when ψ is CGF-like. Defining s(0) = 0 and s(b) =b whenb < ∞, we find that s(u) is continuous, strictly increasing, and 0 ≤ s(u) < u on u ∈ [0,b) (see Lemma 2). We do not know of other references for the slope transform, or other situations where it arises naturally. Table 2 gives examples of these transforms for some common ψ functions.
Our main theorem has four parts, each of which facilitates comparisons with a particular related literature, as we discuss in Section 4. Recall Definition 1 of the class S l0 ψ of l 0 -sub-ψ processes, and the underlying filtration (F t ) to which processes (S t ) and (V t ) are adapted.
(a) For any a, b > 0, we have (2.7) Additionally, whenever ψ is CGF-like, the following three statements are equivalent to statement (a).
(b) For any m > 0 and x ∈ (0, mb), we have x ≤ mb and s x m ≤ b. (2.10) We give a straightforward proof in Section 2.3 that uses only Ville's maximal inequality for nonnegative supermartingales (Ville, 1939) and elementary convex analysis. Theorem 1 can be seen to unify and strengthen many known exponential bounds, showing that we lose nothing in going from a fixed-time to a uniform bound. This includes classical inequalities by Hoeffding (Corollary 1a), Bennett and Freedman (Corollary 1b), and Bernstein (Corollary 1c), along with their matrix extensions due to ; discrete-time scalar line-crossing inequalities due to Blackwell (Corollaries 4 and 5) and Khan (Section 4.2); self-normalized bounds due to de la Peña (Corollaries 6 and 7), Delyon (Corollary 8), Bercu and Touati (Corollary 8), and Fan (Corollary 9); bounds for martingales in smooth Banach spaces due to Pinelis (Corollary 10); continuous-time bounds due to Shorack and Wellner (Corollary 11) and van de Geer (Corollary 11); and Wald's sequential probability ratio test (Corollary 12). Visualizations of how the bounds of Theorem 1 relate to Freedman's and de la Peña's inequalities are provided in Figures 4 and 5. For convenience, Table 1 lists the existing results we recover and our corresponding corollaries, along with ways in which our analysis strengthens conclusions.
For the remainder of the paper after Section 2.3, we will assume F 0 is the trivial σ-field and omit from our notation the conditioning on F 0 in the results of Theorem 1 and its corollaries.

Proof of Theorem 1
Throughout the proof, we write P 0 (·) for the conditional probability P ( · | F 0 ). Ville's maximal inequality for nonnegative supermartingales (Ville, 1939) is the foundation of all uniform bounds in this paper. It is an infinite-horizon uniform extension of Markov's inequality: Lemma 1 (Ville's inequality). If (L t ) t∈T ∪{0} is a nonnegative supermartingale with respect to the filtration (F t ) t∈T ∪{0} , then for any a > 0, we have (2.11) Khan (2009, Theorem 4.  [A] Assumptions: we recover the result under weaker conditions on the distributional or dependence structure of the process. [B] Boundary: we strengthen the result by replacing a fixed-time bound or a finite-horizon constant uniform boundary with an infinite-horizon linear uniform boundary which is everywhere at least as strong (i.e., low) as the fixed-time or finite-horizon bound.
[C] Continuous time: we extend a discrete-time result to include continuous time.
[D] Dimension: we extend a result for scalar process to one for H d -valued processes, recovering the scalar result at d = 1.
[E] Exponent: we improve the exponent in the result's probability bound.
For completeness, we give an elementary proof of Lemma 1 in Section 6.1. Applying Ville's inequality to Definition 1 gives, for any (S t , V t ) ∈ S l0 ψ , λ ∈ (0, λ max ), and z ∈ R, To derive Theorem 1(a) from (2.12), fix a, b > 0 and choose λ ∈ [0, λ max ) such that ψ(λ) ≤ bλ, supposing for the moment that some such value of λ exists. Then applying (2.12) in the last step. This bound holds for all choices of λ in the set {λ ∈ [0, λ max ) : ψ(λ)/λ ≤ b}, so to minimize the final bound, we take the supremum over this set, recovering the stated bound l 0 e −aD(b) by the definition of D(b). If no value λ ∈ [0, λ max ) satisfies ψ(λ) ≤ bλ, then D(b) = 0 by definition, so that the bound holds trivially. This shows that Definition 1 implies Theorem 1(a).
To complete the proof we will show that the four parts of Theorem 1 are equivalent whenever ψ is CGF-like. We repeatedly use the well-known fact about the Legendre-Fenchel transform that ψ −1 (u) = ψ (u) for 0 < u <b, which follows by differentiating the identity ψ (u) = uψ −1 (u) − ψ(ψ −1 (u)). We also require some simple facts about ψ(λ)/λ: where the inequality follows since ψ is strictly convex so that ψ is strictly increasing. For (ii), the function is continuous because ψ is continuous, and differentiating reveals it to be strictly increasing by part (i). L'Hôpital's rule implies (iii) along with the assumptions ψ(λ) = ψ (λ) = 0 at λ = 0, and implies (iv) along with the CGF-like assumption sup λ ψ(λ) = ∞, which means ψ(λ) ↑ ∞ as λ ↑ λ max since ψ is convex. Part (v) follows from the definition of D(·) and parts (ii), (iii) and (iv). To obtain (vi), note that s is the composition of λ → ψ(λ)/λ with ψ . Both of these are continuous and strictly increasing, the former by part (ii) and the latter since ψ = ψ −1 and ψ is continuous and strictly increasing by the CGF-like assumption. As Lemma 2 allows us to prove the equivalences among the parts of Theorem 1 as follows.
and intercept x − bm passes through the point (m, x) in the (V t , S t ) plane, and part (a) yields using Lemma 2(v) in the second step. Now we choose the slope b to minimize the probability bound. The unconstrained optimizer b satisfies ψ (D(b )) = x/m, and a solution is guaranteed to exist by our restriction on x. This solution is given by Now applying part (b) with values m and mx yields part (c). . (2.20) Recognizing the Legendre-Fenchel transform in the denominator of the final exponent, we see that the probability bound equals l 0 exp −aψ (x) .
If instead b ≥b, then the above argument yields Combining these two cases, we have proving the first case in (2.10). On the other hand, if x ≤ mb and s(x/m) ≤ b, then (x , s(x /m)) is feasible for any x < x, by Lemma 2(vi). This yields as in part (b). We minimize the probability bound over x < x, noting that sup x <x ψ (x /m) = ψ (x/m) since ψ is increasing (as ψ is CGFlike) and closed (Rockafellar, 1970, Theorem 12.2). This proves the second case in (2.10). • (d) ⇒ (a): set m = 0 and x = a to recover part (a).
It is worth noting here that, unlike the proofs of , ), Tropp (2011), and Fan et al. (2015, we do not explicitly construct a stopping time in our proof. While an optional stopping argument is hidden within the proof of Ville's inequality, the underlying stopping time here is different from that in the aforementioned citations.

Interpreting the theorem
It is instructive to think of the parts of Theorem 1 as statements about the process (V t , S t ) or (V t , S t /V t ) in R 2 . Many of our results are better understood via this geometric intuition. Specifically, Figure 2 illustrates the following points: Theorem 1(d) Figure 2: Illustration of the equivalent statements of Theorem 1, as described in the text.
• Theorem 1(a) takes a given line a + bV t and bounds its S t -upcrossing probability. • Theorem 1(b) takes a point (m, x) in the (V t , S t )-plane and, out of the infinitely many lines passing through it, chooses the one which yields the tightest upper bound on the corresponding S t -upcrossing probability. • Theorem 1(c) is like part (b), but instead of looking at S t , we look at S t /V t , fix a point (m, x) in the (V t , S t /V t )-plane, and choose from among the infinitely many curves b + a/V t passing through it to minimize the probability bound. • The intuition for Theorem 1(d) is as follows. If we want to bound the upcrossing probability of the line (x−bm)+bV t on {V t ≥ m}, we can clearly obtain a conservative bound from Theorem 1(a) with a = x − bm. This yields the first case in (2.10). However, we can also apply Theorem 1(b) with the values m, x, obtaining a bound on the upcrossing probability for a line which passes through the point (m, x) in the (V t , S t )-plane, and this line yields the minimum possible probability bound among all lines passing through (m, x). If the slope of this line, s(x/m), is less than b, then this optimal probability bound is conservative for the upcrossing probability over the original line This gives the second case in (2.10), which is guaranteed to be at least as small as the bound in the first case when s(x/m) ≤ b.
We make some additional remarks below: • We extend bounds for discrete-time scalar-valued processes to include both discrete-time matrix-valued processes and continuous-time scalar-valued processes, but we do not handle continuous-time matrix-valued processes, as this seems to require further technical developments beyond the scope of this paper (see Bacry et al. (2018) for one approach to exponential bounds in this case). We write [C or D] when discussing extensions to existing results to emphasize this fact (see Table 1). • Most of this paper is concerned with right-tail bounds, hence the restriction to λ ≥ 0 in Definition 1. It is understood that identical techniques yield left-tail bounds upon verifying that Definition 1 holds for (−S t ). • The purpose of excluding ψ being CGF-like from Definition 1 is to separate the truth of statement (a), which follows solely from Definition 1, from its equivalence to (b), (c), and (d), which follows from ψ being CGF-like.

Three simple examples
We illustrate some simple instantiations of our theorem with three examples: a sum of coin flips, a discrete-time concentration inequality for random matrices, and a continuous-time scalar Brownian motion. These examples make use of several results from Section 3 describing conditions under which a process is sub-ψ; such results may be taken for granted on a first reading.
, so that λ max = ∞ andb = 1/p. One may directly check the martingale property to confirm that L t (λ) : ( 2.27) Here KL denotes the Bernoulli Kullback-Leibler divergence, (2.28) It takes some algebra to obtain this KL as the Legendre-Fenchel transform of ψ B ; in Table 2 we summarize all such transforms used in this paper. The final expression is Equation (2.1) of , but here we have a bound not just for the deviation of S m above its expectation at the fixed time m, but for the upper deviations of S t for all t ∈ N, simultaneously. We can use this to sequentially test a hypothesis about p, or to construct a sequence of confidence intervals for p possessing a coverage guarantee holding uniformly over unbounded time.
The slope transform s B (u) for ψ B , given in Table 2, is unwieldy. To derive a more analytically convenient bound, we use the fact that This is equivalent to Blackwell's line-crossing inequality (1.4), and in the form (2.29) it is clear that it recovers Hoeffding's inequality at the fixed time t = m.
; see the proof of Proposition 2, part 3. This will yield a uniform extension of Bennett's inequality (1.2) which improves upon Hoeffding's inequality substantially for values of p near zero and one. We will see other examples of such "sub-Poisson" bounds below.
Example 3 (Covariance estimation for a spiked random vector ensemble). The estimation of a covariance matrix via an i.i.d. sample is a common application of exponential matrix concentration, starting with Rudelson (1999). See also Vershynin (2012), Gittens and Tropp (2011), Tropp (2015), and Koltchinskii and Lounici (2017) for more recent treatments; this particular example is drawn from Wainwright (2017). Let d ≥ 2 and consider R d -valued, mean-zero observations Here the inequality holds for all λ ∈ [0, 3/(d − 1)) as demonstrated in the proof of Proposition 2, part 5. Applying Theorem 1(c) with ψ equal to the final expression in (2.30), we obtain, after some algebra, for any x, m > 0, with probability at least 1 − α, a known fixed-sample result (Wainwright, 2017). However, as above, (2.31) gives a bound on the upper deviations of Σ t for all t ∈ N simultaneously. Such a bound enables, for example, sequential hypothesis tests concerning the true covariance matrix.
Example 4 (Line-crossing for Brownian motion). Let (S t ) t∈[0,∞) denote standard Brownian motion. It is a standard fact that the process exp λS t − λ 2 t/2 is a martingale, so that (S t ) is 1-sub-ψ with ψ(λ) = λ 2 /2 and V t = t. In this case, Theorem 1 says that, for any a, b > 0, a well-known line-crossing bound for Brownian motion, which in fact holds with equality (Durrett, 2017, Exercise 7.5.2).

Sufficient conditions for sub-ψ processes
Much of the power of Definition 1 comes from the array of sufficient conditions for it which have been discovered under diverse, nonparametric conditions. In this section, we define some standard ψ functions and collect a broad set of conditions from the literature for a process (S t ) to be sub-ψ with one of these functions, summarized in Tables 3 and 4. In other words, we collect here some families of process pairs (S t , V t ) which are contained within S l0 ψ for standard choices of ψ. All discrete-time results in this paper use S t = γ max (Y t ) where (Y t ) t∈N is a martingale taking values in H d , with the exception of Section 4.4, which deals with martingales in abstract Banach spaces. Typically, setting d = 1 recovers the corresponding known scalar result exactly. We note also that our results for Hermitian matrices extend directly to rectangular matrices using Hermitian dilations (Tropp, 2012), as we illustrate in Corollary 2.

Five useful ψ functions
We define five particular ψ functions corresponding to five sub-ψ cases: the sub-Gaussian case in Hoeffding's inequality, the "sub-gamma" case corresponding to Bernstein's inequality, the sub-Poisson case from Bennett's and Freedman's inequalities, and the sub-exponential and sub-Bernoulli cases which are used in several other existing bounds. The ψ functions and corresponding transforms for these five cases are summarized in Table 2, while Figure 3 summarizes relationships among these cases, with Proposition 2 containing the formal statements. Recallb = sup λ∈[0,λmax) ψ (λ) from Definition 2, and note that we take 1/0 = ∞ by convention in the expressions for λ max andb below.
which is the scaled CGF of a mean-zero random variable taking values −g and h. Hereb = 1/g.
We will typically write ψ B , ψ P , ψ G , and ψ E , omitting the range or scale parameters from the notation when they are clear from the context. We follow the definition of sub-gamma from Boucheron et al. (2013), despite the somewhat inconsistent terminology: unlike the other four cases, ψ G is not the CGF of a gamma-distributed random variable. It is convenient for a number of reasons: it includes ψ N as a special case, it gives a useful upper bound for ψ P (see Proposition 2 part 5, below), it falls naturally out of the use of a Bernstein condition on higher moments to bound the CGF, and it is simple enough to permit analytically tractable results for the slope and decay transforms and the various bounds to follow. We remark also that our definition of sub-exponential in terms of the CGF of the exponential distribution follows that of Boucheron et al. (2013, Exercise 2.22), but differs from another well-known definition which says that the CGF is bounded by λ 2 /2 for λ in some neighborhood of zero. The two are equivalent up to appropriate choice of constants, as detailed in Appendix E.
The sub-gamma and sub-exponential functions ψ G,c and ψ E,c possess the following universality property, which we prove in Section 6.2.
In particular, this means that if S t = t i=1 X i for any zero-mean, i.i.d. sequence (X i ) satisfying Ee λX1 < ∞ for some λ > 0, then (S t ) is sub-gamma and sub-exponential with appropriate scale constants and variance process V t proportional to t. Furthermore, any process that is sub-ψ with a CGF-like ψ function is also sub-gamma and sub-exponential with appropriate scaling of the variance process by a constant.
Table 2 Summary of common ψ functions and related transforms. KL denotes the Bernoulli Kullback-Leibler divergence, KL ( q p) = q log q p + (1 − q) log 1−q 1−p . For the gamma and exponential cases, the domain of ψ is bounded by λmax = 1/(c ∨ 0); for the other three cases, λmax = ∞. For the Bernoulli, Poisson, and exponential cases, a closed-form expression for D(u) is not available, but we give lower bounds based on Proposition 2; ϕ(g, h) is defined in (3.7).

Conditions for sub-ψ processes
In Tables 3 and 4, we summarize a variety of standard and novel conditions for a process (S t ) to be sub-ψ. Fact 1 and Lemma 3 contain discrete-time results, while results for continuous time are in Fact 2. We let I d denote the d × d identity matrix. For a process (Y t ) t∈T , [Y ] t denotes the quadratic variation and Y t the conditional quadratic variation; in discrete time, We extend a function f : R → R on the real line to an operator f : H d → H d on the space of Hermitian matrices in the standard way: In particular, the absolute value function extends to H d by taking absolute values of the eigenvalues, while operate by truncating the eigenvalues. In the discrete-time case, we have the following known results.
Fact 1. Let (Y t ) t∈N be any H d -valued martingale, and let S t := γ max (Y t ) for t ∈ N. In all cases we set l 0 = d.
(a) (Scalar parametric) If d = 1 and S t is a cumulative sum of i.i.d., real-valued random variables, each of which is mean zero with known CGF ψ(λ) that s. for all t ∈ N, then (S t ) is sub-Bernoulli with variance process V t = ght and range parameters g, h Tropp, 2012).
s. for all t ∈ N for some c > 0, then (S t ) is sub-Poisson with variance process V t = γ max ( Y t ) and scale parameter c Tropp, 2012). (d) (Bernstein) If E t−1 (∆Y t ) k (k!/2)c k−2 E t−1 (∆Y t ) 2 for all t ∈ N and k = 2, 3, . . . , then (S t ) is sub-gamma with variance process V t = γ max ( Y t ) and scale parameter c Tropp, 2012;Boucheron et al., 2013). (e) (Heavy on left) Let T a (y) := (y ∧ a) ∨ −a for a > 0 denote the truncation of y. If d = 1 and . A random variable satisfying (3.6) is called heavy on left, and (Y t ) need not be a martingale in this case Delyon, 2015;Bercu et al., 2015). For example, the centered versions of the exponential, gamma, Pareto, log-normal, Poisson (λ ∈ N), Bernoulli (p < 1/2) and geometric (0 < p < 1) distributions are known to be heavy on left. When −∆Y t satisfies (3.6) we say ∆Y t is heavy on right.
In addition to the above known results, we provide the following extensions of known scalar results to matrices.

Condition ψ Vt
Discrete time, one-sided Continuous time, two-sided Table 3 Summary of sufficient conditions for a real-valued, discrete-or continuous-time martingale (St) to be sub-ψ with the given variance process. We use the shorthand µ k t := E t−1 (∆St) k and |µ| k t := E t−1 |∆St| k . In starred cases ( ), the first moment E i−1 ∆S i need not exist, so (St) need not be a martingale. See Facts 1 and 2 and Lemma 3 for details of each case. "⇒ Hoeffding I" indicates that the variance process (Vt) for Hoeffding-KS is smaller. "SN" is short for "self-normalized".

Condition ψ Zt
Discrete time, one-sided Table 4 Summary from Fact 1 and Lemma 3 of sufficient conditions for an H d -valued, discrete-time martingale (Yt) to have a sub-ψ maximum eigenvalue process St = γmax(Yt) with variance process Vt = γmax(Zt). We use the shorthand µ k t := E t−1 (∆St) k and |µ| k t := E t−1 |∆St| k . In the symmetric case, E i−1 ∆Y i need not exist, so (Yt) need not be a martingale. "⇒ Hoeffding I" indicates that (Vt) for Hoeffding-KS is smaller. "SN" is short for "self-normalized".
Lemma 3. Let (Y t ) t∈N be any H d -valued martingale, and let S t := γ max (Y t ) for t ∈ N. In all cases we set l 0 = d.
(a) (Bernoulli II) If, for all t ∈ N, ∆Y t hI d a.s. and E∆Y 2 t ghI d , then and scale parameter c = 1/6.
The proof of the above lemma can be found in Section 6.5. Case (a) is a straightforward extension of Bennett's condition for upper-bounded random variables with bounded variance to matrices with upper-bounded eigenvalues and bounded matrix variance (Bennett, 1962, p. 42). Cases (b) and (c) are similar extensions of Hoeffding's sub-Gaussian conditions for bounded random variables to matrices with bounded eigenvalues , Theorems 1 and 2; Kearns and Saul, 1998;Bercu et al., 2015, Theorem 2.49). In the conditionally symmetric case (d), we can achieve control without any moment or boundedness assumptions by defining V t in terms of observed rather than expected squared deviations; this is known for d = 1 (de la Peña, 1999, Lemma 6.1; Bercu et al., 2015), allowing exponential concentration for distributions like Cauchy. In the lower-bounded increments case (e), we have a self-normalized complement to the Bennett-style bound, a result known for d = 1 (Fan et al., 2015, Lemma 4.1). For the square-integrable martingale cases (f, g), we achieve control for a broad class of processes by incorporating the conditional variance and the observed squared deviations, as known for d = 1 (Delyon, 2009, Theorem 4;Bercu et al., 2015). The Hoeffding-like case (h) follows from the self-normalized bounds, highlighting a connection implicit in the proof of Corollary 4.2 of Mackey et al. (2014). The third moment bound (i) is similar to a fixed-sample bound given by Fan et al. (2015, Corollary 2.2).
In the continuous-time, scalar case we have the following sufficient conditions for a local martingale (S t ) to be sub-ψ. Here we always assume (S t ) is càdlàg, ∆S t := S t − S t− denotes the jumps of S, [S] t denotes the quadratic variation, and S t is the conditional quadratic variation, the compensator of [S] t .
(a) (Lévy process) If (S t ) is a Lévy process which is a martingale with the CGF ψ(λ) = log Ee λS1 < ∞ for all λ ∈ [0, λ max ), then (S t ) is sub-ψ with variance process V t = t. See, e.g., Papapantoleon (2008, Proposition 10.2). (b) (Continuous Bennett) If (S t ) is a local martingale with ∆S t ≤ c for all t a.s., then (S t ) is sub-Poisson with scale parameter c and variance process V t = S t (Lepingle, 1978, p. 157). (c) (Continuous Bernstein) Suppose (S t ) is a locally square integrable martingale: let W 2,t = S t , and for m = 3, 4, . . . let W m,t be the compensator of the process u≤t |∆S u | m . If, for some c > 0 and predictably measurable, càdlàg, nondecreasing process (V t ), it holds that W m,t ≤ m! 2 c m−2 V t for all m ≥ 2, then (S t ) is sub-gamma with scale parameter c and variance process V t (van de Geer, 1995, implicit in the proof of Lemma 2.2). (d) (Continuous paths) If (S t ) is a local martingale with a.s. continuous paths, then (S t ) is sub-Gaussian with variance process V t = S t . This may be seen as a special case of (c), or a limiting case of (b).

Implications between sub-ψ conditions
In many settings, a process (S t ) may satisfy Definition 1 with several different choices of ψ and (V t ). Choosing a smaller ψ function will lead to tighter bounds in Theorem 1, but in some cases one may opt for a larger ψ function to achieve analytical or computational convenience. It is clear that making ψ uniformly larger retains the sub-ψ property, since the exponential process exp {λS t − ψ(λ)V t } can only become smaller. It is therefore useful to characterize relationships among the above sub-ψ conditions, so that, after invoking one of the sufficient conditions given in Section 3.2, one may invoke Theorem 1 with a different, more convenient ψ function. Note that ψ G , ψ P and ψ E are nondecreasing in c for all values of λ ≥ 0, so that if a process is sub-ψ with scale c for any of these ψ functions, then it is sub-ψ for any scale c > c as well. Similarly, ψ B is nonincreasing in g and nondecreasing in h. Table 5 and Proposition 2 fully characterize all implications among sub-ψ conditions, as illustrated in Figure 3. These follow from inequalities of the form ψ 1 ≤ aψ 2 , some of which are based on standard arguments; see Section 6.3.

Sub-Poisson
Sub-gamma Sub-exponential c < 0 c < 0 c < 0 Figure 3: Each arrow indicates that any process satisfying the source sub-ψ condition, subject to a restriction on the scale parameter c, also satisfies the destination sub-ψ condition with appropriately scaled variance process. See Table 5 and Proposition 2 for details. Table 5 If (St) is sub-ψ 1 with variance process (Vt), subject to the given restriction, then (St) is also sub-ψ 2 with variance process (aVt). ϕ(g, h) is defined in (3.7). See Proposition 2 for details. Table 5, if (S t ) is sub-ψ 1 with variance process (V t ), and the given restrictions are satisfied, then (S t ) is also sub-ψ 2 with variance process (aV t ). Furthermore, when we allow only scaling of V t by a constant, these capture all possible implications among the five sub-ψ conditions defined above, and the given constants are the best possible (in the case of row (2), the constant (g + h) 2 /4gh is the best possible of the form k/gh where k depends only on the total range g + h).

Applications of Theorem 1
Here, we illustrate how Theorem 1 recovers or strengthens a wide variety of existing results. Most results in this section follow immediately upon combining one of the sufficient conditions from Fact 1, Lemma 3, or Fact 2 with Theorem 1, and we omit proof details in many cases. As a rough plan, we first discuss classical Cramér-Chernoff and Freedman-style bounds and then Blackwell's line crossing inequalities. After discussing de la Peña-style self-normalized bounds and Pinelis' Banach-space inequalities, we end by exhibiting some continuous time results and mention connections to the sequential probability ratio test. Figure 4: Comparison of (i) fixed-time Cramér-Chernoff bound (4.2), which bounds the deviations of S m at a fixed time m; (ii) "Freedman-style" constant uniform bound (4.3), which bounds the deviations of S t for all t such that V t ≤ m, with a constant boundary equal in value to the fixed-time Cramér-Chernoff bound; and (iii) linear uniform bound from Theorem 1(b), which bounds the deviations of S t for all t ∈ N, with a boundary growing linearly in V t . Each bound gives the same tail probability and thus implies the preceding one.

Fixed-time Cramér-Chernoff and Freedman-style uniform bounds
In the discrete-time, scalar setting, a simple sufficient condition for a process (S t ) to be 1-sub-ψ with variance process (V t ) is that which is the standard assumption for a martingale-method Cramér-Chernoff inequality, typically with (V t ) predictable (McDiarmid, 1998;Chung and Lu, 2006;Boucheron et al., 2013). When (V t ) is deterministic, the fixed-time Cramér-Chernoff method gives, for fixed x and m, so Theorem 1(b) is a uniform extension of the Cramér-Chernoff inequality, losing nothing at the fixed time m [B; C or D]. For random (V t ), a stopping time argument due to  extends this to the uniform bound When (V t ) is deterministic, analogous uniform bounds can be obtained from Doob's maximal inequality for submartingales, as in Hoeffding (1963, eq. 2.17). Theorem 1 strengthens this "Freedman-style" inequality [B; C or D], since it yields tighter bounds for all times t such that V t < m, and also extends the inequality to hold for all times t with V t > m, as illustrated by Figure 4. Tropp (2011Tropp ( , 2012 extends the scalar Cramér-Chernoff approach to random matrices via control of the matrix moment-generating function, giving matrix analogues of Hoeffding's, Bennett's, Bernstein's and Freedman's inequalities. Following this approach, Theorem 1 gives corresponding strengthened versions of these inequalities for matrix-valued processes [B]. We summarize explicit results below for three well-known special cases reviewed in Example 1(a): Hoeffding's sub-Gaussian inequality for observations bounded from above and below, with variance process depending only on the radius of the interval of boundedness ); Bennett's sub-Poisson inequality for observations bounded from above, with variance process depending on the true variance of the observations ; and Bernstein's sub-gamma inequality for observations satisfying a bound on growth of higher moments, also with a variance process depending on the true variance . In each case below, we recover the standard, fixed-sample result at V t = m. Recall the definitions of s P , ψ P , s G , ψ G from Table 2.
s. for all t for some H d -valued, predictable sequence (A t ). Let S t := γ max (Y t ), and let either

(4.4)
Then for any x, m > 0, we have This strengthens Hoeffding's inequality for all t. Let S t := γ max (Y t ) and V t := γ max ( Y t ). Then for any x, m > 0, we have This strengthens Bennett's and Freedman's inequalities  [B; C or D] for scalars and the corresponding matrix bounds from Tropp (2011Tropp ( , 2012 [B]. (c) Suppose (S t ) is l 0 -sub-gamma with variance process (V t ) and scale parameter c. Then for any x, m > 0, we have This strengthens Bernstein's inequality   Case (a) is a consequence of Lemma 3(g); see also Corollary 8, which uses A 2 i yields the second setting of V t . As is well known, the Hoeffding-style bound in part (a) and the Bennett-style bound in part (b) are not directly comparable: V t may be smaller in part (b), but ψ P ≤ ψ N , so neither subsumes the other. We remark that ψ P (u) ≥ u 2c arcsinh cu 2 , so the Bennett-style inequality in part (b) is an improvement on the inequality of Prokhorov (1959) for sums of independent random variables, as noted by , as well as its extension to martingales in de la Peña (1999).
As an example of the Hermitian dilation technique for extending bounds on Hermitian matrices to bounds for rectangular matrices, we give a bound for rectangular matrix Gaussian and Rademacher series, following Tropp (2012); here A op denotes the largest singular value of A. The proof is in Section 6.6.
Corollary 2. Consider a sequence (B t ) t∈N of fixed matrices with dimension d 1 × d 2 , and let ( t ) t∈N be a sequence of independent standard normal or Rademacher variables. Let S t := t i=1 i B i op and (4.8) Then for any x, m > 0, we have This strengthens Corollary 4.2 of Tropp (2012) [B].

Line-crossing inequalities
Before giving specific results in this section, we start with simplified versions of Theorem 1(d) which are useful for recovering existing results. The probability bound in (4.10) is merely an analytically simplified upper bound on that from Theorem 1(d). We prove the following in Section 6.7.
Corollary 3. If (S t ) is l 0 -sub-ψ with variance process (V t ) and ψ is CGF-like, then for any m ≥ 0, x > 0 and b ∈ (0,b), we have In particular, for m > 0, we have In fitting with the approach of this paper, Theorem 1(d) and Corollary 3 bound the upcrossing probability on {V t ≥ m} using the results of Theorem 1(a,b) and a geometric argument. It may seem naive and wasteful to bound a line-crossing probability on {V t ≥ m} using a bound which applies for {V t > 0}. The literature includes a handful of results bounding line-crossing probabilities on {V t ≥ m} which appear to give bounds tighter than what Theorem 1 offers, by making more direct use of the intrinsic-time condition ). Below we demonstrate that this is not true: we give several special cases of Theorem 1(d) and Corollary 3 which improve upon existing results.
Corollary 4. Suppose (S t ) is l 0 -sub-gamma with variance process (V t ) and scale parameter c.
(a) For any a, b > 0, we have . (4.12) When T = N, c = 0 and d = 1 this strengthens Theorem 1 of  [A; C or D], which is written for discrete-time scalar processes with bounded increments. (b) For any m, b > 0, we have . (4.13) When T = N, c = 0 and d = 1 this strengthens the second bound in Theorem 2 of  [A; C or D], which is written for discretetime scalar processes with bounded increments.
In discrete time, as presented in Fact 1, for a process with bounded increments we may construct both sub-Bernoulli and sub-Gaussian bounds. The sub-Bernoulli case, in combination with (4.11), yields the following: s. for all t ∈ N. Then for any b ∈ [0, 1] and m ≥ 1, we have . (4.14) This strengthens the first bound in Theorem 2 of  [D].
Theorems 4.1-4.3 of  are closest in form to our main results and represent key precedents to our framework. The simplified bound (4.10) recovers Khan's Theorem 4.3 [C or D], while Theorem 1(d) improves the exponent [E]. Our Theorem 1(b) gives a strengthened version of C or D]. Khan's Theorem 4.1 is not strictly comparable to our work since it involves an initial condition on nominal time, t ≥ t 0 , rather than on intrinsic time, V t ≥ m, but when V t is deterministic, then our Theorem 1(d) is tighter [B; C or D; E].

Self-normalized uniform bounds
Collectively, de la Peña (1999); de la Peña et al. (2000,2004,2007); de la Peña, Klass and Lai (2009); and de la Peña, Lai and Shao (2009) give a wide variety of sufficient conditions for the exponential process exp {λS t − ψ(λ)V t } to be a supermartingale in both discrete-and continuous-time settings. They formulate their bounds for ratios involving S t in the numerator and V t in the denominator, as in Theorem 1(c), and often specify initial-time conditions, as in Theorem 1(d).
In this section we draw some comparisons between Theorem 1 and their results. As a first example, consider the boundary of Theorem 1(c) for the ratio S t /V t , strictly decreasing towards the asymptotic level s(x). In particular, at time V t = m the boundary equals x, so Theorem 1(c) strengthens various theorems of de la Peña (1999) and de la Peña et al. (2007) which use a constant boundary after time V t = m [B; C or D]; for example, Theorem 1.2B, eq. 1.5 of de la Peña (1999) states that for scalar processes (S t ) which are 1-sub-gamma with variance process (V t ). As before, we give explicit results for special cases.
Corollary 6. Suppose (S t ) is l 0 -sub-gamma with variance process (V t ) and scale parameter c. Then for any x, m > 0, we have . In the sub-Gaussian case (obtained at c = 0), the above bound simplifies to Theorem 1(c) de la Peña-style Figure 5: Comparison of our decreasing boundary from Theorem 1(c), as in inequality (4.16), to a "de la Peña-style" constant uniform bound as in inequality (4.15), which bounds the deviations of S t /V t for all t such that V t ≥ m with a constant boundary.
Recall that s G (x) = x/(1 + √ 1 + 2cx), so for the boundary in (4.16), we have s G (x)(1 + m √ 1 + 2cx/V t ) ≤ x for all V t ≥ m with equality at V t = m. Corollary 6(a), therefore, gives the same probability bound as (4.15) for a larger crossing event. Figure 5 visualizes this relationship.
More generally, when we normalize by α + βV t and include an initial time condition V t ≥ m, Theorem 1(d) and Corollary 3 become the following: where ψ is CGF-like andb = ∞, then for any β, x > 0 and α, m ≥ 0 with at least one of α, m > 0, we have (4.20) In the case (S t ) is sub-Gaussian, for any β, x > 0 and α, m ≥ 0 with at least one of α, m > 0, we have taking 0/0 = 0 on the right-hand side when m = 0. With Lemma 3(d), this improves eq. 6.4 from Theorem 6.2 of de la Peña (1999) [C or D; E].
A defining feature of self-normalized bounds is that they involve a variance process (V t ) constructed with the squared observations themselves rather than just conditional variances or constants. Such normalization can be found in common statistical procedures such as the t-test. Furthermore, it allows for Gaussian-like concentration while reducing or eliminating moment conditions. Lemma 3 gives several extensions of well-known conditions for scalar sub-Gaussian concentration of self-normalized processes. As one particular special case, Lemma 3(f) and (g) yield general self-normalized uniform bounds for any discrete-time, square-integrable, H d -valued martingale, building upon breakthrough results obtained for scalar processes by Bercu, Touati and Delyon: Then for any x, m > 0, we have Corollary 8 is remarkable for the fact that it gives Gaussian-like concentration with only the existence of second moments for the increments. If the increments have conditionally symmetric distributions, one may instead apply Lemma 3(d) to achieve Gaussian-like concentration without existence of any moments, as discovered by  and illustrated in the following example.
Example 5 (Cauchy increments). Let (∆S t ) t∈N be i.i.d. standard Cauchy random variables (symmetric about zero). Lemma 3(d) shows that (S t ) is sub-Gaussian with variance process V t = [S] t . Corollary 6 yields, for any x, m > 0, (4.24) For another example, Lemma 3(i) gives a self-normalized bound involving third rather than second moments: Then for s G and ψ G using c = 1/6, we have for any x, m > 0, This is a uniform alternative to Corollary 2.2 of Fan et al. (2015) [B,D].
Note the exponent in (4.26) is different from that in Fan et al. (2015); neither strictly dominates the other. Also note that, unlike the classical Bernstein bound, neither of Corollaries 8 and 9 assume existence of moments of all orders.

Martingales in smooth Banach spaces
The applications presented thus far allow us to uniformly bound the operator norm deviations of a sequence of random Hermitian matrices. A different approach is due to Pinelis (1992Pinelis ( , 1994, who gave an innovative approach to exponential tail bounds in abstract Banach spaces. We describe how this approach can be incorporated into our framework. For this section, let (Y t ) t∈N be a martingale with respect to (F t ) taking values in a separable Banach space (X , · ). We can use Pinelis's device to uniformly bound the process (Ψ(Y t )) for any function Ψ : X → R which satisfies the following smoothness property: Definition 3 (Pinelis, 1994). A function Ψ : X → R is called (2, D)-smooth for some D > 0 if, for all x, v ∈ X , we have A Banach space is called (2, D)-smooth if its norm is (2, D)-smooth; in such a space we may take Ψ(·) = · to uniformly bound the deviations of a martingale. In this case, observe that property (4.27a) is part of the definition of a norm, property (4.27b) is the triangle inequality, and property (4.27c) can be seen to hold with D = 1 for the norm induced by the inner product in any Hilbert space, regardless of the (possibly infinite) dimensionality of the space. Note also that setting x = 0 shows that D ≥ 1 whenever Ψ(·) = · . Finally, observe that if we write f (x) = Ψ 2 (x), then we may equivalently replace condition (4.27c) by f (tx + (1 − t)y) ≥ tf (x) + (1 − t)f (y) − D 2 t(1 − t) x − y 2 , a perhaps more familiar definition of smoothness.
Corollary 10. Consider a martingale (Y t ) t∈N taking values in a separable Banach space (X , · ). Let the function Ψ : X → R be (2, D)-smooth and define D := 1 ∨ D.
(a) Suppose ∆Y t ≤ c t a.s. for all t ∈ N for some constants (c t ) t∈N , and let V t := t i=1 c 2 i . Then for any x, m > 0, we have This strengthens Theorem 3.5 from Pinelis (1994) [B]. (b) Suppose ∆Y t ≤ c a.s. for all t ∈ N for some constant c, and let V t := t i=1 E i−1 ∆Y i 2 . Then for any x, m > 0, we have This strengthens Theorem 3.4 from Pinelis (1994) [B].
We prove this result in Section 6.8. As before, the Hoeffding-style bound in part (a) and the Bennett-style bound in part (b) are not directly comparable: V t may be smaller in part (b), but the exponent is also smaller.
We briefly highlight some of the strengths and limitations of this approach. Since the Euclidean l 2 -norm is induced by the standard inner product in R d , Corollary 10 gives a dimension-free uniform bound on the l 2 -norm deviations of a vector-valued martingale in R d which exactly matches the form for scalars. Compare this to bounds based on the operator norm of a Hermitian dilation: the bound of Tropp (2012)  Similarly, Corollary 10 gives dimension-free uniform bounds for the Frobeniusnorm deviations of a matrix-valued martingale. This extends to martingales taking values in a space of Hilbert-Schmidt operators on a separable Hilbert space, with deviations bounded in the Hilbert-Schmidt norm; compare Minsker (2017, §3.2), which gives operator-norm bounds. The method of Corollary 10 does not extend directly to operator-norm bounds because the operator norm is not (2, D)-smooth for any D: for a simple illustration in H 2 , consider x = aI 2 and v = diag{b, −b}, so that x + v 2 op + x − v 2 op − 2 x 2 op = 2b 2 + 4ab and condition (4.27c) cannot be satisfied. However, Corollary 10 does apply to the matrix Schatten p-norm for p < ∞, using D = √ p − 1, and this holds for rectangular matrices as well (Ball et al., 1994).

Continuous-time processes
While Corollaries 1, 4, 6, and 7 already generalize results known in discrete time to new results for continuous-time martingales [C], here we summarize a few more useful bounds explicitly for continuous-time processes which follow from Theorem 1 and the conditions of Fact 2, making use of the novel strategies devised by Shorack and Wellner (1986) and van de Geer (1995). These results use the conditional quadratic variation S t . We remind the reader that [S] t = S t = t for Brownian motion, and the first equality holds more generally for martingales with continuous paths, while for a Poisson process with rate one, Corollary 11. Let (S t ) t∈(0,∞) be a real-valued process.
(a) If (S t ) is a locally square-integrable martingale with a.s. continuous paths, then for any a, b > 0, we have (4.30) If S t ↑ ∞ as t ↑ ∞, then the probability upper bound holds with equality. This recovers as a special case the standard line-crossing probability for Brownian motion (e.g., Durrett, 2017, Exercise 7.5.2). (b) If (S t ) is a local martingale with ∆S t ≤ c for all t, then for any x, m > 0, we have This strengthens Appendix B, Inequality 1 of Shorack and Wellner (1986) [B]. (c) If (S t ) is any locally square-integrable martingale satisfying the Bernstein condition of Fact 2(c) for some predictable process (V t ), then for any x, m > 0, we have . Clearly, Corollary 11(b) applies to centered Poisson processes with c = 1. Of course, one can also apply Fact 2(a) for general Lévy processes, obtaining the same bound (4.31). The point of Corollary 11(b) is that any local martingale with bounded jumps obeys this inequality, and so concentrates like a centered Poisson process in this sense. Barlow et al. (1986, §4) describe further exponential supermartingales obtained for continuous-time processes using the quadratic variation, and derive "Freedman-style" self-normalized bounds; incorporating these cases into our framework would be interesting future work.

Exponential families and the sequential probability ratio test
It is well known that the likelihood ratio f 1,t (X t 1 )/f 0,t (X t 1 ) is a martingale under the null hypothesis that X t 1 ∼ f 0,t . Then Ville's inequality gives a sequential test with valid type I error, equivalent to an open-ended sequential probability ratio test (SPRT, Wald, 1945), in which we stop when the likelihood ratio exceeds an upper threshold, but not when it drops below any lower threshold. In the oneparameter exponential family case, we obtain a simple analytical result which is equivalent to Theorem 1, as we detail below.
Corollary 12. This one-sided SPRT has type I error rate no greater than α: This standard fact follows easily from Theorem 1 because L t ≥ A if and only if S t ≥ (log A)/λ + ψ(λ)t/λ, where ψ(λ) = A(θ 0 + λ) − A(θ 0 ), the CGF of T (X i ) at θ = θ 0 . Hence the rejection boundary for the SPRT is equivalent to the linear boundary of Theorem 1. In light of this, we may interpret the above sub-Gaussian, sub-Poisson, sub-exponential and sub-Bernoulli bounds as open-ended SPRTs for i.i.d. observations from these exponential families. The fact that such tests are also valid for testing various nonparametric classes of distributions, as outlined in Section 3, illustrates how our framework provides nonparametric generalizations of the SPRT. For example, if one wants to test the mean of a bounded distribution, our framework suggests that one apply an SPRT for Bernoulli or Poisson observations, for example. It has long been known that the normal SPRT bound can be applied to sequential problems involving any i.i.d. sequence of sub-Gaussian observations (Darling and Robbins, 1967;Robbins, 1970). Our work expands the breadth of nonparametric sequential problems amenable to such methods and deepens the connection between exponential concentration inequalities and sequential testing procedures.

Discussion and extensions
This section is divided into three parts. We first discuss the sharpness of the derived bounds. Then, building further on the geometric intuition of the paper, we point out an interesting geometric relationship between fixed-sample exponential bounds and our uniform bounds. We end by discussing directions for future work.

When is Theorem 1 sharp?
In the discrete-time, sub-Gaussian case ψ = ψ N and l 0 = 1, Theorem 1(a) is sharp: for any a, b > 0, (5.1) In fact, this can be achieved by rescaling any sum of i.i.d. observations with finite variance, which we prove in Section 6.9 as a corollary of Theorem 2 of Robbins and Siegmund (1970): (5. 2) The following more general sandwich relation, which we prove in Section 6.10, quantifies the looseness in Theorem 1(a) and gives a sufficient condition for the probability bound to be exact. This condition involves the "overshoot" of the process S t over the line a+bV t , a quantity which has been studied extensively in the context of sequential testing (Siegmund, 1985). The upper bound in equation (5.3) below is a restatement of Theorem 1(a); only the lower bound is new.
Then we have In particular, if the conditions of Proposition 3 hold with = 0, then the probability bounds in Theorem 1 parts (a), (b) and (c) hold with equality. In the continuous-time case with (S t ) a continuous martingale, these conditions often hold with ψ = ψ N and V t = [S] t . We give details for the following result in Section 6.11; see Protter (2005, Theorem III.44) for more on Kazamaki's criterion: Corollary 14. Suppose (S t ) t∈(0,∞) is a continuous martingale with S 0 = 0 and [S] t ↑ ∞ a.s. satisfying Kazamaki's criterion: sup T Ee S T /2 < ∞, where the supremum is taken over all bounded stopping times T . Then P(∃t ∈ (0, ∞) : In the discrete-time case with i.i.d. observations bounded above by a.s. and having CGF ψ, the conditions of Proposition 3 hold, setting V t = t. Hence the probability bound in Theorem 1(a) can be made arbitrarily close to exact by taking b sufficiently small relative to , and similarly for parts (b) and (c). So Theorem 1 is sharp in the sense that for any such process, the probability bound is arbitrarily close to exact for some choice of (a, b) or (x, m). To see the connection with Corollary 13, rewrite (5.2) to keep the processes S t and V t = tσ 2 fixed and take limits with respect to a, b: (5.4)

Linear uniform bounds
Fixed-time Chernoff bounds Figure 6: Geometric illustration of Theorem 1(b) and its relation to fixed-time Cramér-Chernoff bounds. Theorem 1(b) chooses the linear boundary which is optimal for V t = m, but other linear boundaries with the same crossing probability are illustrated, each of which achieves the optimal fixed-time bound at some other time V t = m ± . Each uniform Chernoff bound is tangent to the curve of fixed-time bounds, and indeed the curve of fixed-time bounds may be defined as the pointwise infimum of such linear uniform bounds.
Proposition 4. Any line a + bt which is tangent to f α (t) satisfies P(∃t ∈ T : S t ≥ a + bt) ≤ α.
In words, the above proposition states that the set of linear boundaries from Theorem 1 is exactly the set of tangent lines to f α , or conversely, f α is defined as the pointwise infimum of this set of linear boundaries, as illustrated in Figure 6. We give the proof in Section 6.12. This observation provides some intuition for the appearance of the Legendre-Fenchel transform in the standard Cramér-Chernoff formula (4.2).

Future work
Characterizing families of sub-ψ processes. Our Theorem 1 bounds the maximal line-crossing probability over each family S l0 ψ , and Section 3 collects sufficient conditions for membership is many such families. It would be interesting to better delineate such families, for example by characterizing necessary conditions for inclusion. When ψ is CGF-like and (V t ) is predictable, it is necessary for the increments ∆S t to have finite conditional CGFs a.s. When S t is a cumulative sum of i.i.d., real-valued random variables and V t ∝ t, the existence of the CGF is sufficient as well (Fact 1(a)). When the increments are not i.i.d., however, existence of conditional CGFs is no longer sufficient. When (V t ) is not predictable, as with self-normalized bounds, it is no longer necessary for increments to have finite CGFs (e.g., Example 5).
Determining optimal l 0 values. Smaller values of l 0 are preferable since they lead to tighter bounds in Theorem 1. Most of the results in this paper take either l 0 = 1 for scalar observations or l 0 = d for d × d matrix observations. Taking λ ↓ 0 in Definition 1 shows we cannot have l 0 < 1. Furthermore, asymptotic results about maxima of independent Gaussians show that l 0 = d is an asymptotic lower bound as d ↑ ∞ for operator-norm inequalities over any class that includes matrices with independent Gaussians on the diagonal (Galambos, 1978;Boucheron et al., 2013, Exercise 2.17). It would be useful to derive more results about optimal values of l 0 in various settings.
Generalizing assumptions. Definition 1 can be further generalized, allowing it to subsume more known inequalities and yield sharper results for certain cases. However, the corresponding general theorem and specific results are less user-friendly. We have chosen our Definition 1 and Theorem 1 to balance generality and tractability, but in Appendix D we present one possible generalization of our assumption and a corresponding general theorem and specific bound.
Polynomial line-crossing inequalities. We have focused on exponential inequalities, but polynomial concentration also plays an important role in the literature. A theory of polynomial line-crossing analogous to that presented here may begin with the Dubins-Savage inequality (see Appendix B) and its l p extension in .
Banach spaces. The Banach space bounds in Section 4.4 give dimensionfree l p bounds for 2 ≤ p < ∞, but do not give l ∞ bounds. In particular, this does not yield operator-norm bounds for infinite-dimensional Hilbert-Schmidt operators, as in Minsker (2017). Extending Minsker's "effective rank" approach to the uniform bounds of this paper would be an interesting future extension.

Proof of Lemma 1
Define the stopping time τ := inf{t ∈ T : L t ≥ a}, where inf ∅ = ∞. For any fixed m ∈ T , Markov's inequality implies where we have used Doob's optional stopping theorem for bounded stopping times in the final step (e.g., Durrett, 2017, Exercise 4.4.2;or Protter, 2005, Theorem I.17). Taking m → ∞ and using the bounded convergence theorem yields P ( τ < ∞ | F 0 ) ≤ L 0 /a, which is the desired conclusion.

Proof of Proposition 2
In each case, we show an inequality between two ψ functions. The conclusion then follows from the fact that is ψ 1 ≤ ψ 2 , then exp showing that the key condition of Definition 1 continues to hold with ψ 2 in place of ψ 1 . Part (1): the proof of Theorem 1 in  shows that, for all µ ∈ (0, 1) and all t ∈ [0, 1 − µ), with equality at t = 1 − 2µ. Substituting µ = g/(g + h) and t = u/(g + h) for u ∈ [0, h), some algebra shows that the left-hand side is equal to ghψ B (u/gh) and the right-hand side is equal to ψ N (u)/ϕ(g, h), so that, for all g, h > 0 and u ∈ [0, h), ψ B (u/gh) ≥ ψ N (u)/[ghϕ(g, h)], with equality at u = h − g.
To see that the above constants are the best possible when we allow only scaling of V t by a constant, consider the third-order expansions of each ψ function about λ = 0: It is clear from these expansions that parts (3), (4), (5), (6), and (11) have the best possible constants. Part (7) is unimprovable because ψ E diverges at λ = 1/c, and using any scale parameter in ψ G smaller than c would make ψ G finite at λ = 1/c. For part (8), recall that when c < 0,b = |c| −1 for ψ E , whilē b = |2c| −1 for ψ G . Hence, if c < c/2 < 0, then lim λ→∞ ψ G,c (λ) = |2c | −1 < |c| −1 = lim λ→∞ ψ E,c (λ), so that ψ G,c (λ) must be smaller than ψ E,c (λ) for sufficiently large λ. Part (9) is unimprovable by an analogous argument. For part (1), when g ≥ h, we know that the constant of one in front of ψ N (λ) is the best possible from the expansions above. When g < h, some algebra shows that the inequality ψ B,g,h (λ) ≤ ϕ (g,h) gh ψ N (λ) holds with equality at λ = (h − g)/ϕ(g, h), so the constant cannot be improved. For part (2), it is easy to see that ϕ(g, h) = g+h 2 2 = g 2 when g = h, so the constant (g+h) 4gh is the best possible of the form k/gh where k is a function of g + h alone.
A brief remark on the rationale behind part (2). In the "Bernoulli I" (Fact 1(b)) and "Bernoulli II" (Lemma 3(a)) conditions, V t = ght, so applying Proposition 2, part (2) leads to V t = g+h 2 2 t, a function of the total range g + h alone. This is useful in the common case that observations are known to be bounded in a range [a, b], and an inequality is desired which depends only on the range b − a and not on the location of the means within [a, b].

An intermediate condition for sub-ψ processes
In discrete time, the following result capture a useful general condition on a matrix-valued process (Y t ) that is sufficient to show that the maximumeigenvalue process S t = γ max (Y t ) is sub-ψ.
Lemma 4. Let ψ be a real-valued function with domain [0, λ max ). Let (Y t ) t∈N be an adapted, H d -valued process. Let (W t ) t∈N be predictable, H d -valued, and nondecreasing in the semidefinite order, with W 0 = 0. Let (U t ) t∈N be defined by U 0 = 0 and ∆U t = u t (∆Y t ) for some u t : R → R ≥0 , for each t. If, for all t ∈ N and λ ∈ [0, λ max ), we have (6.14) For a familiar example, suppose d = 1 and (Y t ) has independent increments. Let W t = t, U t ≡ 0 and ψ(λ) = λ 2 /2. Then (6.14) reduces to the usual definition of a 1-sub-Gaussian random variable (Boucheron et al., 2013). For a selfnormalized example, let (∆Y t ) be i.i.d. from any distribution symmetric about zero. Then, again letting ψ(λ) = λ 2 /2, an argument due to  shows that (6.14) holds with W t ≡ 0 and U t = t i=1 ∆Y 2 i . See Lemma 3(d) for a general statement of this condition.
The value l 0 = d, the ambient dimension, leads to a pre-factor of d in all of our operator-norm matrix bounds. In cases when sup t∈T rank(U t + W t ) ≤ r < d a.s., the pre-factor d in our bounds may be replaced by r via an argument originally due to Oliveira (2010b). See Appendix A for details.
Proof of Lemma 4. The key result here is Lieb's concavity theorem: Fact 3 (Lieb, 1973;Tropp, 2012). For any fixed H ∈ H d , the function A → tr exp {H + log(A)} is concave on the positive-definite cone.
Fixing λ ∈ [0, λ max ), Lieb's theorem and Jensen's inequality together imply Now we apply inequality (6.14) to the expectation and use the monotonicity of the trace exponential to obtain ( 6.16) This shows that the process L t := tr exp {λY t − ψ(λ) · (U t + W t )} is a supermartingale, with L 0 = d. Next we show that the key condition of Definition 1 holds, s. for all t. We repeat a short argument from Tropp (2012). First, by the monotonicity of the trace exponential, using the fact that the trace of a positive semidefinite matrix is at least as large as its maximum eigenvalue. Then the spectral mapping property gives (6.20) Finally, we use the fact that γ max (A − cI d ) = γ max (A) − c for any A ∈ H d and c ∈ R to see that B = exp {λγ max (Y t ) − ψ(λ)γ max (U t + W t )}, completing the argument.

Proof of Lemma 3
We rely on the following transfer rule for the semidefinite ordering.
Fact 4 (Tropp, 2012, eq. 2.2). If f (a) ≤ g(a) for all a ∈ S, then f (A) g(A) when the eigenvalues of A lie in S.
We make frequent use of the martingale property E t−1 ∆Y t = 0, and prove in most cases that for some (U t ) and (W t ), then invoke Lemma 4. This a stronger condition than property (6.14); the latter is implied by taking logarithms on both sides, recalling the monotonicity of the matrix logarithm.
Part (a): we adapt the argument of Bennett (1962, p. 42). Fix λ ≥ 0 and choose real numbers u, v, w so that e λx ≤ ux 2 + vx + w for all x ≤ h, with equality at x = h and x = −g. Using the assumption ∆Y t hI d , the transfer rule implies where the second inequality uses the assumption E t−1 ∆Y 2 t ghI d and the martingale property. Now consider the random matrix with probability g h+g . (6.23) Evidently EZ = 0 and EZ 2 = ghI d , so Z also satisfies the aforementioned assumptions. Note that for any function f : R → R, (6.24) By our choice of u, v, w, we see that Ee λZ = E(uZ 2 + vZ + wI d ) = (ugh + w)I d , so by direct calculation, Combining (6.25) with (6.22) shows that (6.21) holds with U t ≡ 0 and W t = ghtI d , as desired. Part (b): As in Lemma 1 of , we use the fact that e λx ≤ g+x g+h e hλ + h−x g+h e −gλ for all x ∈ [−g, h], along with the transfer rule, to conclude that, for each t, (6.26) Now the proof of Proposition 2 part (1) shows that ψ B,g,h (λ) ≤ ϕ(g, h)ψ N (λ)/gh, so we have which shows that (6.21) holds with U t ≡ 0 and ∆W t = ϕ(G t , H t )I d , as desired. Part (c): the argument is identical to that for part (a), except for the use of ψ B,g,h (λ) ≤ (g+h) 2 4gh ψ N (λ) from the proof of Proposition 2 part (2). Part (d): From the standard inequality cosh x ≤ e x 2 /2 we see that f (x) := e −x 2 /2 cosh x ≤ 1 for all x. Introducing an independent Rademacher random variable ε, we have for any t, Applying the transfer rule and taking expectations, we have for any t, Replace λ with cλ and identify ψ E to complete the argument that (6.21) holds with U t = [Y ] t and W t ≡ 0. Part (f ): Proposition 12 of  shows that e x−x 2 /6 ≤ 1 + x + x 2 /3 for all x ∈ R. This implies, by the transfer rule, This shows that (6.21) holds with U t = [Y ] t /3 and W t = 2 Y t /3. Part (g): Proposition 12 of , together with the fact that e −x + x − 1 ≤ x 2 /2 for x ≥ 0, shows that e x−x 2 + /2 ≤ 1 + x + x 2 − /2. Again the transfer rule implies This shows that (6.21) holds with U t = [Y + ] t /2 and W t = Y − t /2. Part (h): we appeal to part (d) to see that S t is d-sub-Gaussian with variance . Substituting this larger variance process only makes the exponential process in Definition 1 smaller, so the assumption remains satisfied. Part (i): the proof of Corollary 2.2 in Fan et al. (2015) is based on the inequality e x−x 2 /2 ≤ 1 + x + x 3 − /3 for all x ∈ R. The transfer rule implies Setting c = 1/6 in ψ G , we have for all x ∈ [0, 6) the obvious inequality x 2 /2 ≤ ψ G (x) and we claim x 3 /3 ≤ ψ G (x) as well; indeed, which reaches a maximum value of one at x = 3. The transfer rule now implies which shows that (6.21) holds with U t = [Y ] t and V t = t i=1 E i−1 |∆Y i | 3 .

Proof of Corollary 2
Define the H d1+d2 -valued process (Y t ) using the dilation of B t : (6.43) Since the dilation operation is linear and preserves spectral information, we have (Tropp, 2012, Eq. 2.12). Furthermore, since each B i is fixed and i is 1-sub-Gaussian (in the usual sense for scalar random variables), (Y t ) satisfies the conditions of Lemma 4 with ψ = ψ N , U t ≡ 0, and by Tropp (2012, Lemma 4.3). Hence (S t ) is (d 1 +d 2 )-sub-Gaussian with variance process (6.45) The result now follows from Theorem 1(b).

Proof of Corollary 10
We invoke arguments from Pinelis (1994) and Pinelis (1992) to show that Definition 1 is satisfied.
For part (a), the proofs of Theorem 3 in Pinelis (1994) and Theorem 3 in Pinelis (1992) show that, for each t ∈ N, E t−1 cosh λΨ(Y t ) ≤ e λ 2 D 2 c 2 t /2 cosh λΨ(Y t−1 ) . (6.49) Hence L t := cosh λΨ(Y t ) e −λ 2 D 2 t i=1 c 2 i /2 is a supermartingale, and the inequality cosh x > e x /2 implies that Definition 1 is satisfied for S t = Ψ(Y t ), V t = D 2 t i=1 c 2 i and ψ = ψ N with λ max = ∞ and l 0 = 2. The conclusion (4.28) follows from a slight reparametrization of V t to make D 2 explicit in the bound.

Proof of Corollary 13
We invoke Theorem 2 of Robbins and Siegmund (1970) for the sum S n /σ with g(t) = a/σ + bσt, noting that lim m→∞ P ∃t ∈ N : S n √ m ≥ a + btσ 2 m = lim m→∞ P ∃t ∈ N : S n σ ≥ √ mg t m . (6.52) It is easy to verify the conditions of parts (i) and (ii) of Robbins and Siegmund's theorem, yielding the conclusion lim m→∞ P ∃t ∈ N : S n σ ≥ √ mg t m = P (∃t ∈ (0, ∞) : B t ≥ g(t)) , (6.53) where (B t ) is standard Brownian motion. The latter probability is equal to e −2ab by the standard line-crossing formula for Brownian motion (e.g., Durrett, 2017, Exercise 7.5.2).

Proof of Proposition 3
From the definition of D(·), we see that M t = exp {D(b) · (S t − bV t )}. Since τ is a stopping time, (M t∧τ ) is a martingale, so 1 = EM t∧τ for each t ∈ N. The third condition of the proposition ensures that M t∧τ ≤ e D(b)·(a+ ) for all t a.s., so by dominated convergence we have EM t∧τ → EM τ = 1, where M τ is defined as the a.s. limit of (M t∧τ ), whose existence is guaranteed since the stopped process is a nonnegative martingale. The second condition of the proposition implies M t a.s.

Proof of Corollary 14
The conclusion follows immediately from Proposition 3 with = 0 once we show that the conditions of the proposition are satisfied for (S t ) with V t = [S] t and ψ = ψ N . In this case, since (S t ) has continuous paths a.s, (M t ) is the stochastic exponential of the process (D(b)S t ) (Protter, 2005, Ch. II, Theorem 37). Kazamaki's criterion is sufficient to ensure (M t ) is a martingale (Protter, 2005, Ch. III, Theorem 44) and M 0 = 1 since S 0 = 0. This shows that condition (1) of Proposition 3 holds. Condition (3) follows directly from the continuity of paths of (S t ).
It remains to show that condition (2) holds. For this we express (S t ) as a time change of Brownian motion (Protter, 2005, Ch. II, Theorem 42): S t = B [S]t where (B t ) is a standard Brownian motion (with respect to a different filtration). From the law of the iterated logarithm we know that B t /t a.s.

Proof of Proposition 4
Lemma 2.4 of Boucheron et al. (2013) shows that f α (t) = inf λ log α −1 λ + ψ(λ) λ · t , (6.56) so that f α (t) is a pointwise infimum of lines indexed by λ with intercepts a λ = (log α −1 )/λ and slopes b λ = ψ(λ)/λ. Hence D(b λ ) = λ, and by Theorem 1 the crossing probability of each such line is e −a λ D(b λ ) = α. Note we have also shown that f α is concave. The optimizer λ (t) in (6.56) is the solution in λ of λψ (λ) − ψ(λ) = (log α −1 )/t. The left-hand side of this equation has positive derivative in λ by the convexity of ψ, so the map t → λ (t) is injective. Hence the optimum line a λ (m) + b λ (m) t is tangent to the curve f α (t) at t = m. P (∃t ∈ N : S t ≥ a + bV t ) ≤ 1 1 + ab . (B.1) The Dubins-Savage inequality may be proved by means similar to ours, invoking Ville's inequality for a suitable supermartingale. The relationship of our bounds to the Dubins-Savage inequality is analogous to that between fixed-time Cramér-Chernoff bounds and Chebyshev's inequality. More precisely, the Dubins-Savage inequality is analogous to Uspensky's one-sided version of Chebyshev's inequality (Uspensky, 1937;: (B.2) Similar to our Theorem 1(b), we may optimize the RHS of (B.1) over all lines passing through a point (m, x) to obtain the equivalent bound P ∃t ∈ N : recovering Uspensky's inequality (B.2) with x/2 in place of x. The Dubins-Savage inequality does not recover Uspensky's inequality at the fixed time msomething is necessarily lost in going from a fixed time to a uniform bound. Compare our Theorem 1(b), which exactly recovers the fixed-time Cramér-Chernoff bound (4.2). For these exponential bounds, we lose nothing in going from a fixed time to a uniform bound.  Table 2. We have set g = h = 1 in ψ B , c = 1 in ψ P , c = 1/3 in ψ G , and c = 1/2 in ψ E . These are all values that might be used in bounding a process with [−1, 1]-valued increments using the same variance process; see Figure 3 and Proposition 2. In general, bounds based on different ψ functions may have different assumptions and variance processes, so may not be comparable based on ψ functions alone. However, with identical variance processes, a smaller ψ function yields a tighter bound. Note all functions behave like ψ N (λ) = λ 2 /2 near the origin.
The proof follows the same principles as that of Theorem 1 and is omitted for brevity. One application of this result is to martingales with bounded increments, making use of ψ B : Corollary 15. Let (Y t ) t∈N be an H d -valued martingale and let S t := γ max (Y t ). Suppose γ max (∆Y t ) ≤ c for all t for some c > 0, and let V t := γ max ( Y t ). Then for any x, m > 0, n ∈ N we have P ∃t ≤ n : S t ≥ x + n g V t n − g m n