Freedman's inequality for matrix martingales

Freedman's inequality is a martingale counterpart to Bernstein's inequality. This result shows that the large-deviation behavior of a martingale is controlled by the predictable quadratic variation and a uniform upper bound for the martingale difference sequence. Oliveira has recently established a natural extension of Freedman's inequality that provides tail bounds for the maximum singular value of a matrix-valued martingale. This note describes a different proof of the matrix Freedman inequality that depends on a deep theorem of Lieb from matrix analysis. This argument delivers sharp constants in the matrix Freedman inequality, and it also yields tail bounds for other types of matrix martingales. The new techniques are adapted from recent work by the present author.


An Introduction to Freedman's Inequality
The Freedman inequality [Fre75,Thm. (1.6)] is a martingale extension of the Bernstein inequality. This result demonstrates that a martingale exhibits normal-type concentration near its mean value on a scale determined by the predictable quadratic variation, and the upper tail has Poisson-type decay on a scale determined by a uniform bound on the difference sequence.
Oliveira [Oli10,Thm. 1.2] proves that Freedman's inequality extends, in a certain form, to the matrix setting. The purpose of this note is to demonstrate that the methods from the author's paper [Tro10b] can be used to establish a sharper version of the matrix Freedman inequality. Furthermore, this approach offers a transparent way to obtain other probability inequalities for adapted sequences. Let us introduce some notation and background on martingales so that we can state Freedman's original result rigorously. Afterward, we continue with a statement of our main results and a presentation of the methods that we need to prove the matrix generalization.
1.1. Martingales. Let (Ω, F , P) be a probability space, and let F 0 ⊂ F 1 ⊂ F 2 ⊂ · · · ⊂ F be a filtration of the master sigma algebra. We write E k for the expectation conditioned on F k . A martingale is a (real-valued) random process {Y k : k = 0, 1, 2, . . . } that is adapted to the filtration and that satisfies two properties: For simplicity, we assume that the initial value of a martingale is null: Y 0 = 0. The difference sequence is the random process defined by Roughly, the present value of a martingale depends only on the past values, and the martingale has the status quo property: today, on average, is the same as yesterday.  (Freedman). Consider a real-valued martingale {Y k : k = 0, 1, 2, . . . } with difference sequence {X k : k = 1, 2, 3, . . . }. Assume that the difference sequence is uniformly bounded: X k ≤ R almost surely for k = 1, 2, 3, . . . .
Define the predictable quadratic variation process of the martingale: Then, for all t ≥ 0 and σ 2 > 0, When the difference sequence {X k } consists of independent random variables, the predictable quadratic variation is no longer random. In this case, Freedman's inequality reduces to the usual Bernstein inequality [Lug09, Thm. 6].
1.3. Matrix Martingales. Matrix martingales are defined in much the same manner as scalar martingales. Consider a random process {Y k : k = 0, 1, 2, . . . } whose values are matrices of finite dimension. We say that the process is a matrix martingale when We write · for the spectral norm, which coincides with the operator norm between Hilbert spaces. As before, we assume that Y 0 = 0, and we define the difference sequence {X k : k = 1, 2, 3, . . . } via the relation X k = Y k − Y k−1 for k = 1, 2, 3, . . . .
A matrix-valued random process is a martingale if and only if we obtain a scalar martingale when we track each fixed coordinate in time. Theorem 1.2 (Matrix Freedman). Consider a matrix martingale {Y k : k = 0, 1, 2, . . . } whose values are self-adjoint matrices with dimension d, and let {X k : k = 1, 2, 3, . . . } be the difference sequence. Assume that the difference sequence is uniformly bounded in the sense that λ max (X k ) ≤ R almost surely for k = 1, 2, 3, . . . .
Define the predictable quadratic variation process of the martingale: Then, for all t ≥ 0 and σ 2 > 0, Here and elsewhere, λ max denotes the algebraically largest eigenvalue of a self-adjoint matrix, and · denotes the spectral norm, which returns the largest singular value of a matrix.
Theorem 1.2 offers several concrete improvements over Oliveira's original work. His theorem [Oli10, Thm. 1.2] requires a stronger uniform bound of the form X k ≤ R, and the constants in his inequality are somewhat larger (but still very reasonable).
We prove Theorem 1.2 in Section 3 as a consequence of a stronger probability inequality that follows from a general result for adapted sequences of matrices. These tail bounds cannot be sharpened without changing their structure; see [Tro10b, §4 and §6] for a more detailed discussion.
As an immediate corollary of Theorem 1.2, we obtain a result for rectangular matrices.
. Consider a matrix martingale {Y k : k = 0, 1, 2, . . . } whose values are matrices with dimension d 1 × d 2 , and let {X k : k = 1, 2, 3, . . . } be the difference sequence. Assume that the difference sequence is uniformly bounded: Define two predictable quadratic variation processes for this martingale: Then, for all t ≥ 0 and σ 2 > 0, Apply  . Let H be a fixed self-adjoint matrix, and let X be a random selfadjoint matrix. Then E tr exp(H + X) ≤ tr exp(H + log(E e X )).
Proof. Define the random matrix Y = e X , and calculate that E tr exp(H + X) = E tr exp(H + log(Y )) ≤ tr exp(H + log(E Y )) = tr exp(H + log(E e X )).
The first identity follows because the logarithm can be defined as the functional inverse of the matrix exponential. Lieb's result, Theorem 1.4, establishes that the trace function is concave in Y , so we may invoke Jensen's inequality to draw the expectation inside the logarithm.
A significant advantage of our point of view is that the proof extends in a transparent way to yield other types of probability inequalities for adapted sequence of random matrices. We have dilated on this observation in a preliminary version of this work that is now available as a technical report [Tro11]. Here, for brevity, we focus on proving Freedman's inequality.

Tail Bounds via Martingale Methods
In this section, we show that Freedman's techniques extend to the matrix setting with minor (but profound) changes. The key idea is to use Corollary 1.5 to control the evolution of a matrix version of the moment generating function. This argument culminates in a rather general theorem on the large deviation behavior of an adapted sequence of random matrices. In §3, we specialize this result to obtain Freedman's inequality.
2.1. Additional Terminology. We say that a sequence {X k } of random matrices is adapted to the filtration when each X k is measurable with respect to F k . Loosely speaking, an adapted sequence is one where the present depends only upon the past. We say that a sequence {V k } of random matrices is previsible when each V k is measurable with respect to F k−1 . In particular, the sequence {E k−1 X k } of conditional expectations of an adapted sequence {X k } is previsible. A stopping time is a random variable κ : Ω → N 0 ∪ {∞} that satisfies {κ ≤ k} ⊂ F k for k = 0, 1, 2, . . . , ∞.
In words, we can determine if the stopping time has arrived from current and past experience.
2.2. The Large Deviation Supermartingale. Consider an adapted random process {X k : k = 1, 2, 3, . . . } and a previsible random process {V k : k = 1, 2, 3, . . . } whose values are self-adjoing matrices with dimension d. Suppose that the two processes are connected through a relation of the form log E k−1 e θX k g(θ) · V k for θ > 0,  The random matrix W k can be viewed as a measure of the total variability of the process {X k } up to time k. The partial sum Y k is unlikely to be large unless W k is also large.
To continue, we fix the function g and a positive number θ. Define a real-valued function with two self-adjoint matrix arguments: We use the function G θ to construct a real-valued random process.
This process is an evolving measure of the discrepancy between the partial sum process {Y k } and the cumulant sum process {W k }. The following lemma describes the key properties of this random sequence. In particular, the average discrepancy decreases with time.
Proof. It is easily seen that S k is positive because the exponential of a self-adjoint matrix is positive definite, and the trace of a positive-definite matrix is positive. We obtain the initial value from a short calculation: S 0 = tr exp (θY 0 − g(θ) · W 0 ) = tr exp(0) = tr I = d. To prove that the process is a supermartingale, we ascend a short chain of inequalities.
In the second line, we invoke Corollary 1.5, conditional on F k−1 . This act is legal because Y k−1 and W k are both measurable with respect to F k−1 . The next inequality depends on the assumption (2.1) together with the fact that the trace exponential is monotone with respect to the semidefinite order [Pet94, §2.2]. The last step follows because {W k } is the sequence of partial sums of {V k }.
Finally, we present a simple inequality for the function G θ that holds when we have control on the eigenvalues of its arguments.
The first inequality depends on the semidefinite relation W wI and the monotonicity of the trace exponential with respect to the semidefinite order [Pet94, §2.2]. The second inequality relies on the fact that the trace of a psd matrix is at least as large as its maximum eigenvalue. The third identity follows from the spectral mapping theorem and elementary properties of the maximum eigenvalue map.

A Tail Bound for Adapted Sequences.
Our key theorem for adapted sequences provides a bound on the probability that the partial sum of a matrix-valued random process is large. In the next section, we apply this result to establish a stronger version of Theorem 1.2. This result also allows us to develop other types of probability inequalities for adapted sequences of random matrices; see the technical report [Tro11] for additional details. Define the partial sum processes Then, for all t, w ∈ R, Proof. To begin, note that the cgf hypothesis (2.3) holds in the presence of (2.4) because the logarithm is an operator monotone function [Bha97, Ch. V].
The overall proof strategy is identical with the stopping-time technique used by Freedman [Fre75]. Fix a positive parameter θ, which we will optimize later. Following the discussion in §2.2, we introduce the random process S k := G θ (Y k , W k ). Lemma 2.1 implies that {S k } is a positive supermartingale with initial value d. These simple properties of the auxiliary random process distill all the essential information from the hypotheses of the theorem.
Define a stopping time κ by finding the first time instant k when the maximum eigenvalue of the partial sum process reaches the level t even though the sum of cgf bounds has maximum eigenvalue no larger than w.
κ := inf{k ≥ 0 : λ max (Y k ) ≥ t and λ max (W k ) ≤ w}. When the infimum is empty, the stopping time κ = ∞. Consider a system of exceptional events: Construct the event E := ∞ k=0 E k that one or more of these exceptional situations takes place. The intuition behind this definition is that the partial sum Y k is typically not large unless the process {X k } has varied substantially, a situation that the bound on W k disallows. As a result, the event E is rather unlikely.
We are prepared to estimate the probability of the exceptional event. First, note that κ < ∞ on the event E. Therefore, Lemma 2.2 provides a conditional lower bound for the process {S k } at the stopping time κ: We require the fact that S κ is positive to justify these inequalities. Rearrange the relation to obtain Minimize the right-hand side with respect to θ to complete the main part of the argument.

Proof of Freedman's Inequality
In this section, we use the general martingale deviation bound, Theorem 2.3, to prove a stronger version of Theorem 1.2.
Theorem 1.2 follows easily from this result.
Proof of Theorem 1.2 from Theorem 3.1. To derive Theorem 1.2, we note that the difference sequence of {X k } a matrix martingale {Y k } satisfies the conditions of Theorem 3.1 and the martingale can be expressed using partial sums of the difference sequence. Finally, we apply the numerical inequality h(u) ≥ u 2 /2 1 + u/3 for u ≥ 0, which we obtain by comparing derivatives.
3.1. Demonstration of Theorem 3.1. We conclude with the proof of Theorem 3.1. The argument depends on the following estimate for the moment generating function of a zero-mean random matrix whose eigenvalues are uniformly bounded. See [Tro10b, Lem. 6.7] for the proof.
Lemma 3.2 (Freedman mgf). Suppose that X is a random self-adjoint matrix that satisfies E X = 0 and λ max (X) ≤ 1.
The main result follows quickly from this lemma.
Proof of Theorem 3.1. We assume that R = 1; the general result follows by re-scaling since Y k is 1-homogeneous and W k is 2-homogeneous. Invoke Lemma 3.2 conditionally to see that E k−1 e θX k exp g(θ) · E k−1 X 2 k where g(θ) := e θ − θ − 1.
The infimum is achieved when θ = log(1 + t/σ 2 ). Finally, note that the norm of a positivesemidefinite matrix, such as W k , equals its largest eigenvalue.