On the sub-Gaussianity of the Beta and Dirichlet distributions

We obtain the optimal proxy variance for the sub-Gaussianity of Beta distribution, thus proving upper bounds recently conjectured by Elder (2016). We provide different proof techniques for the symmetrical (around its mean) case and the non-symmetrical case. The technique in the latter case relies on studying the ordinary differential equation satisfied by the Beta moment-generating function known as the confluent hypergeometric function. As a consequence, we derive the optimal proxy variance for the Dirichlet distribution, which is apparently a novel result. We also provide a new proof of the optimal proxy variance for the Bernoulli distribution, and discuss in this context the proxy variance relation to log-Sobolev inequalities and transport inequalities.


Introduction
The sub-Gaussian property Kozachenko, 1980, 2000;Pisier, 2016) and related concentration inequalities (Boucheron et al., 2013;Raginsky and Sason, 2013) have attracted a lot of attention in the last couple of decades due to their applications in various areas such as pure mathematics, physics, information theory and computer sciences. Recent interest focused on deriving the optimal proxy variance for discrete random variables like the Bernoulli distribution (Buldygin and Moskvichova, 2013;Kearns and Saul, 1998;Berend and Kontorovich, 2013) and the missing mass (McAllester and Schapire, 2000;McAllester and Ortiz, 2003;Berend and Kontorovich, 2013;Ben-Hamou et al., 2017). Our focus is instead on two continuous random variables, the Beta and Dirichlet distributions, for which the optimal proxy variance was not known to the best of our knowledge. Some upper bounds were recently conjectured by Elder (2016) that we prove in the present article by providing the optimal proxy variance for both Beta and Dirichlet distributions. Similar concentration properties of the Beta distribution have been recently used in many contexts including Bayesian adaptive data analysis (Elder, 2016), Bayesian nonparametrics (Castillo, 2016) and spectral properties of random matrices (Perry et al., 2016).
We start by reminding the definition of sub-Gaussian property for random variables: Definition 1 (Sub-Gaussian variables). A random variable X with finite mean µ = E[X] is sub-Gaussian if there is a positive number σ such that: Every compactly supported distribution, as is the Beta(α, β) distribution, is sub-Gaussian. This can be seen by Hoeffding's classic inequality: any random variable X supported on [0, 1] with mean µ satisfies ∀λ ∈ R, E e λ(X−µ) ≤ e λ 2 8 , thus exhibiting 1 4 as an upper bound to the proxy variance. This bound can be improved by taking into account the location of the mean µ within the interval [0, 1]. An early step in this direction is the second inequality in Hoeffding (1963) paper, indexed (2.2). It states that if µ < 1/2, then for any positive , P(X − µ > ) ≤ e − 2 g(µ) , where thus indicating that X has a right tail lighter than a Gaussian tail of variance 1 2g(µ) . Hoeffding's result was strengthened by Kearns and Saul (1998) to comply with Definition 1 of sub-Gaussianity 1 as follows thus indicating that 1 2g(µ) is a distribution-sensitive proxy variance for any [0, 1]-supported random variable with mean µ (see also Berend and Kontorovich, 2013, for a detailed proof of this result). If this is the optimal proxy variance for the Bernoulli distribution (see Theorem 2.1 and Theorem 3.1 of Buldygin and Moskvichova, 2013), it is clear from our result that it does not hold true for the Beta distribution. However, fixing α α+β = µ and letting α → 0, β → 0, the Beta(α, β) distribution concentrates to the Bern(µ) distribution, and we show that we recover the optimal proxy variance for the Bernoulli distribution (Theorem 2).
An interesting common feature between optimal proxy variances for the Bernoulli distribution: 1 2g(µ) , and that of the Beta distribution derived later on, is that they deteriorate in a similar fashion as the mean µ goes to 0 or 1, see for instance the left panel of Figure 1. We briefly present here classical proof techniques for sub-Gaussianity hinging on certain tools from functional analysis. We show how they apply in the Bernoulli setting, and let as an interesting open problem how our proof in the Beta distribution setting could be supplemented by these same functional analysis tools. Essentially two (related) functional inequalities allow one to derive a sub-Gaussian property: log-Sobolev inequalities, which date back to Gross (1975), and transport inequalities. The relation with the former inequalities is called Herbst's argument. It states that if a probability measure satisfies a log-Sobolev inequality with some constant, then it is sub-Gaussian with the same constant as a proxy variance 2 (see for instance Ledoux, 1999, Section 2.3 and Proposition 2.3). The optimal constant in the log-Sobolev inequality satisfied by the Bernoulli distribution also produces its optimal proxy variance (Ledoux, 1999, Corollary 5.9).
The relation with transport inequalities is usually referred to as Marton's argument (see for instance Raginsky and Sason, 2013, Section 3.4). Define the Wasserstein distance between two probability measures P and Q on a space X by where Π(P, Q) is the set of probability measures on X × X with fixed marginal distributions respectively P and Q. The Wasserstein distance depends on some choice of a distance d on X . A probability measure P is said to satisfy a transport inequality with constant c, if for any probability measure Q dominated by P , where D(Q||P ) is the entropy, or Kullback-Leibler divergence, between P and Q. The transport inequality (4) is denoted by T(c). Bobkov and Götze (1999) proved that T(c) implies c-sub-Gaussianity. See also Proposition 3.6 and Theorem 3.4.4 of Raginsky and Sason (2013) for general results. Further developments in the 1 Note indeed that Equation (1), together with Markov inequality, imply P(X − µ > ) ≤ e − 2 2σ 2 . 2 The implied predicate is actually stronger than sub-Gaussianity, but it is not useful for our purposes.
2 discrete X setting are interesting for our purposes. Equip a discrete space X with the Hamming metric, d(x, y) = {x =y} . The induced Wasserstein distance then reduces to the total variation distance, W (P, Q) = P − Q TV . In that setting, Ordentlich and Weinberger (2005) proved the distribution-sensitive transport inequality: where the function g is defined in Equation (2) and the coefficient µ P is called the balance coefficient of P , and is defined by µ P = max A⊂X min{P (A), 1 − P (A)}. In particular, the Bernoulli balance coefficient is easily shown to coincide with its mean. Hence, applying the result of Bobkov and Götze (1999) to the T 1 2g(µ P ) transport inequality (5) yields a distribution-sensitive proxy variance of 1 2g(µ) for the Bernoulli with mean µ. It is optimal, see for instance Theorem 3.4.6 of Raginsky and Sason (2013). This viewpoint highlights the key role played by the balance coefficient in the non-uniformity of the optimal proxy variance for discrete distributions such as the Bernoulli. However, it is not clear how this argument would carry over to non discrete distributions such as the Beta distribution for explaining similar sensitivity to the mean. However, to quote Raginsky and Sason (2013), the general approach may not produce optimal concentration estimates, that often require case-by-case treatments. This is the route followed in this note for the Beta distribution.
The outline of the note is as follows. We introduce the Beta distribution and state the main result (Theorem 1) in Section 2.1. We then prove our result depending on whether α = β (Section 2.2) or α = β (Section 2.3). In the first case, the proof is elementary and based on comparing the coefficients of the entire series representations of the functions of both sides of inequality (1). However, it does not directly carry over to the second case, whose proof requires some finer analysis tool: the study of the ordinary differential equation (ODE) satisfied by the confluent hypergeometric function 1 F 1 . Although the second proof also covers the case α = β upon slight modifications, the independent proof for the symmetric case is kept owing to its simplicity. As a by-product, we derive the optimal proxy variance for the Bernoulli and the Dirichlet distributions in Section 3. The R code for the plots presented in this note and for a function deriving the optimal proxy variance in terms of α and β is available at http://www.julyanarbel.com/software.
2 Optimal proxy variance for the Beta distribution

Notations and main result
The Beta(α, β) distribution, with α, β > 0, is characterized by a density on the segment [0, 1] given by: is the Beta function. The moment-generating function of a Beta(α, β) distribution is given by a confluent hypergeometric function (also known as Kummer's function): This is equivalent to say that the j th raw moment of a Beta(α, β) random variable X is given by: where (x) j = x(x + 1) · · · (x + j − 1) = Γ(x+j) Γ(x) is the Pochhammer symbol, also known in the literature as a rising factorial. In particular, the mean and variance are given by: The Beta distribution is ubiquitous in statistics. It plays a central role in the binomial model in Bayesian statistics where it is a conjugate prior distribution (the associated posterior distribution is also Beta): if X ∼ Binomial(θ, N ) and θ ∼ Beta(α, β), then θ|X ∼ Beta(α + X, β + N − X). It is also key to Bayesian nonparametrics where it embodies, among others, the distribution of the breaks in the stick-breaking representation of the Dirichlet process and the Pitman-Yor process; marginal distributions of Polya trees (Castillo, 2016); the posterior distribution of discovery probabilities under a Bayesian nonparametrics model (Arbel et al., 2017). Our main result opens new research avenues for instance about asymptotic (frequentist) assessments of these procedures.

(8)
A simple and explicit upper bound to σ 2 opt (α, β) is given by Equation (8) defining x 0 is a transcendental equation, the solution of which is not available in closed form. However, it is simple to evaluate numerically. The values of the variance, optimal proxy variance and its simple upper bound are illustrated on Figure 1. Note that for a fixed value of the sum of the parameters, α + β = S, the optimal proxy variance deteriorates when α, or equivalently β, gets close to 0 or to S. This is reminiscent of the Bernoulli optimal proxy variance behavior which deteriorates when the success probability moves away from 1 2 (Buldygin and Moskvichova, 2013). The intuition of the proof can be seen from Figure 2 for various values of σ 2 . The main argument is that the optimal proxy variance is obtained for the curve (in magenta) whose positive local minimum equals zero, thus leading to the system of equations of Theorem 1. 1 4(α+β+1) (dotted black) for the Beta(α, β) distribution with α + β set to 1, σ 2 opt (µ) for the Bern(µ) distribution (blue); varying mean µ on the x-axis. Center : curves of σ 2 opt (µ) for the Bern(µ) distribution (blue), and of σ 2 opt (α, β) for the Beta(α, β) distribution with α + β varying on a log scale from 0.1 (purple) to 10 (red); varying mean µ on the x-axis. Right: surfaces of Var[Beta(α, β)] (green) and σ 2 opt (α, β) (purple), for values of α and β varying in [0.2, 4].
As a direct consequence, we obtain the strict sub-Gaussianity of the uniform, the arc-sine and the Wigner semicircle distributions, as special cases up to a trivial rescaling of the Beta(α, α) distribution respectively with α equal to 1, 1 2 and 3 2 .

The Beta(α, α) distribution is strictly sub-Gaussian
. Since a random variable X ∼ Beta(α, α) is symmetric around 1 2 , only its even centered moments are non-zero. The reason why is because the coefficients of the series expansions at λ = 0 of each side: satisfy the inequalities: Indeed, algebra yields: Combining the expression of the raw moments (7) with the following inequality: in (9) concludes the proof.
Remark 1. The non-symmetrical distribution with α = β has even centered moments whose expressions are not as simple as (12). Moreover, it has obviously non-zero odd centered moments. For this last reason, the present proof does not carry over to the case α = β.

Connection with ordinary differential equations
In this section, we assume that X ∼ Beta(α, β) with β = α. We denote σ 2 0 = 1 4(α+β+1) (we omit the dependence on α and β for compactness) and define for all t ∈ R: In other words, the decreasing function t → σ 2 t maps the interval [0, 1] to the interval [σ 2 1 , σ 2 0 ] with σ 2 1 = Var[X]. Then, we introduce the function u t defined by: where σ 2 t -sub-Gaussianity amounts to non negativity of u t on R. Since the confluent hypergeometric function y : x → y(x) = 1 F 1 (α, α+β; x) satisfies the linear second order ordinary differential equation xy (x) + (α + β − x)y (x) − αy(x) = 0, we obtain together with equation (6) that u t is the unique solution of the Cauchy problem: t 2 x 2 P 2 (x; t), u t (0) = 0 and u t (0) = 0, 5 where P 2 is a polynomial of degree 2 in x: For normalization purposes, we also define: The function v t is the unique solution of the Cauchy problem: with: Note that u t and v t have the same sign hence proving that u t is positive (resp. negative) is equivalent to proving that v t is positive (resp. negative). From standard theory on ODEs (Birkhoff and Rota, 1989;Robinson, 2004), we get that the functions u t and v t are C ∞ (R). Indeed, the only possible singularity is at x = 0 but the initial conditions imply that the function is regular at this point. In particular, a Taylor expansion at x = 0 shows that: We also observe that the discriminant of the polynomial x → P 2 (x; t) is given by: Hence we conclude that for t > 0, P 2 admits two distinct real zeros that are positive, while for t < 0 it remains strictly positive on R. For t = 0, P 2 admits a double zero and thus remains positive on R appart from its zero.
By definition (15), we want to study the sign of v t on R. Indeed, showing that v t is positive on R then X is equivalent to showing σ 2 t -sub-Gaussianity. We first observe that we may restrict the sign study on R + . Indeed, if we prove that: then, the case λ < 0 is automatically obtained by noting that 1 − X ∼ Beta(β, α), whose mean is β α+β = 1 − α α+β . Therefore, applying (18) to 1 − X gives that for all λ < 0: Eventually, in agreement with the general theory, we observe that for t > 1 (i.e. σ 2 t < Var[X]), X is not σ 2 t -sub-Gaussian. Indeed, the series expansion at x = 0 (17), shows that for t > 1, v t is strictly negative in a neighborhood of 0. On the contrary, for t < 1, the function v t is strictly positive in a neighborhood of 0 so that we may not directly conclude. Note also that for any value of t, we always have lim x→∞ v t (x) = +∞.

Proof that the Beta(α, β) distribution is σ 2 0 -sub-Gaussian
In this section, we take t = 0. As explained above, this corresponds to a case where P 2 is positive on R (apart from its double zero). We prove that u 0 (x) > 0 for x > 0 by proceeding by contradiction. Let us assume that there exists x 1 > 0 such that u 0 (x 1 ) = 0. Since the non-empty set {x > 0 / u 0 (x) = 0} is compact (because u 0 ∈ C ∞ (R)) and excludes a neighborhood of 0, we may define x 0 = min{x > 0 such that u 0 (x) = 0} > 0. Let us now define the set: M = {0 < x < x 0 such that u 0 (x) = 0 and u 0 changes sign at x}.
Since u 0 (0) = u 0 (x 0 ) = 0 and the facts that u 0 is strictly positive in a neighborhood of 0 and u 0 is continuous on R, Rolle's theorem shows that M is not empty and that: m = min{x ∈ M } exists and 0 < m < x 0 .
Evaluating the ODE (14) at x = m and using the fact that the polynomial P 2 is positive on R (appart from its double zero) leads to: However, combined with u 0 (m) = 0, this contradicts the fact that u 0 changes sign at x = m.
Thus, we conclude that there cannot exist x 1 > 0 such that u 0 (x 1 ) = 0. Since u 0 is strictly positive in a neighborhood of 0 and continuous on R, we conclude that it must remain strictly positive on R * + .
Remark 2. In this proof, the case β = α requires an adaptation since u 0 (0) = 0. Thus, we must determine u (4) 0 (0) = 0 by symmetry) to ensure that the function u 0 is locally convex and remains strictly positive in a neighborhood of 0. Apart from this minor verification, the rest of the proof applies also to this case.

Proof of the optimal proxy variance for the Beta(α, β) distribution
In this section we assume that β = α. From general theorems regarding ODEs, we have that the application: is smooth (g ∈ C ∞ ([0, +∞) × [0, +∞))). Indeed, the t-dependence of the coefficients of the ODE (16) is polynomial and thus smooth. The x-dependence of the coefficients of the ODE (16) is polynomial and as explained above, the only possible singularity in x is at x = 0 but initial conditions ensure that the solutions x → v t (x) are always regular there. Since for all t ≥ 0 we have lim x→∞ v t (x) = +∞, we also have that the function: is continuous.
We now observe that for any 0 ≤ t < 1, the functions v t are strictly positive in a neighborhood of 0. More precisely, if we choose a segment [0, t 0 ] with t 0 < 1, then for all Hence, we may choose η > 0 such that for all 0 ≤ t ≤ t 0 , we have v t is strictly positive on ]0, η]. Moreover, since lim x→∞ v 0 (x) = +∞, v 0 is bounded from below on [η, +∞[ by a constant A > 0. Thus, since g is continuous, there exists a neighborhood of t = 0 in which all solutions v t remain greater than A 2 on [η, +∞[ and thus strictly positive on R * + . This shows that for t > 0, σ 2 0 is not optimal. Let us now introduce the set: Then, from the results presented above, we know that T + is non-empty, that it contains a neighborhood of 0 and that it is bounded from above by 1. Moreover, by connection with the initial problem (15), T + is an interval and thus is of the form [0, t opt ] with 0 < t opt < 1. Indeed t ∈ T + implies by construction that for all s ≤ t, s ∈ T + . Note also that t < 1 since v 1 is strictly negative in a neighborhood of 0 and thus min{v 1 (x), x ∈ R * + } < 0. Hence, the continuity of h shows that there exists a neighborhood of t = 1 in which the solutions v t are non-positive on R * + . For t = t opt the function v topt must have a zero on R * + otherwise by continuity of h we may find a neighborhood of t opt for which min{v t (x), x ∈ R * + } remains strictly positive thus contradicting the maximality of t opt . Since v topt must remain positive, the zero is at least a double zero and therefore we find that there exists x 0 > 0 such that v topt (x 0 ) = 0, v topt (x 0 ) = 0 and v topt (x 0 ) ≥ 0. From (6) and (15), the conditions v topt (x 0 ) = 0, v topt (x 0 ) = 0 are equivalent to the following system of equations (we use here the contiguous relations for the confluent hypergeometric function: 1 F 1 (a; b; This is equivalent to say that x 0 ≡ x 0 (α, β) is the solution of the transcendental equation: , and that σ 2 topt is given by: Note that by symmetry, we have x 0 (β, α) = −x 0 (α, β) hence, σ 2 topt (β, α) = σ 2 topt (α, β). Moreover, if β > α then x 0 (α, β) > 0 while α > β implies x 0 (α, β) < 0. We may illustrate the situation with Figure 2 which displays the difference function x → u t (x).
Remark 3. The system of equations (21) admits only one solution on R * + . Indeed, let us transpose the problem from v topt to u topt using (15) and assume that there exist two points 0 < x 0 < x 1 such that u topt (x 0 ) = 0, u topt (x 0 ) = 0 and u topt (x 1 ) = 0, u topt (x 1 ) = 0 with u topt strictly positive on (x 0 , x 1 ) (hence u topt (x 0 ) ≥ 0 and u topt (x 1 ) ≥ 0). Using (14), this implies that P 2 (x 0 ; t opt ) ≥ 0 and P 2 (x 1 ; t opt ) ≥ 0. If we denote x − < x + the potential distinct positive zeros of x → P 2 (x; t opt ) we may exclude that x 0 ≤ x − . Indeed, if x 0 ≤ x − then we may apply the same argument to u topt on the interval [0, x 0 ] as the one developed for u 0 in Section 2.3.2 and obtain a contradiction. Thus, the only remaining case is to assume x 1 > x 0 > x + . In that case, since u topt (x 0 ) = 0, u topt (x 0 ) = 0, u topt (x 1 ) = 0 and x → P 2 (x; t opt ) is positive on [x 0 , x 1 ], we may apply the same argument to u topt on the interval [x 0 , x 1 ] as the one developed for u 0 in Section 2.3.2 and obtain a contradiction.

Optimal proxy variance for the Bernoulli distribution
We show that our proof technique can be used to recover the optimal proxy variance for the Bernoulli distribution, known since Kearns and Saul (1998). This is illustrated by the center panel of Figure 1.
Theorem 2 (Optimal proxy variance for the Bernoulli distribution). For any µ ∈ (0, 1), the Bernoulli distribution with mean µ is sub-Gaussian with optimal proxy variance σ 2 opt (µ) given by: Proof. In the limit α → 0 with α α+β fixed equal to µ, the differential equation (14) simplifies into: x 0 u 0 u topt u tnon opt u 1 Figure 2: Difference function x → u t (x). For t = 0 (simple upper bound σ 2 0 ), the curve [dotted black] remains strictly positive. For t = t opt (optimal proxy variance σ 2 opt ), the curve [magenta] has zero minimum (at x 0 ). For t = 1 (leading to the variance), the curve [dashed green] has negative second derivative at x = 0, hence is directly negative around 0. The intermediate case with t non opt in the interval (t opt , 1) produces a curve [orange, dash and dots] which is first positive, then negative, and positive again.
with the Cauchy initial conditions u t,µ (0) = 0 and u t,µ (0) = 0. The solution of this Cauchy problem is explicit and given by: Therefore the optimal proxy variance is given by σ 2 opt (µ) = 1 4 − 1 4 x(2µ − 1) 2 t 0 where t 0 is determined by the system of equations: u t0,µ (x 0 ) = 0 and u t0,µ (x 0 ), thus defining implicitly t 0 and x 0 as functions of µ. In order to solve explicitly the last system of equations, we perform the change of variables: (µ, t) = s s+1 , 2t s+1 + 1 so that the solution (23) is now given by: ut ,s (x) = exp s s + 1 x −t 4(s + 1) x 2 − s s + 1 e x − 1 s + 1 Consequently, we have to solve the system: Introducing another change of variable (x 0 , y 0 ) = (x 0 , x 0t0 ), the last system is equivalent to: We now observe that y 0 = s − 1 and x 0 = −2 ln s is a solution of the former system. Performing back the various changes of variables, this is equivalent to say thatt 0 = y0 x0 = 2(1−s) ln s so that t 0 = 4(1−s) (s+1) ln s +1 or equivalently t 0 = 1 (2µ−1) 2 + 2 (2µ−1) ln 1−µ µ . Consequently, the optimal proxy variance is given by: which is precisely the optimal proxy variance of a Bernoulli random variable with mean µ. 9

Optimal proxy variance for the Dirichlet distribution
We start by reminding the definition of sub-Gaussian property for random vectors: Definition 2 (Sub-Gaussian vectors). A random d-dimensional vector X with finite mean µ = E[X] is σ 2 -sub-Gaussian if the random variable u X is σ 2 -sub-Gaussian for any unit vector u in the . This is equivalent to say that: Eventually, a random vector X is said to be strictly sub-Gaussian, if the random variables u X are strictly sub-Gaussian for any unit vector u ∈ S d−1 .
Let d ≥ 2. The Dirichlet distribution Dir(α), with positive parameters α = (α 1 , . . . , α d ) , is characterized by a density on the simplex S d−1 given by: α i . It generalizes the Beta distribution in the sense that the components are Beta distributed. More precisely, for any non-empty and strict subset I of {1, . . . , d}: However, we remind the reader that the components (X i ) 1≤i≤d are not independent and the variance/covariance matrix is given by: Eventually, if we define n = (n 1 , . . . , n d ) ∈ N d , then the moments of the Dirichlet distribution are given by: This is equivalent to say that the moment-generating function of the Dirichlet distribution is: Let us define e i the i th canonical vector of R d and X = (X 1 , . . . , X d ) ∼ Dir(α). From Definition 1 and the results regarding the Beta(α, β) distribution obtained in Section 2.3, we immediately get that e i X = X i is σ 2 i -sub-Gaussian with σ 2 i def = σ 2 opt (α i ,ᾱ − α i ) defined from Theorem 1. Moreover, in direction e i , σ 2 i is the optimal proxy variance. Therefore, the remaining issue is to generalize these results for arbitrary unit vectors on S d−1 . We obtain the following result: Theorem 3 (Optimal proxy variance for the Dirichlet distribution). For any parameter α, the Dirichlet distribution Dir(α) is sub-Gaussian with optimal proxy variance σ 2 opt (α) given from Theorem 1 and: σ 2 opt (α) = σ 2 opt (α max ,ᾱ − α max ) where α max = max 1≤i≤d {α i }.
Proof. We first observe that the computations of σ 2 opt (α i , β i =ᾱ − α i ) correspond to cases where the sum α i + β i is fixed toᾱ and thus independent of i. Therefore, σ 2 opt (α i ,ᾱ − α i ) is maximal when |(ᾱ − α i ) − α i | is minimal, i.e. when the distance from α i to 1 2ᾱ is minimal. It is easy to see that this corresponds to choosing α i = α max = max{α i , 1 ≤ i ≤ d} (by looking at the two possible cases α max ≤ 1 2ᾱ and α max > 1 2ᾱ ). We then observe that σ 2 max cannot be improved. Indeed, let us denote i 0 one of the components for which the maximum is obtained. Then, if we take u = e i0 , the discussion presented above shows that σ 2 i0 = σ 2 max is the optimal proxy variance in this direction. Hence the optimal proxy variance cannot be lower than σ 2 max . Let us now prove that X is σ 2 max -sub-Gaussian. Let u = (u 1 , . . . , u d ) be a unit vector on S d−1 and λ ∈ R. We define for clarity λ = λu. We have: (24) Note that we also have: Moreover we have the inequality: (ᾱ) ni ≤ (ᾱ)n, because both sides have the same number of terms in the product (i.e.n) but those of the right hand side are always greater or equal to those of the left hand side. Hence, from (24) and (25), we find: Using the optimal proxy variance of the Beta distribution proven in Theorem 1, we find: thus showing that X is σ 2 max -sub-Gaussian and concluding the proof.
Indeed, we first need to require α 1 = · · · = α d def = α so that all directions have the same optimal proxy variance. Then, each component satisfies X i = e i X ∼ Beta(α, (d − 1)α) and Theorem 1 shows that σ 2 i is the optimal proxy variance for X i if and only if α = (d − 1)α, i.e. if and only if d = 2. 11