Measuring distributional asymmetry with Wasserstein distance and Rademacher symmetrization

: We propose of an improved version of the ubiquitous sym- metrization inequality making use of the Wasserstein distance between a measure and its reﬂection in order to quantify the asymmetry of the given measure. An empirical bound on this asymmetric correction term is de-rived through a bootstrap procedure and shown to give tighter results in practical settings than the original uncorrected inequality. Lastly, a wide range of applications are detailed including testing for data symmetry, con- structing nonasymptotic high dimensional conﬁdence sets, bounding the variance of an empirical process, and improving constants in Nemirovski style inequalities for Banach space valued random variables.


Introduction
The symmetrization inequality is a ubiquitous result in the probability in Banach spaces literature and in the concentration of measure literature. Dating back at least to Paul Lévy, it is found in the classic text of [18], Section 6.1, and the more recent [6], Section 11.3. [13] use symmetrization in the context of empirical process theory, which is followed by a collection of more recent appearances [23,16,11,2,19,15,9].
Recalling that ε, a Rademacher random variable or sometimes referred to as a symmetric Bernoulli random variable or a random sign, is such that P(ε = 1) = P(ε = −1) = 1/2, then the symmetrization inequality is as follows. Proposition 1.1. Let (B, · ) be a Banach space, and let X 1 , . . . , X n ∈ B be independent random variables with measure μ. Let ε 1 , . . . , ε n be independent and identically distributed Rademacher random variables, then

A.B. Kashlak
This can be readily proved via Jensen's Inequality and the insight that if Z is a symmetric random variable, that is Z d = −Z, then Z d = εZ. The proof is included for completeness.
Proof. Let X 1 , . . . , X n be independent copies of X 1 , . . . , X n such that X i and X i are equal in distribution for all i = 1, . . . , n. Then, The first inequality comes from Jensen's inequality and the convexity of the norm. The subsequent equality results from the fact that X i −X i is a symmetric random variable for all i = 1, . . . , n. The second inequality is just the result of the subadditivity of the norm and the fact that EX i = EX i .

Remark 1.2.
As the main tool of the previous proof is Jensen's inequality, the result can be generalized with the addition of any convex function F : R + → R + to the following: The most notable oversight of this result is that it does not incorporate any measure of the symmetry of the data. Specifically, in the extreme case that the X i are symmetric about their mean, then the coefficient of 2 can be dropped and the inequality becomes an equality. Taking note of this fact, [2] state that "it can be shown that this factor of 2 is unavoidable in general for a fixed n when the symmetry assumption is not satisfied, although it is unnecessary when n goes to infinity." [2] They furthermore "conjecture that an inequality holds under an assumption less restrictive than symmetry (e.g., concerning an appropriate measure of skewness of the distribution )." [2] Hence, in response to this conjecture, we propose an improved symmetrization inequality making use of Wasserstein distance and Hilbert space geometry in order to account for the symmetry, or lack thereof, of the distribution of the X i under analysis. The main contribution of this article is that for some Hilbert space H and X 1 , . . . , X n ∈ H independent and identically distributed random variables with common measure μ, there is for a fixed explicit constant C(μ) depending only on the symmetry of the underlying measure μ of the X i , which quantifies the symmetry of μ, such that This result is detailed and proved in Section 3.2. Furthermore, an empirical bound, C n (μ), on the constant C can be calculated as is done in Section 4. Such an empirical bound can be further used as a data driven measure of the symmetry of the given sample. In the case that the distribution of the X i is symmetric, the true C(μ) = 0 and our data driven estimate C n (X) = O(n −δ ) for some δ ∈ (0, 0.5) implying a fast rate of convergence to the desired zero for the additive term above: n −1/2 C n (μ) = o(n −1/2 ). Applications of this result to testing the symmetry of a data set, constructing nonasymptotic high dimensional confidence sets, bounding the variance of an empirical process, and improving coefficients in probabilistic inequalities in the Banach space setting are given in Section 5.

Empirical estimate of the Rademacher sum
Before discussing the main results detailed and proved in Section 3, we take a closer look at Rademacher sums to motivate the research in the following sections. These sums arise in the theoretical setting of proving various bounds and inequalities for random variables in Banach spaces. Examples can be found in the many results in the monographs [18] and [6]. Alternatively, these sums are used in the applied setting as an analogue for the unknown expectation E n i=1 X i − EX i , which arises when constructing confidence sets using concentration inequalities for such settings as wavelet estimators [19], kernel density estimators [9], and for covariance operators in [14]. Rademacher averages also appear in statistical learning theory under the name Rademacher complexities in [4,16,27] and many others.
In this section, we will consider the practical issue of computing the norm of the Rademacher sum R n = n i=1 ε i (X i −X) with sample meanX = n −1 n i=1 X i to directly estimate the expected value of the norm of the sum S n = n i=1 X i − EX i . The Rademacher sum falls into a category of generalized bootstrap techniques. Mainly, for some random subset I ⊆ {1, . . . , n} with cardinality such that P(|I| = k) = n k 2 −n . Thus, given some observed X 1 , . . . , X n , the total expectation E R n can be approximated by the conditional expectation E ε R n = E( R n | X 1 , . . . , X n ). This conditional expectation can in turn be approximated by randomly drawing M sets of {ε . However, before continuing, we consider alternative bootstrap techniques to demonstrate the superiority of the Rademacher sum and why the symmetrization inequality matters.
The term E S n cannot be estimated directly, but instead approached via some bootstrap technique. Beyond the Rademacher sum, two other bootstrap estimators for E S n will be considered. Given a sample of size n, X 1 , . . . , X n , the first method is to randomly split the data in half using the first half to estimate EX i and the second half to estimate ES n , which is equivalent to restricting the Rademacher sum bootstrap to index sets I ⊂ {1, . . . , n} of cardinality n/2. Namely, for such sets, we havê which can, of course, be approximated by selecting a reasonable number M of such sets I 1 , . . . , I M .
The second approach is a leave-one-out estimate similar to the jackknife estimator [8]. Once again, given a sample of size n, X 1 , . . . , X n , this method is equivalent to the Rademacher sum bootstrap but restricting the cardinality of the set to |I| = n − 1. This results in Each of these bootstrap methods are in some sense comparable to each other with respect to accuracy and variance of the estimate for E S n . For example, Theorem 3 of [27] compares the Permutational Rademacher Complexity to the Conditional Rademacher Complexity, which are more sophisticated versions of R n andŜ half n . However, the symmetrization inequality allows for us to explicitly bound E S n by the Rademacher sum. Indeed, using the original symmetrization inequality, it is reasonable to bound In contrast, the goal of this article is to theoretically derive and explicitly compute a small correction term C n (μ) to update this bound to the tighter This is powerful in the construction of non-asymptotic confidence sets for high dimensional data where one desires to achieve a minimum coverage, say 1 − α, for such confidence sets as performed in both [2] and [14]. Using one of these alternative bootstrap methods does not guarantee such coverage. However, using the Rademacher sum with either the coefficient of 2 or with our correction term proposed in the subsequent section, will, in fact, result in a confidence set with no less than the desired coverage.

Overview of Wasserstein spaces
We first require the standard notions of Wasserstein distance and Wasserstein space as stated below. These are defined on Polish spaces, which are complete separable metric spaces. For a thorough introduction to such topics, see [28].
where the infimum is taken over all measures γ on X × X with marginals μ and ν.
An equivalent and useful formulation of Wasserstein distance is where the infimum is taken over all possible joint distributions of X and Y with marginals μ and ν, respectively.

Definition 3.2 (Wasserstein Space).
Let P (X ) be the space of probability measures on X . The Wasserstein space is for any arbitrary choice of x 0 . This is the space of measures with finite pth moment.
Convergence in Wasserstein space is characterized by weak convergence of measure and convergence in pth moment. From Theorem 6.8 of [28], convergence in Wasserstein distance is equivalent to weak convergence in P p (X ). Hence, for a sequence of measures μ n , for any fixed x 0 ∈ X .

Symmetrization result
In the following lemma, we bound the expectation on the left by the sum of a "symmetric" term and an "asymmetric" term.

Lemma 3.3.
Let H be an Hilbert space, and let X 1 , . . . , X n ∈ H be independent and identically distributed random variables with common law μ for the centred X i − EX i . Define μ − to be the law of the reflection EX i − X i . Furthermore, let ε 1 , . . . , ε n be independent and identically distributed Rademacher random variables also independent of the X i . Then, for any 1-Lipschitz function ψ, Proof. For a Polish space X , let Π(μ, ν) be the space of all product measures on X × X with marginals μ and ν. For δ ∈ (0, 1), let Π δ (μ, ν) be the space of all product measures with marginals μ and ν δ = δμ The inequality on the second lines above arises from taking the infimum over a more restrictive set. The law of ε(X −EX) is 1 2 (μ+μ − ). Hence, for our purposes, the above implies that Define μ * n to be the law of where the second, third, and fourth inequality come respectively from Lemmas A.1, A.2, and A.3 in the appendix. Rearranging the terms gives the desired result.
This lemma leads immediately to the following theorem. The intuition behind this theorem is that averaging a collection of random variables has an inherent smoothing and symmetrizing effect following from the central limit theorem. Thus, as the sample size n increases, the difference between the expectations of the true average and the Rademacher average become negligible. Of course, we have following from such theorems that, given a finite second moment for the probability measure μ, that |Eψ( 1 However, in the next theorem, we explicitly quantify this error and use it for finite sample empirical estimation in the following sections. This behaviour was shown in the simulations detailed in [14].
Proof. Running the proof of Lemma 3.3 after swapping Under condition 1, the result is immediate. Under condition 2, let μ be the law of (X i − EX i ) as before. Then, redefining μ * n to be the law of where the infimum is taken over all joint distributions of X and Y with marginals μ and μ+μ − 2 , respectively. The desired result follows.

Empirical estimate of W 2 (μ, μ − )
In order to explicitly make use of the above results, an empirical estimate of W 2 (μ, μ − ) is required. We first establish the following bound.
Proposition 4.1. Let X 1 , . . . , X n be iid with law μ and let Y 1 , . . . , Y n be iid with law ν. Furthermore, let μ n and ν n be the empirical distributions of μ and ν, respectively. Then, The following infima are taken over all possible joint distributions of the random variables in question given fixed marginal distributions. Let X and Y be random variables of law μ and ν, respectively. Also, let S n be the group of permutations on n elements.
where the above inequality arises by replacing the infimum over all possible joint distributions of the X i and Y i with a specific joint distribution.
The following subsections establish that it is reasonable to replace W 2 (μ, μ − ) with a data driven estimate of EW 2 (μ n , μ − n ) in Lemma 3.3 and Theorem 3.4. Rates of convergence of W 2 (μ n , μ − n ) are presented, and a bootstrap estimator for EW 2 (μ n , μ − n ) is proposed and tested numerically.

Rate of convergence of empirical estimate
As W p (·, ·) is a metric, the triangle inequality and the fact that and therefore, . By Lemma A.4, W p (μ, μ n ) → 0 with probability one making the discrepancy negligible for large data sets. However, it is also possible to get a hard upper bound on this term; specifically, the recent work of [10] proposes explicit moment bounds on W p (μ, μ n ). Their result can be used to demonstrate the speed with which our empirical measure of asymmetry, W 2 (μ n , μ − n ), converges to zero when μ is symmetric.
In the case that μ is symmetric, W 2 (μ, μ − ) = 0, the ideal correction term is equal to zero. This implies that our empirical bound Therefore, the moment bound from Theorem 1 of [10] implies that W 2 (μ n , μ − n ) = O(n −δ ) where δ ∈ (0, 0.5] depending on the specific moment used and the dimensionality of the measure. Thus, the empirical bound on the correction term in our improved inequality, W 2 (μ n , μ − n )/ √ n, achieves a faster convergence rate in the symmetric case than the general rate of n −1/2 .
The tightness of the bounds proposed in [10] was tested experimentally. While the moment bounds are certainly of theoretical interest, implementing these bounds resulted in an inequality less sharp than the original symmetrization inequality. However, the bootstrap procedure detailed in the following section does produce a practically useful estimate of the expected empirical Wasserstein distance.

Bootstrap estimator
We propose a bootstrap procedure to estimate the expected Wasserstein distance between the empirical measure and its reflection, EW 2 (μ n , μ − n ). Given observations x 1 , . . . , x n empirically centred so thatx = 0, letμ n be the empirical measure of the data; this is a specific instance of μ n . Then, for some specified m, two sets Y 1 , . . . , Y m and Z 1 , . . . , Z m can be sampled as independent draws fromμ n . The goal is to move a mass of 1/m from each of the Y i to each of the negated −Z i in an optimal fashion. Hence, the m × m matrix of pairwise distances is constructed with entries A i,j = d(Y i , −Z j ), which can be accomplished in O(m 2 ) time. From here, the problem reduces to a linear assignment problem, a specific instantiation of a Minimum-cost flow problem from linear programming [1]. That is, given a complete bipartite graph with vertices L ∪ R such that |L| = |R| = m and with weighted edges, we wish to construct a perfect matching minimizing the total sum of the edge weights. Here, the weights are the pairwise distances A i,j . This linear program can be efficiently solved in O(m 3 ) time via the Hungarian algorithm [17]. For more on linear programs in the probabilistic setting, see [26].
This estimated distance can be averaged over multiple bootstrapped samples. Though, in general, only a few replications are necessary to achieve a stable estimate as the bootstrap estimator has a very small variance. Indeed, to see this, consider the bounded difference inequality detailed in Section 3.2 of [6] and in Section 3.3.4 of [12], which is a direct corollary of the Efron-Stein-Steele inequality [8,25,24].  (x 1 , . . . , x i , . . . , x n ) − f (x 1 , . . . , x i , . . . , x n Therefore, if m is chosen to be of order n, as in the numerical experiments below, then the variance of the bootstrap estimate decays at rate of O(n −1 ).
The proposed bootstrap procedure was experimentally tested on both high dimensional Rademacher and Gaussian data as will be seen in Section 4.3.1. For each replication, the observed data was randomly split in half. That is, given a random permutation ρ ∈ S n , the symmetric group on n elements, the Hungarian algorithm was run to calculate the cost of an optimal perfect matching between {X ρ(1) , . . . , X ρ( n 2 ) } and {−X ρ( n 2 +1) , . . . , −X ρ(n) }. As the sample size for each set is n/2, the expected distance between the two sets of points will be larger than the expected distance between two sets of n points. Indeed, let Y 1 , . . . , Y n , Z 1 , . . . , Z n be iid with law μ and let n be even, then for some subset I ⊂ {1, . . . , n} with cardinality n/2, where, similarly to Proposition 4.1, the inequality comes from taking a minimum over a smaller set.

Numerical experiments
From Proposition 4.1, there is an obvious positive bias in our new symmetrization inequality when using the Wasserstein distance between the empirical measures, W 2 (μ n , μ − n ), in lieu of the Wasserstein distance between the unknown underlying measures, W 2 (μ, μ − ). This is specifically troublesome when μ is symmetric or nearly symmetric. That is, if W 2 (μ, μ − ) = 0, then barring trivial cases, the distance between the empirical measures will be positive with positive probability. However, as stated in Lemma A.4, W 2 (μ n , μ − n ) → 0 with probability one, which will still make this approach superior to the standard symmetrization inequality. In the following subsections, we will compare the magnitude of the expected symmetrized sum and the asymmetric correction term, which are, respectively, The goal is to demonstrate through numerical simulations that the latter is smaller than the former and thus that newly proposed R n + C n is a sharper upper bound than the original 2R n for n

Rademacher data
For a dimension k and a sample size n = {2, 4, 8, . . . , 256}, the data for this first numerical test was generated from a multivariate symmetric Rademacher distribution. That is, for a size n iid sample from this distribution, X 1 , . . . , X n , let X i,j be the jth entry of the ith random variable with X i,1 , . . . , X i,k iid Rademacher(1/2) random variables. Across 10,000 replications, random samples were drawn and used to estimate the expected Rademacher average, R n , and the expected empirical Wasserstein distance, C n , under the 1 -norm. The dimensions considered were k = {2, 20, 200}. The results are displayed on the left column of Figure 1. As the sample size n increases with respect to k, we get closer to an asymptotic state and the bound based on the empirical Wasserstein distance becomes more attractive.

Gaussian data
For a dimension k and a sample size n = {2, 4, 8, . . . , 256}, the data for this second numerical test was generated from a multivariate Gaussian mixture distribution. Specifically, 1 2 N (−1, I k ) + 1 2 N (1, I k ), which is a symmetric distribution. Over 10,000 replications, random samples were drawn and used to estimate the expected Rademacher average, R n , and the expected empirical Wasserstein distance, C n , under the 2 -norm. The dimensions considered were k = {2, 20, 200}. The results are displayed on the right column of Figure 1. Similarly to the multivariate Rademacher setting, as the sample size n increases, the bound based on the empirical Wasserstein distance becomes sharper than the original symmetrization bound.

Asymmetric data
The above experiments were repeated for asymmetric data with results displayed in Figure 2. On the left side, the symmetric Rademacher distribution-where  P(X i = 1) = P(X i = −1) = 1/2-was replaced with an asymmetric one where P(X i = 1) = 2/3 and P(X i = −1) = 1/3. One the right side, the mixture of normals is 1 3 N (−1, I k ) + 2 3 N (1, I k ). In the case of the asymmetric Rademacher data, the bound performed worse than the standard bound when the sample size n is less than the dimension k. In the case of the imbalanced Gaussian mixture, the results are similar to the balanced case and give an improvement over the old bound.

Applications
In the following subsections, a collection of applications of the improved symmetrization inequality are briefly proposed to demonstrate the potential wide range of usefulness of this result. Such applications range from those of theoretical interest to those of practical application to statistical testing. These include a test for data symmetry, the construction of nonasymptotic high dimensional confidence sets, bounding the variance of an empirical process, and Nemirovski's inequality for Banach space valued random variables.

Permutation test for data symmetry
In the previous sections, we proposed the Wasserstein distance W 2 (μ, μ − ) to quantify the symmetry of a measure μ. Now, given n independent and identically distributed observations X 1 , . . . , X n with common centred measure μ, we propose a procedure to test for whether or not μ is symmetric. Unlike other tests for data symmetry which may be restricted to finite dimensional Euclidean space, this testing procedure applies to general Hilbert space valued random variables. Thus, it is applicable to many diverse settings such as, notably, functional data analysis.
The bootstrap approach from Section 4 for estimating the empirical Wasserstein distance is applied, and a permutation test is applied to the bootstrapped sample. Note that while the Wasserstein-2 metric is specifically used in our improved symmetrization inequality, for this test, any Wasserstein-p metric can be utilized as is done in the numerical simulations below.
The bootstrap-permutation test proceeds as follows: 0. Choose a number r of bootstrap replications to perform. Also, centre the data X i ← X i −X. 1. For each bootstrap replication, permute the data by some uniformly randomly drawn ρ ∈ S n , the symmetric group on n elements. 2. Use the Hungarian algorithm to compute the optimal assignment cost, ω 0 , between the data sets {X ρ(1) , . . . , X ρ(n/2) } and {−X ρ(n/2+1) , . . . , −X ρ(n) }. 3. Denote this new half-negated data set Y where Y i = X ρ(i) for i ≤ n/2 and Y i = −X ρ(i) for i > n/2. 4. Draw m random permutations ρ 1 , . . . , ρ m ∈ S n . For each ρ i , compute ω i , the optimal assignment cost between {Y ρi(1) , . . . , Y ρi(n/2) } and {Y ρi(n/2+1) , . . . , Y ρi(n) }. 5. Return the p-value, p j = #{ω i > ω 0 }/m. 6. Average the r p-values to get an overall p-value, p = r −1 r j=1 p j . Note that for very large data sets, it may be computationally impractical to find a perfect matching between two sets of n/2 nodes as performing this test as stated has a computational complexity of order O(mn 3 ). In that case, randomly draw n < n elements from the data set in step 1, draw a ρ ∈ S n , and proceed as before but with the smaller sample size.
This permutation test was applied to simulated multivariate Rademacher(p) data in R 5 . For sample sizes n = 10 and n = 100, let X 1 , . . . , X n be independent and identically distributed multivariate Rademacher(p) random variables where each X i is comprised of a vector of independent univariate Rademacher(p) random variables, which is P(ε = 1) = p and P(ε = −1) = 1 − p for p ∈ (0, 1). For values of p ∈ [0.5, 0.8], the power of this test was experimentally computed over 1000 simulations. The results are displayed in Figure 3. For the 1 and 2 Fig 3. For data in R 5 , the 1 and 2 metrics, and the Wasserstein distances W 1 and W 2 , the experimentally computed power of the permutation test is plotted for Rademacher(p) data as p, the probability of 1, increases thus skewing the distribution. The sample size is n = 100 on the left plot and is n = 10 on the right plot. The n = 100 case includes an asymptotic test for skewness. This test fails in the nonasymptotic n = 10 case and thus is not included. metrics and Wasserstein distances W 1 and W 2 , the performances of the permutation test were comparable except for the ( 2 , W 2 ) case, which performed poorer in both the large and small sample size settings. For the large sample size, n = 100, Mardia's test for multivariate skewness [20,21] was included, which uses the result that, under the null hypothesis of multivariate normality, whereΣ is the empirical covariance matrix of the data. Similar asymptotic statistics are proposed in [3] for the larger class of elliptically symmetric distributions. However, this is shown to be less powerful than the proposed permutation test. Furthermore, as this test is asymptotic in design, it gave erroneous results in the n = 10 case and was thus excluded from the figure.

High dimensional confidence sets
A method for constructing nonasymptotic confidence regions for high dimensional data using a generalized bootstrap procedure was proposed in the article of [2]. Beginning with a sample of independent and identically distributed Y 1 , . . . , Y n ∈ R K and the assumptions that the Y i are symmetric about their mean-that is, almost surely for all i and some M > 0-they prove, among many other results, that for some fixed α ∈ (0, 1), the following holds with probability 1 − α: where φ : R K → R is a function that is subadditive, positive homogeneous, and bounded by L p -norm. By substituting our Theorem 3.4 for their Proposition 2.4 allows us to drop the symmetry condition and achieve a more general (1 − α) confidence region.

Bounds on empirical processes
Symmetrization arises when bounding the variance of an empirical process. In [6], the following result is stated as Theorem 11.8 and is subsequently proved using the original symmetrization inequality resulting in suboptimal coefficients. where σ 2 = sup s∈T n i=1 EX 2 i,s . The given proof uses the symmetrization inequality twice as well as the contraction inequality-see [18] Theorem 4.4, and [6] Theorem 11.6-to establish the bounds Making use of the improved symmetrization inequality cuts the coefficient of EZ by a factor of 4 to the tighter Beyond this textbook example of bounding the variance of an empirical process, symmetrization arguments are used to construct confidence sets for empirical processes in [11,19,15,9]. The coefficients in all of their results can be similarly improved using the improved symmetrization inequality.

Type, cotype, and Nemirovski's inequality
In the probability in Banach spaces setting, let X i ∈ (B, · ) for i = 1, . . . , n be a collection of independent zero mean Banach space valued random variables. A collection of results referred to as Nemirovski inequalities [22,7] are concerned with whether or not there exists a constant K depending only on the norm such that For example, in the Hilbert space setting, orthogonality allows for K = 1 and the inequality can be replaced by an equality. One such result requires the notion of type and cotype. A Banach space (B, · ) is said to be of Rademacher type p for 1 ≤ p < ∞ (respectively, of Rademacher cotype q for 1 ≤ q < ∞) if there exists a constant T p (respectively, C q ) such that for all finite non-random sequences (x i ) ∈ B and (ε i ), a sequence of independent Rademacher random variables,

A.B. Kashlak
These definitions and the original symmetrization inequality lead to the following proposition.
The proposition can be refined by applying our improved symmetrization inequality along with the Rademacher type p condition if the X i are additionally norm bounded. If the X i have a common law μ, let W 2 = W 2 (μ, μ − ) be the Wasserstein distance between μ and its reflection.
Proof. In the context of Theorem 3.4, set ψ(·) = · p . Given the bound X i ≤ 1, we have that ψ Lip = p. Scale by p, and the first result follows.
Note that for identically distributed X i ∈ B, the order of the original bound for a type p Banach space is O(n 1−p ) while the Wasserstein correction term is O(n −1/2 ). This correction will give an obvious benefit for spaces of type p < 3/2. However, even for spaces of type 2, the new bound can be tighter specifically in the high dimensional setting when d n. Indeed, consider ∞ (R d ), which is discussed in particular in Section 3.2 of [7] where it is shown to be of type 2 with constant T p = 2 log(2d). For independent and identically distributed X i ∈ ∞ (R d ), the bounds to compare are Figure 4 displays such a comparison for n = 10, d ∈ {5, 25, 50}, and iid X i,j + α/(1 + α) ∼ Beta(α, 1) for i = 1, . . . , n and j = 1, . . . , d. Hence, the X i are Beta random variables that are shifted to have zero mean. W 2 (μ, μ − ) is approximated by EW 2 (μ 5 , μ − 5 ), which is computed via the bootstrap procedure outlined in Section 4. The new bound can be seen to have better performance than the old one specifically in the cases of d = 25 and d = 50 when α is not too large. Note that the new bound does not perform as well when d = 5, and, in general, improvement in performance occurs when d n.

A Nemirovski variant with weak variance
As one further example of improved symmetrization, a variation of Nemirovski's inequality found in Section 13.5 of [6] is proved via a similar symmetrization argument for the p norm with p ≥ 1. Let X 1 , . . . , X n ∈ R d be independent zero mean random variables. Let B q = {x ∈ R d : x q ≤ 1}, and define the weak variance Σ 2 p = n −2 E sup t∈Bq n i=1 t, X i 2 . The resulting inequality is Replacing the old symmetrization inequality with the improved version reduces the coefficient of 578 roughly by a factor of 4 resulting in E S n 2 p ≤ 146dΣ 2 p + O(n −1/2 ).

Discussion
The symmetrization inequality is a fundamental result for probability in Banach spaces, concentration inequalities, and many other related areas. However, not accounting for the amount of asymmetry in the given random variables has led to pervasive powers of two throughout derivative results. Our improved symmetrization inequality incorporates such a quantification of asymmetry through use of the Wasserstein distance. Besides being theoretically sound, it is shown in simulations to provide a tightness superior to that of the original result. Going beyond the inequality itself, this Wasserstein distance offers a novel and powerful way to analyze the symmetry of random variables or lack thereof. It can and should be applied to countless other results that were not considered in this current work. This article detailed a connection between symmetrization and the Wasserstein metric to answer the question regarding measuring the asymmetry in the symmetrization inequality. However, many open questions remain. Theorem 3.4 is a non-asymptotic bound on the difference between the centred sum and Rademacher sum, which aligns with the usual n −1/2 rate of convergence. A comparison of this approach with respect to versions of the central limit theorem under assumptions on μ could comment on the sharpness of this bound as we move from the non-asymptotic to the asymptotic regime. This would also take into account how much is lost when moving from the smoothed convolution μ n to just μ via Lemma A.3. The numerical experiments of Section 4.3 show the improvement in the bound as n → ∞, but also show that such improvement is dampened as d → ∞. While Theorem 3.4 is independent of dimension, clearly the bootstrap estimator is not. Further investigation of the rate of convergence in various asymptotic realms would be of interest. Furthermore, a better estimator than the proposed bootstrap estimator for the Wasserstein distance would improve the performance of this bound in practice.