Sanov-type large deviations and conditional limit theorems for high-dimensional Orlicz balls

In this paper, we prove a Sanov-type large deviation principle for the sequence of empirical measures of vectors chosen uniformly at random from an Orlicz ball. From this level-2 large deviation result, in a combination with Gibbs conditioning, entropy maximization and an Orlicz version of the Poincaré-Maxwell-Borel lemma, we deduce a conditional limit theorem for high-dimensional Orlicz balls. Roughly speaking, the latter shows that if V 1 and V 2 are Orlicz functions, then random points in the V 1 -Orlicz ball, conditioned on having a small V 2 -Orlicz radius, look like an appropriately scaled V 2 -Orlicz ball. In fact, we show that the limiting distribution in our Poincaré-Maxwell-Borel lemma, and thus the geometric interpretation, undergoes a phase transition depending on the magnitude of the V 2 -Orlicz radius.


Introduction & Main results
It is a classical and famous result, independently observed by Borel [3] and Maxwell (e.g., [7]), and commonly referred to as the Poincaré-Maxwell-Borel lemma, that any fixed number of coordinates of random vectors uniformly distributed on a high-dimensional Euclidean sphere is approximately Gaussian distributed.More precisely, if X (n) = X (n) (1), . . ., X (n) (n) is uniformly distributed on nS n−1 = (x i ) n i =1 ∈ R n : n i =1 |x i | 2 = n , then, for any fixed k ≤ n, the distribution of the first k coordinates X (n) (1), . . ., X (n) (k) converges weakly to the distribution of a k-dimensional standard Gaussian random vector.The normalization of the sphere is taken to ensure that a typical coordinate of an element of nS n−1 has unit order when the dimension n of the ambient space is large.In 1991, Mogul'skiȋ [22] and Rachev and Rüschendorf [25] obtained an ℓ p analogue of the Poincaré-Maxwell-Borel lemma for the surface and cone measure respectively on a properly scaled ℓ p -sphere (note that surface and cone measure coincide whenever p ∈ {1, 2, ∞}).A simplification of the arguments can be found in the work of Naor and Romik [23,Theorems 3 and 4] and a significant generalization (including the case of Orlicz balls) has recently been obtained by Johnston and Prochno in [11].In the ℓ p -versions of the Poincaré-Maxwell-Borel lemma, instead of the standard Gaussian distribution, the so-called p-generalized Gaussian distribution appears in the weak limit, the Lebesgue density being given by This density is intimately related to ℓ n p -balls, we write B n p , and of fundamental importance in essentially any probabilistic approach to understand their geometry; see, e.g., [13,28,29] and references cited therein.Recently, Kim and Ramanan [16,Theorem 2.3] showed that a conditional analog to the ℓ p -version of the Poincaré-Maxwell-Borel lemma holds.Roughly speaking, they found that for 1 ≤ q < p < ∞, as the dimension n of the space is high, a random point on the ℓ n p -sphere, conditioned on having a sufficiently small ℓ q -norm, is weakly close to a random point drawn from an appropriately scaled ℓ n q -sphere.As the authors point out, this means that conditioning on a sufficiently small ℓ q -norm induces a probabilistic change that admits a nice geometric interpretation.The strategy of proof is based on ideas from statistical mechanics and large deviations theory (see also [12] for more in this direction).The authors first prove a level-2 large deviation principle for the empirical measure of the coordinates of a random point on a properly scaled ℓ n p -sphere, where the random choice is made with respect to the cone probability measure on the boundary.Fundamental to this proof is a probabilistic representation of the cone measure in terms of the p-generalized Gaussian distribution [16,Lemma 3.9], which goes back to Schechtman and Zinn [29] and Rachev and Rüschendorf [25], and allows one to go over to random vectors with independent coordinates.Let us already point out that in our generalized setting considered here, no such probabilistic representation is available.Having proved a Sanov-type large deviation principle for the empirical measure, the authors in [16] move forward deducing the conditional limit theorems from this large deviation result and the famous Gibbs conditioning principle; obviously numerous (technical) hurdles need to be taken here.The use of the Gibbs conditioning principle allows one to transform a Sanov-type large deviation principle for an empirical measure into a statement about the most likely behavior of the underlying sequence of random vectors conditioned on a rare event.
In this work, we prove that an analog result holds for the general and fundamental class of Orlicz balls; Orlicz spaces are natural generalizations of ℓ p -spaces and intensively studied in functional analysis and optimization theory (see, e.g., [10,14,17,18,21,24,27,30]).One of the challenges that arises is that no probabilistic representation, like in the case of ℓ p -balls, is at our disposal.This makes an analysis more challenging.In addition, a general Orlicz function is not homogeneous, making any approach to understanding analytic, geometric or probabilistic aspects of these spaces and their unit balls even more complicated.Key to our generalization, which follows the general theme of first proving a level-2 large deviation principle for the sequence of empirical measures of the coordinates of random points chosen uniformly from a suitably scaled Orlicz ball and then using Gibbs conditioning, is the statistical mechanics point of view.Indeed, here conditional distributions appear in connection to the study of non-interacting particle systems (micro-canonical and canonical ensembles), where one is seeking to describe the most likely state of the system under an energy constraint (see, e.g., [9], [26], and [12,Subsection 1.2]).This point of view directly links such questions to certain Gibbs distributions.Considering such Gibbs measures at suitable critical temperatures with potentials given by an Orlicz function allows us to work around the problem arising from the lack of a probabilistic representation.This idea has recently been put forward by Kabluchko and Prochno [12] (see particularly Subsection 1.2 there) in their probabilistic approach to the geometry of Orlicz balls and has also successfully been used in [1], [2], and [15].

Main results
In order to present the main results of this paper, we need to introduce some notation; for the rest we refer the reader to Section 2 below.Recall that an Orlicz function V : R → [0, ∞) is a convex, symmetric function with V (0) = 0 and V (x) > 0 for x = 0.For R ∈ (0, ∞), we denote by the associated Orlicz ball in R n .If V : R → [0, ∞) is an Orlicz function and β ∈ (−∞, 0), we define the corresponding Gibbs measure µ V,β (with potential V and at critical temperature β) via the Lebesgue density We shall denote by M 1 (R) the set of probability measures on the Borel σ-algebra on R and, for µ ∈ M 1 (R), we define Our first main result establishes a level-2 or Sanov-type large deviation principle for the empirical measure of the coordinates of a vector chosen uniformly at random from an Orlicz ball.
Then, there exists a unique α ∈ (−∞, 0) such that the sequence of empirical measures satisfies an LDP in M 1 (R) with the strictly convex good rate function where In addition, if there exists another Orlicz function W such that Remark 1.1.The proof of Theorem A rests on a general version of the Gärtner-Ellis theorem.Beyond the Sanov-type large deviation principle on M 1 (R) endowed with the weak topology, also the last part of the statement, the large deviation principle on the generalized Wasserstein space (M W 1 (R), d W ), is essential for the proof of our second main result, Theorem B below.
Let (X n,V R ) n∈N be a sequence of random variables, where each X n,V R is uniformly distributed on the Orlicz ball B n,V R .For fixed k ∈ N, we denote by ).Then, for R ∈ (0, ∞), we define We can now state the second main result of this paper, which establishes a conditional limit theorem for random points in Orlicz balls.
for all α ∈ (−∞, 0).Then, there exists an R ∈ (0, ∞) such that for all 0 < R ≤ R and for fixed k ∈ N, we have lim where d w denotes some metric inducing weak convergence of probability measures.
We shall deduce Theorem B from our Sanov-type large deviation result in Theorem A, which we combine with the so-called Gibbs conditioning principle, entropy maximization, and an Orlicz version of the Poincaré-Maxwell-Borel lemma.Let us continue with a few remarks related to Theorem B.
Remark 1.2.Under the assumptions of Theorem B, roughly speaking, in high dimensions we have that random points in the V 1 -Orlicz ball, conditioned on having a small V 2 -Orlicz radius, look like an appropriately scaled V 2 -Orlicz ball.
Remark 1.3.Theorem B generalizes [16,Theorem 2.3].There the authors proved a conditional limit theorem in an ℓ p -setting.Let us point out that there spheres rather than balls were considered.
Remark 1.4.In Corollary 4.9 we will see that, if the radius R in the condition of . Geometrically this means that, in high dimensions, points in a V 1 -Orlicz ball conditioned on being in a big V 2 -Orlicz ball are uniformly distributed in the V 1 -Orlicz ball, which is to be expected.We refer to Remark 4.11 on the phase transition in the "intermediate case".
Remark 1.5.The second condition in Equation (4) ensures the existence of m V 2 (µ V 1 ,α ) for all α ∈ (−∞, 0).This condition does not appear in case of ℓ p -balls, since R |x| p e −|x| q d x < ∞ for all 1 ≤ p, q < ∞.In particular, the second Condition in (4) is satisfied for all polynomially bounded V 1 .A similar condition appears in [12,Theorem B].
Remark 1.6.Fix p ∈ [1, ∞) and take some Orlicz function V with V (x) Then, the sequence of empirical measures (L n,V ) n∈N from Equation (1) satisfies an LDP in the p-Wasserstein space M p 1 (R) with good rate function I V defined in Equation (2) (see also Corollary 3.5).

Notation
We shall use the following standard notation.Given a Borel measurable set A ⊆ R n , we shall denote by vol n the n-dimensional Lebesgue measure.If X is a topological space, then for Borel sets A ⊆ X , we write A • and A for the interior and the closure of A respectively.For a sequence (ν n A related quantity is the entropy h : M 1 (R) → R of a measure, which is defined as For µ ∈ M 1 (R) and a measurable function f : R → R, we set provided the latter exists.Also, for µ ∈ M 1 (R), we define For x ∈ R, δ x denotes the Dirac measure.
We will frequently work with the space of bounded and continuous functions mapping from R to R, which we shall denote by C b (R).

Orlicz functions & Gibbs measures
Let us recall here what an Orlicz function is and introduce Gibbs measures whose potentials are given by Orlicz functions.

Basics from large deviation theory and probability
Consider a sequence of random variables (ξ n ) n∈N taking values in a topological space X , equipped with a Borel sigma algebra and a probability measure P on it.We say that (ξ n ) n∈N satisfies an LDP in X , if there exists a good rate function (GRF) I : X → [0, ∞], i.e I has compact level sets, such that for all Borel sets A ⊆ X .There are different ways to show that a sequence of random variables satisfies an LDP, the most commonly used is the so called contraction principle (see, e.g., Theorem 4.2.1 in [6]).
Lemma 2.2.Let X and Y be Hausdorff topological spaces and f : X → Y be a continuous function.Let (ξ n ) n∈N be a sequence of random variables that satisfies an LDP in X with GRF I : where When working with LDPs it may occur that one encounters different topologies on the same ambient space X .Corollary 2.4 is a useful tool when dealing with that situation.[6]) Let (ξ n ) n∈N be an exponentially tight sequence of random variables on a topological space X equipped with a topology τ 1 .If (ξ n ) n∈N satisfies an LDP with respect to a Hausdorff topology τ 2 that is coarser than τ 1 , then the same LDP holds with respect to the finer topology τ 1 .
Lastly, we present a version of the Gärtner-Ellis Theorem, which will play an important role in the proof of Theorem A from Section 2.
Lemma 2.5.(Corollary 4.16.14 in [6]) Let (ξ n ) n∈N be a sequence of exponentially tight random variables on the locally convex Hausdorff topological vector space E .Suppose the Gärtner-Ellis limit given by For a measure ν ∈ M 1 (R) we recall the relative entropy 6) in Section 2.1, for which the following variational formula holds.
When working with the space of probability measures M 1 (R), the notion of tightness plays a major role.
The following result is known as Gibbs conditioning principle, one of the fundamental tools in the theory of large deviations.It will be used in the proof of Proposition 4.7, which is itself used in the proof of Theorem B.
Proposition 2.8.(Gibbs conditioning principle, Theorem 7.1 in [20]) Let X be a topological space, and let (ξ n ) n∈N be a sequence of X -valued random variables that satisfies an LDP in X with GRF I.In addition, let F ⊆ X be a subset that satisfies (2) F is closed; Let M F be the set of x ∈ F which minimize I.That is, Then, for all open G ⊆ X such that M F ⊆ G, we have As a consequence, if M F = {x} is a singleton, then In the proof of Proposition 4.7 we shall make use of Proposition 2.11, which relies on [31, Proposition 2.1].We start with a definition, where we adapt the notation from [31] in order to stay consistent with the notation used in this article.
Definition 2.9.Let (µ n ε ) n∈N,ε>0 be a family of probability measures, where for each ε > 0 and each n ∈ N, µ n ε is a symmetric distribution on R n .We say that where µ n,k ε denotes the marginal distribution of the k first coordinates of µ n ε .
Remark 2.10.In particular, the limit in Equation ( 9) implies weak convergence of the family (µ n,k ε ) n∈N,ε>0 towards µ ⊗k , where the latter denotes the k-fold product measure of µ.
We have the following characterization of the notion defined in Definition 2.9.Proposition 2.11.Let (µ n,k ε ) n∈N,ε>0 be a family of symmetric probability measures, where for each ε > 0 and n ∈ N, µ n ε is a symmetric distribution on R n .For all ε > 0 and n ∈ N, let X n ε = (X n ε (1), ..., X n ε (n)) ∈ R n be distributed according to µ n ε .Then, the family (µ n,k ε ) n∈N,ε>0 is µ-chaotic if and only if there exists as n → ∞ followed by ε → 0.

The generalized Wasserstein spaces
Let M 1 (R) be the set of probability measures on the Borel σ-algebra on R and V be an Orlicz function and recall the following quantities from Section 2.
In case of V (x) = |x| p for some p ∈ [1, ∞), we write m p := m V and Remark 2.13.By the very definition of the metric d becomes the p-Wasserstein space, which is complete and separable (see, e.g, Chapter 6 in [32]).
Note that for general Orlicz functions V , it is not claimed that (M V 1 (R), d V ) is Polish, but we have the following result which suffices for our purposes.Proposition 2.15.(Lemma C.1 in [8]) Let V be an Orlicz function and M V 1 (R) and d V be defined as in (11) and (12), respectively.Then, for a sequence Furthermore, the metric space Proof.The proof is analogue to the proof of Lemma 3.14 in [16].
Proof.Since the function V 1 : R → [0, ∞) is convex, it has compact level sets.Thus, by Lemma 3.2 in [8], K M V 1 is tight.By Prohorov's Theorem (see, e.g., Theorem D.9 in [6]), K M V 1 is precompact in the weak topology.To see that K M V 1 is also closed in the weak topology, we take a sequence (µ n µ n → µ ∈ M 1 (R) weakly.Then, for fixed C ∈ (0, ∞), we get where the convergence in the second line holds since max{C ,V 1 } is continuous and bounded.By monotone convergence, we get Then, by weak compactness, there exists a subsequence (µ n k ) k∈N and µ (12) for the definition of this metric).In order to show that, we work with the characterization of convergence in M V 2 1 (R) given in Proposition 2.15.We get lim where we used that (µ n k ) k∈N ⊆ K M V 1 and lim |x|→∞ V 2 (x)/V 1 (x) = 0. Thus, we have shown that every sequence in

Proof of Theorem A
We shall now continue and present the proof of the Sanov-type large deviation principle presented in Theorem A.
Lemma 3.1.Let V be an Orlicz function and let R ∈ (0, ∞) .Define Then, for each λ ∈ C b (R), there exists an unique Moreover, the mapping λ → ϕ V (α(λ), λ) is Fréchet-differentiable and we have that Remark 3.2.To be precise, the function α(λ) also depends on R and on V , but to keep the notational burden low, we omit these variables in this section.
In the proof Theorem A, the Legendre transform of a certain functional containing ϕ V (α, λ) from Lemma 3.1 will appear, which we shall compute in the following proposition.
Remark 3.4.If µ ∈ M 1 (R) with m V (µ) ≤ R and µ is absolutely continuous with respect to Lebesgue measure, then we have where we used that d x dµ V,α(0) (x) = e −α(0)V (x)+ϕ V (α(0),0) .For the GRF I V (µ), we thus get Proof.(of Proposition 3.3) By definition, the Legendre transform I V of Λ V is given as where the second equality uses the fact that α(λ) minimizes β → ϕ V (β, λ) − βR (see Lemma 3.1).By rearranging the terms and interchanging the suprema, we get Here we used the variational formula for the relative entropy H given in Proposition 2.6.Next, we take a closer look at the relative entropy H (µ|µ V,β ) in order to obtain the desired representation of I V .First, we note that This leads to From Equation ( 21), it follows that For the strict convexity, we just note that µ → H (µ|µ V,α(0) ) is strictly convex (see, e.g., Ex. 5.5 in [26]) and for κ ∈ (0, 1) and µ 1 , Proof.(of Theorem A) We are going to prove Theorem A by applying the version of the Gärtner-Ellis theorem stated in Lemma 2.5.We view (L n,V ) n∈N as M (R)-valued random variables, where M (R) is the space of finite signed measures on the Borel σ-algebra on R. Let M (R) ′ denote the algebraic dual of M (R), i.e., the space of linear functions mapping from M (R) to R. Recall that C b (R) denotes the space of continuous and bounded functions mapping from R to R. We define The set Y is separating in M (R), i.e., for ν, µ ∈ M (R) with µ = ν, we find a λ ∈ Y such that λ(µ−ν) = 0 (see [6, p. 261]).The Y -topology on M (R) is the topology generated by the system of sets where λ ∈ Y , x ∈ R and δ ∈ (0, ∞).By definition, the relative topology on M 1 (R) ⊆ M (R) is the weak topology.By [6,Theorem B.8] we have that M (R) together with the Y -topology is a locally convex, Hausdorff topological vector space and Y is the topological dual M (R) * of M (R).Thus, we can identify M (R) * with C b (R).Now the framework for Lemma 2.5 is set up and we can compute the Gärtner-Ellis limit and then show that it is Gateaux-differentiable.This can be done by using an appropriate exponential tilting argument.In the following, we denote by X n,V R a random variable uniformly distributed on B n,V R .Then, for the quantity in the Gärtner-Ellis limit (22), where λ ∈ C b (R), we have We now consider a sequence of iid random variables (Z i ) i ∈N , all distributed according to the Gibbs density p : R → R, where α(λ) and ϕ V (α(λ), λ) are defined as in Lemma 3.1.Now, we define a new sequence of independent and identically distributed random variables, We obtain the identity The expectation in the previous line is bounded from above, since α(λ) ∈ (−∞, 0), more precisely, Therefore, we obtain the upper bound For a lower bound, we fix some c ∈ (0, ∞) and obtain This yields the lower bound By Equation (17) and Equation ( 18) in Lemma 3.1, we have that Hence, we can apply the central limit theorem to the sequence (Y i ) i ∈N .Thus, the probability in Equation ( 25) tends to N (0, V[Y 1 ])([−c, 0]) ∈ (0, 1) as n → ∞.Combining the upper bound (24) and the lower bound (25), we get In case of λ ≡ 0, we receive the asymptotic volume of an Orlicz ball with radius R, i.e., lim Putting ( 26) and ( 27) together, we obtain from ( 23) that the Gärtner-Ellis limit ) is given by In particular, by Lemma 3.1, Λ V is Fréchet-differentiable.In order to be able to apply Lemma 2.5, it remains to show that the sequence of empirical measures (L n,V ) n∈N is exponentially tight.We define the quantity M := sup |x|≥1 |x| V (x) and note that M ∈ (0, ∞), since V is an Orlicz function.We set Then, recalling that m 1 denotes the first moment map, we have Thus, for every n ∈ N, we have L n,V ∈ K M 1 , where K M 1 is the weakly compact set from Proposition 2. 16.In other words, we have that implying the exponential tightness of (L n,V ) n∈N .It now follows from Lemma 2.5 that (L n,V ) n∈N satisfies an LDP in the weak topology on M 1 (R).The GRF is the Legendre transform of Λ V , which is, by Proposition 3.3, the strictly convex function Now consider the situation when there exists another Orlicz function W such that Recall the set K R V from Proposition 2.17 and note that ( where Remark 3.6.In case of p = 2 and R = 1, the GRF J V 2 from Corollary 3.5 is the same as the GRF from [15, Section 3].There, the authors used a different approach, but encountered the same growth condition on V , i.e., they assumed V (x) This is necessary to introduce the exponential change of measure used in the proof of Proposition 3.11 in [15].In [15] this condition may seem artificial at first, but looking at the proof of Theorem A, one sees that this condition appears in a quite natural way.
Proof.(of Corollary 3.5) Under our assumption, by Theorem A, (L n,V ) n∈N satisfies an LDP in M p 1 (R) with strictly convex GRF I where We can apply the contraction principle to the continuous mapping m p : Remark 2.13) in order to establish an LDP for (n −1 ||X n,V R || p p ) n∈N with GRF J V p : R → [0, ∞), given as Further, because the relative entropy appears, in the infimum we only need to consider measures µ which are absolutely continuous with respect to µ V,α(0) (and hence absolutely continuous with respect to Lebesgue measure on R).By Remark 3.4, we get where we note that the supremum is a maximum, since it is of the form given in Equation (39) in the maximum entropy principle (see Lemma 4.8).Another application of the contraction principle to Here we used Lemma 3.1, where we showed that This completes the proof.

Proof of Theorem B
We prove Theorem B by employing the Gibbs conditioning principle (see Proposition 2.8).This leads to the maximization of the entropy function h under certain moment constraints.
where the second inclusion holds since m V 2 is continuous.This shows properties (2) and (4) of Proposition 2.8.For Property (3) we need to check that P If we are able to show that there exists a measure ν ∈ F with α is absolutely continuous with respect to Lebesgue measure and thus by Remark 3.4 we have where c ∈ R denotes some finite constant and h is the entropy function.For −h(µ V 1 ,α ), we have that For property (1), we observe that F is closed and convex.Indeed, for any where we used that m V 2 (ν i ) ∈ [0, R] for i = 1, 2.Moreover, we have shown that inf ν∈F I V 1 (ν) < ∞ and since I V 1 is a strictly convex GRF, I V 1 uniquely attains its minimum over such a set F .In total we get property (1) from Proposition 2.8 and that M F is a singleton with M F = {ν * }, where The equality uses Remark 3.4 and the observation that in the minimization only those ν ∈ M V 2 1 (R) are relevant which are absolutely continuous with respect to Lebesgue measure and m V 1 (ν) ≤ 1 and m V 2 (ν) ∈ C .The Gibbs conditioning principle (more precisely ( 8)) now gives us the limit in Equation (37).We now want to employ Proposition 2.11 to derive the limit in Equation (38) from the limit lim ε→0 We introduce the family of measures (µ n ε ) n∈N,ε>0 , where, for n ∈ N and ε > 0, we define Here, for n ∈ N, ) (and hence, L n,V 1 is the empirical measure associated to ).We note that the family (µ n ε ) n∈N,ε>0 is a family of symmetric probability measures, since, by the symmetry of the Orlicz function V 1 , X n,V 1 1 has a symmetric distribution.For n ∈ N and ε > 0, we consider random variables X n ε with distribution µ n ε and let L n ε be the associated empirical distribution, i.e., Then, we have Here, we used that the event m V 2 (L n,V 1 ) ∈ C ε can be expressed as R+ε .Thus, by Proposition 2.11 and Remark 2.10, we get that the marginal distribution of the first k coordinates of µ n ε converges weakly to ν ⊗k * .In other words, we have (1), ..., (1), ..., This completes the proof.
In the preceding Proposition 4.7 we have seen that the limiting distribution ν * follows from a maximum entropy principle.In order to compute ν * , we need the following classical result, which relies on Ex. 5.3 in [4] and chapter 12 in [5].We state it in the version given in [16].for all j = 1, ..., n.

Corollary 3 . 5 .
R) by Proposition 2.17, (L n,V ) n∈N is exponentially tight in M W 1 (R).By Corollary 2.4, (L n,V ) n∈N satisfies the LDP in (M W 1 (R), d W ) with strictly convex GRF I V .This completes the proof.Let V be an Orlicz function and p ∈ (1, ∞) such that V (x)