Tight minimax rates for manifold estimation under Hausdorﬀ loss

: This paper deals with minimax rates of convergence for mani- fold estimation. A new lower bound is obtained by a novel construction of two sets of manifolds and an application of convex hull testing method of Le Cam (1973). The minimax lower bound matches the upper bound up to a constant factor considered by Genovese et al. (2012b).


Introduction
We observe an i.i.d. random sample Y 1 , . . . , Y n ∈ R D from a distribution Q that lies on a d dimensional manifold M with d < D. The goal is to estimate the unknown manifold M based on the sample {Y i }.
Manifold learning is an active area of research in machine leaning, applied mathematics as well as statistics, but not much optimality theory regarding the rates of convergence has been developed. To the best of our knowledge the optimal convergence rates for estimating manifolds are only considered by Genovese et al. (2012a,b) (henceforth GPVW) under a minimax criterion. Convergence rates of their theoretical estimators are compared to the lower bounds. However, their upper and lower bounds do not match. To fill in the gap, this paper establishes the optimal rates of convergence by a novel lower bound argument.
GPVW considered three noise models-noiseless, clutter and additive modelfor Q. Noiseless model assumes that sample is obtained from a distribution G Tight minimax bounds for estimating manifold 1563 supported on M. The clutter model assumes that sample is from G with probability π and from U with probability 1 − π, where U is a uniform distribution on a compact set K ⊂ R D with nonempty interior. We only consider the noiseless model.
Following GPVW we measure the loss by means of the Hausdorff distance for some constant c that depended on M. The proof of the lower bound is based on the testing between two smooth manifolds where the first manifold looks like a squashed ball with a flat region and the second manifold coincides with the first one except a small bump on the flat area, such that these two manifolds are not statistically distinguished. The main contribution of our paper (Theorem 1 in Section 2) is to establish a uniform lower bound for Λ n ( M) of order γ 2/d n , thereby determining the true minimax rate for one of the manifold estimation problems considered by GPVW.
We use a method, first presented by Le Cam (1973), in the form used by Yu (1997). In our setting the method becomes: if M 0 and M 1 are subsets of M for which inf{H(M 0 , M 1 ) : M 0 ∈ M 0 , M 1 ∈ M 1 } ≥ 2γ, for some positive constant γ, then Λ n ( M) ≥ γ sup where Q n i denotes the set of all n-fold product measures Q n i with Q i ∈ Q(M i ) for i = 1, 2 and co(·) denotes the convex hull. The quantity |P 0 ∧ P 1 | is called the testing affinity. It equals 1 minus half of the L 1 distance between P 0 and P 1 .
The general analog of inequality (2) is best known for the case where both M 0 and M 1 are singleton subsets, a situation sometimes referred to as "two-point testing". GPVW used that method to establish their lower bound (1). The situation where only one of the M i 's is a singleton set has been used by many authors (see Tsybakov 2009, Section 2.7.5 for example). The full power of inequality (2) has, to our knowledge, only been effectively applied in the paper by Cai and Low (2011). In the present paper we employ (2) by means of an embedding of an occupancy problem into the manifold problem. We bound the affinity bewteen two convex hulls (mixtures) using combinatorial arguments involving the hypergeometric distribution. In Section 2, we give some intuition behind our construction and the reasons why it does not suffice to have even one of the M i a singleton set.
Proofs for the main results are given in Section 3. We collect proofs for auxiliary lemmas in Section 4.

Main theorems
In this section we introduce lower bound results and intuitions behind them. The lower bounds, together with upper bounds in Genovese et al. (2012b), yield minimax rates of convergence.
First, we describe the setting in detail. Assume that M is a compact C 1 Riemannian submanifold without boundary in R D , and contained in some compact set K ⊂ R D with nonempty interior. In addition, we need a regularity condition for the curvature of the manifold M. Define Δ(M) to be the largest r such that each point in M ⊕ r has a unique projection onto M. As proved by Niyogi et al. (2006, Section 6), Δ(M), which is usually called condition number, controls the curvature of the manifold. We assume that the d dimensional manifold M satisfies Δ(M) ≥ κ, where κ is a fixed positive constant. Let M(κ) := {M : Δ(M) ≥ κ, M ⊆ K}, and G(M) be a set of distributions Q whose densities q with respect to the uniform measure on M satisfy where b(M(κ)) and B(M(κ)) may depend on M(κ) but not on the particular manifold M.
Here is the intuition behind our main result, Theorem 1. Consider a one (d = 1) dimensional closed smooth curve M in a two dimensional space (D = 2), and observe n points uniformly distributed on M. Without loss of generality, assume that the length of the curve is 1. The maximum gap among those points is of an order of log n n with high probability. That means there exists at least one connected piece of the manifold of length of order log n n on which we have no observations. The locally quadratic approximation error of the interpolation of smooth manifold yields a possibly unavoidable estimation error of ( log n n ) 2 . This idea can be carried over to a general d dimensional manifold in R D with d < D by dividing the manifold into an order of n log n disjoint pieces with a diameter of an order ( log n n ) 1/d for each. Theorem 1. Let Y 1 , . . . , Y n be i.i.d. sample from a distribution Q where Q is supported on a manifold M ∈ M(κ), and Q ∈ G(M). Then there is a constant c, not depending on n but depending on d, and b and B defined in Equation (3) The following two remarks explain why a naive "one versus a mixture" testing can not lead to the desired lower bound in Theorem 1 and the intuition for the testing of two convex hulls. We use a n b n if a n ≤ C 1 b n and b n ≤ C 2 a n where C 1 , C 2 are constants not depending on n. Denote by L 1 (P, Q) := |p − q|dμ the L 1 distance between P and Q, where μ is a dominating measure, and p and q are the densities of P and Q respectively.
Remark 1 (One versus a mixture). In many cases including sparse support recovery in high dimensional estimation, one (null) versus a mixture (alternative) testing gives tight bounds. For instance, consider the problem of estimating the multivariate standard normal mean vector θ ∈ R n with covariance matrix I n /n. Then it can be shown that for the parameter space satisfying n i=1 1{θ i = 0} = 1, the magnitude of the nonzero θ i needs to be at least of an order of log n n for consistent support recovery by one versus a mixture testing of Le Cam. However, this same reasoning does not work for manifold estimation by simply considering a test of one manifold versus a mixture of many manifolds with one bump for each. Consider a base manifold M 0 (defined in (14)) as the null. For the alternatives, we can construct the set of manifolds M having one bump deviated from M 0 such that H(M 0 , M) ( log n n ) 2/d for all M ∈ M. Define the uniform distributions Q 0 := U (M 0 ) and Q := U (M) ∈ Q on these manifolds. We find that L 1 (Q n 0 , 1 |Q| Q∈Q Q n ) converges to 2. These two distributions are very different from each other. This can be understood as follows. Based on one sample from any manifold of M, with high probability there is at least one observation lying on the bump, thus we instantly know that the null hypothesis is wrong. This is different from the multivariate normal mean estimation problem, for which based on a sample generated from the alternative one can not tell whether there is a nonzero θ i and where the location of the nonzero θ i is. See Remark 3 for the exact calculation of the L 1 distance.
Remark 2 (A mixture versus a mixture). From Remark 1, we see that the problem of estimating a set of manifolds with at most one bump for each is not hard enough for establishing the desired lower bound. A large set of manifolds with 2m = n/(t log n), for some t ∈ (0, 1/2), number of inward and outward bumps are constructed. Two subsets of manifolds M 0 = {M 0j } and M 1 = {M 1j } are selected. Manifolds in M 0 have m = n/(2t log n) outward bumps, while manifolds in M 1 have either m + 1 or m − 1 outward bumps on M 0 , such that H(M 0j , M 1j ) ( log n n ) 2/d for all M 0j ∈ M 0 and M 1j ∈ M 1 . Consider the uniform distributions on M 0j , that is, Q 0j := U (M 0j ), and similarly Q 1j := U (M 1j ). In Section 3 it will be shown that 1j ) converges to 0, which implies we can not distinguish two convex hulls from observations. The intuition behind the L 1 distance calcula- tion is that for observations generated from either class there will be several bumps without any observation lying there, thus it seems to be impossible to distinguish one convex hull from the other from those observations. See Section 3 for details of a rigorous justification.
Combining the upper bound in Theorem 3 in Genovese et al. (2012b) and the improved lower bound in Theorem 1, we have the following corollary.
Theorem 1 can be easily extended to a so called clutter model, for which is a uniform distribution on K and π ∈ (0, 1). Construct the same set of manifolds which are used in the proof of Theorem 1. Clearly the Hausdorff distance between any manifold in each set is at least (log n/n) 2/d (up to a constant) as in the proof of Theorem 1, thus it suffices to show that the testing affinity in (2) is bounded away from zero in the clutter model. See the proof of Theorem 3 for calculation.
Then there is a constant c, not depending on n but depending on d, π, and b and B defined in Equation (3) Combining the upper bound in Theorem 5 by Genovese et al. (2012b), we obtain the following optimal rates of convergence.

Proof of main results
In this section we derive the lower bounds in Theorems 1 and 3. The key technique is Le Cam's method for testing two convex hulls. Consider a set of distributions Q, supported on a manifold M ∈ M. LetM be the estimator of Cam (1973) establishes a minimax lower bound as follows. See also Yu (1997).
The proof of Theorem 1 consists of the following 4 steps: (i) construction of two finite sub-parameter spaces M 0 and M 1 which are separated of an order of ( log n n ) 2/d , (ii) simplification of the L 1 representation, (iii) reduction of the L 1 distance to a combinatorial counting, and finally (iv) bounding the L 1 distance by studying combinatorics through the tail probability bounds for the traditional occupancy problems and hypergeometric distribution.

Construction of finite sub-parameter spaces
The construction extends the manifold with one bump in Genovese et al. (2012a) to the case of multiple bumps. In particular, M 0 corresponds to the set of manifolds with m outward bumps among 2m possible perturbations while M 1 corresponds to the set of manifolds with m + 1 or m − 1 outward bumps. These bumps are disjoint and congruent, and the volume of each bump is of order log n/n. Lemma 6 shows that there exist suitable constructions for these sets M 0 and M 1 in order to use Le Cam's method.

A simplified L 1 representation
In order to consider the L 1 distance between distributions on these manifolds, we introduce some more notation. Part of manifolds without bumps is denoted For later use, we also denote m l := m + l ∪ m − l . Then we suppose μ as a dominating uniform measure on ∪ N0 j=1 M 0j ∪ N1 j =1 M 1j . We let Q 0j be the uniform probability measure on M 0j (i.e. Q 0j := U (M 0j )) with a density q 0j respect to μ and similarly define Q 1j and q 1j on M 1j .

From L 1 to combinatorics
Before starting the detailed calculations of (5), let us emphasize that the density values n i=1 f 0j (x i ) are determined (as a nonzero fixed value 1/(C − C 0 ) n ) only by the specified perturbations in m 0j . In order to use some combinatorial ideas, we divide the whole integral region Now, we evaluate the integral (5). First note that each S u are disjoint, which gives For notational convenience, we define a representative disjoint region in S u as , . . . , and s 2m = m 1 × m 2 × . . . × m n−2m+1 2m . By Lemma 8, a result for the traditional occupancy problem, we have the total number Υ u of disjoint regions in each S u satisfies, Lemma 8. Consider the distribution of n balls in 2m bins, assuming that each ball has the equal probability n −2m of being placed in each bin. Suppose Υ u be the total number of cases with u unique bins (i.e. 2m − u empty bins). Then (6) holds.
Using the Equation (6), we simplify the right side of (5) as follows in Lemma 9.

Bounding the L 1 distance
In this final step, we shall prove that We start to consider bounds for the quantity Υ u /(2m) n , whose limiting distribution is found in the traditional occupancy problems. Let Ψ n be the random variable corresponding to the number of nonempty bins when we throw n balls into 2m bins. Then Υ u /(2m) n is the probability of having Ψ n = u in this regime. Define α n := (2m) exp(−n/(2m)) = n 1−t /(t log n) → ∞ (by recalling 2m = n/(t log n) with 0 < t < 1/2). By applying Theorem 2 of Kamath et al. (1994), for a large n satisfying n (1−2t) / log n ≥ (t/θ 2 ) log(2n 2 ), for any θ > 0, which implies that we only need to calculate u l=0 l for the range of Ψ n ∈ [2m − α n − θα n , 2m − α n + θα n ] =: [u.l, u.u] since we know that l l ≤ 2. Observing this with (7), Now we focus on the first term in (8). Since α n = o(m), for the range of u ∈ [u.l, u.u], we have simpler regions for the indices, e.g.
Hence we need to treat 4 cases outside of the common region I 1 ∩ I 2 ∩ I 3 = {l : Substituting these into the calculations, we have Now, for the range of u − m + 1 ≤ l ≤ m − 1 (where u.l ≤ u ≤ u.u), we can further simplify l as follows, is the hypergeometric probability with parameters (2m, u, k). Then let us consider the random variable Z with P(Z = l) = p 2m,u,m (l). From the property of the hypergeometric distribution, we know EZ = u 2 . Lemma 10. Let Z be the Hypergeometric random variable with parameters (m, u, k). That is,

A. K. H. Kim and H. Zhou
Proof. For the one side inequality proof, see Hoeffding (1963) or Chvátal (1979). Then we obtain the other side of the inequality using symmetry.
By the tail probability provided in Lemma 10, with replacing EZ with its expectation u/2 and k with m, we note that for 0 ≤ η ≤ u/(2m) Here we take η := α n /(m log n) → 0 so that for 0 < t < 1/2 Based on the small tail probabilities, we divide the summation in (10) where the last inequality is followed by the tail probability (11). For the first region, we bound the absolute term with the largest index l = u/2 + α n / log n (this absolute term has the smallest value at u/2) where u ∈ [2m−α n −θα n , 2m− α n + θα n ]: Hence, Considering both regions, (8) and (9). Therefore, the claim is proved. Now we combine all of the above ideas into the proof.
Proof of Theorem 3. We construct the same set of manifolds M 0 and M 1 as in the noiseless model. By construction, H(M 0j , M 1j ) ≥ 2γ. Again we claim that with the choice γ d/2 m n/ log n, which gives the lower bound (log n/n) 2/d by Lemma 5.
Here we let the dominating measure μ := U (K) + U (m ∪ (∪ 2m l=1 m l )). By symmetry and singular property of U (K) and U (M 0j ) or U (M 1j ), Now, using the exact same idea in Lemma 11, we have which implies by Theorem 1 and n k=1 β n,k,π = 1 − (1 − π) n ≤ 1, Remark 3 (Continuation of Remark 1). As explained in Remark 1, we use a base manifold M 0 defined in (14) as the null. Then, we construct the alternatives having one inward bump: for j = 1, . . . , m where m n/ log n such that whereb j (u) is defined exactly the same as before in Step 1. By construction, which proves that Q 0 and mixtures of Q j are distinguishable.
The radius 1 + κ + γ of this base manifold is larger than the radius 1 + κ appeared in Genovese et al. (2012a). Larger radius is chosen such that manifold M with bumps (which will be constructed on M 0 ) satisfies the curvature condition Δ(M) ≥ κ.
For the second group, M 1 consists of manifolds M 1j , j = 1, . . . , N 1 similar to M 0j but with m + 1 or m − 1 outward bumps which in turn means m − 1 or m + 1 inward bumps. That is, for j = 1, . . . , N 1 , whereb l (u) is defined exactly the same as before. Genovese et al. (2012b) constructed one bump on the similar kind of base manifoldM 0 whose uniform measure on the manifold with that bump is about γ d/2 . The shape of the bump is a union of two portions of spheres, and the bump is centered at (0, . . . , 0) ∈ R d and defined on ||u|| ≤ 4γκ − γ 2 . And also the Hausdorff distance fromM 0 is γ. Here we consider slightly modified bumps, located not only on the one region but located as many disjoint regions as possible on M 0 . In other words, we seek the maximal number of disjoint bumps which also guarantees the Hausdorff distance from M 0 being γ. For d = 1 case, we can construct those bumps on each disjoint interval with length 4γk + 3γ 2 (which is upper bounded by √ 5γκ since γ ≤ κ/3). For general d dimensional manifolds, by using grid points separated by √ γ in each dimensions, there exist at least γ −d/2 (of order) number of disjoint bumps on the region {||u|| ≤ 1}. Thus, we let 2m γ −d/2 .
Then we need to check if these satisfy the condition for the model. Note that each outward bump is just magnified version of the bump used by Genovese et al. (2012b). Indeed, these bumps are constructed with parts of sphere with radius κ + γ located on different regions of M 0 . Each inward bump is just the reflected version of the outward bump. Thus, constructed manifolds have no manifold boundary and Δ(M 0j ) ≥ κ and Δ(M 1j ) ≥ κ for all j, j = 1, . . . , 2m. Also we check H (M 0j , M 1j ) ≥ 2γ for all j = 1, . . . , N 0 , and j = 1, . . . , N 1 because there is always at least one different spot in M 0j and M 1j .

Proof of Lemma 7
To conveniently express the L 1 distance between mixtures, we use the notation x := (x 1 , . . . x n ) and x −i = (x 1 , . . . , x i−1 , x i+1 , . . . , x n ). Then we expand the product term as follows, Using the above expression, By symmetry, and also using the disjoint property between m and m 0j , and m and m 1j , L 1 distance is actually equal to the following, where the second equality is followed by μ(m) = C 0 , μ(m 0j ) = μ(m 1j ) = C − C 0 with the binomial coefficient notation β n,k,p := n k p n−k (1 − p) k , and the last equality is obtained by definition.

Proof of Lemma 9
First, we consider the simplest case u = 1 with the integral region m n 1 . Then the only one perturbation region m 1 in the constructed manifold will affect the density. Suppose m 1 takes the outward perturbation m + 1 . Then, becomes the comparison problem between counting how many m 0j can take (m + 1 ) and counting how many m 1j can take (m + 1 ). If m 1 takes the inward perturbation m − 1 , then we can ask similar question with replacing m + 1 to m − 1 from the previous sentence. Again, we can ask the same question for other region m l for l = 2, . . . , 2m. By symmetry, by defining the representative region in S 1 as s 1 := m n 1 , where Υ 1 is the total number of unique region in S 1 as defined in (6). We extend the same idea to S 2 . Then only two perturbation regions on the constructed manifold will affect the joint density. First consider the region m 1 and m 2 . Irrelevant to whether the integral region as m l 1 ×m n−l 2 or m n−l 1 ×m l 2 where l (1 ≤ l < n) is some arbitrary integer, the density n i=1 (1{x i ∈ m 0j }/μ(m 0j )) becomes nonzero as long as m 0j contains perturbations defined on m 1 and m 2 . Suppose m 1 takes m + 1 and m 2 takes m − 2 . Then becomes the comparison problem between counting how many m 0j can take m + 1 , m − 2 , and counting how many m 1j can take m + 1 , m − 2 for any l = 1, . . . , n − 1. Again, it would not make any difference if we change the region of the perturbations as long as the unique number u of regions is 2.
In general, since it is more complicate, we first only consider the region without specifying perturbation shape. With the same intuition as before, we have the same density value on any disjoint regions in S u , say, . Indeed, we only need to count how many m 0j and m 1j would contain each specific perturbation (combinations of outward and inward on these regions m 1 , . . . , m u ; we will explain how to calculate these in detail later). Similarly as before, by symmetry, considering the region m 1 , . . . , m u or m 2 , . . . , m u+1 would not make any difference.
For notational convenience, we define the representative disjoint region in S u as s u := m 1 × m 2 × . . . . Then, by symmetry as explained, (16) Now, we evaluate the above integrals. First, we consider s1 |f 0 (x) −f 1 (x)|. As explained before, on (m − 1 ) n , n i=1 f 0j (x i ) are either zero or 1/μ(m 0j ) n = 1/(2mcγ d/2 ) n , and m − 1 . . . m − 1 dμ n = (cγ d/2 ) n . Then Thus we need to count how many of m 0j s contain m − 1 (and m + 1 ) for the first group and how many of m 1j s contain m − 1 (and m + 1 ) for the second group. Now, we consider the case where u ≥ m. Then we need to be more careful in deciding the possible range of l. For the extreme case, if u = 2m, then we know that + numbers should be fixed as m among 2m regions for the first group, which restricts the range of l as l = m. After choosing this location, there is no freedom left, since there should be a unique 2m regions, which determines the exact form of the manifolds. Also in this case, there does not exist manifolds in the second group (since those cannot have m number of +'s and m number of −'s). Similarly for the first group case, either m + 1 or m − 1 number of +'s should be fixed with no freedom. This yields