Data-based decision rules about the convexity of the support of a distribution

: Given n independent, identically distributed random vectors in R d , drawn from a common density f , one wishes to ﬁnd out whether the support of f is convex or not. In this paper we describe a decision rule which decides correctly for suﬃciently large n , with probability 1, whenever f is bounded away from zero in its compact support. We also show that the assumption of boundedness is necessary. The rule is based on a statistic that is a second-order U -statistic with a random kernel. Moreover, we suggest a way of approximating the distribution of the statistic under the hypothesis of convexity of the support. The performance of the proposed method is illustrated on simulated data sets. As an example of its potential statistical implications, the decision rule is used to automatically choose the tuning parameter of ISOMAP, a nonlinear dimensionality reduction method.


Introduction
Let X be a random vector with distribution µ on R d having density f . The support of µ is defined as S = C⊂R d closed set:µ(C)=1

C.
(1) (We also call S the support of the density f .) Let X 1 , . . . , X n be independent random vectors drawn from the distribution µ. In this paper we investigate the problem of testing whether the support S is a convex set or not. In other words, we consider the hypothesis testing problem in which the null and alternative hypotheses are H 0 : S is a convex set, H 1 : S is not a convex set.
We are interested in finding tests-or, perhaps more adequately, decision rulesthat decide correctly when the sample size is large. Formally, a decision rule is a sequence of functions T n : (R d ) n → {0, 1}. T n (X 1 , . . . , X n ) = 1 is interpreted as a guess that f has a convex support while if T n (X 1 , . . . , X n ) = 0, the decision rule suggests that the support is non-convex. A decision rule is consistent for a density f if it is correct eventually almost surely, that is, if P T n (X 1 , . . . , X n ) = ½ {f has convex support} for finitely many n = 1.
The main objective of this paper is to investigate the possibility of constructing consistent decision rules for the convexity of the support. We show that consistent decision rules (i.e., rules that decide correctly eventually almost surely) exist whenever f is bounded away from zero on its support and some other mild regularity conditions are satisfied. The rule, proposed in Section 2, is based on a statistic which is the average, over all pairs of points (X i , X j ), of the distance of the closest data point to the mid-point (X i + X j )/2. We show that under the null hypothesis this average value converges to zero, in probability, while under the alternative, it stays bounded away from zero. This makes it possible to define a consistent decision rule. The difficulty of the analysis is that the proposed statistic is not a U -statistic since every summand depends not only on X i and X j but on all other data.
In Section 3 it is shown that it is impossible (in a well-defined sense described below) to design a decision rule that behaves asymptotically correctly for all bounded densities of bounded support. This shows that an assumption like the density being bounded away from zero on its support is necessary for consistent decision rules.
In Section 4, using the terminology of hypothesis testing, we describe some heuristics to approximate the distribution of the proposed statistic under the hypothesis of convexity of the support. Such approximations are essential in practice when the threshold for accepting or rejecting the null hypothesis needs to be adjusted for a given problem at a fixed sample size. We present numerical examples for illustration. Finally Section 5 illustrates by a numerical example how the decision rule is applied successfully in the automatic choice of the tuning parameter of ISOMAP.

A decision rule for the convexity of the support of a distribution
Let X 1 , . . . , X n be i.i.d. vectors drawn from the probability distribution µ on R d . We assume that µ is absolutely continuous with respect to the Lebesgue measure, with density f . Suppose that f has a support S ⊂ R d and that there exists a constant c > 0 such that for every x ∈ S, f (x) ≥ c. In this section we propose a test for the convexity of S. The main result of the section is that the decision rule is consistent, that is, regardless of whether S is convex or not, the rule decides correctly for sufficiently large sample sizes. For this we also need some mild regularity conditions detailed below.
The basic idea of the proposed test is the fact that a closed set S ⊆ R d is convex if and only if for all x, y ∈ S, the mid-point (x + y)/2 is also in S. Thus, if the support S of f is convex, it is reasonable to expect that for each pair of observations (X i , X j ), there is some other data point X h close to the mid-point (X i + X j )/2. On the other hand, if the support is not convex then we expect to have a large number of pairs (X i , X j ) such that the closest point to (X i + X j )/2 is far away. Based on this intuition, we introduce the statistic U n resembles a U -statistic (see, e.g., Serfling (1980), Chapter 6) as it is a sum, over all pairs of points, of a function depending on the pair. However, U n is not a U -statistic because the kernel γ(X i , X j , X h (1) (i,j) ) depends not only on (X i , X j ) but also on the rest of points X h , which makes its analysis more complex. U -statistics with a random kernel were investigated by Schick (1997), but these results are not applicable to U n as Schick deals with random kernelŝ k n (X i , X j ; X 1 , . . . , X n ) that converge (as n → ∞) in some sense to a nonrandom kernel k n (X i , X j ) for which the standard results on U -statistics apply. This is not the case for the kernel γ defining U n .
In Propositions 1 and 2 below we show that, under a certain regularity condition, if the support is not convex, U n stays bounded away from zero almost surely, while for convex S, its expectation converges to zero at a rate (log n/n) 1/d and it is concentrated around its mean. Thus, it makes sense to define the following rule: accept H 0 if and only if U n ≤ τ n where τ n → 0 but slower than (log n)/ √ n. Indeed, this test is guaranteed to make the correct decision eventually, almost surely, whenever f is bounded from above and from below on its support. The regularity condition we require is the following.
Assumption 1. Assume that the topological boundary ∂S of S has zero Lebesgue measure.
Since the density f is supposed to be bounded away from zero on its support, the assumption is equivalent to saying that f is such that for almost every x ∈ S there exists ǫ > 0 such that essinf y: y−x <ǫ f (y) > 0, see Lemma 4 in the appendix for the proof of this simple fact. Note that Assumption 1 is equivalent to the fact that S is Jordan measurable. If S is convex, the assumption is always satisfied, see Lang (1986).
The regularity assumption, together with the assumption that f is bounded away from zero on its support, exclude some pathological cases in which the statistical problem of deciding whether the support is convex is not only difficult, but also of questionable meaning. For example, Assumption 1 excludes cases such as a uniform density on a Cantor set of positive measure. As another illustration, consider the following example of a density over the real line. Let r 1 , r 2 , . . . be an enumeration of all rational numbers. Then the set A = ∪ n≥1 (r n − 100 P. Delicado et al. 2 −(n+2) , r n + 2 −(n+2) ) has Lebesgue measure at most 1/2 and we may define µ as the uniform distribution over A. Then the support S of µ is R (in particular, S is convex), yet the density vanishes everywhere except for a set of measure 1/2. Our regularity assumptions exclude such pathological cases.
The next performance guarantee is the main result of this section: Theorem 1. Suppose that the support of f satisfies Assumption 1 and that there exist constants 0 < c < C such that c ≤ f (x) ≤ C for all x ∈ S. Consider the test which accepts H 0 if and only if U n ≤ τ n and suppose that τ n is chosen such that lim n→∞ τ n = 0 and lim n→∞ τ n n 1/2 log n = ∞.
Then regardless of whether S is convex or not, with probability one, there exists an index N such that for all n > N the test always decides correctly.
Remark 1. Of course, the density f is not uniquely defined as its value can be changed on a set of zero Lebesgue measure. The boundedness condition for f in the theorem should be interpreted such that f has a version that satisfies this condition. More precisely, we assume that esssup x∈S f (x) ≤ C and essinf x∈S f (x) ≥ c. This comment applies throughout the whole paper.
The theorem is an immediate consequence of Propositions 1 and 2 below. As it is shown in Section 3, the condition of f being bounded away from zero cannot be dropped. However, we conjecture that the condition that f is bounded from above is not necessary.
Note that our choice of the function γ is far from being the only possibility that gives rise to a consistent decision rule. In particular, the d-th power of the norm may be replaced by any other positive power. However, the proposed choice has some advantages that we exploit in Section 4 in defining a bootstrap approximation of the distribution of U n under the null hypothesis.
First we establish the asymptotic behavior of U n under both the null and alternative hypotheses. We treat the simpler case, when S is not convex, first: Proposition 1 (Asymptotic properties of U n under H 1 ). Suppose that Assumption 1 is satisfied and that f is bounded away from zero on its support S. If S is not convex, then lim inf n→∞ U n > 0 almost surely.
Proof. For z ∈ R d and r > 0, denote by N (z, r) the open ball of radius r centered at z.
Suppose that S is not convex. Then there exist x, y ∈ S such that (x+y)/2 / ∈ S. (The fact that we may take the mid-point of x and y follows from closedness of S.) Since S is closed, (x + y)/2 has a neighborhood entirely outside of S. Also, by Assumption 1, S equals the closure of the set A = {x : ∃δ > 0 : essinf y∈N (x,δ) f (y) > 0} (see Lemma 5 in the appendix). This implies that there exist x ′ , y ′ ∈ A and ǫ > 0 such that N (x ′ , ǫ) ∪ N (y ′ , ǫ) ⊂ A and N ((x ′ + y ′ )/2, 2ǫ) ∩ S = ∅. (Indeed, if x ∈ A then we may take x ′ = x otherwise any x ′ ∈ A sufficiently close to x will do. y ′ is chosen similarly.) By assumption, there exists a constant c > 0 such that for every x ∈ A, f (x) ≥ c. By the law of large numbers, with probability one, there exists an index N such that for all n > N , where v d is the volume of the d-dimensional unit Euclidean ball. On the other hand, clearly, if The next result shows that under the null hypothesis, the expected value of U n goes to zero at a rate (log n/n) 1/d and it is very unlikely to exceed its expectation by more than log n/ √ n. This result, combined with the Borel-Cantelli lemma, implies that for any sequence τ n such that τ n n 1/d / log n → ∞, with probability one, U n < τ n for all sufficiently large n.
Proposition 2 (Asymptotic properties of U n under H 0 ). Suppose that Assumption 1 is satisfied and that there exist constants 0 < c < C such that c ≤ f (x) ≤ C for all x ∈ S. If S is convex, then there exists a constant K depending on c, C, and S such that, for all n and q ≥ 2, where (·) + denotes positive part.
Proof. Note first that convexity of S implies that Assumption 1 holds and therefore the open set A = {x : ∃δ > 0 : essinf y∈N (x,δ) f (y) > 0} ⊂ S is also convex.
(To see this, consider x, y ∈ A and λ ∈ (0, 1). Since A is open and S is convex, λx + (1 − λ)y has a neighborhood entirely included in S. Since f is at least c at every point of S, this implies that λx + (1 − λ)y ∈ A.) Since A is open, there exists an x ∈ A and δ > 0 such that N (x, δ) is contained in A.
Since f is assumed to be bounded away from zero on A which is convex, A must also be bounded. To see this, note that since A contains an open ball N (x, δ), if A was unbounded, for all n > 0 there would exist x n ∈ A with x − x n > n. Since A is convex, it contains the convex hull of N (x, δ) and x n whose volume is bounded from below by a positive constant (depending on δ and d) times n which contradicts the fact that A has a bounded Lebesgue measure.
Since S equals the closure of A (again by Lemma 5 in the appendix), S is compact.
By translating S if necessary, we may assume, without loss of generality, that N (0, δ) ⊂ A. We may also assume, without loss of generality, that ∆ A , the diameter of A (and S), is equal to 1.
For all ǫ > 0, define the ǫ-interior of A by where B(x, ǫ) is the closed ball of radius ǫ centered at x. Note that A ǫ is nonempty, open, and convex whenever ǫ ≤ δ. The reason we introduce A ǫ is because to bound the expectation of U n , we assume that both X 1 and X 2 are in the ǫ-interior of A and also show that the probability the assumption is not satisfied is small. More precisely, to bound the expected value, note first that for all ǫ ≤ δ, Since γ(X 1 , X 2 , X h ) ≤ 1, the first term on the right-hand side may be bounded by where Vol denotes the d-dimensional Lebesgue measure. To bound the volume of the boundary region, first observe that since S is the closure of A, Vol(A \ A ǫ ) = Vol(S \ A ǫ ). Next we show that there exists a constant κ S > 0, depending on S, such that for all ǫ < δ/4, S ⊂ A ǫ ⊕ N (0, κ S ǫ) where ⊕ denotes Minkowski sum. This may be seen as follows: since N (0, δ) ⊂ A, every x ∈ S \ A ǫ is not in denote the infimum of the angle of any cone centered at x that includes N (0, δ/2) where for x ∈ R d , a cone of angle θ is defined as Figure 1). Since S is compact and θ(x) is positive and continuous, θ = inf x∈S\Aǫ θ(x) > 0. Let x ∈ A \ A ǫ and define y = ax where a = sup{α ∈ (0, 1) : In words, y is the point on the segment joining x and 0 such that N (y, ǫ) "just fits" in the cone x + C(−x, θ(x)), see Figure 2.
(Such a point exists by the definition of θ(x) and since ǫ < δ/4.) Note that N (y, ǫ) lies in the convex hull of {x} ∪ N (0, δ) and therefore, by convexity of A, N (y, ǫ) ⊂ A.
Since A is open, this implies that B(y, ǫ) ⊂ A and therefore y ∈ A ǫ . Consider now any straight line containing x and tangent to N (y, ǫ) at, say, point z. The right-angle triangle formed by x, y, and z is such that its hypotenuse is the segment [x, y], and its leg [y, z] has length ǫ. Since the angle of the triangle at vertex x equals θ(x), we have that x − y ≤ ǫ/ sin(θ(x)) ≤ ǫ/ sin(θ). Therefore, we may take κ S = 1/ sin(θ). Therefore, for all ǫ < δ/κ S , where the last inequality follows from Taylor's theorem. Thus, Hence, we have shown that the first term on the right-hand side of (2) may be bounded as δ is a constant depending on S and C only. It remains to bound the second term on the right-hand side of (2). To this end, suppose ǫ ≤ δ. In the event that X 1 , Summarizing, we have proved that, for all ǫ ≤ min( δ 2 , 2δ d2 d−2 ), Choosing ǫ = (log n/((n − 2)cv d )) 1/d completes the proof of the bound for the expected value of U n . To bound the higher moments of U n , we apply a general moment inequality for functions of independent random variables of Boucheron et al. (2005, Theorem 3) which states that, for every q ≥ 2, Thus, we need to study the effect of removing the point X k from the sample on the value of U n . Clearly, The first term on the right-hand side is non-negative and bounded by 2/n. At the same time, every term in the second sum on the right-hand is in [−1, 0] and is not zero Thus, it suffices to find suitable upper bounds for the moments of max k N k . To this end, note that since N k ≤ n−1 2 , for all t > 0, Next we write It remains to bound P{N 1,i > t}. In Lemma 1 below we show that there exists a constant K depending on c and the set S such that, for all n and t > 0, 106 P. Delicado et al.
Putting everything together, we obtain that there exists a constant K (possibly different from the one above) such that Choosing ǫ n = K ′ ((q log n)/n) 1/d for a sufficiently large K ′ , the upper bound becomes In the proof above we have used the following auxiliary result: Then there exists a constant K depending on c and the set S such that, for all n and t > 0, Proof. In the proof we condition on the value of X 1 = x 1 and consider two different cases: the first, somewhat simpler, case is when x 1 falls in the ǫ ninterior of A (with ǫ n defined below). The case when X 1 is closer than ǫ n to the boundary can be handled by a similar argument, though one should proceed with some care. Let ǫ n = ((K 2 q log n)/n) 1/d for some constant K 2 specified below. Recall that A ǫn denotes the ǫ n -interior of A.
By Lemma 5.5 in Devroye, Györfi and Lugosi (1996), R d can be covered by . . , C ρ d of angle π/6 centered at x 1 that cover R d . Consider the data points falling in each cone and mark the nearest N N be the distance of x 1 and the farthest of these nearest neighbors and define, for each j = 1, . . . , ρ d , N N to the surface of the ball centered at x 1 , of radius R. Now for each j = 1, . . . , ρ d , let be the half-space containing x 1 defined by the bisecting hyperplane between NN to the sphere of radius R, and the convex polytope P defined by the bisecting hyperplanes.
It is easy to see (and this is the second key observation) that P ⊂ B(x 1 , R) where B(x 1 , R) is the closed ball of radius R centered at x 1 . Thus, for every Thus, We use the decomposition Note that given R < ǫ n , by (3), N 1,i is dominated by a binomial random variable with parameters n − 1 and µ(B( is the volume of the d-dimensional unit Euclidean ball. Therefore, choosing t = 2(n − 1)cv d (2ǫ n ) d , and setting c 1 = log(4/e)cv d 2 d+1 , by a standard estimate for the tail of the binomial distribution, 108 P. Delicado et al.

It remains to bound
Since x 1 is at least ǫ n away from the complement of A and the density f is bounded from below by c on A, Putting everything together, we have It remains to handle the case when x 1 is not in the ǫ n -interior of A. Suppose that n is so large that ǫ n < δ/2. The key observation is that there exists α ∈ (0, π/12] such that for all x 1 ∈ A, there exists a cone C 1 centered at x 1 , of angle α such that C 1 ∩ B(x 1 , ǫ n ) ⊂ A. (This follows by a similar argument that we have used earlier: first note that every By convexity, the smallest cone centered at x 1 that includes B(0, δ/2) satisfies the required property. The smallest angle of all such cones over x 1 ∈ A \ A ǫn is bounded away from zero by compactness of S.) Now fix x 1 ∈ A ∩ A c ǫn . Cover R d by a minimal number of cones C 1 , . . . , C Nα centered at x 1 of angle α such that one of the cones C 1 is such that Figure 4). Then the same argument as in the case of x 1 ∈ A ǫn carries over with the only is contained in S then it is very likely to contain a data point. If the angle of the cones is less than π/12, the convex set defined by the intersection of S with the bisecting half-spaces is contained in B(x 1 , ǫn).
difference that instead of cones of angle π/6 now we have cones of angle α and we obtain that there exists a constant K, depending on c and α such that Since the above estimate holds independently of what x 1 is, we have established that 3. On the non-discernibility of support convexity The purpose of this section is to show that without the assumption that the density is bounded away from zero on its support, Theorem 1 is not true. Without further assumptions, it is impossible to decide whether the support of a density is convex. In order to formalize this statement, we recall the notion of discernibility introduced by Dembo and Peres (1994) (see also Devroye and Lugosi (2002)). Let F and G be two disjoint sets of densities on R d . Let X 1 , . . . , X n be independent random vectors drawn according to a density f ∈ F ∪ G. Based on these data, one tries to decide whether f ∈ F or not. Recall from Section 1 that a decision rule is a sequence of functions T n : (R d ) n → {0, 1}. T n (X 1 , . . . , X n ) = 1 means that the rule guesses that f ∈ F while if T n (X 1 , . . . , X n ) = 0, the decision rule thinks that f ∈ G. A decision rule is consistent if, for every f ∈ F ∪ G it is correct eventually almost surely, that is, if P T n (X 1 , . . . , X n ) = ½ {f ∈F } for finitely many n = 1.
We say that the pair (F , G) is discernible if there exists a consistent decision rule.
Theorem 1 shows that if F is the class of densities that are bounded, bounded away from zero, and have convex support and G is the class of bounded densities, bounded away from zero with non-convex support (in both cases satisfying Assumption 1), then the pair (F , G) is discernible.
The main result of this section implies that the sets of all uniformly bounded densities with convex and non-convex support are not discernible. In other words, every decision rule must fail infinitely often for some density. Thus, the assumption of boundedness (from below) of the densities is necessary in Theorem 1 or at least needs to be replaced by another assumption. This is true even if we only consider densities on R with support in [0, 1]: Theorem 2. Let F be the class of all densities on R bounded by 2 with support [0, 1], and let G be the class of all densities on R bounded by 2, satisfying Assumption 1, whose support is a non-convex subset of [0, 1]. Then the pair (F , G) is not discernible.
A general impossibility theorem that gives sufficient conditions for a pair (F , G) to be non-discernible was given by Devroye and Lugosi (2002, Theorem 8). However, their theorem does not seem to apply here and we need a separate proof. We crucially use the following basic and well-known fact (see, e.g., Devroye and Györfi (2002)): if X and Y are real random variables with density f and g, respectively, then there exists a coupling (i.e., a joint distribution of (X, Y ) with marginal densities f, g) such that P{X = Y } = (1/2) |f − g|.
Proof of Theorem 2. To prove the theorem, we assume that the pair (F , G) is discernible, that is, there exists a consistent decision rule T n . We construct subclasses A ⊂ F and B ⊂ G such that for any consistent decision rule T n , there is a density φ in the L 1 -closure of A ∪ B such that if X 1 , X 2 , . . . are distributed as φ then, with probability at least 1/2, T n changes decision infinitely many times, thus arriving at a contradiction.
Assume that there exists a consistent decision rule T n . Then for any density f ∈ A, and almost all ω, there exists an integer N (ω) such that T n (X 1 , . . . , X n ) = 1 if n > N (ω) and for any density f ∈ B, and almost all ω there exists an integer N (ω) such that T n (X 1 , . . . , X n ) = 0 if n > N (ω).
We may continue this procedure such that φ m ∈ A for all odd m and φ m ∈ B for all even m, and N 1 < N 2 < · · · are chosen such that The sequence of densities φ m converges in L 1 to a density φ such that |φ m − φ| < 2ǫ m and φ agrees with φ m for all x ≤ km−1 i=1 2 −i (which converges to 1 as m → ∞).
Now let X 1 , X 2 , . . . be independent random variables drawn according to the density φ. Then, according to the coupling mentioned before the proof, there exist random variables Similarly, for each m,

are independent, and P{X
and therefore Hence, with probability 1/2, the decision rule changes its decision infinitely often, concluding the proof.

Data-based heuristics for calibrating the decision rule
Theorem 1 shows consistency of the rule that decides that the support is convex if and only if U n ≤ τ n and it provides an asymptotic criterion to select τ n . Nevertheless this result is not directly applicable in practice: if the sequence τ n verifies the assumptions in Theorem 1 then so does τ * n = kτ n for any positive k, but it can be the case that U n ≤ τ n whereas U n > τ * n for a fixed n. In order to address this question, we find it convenient to use the standard terminology of hypothesis testing where the null hypothesis is that the underlying distribution has convex support.
An objective way of selecting τ n is needed so that it is possible to control the probability of either of the two possible errors: Deciding that the support is not a convex set when indeed it is (type I error), and deciding that the support is convex when it is not the case (type II error).
The main difficulty is that the optimal value of the threshold τ n depends on the unknown set S and the distribution µ. Therefore a mechanism is required to obtain a value for τ n = τ n (S, µ) that should be valid for any S and µ in a large class of distributions. The situation is similar to that appearing in the usual practice in bootstrap methods, where the distribution of a given statistic T is unknown and moreover it depends on the specific distribution of the data under study. In this context bootstrap methods are used to approach the specific distribution of T for every particular case.
We present a bootstrap-type approximation of the distribution of U n under the hypothesis of support convexity (this procedure is also referred as calibration of the null distribution of U n ). We provide a heuristic justification that this approximation is valid both when the support is actually convex and when it is not (Lemmas 2 and 3). This approximation allows us to control the significance level (the probability of type I error), to give approximate values of power (1 minus the probability of type II error, when the significance level is fixed) and to define approximate p-values (the probability that the distribution approximating that of U n under the null hypothesis gives to values greater than or equal to the observed value of U n ). The approximate p-value acts as a score of how plausible the support convexity hypothesis is. Our proposal has been empirically validated with a simulation study shown throughout the section. A rigorous proof that the proposed bootstrap-type approximation works (that is, it provides a sequence of probability distributions converging weakly to the same limit distribution as U n , when n goes to infinity) is beyond the scope of this paper and it probably deserves a separate contribution.
Lemma 2. Let µ be a probability distribution on R d with density f and compact support S ⊂ R d . Assume that there exist constants 0 < c < C such that c ≤ f (x) ≤ C for all x ∈ S. Let X 1 , . . . , X n be i.i.d. vectors drawn from µ. Fix a pair of observations X i and X j such that a = (X i + X j )/2 is in the interior of the support S. Let h (1) (i, j) be defined as above, Let G (k) = γ(X i , X j , X h (k) (i,j) ), k = 1, 2, be the two smallest elements of the ordered sample of G h = γ(X i , X j , X h ), h = 1, . . . , n. Assume that f is continuous at a. Then, conditioning on (X i , X j ), nG (1) and n(G (2) − G (1) ) converge in distribution, as n → ∞, to an exponential distribution with expected value (f (a)v d ) −1 , where v d is the volume of the d-dimensional unit Euclidean ball. Moreover nG (1) and n(G (2) − G (1) ) are asymptotically independent, given (X i , X j ).
as s goes to zero where the continuity of f at a is used in the last step. Observe that f (a) < ∞ because a ∈ S. Therefore, for 0 < t < n X i − X j d /2 d , and the first part of the lemma is proved.
We prove now the result for k = 2. Let D (2) = a − X h (2) (i,j) = G 1/d (2) . Then, reasoning as before, for 0 < s + d 1 < X i − X j /2 such that B(a, s + d 1 ) ⊆ S, as s goes to zero. Then, for 0 < t < n X i −X j d /2 d −nd d 1 and defining as it was stated. The asymptotic independence between G (2) and G (1) follows from the observation that the conditional distribution of G (2) given that G (1) = g 1 does not depend on the value g 1 .
Motivated by Lemma 2, we may define the statistic Lemma 2 establishes that, under the hypothesis of support convexity, each term in the sum defining U (2) n , has the same asymptotic conditional distribution as the corresponding term in the sum defining U n , Therefore it is reasonable to expect that the asymptotic distributions of U n and U (2) n coincide. However, we have not proved that the joint distributions of asymptotically coincide, as Lemma 2 states this result only for marginals. This is sufficient to conclude that the expectations of U n and U (2) n asymptotically coincide but we cannot make such a statement about the distributions of these statistics. Example of a convex and 5 non-convex configurations of points: Different twodimensional S-shaped patterns, with different sharpness. These patterns consist of two circular arches of radius R, with R = 1, 1.5, 3, 6, 24, ∞, with a constant length equal to 3π/2. The bigger the value of R the closer the configuration to convexity, that is achieved for R = ∞.
We have carried out some simulations to compile evidence that, under the hypothesis of support convexity, the statistics nU n and nU (2) n have similar asymptotic distributions. Consider the different two-dimensional noisy S-shape data sets plotted in Figure 5. They are obtained as follows. Consider two adjacent circumferences both with radius R > 3/4. Take an arch of length 1.5π in each of them in such a way that their union forms a differentiable one-dimensional curve of length 3π. Smaller values of the radius R correspond to sharper curves. A flat segment with length 3π corresponds to R = ∞. To generate a random point around such a S-shape curve, we generate a random position uniformly distributed over the curve. Then we add an orthogonal deviation from this position, distributed as a truncated normal with zero mean and standard deviation σ = 0.15, truncated at [−4σ, 4σ].
Consider now data following a noisy S-shaped pattern with radius R = ∞ (that is, a line segment) so that support is a convex set. For sample sizes n ∈ {100, 250, 500, 1000}, we have generated 500 data sets. The statistics nU n and nU (2) n have been calculated for each sample. The first row of Table 1 shows the p-values of the two-sample Kolmogorov-Smirnov test comparing distributions of nU n and nU (2) n (see, for instance, Hollander and Wolfe (1999, Chapter 5)). For large sample sizes the null hypothesis that both statistics have the same distribution can not be rejected. We also have tested the normality of nU n and nU (2) n using the Lilliefors normality test (see, for instance, Hollander and Wolfe (1999, Chapter 11)). The corresponding p-values are shown in Table 1. It seems that asymptotic normality is admissible even if U -statistic theory is not directly applicable to these statistics.
In order to establish results under the hypothesis of non-convexity, we need an additional regularity assumption for the support. Given a ∈ R d \ S let π S (a) = {x ∈ S : x − a = min y∈S y − a } be the set of closest points in S to a. When π S (a) has a unique point, we call it a S . Erdős (1945) proved that the set of points a ∈ R d − S with more than one point in π S (a) has null Lebesgue measure. The required regularity condition is as follows: (A) For any a ∈ R d \ S such that π S (a) = {a S } there exist constants η > 0 and ν ≥ 1, both depending on a and S, such that, when s → 0, s > 0, Condition (A) is satisfied for many regular sets S, convex or not. For instance, let S = [−1, 1] 3 ⊂ R 3 and a = (1 + δ, 0, 0). Then a S = (1, 0, 0), a − a S = δ and, for small δ and s (such that s + δ ≤ a − (1, 1, 0) ), the solid B(a, δ + s) ∩ S is a spherical cap (portion of a sphere cut off by a plane) and its volume is (see, e.g., Li (2011)) Vol (B(a, δ + s) ∩ S) = π 3 s 2 (3δ + 2s) = πδs 2 + o(s 2 ), and (A) is verified with η = πδ and ν = 2. For the volume V d (δ, s) of the corresponding d-dimensional hyperspherical cap, Li (2011) gives the following formula: where I x (a, b) is the regularized incomplete beta function, defined for 0 ≤ x ≤ 1, a > 0, b > 0, as aB(a, b)) + o(x a ), when a → 0, and that 1 − {δ/(δ + s)} 2 = 2s/δ + o(s), when s → 0. Therefore, when s → 0, So ν = (d + 1)/2 in this case.
Lemma 3. Let µ be a probability distribution on R d with density f and compact support S ⊂ R d . Assume that there exist constants 0 < c < C such that c ≤ f (x) ≤ C for all x ∈ S. Let X 1 , . . . , X n be i.i.d. vectors drawn from µ. Assume that S is not convex and fix a pair of observations X i and X j such that a = (X i + X j )/2 ∈ S. Then, with probability one, there is only one closest point in S to a. Let a S be such a point and assume that f is continuous at a S .
Assume additionally that condition (A) is verified. Let G (1) and G (2) be defined as in Lemma 2. Then, conditioning on (X i , X j ), as n → ∞, G (1) converges in probability to a − a S d and n(G (2) − G (1) ) converges in distribution to an exponential distribution with expected value Proof. Given Erdős' result on the null Lebesgue measure of the set of points with more than one closest point in S, this set has zero probability because (X i + X j )/2 is absolutely continuous.
(1) and s > 0. Then, arguing as in the proof of Lemma 2 and using (A), n as s goes to zero. Therefore, for t > 0 It follows that n 1/ν (D (1) − a − a S ) converges in distribution to a Weibull distribution with shape parameter ν. It follows that D (1) converges in probability to a − a S and, by continuity of g(x) = x d/ν , that G (1) converges in probability to a − a S d and the first part of the lemma is proved.
It follows from Lemma 3 that, given (X i , X j ) with a = (X i + X j )/2 ∈ S, Therefore, nG (1) goes to infinity (in probability) at rate n. Lemma 3 suggests that for a non-convex support S and assuming (A), nU (2) n should be bounded in probability (because nU (2) n is the average of n(n − 1)/2 random variables that are bounded in probability) but nU n should not be (because it is the average of n(n − 1)/2 random variables that go to infinity at rate n). In fact, from Proposition 1 it follows that lim n nU n = ∞ almost surely when the support S is not convex.
To understand the intuitive meaning of Lemmas 2 and 3, let F conv nUn and F conv nU (2) n be the distributions of the two statistics under consideration, nU n and nU (2) n , respectively, when the support S is convex. For the case of non-convex support, we call F non-conv nUn and F non-conv nU (2) n the distributions of the two corresponding statistics. Lemma 2 establishes that F conv nUn and F conv nU (2) n are similar and Lemma 3 indicates that F non-conv nU (2) n looks more like F conv nUn than F non-conv nUn . Therefore we propose to use the distribution of the statistic nU (2) n to approximate that of nU n under the support convexity hypothesis, whether the support is indeed convex or not. (2) n generated from data according to each of the six noisy S-shaped configurations for sample size n = 500.
Some simulation have been conducted to support the use of the distribution of U (2) n to approximate that of U n . The upper panel of Figure 6 shows the estimated density of the statistic log nU n calculated on 500 samples (of size n = 500) generated according to each of the six S-shaped supports shown in Figure 5. It can be clearly seen how the distribution of nU n changes in a considerable way with the data pattern and how its values get closer to 0 as the support gets closer to convexity.
The lower panel of Figure 6 shows estimated densities (from 500 values) of log nU (2) n calculated over 500 samples (of size n = 500) generated according to each of the six S-shaped configurations shown in Figure 5. The scale of upper panel has been kept in order to clearly show how the distributions of log(nU (2) n ) under non-convexity are closer to 0 than those of log(nU n ). This is the main conclusion of the simulations.
Observe that most of the estimated densities of log(nU n ) corresponding to non-convex S do not overlap the density of log(nU n ) corresponding to convex S (i.e., R = ∞). This fact does not necessarily contradict our belief that the null distribution of nU n can always be approximated with that of nU (2) n : the asymptotic distribution of n(G (2) − G (1) ) depends on whether the data follow the null distribution or not (see Lemmas 2 and 3). The use of the distribution of the statistic nU (2) n as an approximation for that of nU n under the null hypothesis entails a problem: only one observation of nU (2) n is available from each sample X 1 , . . . , X n . Therefore a resampling procedure is required to provide a set of pseudo-observations of nU (2) n . The standard bootstrap is not adequate in this context because at each bootstrap sample there would be some repeated observations, say X * i = X * j = X l , for i = j, and therefore we would have min h=1...n γ(X * i , X * j , X * h ) = 0, thus reducing the effective number of summands defining U * n . We propose to perform subsampling bootstrap, that is, resampling without replacement from the original sample at smaller than the original sample size (see Politis and Romano (1994); Politis, Romano and Wolf (1999)). Then the procedure to compute p-values for the support convexity decision rule is as follows. Let nU Obs n be the observed values of nU n for the sample X 1 , . . . , X n at hand. We take B subsamples of size m < n and compute the statistic mU (2) * m,b , b = 1, . . . , B. We approximate the distribution of nU n (under support convexity) by a normal distribution centered at µ * m and having standard deviation equal to s * m . Let Φ be the distribution function of the standard normal distribution. The p-value is therefore defined as (4) Table 2 illustrates the performance of the proposed procedure for deciding about support convexity when this is true. For each different sample size, 500 samples have been generated according to the hypothesis of support convexity (S-shape pattern with R = ∞). For each sample, B = 100 bootstrap subsamples have been drawn, with sizes m = n/2 (for n ∈ {100, 250, 500}) or m = n/4 (for n = 1000). The empirical significance levels are calculated as the proportion of samples for which the computed p-value is lower than the nominal one. We see that the nominal significance level is well reproduced for α = 0.01 and α = 0.05 (when n ≥ 500 in this case), but the case α = 0.1 is unsatisfactory.
The empirical power of the convexity decision rule has been calculated (for a nominal significance value α = 0.05) for sample sizes n ∈ {100, 250, 500, 1000} as the proportion of samples for which the computed p-value is lower than α. . The parameter R indicates the radius of the circumferences used to produce the six noisy S-shaped configurations in Figure 5: the bigger the value of R, the closer the support is to convexity, which is achieved at R = ∞. Horizontal lines mark acceptance intervals of the null hypothesis that the observed powers equal to the nominal significance level α = 0.05.
case closest to convexity (R = 24) is not detected as non-convex while it is detected when n ∈ {250, 500, 1000}.

Choice of the tuning parameter in ISOMAP
In this section we present a statistical application of the rule introduced in Section 2. We use this decision rule for choosing automatically the tuning parameter of ISOMAP, a nonlinear dimensionality reduction method due to Tenenbaum, de Silva and Langford (2000). Given n points x 1 , . . . , x n ∈ R p in a high-dimensional space, equipped with metric d, the object of nonlinear dimensionality reduction (also known as manifold learning) is to find a low dimensional configuration, that is, an n× d matrix, with d ≪ p, with rows y i , i = 1, . . . , n, and a nonlinear function ρ : R d → R p such that ρ(y i ) is close (in some sense) to the observed x i , for i = 1, . . . , n. Principal Component Analysis (PCA) is without doubt the most used dimensionality reduction technique, but it is not able to detect nonlinear structures. See Lee and Verleysen (2007) or Gorban et al. (2007) for a broad coverage of nonlinear dimensionality reduction.
We focus on ISOMAP transformation. The underlying implicit assumption is that the high-dimensional data lie on, or close to, a low-dimensional nonlinear manifold and the geodesic distance of the manifold represents a meaningful metric. ISOMAP tries to recover this hidden information. choice of ǫ min and ǫ max can be found in the Appendix. There you can also find a proposal to avoid disconnected graphs (even if we take ǫ = 0).
We have applied the proposed procedure for the choice of ǫ to a bi-dimensional synthetic data set, corresponding to the sharpest S in Figure 5, that with radius R = 1. The sample size is n = 100 and B = 200 bootstrap samples are obtained. We have used values ǫ equal to seven evenly spaced values ǫ 1 , . . . , ǫ 7 , with ǫ 1 = 0 and ǫ 7 = median{d(x i , x j )}. The resulting p-values are shown in Figure 8 (only for ǫ 1 to ǫ 6 ; the result for ǫ 7 is very similar to that of ǫ 6 ).
The two panels in the first row of Figure 8 show that when ǫ is too small, a linear representation is adequate for many points in the sample, but there appear some points that are too far from the common linear structure, making the support convexity hypothesis hard to be accepted. For moderate values of ǫ (second row of Figure 8) a compromise is achieved between a common linear structure and the absence of outliers. When ǫ = 1.26 (lower left panel) a shortcircuit appears (leading to a misleading bi-dimensional configuration). It could be said that the short-circuit is also present (but not so clearly) for ǫ = 0.94. For ǫ = 1.57 (lower right panel) there are two short-circuits. Following our proposal, the chosen value for ǫ is ǫ * = 0.63. Observe that the p-value corresponding to this best choice of ǫ is 0.0281, indicating a moderate evidence against the null hypothesis of support convexity. This happens because the ISOMAP procedure is not able to fully linearize the data configuration even for the most favorable value of parameter ǫ.

Conclusions
In this paper we investigated the possibilities and limitations of constructing data-based procedures to decide whether the support of the underlying density generating the data points is convex or nor. We defined a decision rule, based on a U -statistic with a random kernel, which decides correctly for sufficiently large n, with probability 1, whenever the density is bounded away from zero in its compact support and the support has a boundary of zero Lebesgue measure.
We also show that such asymptotically correct decision rules are impossible to define if one only assumes boundedness of the density.
Moreover, we suggest a bootstrap-like procedure for approximating the distribution of the proposed test statistic under the hypothesis of convexity of the support. The performance of the proposed method is illustrated on simulated data sets.
To illustrate potential applications, the decision rule is used to automatically choose the tuning parameter of ISOMAP, a nonlinear dimensionality reduction method.
On the other hand, for every x ∈ ∂S, essinf y: y−x <ǫ f (y) = 0 for all ǫ > 0. To see this, suppose that this is not true and for some x ∈ ∂S, there exists ǫ > 0 such that essinf y: y−x <ǫ f (y) > 0. But since x is on the boundary of S, there exists z / ∈ S such that z − x < ǫ/2. Since S is closed, there exists δ < ǫ/2 such that the ball N (z, δ) is entirely outside of S. But then essinf y∈N (z,δ) f (y) > 0, which contradicts the definition of the support.
This implies that if Vol(∂S) > 0, then the Lebesgue measure of the set of points x ∈ S with essinf y: y−x <ǫ f (y) = 0 for all ǫ > 0 is positive.
Lemma 5. Let f be a density with support S. Suppose that Vol(∂S) = 0 and f (x) ≥ c for all x ∈ S where c > 0. Define the set A = {x : ∃δ > 0 : essinf y∈N (x,δ) f (y) > 0}. Then S is the closure of A.
Proof. Since A ⊂ S and S is closed, A ⊆ S (where A stands for the closure of A). Suppose S = A. Then there exists x ∈ S and ǫ > 0 such that N (x, ǫ) ∩ A = ∅.
Observe that since f (x) ≥ c on S, for every point y ∈ N (x, ǫ), either y / ∈ S or y ∈ ∂S. Thus, by assumption, Vol(S ∩ N (x, ǫ)) = 0. But then the closed set S ∩ N (x, ǫ) c has f -measure 1 which contradicts the definition of S (since the support is defined as the smallest closed set of f -measure 1).

Some technical details on the choice of ǫ *
A remark on the meaningful choice of ǫ min and ǫ max follows. Very low values of ǫ produce disconnected graphs G ǫ in the first step of the ǫ-ISOMAP algorithm. Then the usual way to circumvent the problem is to analyze only the largest connected component of G ǫ . Then different samples are used for different values ǫ < ǫ conn , being that value the lowest one assuring the connectivity of G ǫ . So it may seem plausible to take ǫ min = ǫ conn . Unfortunately, the value of ǫ conn is extremely sensitive to outliers, because ǫ conn ≥ max i min j d(x i , x j ).
Our proposal to avoid disconnected graphs G ǫ for small ǫ is based on the Minimum Spanning Tree associated to distance matrix D. Let G 0 MST be the graph representing this Minimum Spanning Tree, which is connected by definition. We propose to replace always the graph G in the first step of ǫ-ISOMAP algorithm by the union graph G ǫ MST = G ǫ ∪ G 0 MST , and proceed to further steps in the usual way. Observe that G MST is connected for all ǫ ≥ 0, being G MST = G 0 MST for ǫ = 0. Therefore we may choose ǫ min = 0.
An easy way to fix ǫ max is taking This choice allows one the possibility of having observed a distance matrix D compatible with a convex support probability distribution. In practice a lower value may be chosen such as ǫ max = median{d(x i , x j )}. Then a fine regular grid ǫ 1 = ǫ min < · · · < ǫ E = ǫ max is used and p-values are computed: p(ǫ e ), e = 1, . . . , E.