A notion of stability for k-means clustering

In this paper, we define and study a new notion of stability for the $k$-means clustering scheme building upon the notion of quantization of a probability measure. We connect this notion of stability to a geometric feature of the underlying distribution of the data, named absolute margin condition, inspired by recent works on the subject.


Introduction
Unsupervised classification consists in partitioning a data set into a series of groups (or clusters) each of which may then be regarded as a separate class of observations. This task, widely considered in data analysis, enables, for instance, practitioners, in many disciplines, to get a first intuition about their data by identifying meaningful groups of observations. The tools available for unsupervised classification are various. Depending on the nature of the problem, one may rely on a model based strategy modeling the unknown distribution of the data as a mixture of known distributions with unknown parameters. Another approach, model-free, is embodied by the well known k-means clustering scheme. This paper focuses on the stability of this clustering scheme.

Quantization and the k-means clustering scheme
The k-means clustering scheme prescribes to classify observations according to their distances to chosen representatives. This clustering scheme is strongly connected to the field of quantization of probability measures and this paragraph shortly recalls how these concepts interact. Suppose our data modeled by n i.i.d. random variables X 1 , . . . , X n , taking their values in some metric space (E , d ), and with same distribution P as (and independent of) a generic random variable X . Let k ≥ 1 be an integer fixed in advance, representing the prescribed number of clusters, and define a k-points 1 quantizer as any mapping q : E → E such that 2 |q(E )| = k. Denoting c 1 , . . . , c k the values taken by q, the sets {x ∈ E : q(x) = c j }, 1 ≤ j ≤ k, partition the space E into k subsets (or cell) and each point c j (called indifferently a center, a centroid or a code point) stands as a representative of all points in its cell. Given a quantizer q, associated data clusters are defined, for all 1 ≤ j ≤ k, by C j (q) := {x ∈ E : q(x) = c j } ∩ {X 1 , . . . , X n }.
The performance of this clustering scheme is naturally measured by the average square distance, with respect to P , of a point to its representative. In other words, the risk of q (also referred to as its distortion) is defined by (1.1) Quantizers of special interest are nearest neighbor (NN) quantizers, i.e. quantizers such that, for all x ∈ E , q(x) ∈ arg min c∈q (E ) d (x, c).
The interest for these quantizers relies on the straightforward observation that for any quantizer q, an NN quantizer q such that q(E ) = q (E ) satisfies R(q ) ≤ R(q). Hence, at-1 The integer k is supposed fixed throughout the paper and all quantizers considered below are supposed to be k-points quantizers. 2 For a set A, notation |A| refers to the number of elements in A.
2 tention may be restricted to NN quantizers and any optimal quantizer q ∈ arg min q R(q), (1.2) (where q ranges over all quantizers k-points quantizers) is necessarily an NN quantizer.
We will denote Q k the set of all k-points NN quantizers and, unless mentionned explicitly, all quantizers involved in the sequel will be considered as members of Q k . For q ∈ Q k , the value of its risk is entirely described by its image. Indeed, if q ∈ Q k takes values c 1 , . . . , c k , then Denoting c = {c 1 , . . . , c k }, referred to as a codebook, we will often denote by R(c) the right hand side of (1.3) with a slight abuse of notation.
A few additional considerations, relative to NN-quantizers, will be useful in the paper. Given c = {c 1 , . . . , c k }, denote V j (c) the set of points in E closer to c j than to any other c , that is These sets do not partition the space E since, for i = j , the set V i (c)∩V j (c) is not necessarily empty. A Voronoi partition of E relative to c is any partition W 1 , . . . ,W k of E such that, for all 1 ≤ j ≤ k, W j ⊂ V j (c) up to relabeling. For instance, given q ∈ Q k with image c, the sets W j = q −1 (c j ), 1 ≤ j ≤ k, form a Voronoi partition relative to c. We call frontier of the Voronoi diagram generated by c the set Given an optimal quantizer q with image c = {c 1 , . . . , c k }, a remarkable property, known as the center condition, states that for all 1 ≤ j ≤ k, and provided |supp(P )| ≥ k, From now on, the probability measure P will be supposed to have a support of more than k points.
We end this subsection by mentioning that computing an optimal quantizer requires the knowledge of the distribution P . From a statistical point of view, when the only information available about P consists in the sample X 1 , . . . , X n , reasonable quantizers are empirically optimal quantizers, i.e. NN quantizers associated to any codebookĉ = {ĉ 1 , . . . ,ĉ k } satisfyingĉ ∈ arg min In other words, empirically optimal quantizers minimize the risk associated to the empirical measure The computation of empirically optimal centers is known to be a hard problem, due in particular to the non-convexity of c → R(c), and is usually performed by Lloyd's algorithm for which convergence guarantees have been obtained recently by Lu and Zhou (2016) in the context where P is a mixture of sub-gaussian distributions.

Risk bounds
The performance of the k-means clustering scheme, based on the notion of risk, has been widely studied in the literature. Whenever (E , |.|) is a separable Hilbert space, the existence of an optimal codebook, i.e. of c = {c 1 , . . . , c k } such that is well established (see, e.g, Theorem 4.12 in Graf and Luschgy, 2000), provided E|X | 2 < +∞. In this same context, works of Pollard (1981Pollard ( , 1982a and Abaya and Wise (1984) imply that R(ĉ) → R almost surely as n goes to +∞, whereĉ is as in (1.6). The non-asymptotic performance of the k-means clustering scheme has also received a lot of attention and has been studied, for example, by Chou (1994); Linder et al. (1994); Bartlett et al. (1998);Linder (2000Linder ( , 2001; Antos (2005);  and Biau et al. (2008). For instance Biau et al. (2008) prove that in a separable Hilbert space, and provided |X | ≤ L almost surely, then ER(ĉ) − R ≤ 12kL 2 / n, for all n ≥ 1. A similar result is established in Cadre and Paris (2012) relaxing the hypothesis of bounded support by supposing only the existence of an exponential moment for X . In the context of a separable Hilbert space, Levrard (2015) establishes a stronger result under some conditions involving the quantity p(t ) defined as follows.
Definition 1.1 ( Levrard, 2015 ). Let M be the set of all c = {c 1 , . . . , where, for any set A ⊂ E , the notation A t stands for the t -neighborhood of A in E defined by For any codebook c = (c 1 , . . . , c k ), P (F (c) t ) corresponds to the probability mass of the frontier of the associated Voronoi diagram inflated by t (see Figure 1). Under some slight restrictions and supposing p(t ) does not increase too rapidly with t , it appears that the excess risk is of order O (1/n) as described below.
(2) Suppose in addition that there exists r 0 > 0 such that, for all 0 < t ≤ r 0 , where p(t ) is as in (1.7). Then, for all x > 0, and anyĉ minimizing the empirical risk as in (1.6), with probability at least 1 − e −x , where C > 0 denotes a constant depending on auxiliary (and explicit) characteristics of P . The light-green area, inside the red dashed lines, corresponds to the tneighborhood of this frontier for some small t .

Stability
For a quantizer q ∈ Q k , the risk R(q) describes the average square distance of a point x ∈ E to its representative q(x) whenever x is drawn from P . The risk of q characterizes therefore an important feature of the clustering scheme based on q and defining optimality of q in terms of the value of its risk appears as a reasonable approach. However, an important though simple observation is that the excess risk R(q)−R(q ), for an optimal quantizer q , isn't well suited to describe the geometric similarity between the clusterings based on q and q . For one thing, there might be several optimal codebooks. Also, even in the context where there is a unique optimal codebook, quite different configurations of centers c may give rise to very similar values of the excess risk R(c) − R(c ). This observation relates to the difference between estimating the optimal quantizer and learning to perform as well as the optimal quantizer and is relevant in a more general context as briefly discussed in Appendix B below. Basically, the idea of stability we are referring to consists in identifying situations where having centers c with small excess risk guarantees that c isn't far from an optimal center c geometrically speaking. We formalize this idea below.
The clustering problem discussed in subsections 1.1 and 1.2 is called (F, φ)-stable if, for any optimal quantizer q , for any auxiliary quantizer q, We say that the clustering problem is strongly stable for F , if φ is linear.
Note first that, for some chosen F , the notion of stability defined above characterizes a property of the underlying distribution P . Here, properties of the function F are deliberately unspecified as, in practice, F can be chosen in order to encode very different properties, of more or less geometric nature. An important property of this notion is that stable clustering problem are such that ε-minimizers of the risk are "close" (in the sense of F ) to the optimal quantizer (see Corollary 2.5 below).
Remark 1.4. The notion of stability described above differs from the notion of algorithm stability studied in Ben-David et al. (2006) and Ben-David et al. (2007). Their notion of stability is defined for a function (called algorithm) A : n E n → Q k that maps any data set {X 1 , . . . , X n } to a quantizer A({X 1 , . . . , X n }). In this context, the stability of A is defined by where the X i 's and Y i 's are i.i.d. random variables of common distribution P and D is a (pseudo-) metric on Q k . Then, an algorithm is said to be stable for P if Stab(A, P ) = 0.

According to this definition, any constant algorithm A = q is stable. A notable difference, is that our notion of stability includes a notion of consistency. Indeed, since q → R(q) is continuous (for a proper choice of the metric on Q k ), then our notion of stability measures
(if and) at which rate q → q whenever R(q) → R . Thus, we focus only on the behaviour of algorithms A such that R(A({X 1 , . . . , X n })) → R .

Remark 1.5. Note that F 1 (q , q) does not always coincide with the Hausdorff distance d H (c , c)
between c = {c 1 , . . . , c k } and c = {c 1 , . . . , c k }. Indeed, Figure 2 presents a configuration of codebooks c and c that have small Hausdorff distance but define NN quantizers q and q with large F 1 (q , q). However, it may be seen that inequality always holds and that, provided . The proof of these statements is reported in Appendix A.1.

Figure 2:
In this simple case, where k=3, the set of black dots and the set of white dots have small Hausdorff distance but define two NN quantizers, say q 1 and q 2 , for which F 1 (q 1 , q 2 ) is large.
Whenever (E , |.|) is Euclidean, it follows from the previous remark and Pollard (1982b) that, provided the optimal codebook c is unique, whenq is any quantizer minimizing the empirical riskR. In Levrard (2015), under the conditions of Theorem 1.2, it is proven that for any optimal quantizer q , and any q ∈ Q k such that q(E ) ⊂ {x : |x| ≤ L}, provided F 1 (q , q) ≤ Br 0 /4 2M which proves in this case (a local version of) the stability of the clustering scheme for F 1 (constants are defined in Theorem 1.2). In the same spirit, when E = R d and for a measure P with bounded support, Rakhlin and Caponnetto (2007) show that F 1 (q n , q n ) → 0 as n → ∞ whenever q n and q n are optimal quantizers for empirical measures P n and P n whose supports differ by at most o( n) points. In addition, their Lemma 5.1 shows that, for P with bounded support, While F 1 captures distances between representatives of the two quantizers, it is however totally oblivious to the amount of wrongly classified points. From this point of view, a more interesting quantity is described by ( 1.10) where the minimum is taken over all permutations σ of {1, . . . , k} (see Figure 3). This quantity measures exactly the amount of points that are misclassified by q compared to q , regarding P .
In the present paper, we study a related quantity, of geometric nature, defined simply as the average square distance between a quantizer q and an optimal quantizer q , i.e.
( 1.11) As discussed later in the paper (see Subsection 2.2), this quantity may be seen as an intermediate between F 1 and F 2 incorporating both the notion of proximity of the centers and the amount of misclassified points. The general concern of the paper will be to establish conditions under which the clustering scheme is strongly stable for this function F 2 .

Figure 3:
The image of q (resp. q) is represented by the black (resp. blue dots). The quantity F 1 (q , q) corresponds to the length of the longest pink segment in the first (left) figure. The quantity F 2 (q , q) is the P measure of the light green area in the second (right) figure.

Stability results
In this section, we present our main results. In the sequel, we restrict ourselves to the case where E is a (separable) Hilbert space with scalar product 〈., .〉 and associated norm |.|. For 8 any E -valued random variable Z , we'll denote for brevity.

Absolute margin condition
We first address the issue of characterizing the stability of the clustering scheme in terms of the function F defined in (1.11). The next definition plays a central role in our main result. Recall that X denotes a generic random variable with distribution P .
Definition 2.1 ( Absolute margin condition ). Suppose that |x| 2 dP (x) < +∞ and let q be an optimal k-points quantizer of P . For λ ≥ 0, define Then, P is said to satisfy the absolute margin condition with parameter λ 0 > 0, if both the following conditions hold: 1. P (A(λ 0 )) = 1.

For any random variable
has a unique minimizer q λ .
The second condition means that every probability measure, in a neighborhood of P , has a unique k-quantizer. Note that A(0) = E and that A(λ) ⊂ A(λ ) for λ ≤ λ. Letting c = q (E ), the first point of this definition states that the neighborhood E \ A(λ 0 ) of the frontier F (c ) is of probability zero (see Figure 4). The next remark discusses the geometry of the set A(λ), involved in the previous definition, in comparison with the sets F (c ) t used in Definition 1.1. In particular, it follows from the following remark that, for appropriate For all λ ≥ 0 and t > 0, let Then the following statements hold. 1. For all 0 < t < M (c)/2, We are now in position to state the main result of this paper.
Theorem 2.3. Suppose that |x| 2 dP (x) < +∞. Let q be an optimal quantizer for P and suppose that P satisfies the absolute margin condition 2.1 with parameter λ 0 > 0. Then, for any q ∈ Q k , it holds that Remark 2.4. The above theorem states that the clustering scheme is strongly stable for F 2 provided the absolute margin condition holds. Here, we briefly argue that this result is optimal in the sense that strong stability requires that both hypotheses of the absolute margin condition 2.1 hold in general.
1. The following example shows that the first point of the absolute margin condition cannot be dropped. Take  Then it can be checked through straightforward computations that F(q , q ε ) = ε and that R(q ε ) − R(q ) ≤ ε 2 , so that there exists no λ > 0 for which inequality holds for all ε > 0. An interesting consequence of Theorem 2.3 holds in the context of empirical measures for which the absolute margin condition always holds. Consider a sample X 1 , . . . , X n composed of i.i.d. variables with distribution P and let The next result ensures that an ε-empirical risk minimizer (i.e. a quantizer q ε such that R n (q ε ) ≤ inf q R n (q) + ε) is at a distance (in terms of F) at most ε(1 + λ)/λ to an empirical risk minimizer for some λ depending only on P n .
Corollary 2.5. Let ε > 0. Let P n be the empirical measure of a measure P , associated with sample {X 1 , . . . , X n }. Suppose P n has a unique optimal quantizerq. Then P n satisfies the absolute margin condition for some λ n > 0. In addition, if q ε ∈ Q k satisfies The last result follows easily from Theorem 4.2 in Graf and Luschgy (2000) (stating that P n (F (ĉ)) = 0, forĉ =q(E ), and thus P n (A(λ)) = 1 for some λ > 0) and from Theorem 2.3. The proof is therefore omitted for brevity. The interpretation of this corollary is that any algorithm producing a quantizer q with small empirical riskR(q) will be, automatically, such that F(q, q) is small (and again, provided uniqueness ofq) if λ n is large. Parameter λ n defined by the absolute margin condition, thus provides a key feature for stability of the kmeans algorithm. A nice property of the previous result is that λ n is of course independent of the ε-minimizer q ε . However, an important remaining question, of large practical value, is to lower bound λ n with large probability to assess the size of the coefficient (1 + λ n )/λ n . This is left for future research.

Comparing notions of stability
This subsection describes some relationships existing between the function F involved in our main result, with the two functions F 1 and F 2 mentioned earlier in section 1.3. Below, we restrict attention to the case where there is a unique optimal quantizer q . Comparing F and F 2 can be done straightforwardly. Let Observe that, for F 1 (q , q) small enough, the permutation reaching the minimum in the definitions of F 1 and F 2 is the same and can be assumed to be the identity without loss of generality. Then, it follows that, for F 1 (q , q) small enough, and similarly, when m ≥ F 1 (q , q), This two inequalities imply that F 2 and F 2 are comparable whenever F 1 is small enough.
Comparing F 1 and F requires more effort, although one inequality is also quite straightforward. Recall the notation p min = inf i P (V i (c )). Suppose again that the optimal permutation in the definition of F 1 is the identity. Then, remark that F 1 (q , q) ≤ m/2, implies Thus, in this case, In view of providing a more detailled result, we define the function p , similar in nature to the function p introduced by Levrard (2015) and defined in 1.1.
Definition 2.6. For a metric space (E , d ) and a probability measure P on E , let X be a random variable of law P . Denote q an optimal quantizer of P with image c = {c 1 , . . . , c k } and ∂V i (c ) the frontier of the Voronoi cell associated to c i . Then, for all t > 0, we let While p(t ) corresponds to the probability of the t -inflated frontier of the Voronoi cells (defined in Definition 1.1), p (t ) corresponds to a similar object in which the inflation of the frontier gets larger as the points go further from their representant in the codebook c . These two functions can thus differ significantly, in general. However, since m/4 ≤ d (X , q (X )) for X such that d (X , ∂V i (c )) < m/4, it follows that whenever 0 < t < m/4. And when the probability measure P has its support in a ball of diameter R > 0, it can be readily seen that for all t > 0 If the support of P is not contained in a ball, the comparison is not as straightforward.
We can now state the last comparison inequality.
Proposition 2.7. Under the same setting as in the Definition 2.6, A consequence of this proposition and the result of Levrard (2015) for any empirical risk minimizerq.

Proofs
This section gathers the proofs of the main results of the paper. Additional proofs are postponed to the appendices.

Proof of Theorem 2.3
Recall that E is a Hilbert space with scalar product 〈., .〉, norm |.| and that, for an E -valued random variable Z with square integrable norm, we denote Z 2 = E|Z | 2 for brevity. For λ > 0, set As E is a Hilbert space, we have for all y, z ∈ E and all t ∈ [0, 1], Now for all x ∈ E , any quantizer q ∈ Q k and any λ > 0, using the previous inequality with where the last inequality follows from the fact that q is a nearest neighbor quantizer. Integrating this inequality with respect to P , we obtain where we have denoted Observe that λ → c q (λ) is continuous. Now, define where the supremum is taken over all k-points quantizers q ∈ Q k . The function λ → c ∞ (λ) satisfies obviously c ∞ (λ) ≥ c q (λ) ≥ 0, for all λ > 0. To prove the theorem, we will show that c ∞ (λ 0 ) ≤ 0, whenever P satisfies the absolute margin condition with paramater λ 0 > 0. To that aim, we provide two auxiliary results.
Lemma 3.1. Suppose there exists R > 0 such that P (B (0, R)) = 1. For all λ > 0, denote q λ any quantizer such that c q λ (λ) = c ∞ (λ) and denote q an optimal quantizer of the law of X . Suppose the absolute margin condition holds for λ 0 > 0. Then, for all 0 < λ 1 < λ 0 , there exists ε > 0 such that for all 0 Proof of lemma 3.1. The main idea of the proof is that since the Voronoi cells are well separated (inflated borders are with probability 0), when a quantizer is close enough to the optimal one, it shares its Voronoi cell (on the support of P ) and thus, centroid condition requires that quantizer have to be centroid of its cell to be optimal. Set q λ (E ) = {c 1 , ..., c k } and {c 1 , . . . , c k } = q (E ). Suppose without loss of generality that the optimal permutation in the definition of F 1 is the identity. The assumption implies that, with probability one, for each 1 ≤ i ≤ k, on the event q (X ) = c i , the inequality |X λ 0 − c i | 2 ≤ |X λ 0 − c j | 2 holds, or equivalently However, Since (3.2) holds, for all λ 1 < λ 0 , there exists therefore ε = ε(λ 0 , λ 1 , R, max{|c i − c j | : i = j }) such that, if F 1 (q , q) < ε, then for all λ ≤ λ 1 , on the event q (X ) = c i . As a result, This means that q and q λ share the same cells on the support of P . Thus, where inequality (3.3) follows from the center condition (1.5). Therefore, since q λ minimizes X λ − q(X λ ) 2 amongst NN quantizers, (3.3) is an equality, so that c i = c i i.e. q = q λ ; since X λ − X = λ X − q (X ) ≤ λ 0 X − q (X ) implies from absolute margin condition that q λ is unique.

Proof of Proposition 2.7
The following proof borrows some arguments from the proof of Lemma 4.2 of Levrard (2015). Recall that m = inf i = j |c i − c j |. Take 1 ≤ i , j ≤ k and consider the hyperplane Then, for all x ∈ V i (c ), Without loss of generality, suppose now for simplicity that the permutation σ achieving the minimum in the definition of F 1 (q , q) is the identity, σ( j ) = j , so that Then, it follows that for Thus, using the fact that, for all x ∈ V i (c ), we have we deduce from the previous observations that, for all i = j , The right hand side being independent of j , we obtain in particular, Therefore, which shows the desired result.

A.1 Proofs for Remark 1.5
Recall that, for any two sets A, B ⊂ E , their Hausdorff distance is defined by The fact that d H (c , c) ≤ F 1 (q , q) then follows easily from definitions. Now, to prove the second statement, observe that, in the context of the finite sets c and c , the infimum in the definition of δ := d H (c , c) is attained so that, for any i ∈ {1, . . . , k}, there exists some j ∈ {1, . . . , k} such that Then, the balls B (c i , δ) are necessarily disjoint and therefore contain one and only one element of c , denoted c σ(i ) . As a result, where the last equality follows by construction. This implies the desired result.

A.2 Proof for Remark 2.2
Let x ∈ E and denote c i = q(x). First, it may be checked that assumption d (x, F (c)) > ε holds if, and only if, Using the definition of x λ , this last condition may be equivalently written, for all j = i , as After simplification in (A.2), we therefore obtain that q(x λ ) = q(x) if, and only if, The result now easily follows from combining (A.1) and (A.3).

A.3 A consistency result
The next result is adaptated from Theorems 4.12 and 4.21 in Graf and Luschgy, 2000.
Lemma A.1. Suppose X ∈ L 2 (P). Then, letting q be an optimal k-points quantizer for the distribution of X and denoting X λ = X + λ(X − q (X )), the following statements hold.
1. For any λ ≥ 0, there exists a k-points NN quantizer q λ such that where the minimum is taken over all k-points quantizers.
2. For all λ 0 ≥ 0, if q λ 0 is unique, Proof of lemma A.1. We state the result for a measure with bounded support and refer to Graf and Luschgy (2000) for unbounded case.
1. Let q n be a sequence of quantizers such that as n → ∞. Since balls in E are weakly compact, the centers q n (E ) = {c n 1 , . . . , c n k } weakly converge to some limit {c 1 , . . . , c k } up to a subsequence. Denote q 0 (X λ ) a limit of a weakly converging subsequence of q n (X λ ), realizing the limit lim inf X − q n (X λ ) 2 then, by Fatou Lemma, which shows that q 0 realizes the minimum of inf q X λ − q(X ) 2 .

B Stability of a learning problem
In this section, we briefly argue that the problem considered in the paper, while of special interest in the context of unsupervised learning, finds a natural extension in a more general framework of learning theory, namely the context of contrast minimization. Let Z be a measurable space equipped with a probability distribution P and let T be a given set of parameters. Suppose given a sample Z 1 , . . . , Z n of i.i.d. variables with common distribution P . Given a contrast function C : Z × T → R + , consider the problem of designing a data driven t , based on the sample Z 1 , . . . , Z n , achieving a small value of the risk function R(t ) := C (z, t ) dP (z).
This general problem, known as contrast minimization, is a classical way to unify the supervised and unsupervised learning approaches as illustrated in the next example. • Supervised learning. The supervised learning problem corresponds to the contrast minimization problem where Z = X × Y , where T is a class of candidate functions t : X → Y and where, for a given loss function : Y 2 → R + , the contrast is C ((x, y), t ) = (y, t (x)).

20
• Unsupervised learning. The unsupervised learning problem discussed earlier in the present paper corresponds to the contrast minimization problem where Z is a metric space (E , d ), where T is the set Q of all k-points quantizers, for a given integer k, and where the contrast function is Given the general problem of contrast minimization, formulated above, one may naturally extend the question discussed in the present paper by considering the following notion of stability.
Definition B.2. Consider a function F : T 2 → R + and an increasing function φ : R + → R + . Then, the contrast minimization problem is called (F, φ, ε)-stable if, for any t minimizing the risk on T , Our main result, Theorem 2.3, proves the stability of the contrast minimization problem for the contrast function defined in (B.1). The following result proves the stability of the supervised learning problem for a strongly convex loss function.
Example B.3. Consider the supervised learning problem described in the example above. Suppose there exists α > 0 such that, for all y ∈ Y , the function u ∈ Y → (y, u) is α-strongly convex. Then, for any convex class T of functions t : X → Y and any t minimizing the risk on T , we have for any t ∈ T , where µ is the marginal of P on X . In particular, for all ε > 0, this learning problem is (ε, φ)-stable for the L 2 (µ) metric with φ(u) = 2 u/ α.